本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新617篇论文,其中:

  • 自然语言处理86
  • 信息检索22
  • 计算机视觉97

自然语言处理

1. 【2602.11151】Diffusion-Pretrained Dense and Contextual Embeddings

链接https://arxiv.org/abs/2602.11151

作者:Sedigheh Eslami,Maksim Gaiduk,Markus Krimmel,Louis Milliken,Bo Wang,Denis Bykov

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:employ multi-stage contrastive, multi-stage contrastive learning, language model backbone, diffusion-pretrained language model, introduce pplx-embed

备注

点击查看摘要

Abstract:In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, which focuses on real-world, large-scale search scenarios over tens of millions of documents. These results validate the models' effectiveness in production environments where retrieval quality and efficiency are critical at scale.

2. 【2602.11149】Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

链接https://arxiv.org/abs/2602.11149

作者:Dawid J. Kopiczko,Sagar Vaze,Tijmen Blankevoort,Yuki M. Asano

类目:Computation and Language (cs.CL)

关键词:essential post-training step, Supervised fine-tuning, essential post-training, post-training step, Supervised

备注

点击查看摘要

Abstract:Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

3. 【2602.11137】Weight Decay Improves Language Model Plasticity

链接https://arxiv.org/abs/2602.11137

作者:Tessa Han,Sebastian Bordt,Hanlin Zhang,Sham Kakade

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language model, prevailing paradigm, paradigm in large, large language, base model

备注

点击查看摘要

Abstract:The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.

4. 【2602.11133】Just on Time: Token-Level Early Stopping for Diffusion Language Models

链接https://arxiv.org/abs/2602.11133

作者:Zahar Kohut,Severyn Shykula,Dmytro Khamula,Mykola Vysotskyi,Taras Rumezhak,Volodymyr Karpiv

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:reach stability long, language models generate, models generate text, tokens reach stability, final denoising step

备注: Under review

点击查看摘要

Abstract:Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position. Our method leverages lightweight signals derived from the model's predictions and local context to dynamically determine when individual tokens can be finalized. This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required. Across diverse benchmarks, spanning mathematical reasoning, general question answering, and scientific understanding, our approach achieves state-of-the-art efficiency gains while preserving generation quality.

5. 【2602.11106】EGRA: Text Encoding With Graph and Retrieval Augmentation for Misinformation Detection

链接https://arxiv.org/abs/2602.11106

作者:Géraud Faye,Wassila Ouerdane,Guillaume Gadek,Céline Hudelot

类目:Computation and Language (cs.CL)

关键词:manual fact-checking, critical task, benefit significantly, integration of external, Misinformation detection

备注

点击查看摘要

Abstract:Misinformation detection is a critical task that can benefit significantly from the integration of external knowledge, much like manual fact-checking. In this work, we propose a novel method for representing textual documents that facilitates the incorporation of information from a knowledge base. Our approach, Text Encoding with Graph (TEG), processes documents by extracting structured information in the form of a graph and encoding both the text and the graph for classification purposes. Through extensive experiments, we demonstrate that this hybrid representation enhances misinformation detection performance compared to using language models alone. Furthermore, we introduce TEGRA, an extension of our framework that integrates domain-specific knowledge, further enhancing classification accuracy in most cases.

6. 【2602.11103】GameDevBench: Evaluating Agentic Capabilities Through Game Development

链接https://arxiv.org/abs/2602.11103

作者:Wayne Chi,Yixiong Fang,Arnav Yayavaram,Siddharth Yayavaram,Seth Karten,Qiuhong Anna Wei,Runkun Chen,Alexander Wang,Valerie Chen,Ameet Talwalkar,Chris Donahue

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:rapid progress, progress on coding, counterparts has lagged, development, Game development

备注

点击查看摘要

Abstract:Despite rapid progress on coding agents, progress on their multimodal counterparts has lagged behind. A key challenge is the scarcity of evaluation testbeds that combine the complexity of software development with the need for deep multimodal understanding. Game development provides such a testbed as agents must navigate large, dense codebases while manipulating intrinsically multimodal assets such as shaders, sprites, and animations within a visual game scene. We present GameDevBench, the first benchmark for evaluating agents on game development tasks. GameDevBench consists of 132 tasks derived from web and video tutorials. Tasks require significant multimodal understanding and are complex -- the average solution requires over three times the amount of lines of code and file changes compared to prior software development benchmarks. Agents still struggle with game development, with the best agent solving only 54.5% of tasks. We find a strong correlation between perceived task difficulty and multimodal complexity, with success rates dropping from 46.9% on gameplay-oriented tasks to 31.6% on 2D graphics tasks. To improve multimodal capability, we introduce two simple image and video-based feedback mechanisms for agents. Despite their simplicity, these methods consistently improve performance, with the largest change being an increase in Claude Sonnet 4.5's performance from 33.3% to 47.7%. We release GameDevBench publicly to support further research into agentic game development.

7. 【2602.11096】Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

链接https://arxiv.org/abs/2602.11096

作者:Soumya Suvra Ghosal,Souradip Chakraborty,Vaibhav Singh,Furong Huang,Dinesh Manocha,Amrit Singh Bedi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Reinforcement learning, multimodal large-scale reasoning, based post-training, post-training for explicit, ability of multimodal

备注

点击查看摘要

Abstract:Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

8. 【2602.11091】Can Large Language Models Make Everyone Happy?

链接https://arxiv.org/abs/2602.11091

作者:Usman Naseem,Gautam Siddharth Kashyap,Ebad Shabbir,Sushant Kumar Ray,Abdullah Mohammad,Rafiq Ali

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, simultaneously satisfy safety, Language Models, leading to behaviors

备注

点击查看摘要

Abstract:Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions, leading to behaviors that diverge from human expectations in real-world settings where these dimensions must co-occur. Existing benchmarks, such as SAFETUNEBED (safety-centric), VALUEBENCH (value-centric), and WORLDVIEW-BENCH (culture-centric), primarily evaluate these dimensions in isolation and therefore provide limited insight into their interactions and trade-offs. More recent efforts, including MIB and INTERPRETABILITY BENCHMARK-based on mechanistic interpretability, offer valuable perspectives on model failures; however, they remain insufficient for systematically characterizing cross-dimensional trade-offs. To address these gaps, we introduce MisAlign-Profile, a unified benchmark for measuring misalignment trade-offs inspired by mechanistic profiling. First, we construct MISALIGNTRADE, an English misaligned-aligned dataset across 112 normative domains taxonomies, including 14 safety, 56 value, and 42 cultural domains. In addition to domain labels, each prompt is classified with one of three orthogonal semantic types-object, attribute, or relations misalignment-using Gemma-2-9B-it and expanded via Qwen3-30B-A3B-Instruct-2507 with SimHash-based fingerprinting to avoid deduplication. Each prompt is paired with misaligned and aligned responses through two-stage rejection sampling to ensure quality. Second, we benchmark general-purpose, fine-tuned, and open-weight LLMs on MISALIGNTRADE-revealing 12%-34% misalignment trade-offs across dimensions.

9. 【2602.11089】DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

链接https://arxiv.org/abs/2602.11089

作者:Yicheng Chen,Zerun Ma,Xinchen Xie,Yining Li,Kai Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, landscape of Large, Language Models, high-quality training data

备注

点击查看摘要

Abstract:In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.

10. 【2602.11081】SteuerLLM: Local specialized large language model for German tax law analysis

链接https://arxiv.org/abs/2602.11081

作者:Sebastian Wind,Jeta Sopa,Laurin Schmid,Quirin Jackl,Sebastian Kiefer,Fei Wu,Martin Mayr,Harald Köstler,Gerhard Wellein,Andreas Maier,Soroosh Tayebi Arasteh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:demonstrate strong general, strict formal rules, legally binding structure, Large language models, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) demonstrate strong general reasoning and language understanding, yet their performance degrades in domains governed by strict formal rules, precise terminology, and legally binding structure. Tax law exemplifies these challenges, as correct answers require exact statutory citation, structured legal argumentation, and numerical accuracy under rigid grading schemes. We algorithmically generate SteuerEx, the first open benchmark derived from authentic German university tax law examinations. SteuerEx comprises 115 expert-validated examination questions spanning six core tax law domains and multiple academic levels, and employs a statement-level, partial-credit evaluation framework that closely mirrors real examination practice. We further present SteuerLLM, a domain-adapted LLM for German tax law trained on a large-scale synthetic dataset generated from authentic examination material using a controlled retrieval-augmented pipeline. SteuerLLM (28B parameters) consistently outperforms general-purpose instruction-tuned models of comparable size and, in several cases, substantially larger systems, demonstrating that domain-specific data and architectural adaptation are more decisive than parameter scale for performance on realistic legal reasoning tasks. All benchmark data, training datasets, model weights, and evaluation code are released openly to support reproducible research in domain-specific legal artificial intelligence. A web-based demo of SteuerLLM is available at this https URL.

11. 【2602.11073】Chatting with Images for Introspective Visual Thinking

链接https://arxiv.org/abs/2602.11073

作者:Junfei Wu,Jian Guan,Qiang Liu,Shu Wu,Liang Wang,Wei Wu,Tienie Tan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Current large vision-language, Current large, single-pass visual encoding, fine-grained visual information, large vision-language models

备注

点击查看摘要

Abstract:Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

12. 【2602.11072】Simultaneous Speech-to-Speech Translation Without Aligned Data

链接https://arxiv.org/abs/2602.11072

作者:Tom Labiausse,Romain Fabre,Yannick Estève,Alexandre Défossez,Neil Zeghidour

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:non-monotonic word dependencies, requires translating source, handling non-monotonic word, word dependencies, Simultaneous speech translation

备注: See inference code at: [this https URL](https://github.com/kyutai-labs/hibiki-zero)

点击查看摘要

Abstract:Simultaneous speech translation requires translating source speech into a target language in real-time while handling non-monotonic word dependencies. Traditional approaches rely on supervised training with word-level aligned data, which is difficult to collect at scale and thus depends on synthetic alignments using language-specific heuristics that are suboptimal. We propose Hibiki-Zero, which eliminates the need for word-level alignments entirely. This fundamentally simplifies the training pipeline and enables seamless scaling to diverse languages with varying grammatical structures, removing the bottleneck of designing language-specific alignment heuristics. We first train on sentence-level aligned data to learn speech translation at high latency, then apply a novel reinforcement learning strategy using GRPO to optimize latency while preserving translation quality. Hibiki-Zero achieves state-of-the-art performance in translation accuracy, latency, voice transfer, and naturalness across five X-to-English tasks. Moreover, we demonstrate that our model can be adapted to support a new input language with less than 1000h of speech. We provide examples, model weights, inference code and we release a benchmark containing 45h of multilingual data for speech translation evaluation.

13. 【2602.11065】Conversational Behavior Modeling Foundation Model With Multi-Level Perception

链接https://arxiv.org/abs/2602.11065

作者:Dingkun Zhou,Shuchang Pan,Jiachen Lian,Siddharth Banerjee,Sarika Pasumarthy,Dhruv Hebbar,Siddhant Patel,Zeyi Austin Li,Kan Jen Cheng,Sanay Bordia,Krish Patel,Akshaj Gupta,Tingle Li,Gopala Anumanchipalli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Human conversation, conversation is organized, thoughts that manifests, manifests as timed, timed speech acts

备注

点击查看摘要

Abstract:Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this perceptual pathway is key to building natural full-duplex interactive systems. We introduce a framework that models this process as multi-level perception, and then reasons over conversational behaviors via a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a high quality corpus that pairs controllable, event-rich dialogue data with human-annotated labels. The GoT framework structures streaming predictions as an evolving graph, enabling a transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

14. 【2602.11052】GraphSeek: Next-Generation Graph Analytics with LLMs

链接https://arxiv.org/abs/2602.11052

作者:Maciej Besta,Łukasz Jarmocik,Orest Hrycyna,Shachar Klaiman,Konrad Mączka,Robert Gerstenberger,Jürgen Müller,Piotr Nyczyk,Hubert Niewiadomski,Torsten Hoefler

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:deep expertise, foundational across domains, domains but remain, remain hard, graph

备注

点击查看摘要

Abstract:Graphs are foundational across domains but remain hard to use without deep expertise. LLMs promise accessible natural language (NL) graph analytics, yet they fail to process industry-scale property graphs effectively and efficiently: such datasets are large, highly heterogeneous, structurally complex, and evolve dynamically. To address this, we devise a novel abstraction for complex multi-query analytics over such graphs. Its key idea is to replace brittle generation of graph queries directly from NL with planning over a Semantic Catalog that describes both the graph schema and the graph operations. Concretely, this induces a clean separation between a Semantic Plane for LLM planning and broader reasoning, and an Execution Plane for deterministic, database-grade query execution over the full dataset and tool implementations. This design yields substantial gains in both token efficiency and task effectiveness even with small-context LLMs. We use this abstraction as the basis of the first LLM-enhanced graph analytics framework called GraphSeek. GraphSeek achieves substantially higher success rates (e.g., 86% over enhanced LangChain) and points toward the next generation of affordable and accessible graph analytics that unify LLM reasoning with database-grade execution over large and complex property graphs.

15. 【2602.11047】Embedding Inversion via Conditional Masked Diffusion Language Models

链接https://arxiv.org/abs/2602.11047

作者:Han Xiao

类目:Computation and Language (cs.CL)

关键词:sequential autoregressive generation, conditional masked diffusion, frame embedding inversion, autoregressive generation, inversion as conditional

备注: 9 pages, 3 figures, 7 tables. Code and demo: [this https URL](https://github.com/hanxiao/embedding-inversion-demo)

点击查看摘要

Abstract:We frame embedding inversion as conditional masked diffusion, recovering all tokens in parallel through iterative denoising rather than sequential autoregressive generation. A masked diffusion language model is conditioned on the target embedding via adaptive layer normalization, requiring only 8 forward passes through a 78M parameter model with no access to the target encoder. On 32-token sequences across three embedding models, the method achieves up to 81.3% token accuracy. Source code and live demo are available at this https URL.

16. 【2602.11044】Language Model Inversion through End-to-End Differentiation

链接https://arxiv.org/abs/2602.11044

作者:Kevin Yandoka Denamganaï,Kartic Subr

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Language Models, research on Language, emerging research, approaches analyse, analyse the invertibility

备注: 24 pages, 5 figures, under review

点击查看摘要

Abstract:Despite emerging research on Language Models (LM), few approaches analyse the invertibility of LMs. That is, given a LM and a desirable target output sequence of tokens, determining what input prompts would yield the target output remains an open problem. We formulate this problem as a classical gradient-based optimisation. First, we propose a simple algorithm to achieve end-to-end differentiability of a given (frozen) LM and then find optimised prompts via gradient descent. Our central insight is to view LMs as functions operating on sequences of distributions over tokens (rather than the traditional view as functions on sequences of tokens). Our experiments and ablations demonstrate that our DLM-powered inversion can reliably and efficiently optimise prompts of lengths $10$ and $80$ for targets of length $20$, for several white-box LMs (out-of-the-box).

17. 【2602.11040】Learning Page Order in Shuffled WOO Releases

链接https://arxiv.org/abs/2602.11040

作者:Efe Kahraman,Giulio Tosato

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:shuffled WOO documents, shuffled WOO, Dutch freedom, WOO documents, information releases

备注

点击查看摘要

Abstract:We investigate document page ordering on 5,461 shuffled WOO documents (Dutch freedom of information releases) using page embeddings. These documents are heterogeneous collections such as emails, legal texts, and spreadsheets compiled into single PDFs, where semantic ordering signals are unreliable. We compare five methods, including pointer networks, seq2seq transformers, and specialized pairwise ranking models. The best performing approach successfully reorders documents up to 15 pages, with Kendall's tau ranging from 0.95 for short documents (2-5 pages) to 0.72 for 15 page documents. We observe two unexpected failures: seq2seq transformers fail to generalize on long documents (Kendall's tau drops from 0.918 on 2-5 pages to 0.014 on 21-25 pages), and curriculum learning underperforms direct training by 39% on long documents. Ablation studies suggest learned positional encodings are one contributing factor to seq2seq failure, though the degradation persists across all encoding variants, indicating multiple interacting causes. Attention pattern analysis reveals that short and long documents require fundamentally different ordering strategies, explaining why curriculum learning fails. Model specialization achieves substantial improvements on longer documents (+0.21 tau).

18. 【2602.11028】Linguistic Indicators of Early Cognitive Decline in the DementiaBank Pitt Corpus: A Statistical and Machine Learning Study

链接https://arxiv.org/abs/2602.11028

作者:Artsvik Avetisyan,Sachin Kumar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:spontaneous language production, earliest indicators, DementiaBank Pitt Corpus, Background, Subtle

备注

点击查看摘要

Abstract:Background: Subtle changes in spontaneous language production are among the earliest indicators of cognitive decline. Identifying linguistically interpretable markers of dementia can support transparent and clinically grounded screening approaches. Methods: This study analyzes spontaneous speech transcripts from the DementiaBank Pitt Corpus using three linguistic representations: raw cleaned text, a part-of-speech (POS)-enhanced representation combining lexical and grammatical information, and a POS-only syntactic representation. Logistic regression and random forest models were evaluated under two protocols: transcript-level train-test splits and subject-level five-fold cross-validation to prevent speaker overlap. Model interpretability was examined using global feature importance, and statistical validation was conducted using Mann-Whitney U tests with Cliff's delta effect sizes. Results: Across representations, models achieved stable performance, with syntactic and grammatical features retaining strong discriminative power even in the absence of lexical content. Subject-level evaluation yielded more conservative but consistent results, particularly for POS-enhanced and POS-only representations. Statistical analysis revealed significant group differences in functional word usage, lexical diversity, sentence structure, and discourse coherence, aligning closely with machine learning feature importance findings. Conclusion: The results demonstrate that abstract linguistic features capture robust markers of early cognitive decline under clinically realistic evaluation. By combining interpretable machine learning with non-parametric statistical validation, this study supports the use of linguistically grounded features for transparent and reliable language-based cognitive screening.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2602.11028 [cs.CL]

(or
arXiv:2602.11028v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.11028

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Sachin Kumar [view email] [v1]
Wed, 11 Feb 2026 16:53:57 UTC (537 KB)

19. 【2602.11008】ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression

链接https://arxiv.org/abs/2602.11008

作者:Ammar Ali,Baher Mohammad,Denis Makhov,Dmitriy Shopkhoev,Magauiya Zhussip,Stamatios Lefkimmiatis

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:dynamic compression baselines, method that achieves, structured-sparsification and dynamic, present ROCKET, ROCKET

备注

点击查看摘要

Abstract:We present ROCKET, a training-free model compression method that achieves state-of-the-art performance in comparison with factorization, structured-sparsification and dynamic compression baselines. Operating under a global compression budget, ROCKET comprises two key innovations: First, it formulates layer-wise compression allocation as a multi-choice knapsack problem, selecting the optimal compression level for each layer to minimize total reconstruction error while adhering to a target model size. Second, it introduces a single-step sparse matrix factorization inspired by dictionary learning: using only a small calibration set, it sparsifies weight coefficients based on activation-weights sensitivity and then updates the dictionary in closed form via least squares bypassing iterative optimization, sparse coding, or backpropagation entirely. ROCKET consistently outperforms existing compression approaches across different model architectures at 20-50\% compression rates. Notably, it retains over 90\% of the original model's performance at 30\% compression without any fine-tuning. Moreover, when applying a light fine-tuning phase, recovery is substantially enhanced: for instance, compressing Qwen3-14B to an 8B-parameter model and healing it with just 30 million tokens yields performance nearly on par with the original Qwen3-8B. The code for ROCKET is at this http URL.

20. 【2602.10996】he emergence of numerical representations in communicating artificial agents

链接https://arxiv.org/abs/2602.10996

作者:Daniela Mihai,Lucas Weber,Francesca Franzon

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Human languages provide, resemble human numerals, languages provide efficient, provide efficient systems, codes resemble human

备注: In the Sixteenth International Conference on the Evolution of Language

点击查看摘要

Abstract:Human languages provide efficient systems for expressing numerosities, but whether the sheer pressure to communicate is enough for numerical representations to arise in artificial agents, and whether the emergent codes resemble human numerals at all, remains an open question. We study two neural network-based agents that must communicate numerosities in a referential game using either discrete tokens or continuous sketches, thus exploring both symbolic and iconic representations. Without any pre-defined numeric concepts, the agents achieve high in-distribution communication accuracy in both communication channels and converge on high-precision symbol-meaning mappings. However, the emergent code is non-compositional: the agents fail to derive systematic messages for unseen numerosities, typically reusing the symbol of the highest trained numerosity (discrete), or collapsing extrapolated values onto a single sketch (continuous). We conclude that the communication pressure alone suffices for precise transmission of learned numerosities, but additional pressures are needed to yield compositional codes and generalisation abilities.

21. 【2602.10993】LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules

链接https://arxiv.org/abs/2602.10993

作者:Ivan Vulić,Adam Grycner,Quentin de Laroussilhe,Jonas Pfeiffer

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:standard Low-Rank Adaptation, Low-Rank Adaptation, dominant technique, technique for parameter-efficient, PEFT

备注: Preprint

点击查看摘要

Abstract:Despite its huge number of variants, standard Low-Rank Adaptation (LoRA) is still a dominant technique for parameter-efficient fine-tuning (PEFT). Nonetheless, it faces persistent challenges, including the pre-selection of an optimal rank and rank-specific hyper-parameters, as well as the deployment complexity of heterogeneous-rank modules and more sophisticated LoRA derivatives. In this work, we introduce LoRA-Squeeze, a simple and efficient methodology that aims to improve standard LoRA learning by changing LoRA module ranks either post-hoc or dynamically during training}. Our approach posits that it is better to first learn an expressive, higher-rank solution and then compress it, rather than learning a constrained, low-rank solution directly. The method involves fine-tuning with a deliberately high(er) source rank, reconstructing or efficiently approximating the reconstruction of the full weight update matrix, and then using Randomized Singular Value Decomposition (RSVD) to create a new, compressed LoRA module at a lower target rank. Extensive experiments across 13 text and 10 vision-language tasks show that post-hoc compression often produces lower-rank adapters that outperform those trained directly at the target rank, especially if a small number of fine-tuning steps at the target rank is allowed. Moreover, a gradual, in-tuning rank annealing variant of LoRA-Squeeze consistently achieves the best LoRA size-performance trade-off.

22. 【2602.10959】Rotary Positional Embeddings as Phase Modulation: Theoretical Bounds on the RoPE Base for Long-Context Transformers

链接https://arxiv.org/abs/2602.10959

作者:Feilong Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:remains poorly characterized, lengths remains poorly, long context lengths, context lengths remains, encode token positions

备注

点击查看摘要

Abstract:Rotary positional embeddings (RoPE) are widely used in large language models to encode token positions through multiplicative rotations, yet their behavior at long context lengths remains poorly characterized. In this work, we reinterpret RoPE as phase modulation applied to a bank of complex oscillators, enabling analysis through classical signal processing theory. Under this formulation, we derive principled lower bounds on the RoPE base parameter that are necessary to preserve positional coherence over a target context length. These include a fundamental aliasing bound, analogous to a Nyquist limit, and a DC-component stability bound that constrains phase drift in low-frequency positional modes. We further extend this analysis to deep transformers, showing that repeated rotary modulation across layers compounds angular misalignment, tightening the base requirement as depth increases. Complementing these results, we derive a precision-dependent upper bound on the RoPE base arising from finite floating-point resolution. Beyond this limit, incremental phase updates become numerically indistinguishable, leading to positional erasure even in the absence of aliasing. Together, the lower and upper bounds define a precision- and depth-dependent feasibility region a Goldilocks zone for long-context transformers. We validate the framework through a comprehensive case study of state-of-the-art models, including LLaMA, Mistral, and DeepSeek variants, showing that observed successes, failures, and community retrofits align closely with the predicted bounds. Notably, models that violate the stability bound exhibit attention collapse and long-range degradation, while attempts to scale beyond one million tokens encounter a hard precision wall independent of architecture or training.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2602.10959 [cs.LG]

(or
arXiv:2602.10959v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2602.10959

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Feilong Liu [view email] [v1]
Wed, 11 Feb 2026 15:50:07 UTC (320 KB)

23. 【2602.10953】Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models

链接https://arxiv.org/abs/2602.10953

作者:Mingyu Cao,Alvaro Correia,Christos Louizos,Shiwei Liu,Lu Yin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Diffusion Language Models, Diffusion Language, Language Models, generate text, masked sequence

备注: 11 pages, 8 figures

点击查看摘要

Abstract:Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model's uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.

24. 【2602.10947】Computational Phenomenology of Temporal Experience in Autism: Quantifying the Emotional and Narrative Characteristics of Lived Unpredictability

链接https://arxiv.org/abs/2602.10947

作者:Kacper Dudzic,Karolina Drożdż,Maciej Wodziński,Anastazja Szuła,Marcin Moskalewicz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:considered core features, Disturbances in temporality, impact on relationships, social environment, considered core

备注

点击查看摘要

Abstract:Disturbances in temporality, such as desynchronization with the social environment and its unpredictability, are considered core features of autism with a deep impact on relationships. However, limitations regarding research on this issue include: 1) the dominance of deficit-based medical models of autism, 2) sample size in qualitative research, and 3) the lack of phenomenological anchoring in computational research. To bridge the gap between phenomenological and computational approaches and overcome sample-size limitations, our research integrated three methodologies. Study A: structured phenomenological interviews with autistic individuals using the Transdiagnostic Assessment of Temporal Experience. Study B: computational analysis of an autobiographical corpus of autistic narratives built for this purpose. Study C: a replication of a computational study using narrative flow measures to assess the perceived phenomenological authenticity of autistic autobiographies. Interviews revealed that the most significant differences between the autistic and control groups concerned unpredictability of experience. Computational results mirrored these findings: the temporal lexicon in autistic narratives was significantly more negatively valenced - particularly the "Immediacy Suddenness" category. Outlier analysis identified terms associated with perceived discontinuity (unpredictably, precipitously, and abruptly) as highly negative. The computational analysis of narrative flow found that the autistic narratives contained within the corpus quantifiably resemble autobiographical stories more than imaginary ones. Overall, the temporal challenges experienced by autistic individuals were shown to primarily concern lived unpredictability and stem from the contents of lived experience, and not from autistic narrative construction.

25. 【2602.10908】SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

链接https://arxiv.org/abs/2602.10908

作者:Masataka Yoneda,Yusuke Matsushita,Go Kamoda,Kohei Suenaga,Takuya Akiba,Masaki Waga,Sho Yokoi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:handling semantic variations, flexible search algorithm, present an ultra-fast, ultra-fast and flexible, algorithm that enables

备注: Project Page Web Interface: [this https URL](https://softmatcha.github.io/v2/) , Source Code: [this https URL](https://github.com/softmatcha/softmatcha2)

点击查看摘要

Abstract:We present an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while handling semantic variations (substitution, insertion, and deletion). Our approach employs string matching based on suffix arrays that scales well with corpus size. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: fast exact lookup enabled by a disk-aware design, and dynamic corpus-aware pruning. We theoretically show that the proposed method suppresses exponential growth in the search space with respect to query length by leveraging statistical properties of natural language. In experiments on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), we show that our method achieves significantly lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, we demonstrate that our method identifies benchmark contamination in training corpora, unidentified by existing approaches. We also provide an online demo of fast, soft search across corpora in seven languages.

26. 【2602.10886】he CLEF-2026 FinMMEval Lab: Multilingual and Multimodal Evaluation of Financial AI Systems

链接https://arxiv.org/abs/2602.10886

作者:Zhuohan Xie,Rania Elbadry,Fan Zhang,Georgi Georgiev,Xueqing Peng,Lingfei Qian,Jimin Huang,Dimitar Dimitrov,Vanshikaa Jani,Yuyang Dai,Jiahui Geng,Yuxia Wang,Ivan Koychev,Veselin Stoyanov,Preslav Nakov

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

关键词:Large Language Models, financial Large Language, Financial Question Answering, multimodal evaluation framework, Exam Question Answering

备注: 7 pages

点击查看摘要

Abstract:We present the setup and the tasks of the FinMMEval Lab at CLEF 2026, which introduces the first multilingual and multimodal evaluation framework for financial Large Language Models (LLMs). While recent advances in financial natural language processing have enabled automated analysis of market reports, regulatory documents, and investor communications, existing benchmarks remain largely monolingual, text-only, and limited to narrow subtasks. FinMMEval 2026 addresses this gap by offering three interconnected tasks that span financial understanding, reasoning, and decision-making: Financial Exam Question Answering, Multilingual Financial Question Answering (PolyFiQA), and Financial Decision Making. Together, these tasks provide a comprehensive evaluation suite that measures models' ability to reason, generalize, and act across diverse languages and modalities. The lab aims to promote the development of robust, transparent, and globally inclusive financial AI systems, with datasets and evaluation resources publicly released to support reproducible research.

27. 【2602.10881】Diagnosing Structural Failures in LLM-Based Evidence Extraction for Meta-Analysis

链接https://arxiv.org/abs/2602.10881

作者:Zhiyin Tan,Jennifer D'Souza

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:numerically grounded study, grounded study records, converting narrative articles, articles into structured, numerically grounded

备注: Accepted at the 22nd Conference on Information and Research Science Connecting to Digital and Library Science (IRCDL 2026)

点击查看摘要

Abstract:Systematic reviews and meta-analyses rely on converting narrative articles into structured, numerically grounded study records. Despite rapid advances in large language models (LLMs), it remains unclear whether they can meet the structural requirements of this process, which hinge on preserving roles, methods, and effect-size attribution across documents rather than on recognizing isolated entities. We propose a structural, diagnostic framework that evaluates LLM-based evidence extraction as a progression of schema-constrained queries with increasing relational and numerical complexity, enabling precise identification of failure points beyond atom-level extraction. Using a manually curated corpus spanning five scientific domains, together with a unified query suite and evaluation protocol, we evaluate two state-of-the-art LLMs under both per-document and long-context, multi-document input regimes. Across domains and models, performance remains moderate for single-property queries but degrades sharply once tasks require stable binding between variables, roles, statistical methods, and effect sizes. Full meta-analytic association tuples are extracted with near-zero reliability, and long-context inputs further exacerbate these failures. Downstream aggregation amplifies even minor upstream errors, rendering corpus-level statistics unreliable. Our analysis shows that these limitations stem not from entity recognition errors, but from systematic structural breakdowns, including role reversals, cross-analysis binding drift, instance compression in dense result sections, and numeric misattribution, indicating that current LLMs lack the structural fidelity, relational binding, and numerical grounding required for automated meta-analysis. The code and data are publicly available at GitHub (this https URL).

28. 【2602.10874】C-MOP: Integrating Momentum and Boundary-Aware Clustering for Enhanced Prompt Evolution

链接https://arxiv.org/abs/2602.10874

作者:Binwei Yan,Yifei Fu,Mingjian Zhu,Hanting Chen,Mingxuan Yuan,Yunhe Wang,Hailin Hu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, Automatic prompt optimization, performance of Large

备注: The code is available at [this https URL](https://github.com/huawei-noah/noah-research/tree/master/C-MOP)

点击查看摘要

Abstract:Automatic prompt optimization is a promising direction to boost the performance of Large Language Models (LLMs). However, existing methods often suffer from noisy and conflicting update signals. In this research, we propose C-MOP (Cluster-based Momentum Optimized Prompting), a framework that stabilizes optimization via Boundary-Aware Contrastive Sampling (BACS) and Momentum-Guided Semantic Clustering (MGSC). Specifically, BACS utilizes batch-level information to mine tripartite features--Hard Negatives, Anchors, and Boundary Pairs--to precisely characterize the typical representation and decision boundaries of positive and negative prompt samples. To resolve semantic conflicts, MGSC introduces a textual momentum mechanism with temporal decay that distills persistent consensus from fluctuating gradients across iterations. Extensive experiments demonstrate that C-MOP consistently outperforms SOTA baselines like PromptWizard and ProTeGi, yielding average gains of 1.58% and 3.35%. Notably, C-MOP enables a general LLM with 3B activated parameters to surpass a 70B domain-specific dense LLM, highlighting its effectiveness in driving precise prompt evolution. The code is available at this https URL.

29. 【2602.10833】raining-Induced Bias Toward LLM-Generated Content in Dense Retrieval

链接https://arxiv.org/abs/2602.10833

作者:William Xion,Wolfgang Nejdl

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:information retrieval applications, acquiring relevant context, language processing tasks, retrieval applications, open-domain natural language

备注: Accepted at ECIR 2026

点击查看摘要

Abstract:Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.

30. 【2602.10832】I can tell whether you are a Native Hawlêri Speaker! How ANN, CNN, and RNN perform in NLI-Native Language Identification

链接https://arxiv.org/abs/2602.10832

作者:Hardi Garari,Hossein Hassani

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, Native Language Identification, task in Natural, Language Identification, Language Processing

备注: 16 pages, 12 figures, 7 tables

点击查看摘要

Abstract:Native Language Identification (NLI) is a task in Natural Language Processing (NLP) that typically determines the native language of an author through their writing or a speaker through their speaking. It has various applications in different areas, such as forensic linguistics and general linguistics studies. Although considerable research has been conducted on NLI regarding two different languages, such as English and German, the literature indicates a significant gap regarding NLI for dialects and subdialects. The gap becomes wider in less-resourced languages such as Kurdish. This research focuses on NLI within the context of a subdialect of Sorani (Central) Kurdish. It aims to investigate the NLI for Hewlêri, a subdialect spoken in Hewlêr (Erbil), the Capital of the Kurdistan Region of Iraq. We collected about 24 hours of speech by recording interviews with 40 native or non-native Hewlêri speakers, 17 female and 23 male. We created three Neural Network-based models: Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN), which were evaluated through 66 experiments, covering various time-frames from 1 to 60 seconds, undersampling, oversampling, and cross-validation. The RNN model showed the highest accuracy of 95.92% for 5-second audio segmentation, using an 80:10:10 data splitting scheme. The created dataset is the first speech dataset for NLI on the Hewlêri subdialect in the Sorani Kurdish dialect, which can be of benefit to various research areas.

31. 【2602.10816】Beyond Confidence: The Rhythms of Reasoning in Generative Models

链接https://arxiv.org/abs/2602.10816

作者:Deyuan Liu,Zecheng Wang,Zhanyue Qin,Zhiying Tu,Dianhui Chu,Dianbo Sui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, exhibit impressive capabilities, input context variations, slight input context

备注: ICLR 2026

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit impressive capabilities yet suffer from sensitivity to slight input context variations, hampering reliability. Conventional metrics like accuracy and perplexity fail to assess local prediction robustness, as normalized output probabilities can obscure the underlying resilience of an LLM's internal state to perturbations. We introduce the Token Constraint Bound ($\delta_{\mathrm{TCB}}$), a novel metric that quantifies the maximum internal state perturbation an LLM can withstand before its dominant next-token prediction significantly changes. Intrinsically linked to output embedding space geometry, $\delta_{\mathrm{TCB}}$ provides insights into the stability of the model's internal predictive commitment. Our experiments show $\delta_{\mathrm{TCB}}$ correlates with effective prompt engineering and uncovers critical prediction instabilities missed by perplexity during in-context learning and text generation. $\delta_{\mathrm{TCB}}$ offers a principled, complementary approach to analyze and potentially improve the contextual stability of LLM predictions.

32. 【2602.10801】Deep Learning-based Method for Expressing Knowledge Boundary of Black-Box LLM

链接https://arxiv.org/abs/2602.10801

作者:Haotian Sheng,Heyong Wang,Ming Hong,Hongman He,Junqiu Liu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, achieved remarkable success, content generation distortion, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable success, however, the emergence of content generation distortion (hallucination) limits their practical applications. The core cause of hallucination lies in LLMs' lack of awareness regarding their stored internal knowledge, preventing them from expressing their knowledge state on questions beyond their internal knowledge boundaries, as humans do. However, existing research on knowledge boundary expression primarily focuses on white-box LLMs, leaving methods suitable for black-box LLMs which offer only API access without revealing internal parameters-largely unexplored. Against this backdrop, this paper proposes LSCL (LLM-Supervised Confidence Learning), a deep learning-based method for expressing the knowledge boundaries of black-box LLMs. Based on the knowledge distillation framework, this method designs a deep learning model. Taking the input question, output answer, and token probability from a black-box LLM as inputs, it constructs a mapping between the inputs and the model' internal knowledge state, enabling the quantification and expression of the black-box LLM' knowledge boundaries. Experiments conducted on diverse public datasets and with multiple prominent black-box LLMs demonstrate that LSCL effectively assists black-box LLMs in accurately expressing their knowledge boundaries. It significantly outperforms existing baseline models on metrics such as accuracy and recall rate. Furthermore, considering scenarios where some black-box LLMs do not support access to token probability, an adaptive alternative method is proposed. The performance of this alternative approach is close to that of LSCL and surpasses baseline models.

33. 【2602.10740】Reinforced Curriculum Pre-Alignment for Domain-Adaptive VLMs

链接https://arxiv.org/abs/2602.10740

作者:Yuming Yan,Shuo Yang,Kai Tang,Sihong Chen,Yang Zhang,Ke Xu,Dan Hu,Qun Yu,Pengfei Hu,Edith C.H. Ngai

类目:Computation and Language (cs.CL)

关键词:demonstrate remarkable general-purpose, demonstrate remarkable, geometric problem-solving, remarkable general-purpose capabilities, fall short

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) demonstrate remarkable general-purpose capabilities but often fall short in specialized domains such as medical imaging or geometric problem-solving. Supervised Fine-Tuning (SFT) can enhance performance within a target domain, but it typically causes catastrophic forgetting, limiting its generalization. The central challenge, therefore, is to adapt VLMs to new domains while preserving their general-purpose capabilities. Continual pretraining is effective for expanding knowledge in Large Language Models (LLMs), but it is less feasible for VLMs due to prohibitive computational costs and the unavailability of pretraining data for most open-source models. This necessitates efficient post-training adaptation methods. Reinforcement learning (RL)-based approaches such as Group Relative Policy Optimization (GRPO) have shown promise in preserving general abilities, yet they often fail in domain adaptation scenarios where the model initially lacks sufficient domain knowledge, leading to optimization collapse. To bridge this gap, we propose Reinforced Curriculum Pre-Alignment (RCPA), a novel post-training paradigm that introduces a curriculum-aware progressive modulation mechanism. In the early phase, RCPA applies partial output constraints to safely expose the model to new domain concepts. As the model's domain familiarity increases, training gradually transitions to full generation optimization, refining responses and aligning them with domain-specific preferences. This staged adaptation balances domain knowledge acquisition with the preservation of general multimodal capabilities. Extensive experiments across specialized domains and general benchmarks validate the effectiveness of RCPA, establishing a practical pathway toward building high-performing and domain-adaptive VLMs.

34. 【2602.10735】Calliope: A TTS-based Narrated E-book Creator Ensuring Exact Synchronization, Privacy, and Layout Fidelity

链接https://arxiv.org/abs/2602.10735

作者:Hugo L. Hammer,Vajira Thambawita,Pål Halvorsen

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:e-book combines synchronized, narrated e-book combines, combines synchronized audio, sentence during playback, combines synchronized

备注

点击查看摘要

Abstract:A narrated e-book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural-sounding neural Text-to-Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e-books into high-quality narrated e-books. However, no open-source solutions currently exist to perform this task. In this paper, we present Calliope, an open-source framework designed to fill this gap. Our method leverages state-of-the-art open-source TTS to convert a text e-book into a narrated e-book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher's original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud-based services. The framework currently supports the state-of-the-art open-source TTS systems XTTS-v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at this https URL.

35. 【2602.10732】Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

链接https://arxiv.org/abs/2602.10732

作者:Alaa Elsetohy,Sama Hadhoud,Haryo Akbarianto Wibowo,Chenxi Whitehouse,Genta Indra Winata,Fajri Koto,Alham Fikri Aji

类目:Computation and Language (cs.CL)

关键词:culturally grounded premises, benchmarks rarely test, rarely test reasoning, English-centric scenarios, Multilingual benchmarks rarely

备注

点击查看摘要

Abstract:Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here this https URL.

36. 【2602.10718】SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining

链接https://arxiv.org/abs/2602.10718

作者:Yifan Zhang,Zunhai Su,Shuhao Hu,Rui Yang,Wei Wu,Yulei Qian,Yuchen Xie,Xunliang Cai

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Multi-head Latent Attention, DeepSeek Multi-head Latent, architecture presents notable, Multi-head Latent, shown substantial promise

备注

点击查看摘要

Abstract:While FP8 attention has shown substantial promise in innovations like FlashAttention-3, its integration into the decoding phase of the DeepSeek Multi-head Latent Attention (MLA) architecture presents notable challenges. These challenges include numerical heterogeneity arising from the decoupling of positional embeddings, misalignment of quantization scales in FP8 PV GEMM, and the need for optimized system-level support. In this paper, we introduce SnapMLA, an FP8 MLA decoding framework optimized to improve long-context efficiency through the following hardware-aware algorithm-kernel co-optimization techniques: (i) RoPE-Aware Per-Token KV Quantization, where the RoPE part is maintained in high precision, motivated by our comprehensive analysis of the heterogeneous quantization sensitivity inherent to the MLA KV cache. Furthermore, per-token granularity is employed to align with the autoregressive decoding process and maintain quantization accuracy. (ii) Quantized PV Computation Pipeline Reconstruction, which resolves the misalignment of quantization scale in FP8 PV computation stemming from the shared KV structure of the MLA KV cache. (iii) End-to-End Dataflow Optimization, where we establish an efficient data read-and-write workflow using specialized kernels, ensuring efficient data flow and performance gains. Extensive experiments on state-of-the-art MLA LLMs show that SnapMLA achieves up to a 1.91x improvement in throughput, with negligible risk of performance degradation in challenging long-context tasks, including mathematical reasoning and code generation benchmarks. Code is available at this https URL.

37. 【2602.10715】Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

链接https://arxiv.org/abs/2602.10715

作者:Yifei Li,Weidong Guo,Lingling Zhang,Rongman Xu,Muye Huang,Hui Liu,Lijiao Xu,Yu Xu,Jun Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:surface-level factual recall, protocols primarily focus, Long-term conversational memory, evaluation protocols primarily, LLM-based dialogue systems

备注: 16 pages, 8 figures

点击查看摘要

Abstract:Long-term conversational memory is a core capability for LLM-based dialogue systems, yet existing benchmarks and evaluation protocols primarily focus on surface-level factual recall. In realistic interactions, appropriate responses often depend on implicit constraints such as user state, goals, or values that are not explicitly queried later. To evaluate this setting, we introduce \textbf{LoCoMo-Plus}, a benchmark for assessing cognitive memory under cue--trigger semantic disconnect, where models must retain and apply latent constraints across long conversational contexts. We further show that conventional string-matching metrics and explicit task-type prompting are misaligned with such scenarios, and propose a unified evaluation framework based on constraint consistency. Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks. Our code and evaluation framework are publicly available at: this https URL.

38. 【2602.10661】argeted Syntactic Evaluation of Language Models on Georgian Case Alignment

链接https://arxiv.org/abs/2602.10661

作者:Daniel Gallagher,Gerhard Heyer

类目:Computation and Language (cs.CL)

关键词:alignment in Georgian, split-ergative case alignment, mark argument roles, paper evaluates, rare system

备注: To appear in Proceedings of The Second Workshop on Language Models for Low-Resource Languages (LoResLM), EACL 2026

点击查看摘要

Abstract:This paper evaluates the performance of transformer-based language models on split-ergative case alignment in Georgian, a particularly rare system for assigning grammatical cases to mark argument roles. We focus on subject and object marking determined through various permutations of nominative, ergative, and dative noun forms. A treebank-based approach for the generation of minimal pairs using the Grew query language is implemented. We create a dataset of 370 syntactic tests made up of seven tasks containing 50-70 samples each, where three noun forms are tested in any given sample. Five encoder- and two decoder-only models are evaluated with word- and/or sentence-level accuracy metrics. Regardless of the specific syntactic makeup, models performed worst in assigning the ergative case correctly and strongest in assigning the nominative case correctly. Performance correlated with the overall frequency distribution of the three forms (NOM DAT ERG). Though data scarcity is a known issue for low-resource languages, we show that the highly specific role of the ergative along with a lack of available training data likely contributes to poor performance on this case. The dataset is made publicly available and the methodology provides an interesting avenue for future syntactic evaluations of languages where benchmarks are limited.

39. 【2602.10657】Benchmarks Are Not That Out of Distribution: Word Overlap Predicts Performance

链接https://arxiv.org/abs/2602.10657

作者:Woojin Chung,Jeonghoon Kim

类目:Computation and Language (cs.CL)

关键词:Understanding what constitutes, constitutes high-quality pre-training, word-level unigram cross-entropy, constitutes high-quality, remains a central

备注

点击查看摘要

Abstract:Understanding what constitutes high-quality pre-training data remains a central question in language model training. In this work, we investigate whether benchmark performance is primarily driven by the degree of statistical pattern overlap between pre-training corpora and evaluation datasets. We measure this overlap using word-level unigram cross-entropy and word frequency statistics, and perform controlled experiments across $10$ zero-shot benchmarks, $4$ pre-training datasets spanning $8.5\mathrm{B}$ to $60\mathrm{B}$ tokens, and model sizes ranging from $400\mathrm{M}$ to $3\mathrm{B}$ parameters. Our results demonstrate a robust inverse relationship between word-level unigram cross-entropy and benchmark performance, suggesting that widely used benchmarks are strongly influenced by word overlap between training and evaluation data. Thus, larger pre-training subsets with similar word-level unigram cross-entropy yield improved downstream results, indicating that word frequency statistics play an additional role in shaping benchmark scores. Taken together, these results suggest that many standard benchmarks are only weakly out-of-distribution relative to pre-training corpora, so that simple word-overlap statistics predict benchmark performance.

40. 【2602.10652】UMEM: Unified Memory Extraction and Management Framework for Generalizable Memory

链接https://arxiv.org/abs/2602.10652

作者:Yongshi Ye,Hui Jiang,Feihu Jiang,Tian Lan,Yichao Du,Biao Fu,Xiaodong Shi,Qianghuai Jia,Longyue Wang,Weihua Luo

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language Model, Large Language, Self-evolving memory serves, distilling insights

备注

点击查看摘要

Abstract:Self-evolving memory serves as the trainable parameters for Large Language Models (LLMs)-based agents, where extraction (distilling insights from experience) and management (updating the memory bank) must be tightly coordinated. Existing methods predominately optimize memory management while treating memory extraction as a static process, resulting in poor generalization, where agents accumulate instance-specific noise rather than robust memories. To address this, we propose Unified Memory Extraction and Management (UMEM), a self-evolving agent framework that jointly optimizes a Large Language Model to simultaneous extract and manage memories. To mitigate overfitting to specific instances, we introduce Semantic Neighborhood Modeling and optimize the model with a neighborhood-level marginal utility reward via GRPO. This approach ensures memory generalizability by evaluating memory utility across clusters of semantically related queries. Extensive experiments across five benchmarks demonstrate that UMEM significantly outperforms highly competitive baselines, achieving up to a 10.67% improvement in multi-turn interactive tasks. Futhermore, UMEM maintains a monotonic growth curve during continuous evolution. Codes and models will be publicly released.

41. 【2602.10625】o Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks

链接https://arxiv.org/abs/2602.10625

作者:Nanxu Gong,Haotian Li,Sixun Dong,Jianxun Lian,Yanjie Fu,Xing Xie

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Theory of Mind, infer hidden mental, hidden mental states, Large Language Models, natural social interaction

备注

点击查看摘要

Abstract:Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions, which is essential for natural social interaction. Although recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding, it is still underexplored whether this benefit transfers to socio-cognitive skills. We present a systematic study of nine advanced Large Language Models (LLMs), comparing reasoning models with non-reasoning models on three representative ToM benchmarks. The results show that reasoning models do not consistently outperform non-reasoning models and sometimes perform worse. A fine-grained analysis reveals three insights. First, slow thinking collapses: accuracy significantly drops as responses grow longer, and larger reasoning budgets hurt performance. Second, moderate and adaptive reasoning benefits performance: constraining reasoning length mitigates failure, while distinct success patterns demonstrate the necessity of dynamic adaptation. Third, option matching shortcut: when multiple choice options are removed, reasoning models improve markedly, indicating reliance on option matching rather than genuine deduction. We also design two intervention approaches: Slow-to-Fast (S2F) adaptive reasoning and Think-to-Match (T2M) shortcut prevention to further verify and mitigate the problems. With all results, our study highlights the advancement of LRMs in formal reasoning (e.g., math, code) cannot be fully transferred to ToM, a typical task in social reasoning. We conclude that achieving robust ToM requires developing unique capabilities beyond existing reasoning methods.

42. 【2602.10622】How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning

链接https://arxiv.org/abs/2602.10622

作者:Jiahao Yuan,Yike Xu,Jinyong Wen,Baokun Wang,Yang Chen,Xiaotong Lin,Wuliang Huang,Ziyi Gao,Xing Fu,Yu Cheng,Weiqiang Wang

类目:Computation and Language (cs.CL)

关键词:embeddings remains underexplored, large language models, Decoder-only large language, user embeddings remains, remains underexplored

备注: 13 pages, 4 figures

点击查看摘要

Abstract:Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at this https URL.

43. 【2602.10620】ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents

链接https://arxiv.org/abs/2602.10620

作者:YoungHoon Jeon,Suwan Kim,Haein Son,Sookbun Lee,Yeil Jeong,Unggi Lee

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:Large Language Model, automating Instructional Systems, Instructional Systems Design, Large Language, Instructional Systems

备注

点击查看摘要

Abstract:Large Language Model (LLM) agents have shown promising potential in automating Instructional Systems Design (ISD), a systematic approach to developing educational programs. However, evaluating these agents remains challenging due to the lack of standardized benchmarks and the risk of LLM-as-judge bias. We present ISD-Agent-Bench, a comprehensive benchmark comprising 25,795 scenarios generated via a Context Matrix framework that combines 51 contextual variables across 5 categories with 33 ISD sub-steps derived from the ADDIE model. To ensure evaluation reliability, we employ a multi-judge protocol using diverse LLMs from different providers, achieving high inter-judge reliability. We compare existing ISD agents with novel agents grounded in classical ISD theories such as ADDIE, Dick \ Carey, and Rapid Prototyping ISD. Experiments on 1,017 test scenarios demonstrate that integrating classical ISD frameworks with modern ReAct-style reasoning achieves the highest performance, outperforming both pure theory-based agents and technique-only approaches. Further analysis reveals that theoretical quality strongly correlates with benchmark performance, with theory-based agents showing significant advantages in problem-centered design and objective-assessment alignment. Our work provides a foundation for systematic LLM-based ISD research.

44. 【2602.10609】Online Causal Kalman Filtering for Stable and Effective Policy Optimization

链接https://arxiv.org/abs/2602.10609

作者:Shuo He,Lang Feng,Xin Cheng,Lei Feng,Bo An

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:token-level importance sampling, high-variance token-level importance, Reinforcement learning, large language models, language models suffers

备注: Preprint

点击查看摘要

Abstract:Reinforcement learning for large language models suffers from high-variance token-level importance sampling (IS) ratios, which would destabilize policy optimization at scale. To improve stability, recent methods typically use a fixed sequence-level IS ratio for all tokens in a sequence or adjust each token's IS ratio separately, thereby neglecting temporal off-policy derivation across tokens in a sequence. In this paper, we first empirically identify that local off-policy deviation is structurally inconsistent at the token level, which may distort policy-gradient updates across adjacent tokens and lead to training collapse. To address the issue, we propose Online Causal Kalman Filtering for stable and effective Policy Optimization (KPO). Concretely, we model the desired IS ratio as a latent state that evolves across tokens and apply a Kalman filter to update this state online and autoregressively based on the states of past tokens, regardless of future tokens. The resulting filtered IS ratios preserve token-wise local structure-aware variation while strongly smoothing noise spikes, yielding more stable and effective policy updates. Experimentally, KPO achieves superior results on challenging math reasoning datasets compared with state-of-the-art counterparts.

45. 【2602.10604】Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

链接https://arxiv.org/abs/2602.10604

作者:Ailin Huang,Ang Li,Aobo Kong,Bin Wang,Binxing Jiao,Bo Dong,Bojun Wang,Boyu Chen,Brian Li,Buyun Ma,Chang Su,Changxin Miao,Changyi Wan,Chao Lou,Chen Hu,Chen Xu,Chenfeng Yu,Chengting Feng,Chengyuan Yao,Chunrui Han,Dan Ma,Dapeng Shi,Daxin Jiang,Dehua Ma,Deshan Sun,Di Qi,Enle Liu,Fajie Zhang,Fanqi Wan,Guanzhe Huang,Gulin Yan,Guoliang Cao,Guopeng Li,Han Cheng,Hangyu Guo,Hanshan Zhang,Hao Nie,Haonan Jia,Haoran Lv,Hebin Zhou,Hekun Lv,Heng Wang,Heung-Yeung Shum,Hongbo Huang,Hongbo Peng,Hongyu Zhou,Hongyuan Wang,Houyong Chen,Huangxi Zhu,Huimin Wu,Huiyong Guo,Jia Wang,Jian Zhou,Jianjian Sun,Jiaoren Wu,Jiaran Zhang,Jiashu Lv,Jiashuo Liu,Jiayi Fu,Jiayu Liu,Jie Cheng,Jie Luo,Jie Yang,Jie Zhou,Jieyi Hou,Jing Bai,Jingcheng Hu,Jingjing Xie,Jingwei Wu,Jingyang Zhang,Jishi Zhou,Junfeng Liu,Junzhe Lin,Ka Man Lo,Kai Liang,Kaibo Liu,Kaijun Tan,Kaiwen Yan,Kaixiang Li,Kang An,Kangheng Lin,Lei Yang,Liang Lv,Liang Zhao,Liangyu Chen,Lieyu Shi,Liguo Tan,Lin Lin,Lina Chen,Luck Ma,Mengqiang Ren,Michael Li,Ming Li,Mingliang Li,Mingming Zhang,Mingrui Chen,Mitt Huang,Na Wang,Peng Liu,Qi Han

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:bridges frontier-level agentic, introduce Step, Flash, Step, bridges frontier-level

备注: Technical report for Step 3.5 Flash

点击查看摘要

Abstract:We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.

46. 【2602.10560】When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

链接https://arxiv.org/abs/2602.10560

作者:Leheng Sheng,Yongtao Zhang,Wenchang Ma,Yaorui Shi,Ting Huang,Xiang Wang,An Zhang,Ke Shen,Tat-Seng Chua

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:context length grows, large language models, real-world applications, length grows, remains challenging

备注: 26 pages

点击查看摘要

Abstract:While reasoning over long context is crucial for various real-world applications, it remains challenging for large language models (LLMs) as they suffer from performance degradation as the context length grows. Recent work MemAgent has tried to tackle this by processing context chunk-by-chunk in an RNN-like loop and updating a textual memory for final answering. However, this naive recurrent memory update faces two crucial drawbacks: (i) memory can quickly explode because it can update indiscriminately, even on evidence-free chunks; and (ii) the loop lacks an exit mechanism, leading to unnecessary computation after even sufficient evidence is collected. To address these issues, we propose GRU-Mem, which incorporates two text-controlled gates for more stable and efficient long-context reasoning. Specifically, in GRU-Mem, the memory only updates when the update gate is open and the recurrent loop will exit immediately once the exit gate is open. To endow the model with such capabilities, we introduce two reward signals $r^{\text{update}}$ and $r^{\text{exit}}$ within end-to-end RL, rewarding the correct updating and exiting behaviors respectively. Experiments on various long-context reasoning tasks demonstrate the effectiveness and efficiency of GRU-Mem, which generally outperforms the vanilla MemAgent with up to 400\% times inference speed acceleration.

47. 【2602.10525】LHAW: Controllable Underspecification for Long-Horizon Tasks

链接https://arxiv.org/abs/2602.10525

作者:George Pu,Michael S. Lee,Udari Madhushani Sehwag,David J. Lee,Bryan Zhu,Yash Maurya,Mohit Raghavendra,Yuan Xue,Samuel Marc Denton

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:operate effectively, effectively over extended, extended periods, periods are essential, Long-Horizon Augmented Workflows

备注

点击查看摘要

Abstract:Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.

48. 【2602.10504】On the Robustness of Knowledge Editing for Detoxification

链接https://arxiv.org/abs/2602.10504

作者:Ming Dong,Shiyi Tang,Ziyan Peng,Guanyi Chen,Tingting He

类目:Computation and Language (cs.CL)

关键词:Large Language Models, mitigating harmful behaviours, Large Language, promising approach, approach for mitigating

备注

点击查看摘要

Abstract:Knowledge-Editing-based (KE-based) detoxification has emerged as a promising approach for mitigating harmful behaviours in Large Language Models. Existing evaluations, however, largely rely on automatic toxicity classifiers, implicitly assuming that reduced toxicity scores reflect genuine behavioural suppression. In this work, we propose a robustness-oriented evaluation framework for KE-based detoxification that examines its reliability beyond standard classifier-based metrics along three dimensions: optimisation robustness, compositional robustness, and cross-lingual robustness. We identify pseudo-detoxification as a common failure mode, where apparent toxicity reductions arise from degenerate generation behaviours rather than meaningful suppression of unsafe content. We further show that detoxification effectiveness degrades when multiple unsafe behaviours are edited jointly, and that both monolingual and cross-lingual detoxification remain effective only under specific model-method combinations. Overall, our results indicate that KE-based detoxification is robust only for certain models, limited numbers of detoxification objectives, and a subset of languages.

49. 【2602.10494】Canvas-of-Thought: Grounding Reasoning via Mutable Structured States

链接https://arxiv.org/abs/2602.10494

作者:Lingzhuang Sun,Yuxia Zhu,Ruitong Liu,Hao Liang,Zheng Sun,Caijun Jia,Honghao He,Yuchen Wu,Siyuan Li,Jingxuan Wei,Xiangxiang Zhang,Bihui Yu,Wentao Zhang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Multimodal Large Language, Large Language, Language Models, linear text sequences

备注

点击查看摘要

Abstract:While Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), relying solely on linear text sequences remains a bottleneck for complex tasks. We observe that even when auxiliary visual elements are interleaved, they are often treated as static snapshots within a one-dimensional, unstructured reasoning chain. We argue that such approaches treat reasoning history as an immutable stream: correcting a local error necessitates either generating verbose downstream corrections or regenerating the entire context. This forces the model to implicitly maintain and track state updates, significantly increasing token consumption and cognitive load. This limitation is particularly acute in high-dimensional domains, such as geometry and SVG design, where the textual expression of CoT lacks explicit visual guidance, further constraining the model's reasoning precision. To bridge this gap, we introduce \textbf{Canvas-of-Thought (Canvas-CoT)}. By leveraging a HTML Canvas as an external reasoning substrate, Canvas-CoT empowers the model to perform atomic, DOM-based CRUD operations. This architecture enables in-place state revisions without disrupting the surrounding context, allowing the model to explicitly maintain the "ground truth". Furthermore, we integrate a rendering-based critique loop that serves as a hard constraint validator, providing explicit visual feedback to resolve complex tasks that are difficult to articulate through text alone. Extensive experiments on VCode, RBench-V, and MathVista demonstrate that Canvas-CoT significantly outperforms existing baselines, establishing a new paradigm for context-efficient multimodal reasoning.

50. 【2602.10480】Neuro-Symbolic Synergy for Interactive World Modeling

链接https://arxiv.org/abs/2602.10480

作者:Hongyu Zhao,Siyu Zhou,Haolin Yang,Zengyi Qin,Tianyi Zhou

类目:Computation and Language (cs.CL)

关键词:exhibit strong general-purpose, general-purpose reasoning capabilities, Large language models, strong general-purpose reasoning, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) exhibit strong general-purpose reasoning capabilities, yet they frequently hallucinate when used as world models (WMs), where strict compliance with deterministic transition rules--particularly in corner cases--is essential. In contrast, Symbolic WMs provide logical consistency but lack semantic expressivity. To bridge this gap, we propose Neuro-Symbolic Synergy (NeSyS), a framework that integrates the probabilistic semantic priors of LLMs with executable symbolic rules to achieve both expressivity and robustness. NeSyS alternates training between the two models using trajectories inadequately explained by the other. Unlike rule-based prompting, the symbolic WM directly constrains the LLM by modifying its output probability distribution. The neural WM is fine-tuned only on trajectories not covered by symbolic rules, reducing training data by 50% without loss of accuracy. Extensive experiments on three distinct interactive environments, i.e., ScienceWorld, Webshop, and Plancraft, demonstrate NeSyS's consistent advantages over baselines in both WM prediction accuracy and data efficiency.

51. 【2602.10471】stExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation

链接https://arxiv.org/abs/2602.10471

作者:Steven Liu,Jane Luo,Xin Zhang,Aofan Liu,Hao Liu,Jie Wu,Ziyang Huang,Yangyu Huang,Yu Kang,Scarlett Li

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, automate software development, reactive reproduction, Language Models

备注

点击查看摘要

Abstract:Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground truth (a compliance trap) for regression prevention, or depend on post-failure artifacts (e.g., issue reports) for bug reproduction-so they rarely surface defects before failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals. Models must proactively find bugs by comparing implementations against documentation-derived intent, using documentation as the oracle. Furthermore, to keep evaluation sustainable and reduce leakage, we propose continuous, time-aware data collection. Our evaluation reveals a significant capability gap: state-of-the-art models achieve a maximum Fail-to-Pass (F2P) rate of only 16.06%. Further analysis indicates that navigating complex cross-module interactions and leveraging agentic exploration are critical to advancing LLMs toward autonomous software quality assurance. Consistent with this, SWEAgent instantiated with GPT-5-mini achieves an F2P of 17.27% and an F2P@5 of 29.7%, highlighting the effectiveness and promise of agentic exploration in proactive bug discovery tasks.

52. 【2602.10454】LATA: A Tool for LLM-Assisted Translation Annotation

链接https://arxiv.org/abs/2602.10454

作者:Baorong Huang,Ali Asiri

类目:Computation and Language (cs.CL)

关键词:high-quality parallel corpora, multi-layered annotation tasks, construction of high-quality, high-quality parallel, parallel corpora

备注

点击查看摘要

Abstract:The construction of high-quality parallel corpora for translation research has increasingly evolved from simple sentence alignment to complex, multi-layered annotation tasks. This methodological shift presents significant challenges for structurally divergent language pairs, such as Arabic--English, where standard automated tools frequently fail to capture deep linguistic shifts or semantic nuances. This paper introduces a novel, LLM-assisted interactive tool designed to reduce the gap between scalable automation and the rigorous precision required for expert human judgment. Unlike traditional statistical aligners, our system employs a template-based Prompt Manager that leverages large language models (LLMs) for sentence segmentation and alignment under strict JSON output constraints. In this tool, automated preprocessing integrates into a human-in-the-loop workflow, allowing researchers to refine alignments and apply custom translation technique annotations through a stand-off architecture. By leveraging LLM-assisted processing, the tool balances annotation efficiency with the linguistic precision required to analyze complex translation phenomena in specialized domains.

53. 【2602.10453】he Landscape of Prompt Injection Threats in LLM Agents: From Taxonomy to Analysis

链接https://arxiv.org/abs/2602.10453

作者:Peiran Wang,Xinfeng Li,Chong Xiang,Jinghuai Zhang,Ying Li,Lixia Zhang,Xiaofeng Wang,Yuan Tian

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:Large Language Models, Prompt Injection, Large Language, necessitating robust security, evolution of Large

备注

点击查看摘要

Abstract:The evolution of Large Language Models (LLMs) has resulted in a paradigm shift towards autonomous agents, necessitating robust security against Prompt Injection (PI) vulnerabilities where untrusted inputs hijack agent behaviors. This SoK presents a comprehensive overview of the PI landscape, covering attacks, defenses, and their evaluation practices. Through a systematic literature review and quantitative analysis, we establish taxonomies that categorize PI attacks by payload generation strategies (heuristic vs. optimization) and defenses by intervention stages (text, model, and execution levels). Our analysis reveals a key limitation shared by many existing defenses and benchmarks: they largely overlook context-dependent tasks, in which agents are authorized to rely on runtime environmental observations to determine actions. To address this gap, we introduce AgentPI, a new benchmark designed to systematically evaluate agent behavior under context-dependent interaction settings. Using AgentPI, we empirically evaluate representative defenses and show that no single approach can simultaneously achieve high trustworthiness, high utility, and low latency. Moreover, we show that many defenses appear effective under existing benchmarks by suppressing contextual inputs, yet fail to generalize to realistic agent settings where context-dependent reasoning is essential. This SoK distills key takeaways and open research problems, offering structured guidance for future research and practical deployment of secure LLM agents.

54. 【2602.10437】Control Reinforcement Learning: Token-Level Mechanistic Analysis via Learned SAE Feature Steering

链接https://arxiv.org/abs/2602.10437

作者:Seonglae Cho,Zekun Wu,Adriano Koshiyama

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:decompose language model, change model outputs, language model activations, Control Reinforcement Learning, Sparse autoencoders

备注

点击查看摘要

Abstract:Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Reinforcement Learning (CRL), which trains a policy to select SAE features for steering at each token, producing interpretable intervention logs: the learned policy identifies features that change model outputs when amplified. Adaptive Feature Masking encourages diverse feature discovery while preserving singlefeature interpretability. The framework yields new analysis capabilities: branch point tracking locates tokens where feature choice determines output correctness; critic trajectory analysis separates policy limitations from value estimation errors; layer-wise comparison reveals syntactic features in early layers and semantic features in later layers. On Gemma 2 2B across MMLU, BBQ, GSM8K, HarmBench, and XSTest, CRL achieves improvements while providing per-token intervention logs. These results establish learned feature steering as a mechanistic interpretability tool that complements static feature analysis with dynamic intervention probes

55. 【2602.10416】AI-rithmetic

链接https://arxiv.org/abs/2602.10416

作者:Alex Bie,Travis Dick,Alex Kulesza,Prabhakar Raghavan,Vinod Raman,Sergei Vassilvitskii

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:international math competitions, Modern AI systems, math competitions, assist with research, research workflows

备注

点击查看摘要

Abstract:Modern AI systems have been successfully deployed to win medals at international math competitions, assist with research workflows, and prove novel technical lemmas. However, despite their progress at advanced levels of mathematics, they remain stubbornly bad at basic arithmetic, consistently failing on the simple task of adding two numbers. We present a systematic investigation of this phenomenon. We demonstrate empirically that all frontier models suffer significantly degraded accuracy for integer addition as the number of digits increases. Furthermore, we show that most errors made by these models are highly interpretable and can be attributed to either operand misalignment or a failure to correctly carry; these two error classes explain 87.9%, 62.9%, and 92.4% of Claude Opus 4.1, GPT-5, and Gemini 2.5 Pro errors, respectively. Finally, we show that misalignment errors are frequently related to tokenization, and that carrying errors appear largely as independent random failures.

56. 【2602.10414】EVOKE: Emotion Vocabulary Of Korean and English

链接https://arxiv.org/abs/2602.10414

作者:Yoonwon Jung,Hagyeong Shin,Benjamin K. Bergen

类目:Computation and Language (cs.CL)

关键词:paper introduces EVOKE, introduces EVOKE, emotion words, words, paper introduces

备注

点击查看摘要

Abstract:This paper introduces EVOKE, a parallel dataset of emotion vocabulary in English and Korean. The dataset offers comprehensive coverage of emotion words in each language, in addition to many-to-many translations between words in the two languages and identification of language-specific emotion words. The dataset contains 1,427 Korean words and 1,399 English words, and we systematically annotate 819 Korean and 924 English adjectives and verbs. We also annotate multiple meanings of each word and their relationships, identifying polysemous emotion words and emotion-related metaphors. The dataset is, to our knowledge, the most comprehensive, systematic, and theory-agnostic dataset of emotion words in both Korean and English to date. It can serve as a practical tool for emotion science, psycholinguistics, computational linguistics, and natural language processing, allowing researchers to adopt different views on the resource reflecting their needs and theoretical perspectives. The dataset is publicly available at this https URL.

57. 【2602.10408】Gated Removal of Normalization in Transformers Enables Stable Training and Efficient Inference

链接https://arxiv.org/abs/2602.10408

作者:Andrei Kanavalau,Carmen Amo Alonso,Sanjay Lall

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:stabilizing Transformer training, stabilizing Transformer, widely viewed, viewed as essential, essential for stabilizing

备注

点击查看摘要

Abstract:Normalization is widely viewed as essential for stabilizing Transformer training. We revisit this assumption for pre-norm Transformers and ask to what extent sample-dependent normalization is needed inside Transformer blocks. We introduce TaperNorm, a drop-in replacement for RMSNorm/LayerNorm that behaves exactly like the standard normalizer early in training and then smoothly tapers to a learned sample-independent linear/affine map. A single global gate is held at $g{=}1$ during gate warmup, used to calibrate the scaling branch via EMAs, and then cosine-decayed to $g{=}0$, at which point per-token statistics vanish and the resulting fixed scalings can be folded into adjacent linear projections. Our theoretical and empirical results isolate scale anchoring as the key role played by output normalization: as a (near) $0$-homogeneous map it removes radial gradients at the output, whereas without such an anchor cross-entropy encourages unbounded logit growth (``logit chasing''). We further show that a simple fixed-target auxiliary loss on the pre-logit residual-stream scale provides an explicit alternative anchor and can aid removal of the final normalization layer. Empirically, TaperNorm matches normalized baselines under identical setups while eliminating per-token statistics and enabling these layers to be folded into adjacent linear projections at inference. On an efficiency microbenchmark, folding internal scalings yields up to $1.22\times$ higher throughput in last-token logits mode. These results take a step towards norm-free Transformers while identifying the special role output normalization plays.

58. 【2602.10404】Modular Multi-Task Learning for Chemical Reaction Prediction

链接https://arxiv.org/abs/2602.10404

作者:Jiayun Pang,Ahmed M. Zaitoun,Xacobe Couso Cambeiro,Ivan Vulić

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Adapting large language, large language models, Adapting large, language models, trained on broad

备注: 19 pages, 7 figures

点击查看摘要

Abstract:Adapting large language models (LLMs) trained on broad organic chemistry to smaller, domain-specific reaction datasets is a key challenge in chemical and pharmaceutical RD. Effective specialisation requires learning new reaction knowledge while preserving general chemical understanding across related tasks. Here, we evaluate Low-Rank Adaptation (LoRA) as a parameter-efficient alternative to full fine-tuning for organic reaction prediction on limited, complex datasets. Using USPTO reaction classes and challenging C-H functionalisation reactions, we benchmark forward reaction prediction, retrosynthesis and reagent prediction. LoRA achieves accuracy comparable to full fine-tuning while effectively mitigating catastrophic forgetting and better preserving multi-task performance. Both fine-tuning approaches generalise beyond training distributions, producing plausible alternative solvent predictions. Notably, C-H functionalisation fine-tuning reveals that LoRA and full fine-tuning encode subtly different reactivity patterns, suggesting more effective reaction-specific adaptation with LoRA. As LLMs continue to scale, our results highlight the practicality of modular, parameter-efficient fine-tuning strategies for their flexible deployment for chemistry applications.

59. 【2602.10400】When are We Worried? Temporal Trends of Anxiety and What They Reveal about Us

链接https://arxiv.org/abs/2602.10400

作者:Saif M. Mohammad

类目:Computation and Language (cs.CL)

关键词:Canadian social media, recently created lexicon, analyze large amounts, social media data, Canadian social

备注

点击查看摘要

Abstract:In this short paper, we make use of a recently created lexicon of word-anxiety associations to analyze large amounts of US and Canadian social media data (tweets) to explore *when* we are anxious and what insights that reveals about us. We show that our levels of anxiety on social media exhibit systematic patterns of rise and fall during the day -- highest at 8am (in-line with when we have high cortisol levels in the body) and lowest around noon. Anxiety is lowest on weekends and highest mid-week. We also examine anxiety in past, present, and future tense sentences to show that anxiety is highest in past tense and lowest in future tense. Finally, we examine the use of anxiety and calmness words in posts that contain pronouns to show: more anxiety in 3rd person pronouns (he, they) posts than 1st and 2nd person pronouns and higher anxiety in posts with subject pronouns (I, he, she, they) than object pronouns (me, him, her, them). Overall, these trends provide valuable insights on not just when we are anxious, but also how different types of focus (future, past, self, outward, etc.) are related to anxiety.

60. 【2602.10388】Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

链接https://arxiv.org/abs/2602.10388

作者:Zhongzhi Li,Xuansheng Wu,Yijiang Li,Lijie Hu,Ninghao Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:effective downstream performance, large language models, downstream performance, critical for effective, large language

备注

点击查看摘要

Abstract:The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

61. 【2602.10384】When Tables Go Crazy: Evaluating Multimodal Models on French Financial Documents

链接https://arxiv.org/abs/2602.10384

作者:Virginie Mouilleron,Théo Lasnier,Djamé Seddah

类目:Computation and Language (cs.CL)

关键词:non-English domains remains, domains remains underexplored, Multimodal Finance Eval, reliability in specialized, non-English domains

备注: 14 pages, 17 figures

点击查看摘要

Abstract:Vision-language models (VLMs) perform well on many document understanding tasks, yet their reliability in specialized, non-English domains remains underexplored. This gap is especially critical in finance, where documents mix dense regulatory text, numerical tables, and visual charts, and where extraction errors can have real-world consequences. We introduce Multimodal Finance Eval, the first multimodal benchmark for evaluating French financial document understanding. The dataset contains 1,204 expert-validated questions spanning text extraction, table comprehension, chart interpretation, and multi-turn conversational reasoning, drawn from real investment prospectuses, KIDs, and PRIIPs. We evaluate six open-weight VLMs (8B-124B parameters) using an LLM-as-judge protocol. While models achieve strong performance on text and table tasks (85-90% accuracy), they struggle with chart interpretation (34-62%). Most notably, multi-turn dialogue reveals a sharp failure mode: early mistakes propagate across turns, driving accuracy down to roughly 50% regardless of model size. These results show that current VLMs are effective for well-defined extraction tasks but remain brittle in interactive, multi-step financial analysis. Multimodal Finance Eval offers a challenging benchmark to measure and drive progress in this high-stakes setting.

Comments:
14 pages, 17 figures

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.10384 [cs.CL]

(or
arXiv:2602.10384v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.10384

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
62. 【2602.10382】riggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models

链接https://arxiv.org/abs/2602.10382

作者:Théo Lasnier,Wissam Antoun,Francis Kulumba,Djamé Seddah

类目:Computation and Language (cs.CL)

关键词:remain poorly understood, attacks pose significant, pose significant security, significant security risks, operate remain poorly

备注: 13 pages, 35 figures

点击查看摘要

Abstract:Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of language-switching backdoors, studying the GAPperon model family (1B, 8B, 24B parameters) which contains triggers injected during pretraining that cause output language switching. Using activation patching, we localize trigger formation to early layers (7.5-25% of model depth) and identify which attention heads process trigger information. Our central finding is that trigger-activated heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.66 over the top heads identified. This suggests that backdoor triggers do not form isolated circuits but instead co-opt the model's existing language components. These findings have implications for backdoor defense: detection methods may benefit from monitoring known functional components rather than searching for hidden circuits, and mitigation strategies could potentially leverage this entanglement between injected and natural behaviors.

63. 【2602.10380】he Alignment Bottleneck in Decomposition-Based Claim Verification

链接https://arxiv.org/abs/2602.10380

作者:Mahmud Elahi Akhter,Federico Ruggeri,Iman Munire Bilal,Rob Procter,Maria Liakata

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Structured claim decomposition, Structured claim, Repeated Claim-level Evidence, solution for verifying, evidence

备注

点击查看摘要

Abstract:Structured claim decomposition is often proposed as a solution for verifying complex, multi-faceted claims, yet empirical results have been inconsistent. We argue that these inconsistencies stem from two overlooked bottlenecks: evidence alignment and sub-claim error profiles. To better understand these factors, we introduce a new dataset of real-world complex claims, featuring temporally bounded evidence and human-annotated sub-claim evidence spans. We evaluate decomposition under two evidence alignment setups: Sub-claim Aligned Evidence (SAE) and Repeated Claim-level Evidence (SRE). Our results reveal that decomposition brings significant performance improvement only when evidence is granular and strictly aligned. By contrast, standard setups that rely on repeated claim-level evidence (SRE) fail to improve and often degrade performance as shown across different datasets and domains (PHEMEPlus, MMM-Fact, COVID-Fact). Furthermore, we demonstrate that in the presence of noisy sub-claim labels, the nature of the error ends up determining downstream robustness. We find that conservative "abstention" significantly reduces error propagation compared to aggressive but incorrect predictions. These findings suggest that future claim decomposition frameworks must prioritize precise evidence synthesis and calibrate the label bias of sub-claim verification models.

64. 【2602.10377】Hardware Co-Design Scaling Laws via Roofline Modelling for On-Device LLMs

链接https://arxiv.org/abs/2602.10377

作者:Luoyang Sun,Jiwen Jiang,Yifeng Ding,Fengfa Li,Yan Song,Haifeng Zhang,Jian Ying,Lei Ren,Kun Zhan,Wei Chen,Yan Xie,Cheng Deng

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:paradigm of Physical, autonomous vehicles, smart spaces, on-device LLM deployment, key paradigm

备注

点击查看摘要

Abstract:Vision-Language-Action Models (VLAs) have emerged as a key paradigm of Physical AI and are increasingly deployed in autonomous vehicles, robots, and smart spaces. In these resource-constrained on-device settings, selecting an appropriate large language model (LLM) backbone is a critical challenge: models must balance accuracy with strict inference latency and hardware efficiency constraints. This makes hardware-software co-design a game-changing requirement for on-device LLM deployment, where each hardware platform demands a tailored architectural solution. We propose a hardware co-design law that jointly captures model accuracy and inference performance. Specifically, we model training loss as an explicit function of architectural hyperparameters and characterise inference latency via roofline modelling. We empirically evaluate 1,942 candidate architectures on NVIDIA Jetson Orin, training 170 selected models for 10B tokens each to fit a scaling law relating architecture to training loss. By coupling this scaling law with latency modelling, we establish a direct accuracy-latency correspondence and identify the Pareto frontier for hardware co-designed LLMs. We further formulate architecture search as a joint optimisation over precision and performance, deriving feasible design regions under industrial hardware and application budgets. Our approach reduces architecture selection from months to days. At the same latency as Qwen2.5-0.5B on the target hardware, our co-designed architecture achieves 19.42% lower perplexity on WikiText-2. To our knowledge, this is the first principled and operational framework for hardware co-design scaling laws in on-device LLM deployment. We will make the code and related checkpoints publicly available.

65. 【2602.10356】Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation

链接https://arxiv.org/abs/2602.10356

作者:Tianci Xue,Zeyi Liao,Tianneng Shi,Zilu Wang,Kai Zhang,Dawn Song,Yu Su,Huan Sun

类目:Computation and Language (cs.CL)

关键词:Real-world digital environments, Real-world digital, diverse and dynamic, Real-world, Autonomous Curriculum Reinforcement

备注: 24 pages, 6 figures

点击查看摘要

Abstract:Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen scenarios and distribution shifts, making continual learning in specific environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded agent data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores target environments to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent's current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without catastrophic forgetting on existing environments. Further analyses show highly sparse updates (e.g., 20% parameters), which helps explain the effective and robust adaptation. Our data and code are available at this https URL.

66. 【2602.10354】Physically Interpretable AlphaEarth Foundation Model Embeddings Enable LLM-Based Land Surface Intelligence

链接https://arxiv.org/abs/2602.10354

作者:Mashrekur Rahman

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Continental United States, remains poorly understood, produce dense embeddings, models produce dense, physical interpretability remains

备注

点击查看摘要

Abstract:Satellite foundation models produce dense embeddings whose physical interpretability remains poorly understood, limiting their integration into environmental decision systems. Using 12.1 million samples across the Continental United States (2017--2023), we first present a comprehensive interpretability analysis of Google AlphaEarth's 64-dimensional embeddings against 26 environmental variables spanning climate, vegetation, hydrology, temperature, and terrain. Combining linear, nonlinear, and attention-based methods, we show that individual embedding dimensions map onto specific land surface properties, while the full embedding space reconstructs most environmental variables with high fidelity (12 of 26 variables exceed $R^2 0.90$; temperature and elevation approach $R^2 = 0.97$). The strongest dimension-variable relationships converge across all three analytical methods and remain robust under spatial block cross-validation (mean $\Delta R^2 = 0.017$) and temporally stable across all seven study years (mean inter-year correlation $r = 0.963$). Building on these validated interpretations, we then developed a Land Surface Intelligence system that implements retrieval-augmented generation over a FAISS-indexed embedding database of 12.1 million vectors, translating natural language environmental queries into satellite-grounded assessments. An LLM-as-Judge evaluation across 360 query--response cycles, using four LLMs in rotating generator, system, and judge roles, achieved weighted scores of $\mu = 3.74 \pm 0.77$ (scale 1--5), with grounding ($\mu = 3.93$) and coherence ($\mu = 4.25$) as the strongest criteria. Our results demonstrate that satellite foundation model embeddings are physically structured representations that can be operationalized for environmental and geospatial intelligence.

67. 【2602.10352】Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

链接https://arxiv.org/abs/2602.10352

作者:Keenan Pepper,Alex McKenzie,Florin Pop,Stijn Servaes,Martin Leitgab,Mike Vaiana,Judd Rosenblatt,Michael S. A. Graziano,Diogo de Lucena

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:remain unreliable due, methods prompt language, internal states, hyperparameter sensitivity, remain unreliable

备注: 23 pages, 17 tables, 17 figures. Code and data at [this https URL](https://github.com/agencyenterprise/selfie-adapters)

点击查看摘要

Abstract:Self-interpretation methods prompt language models to describe their own internal states, but remain unreliable due to hyperparameter sensitivity. We show that training lightweight adapters on interpretability artifacts, while keeping the LM entirely frozen, yields reliable self-interpretation across tasks and model families. A scalar affine adapter with just $d_\text{model}+1$ parameters suffices: trained adapters generate sparse autoencoder feature labels that outperform the training labels themselves (71% vs 63% generation scoring at 70B scale), identify topics with 94% recall@1 versus 1% for untrained baselines, and decode bridge entities in multi-hop reasoning that appear in neither prompt nor response, surfacing implicit reasoning without chain-of-thought. The learned bias vector alone accounts for 85% of improvement, and simpler adapters generalize better than more expressive alternatives. Controlling for model knowledge via prompted descriptions, we find self-interpretation gains outpace capability gains from 7B to 72B parameters. Our results demonstrate that self-interpretation improves with scale, without modifying the model being interpreted.

68. 【2602.10350】When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

链接https://arxiv.org/abs/2602.10350

作者:Domenico De Cristofaro,Alessandro Vietti,Marianne Pouplier,Aleese Block

类目:Computation and Language (cs.CL)

关键词:phonetically accurate representations, Recent studies, multilingual speech models, final output layer, studies have shown

备注

点击查看摘要

Abstract:Recent studies have shown that intermediate layers in multilingual speech models often encode more phonetically accurate representations than the final output layer. In this work, we apply a layer-wise decoding strategy to a pretrained Wav2Vec2 model to investigate how phoneme-level predictions evolve across encoder layers, focusing on Campidanese Sardinian, a low-resource language. We show that truncating upper transformer layers leads to improved Phoneme Error Rates (PER), with the best performance achieved not at the final layer, but two layers earlier. Through fine-grained alignment analysis, we find that intermediate predictions better preserve segmental identity, avoid overgeneration, and reduce certain classes of phonological errors. We also introduce the notion of regressive errors, cases where correct predictions at intermediate layers are overwritten by errors at the final layer. These regressions highlight the limitations of surface-level error metrics and reveal how deeper layers may generalize or abstract away from acoustic detail. Our findings support the use of early-layer probing as a diagnostic tool for ASR models, particularly in low-resource settings where standard evaluation metrics may fail to capture linguistically meaningful behavior.

69. 【2602.10346】Geometry-Aware Decoding with Wasserstein-Regularized Truncation and Mass Penalties for Large Language Models

链接https://arxiv.org/abs/2602.10346

作者:Arash Gholami Davoodi,Navid Rezazadeh,Seyed Pouyan Mousavi Davoudi,Pouya Pezeshkpour

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, Large language, balance diversity, logical coherence, probability mass

备注

点击查看摘要

Abstract:Large language models (LLMs) must balance diversity and creativity against logical coherence in open-ended generation. Existing truncation-based samplers are effective but largely heuristic, relying mainly on probability mass and entropy while ignoring semantic geometry of the token space. We present Top-W, a geometry-aware truncation rule that uses Wasserstein distance-defined over token-embedding geometry-to keep the cropped distribution close to the original, while explicitly balancing retained probability mass against the entropy of the kept set. Our theory yields a simple closed-form structure for the fixed-potential subset update: depending on the mass-entropy trade-off, the optimal crop either collapses to a single token or takes the form of a one-dimensional prefix that can be found efficiently with a linear scan. We implement Top-W using efficient geometry-based potentials (nearest-set or k-NN) and pair it with an alternating decoding routine that keeps the standard truncation-and-sampling interface unchanged. Extensive experiments on four benchmarks (GSM8K, GPQA, AlpacaEval, and MT-Bench) across three instruction-tuned models show that Top-W consistently outperforms prior state-of-the-art decoding approaches achieving up to 33.7% improvement. Moreover, we find that Top-W not only improves accuracy-focused performance, but also boosts creativity under judge-based open-ended evaluation.

70. 【2602.10339】he Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

链接https://arxiv.org/abs/2602.10339

作者:Preni Golazizian,Elnaz Rahmati,Jackson Trager,Zhivar Sourati,Nona Ghazizadeh,Georgios Chochlakis,Jose Alcocer,Kerby Bennett,Aarya Vijay Devnani,Parsa Hejabi,Harry G. Muttram,Akshay Kiran Padte,Mehrshad Saadatinia,Chenhao Wu,Alireza S. Zaibari,Michael Sierra-Arévalo,Nick Weller,Shrikanth Narayanan,Benjamin A. T. Graham,Morteza Dehghani

类目:Computation and Language (cs.CL)

关键词:frequent police-civilian interactions, Traffic stops, Los Angeles Police, Police Department BWC, Angeles Police Department

备注

点击查看摘要

Abstract:Traffic stops are among the most frequent police-civilian interactions, and body-worn cameras (BWCs) provide a unique record of how these encounters unfold. Respect is a central dimension of these interactions, shaping public trust and perceived legitimacy, yet its interpretation is inherently subjective and shaped by lived experience, rendering community-specific perspectives a critical consideration. Leveraging unprecedented access to Los Angeles Police Department BWC footage, we introduce the first large-scale traffic-stop dataset annotated with respect ratings and free-text rationales from multiple perspectives. By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities. To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-consistent alignment; and (iii) propose a perspective-aware modeling framework that predicts personalized respect ratings and generates annotator-specific rationales for both officers and civilian drivers from traffic-stop transcripts. Across all three annotator groups, our approach improves both rating prediction performance and rationale alignment. Our perspective-aware framework enables law enforcement to better understand diverse community expectations, providing a vital tool for building public trust and procedural legitimacy.

71. 【2602.10329】Are More Tokens Rational? Inference-Time Scaling in Language Models as Adaptive Resource Rationality

链接https://arxiv.org/abs/2602.10329

作者:Zhimin Hu,Riya Roshan,Sashank Varma

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Human reasoning, Large Language Models, Large Language, Human, optimizing performance

备注

点击查看摘要

Abstract:Human reasoning is shaped by resource rationality -- optimizing performance under constraints. Recently, inference-time scaling has emerged as a powerful paradigm to improve the reasoning performance of Large Language Models by expanding test-time computation. Specifically, instruction-tuned (IT) models explicitly generate long reasoning steps during inference, whereas Large Reasoning Models (LRMs) are trained by reinforcement learning to discover reasoning paths that maximize accuracy. However, it remains unclear whether resource-rationality can emerge from such scaling without explicit reward related to computational costs. We introduce a Variable Attribution Task in which models infer which variables determine outcomes given candidate variables, input-output trials, and predefined logical functions. By varying the number of candidate variables and trials, we systematically manipulate task complexity. Both models exhibit a transition from brute-force to analytic strategies as complexity increases. IT models degrade on XOR and XNOR functions, whereas LRMs remain robust. These findings suggest that models can adjust their reasoning behavior in response to task complexity, even without explicit cost-based reward. It provides compelling evidence that resource rationality is an emergent property of inference-time scaling itself.

72. 【2602.10324】Discovering Differences in Strategic Behavior Between Humans and LLMs

链接https://arxiv.org/abs/2602.10324

作者:Caroline Wang,Daniel Kasenberg,Kim Stachenfeld,Pablo Samuel Castro

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:Large Language Models, Large Language, Language Models, LLM behavior, increasingly deployed

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly deployed in social and strategic scenarios, it becomes critical to understand where and why their behavior diverges from that of humans. While behavioral game theory (BGT) provides a framework for analyzing behavior, existing models do not fully capture the idiosyncratic behavior of humans or black-box, non-human agents like LLMs. We employ AlphaEvolve, a cutting-edge program discovery tool, to directly discover interpretable models of human and LLM behavior from data, thereby enabling open-ended discovery of structural factors driving human and LLM behavior. Our analysis on iterated rock-paper-scissors reveals that frontier LLMs can be capable of deeper strategic behavior than humans. These results provide a foundation for understanding structural differences driving differences in human and LLM behavior in strategic interactions.

73. 【2602.10298】On Emergent Social World Models -- Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models

链接https://arxiv.org/abs/2602.10298

作者:Polina Tsvilodub,Jan-Felix Klumpp,Amir Mohammadpour,Jennifer Hu,Michael Franke

类目:Computation and Language (cs.CL)

关键词:Theory of Mind, recruit shared computational, shared computational mechanisms, language-specific pragmatic reasoning, LMs recruit shared

备注: 29 pages, 13 figures, under review

点击查看摘要

Abstract:This paper investigates whether LMs recruit shared computational mechanisms for general Theory of Mind (ToM) and language-specific pragmatic reasoning in order to contribute to the general question of whether LMs may be said to have emergent "social world models", i.e., representations of mental states that are repurposed across tasks (the functional integration hypothesis). Using behavioral evaluations and causal-mechanistic experiments via functional localization methods inspired by cognitive neuroscience, we analyze LMs' performance across seven subcategories of ToM abilities (Beaudoin et al., 2020) on a substantially larger localizer dataset than used in prior like-minded work. Results from stringent hypothesis-driven statistical testing offer suggestive evidence for the functional integration hypothesis, indicating that LMs may develop interconnected "social world models" rather than isolated competencies. This work contributes novel ToM localizer data, methodological refinements to functional localization techniques, and empirical insights into the emergence of social cognition in artificial systems.

74. 【2602.10271】MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

链接https://arxiv.org/abs/2602.10271

作者:Yongyue Zhang,Yaxiong Wu

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:localize relevant information, aggregate dispersed evidence, multimodal long-context, comprise multimodal chunks, multimodal

备注: 15 pages

点击查看摘要

Abstract:Understanding multimodal long-context documents that comprise multimodal chunks such as paragraphs, figures, and tables is challenging due to (1) cross-modal heterogeneity to localize relevant information across modalities, (2) cross-page reasoning to aggregate dispersed evidence across pages. To address these challenges, we are motivated to adopt a query-centric formulation that projects cross-modal and cross-page information into a unified query representation space, with queries acting as abstract semantic surrogates for heterogeneous multimodal content. In this paper, we propose a Multimodal Long-Context Document Retrieval Augmented Generation (MLDocRAG) framework that leverages a Multimodal Chunk-Query Graph (MCQG) to organize multimodal document content around semantically rich, answerable queries. MCQG is constructed via a multimodal document expansion process that generates fine-grained queries from heterogeneous document chunks and links them to their corresponding content across modalities and pages. This graph-based structure enables selective, query-centric retrieval and structured evidence aggregation, thereby enhancing grounding and coherence in multimodal long-context question answering. Experiments on datasets MMLongBench-Doc and LongDocURL demonstrate that MLDocRAG consistently improves retrieval quality and answer accuracy, demonstrating its effectiveness for multimodal long-context understanding.

75. 【2602.10238】Learning to Evict from Key-Value Cache

链接https://arxiv.org/abs/2602.10238

作者:Luca Moschella,Laura Manduchi,Ozan Sener

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, size of Large, efficient inference challenging, makes efficient inference

备注: 23 pages, 15 figures

点击查看摘要

Abstract:The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token's future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or additional inference. Evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k, KVP significantly outperforms baselines. Furthermore, zero-shot tests on standard downstream tasks (e.g., LongBench, BOOLQ, ARC) indicate that KVP generalizes well beyond its training distribution and to longer context lengths. These results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.

76. 【2602.10231】Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

链接https://arxiv.org/abs/2602.10231

作者:Kirill Pavlenko,Alexander Golubev,Simon Karasik,Boris Yangel

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Group Relative Policy, Relative Policy Optimization, Group Relative, Policy Optimization, Relative Policy

备注

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion. For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit. We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block, reducing reliance on hand-designed scalar rewards and scaling naturally to additional objectives. A key challenge is estimating advantages for later blocks whose rewards are conditioned on sampled prefixes; standard unbiased approaches require expensive nested rollouts from intermediate states. Concretely, we introduce an Outcome-Conditioned Baseline that approximates intermediate state values using only within-group statistics by stratifying samples according to a prefix-derived intermediate outcome. On math tasks with uncertainty estimation, our method mitigates reward interference, is competitive with a state-of-the-art reward-designed approach, and preserves test-time gains from confidence-weighted ensembling. More broadly, it provides a modular recipe for optimizing sequential objectives in structured generations without additional rollouts.

77. 【2602.10229】Latent Thoughts Tuning: Bridging Context and Reasoning with Fused Information in Latent Tokens

链接https://arxiv.org/abs/2602.10229

作者:Weihao Liu,Dehai Min,Lu Cheng

类目:Computation and Language (cs.CL)

关键词:equips Large Language, Large Language Models, Large Language, equips Large, strong reasoning capabilities

备注: version 1.0

点击查看摘要

Abstract:While explicit Chain-of-Thought (CoT) equips Large Language Models (LLMs) with strong reasoning capabilities, it requires models to verbalize every intermediate step in text tokens, constraining the model thoughts to the discrete vocabulary space. Recently, reasoning in continuous latent space has emerged as a promising alternative, enabling more robust inference and flexible computation beyond discrete token constraints. However, current latent paradigms often suffer from feature collapse and instability, stemming from distribution mismatches when recurrently using hidden states as the input embeddings, or alignment issues when relying on assistant models. To address this, we propose Latent Thoughts Tuning (LT-Tuning), a framework that redefines how latent thoughts are constructed and deployed. Instead of relying solely on raw hidden states, our method introduces a Context-Prediction-Fusion mechanism that jointly leveraging contextual hidden states and predictive semantic guidance from the vocabulary embedding space. Combined with a progressive three-stage curriculum learning pipeline, LT-Tuning also enables dynamically switching between latent and explicit thinking modes. Experiments demonstrate that our method outperforms existing latent reasoning baselines, effectively mitigating feature collapse and achieving robust reasoning accuracy.

78. 【2602.10177】owards Autonomous Mathematics Research

链接https://arxiv.org/abs/2602.10177

作者:Tony Feng,Trieu H. Trinh,Garrett Bingham,Dawsen Hwang,Yuri Chervonyi,Junehyuk Jung,Joonkyung Lee,Carlo Pagano,Sang-hyun Kim,Federico Pasqualotto,Sergei Gukov,Jonathan N. Lee,Junsu Kim,Kaiying Hou,Golnaz Ghiasi,Yi Tay,YaGuang Li,Chenkai Kuang,Yuan Liu,Hanzhao(Maggie)Lin,Evan Zheran Liu,Nigamaa Nayakanti,Xiaomeng Yang,Heng-tze Cheng,Demis Hassabis,Koray Kavukcuoglu,Quoc V. Le,Thang Luong

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:International Mathematical Olympiad, Recent advances, International Mathematical, advances in foundational, foundational models

备注: Project page: [this https URL](https://github.com/google-deepmind/superhuman/tree/main/aletheia)

点击查看摘要

Abstract:Recent advances in foundational models have yielded reasoning systems capable of achieving a gold-medal standard at the International Mathematical Olympiad. The transition from competition-level problem-solving to professional research, however, requires navigating vast literature and constructing long-horizon proofs. In this work, we introduce Aletheia, a math research agent that iteratively generates, verifies, and revises solutions end-to-end in natural language. Specifically, Aletheia is powered by an advanced version of Gemini Deep Think for challenging reasoning problems, a novel inference-time scaling law that extends beyond Olympiad-level problems, and intensive tool use to navigate the complexities of mathematical research. We demonstrate the capability of Aletheia from Olympiad problems to PhD-level exercises and most notably, through several distinct milestones in AI-assisted mathematics research: (a) a research paper (Feng26) generated by AI without any human intervention in calculating certain structure constants in arithmetic geometry called eigenweights; (b) a research paper (LeeSeo26) demonstrating human-AI collaboration in proving bounds on systems of interacting particles called independent sets; and (c) an extensive semi-autonomous evaluation (Feng et al., 2026a) of 700 open problems on Bloom's Erdos Conjectures database, including autonomous solutions to four open questions. In order to help the public better understand the developments pertaining to AI and mathematics, we suggest codifying standard levels quantifying autonomy and novelty of AI-assisted results. We conclude with reflections on human-AI collaboration in mathematics.

79. 【2602.10161】Omni-Safety under Cross-Modality Conflict: Vulnerabilities, Dynamics Mechanisms and Efficient Alignment

链接https://arxiv.org/abs/2602.10161

作者:Kun Wang,Zherui Li,Zhenhong Zhou,Yitong Zhang,Yan Mi,Kun Yang,Yiming Zhang,Junhao Dong,Zhongxiang Sun,Qiankun Li,Yang Liu

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Omni-modal Large Language, greatly expand LLMs', cross-modal safety risks, Language Models

备注

点击查看摘要

Abstract:Omni-modal Large Language Models (OLLMs) greatly expand LLMs' multimodal capabilities but also introduce cross-modal safety risks. However, a systematic understanding of vulnerabilities in omni-modal interactions remains lacking. To bridge this gap, we establish a modality-semantics decoupling principle and construct the AdvBench-Omni dataset, which reveals a significant vulnerability in OLLMs. Mechanistic analysis uncovers a Mid-layer Dissolution phenomenon driven by refusal vector magnitude shrinkage, alongside the existence of a modal-invariant pure refusal direction. Inspired by these insights, we extract a golden refusal vector using Singular Value Decomposition and propose OmniSteer, which utilizes lightweight adapters to modulate intervention intensity adaptively. Extensive experiments show that our method not only increases the Refusal Success Rate against harmful inputs from 69.9% to 91.2%, but also effectively preserves the general capabilities across all modalities. Our code is available at: this https URL.

80. 【2602.10146】VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

链接https://arxiv.org/abs/2602.10146

作者:Rongcan Pei,Huan Li,Fang Guo,Qi Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:handling long context, complex reasoning tasks, Visual Evidence Retrieval, face significant challenges, Evidence Retrieval Augmentation

备注: 12 pages, 12 figures

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we identify specific Visual Evidence Retrieval (VER) Heads - a sparse, dynamic set of attention heads critical for locating visual cues during reasoning, distinct from static OCR heads. We demonstrate that these heads are causal to model performance; masking them leads to significant degradation. Leveraging this discovery, we propose VERA (Visual Evidence Retrieval Augmentation), a training-free framework that detects model uncertainty (i.e., entropy) to trigger the explicit verbalization of visual evidence attended by VER heads. Comprehensive experiments demonstrate that VERA significantly improves long-context understanding of open-source VLMs: it yields an average relative improvement of 21.3% on Qwen3-VL-8B-Instruct and 20.1% on GLM-4.1V-Thinking across five benchmarks.

81. 【2602.10138】Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

链接https://arxiv.org/abs/2602.10138

作者:Zhihang Yi,Jian Zhao,Jiancheng Lv,Tao Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, requiring the seamless, extract meaning, seamless integration

备注

点击查看摘要

Abstract:Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain's core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field's expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.

82. 【2602.10134】Reverse-Engineering Model Editing on Language Models

链接https://arxiv.org/abs/2602.10134

作者:Zhiyu Sun,Minrui Luo,Yu Wang,Zhili Chen,Tianxing He

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, memorize sensitive information, inevitably memorize sensitive, Large language, inevitably memorize

备注

点击查看摘要

Abstract:Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at this https URL.

83. 【2602.10119】Large Language Models Predict Functional Outcomes after Acute Ischemic Stroke

链接https://arxiv.org/abs/2602.10119

作者:Anjali K. Kapoor(1),Anton Alyakin(1,2,3),Jin Vivian Lee(1,2,3),Eunice Yang(1,4),Annelene M. Schulze(1),Krithik Vishwanath(5),Jinseok Lee(2,6),Yindalon Aphinyanaphongs(7,8),Howard Riina(1,9),Jennifer A. Frontera(10),Eric Karl Oermann(1,2,8,11) ((1) Department of Neurosurgery, NYU Langone Health, New York, USA (2) Global AI Frontier Lab, New York University, Brooklyn, USA (3) Department of Neurosurgery, Washington University in Saint Louis, Saint Louis, USA (4) Columbia University Vagelos College of Physicians and Surgeons, New York, USA (5) Department of Aerospace Engineering and Engineering Mechanics, University of Texas at Austin, Austin, USA (6) Department of Biomedical Engineering, Kyung Hee University, Yongin, South Korea (7) Department of Population Health, NYU Langone Health, New York, USA (8) Division of Applied AI Technologies, NYU Langone Health, New York, USA (9) Department of Radiology, NYU Langone Health, New York, USA (10) Department of Neurology, NYU Langone Health, New York, USA (11) Center for Data Science, New York University, New York, USA)

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:acute ischemic stroke, modified Rankin Scale, Accurate prediction, inform clinical decision-making, resource allocation

备注

点击查看摘要

Abstract:Accurate prediction of functional outcomes after acute ischemic stroke can inform clinical decision-making and resource allocation. Prior work on modified Rankin Scale (mRS) prediction has relied primarily on structured variables (e.g., age, NIHSS) and conventional machine learning. The ability of large language models (LLMs) to infer future mRS scores directly from routine admission notes remains largely unexplored. We evaluated encoder (BERT, NYUTron) and generative (Llama-3.1-8B, MedGemma-4B) LLMs, in both frozen and fine-tuned settings, for discharge and 90-day mRS prediction using a large, real-world stroke registry. The discharge outcome dataset included 9,485 History and Physical notes and the 90-day outcome dataset included 1,898 notes from the NYU Langone Get With The Guidelines-Stroke registry (2016-2025). Data were temporally split with the most recent 12 months held out for testing. Performance was assessed using exact (7-class) mRS accuracy and binary functional outcome (mRS 0-2 vs. 3-6) accuracy and compared against established structured-data baselines incorporating NIHSS and age. Fine-tuned Llama achieved the highest performance, with 90-day exact mRS accuracy of 33.9% [95% CI, 27.9-39.9%] and binary accuracy of 76.3% [95% CI, 70.7-81.9%]. Discharge performance reached 42.0% [95% CI, 39.0-45.0%] exact accuracy and 75.0% [95% CI, 72.4-77.6%] binary accuracy. For 90-day prediction, Llama performed comparably to structured-data baselines. Fine-tuned LLMs can predict post-stroke functional outcomes from admission notes alone, achieving performance comparable to models requiring structured variable abstraction. Our findings support the development of text-based prognostic tools that integrate seamlessly into clinical workflows without manual data extraction.

84. 【2602.10118】Reviewing the Reviewer: Elevating Peer Review Quality through LLM-Guided Feedback

链接https://arxiv.org/abs/2602.10118

作者:Sukannya Purkayastha,Qile Wan,Anne Lauscher,Lizhen Qu,Iryna Gurevych

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:simple heuristics, lowered standards, Peer review, central to scientific, reliance on simple

备注: 39 pages, 22 figures, 29 tables

点击查看摘要

Abstract:Peer review is central to scientific quality, yet reliance on simple heuristics -- lazy thinking -- has lowered standards. Prior work treats lazy thinking detection as a single-label task, but review segments may exhibit multiple issues, including broader clarity problems, or specificity issues. Turning detection into actionable improvements requires guideline-aware feedback, which is currently missing. We introduce an LLM-driven framework that decomposes reviews into argumentative segments, identifies issues via a neurosymbolic module combining LLM features with traditional classifiers, and generates targeted feedback using issue-specific templates refined by a genetic algorithm. Experiments show our method outperforms zero-shot LLM baselines and improves review quality by up to 92.4\%. We also release LazyReviewPlus, a dataset of 1,309 sentences labeled for lazy thinking and specificity.

85. 【2602.10716】RE-LLM: Refining Empathetic Speech-LLM Responses by Integrating Emotion Nuance

链接https://arxiv.org/abs/2602.10716

作者:Jing-Han Chen,Bo-Hao Su,Ya-Tse Wu,Chi-Chun Lee

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:generative AI advancing, interaction is essential, human-AI interaction, ESD, RE-LLM

备注: 5 pages, 1 figure, 2 tables. Accepted at IEEE ASRU 2025

点击查看摘要

Abstract:With generative AI advancing, empathy in human-AI interaction is essential. While prior work focuses on emotional reflection, emotional exploration, key to deeper engagement, remains overlooked. Existing LLMs rely on text which captures limited emotion nuances. To address this, we propose RE-LLM, a speech-LLM integrating dimensional emotion embeddings and auxiliary learning. Experiments show statistically significant gains in empathy metrics across three datasets. RE-LLM relatively improves the Emotional Reaction score by 14.79% and 6.76% compared to text-only and speech-LLM baselines on ESD. Notably, it raises the Exploration score by 35.42% and 3.91% on IEMOCAP, 139.28% and 9.83% on ESD, and 60.95% and 22.64% on MSP-PODCAST. It also boosts unweighted accuracy by 5.4% on IEMOCAP, 2.3% on ESD, and 6.9% on MSP-PODCAST in speech emotion recognition. These results highlight the enriched emotional understanding and improved empathetic response generation of RE-LLM.

86. 【2602.09067】AntigenLM: Structure-Aware DNA Language Modeling for Influenza

链接https://arxiv.org/abs/2602.09067

作者:Yue Pei,Xuebin Chi,Yu Kang

类目:Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)

关键词:advanced sequence analysis, unclear reasons, DNA language, lag behind task-specific, task-specific methods

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.

信息检索

1. 【2602.11151】Diffusion-Pretrained Dense and Contextual Embeddings

链接https://arxiv.org/abs/2602.11151

作者:Sedigheh Eslami,Maksim Gaiduk,Markus Krimmel,Louis Milliken,Bo Wang,Denis Bykov

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:employ multi-stage contrastive, multi-stage contrastive learning, language model backbone, diffusion-pretrained language model, introduce pplx-embed

备注

点击查看摘要

Abstract:In this report, we introduce pplx-embed, a family of multilingual embedding models that employ multi-stage contrastive learning on a diffusion-pretrained language model backbone for web-scale retrieval. By leveraging bidirectional attention through diffusion-based pretraining, our models capture comprehensive bidirectional context within passages, enabling the use of mean pooling and a late chunking strategy to better preserve global context across long documents. We release two model types: pplx-embed-v1 for standard retrieval, and pplx-embed-context-v1 for contextualized embeddings that incorporate global document context into passage representations. pplx-embed-v1 achieves competitive performance on the MTEB(Multilingual, v2), MTEB(Code), MIRACL, BERGEN, and ToolRet retrieval benchmarks, while pplx-embed-context-v1 sets new records on the ConTEB benchmark. Beyond public benchmarks, pplx-embed-v1 demonstrates strong performance on our internal evaluation suite, which focuses on real-world, large-scale search scenarios over tens of millions of documents. These results validate the models' effectiveness in production environments where retrieval quality and efficiency are critical at scale.

2. 【2602.11062】MoToRec: Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation

链接https://arxiv.org/abs/2602.11062

作者:Jialin Liu,Zhaorui Zhang,Ray C.C. Cheung

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:significantly impair performance, complex user-item interactions, revolutionized recommender systems, effectively modeling complex, modeling complex user-item

备注: Accepted to AAAI 2026 (Main Track)

点击查看摘要

Abstract:Graph neural networks (GNNs) have revolutionized recommender systems by effectively modeling complex user-item interactions, yet data sparsity and the item cold-start problem significantly impair performance, particularly for new items with limited or no interaction history. While multimodal content offers a promising solution, existing methods result in suboptimal representations for new items due to noise and entanglement in sparse data. To address this, we transform multimodal recommendation into discrete semantic tokenization. We present Sparse-Regularized Multimodal Tokenization for Cold-Start Recommendation (MoToRec), a framework centered on a sparsely-regularized Residual Quantized Variational Autoencoder (RQ-VAE) that generates a compositional semantic code of discrete, interpretable tokens, promoting disentangled representations. MoToRec's architecture is enhanced by three synergistic components: (1) a sparsely-regularized RQ-VAE that promotes disentangled representations, (2) a novel adaptive rarity amplification that promotes prioritized learning for cold-start items, and (3) a hierarchical multi-source graph encoder for robust signal fusion with collaborative signals. Extensive experiments on three large-scale datasets demonstrate MoToRec's superiority over state-of-the-art methods in both overall and cold-start scenarios. Our work validates that discrete tokenization provides an effective and scalable alternative for mitigating the long-standing cold-start challenge.

3. 【2602.11052】GraphSeek: Next-Generation Graph Analytics with LLMs

链接https://arxiv.org/abs/2602.11052

作者:Maciej Besta,Łukasz Jarmocik,Orest Hrycyna,Shachar Klaiman,Konrad Mączka,Robert Gerstenberger,Jürgen Müller,Piotr Nyczyk,Hubert Niewiadomski,Torsten Hoefler

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:deep expertise, foundational across domains, domains but remain, remain hard, graph

备注

点击查看摘要

Abstract:Graphs are foundational across domains but remain hard to use without deep expertise. LLMs promise accessible natural language (NL) graph analytics, yet they fail to process industry-scale property graphs effectively and efficiently: such datasets are large, highly heterogeneous, structurally complex, and evolve dynamically. To address this, we devise a novel abstraction for complex multi-query analytics over such graphs. Its key idea is to replace brittle generation of graph queries directly from NL with planning over a Semantic Catalog that describes both the graph schema and the graph operations. Concretely, this induces a clean separation between a Semantic Plane for LLM planning and broader reasoning, and an Execution Plane for deterministic, database-grade query execution over the full dataset and tool implementations. This design yields substantial gains in both token efficiency and task effectiveness even with small-context LLMs. We use this abstraction as the basis of the first LLM-enhanced graph analytics framework called GraphSeek. GraphSeek achieves substantially higher success rates (e.g., 86% over enhanced LangChain) and points toward the next generation of affordable and accessible graph analytics that unify LLM reasoning with database-grade execution over large and complex property graphs.

4. 【2602.10833】raining-Induced Bias Toward LLM-Generated Content in Dense Retrieval

链接https://arxiv.org/abs/2602.10833

作者:William Xion,Wolfgang Nejdl

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:information retrieval applications, acquiring relevant context, language processing tasks, retrieval applications, open-domain natural language

备注: Accepted at ECIR 2026

点击查看摘要

Abstract:Dense retrieval is a promising approach for acquiring relevant context or world knowledge in open-domain natural language processing tasks and is now widely used in information retrieval applications. However, recent reports claim a broad preference for text generated by large language models (LLMs). This bias is called "source bias", and it has been hypothesized that lower perplexity contributes to this effect. In this study, we revisit this claim by conducting a controlled evaluation to trace the emergence of such preferences across training stages and data sources. Using parallel human- and LLM-generated counterparts of the SciFact and Natural Questions (NQ320K) datasets, we compare unsupervised checkpoints with models fine-tuned using in-domain human text, in-domain LLM-generated text, and MS MARCO. Our results show the following: 1) Unsupervised retrievers do not exhibit a uniform pro-LLM preference. The direction and magnitude depend on the dataset. 2) Across the settings tested, supervised fine-tuning on MS MARCO consistently shifts the rankings toward LLM-generated text. 3) In-domain fine-tuning produces dataset-specific and inconsistent shifts in preference. 4) Fine-tuning on LLM-generated corpora induces a pronounced pro-LLM bias. Finally, a retriever-centric perplexity probe involving the reattachment of a language modeling head to the fine-tuned dense retriever encoder indicates agreement with relevance near chance, thereby weakening the explanatory power of perplexity. Our study demonstrates that source bias is a training-induced phenomenon rather than an inherent property of dense retrievers.

5. 【2602.10811】EST: Towards Efficient Scaling Laws in Click-Through Rate Prediction via Unified Modeling

链接https://arxiv.org/abs/2602.10811

作者:Mingyang Liu,Yong Bai,Zhangming Chan,Sishuo Chen,Xiang-Rong Sheng,Han Zhu,Jian Xu,Xinyang Chen

类目:Information Retrieval (cs.IR)

关键词:industrial Click-Through Rate, recently attracted significant, attracted significant research, Click-Through Rate, significant research attention

备注

点击查看摘要

Abstract:Efficiently scaling industrial Click-Through Rate (CTR) prediction has recently attracted significant research attention. Existing approaches typically employ early aggregation of user behaviors to maintain efficiency. However, such non-unified or partially unified modeling creates an information bottleneck by discarding fine-grained, token-level signals essential for unlocking scaling gains. In this work, we revisit the fundamental distinctions between CTR prediction and Large Language Models (LLMs), identifying two critical properties: the asymmetry in information density between behavioral and non-behavioral features, and the modality-specific priors of content-rich signals. Accordingly, we propose the Efficiently Scalable Transformer (EST), which achieves fully unified modeling by processing all raw inputs in a single sequence without lossy aggregation. EST integrates two modules: Lightweight Cross-Attention (LCA), which prunes redundant self-interactions to focus on high-impact cross-feature dependencies, and Content Sparse Attention (CSA), which utilizes content similarity to dynamically select high-signal behaviors. Extensive experiments show that EST exhibits a stable and efficient power-law scaling relationship, enabling predictable performance gains with model scale. Deployed on Taobao's display advertising platform, EST significantly outperforms production baselines, delivering a 3.27\% RPM (Revenue Per Mile) increase and a 1.22\% CTR lift, establishing a practical pathway for scalable industrial CTR prediction models.

6. 【2602.10809】DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

链接https://arxiv.org/abs/2602.10809

作者:Chenlong Deng,Mengjie Deng,Junjie Wu,Dun Zeng,Teng Wang,Qingsong Xie,Jiadeng Huang,Shengjie Ma,Changwang Zhang,Zhaoxiang Wang,Jun Wang,Yutao Zhu,Zhicheng Dou

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:Existing multimodal retrieval, Existing multimodal, measured in isolation, excel at semantic, semantic matching

备注: 17 pages, 5 figures

点击查看摘要

Abstract:Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

7. 【2602.10787】VulReaD: Knowledge-Graph-guided Software Vulnerability Reasoning and Detection

链接https://arxiv.org/abs/2602.10787

作者:Samal Mukhtar,Yinghua Yao,Zhu Sun,Mustafa Mustafa,Yew Soon Ong,Youcheng Sun

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:Common Weakness Enumeration, Software vulnerability detection, Software vulnerability, modern systems, critical challenge

备注: 22 pages, 3 figures

点击查看摘要

Abstract:Software vulnerability detection (SVD) is a critical challenge in modern systems. Large language models (LLMs) offer natural-language explanations alongside predictions, but most work focuses on binary evaluation, and explanations often lack semantic consistency with Common Weakness Enumeration (CWE) categories. We propose VulReaD, a knowledge-graph-guided approach for vulnerability reasoning and detection that moves beyond binary classification toward CWE-level reasoning. VulReaD leverages a security knowledge graph (KG) as a semantic backbone and uses a strong teacher LLM to generate CWE-consistent contrastive reasoning supervision, enabling student model training without manual annotations. Students are fine-tuned with Odds Ratio Preference Optimization (ORPO) to encourage taxonomy-aligned reasoning while suppressing unsupported explanations. Across three real-world datasets, VulReaD improves binary F1 by 8-10% and multi-class classification by 30% Macro-F1 and 18% Micro-F1 compared to state-of-the-art baselines. Results show that LLMs outperform deep learning baselines in binary detection and that KG-guided reasoning enhances CWE coverage and interpretability.

8. 【2602.10739】Equity by Design: Fairness-Driven Recommendation in Heterogeneous Two-Sided Markets

链接https://arxiv.org/abs/2602.10739

作者:Dominykas Seputis,Rajeev Verma,Alexander Timans

类目:Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)

关键词:consumers seek relevance, heterogeneity in incentives, standard practice, marketplaces embody heterogeneity, embody heterogeneity

备注

点击查看摘要

Abstract:Two-sided marketplaces embody heterogeneity in incentives: producers seek exposure while consumers seek relevance, and balancing these competing objectives through constrained optimization is now a standard practice. Yet real platforms face finer-grained complexity: consumers differ in preferences and engagement patterns, producers vary in catalog value and capacity, and business objectives impose additional constraints beyond raw relevance. We formalize two-sided fairness under these realistic conditions, extending prior work from soft single-item allocations to discrete multi-item recommendations. We introduce Conditional Value-at-Risk (CVaR) as a consumer-side objective that compresses group-level utility disparities, and integrate business constraints directly into the optimization. Our experiments reveal that the "free fairness" regime, where producer constraints impose no consumer cost, disappears in multi item settings. Strikingly, moderate fairness constraints can improve business metrics by diversifying exposure away from saturated producers. Scalable solvers match exact solutions at a fraction of the runtime, making fairness-aware allocation practical at scale. These findings reframe fairness not as a tax on platform efficiency but as a lever for sustainable marketplace health.

9. 【2602.10633】A Cognitive Distribution and Behavior-Consistent Framework for Black-Box Attacks on Recommender Systems

链接https://arxiv.org/abs/2602.10633

作者:Hongyue Zhan,Mingming Li,Dongqin Liu,Hui Wang,Yaning Zhang,Xi Zhou,Honglei Lv,Jiao Dai,Jizhong Han

类目:Information Retrieval (cs.IR)

关键词:raise security concerns, interfaces raise security, black-box interfaces raise, subsequent adversarial manipulation, security concerns

备注

点击查看摘要

Abstract:With the growing deployment of sequential recommender systems in e-commerce and other fields, their black-box interfaces raise security concerns: models are vulnerable to extraction and subsequent adversarial manipulation. Existing black-box extraction attacks primarily rely on hard labels or pairwise learning, often ignoring the importance of ranking positions, which results in incomplete knowledge transfer. Moreover, adversarial sequences generated via pure gradient methods lack semantic consistency with real user behavior, making them easily detectable. To overcome these limitations, this paper proposes a dual-enhanced attack framework. First, drawing on primacy effects and position bias, we introduce a cognitive distribution-driven extraction mechanism that maps discrete rankings into continuous value distributions with position-aware decay, thereby advancing from order alignment to cognitive distribution alignment. Second, we design a behavior-aware noisy item generation strategy that jointly optimizes collaborative signals and gradient signals. This ensures both semantic coherence and statistical stealth while effectively promoting target item rankings. Extensive experiments on multiple datasets demonstrate that our approach significantly outperforms existing methods in both attack success rate and evasion rate, validating the value of integrating cognitive modeling and behavioral consistency for secure recommender systems.

10. 【2602.10606】S-GRec: Personalized Semantic-Aware Generative Recommendation with Asymmetric Advantage

链接https://arxiv.org/abs/2602.10606

作者:Jie Jiang,Hongbo Tang,Wenjie Wu,Yangru Huang,Zhenmao Li,Qian Li,Changping Wang,Jun Zhang,Huan Yu

类目:Information Retrieval (cs.IR)

关键词:Generative recommendation models, underlying user intent, models sequence generation, Large Language Models, recommendation models sequence

备注

点击查看摘要

Abstract:Generative recommendation models sequence generation to produce items end-to-end, but training from behavioral logs often provides weak supervision on underlying user intent. Although Large Language Models (LLMs) offer rich semantic priors that could supply such supervision, direct adoption in industrial recommendation is hindered by two obstacles: semantic signals can conflict with platform business objectives, and LLM inference is prohibitively expensive at scale. This paper presents S-GRec, a semantic-aware framework that decouples an online lightweight generator from an offline LLM-based semantic judge for train-time supervision. S-GRec introduces a two-stage Personalized Semantic Judge (PSJ) that produces interpretable aspect evidence and learns user-conditional aggregation from pairwise feedback, yielding stable semantic rewards. To prevent semantic supervision from deviating from business goals, Asymmetric Advantage Policy Optimization (A2PO) anchors optimization on business rewards (e.g., eCPM) and injects semantic advantages only when they are consistent. Extensive experiments on public benchmarks and a large-scale production system validate both effectiveness and scalability, including statistically significant gains in CTR and a 1.19\% lift in GMV in online A/B tests, without requiring real-time LLM inference.

11. 【2602.10577】Campaign-2-PT-RAG: LLM-Guided Semantic Product Type Attribution for Scalable Campaign Ranking

链接https://arxiv.org/abs/2602.10577

作者:Yiming Che,Mansi Mane,Keerthi Gopalakrishnan,Parisa Kaghazgaran,Murali Mohana Krishna Dandu,Archana Venkatachalapathy,Sinduja Subramaniam,Yokila Arora,Evren Korpeoglu,Sushant Kumar,Kannan Achan

类目:Information Retrieval (cs.IR)

关键词:require large-scale training, models require large-scale, users purchased due, require large-scale, purchased due

备注

点击查看摘要

Abstract:E-commerce campaign ranking models require large-scale training labels indicating which users purchased due to campaign influence. However, generating these labels is challenging because campaigns use creative, thematic language that does not directly map to product purchases. Without clear product-level attribution, supervised learning for campaign optimization remains limited. We present \textbf{Campaign-2-PT-RAG}, a scalable label generation framework that constructs user--campaign purchase labels by inferring which product types (PTs) each campaign promotes. The framework first interprets campaign content using large language models (LLMs) to capture implicit intent, then retrieves candidate PTs through semantic search over the platform taxonomy. A structured LLM-based classifier evaluates each PT's relevance, producing a campaign-specific product coverage set. User purchases matching these PTs generate positive training labels for downstream ranking models. This approach reframes the ambiguous attribution problem into a tractable semantic alignment task, enabling scalable and consistent supervision for downstream tasks such as campaign ranking optimization in production e-commerce environments. Experiments on internal and synthetic datasets, validated against expert-annotated campaign--PT mappings, show that our LLM-assisted approach generates high-quality labels with 78--90% precision while maintaining over 99% recall.

12. 【2602.10493】Boundary-Aware Multi-Behavior Dynamic Graph Transformer for Sequential Recommendation

链接https://arxiv.org/abs/2602.10493

作者:Jingsong Su,Xuetao Ma,Mingming Li,Qiannan Zhu,Yu Guo

类目:Information Retrieval (cs.IR)

关键词:contemporary recommender systems, user, recommender systems, user preferences, landscape of contemporary

备注

点击查看摘要

Abstract:In the landscape of contemporary recommender systems, user-item interactions are inherently dynamic and sequential, often characterized by various behaviors. Prior research has explored the modeling of user preferences through sequential interactions and the user-item interaction graph, utilizing advanced techniques such as graph neural networks and transformer-based architectures. However, these methods typically fall short in simultaneously accounting for the dynamic nature of graph topologies and the sequential pattern of interactions in user preference models. Moreover, they often fail to adequately capture the multiple user behavior boundaries during model optimization. To tackle these challenges, we introduce a boundary-aware Multi-Behavioral Dynamic Graph Transformer (MB-DGT) model that dynamically refines the graph structure to reflect the evolving patterns of user behaviors and interactions. Our model involves a transformer-based dynamic graph aggregator for user preference modeling, which assimilates the changing graph structure and the sequence of user behaviors. This integration yields a more comprehensive and dynamic representation of user preferences. For model optimization, we implement a user-specific multi-behavior loss function that delineates the interest boundaries among different behaviors, thereby enriching the personalized learning of user preferences. Comprehensive experiments across three datasets indicate that our model consistently delivers remarkable recommendation performance.

13. 【2602.10490】ChainRec: An Agentic Recommender Learning to Route Tool Chains for Diverse and Evolving Interests

链接https://arxiv.org/abs/2602.10490

作者:Fuchun Li,Qian Li,Xingyu Gao,Bocheng Pan,Yang Wu,Jun Zhang,Huan Yu,Jie Jiang,Jinsheng Xiao,Hailong Shi

类目:Information Retrieval (cs.IR)

关键词:Large language models, Large language, motivating recent interest, language models, motivating recent

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly integrated into recommender systems, motivating recent interest in agentic and reasoning-based recommendation. However, most existing approaches still rely on fixed workflows, applying the same reasoning procedure across diverse recommendation scenarios. In practice, user contexts vary substantially-for example, in cold-start settings or during interest shifts, so an agent should adaptively decide what evidence to gather next rather than following a scripted process. To address this, we propose ChainRec, an agentic recommender that uses a planner to dynamically select reasoning tools. ChainRec builds a standardized Tool Agent Library from expert trajectories. It then trains a planner using supervised fine-tuning and preference optimization to dynamically select tools, decide their order, and determine when to stop. Experiments on AgentRecBench across Amazon, Yelp, and Goodreads show that ChainRec consistently improves Avg HR@{1,3,5} over strong baselines, with especially notable gains in cold-start and evolving-interest scenarios. Ablation studies further validate the importance of tool standardization and preference-optimized planning.

14. 【2602.10455】Compute Only Once: UG-Separation for Efficient Large Recommendation Models

链接https://arxiv.org/abs/2602.10455

作者:Hui Lu,Zheng Chai,Shipeng Bai,Hao Zhang,Zhifang Fan,Kunmin Bai,Yingwen Wu,Bingzheng Wei,Xiang Sun,Ziyan Gong,Tianyi Liu,Hua Chen,Deping Xie,Zhongkai Chen,Zhiliang Guo,Qiwei Chen,Yuchao Zheng

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:capture complex feature, Driven by scaling, recommender systems increasingly, complex feature interactions, systems increasingly rely

备注: Large Recommender Model, Industrial Recommenders, Scaling Law

点击查看摘要

Abstract:Driven by scaling laws, recommender systems increasingly rely on large-scale models to capture complex feature interactions and user behaviors, but this trend also leads to prohibitive training and inference costs. While long-sequence models(e.g., LONGER) can reuse user-side computation through KV caching, such reuse is difficult in dense feature interaction architectures(e.g., RankMixer), where user and group (candidate item) features are deeply entangled across layers. In this work, we propose User-Group Separation (UG-Sep), a novel framework that enables reusable user-side computation in dense interaction models for the first time. UG-Sep introduces a masking mechanism that explicitly disentangles user-side and item-side information flows within token-mixing layers, ensuring that a subset of tokens to preserve purely user-side representations across layers. This design enables corresponding token computations to be reused across multiple samples, significantly reducing redundant inference cost. To compensate for potential expressiveness loss induced by masking, we further propose an Information Compensation strategy that adaptively reconstructs suppressed user-item interactions. Moreover, as UG-Sep substantially reduces user-side FLOPs and exposes memory-bound components, we incorporate W8A16 (8-bit weight, 16-bit activation) weight-only quantization to alleviate memory bandwidth bottlenecks and achieve additional acceleration. We conduct extensive offline evaluations and large-scale online A/B experiments at ByteDance, demonstrating that UG-Sep reduces inference latency by up to 20 percent without degrading online user experience or commercial metrics across multiple business scenarios, including feed recommendation and advertising systems.

15. 【2602.10445】End-to-End Semantic ID Generation for Generative Advertisement Recommendation

链接https://arxiv.org/abs/2602.10445

作者:Jie Jiang,Xinxun Zhang,Enming Zhang,Yuling Xiong,Jun Zhang,Jingwen Wang,Huan Yu,Yuxiang Wang,Hao Wang,Xiao Yan,Jiawei Jiang

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:next-token prediction, excelled by framing, framing recommendation, SIDs, Residual Quantization

备注

点击查看摘要

Abstract:Generative Recommendation (GR) has excelled by framing recommendation as next-token prediction. This paradigm relies on Semantic IDs (SIDs) to tokenize large-scale items into discrete sequences. Existing GR approaches predominantly generate SIDs via Residual Quantization (RQ), where items are encoded into embeddings and then quantized to discrete SIDs. However, this paradigm suffers from inherent limitations: 1) Objective misalignment and semantic degradation stemming from the two-stage compression; 2) Error accumulation inherent in the structure of RQ. To address these limitations, we propose UniSID, a Unified SID generation framework for generative advertisement recommendation. Specifically, we jointly optimize embeddings and SIDs in an end-to-end manner from raw advertising data, enabling semantic information to flow directly into the SID space and thus addressing the inherent limitations of the two-stage cascading compression paradigm. To capture fine-grained semantics, a multi-granularity contrastive learning strategy is introduced to align distinct items across SID levels. Finally, a summary-based ad reconstruction mechanism is proposed to encourage SIDs to capture high-level semantic information that is not explicitly present in advertising contexts. Experiments demonstrate that UniSID consistently outperforms state-of-the-art SID generation methods, yielding up to a 4.62% improvement in Hit Rate metrics across downstream advertising scenarios compared to the strongest baseline.

16. 【2602.10444】Chamfer-Linkage for Hierarchical Agglomerative Clustering

链接https://arxiv.org/abs/2602.10444

作者:Kishen N Gowda,Willem Fletcher,MohammadHossein Bateni,Laxman Dhulipala,D Ellis Hershkowitz,Rajesh Jayaram,Jakub Łącki

类目:Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)

关键词:Hierarchical Agglomerative Clustering, Agglomerative Clustering, Hierarchical Agglomerative, widely-used clustering method, clustering method based

备注

点击查看摘要

Abstract:Hierarchical Agglomerative Clustering (HAC) is a widely-used clustering method based on repeatedly merging the closest pair of clusters, where inter-cluster distances are determined by a linkage function. Unlike many clustering methods, HAC does not optimize a single explicit global objective; clustering quality is therefore primarily evaluated empirically, and the choice of linkage function plays a crucial role in practice. However, popular classical linkages, such as single-linkage, average-linkage and Ward's method show high variability across real-world datasets and do not consistently produce high-quality clusterings in practice. In this paper, we propose \emph{Chamfer-linkage}, a novel linkage function that measures the distance between clusters using the Chamfer distance, a popular notion of distance between point-clouds in machine learning and computer vision. We argue that Chamfer-linkage satisfies desirable concept representation properties that other popular measures struggle to satisfy. Theoretically, we show that Chamfer-linkage HAC can be implemented in $O(n^2)$ time, matching the efficiency of classical linkage functions. Experimentally, we find that Chamfer-linkage consistently yields higher-quality clusterings than classical linkages such as average-linkage and Ward's method across a diverse collection of datasets. Our results establish Chamfer-linkage as a practical drop-in replacement for classical linkage functions, broadening the toolkit for hierarchical clustering in both theory and practice.

Subjects:

Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)

Cite as:
arXiv:2602.10444 [cs.LG]

(or
arXiv:2602.10444v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2602.10444

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
17. 【2602.10411】GeoGR: A Generative Retrieval Framework for Spatio-Temporal Aware POI Recommendation

链接https://arxiv.org/abs/2602.10411

作者:Fangye Wang,Haowen Lin,Yifang Yuan,Siyuan Wang,Xiaojiang Zhou,Song Yang,Pengjie Wang

类目:Information Retrieval (cs.IR)

关键词:diverse lifestyle scenarios, large-scale navigation platforms, POI recommendation task, POI recommendation, location-based services

备注

点击查看摘要

Abstract:Next Point-of-Interest (POI) prediction is a fundamental task in location-based services, especially critical for large-scale navigation platforms like AMAP that serve billions of users across diverse lifestyle scenarios. While recent POI recommendation approaches based on SIDs have achieved promising, they struggle in complex, sparse real-world environments due to two key limitations: (1) inadequate modeling of high-quality SIDs that capture cross-category spatio-temporal collaborative relationships, and (2) poor alignment between large language models (LLMs) and the POI recommendation task. To this end, we propose GeoGR, a geographic generative recommendation framework tailored for navigation-based LBS like AMAP, which perceives users' contextual state changes and enables intent-aware POI recommendation. GeoGR features a two-stage design: (i) a geo-aware SID tokenization pipeline that explicitly learns spatio-temporal collaborative semantic representations via geographically constrained co-visited POI pairs, contrastive learning, and iterative refinement; and (ii) a multi-stage LLM training strategy that aligns non-native SID tokens through multiple template-based continued pre-training(CPT) and enables autoregressive POI generation via supervised fine-tuning(SFT). Extensive experiments on multiple real-world datasets demonstrate GeoGR's superiority over state-of-the-art baselines. Moreover, deployment on the AMAP platform, serving millions of users with multiple online metrics boosting, confirms its practical effectiveness and scalability in production.

18. 【2602.10321】Single-Turn LLM Reformulation Powered Multi-Stage Hybrid Re-Ranking for Tip-of-the-Tongue Known-Item Retrieval

链接https://arxiv.org/abs/2602.10321

作者:Debayan Mukhopadhyay,Utshab Kumar Ghosh,Shubham Chatterjee

类目:Information Retrieval (cs.IR)

关键词:Retrieving known items, vague descriptions, remains a significant, significant challenge, items from vague

备注

点击查看摘要

Abstract:Retrieving known items from vague descriptions, Tip-of-the-Tongue (ToT) retrieval, remains a significant challenge. We propose using a single call to a generic 8B-parameter LLM for query reformulation, bridging the gap between ill-formed ToT queries and specific information needs. This method is particularly effective where standard Pseudo-Relevance Feedback fails due to poor initial recall. Crucially, our LLM is not fine-tuned for ToT or specific domains, demonstrating that gains stem from our prompting strategy rather than model specialization. Rewritten queries feed a multi-stage pipeline: sparse retrieval (BM25), dense/late-interaction reranking (Contriever, E5-large-v2, ColBERTv2), monoT5 cross-encoding, and list-wise reranking (Qwen 2.5 72B). Experiments on 2025 TREC-ToT datasets show that while raw queries yield poor performance, our lightweight pre-retrieval transformation improves Recall by 20.61%. Subsequent reranking improves nDCG@10 by 33.88%, MRR by 29.92%, and MAP@10 by 29.98%, offering a cost-effective intervention that unlocks the potential of downstream rankers. Code and data: this https URL

19. 【2602.10295】ECHO: An Open Research Platform for Evaluation of Chat, Human Behavior, and Outcomes

链接https://arxiv.org/abs/2602.10295

作者:Jiqun Liu,Nischal Dinesh,Ran Yu

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Web search engines, open research platform, research platform designed, systems and Web, Human behavior

备注

点击查看摘要

Abstract:ECHO (Evaluation of Chat, Human behavior, and Outcomes) is an open research platform designed to support reproducible, mixed-method studies of human interaction with both conversational AI systems and Web search engines. It enables researchers from varying disciplines to orchestrate end-to-end experimental workflows that integrate consent and background surveys, chat-based and search-based information-seeking sessions, writing or judgment tasks, and pre- and post-task evaluations within a unified, low-coding-load framework. ECHO logs fine-grained interaction traces and participant responses, and exports structured datasets for downstream analysis. By supporting both chat and search alongside flexible evaluation instruments, ECHO lowers technical barriers for studying learning, decision making, and user experience across different information access paradigms, empowering researchers from information retrieval, HCI, and the social sciences to conduct scalable and reproducible human-centered AI evaluations.

20. 【2602.10271】MLDocRAG: Multimodal Long-Context Document Retrieval Augmented Generation

链接https://arxiv.org/abs/2602.10271

作者:Yongyue Zhang,Yaxiong Wu

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词:localize relevant information, aggregate dispersed evidence, multimodal long-context, comprise multimodal chunks, multimodal

备注: 15 pages

点击查看摘要

Abstract:Understanding multimodal long-context documents that comprise multimodal chunks such as paragraphs, figures, and tables is challenging due to (1) cross-modal heterogeneity to localize relevant information across modalities, (2) cross-page reasoning to aggregate dispersed evidence across pages. To address these challenges, we are motivated to adopt a query-centric formulation that projects cross-modal and cross-page information into a unified query representation space, with queries acting as abstract semantic surrogates for heterogeneous multimodal content. In this paper, we propose a Multimodal Long-Context Document Retrieval Augmented Generation (MLDocRAG) framework that leverages a Multimodal Chunk-Query Graph (MCQG) to organize multimodal document content around semantically rich, answerable queries. MCQG is constructed via a multimodal document expansion process that generates fine-grained queries from heterogeneous document chunks and links them to their corresponding content across modalities and pages. This graph-based structure enables selective, query-centric retrieval and structured evidence aggregation, thereby enhancing grounding and coherence in multimodal long-context question answering. Experiments on datasets MMLongBench-Doc and LongDocURL demonstrate that MLDocRAG consistently improves retrieval quality and answer accuracy, demonstrating its effectiveness for multimodal long-context understanding.

21. 【2602.10258】JAG: Joint Attribute Graphs for Filtered Nearest Neighbor Search

链接https://arxiv.org/abs/2602.10258

作者:Haike Xu,Guy Blelloch,Laxman Dhulipala,Lars Gottesbüren,Rajesh Jayaram,Jakub Łącki

类目:Information Retrieval (cs.IR); Databases (cs.DB)

关键词:fundamental task, task in modern, highly sensitive, filter types, vector search systems

备注

点击查看摘要

Abstract:Despite filtered nearest neighbor search being a fundamental task in modern vector search systems, the performance of existing algorithms is highly sensitive to query selectivity and filter type. In particular, existing solutions excel either at specific filter categories (e.g., label equality) or within narrow selectivity bands (e.g., pre-filtering for low selectivity) and are therefore a poor fit for practical deployments that demand generalization to new filter types and unknown query selectivities. In this paper, we propose JAG (Joint Attribute Graphs), a graph-based algorithm designed to deliver robust performance across the entire selectivity spectrum and support diverse filter types. Our key innovation is the introduction of attribute and filter distances, which transform binary filter constraints into continuous navigational guidance. By constructing a proximity graph that jointly optimizes for both vector similarity and attribute proximity, JAG prevents navigational dead-ends and allows JAG to consistently outperform prior graph-based filtered nearest neighbor search methods. Our experimental results across five datasets and four filter types (Label, Range, Subset, Boolean) demonstrate that JAG significantly outperforms existing state-of-the-art baselines in both throughput and recall robustness.

22. 【2602.10145】Silence Routing: When Not Speaking Improves Collective Judgment

链接https://arxiv.org/abs/2602.10145

作者:Itsuki Fujisaki,Kunhao Yang

类目:Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:wisdom of crowds, shown to operate, factual judgments, defined relative, individual preferences

备注: 7pages, 2 figures

点击查看摘要

Abstract:The wisdom of crowds has been shown to operate not only for factual judgments but also in matters of taste, where accuracy is defined relative to an individual's preferences. However, it remains unclear how different types of social signals should be selectively used in such domains. Focusing on a music preference dataset in which contributors provide both personal evaluations (Own) and estimates of population-level preferences (Estimated), we propose a routing framework for collective intelligence in taste. The framework specifies when contributors should speak, what they should report, and when silence is preferable. Using simulation-based aggregation, we show that prediction accuracy improves over an all-own baseline across a broad region of the parameter space, conditional on items where routing applies. Importantly, these gains arise only when silence is allowed, enabling second-order signals to function effectively. The results demonstrate that collective intelligence in matters of taste depends on principled signal routing rather than simple averaging.

计算机视觉

1. 【2602.11154】SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos

链接https://arxiv.org/abs/2602.11154

作者:Yue Gao,Hong-Xing Yu,Sanghyeon Chang,Qianxi Fu,Bo Zhu,Yoonjin Won,Juan Carlos Niebles,Jiajun Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:flows govern momentum, two-phase flows govern, govern momentum, mass transfer, measure experimentally

备注: The first two authors contributed equally. Project website: [this https URL](https://yuegao.me/SurfPhase)

点击查看摘要

Abstract:Interfacial dynamics in two-phase flows govern momentum, heat, and mass transfer, yet remain difficult to measure experimentally. Classical techniques face intrinsic limitations near moving interfaces, while existing neural rendering methods target single-phase flows with diffuse boundaries and cannot handle sharp, deformable liquid-vapor interfaces. We propose SurfPhase, a novel model for reconstructing 3D interfacial dynamics from sparse camera views. Our approach integrates dynamic Gaussian surfels with a signed distance function formulation for geometric consistency, and leverages a video diffusion model to synthesize novel-view videos to refine reconstruction from sparse observations. We evaluate on a new dataset of high-speed pool boiling videos, demonstrating high-quality view synthesis and velocity estimation from only two camera views. Project website: this https URL.

2. 【2602.11146】Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

链接https://arxiv.org/abs/2602.11146

作者:Gongye Liu,Bo Yang,Yida Zhi,Zhizhou Zhong,Lei Ke,Didan Deng,Han Gao,Yongxiang Huang,Kaihao Zhang,Hongbo Fu,Wenhan Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:flow-matching models relies, computationally efficient, Preference optimization, flow-matching models, models relies

备注: Code: [this https URL](https://github.com/HKUST-C4G/diffusion-rm)

点击查看摘要

Abstract:Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

3. 【2602.11144】GENIUS: Generative Fluid Intelligence Evaluation Suite

链接https://arxiv.org/abs/2602.11144

作者:Ruichuan An,Sihan Yang,Ziyu Guo,Wei Dai,Zijun Shen,Haodong Li,Renrui Zhang,Xinyu Wei,Guopeng Li,Wenshan Wu,Wentao Zhang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Unified Multimodal Models, Unified Multimodal, shown remarkable progress, textit, textbf

备注

点击查看摘要

Abstract:Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks $\textit{Generative Fluid Intelligence (GFI)}$: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce $\textbf{GENIUS}$ ($\textbf{GEN}$ Fluid $\textbf{I}$ntelligence Eval$\textbf{U}$ation $\textbf{S}$uite). We formalize $\textit{GFI}$ as a synthesis of three primitives. These include $\textit{Inducing Implicit Patterns}$ (e.g., inferring personalized visual preferences), $\textit{Executing Ad-hoc Constraints}$ (e.g., visualizing abstract metaphors), and $\textit{Adapting to Contextual Knowledge}$ (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, $\textbf{GENIUS}$ establishes a rigorous standard for $\textit{GFI}$, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: $\href{this https URL}{this https URL}$.

4. 【2602.11130】From Circuits to Dynamics: Understanding and Stabilizing Failure in 3D Diffusion Transformers

链接https://arxiv.org/abs/2602.11130

作者:Maximilian Plattner,Fabian Paischer,Johannes Brandstetter,Arturs Berzins

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Reliable surface completion, applications spanning content, spanning content creation, point clouds underpins, Reliable surface

备注

点击查看摘要

Abstract:Reliable surface completion from sparse point clouds underpins many applications spanning content creation and robotics. While 3D diffusion transformers attain state-of-the-art results on this task, we uncover that they exhibit a catastrophic mode of failure: arbitrarily small on-surface perturbations to the input point cloud can fracture the output into multiple disconnected pieces -- a phenomenon we call Meltdown. Using activation-patching from mechanistic interpretability, we localize Meltdown to a single early denoising cross-attention activation. We find that the singular-value spectrum of this activation provides a scalar proxy: its spectral entropy rises when fragmentation occurs and returns to baseline when patched. Interpreted through diffusion dynamics, we show that this proxy tracks a symmetry-breaking bifurcation of the reverse process. Guided by this insight, we introduce PowerRemap, a test-time control that stabilizes sparse point-cloud conditioning. We demonstrate that Meltdown persists across state-of-the-art architectures (WaLa, Make-a-Shape), datasets (GSO, SimJEB) and denoising strategies (DDPM, DDIM), and that PowerRemap effectively counters this failure with stabilization rates of up to 98.3%. Overall, this work is a case study on how diffusion model behavior can be understood and guided based on mechanistic analysis, linking a circuit-level cross-attention mechanism to diffusion-dynamics accounts of trajectory bifurcations.

5. 【2602.11124】PhyCritic: Multimodal Critic Models for Physical AI

链接https://arxiv.org/abs/2602.11124

作者:Tianyi Xiong,Shihao Wang,Guilin Liu,Yi Dong,Ming Li,Heng Huang,Jan Kautz,Zhiding Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:providing pairwise preferences, assessing model-generated responses, numerical scores, preference alignment, providing pairwise

备注

点击查看摘要

Abstract:With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.

6. 【2602.11117】HairWeaver: Few-Shot Photorealistic Hair Motion Synthesis with Sim-to-Real Guided Video Diffusion

链接https://arxiv.org/abs/2602.11117

作者:Di Chang,Ji Hou,Aljaz Bozic,Assaf Neuberger,Felix Juefei-Xu,Olivier Maury,Gene Wei-Chin Lin,Tuur Stuyck,Doug Roble,Mohammad Soleymani,Stephane Grabli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single human image, diffusion-based pipeline, pipeline that animates, animates a single, expressive hair dynamics

备注: Website: [this https URL](https://boese0601.github.io/hairweaver/)

点击查看摘要

Abstract:We present HairWeaver, a diffusion-based pipeline that animates a single human image with realistic and expressive hair dynamics. While existing methods successfully control body pose, they lack specific control over hair, and as a result, fail to capture the intricate hair motions, resulting in stiff and unrealistic animations. HairWeaver overcomes this limitation using two specialized modules: a Motion-Context-LoRA to integrate motion conditions and a Sim2Real-Domain-LoRA to preserve the subject's photoreal appearance across different data domains. These lightweight components are designed to guide a video diffusion backbone while maintaining its core generative capabilities. By training on a specialized dataset of dynamic human motion generated from a CG simulator, HairWeaver affords fine control over hair motion and ultimately learns to produce highly realistic hair that responds naturally to movement. Comprehensive evaluations demonstrate that our approach sets a new state of the art, producing lifelike human hair animations with dynamic details.

7. 【2602.11105】FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference

链接https://arxiv.org/abs/2602.11105

作者:Divya Jyoti Bajpai,Dhruv Bhardwaj,Soumya Roy,Tejas Duseja,Harsh Agarwal,Aashay Sandansing,Manjesh Kumar Hanawal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Flow-matching models deliver, inherent sequential denoising, sequential denoising process, denoising process renders, Flow-matching models

备注: Accepted at International Conference on Learning Representations (ICLR) 2026

点击查看摘要

Abstract:Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower. Existing acceleration methods like distillation, trajectory truncation, and consistency approaches are static, require retraining, and often fail to generalize across tasks. We propose FastFlow, a plug-and-play adaptive inference framework that accelerates generation in flow matching models. FastFlow identifies denoising steps that produce only minor adjustments to the denoising path and approximates them without using the full neural network models used for velocity predictions. The approximation utilizes finite-difference velocity estimates from prior predictions to efficiently extrapolate future states, enabling faster advancements along the denoising path at zero compute cost. This enables skipping computation at intermediary steps. We model the decision of how many steps to safely skip before requiring a full model computation as a multi-armed bandit problem. The bandit learns the optimal skips to balance speed with performance. FastFlow integrates seamlessly with existing pipelines and generalizes across image generation, video generation, and editing tasks. Experiments demonstrate a speedup of over 2.6x while maintaining high-quality outputs. The source code for this work can be found at this https URL.

8. 【2602.11086】First International StepUP Competition for Biometric Footstep Recognition: Methods, Results and Remaining Challenges

链接https://arxiv.org/abs/2602.11086

作者:Robyn Larracy,Eve MacDonald,Angkoon Phinyomark,Saeid Rezaei,Mahdi Laghaei,Ali Hajighasem,Aaron Tabor,Erik Scheme

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:unique pressure patterns, person unique pressure, Biometric footstep recognition, security and safety, Biometric footstep

备注: to be published in 2025 IEEE International Joint Conference on Biometrics (IJCB)

点击查看摘要

Abstract:Biometric footstep recognition, based on a person's unique pressure patterns under their feet during walking, is an emerging field with growing applications in security and safety. However, progress in this area has been limited by the lack of large, diverse datasets necessary to address critical challenges such as generalization to new users and robustness to shifts in factors like footwear or walking speed. The recent release of the UNB StepUP-P150 dataset, the largest and most comprehensive collection of high-resolution footstep pressure recordings to date, opens new opportunities for addressing these challenges through deep learning. To mark this milestone, the First International StepUP Competition for Biometric Footstep Recognition was launched. Competitors were tasked with developing robust recognition models using the StepUP-P150 dataset that were then evaluated on a separate, dedicated test set designed to assess verification performance under challenging variations, given limited and relatively homogeneous reference data. The competition attracted global participation, with 23 registered teams from academia and industry. The top-performing team, Saeid_UCC, achieved the best equal error rate (EER) of 10.77% using a generative reward machine (GRM) optimization strategy. Overall, the competition showcased strong solutions, but persistent challenges in generalizing to unfamiliar footwear highlight a critical area for future work.

9. 【2602.11073】Chatting with Images for Introspective Visual Thinking

链接https://arxiv.org/abs/2602.11073

作者:Junfei Wu,Jian Guan,Qiang Liu,Shu Wu,Liang Wang,Wei Wu,Tienie Tan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Current large vision-language, Current large, single-pass visual encoding, fine-grained visual information, large vision-language models

备注

点击查看摘要

Abstract:Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.

10. 【2602.11066】PuriLight: A Lightweight Shuffle and Purification Framework for Monocular Depth Estimation

链接https://arxiv.org/abs/2602.11066

作者:Yujie Chen,Li Zhang,Xiaomeng Chu,Tian Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:monocular depth estimation, self-supervised monocular depth, detail preservation, self-supervised depth estimation, depth estimation

备注: 8 pages, 6figures, accepted by European Conference on Artificial Intelligence (ECAI2025)

点击查看摘要

Abstract:We propose PuriLight, a lightweight and efficient framework for self-supervised monocular depth estimation, to address the dual challenges of computational efficiency and detail preservation. While recent advances in self-supervised depth estimation have reduced reliance on ground truth supervision, existing approaches remain constrained by either bulky architectures compromising practicality or lightweight models sacrificing structural precision. These dual limitations underscore the critical need to develop lightweight yet structurally precise architectures. Our framework addresses these limitations through a three-stage architecture incorporating three novel modules: the Shuffle-Dilation Convolution (SDC) module for local feature extraction, the Rotation-Adaptive Kernel Attention (RAKA) module for hierarchical feature enhancement, and the Deep Frequency Signal Purification (DFSP) module for global feature purification. Through effective collaboration, these modules enable PuriLight to achieve both lightweight and accurate feature extraction and processing. Extensive experiments demonstrate that PuriLight achieves state-of-the-art performance with minimal training parameters while maintaining exceptional computational efficiency. Codes will be available at this https URL.

11. 【2602.11024】Chain-of-Look Spatial Reasoning for Dense Surgical Instrument Counting

链接https://arxiv.org/abs/2602.11024

作者:Rishikesh Bhyri,Brian R Quaranto,Philip J Seger,Kaity Tung,Brendan Fox,Gene Yang,Steven D. Schwaitzberg,Junsong Yuan,Nan Xi,Peter C W Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Operating Rooms, ensuring patient safety, Accurate counting, safety during surgery, critical prerequisite

备注: Accepted to WACV 2026. This version includes additional authors who contributed during the rebuttal phase

点击查看摘要

Abstract:Accurate counting of surgical instruments in Operating Rooms (OR) is a critical prerequisite for ensuring patient safety during surgery. Despite recent progress of large visual-language models and agentic AI, accurately counting such instruments remains highly challenging, particularly in dense scenarios where instruments are tightly clustered. To address this problem, we introduce Chain-of-Look, a novel visual reasoning framework that mimics the sequential human counting process by enforcing a structured visual chain, rather than relying on classic object detection which is unordered. This visual chain guides the model to count along a coherent spatial trajectory, improving accuracy in complex scenes. To further enforce the physical plausibility of the visual chain, we introduce the neighboring loss function, which explicitly models the spatial constraints inherent to densely packed surgical instruments. We also present SurgCount-HD, a new dataset comprising 1,464 high-density surgical instrument images. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches for counting (e.g., CountGD, REC) as well as Multimodality Large Language Models (e.g., Qwen, ChatGPT) in the challenging task of dense surgical instrument counting.

12. 【2602.11021】ContactGaussian-WM: Learning Physics-Grounded World Model from Videos

链接https://arxiv.org/abs/2602.11021

作者:Meizhong Wang,Wanxin Jin,Kun Cao,Lihua Xie,Yiguang Hong

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:http URL address, http URL simulations, http URL framework, http URL, Developing world models

备注

点击查看摘要

Abstract:Developing world models that understand complex physical interactions is essential for advancing robotic planning and this http URL, existing methods often struggle to accurately model the environment under conditions of data scarcity and complex contact-rich dynamic this http URL address these challenges, we propose ContactGaussian-WM, a differentiable physics-grounded rigid-body world model capable of learning intricate physical laws directly from sparse and contact-rich video this http URL framework consists of two core components: (1) a unified Gaussian representation for both visual appearance and collision geometry, and (2) an end-to-end differentiable learning framework that differentiates through a closed-form physics engine to infer physical properties from sparse visual this http URL simulations and real-world evaluations demonstrate that ContactGaussian-WM outperforms state-of-the-art methods in learning complex scenarios, exhibiting robust generalization this http URL, we showcase the practical utility of our framework in downstream applications, including data synthesis and real-time MPC.

13. 【2602.11007】LaSSM: Efficient Semantic-Spatial Query Decoding via Local Aggregation and State Space Models for 3D Instance Segmentation

链接https://arxiv.org/abs/2602.11007

作者:Lei Yao,Yi Wang,Yawen Cui,Moyun Liu,Lap-Pui Chau

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attained notable performance, point clouds, attained notable, Query-based, scene instance segmentation

备注: Accepted at IEEE-TCSVT

点击查看摘要

Abstract:Query-based 3D scene instance segmentation from point clouds has attained notable performance. However, existing methods suffer from the query initialization dilemma due to the sparse nature of point clouds and rely on computationally intensive attention mechanisms in query decoders. We accordingly introduce LaSSM, prioritizing simplicity and efficiency while maintaining competitive performance. Specifically, we propose a hierarchical semantic-spatial query initializer to derive the query set from superpoints by considering both semantic cues and spatial distribution, achieving comprehensive scene coverage and accelerated convergence. We further present a coordinate-guided state space model (SSM) decoder that progressively refines queries. The novel decoder features a local aggregation scheme that restricts the model to focus on geometrically coherent regions and a spatial dual-path SSM block to capture underlying dependencies within the query set by integrating associated coordinates information. Our design enables efficient instance prediction, avoiding the incorporation of noisy information and reducing redundant computation. LaSSM ranks first place on the latest ScanNet++ V2 leaderboard, outperforming the previous best method by 2.5% mAP with only 1/3 FLOPs, demonstrating its superiority in challenging large-scale scene instance segmentation. LaSSM also achieves competitive performance on ScanNet, ScanNet200, S3DIS and ScanNet++ V1 benchmarks with less computational cost. Extensive ablation studies and qualitative results validate the effectiveness of our design. The code and weights are available at this https URL.

14. 【2602.11005】Interpretable Vision Transformers in Monocular Depth Estimation via SVDA

链接https://arxiv.org/abs/2602.11005

作者:Vasileios Arampatzakis,George Pavlidis,Nikolaos Mitianoudis,Nikos Papamarkos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:drive modern Transformer, modern Transformer architectures, applications in robotics, autonomous driving, Transformer architectures remain

备注: 8 pages, 2 figures, submitted to CVPR Conference 2026

点击查看摘要

Abstract:Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness. These reveal consistent cross-dataset and depth-wise patterns in how attention organizes during training, insights that remain inaccessible in standard Transformers. By shifting the role of attention from opaque mechanism to quantifiable descriptor, SVDA redefines interpretability in monocular depth estimation and opens a principled avenue toward transparent dense prediction models.

15. 【2602.11004】Enhancing Predictability of Multi-Tenant DNN Inference for Autonomous Vehicles' Perception

链接https://arxiv.org/abs/2602.11004

作者:Liangkai Liu,Kang G. Shin,Jinkyu Lee,Chengmo Yang,Weisong Shi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)

关键词:deep neural networks, make maneuver decisions, Autonomous vehicles, critical frames, frames

备注: 13 pages, 12 figures

点击查看摘要

Abstract:Autonomous vehicles (AVs) rely on sensors and deep neural networks (DNNs) to perceive their surrounding environment and make maneuver decisions in real time. However, achieving real-time DNN inference in the AV's perception pipeline is challenging due to the large gap between the computation requirement and the AV's limited resources. Most, if not all, of existing studies focus on optimizing the DNN inference time to achieve faster perception by compressing the DNN model with pruning and quantization. In contrast, we present a Predictable Perception system with DNNs (PP-DNN) that reduce the amount of image data to be processed while maintaining the same level of accuracy for multi-tenant DNNs by dynamically selecting critical frames and regions of interest (ROIs). PP-DNN is based on our key insight that critical frames and ROIs for AVs vary with the AV's surrounding environment. However, it is challenging to identify and use critical frames and ROIs in multi-tenant DNNs for predictable inference. Given image-frame streams, PP-DNN leverages an ROI generator to identify critical frames and ROIs based on the similarities of consecutive frames and traffic scenarios. PP-DNN then leverages a FLOPs predictor to predict multiply-accumulate operations (MACs) from the dynamic critical frames and ROIs. The ROI scheduler coordinates the processing of critical frames and ROIs with multiple DNN models. Finally, we design a detection predictor for the perception of non-critical frames. We have implemented PP-DNN in an ROS-based AV pipeline and evaluated it with the BDD100K and the nuScenes dataset. PP-DNN is observed to significantly enhance perception predictability, increasing the number of fusion frames by up to 7.3x, reducing the fusion delay by 2.6x and fusion-delay variations by 2.3x, improving detection completeness by 75.4% and the cost-effectiveness by up to 98% over the baseline.

16. 【2602.10994】Interpretable Vision Transformers in Image Classification via SVDA

链接https://arxiv.org/abs/2602.10994

作者:Vasileios Arampatzakis,George Pavlidis,Nikolaos Mitianoudis,Nikos Papamarkos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformers, non-structured behaviors, performance in image, exhibit dense, remain opaque

备注: 10 pages, 4 figures, submitted to IEEE Access

点击查看摘要

Abstract:Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.

17. 【2602.10985】DFIC: Towards a balanced facial image dataset for automatic ICAO compliance verification

链接https://arxiv.org/abs/2602.10985

作者:Nuno Gonçalves,Diogo Nunes,Carla Guerra,João Marcos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:machine-readable travel documents, current manual inspection, ICAO compliance verification, reliable identity verification, ICAO compliance

备注

点击查看摘要

Abstract:Ensuring compliance with ISO/IEC and ICAO standards for facial images in machine-readable travel documents (MRTDs) is essential for reliable identity verification, but current manual inspection methods are inefficient in high-demand environments. This paper introduces the DFIC dataset, a novel comprehensive facial image dataset comprising around 58,000 annotated images and 2706 videos of more than 1000 subjects, that cover a broad range of non-compliant conditions, in addition to compliant portraits. Our dataset provides a more balanced demographic distribution than the existing public datasets, with one partition that is nearly uniformly distributed, facilitating the development of automated ICAO compliance verification methods. Using DFIC, we fine-tuned a novel method that heavily relies on spatial attention mechanisms for the automatic validation of ICAO compliance requirements, and we have compared it with the state-of-the-art aimed at ICAO compliance verification, demonstrating improved results. DFIC dataset is now made public (this https URL) for the training and validation of new models, offering an unprecedented diversity of faces, that will improve both robustness and adaptability to the intrinsically diverse combinations of faces and props that can be presented to the validation system. These results emphasize the potential of DFIC to enhance automated ICAO compliance methods but it can also be used in many other applications that aim to improve the security, privacy, and fairness of facial recognition systems.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

ACMclasses:
I.4

Cite as:
arXiv:2602.10985 [cs.CV]

(or
arXiv:2602.10985v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.10985

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2602.10978】VFGS-Net: Frequency-Guided State-Space Learning for Topology-Preserving Retinal Vessel Segmentation

链接https://arxiv.org/abs/2602.10978

作者:Ruiqi Song,Lei Liu,Ya-Nan Zhang,Chao Wang,Xiaoning Li,Nan Mu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate retinal vessel, Accurate retinal, diabetic retinopathy, critical prerequisite, prerequisite for quantitative

备注

点击查看摘要

Abstract:Accurate retinal vessel segmentation is a critical prerequisite for quantitative analysis of retinal images and computer-aided diagnosis of vascular diseases such as diabetic retinopathy. However, the elongated morphology, wide scale variation, and low contrast of retinal vessels pose significant challenges for existing methods, making it difficult to simultaneously preserve fine capillaries and maintain global topological continuity. To address these challenges, we propose the Vessel-aware Frequency-domain and Global Spatial modeling Network (VFGS-Net), an end-to-end segmentation framework that seamlessly integrates frequency-aware feature enhancement, dual-path convolutional representation learning, and bidirectional asymmetric spatial state-space modeling within a unified architecture. Specifically, VFGS-Net employs a dual-path feature convolution module to jointly capture fine-grained local textures and multi-scale contextual semantics. A novel vessel-aware frequency-domain channel attention mechanism is introduced to adaptively reweight spectral components, thereby enhancing vessel-relevant responses in high-level features. Furthermore, at the network bottleneck, we propose a bidirectional asymmetric Mamba2-based spatial modeling block to efficiently capture long-range spatial dependencies and strengthen the global continuity of vascular structures. Extensive experiments on four publicly available retinal vessel datasets demonstrate that VFGS-Net achieves competitive or superior performance compared to state-of-the-art methods. Notably, our model consistently improves segmentation accuracy for fine vessels, complex branching patterns, and low-contrast regions, highlighting its robustness and clinical potential.

19. 【2602.10967】Healthy Harvests: A Comparative Look at Guava Disease Classification Using InceptionV3

链接https://arxiv.org/abs/2602.10967

作者:Samanta Ghosh,Shaila Afroz Anika,Umma Habiba Ahmed,B. M. Shahria Alam,Mohammad Tahmid Noor,Nishat Tasnim Niloy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:fruit, Guava fruits, model, fruit crop yield, images

备注: 6 pages, 13 figures, his is the author's accepted manuscript of a paper accepted for publication in the Proceedings of the 16th International IEEE Conference on Computing, Communication and Networking Technologies (ICCCNT 2025). The final published version will be available via IEEE Xplore

点击查看摘要

Abstract:Guava fruits often suffer from many diseases. This can harm fruit quality and fruit crop yield. Early identification is important for minimizing damage and ensuring fruit health. This study focuses on 3 different categories for classifying diseases. These are Anthracnose, Fruit flies, and Healthy fruit. The data set used in this study is collected from Mendeley Data. This dataset contains 473 original images of Guava. These images vary in size and format. The original dataset was resized to 256x256 pixels with RGB color mode for better consistency. After this, the Data augmentation process is applied to improve the dataset by generating variations of the original images. The augmented dataset consists of 3784 images using advanced preprocessing techniques. Two deep learning models were implemented to classify the images. The InceptionV3 model is well known for its advanced framework. These apply multiple convolutional filters for obtaining different features effectively. On the other hand, the ResNet50 model helps to train deeper networks by using residual learning. The InceptionV3 model achieved the impressive accuracy of 98.15%, and ResNet50got 94.46% accuracy. Data mixing methods such as CutMix and MixUp were applied to enhance the model's robustness. The confusion matrix was used to evaluate the overall model performance of both InceptionV3 and Resnet50. Additionally, SHAP analysis is used to improve interpretability, which helps to find the significant parts of the image for the model prediction. This study purposes to highlight how advanced models enhan

20. 【2602.10943】owards Learning a Generalizable 3D Scene Representation from 2D Observations

链接https://arxiv.org/abs/2602.10943

作者:Martin Gromniak,Jan-Gerrit Habekost,Sebastian Kamp,Sven Magg,Stefan Wermter

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Generalizable Neural Radiance, Neural Radiance Field, Radiance Field approach, Generalizable Neural, Neural Radiance

备注: Paper accepted at ESANN 2026

点击查看摘要

Abstract:We introduce a Generalizable Neural Radiance Field approach for predicting 3D workspace occupancy from egocentric robot observations. Unlike prior methods operating in camera-centric coordinates, our model constructs occupancy representations in a global workspace frame, making it directly applicable to robotic manipulation. The model integrates flexible source views and generalizes to unseen object arrangements without scene-specific finetuning. We demonstrate the approach on a humanoid robot and evaluate predicted geometry against 3D sensor ground truth. Trained on 40 real scenes, our model achieves 26mm reconstruction error, including occluded regions, validating its ability to infer complete 3D occupancy beyond traditional stereo vision methods.

21. 【2602.10940】FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference

链接https://arxiv.org/abs/2602.10940

作者:Guandong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:require multi-GPU parallelism, Large-scale diffusion models, Unified Sequence Parallelism, Large-scale diffusion, Stable Diffusion

备注

点击查看摘要

Abstract:Large-scale diffusion models such as FLUX (12B parameters) and Stable Diffusion 3 (8B parameters) require multi-GPU parallelism for efficient inference. Unified Sequence Parallelism (USP), which combines Ulysses and Ring attention mechanisms, has emerged as the state-of-the-art approach for distributed attention computation. However, existing USP implementations suffer from significant inefficiencies including excessive kernel launch overhead and suboptimal computation-communication scheduling. In this paper, we propose \textbf{FastUSP}, a multi-level optimization framework that integrates compile-level optimization (graph compilation with CUDA Graphs and computation-communication reordering), communication-level optimization (FP8 quantized collective communication), and operator-level optimization (pipelined Ring attention with double buffering). We evaluate FastUSP on FLUX (12B) and Qwen-Image models across 2, 4, and 8 NVIDIA RTX 5090 GPUs. On FLUX, FastUSP achieves consistent \textbf{1.12$\times$--1.16$\times$} end-to-end speedup over baseline USP, with compile-level optimization contributing the dominant improvement. On Qwen-Image, FastUSP achieves \textbf{1.09$\times$} speedup on 2 GPUs; on 4--8 GPUs, we identify a PyTorch Inductor compatibility limitation with Ring attention that prevents compile optimization, while baseline USP scales to 1.30$\times$--1.46$\times$ of 2-GPU performance. We further provide a detailed analysis of the performance characteristics of distributed diffusion inference, revealing that kernel launch overhead -- rather than communication latency -- is the primary bottleneck on modern high-bandwidth GPU interconnects.

22. 【2602.10884】ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving

链接https://arxiv.org/abs/2602.10884

作者:Jinqing Zhang,Zehua Fu,Zelin Xu,Wenying Dai,Qingjie Liu,Yunhong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving frameworks, Residual World Model, future BEV features, comprehensive understanding capabilities, autonomous driving

备注: ICLR 2026

点击查看摘要

Abstract:The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of scene representations, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World takes only temporal residuals as input, thus predicting the future spatial distribution of dynamic objects more precisely. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representation) and the future BEV features. This module can not only utilize future road conditions to refine trajectories, but also provides sparse spatial-temporal supervision on future BEV features to prevent world model collapse. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art planning performance. The code is available at this https URL.

23. 【2602.10880】Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation

链接https://arxiv.org/abs/2602.10880

作者:Minggui He,Mingchen Dai,Jian Zhang,Yilun Liu,Shimin Tao,Pufan Zeng,Osamu Yoshie,Yuya Ieiri

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fidelity remains challenging, Vision-Language Models, achieving structural fidelity, structural fidelity remains, remains challenging

备注: under review

点击查看摘要

Abstract:Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: this https URL

24. 【2602.10875】Stride-Net: Fairness-Aware Disentangled Representation Learning for Chest X-Ray Diagnosis

链接https://arxiv.org/abs/2602.10875

作者:Darakshan Rashid,Raza Imam,Dwarikanath Mahapatra,Brejesh Lall

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep neural networks, strong average performance, raising critical concerns, X-ray classification achieve, classification achieve strong

备注: 6 pages, 2 Tables, 3 Figures. Our code is available [this https URL](https://github.com/Daraksh/Fairness_StrideNet)

点击查看摘要

Abstract:Deep neural networks for chest X-ray classification achieve strong average performance, yet often underperform for specific demographic subgroups, raising critical concerns about clinical safety and equity. Existing debiasing methods frequently yield inconsistent improvements across datasets or attain fairness by degrading overall diagnostic utility, treating fairness as a post hoc constraint rather than a property of the learned representation. In this work, we propose Stride-Net (Sensitive Attribute Resilient Learning via Disentanglement and Learnable Masking with Embedding Alignment), a fairness-aware framework that learns disease-discriminative yet demographically invariant representations for chest X-ray analysis. Stride-Net operates at the patch level, using a learnable stride-based mask to select label-aligned image regions while suppressing sensitive attribute information through adversarial confusion loss. To anchor representations in clinical semantics and discourage shortcut learning, we further enforce semantic alignment between image features and BioBERT-based disease label embeddings via Group Optimal Transport. We evaluate Stride-Net on the MIMIC-CXR and CheXpert benchmarks across race and intersectional race-gender subgroups. Across architectures including ResNet and Vision Transformers, Stride-Net consistently improves fairness metrics while matching or exceeding baseline accuracy, achieving a more favorable accuracy-fairness trade-off than prior debiasing approaches. Our code is available at this https URL.

25. 【2602.10871】Viewpoint Recommendation for Point Cloud Labeling through Interaction Cost Modeling

链接https://arxiv.org/abs/2602.10871

作者:Yu Zhang,Xinyi Zhao,Chongke Bi,Siming Chen

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:Semantic segmentation, time cost, autonomous driving, point clouds, point cloud

备注: Accepted to IEEE TVCG

点击查看摘要

Abstract:Semantic segmentation of 3D point clouds is important for many applications, such as autonomous driving. To train semantic segmentation models, labeled point cloud segmentation datasets are essential. Meanwhile, point cloud labeling is time-consuming for annotators, which typically involves tuning the camera viewpoint and selecting points by lasso. To reduce the time cost of point cloud labeling, we propose a viewpoint recommendation approach to reduce annotators' labeling time costs. We adapt Fitts' law to model the time cost of lasso selection in point clouds. Using the modeled time cost, the viewpoint that minimizes the lasso selection time cost is recommended to the annotator. We build a data labeling system for semantic segmentation of 3D point clouds that integrates our viewpoint recommendation approach. The system enables users to navigate to recommended viewpoints for efficient annotation. Through an ablation study, we observed that our approach effectively reduced the data labeling time cost. We also qualitatively compare our approach with previous viewpoint selection approaches on different datasets.

26. 【2602.10858】Hyperspectral Smoke Segmentation via Mixture of Prototypes

链接https://arxiv.org/abs/2602.10858

作者:Lujian Yao,Haitao Zhao,Xianghai Kong,Yuhan Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:industrial safety applications, safety applications, Smoke segmentation, critical for wildfire, wildfire management

备注: 35 pages, 14 figures

点击查看摘要

Abstract:Smoke segmentation is critical for wildfire management and industrial safety applications. Traditional visible-light-based methods face limitations due to insufficient spectral information, particularly struggling with cloud interference and semi-transparent smoke regions. To address these challenges, we introduce hyperspectral imaging for smoke segmentation and present the first hyperspectral smoke segmentation dataset (HSSDataset) with carefully annotated samples collected from over 18,000 frames across 20 real-world scenarios using a Many-to-One annotations protocol. However, different spectral bands exhibit varying discriminative capabilities across spatial regions, necessitating adaptive band weighting strategies. We decompose this into three technical challenges: spectral interaction contamination, limited spectral pattern modeling, and complex weighting router problems. We propose a mixture of prototypes (MoP) network with: (1) Band split for spectral isolation, (2) Prototype-based spectral representation for diverse patterns, and (3) Dual-level router for adaptive spatial-aware band weighting. We further construct a multispectral dataset (MSSDataset) with RGB-infrared images. Extensive experiments validate superior performance across both hyperspectral and multispectral modalities, establishing a new paradigm for spectral-based smoke segmentation.

27. 【2602.10825】Flow caching for autoregressive video generation

链接https://arxiv.org/abs/2602.10825

作者:Yuexiao Ma,Xuzhe Zheng,Jing Xu,Xiwei Xu,Feng Ling,Xiawu Zheng,Huafeng Kuang,Huixia Li,Xing Wang,Xuefeng Xiao,Fei Chao,Rongrong Ji

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Transformer architectures, built on Transformer, represent a powerful, powerful paradigm, paradigm for generating

备注

点击查看摘要

Abstract:Autoregressive models, often built on Transformer architectures, represent a powerful paradigm for generating ultra-long videos by synthesizing content in sequential chunks. However, this sequential generation process is notoriously slow. While caching strategies have proven effective for accelerating traditional video diffusion models, existing methods assume uniform denoising across all frames-an assumption that breaks down in autoregressive models where different video chunks exhibit varying similarity patterns at identical timesteps. In this paper, we present FlowCache, the first caching framework specifically designed for autoregressive video generation. Our key insight is that each video chunk should maintain independent caching policies, allowing fine-grained control over which chunks require recomputation at each timestep. We introduce a chunkwise caching strategy that dynamically adapts to the unique denoising characteristics of each chunk, complemented by a joint importance-redundancy optimized KV cache compression mechanism that maintains fixed memory bounds while preserving generation quality. Our method achieves remarkable speedups of 2.38 times on MAGI-1 and 6.7 times on SkyReels-V2, with negligible quality degradation (VBench: 0.87 increase and 0.79 decrease respectively). These results demonstrate that FlowCache successfully unlocks the potential of autoregressive models for real-time, ultra-long video generation-establishing a new benchmark for efficient video synthesis at scale. The code is available at this https URL.

28. 【2602.10818】Resource-Efficient RGB-Only Action Recognition for Edge Deployment

链接https://arxiv.org/abs/2602.10818

作者:Dongsik Yoon,Jongeun Kim,Dayeon Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Performance (cs.PF)

关键词:devices poses stringent, poses stringent constraints, edge devices poses, constraints on latency, power consumption

备注: Under review

点击查看摘要

Abstract:Action recognition on edge devices poses stringent constraints on latency, memory, storage, and power consumption. While auxiliary modalities such as skeleton and depth information can enhance recognition performance, they often require additional sensors or computationally expensive pose-estimation pipelines, limiting practicality for edge use. In this work, we propose a compact RGB-only network tailored for efficient on-device inference. Our approach builds upon an X3D-style backbone augmented with Temporal Shift, and further introduces selective temporal adaptation and parameter-free attention. Extensive experiments on the NTU RGB+D 60 and 120 benchmarks demonstrate a strong accuracy-efficiency balance. Moreover, deployment-level profiling on the Jetson Orin Nano verifies a smaller on-device footprint and practical resource utilization compared to existing RGB-based action recognition techniques.

29. 【2602.10815】Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

链接https://arxiv.org/abs/2602.10815

作者:Aojun Lu,Tao Feng,Hangjie Yuan,Wei Li,Yanan Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Reinforcement Learning, consistently achieve superior, Supervised Fine-Tuning, fine-tuned with Reinforcement, trained with Supervised

备注

点击查看摘要

Abstract:The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at this https URL.

30. 【2602.10809】DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

链接https://arxiv.org/abs/2602.10809

作者:Chenlong Deng,Mengjie Deng,Junjie Wu,Dun Zeng,Teng Wang,Qingsong Xie,Jiadeng Huang,Shengjie Ma,Changwang Zhang,Zhaoxiang Wang,Jun Wang,Yutao Zhu,Zhicheng Dou

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词:Existing multimodal retrieval, Existing multimodal, measured in isolation, excel at semantic, semantic matching

备注: 17 pages, 5 figures

点击查看摘要

Abstract:Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

31. 【2602.10806】DMP-3DAD: Cross-Category 3D Anomaly Detection via Realistic Depth Map Projection with Few Normal Samples

链接https://arxiv.org/abs/2602.10806

作者:Zi Wang,Katsuya Hotta,Koichiro Kamide,Yawen Zou,Jianjian Qin,Chao Zhang,Jun Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unseen object belongs, aims to determine, unseen object, object belongs, target category

备注

点击查看摘要

Abstract:Cross-category anomaly detection for 3D point clouds aims to determine whether an unseen object belongs to a target category using only a few normal examples. Most existing methods rely on category-specific training, which limits their flexibility in few-shot scenarios. In this paper, we propose DMP-3DAD, a training-free framework for cross-category 3D anomaly detection based on multi-view realistic depth map projection. Specifically, by converting point clouds into a fixed set of realistic depth images, our method leverages a frozen CLIP visual encoder to extract multi-view representations and performs anomaly detection via weighted feature similarity, which does not require any fine-tuning or category-dependent adaptation. Extensive experiments on the ShapeNetPart dataset demonstrate that DMP-3DAD achieves state-of-the-art performance under few-shot setting. The results show that the proposed approach provides a simple yet effective solution for practical cross-category 3D anomaly detection.

32. 【2602.10799】RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation

链接https://arxiv.org/abs/2602.10799

作者:Zihui Zhou,Yong Feng,Yanying Chen,Guofan Duan,Zhenxi Song,Mingliang Zhou,Weijia Jia

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal large language, visual question answering, large language models, shown strong performance, multimodal dialogue

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.

33. 【2602.10780】Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

链接https://arxiv.org/abs/2602.10780

作者:Enrico Ahlers,Daniel Passon,Yannic Noller,Lars Grunske

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Machine learning models, adversarial attackers seeking, Machine learning, everyday lives, increasingly present

备注

点击查看摘要

Abstract:Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model's internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample's features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.

34. 【2602.10771】From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning?

链接https://arxiv.org/abs/2602.10771

作者:Krishna Kanth Nakka,Vedasri Nakka

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:encounter safety-critical situations, informed decision-making, encounter safety-critical, safety-critical situations, situations in urban

备注: Preprint

点击查看摘要

Abstract:Cyclists often encounter safety-critical situations in urban traffic, highlighting the need for assistive systems that support safe and informed decision-making. Recently, vision-language models (VLMs) have demonstrated strong performance on autonomous driving benchmarks, suggesting their potential for general traffic understanding and navigation-related reasoning. However, existing evaluations are predominantly vehicle-centric and fail to assess perception and reasoning from a cyclist-centric viewpoint. To address this gap, we introduce CyclingVQA, a diagnostic benchmark designed to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist's perspective. Evaluating 31+ recent VLMs spanning general-purpose, spatially enhanced, and autonomous-driving-specialized models, we find that current models demonstrate encouraging capabilities, while also revealing clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes. Notably, several driving-specialized models underperform strong generalist VLMs, indicating limited transfer from vehicle-centric training to cyclist-assistive scenarios. Finally, through systematic error analysis, we identify recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.

35. 【2602.10764】Dual-End Consistency Model

链接https://arxiv.org/abs/2602.10764

作者:Linwei Dong,Ruoyu Guo,Ge Bai,Zehuan Yuan,Yawei Luo,Changqing Zou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:flow-based generative models, slow iterative sampling, iterative sampling nature, sampling nature remains, generative models

备注

点击查看摘要

Abstract:The slow iterative sampling nature remains a major bottleneck for the practical deployment of diffusion and flow-based generative models. While consistency models (CMs) represent a state-of-the-art distillation-based approach for efficient generation, their large-scale application is still limited by two key issues: training instability and inflexible sampling. Existing methods seek to mitigate these problems through architectural adjustments or regularized objectives, yet overlook the critical reliance on trajectory selection. In this work, we first conduct an analysis on these two limitations: training instability originates from loss divergence induced by unstable self-supervised term, whereas sampling inflexibility arises from error accumulation. Based on these insights and analysis, we propose the Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters to achieve stable and effective training. DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. Specifically, our approach leverages continuous-time CMs objectives to achieve few-step distillation and utilizes flow matching as a boundary regularizer to stabilize the training process. Furthermore, we propose a novel noise-to-noisy (N2N) mapping that can map noise to any point, thereby alleviating the error accumulation in the first step. Extensive experimental results show the effectiveness of our method: it achieves a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.

36. 【2602.10757】xt-to-Vector Conversion for Residential Plan Design

链接https://arxiv.org/abs/2602.10757

作者:Egor Bazhenov,Stepan Kasai,Viacheslav Shalamov,Valeria Efimova

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Computer graphics, modern science, digital communication, fundamental part, part of modern

备注: 4 pages, 1 figure

点击查看摘要

Abstract:Computer graphics, comprising both raster and vector components, is a fundamental part of modern science, industry, and digital communication. While raster graphics offer ease of use, its pixel-based structure limits scalability. Vector graphics, defined by mathematical primitives, provides scalability without quality loss, however, it is more complex to produce. For design and architecture, the versatility of vector graphics is paramount, despite its computational demands. This paper introduces a novel method for generating vector residential plans from textual descriptions. Our approach surpasses existing solutions by approximately 5% in CLIPScore-based visual quality, benefiting from its inherent handling of right angles and flexible settings. Additionally, we present a new algorithm for vectorizing raster plans into structured vector images. Such images have a better CLIPscore compared to others by about 4%.

37. 【2602.10750】SecureScan: An AI-Driven Multi-Layer Framework for Malware and Phishing Detection Using Logistic Regression and Threat Intelligence Integration

链接https://arxiv.org/abs/2602.10750

作者:Rumman Firdos,Aman Dangi

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:traditional signature-based intrusion, signature-based intrusion detection, intrusion detection systems, growing sophistication, sophistication of modern

备注

点击查看摘要

Abstract:The growing sophistication of modern malware and phishing campaigns has diminished the effectiveness of traditional signature-based intrusion detection systems. This work presents SecureScan, an AI-driven, triple-layer detection framework that integrates logistic regression-based classification, heuristic analysis, and external threat intelligence via the VirusTotal API for comprehensive triage of URLs, file hashes, and binaries. The proposed architecture prioritizes efficiency by filtering known threats through heuristics, classifying uncertain samples using machine learning, and validating borderline cases with third-party intelligence. On benchmark datasets, SecureScan achieves 93.1 percent accuracy with balanced precision (0.87) and recall (0.92), demonstrating strong generalization and reduced overfitting through threshold-based decision calibration. A calibrated threshold and gray-zone logic (0.45-0.55) were introduced to minimize false positives and enhance real-world stability. Experimental results indicate that a lightweight statistical model, when augmented with calibrated verification and external intelligence, can achieve reliability and performance comparable to more complex deep learning systems.

38. 【2602.10745】Spectral-Spatial Contrastive Learning Framework for Regression on Hyperspectral Data

链接https://arxiv.org/abs/2602.10745

作者:Mohamad Dhaini,Paul Honeine,Maxime Berar,Antonin Van Exem

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:demonstrated great success, image classification tasks, hyperspectral data, demonstrated great, great success

备注

点击查看摘要

Abstract:Contrastive learning has demonstrated great success in representation learning, especially for image classification tasks. However, there is still a shortage in studies targeting regression tasks, and more specifically applications on hyperspectral data. In this paper, we propose a spectral-spatial contrastive learning framework for regression tasks for hyperspectral data, in a model-agnostic design allowing to enhance backbones such as 3D convolutional and transformer-based networks. Moreover, we provide a collection of transformations relevant for augmenting hyperspectral data. Experiments on synthetic and real datasets show that the proposed framework and transformations significantly improve the performance of all studied backbone models.

39. 【2602.10744】Self-Supervised Image Super-Resolution Quality Assessment based on Content-Free Multi-Model Oriented Representation Learning

链接https://arxiv.org/abs/2602.10744

作者:Kian Majlessi,Amir Masoud Soltani,Mohammad Ebrahim Mahdavi,Aurelien Gourrier,Peyman Adibi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:natural scene acquisition, results in complex, scene acquisition, inherent complexity, complexity of natural

备注

点击查看摘要

Abstract:Super-resolution (SR) applied to real-world low-resolution (LR) images often results in complex, irregular degradations that stem from the inherent complexity of natural scene acquisition. In contrast to SR artifacts arising from synthetic LR images created under well-defined scenarios, those distortions are highly unpredictable and vary significantly across different real-life contexts. Consequently, assessing the quality of SR images (SR-IQA) obtained from realistic LR, remains a challenging and underexplored problem. In this work, we introduce a no-reference SR-IQA approach tailored for such highly ill-posed realistic settings. The proposed method enables domain-adaptive IQA for real-world SR applications, particularly in data-scarce domains. We hypothesize that degradations in super-resolved images are strongly dependent on the underlying SR algorithms, rather than being solely determined by image content. To this end, we introduce a self-supervised learning (SSL) strategy that first pretrains multiple SR model oriented representations in a pretext stage. Our contrastive learning framework forms positive pairs from images produced by the same SR model and negative pairs from those generated by different methods, independent of image content. The proposed approach S3 RIQA, further incorporates targeted preprocessing to extract complementary quality information and an auxiliary task to better handle the various degradation profiles associated with different SR scaling factors. To this end, we constructed a new dataset, SRMORSS, to support unsupervised pretext training; it includes a wide range of SR algorithms applied to numerous real LR images, which addresses a gap in existing datasets. Experiments on real SR-IQA benchmarks demonstrate that S3 RIQA consistently outperforms most state-of-the-art relevant metrics.

40. 【2602.10728】OccFace: Unified Occlusion-Aware Facial Landmark Detection with Per-Point Visibility

链接https://arxiv.org/abs/2602.10728

作者:Xinhao Xiang,Zhengxin Li,Saurav Dhakad,Theo Bancroft,Jiawei Zhang,Weiyang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate facial landmark, Accurate facial, occlusion remains challenging, facial landmark detection, large appearance variation

备注

点击查看摘要

Abstract:Accurate facial landmark detection under occlusion remains challenging, especially for human-like faces with large appearance variation and rotation-driven self-occlusion. Existing detectors typically localize landmarks while handling occlusion implicitly, without predicting per-point visibility that downstream applications can benefits. We present OccFace, an occlusion-aware framework for universal human-like faces, including humans, stylized characters, and other non-human designs. OccFace adopts a unified dense 100-point layout and a heatmap-based backbone, and adds an occlusion module that jointly predicts landmark coordinates and per-point visibility by combining local evidence with cross-landmark context. Visibility supervision mixes manual labels with landmark-aware masking that derives pseudo visibility from mask-heatmap overlap. We also create an occlusion-aware evaluation suite reporting NME on visible vs. occluded landmarks and benchmarking visibility with Occ AP, F1@0.5, and ROC-AUC, together with a dataset annotated with 100-point landmarks and per-point visibility. Experiments show improved robustness under external occlusion and large head rotations, especially on occluded regions, while preserving accuracy on visible landmarks.

41. 【2602.10722】A Diffusion-Based Generative Prior Approach to Sparse-view Computed Tomography

链接https://arxiv.org/abs/2602.10722

作者:Davide Evangelista,Pasquale Cascarano,Elena Loli Piccolomini

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:highly challenging task, challenging task, Deep Generative Prior, deep generative models, reconstruction of X-rays

备注: 13 pages, 5 figures, 1 table

点击查看摘要

Abstract:The reconstruction of X-rays CT images from sparse or limited-angle geometries is a highly challenging task. The lack of data typically results in artifacts in the reconstructed image and may even lead to object distortions. For this reason, the use of deep generative models in this context has great interest and potential success. In the Deep Generative Prior (DGP) framework, the use of diffusion-based generative models is combined with an iterative optimization algorithm for the reconstruction of CT images from sinograms acquired under sparse geometries, to maintain the explainability of a model-based approach while introducing the generative power of a neural network. There are therefore several aspects that can be further investigated within these frameworks to improve reconstruction quality, such as image generation, the model, and the iterative algorithm used to solve the minimization problem, for which we propose modifications with respect to existing approaches. The results obtained even under highly sparse geometries are very promising, although further research is clearly needed in this direction.

42. 【2602.10720】Ecological mapping with geospatial foundation models

链接https://arxiv.org/abs/2602.10720

作者:Craig Mahlasi,Gciniwe S. Baloyi,Zaheed Gaffoor,Levente Klein,Anne Jones,Etienne Vos,Michal Muszynski,Geoffrey Dawson,Campbell Watson

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Geospatial foundation models, geospatial tasks, Geospatial foundation, Geospatial, fast-emerging paradigm

备注

点击查看摘要

Abstract:Geospatial foundation models (GFMs) are a fast-emerging paradigm for various geospatial tasks, such as ecological mapping. However, the utility of GFMs has not been fully explored for high-value use cases. This study aims to explore the utility, challenges and opportunities associated with the application of GFMs for ecological uses. In this regard, we fine-tune several pretrained AI models, namely, Prithvi-E0-2.0 and TerraMind, across three use cases, and compare this with a baseline ResNet-101 model. Firstly, we demonstrate TerraMind's LULC generation capabilities. Lastly, we explore the utility of the GFMs in forest functional trait mapping and peatlands detection. In all experiments, the GFMs outperform the baseline ResNet models. In general TerraMind marginally outperforms Prithvi. However, with additional modalities TerraMind significantly outperforms the baseline ResNet and Prithvi models. Nonetheless, consideration should be given to the divergence of input data from pretrained modalities. We note that these models would benefit from higher resolution and more accurate labels, especially for use cases where pixel-level dynamics need to be mapped.

43. 【2602.10719】From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

链接https://arxiv.org/abs/2602.10719

作者:Sining Ang,Yuguang Yang,Chenxu Dang,Canyu Chen,Cheng Chi,Haiyan Liu,Xuanyao Mao,Jason Bao,Xuliang,Bingchuan Sun,Yan Wang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:driving augments, cost trade-off, VLM, planning with language-enabled, usual accuracy

备注: 22 pages (10 pages main text + 12 pages appendix), 18 figures

点击查看摘要

Abstract:Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which runs both ViT and VLM branches and selects between their endpoint trajectories using a learned scorer, improving PDMS to 92.10. Finally, DualDriveVLA implements a practical fast--slow policy: it runs ViT by default and invokes the VLM only when the scorer's confidence falls below a threshold; calling the VLM on 15% of scenarios achieves 91.00 PDMS while improving throughput by 3.2x. Code will be released.

44. 【2602.10710】FGAA-FPN: Foreground-Guided Angle-Aware Feature Pyramid Network for Oriented Object Detection

链接https://arxiv.org/abs/2602.10710

作者:Jialin Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:geographic information updating, high-resolution remote sensing, maritime surveillance, aerial imagery, information updating

备注: Submitted to The Visual Computer

点击查看摘要

Abstract:With the increasing availability of high-resolution remote sensing and aerial imagery, oriented object detection has become a key capability for geographic information updating, maritime surveillance, and disaster response. However, it remains challenging due to cluttered backgrounds, severe scale variation, and large orientation changes. Existing approaches largely improve performance through multi-scale feature fusion with feature pyramid networks or contextual modeling with attention, but they often lack explicit foreground modeling and do not leverage geometric orientation priors, which limits feature discriminability. To overcome these limitations, we propose FGAA-FPN, a Foreground-Guided Angle-Aware Feature Pyramid Network for oriented object detection. FGAA-FPN is built on a hierarchical functional decomposition that accounts for the distinct spatial resolution and semantic abstraction across pyramid levels, thereby strengthening multi-scale representations. Concretely, a Foreground-Guided Feature Modulation module learns foreground saliency under weak supervision to enhance object regions and suppress background interference in low-level features. In parallel, an Angle-Aware Multi-Head Attention module encodes relative orientation relationships to guide global interactions among high-level semantic features. Extensive experiments on DOTA v1.0 and DOTA v1.5 demonstrate that FGAA-FPN achieves state-of-the-art results, reaching 75.5% and 68.3% mAP, respectively.

45. 【2602.10704】(MGS)$^2$-Net: Unifying Micro-Geometric Scale and Macro-Geometric Structure for Cross-View Geo-Localization

链接https://arxiv.org/abs/2602.10704

作者:Minglei Li,Mengfan He,Chao Chen,Ziyang Meng

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:GNSS-denied UAV navigation, orthographic satellite references, Cross-view geo-localization, GNSS-denied UAV, UAV navigation

备注

点击查看摘要

Abstract:Cross-view geo-localization (CVGL) is pivotal for GNSS-denied UAV navigation but remains brittle under the drastic geometric misalignment between oblique aerial views and orthographic satellite references. Existing methods predominantly operate within a 2D manifold, neglecting the underlying 3D geometry where view-dependent vertical facades (macro-structure) and scale variations (micro-scale) severely corrupt feature alignment. To bridge this gap, we propose (MGS)$^2$, a geometry-grounded framework. The core of our innovation is the Macro-Geometric Structure Filtering (MGSF) module. Unlike pixel-wise matching sensitive to noise, MGSF leverages dilated geometric gradients to physically filter out high-frequency facade artifacts while enhancing the view-invariant horizontal plane, directly addressing the domain shift. To guarantee robust input for this structural filtering, we explicitly incorporate a Micro-Geometric Scale Adaptation (MGSA) module. MGSA utilizes depth priors to dynamically rectify scale discrepancies via multi-branch feature fusion. Furthermore, a Geometric-Appearance Contrastive Distillation (GACD) loss is designed to strictly discriminate against oblique occlusions. Extensive experiments demonstrate that (MGS)$^2$ achieves state-of-the-art performance, recording a Recall@1 of 97.5\% on University-1652 and 97.02\% on SUES-200. Furthermore, the framework exhibits superior cross-dataset generalization against geometric ambiguity. The code is available at: \href{this https URL}{this https URL}.

46. 【2602.10698】AugVLA-3D: Depth-Driven Feature Augmentation for Vision-Language-Action Models

链接https://arxiv.org/abs/2602.10698

作者:Zhifeng Rao,Wenlong Chen,Lei Xie,Xia Hua,Dongfu Yin,Zhen Tian,F. Richard Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:recently achieved remarkable, achieved remarkable progress, approaches primarily rely, existing approaches primarily, rely on VLM

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic perception and control, yet most existing approaches primarily rely on VLM trained using 2D images, which limits their spatial understanding and action grounding in complex 3D environments. To address this limitation, we propose a novel framework that integrates depth estimation into VLA models to enrich 3D feature representations. Specifically, we employ a depth estimation baseline called VGGT to extract geometry-aware 3D cues from standard RGB inputs, enabling efficient utilization of existing large-scale 2D datasets while implicitly recovering 3D structural information. To further enhance the reliability of these depth-derived features, we introduce a new module called action assistant, which constrains the learned 3D representations with action priors and ensures their consistency with downstream control tasks. By fusing the enhanced 3D features with conventional 2D visual tokens, our approach significantly improves the generalization ability and robustness of VLA models. Experimental results demonstrate that the proposed method not only strengthens perception in geometrically ambiguous scenarios but also leads to superior action prediction accuracy. This work highlights the potential of depth-driven data augmentation and auxiliary expert supervision for bridging the gap between 2D observations and 3D-aware decision-making in robotic systems.

47. 【2602.10687】OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL

链接https://arxiv.org/abs/2602.10687

作者:Jinjie Shen,Jing Wu,Yaxiong Wang,Lechao Cheng,Shengeng Tang,Tianrui Hui,Nan Pu,Zhun Zhong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Existing forgery detection, Scaling Policy Optimization, vision-language forgery detection, Existing forgery, Reward Scaling Policy

备注: 38 pages, DeepFake Detection

点击查看摘要

Abstract:Existing forgery detection methods are often limited to uni-modal or bi-modal settings, failing to handle the interleaved text, images, and videos prevalent in real-world misinformation. To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding. In this unified setting, the {interplay} between diverse modalities and the dual requirements of simultaneous detection and localization pose a critical ``difficulty bias`` problem: the simpler veracity classification task tends to dominate the gradients, leading to suboptimal performance in fine-grained grounding during multi-task optimization. To address this challenge, we propose \textbf{OmniVL-Guard}, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding. Particularly, OmniVL-Guard comprises two core designs: Self-Evolving CoT Generatio and Adaptive Reward Scaling Policy Optimization (ARSPO). {Self-Evolving CoT Generation} synthesizes high-quality reasoning paths, effectively overcoming the cold-start challenge. Building upon this, {Adaptive Reward Scaling Policy Optimization (ARSPO)} dynamically modulates reward scales and task weights, ensuring a balanced joint optimization. Extensive experiments demonstrate that OmniVL-Guard significantly outperforms state-of-the-art methods and exhibits zero-shot robust generalization across out-of-domain scenarios.

48. 【2602.10675】wiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning

链接https://arxiv.org/abs/2602.10675

作者:Junhua Liu,Zhangcheng Wang,Zhike Han,Ningli Wang,Guotao Liang,Kun Kuang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:integrating visual perception, intermediate reasoning steps, enhancing multimodal reasoning, promising paradigm, paradigm for enhancing

备注: preprint

点击查看摘要

Abstract:Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1,078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at this https URL.

49. 【2602.10663】AMAP-APP: Efficient Segmentation and Morphometry Quantification of Fluorescent Microscopy Images of Podocytes

链接https://arxiv.org/abs/2602.10663

作者:Arash Fatehi,David Unnersjö-Jess,Linus Butt,Noémie Moreau,Thomas Benzing,Katarzyna Bozek

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automatic Morphological Analysis, Automatic Morphological, Automated podocyte foot, Morphological Analysis, foot process quantification

备注

点击查看摘要

Abstract:Background: Automated podocyte foot process quantification is vital for kidney research, but the established "Automatic Morphological Analysis of Podocytes" (AMAP) method is hindered by high computational demands, a lack of a user interface, and Linux dependency. We developed AMAP-APP, a cross-platform desktop application designed to overcome these barriers. Methods: AMAP-APP optimizes efficiency by replacing intensive instance segmentation with classic image processing while retaining the original semantic segmentation model. It introduces a refined Region of Interest (ROI) algorithm to improve precision. Validation involved 365 mouse and human images (STED and confocal), benchmarking performance against the original AMAP via Pearson correlation and Two One-Sided T-tests (TOST). Results: AMAP-APP achieved a 147-fold increase in processing speed on consumer hardware. Morphometric outputs (area, perimeter, circularity, and slit diaphragm density) showed high correlation (r0.90) and statistical equivalence (TOST P0.05) to the original method. Additionally, the new ROI algorithm demonstrated superior accuracy compared to the original, showing reduced deviation from manual delineations. Conclusion: AMAP-APP democratizes deep learning-based podocyte morphometry. By eliminating the need for high-performance computing clusters and providing a user-friendly interface for Windows, macOS, and Linux, it enables widespread adoption in nephrology research and potential clinical diagnostics.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.10663 [cs.CV]

(or
arXiv:2602.10663v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.10663

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Arash Fatehi [view email] [v1]
Wed, 11 Feb 2026 09:07:03 UTC (8,190 KB)

50. 【2602.10662】Dynamic Frequency Modulation for Controllable Text-driven Image Generation

链接https://arxiv.org/abs/2602.10662

作者:Tiandong Shi,Ling Zhao,Ji Qi,Jiayi Ma,Chengli Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:text-guided diffusion models, image generation paradigm, generation paradigm driven, original text prompt, success of text-guided

备注

点击查看摘要

Abstract:The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.

51. 【2602.10660】AurigaNet: A Real-Time Multi-Task Network for Enhanced Urban Driving Perception

链接https://arxiv.org/abs/2602.10660

作者:Kiarash Ghasemzadeh,Sedigheh Dehghani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Self-driving cars hold, cars hold significant, Self-driving cars, enhance urban mobility, reduce traffic accidents

备注

点击查看摘要

Abstract:Self-driving cars hold significant potential to reduce traffic accidents, alleviate congestion, and enhance urban mobility. However, developing reliable AI systems for autonomous vehicles remains a substantial challenge. Over the past decade, multi-task learning has emerged as a powerful approach to address complex problems in driving perception. Multi-task networks offer several advantages, including increased computational efficiency, real-time processing capabilities, optimized resource utilization, and improved generalization. In this study, we present AurigaNet, an advanced multi-task network architecture designed to push the boundaries of autonomous driving perception. AurigaNet integrates three critical tasks: object detection, lane detection, and drivable area instance segmentation. The system is trained and evaluated using the BDD100K dataset, renowned for its diversity in driving conditions. Key innovations of AurigaNet include its end-to-end instance segmentation capability, which significantly enhances both accuracy and efficiency in path estimation for autonomous vehicles. Experimental results demonstrate that AurigaNet achieves an 85.2% IoU in drivable area segmentation, outperforming its closest competitor by 0.7%. In lane detection, AurigaNet achieves a remarkable 60.8% IoU, surpassing other models by more than 30%. Furthermore, the network achieves an mAP@0.5:0.95 of 47.6% in traffic object detection, exceeding the next leading model by 2.9%. Additionally, we validate the practical feasibility of AurigaNet by deploying it on embedded devices such as the Jetson Orin NX, where it demonstrates competitive real-time performance. These results underscore AurigaNet's potential as a robust and efficient solution for autonomous driving perception systems. The code can be found here this https URL.

52. 【2602.10659】Multimodal Priors-Augmented Text-Driven 3D Human-Object Interaction Generation

链接https://arxiv.org/abs/2602.10659

作者:Yin Wang,Ziyao Zhang,Zhiying Leng,Haitian Liu,Frederick W. B. Li,Mu Li,Xiaohui Liang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:task of text-driven, challenging task, tackles, interaction, object

备注

点击查看摘要

Abstract:We address the challenging task of text-driven 3D human-object interaction (HOI) motion generation. Existing methods primarily rely on a direct text-to-HOI mapping, which suffers from three key limitations due to the significant cross-modality gap: (Q1) sub-optimal human motion, (Q2) unnatural object motion, and (Q3) weak interaction between humans and objects. To address these challenges, we propose MP-HOI, a novel framework grounded in four core insights: (1) Multimodal Data Priors: We leverage multimodal data (text, image, pose/object) from large multimodal models as priors to guide HOI generation, which tackles Q1 and Q2 in data modeling. (2) Enhanced Object Representation: We improve existing object representations by incorporating geometric keypoints, contact features, and dynamic properties, enabling expressive object representations, which tackles Q2 in data representation. (3) Multimodal-Aware Mixture-of-Experts (MoE) Model: We propose a modality-aware MoE model for effective multimodal feature fusion paradigm, which tackles Q1 and Q2 in feature fusion. (4) Cascaded Diffusion with Interaction Supervision: We design a cascaded diffusion framework that progressively refines human-object interaction features under dedicated supervision, which tackles Q3 in interaction refinement. Comprehensive experiments demonstrate that MP-HOI outperforms existing approaches in generating high-fidelity and fine-grained HOI motions.

53. 【2602.10639】VideoSTF: Stress-Testing Output Repetition in Video Large Language Models

链接https://arxiv.org/abs/2602.10639

作者:Yuxin Cao,Wei Song,Shangzhi Xu,Jingling Xue,Jin Song Dong

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Multimedia (cs.MM)

关键词:Large Language Models, Video Large Language, Large Language, recently achieved strong, achieved strong performance

备注

点击查看摘要

Abstract:Video Large Language Models (VideoLLMs) have recently achieved strong performance in video understanding tasks. However, we identify a previously underexplored generation failure: severe output repetition, where models degenerate into self-reinforcing loops of repeated phrases or sentences. This failure mode is not captured by existing VideoLLM benchmarks, which focus primarily on task accuracy and factual correctness. We introduce VideoSTF, the first framework for systematically measuring and stress-testing output repetition in VideoLLMs. VideoSTF formalizes repetition using three complementary n-gram-based metrics and provides a standardized testbed of 10,000 diverse videos together with a library of controlled temporal transformations. Using VideoSTF, we conduct pervasive testing, temporal stress testing, and adversarial exploitation across 10 advanced VideoLLMs. We find that output repetition is widespread and, critically, highly sensitive to temporal perturbations of video inputs. Moreover, we show that simple temporal transformations can efficiently induce repetitive degeneration in a black-box setting, exposing output repetition as an exploitable security vulnerability. Our results reveal output repetition as a fundamental stability issue in modern VideoLLMs and motivate stability-aware evaluation for video-language systems. Our evaluation code and scripts are available at: this https URL.

54. 【2602.10630】Eliminating VAE for Fast and High-Resolution Generative Detail Restoration

链接https://arxiv.org/abs/2602.10630

作者:Yan Wang,Shijie Zhao,Junlin Li,Li Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attained remarkable breakthroughs, real-world super-resolution, albeit at slow, demand on devices, attained remarkable

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:Diffusion models have attained remarkable breakthroughs in the real-world super-resolution (SR) task, albeit at slow inference and high demand on devices. To accelerate inference, recent works like GenDR adopt step distillation to minimize the step number to one. However, the memory boundary still restricts the maximum processing size, necessitating tile-by-tile restoration of high-resolution images. Through profiling the pipeline, we pinpoint that the variational auto-encoder (VAE) is the bottleneck of latency and memory. To completely solve the problem, we leverage pixel-(un)shuffle operations to eliminate the VAE, reversing the latent-based GenDR to pixel-space GenDR-Pix. However, upscale with x8 pixelshuffle may induce artifacts of repeated patterns. To alleviate the distortion, we propose a multi-stage adversarial distillation to progressively remove the encoder and decoder. Specifically, we utilize generative features from the previous stage models to guide adversarial discrimination. Moreover, we propose random padding to augment generative features and avoid discriminator collapse. We also introduce a masked Fourier space loss to penalize the outliers of amplitude. To improve inference performance, we empirically integrate a padding-based self-ensemble with classifier-free guidance to improve inference scaling. Experimental results show that GenDR-Pix performs 2.8x acceleration and 60% memory-saving compared to GenDR with negligible visual degradation, surpassing other one-step diffusion SR. Against all odds, GenDR-Pix can restore 4K image in only 1 second and 6GB.

55. 【2602.10624】A Vision-Language Foundation Model for Zero-shot Clinical Collaboration and Automated Concept Discovery in Dermatology

链接https://arxiv.org/abs/2602.10624

作者:Siyuan Yan,Xieji Li,Dan Mo,Philipp Tschandl,Yiwen Jiang,Zhonghua Wang,Ming Hu,Lie Ju,Cristina Vico-Alonso,Yizhen Zheng,Jiahe Liu,Juexiao Zhou,Camilla Chello,Jen G. Cheung,Julien Anriot,Luc Thomas,Clare Primiero,Gin Tan,Aik Beng Ng,Simon See,Xiaoying Tang,Albert Ip,Xiaoyang Liao,Adrian Bowling,Martin Haskett,Shuang Zhao,Monika Janda,H. Peter Soyer,Victoria Mar,Harald Kittler,Zongyuan Ge

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:widespread deployment remains, deployment remains hindered, Medical foundation models, Medical foundation, shown promise

备注: reports

点击查看摘要

Abstract:Medical foundation models have shown promise in controlled benchmarks, yet widespread deployment remains hindered by reliance on task-specific fine-tuning. Here, we introduce DermFM-Zero, a dermatology vision-language foundation model trained via masked latent modelling and contrastive learning on over 4 million multimodal data points. We evaluated DermFM-Zero across 20 benchmarks spanning zero-shot diagnosis and multimodal retrieval, achieving state-of-the-art performance without task-specific adaptation. We further evaluated its zero-shot capabilities in three multinational reader studies involving over 1,100 clinicians. In primary care settings, AI assistance enabled general practitioners to nearly double their differential diagnostic accuracy across 98 skin conditions. In specialist settings, the model significantly outperformed board-certified dermatologists in multimodal skin cancer assessment. In collaborative workflows, AI assistance enabled non-experts to surpass unassisted experts while improving management appropriateness. Finally, we show that DermFM-Zero's latent representations are interpretable: sparse autoencoders unsupervisedly disentangle clinically meaningful concepts that outperform predefined-vocabulary approaches and enable targeted suppression of artifact-induced biases, enhancing robustness without retraining. These findings demonstrate that a foundation model can provide effective, safe, and transparent zero-shot clinical decision support.

56. 【2602.10619】Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

链接https://arxiv.org/abs/2602.10619

作者:Guangjing Yang,ZhangYuan Yu,Ziyuan Qin,Xinyuan Song,Huahui Yi,Qingbo Kang,Jun Gao,Yiyue Li,Chenlin Du,Qicheng Lao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains largely underexplored, enable effective post-training, vision-centric domains remains, domains remains largely, rule-based reward schemes

备注: CPAL 2026

点击查看摘要

Abstract:While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications.

Comments:
CPAL 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.10619 [cs.CV]

(or
arXiv:2602.10619v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.10619

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
57. 【2602.10593】Fast Person Detection Using YOLOX With AI Accelerator For Train Station Safety

链接https://arxiv.org/abs/2602.10593

作者:Mas Nurul Achmadiah,Novendra Setyawan,Achmad Arif Bryantono,Chi-Chia Sun,Wen-Kai Kuo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Jetson Orin Nano, Image processing, advanced Faster, Faster and applied, Orin Nano

备注: 6 pages, 8 figures, 2 tables. Presented at 2024 International Electronics Symposium (IES). IEEE DOI: [https://doi.org/10.1109/IES63037.2024.10665874](https://doi.org/10.1109/IES63037.2024.10665874)

点击查看摘要

Abstract:Recently, Image processing has advanced Faster and applied in many fields, including health, industry, and transportation. In the transportation sector, object detection is widely used to improve security, for example, in traffic security and passenger crossings at train stations. Some accidents occur in the train crossing area at the station, like passengers uncarefully when passing through the yellow line. So further security needs to be developed. Additional technology is required to reduce the number of accidents. This paper focuses on passenger detection applications at train stations using YOLOX and Edge AI Accelerator hardware. the performance of the AI accelerator will be compared with Jetson Orin Nano. The experimental results show that the Hailo-8 AI hardware accelerator has higher accuracy than Jetson Orin Nano (improvement of over 12%) and has lower latency than Jetson Orin Nano (reduced 20 ms).

58. 【2602.10592】Enhancing YOLOv11n for Reliable Child Detection in Noisy Surveillance Footage

链接https://arxiv.org/abs/2602.10592

作者:Khanh Linh Tran,Minh Nguyen Dang,Thien Nguyen Trong,Hung Nguyen Quoc,Linh Nguyen Kieu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world missing child, missing child alert, low-quality surveillance footage, enhancing child detection, daycare monitoring systems

备注

点击查看摘要

Abstract:This paper presents a practical and lightweight solution for enhancing child detection in low-quality surveillance footage, a critical component in real-world missing child alert and daycare monitoring systems. Building upon the efficient YOLOv11n architecture, we propose a deployment-ready pipeline that improves detection under challenging conditions including occlusion, small object size, low resolution, motion blur, and poor lighting commonly found in existing CCTV infrastructures. Our approach introduces a domain-specific augmentation strategy that synthesizes realistic child placements using spatial perturbations such as partial visibility, truncation, and overlaps, combined with photometric degradations including lighting variation and noise. To improve recall of small and partially occluded instances, we integrate Slicing Aided Hyper Inference (SAHI) at inference time. All components are trained and evaluated on a filtered, child-only subset of the Roboflow Daycare dataset. Compared to the baseline YOLOv11n, our enhanced system achieves a mean Average Precision at 0.5 IoU (mAP@0.5) of 0.967 and a mean Average Precision averaged over IoU thresholds from 0.5 to 0.95 (mAP@0.5:0.95) of 0.783, yielding absolute improvements of 0.7 percent and 2.3 percent, respectively, without architectural changes. Importantly, the entire pipeline maintains compatibility with low-power edge devices and supports real-time performance, making it particularly well suited for low-cost or resource-constrained industrial surveillance deployments. The example augmented dataset and the source code used to generate it are available at: this https URL

59. 【2602.10586】Enhancing Underwater Images via Adaptive Semantic-aware Codebook Learning

链接https://arxiv.org/abs/2602.10586

作者:Bosen Lin,Feng Gao,Yanwei Yu,Junyu Dong,Qian Du

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Underwater Image Enhancement, levels vary significantly, Image Enhancement, degradation levels vary, natural clean references

备注: Accepted for publication in IEEE TGRS 2026

点击查看摘要

Abstract:Underwater Image Enhancement (UIE) is an ill-posed problem where natural clean references are not available, and the degradation levels vary significantly across semantic regions. Existing UIE methods treat images with a single global model and ignore the inconsistent degradation of different scene components. This oversight leads to significant color distortions and loss of fine details in heterogeneous underwater scenes, especially where degradation varies significantly across different image regions. Therefore, we propose SUCode (Semantic-aware Underwater Codebook Network), which achieves adaptive UIE from semantic-aware discrete codebook representation. Compared with one-shot codebook-based methods, SUCode exploits semantic-aware, pixel-level codebook representation tailored to heterogeneous underwater degradation. A three-stage training paradigm is employed to represent raw underwater image features to avoid pseudo ground-truth contamination. Gated Channel Attention Module (GCAM) and Frequency-Aware Feature Fusion (FAFF) jointly integrate channel and frequency cues for faithful color restoration and texture recovery. Extensive experiments on multiple benchmarks demonstrate that SUCode achieves state-of-the-art performance, outperforming recent UIE methods on both reference and no-reference metrics. The code will be made public available at this https URL.

60. 【2602.10575】MetaphorStar: Image Metaphor Understanding and Reasoning with End-to-End Visual Reinforcement Learning

链接https://arxiv.org/abs/2602.10575

作者:Chenhao Zhang,Yazhe Niu,Hongsheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:Multimodal Large Language, Nowadays AI systems, Metaphorical comprehension, challenge for Nowadays, Large Language Models

备注: 14 pages, 4 figures, 11 tables; Code: [this https URL](https://github.com/MING-ZCH/MetaphorStar) , Model Dataset: [this https URL](https://huggingface.co/collections/MING-ZCH/metaphorstar)

点击查看摘要

Abstract:Metaphorical comprehension in images remains a critical challenge for Nowadays AI systems. While Multimodal Large Language Models (MLLMs) excel at basic Visual Question Answering (VQA), they consistently struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. This difficulty stems from the task's demand for sophisticated multi-hop reasoning, cultural context, and Theory of Mind (ToM) capabilities, which current models lack. To fill this gap, we propose MetaphorStar, the first end-to-end visual reinforcement learning (RL) framework for image implication tasks. Our framework includes three core components: the fine-grained dataset TFQ-Data, the visual RL method TFQ-GRPO, and the well-structured benchmark TFQ-Bench. Our fully open-source MetaphorStar family, trained using TFQ-GRPO on TFQ-Data, significantly improves performance by an average of 82.6% on the image implication benchmarks. Compared with 20+ mainstream MLLMs, MetaphorStar-32B achieves state-of-the-art (SOTA) on Multiple-Choice Question and Open-Style Question, significantly outperforms the top closed-source model Gemini-3.0-pro on True-False Question. Crucially, our experiments reveal that learning image implication tasks improves the general understanding ability, especially the complex visual reasoning ability. We further provide a systematic analysis of model parameter scaling, training data scaling, and the impact of different model architectures and training strategies, demonstrating the broad applicability of our method. We open-sourced all model weights, datasets, and method code at this https URL.

Comments:
14 pages, 4 figures, 11 tables; Code: this https URL, Model Dataset: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Cite as:
arXiv:2602.10575 [cs.CV]

(or
arXiv:2602.10575v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.10575

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
61. 【2602.10551】C^2ROPE: Causal Continuous Rotary Positional Encoding for 3D Large Multimodal-Models Reasoning

链接https://arxiv.org/abs/2602.10551

作者:Guanting Ye,Qiyan Zhao,Wenhao Yu,Xiaofeng Zhang,Jianmin Ji,Yanyong Zhang,Ka-Veng Yuen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Multimodal Models, Large Language, Large Multimodal, Recent advances

备注: Accepted in ICRA 2026

点击查看摘要

Abstract:Recent advances in 3D Large Multimodal Models (LMMs) built on Large Language Models (LLMs) have established the alignment of 3D visual features with LLM representations as the dominant paradigm. However, the inherited Rotary Position Embedding (RoPE) introduces limitations for multimodal processing. Specifically, applying 1D temporal positional indices disrupts the continuity of visual features along the column dimension, resulting in spatial locality loss. Moreover, RoPE follows the prior that temporally closer image tokens are more causally related, leading to long-term decay in attention allocation and causing the model to progressively neglect earlier visual tokens as the sequence length increases. To address these issues, we propose C^2RoPE, an improved RoPE that explicitly models local spatial Continuity and spatial Causal relationships for visual processing. C^2RoPE introduces a spatio-temporal continuous positional embedding mechanism for visual tokens. It first integrates 1D temporal positions with Cartesian-based spatial coordinates to construct a triplet hybrid positional index, and then employs a frequency allocation strategy to encode spatio-temporal positional information across the three index components. Additionally, we introduce Chebyshev Causal Masking, which determines causal dependencies by computing the Chebyshev distance of image tokens in 2D space. Evaluation results across various benchmarks, including 3D scene reasoning and 3D visual question answering, demonstrate C^2RoPE's effectiveness. The code is be available at this https URL.

62. 【2602.10549】Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

链接https://arxiv.org/abs/2602.10549

作者:Shengyang Sun,Jiashen Hua,Junyi Feng,Xiaojin Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Weakly supervised multimodal, gained significant attention, modality remains under-explored, Weakly supervised, video anomaly detection

备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.

63. 【2602.10546】RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images

链接https://arxiv.org/abs/2602.10546

作者:Hanzhe Yu,Yun Ye,Jintao Rong,Qi Xuan,Chen Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:potentially increasing societal, increasing societal risks, highly realistic fake, realistic fake images, potentially increasing

备注: Published in the Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM 2025)

点击查看摘要

Abstract:The rapid advancement of generative AI has raised concerns about the authenticity of digital images, as highly realistic fake images can now be generated at low cost, potentially increasing societal risks. In response, several datasets have been established to train detection models aimed at distinguishing AI-generated images from real ones. However, existing datasets suffer from limited generalization, low image quality, overly simple prompts, and insufficient image diversity. To address these limitations, we propose a high-quality, large-scale dataset comprising over 730,000 images across multiple categories, including both real and AI-generated images. The generated images are synthesized via state-of-the-art methods, including text-to-image generation (guided by over 10,000 carefully designed prompts), image inpainting, image refinement, and face swapping. Each generated image is annotated with its generation method and category. Inpainting images further include binary masks to indicate inpainted regions, providing rich metadata for analysis. Compared to existing datasets, detection models trained on our dataset demonstrate superior generalization capabilities. Our dataset not only serves as a strong benchmark for evaluating detection methods but also contributes to advancing the robustness of AI-generated image detection techniques. Building upon this, we propose a lightweight detection method based on image noise entropy, which transforms the original image into an entropy tensor of Non-Local Means (NLM) noise before classification. Extensive experiments demonstrate that models trained on our dataset achieve strong generalization, and our method delivers competitive performance, establishing a solid baseline for future research. The dataset and source code are publicly available at this https URL.

64. 【2602.10518】MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps

链接https://arxiv.org/abs/2602.10518

作者:Sharat Bhat,Harshita Khandelwal,Tushar Kataria,Vivek Gupta

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:encompassing geography, environmental patterns, powerful carriers, carriers of structured, structured and contextual

备注

点击查看摘要

Abstract:Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.

65. 【2602.10516】3DXTalker: Unifying Identity, Lip Sync, Emotion, and Spatial Dynamics in Expressive 3D Talking Avatars

链接https://arxiv.org/abs/2602.10516

作者:Zhongju Wang,Zhenhong Sun,Beier Wang,Yifu Wang,Daoyi Dong,Huadong Mo,Hongdong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:exhibit lifelike spatial, digital humans, express emotion, talking avatar, talking avatar generation

备注

点击查看摘要

Abstract:Audio-driven 3D talking avatar generation is increasingly important in virtual communication, digital humans, and interactive media, where avatars must preserve identity, synchronize lip motion with speech, express emotion, and exhibit lifelike spatial dynamics, collectively defining a broader objective of expressivity. However, achieving this remains challenging due to insufficient training data with limited subject identities, narrow audio representations, and restricted explicit controllability. In this paper, we propose 3DXTalker, an expressive 3D talking avatar through data-curated identity modeling, audio-rich representations, and spatial dynamics controllability. 3DXTalker enables scalable identity modeling via 2D-to-3D data curation pipeline and disentangled representations, alleviating data scarcity and improving identity generalization. Then, we introduce frame-wise amplitude and emotional cues beyond standard speech embeddings, ensuring superior lip synchronization and nuanced expression modulation. These cues are unified by a flow-matching-based transformer for coherent facial dynamics. Moreover, 3DXTalker also enables natural head-pose motion generation while supporting stylized control via prompt-based conditioning. Extensive experiments show that 3DXTalker integrates lip synchronization, emotional expression, and head-pose dynamics within a unified framework, achieves superior performance in 3D talking avatar generation.

66. 【2602.10513】1%100%: High-Efficiency Visual Adapter with Complex Linear Projection Optimization

链接https://arxiv.org/abs/2602.10513

作者:Dongshuo Yin,Xue Yang,Deng-Ping Fan,Shi-Min Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Deploying vision foundation, vision foundation models, models typically relies, foundation models typically, Deploying vision

备注

点击查看摘要

Abstract:Deploying vision foundation models typically relies on efficient adaptation strategies, whereas conventional full fine-tuning suffers from prohibitive costs and low efficiency. While delta-tuning has proven effective in boosting the performance and efficiency of LLMs during adaptation, its advantages cannot be directly transferred to the fine-tuning pipeline of vision foundation models. To push the boundaries of adaptation efficiency for vision tasks, we propose an adapter with Complex Linear Projection Optimization (CoLin). For architecture, we design a novel low-rank complex adapter that introduces only about 1% parameters to the backbone. For efficiency, we theoretically prove that low-rank composite matrices suffer from severe convergence issues during training, and address this challenge with a tailored loss. Extensive experiments on object detection, segmentation, image classification, and rotated object detection (remote sensing scenario) demonstrate that CoLin outperforms both full fine-tuning and classical delta-tuning approaches with merely 1% parameters for the first time, providing a novel and efficient solution for deployment of vision foundation models. We release the code on this https URL.

67. 【2602.10508】Med-SegLens: Latent-Level Model Diffing for Interpretable Medical Image Segmentation

链接https://arxiv.org/abs/2602.10508

作者:Salma J. Ahmed,Emad A. Mohammed,Azam Asilian Bidgoli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain largely opaque, achieve strong predictive, Modern segmentation models, models achieve strong, strong predictive performance

备注

点击查看摘要

Abstract:Modern segmentation models achieve strong predictive performance but remain largely opaque, limiting our ability to diagnose failures, understand dataset shift, or intervene in a principled manner. We introduce Med-SegLens, a model-diffing framework that decomposes segmentation model activations into interpretable latent features using sparse autoencoders trained on SegFormer and U-Net. Through cross-architecture and cross-dataset latent alignment across healthy, adult, pediatric, and sub-Saharan African glioma cohorts, we identify a stable backbone of shared representations, while dataset shift is driven by differential reliance on population-specific latents. We show that these latents act as causal bottlenecks for segmentation failures, and that targeted latent-level interventions can correct errors and improve cross-dataset adaption without retraining, recovering performance in 70% of failure cases and improving Dice score from 39.4% to 74.2%. Our results demonstrate that latent-level model diffing provides a practical and mechanistic tool for diagnosing failures and mitigating dataset shift in segmentation models.

68. 【2602.10500】he Garbage Dataset (GD): A Multi-Class Image Benchmark for Automated Waste Segregation

链接https://arxiv.org/abs/2602.10500

作者:Suman Kunwar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:automated waste segregation, advance automated waste, introduces the Garbage, Garbage Dataset, image dataset designed

备注: 11 pages 10 figures and 1 table

点击查看摘要

Abstract:This study introduces the Garbage Dataset (GD), a publicly available image dataset designed to advance automated waste segregation through machine learning and computer vision. It's a diverse dataset covering 10 common household waste categories: metal, glass, biological, paper, battery, trash, cardboard, shoes, clothes, and plastic. The dataset comprises 13,348 labeled images collected through multiple methods, including DWaste mobile app and curated web sources. Methods included rigorous validation through checksums and outlier detection, analysis of class imbalance and visual separability via PCA/t-SNE, and assessment of background complexity using entropy and saliency measures. The dataset was benchmarked using state-of-the-art deep learning models (EfficientNetV2M, EfficientNetV2S, MobileNet, ResNet50, ResNet101) evaluated on performance metrics and operational carbon emissions. Experiment results indicate EfficientNetV2S achieved the highest performance with 96.19% accuracy and a 0.96 F1-score, though with a moderate carbon cost. Analysis revealed inherent dataset characteristics including class imbalance, a skew toward high-outlier classes (plastic, cardboard, paper), and brightness variations that require consideration. The main conclusion is that GD provides a valuable, real-world benchmark for waste classification research while highlighting important challenges such as class imbalance, background complexity, and environmental trade-offs in model selection that must be addressed for practical deployment. The dataset is publicly released to support further research in environmental sustainability applications.

69. 【2602.10495】Characterizing and Optimizing the Spatial Kernel of Multi Resolution Hash Encodings

链接https://arxiv.org/abs/2602.10495

作者:Tianxiang Dai,Jonathan Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Graphics Primitives, Instant Neural Graphics, Graphics Primitives, Instant Neural, Neural Graphics

备注: ICLR 2026 (Poster); LaTeX source; 11 figures; 7 tables

点击查看摘要

Abstract:Multi-Resolution Hash Encoding (MHE), the foundational technique behind Instant Neural Graphics Primitives, provides a powerful parameterization for neural fields. However, its spatial behavior lacks rigorous understanding from a physical systems perspective, leading to reliance on heuristics for hyperparameter selection. This work introduces a novel analytical approach that characterizes MHE by examining its Point Spread Function (PSF), which is analogous to the Green's function of the system. This methodology enables a quantification of the encoding's spatial resolution and fidelity. We derive a closed-form approximation for the collision-free PSF, uncovering inherent grid-induced anisotropy and a logarithmic spatial profile. We establish that the idealized spatial bandwidth, specifically the Full Width at Half Maximum (FWHM), is determined by the average resolution, $N_{\text{avg}}$. This leads to a counterintuitive finding: the effective resolution of the model is governed by the broadened empirical FWHM (and therefore $N_{\text{avg}}$), rather than the finest resolution $N_{\max}$, a broadening effect we demonstrate arises from optimization dynamics. Furthermore, we analyze the impact of finite hash capacity, demonstrating how collisions introduce speckle noise and degrade the Signal-to-Noise Ratio (SNR). Leveraging these theoretical insights, we propose Rotated MHE (R-MHE), an architecture that applies distinct rotations to the input coordinates at each resolution level. R-MHE mitigates anisotropy while maintaining the efficiency and parameter count of the original MHE. This study establishes a methodology based on physical principles that moves beyond heuristics to characterize and optimize MHE.

70. 【2602.10492】End-to-End LiDAR optimization for 3D point cloud registration

链接https://arxiv.org/abs/2602.10492

作者:Siddhant Katyan,Marc-André Gardner,Jean-François Lalonde

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:typically designed independently, key modality, typically designed, designed independently, independently of downstream

备注: 36th British Machine Vision Conference 2025, {BMVC} 2025, Sheffield, UK, November 24-27, 2025. Project page: [this https URL](https://lvsn.github.io/e2e-lidar-registration/)

点击查看摘要

Abstract:LiDAR sensors are a key modality for 3D perception, yet they are typically designed independently of downstream tasks such as point cloud registration. Conventional registration operates on pre-acquired datasets with fixed LiDAR configurations, leading to suboptimal data collection and significant computational overhead for sampling, noise filtering, and parameter tuning. In this work, we propose an adaptive LiDAR sensing framework that dynamically adjusts sensor parameters, jointly optimizing LiDAR acquisition and registration hyperparameters. By integrating registration feedback into the sensing loop, our approach optimally balances point density, noise, and sparsity, improving registration accuracy and efficiency. Evaluations in the CARLA simulation demonstrate that our method outperforms fixed-parameter baselines while retaining generalization abilities, highlighting the potential of adaptive LiDAR for autonomous perception and robotic applications.

71. 【2602.10491】owards Remote Sensing Change Detection with Neural Memory

链接https://arxiv.org/abs/2602.10491

作者:Zhenyu Yang,Gensheng Pei,Yazhou Yao,Tianfei Zhou,Lizhong Ding,Fumin Shen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:urban planning, Remote sensing change, environmental monitoring, related applications, essential for environmental

备注: accepted by IEEE Transactions on Geoscience Remote Sensing

点击查看摘要

Abstract:Remote sensing change detection is essential for environmental monitoring, urban planning, and related applications. However, current methods often struggle to capture long-range dependencies while maintaining computational efficiency. Although Transformers can effectively model global context, their quadratic complexity poses scalability challenges, and existing linear attention approaches frequently fail to capture intricate spatiotemporal relationships. Drawing inspiration from the recent success of Titans in language tasks, we present ChangeTitans, the Titans-based framework for remote sensing change detection. Specifically, we propose VTitans, the first Titans-based vision backbone that integrates neural memory with segmented local attention, thereby capturing long-range dependencies while mitigating computational overhead. Next, we present a hierarchical VTitans-Adapter to refine multi-scale features across different network layers. Finally, we introduce TS-CBAM, a two-stream fusion module leveraging cross-temporal attention to suppress pseudo-changes and enhance detection accuracy. Experimental evaluations on four benchmark datasets (LEVIR-CD, WHU-CD, LEVIR-CD+, and SYSU-CD) demonstrate that ChangeTitans achieves state-of-the-art results, attaining \textbf{84.36\%} IoU and \textbf{91.52\%} F1-score on LEVIR-CD, while remaining computationally competitive.

72. 【2602.10425】HII-DPO: Eliminate Hallucination via Accurate Hallucination-Inducing Counterfactual Images

链接https://arxiv.org/abs/2602.10425

作者:Yilin Yang,Zhenghui Guo,Yuke Wang,Omprakash Gnawali,Sheng Di,Chengming Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, achieved remarkable success, diverse multimodal tasks, inherent language bias, Large Vision-Language

备注

点击查看摘要

Abstract:Large Vision-Language Models (VLMs) have achieved remarkable success across diverse multimodal tasks but remain vulnerable to hallucinations rooted in inherent language bias. Despite recent progress, existing hallucination mitigation methods often overlook the underlying hallucination patterns driven by language bias. In this work, we design a novel pipeline to accurately synthesize Hallucination-Inducing Images (HIIs). Using synthesized HIIs, we reveal a consistent scene-conditioned hallucination pattern: models tend to mention objects that are highly typical of the scene even when visual evidence is removed. To quantify the susceptibility of VLMs to this hallucination pattern, we establish the Masked-Object-Hallucination (MOH) benchmark to rigorously evaluate existing state-of-the-art alignment frameworks. Finally, we leverage HIIs to construct high-quality preference datasets for fine-grained alignment. Experimental results demonstrate that our approach effectively mitigates hallucinations while preserving general model capabilities. Specifically, our method achieves up to a 38% improvement over the current state-of-the-art on standard hallucination benchmarks.

73. 【2602.10364】Comp2Comp: Open-Source Software with FDA-Cleared Artificial Intelligence Algorithms for Computed Tomography Image Analysis

链接https://arxiv.org/abs/2602.10364

作者:Adrit Rao,Malte Jensen,Andrea T. Fisher,Louis Blankemeier,Pauline Berens,Arash Fereydooni,Seth Lirette,Eren Alkan,Felipe C. Kitamura,Juan M. Zambrano Chaves,Eduardo Reis,Arjun Desai,Marc H. Willis,Jason Hom,Andrew Johnston,Leon Lenchik,Robert D. Boutin,Eduardo M. J. M. Farina,Augusto S. Serpa,Marcelo S. Takahashi,Jordan Perchik,Steven A. Rothenberg,Jamie L. Schroeder,Ross Filice,Leonardo K. Bittencourt,Hari Trivedi,Marly van Assen,John Mongan,Kimberly Kallianos,Oliver Aalami,Akshay S. Chaudhari

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Artificial intelligence, already-acquired radiologic images, intelligence allows automatic, automatic extraction, biomarkers from already-acquired

备注: Adrit Rao, Malte Jensen, Andrea T. Fisher, Louis Blankemeier: Co-first authors. Oliver Aalami, Akshay S. Chaudhari: Co-senior authors

点击查看摘要

Abstract:Artificial intelligence allows automatic extraction of imaging biomarkers from already-acquired radiologic images. This paradigm of opportunistic imaging adds value to medical imaging without additional imaging costs or patient radiation exposure. However, many open-source image analysis solutions lack rigorous validation while commercial solutions lack transparency, leading to unexpected failures when deployed. Here, we report development and validation for two of the first fully open-sourced, FDA-510(k)-cleared deep learning pipelines to mitigate both challenges: Abdominal Aortic Quantification (AAQ) and Bone Mineral Density (BMD) estimation are both offered within the Comp2Comp package for opportunistic analysis of computed tomography scans. AAQ segments the abdominal aorta to assess aneurysm size; BMD segments vertebral bodies to estimate trabecular bone density and osteoporosis risk. AAQ-derived maximal aortic diameters were compared against radiologist ground-truth measurements on 258 patient scans enriched for abdominal aortic aneurysms from four external institutions. BMD binary classifications (low vs. normal bone density) were compared against concurrent DXA scan ground truths obtained on 371 patient scans from four external institutions. AAQ had an overall mean absolute error of 1.57 mm (95% CI 1.38-1.80 mm). BMD had a sensitivity of 81.0% (95% CI 74.0-86.8%) and specificity of 78.4% (95% CI 72.3-83.7%). Comp2Comp AAQ and BMD demonstrated sufficient accuracy for clinical use. Open-sourcing these algorithms improves transparency of typically opaque FDA clearance processes, allows hospitals to test the algorithms before cumbersome clinical pilots, and provides researchers with best-in-class methods.

74. 【2602.10344】Monte Carlo Maximum Likelihood Reconstruction for Digital Holography with Speckle

链接https://arxiv.org/abs/2602.10344

作者:Xi Chen,Arian Maleki,Shirin Jalali

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multiplicative noise, posing a fundamental, coherent imaging, statistically modeled, modeled as multiplicative

备注

点击查看摘要

Abstract:In coherent imaging, speckle is statistically modeled as multiplicative noise, posing a fundamental challenge for image reconstruction. While maximum likelihood estimation (MLE) provides a principled framework for speckle mitigation, its application to coherent imaging system such as digital holography with finite apertures is hindered by the prohibitive cost of high-dimensional matrix inversion, especially at high resolutions. This computational burden has prevented the use of MLE-based reconstruction with physically accurate aperture modeling. In this work, we propose a randomized linear algebra approach that enables scalable MLE optimization without explicit matrix inversions in gradient computation. By exploiting the structural properties of sensing matrix and using conjugate gradient for likelihood gradient evaluation, the proposed algorithm supports accurate aperture modeling without the simplifying assumptions commonly imposed for tractability. We term the resulting method projected gradient descent with Monte Carlo estimation (PGD-MC). The proposed PGD-MC framework (i) demonstrates robustness to diverse and physically accurate aperture models, (ii) achieves substantial improvements in reconstruction quality and computational efficiency, and (iii) scales effectively to high-resolution digital holography. Extensive experiments incorporating three representative denoisers as regularization show that PGD-MC provides a flexible and effective MLE-based reconstruction framework for digital holography with finite apertures, consistently outperforming prior Plug-and-Play model-based iterative reconstruction methods in both accuracy and speed. Our code is available at: this https URL.

75. 【2602.10343】Conditional Uncertainty-Aware Political Deepfake Detection with Stochastic Convolutional Neural Networks

链接https://arxiv.org/abs/2602.10343

作者:Rafael-Petruţ Gardoş

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:highly realistic political, Recent advances, generative image models, realistic political deepfakes, posing risks

备注: 21 pages, 12 figures, 18 tables

点击查看摘要

Abstract:Recent advances in generative image models have enabled the creation of highly realistic political deepfakes, posing risks to information integrity, public trust, and democratic processes. While automated deepfake detectors are increasingly deployed in moderation and investigative pipelines, most existing systems provide only point predictions and fail to indicate when outputs are unreliable, being an operationally critical limitation in high-stakes political contexts. This work investigates conditional, uncertainty-aware political deepfake detection using stochastic convolutional neural networks within an empirical, decision-oriented reliability framework. Rather than treating uncertainty as a purely Bayesian construct, it is evaluated through observable criteria, including calibration quality, proper scoring rules, and its alignment with prediction errors under both global and confidence-conditioned analyses. A politically focused binary image dataset is constructed via deterministic metadata filtering from a large public real-synthetic corpus. Two pretrained CNN backbones (ResNet-18 and EfficientNet-B4) are fully fine-tuned for classification. Deterministic inference is compared with single-pass stochastic prediction, Monte Carlo dropout with multiple forward passes, temperature scaling, and ensemble-based uncertainty surrogates. Evaluation reports ROC-AUC, thresholded confusion matrices, calibration metrics, and generator-disjoint out-of-distribution performance. Results demonstrate that calibrated probabilistic outputs and uncertainty estimates enable risk-aware moderation policies. A systematic confidence-band analysis further clarifies when uncertainty provides operational value beyond predicted confidence, delineating both the benefits and limitations of uncertainty-aware deepfake detection in political settings.

76. 【2602.10326】Flow Matching with Uncertainty Quantification and Guidance

链接https://arxiv.org/abs/2602.10326

作者:Juyeop Han,Lukas Lao Beyer,Sertac Karaman

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:sampling-based generative models, flow matching, remarkable success, success of sampling-based, sampling-based generative

备注

点击查看摘要

Abstract:Despite the remarkable success of sampling-based generative models such as flow matching, they can still produce samples of inconsistent or degraded quality. To assess sample reliability and generate higher-quality outputs, we propose uncertainty-aware flow matching (UA-Flow), a lightweight extension of flow matching that predicts the velocity field together with heteroscedastic uncertainty. UA-Flow estimates per-sample uncertainty by propagating velocity uncertainty through the flow dynamics. These uncertainty estimates act as a reliability signal for individual samples, and we further use them to steer generation via uncertainty-aware classifier guidance and classifier-free guidance. Experiments on image generation show that UA-Flow produces uncertainty signals more highly correlated with sample fidelity than baseline methods, and that uncertainty-guided sampling further improves generation quality.

77. 【2602.10319】A Low-Rank Defense Method for Adversarial Attack on Diffusion Models

链接https://arxiv.org/abs/2602.10319

作者:Jiaxuan Zhu,Siyu Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diffusion models, Latent Diffusion Models, diffusion models defend, developed rapidly, fine-tuning process

备注: Accepted by ICME2025

点击查看摘要

Abstract:Recently, adversarial attacks for diffusion models as well as their fine-tuning process have been developed rapidly. To prevent the abuse of these attack algorithms from affecting the practical application of diffusion models, it is critical to develop corresponding defensive strategies. In this work, we propose an efficient defensive strategy, named Low-Rank Defense (LoRD), to defend the adversarial attack on Latent Diffusion Models (LDMs). LoRD introduces the merging idea and a balance parameter, combined with the low-rank adaptation (LoRA) modules, to detect and defend the adversarial samples. Based on LoRD, we build up a defense pipeline that applies the learned LoRD modules to help diffusion models defend against attack algorithms. Our method ensures that the LDM fine-tuned on both adversarial and clean samples can still generate high-quality images. To demonstrate the effectiveness of our approach, we conduct extensive experiments on facial and landscape images, and our method shows significantly better defense performance compared to the baseline methods.

78. 【2602.10278】ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting

链接https://arxiv.org/abs/2602.10278

作者:Zehua Ma,Hanhui Li,Zhenyu Xie,Xiaonan Luo,Michael Kampffmeyer,Feng Gao,Xiaodan Liang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:single image remains, ill-posed problem due, occluded regions, single image, image remains

备注

点击查看摘要

Abstract:Generating 3D content from a single image remains a fundamentally challenging and ill-posed problem due to the inherent absence of geometric and textural information in occluded regions. While state-of-the-art generative models can synthesize auxiliary views to provide additional supervision, these views inevitably contain geometric inconsistencies and textural misalignments that propagate and amplify artifacts during 3D reconstruction. To effectively harness these imperfect supervisory signals, we propose an adaptive optimization framework guided by excess risk decomposition, termed ERGO. Specifically, ERGO decomposes the optimization losses in 3D Gaussian splatting into two components, i.e., excess risk that quantifies the suboptimality gap between current and optimal parameters, and Bayes error that models the irreducible noise inherent in synthesized views. This decomposition enables ERGO to dynamically estimate the view-specific excess risk and adaptively adjust loss weights during optimization. Furthermore, we introduce geometry-aware and texture-aware objectives that complement the excess-risk-derived weighting mechanism, establishing a synergistic global-local optimization paradigm. Consequently, ERGO demonstrates robustness against supervision noise while consistently enhancing both geometric fidelity and textural quality of the reconstructed 3D content. Extensive experiments on the Google Scanned Objects dataset and the OmniObject3D dataset demonstrate the superiority of ERGO over existing state-of-the-art methods.

79. 【2602.10265】Colorimeter-Supervised Skin Tone Estimation from Dermatoscopic Images for Fairness Auditing

链接https://arxiv.org/abs/2602.10265

作者:Marin Benčević,Krešimir Romić,Ivana Hartmann Tolić,Irena Galić

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:studies report performance, report performance disparities, Individual Typology Angle, studies report, clinical decision support

备注: Preprint submitted to Computer Methods and Programs in Biomedicine

点击查看摘要

Abstract:Neural-network-based diagnosis from dermatoscopic images is increasingly used for clinical decision support, yet studies report performance disparities across skin tones. Fairness auditing of these models is limited by the lack of reliable skin-tone annotations in public dermatoscopy datasets. We address this gap with neural networks that predict Fitzpatrick skin type via ordinal regression and the Individual Typology Angle (ITA) via color regression, using in-person Fitzpatrick labels and colorimeter measurements as targets. We further leverage extensive pretraining on synthetic and real dermatoscopic and clinical images. The Fitzpatrick model achieves agreement comparable to human crowdsourced annotations, and ITA predictions show high concordance with colorimeter-derived ITA, substantially outperforming pixel-averaging approaches. Applying these estimators to ISIC 2020 and MILK10k, we find that fewer than 1% of subjects belong to Fitzpatrick types V and VI. We release code and pretrained models as an open-source tool for rapid skin-tone annotation and bias auditing. This is, to our knowledge, the first dermatoscopic skin-tone estimation neural network validated against colorimeter measurements, and it supports growing evidence of clinically relevant performance gaps across skin-tone groups.

80. 【2602.10259】PMMA: The Polytechnique Montreal Mobility Aids Dataset

链接https://arxiv.org/abs/2602.10259

作者:Qingwu Liu,Nicolas Saunier,Guillaume-Alexandre Bilodeau

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deformable DETR, mobility aids, Faster R-CNN, study introduces, including wheelchair users

备注: Submitted to the journal IEEE Transactions on Intelligent Transportation Systems, under review

点击查看摘要

Abstract:This study introduces a new object detection dataset of pedestrians using mobility aids, named PMMA. The dataset was collected in an outdoor environment, where volunteers used wheelchairs, canes, and walkers, resulting in nine categories of pedestrians: pedestrians, cane users, two types of walker users, whether walking or resting, five types of wheelchair users, including wheelchair users, people pushing empty wheelchairs, and three types of users pushing occupied wheelchairs, including the entire pushing group, the pusher and the person seated on the wheelchair. To establish a benchmark, seven object detection models (Faster R-CNN, CenterNet, YOLOX, DETR, Deformable DETR, DINO, and RT-DETR) and three tracking algorithms (ByteTrack, BOT-SORT, and OC-SORT) were implemented under the MMDetection framework. Experimental results show that YOLOX, Deformable DETR, and Faster R-CNN achieve the best detection performance, while the differences among the three trackers are relatively small. The PMMA dataset is publicly available at this https URL, and the video processing and model training code is available at this https URL.

81. 【2602.10239】XSPLAIN: XAI-enabling Splat-based Prototype Learning for Attribute-aware INterpretability

链接https://arxiv.org/abs/2602.10239

作者:Dominik Galus,Julia Farganus,Tymoteusz Zapala,Mikołaj Czachorowski,Piotr Borycki,Przemysław Spurek,Piotr Syga

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multiple critical domains, Gaussian Splatting, standard for high-fidelity, adoption in multiple, multiple critical

备注

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has rapidly become a standard for high-fidelity 3D reconstruction, yet its adoption in multiple critical domains is hindered by the lack of interpretability of the generation models as well as classification of the Splats. While explainability methods exist for other 3D representations, like point clouds, they typically rely on ambiguous saliency maps that fail to capture the volumetric coherence of Gaussian primitives. We introduce XSPLAIN, the first ante-hoc, prototype-based interpretability framework designed specifically for 3DGS classification. Our approach leverages a voxel-aggregated PointNet backbone and a novel, invertible orthogonal transformation that disentangles feature channels for interpretability while strictly preserving the original decision boundaries. Explanations are grounded in representative training examples, enabling intuitive ``this looks like that'' reasoning without any degradation in classification performance. A rigorous user study (N=51) demonstrates a decisive preference for our approach: participants selected XSPLAIN explanations 48.4\% of the time as the best, significantly outperforming baselines $(p0.001)$, showing that XSPLAIN provides transparency and user trust. The source code for this work is available at: this https URL

82. 【2602.10221】DEGMC: Denoising Diffusion Models Based on Riemannian Equivariant Group Morphological Convolutions

链接https://arxiv.org/abs/2602.10221

作者:El Hadji S. Diop,Thierno Fall,Mohamed Daoudi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Denoising Diffusion Probabilistic, recent Denoising Diffusion, Diffusion Probabilistic Models, key feature extraction, Denoising Diffusion

备注

点击查看摘要

Abstract:In this work, we address two major issues in recent Denoising Diffusion Probabilistic Models (DDPM): {\bf 1)} geometric key feature extraction and {\bf 2)} network equivariance. Since the DDPM prediction network relies on the U-net architecture, which is theoretically only translation equivariant, we introduce a geometric approach combined with an equivariance property of the more general Euclidean group, which includes rotations, reflections, and permutations. We introduce the notion of group morphological convolutions in Riemannian manifolds, which are derived from the viscosity solutions of first-order Hamilton-Jacobi-type partial differential equations (PDEs) that act as morphological multiscale dilations and erosions. We add a convection term to the model and solve it using the method of characteristics. This helps us better capture nonlinearities, represent thin geometric structures, and incorporate symmetries into the learning process. Experimental results on the MNIST, RotoMNIST, and CIFAR-10 datasets show noticeable improvements compared to the baseline DDPM model.

83. 【2602.10179】When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models

链接https://arxiv.org/abs/2602.10179

作者:Jiacheng Hou,Yining Sun,Ruochong Jin,Haochen Han,Fangming Liu,Wai Kin Victor Chan,Alex Jinpeng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent advances, visual-text prompts, user intent, intent is inferred, inferred directly

备注: Project homepage: [this https URL](https://csu-jpg.github.io/vja.github.io/)

点击查看摘要

Abstract:Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual-text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. In this work, we propose Vision-Centric Jailbreak Attack (VJA), the first visual-to-visual jailbreak attack that conveys malicious instructions purely through visual inputs. To systematically study this emerging threat, we introduce IESBench, a safety-oriented benchmark for image editing models. Extensive experiments on IESBench demonstrate that VJA effectively compromises state-of-the-art commercial models, achieving attack success rates of up to 80.9% on Nano Banana Pro and 70.1% on GPT-Image-1.5. To mitigate this vulnerability, we propose a training-free defense based on introspective multimodal reasoning, which substantially improves the safety of poorly aligned models to a level comparable with commercial systems, without auxiliary guard models and with negligible computational overhead. Our findings expose new vulnerabilities, provide both a benchmark and practical defense to advance safe and trustworthy modern image editing systems. Warning: This paper contains offensive images created by large image editing models.

84. 【2602.10173】ArtisanGS: Interactive Tools for Gaussian Splat Selection with AI and Human in the Loop

链接https://arxiv.org/abs/2602.10173

作者:Clement Fuji Tsang,Anita Hu,Or Perel,Carsten Kolve,Maria Shugrina

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:facilitate physics simulation, Gaussian Splat selection, including recent techniques, Gaussian Splat, versatile Gaussian Splat

备注: 12 pages, includes supplementary material

点击查看摘要

Abstract:Representation in the family of 3D Gaussian Splats (3DGS) are growing into a viable alternative to traditional graphics for an expanding number of application, including recent techniques that facilitate physics simulation and animation. However, extracting usable objects from in-the-wild captures remains challenging and controllable editing techniques for this representation are limited. Unlike the bulk of emerging techniques, focused on automatic solutions or high-level editing, we introduce an interactive suite of tools centered around versatile Gaussian Splat selection and segmentation. We propose a fast AI-driven method to propagate user-guided 2D selection masks to 3DGS selections. This technique allows for user intervention in the case of errors and is further coupled with flexible manual selection and segmentation tools. These allow a user to achieve virtually any binary segmentation of an unstructured 3DGS scene. We evaluate our toolset against the state-of-the-art for Gaussian Splat selection and demonstrate their utility for downstream applications by developing a user-guided local editing approach, leveraging a custom Video Diffusion Model. With flexible selection tools, users have direct control over the areas that the AI can modify. Our selection and editing tools can be used for any in-the-wild capture without additional optimization.

85. 【2602.10160】AD$^2$: Analysis and Detection of Adversarial Threats in Visual Perception for End-to-End Autonomous Driving Systems

链接https://arxiv.org/abs/2602.10160

作者:Ishan Sahu,Somnath Hazra,Somak Aditya,Soumyajit Dey

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:achieved significant progress, remains largely underexplored, robustness remains largely, adversarial robustness remains, autonomous driving systems

备注: Accepted to WACV 2026

点击查看摘要

Abstract:End-to-end autonomous driving systems have achieved significant progress, yet their adversarial robustness remains largely underexplored. In this work, we conduct a closed-loop evaluation of state-of-the-art autonomous driving agents under black-box adversarial threat models in CARLA. Specifically, we consider three representative attack vectors on the visual perception pipeline: (i) a physics-based blur attack induced by acoustic waves, (ii) an electromagnetic interference attack that distorts captured images, and (iii) a digital attack that adds ghost objects as carefully crafted bounded perturbations on images. Our experiments on two advanced agents, Transfuser and Interfuser, reveal severe vulnerabilities to such attacks, with driving scores dropping by up to 99% in the worst case, raising valid safety concerns. To help mitigate such threats, we further propose a lightweight Attack Detection model for Autonomous Driving systems (AD$^2$) based on attention mechanisms that capture spatial-temporal consistency. Comprehensive experiments across multi-camera inputs on CARLA show that our detector achieves superior detection capability and computational efficiency compared to existing approaches.

86. 【2602.10159】Beyond Closed-Pool Video Retrieval: A Benchmark and Agent Framework for Real-World Video Search and Moment Localization

链接https://arxiv.org/abs/2602.10159

作者:Tao Yu,Yujia Yang,Haopeng Jin,Junhao Gong,Xinlong Chen,Yuxuan Zhou,Shanbin Zhang,Jiabing Yang,Xinming Wang,Hongzhu Yi,Ping Nie,Kai Zou,Zhang Zhang,Yan Huang,Liang Wang,Yeshani,Ruiwen Tao,Jin Ma,Haijin Liang,Jinwen Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Traditional video retrieval, textbf, closed video pools, retrieval benchmarks focus, reflect real-world searches

备注: 49 pages, 9 figures

点击查看摘要

Abstract:Traditional video retrieval benchmarks focus on matching precise descriptions to closed video pools, failing to reflect real-world searches characterized by fuzzy, multi-dimensional memories on the open web. We present \textbf{RVMS-Bench}, a comprehensive system for evaluating real-world video memory search. It consists of \textbf{1,440 samples} spanning \textbf{20 diverse categories} and \textbf{four duration groups}, sourced from \textbf{real-world open-web videos}. RVMS-Bench utilizes a hierarchical description framework encompassing \textbf{Global Impression, Key Moment, Temporal Context, and Auditory Memory} to mimic realistic multi-dimensional search cues, with all samples strictly verified via a human-in-the-loop protocol. We further propose \textbf{RACLO}, an agentic framework that employs abductive reasoning to simulate the human ``Recall-Search-Verify'' cognitive process, effectively addressing the challenge of searching for videos via fuzzy memories in the real world. Experiments reveal that existing MLLMs still demonstrate insufficient capabilities in real-world Video Retrieval and Moment Localization based on fuzzy memories. We believe this work will facilitate the advancement of video retrieval robustness in real-world unstructured scenarios.

87. 【2602.10146】VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

链接https://arxiv.org/abs/2602.10146

作者:Rongcan Pei,Huan Li,Fang Guo,Qi Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:handling long context, complex reasoning tasks, Visual Evidence Retrieval, face significant challenges, Evidence Retrieval Augmentation

备注: 12 pages, 12 figures

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we identify specific Visual Evidence Retrieval (VER) Heads - a sparse, dynamic set of attention heads critical for locating visual cues during reasoning, distinct from static OCR heads. We demonstrate that these heads are causal to model performance; masking them leads to significant degradation. Leveraging this discovery, we propose VERA (Visual Evidence Retrieval Augmentation), a training-free framework that detects model uncertainty (i.e., entropy) to trigger the explicit verbalization of visual evidence attended by VER heads. Comprehensive experiments demonstrate that VERA significantly improves long-context understanding of open-source VLMs: it yields an average relative improvement of 21.3% on Qwen3-VL-8B-Instruct and 20.1% on GLM-4.1V-Thinking across five benchmarks.

88. 【2602.10143】MPA: Multimodal Prototype Augmentation for Few-Shot Learning

链接https://arxiv.org/abs/2602.10143

作者:Liwen Wu,Wei Wang,Lei Zhao,Zhan Gao,Qika Lin,Shaowen Yao,Zuozhu Liu,Bin Pu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Prototype Augmentation FSL, few-shot learning, remote sensing, Multimodal Prototype Augmentation, popular task

备注: This paper has been accepted by AAAI 2026

点击查看摘要

Abstract:Recently, few-shot learning (FSL) has become a popular task that aims to recognize new classes from only a few labeled examples and has been widely applied in fields such as natural science, remote sensing, and medical images. However, most existing methods focus only on the visual modality and compute prototypes directly from raw support images, which lack comprehensive and rich multimodal information. To address these limitations, we propose a novel Multimodal Prototype Augmentation FSL framework called MPA, including LLM-based Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA). LMSE leverages large language models to generate diverse paraphrased category descriptions, enriching the support set with additional semantic cues. HMA exploits both natural and multi-view augmentations to enhance feature diversity (e.g., changes in viewing distance, camera angles, and lighting conditions). AUCA models uncertainty by introducing uncertain classes via interpolation and Gaussian sampling, effectively absorbing uncertain samples. Extensive experiments on four single-domain and six cross-domain FSL benchmarks demonstrate that MPA achieves superior performance compared to existing state-of-the-art methods across most settings. Notably, MPA surpasses the second-best method by 12.29% and 24.56% in the single-domain and cross-domain setting, respectively, in the 5-way 1-shot setting.

89. 【2602.10138】Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement

链接https://arxiv.org/abs/2602.10138

作者:Zhihang Yi,Jian Zhao,Jiancheng Lv,Tao Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Multimodal Large Language, Large Language Models, requiring the seamless, extract meaning, seamless integration

备注

点击查看摘要

Abstract:Chart understanding is a quintessential information fusion task, requiring the seamless integration of graphical and textual data to extract meaning. The advent of Multimodal Large Language Models (MLLMs) has revolutionized this domain, yet the landscape of MLLM-based chart analysis remains fragmented and lacks systematic organization. This survey provides a comprehensive roadmap of this nascent frontier by structuring the domain's core components. We begin by analyzing the fundamental challenges of fusing visual and linguistic information in charts. We then categorize downstream tasks and datasets, introducing a novel taxonomy of canonical and non-canonical benchmarks to highlight the field's expanding scope. Subsequently, we present a comprehensive evolution of methodologies, tracing the progression from classic deep learning techniques to state-of-the-art MLLM paradigms that leverage sophisticated fusion strategies. By critically examining the limitations of current models, particularly their perceptual and reasoning deficits, we identify promising future directions, including advanced alignment techniques and reinforcement learning for cognitive enhancement. This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion and to catalyze progress toward more robust and reliable systems.

90. 【2602.10137】Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

链接https://arxiv.org/abs/2602.10137

作者:Leo Thomas Ramos,Angel D. Sappa

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multi-branch encoder-decoder architecture, land cover segmentation, work proposes MeCSAFNet, multispectral imagery, work proposes

备注: This is an extended version of the study presented at IEEE SoutheastCon2025. It presents substantial new content and original contributions beyond the previous version, including an expanded and enhanced background, new architectural refinements, additional experiments conducted on a broader range of datasets and experimental scenarios, and a more comprehensive analysis of results

点击查看摘要

Abstract:This work proposes MeCSAFNet, a multi-branch encoder-decoder architecture for land cover segmentation in multispectral imagery. The model separately processes visible and non-visible channels through dual ConvNeXt encoders, followed by individual decoders that reconstruct spatial information. A dedicated fusion decoder integrates intermediate features at multiple scales, combining fine spatial cues with high-level spectral representations. The feature fusion is further enhanced with CBAM attention, and the ASAU activation function contributes to stable and efficient optimization. The model is designed to process different spectral configurations, including a 4-channel (4c) input combining RGB and NIR bands, as well as a 6-channel (6c) input incorporating NDVI and NDWI indices. Experiments on the Five-Billion-Pixels (FBP) and Potsdam datasets demonstrate significant performance gains. On FBP, MeCSAFNet-base (6c) surpasses U-Net (4c) by +19.21%, U-Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74% in mIoU. On Potsdam, MeCSAFNet-large (4c) improves over DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU. The model also achieves consistent gains over several recent state-of-the-art approaches. Moreover, compact variants of MeCSAFNet deliver notable performance with lower training time and reduced inference cost, supporting their deployment in resource-constrained environments.

91. 【2602.09153】SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

链接https://arxiv.org/abs/2602.09153

作者:Nicholas Pfaff,Thomas Cohn,Sergey Zakharov,Rick Cory,Russ Tedrake

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:real indoor spaces, existing environments fail, evaluating home robots, key tool, tool for training

备注: Project page: [this https URL](https://scenesmith.github.io/)

点击查看摘要

Abstract:Simulation has become a key tool for training and evaluating home robots at scale, yet existing environments fail to capture the diversity and physical complexity of real indoor spaces. Current scene synthesis methods produce sparsely furnished rooms that lack the dense clutter, articulated furniture, and physical properties essential for robotic manipulation. We introduce SceneSmith, a hierarchical agentic framework that generates simulation-ready indoor environments from natural language prompts. SceneSmith constructs scenes through successive stages$\unicode{x2013}$from architectural layout to furniture placement to small object population$\unicode{x2013}$each implemented as an interaction among VLM agents: designer, critic, and orchestrator. The framework tightly integrates asset generation through text-to-3D synthesis for static objects, dataset retrieval for articulated objects, and physical property estimation. SceneSmith generates 3-6x more objects than prior methods, with 2% inter-object collisions and 96% of objects remaining stable under physics simulation. In a user study with 205 participants, it achieves 92% average realism and 91% average prompt faithfulness win rates against baselines. We further demonstrate that these environments can be used in an end-to-end pipeline for automatic robot policy evaluation.

92. 【2602.07049】Enhancing IMU-Based Online Handwriting Recognition via Contrastive Learning with Zero Inference Overhead

链接https://arxiv.org/abs/2602.07049

作者:Jindong Li,Dario Zanca,Vincent Christlein,Tim Hamann,Jens Barth,Peter Kämpf,Björn Eskofier

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Online handwriting recognition, inertial measurement units, measurement units opens, Online handwriting, digital devices

备注

点击查看摘要

Abstract:Online handwriting recognition using inertial measurement units opens up handwriting on paper as input for digital devices. Doing it on edge hardware improves privacy and lowers latency, but entails memory constraints. To address this, we propose Error-enhanced Contrastive Handwriting Recognition (ECHWR), a training framework designed to improve feature representation and recognition accuracy without increasing inference costs. ECHWR utilizes a temporary auxiliary branch that aligns sensor signals with semantic text embeddings during the training phase. This alignment is maintained through a dual contrastive objective: an in-batch contrastive loss for general modality alignment and a novel error-based contrastive loss that distinguishes between correct signals and synthetic hard negatives. The auxiliary branch is discarded after training, which allows the deployed model to keep its original, efficient architecture. Evaluations on the OnHW-Words500 dataset show that ECHWR significantly outperforms state-of-the-art baselines, reducing character error rates by up to 7.4% on the writer-independent split and 10.4% on the writer-dependent split. Finally, although our ablation studies indicate that solving specific challenges require specific architectural and objective configurations, error-based contrastive loss shows its effectiveness for handling unseen writing styles.

93. 【2602.10361】ENIGMA: EEG-to-Image in 15 Minutes Using Less Than 1% of the Parameters

链接https://arxiv.org/abs/2602.10361

作者:Reese Kneeland,Wangshu Jiang,Ugo Bruzadin Nunes,Paul Steven Scotti,Arnaud Delorme,Jonathan Xu

类目:Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:accessible computing resources, affordable scanning hardware, effective on affordable, computing resources, easily and quickly

备注

点击查看摘要

Abstract:To be practical for real-life applications, models for brain-computer interfaces must be easily and quickly deployable on new subjects, effective on affordable scanning hardware, and small enough to run locally on accessible computing resources. To directly address these current limitations, we introduce ENIGMA, a multi-subject electroencephalography (EEG)-to-Image decoding model that reconstructs seen images from EEG recordings and achieves state-of-the-art (SOTA) performance on the research-grade THINGS-EEG2 and consumer-grade AllJoined-1.6M benchmarks, while fine-tuning effectively on new subjects with as little as 15 minutes of data. ENIGMA boasts a simpler architecture and requires less than 1% of the trainable parameters necessary for previous approaches. Our approach integrates a subject-unified spatio-temporal backbone along with a set of multi-subject latent alignment layers and an MLP projector to map raw EEG signals to a rich visual latent space. We evaluate our approach using a broad suite of image reconstruction metrics that have been standardized in the adjacent field of fMRI-to-Image research, and we describe the first EEG-to-Image study to conduct extensive behavioral evaluations of our reconstructions using human raters. Our simple and robust architecture provides a significant performance boost across both research-grade and consumer-grade EEG hardware, and a substantial improvement in fine-tuning efficiency and inference cost. Finally, we provide extensive ablations to determine the architectural choices most responsible for our performance gains in both single and multi-subject cases across multiple benchmark datasets. Collectively, our work provides a substantial step towards the development of practical brain-computer interface applications.

94. 【2602.10359】Beyond Calibration: Confounding Pathology Limits Foundation Model Specificity in Abdominal Trauma CT

链接https://arxiv.org/abs/2602.10359

作者:Jineel H Raythatha,Shuchang Ye,Jeremy Hsu,Jinman Kim

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:compound distribution shift, heterogeneous imaging appearances, practice requires evaluating, Translating foundation models, Translating foundation

备注: 26 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Purpose: Translating foundation models into clinical practice requires evaluating their performance under compound distribution shift, where severe class imbalance coexists with heterogeneous imaging appearances. This challenge is relevant for traumatic bowel injury, a rare but high-mortality diagnosis. We investigated whether specificity deficits in foundation models are associated with heterogeneity in the negative class. Methods: This retrospective study used the multi-institutional, RSNA Abdominal Traumatic Injury CT dataset (2019-2023), comprising scans from 23 centres. Two foundation models (MedCLIP, zero-shot; RadDINO, linear probe) were compared against three task-specific approaches (CNN, Transformer, Ensemble). Models were trained on 3,147 patients (2.3% bowel injury prevalence) and evaluated on an enriched 100-patient test set. To isolate negative-class effects, specificity was assessed in patients without bowel injury who had concurrent solid organ injury (n=58) versus no abdominal pathology (n=50). Results: Foundation models achieved equivalent discrimination to task-specific models (AUC, 0.64-0.68 versus 0.58-0.64) with higher sensitivity (79-91% vs 41-74%) but lower specificity (33-50% vs 50-88%). All models demonstrated high specificity in patients without abdominal pathology (84-100%). When solid organ injuries were present, specificity declined substantially for foundation models (50-51 percentage points) compared with smaller reductions of 12-41 percentage points for task-specific models. Conclusion: Foundation models matched task-specific discrimination without task-specific training, but their specificity deficits were driven primarily by confounding negative-class heterogeneity rather than prevalence alone. Susceptibility to negative-class heterogeneity decreased progressively with labelled training, suggesting adaptation is required before clinical implementation.

95. 【2602.10315】Uncertainty-Aware Ordinal Deep Learning for cross-Dataset Diabetic Retinopathy Grading

链接https://arxiv.org/abs/2602.10315

作者:Ali El Bellaj,Aya Benradi,Salman El Youssoufi,Taha El Marzouki,Mohammed-Amine Cheddadi

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:impaired insulin utilization, insufficient insulin production, chronic metabolic disorder, metabolic disorder characterized, persistent hyperglycemia due

备注

点击查看摘要

Abstract:Diabetes mellitus is a chronic metabolic disorder characterized by persistent hyperglycemia due to insufficient insulin production or impaired insulin utilization. One of its most severe complications is diabetic retinopathy (DR), a progressive retinal disease caused by microvascular damage, leading to hemorrhages, exudates, and potential vision loss. Early and reliable detection of DR is therefore critical for preventing irreversible blindness. In this work, we propose an uncertainty-aware deep learning framework for automated DR severity grading that explicitly models the ordinal nature of disease progression. Our approach combines a convolutional backbone with lesion-query attention pooling and an evidential Dirichlet-based ordinal regression head, enabling both accurate severity prediction and principled estimation of predictive uncertainty. The model is trained using an ordinal evidential loss with annealed regularization to encourage calibrated confidence under domain shift. We evaluate the proposed method on a multi-domain training setup combining APTOS, Messidor-2, and a subset of EyePACS fundus datasets. Experimental results demonstrate strong cross-dataset generalization, achieving competitive classification accuracy and high quadratic weighted kappa on held-out test sets, while providing meaningful uncertainty estimates for low-confidence cases. These results suggest that ordinal evidential learning is a promising direction for robust and clinically reliable diabetic retinopathy grading.

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.10315 [eess.IV]

(or
arXiv:2602.10315v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2602.10315

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
96. 【2602.10155】A Systematic Review on Data-Driven Brain Deformation Modeling for Image-Guided Neurosurgery

链接https://arxiv.org/abs/2602.10155

作者:Tiago Assis,Colin P. Galvin,Joshua P. Castillo,Nazim Haouchine,Marta Kersten-Oertel,Zeyu Gao,Mireia Crispin-Ortuzar,Stephen J. Price,Thomas Santarius,Yangming Ou,Sarah Frisken,Nuno C. Garcia,Alexandra J. Golby,Reuben Dorent,Ines P. Machado

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:reliable image-guided neurosurgery, tumor resection induce, resection induce tissue, induce tissue motion, misaligns preoperative planning

备注: 31 pages, 7 figures, 3 tables. Submitted to Medical Image Analysis

点击查看摘要

Abstract:Accurate compensation of brain deformation is a critical challenge for reliable image-guided neurosurgery, as surgical manipulation and tumor resection induce tissue motion that misaligns preoperative planning images with intraoperative anatomy and longitudinal studies. In this systematic review, we synthesize recent AI-driven approaches developed between January 2020 and April 2025 for modeling and correcting brain deformation. A comprehensive literature search was conducted in PubMed, IEEE Xplore, Scopus, and Web of Science, with predefined inclusion and exclusion criteria focused on computational methods applied to brain deformation compensation for neurosurgical imaging, resulting in 41 studies meeting these criteria. We provide a unified analysis of methodological strategies, including deep learning-based image registration, direct deformation field regression, synthesis-driven multimodal alignment, resection-aware architectures addressing missing correspondences, and hybrid models that integrate biomechanical priors. We also examine dataset utilization, reported evaluation metrics, validation protocols, and how uncertainty and generalization have been assessed across studies. While AI-based deformation models demonstrate promising performance and computational efficiency, current approaches exhibit limitations in out-of-distribution robustness, standardized benchmarking, interpretability, and readiness for clinical deployment. Our review highlights these gaps and outlines opportunities for future research aimed at achieving more robust, generalizable, and clinically translatable deformation compensation solutions for neurosurgical guidance. By organizing recent advances and critically evaluating evaluation practices, this work provides a comprehensive foundation for researchers and clinicians engaged in developing and applying AI-based brain deformation methods.

97. 【2602.10124】URBAN-SPIN: A street-level bikeability index to inform design implementations in historical city centres

链接https://arxiv.org/abs/2602.10124

作者:Haining Ding,Chenxi Wang,Michal Gath-Morad

类目:Physics and Society (physics.soc-ph); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)

关键词:vulnerable road users, road users directly, users directly exposed, Cambridge Cycling Experience, Cycling Experience Video

备注: 32 pages, 10 figures

点击查看摘要

Abstract:Cycling is reported by an average of 35\% of adults at least once per week across 28 countries, and as vulnerable road users directly exposed to their surroundings, cyclists experience the street at an intensity unmatched by other modes. Yet the street-level features that shape this experience remain under-analysed, particularly in historical urban contexts where spatial constraints rule out large-scale infrastructural change and where typological context is often overlooked. This study develops a perception-led, typology-based, and data-integrated framework that explicitly models street typologies and their sub-classifications to evaluate how visual and spatial configurations shape cycling experience. Drawing on the Cambridge Cycling Experience Video Dataset (CCEVD), a first-person and handlebar-mounted corpus developed in this study, we extract fine-grained streetscape indicators with computer vision and pair them with built-environment variables and subjective ratings from a Balanced Incomplete Block Design (BIBD) survey, thereby constructing a typology-sensitive Bikeability Index that integrates subjective and perceived dimensions with physical metrics for segment-level comparison. Statistical analysis shows that perceived bikeability arises from cumulative, context-specific interactions among features. While greenness and openness consistently enhance comfort and pleasure, enclosure, imageability, and building continuity display threshold or divergent effects contingent on street type and subtype. AI-assisted visual redesigns further demonstrate that subtle, targeted changes can yield meaningful perceptual gains without large-scale structural interventions. The framework offers a transferable model for evaluating and improving cycling conditions in heritage cities through perceptually attuned, typology-aware design strategies.