本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新412篇论文,其中:
- 自然语言处理49篇
- 信息检索16篇
- 计算机视觉60篇
自然语言处理
1. 【2602.18429】VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
链接:https://arxiv.org/abs/2602.18429
作者:Harshul Raj Surana,Arijit Maji,Aryan Vats,Akash Ghosh,Sriparna Saha,Amit Sheth
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, made significant progress, Indian Culture, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have made significant progress in reasoning tasks across various domains such as mathematics and coding. However, their performance deteriorates in tasks requiring rich socio-cultural knowledge and diverse local contexts, particularly those involving Indian Culture. Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured. To address this, we introduce VIRAASAT, a novel, semi-automated multi-hop approach for generating cultural specific multi-hop Question-Answering dataset for Indian culture. VIRAASAT leverages a Knowledge Graph comprising more than 700 expert-curated cultural artifacts, covering 13 key attributes of Indian culture (history, festivals, etc). VIRAASAT spans all 28 states and 8 Union Territories, yielding more than 3,200 multi-hop questions that necessitate chained cultural reasoning. We evaluate current State-of-the-Art (SOTA) LLMs on VIRAASAT and identify key limitations in reasoning wherein fine-tuning on Chain-of-Thought(CoT) traces fails to ground and synthesize low-probability facts. To bridge this gap, we propose a novel framework named Symbolic Chain-of-Manipulation (SCoM). Adapting the Chain-of-Manipulation paradigm, we train the model to simulate atomic Knowledge Graph manipulations internally. SCoM teaches the model to reliably traverse the topological structure of the graph. Experiments on Supervised Fine-Tuning (SFT) demonstrate that SCoM outperforms standard CoT baselines by up to 20%. We release the VIRAASAT dataset along with our findings, laying a strong foundation towards building Culturally Aware Reasoning Models.
2. 【2602.18425】RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
链接:https://arxiv.org/abs/2602.18425
作者:Deniz Qian,Hung-Ting Chen,Eunsol Choi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Comprehensively retrieving diverse, Comprehensively retrieving, retrieving diverse documents, retrieving diverse, crucial to address
备注: 18 pages, 12 figures, 12 tables
点击查看摘要
Abstract:Comprehensively retrieving diverse documents is crucial to address queries that admit a wide range of valid answers. We introduce retrieve-verify-retrieve (RVR), a multi-round retrieval framework designed to maximize answer coverage. Initially, a retriever takes the original query and returns a candidate document set, followed by a verifier that identifies a high-quality subset. For subsequent rounds, the query is augmented with previously verified documents to uncover answers that are not yet covered in previous rounds. RVR is effective even with off-the-shelf retrievers, and fine-tuning retrievers for our inference procedure brings further gains. Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI). We also see consistent gains on two out-of-domain datasets (QUEST and WebQuestionsSP) across different base retrievers. Our work presents a promising iterative approach for comprehensive answer recall leveraging a verifier and adapting retrievers to a new inference scenario.
3. 【2602.18420】SPQ: An Ensemble Technique for Large Language Model Compression
链接:https://arxiv.org/abs/2602.18420
作者:Jiamin Yao,Eren Gultepe
类目:Computation and Language (cs.CL)
关键词:large language model, combines variance-retained singular, post-training linear quantization, language model, singular value decomposition
备注: Accepted to LREC 2026 Main Conference
点击查看摘要
Abstract:This study presents an ensemble technique, SPQ (SVD-Pruning-Quantization), for large language model (LLM) compression that combines variance-retained singular value decomposition (SVD), activation-based pruning, and post-training linear quantization. Each component targets a different source of inefficiency: i) pruning removes redundant neurons in MLP layers, ii) SVD reduces attention projections into compact low-rank factors, iii) and 8-bit quantization uniformly compresses all linear layers. At matched compression ratios, SPQ outperforms individual methods (SVD-only, pruning-only, or quantization-only) in perplexity, demonstrating the benefit of combining complementary techniques. Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K. Compared to strong baselines like GPTQ and SparseGPT, SPQ offers competitive perplexity and accuracy while using less memory (6.86 GB vs. 7.16 GB for GPTQ). Moreover, SPQ improves inference throughput over GPTQ, achieving up to a 1.9x speedup, which further enhances its practicality for real-world deployment. The effectiveness of SPQ's robust compression through layer-aware and complementary compression techniques may provide practical deployment of LLMs in memory-constrained environments. Code is available at: this https URL
4. 【2602.18417】Subgroups of $U(d)$ Induce Natural RNN and Transformer Architectures
链接:https://arxiv.org/abs/2602.18417
作者:Joshua Nunley
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:paper presents, presents a direct, direct framework, framework for sequence, hidden states
备注: 12 pages, 3 figures, 8 tables
点击查看摘要
Abstract:This paper presents a direct framework for sequence models with hidden states on closed subgroups of U(d). We use a minimal axiomatic setup and derive recurrent and transformer templates from a shared skeleton in which subgroup choice acts as a drop-in replacement for state space, tangent projection, and update map. We then specialize to O(d) and evaluate orthogonal-state RNN and transformer models on Tiny Shakespeare and Penn Treebank under parameter-matched settings. We also report a general linear-mixing extension in tangent space, which applies across subgroup choices and improves finite-budget performance in the current O(d) experiments.
5. 【2602.18351】Validating Political Position Predictions of Arguments
链接:https://arxiv.org/abs/2602.18351
作者:Jordan Robinson,Angus R. Williams,Katie Atkinson,Anthony G. Cohn
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:widely accepted gold, accepted gold standard, requires capturing subjective, requires capturing, widely accepted
备注: 13 pages, 6 figures, 6 tables. Under review
点击查看摘要
Abstract:Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation. We address this challenge through a dual-scale validation framework applied to political stance prediction in argumentative discourse, combining pointwise and pairwise human annotation. Using 22 language models, we construct a large-scale knowledge base of political position predictions for 23,228 arguments drawn from 30 debates that appeared on the UK politicial television programme \textit{Question Time}. Pointwise evaluation shows moderate human-model agreement (Krippendorff's $\alpha=0.578$), reflecting intrinsic subjectivity, while pairwise validation reveals substantially stronger alignment between human- and model-derived rankings ($\alpha=0.86$ for the best model). This work contributes: (i) a practical validation methodology for subjective continuous knowledge that balances scalability with reliability; (ii) a validated structured argumentation knowledge base enabling graph-based reasoning and retrieval-augmented generation in political domains; and (iii) evidence that ordinal structure can be extracted from pointwise language models predictions from inherently subjective real-world discourse, advancing knowledge representation capabilities for domains where traditional symbolic or categorical approaches are insufficient.
6. 【2602.18346】Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
链接:https://arxiv.org/abs/2602.18346
作者:Pavithra PM Nair,Preethu Rose Anish
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:artificial intelligence offers, intelligence offers transformative, offers transformative potential, jurisdictions like India, artificial intelligence
备注:
点击查看摘要
Abstract:In jurisdictions like India, where courts face an extensive backlog of cases, artificial intelligence offers transformative potential for legal judgment prediction. A critical subset of this backlog comprises appellate cases, which are formal decisions issued by higher courts reviewing the rulings of lower courts. To this end, we present Vichara, a novel framework tailored to the Indian judicial system that predicts and explains appellate judgments. Vichara processes English-language appellate case proceeding documents and decomposes them into decision points. Decision points are discrete legal determinations that encapsulate the legal issue, deciding authority, outcome, reasoning, and temporal context. The structured representation isolates the core determinations and their context, enabling accurate predictions and interpretable explanations. Vichara's explanations follow a structured format inspired by the IRAC (Issue-Rule-Application-Conclusion) framework and adapted for Indian legal reasoning. This enhances interpretability, allowing legal professionals to assess the soundness of predictions efficiently. We evaluate Vichara on two datasets, PredEx and the expert-annotated subset of the Indian Legal Documents Corpus (ILDC_expert), using four large language models: GPT-4o mini, Llama-3.1-8B, Mistral-7B, and Qwen2.5-7B. Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B. Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.
7. 【2602.18333】On the "Induction Bias" in Sequence Models
链接:https://arxiv.org/abs/2602.18333
作者:M.Reza Ebrahimi,Michaël Defferrard,Sunny Panchal,Roland Memisevic
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:remarkable practical success, transformer-based language models, remarkable practical, practical success, success of transformer-based
备注:
点击查看摘要
Abstract:Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.
8. 【2602.18326】Predicting Contextual Informativeness for Vocabulary Learning using Deep Learning
链接:https://arxiv.org/abs/2602.18326
作者:Tao Wu,Adam Kapelner
类目:Computation and Language (cs.CL)
关键词:high school student, deep learning system, automatically identifies informative, identifies informative contextual, modern deep learning
备注: 8 pages, 3 figures, 4 tables
点击查看摘要
Abstract:We describe a modern deep learning system that automatically identifies informative contextual examples (\qu{contexts}) for first language vocabulary instruction for high school student. Our paper compares three modeling approaches: (i) an unsupervised similarity-based strategy using MPNet's uniformly contextualized embeddings, (ii) a supervised framework built on instruction-aware, fine-tuned Qwen3 embeddings with a nonlinear regression head and (iii) model (ii) plus handcrafted context features. We introduce a novel metric called the Retention Competency Curve to visualize trade-offs between the discarded proportion of good contexts and the \qu{good-to-bad} contexts ratio providing a compact, unified lens on model performance. Model (iii) delivers the most dramatic gains with performance of a good-to-bad ratio of 440 all while only throwing out 70\% of the good contexts. In summary, we demonstrate that a modern embedding model on neural network architecture, when guided by human supervision, results in a low-cost large supply of near-perfect contexts for teaching vocabulary for a variety of target words.
9. 【2602.18324】PsihoRo: Depression and Anxiety Romanian Text Corpus
链接:https://arxiv.org/abs/2602.18324
作者:Alexandra Ciobotaru,Ana-Maria Bucur,Liviu P. Dinu
类目:Computation and Language (cs.CL)
关键词:analyze human psychology, mental health, human psychology, Psychological corpora, mental
备注: This article was accepted at LREC 2026
点击查看摘要
Abstract:Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health. These texts allow researchers to study psychological constructs, detect mental health issues and analyze emotional language. However, mental health data can be difficult to collect correctly from social media, due to suppositions made by the collectors. A more pragmatic strategy involves gathering data through open-ended questions and then assessing this information with self-report screening surveys. This method was employed successfully for English, a language with a lot of psychological NLP resources. However, this cannot be stated for Romanian, which currently has no open-source mental health corpus. To address this gap, we have created the first corpus for depression and anxiety in Romanian, by utilizing a form with 6 open-ended questions along with the standardized PHQ-9 and GAD-7 screening questionnaires. Consisting of the texts of 205 respondents and although it may seem small, PsihoRo is a first step towards understanding and analyzing texts regarding the mental health of the Romanian population. We employ statistical analysis, text analysis using Romanian LIWC, emotion detection and topic modeling to show what are the most important features of this newly introduced resource to the NLP community.
10. 【2602.18307】VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
链接:https://arxiv.org/abs/2602.18307
作者:Yutong Xin,Qiaochu Chen,Greg Durrett,Işil Dillig
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
关键词:interactive theorem proving, achieved striking results, Large language models, theorem proving, language models
备注:
点击查看摘要
Abstract:Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at this https URL.
11. 【2602.18301】On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction
链接:https://arxiv.org/abs/2602.18301
作者:Ivan Bondarenko,Egor Palkin,Fedor Tikunov
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:large language models, Autoregressive large language, language models, generate text, large language
备注:
点击查看摘要
Abstract:Autoregressive large language models (LLMs) generate text token-by-token, requiring n forward passes to produce a sequence of length n. Recent work, Exploring the Latent Capacity of LLMs for One-Step Text Reconstruction (Mezentsev and Oseledets), shows that frozen LLMs can reconstruct hundreds of tokens from only two learned proto-tokens in a single forward pass, suggesting a path beyond the autoregressive paradigm. In this paper, we study what information these proto-tokens encode and how they behave under reconstruction and controlled constraints. We perform a series of experiments aimed at disentangling semantic and syntactic content in the two proto-tokens, analyzing stability properties of the e-token, and visualizing attention patterns to the e-token during reconstruction. Finally, we test two regularization schemes for "imposing" semantic structure on the e-token using teacher embeddings, including an anchor-based loss and a relational distillation objective. Our results indicate that the m-token tends to capture semantic information more strongly than the e-token under standard optimization; anchor-based constraints trade off sharply with reconstruction accuracy; and relational distillation can transfer batch-level semantic relations into the proto-token space without sacrificing reconstruction quality, supporting the feasibility of future non-autoregressive seq2seq systems that predict proto-tokens as an intermediate representation.
12. 【2602.18297】Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
链接:https://arxiv.org/abs/2602.18297
作者:Usman Anwar,Tim Bakker,Dana Kianfar,Cristina Pinneri,Christos Louizos
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
关键词:analyze reasoning traces, attributes of interest, code generation, LLM-based systems, systems that analyze
备注: First two authors contributed equally
点击查看摘要
Abstract:Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation. In this paper, we use information-theoretic analysis to show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability. We identify two sources of approximation error that may undermine the performance of CoT monitors in practice: information gap, which measures the extent to which the monitor can extract the information available in CoT, and elicitation error, which measures the extent to which the monitor approximates the optimal monitoring function. We further demonstrate that CoT monitorability can be systematically improved through targeted training objectives. To this end, we propose two complementary approaches: (a) an oracle-based method that directly rewards the monitored model for producing CoTs that maximize monitor accuracy, and (b) a more practical, label-free approach that maximizes conditional mutual information between outputs and CoTs. Across multiple different environments, we show both methods significantly improve monitor accuracy while preventing CoT degeneration even when training against a monitor, thereby mitigating reward hacking when the task reward is imperfectly specified.
13. 【2602.18262】Simplifying Outcomes of Language Model Component Analyses with ELIA
链接:https://arxiv.org/abs/2602.18262
作者:Aaron Louis Eidt,Nils Feldhus
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, developed powerful tools, Explainable Language Interpretability, Language Interpretability Analysis, workings of Large
备注: EACL 2026 System Demonstrations. GitHub: [this https URL](https://github.com/aaron0eidt/ELIA)
点击查看摘要
Abstract:While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques -- Attribution Analysis, Function Vector Analysis, and Circuit Tracing -- and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations helped bridge the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user's prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.
14. 【2602.18232】hinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning
链接:https://arxiv.org/abs/2602.18232
作者:Lexiang Tang,Weihao Gao,Bingchen Zhao,Lu Ma,Qiao jin,Bang Yang,Yuexian Zou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language model, inference-time computation uniformly, Recent work, uniformly improves correctness, computation uniformly improves
备注:
点击查看摘要
Abstract:Recent work on test-time scaling for large language model (LLM) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive decoding approach that improves reasoning reliability through targeted token-level intervention. Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes selectively at these positions. It constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead. As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy. Our code will be made available at: this https URL.
15. 【2602.18217】Information-Theoretic Storage Cost in Sentence Comprehension
链接:https://arxiv.org/abs/2602.18217
作者:Kohei Kajikawa,Shinnosuke Isono,Ethan Gotlieb Wilcox
类目:Computation and Language (cs.CL)
关键词:Real-time sentence comprehension, sentence comprehension imposes, Real-time sentence, anticipate future input, maintain contextual information
备注:
点击查看摘要
Abstract:Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have been formalized, largely, using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors.
16. 【2602.18176】Improving Sampling for Masked Diffusion Models via Information Gain
链接:https://arxiv.org/abs/2602.18176
作者:Kaisen Yang,Jayden Teoh,Kaicheng Yang,Yitong Zhang,Alex Lamb
类目:Computation and Language (cs.CL)
关键词:Masked Diffusion Models, offer greater flexibility, Diffusion Models, require careful planning, autoregressive models
备注: [this https URL](https://github.com/yks23/Information-Gain-Sampler)
点击查看摘要
Abstract:Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful planning to achieve high-quality generation. Existing samplers typically adopt greedy heuristics, prioritizing positions with the highest local certainty to decode at each step. Through failure case analysis, we identify a fundamental limitation of this approach: it neglects the downstream impact of current decoding choices on subsequent steps and fails to minimize cumulative uncertainty. In particular, these methods do not fully exploit the non-causal nature of MDMs, which enables evaluating how a decoding decision reshapes token probabilities/uncertainty across all remaining masked positions. To bridge this gap, we propose the Info-Gain Sampler, a principled decoding framework that balances immediate uncertainty with information gain over future masked tokens. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs. For instance, it achieves a 3.6% improvement in average accuracy on reasoning tasks and a 63.1% win-rate in creative writing. Notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin. The code will be available at this https URL.
17. 【2602.18171】Click it or Leave it: Detecting and Spoiling Clickbait with Informativeness Measures and Large Language Models
链接:https://arxiv.org/abs/2602.18171
作者:Wojciech Michaluk,Tymoteusz Urban,Mateusz Kubita,Soveatin Kuntur,Anna Wroblewska
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:undermine user trust, Clickbait headlines degrade, user trust, headlines degrade, degrade the quality
备注:
点击查看摘要
Abstract:Clickbait headlines degrade the quality of online information and undermine user trust. We present a hybrid approach to clickbait detection that combines transformer-based text embeddings with linguistically motivated informativeness features. Using natural language processing techniques, we evaluate classical vectorizers, word embedding baselines, and large language model embeddings paired with tree-based classifiers. Our best-performing model, XGBoost over embeddings augmented with 15 explicit features, achieves an F1-score of 91\%, outperforming TF-IDF, Word2Vec, GloVe, LLM prompt based classification, and feature-only baselines. The proposed feature set enhances interpretability by highlighting salient linguistic cues such as second-person pronouns, superlatives, numerals, and attention-oriented punctuation, enabling transparent and well-calibrated clickbait predictions. We release code and trained models to support reproducible research.
18. 【2602.18154】FENCE: A Financial and Multimodal Jailbreak Detection Dataset
链接:https://arxiv.org/abs/2602.18154
作者:Mirae Kim,Seonghun Jeong,Youngjun Kwak
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
关键词:Large Language Models, Vision Language Models, Large Language, Vision Language, Jailbreaking poses
备注: lrec 2026 accepted paper
点击查看摘要
Abstract:Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.
19. 【2602.18152】he Statistical Signature of LLMs
链接:https://arxiv.org/abs/2602.18152
作者:Ortal Hadad,Edoardo Loru,Jacopo Nudo,Niccolò Di Marco,Matteo Cinelli,Walter Quattrociocchi
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Physics and Society (physics.soc-ph)
关键词:remains incompletely characterized, language remains incompletely, Large language models, Large language, high-dimensional distributions
备注:
点击查看摘要
Abstract:Large language models generate text through probabilistic sampling from high-dimensional distributions, yet how this process reshapes the structural statistical organization of language remains incompletely characterized. Here we show that lossless compression provides a simple, model-agnostic measure of statistical regularity that differentiates generative regimes directly from surface text. We analyze compression behavior across three progressively more complex information ecosystems: controlled human-LLM continuations, generative mediation of a knowledge infrastructure (Wikipedia vs. Grokipedia), and fully synthetic social interaction environments (Moltbook vs. Reddit). Across settings, compression reveals a persistent structural signature of probabilistic generation. In controlled and mediated contexts, LLM-produced language exhibits higher structural regularity and compressibility than human-written text, consistent with a concentration of output within highly recurrent statistical patterns. However, this signature shows scale dependence: in fragmented interaction environments the separation attenuates, suggesting a fundamental limit to surface-level distinguishability at small scales. This compressibility-based separation emerges consistently across models, tasks, and domains and can be observed directly from surface text without relying on model internals or semantic evaluation. Overall, our findings introduce a simple and robust framework for quantifying how generative systems reshape textual production, offering a structural perspective on the evolving complexity of communication.
20. 【2602.18145】Detecting Contextual Hallucinations in LLMs with Frequency-Aware Attention
链接:https://arxiv.org/abs/2602.18145
作者:Siya Qi,Yudong Chen,Runcong Zhao,Qinglin Zhu,Zhanghao Hu,Wei Liu,Yulan He,Zheng Yuan,Lin Gui
类目:Computation and Language (cs.CL)
关键词:large language models, detection is critical, critical for ensuring, ensuring the reliability, reliability of large
备注: 25 pages, 10 figures
点击查看摘要
Abstract:Hallucination detection is critical for ensuring the reliability of large language models (LLMs) in context-based generation. Prior work has explored intrinsic signals available during generation, among which attention offers a direct view of grounding behavior. However, existing approaches typically rely on coarse summaries that fail to capture fine-grained instabilities in attention. Inspired by signal processing, we introduce a frequency-aware perspective on attention by analyzing its variation during generation. We model attention distributions as discrete signals and extract high-frequency components that reflect rapid local changes in attention. Our analysis reveals that hallucinated tokens are associated with high-frequency attention energy, reflecting fragmented and unstable grounding behavior. Based on this insight, we develop a lightweight hallucination detector using high-frequency attention features. Experiments on the RAGTruth and HalluRAG benchmarks show that our approach achieves performance gains over verification-based, internal-representation-based, and attention-based methods across models and tasks.
21. 【2602.18137】Agentic Adversarial QA for Improving Domain-Specific LLMs
链接:https://arxiv.org/abs/2602.18137
作者:Vincent Grari,Ciprian Tomoiaga,Sylvain Lamprier,Tatsunori Hashimoto,Marcin Detyniecki
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, broad internet corpora, Large Language, Language Models, extensive pretraining
备注: 9 pages, 1 Figure
点击查看摘要
Abstract:Large Language Models (LLMs), despite extensive pretraining on broad internet corpora, often struggle to adapt effectively to specialized domains. There is growing interest in fine-tuning these models for such domains; however, progress is constrained by the scarcity and limited coverage of high-quality, task-relevant data. To address this, synthetic data generation methods such as paraphrasing or knowledge extraction are commonly applied. Although these approaches excel at factual recall and conceptual knowledge, they suffer from two critical shortcomings: (i) they provide minimal support for interpretive reasoning capabilities in these specialized domains, and (ii) they often produce synthetic corpora that are excessively large and redundant, resulting in poor sample efficiency. To overcome these gaps, we propose an adversarial question-generation framework that produces a compact set of semantically challenging questions. These questions are constructed by comparing the outputs of the model to be adapted and a robust expert model grounded in reference documents, using an iterative, feedback-driven process designed to reveal and address comprehension gaps. Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
22. 【2602.18092】Perceived Political Bias in LLMs Reduces Persuasive Abilities
链接:https://arxiv.org/abs/2602.18092
作者:Matthew DiGiuseppe,Joshua Robison
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:correct public misconceptions, spread misinformation, correct public, public misconceptions, Abstract
备注: 39 pages, 10 figures
点击查看摘要
Abstract:Conversational AI has been proposed as a scalable way to correct public misconceptions and spread misinformation. Yet its effectiveness may depend on perceptions of its political neutrality. As LLMs enter partisan conflict, elites increasingly portray them as ideologically aligned. We test whether these credibility attacks reduce LLM-based persuasion. In a preregistered U.S. survey experiment (N=2144), participants completed a three-round conversation with ChatGPT about a personally held economic policy misconception. Compared to a neutral control, a short message indicating that the LLM was biased against the respondent's party attenuated persuasion by 28%. Transcript analysis indicates that the warnings alter the interaction: respondents push back more and engage less receptively. These findings suggest that the persuasive impact of conversational AI is politically contingent, constrained by perceptions of partisan alignment.
23. 【2602.18037】Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
链接:https://arxiv.org/abs/2602.18037
作者:Johannes Ackermann,Michael Noukhovitch,Takashi Ishida,Masashi Sugiyama
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Reinforcement Learning, Human Feedback, Learning from Human, modern Language Models, modern Language
备注: 25 pages, 15 figures
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs). A common problem is reward hacking, where the policy may exploit inaccuracies of the reward and learn an unintended behavior. Most previous works address this by limiting the policy update with a Kullback-Leibler (KL) penalty towards a reference model. We propose a different framing: Train the LM in a way that biases policy updates towards regions in which the reward is more accurate. First, we derive a theoretical connection between the accuracy of a reward model and the flatness of an optimum at convergence. Gradient regularization (GR) can then be used to bias training to flatter regions and thereby maintain reward model accuracy. We confirm these results by showing that the gradient norm and reward accuracy are empirically correlated in RLHF. We then show that Reference Resets of the KL penalty implicitly use GR to find flatter regions with higher reward accuracy. We further improve on this by proposing to use explicit GR with an efficient finite-difference estimate. Empirically, GR performs better than a KL penalty across a diverse set of RL experiments with LMs. GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.
24. 【2602.18029】owards More Standardized AI Evaluation: From Models to Agents
链接:https://arxiv.org/abs/2602.18029
作者:Ali El Filali,Inès Bedar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:machine learning lifecycle, learning lifecycle, final checkpoint, machine learning, Evaluation
备注: 19 pages, 3 figures
点击查看摘要
Abstract:Evaluation is no longer a final checkpoint in the machine learning lifecycle. As AI systems evolve from static models to compound, tool-using agents, evaluation becomes a core control function. The question is no longer "How good is the model?" but "Can we trust the system to behave as intended, under change, at scale?". Yet most evaluation practices remain anchored in assumptions inherited from the model-centric era: static benchmarks, aggregate scores, and one-off success criteria. This paper argues that such approaches are increasingly obscure rather than illuminating system behavior. We examine how evaluation pipelines themselves introduce silent failure modes, why high benchmark scores routinely mislead teams, and how agentic systems fundamentally alter the meaning of performance measurement. Rather than proposing new metrics or harder benchmarks, we aim to clarify the role of evaluation in the AI era, and especially for agents: not as performance theater, but as a measurement discipline that conditions trust, iteration, and governance in non-deterministic systems.
25. 【2602.18008】NIMMGen: Learning Neural-Integrated Mechanistic Digital Twins with LLMs
链接:https://arxiv.org/abs/2602.18008
作者:Zihan Guan,Rituparna Datta,Mengxuan Hu,Shunshun Liu,Aiying Zhang,Prasanna Balachandran,Sheng Li,Anil Vullikanti
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Mechanistic models, Mechanistic models encode, LLM-generated mechanistic models, encode scientific knowledge, Neural-Integrated Mechanistic Modeling
备注: 19 pages, 6 figures
点击查看摘要
Abstract:Mechanistic models encode scientific knowledge about dynamical systems and are widely used in downstream scientific and policy applications. Recent work has explored LLM-based agentic frameworks to automatically construct mechanistic models from data; however, existing problem settings substantially oversimplify real-world conditions, leaving it unclear whether LLM-generated mechanistic models are reliable in practice. To address this gap, we introduce the Neural-Integrated Mechanistic Modeling (NIMM) evaluation framework, which evaluates LLM-generated mechanistic models under realistic settings with partial observations and diversified task objectives. Our evaluation reveals fundamental challenges in current baselines, ranging from model effectiveness to code-level correctness. Motivated by these findings, we design NIMMgen, an agentic framework for neural-integrated mechanistic modeling that enhances code correctness and practical validity through iterative refinement. Experiments across three datasets from diversified scientific domains demonstrate its strong performance. We also show that the learned mechanistic models support counterfactual intervention simulation.
26. 【2602.17981】Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
链接:https://arxiv.org/abs/2602.17981
作者:Amine Kobeissi,Philippe Langlais
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:high stakes settings, Retrieval-augmented generation, financial question answering, exact context needed, long regulatory filings
备注:
点击查看摘要
Abstract:Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode has received limited systematic attention in the Financial Question Answering (QA) literature. We evaluate retrieval at multiple levels of granularity, document, page, and chunk level, and introduce an oracle based analysis to provide empirical upper bounds on retrieval and generative performance. On a 150 question subset of FinanceBench, we reproduce and compare diverse retrieval strategies including dense, sparse, hybrid, and hierarchical methods with reranking and query reformulation. Across methods, gains in document discovery tend to translate into stronger page recall, yet oracle performance still suggests headroom for page and chunk level retrieval. To target this gap, we introduce a domain fine-tuned page scorer that treats pages as an intermediate retrieval unit between documents and chunks. Unlike prior passage-based hierarchical retrieval, we fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages. Overall, our results demonstrate a significant improvement in page recall and chunk retrieval.
27. 【2602.17949】CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
链接:https://arxiv.org/abs/2602.17949
作者:Victoria Blake,Mathew Miller,Jamie Novak,Sze-yuan Ooi,Blanca Gallego
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Medical Language System, Concept Unique Identifiers, Unified Medical Language, Unique Identifiers, Unified Medical
备注: 30 pages, 6 figures, 4 tables
点击查看摘要
Abstract:Background: Clinical named entity recognition tools commonly map free text to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs). For many downstream tasks, however, the clinically meaningful unit is not a single CUI but a concept set comprising related synonyms, subtypes, and supertypes. Constructing such concept sets is labour-intensive, inconsistently performed, and poorly supported by existing tools, particularly for NLP pipelines that operate directly on UMLS CUIs. Methods We present CUICurate, a Graph-based retrieval-augmented generation (GraphRAG) framework for automated UMLS concept set curation. A UMLS knowledge graph (KG) was constructed and embedded for semantic retrieval. For each target concept, candidate CUIs were retrieved from the KG, followed by large language model (LLM) filtering and classification steps comparing two LLMs (GPT-5 and GPT-5-mini). The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets. Results Across all concepts, CUICurate produced substantially larger and more complete concept sets than the manual benchmarks whilst matching human precision. Comparisons between the two LLMs found that GPT-5-mini achieved higher recall during filtering, while GPT-5 produced classifications that more closely aligned with clinician judgements. Outputs were stable across repeated runs and computationally inexpensive. Conclusions CUICurate offers a scalable and reproducible approach to support UMLS concept set curation that substantially reduces manual effort. By integrating graph-based retrieval with LLM reasoning, the framework produces focused candidate concept sets that can be adapted to clinical NLP pipelines for different phenotyping and analytic requirements.
28. 【2602.17937】Analyzing LLM Instruction Optimization for Tabular Fact Verification
链接:https://arxiv.org/abs/2602.17937
作者:Xiaotang Du,Giwon Hong,Wai-Chung Kwan,Rohit Saxena,Ivan Titov,Pasquale Minervini,Emily Allaway
类目:Computation and Language (cs.CL); Programming Languages (cs.PL)
关键词:large language models, model-agnostic approach, Instruction optimization, approach to enhancing, large language
备注:
点击查看摘要
Abstract:Instruction optimization provides a lightweight, model-agnostic approach to enhancing the reasoning performance of large language models (LLMs). This paper presents the first systematic comparison of instruction optimization, based on the DSPy optimization framework, for tabular fact verification. We evaluate four out-of-the-box prompting techniques that cover both text-only prompting and code use: direct prediction, Chain-of-Thought (CoT), ReAct with SQL tools, and CodeAct with Python execution. We study three optimizers from the DSPy framework -- COPRO, MiPROv2, and SIMBA -- across four benchmarks and three model families. We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales. Behavioral analyses reveal that SIMBA encourages more direct reasoning paths by applying heuristics, thereby improving numerical comparison abilities in CoT reasoning and helping avoid unnecessary tool calls in ReAct agents. Across different prompting techniques, CoT remains effective for tabular fact checking, especially with smaller models. Although ReAct agents built with larger models can achieve competitive performance, they require careful instruction optimization.
29. 【2602.17911】Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering
链接:https://arxiv.org/abs/2602.17911
作者:Jash Rajesh Parekh,Wonbin Kweon,Joey Chan,Rezarta Islamaj,Robert Leaman,Pengcheng Jiang,Chih-Hsuan Wei,Zhizheng Wang,Zhiyong Lu,Jiawei Han
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:knowledge applies uniformly, Current biomedical question, real-world clinical reasoning, Current biomedical, systems often assume
备注:
点击查看摘要
Abstract:Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.
30. 【2602.17907】Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions
链接:https://arxiv.org/abs/2602.17907
作者:Raymond Li,Amirhossein Abaskohi,Chuyuan Li,Gabriel Murray,Giuseppe Carenini
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:overlooking contextual information, Traditional neural topic, Traditional neural, neural topic models, overlooking contextual
备注: 20 pages, 5 figures
点击查看摘要
Abstract:Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we propose a novel approach to construct semantically-grounded soft label targets using Language Models (LMs) by projecting the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary to obtain contextually enriched supervision signals. By training the topic models to reconstruct the soft labels using the LM hidden states, our method produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Experiments on three datasets show that our method achieves substantial improvements in topic coherence, purity over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.
31. 【2602.17905】Games That Teach, Chats That Convince: Comparing Interactive and Static Formats for Persuasive Learning
链接:https://arxiv.org/abs/2602.17905
作者:Seyed Hossein Alavi,Zining Wang,Shruthi Chockkalingam,Raymond T. Ng,Vered Shwartz
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
关键词:delivery formats shape, formats shape learning, held constant, Interactive systems, persuade and educate
备注:
点击查看摘要
Abstract:Interactive systems such as chatbots and games are increasingly used to persuade and educate on sustainability-related topics, yet it remains unclear how different delivery formats shape learning and persuasive outcomes when content is held constant. Grounding on identical arguments and factual content across conditions, we present a controlled user study comparing three modes of information delivery: static essays, conversational chatbots, and narrative text-based games. Across subjective measures, the chatbot condition consistently outperformed the other modes and increased perceived importance of the topic. However, perceived learning did not reliably align with objective outcomes: participants in the text-based game condition reported learning less than those reading essays, yet achieved higher scores on a delayed (24-hour) knowledge quiz. Additional exploratory analyses further suggest that common engagement proxies, such as verbosity and interaction length, are more closely related to subjective experience than to actual learning. These findings highlight a dissociation between how persuasive experiences feel and what participants retain, and point to important design trade-offs between interactivity, realism, and learning in persuasive systems and serious games.
32. 【2602.17881】Understanding Unreliability of Steering Vectors in Language Models: Geometric Predictors and the Limits of Linear Approximations
链接:https://arxiv.org/abs/2602.17881
作者:Joschka Braun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:controlling language model, language model behavior, Steering, inference time, controlling language
备注: Master's Thesis, University of Tübingen. 89 pages, 34 figures. Portions of this work were published at the ICLR 2025 Workshop on Foundation Models in the Wild (see [arXiv:2505.22637](https://arxiv.org/abs/2505.22637) )
点击查看摘要
Abstract:Steering vectors are a lightweight method for controlling language model behavior by adding a learned bias to the activations at inference time. Although effective on average, steering effect sizes vary across samples and are unreliable for many target behaviors. In my thesis, I investigate why steering reliability differs across behaviors and how it is impacted by steering vector training data. First, I find that higher cosine similarity between training activation differences predicts more reliable steering. Second, I observe that behavior datasets where positive and negative activations are better separated along the steering direction are more reliably steerable. Finally, steering vectors trained on different prompt variations are directionally distinct, yet perform similarly well and exhibit correlated efficacy across datasets. My findings suggest that steering vectors are unreliable when the latent target behavior representation is not effectively approximated by the linear steering direction. Taken together, these insights offer a practical diagnostic for steering unreliability and motivate the development of more robust steering methods that explicitly account for non-linear latent behavior representations.
33. 【2602.17867】ADAPT: Hybrid Prompt Optimization for LLM Feature Visualization
链接:https://arxiv.org/abs/2602.17867
作者:João N. Cardoso,Arlindo L. Oliveira,Bruno Martins
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:space requires identifying, LLM activation space, requires identifying inputs, encoded by learned, activation space requires
备注:
点击查看摘要
Abstract:Understanding what features are encoded by learned directions in LLM activation space requires identifying inputs that strongly activate them. Feature visualization, which optimizes inputs to maximally activate a target direction, offers an alternative to costly dataset search approaches, but remains underexplored for LLMs due to the discrete nature of text. Furthermore, existing prompt optimization techniques are poorly suited to this domain, which is highly prone to local minima. To overcome these limitations, we introduce ADAPT, a hybrid method combining beam search initialization with adaptive gradient-guided mutation, designed around these failure modes. We evaluate on Sparse Autoencoder latents from Gemma 2 2B, proposing metrics grounded in dataset activation statistics to enable rigorous comparison, and show that ADAPT consistently outperforms prior methods across layers and latent types. Our results establish that feature visualization for LLMs is tractable, but requires design assumptions tailored to the domain.
34. 【2602.17850】Mind the Style: Impact of Communication Style on Human-Chatbot Interaction
链接:https://arxiv.org/abs/2602.17850
作者:Erik Derner,Dalibor Kučera,Aditya Gulati,Ayoub Bagheri,Nuria Oliver
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:success remain unclear, increasingly mediate everyday, mediate everyday digital, task success remain, agents increasingly mediate
备注:
点击查看摘要
Abstract:Conversational agents increasingly mediate everyday digital interactions, yet the effects of their communication style on user experience and task success remain unclear. Addressing this gap, we describe the results of a between-subject user study where participants interact with one of two versions of a chatbot called NAVI which assists users in an interactive map-based 2D navigation task. The two chatbot versions differ only in communication style: one is friendly and supportive, while the other is direct and task-focused. Our results show that the friendly style increases subjective satisfaction and significantly improves task completion rates among female participants only, while no baseline differences between female and male participants were observed in a control condition without the chatbot. Furthermore, we find little evidence of users mimicking the chatbot's style, suggesting limited linguistic accommodation. These findings highlight the importance of user- and task-sensitive conversational agents and support that communication style personalization can meaningfully enhance interaction quality and performance.
35. 【2602.17848】On the scaling relationship between cloze probabilities and language model next-token prediction
链接:https://arxiv.org/abs/2602.17848
作者:Cassandra L. Jacobs,Morgan Grobol
类目:Computation and Language (cs.CL)
关键词:Recent work, reading time data, work has shown, predictive power, power for eye
备注:
点击查看摘要
Abstract:Recent work has shown that larger language models have better predictive power for eye movement and reading time data. While even the best models under-allocate probability mass to human responses, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they are less sensitive to lexical co-occurrence statistics while being better aligned semantically to human cloze responses. The results provide support for the claim that the greater memorization capacity of larger models helps them guess more semantically appropriate words, but makes them less sensitive to low-level information that is relevant for word recognition.
36. 【2602.17837】FL: Targeted Bit-Flip Attack on Large Language Model
链接:https://arxiv.org/abs/2602.17837
作者:Jingkai Guo,Chaitali Chakrabarti,Deliang Fan
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, security critical applications, parameter fault injection, Large language, model parameter fault
备注: 13 pages, 11 figures. Preprint
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed in safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks. Recent studies have shown that bit-flip attacks (BFAs), which exploit computer main memory (i.e., DRAM) vulnerabilities to flip a small number of bits in model weights, can severely disrupt LLM behavior. However, existing BFA on LLM largely induce un-targeted failure or general performance degradation, offering limited control over manipulating specific or targeted outputs. In this paper, we present TFL, a novel targeted bit-flip attack framework that enables precise manipulation of LLM outputs for selected prompts while maintaining almost no or minor degradation on unrelated inputs. Within our TFL framework, we propose a novel keyword-focused attack loss to promote attacker-specified target tokens in generative outputs, together with an auxiliary utility score that balances attack effectiveness against collateral performance impact on benign data. We evaluate TFL on multiple LLMs (Qwen, DeepSeek, Llama) and benchmarks (DROP, GSM8K, and TriviaQA). The experiments show that TFL achieves successful targeted LLM output manipulations with less than 50 bit flips and significantly reduced effect on unrelated queries compared to prior BFA approaches. This demonstrates the effectiveness of TFL and positions it as a new class of stealthy and targeted LLM model attack.
37. 【2602.17815】Neural Synchrony Between Socially Interacting Language Models
链接:https://arxiv.org/abs/2602.17815
作者:Zhining Zhang,Wentao Zhu,Chi Han,Yizhou Wang,Heng Ji
类目:Computation and Language (cs.CL)
关键词:Neuroscience has uncovered, human brain activity, contexts involving interaction, social contexts involving, uncovered a fundamental
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction. Traditionally, social minds have been regarded as an exclusive property of living beings. Although large language models (LLMs) are widely accepted as powerful approximations of human behavior, with multi-LLM system being extensively explored to enhance their capabilities, it remains controversial whether they can be meaningfully compared to human social minds. In this work, we explore neural synchrony between socially interacting LLMs as an empirical evidence for this debate. Specifically, we introduce neural synchrony during social simulations as a novel proxy for analyzing the sociality of LLMs at the representational level. Through carefully designed experiments, we demonstrate that it reliably reflects both social engagement and temporal alignment in their interactions. Our findings indicate that neural synchrony between LLMs is strongly correlated with their social performance, highlighting an important link between neural synchrony and the social behaviors of LLMs. Our work offers a new perspective to examine the "social minds" of LLMs, highlighting surprising parallels in the internal dynamics that underlie human and LLM social interaction.
38. 【2602.17784】QueryPlot: Generating Geological Evidence Layers using Natural Language Queries for Mineral Exploration
链接:https://arxiv.org/abs/2602.17784
作者:Meng Ye,Xiao Lin,Georgina Lukoczki,Graham W. Lederer,Yi Yao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:host specific mineral, requires synthesizing heterogeneous, specific mineral deposit, heterogeneous geological knowledge, mapping requires synthesizing
备注:
点击查看摘要
Abstract:Mineral prospectivity mapping requires synthesizing heterogeneous geological knowledge, including textual deposit models and geospatial datasets, to identify regions likely to host specific mineral deposit types. This process is traditionally manual and knowledge-intensive. We present QueryPlot, a semantic retrieval and mapping framework that integrates large-scale geological text corpora with geologic map data using modern Natural Language Processing techniques. We curate descriptive deposit models for over 120 deposit types and transform the State Geologic Map Compilation (SGMC) polygons into structured textual representations. Given a user-defined natural language query, the system encodes both queries and region descriptions using a pretrained embedding model and computes semantic similarity scores to rank and spatially visualize regions as continuous evidence layers. QueryPlot supports compositional querying over deposit characteristics, enabling aggregation of multiple similarity-derived layers for multi-criteria prospectivity analysis. In a case study on tungsten skarn deposits, we demonstrate that embedding-based retrieval achieves high recall of known occurrences and produces prospective regions that closely align with expert-defined permissive tracts. Furthermore, similarity scores can be incorporated as additional features in supervised learning pipelines, yielding measurable improvements in classification performance. QueryPlot is implemented as a web-based system supporting interactive querying, visualization, and export of GIS-compatible prospectivity this http URL support future research, we have made the source code and datasets used in this study publicly available.
39. 【2602.17744】Bayesian Optimality of In-Context Learning with Selective State Spaces
链接:https://arxiv.org/abs/2602.17744
作者:Di Zhang,Jiaqi Xing
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Statistics Theory (math.ST); Machine Learning (stat.ML)
关键词:understanding in-context learning, in-context learning, optimal sequential prediction, propose Bayesian optimal, Bayesian optimal sequential
备注: 17 pages
点击查看摘要
Abstract:We propose Bayesian optimal sequential prediction as a new principle for understanding in-context learning (ICL). Unlike interpretations framing Transformers as performing implicit gradient descent, we formalize ICL as meta-learning over latent sequence tasks. For tasks governed by Linear Gaussian State Space Models (LG-SSMs), we prove a meta-trained selective SSM asymptotically implements the Bayes-optimal predictor, converging to the posterior predictive mean. We further establish a statistical separation from gradient descent, constructing tasks with temporally correlated noise where the optimal Bayesian predictor strictly outperforms any empirical risk minimization (ERM) estimator. Since Transformers can be seen as performing implicit ERM, this demonstrates selective SSMs achieve lower asymptotic risk due to superior statistical efficiency. Experiments on synthetic LG-SSM tasks and a character-level Markov benchmark confirm selective SSMs converge faster to Bayes-optimal risk, show superior sample efficiency with longer contexts in structured-noise settings, and track latent states more robustly than linear Transformers. This reframes ICL from "implicit optimization" to "optimal inference," explaining the efficiency of selective SSMs and offering a principled basis for architecture design.
40. 【2602.17693】A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU
链接:https://arxiv.org/abs/2602.17693
作者:Yuchen Luo,Fangyue Zhu,Ruining Zhou,Mingzhe Huang,Jian Zhu,Fanyu Fan,Wei Shao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:GPU architectures, compared to GPU, remains under-explored compared, crucial for efficient, under-explored compared
备注:
点击查看摘要
Abstract:Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.
41. 【2602.17691】hered Reasoning: Decoupling Entropy from Hallucination in Quantized LLMs via Manifold Steering
链接:https://arxiv.org/abs/2602.17691
作者:Craig Atkinson
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:low sampling temperatures, temperatures yield repetitive, sampling temperatures yield, language models face, Unified Truth Score
备注: 16 pages, 6 tables
点击查看摘要
Abstract:Quantized language models face a fundamental dilemma: low sampling temperatures yield repetitive, mode-collapsed outputs, while high temperatures (T 2.0) cause trajectory divergence and semantic incoherence. We present HELIX, a geometric framework that decouples output entropy from hallucination by tethering hidden-state trajectories to a pre-computed truthfulness manifold. HELIX computes a Unified Truth Score (UTS) combining token-level semantic entropy with Mahalanobis distance from the manifold. When UTS indicates trajectory divergence, graduated steering vectors redirect activations toward structurally coherent regions while affecting only 0.2-2.5% of tokens. On 4-bit quantized Granite 4.0 H Small (32B/9B active, hybrid Mamba-Transformer): GSM8K maintains 88.84% accuracy at T = 3.0 (2.81pp degradation from T = 0.5); MMLU maintains 72.49% across 14,042 questions (1.24pp degradation). This demonstrates that high-temperature hallucination is primarily trajectory divergence rather than semantic collapse. Notably, steering the sparse Transformer attention layers (~10% of layers) is sufficient to correct drift in the Mamba-2 state-space formulation. Geometric tethering reveals a previously-masked High-Entropy Creative Reservoir. At T 2.0, steered outputs exhibit 5-20% idea duplication versus 70-80% at conservative settings. Cross-architecture validation (Qwen3-30B-A3B MOE) confirms this phenomenon is architecture-independent, with 46.7% higher unique concept generation. HELIX acts as a syntax tether, enabling exploration of semantic diversity without violating the logical backbone required for valid output. This enables Multi-Temperature Synthesis, generating 200% more unique concepts than single-temperature inference.
Comments:
16 pages, 6 tables
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
MSC classes:
68T50
ACMclasses:
I.2.7; G.3
Cite as:
arXiv:2602.17691 [cs.LG]
(or
arXiv:2602.17691v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2602.17691
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
42. 【2602.17689】Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction
链接:https://arxiv.org/abs/2602.17689
作者:Melika Filvantorkaman,Mohsen Piri
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:domain shift caused, models show strong, show strong potential, acquisition protocols, imaging devices
备注: 28 pages, 3 figures
点击查看摘要
Abstract:Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing multi-modal pre-training methods largely overlook robustness, treating it as a downstream adaptation problem. In this work, we propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates robustness objectives into masked vision-language learning. Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations. We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD, SLAKE, VQA-2019), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO). Robust-MMR achieves 78.9% cross-domain accuracy on VQA-RAD, outperforming the strongest baseline by 3.8 percentage points, and reaches 74.6% and 77.0% accuracy on SLAKE and VQA-2019, respectively. Under perturbed evaluation, Robust-MMR improves VQA-RAD accuracy from 69.1% to 75.6%. For image-text classification, cross-domain MELINDA accuracy increases from 70.3% to 75.2%, while retrieval experiments show a reduction in mean rank degradation from over 16 to 4.1 under perturbation. Qualitative results further demonstrate improved clinical reasoning for disease detection and structural abnormality assessment. These findings show that explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.
43. 【2602.17687】IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering
链接:https://arxiv.org/abs/2602.17687
作者:Connor Shorten,Augustas Skaburskas,Daniel M. Jones,Charles Pierse,Roberto Esposito,John Trengrove,Etienne Dilocker,Bob van Luijt
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:document processing remains, visual document processing, Recall, achieved remarkable success, processing remains
备注: 23 pages, 6 figures
点击查看摘要
Abstract:AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, enabling multimodal hybrid search to outperform either alone, achieving 49% Recall@1, 81% Recall@5, and 95% Recall@20. We further evaluate efficiency-performance tradeoffs with MUVERA and assess multiple multi-vector image embedding models. Among closed-source models, Cohere Embed v4 page image embeddings outperform Voyage 3 Large text embeddings and all tested open-source models, achieving 58% Recall@1, 87% Recall@5, and 97% Recall@20. For question answering, text-based RAG systems achieved higher ground-truth alignment than image-based systems (0.82 vs. 0.71), and both benefit substantially from increased retrieval depth, with multi-document retrieval outperforming oracle single-document retrieval. We analyze the complementary limitations of unimodal text and image representations and identify question types that require one modality over the other. The IRPAPERS dataset and all experimental code are publicly available.
44. 【2602.17681】LATMiX: Learnable Affine Transformations for Microscaling Quantization of LLMs
链接:https://arxiv.org/abs/2602.17681
作者:Ofir Gordon,Lior Dikstein,Arnon Netzer,Idan Achituve,Hai Victor Habi
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:large language models, Post-training quantization, widely used approach, memory and compute, compute costs
备注: 24 pages, 4 figures
点击查看摘要
Abstract:Post-training quantization (PTQ) is a widely used approach for reducing the memory and compute costs of large language models (LLMs). Recent studies have shown that applying invertible transformations to activations can significantly improve quantization robustness by reducing activation outliers; however, existing approaches are largely restricted to rotation or Hadamard-based transformations. Moreover, most studies focused primarily on traditional quantization schemes, whereas modern hardware increasingly supports the microscaling (MX) data format. Attempts to combine both showed severe performance degradation, leading prior work to introduce assumptions on the transformations. In this work, we take a complementary perspective. First, we provide a theoretical analysis of transformations under MX quantization by deriving a bound on the quantization error. Our analysis emphasizes the importance of accounting for both the activation distribution and the underlying quantization structure. Building on this analysis, we propose LATMiX, a method that generalizes outlier reduction to learnable invertible affine transformations optimized using standard deep learning tools. Experiments show consistent improvements in average accuracy for MX low-bit quantization over strong baselines on a wide range of zero-shot benchmarks, across multiple model sizes.
45. 【2602.17677】Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving
链接:https://arxiv.org/abs/2602.17677
作者:Sutej Kulgod,Sean Ye,Sanchit Tanwar,Christoffer Heckman
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:Multiple Choice Question, Choice Question Answering, measuring Vision Language, Vision Language Model, Multiple Choice
备注: 7 pages, 2 figures
点击查看摘要
Abstract:Multiple Choice Question Answering (MCQA) benchmarks are an established standard for measuring Vision Language Model (VLM) performance in driving tasks. However, we observe the known phenomenon that synthetically generated MCQAs are highly susceptible to hidden textual cues that allow models to exploit linguistic patterns rather than visual context. Our results show that a VLM fine-tuned on such data can achieve accuracy comparable to human-validated benchmarks even without visual input. Our proposed method reduces blind accuracy from +66.9% above random to +2.9%, eliminating the vast majority of exploitable textual shortcuts. By decoupling the correct answer from linguistic artifacts and employing a curriculum learning strategy, we force the model to rely on visual grounding, ensuring that performance accurately reflects perceptual understanding.
46. 【2602.17676】Epistemic Traps: Rational Misalignment Driven by Model Misspecification
链接:https://arxiv.org/abs/2602.17676
作者:Xingcheng Xu,Jingjing Qu,Qiaosheng Zhang,Chaochao Lu,Yanqing Yang,Na Zou,Xia Hu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, pathologies including sycophancy, deployment of Large, persistent behavioral pathologies
备注:
点击查看摘要
Abstract:The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a "locked-in" equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent's epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent's internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent's interpretation of reality.
47. 【2602.17674】Lost Before Translation: Social Information Transmission and Survival in AI-AI Communication
链接:https://arxiv.org/abs/2602.17674
作者:Bijean Ghafouri,Emilio Ferrara
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:relay information, systems summarize, summarize and relay, inevitably transform, Abstract
备注:
点击查看摘要
Abstract:When AI systems summarize and relay information, they inevitably transform it. But how? We introduce an experimental paradigm based on the telephone game to study what happens when AI talks to AI. Across five studies tracking content through AI transmission chains, we find three consistent patterns. The first is convergence, where texts differing in certainty, emotional intensity, and perspectival balance collapse toward a shared default of moderate confidence, muted affect, and analytical structure. The second is selective survival, where narrative anchors persist while the texture of evidence, hedges, quotes, and attributions is stripped away. The third is competitive filtering, where strong arguments survive while weaker but valid considerations disappear when multiple viewpoints coexist. In downstream experiments, human participants rated AI-transmitted content as more credible and polished. Importantly, however, humans also showed degraded factual recall, reduced perception of balance, and diminished emotional resonance. We show that the properties that make AI-mediated content appear authoritative may systematically erode the cognitive and affective diversity on which informed judgment depends.
48. 【2602.17672】Assessing LLM Response Quality in the Context of Technology-Facilitated Abuse
链接:https://arxiv.org/abs/2602.17672
作者:Vijay Prakash,Majed Almansoori,Donghan Hu,Rahul Chatterjee,Danny Yuxing Huang
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
关键词:intimate partner violence, leverages digital tools, Technology-facilitated abuse, partner violence, tools to control
备注:
点击查看摘要
Abstract:Technology-facilitated abuse (TFA) is a pervasive form of intimate partner violence (IPV) that leverages digital tools to control, surveil, or harm survivors. While tech clinics are one of the reliable sources of support for TFA survivors, they face limitations due to staffing constraints and logistical barriers. As a result, many survivors turn to online resources for assistance. With the growing accessibility and popularity of large language models (LLMs), and increasing interest from IPV organizations, survivors may begin to consult LLM-based chatbots before seeking help from tech clinics. In this work, we present the first expert-led manual evaluation of four LLMs - two widely used general-purpose non-reasoning models and two domain-specific models designed for IPV contexts - focused on their effectiveness in responding to TFA-related questions. Using real-world questions collected from literature and online forums, we assess the quality of zero-shot single-turn LLM responses generated with a survivor safety-centered prompt on criteria tailored to the TFA domain. Additionally, we conducted a user study to evaluate the perceived actionability of these responses from the perspective of individuals who have experienced TFA. Our findings, grounded in both expert assessment and user feedback, provide insights into the current capabilities and limitations of LLMs in the TFA context and may inform the design, development, and fine-tuning of future models for this domain. We conclude with concrete recommendations to improve LLM performance for survivor support.
Subjects:
Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
Cite as:
arXiv:2602.17672 [cs.HC]
(or
arXiv:2602.17672v1 [cs.HC] for this version)
https://doi.org/10.48550/arXiv.2602.17672
Focus to learn more
arXiv-issued DOI via DataCite</p>
49. 【2602.17671】AI Hallucination from Students' Perspective: A Thematic Analysis
链接:https://arxiv.org/abs/2602.17671
作者:Abdulhadi Shoufan,Ahmad-Azmi-Abdelhamid Esmaeil
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, large language, pose a growing, growing threat, hallucinations
备注:
点击查看摘要
Abstract:As students increasingly rely on large language models, hallucinations pose a growing threat to learning. To mitigate this, AI literacy must expand beyond prompt engineering to address how students should detect and respond to LLM hallucinations. To support this, we need to understand how students experience hallucinations, how they detect them, and why they believe they occur. To investigate these questions, we asked university students three open-ended questions about their experiences with AI hallucinations, their detection strategies, and their mental models of why hallucinations occur. Sixty-three students responded to the survey. Thematic analysis of their responses revealed that reported hallucination issues primarily relate to incorrect or fabricated citations, false information, overconfident but misleading responses, poor adherence to prompts, persistence in incorrect answers, and sycophancy. To detect hallucinations, students rely either on intuitive judgment or on active verification strategies, such as cross-checking with external sources or re-prompting the model. Students' explanations for why hallucinations occur reflected several mental models, including notable misconceptions. Many described AI as a research engine that fabricates information when it cannot locate an answer in its "database." Others attributed hallucinations to issues with training data, inadequate prompting, or the model's inability to understand or verify information. These findings illuminate vulnerabilities in AI-supported learning and highlight the need for explicit instruction in verification protocols, accurate mental models of generative AI, and awareness of behaviors such as sycophancy and confident delivery that obscure inaccuracy. The study contributes empirical evidence for integrating hallucination awareness and mitigation into AI literacy curricula.
信息检索
1. 【2602.18429】VIRAASAT: Traversing Novel Paths for Indian Cultural Reasoning
链接:https://arxiv.org/abs/2602.18429
作者:Harshul Raj Surana,Arijit Maji,Aryan Vats,Akash Ghosh,Sriparna Saha,Amit Sheth
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, made significant progress, Indian Culture, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have made significant progress in reasoning tasks across various domains such as mathematics and coding. However, their performance deteriorates in tasks requiring rich socio-cultural knowledge and diverse local contexts, particularly those involving Indian Culture. Existing Cultural benchmarks are (i) Manually crafted, (ii) contain single-hop questions testing factual recall, and (iii) prohibitively costly to scale, leaving this deficiency largely unmeasured. To address this, we introduce VIRAASAT, a novel, semi-automated multi-hop approach for generating cultural specific multi-hop Question-Answering dataset for Indian culture. VIRAASAT leverages a Knowledge Graph comprising more than 700 expert-curated cultural artifacts, covering 13 key attributes of Indian culture (history, festivals, etc). VIRAASAT spans all 28 states and 8 Union Territories, yielding more than 3,200 multi-hop questions that necessitate chained cultural reasoning. We evaluate current State-of-the-Art (SOTA) LLMs on VIRAASAT and identify key limitations in reasoning wherein fine-tuning on Chain-of-Thought(CoT) traces fails to ground and synthesize low-probability facts. To bridge this gap, we propose a novel framework named Symbolic Chain-of-Manipulation (SCoM). Adapting the Chain-of-Manipulation paradigm, we train the model to simulate atomic Knowledge Graph manipulations internally. SCoM teaches the model to reliably traverse the topological structure of the graph. Experiments on Supervised Fine-Tuning (SFT) demonstrate that SCoM outperforms standard CoT baselines by up to 20%. We release the VIRAASAT dataset along with our findings, laying a strong foundation towards building Culturally Aware Reasoning Models.
2. 【2602.18425】RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
链接:https://arxiv.org/abs/2602.18425
作者:Deniz Qian,Hung-Ting Chen,Eunsol Choi
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Comprehensively retrieving diverse, Comprehensively retrieving, retrieving diverse documents, retrieving diverse, crucial to address
备注: 18 pages, 12 figures, 12 tables
点击查看摘要
Abstract:Comprehensively retrieving diverse documents is crucial to address queries that admit a wide range of valid answers. We introduce retrieve-verify-retrieve (RVR), a multi-round retrieval framework designed to maximize answer coverage. Initially, a retriever takes the original query and returns a candidate document set, followed by a verifier that identifies a high-quality subset. For subsequent rounds, the query is augmented with previously verified documents to uncover answers that are not yet covered in previous rounds. RVR is effective even with off-the-shelf retrievers, and fine-tuning retrievers for our inference procedure brings further gains. Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI). We also see consistent gains on two out-of-domain datasets (QUEST and WebQuestionsSP) across different base retrievers. Our work presents a promising iterative approach for comprehensive answer recall leveraging a verifier and adapting retrievers to a new inference scenario.
3. 【2602.18288】A Topology-Aware Positive Sample Set Construction and Feature Optimization Method in Implicit Collaborative Filtering
链接:https://arxiv.org/abs/2602.18288
作者:Jiayi Wu,Zhengyu Wu,Xunkai Li,Rong-Hua Li,Guoren Wang
类目:Information Retrieval (cs.IR)
关键词:implicit collaborative filtering, Negative sampling strategies, false negatives, Topology-aware Positive Sample, Positive Sample Set
备注:
点击查看摘要
Abstract:Negative sampling strategies are widely used in implicit collaborative filtering to address issues like data sparsity and class imbalance. However, these methods often introduce false negatives, hindering the model's ability to accurately learn users' latent preferences. To mitigate this problem, existing methods adjust the negative sampling distribution based on statistical features from model training or the hardness of negative samples. Nevertheless, these methods face two key limitations: (1) over-reliance on the model's current representation capabilities; (2) failure to leverage the potential of false negatives as latent positive samples to guide model learning of user preferences more accurately. To address the above issues, we propose a Topology-aware Positive Sample Set Construction and Feature Optimization method (TPSC-FO). First, we design a simple topological community-aware false negative identification (FNI) method and observe that topological community structures in interaction networks can effectively identify false negatives. Motivated by this, we develop a topology-aware positive sample set construction module. This module employs a differential community detection strategy to capture topological community structures in implicit feedback, coupled with personalized noise filtration to reliably identify false negatives and convert them into positive samples. Additionally, we introduce a neighborhood-guided feature optimization module that refines positive sample features by incorporating neighborhood features in the embedding space, effectively mitigating noise in the positive samples. Extensive experiments on five real-world datasets and two synthetic datasets validate the effectiveness of TPSC-FO.
4. 【2602.18283】HyTRec: A Hybrid Temporal-Aware Attention Architecture for Long Behavior Sequential Recommendation
链接:https://arxiv.org/abs/2602.18283
作者:Lei Xin,Yuhao Zheng,Ke Cheng,Changjiang Jiang,Zifan Zhang,Fanhu Zeng
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Modeling long sequences, Modeling long, generative recommendation, behaviors has emerged, critical frontier
备注: Preprint
点击查看摘要
Abstract:Modeling long sequences of user behaviors has emerged as a critical frontier in generative recommendation. However, existing solutions face a dilemma: linear attention mechanisms achieve efficiency at the cost of retrieval precision due to limited state capacity, while softmax attention suffers from prohibitive computational overhead. To address this challenge, we propose HyTRec, a model featuring a Hybrid Attention architecture that explicitly decouples long-term stable preferences from short-term intent spikes. By assigning massive historical sequences to a linear attention branch and reserving a specialized softmax attention branch for recent interactions, our approach restores precise retrieval capabilities within industrial-scale contexts involving ten thousand interactions. To mitigate the lag in capturing rapid interest drifts within the linear layers, we furthermore design Temporal-Aware Delta Network (TADN) to dynamically upweight fresh behavioral signals while effectively suppressing historical noise. Empirical results on industrial-scale datasets confirm the superiority that our model maintains linear inference speed and outperforms strong baselines, notably delivering over 8% improvement in Hit Rate for users with ultra-long sequences with great efficiency.
5. 【2602.18249】Dual-Tree LLM-Enhanced Negative Sampling for Implicit Collaborative Filtering
链接:https://arxiv.org/abs/2602.18249
作者:Jiayi Wu,Zhengyu Wu,Xunkai Li,Rong-Hua Li,Guoren Wang
类目:Information Retrieval (cs.IR)
关键词:Negative sampling, contrasting observed interactions, Toggle, LLM-enhanced Negative Sampling, Negative
备注:
点击查看摘要
Abstract:Negative sampling is a pivotal technique in implicit collaborative filtering (CF) recommendation, enabling efficient and effective training by contrasting observed interactions with sampled unobserved ones. Recently, large language models (LLMs) have shown promise in recommender systems; however, research on LLM-empowered negative sampling remains underexplored. Existing methods heavily rely on textual information and task-specific fine-tuning, limiting practical applicability. To address this limitation, we propose a text-free and fine-tuning-free Dual-Tree LLM-enhanced Negative Sampling method (DTL-NS). It consists of two modules: (i) an offline false negative identification module that leverages hierarchical index trees to transform collaborative structural and latent semantic information into structured item-ID encodings for LLM inference, enabling accurate identification of false negatives; and (ii) a multi-view hard negative sampling module that combines user-item preference scores with item-item hierarchical similarities from these encodings to mine high-quality hard negatives, thus improving models' discriminative ability. Extensive experiments demonstrate the effectiveness of DTL-NS. For example, on the Amazon-sports dataset, DTL-NS outperforms the strongest baseline by 10.64% and 19.12% in Recall@20 and NDCG@20, respectively. Moreover, DTL-NS can be integrated into various implicit CF models and negative sampling methods, consistently enhancing their performance.
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2602.18249 [cs.IR]
(or
arXiv:2602.18249v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.18249
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Xunkai Li [view email] [v1]
Fri, 20 Feb 2026 14:32:41 UTC (15,726 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Dual-Tree LLM-Enhanced Negative Sampling for Implicit Collaborative Filtering, by Jiayi Wu and 4 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.IR
prev
|
next
new
|
recent
| 2026-02
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
6. 【2602.18221】he Economical-Ecological Benefits of Matching Non-matching Socks
链接:https://arxiv.org/abs/2602.18221
作者:Teddy Lazebnik
类目:Information Retrieval (cs.IR)
关键词:trigger premature replacement, strand usable wear-capacity, massive scale, vulnerable to waste, premature replacement
备注:
点击查看摘要
Abstract:Socks are produced and replaced at a massive scale, yet their paired use makes them unusually vulnerable to waste, as the loss of a single sock can strand usable wear-capacity and trigger premature replacement. In this study, we quantify the economic and ecological value of pairing non-matching \say{orphan} socks, and the social cost that discourages this behaviour. We formalize sock ownership as a sequential decision problem under uncertainty in which socks wear out and disappear stochastically during laundering, while public exposure induces a person-specific mismatch penalty. We conducted an in-person study to estimate mismatch sensitivity and diversity preference, linking behavioural heterogeneity to optimal mixing strategies. Using these results and a computer simulation-based evaluation of interpretable pairing policies, we show that strict matching can appear resource-frugal largely because it generates many sockless days, whereas controlled tolerance for mismatch sustains service and reduces stranded capacity across loss regimes. This study establishes the feasibility of matching non-matching socks while outlining its limitations and challenges.
7. 【2602.18206】A Simple yet Effective Negative Sampling Plugin for Constructing Positive Sample Pairs in Implicit Collaborative Filtering
链接:https://arxiv.org/abs/2602.18206
作者:Jiayi Wu,Zhengyu Wu,Xunkai Li,Ronghua Li,Guoren Wang
类目:Information Retrieval (cs.IR)
关键词:existing work designs, work designs sophisticated, designs sophisticated strategies, implicit collaborative filtering, collaborative filtering
备注:
点击查看摘要
Abstract:Most implicit collaborative filtering (CF) models are trained with negative sampling, where existing work designs sophisticated strategies for high-quality negatives while largely overlooking the exploration of positive samples. Although some denoising recommendation methods can be applied to implicit CF for denoising positive samples, they often sparsify positive supervision. Moreover, these approaches generally overlook user activity bias during training, leading to insufficient learning for inactive users. To address these issues, we propose a simple yet effective negative sampling plugin, PSP-NS, from the perspective of enhancing positive supervision signals. It builds a user-item bipartite graph with edge weights indicating interaction confidence inferred from global and local patterns, generates positive sample pairs via replication-based reweighting to strengthen positive signals, and adopts an activity-aware weighting scheme to effectively learn inactive users' preferences. We provide theoretical insights from a margin-improvement perspective, explaining why PSP-NS tends to improve ranking quality (e.g., Precision@k/Recall@k), and conduct extensive experiments on four real-world datasets to demonstrate its superiority. For instance, PSP-NS boosts Recall@30 and Precision@30 by 32.11% and 22.90% on Yelp over the strongest baselines. PSP-NS can be integrated with various implicit CF recommenders or negative sampling methods to enhance their performance.
8. 【2602.18107】SuiteEval: Simplifying Retrieval Benchmarks
链接:https://arxiv.org/abs/2602.18107
作者:Andrew Parry,Debasis Ganguly,Sean MacAvaney
类目:Information Retrieval (cs.IR)
关键词:varying dataset subsets, Information retrieval evaluation, models requiring robust, foundation embedding models, embedding models requiring
备注: 5 pages, 3 figures, 2 tables, Accepted as a Demonstration to ECIR 2026
点击查看摘要
Abstract:Information retrieval evaluation often suffers from fragmented practices -- varying dataset subsets, aggregation methods, and pipeline configurations -- that undermine reproducibility and comparability, especially for foundation embedding models requiring robust out-of-domain performance. We introduce SuiteEval, a unified framework that offers automatic end-to-end evaluation, dynamic indexing that reuses on-disk indices to minimise disk usage, and built-in support for major benchmarks (BEIR, LoTTE, MS MARCO, NanoBEIR, and BRIGHT). Users only need to supply a pipeline generator. SuiteEval handles data loading, indexing, ranking, metric computation, and result aggregation. New benchmark suites can be added in a single line. SuiteEval reduces boilerplate and standardises evaluations to facilitate reproducible IR research, as a broader benchmark set is increasingly required.
9. 【2602.17981】Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
链接:https://arxiv.org/abs/2602.17981
作者:Amine Kobeissi,Philippe Langlais
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:high stakes settings, Retrieval-augmented generation, financial question answering, exact context needed, long regulatory filings
备注:
点击查看摘要
Abstract:Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode has received limited systematic attention in the Financial Question Answering (QA) literature. We evaluate retrieval at multiple levels of granularity, document, page, and chunk level, and introduce an oracle based analysis to provide empirical upper bounds on retrieval and generative performance. On a 150 question subset of FinanceBench, we reproduce and compare diverse retrieval strategies including dense, sparse, hybrid, and hierarchical methods with reranking and query reformulation. Across methods, gains in document discovery tend to translate into stronger page recall, yet oracle performance still suggests headroom for page and chunk level retrieval. To target this gap, we introduce a domain fine-tuned page scorer that treats pages as an intermediate retrieval unit between documents and chunks. Unlike prior passage-based hierarchical retrieval, we fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages. Overall, our results demonstrate a significant improvement in page recall and chunk retrieval.
10. 【2602.17914】Efficient Filtered-ANN via Learning-based Query Planning
链接:https://arxiv.org/abs/2602.17914
作者:Zhuocheng Gan,Yifan Wang
类目:Databases (cs.DB); Information Retrieval (cs.IR)
关键词:Filtered ANN search, requires expensive per-predicate, difficult trade-off due, low selectivity due, increasingly important problem
备注:
点击查看摘要
Abstract:Filtered ANN search is an increasingly important problem in vector retrieval, yet systems face a difficult trade-off due to the execution order: Pre-filtering (filtering first, then ANN over the passing subset) requires expensive per-predicate index construction, while post-filtering (ANN first, then filtering candidates) may waste computation and lose recall under low selectivity due to insufficient candidates after filtering. We introduce a learning-based query planning framework that dynamically selects the most effective execution plan for each query, using lightweight predictions derived from dataset and query statistics (e.g., dimensionality, corpus size, distribution features, and predicate statistics). The framework supports diverse filter types, including categorical/keyword and range predicates, and is generic to use any backend ANN index. Experiments show that our method achieves up to 4x acceleration with = 90% recall comparing to the strong baselines.
11. 【2602.17856】Enhancing Scientific Literature Chatbots with Retrieval-Augmented Generation: A Performance Evaluation of Vector and Graph-Based Systems
链接:https://arxiv.org/abs/2602.17856
作者:Hamideh Ghanadian,Amin Kamali,Mohammad Hossein Tekieh
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:retrieval-augmented generation, scientific literature chatbots, paper investigates, investigates the enhancement, focus on evaluating
备注:
点击查看摘要
Abstract:This paper investigates the enhancement of scientific literature chatbots through retrieval-augmented generation (RAG), with a focus on evaluating vector- and graph-based retrieval systems. The proposed chatbot leverages both structured (graph) and unstructured (vector) databases to access scientific articles and gray literature, enabling efficient triage of sources according to research objectives. To systematically assess performance, we examine two use-case scenarios: retrieval from a single uploaded document and retrieval from a large-scale corpus. Benchmark test sets were generated using a GPT model, with selected outputs annotated for evaluation. The comparative analysis emphasizes retrieval accuracy and response relevance, providing insight into the strengths and limitations of each approach. The findings demonstrate the potential of hybrid RAG systems to improve accessibility to scientific knowledge and to support evidence-based decision making.
12. 【2602.17814】VQPP: Video Query Performance Prediction Benchmark
链接:https://arxiv.org/abs/2602.17814
作者:Adrian Catalin Lutu,Eduard Poesina,Radu Tudor Ionescu
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:actively studied information, Query performance prediction, studied information retrieval, retrieval system selection, information retrieval task
备注:
点击查看摘要
Abstract:Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at this https URL.
13. 【2602.17695】EXACT: Explicit Attribute-Guided Decoding-Time Personalization
链接:https://arxiv.org/abs/2602.17695
作者:Xin Yu,Hanwen Xing,Lingzhou Xue
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Achieving personalized alignment, large language models, alignment requires adapting, requires adapting large, adapting large language
备注:
点击查看摘要
Abstract:Achieving personalized alignment requires adapting large language models to each user's evolving context. While decoding-time personalization offers a scalable alternative to training-time methods, existing methods largely rely on implicit, less interpretable preference representations and impose a rigid, context-agnostic user representation, failing to account for how preferences shift across prompts. We introduce EXACT, a new decoding-time personalization that aligns generation with limited pairwise preference feedback using a predefined set of interpretable attributes. EXACT first identifies user-specific attribute subsets by maximizing the likelihood of preferred responses in the offline stage. Then, for online inference, EXACT retrieves the most semantically relevant attributes for an incoming prompt and injects them into the context to steer generation. We establish theoretical approximation guarantees for the proposed algorithm under mild assumptions, and provably show that our similarity-based retrieval mechanism effectively mitigates contextual preference shifts, adapting to disparate tasks without pooling conflicting preferences. Extensive experiments on human-annotated preference datasets demonstrate that EXACT consistently outperforms strong baselines, including preference modeling accuracy and personalized generation quality.
14. 【2602.17687】IRPAPERS: A Visual Document Benchmark for Scientific Retrieval and Question Answering
链接:https://arxiv.org/abs/2602.17687
作者:Connor Shorten,Augustas Skaburskas,Daniel M. Jones,Charles Pierse,Roberto Esposito,John Trengrove,Etienne Dilocker,Bob van Luijt
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:document processing remains, visual document processing, Recall, achieved remarkable success, processing remains
备注: 23 pages, 6 figures
点击查看摘要
Abstract:AI systems have achieved remarkable success in processing text and relational data, yet visual document processing remains relatively underexplored. Whereas traditional systems require OCR transcriptions to convert these visual documents into text and metadata, recent advances in multimodal foundation models offer retrieval and generation directly from document images. This raises a key question: How do image-based systems compare to established text-based methods? We introduce IRPAPERS, a benchmark of 3,230 pages from 166 scientific papers, with both an image and an OCR transcription for each page. Using 180 needle-in-the-haystack questions, we compare image- and text-based retrieval and question answering systems. Text retrieval using Arctic 2.0 embeddings, BM25, and hybrid text search achieved 46% Recall@1, 78% Recall@5, and 91% Recall@20, while image-based retrieval reaches 43%, 78%, and 93%, respectively. The two modalities exhibit complementary failures, enabling multimodal hybrid search to outperform either alone, achieving 49% Recall@1, 81% Recall@5, and 95% Recall@20. We further evaluate efficiency-performance tradeoffs with MUVERA and assess multiple multi-vector image embedding models. Among closed-source models, Cohere Embed v4 page image embeddings outperform Voyage 3 Large text embeddings and all tested open-source models, achieving 58% Recall@1, 87% Recall@5, and 97% Recall@20. For question answering, text-based RAG systems achieved higher ground-truth alignment than image-based systems (0.82 vs. 0.71), and both benefit substantially from increased retrieval depth, with multi-document retrieval outperforming oracle single-document retrieval. We analyze the complementary limitations of unimodal text and image representations and identify question types that require one modality over the other. The IRPAPERS dataset and all experimental code are publicly available.
15. 【2602.17667】When How to Write for Personalized Demand-aware Query Rewriting in Video Search
链接:https://arxiv.org/abs/2602.17667
作者:Cheng cheng,Chenxing Wang,Aolin Li,Haijun Wu,Huiyun Hu,Juyuan Wang
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:historical behaviors provide, behaviors provide rich, provide rich context, identifying search intent, user historical behaviors
备注:
点击查看摘要
Abstract:In video search systems, user historical behaviors provide rich context for identifying search intent and resolving ambiguity. However, traditional methods utilizing implicit history features often suffer from signal dilution and delayed feedback. To address these challenges, we propose WeWrite, a novel Personalized Demand-aware Query Rewriting framework. Specifically, WeWrite tackles three key challenges: (1) When to Write: An automated posterior-based mining strategy extracts high-quality samples from user logs, identifying scenarios where personalization is strictly necessary; (2) How to Write: A hybrid training paradigm combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to align the LLM's output style with the retrieval system; (3) Deployment: A parallel "Fake Recall" architecture ensures low latency. Online A/B testing on a large-scale video platform demonstrates that WeWrite improves the Click-Through Video Volume (VV$$10s) by 1.07% and reduces the Query Reformulation Rate by 2.97%.
16. 【2602.17705】Wavenumber-domain signal processing for holographic MIMO: Foundations, methods, and future directions
链接:https://arxiv.org/abs/2602.17705
作者:Zijian Zhang,Linglong Dai
类目:ignal Processing (eess.SP); Information Retrieval (cs.IR); Systems and Control (eess.SY)
关键词:Holographic multiple-input multiple-output, enabling quasi-continuous apertures, Holographic multiple-input, multiple-input multiple-output, quasi-continuous apertures
备注: Accepted by IEEE Communications Standards Magazine. 6 pages, 5 figures
点击查看摘要
Abstract:Holographic multiple-input multiple-output (H-MIMO) systems represent a paradigm shift in wireless communications by enabling quasi-continuous apertures. Unlike conventional MIMO systems, H-MIMO with subwavelength antenna spacing operates in both far-field and near-field regimes, where classical discrete Fourier transform (DFT) representations fail to sufficiently capture the channel characteristics. To address this challenge, this article provides an overview of the emerging wavenumber-domain signal processing framework. Specifically, by leveraging spatial Fourier plane-wave decomposition to model H-MIMO channels, the wavenumber domain offers a unified and physically consistent basis for characterizing subwavelength-level spatial correlation and spherical wave propagation. This article first introduces the concept of H-MIMO and the wavenumber representation of H-MIMO channels. Next, it elaborates on wavenumber-domain signal processing technologies reported in the literature, including multiplexing, channel estimation, and waveform designs. Finally, it highlights open challenges and outlines future research directions in wavenumber-domain signal processing for next-generation wireless systems.
计算机视觉
1. 【2602.18434】Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory
链接:https://arxiv.org/abs/2602.18434
作者:Vatsal Agarwal,Saksham Suri,Matthew Gwilliam,Pulkit Kumar,Abhinav Shrivastava
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Streaming video understanding, video question answering, support accurate video, accurate video question, video understanding requires
备注: Project page: see [this https URL](https://vatsalag99.github.io/memstream/)
点击查看摘要
Abstract:Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.
2. 【2602.18432】SARAH: Spatially Aware Real-time Agentic Humans
链接:https://arxiv.org/abs/2602.18432
作者:Evonne Ng,Siwei Zhang,Zhang Chen,Michael Zollhoefer,Alexander Richard
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:digital human applications, human applications, digital human, speech-aligned gestures, Abstract
备注: Project page: [this https URL](https://evonneng.github.io/sarah/)
点击查看摘要
Abstract:As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see this https URL for details.
3. 【2602.18428】he Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning
链接:https://arxiv.org/abs/2602.18428
作者:Mojtaba Sahraee-Ardakan,Mauricio Delbracio,Peyman Milanfar
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Equilibrium Matching, explicit noise-level conditioning, Marginal Energy, challenge the standard, learning a single
备注:
点击查看摘要
Abstract:Autonomous (noise-agnostic) generative models, such as Equilibrium Matching and blind diffusion, challenge the standard paradigm by learning a single, time-invariant vector field that operates without explicit noise-level conditioning. While recent work suggests that high-dimensional concentration allows these models to implicitly estimate noise levels from corrupted observations, a fundamental paradox remains: what is the underlying landscape being optimized when the noise level is treated as a random variable, and how can a bounded, noise-agnostic network remain stable near the data manifold where gradients typically diverge? We resolve this paradox by formalizing Marginal Energy, $E_{\text{marg}}(\mathbf{u}) = -\log p(\mathbf{u})$, where $p(\mathbf{u}) = \int p(\mathbf{u}|t)p(t)dt$ is the marginal density of the noisy data integrated over a prior distribution of unknown noise levels. We prove that generation using autonomous models is not merely blind denoising, but a specific form of Riemannian gradient flow on this Marginal Energy. Through a novel relative energy decomposition, we demonstrate that while the raw Marginal Energy landscape possesses a $1/t^p$ singularity normal to the data manifold, the learned time-invariant field implicitly incorporates a local conformal metric that perfectly counteracts the geometric singularity, converting an infinitely deep potential well into a stable attractor. We also establish the structural stability conditions for sampling with autonomous models. We identify a ``Jensen Gap'' in noise-prediction parameterizations that acts as a high-gain amplifier for estimation errors, explaining the catastrophic failure observed in deterministic blind models. Conversely, we prove that velocity-based parameterizations are inherently stable because they satisfy a bounded-gain condition that absorbs posterior uncertainty into a smooth geometric drift.
4. 【2602.18424】CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
链接:https://arxiv.org/abs/2602.18424
作者:Xia Su,Ruiqi Chen,Benlin Liu,Jingwei Ma,Zonglin Di,Ranjay Krishna,Jon Froehlich
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:shown remarkable progress, offering new possibilities, shown remarkable, remarkable progress, benefit both robotic
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent's mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent's specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at this https URL
5. 【2602.18422】Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control
链接:https://arxiv.org/abs/2602.18422
作者:Linxi Xie,Lisong C. Sun,Ashley Neall,Tong Wu,Shengqu Cai,Gordon Wetzstein
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:tracked real-world motion, demands generative models, users' tracked real-world, world models accept, current video world
备注: Project page here: [this https URL](https://codeysun.github.io/generated-reality)
点击查看摘要
Abstract:Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.
6. 【2602.18406】Latent Equivariant Operators for Robust Object Recognition: Promise and Challenges
链接:https://arxiv.org/abs/2602.18406
作者:Minh Dinh,Stéphane Deny
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:group-symmetric transformations rarely, undergone group-symmetric transformations, computer vision, difficulties persist, unusual poses
备注:
点击查看摘要
Abstract:Despite the successes of deep learning in computer vision, difficulties persist in recognizing objects that have undergone group-symmetric transformations rarely seen during training-for example objects seen in unusual poses, scales, positions, or combinations thereof. Equivariant neural networks are a solution to the problem of generalizing across symmetric transformations, but require knowledge of transformations a priori. An alternative family of architectures proposes to earn equivariant operators in a latent space from examples of symmetric transformations. Here, using simple datasets of rotated and translated noisy MNIST, we illustrate how such architectures can successfully be harnessed for out-of-distribution classification, thus overcoming the limitations of both traditional and equivariant networks. While conceptually enticing, we discuss challenges ahead on the path of scaling these architectures to more complex datasets.
7. 【2602.18394】Self-Aware Object Detection via Degradation Manifolds
链接:https://arxiv.org/abs/2602.18394
作者:Stefan Becker,Simon Weiss,Wolfgang Hübner,Michael Arens
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:nominal imaging conditions, achieve strong performance, Object detectors achieve, adverse weather, exposed to blur
备注:
点击查看摘要
Abstract:Object detectors achieve strong performance under nominal imaging conditions but can fail silently when exposed to blur, noise, compression, adverse weather, or resolution changes. In safety-critical settings, it is therefore insufficient to produce predictions without assessing whether the input remains within the detector's nominal operating regime. We refer to this capability as self-aware object detection. We introduce a degradation-aware self-awareness framework based on degradation manifolds, which explicitly structure a detector's feature space according to image degradation rather than semantic content. Our method augments a standard detection backbone with a lightweight embedding head trained via multi-layer contrastive learning. Images sharing the same degradation composition are pulled together, while differing degradation configurations are pushed apart, yielding a geometrically organized representation that captures degradation type and severity without requiring degradation labels or explicit density modeling. To anchor the learned geometry, we estimate a pristine prototype from clean training embeddings, defining a nominal operating point in representation space. Self-awareness emerges as geometric deviation from this reference, providing an intrinsic, image-level signal of degradation-induced shift that is independent of detection confidence. Extensive experiments on synthetic corruption benchmarks, cross-dataset zero-shot transfer, and natural weather-induced distribution shifts demonstrate strong pristine-degraded separability, consistent behavior across multiple detector architectures, and robust generalization under semantic shift. These results suggest that degradation-aware representation geometry provides a practical and detector-agnostic foundation.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.18394 [cs.CV]
(or
arXiv:2602.18394v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.18394
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Stefan Becker [view email] [v1]
Fri, 20 Feb 2026 17:58:46 UTC (26,902 KB)
8. 【2602.18329】G-LoG Bi-filtration for Medical Image Classification
链接:https://arxiv.org/abs/2602.18329
作者:Qingsong Wang,Jiaxing He,Bingzhe Hou,Tieru Wu,Yang Cao,Cailing Yao
类目:Computer Vision and Pattern Recognition (cs.CV); Algebraic Topology (math.AT)
关键词:Topological Data Analysis, Building practical filtrations, Data Analysis, Building practical, Topological Data
备注:
点击查看摘要
Abstract:Building practical filtrations on objects to detect topological and geometric features is an important task in the field of Topological Data Analysis (TDA). In this paper, leveraging the ability of the Laplacian of Gaussian operator to enhance the boundaries of medical images, we define the G-LoG (Gaussian-Laplacian of Gaussian) bi-filtration to generate the features more suitable for multi-parameter persistence module. By modeling volumetric images as bounded functions, then we prove the interleaving distance on the persistence modules obtained from our bi-filtrations on the bounded functions is stable with respect to the maximum norm of the bounded functions. Finally, we conduct experiments on the MedMNIST dataset, comparing our bi-filtration against single-parameter filtration and the established deep learning baselines, including Google AutoML Vision, ResNet, AutoKeras and auto-sklearn. Experiments results demonstrate that our bi-filtration significantly outperforms single-parameter filtration. Notably, a simple Multi-Layer Perceptron (MLP) trained on the topological features generated by our bi-filtration achieves performance comparable to complex deep learning models trained on the original dataset.
9. 【2602.18322】Unifying Color and Lightness Correction with View-Adaptive Curve Adjustment for Robust 3D Novel View Synthesis
链接:https://arxiv.org/abs/2602.18322
作者:Ziteng Cui,Shuhong Liu,Xiaoyu Dong,Xuangeng Chu,Lin Gu,Ming-Hsuan Yang,Tatsuya Harada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High-quality image acquisition, camera imaging pipelines, real-world environments remains, Neural Radiance Fields, environments remains challenging
备注: Journal extension version of CVPR 2025 paper: [arXiv:2504.01503](https://arxiv.org/abs/2504.01503)
点击查看摘要
Abstract:High-quality image acquisition in real-world environments remains challenging due to complex illumination variations and inherent limitations of camera imaging pipelines. These issues are exacerbated in multi-view capture, where differences in lighting, sensor responses, and image signal processor (ISP) configurations introduce photometric and chromatic inconsistencies that violate the assumptions of photometric consistency underlying modern 3D novel view synthesis (NVS) methods, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), leading to degraded reconstruction and rendering quality. We propose Luminance-GS++, a 3DGS-based framework for robust NVS under diverse illumination conditions. Our method combines a globally view-adaptive lightness adjustment with a local pixel-wise residual refinement for precise color correction. We further design unsupervised objectives that jointly enforce lightness correction and multi-view geometric and photometric consistency. Extensive experiments demonstrate state-of-the-art performance across challenging scenarios, including low-light, overexposure, and complex luminance and chromatic variations. Unlike prior approaches that modify the underlying representation, our method preserves the explicit 3DGS formulation, improving reconstruction fidelity while maintaining real-time rendering efficiency.
10. 【2602.18314】Diff2DGS: Reliable Reconstruction of Occluded Surgical Scenes via 2D Gaussian Splatting
链接:https://arxiv.org/abs/2602.18314
作者:Tianyi Song,Danail Stoyanov,Evangelos Mazomenos,Francisco Vasconcelos
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
关键词:improving surgeon guidance, advancing robotic surgery, Vinci robotic surgery, Gaussian Splatting, improving surgeon
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Real-time reconstruction of deformable surgical scenes is vital for advancing robotic surgery, improving surgeon guidance, and enabling automation. Recent methods achieve dense reconstructions from da Vinci robotic surgery videos, with Gaussian Splatting (GS) offering real-time performance via graphics acceleration. However, reconstruction quality in occluded regions remains limited, and depth accuracy has not been fully assessed, as benchmarks like EndoNeRF and StereoMIS lack 3D ground truth. We propose Diff2DGS, a novel two-stage framework for reliable 3D reconstruction of occluded surgical scenes. In the first stage, a diffusion-based video module with temporal priors inpaints tissue occluded by instruments with high spatial-temporal consistency. In the second stage, we adapt 2D Gaussian Splatting (2DGS) with a Learnable Deformation Model (LDM) to capture dynamic tissue deformation and anatomical geometry. We also extend evaluation beyond prior image-quality metrics by performing quantitative depth accuracy analysis on the SCARED dataset. Diff2DGS outperforms state-of-the-art approaches in both appearance and geometry, reaching 38.02 dB PSNR on EndoNeRF and 34.40 dB on StereoMIS. Furthermore, our experiments demonstrate that optimizing for image quality alone does not necessarily translate into optimal 3D reconstruction accuracy. To address this, we further optimize the depth quality of the reconstructed 3D results, ensuring more faithful geometry in addition to high-fidelity appearance.
11. 【2602.18309】Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation
链接:https://arxiv.org/abs/2602.18309
作者:Ziyue Liu,Davide Talon,Federico Girella,Zanxi Ruan,Mattia Mondo,Loris Bazzani,Yiming Wang,Marco Cristani
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:textual descriptions complement, early-stage fashion ideation, Sketches offer designers, descriptions complement sketches, spatial relationships
备注: Project page: [this https URL](https://intelligolabs.github.io/lots/)
点击查看摘要
Abstract:Sketches offer designers a concise yet expressive medium for early-stage fashion ideation by specifying structure, silhouette, and spatial relationships, while textual descriptions complement sketches to convey material, color, and stylistic details. Effectively combining textual and visual modalities requires adherence to the sketch visual structure when leveraging the guidance of localized attributes from text. We present LOcalized Text and Sketch with multi-level guidance (LOTS), a framework that enhances fashion image generation by combining global sketch guidance with multiple localized sketch-text pairs. LOTS employs a Multi-level Conditioning Stage to independently encode local features within a shared latent space while maintaining global structural coordination. Then, the Diffusion Pair Guidance stage integrates both local and global conditioning via attention-based guidance within the diffusion model's multi-step denoising process. To validate our method, we develop Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Sketchy provides high-quality, clean sketches with a professional look and consistent structure. To assess robustness beyond this setting, we also include an "in the wild" split with non-expert sketches, featuring higher variability and imperfections. Experiments demonstrate that our method strengthens global structural adherence while leveraging richer localized semantic guidance, achieving improvement over state-of-the-art. The dataset, platform, and code are publicly available.
12. 【2602.18282】DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control
链接:https://arxiv.org/abs/2602.18282
作者:Shiyan Du,Conghan Yue,Xinyu Cheng,Dongyu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:advanced significantly, Multi-Instance Generation, Detail Fusion Module, DEIG, Instance Detail Extractor
备注: Accepted by AAAI 2026
点击查看摘要
Abstract:Multi-Instance Generation has advanced significantly in spatial placement and attribute binding. However, existing approaches still face challenges in fine-grained semantic understanding, particularly when dealing with complex textual descriptions. To overcome these limitations, we propose DEIG, a novel framework for fine-grained and controllable multi-instance generation. DEIG integrates an Instance Detail Extractor (IDE) that transforms text encoder embeddings into compact, instance-aware representations, and a Detail Fusion Module (DFM) that applies instance-based masked attention to prevent attribute leakage across instances. These components enable DEIG to generate visually coherent multi-instance scenes that precisely match rich, localized textual descriptions. To support fine-grained supervision, we construct a high-quality dataset with detailed, compositional instance captions generated by VLMs. We also introduce DEIG-Bench, a new benchmark with region-level annotations and multi-attribute prompts for both humans and objects. Experiments demonstrate that DEIG consistently outperforms existing approaches across multiple benchmarks in spatial consistency, semantic accuracy, and compositional generalization. Moreover, DEIG functions as a plug-and-play module, making it easily integrable into standard diffusion-based pipelines.
13. 【2602.18258】RoEL: Robust Event-based 3D Line Reconstruction
链接:https://arxiv.org/abs/2602.18258
作者:Gwangtak Bae,Jaeho Shin,Seunggu Kang,Junho Kim,Ayoung Kim,Young Min Kim
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:detect object boundaries, texture edges, man-made environments, motion tend, object boundaries
备注: IEEE Transactions on Robotics (T-RO)
点击查看摘要
Abstract:Event cameras in motion tend to detect object boundaries or texture edges, which produce lines of brightness changes, especially in man-made environments. While lines can constitute a robust intermediate representation that is consistently observed, the sparse nature of lines may lead to drastic deterioration with minor estimation errors. Only a few previous works, often accompanied by additional sensors, utilize lines to compensate for the severe domain discrepancies of event sensors along with unpredictable noise characteristics. We propose a method that can stably extract tracks of varying appearances of lines using a clever algorithmic process that observes multiple representations from various time slices of events, compensating for potential adversaries within the event data. We then propose geometric cost functions that can refine the 3D line maps and camera poses, eliminating projective distortions and depth ambiguities. The 3D line maps are highly compact and can be equipped with our proposed cost function, which can be adapted for any observations that can detect and extract line structures or projections of them, including 3D point cloud maps or image observations. We demonstrate that our formulation is powerful enough to exhibit a significant performance boost in event-based mapping and pose refinement across diverse datasets, and can be flexibly applied to multimodal scenarios. Our results confirm that the proposed line-based formulation is a robust and effective approach for the practical deployment of event-based perceptual modules. Project page: this https URL
14. 【2602.18252】On the Adversarial Robustness of Discrete Image Tokenizers
链接:https://arxiv.org/abs/2602.18252
作者:Rishika Bhagwatkar,Irina Rish,Nicolas Flammarion,Francesco Croce
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:encode visual inputs, tokenizers encode visual, including encoder-only, encode visual, visual inputs
备注:
点击查看摘要
Abstract:Discrete image tokenizers encode visual inputs as sequences of tokens from a finite vocabulary and are gaining popularity in multimodal systems, including encoder-only, encoder-decoder, and decoder-only models. However, unlike CLIP encoders, their vulnerability to adversarial attacks has not been explored. Ours being the first work studying this topic, we first formulate attacks that aim to perturb the features extracted by discrete tokenizers, and thus change the extracted tokens. These attacks are computationally efficient, application-agnostic, and effective across classification, multimodal retrieval, and captioning tasks. Second, to defend against this vulnerability, inspired by recent work on robust CLIP encoders, we fine-tune popular tokenizers with unsupervised adversarial training, keeping all other components frozen. While unsupervised and task-agnostic, our approach significantly improves robustness to both unsupervised and end-to-end supervised attacks and generalizes well to unseen tasks and data. Unlike supervised adversarial training, our approach can leverage unlabeled images, making it more versatile. Overall, our work highlights the critical role of tokenizer robustness in downstream tasks and presents an important step in the development of safe multimodal foundation models.
15. 【2602.18199】A Self-Supervised Approach on Motion Calibration for Enhancing Physical Plausibility in Text-to-Motion
链接:https://arxiv.org/abs/2602.18199
作者:Gahyeon Shim,Soogeun Park,Hyemin Ahn
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating semantically aligned, made rapid progress, semantically aligned human, Generating semantically, Distortion-aware Motion Calibrator
备注:
点击查看摘要
Abstract:Generating semantically aligned human motion from textual descriptions has made rapid progress, but ensuring both semantic and physical realism in motion remains a challenge. In this paper, we introduce the Distortion-aware Motion Calibrator (DMC), a post-hoc module that refines physically implausible motions (e.g., foot floating) while preserving semantic consistency with the original textual description. Rather than relying on complex physical modeling, we propose a self-supervised and data-driven approach, whereby DMC learns to obtain physically plausible motions when an intentionally distorted motion and the original textual descriptions are given as inputs. We evaluate DMC as a post-hoc module to improve motions obtained from various text-to-motion generation models and demonstrate its effectiveness in improving physical plausibility while enhancing semantic consistency. The experimental results show that DMC reduces FID score by 42.74% on T2M and 13.20% on T2M-GPT, while also achieving the highest R-Precision. When applied to high-quality models like MoMask, DMC improves the physical plausibility of motions by reducing penetration by 33.0% as well as adjusting floating artifacts closer to the ground-truth reference. These results highlight that DMC can serve as a promising post-hoc motion refinement framework for any kind of text-to-motion models by incorporating textual semantics and physical plausibility.
16. 【2602.18193】BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards
链接:https://arxiv.org/abs/2602.18193
作者:Yiran Yang,Zhaowei Liu,Yuan Yuan,Yukun Song,Xiong Ma,Yinghao Song,Xiangji Zeng,Lu Sun,Yulu Wang,Hai Zhou,Shuai Cui,Zhaohan Gong,Jiefei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:subtitles demand finer-grained, community safety filters, host vast multimodal, vast multimodal ads, deceptive visuals
备注: 7 pages, 3 figures. To appear in AAAI 2026
点击查看摘要
Abstract:Short-video platforms now host vast multimodal ads whose deceptive visuals, speech and subtitles demand finer-grained, policy-driven moderation than community safety filters. We present BLM-Guard, a content-audit framework for commercial ads that fuses Chain-of-Thought reasoning with rule-based policy principles and a critic-guided reward. A rule-driven ICoT data-synthesis pipeline jump-starts training by generating structured scene descriptions, reasoning chains and labels, cutting annotation costs. Reinforcement learning then refines the model using a composite reward balancing causal coherence with policy adherence. A multitask architecture models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle-speech drift), boosting robustness. Experiments on real short-video ads show BLM-Guard surpasses strong baselines in accuracy, consistency and generalization.
17. 【2602.18178】Evaluating Graphical Perception Capabilities of Vision Transformers
链接:https://arxiv.org/abs/2602.18178
作者:Poonam Poonam,Pere-Pau Vázquez,Timo Ropinski
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:convolutional neural networks, Vision Transformers, neural networks, powerful alternative, alternative to convolutional
备注:
点击查看摘要
Abstract:Vision Transformers, ViTs, have emerged as a powerful alternative to convolutional neural networks, CNNs, in a variety of image-based tasks. While CNNs have previously been evaluated for their ability to perform graphical perception tasks, which are essential for interpreting visualizations, the perceptual capabilities of ViTs remain largely unexplored. In this work, we investigate the performance of ViTs in elementary visual judgment tasks inspired by the foundational studies of Cleveland and McGill, which quantified the accuracy of human perception across different visual encodings. Inspired by their study, we benchmark ViTs against CNNs and human participants in a series of controlled graphical perception tasks. Our results reveal that, although ViTs demonstrate strong performance in general vision tasks, their alignment with human-like graphical perception in the visualization domain is limited. This study highlights key perceptual gaps and points to important considerations for the application of ViTs in visualization systems and graphical perceptual modeling.
18. 【2602.18094】OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models
链接:https://arxiv.org/abs/2602.18094
作者:Ling Lin,Yang Bai,Heng Su,Congcong Zhu,Yaoxing Wang,Yang Zhou,Huazhu Fu,Jingrun Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Databases (cs.DB)
关键词:Existing Visual-Language Models, achieved significant progress, Existing Visual-Language, Visual-Language Models, massive-scale datasets
备注: 54 pages, 21 figures
点击查看摘要
Abstract:Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.
19. 【2602.18093】Predict to Skip: Linear Multistep Feature Forecasting for Efficient Diffusion Transformers
链接:https://arxiv.org/abs/2602.18093
作者:Hanshuai Cui,Zhiqing Tang,Qianli Ma,Zhi Yao,Weijia Jia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high computational costs, widely adopted backbone, iterative denoising process, denoising process incurs, process incurs high
备注:
点击查看摘要
Abstract:Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free acceleration methods rely on feature caching and reuse under the assumption of temporal stability. However, reusing features for multiple steps may lead to latent drift and visual degradation. We observe that model outputs evolve smoothly along much of the diffusion trajectory, enabling principled predictions rather than naive reuse. Based on this insight, we propose \textbf{PrediT}, a training-free acceleration framework that formulates feature prediction as a linear multistep problem. We employ classical linear multistep methods to forecast future model outputs from historical information, combined with a corrector that activates in high-dynamics regions to prevent error accumulation. A dynamic step modulation mechanism adaptively adjusts the prediction horizon by monitoring the feature change rate. Together, these components enable substantial acceleration while preserving generation fidelity. Extensive experiments validate that our method achieves up to $5.54\times$ latency reduction across various DiT-based image and video generation models, while incurring negligible quality degradation.
20. 【2602.18089】DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text
链接:https://arxiv.org/abs/2602.18089
作者:Kunwar Arpit Singh,Ankush Prakash,Haroon R Lone
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:remains severely underrepresented, text remains severely, Devanagari text remains, handwritten Devanagari text, millions of speakers
备注:
点击查看摘要
Abstract:Despite having hundreds of millions of speakers, handwritten Devanagari text remains severely underrepresented in publicly available benchmark datasets. Existing resources are limited in scale, focus primarily on isolated characters or short words, and lack controlled lexical content and writer level diversity, which restricts their utility for modern data driven handwriting analysis. As a result, they fail to capture the continuous, fused, and structurally complex nature of Devanagari handwriting, where characters are connected through a shared shirorekha (horizontal headline) and exhibit rich ligature formations. We introduce DohaScript, a large scale, multi writer dataset of handwritten Hindi text collected from 531 unique contributors. The dataset is designed as a parallel stylistic corpus, in which all writers transcribe the same fixed set of six traditional Hindi dohas (couplets). This controlled design enables systematic analysis of writer specific variation independent of linguistic content, and supports tasks such as handwriting recognition, writer identification, style analysis, and generative modeling. The dataset is accompanied by non identifiable demographic metadata, rigorous quality curation based on objective sharpness and resolution criteria, and page level layout difficulty annotations that facilitate stratified benchmarking. Baseline experiments demonstrate clear quality separation and strong generalization to unseen writers, highlighting the dataset's reliability and practical value. DohaScript is intended to serve as a standardized and reproducible benchmark for advancing research on continuous handwritten Devanagari text in low resource script settings.
21. 【2602.18083】Comparative Assessment of Multimodal Earth Observation Data for Soil Moisture Estimation
链接:https://arxiv.org/abs/2602.18083
作者:Ioannis Kontogiorgakis,Athanasios Askitopoulos,Iason Tsardanidis,Dimitrios Bormpoudakis,Ilias Tsoumas,Fotios Balampanis,Charalampos Kontoes
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:water resources management, Accurate soil moisture, Accurate soil, precision agriculture, water resources
备注: This paper has been submitted to IEEE IGARSS 2026
点击查看摘要
Abstract:Accurate soil moisture (SM) estimation is critical for precision agriculture, water resources management and climate monitoring. Yet, existing satellite SM products are too coarse (1km) for farm-level applications. We present a high-resolution (10m) SM estimation framework for vegetated areas across Europe, combining Sentinel-1 SAR, Sentinel-2 optical imagery and ERA-5 reanalysis data through machine learning. Using 113 International Soil Moisture Network (ISMN) stations spanning diverse vegetated areas, we compare modality combinations with temporal parameterizations, using spatial cross-validation, to ensure geographic generalization. We also evaluate whether foundation model embeddings from IBM-NASA's Prithvi model improve upon traditional hand-crafted spectral features. Results demonstrate that hybrid temporal matching - Sentinel-2 current-day acquisitions with Sentinel-1 descending orbit - achieves R^2=0.514, with 10-day ERA5 lookback window improving performance to R^2=0.518. Foundation model (Prithvi) embeddings provide negligible improvement over hand-crafted features (R^2=0.515 vs. 0.514), indicating traditional feature engineering remains highly competitive for sparse-data regression tasks. Our findings suggest that domain-specific spectral indices combined with tree-based ensemble methods offer a practical and computationally efficient solution for operational pan-European field-scale soil moisture monitoring.
22. 【2602.18066】Faster Training, Fewer Labels: Self-Supervised Pretraining for Fine-Grained BEV Segmentation
链接:https://arxiv.org/abs/2602.18066
作者:Daniel Busch,Christian Bohn,Thomas Kurbiel,Klaus Friedrichs,Richard Meyes,Tobias Meisen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Dense Bird Eye, Bird Eye View, Dense Bird, Eye View, Bird Eye
备注: This Paper has been accepted to the 2026 IEEE Intelligent Vehicles Symposium (IV)
点击查看摘要
Abstract:Dense Bird's Eye View (BEV) semantic maps are central to autonomous driving, yet current multi-camera methods depend on costly, inconsistently annotated BEV ground truth. We address this limitation with a two-phase training strategy for fine-grained road marking segmentation that removes full supervision during pretraining and halves the amount of training data during fine-tuning while still outperforming the comparable supervised baseline model. During the self-supervised pretraining, BEVFormer predictions are differentiably reprojected into the image plane and trained against multi-view semantic pseudo-labels generated by the widely used semantic segmentation model Mask2Former. A temporal loss encourages consistency across frames. The subsequent supervised fine-tuning phase requires only 50% of the dataset and significantly less training time. With our method, the fine-tuning benefits from rich priors learned during pretraining boosting the performance and BEV segmentation quality (up to +2.5pp mIoU over the fully supervised baseline) on nuScenes. It simultaneously halves the usage of annotation data and reduces total training time by up to two thirds. The results demonstrate that differentiable reprojection plus camera perspective pseudo labels yields transferable BEV features and a scalable path toward reduced-label autonomous perception.
23. 【2602.18064】3DMedAgent: Unified Perception-to-Understanding for 3D Medical Analysis
链接:https://arxiv.org/abs/2602.18064
作者:Ziyue Wang,Linghan Cai,Chang Han Low,Haofeng Liu,Junde Wu,Jingyu Wang,Rui Wang,Lei Song,Jiang Bian,Jingjing Fu,Yueming Jin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-level clinical understanding, spans a continuum, continuum from low-level, analysis spans, low-level perception
备注: 19 pages, 7 figures
点击查看摘要
Abstract:3D CT analysis spans a continuum from low-level perception to high-level clinical understanding. Existing 3D-oriented analysis methods adopt either isolated task-specific modeling or task-agnostic end-to-end paradigms to produce one-hop outputs, impeding the systematic accumulation of perceptual evidence for downstream reasoning. In parallel, recent multimodal large language models (MLLMs) exhibit improved visual perception and can integrate visual and textual information effectively, yet their predominantly 2D-oriented designs fundamentally limit their ability to perceive and analyze volumetric medical data. To bridge this gap, we propose 3DMedAgent, a unified agent that enables 2D MLLMs to perform general 3D CT analysis without 3D-specific fine-tuning. 3DMedAgent coordinates heterogeneous visual and textual tools through a flexible MLLM agent, progressively decomposing complex 3D analysis into tractable subtasks that transition from global to regional views, from 3D volumes to informative 2D slices, and from visual evidence to structured textual representations. Central to this design, 3DMedAgent maintains a long-term structured memory that aggregates intermediate tool outputs and supports query-adaptive, evidence-driven multi-step reasoning. We further introduce the DeepChestVQA benchmark for evaluating unified perception-to-understanding capabilities in 3D thoracic imaging. Experiments across over 40 tasks demonstrate that 3DMedAgent consistently outperforms general, medical, and 3D-specific MLLMs, highlighting a scalable path toward general-purpose 3D clinical this http URL and data are available at \href{this https URL}{this https URL}.
24. 【2602.18057】mporal Consistency-Aware Text-to-Motion Generation
链接:https://arxiv.org/abs/2602.18057
作者:Hongsong Wang,Wenjing Yan,Qiuxia Lai,Xin Geng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language descriptions, synthesize realistic human, realistic human motion, human motion sequences, language descriptions
备注: Code is on [this https URL](https://github.com/Giat995/TCA-T2M/)
点击查看摘要
Abstract:Text-to-Motion (T2M) generation aims to synthesize realistic human motion sequences from natural language descriptions. While two-stage frameworks leveraging discrete motion representations have advanced T2M research, they often neglect cross-sequence temporal consistency, i.e., the shared temporal structures present across different instances of the same action. This leads to semantic misalignments and physically implausible motions. To address this limitation, we propose TCA-T2M, a framework for temporal consistency-aware T2M generation. Our approach introduces a temporal consistency-aware spatial VQ-VAE (TCaS-VQ-VAE) for cross-sequence temporal alignment, coupled with a masked motion transformer for text-conditioned motion generation. Additionally, a kinematic constraint block mitigates discretization artifacts to ensure physical plausibility. Experiments on HumanML3D and KIT-ML benchmarks demonstrate that TCA-T2M achieves state-of-the-art performance, highlighting the importance of temporal consistency in robust and coherent T2M generation.
25. 【2602.18047】CityGuard: Graph-Aware Private Descriptors for Bias-Resilient Identity Search Across Urban Cameras
链接:https://arxiv.org/abs/2602.18047
作者:Rong Fu,Wenxin Zhang,Yibo Meng,Jia Yee Tan,Jiaxuan Lu,Rui Lu,Jiekai Wu,Zhaolu Kang,Simon Fong
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:City-scale person re-identification, sharing raw imagery, handle severe appearance, data protection rules, prevent sharing raw
备注: 36 pages, 12 figures
点击查看摘要
Abstract:City-scale person re-identification across distributed cameras must handle severe appearance changes from viewpoint, occlusion, and domain shift while complying with data protection rules that prevent sharing raw imagery. We introduce CityGuard, a topology-aware transformer for privacy-preserving identity retrieval in decentralized surveillance. The framework integrates three components. A dispersion-adaptive metric learner adjusts instance-level margins according to feature spread, increasing intra-class compactness. Spatially conditioned attention injects coarse geometry, such as GPS or deployment floor plans, into graph-based self-attention to enable projectively consistent cross-view alignment using only coarse geometric priors without requiring survey-grade calibration. Differentially private embedding maps are coupled with compact approximate indexes to support secure and cost-efficient deployment. Together these designs produce descriptors robust to viewpoint variation, occlusion, and domain shifts, and they enable a tunable balance between privacy and utility under rigorous differential-privacy accounting. Experiments on Market-1501 and additional public benchmarks, complemented by database-scale retrieval studies, show consistent gains in retrieval precision and query throughput over strong baselines, confirming the practicality of the framework for privacy-critical urban identity matching.
26. 【2602.18043】Spatio-temporal Decoupled Knowledge Compensator for Few-Shot Action Recognition
链接:https://arxiv.org/abs/2602.18043
作者:Hongyu Qu,Xiangbo Shu,Rui Yan,Hailiang Gao,Wenguan Wang,Jinhui Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Few-Shot Action Recognition, Action Recognition, labeled videos, challenging task, task that requires
备注: Accepted to TPAMI 2026
点击查看摘要
Abstract:Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and temporal perspectives. In the incorporation stage, we propose Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover discriminative object-level and frame-level prototypes, respectively. In SKC, object-level prototypes adaptively aggregate important patch tokens under the guidance of spatial knowledge. Moreover, in TKC, frame-level prototypes utilize temporal attributes to assist in inter-frame temporal relation modeling. These learned prototypes thus provide transparency in capturing fine-grained spatial details and diverse temporal patterns. Experimental results show DiST achieves state-of-the-art results on five standard FSAR datasets.
27. 【2602.18022】Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers
链接:https://arxiv.org/abs/2602.18022
作者:Guandong Li,Mengxia Ye
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Diffusion Transformer, editing models built, critical requirement, requirement for diffusion-based, models built
备注:
点击查看摘要
Abstract:Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(\delta_k, \delta_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).
28. 【2602.18020】UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models
链接:https://arxiv.org/abs/2602.18020
作者:Jiabing Yang,Yixiang Chen,Yuan Xu,Peiyan Li,Xiangnan Wu,Zichen Wen,Bowen Fang,Tao Yu,Zhengbo Zhang,Yingda Li,Kai Wang,Jing Liu,Nianfeng Liu,Yan Huang,Liang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:demonstrating remarkable potential, generalizable robotic manipulation, leverage pretrained Vision-Language, models leverage pretrained, pretrained Vision-Language Models
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism helps VLAs better attend to observations during inference, enabling more confident and faithful action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at this https URL.
29. 【2602.18019】DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE
链接:https://arxiv.org/abs/2602.18019
作者:Yujie Jin,Wenxin Zhang,Jingjing Wang,Guodong Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Security-oriented Video Understanding, In-depth Security-oriented Video, Video Understanding, Security-oriented Video, paradigm SVU task
备注:
点击查看摘要
Abstract:In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such this http URL justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.
30. 【2602.18016】owards LLM-centric Affective Visual Customization via Efficient and Precise Emotion Manipulating
链接:https://arxiv.org/abs/2602.18016
作者:Jiamin Luo,Xuqian Gu,Jingjing Wang,Jiahong Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:affective visual customization, visual customization primarily, customization primarily rely, visual customization, importantly lack general-purpose
备注:
点击查看摘要
Abstract:Previous studies on visual customization primarily rely on the objective alignment between various control signals (e.g., language, layout and canny) and the edited images, which largely ignore the subjective emotional contents, and more importantly lack general-purpose foundation models for affective visual customization. With this in mind, this paper proposes an LLM-centric Affective Visual Customization (L-AVC) task, which focuses on generating images within modifying their subjective emotions via Multimodal LLM. Further, this paper contends that how to make the model efficiently align emotion conversion in semantics (named inter-emotion semantic conversion) and how to precisely retain emotion-agnostic contents (named exter-emotion semantic retaining) are rather important and challenging in this L-AVC task. To this end, this paper proposes an Efficient and Precise Emotion Manipulating approach for editing subjective emotions in images. Specifically, an Efficient Inter-emotion Converting (EIC) module is tailored to make the LLM efficiently align emotion conversion in semantics before and after editing, followed by a Precise Exter-emotion Retaining (PER) module to precisely retain the emotion-agnostic contents. Comprehensive experimental evaluations on our constructed L-AVC dataset demonstrate the great advantage of the proposed EPEM approach to the L-AVC task over several state-of-the-art baselines. This justifies the importance of emotion information for L-AVC and the effectiveness of EPEM in efficiently and precisely manipulating such information.
31. 【2602.18006】MUOT_3M: A 3 Million Frame Multimodal Underwater Benchmark and the MUTrack Tracking Method
链接:https://arxiv.org/abs/2602.18006
作者:Ahsan Baidar Bakht,Mohamad Alansari,Muhayy Ud Din,Muzammal Naseer,Sajid Javed,Irfan Hussain,Jiri Matas,Arif Mahmood
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large scale ecological, scale ecological monitoring, Underwater Object Tracking, efficient marine robotics, Object Tracking
备注:
点击查看摘要
Abstract:Underwater Object Tracking (UOT) is crucial for efficient marine robotics, large scale ecological monitoring, and ocean exploration; however, progress has been hindered by the scarcity of large, multimodal, and diverse datasets. Existing benchmarks remain small and RGB only, limiting robustness under severe color distortion, turbidity, and low visibility conditions. We introduce MUOT_3M, the first pseudo multimodal UOT benchmark comprising 3 million frames from 3,030 videos (27.8h) annotated with 32 tracking attributes, 677 fine grained classes, and synchronized RGB, estimated enhanced RGB, estimated depth, and language modalities validated by a marine biologist. Building upon MUOT_3M, we propose MUTrack, a SAM-based multimodal to unimodal tracker featuring visual geometric alignment, vision language fusion, and four level knowledge distillation that transfers multimodal knowledge into a unimodal student model. Extensive evaluations across five UOT benchmarks demonstrate that MUTrack achieves up to 8.40% higher AUC and 7.80% higher precision than the strongest SOTA baselines while running at 24 FPS. MUOT_3M and MUTrack establish a new foundation for scalable, multimodally trained yet practically deployable underwater tracking.
32. 【2602.18000】Image Quality Assessment: Exploring Quality Awareness via Memory-driven Distortion Patterns Matching
链接:https://arxiv.org/abs/2602.18000
作者:Xuting Lan,Mingliang Zhou,Xuekai Wei,Jielu Yan,Yueting Huang,Huayan Pu,Jun Luo,Weijia Jia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieve high-precision evaluation, analysing feature differences, Existing full-reference image, methods achieve high-precision, image quality assessment
备注:
点击查看摘要
Abstract:Existing full-reference image quality assessment (FR-IQA) methods achieve high-precision evaluation by analysing feature differences between reference and distorted images. However, their performance is constrained by the quality of the reference image, which limits real-world applications where ideal reference sources are unavailable. Notably, the human visual system has the ability to accumulate visual memory, allowing image quality assessment on the basis of long-term memory storage. Inspired by this biological memory mechanism, we propose a memory-driven quality-aware framework (MQAF), which establishes a memory bank for storing distortion patterns and dynamically switches between dual-mode quality assessment strategies to reduce reliance on high-quality reference images. When reference images are available, MQAF obtains reference-guided quality scores by adaptively weighting reference information and comparing the distorted image with stored distortion patterns in the memory bank. When the reference image is absent, the framework relies on distortion patterns in the memory bank to infer image quality, enabling no-reference quality assessment (NR-IQA). The experimental results show that our method outperforms state-of-the-art approaches across multiple datasets while adapting to both no-reference and full-reference tasks.
33. 【2602.17951】ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models
链接:https://arxiv.org/abs/2602.17951
作者:Guoheng Sun,Tingting Du,Kaixi Feng,Chenxiang Luo,Xingguo Ding,Zheyu Shen,Ziyao Wang,Yexiao He,Ang Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:instruction-following robotic manipulation, enable instruction-following robotic, models enable instruction-following, data and lack, spatial understanding
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at this https URL.
34. 【2602.17929】ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging
链接:https://arxiv.org/abs/2602.17929
作者:Athanasios Angelakis
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:Vision Transformers rely, Hierarchical Vision Transformer, compact Vision Transformer, Vision Transformer, Compact Hierarchical Vision
备注: 15 pages, 12 figures, 7 tables. Code and models available at [this https URL](https://github.com/Bluesman79/ZACH-ViT)
点击查看摘要
Abstract:Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors. While effective for natural images, these priors may hinder generalization when spatial layout is weakly informative or inconsistent, a frequent condition in medical imaging and edge-deployed clinical systems. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes both positional embeddings and the [CLS] token, achieving permutation invariance through global average pooling over patch representations. The term "Zero-token" specifically refers to removing the dedicated [CLS] aggregation token and positional embeddings; patch tokens remain unchanged and are processed normally. Adaptive residual projections preserve training stability in compact configurations while maintaining a strict parameter budget. Evaluation is performed across seven MedMNIST datasets spanning binary and multi-class tasks under a strict few-shot protocol (50 samples per class, fixed hyperparameters, five random seeds). The empirical analysis demonstrates regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves its strongest advantage on BloodMNIST and remains competitive with TransMIL on PathMNIST, while its relative advantage decreases on datasets with strong anatomical priors (OCTMNIST, OrganAMNIST), consistent with the architectural hypothesis. These findings support the view that aligning architectural inductive bias with data structure can be more important than pursuing universal benchmark dominance. Despite its minimal size and lack of pretraining, ZACH-ViT achieves competitive performance while maintaining sub-second inference times, supporting deployment in resource-constrained clinical environments. Code and models are available at this https URL.
Comments:
15 pages, 12 figures, 7 tables. Code and models available at this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
ACMclasses:
I.2.6; I.4.10; J.3
Cite as:
arXiv:2602.17929 [cs.CV]
(or
arXiv:2602.17929v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.17929
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
35. 【2602.17909】A Single Image and Multimodality Is All You Need for Novel View Synthesis
链接:https://arxiv.org/abs/2602.17909
作者:Amirhosein Javadi,Chi-Shiang Gau,Konstantinos D. Polyzos,Tara Javidi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recently demonstrated strong, demonstrated strong performance, approaches have recently, recently demonstrated, demonstrated strong
备注:
点击查看摘要
Abstract:Diffusion-based approaches have recently demonstrated strong performance for single-image novel view synthesis by conditioning generative models on geometry inferred from monocular depth estimation. However, in practice, the quality and consistency of the synthesized views are fundamentally limited by the reliability of the underlying depth estimates, which are often fragile under low texture, adverse weather, and occlusion-heavy real-world conditions. In this work, we show that incorporating sparse multimodal range measurements provides a simple yet effective way to overcome these limitations. We introduce a multimodal depth reconstruction framework that leverages extremely sparse range sensing data, such as automotive radar or LiDAR, to produce dense depth maps that serve as robust geometric conditioning for diffusion-based novel view synthesis. Our approach models depth in an angular domain using a localized Gaussian Process formulation, enabling computationally efficient inference while explicitly quantifying uncertainty in regions with limited observations. The reconstructed depth and uncertainty are used as a drop-in replacement for monocular depth estimators in existing diffusion-based rendering pipelines, without modifying the generative model itself. Experiments on real-world multimodal driving scenes demonstrate that replacing vision-only depth with our sparse range-based reconstruction substantially improves both geometric consistency and visual quality in single-image novel-view video generation. These results highlight the importance of reliable geometric priors for diffusion-based view synthesis and demonstrate the practical benefits of multimodal sensing even at extreme levels of sparsity.
36. 【2602.17871】Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models
链接:https://arxiv.org/abs/2602.17871
作者:Dhruba Ghosh,Yuhui Zhang,Ludwig Schmidt
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:made substantial progress, spanning visual reasoning, visual question answering, Vision-language models, question answering benchmarks
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.
37. 【2602.17869】Learning Compact Video Representations for Efficient Long-form Video Understanding in Large Multimodal Models
链接:https://arxiv.org/abs/2602.17869
作者:Yuxiao Chen,Jue Wang,Zhikang Zhang,Jingru Yi,Xu Zhang,Yang Zou,Zhaowei Cai,Jianbo Yuan,Xinyu Li,Hao Yang,Davide Modolo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video backbone architectures, videos spanning tens, large language models, large language model, backbone architectures
备注:
点击查看摘要
Abstract:With recent advancements in video backbone architectures, combined with the remarkable achievements of large language models (LLMs), the analysis of long-form videos spanning tens of minutes has become both feasible and increasingly prevalent. However, the inherently redundant nature of video sequences poses significant challenges for contemporary state-of-the-art models. These challenges stem from two primary aspects: 1) efficiently incorporating a larger number of frames within memory constraints, and 2) extracting discriminative information from the vast volume of input data. In this paper, we introduce a novel end-to-end schema for long-form video understanding, which includes an information-density-based adaptive video sampler (AVS) and an autoencoder-based spatiotemporal video compressor (SVC) integrated with a multimodal large language model (MLLM). Our proposed system offers two major advantages: it adaptively and effectively captures essential information from video sequences of varying durations, and it achieves high compression rates while preserving crucial discriminative information. The proposed framework demonstrates promising performance across various benchmarks, excelling in both long-form video understanding tasks and standard video understanding benchmarks. These results underscore the versatility and efficacy of our approach, particularly in managing the complexities of prolonged video sequences.
38. 【2602.17854】On the Evaluation Protocol of Gesture Recognition for UAV-based Rescue Operation based on Deep Learning: A Subject-Independence Perspective
链接:https://arxiv.org/abs/2602.17854
作者:Domonkos Varga
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Liu and Szirányi, gesture-recognition approach proposed, proposed by Liu, paper presents, presents a methodological
备注:
点击查看摘要
Abstract:This paper presents a methodological analysis of the gesture-recognition approach proposed by Liu and Szirányi, with a particular focus on the validity of their evaluation protocol. We show that the reported near-perfect accuracy metrics result from a frame-level random train-test split that inevitably mixes samples from the same subjects across both sets, causing severe data leakage. By examining the published confusion matrix, learning curves, and dataset construction, we demonstrate that the evaluation does not measure generalization to unseen individuals. Our findings underscore the importance of subject-independent data partitioning in vision-based gesture-recognition research, especially for applications - such as UAV-human interaction - that require reliable recognition of gestures performed by previously unseen people.
39. 【2602.17853】Neural Prior Estimation: Learning Class Priors from Latent Representations
链接:https://arxiv.org/abs/2602.17853
作者:Masoud Yavari,Payman Moallem
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:imbalance induces systematic, induces systematic bias, deep neural networks, Neural Prior Estimator, Class imbalance induces
备注:
点击查看摘要
Abstract:Class imbalance induces systematic bias in deep neural networks by imposing a skewed effective class prior. This work introduces the Neural Prior Estimator (NPE), a framework that learns feature-conditioned log-prior estimates from latent representations. NPE employs one or more Prior Estimation Modules trained jointly with the backbone via a one-way logistic loss. Under the Neural Collapse regime, NPE is analytically shown to recover the class log-prior up to an additive constant, providing a theoretically grounded adaptive signal without requiring explicit class counts or distribution-specific hyperparameters. The learned estimate is incorporated into logit adjustment, forming NPE-LA, a principled mechanism for bias-aware prediction. Experiments on long-tailed CIFAR and imbalanced semantic segmentation benchmarks (STARE, ADE20K) demonstrate consistent improvements, particularly for underrepresented classes. NPE thus offers a lightweight and theoretically justified approach to learned prior estimation and imbalance-aware prediction.
40. 【2602.17814】VQPP: Video Query Performance Prediction Benchmark
链接:https://arxiv.org/abs/2602.17814
作者:Adrian Catalin Lutu,Eduard Poesina,Radu Tudor Ionescu
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:actively studied information, Query performance prediction, studied information retrieval, retrieval system selection, information retrieval task
备注:
点击查看摘要
Abstract:Query performance prediction (QPP) is an important and actively studied information retrieval task, having various applications, such as query reformulation, query expansion, and retrieval system selection, among many others. The task has been primarily studied in the context of text and image retrieval, whereas QPP for content-based video retrieval (CBVR) remains largely underexplored. To this end, we propose the first benchmark for video query performance prediction (VQPP), comprising two text-to-video retrieval datasets and two CBVR systems, respectively. VQPP contains a total of 56K text queries and 51K videos, and comes with official training, validation and test splits, fostering direct comparisons and reproducible results. We explore multiple pre-retrieval and post-retrieval performance predictors, creating a representative benchmark for future exploration of QPP in the video domain. Our results show that pre-retrieval predictors obtain competitive performance, enabling applications before performing the retrieval step. We also demonstrate the applicability of VQPP by employing the best performing pre-retrieval predictor as reward model for training a large language model (LLM) on the query reformulation task via direct preference optimization (DPO). We release our benchmark and code at this https URL.
41. 【2602.17807】VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
链接:https://arxiv.org/abs/2602.17807
作者:Narges Norouzi,Idil Esen Zulfikar,Niccol`o Cavagnero,Tommie Kerssies,Bastian Leibe,Gijs Dubbelman,Daan de Geus
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing online video, Existing online, online video segmentation, segmentation models typically, complex specialized tracking
备注:
点击查看摘要
Abstract:Existing online video segmentation models typically combine a per-frame segmenter with complex specialized tracking modules. While effective, these modules introduce significant architectural complexity and computational overhead. Recent studies suggest that plain Vision Transformer (ViT) encoders, when scaled with sufficient capacity and large-scale pre-training, can conduct accurate image segmentation without requiring specialized modules. Motivated by this observation, we propose the Video Encoder-only Mask Transformer (VidEoMT), a simple encoder-only video segmentation model that eliminates the need for dedicated tracking modules. To enable temporal modeling in an encoder-only ViT, VidEoMT introduces a lightweight query propagation mechanism that carries information across frames by reusing queries from the previous frame. To balance this with adaptability to new content, it employs a query fusion strategy that combines the propagated queries with a set of temporally-agnostic learned queries. As a result, VidEoMT attains the benefits of a tracker without added complexity, achieving competitive accuracy while being 5x--10x faster, running at up to 160 FPS with a ViT-L backbone. Code: this https URL
42. 【2602.17799】Enabling Training-Free Text-Based Remote Sensing Segmentation
链接:https://arxiv.org/abs/2602.17799
作者:Jose Sosa,Danila Rukhovich,Anis Kacem,Djamila Aouada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Language Models, Vision Foundation Models, Vision Language, Recent advances, Language Models
备注:
点击查看摘要
Abstract:Recent advances in Vision Language Models (VLMs) and Vision Foundation Models (VFMs) have opened new opportunities for zero-shot text-guided segmentation of remote sensing imagery. However, most existing approaches still rely on additional trainable components, limiting their generalisation and practical applicability. In this work, we investigate to what extent text-based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models. We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM), enabling a fully training-free or lightweight LoRA-tuned pipeline. Our contrastive approach employs CLIP as mask selector for SAM's grid-based proposals, achieving state-of-the-art open-vocabulary semantic segmentation (OVSS) in a completely zero-shot setting. In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT-5 in a zero-shot setting and a LoRA-tuned Qwen-VL model, with the latter yielding the best results. Extensive experiments across 19 remote sensing benchmarks, including open-vocabulary, referring, and reasoning-based tasks, demonstrate the strong capabilities of our approach. Code will be released at this https URL.
43. 【2602.17793】LGD-Net: Latent-Guided Dual-Stream Network for HER2 Scoring with Task-Specific Domain Knowledge
链接:https://arxiv.org/abs/2602.17793
作者:Peide Zhu,Linbin Lu,Zhiqin Chen,Xiong Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:treatment therapy selection, breast cancer evaluation, targeted treatment therapy, expression level accurately, therapy selection
备注:
点击查看摘要
Abstract:It is a critical task to evalaute HER2 expression level accurately for breast cancer evaluation and targeted treatment therapy selection. However, the standard multi-step Immunohistochemistry (IHC) staining is resource-intensive, expensive, and time-consuming, which is also often unavailable in many areas. Consequently, predicting HER2 levels directly from HE slides has emerged as a potential alternative solution. It has been shown to be effective to use virtual IHC images from HE images for automatic HER2 scoring. However, the pixel-level virtual staining methods are computationally expensive and prone to reconstruction artifacts that can propagate diagnostic errors. To address these limitations, we propose the Latent-Guided Dual-Stream Network (LGD-Net), a novel framework that employes cross-modal feature hallucination instead of explicit pixel-level image generation. LGD-Net learns to map morphological HE features directly to the molecular latent space, guided by a teacher IHC encoder during training. To ensure the hallucinated features capture clinically relevant phenotypes, we explicitly regularize the model training with task-specific domain knowledge, specifically nuclei distribution and membrane staining intensity, via lightweight auxiliary regularization tasks. Extensive experiments on the public BCI dataset demonstrate that LGD-Net achieves state-of-the-art performance, significantly outperforming baseline methods while enabling efficient inference using single-modality HE inputs.
44. 【2602.17785】Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision
链接:https://arxiv.org/abs/2602.17785
作者:Xinwei Ju,Rema Daher,Danail Stoyanov,Sophia Bano,Francisco Vasconcelos
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reducing blind spots, enable improved screening, colonoscopy-assisted navigation, blind spots, minimizing the risk
备注: 14 pages, 6 figures; early accepted by IPCAI2026
点击查看摘要
Abstract:Monocular depth and pose estimation play an important role in the development of colonoscopy-assisted navigation, as they enable improved screening by reducing blind spots, minimizing the risk of missed or recurrent lesions, and lowering the likelihood of incomplete examinations. However, this task remains challenging due to the presence of texture-less surfaces, complex illumination patterns, deformation, and a lack of in-vivo datasets with reliable ground truth. In this paper, we propose **PRISM** (Pose-Refinement with Intrinsic Shading and edge Maps), a self-supervised learning framework that leverages anatomical and illumination priors to guide geometric learning. Our approach uniquely incorporates edge detection and luminance decoupling for structural guidance. Specifically, edge maps are derived using a learning-based edge detector (e.g., DexiNed or HED) trained to capture thin and high-frequency boundaries, while luminance decoupling is obtained through an intrinsic decomposition module that separates shading and reflectance, enabling the model to exploit shading cues for depth estimation. Experimental results on multiple real and synthetic datasets demonstrate state-of-the-art performance. We further conduct a thorough ablation study on training data selection to establish best practices for pose and depth estimation in colonoscopy. This analysis yields two practical insights: (1) self-supervised training on real-world data outperforms supervised training on realistic phantom data, underscoring the superiority of domain realism over ground truth availability; and (2) video frame rate is an extremely important factor for model performance, where dataset-specific video frame sampling is necessary for generating high quality training data.
45. 【2602.17770】CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild
链接:https://arxiv.org/abs/2602.17770
作者:Balamurugan Thambiraja,Omid Taheri,Radek Danecek,Giorgio Becherini,Gerard Pons-Moll,Justus Thies
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:modeling natural hand, motions remains underexplored, daily life, remains underexplored, natural hand motions
备注: ICLR2026; Project page: [this https URL](https://balamuruganthambiraja.github.io/CLUTCH/)
点击查看摘要
Abstract:Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.
46. 【2602.17768】KPM-Bench: A Kinematic Parsing Motion Benchmark for Fine-grained Motion-centric Video Understanding
链接:https://arxiv.org/abs/2602.17768
作者:Boda Lin,Yongjie Zhu,Xiaocheng Gong,Wenyu Qin,Meng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:face significant limitations, recent advancements, face significant, significant limitations, details and suffer
备注: 26 pages
点击查看摘要
Abstract:Despite recent advancements, video captioning models still face significant limitations in accurately describing fine-grained motion details and suffer from severe hallucination issues. These challenges become particularly prominent when generating captions for motion-centric videos, where precise depiction of intricate movements and limb dynamics is crucial yet often neglected. To alleviate this gap, we introduce an automated annotation pipeline that integrates kinematic-based motion computation with linguistic parsing, enabling detailed decomposition and description of complex human motions. Based on this pipeline, we construct and release the Kinematic Parsing Motion Benchmark (KPM-Bench), a novel open-source dataset designed to facilitate fine-grained motion understanding. KPM-Bench consists of (i) fine-grained video-caption pairs that comprehensively illustrate limb-level dynamics in complex actions, (ii) diverse and challenging question-answer pairs focusing specifically on motion understanding, and (iii) a meticulously curated evaluation set specifically designed to assess hallucination phenomena associated with motion descriptions. Furthermore, to address hallucination issues systematically, we propose the linguistically grounded Motion Parsing and Extraction (MoPE) algorithm, capable of accurately extracting motion-specific attributes directly from textual captions. Leveraging MoPE, we introduce a precise hallucination evaluation metric that functions independently of large-scale vision-language or language-only models. By integrating MoPE into the GRPO post-training framework, we effectively mitigate hallucination problems, significantly improving the reliability of motion-centric video captioning models.
47. 【2602.17690】DesignAsCode: Bridging Structural Editability and Visual Fidelity in Graphic Design Generation
链接:https://arxiv.org/abs/2602.17690
作者:Ziyuan Liu,Shizhao Sun,Danqing Huang,Yingdong Shi,Meisheng Zhang,Ji Li,Jingsong Yu,Jiang Bian
类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:high visual fidelity, fine-grained structural editability, design generation demands, demands a delicate, delicate balance
备注:
点击查看摘要
Abstract:Graphic design generation demands a delicate balance between high visual fidelity and fine-grained structural editability. However, existing approaches typically bifurcate into either non-editable raster image synthesis or abstract layout generation devoid of visual content. Recent combinations of these two approaches attempt to bridge this gap but often suffer from rigid composition schemas and unresolvable visual dissonances (e.g., text-background conflicts) due to their inexpressive representation and open-loop nature. To address these challenges, we propose DesignAsCode, a novel framework that reimagines graphic design as a programmatic synthesis task using HTML/CSS. Specifically, we introduce a Plan-Implement-Reflect pipeline, incorporating a Semantic Planner to construct dynamic, variable-depth element hierarchies and a Visual-Aware Reflection mechanism that iteratively optimizes the code to rectify rendering artifacts. Extensive experiments demonstrate that DesignAsCode significantly outperforms state-of-the-art baselines in both structural validity and aesthetic quality. Furthermore, our code-native representation unlocks advanced capabilities, including automatic layout retargeting, complex document generation (e.g., resumes), and CSS-based animation.
48. 【2602.17689】Robust Pre-Training of Medical Vision-and-Language Models with Domain-Invariant Multi-Modal Masked Reconstruction
链接:https://arxiv.org/abs/2602.17689
作者:Melika Filvantorkaman,Mohsen Piri
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:domain shift caused, models show strong, show strong potential, acquisition protocols, imaging devices
备注: 28 pages, 3 figures
点击查看摘要
Abstract:Medical vision-language models show strong potential for joint reasoning over medical images and clinical text, but their performance often degrades under domain shift caused by variations in imaging devices, acquisition protocols, and reporting styles. Existing multi-modal pre-training methods largely overlook robustness, treating it as a downstream adaptation problem. In this work, we propose Robust Multi-Modal Masked Reconstruction (Robust-MMR), a self-supervised pre-training framework that explicitly incorporates robustness objectives into masked vision-language learning. Robust-MMR integrates asymmetric perturbation-aware masking, domain-consistency regularization, and modality-resilience constraints to encourage domain-invariant representations. We evaluate Robust-MMR on multiple medical vision-language benchmarks, including medical visual question answering (VQA-RAD, SLAKE, VQA-2019), cross-domain image-text classification (MELINDA), and robust image-caption retrieval (ROCO). Robust-MMR achieves 78.9% cross-domain accuracy on VQA-RAD, outperforming the strongest baseline by 3.8 percentage points, and reaches 74.6% and 77.0% accuracy on SLAKE and VQA-2019, respectively. Under perturbed evaluation, Robust-MMR improves VQA-RAD accuracy from 69.1% to 75.6%. For image-text classification, cross-domain MELINDA accuracy increases from 70.3% to 75.2%, while retrieval experiments show a reduction in mean rank degradation from over 16 to 4.1 under perturbation. Qualitative results further demonstrate improved clinical reasoning for disease detection and structural abnormality assessment. These findings show that explicitly modeling robustness during pre-training leads to more reliable and transferable medical vision-language representations for real-world deployment.
49. 【2602.17683】Probabilistic NDVI Forecasting from Sparse Satellite Time Series and Weather Covariates
链接:https://arxiv.org/abs/2602.17683
作者:Irene Iele,Giulia Romoli,Daniele Molino,Elena Mulero Ayllón,Filippo Ruffini,Paolo Soda,Matteo Tortora
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
关键词:Accurate short-term forecasting, data-driven decision support, Accurate short-term, Difference Vegetation Index, precision agriculture
备注:
点击查看摘要
Abstract:Accurate short-term forecasting of vegetation dynamics is a key enabler for data-driven decision support in precision agriculture. Normalized Difference Vegetation Index (NDVI) forecasting from satellite observations, however, remains challenging due to sparse and irregular sampling caused by cloud coverage, as well as the heterogeneous climatic conditions under which crops evolve. In this work, we propose a probabilistic forecasting framework specifically designed for field-level NDVI prediction under clear-sky acquisition constraints. The method leverages a transformer-based architecture that explicitly separates the modeling of historical vegetation dynamics from future exogenous information, integrating historical NDVI observations with both historical and future meteorological covariates. To address irregular revisit patterns and horizon-dependent uncertainty, we introduce a temporal-distance weighted quantile loss that aligns the training objective with the effective forecasting horizon. In addition, we incorporate cumulative and extreme-weather feature engineering to better capture delayed meteorological effects relevant to vegetation response. Extensive experiments on European satellite data demonstrate that the proposed approach consistently outperforms a diverse set of statistical, deep learning, and recent time series baselines across both point-wise and probabilistic evaluation metrics. Ablation studies further highlight the central role of target history, while showing that meteorological covariates provide complementary gains when jointly exploited. The code is available at this https URL.
50. 【2602.17667】When How to Write for Personalized Demand-aware Query Rewriting in Video Search
链接:https://arxiv.org/abs/2602.17667
作者:Cheng cheng,Chenxing Wang,Aolin Li,Haijun Wu,Huiyun Hu,Juyuan Wang
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:historical behaviors provide, behaviors provide rich, provide rich context, identifying search intent, user historical behaviors
备注:
点击查看摘要
Abstract:In video search systems, user historical behaviors provide rich context for identifying search intent and resolving ambiguity. However, traditional methods utilizing implicit history features often suffer from signal dilution and delayed feedback. To address these challenges, we propose WeWrite, a novel Personalized Demand-aware Query Rewriting framework. Specifically, WeWrite tackles three key challenges: (1) When to Write: An automated posterior-based mining strategy extracts high-quality samples from user logs, identifying scenarios where personalization is strictly necessary; (2) How to Write: A hybrid training paradigm combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to align the LLM's output style with the retrieval system; (3) Deployment: A parallel "Fake Recall" architecture ensures low latency. Online A/B testing on a large-scale video platform demonstrates that WeWrite improves the Click-Through Video Volume (VV$$10s) by 1.07% and reduces the Query Reformulation Rate by 2.97%.
51. 【2602.18426】Spatio-Spectroscopic Representation Learning using Unsupervised Convolutional Long-Short Term Memory Networks
链接:https://arxiv.org/abs/2602.18426
作者:Kameswara Bharadwaj Mantha,Lucy Fortson,Ramanakumar Sankar,Claudia Scarlata,Chris Lintott,Sandor Kruk,Mike Walmsley,Hugh Dickinson,Karen Masters,Brooke Simmons,Rebecca Smethurst
类目:Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
关键词:Integral Field Spectroscopy, uncover previously unknown, previously unknown insights, Integral Field, Field Spectroscopy
备注: This manuscript was previously submitted to ICML for peer review. Reviewers noted that while the underlying VAE-based architecture builds on established methods, its application to spatially-resolved IFS data is promising for unsupervised representation learning in astronomy. This version is released for community visibility. Reviewer decisions: Weak accept and Weak reject (Final: Reject)
点击查看摘要
Abstract:Integral Field Spectroscopy (IFS) surveys offer a unique new landscape in which to learn in both spatial and spectroscopic dimensions and could help uncover previously unknown insights into galaxy evolution. In this work, we demonstrate a new unsupervised deep learning framework using Convolutional Long-Short Term Memory Network Autoencoders to encode generalized feature representations across both spatial and spectroscopic dimensions spanning $19$ optical emission lines (3800A $ \lambda $ 8000A) among a sample of $\sim 9000$ galaxies from the MaNGA IFS survey. As a demonstrative exercise, we assess our model on a sample of $290$ Active Galactic Nuclei (AGN) and highlight scientifically interesting characteristics of some highly anomalous AGN.
52. 【2602.18400】Exploiting Completeness Perception with Diffusion Transformer for Unified 3D MRI Synthesis
链接:https://arxiv.org/abs/2602.18400
作者:Junkai Liu,Nay Aung,Theodoros N. Arvanitis,Joao A. C. Lima,Steffen E. Petersen,Daniel C. Alexander,Le Zhang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:pose significant challenges, Missing data problems, multi-modal brain MRI, data problems, pose significant
备注:
点击查看摘要
Abstract:Missing data problems, such as missing modalities in multi-modal brain MRI and missing slices in cardiac MRI, pose significant challenges in clinical practice. Existing methods rely on external guidance to supply detailed missing state for instructing generative models to synthesize missing MRIs. However, manual indicators are not always available or reliable in real-world scenarios due to the unpredictable nature of clinical environments. Moreover, these explicit masks are not informative enough to provide guidance for improving semantic consistency. In this work, we argue that generative models should infer and recognize missing states in a self-perceptive manner, enabling them to better capture subtle anatomical and pathological variations. Towards this goal, we propose CoPeDiT, a general-purpose latent diffusion model equipped with completeness perception for unified synthesis of 3D MRIs. Specifically, we incorporate dedicated pretext tasks into our tokenizer, CoPeVAE, empowering it to learn completeness-aware discriminative prompts, and design MDiT3D, a specialized diffusion transformer architecture for 3D MRI synthesis, that effectively uses the learned prompts as guidance to enhance semantic consistency in 3D space. Comprehensive evaluations on three large-scale MRI datasets demonstrate that CoPeDiT significantly outperforms state-of-the-art methods, achieving superior robustness, generalizability, and flexibility. The code is available at this https URL .
53. 【2602.18350】Quantum-enhanced satellite image classification
链接:https://arxiv.org/abs/2602.18350
作者:Qi Zhang,Anton Simen,Carlos Flores-Garrigós,Gabriel Alvarado Barrios,Paolo A. Erdman,Enrique Solano,Aaron C. Kemp,Vincent Beltrani,Vedangi Pathak,Hamed Mohammadbagherpoor
类目:Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:enhance multi-class image, multi-class image classification, feature extraction method, quantum feature extraction, space applications
备注:
点击查看摘要
Abstract:We demonstrate the application of a quantum feature extraction method to enhance multi-class image classification for space applications. By harnessing the dynamics of many-body spin Hamiltonians, the method generates expressive quantum features that, when combined with classical processing, lead to quantum-enhanced classification accuracy. Using a strong and well-established ResNet50 baseline, we achieved a maximum classical accuracy of 83%, which can be improved to 84% with a transfer learning approach. In contrast, applying our quantum-classical method the performance is increased to 87% accuracy, demonstrating a clear and reproducible improvement over robust classical approaches. Implemented on several of IBM's quantum processors, our hybrid quantum-classical approach delivers consistent gains of 2-3% in absolute accuracy. These results highlight the practical potential of current and near-term quantum processors in high-stakes, data-driven domains such as satellite imaging and remote sensing, while suggesting broader applicability in real-world machine learning tasks.
54. 【2602.18119】RamanSeg: Interpretability-driven Deep Learning on Raman Spectra for Cancer Diagnosis
链接:https://arxiv.org/abs/2602.18119
作者:Chris Tomy,Mo Vali,David Pertzborn,Tammam Alamatouri,Anna Mühlig,Orlando Guntinas-Lichius,Anna Xylander,Eric Michele Fantuzzi,Matteo Negro,Francesco Crisafi,Pietro Lio,Tiago Azevedo
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:requiring expert analysis, current gold standard, time-consuming process requiring, process requiring expert, cancer diagnosis
备注: 12 pages, 8 figures
点击查看摘要
Abstract:Histopathology, the current gold standard for cancer diagnosis, involves the manual examination of tissue samples after chemical staining, a time-consuming process requiring expert analysis. Raman spectroscopy is an alternative, stain-free method of extracting information from samples. Using nnU-Net, we trained a segmentation model on a novel dataset of spatial Raman spectra aligned with tumour annotations, achieving a mean foreground Dice score of 80.9%, surpassing previous work. Furthermore, we propose a novel, interpretable, prototype-based architecture called RamanSeg. RamanSeg classifies pixels based on discovered regions of the training set, generating a segmentation mask. Two variants of RamanSeg allow a trade-off between interpretability and performance: one with prototype projection and another projection-free version. The projection-free RamanSeg outperformed a U-Net baseline with a mean foreground Dice score of 67.3%, offering a meaningful improvement over a black-box training approach.
55. 【2602.17986】From Global Radiomics to Parametric Maps: A Unified Workflow Fusing Radiomics and Deep Learning for PDAC Detection
链接:https://arxiv.org/abs/2602.17986
作者:Zengtian Deng,Yimeng He,Yu Shi,Lixia Wang,Touseef Ahmad Qureshi,Xiuzhen Huang,Debiao Li
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:quantitative medical imaging, offer powerful tools, existing fusion approaches, radiomic parametric maps, spatially resolved radiomic
备注: This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Radiomics and deep learning both offer powerful tools for quantitative medical imaging, but most existing fusion approaches only leverage global radiomic features and overlook the complementary value of spatially resolved radiomic parametric maps. We propose a unified framework that first selects discriminative radiomic features and then injects them into a radiomics-enhanced nnUNet at both the global and voxel levels for pancreatic ductal adenocarcinoma (PDAC) detection. On the PANORAMA dataset, our method achieved AUC = 0.96 and AP = 0.84 in cross-validation. On an external in-house cohort, it achieved AUC = 0.95 and AP = 0.78, outperforming the baseline nnUNet; it also ranked second in the PANORAMA Grand Challenge. This demonstrates that handcrafted radiomics, when injected at both global and voxel levels, provide complementary signals to deep learning models for PDAC detection. Our code can be found at this https URL
56. 【2602.17901】MeDUET: Disentangled Unified Pretraining for 3D Medical Image Synthesis and Analysis
链接:https://arxiv.org/abs/2602.17901
作者:Junkai Liu,Ling Shao,Le Zhang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT)
关键词:medical image synthesis, advanced representation learning, image synthesis, Self-supervised learning, models have advanced
备注:
点击查看摘要
Abstract:Self-supervised learning (SSL) and diffusion models have advanced representation learning and image synthesis. However, in 3D medical imaging, they remain separate: diffusion for synthesis, SSL for analysis. Unifying 3D medical image synthesis and analysis is intuitive yet challenging, as multi-center datasets exhibit dominant style shifts, while downstream tasks rely on anatomy, and site-specific style co-varies with anatomy across slices, making factors unreliable without explicit constraints. In this paper, we propose MeDUET, a 3D Medical image Disentangled UnifiEd PreTraining framework that performs SSL in the Variational Autoencoder (VAE) latent space which explicitly disentangles domain-invariant content from domain-specific style. The token demixing mechanism serves to turn disentanglement from a modeling assumption into an empirically identifiable property. Two novel proxy tasks, Mixed-Factor Token Distillation (MFTD) and Swap-invariance Quadruplet Contrast (SiQC), are devised to synergistically enhance disentanglement. Once pretrained, MeDUET is capable of (i) delivering higher fidelity, faster convergence, and improved controllability for synthesis, and (ii) demonstrating strong domain generalization and notable label efficiency for analysis across diverse medical benchmarks. In summary, MeDUET converts multi-source heterogeneity from an obstacle into a learning signal, enabling unified pretraining for 3D medical image synthesis and analysis. The code is available at this https URL .
57. 【2602.17855】opoGate: Quality-Aware Topology-Stabilized Gated Fusion for Longitudinal Low-Dose CT New-Lesion Prediction
链接:https://arxiv.org/abs/2602.17855
作者:Seungik Cho
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:reconstruction kernels, follow-ups vary, Longitudinal low-dose, ROC curve, Brier score
备注:
点击查看摘要
Abstract:Longitudinal low-dose CT follow-ups vary in noise, reconstruction kernels, and registration quality. These differences destabilize subtraction images and can trigger false new lesion alarms. We present TopoGate, a lightweight model that combines the follow-up appearance view with the subtraction view and controls their influence through a learned, quality-aware gate. The gate is driven by three case-specific signals: CT appearance quality, registration consistency, and stability of anatomical topology measured with topological metrics. On the NLST--New-Lesion--LongCT cohort comprising 152 pairs from 122 patients, TopoGate improves discrimination and calibration over single-view baselines, achieving an area under the ROC curve of 0.65 with a standard deviation of 0.05 and a Brier score of 0.14. Removing corrupted or low-quality pairs, identified by the quality scores, further increases the area under the ROC curve from 0.62 to 0.68 and reduces the Brier score from 0.14 to 0.12. The gate responds predictably to degradation, placing more weight on appearance when noise grows, which mirrors radiologist practice. The approach is simple, interpretable, and practical for reliable longitudinal LDCT triage.
58. 【2602.17813】Promptable segmentation with region exploration enables minimal-effort expert-level prostate cancer delineation
链接:https://arxiv.org/abs/2602.17813
作者:Junqing Yang,Natasha Thorley,Ahmed Nadeem Abbasi,Shonit Punwani,Zion Tse,Yipeng Hu,Shaheer U. Saeed
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:planning image-guided interventions, magnetic resonance, targeted biopsies, cancer on magnetic, crucial for planning
备注: Accepted at IPCAI 2026 (IJCARS - IPCAI 2026 Special Issue)
点击查看摘要
Abstract:Purpose: Accurate segmentation of prostate cancer on magnetic resonance (MR) images is crucial for planning image-guided interventions such as targeted biopsies, cryoablation, and radiotherapy. However, subtle and variable tumour appearances, differences in imaging protocols, and limited expert availability make consistent interpretation difficult. While automated methods aim to address this, they rely on large expertly-annotated datasets that are often inconsistent, whereas manual delineation remains labour-intensive. This work aims to bridge the gap between automated and manual segmentation through a framework driven by user-provided point prompts, enabling accurate segmentation with minimal annotation effort. Methods: The framework combines reinforcement learning (RL) with a region-growing segmentation process guided by user prompts. Starting from an initial point prompt, region-growing generates a preliminary segmentation, which is iteratively refined through RL. At each step, the RL agent observes the image and current segmentation to predict a new point, from which region growing updates the mask. A reward, balancing segmentation accuracy and voxel-wise uncertainty, encourages exploration of ambiguous regions, allowing the agent to escape local optima and perform sample-specific optimisation. Despite requiring fully supervised training, the framework bridges manual and fully automated segmentation at inference by substantially reducing user effort while outperforming current fully automated methods. Results: The framework was evaluated on two public prostate MR datasets (PROMIS and PICAI, with 566 and 1090 cases). It outperformed the previous best automated methods by 9.9% and 8.9%, respectively, with performance comparable to manual radiologist segmentation, reducing annotation time tenfold.
Comments:
Accepted at IPCAI 2026 (IJCARS - IPCAI 2026 Special Issue)
Subjects:
Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.17813 [eess.IV]
(or
arXiv:2602.17813v1 [eess.IV] for this version)
https://doi.org/10.48550/arXiv.2602.17813
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
59. 【2602.17797】Deep Learning for Dermatology: An Innovative Framework for Approaching Precise Skin Cancer Detection
链接:https://arxiv.org/abs/2602.17797
作者:Mohammad Tahmid Noor,B. M. Shahria Alam,Tasmiah Rahman Orpa,Shaila Afroz Anika,Mahjabin Tasnim Samiha,Fahad Ahammed
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:preventable disease, Skin cancer, prevalent yet preventable, Skin, malignant skin
备注: 6 pages, 9 figures, this is the author's accepted manuscript of a paper accepted for publication in the Proceedings of the 16th International IEEE Conference on Computing, Communication and Networking Technologies (ICCCNT 2025). The final published version will be available via IEEE Xplore
点击查看摘要
Abstract:Skin cancer can be life-threatening if not diagnosed early, a prevalent yet preventable disease. Globally, skin cancer is perceived among the finest prevailing cancers and millions of people are diagnosed each year. For the allotment of benign and malignant skin spots, an area of critical importance in dermatological diagnostics, the application of two prominent deep learning models, VGG16 and DenseNet201 are investigated by this paper. We evaluate these CNN architectures for their efficacy in differentiating benign from malignant skin lesions leveraging enhancements in deep learning enforced to skin cancer spotting. Our objective is to assess model accuracy and computational efficiency, offering insights into how these models could assist in early detection, diagnosis, and streamlined workflows in dermatology. We used two deep learning methods DenseNet201 and VGG16 model on a binary class dataset containing 3297 images. The best result with an accuracy of 93.79% achieved by DenseNet201. All images were resized to 224x224 by rescaling. Although both models provide excellent accuracy, there is still some room for improvement. In future using new datasets, we tend to improve our work by achieving great accuracy.
60. 【2602.17749】Detection and Classification of Cetacean Echolocation Clicks using Image-based Object Detection Methods applied to Advanced Wavelet-based Transformations
链接:https://arxiv.org/abs/2602.17749
作者:Christopher Hauer
类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
关键词:behavioral studies, challenge in marine, detection of animal, marine bioacoustic analysis, Learning Neural Networks
备注: My Master thesis CLICK-SPOT from 2025
点击查看摘要
Abstract:A challenge in marine bioacoustic analysis is the detection of animal signals, like calls, whistles and clicks, for behavioral studies. Manual labeling is too time-consuming to process sufficient data to get reasonable results. Thus, an automatic solution to overcome the time-consuming data analysis is necessary. Basic mathematical models can detect events in simple environments, but they struggle with complex scenarios, like differentiating signals with a low signal-to-noise ratio or distinguishing clicks from echoes. Deep Learning Neural Networks, such as ANIMAL-SPOT, are better suited for such tasks. DNNs process audio signals as image representations, often using spectrograms created by Short-Time Fourier Transform. However, spectrograms have limitations due to the uncertainty principle, which creates a tradeoff between time and frequency resolution. Alternatives like the wavelet, which provides better time resolution for high frequencies and improved frequency resolution for low frequencies, may offer advantages for feature extraction in complex bioacoustic environments. This thesis shows the efficacy of CLICK-SPOT on Norwegian Killer whale underwater recordings provided by the cetacean biologist Dr. Vester. Keywords: Bioacoustics, Deep Learning, Wavelet Transformation



