本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新482篇论文,其中:
- 自然语言处理85篇
- 信息检索12篇
- 计算机视觉66篇
自然语言处理
1. 【2602.16704】Reinforced Fast Weights with Next-Sequence Prediction
链接:https://arxiv.org/abs/2602.16704
作者:Hee Seung Hwang,Xindi Wu,Sanghyuk Chun,Olga Russakovsky
类目:Computation and Language (cs.CL)
关键词:maintaining constant memory, constant memory overhead, Fast weight, context length, offer a promising
备注:
点击查看摘要
Abstract:Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.
2. 【2602.16699】Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
链接:https://arxiv.org/abs/2602.16699
作者:Wenxuan Ding,Nicholas Tomlin,Greg Durrett
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:single response, necessarily resolved, require interacting, acquire information, LLMs
备注:
点击查看摘要
Abstract:LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.
3. 【2602.16687】Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
链接:https://arxiv.org/abs/2602.16687
作者:Potsawee Manakul,Woody Haosheng Gan,Martijn Bartelds,Guangzhi Sun,William Held,Diyi Yang
类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:pre-trained text LLM, Current audio language, text LLM backbones, extending pre-trained text, limiting general audio
备注:
点击查看摘要
Abstract:Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.
4. 【2602.16660】Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
链接:https://arxiv.org/abs/2602.16660
作者:Yuyan Bu,Xiaohao Liu,ZhaoXing Ren,Yaodong Yang,Juntao Dai
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:linguistic communities necessitates, communities necessitates reliable, necessitates reliable multilingual, widespread deployment, deployment of large
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment. However, recent efforts to extend alignment to other languages often require substantial resources, either through large-scale, high-quality supervision in the target language or through pairwise alignment with high-resource languages, which limits scalability. In this work, we propose a resource-efficient method for improving multilingual safety alignment. We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines. By improving collinearity between multilingual representation vectors, our method encourages directional consistency at the multilingual semantic level in a single update. This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional response-level supervision in low-resource languages. We validate the proposed method across different model architectures and alignment paradigms, and demonstrate its effectiveness in enhancing multilingual safety with limited impact on general model utility. Further evaluation across languages and tasks indicates improved cross-lingual generalization, suggesting the proposed approach as a practical solution for multilingual consistency alignment under limited supervision.
5. 【2602.16640】Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
链接:https://arxiv.org/abs/2602.16640
作者:Subrit Dikshit
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, revolutionized Natural Language, Large Language Models, Language Processing, Large Language
备注: 5 pages, 2 tables
点击查看摘要
Abstract:The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter counts (7B+) and cloud-based inference, rendering them inaccessible to practitioners in resource-constrained environments and posing significant data sovereignty risks. This paper introduces Quecto-V1, a domain-specific Small Language Model (SLM) engineered to democratize access to Indian legal intelligence. Built upon a custom configuration of the GPT-2 architecture (124 million parameters), Quecto-V1 was trained from scratch exclusively on a corpus of Indian statutes, including the Indian Penal Code (IPC), the Code of Criminal Procedure (CrPC), and the Constitution of India. Unlike generalist models, which prioritize broad world knowledge, our approach maximizes "lexical density" within the legal domain. Furthermore, we address the deployment bottleneck by applying post-training 8-bit quantization (GGUF format), compressing the model to a memory footprint of under 150 MB. Our empirical analysis demonstrates that Quecto-V1 achieves high fidelity in retrieving statutory definitions and penal provisions, outperforming general-purpose SLMs in domain-specific exact match tasks while running entirely offline on consumer-grade CPUs. We further present an ablation study showing that 8-bit quantization yields a 74% reduction in model size with less than 3.5% degradation in retrieval accuracy compared to full-precision baselines. These findings suggest that for specialized, high-stakes domains like law, domain-specific training coupled with aggressive quantization offers a viable, privacy-preserving alternative to monolithic cloud models.
6. 【2602.16639】AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models
链接:https://arxiv.org/abs/2602.16639
作者:Adib Sakhawat,Fardeen Sadab
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, increasingly requires moving, static text generation, intelligence of Large
备注: 15 pages, 5 figures, 11 tables. Includes appendix with detailed experimental results and prompts
点击查看摘要
Abstract:Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources. Using a round-robin tournament across frontier models, AREG enables joint evaluation of offensive (persuasion) and defensive (resistance) capabilities within a single interactional framework. Our analysis provides evidence that these capabilities are weakly correlated ($\rho = 0.33$) and empirically dissociated: strong persuasive performance does not reliably predict strong resistance, and vice versa. Across all evaluated models, resistance scores exceed persuasion scores, indicating a systematic defensive advantage in adversarial dialogue settings. Further linguistic analysis suggests that interaction structure plays a central role in these outcomes. Incremental commitment-seeking strategies are associated with higher extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal. Together, these findings indicate that social influence in LLMs is not a monolithic capability and that evaluation frameworks focusing on persuasion alone may overlook asymmetric behavioral vulnerabilities.
7. 【2602.16610】Who can we trust? LLM-as-a-jury for Comparative Assessment
链接:https://arxiv.org/abs/2602.16610
作者:Mengjie Qian,Guangzhi Sun,Mark J.F. Gales,Kate M. Knill
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, natural language generation, language generation assessment, Large language, pairwise comparative judgements
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminator strongly correlates with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.
8. 【2602.16609】ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models
链接:https://arxiv.org/abs/2602.16609
作者:Antoine Chaffin,Luca Arnaboldi,Amélie Chatelain,Florent Krzakala
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:small Knowledge Distillation, Knowledge Distillation, strong single-vector models, small Knowledge, multi-vector models
备注: 9 pages, 5 tables, 2 figures
点击查看摘要
Abstract:Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.
9. 【2602.16608】Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
链接:https://arxiv.org/abs/2602.16608
作者:Melkamu Abay Mersha,Jugal Kalita
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:deeply layered representations, layered representations make, Layer-wise Integrated Gradients, Transformer models achieve, difficult to interpret
备注:
点击查看摘要
Abstract:Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.
10. 【2602.16607】CitiLink-Summ: Summarization of Discussion Subjects in European Portuguese Municipal Meeting Minutes
链接:https://arxiv.org/abs/2602.16607
作者:Miguel Marques,Ana Luísa Fernandes,Ana Filipa Pacheco,Rute Rebouças,Inês Cantante,José Isidro,Luís Filipe Cunha,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,António Leal,Purificação Silvano,Ricardo Campos
类目:Computation and Language (cs.CL)
关键词:formal records documenting, Municipal meeting minutes, local government, citizens to navigate, Municipal meeting
备注:
点击查看摘要
Abstract:Municipal meeting minutes are formal records documenting the discussions and decisions of local government, yet their content is often lengthy, dense, and difficult for citizens to navigate. Automatic summarization can help address this challenge by producing concise summaries for each discussion subject. Despite its potential, research on summarizing discussion subjects in municipal meeting minutes remains largely unexplored, especially in low-resource languages, where the inherent complexity of these documents adds further challenges. A major bottleneck is the scarcity of datasets containing high-quality, manually crafted summaries, which limits the development and evaluation of effective summarization models for this domain. In this paper, we present CitiLink-Summ, a new corpus of European Portuguese municipal meeting minutes, comprising 100 documents and 2,322 manually hand-written summaries, each corresponding to a distinct discussion subject. Leveraging this dataset, we establish baseline results for automatic summarization in this domain, employing state-of-the-art generative models (e.g., BART, PRIMERA) as well as large language models (LLMs), evaluated with both lexical and semantic metrics such as ROUGE, BLEU, METEOR, and BERTScore. CitiLink-Summ provides the first benchmark for municipal-domain summarization in European Portuguese, offering a valuable resource for advancing NLP research on complex administrative texts.
11. 【2602.16578】Creating a digital poet
链接:https://arxiv.org/abs/2602.16578
作者:Vered Tohar,Tsahi Hayat,Amir Leshem
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:machine write good, write good poetry, machine write, write good, good poetry
备注: 24 pages, 3 figures
点击查看摘要
Abstract:Can a machine write good poetry? Any positive answer raises fundamental questions about the nature and value of art. We report a seven-month poetry workshop in which a large language model was shaped into a digital poet through iterative in-context expert feedback, without retraining. Across sessions, the model developed a distinctive style and a coherent corpus, supported by quantitative and qualitative analyses, and it produced a pen name and author image. In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence intervals including 50%. After the workshop, a commercial publisher released a poetry collection authored by the model. These results show that workshop-style prompting can support long-horizon creative shaping and renew debates on creativity and authorship.
12. 【2602.16571】Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset
链接:https://arxiv.org/abs/2602.16571
作者:Zhuqian Zhou,Kirk Vanacore,Bakhtawar Ahtisham,Jinsook Lee,Doug Pietrzak,Daryl Hedley,Jorge Dias,Chris Shaw,Ruth Schäfer,René F. Kizilcec
类目:Computation and Language (cs.CL)
关键词:Personally Identifiable Information, Large-scale sharing, generic Personally Identifiable, rigorous de-identification remains, teaching and learning
备注:
点击查看摘要
Abstract:Large-scale sharing of dialogue-based data is instrumental for advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce dataset utility. This work asks how PII can be detected in math tutoring transcripts while preserving their educational utility. To address this challenge, we investigate the "numeric ambiguity" problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits upstream redactions and generates privacy-preserving surrogates. The dataset contains 1,000 tutoring sessions (115,620 messages; 769,628 tokens) with validated PII annotations. Using a density-based segmentation method, we show that false PII redactions are disproportionately concentrated in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and LLM-based approaches with basic, math-aware, and segment-aware prompting. Math-aware prompting substantially improves performance over the baseline (F1: 0.821 vs. 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides both a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.
13. 【2602.16516】Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
链接:https://arxiv.org/abs/2602.16516
作者:Taja Kuzman Pungeršek,Peter Rupnik,Daniela Širinić,Nikola Ljubešić
类目:Computation and Language (cs.CL)
关键词:setting across Europe, Comparative Agendas Project, building domain-specific policy, paper introduces ParlaCAP, parliamentary agenda setting
备注: 17 pages, 7 figures, 7 tables. Submitted to the PoliticalNLP 2026 workshop, co-located with LREC 2026 conference
点击查看摘要
Abstract:This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.
14. 【2602.16500】Optimizing Soft Prompt Tuning via Structural Evolution
链接:https://arxiv.org/abs/2602.16500
作者:Zhenzhen Huang,Chaoning Zhang,Haoyu Bian,Songbo Zhang,Chi-lok Andy Tai,Jiaquan Zhang,Caiyan Qin,Jingjing Qu,Yalan Ye,Yang Yang,Heng Tao Shen
类目:Computation and Language (cs.CL)
关键词:capture task-specific information, large pre-trained language, Soft prompt tuning, Soft prompt, leverages continuous embeddings
备注: This manuscript has been submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) for peer review
点击查看摘要
Abstract:Soft prompt tuning leverages continuous embeddings to capture task-specific information in large pre-trained language models (LLMs), achieving competitive performance in few-shot settings. However, soft prompts rely on high-dimensional, implicit representations and lack explicit semantics and traceable training behaviors, which limits their interpretability. To address this limitation, we propose a soft prompt tuning optimization method based on topological morphological evolution. Specifically, we employ persistent homology from topological data analysis (TDA) to quantify the structural representations of soft prompts in continuous parameter space and their training process evolution. Quantitative analysis shows that topologically stable and compact soft prompts achieve better downstream performance. Based on this empirical observation, we construct a loss function for optimizing soft prompt tuning, termed Topological Soft Prompt Loss (TSLoss). TSLoss guides the model to learn structurally stable adaptations by quantifying inter-parameter connectivity and redundancy. Extensive experiments show that training with TSLoss accelerates convergence and improves tuning performance, providing an interpretable method to understand and optimize soft prompt tuning from structural and topological perspectives.
15. 【2602.16490】From Growing to Looping: A Unified View of Iterative Computation in LLMs
链接:https://arxiv.org/abs/2602.16490
作者:Ferdinand Kapl,Emmanouil Angelis,Kaitlin Maile,Johannes von Oswald,Stefan Bauer
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:relationship remains unclear, remains unclear, duplicating middle layers, linked to stronger, relationship remains
备注:
点击查看摘要
Abstract:Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to $2\times$, despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.
16. 【2602.16488】Learning to Learn from Language Feedback with Social Meta-Learning
链接:https://arxiv.org/abs/2602.16488
作者:Jonathan Cook,Diego Antognini,Martin Klissarov,Claudiu Musat,Edward Grefenstette
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, conversational context, Large, SML
备注:
点击查看摘要
Abstract:Large language models (LLMs) often struggle to learn from corrective feedback within a conversational context. They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation. To address these limitations, we draw inspiration from social meta-learning (SML) in humans - the process of learning how to learn from others. We formulate SML as a finetuning methodology, training LLMs to solicit and learn from language feedback in simulated pedagogical dialogues, where static tasks are converted into interactive social learning problems. SML effectively teaches models to use conversation to solve problems they are unable to solve in a single turn. This capability generalises across domains; SML on math problems produces models that better use feedback to solve coding problems and vice versa. Furthermore, despite being trained only on fully-specified problems, these models are better able to solve underspecified tasks where critical information is revealed over multiple turns. When faced with this ambiguity, SML-trained models make fewer premature answer attempts and are more likely to ask for the information they need. This work presents a scalable approach to developing AI systems that effectively learn from language feedback.
17. 【2602.16485】am of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
链接:https://arxiv.org/abs/2602.16485
作者:Jeffrey T. H. Wong,Zixi Zhang,Junyi Liu,Yiren Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
关键词:Existing Multi-Agent Systems, Existing Multi-Agent, Multi-Agent Systems, differently post-trained models, typically rely
备注: 8 pages
点击查看摘要
Abstract:Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.
18. 【2602.16469】raining Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning
链接:https://arxiv.org/abs/2602.16469
作者:Jenny Kunz
类目:Computation and Language (cs.CL)
关键词:multilingual NLP, native text, NLP, source, source language
备注:
点击查看摘要
Abstract:Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce. However, translated text differs systematically from native text. This phenomenon is known as translationese, and it reflects both traces of the source language and characteristic properties of translation itself. In this paper, we study how training on machine-translated data affects small English language models, focusing on how translationese from different source languages shapes linguistic acceptability judgments and language modelling for different domains. We train models on English text translated from 24 typologically and resource-diverse source languages, enabling a systematic analysis of how source language and corpus properties influence what models learn. Our results show that the source language has a clear impact on model behavior: general perplexity is more driven by the lexical diversity of the translated corpus, while grammatical performance is strongly correlated to typological similarity to English, given enough data.
19. 【2602.16467】IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models
链接:https://arxiv.org/abs/2602.16467
作者:Saurabh Bharti,Gaurav Azad,Abhinaw Jagtap,Nachiket Tapas
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:reflect real-world academic, real-world academic rigor, necessitates evaluation frameworks, large language models, rapid advancement
备注:
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and humanities domains in both English and Hindi. Unlike synthetic benchmarks, IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability. The framework automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) prompting strategies and supports modular integration of new models and languages. Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings. First, CoT prompting consistently improves reasoning accuracy, with substantial gains across subjects and languages. Second, significant cross-model performance disparities persist, particularly in high-complexity examinations. Third, multilingual degradation remains a critical challenge, with marked accuracy drops in Hindi compared to English, especially under Zero-Shot conditions. These results highlight persistent gaps in bilingual reasoning and domain transfer. Overall, IndicEval provides a practice-oriented, extensible foundation for rigorous, equitable evaluation of LLMs in multilingual educational settings and offers actionable insights for improving reasoning robustness and language adaptability.
20. 【2602.16429】abAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
链接:https://arxiv.org/abs/2602.16429
作者:Ido Levy,Eilam Shapira,Yinon Goldshtein,Avi Yaeli,Nir Mashkif,Segev Shlomov
类目:Computation and Language (cs.CL)
关键词:achieve complex goals, large language model, autonomously execute multi-step, execute multi-step workflows, repeated large language
备注:
点击查看摘要
Abstract:Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating, and verification. While convenient, this design makes deployments slow and expensive due to cumulative latency and token usage. We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces. TabAgent (i) extracts structured schema, state, and dependency features from trajectories (TabSchema), (ii) augments coverage with schema-aligned synthetic supervision (TabSynth), and (iii) scores candidates with a lightweight classifier (TabHead). On the long-horizon AppWorld benchmark, TabAgent maintains task-level success while eliminating shortlist-time LLM calls, reducing latency by approximately 95% and inference cost by 85-91%. Beyond tool shortlisting, TabAgent generalizes to other agentic decision heads, establishing a paradigm for learned discriminative replacements of generative bottlenecks in production agent architectures.
21. 【2602.16379】Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents
链接:https://arxiv.org/abs/2602.16379
作者:Mohammad H.A. Monfared,Lucie Flek,Akbar Karimi
类目:Computation and Language (cs.CL)
关键词:Aspect-Based Sentiment Analysis, produce high quality, high quality synthetic, quality synthetic training, Sentiment Analysis
备注: Accepted to WASSA Workshop at EACL 2026
点击查看摘要
Abstract:We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks (Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)), four SemEval datasets, and two encoder-decoder models: T5-Base and Tk-Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting-based generation. These benefits are most pronounced for T5-Base, while the more heavily pretrained Tk-Instruct exhibits smaller improvements. As a result, augmented data helps T5-Base achieve comparable performance with its counterpart.
22. 【2602.16375】Variable-Length Semantic IDs for Recommender Systems
链接:https://arxiv.org/abs/2602.16375
作者:Kirill Khrylchenko
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:modeling user behavior, Generative models, generative models difficult, integrating large language, training generative models
备注:
点击查看摘要
Abstract:Generative models are increasingly used in recommender systems, both for modeling user behavior as event sequences and for integrating large language models into recommendation pipelines. A key challenge in this setting is the extremely large cardinality of item spaces, which makes training generative models difficult and introduces a vocabulary gap between natural language and item identifiers. Semantic identifiers (semantic IDs), which represent items as sequences of low-cardinality tokens, have recently emerged as an effective solution to this problem. However, existing approaches generate semantic identifiers of fixed length, assigning the same description length to all items. This is inefficient, misaligned with natural language, and ignores the highly skewed frequency structure of real-world catalogs, where popular items and rare long-tail items exhibit fundamentally different information requirements. In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions. Despite the conceptual similarity, these ideas have not been systematically adopted in recommender systems. In this work, we bridge recommender systems and emergent communication by introducing variable-length semantic identifiers for recommendation. We propose a discrete variational autoencoder with Gumbel-Softmax reparameterization that learns item representations of adaptive length under a principled probabilistic framework, avoiding the instability of REINFORCE-based training and the fixed-length constraints of prior semantic ID methods.
Subjects:
Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2602.16375 [cs.IR]
(or
arXiv:2602.16375v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.16375
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2602.16346】Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
链接:https://arxiv.org/abs/2602.16346
作者:Nivya Talokar,Ayush K Tarun,Murari Mandal,Maksym Andriushchenko,Antoine Bosselut
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:execute real-world workflows, LLM-based agents execute, agents execute real-world, execute real-world, real-world workflows
备注:
点击查看摘要
Abstract:LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.
24. 【2602.16313】MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks
链接:https://arxiv.org/abs/2602.16313
作者:Zexue He,Yu Wang,Churan Zhi,Yuanzhe Hu,Tzu-Ping Chen,Lang Yin,Ze Chen,Tong Arthur Wu,Siru Ouyang,Zihan Wang,Jiaxin Pei,Julian McAuley,Yejin Choi,Alex Pentland
类目:Computation and Language (cs.CL)
关键词:typically assess memorization, memory typically assess, memory, typically assess, assess memorization
备注:
点击查看摘要
Abstract:Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.
25. 【2602.16298】MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models
链接:https://arxiv.org/abs/2602.16298
作者:Martin Hyben,Sebastian Kula,Jan Cegin,Jakub Simko,Ivan Srba,Robert Moro
类目:Computation and Language (cs.CL)
关键词:professionals verify information, process remains limited, media professionals verify, Large Language Models, Large Language
备注: 18 pages, 8 figures, 19 tables, EACL-2026
点击查看摘要
Abstract:Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.
26. 【2602.16290】Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation
链接:https://arxiv.org/abs/2602.16290
作者:Jonathan Mutal,Perla Al Almaoui,Simon Hengchen,Pierrette Bouillon
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, Language Processing, under-represented in Natural, Natural Language, Large Language Models
备注: 13 pages, Paper submitted to the AMIYA shared task at the VarDial workshop, co-located with EACL 2026
点击查看摘要
Abstract:Arabic dialects have long been under-represented in Natural Language Processing (NLP) research due to their non-standardization and high variability, which pose challenges for computational modeling. Recent advances in the field, such as Large Language Models (LLMs), offer promising avenues to address this gap by enabling Arabic to be modeled as a pluricentric language rather than a monolithic system. This paper presents Aladdin-FTI, our submission to the AMIYA shared task. The proposed system is designed to both generate and translate dialectal Arabic (DA). Specifically, the model supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects, as well as bidirectional translation between these dialects, Modern Standard Arabic (MSA), and English. The code and trained model are publicly available.
27. 【2602.16241】Are LLMs Ready to Replace Bangla Annotators?
链接:https://arxiv.org/abs/2602.16241
作者:Md. Najib Hasan,Touseef Hasan,Souvika Sarkar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:remains poorly understood, Large Language Models, scale dataset creation, Large Language, dataset creation
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.
28. 【2602.16201】Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications
链接:https://arxiv.org/abs/2602.16201
作者:Sanket Badhe,Deep Shah,Nehal Kathrotia
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:steep power-law distributions, exhibit steep power-law, Large language models, long-Tail Knowledge, power-law distributions
备注:
点击查看摘要
Abstract:Large language models (LLMs) are trained on web-scale corpora that exhibit steep power-law distributions, in which the distribution of knowledge is highly long-tailed, with most appearing infrequently. While scaling has improved average-case performance, persistent failures on low-frequency, domain-specific, cultural, and temporal knowledge remain poorly characterized. This paper develops a structured taxonomy and analysis of long-Tail Knowledge in large language models, synthesizing prior work across technical and sociotechnical perspectives. We introduce a structured analytical framework that synthesizes prior work across four complementary axes: how long-Tail Knowledge is defined, the mechanisms by which it is lost or distorted during training and inference, the technical interventions proposed to mitigate these failures, and the implications of these failures for fairness, accountability, transparency, and user trust. We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures. The paper concludes by identifying open challenges related to privacy, sustainability, and governance that constrain long-Tail Knowledge representation. Taken together, this paper provides a unifying conceptual framework for understanding how long-Tail Knowledge is defined, lost, evaluated, and manifested in deployed language model systems.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:
arXiv:2602.16201 [cs.CL]
(or
arXiv:2602.16201v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.16201
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
29. 【2602.16200】he Validity of Coreference-based Evaluations of Natural Language Understanding
链接:https://arxiv.org/abs/2602.16200
作者:Ian Porada
类目:Computation and Language (cs.CL)
关键词:expanding existing evaluation, existing evaluation practices, converging or conflicting, refine our understanding, reach from coreference-based
备注: PhD Thesis
点击查看摘要
Abstract:In this thesis, I refine our understanding as to what conclusions we can reach from coreference-based evaluations by expanding existing evaluation practices and considering the extent to which evaluation results are either converging or conflicting. First, I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions due to issues of measurement validity - including contestedness (multiple, competing definitions of coreference) and convergent validity (evaluation results that rank models differently across benchmarks). Second, I propose and implement a novel evaluation focused on testing systems' ability to infer the relative plausibility of events, a key aspect of resolving coreference. Through this extended evaluation, I find that contemporary language models demonstrate strong performance on standard benchmarks - improving over earlier baseline systems within certain domains and types of coreference - but remain sensitive to the evaluation conditions: they often fail to generalize in ways one would expect a human to be capable of when evaluation contexts are slightly modified. Taken together, these findings clarify both the strengths, such as improved accuracy over baselines on widely used evaluations, and the limitations of the current NLP paradigm, including weaknesses in measurement validity, and suggest directions for future work in developing better evaluation methods and more genuinely generalizable systems.
30. 【2602.16197】ModalImmune: Immunity Driven Unlearning via Self Destructive Training
链接:https://arxiv.org/abs/2602.16197
作者:Rong Fu,Jia Yee Tan,Wenxin Zhang,Zijian Zhang,Ziming Wang,Zhaolu Kang,Muge Qi,Shuning Zhang,Simon Fong
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:channels at deployment, real-world settings, systems are vulnerable, vulnerable to partial, partial or complete
备注: 23 pages, 8 figures
点击查看摘要
Abstract:Multimodal systems are vulnerable to partial or complete loss of input channels at deployment, which undermines reliability in real-world settings. This paper presents ModalImmune, a training framework that enforces modality immunity by intentionally and controllably collapsing selected modality information during training so the model learns joint representations that are robust to destructive modality influence. The framework combines a spectrum-adaptive collapse regularizer, an information-gain guided controller for targeted interventions, curvature-aware gradient masking to stabilize destructive updates, and a certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation. Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.
31. 【2602.16189】Beyond Learning: A Training-Free Alternative to Model Adaptation
链接:https://arxiv.org/abs/2602.16189
作者:Namkyung Yoon,Kyeonghyun Yoo,Wooyong Jung,Sanghong Kim,Hwangnam Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:underperform previous versions, previous versions, language models, underperform previous, model
备注: 7 pages, 3 figures, 5 tables. Preprint submitted to Pattern Recognition Letters
点击查看摘要
Abstract:Despite the continuous research and evolution of language models, they sometimes underperform previous versions. Existing approaches to overcome these challenges are resource-intensive, highlighting the need for alternatives that enable immediate action. We assume that each language model has a local module inside that is suitable for a specific function. First, this work identifies a set of modules showing consistent and local activation changes under an inference workload through activation-based analysis. Subsequently, we transplant an internal module that is properly activated for a specific task into the target model, leading to immediate and measurable functional changes without additional training or fine-tuning. To experimentally demonstrate the effectiveness of the transplant technique, we quantify the relationship between transplant strength and performance improvement under different conditions for two language models. In the cross-generation setting, we find that transplanting activation-selected modules can substantially improve the underperforming model, reaching up to twice the target baseline and achieving gap-based recovery above 100%. Moreover, in transplant experiments between a base model and its instruction-tuned counterpart, transplantation improves the underperforming model toward the stronger baseline, yielding up to about 2.33 times the target baseline with gap-based recovery reaching up to 100% in the best case. These results show that meaningful capacity transfer can be realized through the implantation of highly localized modules implied by language models. Overall, this work provides empirical evidence for task-localized modularity in language models and presents a new research area: model transplantation.
32. 【2602.16173】Learning Personalized Agents from Human Feedback
链接:https://arxiv.org/abs/2602.16173
作者:Kaiqu Liang,Julia Kruk,Shengyi Qian,Xianjun Yang,Shengjie Bi,Yuanshun Yao,Shaoliang Nie,Mingyang Zhang,Lijuan Liu,Jaime Fernández Fisac,Shuyan Zhou,Saghar Hosseini
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:fail to align, Modern, individual users, PAHF, Modern AI agents
备注:
点击查看摘要
Abstract:Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users. Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory. However, these approaches struggle with new users and with preferences that change over time. We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization in which agents learn online from live interaction using explicit per-user memory. PAHF operationalizes a three-step loop: (1) seeking pre-action clarification to resolve ambiguity, (2) grounding actions in preferences retrieved from memory, and (3) integrating post-action feedback to update memory when preferences drift. To evaluate this capability, we develop a four-phase protocol and two benchmarks in embodied manipulation and online shopping. These benchmarks quantify an agent's ability to learn initial preferences from scratch and subsequently adapt to persona shifts. Our theoretical analysis and empirical results show that integrating explicit memory with dual feedback channels is critical: PAHF learns substantially faster and consistently outperforms both no-memory and single-channel baselines, reducing initial personalization error and enabling rapid adaptation to preference shifts.
33. 【2602.16169】Discrete Stochastic Localization for Non-autoregressive Generation
链接:https://arxiv.org/abs/2602.16169
作者:Yunshu Wu,Jiayi Cheng,Partha Thakuria,Rob Brekelmans,Evangelos E. Papalexakis,Greg Ver Steeg
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:reduces decoding latency, generation reduces decoding, NAR iterative refinement, modern NAR iterative, tokens in parallel
备注:
点击查看摘要
Abstract:Non-autoregressive (NAR) generation reduces decoding latency by predicting many tokens in parallel, but iterative refinement often suffers from error accumulation and distribution shift under self-generated drafts. Masked diffusion language models (MDLMs) and their remasking samplers (e.g., ReMDM) can be viewed as modern NAR iterative refinement, where generation repeatedly revises a partially observed draft. In this work we show that \emph{training alone} can substantially improve the step-efficiency of MDLM/ReMDM sampling. We propose \textsc{DSL} (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer. On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with \(\sim\)4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets. Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.
34. 【2602.16162】LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers
链接:https://arxiv.org/abs/2602.16162
作者:Peiqi Sui
类目:Computation and Language (cs.CL)
关键词:trite and cliché-ridden, key and understudied, understudied limitation, limitation of LLMs', LLMs' performance
备注: 11 tables
点击查看摘要
Abstract:We argue that uncertainty is a key and understudied limitation of LLMs' performance in creative writing, which is often characterized as trite and cliché-ridden. Literary theory identifies uncertainty as a necessary condition for creative expression, while current alignment strategies steer models away from uncertain outputs to ensure factuality and reduce hallucination. We formalize this tension by quantifying the "uncertainty gap" between human-authored stories and model-generated continuations. Through a controlled information-theoretic analysis of 28 LLMs on high-quality storytelling datasets, we demonstrate that human writing consistently exhibits significantly higher uncertainty than model outputs. We find that instruction-tuned and reasoning models exacerbate this trend compared to their base counterparts; furthermore, the gap is more pronounced in creative writing than in functional domains, and strongly correlates to writing quality. Achieving human-level creativity requires new uncertainty-aware alignment paradigms that can distinguish between destructive hallucinations and the constructive ambiguity required for literary richness.
35. 【2602.16161】Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection
链接:https://arxiv.org/abs/2602.16161
作者:Rong Fu,Ziming Wang,Shuo Yin,Wenxin Zhang,Haiyun Wei,Kun Liu,Xianda Li,Zeli Su,Simon Fong
类目:Multimedia (cs.MM); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Emotional expression underpins, expression underpins natural, underpins natural communication, Emotional expression, effective human-computer interaction
备注: 25 pages, 14 figures
点击查看摘要
Abstract:Emotional expression underpins natural communication and effective human-computer interaction. We present Emotion Collider (EC-Net), a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling. EC-Net represents modality hierarchies using Poincare-ball embeddings and performs fusion through a hypergraph mechanism that passes messages bidirectionally between nodes and hyperedges. To sharpen class separation, contrastive learning is formulated in hyperbolic space with decoupled radial and angular objectives. High-order semantic relations across time steps and modalities are preserved via adaptive hyperedge construction. Empirical results on standard multimodal emotion benchmarks show that EC-Net produces robust, semantically coherent representations and consistently improves accuracy, particularly when modalities are partially available or contaminated by noise. These findings indicate that explicit hierarchical geometry combined with hypergraph fusion is effective for resilient multimodal affect understanding.
36. 【2602.16154】Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution
链接:https://arxiv.org/abs/2602.16154
作者:Nithin Sivakumaran,Shoubin Yu,Hyunji Lee,Yue Zhang,Ali Payani,Mohit Bansal,Elias Stengel-Eskin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:LLMs arrive, large language model, hampering its utility, fails to faithfully, faithfully reflect
备注: Code: [this https URL](https://github.com/nsivaku/remul)
点击查看摘要
Abstract:Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi-party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who "execute" the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness -- hint attribution, early answering area over the curve (AOC), and mistake injection AOC -- while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.
37. 【2602.16144】Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis
链接:https://arxiv.org/abs/2602.16144
作者:Rong Fu,Wenxin Zhang,Ziming Wang,Chunlei Meng,Jiaxuan Lu,Jiekai Wu,Kangan Qian,Hao Zhang,Simon Fong
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:sensitive personal data, revoke specific data, specific data modalities, systems increasingly process, increasingly process sensitive
备注: 21 pages, 6 figures
点击查看摘要
Abstract:As multimodal systems increasingly process sensitive personal data, the ability to selectively revoke specific data modalities has become a critical requirement for privacy compliance and user autonomy. We present Missing-by-Design (MBD), a unified framework for revocable multimodal sentiment analysis that combines structured representation learning with a certifiable parameter-modification pipeline. Revocability is critical in privacy-sensitive applications where users or regulators may request removal of modality-specific information. MBD learns property-aware embeddings and employs generator-based reconstruction to recover missing channels while preserving task-relevant signals. For deletion requests, the framework applies saliency-driven candidate selection and a calibrated Gaussian update to produce a machine-verifiable Modality Deletion Certificate. Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full retraining.
38. 【2602.16093】Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities
链接:https://arxiv.org/abs/2602.16093
作者:Shankar Padmanabhan,Mustafa Omer Gul,Tanya Goyal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Post-training endows pretrained, endows pretrained LLMs, Post-training endows, endows pretrained, variety of desirable
备注: 15 pages. Preprint, under review
点击查看摘要
Abstract:Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot simultaneously learn new knowledge from an adaptation document corpora and mitigate the forgetting of earlier learned capabilities. To address this, we introduce Distillation via Split Contexts (DiSC), a simple context-distillation based approach for continual knowledge adaptation. \methodname~derives student and teacher distributions by conditioning on distinct segments of the training example and minimizes the KL divergence between the shared tokens. This allows us to efficiently apply context-distillation without requiring explicit generation steps during training. We run experiments on four post-trained models and two adaptation domains. Compared to prior finetuning and distillation methods for continual adaptation, DiSC consistently reports the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills like instruction-following, reasoning, and factual knowledge.
39. 【2602.16092】Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff
链接:https://arxiv.org/abs/2602.16092
作者:Patrick Pynadath,Ruqi Zhang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:native key-value caching, efficient masked diffusion, enabling native key-value, Any-order autoregressive models, required two-stream attention
备注:
点击查看摘要
Abstract:Any-order autoregressive models (AO-ARMs) offer a promising path toward efficient masked diffusion by enabling native key-value caching, but competitive performance has so far required two-stream attention, typically motivated as a means of decoupling token content from position. In this work, we argue that two-stream attention may be serving a more subtle role. We identify a structural-semantic tradeoff in any-order generation: the hidden representation at each step must simultaneously attend to semantically informative tokens for prediction and structurally recent tokens for summarization, objectives that compete for attention capacity in a single stream but can specialize across two streams. To isolate this tradeoff from position-content separation, we propose Decoupled RoPE, a modification to rotary position embeddings that provides target position information without revealing target content. Decoupled RoPE performs competitively at short sequence lengths--where semantic and structural proximity coincide--but degrades as sequence length increases and the two orderings diverge. These results suggest that the success of two-stream attention stems not merely from separating position from content, but from circumventing the deeper structural-semantic tradeoff inherent to any-order generation.
40. 【2602.16085】Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs
链接:https://arxiv.org/abs/2602.16085
作者:Sean Trott,Samuel Taylor,Cameron Jones,James A. Michaelov,Pamela D. Rivière
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:mental state reasoning, state reasoning emerges, mental state, state reasoning, potential to inform
备注: 15 pages, 7 figures, submitted to conference
点击查看摘要
Abstract:Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition--such as the theory that mental state reasoning emerges in part from language exposure--and our understanding of LMs themselves. Yet much published work on LMs relies on a relatively small sample of closed-source LMs, limiting our ability to rigorously test psychological theories and evaluate LM capacities. Here, we replicate and extend published work on the false belief task by assessing LM mental state reasoning behavior across 41 open-weight models (from distinct model families). We find sensitivity to implied knowledge states in 34% of the LMs tested; however, consistent with prior work, none fully ``explain away'' the effect in humans. Larger LMs show increased sensitivity and also exhibit higher psychometric predictive power. Finally, we use LM behavior to generate and test a novel hypothesis about human cognition: both humans and LMs show a bias towards attributing false beliefs when knowledge states are cued using a non-factive verb (``John thinks...'') than when cued indirectly (``John looks in the...''). Unlike the primary effect of knowledge states, where human sensitivity exceeds that of LMs, the magnitude of the human knowledge cue effect falls squarely within the distribution of LM effect sizes-suggesting that distributional statistics of language can in principle account for the latter but not the former in humans. These results demonstrate the value of using larger samples of open-weight LMs to test theories of human cognition and evaluate LM capacities.
41. 【2602.16080】Surgical Activation Steering via Generative Causal Mediation
链接:https://arxiv.org/abs/2602.16080
作者:Aruna Sankaranarayanan,Amir Zur,Atticus Geiger,Dylan Hadfield-Menell
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:Generative Causal Mediation, introduce Generative Causal, Causal Mediation, control behaviors, Generative Causal
备注:
点击查看摘要
Abstract:Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.
42. 【2602.16054】CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill
链接:https://arxiv.org/abs/2602.16054
作者:Bradley McDanel,Steven Li,Harshit Khaitan
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:long-context LLM inference, LLM inference remains, long-context LLM, LLM inference, computational bottleneck
备注: 15 pages, 8 figures
点击查看摘要
Abstract:The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.
43. 【2602.16050】Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination
链接:https://arxiv.org/abs/2602.16050
作者:Amir Hosseinian,MohammadReza Zare Shahneh,Umer Mansoor,Gilbert Szeto,Kirill Karlin,Nima Aghaeepour
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, demonstrated strong performance, remains challenging due, general medical examinations, Large language
备注:
点击查看摘要
Abstract:Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.
44. 【2602.16023】A Curious Class of Adpositional Multiword Expressions in Korean
链接:https://arxiv.org/abs/2602.16023
作者:Junghyun Min,Na-Rae Han,Jena D. Hwang,Nathan Schneider
类目:Computation and Language (cs.CL)
关键词:widely studied, Korean multiword adpositions, PARSEME, Korean, Korean multiword
备注: 10 pages. Camera-ready for MWE at EACL 2026
点击查看摘要
Abstract:Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME. However, Korean MWEs remain underrepresented in these efforts. In particular, Korean multiword adpositions lack systematic analysis, annotated resources, and integration into existing multilingual frameworks. In this paper, we study a class of Korean functional multiword expressions: postpositional verb-based constructions (PVCs). Using data from Korean Wikipedia, we survey and analyze several PVC expressions and contrast them with non-MWEs and light verb constructions (LVCs) with similar structure. Building on this analysis, we propose annotation guidelines designed to support future work in Korean multiword adpositions and facilitate alignment with cross-lingual frameworks.
45. 【2602.16008】MAEB: Massive Audio Embedding Benchmark
链接:https://arxiv.org/abs/2602.16008
作者:Adnan El Assadi,Isaac Chung,Chenghao Xiao,Roman Solomatin,Animesh Jha,Rahul Chand,Silky Singh,Kaitlyn Wang,Ali Sartaz Khan,Marc Moussa Nasser,Sufen Fong,Pengfei He,Alan Xiao,Ayush Sunil Munot,Aditya Shrivastava,Artem Gazizov,Niklas Muennighoff,Kenneth Enevoldsen
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Massive Audio Embedding, Audio Embedding Benchmark, large-scale benchmark covering, Embedding Benchmark, introduce the Massive
备注:
点击查看摘要
Abstract:We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at this https URL.
46. 【2602.15997】Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks
链接:https://arxiv.org/abs/2602.15997
作者:Jayadev Billa
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:remains mechanistically opaque, neural network training, network training remains, training remains mechanistically, Capability emergence
备注: 19 pages, 6 figures, 12 appendix pages
点击查看摘要
Abstract:Capability emergence during neural network training remains mechanistically opaque. We track five geometric measures across five model scales (405K-85M parameters), 120+ emergence events in eight algorithmic tasks, and three Pythia language models (160M-2.8B). We find: (1) training begins with a universal representation collapse to task-specific floors that are scale-invariant across a 210X parameter range (e.g., modular arithmetic collapses to RANKME ~ 2.0 regardless of model size); (2) collapse propagates top-down through layers (32/32 task X model consistency), contradicting bottom-up feature-building intuition; (3) a geometric hierarchy in which representation geometry leads emergence (75-100% precursor rate for hard tasks), while the local learning coefficient is synchronous (0/24 precursor) and Hessian measures lag. We also delineate prediction limits: geometric measures encode coarse task difficulty but not fine-grained timing (within-class concordance 27%; when task ordering reverses across scales, prediction fails at 26%). On Pythia, global geometric patterns replicate but per-task precursor signals do not -- the precursor relationship requires task-training alignment that naturalistic pre-training does not provide. Our contribution is the geometric anatomy of emergence and its boundary conditions, not a prediction tool.
47. 【2602.15958】DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
链接:https://arxiv.org/abs/2602.15958
作者:Md Mofijul Islam,Md Sirajus Salekin,Nivedha Balakrishnan,Vincil C. Bishop III,Niharika Jain,Spencer Romo,Bob Strahan,Boyi Xie,Diego A. Socolinsky
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:multiple documents stitched, document packet, multi-page document packets, document packet splitting, Document
备注:
点击查看摘要
Abstract:Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.
48. 【2602.15902】Doc-to-LoRA: Learning to Instantly Internalize Contexts
链接:https://arxiv.org/abs/2602.15902
作者:Rujikorn Charakorn,Edoardo Cetin,Shinnosuke Uesaka,Robert Tjarko Lange
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Long input sequences, Long input, reasoning of Large
备注:
点击查看摘要
Abstract:Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.
49. 【2602.15898】MultiCube-RAG for Multi-hop Question Answering
链接:https://arxiv.org/abs/2602.15898
作者:Jimeng Shi,Wei Hu,Runchu Tian,Bowen Jin,Wonbin Kweon,SeongKu Kang,Yunfan Kang,Dingqi Ye,Sizhe Zhou,Shaowen Wang,Jiawei Han
类目:Computation and Language (cs.CL)
关键词:Multi-hop question answering, question answering, necessitates multi-step reasoning, reasoning and retrieval, reasoning
备注: 12 pages
点击查看摘要
Abstract:Multi-hop question answering (QA) necessitates multi-step reasoning and retrieval across interconnected subjects, attributes, and relations. Existing retrieval-augmented generation (RAG) methods struggle to capture these structural semantics accurately, resulting in suboptimal performance. Graph-based RAGs structure such information in graphs, but the resulting graphs are often noisy and computationally expensive. Moreover, most methods rely on single-step retrieval, neglecting the need for multi-hop reasoning processes. Recent training-based approaches attempt to incentivize the large language models (LLMs) for iterative reasoning and retrieval, but their training processes are prone to unstable convergence and high computational overhead. To address these limitations, we devise an ontology-based cube structure with multiple and orthogonal dimensions to model structural subjects, attributes, and relations. Built on the cube structure, we propose MultiCube-RAG, a training-free method consisting of multiple cubes for multi-step reasoning and retrieval. Each cube specializes in modeling a class of subjects, so that MultiCube-RAG flexibly selects the most suitable cubes to acquire the relevant knowledge precisely. To enhance the query-based reasoning and retrieval, our method decomposes a complex multi-hop query into a set of simple subqueries along cube dimensions and conquers each of them sequentially. Experiments on four multi-hop QA datasets show that MultiCube-RAG improves response accuracy by 8.9% over the average performance of various baselines. Notably, we also demonstrate that our method performs with greater efficiency and inherent explainability.
50. 【2602.15897】Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation
链接:https://arxiv.org/abs/2602.15897
作者:Xinguo Feng,Zhongkui Ma,Zihan Wang,Alsharif Abuadbba,Guangdong Bai
类目:Computation and Language (cs.CL)
关键词:fine-tuning large-scale language, reconstruct private training, private training data, large-scale language models, language models largely
备注:
点击查看摘要
Abstract:Training and fine-tuning large-scale language models largely benefit from collaborative learning, but the approach has been proven vulnerable to gradient inversion attacks (GIAs), which allow adversaries to reconstruct private training data from shared gradients. Existing defenses mainly employ gradient perturbation techniques, e.g., noise injection or gradient pruning, to disrupt GIAs' direct mapping from gradient space to token space. However, these methods often fall short due to the retention of semantics similarity across gradient, embedding, and token spaces. In this work, we propose a novel defense mechanism named GHOST (gradient shield with obfuscated tokens), a token-level obfuscation mechanism that neutralizes GIAs by decoupling the inherent connections across gradient, embedding, and token spaces. GHOST is built upon an important insight: due to the large scale of the token space, there exist semantically distinct yet embedding-proximate tokens that can serve as the shadow substitutes of the original tokens, which enables a semantic disconnection in the token space while preserving the connection in the embedding and gradient spaces. GHOST comprises a searching step, which identifies semantically distinct candidate tokens using a multi-criteria searching process, and a selection step, which selects optimal shadow tokens to ensure minimal disruption to features critical for training by preserving alignment with the internal outputs produced by original tokens. Evaluation across diverse model architectures (from BERT to Llama) and datasets demonstrates the remarkable effectiveness of GHOST in protecting privacy (as low as 1% in recovery rate) and preserving utility (up to 0.92 in classification F1 and 5.45 in perplexity), in both classification and generation tasks against state-of-the-art GIAs and adaptive attack scenarios.
51. 【2602.15896】Every Little Helps: Building Knowledge Graph Foundation Model with Fine-grained Transferable Multi-modal Tokens
链接:https://arxiv.org/abs/2602.15896
作者:Yichi Zhang,Zhuo Chen,Lingbing Guo,Wen Zhang,Huajun Chen
类目:Computation and Language (cs.CL)
关键词:multi-modal entity contents, knowledge graph reasoning, aims to predict, entity contents, predict the missing
备注: Work in progress
点击查看摘要
Abstract:Multi-modal knowledge graph reasoning (MMKGR) aims to predict the missing links by exploiting both graph structure information and multi-modal entity contents. Most existing works are designed for a transductive setting, which learns dataset-specific embeddings and struggles to generalize to new KGs. Recent knowledge graph foundation models (KGFMs) improve cross-KG transfer, but they mainly exploit structural patterns and ignore rich multi-modal signals. We address these gaps by proposing a token-based foundation model (TOFU) for MMKGR, which exhibits strong generalization across different MMKGs. TOFU discretizes structural, visual, and textual information into modality-specific tokens. TOFU then employs a hierarchical fusion architecture with mixture-of-message mechanisms, aiming to process these tokens and obtain transferable features for MMKGR. Experimental results on 17 transductive, inductive, and fully-inductive MMKGs show that TOFU consistently outperforms strong KGFM and MMKGR baselines, delivering strong performance on unseen MMKGs.
52. 【2602.15895】Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion
链接:https://arxiv.org/abs/2602.15895
作者:Pengcheng Zhou,Haochen Li,Zhiqiang Nie,JiaLe Chen,Qing Gong,Weizhen Zhang,Chun Yu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:effectively mitigates hallucinations, incorporating external knowledge, effectively mitigates, mitigates hallucinations, hallucinations in LLMs
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a RAG framework that simulates human cognitive memory processes. The core of this framework lies in the extraction and evolution of the Semantic Gist. During the offline indexing stage, CogitoRAG first deduces unstructured corpora into gist memory corpora, which are then transformed into a multi-dimensional knowledge graph integrating entities, relational facts, and memory nodes. In the online retrieval stage, the framework handles complex queries via Query Decomposition Module that breaks them into comprehensive sub-queries, mimicking the cognitive decomposition humans employ for complex information. Subsequently, Entity Diffusion Module performs associative retrieval across the graph, guided by structural relevance and an entity-frequency reward mechanism. Furthermore, we propose the CogniRank algorithm, which precisely reranks candidate passages by fusing diffusion-derived scores with semantic similarity. The final evidence is delivered to the generator in a passage-memory pairing format, providing high-density information support. Experimental results across five mainstream QA benchmarks and multi-task generation on GraphBench demonstrate that CogitoRAG significantly outperforms state-of-the-art RAG methods, showcasing superior capabilities in complex knowledge integration and reasoning.
53. 【2602.15894】Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity
链接:https://arxiv.org/abs/2602.15894
作者:Haihui Pan,Yuzhong Hong,Shaoke Lv,Junwei Bao,Hongfei Jiang,Yang Song
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language model, Recent research, methods significantly improve, LLM output diversity, language model
备注:
点击查看摘要
Abstract:Recent research indicates that while alignment methods significantly improve the quality of large language model(LLM) outputs, they simultaneously reduce the diversity of the models' output. Although some methods have been proposed to enhance LLM output diversity, they often come at the cost of reduced performance. In this work, we first theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity. To enhance the diversity of LLM outputs while ensuring quality, we propose the Quality-constrained Entropy Maximization Policy Optimization (QEMPO). QEMPO aims to maximize the output entropy of the policy while ensuring output quality. By adding different constraints to QEMPO, we obtain different policies. To optimize policies, we propose both online and offline training methods. Experiments validate that QEMPO achieves performance comparable to or even better than RLHF while improving output diversity.
54. 【2602.15874】P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA
链接:https://arxiv.org/abs/2602.15874
作者:Xingda Lyu,Gongfu Lyu,Zitai Yan,Yuxin Jiang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, demonstrate remarkable capabilities, static training data
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities but remain limited by their reliance on static training data. Retrieval-Augmented Generation (RAG) addresses this constraint by retrieving external knowledge during inference, though it still depends heavily on knowledge base quality. To explore potential improvements, we evaluated three RAG variants-Standard RAG, DA-RAG, and our proposed Prompt-Enhanced Parametric RAG (P-RAG), a hybrid architecture that integrates parametric knowledge within the LLM and retrieved evidence, guided by Chain-of-Thought (CoT) prompting and Low-Rank Adaptation (LoRA) fine-tuning-on both general and biomedical datasets. Using LLaMA-3.2-1B-Instruct fine-tuned via LoRA, we evaluate on PubMedQA and 2WikiMultihopQA. P-RAG outperforms Standard RAG on PubMedQA by 10.47 percentage points in F1 (93.33% vs. 82.86%; 12.64% relative). On 2WikiMultihopQA, P-RAG nearly doubles the overall score vs. Standard RAG (33.44% vs. 17.83%) and achieves 44.03% on the Compare subset (with 42.74% Bridge, 21.84% Inference, 8.60% Compose). CoT prompting substantially improves multi-hop reasoning but yields mixed results for simpler, single-hop queries. These findings underscore P-RAG's potential for accurate, scalable, and contextually adaptive biomedical question answering. Our contributions include: (1) LoRA-based fine-tuning of LLaMA-3.2-1B-Instruct for biomedical QA, (2) introduction of P-RAG with Chain-of-Thought prompting, and (3) state-of-the-art results on PubMedQA and 2WikiMultihopQA.
55. 【2602.15871】CheckIfExist: Detecting Citation Hallucinations in the Era of AI-Generated Content
链接:https://arxiv.org/abs/2602.15871
作者:Diletta Abbonato
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:large language models, introduced unprecedented challenges, language models, proliferation of large, large language
备注:
点击查看摘要
Abstract:The proliferation of large language models (LLMs) in academic workflows has introduced unprecedented challenges to bibliographic integrity, particularly through reference hallucination -- the generation of plausible but non-existent citations. Recent investigations have documented the presence of AI-hallucinated citations even in papers accepted at premier machine learning conferences such as NeurIPS and ICLR, underscoring the urgency of automated verification mechanisms. This paper presents "CheckIfExist", an open-source web-based tool designed to provide immediate verification of bibliographic references through multi-source validation against CrossRef, Semantic Scholar, and OpenAlex scholarly databases. While existing reference management tools offer bibliographic organization capabilities, they do not provide real-time validation of citation authenticity. Commercial hallucination detection services, though increasingly available, often impose restrictive usage limits on free tiers or require substantial subscription fees. The proposed tool fills this gap by employing a cascading validation architecture with string similarity algorithms to compute multi-dimensional match confidence scores, delivering instant feedback on reference authenticity. The system supports both single-reference verification and batch processing of BibTeX entries through a unified interface, returning validated APA citations and exportable BibTeX records within seconds.
56. 【2602.15870】VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering
链接:https://arxiv.org/abs/2602.15870
作者:Shuhui Qu
类目:Computation and Language (cs.CL)
关键词:Autoregressive language models, language models decode, Autoregressive language, irreversible commitments, limiting revision
备注:
点击查看摘要
Abstract:Autoregressive language models decode left-to-right with irreversible commitments, limiting revision during multi-step reasoning. We propose \textbf{VDLM}, a modular variable diffusion language model that separates semantic planning from text rendering. VDLM applies LLaDA-style masked diffusion over semantic variable embeddings to enable iterative refinement in latent space, then post-trains the planner with trajectory-aware optimization using embedding-space rewards and values, avoiding text decoding inside the RL loop. To convert planned embeddings back to text, we use a \textbf{Vec2Text} renderer and introduce \textbf{embedding perturbations} to robustify decoding under planner noise. Across nine benchmarks spanning general reasoning, math, and code, VDLM is competitive in pre-training and yields substantial post-training improvements on long-form generation tasks, outperforming other baselines. These results highlight the effectiveness of embedding-space post-training and robust latent-to-text rendering for diffusion language modeling.
57. 【2602.15869】owards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches
链接:https://arxiv.org/abs/2602.15869
作者:Noopur Zambare,Kiana Aghakasiri,Carissa Lin,Carrie Ye,J. Ross Mitchell,Mohamed Abdalla
类目:Computation and Language (cs.CL)
关键词:shown strong performance, identifying sensitive identifiers, protect privacy, shown strong, task of identifying
备注: Accpted to the Findings of EACL 2026
点击查看摘要
Abstract:Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in multi-cultural contexts, we introduce and publicly release BERT-MultiCulture-DEID, a set of de-identification models based on BERT, ClinicalBERT, and ModernBERT, fine-tuned on MIMIC with identifiers from multiple language variants. Our findings provide the first comprehensive quantification of the efficiency-generalizability trade-off in de-identification and establish practical pathways for fair and efficient clinical de-identification. Details on accessing the models are available at: this https URL
Comments:
Accpted to the Findings of EACL 2026
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.15869 [cs.CL]
(or
arXiv:2602.15869v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.15869
Focus to learn more
arXiv-issued DOI via DataCite</p>
58. 【2602.15868】Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning
链接:https://arxiv.org/abs/2602.15868
作者:Magnus Boman
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, seemingly trivial tasks, exhibit failure modes, seemingly trivial
备注: 8 pages, 1 page appendix
点击查看摘要
Abstract:Large language models (LLMs) exhibit failure modes on seemingly trivial tasks. We propose a formalisation of LLM interaction using a deterministic multi-tape Turing machine, where each tape represents a distinct component: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text. The model enables precise localisation of failure modes to specific pipeline stages, revealing, e.g., how tokenisation obscures character-level structure needed for counting tasks. The model clarifies why techniques like chain-of-thought prompting help, by externalising computation on the output tape, while also revealing their fundamental limitations. This approach provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis.
59. 【2602.15867】Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?
链接:https://arxiv.org/abs/2602.15867
作者:Berry Gerrits
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:contemporary Large Language, Large Language Models, contemporary Large, Large Language, seminal text-based adventure
备注: 18 pages, 1 figure
点击查看摘要
Abstract:In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game's dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ''extended thinking''. Qualitative analysis of the models' reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one's own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs' metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.
60. 【2602.15866】NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey
链接:https://arxiv.org/abs/2602.15866
作者:Dhiman Goswami,Jai Kruthunz Naveen Kumar,Sanchari Das
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:Personally Identifiable Information, Identifiable Information, Personally Identifiable, Natural Language Processing, social media analytics
备注:
点击查看摘要
Abstract:Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Media (NLP-PRISM) framework, which evaluates vulnerabilities across six dimensions: data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance. Our analysis shows that transformer models achieve F1-scores ranging from 0.58-0.84, but incur a 1% - 23% drop under privacy-preserving fine-tuning. Using NLP-PRISM, we examine privacy coverage in six NLP tasks: sentiment analysis (16), emotion detection (14), offensive language identification (19), code-mixed processing (39), native language identification (29), and dialect detection (24) revealing substantial gaps in privacy research. We further found a (reduced by 2% - 9%) trade-off in model utility, MIA AUC (membership inference attacks) 0.81, AIA accuracy 0.75 (attribute inference attacks). Finally, we advocate for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts.
61. 【2602.15865】AI as Teammate or Tool? A Review of Human-AI Interaction in Decision Support
链接:https://arxiv.org/abs/2602.15865
作者:Most. Sharmin Sultana Samu,Nafisa Khan,Kazi Toufique Elahi,Tasnuva Binte Rahman,Md. Rakibul Islam,Farig Sadeque
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Artificial Intelligence, integration of Artificial, necessitates determining, function as tools, Intelligence
备注: Preprint
点击查看摘要
Abstract:The integration of Artificial Intelligence (AI) necessitates determining whether systems function as tools or collaborative teammates. In this study, by synthesizing Human-AI Interaction (HAI) literature, we analyze this distinction across four dimensions: interaction design, trust calibration, collaborative frameworks and healthcare applications. Our analysis reveals that static interfaces and miscalibrated trust limit AI efficacy. Performance hinges on aligning transparency with cognitive workflows, yet a fluency trap often inflates trust without improving decision-making. Consequently, an overemphasis on explainability leaves systems largely passive. Our findings show that current AI systems remain largely passive due to an overreliance on explainability-centric designs and that transitioning AI to an active teammate requires adaptive, context-aware interactions that support shared mental models and the dynamic negotiation of authority between humans and AI.
62. 【2602.15863】Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning
链接:https://arxiv.org/abs/2602.15863
作者:Daehoon Gwak,Minseo Jung,Junwoo Park,Minho Park,ChaeHun Park,Junha Hyung,Jaegul Choo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, achieving results comparable, Recent studies, shown that Large
备注: Presented at AACL-IJCNLP 2025
点击查看摘要
Abstract:Recent studies have shown that Large Language Models (LLMs) can improve their reasoning performance through self-generated few-shot examples, achieving results comparable to manually curated in-context examples. However, the underlying mechanism behind these gains remains unclear, making it hard to decide when and how to apply the technique effectively. In this work, we argue that the key benefit arises not from the generated examples themselves but from the act of creating them. To validate this, on reasoning-intensive tasks across diverse LLM architectures, we systematically evaluate three prompting strategies for in-context learning: (1) Zero-shot prompting; (2) Integrated prompting, where LLMs create and solve problems within a single, unified prompt; and (3) Decoupled prompting, where self-generated examples are reused as in-context examples, but the context of their creation itself is excluded. We conduct experiments across five widely used model architectures, demonstrating that Integrated prompting consistently outperforms both Zero-shot and Decoupled prompting. In contrast, Decoupled prompting offers only marginal gains over Zero-shot. Further, for a more in-depth analysis, we conduct an attention analysis and observe significant differences in attention patterns between Integrated and Decoupled prompting. These findings suggest that the advantage of self-generation prompting comes from the process of problem creation, not the examples themselves, providing valuable insights for designing more effective prompting strategies.
63. 【2602.15862】Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation
链接:https://arxiv.org/abs/2602.15862
作者:Guoshan Liu,Bin Zhu,Yian Li,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Multimodal Large Language, high lexical scores, Language Models, Multimodal Large
备注:
点击查看摘要
Abstract:Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.
64. 【2602.15861】CAST: Achieving Stable LLM-based Text Analysis for Data Analytics
链接:https://arxiv.org/abs/2602.15861
作者:Jinxiang Xie,Zihao Li,Wei He,Rui Ding,Shi Han,Dongmei Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:corpus-level theme extraction, Text analysis, tabular data relies, textbf, core operations
备注:
点击查看摘要
Abstract:Text analysis of tabular data relies on two core operations: \emph{summarization} for corpus-level theme extraction and \emph{tagging} for row-level labeling. A critical limitation of employing large language models (LLMs) for these tasks is their inability to meet the high standards of output stability demanded by data analytics. To address this challenge, we introduce \textbf{CAST} (\textbf{C}onsistency via \textbf{A}lgorithmic Prompting and \textbf{S}table \textbf{T}hinking), a framework that enhances output stability by constraining the model's latent reasoning path. CAST combines (i) Algorithmic Prompting to impose a procedural scaffold over valid reasoning transitions and (ii) Thinking-before-Speaking to enforce explicit intermediate commitments before final generation. To measure progress, we introduce \textbf{CAST-S} and \textbf{CAST-T}, stability metrics for bulleted summarization and tagging, and validate their alignment with human judgments. Experiments across publicly available benchmarks on multiple LLM backbones show that CAST consistently achieves the best stability among all baselines, improving Stability Score by up to 16.2\%, while maintaining or improving output quality.
65. 【2602.15860】Reranker Optimization via Geodesic Distances on k-NN Manifolds
链接:https://arxiv.org/abs/2602.15860
作者:Wen G. Gong
类目:Computation and Language (cs.CL)
关键词:Current neural reranking, large language models, requiring substantial computational, substantial computational resources, Current neural
备注: 10 pages, 3 tables
点击查看摘要
Abstract:Current neural reranking approaches for retrieval-augmented generation (RAG) rely on cross-encoders or large language models (LLMs), requiring substantial computational resources and exhibiting latencies of 3-5 seconds per query. We propose Maniscope, a geometric reranking method that computes geodesic distances on k-nearest neighbor (k-NN) manifolds constructed over retrieved document candidates. This approach combines global cosine similarity with local manifold geometry to capture semantic structure that flat Euclidean metrics miss. Evaluating on eight BEIR benchmark datasets (1,233 queries), Maniscope outperforms HNSW graph-based baseline on the three hardest datasets (NFCorpus: +7.0%, TREC-COVID: +1.6%, AorB: +2.8% NDCG@3) while being 3.2x faster (4.7 ms vs 14.8 ms average). Compared to cross-encoder rerankers, Maniscope achieves within 2% accuracy at 10-45x lower latency. On TREC-COVID, LLM-Reranker provides only +0.5% NDCG@3 improvement over Maniscope at 840x higher latency, positioning Maniscope as a practical alternative for real-time RAG deployment. The method requires O(N D + M^2 D + M k log k) complexity where M N , enabling sub-10 ms latency. We plan to release Maniscope as open-source software.
66. 【2602.15859】From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants
链接:https://arxiv.org/abs/2602.15859
作者:Krittin Pachtrachai,Petmongkon Pornpichitsuwan,Wachiravit Modecrua,Touchapon Kraisingkorn
类目:Computation and Language (cs.CL)
关键词:Building reliable conversational, customer-facing industries remains, accurate human hand-off, Building reliable, industries remains challenging
备注: 9 pages, 2 figures, 1 table
点击查看摘要
Abstract:Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts. Incoming transcripts are first graded using a simplified adaptation of the PIPA framework, focusing on observation alignment and appropriate response behavior, and are filtered to retain only high-quality interactions exhibiting coherent flow and effective human agent responses. Structured knowledge is then extracted from curated transcripts using large language models (LLMs) and deployed as the sole grounding source in a Retrieval-Augmented Generation (RAG) pipeline. Assistant behavior is governed through systematic prompt tuning, progressing from monolithic prompts to lean, modular, and governed designs that ensure consistency, safety, and controllable execution. Evaluation is conducted using a transcript-grounded user simulator, enabling quantitative measurement of call coverage, factual accuracy, and human escalation behavior. Additional red teaming assesses robustness against prompt injection, out-of-scope, and out-of-context attacks. Experiments are conducted in the Real Estate and Specialist Recruitment domains, which are intentionally challenging and currently suboptimal for automation due to their reliance on real-time data. Despite these constraints, the assistant autonomously handles approximately 30 percents of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness under adversarial testing.
67. 【2602.15858】State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models
链接:https://arxiv.org/abs/2602.15858
作者:Annie Wong,Aske Plaat,Thomas Bäck,Niki van Stein,Anna V. Kononova
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:dynamic environments, inference time, tasks toward dynamic, success depends, ability to navigate
备注:
点击查看摘要
Abstract:As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.
68. 【2602.15857】Multi-source Heterogeneous Public Opinion Analysis via Collaborative Reasoning and Adaptive Fusion: A Systematically Integrated Approach
链接:https://arxiv.org/abs/2602.15857
作者:Yi Liu
类目:Computation and Language (cs.CL)
关键词:presents significant challenges, significant challenges due, multiple heterogeneous sources, heterogeneous sources presents, sources presents significant
备注: 13 pages, 11 figures
点击查看摘要
Abstract:The analysis of public opinion from multiple heterogeneous sources presents significant challenges due to structural differences, semantic variations, and platform-specific biases. This paper introduces a novel Collaborative Reasoning and Adaptive Fusion (CRAF) framework that systematically integrates traditional feature-based methods with large language models (LLMs) through a structured multi-stage reasoning mechanism. Our approach features four key innovations: (1) a cross-platform collaborative attention module that aligns semantic representations while preserving source-specific characteristics, (2) a hierarchical adaptive fusion mechanism that dynamically weights features based on both data quality and task requirements, (3) a joint optimization strategy that simultaneously learns topic representations and sentiment distributions through shared latent spaces, and (4) a novel multimodal extraction capability that processes video content from platforms like Douyin and Kuaishou by integrating OCR, ASR, and visual sentiment analysis. Theoretical analysis demonstrates that CRAF achieves a tighter generalization bound with a reduction of O(sqrt(d log K / m)) compared to independent source modeling, where d is feature dimensionality, K is the number of sources, and m is sample size. Comprehensive experiments on three multi-platform datasets (Weibo-12, CrossPlatform-15, NewsForum-8) show that CRAF achieves an average topic clustering ARI of 0.76 (4.1% improvement over best baseline) and sentiment analysis F1-score of 0.84 (3.8% improvement). The framework exhibits strong cross-platform adaptability, reducing the labeled data requirement for new platforms by 75%.
69. 【2602.15856】Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective
链接:https://arxiv.org/abs/2602.15856
作者:Yunhao Liu,Zian Jia,Xinyu Gao,Kanjun Xu,Yun Xiong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large Language Models, effectively grounds Large, grounds Large Language, Web-related tasks, grounds Large
备注: Accepted by WWW 2026
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM's downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%.
Comments:
Accepted by WWW 2026
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:
arXiv:2602.15856 [cs.CL]
(or
arXiv:2602.15856v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.15856
Focus to learn more
arXiv-issued DOI via DataCite</p>
70. 【2602.15854】Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization
链接:https://arxiv.org/abs/2602.15854
作者:Jingyi Xu,Xingyu Ren,Zhiqiang You,Yumeng Zhang,Zhoupeng Shou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Customer Service Agent, Large language models, existing training methods, Large language, Expert Agent
备注:
点击查看摘要
Abstract:Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent's critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.
71. 【2602.15853】A Lightweight Explainable Guardrail for Prompt Safety
链接:https://arxiv.org/abs/2602.15853
作者:Md Asiful Islam,Mihai Surdeanu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:lightweight explainable guardrail, explainable guardrail, LEG, propose a lightweight, lightweight explainable
备注:
点击查看摘要
Abstract:We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly.
72. 【2602.15852】Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints
链接:https://arxiv.org/abs/2602.15852
作者:Ha Na Cho,Sairam Sutari,Alexander Lopez,Hansen Bow,Kai Zheng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:natural language processing, Clinical natural language, leveraging narrative clinical, hospital discharge planning, supporting hospital discharge
备注:
点击查看摘要
Abstract:Clinical natural language processing (NLP) models have shown promise for supporting hospital discharge planning by leveraging narrative clinical documentation. However, note-based models are particularly vulnerable to temporal and lexical leakage, where documentation artifacts encode future clinical decisions and inflate apparent predictive performance. Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety. This study focuses on system-level design choices required to build safe and deployable clinical NLP under temporal leakage constraints. We present a lightweight auditing pipeline that integrates interpretability into the model development process to identify and suppress leakage-prone signals prior to final training. Using next-day discharge prediction after elective spine surgery as a case study, we evaluate how auditing affects predictive behavior, calibration, and safety-relevant trade-offs. Results show that audited models exhibit more conservative and better-calibrated probability estimates, with reduced reliance on discharge-related lexical cues. These findings emphasize that deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance.
73. 【2602.15851】Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey
链接:https://arxiv.org/abs/2602.15851
作者:David Y. Liu,Aditya Joshi,Paul Dawson
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:deliver promising use-cases, automatic story generation, deliver promising, large language models, narrative
备注: 31 pages
点击查看摘要
Abstract:Applications of narrative theories using large language models (LLMs) deliver promising use-cases in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research engages with fields of narrative studies, and proposes a taxonomy for ongoing efforts that reflect established distinctions in narratology. We discover patterns in the following: narrative datasets and tasks, narrative theories and NLP pipeline and methodological trends in prompting and fine-tuning. We highlight how LLMs enable easy connections of NLP pipelines with abstract narrative concepts and opportunities for interdisciplinary collaboration. Challenges remain in attempts to work towards any unified definition or benchmark of narrative related tasks, making model comparison difficult. For future directions, instead of the pursuit of a single, generalised benchmark for 'narrative quality', we believe that progress benefits more from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes to incrementally improve model performance; conducting large-scale, theory-driven literary/social/cultural analysis; and creating experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.
74. 【2602.15850】Large Language Models for Assisting American College Applications
链接:https://arxiv.org/abs/2602.15850
作者:Zhengliang Liu,Weihang You,Peng Shu,Junhao Chen,Yi Pan,Hanqi Jiang,Yiwei Li,Zhaojun Ding,Chao Cao,Xinliang Li,Yifan Zhou,Ruidong Zhang,Shaochen Xu,Wei Ruan,Huaqin Zhao,Dajiang Zhu,Tianming Liu
类目:Computation and Language (cs.CL)
关键词:American college applications, demand cross-referencing multiple, fragmented admissions policies, American college, navigate fragmented admissions
备注:
点击查看摘要
Abstract:American college applications require students to navigate fragmented admissions policies, repetitive and conditional forms, and ambiguous questions that often demand cross-referencing multiple sources. We present EZCollegeApp, a large language model (LLM)-powered system that assists high-school students by structuring application forms, grounding suggested answers in authoritative admissions documents, and maintaining full human control over final responses. The system introduces a mapping-first paradigm that separates form understanding from answer generation, enabling consistent reasoning across heterogeneous application portals. EZCollegeApp integrates document ingestion from official admissions websites, retrieval-augmented question answering, and a human-in-the-loop chatbot interface that presents suggestions alongside application fields without automated submission. We describe the system architecture, data pipeline, internal representations, security and privacy measures, and evaluation through automated testing and human quality assessment. Our source code is released on GitHub (this https URL) to facilitate the broader impact of this work.
75. 【2602.15849】Preference Optimization for Review Question Generation Improves Writing Quality
链接:https://arxiv.org/abs/2602.15849
作者:Karun Sharma,Vidushee Vats,Shengzhi Li,Yuxiang Wang,Zhongtian Sun,Prayag Tiwari
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:generate surface-level queries, Peer review relies, existing LLM-based approaches, Sampling Policy Optimization, relies on substantive
备注: 24 Pages, v1
点击查看摘要
Abstract:Peer review relies on substantive, evidence-based questions, yet existing LLM-based approaches often generate surface-level queries, drawing over 50\% of their question tokens from a paper's first page. To bridge this gap, we develop IntelliReward, a novel reward model built from a frozen autoregressive LLM with trainable multi-head transformers over the final 50 token states, which outperforms API-based SFT baselines in predicting expert-level human preferences. By applying Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward, we train IntelliAsk, a question-generation model aligned with human standards of effort, evidence, and grounding. We find consistent improvements on reasoning and writing benchmarks, suggesting reviewer-question quality correlates with broader capabilities. Compared to the Qwen3-32B base model, IntelliAsk shows measurable gains across diverse benchmarks, specifically improving performance on reasoning tasks like MuSR (68.3 vs 64.7 Acc) and complex writing evaluations such as WritingBench (8.31 vs 8.07). We release our implementation, expert preference annotations, and the IntelliReward model to provide an automatic evaluation benchmark for grounding, effort, and evidence in LLM-generated review questions.
76. 【2602.15848】Can LLMs Assess Personality? Validating Conversational AI for Trait Profiling
链接:https://arxiv.org/abs/2602.15848
作者:Andrius Matšenas,Anet Lello,Tõnis Lees,Hans Peep,Kim Lilii Tamm
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, validates Large Language, study validates Large, Language Models, Large Language
备注: 13 pages, 7 figures, 4 tables, 2 appendices
点击查看摘要
Abstract:This study validates Large Language Models (LLMs) as a dynamic alternative to questionnaire-based personality assessment. Using a within-subjects experiment (N=33), we compared Big Five personality scores derived from guided LLM conversations against the gold-standard IPIP-50 questionnaire, while also measuring user-perceived accuracy. Results indicate moderate convergent validity (r=0.38-0.58), with Conscientiousness, Openness, and Neuroticism scores statistically equivalent between methods. Agreeableness and Extraversion showed significant differences, suggesting trait-specific calibration is needed. Notably, participants rated LLM-generated profiles as equally accurate as traditional questionnaire results. These findings suggest conversational AI offers a promising new approach to traditional psychometrics.
77. 【2602.15847】Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models
链接:https://arxiv.org/abs/2602.15847
作者:Pranav Bhandari,Usman Naseem,Mehwish Nasim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, injecting trait-specific steering, personality steering directions, commonly relies, implicitly assuming
备注:
点击查看摘要
Abstract:Personality steering in large language models (LLMs) commonly relies on injecting trait-specific steering vectors, implicitly assuming that personality traits can be controlled independently. In this work, we examine whether this assumption holds by analysing the geometric relationships between Big Five personality steering directions. We study steering vectors extracted from two model families (LLaMA-3-8B and Mistral-8B) and apply a range of geometric conditioning schemes, from unconstrained directions to soft and hard orthonormalisation. Our results show that personality steering directions exhibit substantial geometric dependence: steering one trait consistently induces changes in others, even when linear overlap is explicitly removed. While hard orthonormalisation enforces geometric independence, it does not eliminate cross-trait behavioural effects and can reduce steering strength. These findings suggest that personality traits in LLMs occupy a slightly coupled subspace, limiting fully independent trait control.
78. 【2602.15846】Gated Tree Cross-attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs
链接:https://arxiv.org/abs/2602.15846
作者:Xinyu Gao,Shaonan Wang,Nai Ding
类目:Computation and Language (cs.CL)
关键词:minor grammatical perturbations, large language models, language models achieve, models achieve strong, achieve strong broad
备注:
点击查看摘要
Abstract:Decoder-only large language models achieve strong broad performance but are brittle to minor grammatical perturbations, undermining reliability for downstream reasoning. However, directly injecting explicit syntactic structure into an existing checkpoint can interfere with its pretrained competence. We introduce a checkpoint-compatible gated tree cross-attention (GTCA) branch that reads precomputed constituency chunk memory while leaving backbone architecture unchanged. Our design uses a token update mask and staged training to control the scope and timing of structural updates. Across benchmarks and Transformer backbones, GTCA strengthens syntactic robustness beyond continued-training baselines without compromising Multiple-Choice QA performance or commonsense reasoning, providing a practical checkpoint-compatible route to more syntax-robust decoder-only LLMs.
79. 【2602.15845】KD4MT: A Survey of Knowledge Distillation for Machine Translation
链接:https://arxiv.org/abs/2602.15845
作者:Ona de Gibert,Joseph Attieh,Timothee Mickus,Yves Scherrer,Jörg Tiedemann
类目:Computation and Language (cs.CL)
关键词:address challenges related, models in NLP, Knowledge Distillation, area has gained, gained a lot
备注: Pre-print under Review submitted to Computational Linguistics Journal
点击查看摘要
Abstract:Knowledge Distillation (KD) as a research area has gained a lot of traction in recent years as a compression tool to address challenges related to ever-larger models in NLP. Remarkably, Machine Translation (MT) offers a much more nuanced take on this narrative: in MT, KD also functions as a general-purpose knowledge transfer mechanism that shapes supervision and translation quality as well as efficiency. This survey synthesizes KD for MT (KD4MT) across 105 papers (through October 1, 2025). We begin by introducing both MT and KD for non-experts, followed by an overview of the standard KD approaches relevant to MT applications. Subsequently, we categorize advances in the KD4MT literature based on (i) their methodological contributions and (ii) their practical applications. Our qualitative and quantitative analyses identify common trends in the field and highlight key research gaps as well as the absence of unified evaluation practice for KD methods in MT. We further provide practical guidelines for selecting a KD method in concrete settings and highlight potential risks associated with the application of KD to MT such as increased hallucination and bias amplification. Finally, we discuss the role of LLMs in re-shaping the KD4MT field. To support further research, we complement our survey with a publicly available database summarizing the main characteristics of the surveyed KD methods and a glossary of key terms.
Comments:
Pre-print under Review submitted to Computational Linguistics Journal
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.15845 [cs.CL]
(or
arXiv:2602.15845v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.15845
Focus to learn more
arXiv-issued DOI via DataCite</p>
80. 【2602.15844】Language Model Representations for Efficient Few-Shot Tabular Classification
链接:https://arxiv.org/abs/2602.15844
作者:Inwon Kang,Parikshit Ram,Yi Zhou,Horst Samulowitz,Oshani Seneviratne
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:rich source, textbf, existing LLM infrastructure, LLM infrastructure, structured data
备注: Accepted to WWW'26
点击查看摘要
Abstract:The Web is a rich source of structured data in the form of tables, from product catalogs and knowledge bases to scientific datasets. However, the heterogeneity of the structure and semantics of these tables makes it challenging to build a unified method that can effectively leverage the information they contain. Meanwhile, Large language models (LLMs) are becoming an increasingly integral component of web infrastructure for tasks like semantic search. This raises a crucial question: can we leverage these already-deployed LLMs to classify structured data in web-native tables (e.g., product catalogs, knowledge base exports, scientific data portals), avoiding the need for specialized models or extensive retraining? This work investigates a lightweight paradigm, $\textbf{Ta}$ble $\textbf{R}$epresentation with $\textbf{L}$anguage Model~($\textbf{TaRL}$), for few-shot tabular classification that directly utilizes semantic embeddings of individual table rows. We first show that naive application of these embeddings underperforms compared to specialized tabular models. We then demonstrate that their potentials can be unlocked with two key techniques: removing the common component from all embeddings and calibrating the softmax temperature. We show that a simple meta-learner, trained on handcrafted features, can learn to predict an appropriate temperature. This approach achieves performance comparable to state-of-the-art models in low-data regimes ($k \leq 32$) of semantically-rich tables. Our findings demonstrate the viability of reusing existing LLM infrastructure for efficient semantics-driven pathway to reuse existing LLM infrastructure for Web table understanding.
81. 【2602.15843】he Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts
链接:https://arxiv.org/abs/2602.15843
作者:Warren Johnson
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Compress or Route, Compress, Route, perplexity, Abstract
备注: 19 pages, 5 figures, 4 tables. Second paper in TAAC research series. Code and data at [this https URL](https://github.com/micoverde/taac-llm-compression)
点击查看摘要
Abstract:In "Compress or Route?" (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r = 0.6) while chain-of-thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the "perplexity paradox" mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL-E) and four reasoning benchmarks (GSM8K, MATH, ARC-Challenge, MMLU-STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per-token perplexity analysis (n=723 tokens), revealing a "perplexity paradox": code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task-critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen's h=0.890). Third, we propose TAAC (Task-Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed-ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.
82. 【2602.15842】Memes-as-Replies: Can Models Select Humorous Manga Panel Responses?
链接:https://arxiv.org/abs/2602.15842
作者:Ryosuke Kohita,Seiichiro Yoshioka
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:modern web communication, Meme Reply Selection, Manga Meme Reply, Meme Reply Benchmark, Meme Reply
备注:
点击查看摘要
Abstract:Memes are a popular element of modern web communication, used not only as static artifacts but also as interactive replies within conversations. While computational research has focused on analyzing the intrinsic properties of memes, the dynamic and contextual use of memes to create humor remains an understudied area of web science. To address this gap, we introduce the Meme Reply Selection task and present MaMe-Re (Manga Meme Reply Benchmark), a benchmark of 100,000 human-annotated pairs (500,000 total annotations from 2,325 unique annotators) consisting of openly licensed Japanese manga panels and social media posts. Our analysis reveals three key insights: (1) large language models (LLMs) show preliminary evidence of capturing complex social cues such as exaggeration, moving beyond surface-level semantic matching; (2) the inclusion of visual information does not improve performance, revealing a gap between understanding visual content and effectively using it for contextual humor; (3) while LLMs can match human judgments in controlled settings, they struggle to distinguish subtle differences in wit among semantically similar candidates. These findings suggest that selecting contextually humorous replies remains an open challenge for current models.
83. 【2602.15835】A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dialogue System Alignment Models
链接:https://arxiv.org/abs/2602.15835
作者:Mikio Nakano,Hironori Takeuchi,Kazunori Komatani
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:evaluation items, dialogue systems, practical dialogue systems, evaluation, dialogue
备注: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2025 (IWSDS 2025)
点击查看摘要
Abstract:This paper proposes a methodology for identifying evaluation items for practical dialogue systems. Traditionally, user satisfaction and user experiences have been the primary metrics for evaluating dialogue systems. However, there are various other evaluation items to consider when developing and operating practical dialogue systems, and such evaluation items are expected to lead to new research topics. So far, there has been no methodology for identifying these evaluation items. We propose identifying evaluation items based on business-dialogue system alignment models, which are applications of business-IT alignment models used in the development and operation of practical IT systems. We also present a generic model that facilitates the construction of a business-dialogue system alignment model for each dialogue system.
84. 【2602.16273】Lyapunov Spectral Analysis of Speech Embedding Trajectories in Psychosis
链接:https://arxiv.org/abs/2602.16273
作者:Jelena Vasic,Branislav Andjelic,Ana Mancic,Dusica Filipovic Djurdjevic,Ljiljana Mihic,Aleksandar Kovacevic,Nadja P. Maric,Aleksandra Maluckov
类目:Adaptation and Self-Organizing Systems (nlin.AO); Computation and Language (cs.CL)
关键词:structured clinical interviews, treating language production, high-dimensional dynamical process, structured clinical, clinical interviews
备注: 14 pages, 3 figures
点击查看摘要
Abstract:We analyze speech embeddings from structured clinical interviews of psychotic patients and healthy controls by treating language production as a high-dimensional dynamical process. Lyapunov exponent (LE) spectra are computed from word-level and answer-level embeddings generated by two distinct large language models, allowing us to assess the stability of the conclusions with respect to different embedding presentations. Word-level embeddings exhibit uniformly contracting dynamics with no positive LE, while answer-level embeddings, in spite of the overall contraction, display a number of positive LEs and higher-dimensional attractors. The resulting LE spectra robustly separate psychotic from healthy speech, while differentiation within the psychotic group is not statistically significant overall, despite a tendency of the most severe cases to occupy distinct dynamical regimes. These findings indicate that nonlinear dynamical invariants of speech embeddings provide a physics-inspired probe of disordered cognition whose conclusions remain stable across embedding models.
85. 【2602.15889】Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance
链接:https://arxiv.org/abs/2602.15889
作者:Paul Tschisgale,Peter Wulff
类目:Applications (stat.AP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Physics Education (physics.ed-ph)
关键词:Large language models, Large language, objects of investigation, Large, language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used in research both as tools and as objects of investigation. Much of this work implicitly assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant. If average output quality changes systematically over time, this assumption is violated, threatening the reliability, validity, and reproducibility of findings. To empirically examine this assumption, we conducted a longitudinal study on the temporal variability of GPT-4o's average performance. Using a fixed model snapshot, fixed hyperparameters, and identical prompting, GPT-4o was queried via the API to solve the same multiple-choice physics task every three hours for approximately three months. Ten independent responses were generated at each time point and their scores were averaged. Spectral (Fourier) analysis of the resulting time series revealed notable periodic variability in average model performance, accounting for approximately 20% of the total variance. In particular, the observed periodic patterns are well explained by the interaction of a daily and a weekly rhythm. These findings indicate that, even under controlled conditions, LLM performance may vary periodically over time, calling into question the assumption of time invariance. Implications for ensuring validity and replicability of research that uses or investigates LLMs are discussed.
信息检索
1. 【2602.16673】Neighborhood Stability as a Measure of Nearest Neighbor Searchability
链接:https://arxiv.org/abs/2602.16673
作者:Thomas Vecchiato,Sebastian Bruch
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:Clustering-based Approximate Nearest, Nearest Neighbor Search, Approximate Nearest Neighbor, Clustering-based Approximate, Neighbor Search
备注:
点击查看摘要
Abstract:Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset -- what we call "searchability." To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality -- a function of a clustering of a dataset -- that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is searchable by clustering-based ANNS given only the data points. Importantly, both are functions of nearest neighbor relationships between points, not distances, making them applicable to various distance functions including inner product.
2. 【2602.16609】ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models
链接:https://arxiv.org/abs/2602.16609
作者:Antoine Chaffin,Luca Arnaboldi,Amélie Chatelain,Florent Krzakala
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:small Knowledge Distillation, Knowledge Distillation, strong single-vector models, small Knowledge, multi-vector models
备注: 9 pages, 5 tables, 2 figures
点击查看摘要
Abstract:Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.
3. 【2602.16587】Why Thinking Hurts? Diagnosing and Rectifying the Reasoning Shift in Foundation Recommender Models
链接:https://arxiv.org/abs/2602.16587
作者:Luankang Zhang,Yonghao Huang,Hang Lv,Mingjia Yin,Liangyue Li,Zulong Chen,Hao Wang,Enhong Chen
类目:Information Retrieval (cs.IR)
关键词:degrades recommendation performance, Semantic ID-based recommendation, paradoxically degrades recommendation, ID-based recommendation foundation, recommendation performance
备注:
点击查看摘要
Abstract:Integrating Chain-of-Thought (CoT) reasoning into Semantic ID-based recommendation foundation models (such as OpenOneRec) often paradoxically degrades recommendation performance. We identify the root cause as textual inertia from the General Subspace, where verbose reasoning dominates inference and causes the model to neglect critical Semantic ID. To address this, we propose a training-free Inference-Time Subspace Alignment framework. By compressing reasoning chains and applying bias-subtracted contrastive decoding, our approach mitigates ungrounded textual drift. Experiments show this effectively calibrates inference, allowing foundation models to leverage reasoning without sacrificing ID-grounded accuracy.
4. 【2602.16541】From Latent to Observable Position-Based Click Models in Carousel Interfaces
链接:https://arxiv.org/abs/2602.16541
作者:Santiago de Leon-Martinez,Robert Moro,Branislav Kveton,Maria Bielikova
类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
关键词:single ranked-list interfaces, models, Click models, central component, component of learning
备注:
点击查看摘要
Abstract:Click models are a central component of learning and evaluation in recommender systems, yet most existing models are designed for single ranked-list interfaces. In contrast, modern recommender platforms increasingly use complex interfaces such as carousels, which consist of multiple swipeable lists that enable complex user browsing behaviors. In this paper, we study position-based click models in carousel interfaces and examine optimization methods, model structure, and alignment with user behavior. We propose three novel position-based models tailored to carousels, including the first position-based model without latent variables that incorporates observed examination signals derived from eye tracking data, called the Observed Examination Position-Based Model (OEPBM). We develop a general implementation of these carousel click models, supporting multiple optimization techniques and conduct experiments comparing gradient-based methods with classical approaches, namely expectation-maximization and maximum likelihood estimation. Our results show that gradient-based optimization consistently achieve better click likelihoods. Among the evaluated models, the OEPBM achieves the strongest performance in click prediction and produces examination patterns that most closely align to user behavior. However, we also demonstrate that strong click fit does not imply realistic modeling of user examination and browsing patterns. This reveals a fundamental limitation of click-only models in complex interfaces and the need for incorporating additional behavioral signals when designing click models for carousel-based recommender systems.
Subjects:
Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)
Cite as:
arXiv:2602.16541 [cs.IR]
(or
arXiv:2602.16541v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.16541
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
5. 【2602.16375】Variable-Length Semantic IDs for Recommender Systems
链接:https://arxiv.org/abs/2602.16375
作者:Kirill Khrylchenko
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:modeling user behavior, Generative models, generative models difficult, integrating large language, training generative models
备注:
点击查看摘要
Abstract:Generative models are increasingly used in recommender systems, both for modeling user behavior as event sequences and for integrating large language models into recommendation pipelines. A key challenge in this setting is the extremely large cardinality of item spaces, which makes training generative models difficult and introduces a vocabulary gap between natural language and item identifiers. Semantic identifiers (semantic IDs), which represent items as sequences of low-cardinality tokens, have recently emerged as an effective solution to this problem. However, existing approaches generate semantic identifiers of fixed length, assigning the same description length to all items. This is inefficient, misaligned with natural language, and ignores the highly skewed frequency structure of real-world catalogs, where popular items and rare long-tail items exhibit fundamentally different information requirements. In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions. Despite the conceptual similarity, these ideas have not been systematically adopted in recommender systems. In this work, we bridge recommender systems and emergent communication by introducing variable-length semantic identifiers for recommendation. We propose a discrete variational autoencoder with Gumbel-Softmax reparameterization that learns item representations of adaptive length under a principled probabilistic framework, avoiding the instability of REINFORCE-based training and the fixed-length constraints of prior semantic ID methods.
Subjects:
Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2602.16375 [cs.IR]
(or
arXiv:2602.16375v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.16375
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2602.16315】he Diversity Paradox revisited: Systemic Effects of Feedback Loops in Recommender Systems
链接:https://arxiv.org/abs/2602.16315
作者:Gabriele Barlacchi,Margherita Lalli,Emanuele Ferragina,Fosca Giannotti,Dino Pedreschi,Luca Pappalardo
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:algorithmic recommendations coevolve, shape individual choices, Recommender systems shape, user behavior, behavior and algorithmic
备注:
点击查看摘要
Abstract:Recommender systems shape individual choices through feedback loops in which user behavior and algorithmic recommendations coevolve over time. The systemic effects of these loops remain poorly understood, in part due to unrealistic assumptions in existing simulation studies. We propose a feedback-loop model that captures implicit feedback, periodic retraining, probabilistic adoption of recommendations, and heterogeneous recommender systems. We apply the framework on online retail and music streaming data and analyze systemic effects of the feedback loop. We find that increasing recommender adoption may lead to a progressive diversification of individual consumption, while collective demand is redistributed in model- and domain-dependent ways, often amplifying popularity concentration. Temporal analyses further reveal that apparent increases in individual diversity observed in static evaluations are illusory: when adoption is fixed and time unfolds, individual diversity consistently decreases across all models. Our results highlight the need to move beyond static evaluations and explicitly account for feedback-loop dynamics when designing recommender systems.
7. 【2602.16299】MICE: Minimal Interaction Cross-Encoders for efficient Re-ranking
链接:https://arxiv.org/abs/2602.16299
作者:Mathias Vast,Victor Morand,Basile van Cooten,Laure Soulier,Josiane Mothe,Benjamin Piwowarski
类目:Information Retrieval (cs.IR)
关键词:high inference cost, ranking effectiveness, information retrieval, Cross-encoders deliver, Cross-encoders
备注: 9 pages, 5 figures
点击查看摘要
Abstract:Cross-encoders deliver state-of-the-art ranking effectiveness in information retrieval, but have a high inference cost. This prevents them from being used as first-stage rankers, but also incurs a cost when re-ranking documents. Prior work has addressed this bottleneck from two largely separate directions: accelerating cross-encoder inference by sparsifying the attention process or improving first-stage retrieval effectiveness using more complex models, e.g. late-interaction ones. In this work, we propose to bridge these two approaches, based on an in-depth understanding of the internal mechanisms of cross-encoders. Starting from cross-encoders, we show that it is possible to derive a new late-interaction-like architecture by carefully removing detrimental or unnecessary interactions. We name this architecture MICE (Minimal Interaction Cross-Encoders). We extensively evaluate MICE across both in-domain (ID) and out-of-domain (OOD) datasets. MICE decreases fourfold the inference latency compared to standard cross-encoders, matching late-interaction models like ColBERT while retaining most of cross-encoder ID effectiveness and demonstrating superior generalization abilities in OOD.
8. 【2602.16136】Retrieval Collapses When AI Pollutes the Web
链接:https://arxiv.org/abs/2602.16136
作者:Hongyeon Yu,Dongchan Kim,Young-Bum Kim
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Language Models, Large Language, Retrieval-Augmented Generation, Web presents
备注: 4 pages, Proceedings of The Web Conference 2026 (WWW '26)
点击查看摘要
Abstract:The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both high-quality SEO-style content and adversarially crafted content. In the SEO scenario, a 67\% pool contamination led to over 80\% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial contamination, baselines like BM25 exposed $\sim$19\% of harmful content, whereas LLM-based rankers demonstrated stronger suppression capabilities. These findings highlight the risk of retrieval pipelines quietly shifting toward synthetic evidence and the need for retrieval-aware strategies to prevent a self-reinforcing cycle of quality decline in Web-grounded systems.
9. 【2602.16124】Rethinking ANN-based Retrieval: Multifaceted Learnable Index for Large-scale Recommendation System
链接:https://arxiv.org/abs/2602.16124
作者:Jiang Zhang,Yubo Wang,Wei Chang,Lu Han,Xingying Cheng,Feng Zhang,Min Li,Songhao Jiang,Wei Zheng,Harry Tran,Zhen Wang,Lei Chen,Yueming Wang,Benyu Zhang,Xiangjun Fan,Bi Xue,Qifan Wang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Approximate nearest neighbor, Approximate nearest, ANN search, large-scale recommendation systems, nearest neighbor
备注:
点击查看摘要
Abstract:Approximate nearest neighbor (ANN) search is widely used in the retrieval stage of large-scale recommendation systems. In this stage, candidate items are indexed using their learned embedding vectors, and ANN search is executed for each user (or item) query to retrieve a set of relevant items. However, ANN-based retrieval has two key limitations. First, item embeddings and their indices are typically learned in separate stages: indexing is often performed offline after embeddings are trained, which can yield suboptimal retrieval quality-especially for newly created items. Second, although ANN offers sublinear query time, it must still be run for every request, incurring substantial computation cost at industry scale. In this paper, we propose MultiFaceted Learnable Index (MFLI), a scalable, real-time retrieval paradigm that learns multifaceted item embeddings and indices within a unified framework and eliminates ANN search at serving time. Specifically, we construct a multifaceted hierarchical codebook via residual quantization of item embeddings and co-train the codebook with the embeddings. We further introduce an efficient multifaceted indexing structure and mechanisms that support real-time updates. At serving time, the learned hierarchical indices are used directly to identify relevant items, avoiding ANN search altogether. Extensive experiments on real-world data with billions of users show that MFLI improves recall on engagement tasks by up to 11.8\%, cold-content delivery by up to 57.29\%, and semantic relevance by 13.5\% compared with prior state-of-the-art methods. We also deploy MFLI in the system and report online experimental results demonstrating improved engagement, less popularity bias, and higher serving efficiency.
10. 【2602.16034】FeDecider: An LLM-Based Framework for Federated Cross-Domain Recommendation
链接:https://arxiv.org/abs/2602.16034
作者:Xinrui He,Ting-Wei Li,Tianxin Wei,Xuying Ning,Xinyu He,Wenxuan Bao,Hanghang Tong,Jingrui He
类目:Information Retrieval (cs.IR)
关键词:preserving data privacy, Federated CDR, aims to collaboratively, data privacy, recommendation models
备注: Accepted to The Web Conference (WWW) 2026
点击查看摘要
Abstract:Federated cross-domain recommendation (Federated CDR) aims to collaboratively learn personalized recommendation models across heterogeneous domains while preserving data privacy. Recently, large language model (LLM)-based recommendation models have demonstrated impressive performance by leveraging LLMs' strong reasoning capabilities and broad knowledge. However, adopting LLM-based recommendation models in Federated CDR scenarios introduces new challenges. First, there exists a risk of overfitting with domain-specific local adapters. The magnitudes of locally optimized parameter updates often vary across domains, causing biased aggregation and overfitting toward domain-specific distributions. Second, unlike traditional recommendation models (e.g., collaborative filtering, bipartite graph-based methods) that learn explicit and comparable user/item representations, LLMs encode knowledge implicitly through autoregressive text generation training. This poses additional challenges for effectively measuring the cross-domain similarities under heterogeneity. To address these challenges, we propose an LLM-based framework for federated cross-domain recommendation, FeDecider. Specifically, FeDecider tackles the challenge of scale-specific noise by disentangling each client's low-rank updates and sharing only their directional components. To handle the need for flexible and effective integration, each client further learns personalized weights that achieve the data-aware integration of updates from other domains. Extensive experiments across diverse datasets validate the effectiveness of our proposed FeDecider.
11. 【2602.15921】Latent Objective Induction and Diversity-Constrained Selection: Algorithms for Multi-Locale Retrieval Pipelines
链接:https://arxiv.org/abs/2602.15921
作者:Faruk Alpay,Levent Sarioglu
类目:Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
关键词:ranked search results, formal correctness guarantees, selecting a diverse, search results, formal correctness
备注: 13 pages, 2 algorithms, 3 tables
点击查看摘要
Abstract:We present three algorithms with formal correctness guarantees and complexity bounds for the problem of selecting a diverse, multi-locale set of sources from ranked search results. First, we formulate weighted locale allocation as a constrained integer partition problem and give an $O(n \log n)$ algorithm that simultaneously satisfies minimum-representation, budget-exhaustion, and proportionality-bound constraints; we prove all three hold with a tight deviation bound of $ 1$. Second, we define a cascaded country-code inference function as a deterministic priority chain over heterogeneous signals (TLD structure, model-inferred metadata, language fallback) and prove it satisfies both determinism and graceful degradation. Third, we introduce a $\kappa$-domain diversity constraint for source selection and give an $O(|K| \cdot R)$ algorithm that maintains the invariant via hash-map lookup, eliminating the aggregator monopolization pathology present in URL-level deduplication. We further formalize Latent Objective Induction (LOI), an environment-shaping operator over prompt spaces that steers downstream model behavior without restricting the feasible output set, and prove its convergence under mild assumptions. Applied to a multi-locale retrieval pipeline, these algorithms yield 62% improvement in first-party source ratio and 89% reduction in same-domain duplication across 120 multilingual queries.
12. 【2602.15856】Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective
链接:https://arxiv.org/abs/2602.15856
作者:Yunhao Liu,Zian Jia,Xinyu Gao,Kanjun Xu,Yun Xiong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large Language Models, effectively grounds Large, grounds Large Language, Web-related tasks, grounds Large
备注: Accepted by WWW 2026
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM's downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%.
Comments:
Accepted by WWW 2026
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:
arXiv:2602.15856 [cs.CL]
(or
arXiv:2602.15856v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.15856
Focus to learn more
arXiv-issued DOI via DataCite</p>
计算机视觉
1. 【2602.16711】CoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos
链接:https://arxiv.org/abs/2602.16711
作者:Namitha Padmanabhan,Matthew Gwilliam,Abhinav Shrivastava
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Implicit Neural Representations, Implicit Neural, recently demonstrated impressive, demonstrated impressive performance, Neural Representations
备注:
点击查看摘要
Abstract:Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20$\times$; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3$\times$ faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at this https URL .
2. 【2602.16705】Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
链接:https://arxiv.org/abs/2602.16705
作者:Runpei Dong,Ziyan Li,Xialin He,Saurabh Gupta
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:RGB-D images, humanoid robots requires, humanoid robots, RGB-D, robots requires accurate
备注: Project page: [this https URL](https://hero-humanoid.github.io/)
点击查看摘要
Abstract:Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.
3. 【2602.16702】Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning
链接:https://arxiv.org/abs/2602.16702
作者:Mingjia Shi,Yinhan He,Yaochen Zhu,Jundong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-language models, jointly leveraging visual, aim to reason, reason by jointly, jointly leveraging
备注: preprint 10 pages, 4 figures
点击查看摘要
Abstract:Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.
4. 【2602.16689】Are Object-Centric Representations Better At Compositional Generalization?
链接:https://arxiv.org/abs/2602.16689
作者:Ferdinand Kapl,Amir Mohammad Karimi Mamaghan,Maximilian Seitzer,Karl Henrik Johansson,Carsten Marr,Stefan Bauer,Andrea Dittadi
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:familiar concepts, machine learning, Visual Question Answering, ability to reason, fundamental to human
备注:
点击查看摘要
Abstract:Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.
5. 【2602.16682】Learning Situated Awareness in the Real World
链接:https://arxiv.org/abs/2602.16682
作者:Chuhan Li,Ruilin Han,Joy Hsu,Yongyuan Liang,Rajiv Dhawan,Jiajun Wu,Ming-Hsuan Yang,Xin Eric Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:surrounding physical environment, actions in context, core aspect, aspect of human, human perception
备注:
点击查看摘要
Abstract:A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.
6. 【2602.16681】VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection
链接:https://arxiv.org/abs/2602.16681
作者:Yingyuan Yang,Tian Lan,Yifei Gao,Yimeng Lu,Wenjun He,Meng Wang,Chenghao Liu,Chen Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long-range Context Anomalies, Point Anomalies, Context Anomalies, Time-series anomaly detection, long-range Context
备注:
点击查看摘要
Abstract:Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: this https URL.
7. 【2602.16669】PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction
链接:https://arxiv.org/abs/2602.16669
作者:Bo Lang,Nirav Savaliya,Zhihao Zheng,Jinglun Feng,Zheng-Hang Yeh,Mooi Choo Chuah
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:providing structured representations, autonomous driving, providing structured, navigation and planning, crucial to autonomous
备注: WACV 2026
点击查看摘要
Abstract:High-definition (HD) maps are crucial to autonomous driving, providing structured representations of road elements to support navigation and planning. However, existing query-based methods often employ random query initialization and depend on implicit temporal modeling, which lead to temporal inconsistencies and instabilities during the construction of a global map. To overcome these challenges, we introduce a novel end-to-end framework for consistent online HD vectorized map construction, which jointly performs map instance tracking and short-term prediction. First, we propose a Semantic-Aware Query Generator that initializes queries with spatially aligned semantic masks to capture scene-level context globally. Next, we design a History Rasterized Map Memory to store fine-grained instance-level maps for each tracked instance, enabling explicit historical priors. A History-Map Guidance Module then integrates rasterized map information into track queries, improving temporal continuity. Finally, we propose a Short-Term Future Guidance module to forecast the immediate motion of map instances based on the stored history trajectories. These predicted future locations serve as hints for tracked instances to further avoid implausible predictions and keep temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed method outperforms state-of-the-art (SOTA) methods with good efficiency.
8. 【2602.16664】Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge
链接:https://arxiv.org/abs/2602.16664
作者:Jiaming Liu,Felix Petersen,Yunhe Gao,Yabin Zhang,Hyojin Kim,Akshay S. Chaudhari,Yu Sun,Stefano Ermon,Sergios Gatidis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:faces key limitations, advanced unpaired, diffusion-inversion methods, require target-domain adversarial, Adversarial approaches require
备注: 36 pages
点击查看摘要
Abstract:Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.
9. 【2602.16611】Style-Aware Gloss Control for Generative Non-Photorealistic Rendering
链接:https://arxiv.org/abs/2602.16611
作者:Santiago Jimenez-Navarro,Belen Masia,Ana Serrano
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:similar perceptual strategies, perceptual strategies guide, infer material characteristics, paintings or drawings, Humans can infer
备注:
点击查看摘要
Abstract:Humans can infer material characteristics of objects from their visual appearance, and this ability extends to artistic depictions, where similar perceptual strategies guide the interpretation of paintings or drawings. Among the factors that define material appearance, gloss, along with color, is widely regarded as one of the most important, and recent studies indicate that humans can perceive gloss independently of the artistic style used to depict an object. To investigate how gloss and artistic style are represented in learned models, we train an unsupervised generative model on a newly curated dataset of painterly objects designed to systematically vary such factors. Our analysis reveals a hierarchical latent space in which gloss is disentangled from other appearance factors, allowing for a detailed study of how gloss is represented and varies across artistic styles. Building on this representation, we introduce a lightweight adapter that connects our style- and gloss-aware latent space to a latent-diffusion model, enabling the synthesis of non-photorealistic images with fine-grained control of these factors. We compare our approach with previous models and observe improved disentanglement and controllability of the learned factors.
10. 【2602.16608】Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
链接:https://arxiv.org/abs/2602.16608
作者:Melkamu Abay Mersha,Jugal Kalita
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:deeply layered representations, layered representations make, Layer-wise Integrated Gradients, Transformer models achieve, difficult to interpret
备注:
点击查看摘要
Abstract:Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.
11. 【2602.16590】A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
链接:https://arxiv.org/abs/2602.16590
作者:Qi You,Yitai Cheng,Zichao Zeng,James Haworth
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:high-definition map construction, Street-view image attribute, urban analytics, vital downstream task, Street-view image
备注:
点击查看摘要
Abstract:Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at this https URL.
12. 【2602.16569】Arc2Morph: Identity-Preserving Facial Morphing with Arc2Face
链接:https://arxiv.org/abs/2602.16569
作者:Nicolò Di Domenico,Annalisa Franco,Matteo Ferrara,Davide Maltoni
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
关键词:face recognition systems, electronic identity documents, morphing attack potential, morphing attack, widely recognized
备注:
点击查看摘要
Abstract:Face morphing attacks are widely recognized as one of the most challenging threats to face recognition systems used in electronic identity documents. These attacks exploit a critical vulnerability in passport enrollment procedures adopted by many countries, where the facial image is often acquired without a supervised live capture process. In this paper, we propose a novel face morphing technique based on Arc2Face, an identity-conditioned face foundation model capable of synthesizing photorealistic facial images from compact identity representations. We demonstrate the effectiveness of the proposed approach by comparing the morphing attack potential metric on two large-scale sequestered face morphing attack detection datasets against several state-of-the-art morphing methods, as well as on two novel morphed face datasets derived from FEI and ONOT. Experimental results show that the proposed deep learning-based approach achieves a morphing attack potential comparable to that of landmark-based techniques, which have traditionally been regarded as the most challenging. These findings confirm the ability of the proposed method to effectively preserve and manage identity information during the morph generation process.
13. 【2602.16545】Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding
链接:https://arxiv.org/abs/2602.16545
作者:Kaiting Liu,Hazel Doughty
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:manner or outcome, single label, typically trained, trained on fixed, fixed taxonomies
备注: ICLR 2026
点击查看摘要
Abstract:Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: this https URL.
14. 【2602.16502】DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images
链接:https://arxiv.org/abs/2602.16502
作者:Zeng Tao,Ying Jiang,Yunuo Chen,Tianyi Xie,Huamin Wang,Yingnian Wu,Yin Yang,Abishek Sampath Kumar,Kenji Tashiro,Chenfanfu Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shown promising progress, Recent advances, promising progress, shown promising, sewing pattern generation
备注:
点击查看摘要
Abstract:Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.
15. 【2602.16494】Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection
链接:https://arxiv.org/abs/2602.16494
作者:Alexis Winter,Jean-Vincent Martini,Romaric Audigier,Angelique Loesch,Bertrand Luvison
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:automated systems, perception-based robots, security risk, critical components, components of automated
备注:
点击查看摘要
Abstract:Object detection models are critical components of automated systems, such as autonomous vehicles and perception-based robots, but their sensitivity to adversarial attacks poses a serious security risk. Progress in defending these models lags behind classification, hindered by a lack of standardized evaluation. It is nearly impossible to thoroughly compare attack or defense methods, as existing work uses different datasets, inconsistent efficiency metrics, and varied measures of perturbation cost. This paper addresses this gap by investigating three key questions: (1) How can we create a fair benchmark to impartially compare attacks? (2) How well do modern attacks transfer across different architectures, especially from Convolutional Neural Networks to Vision Transformers? (3) What is the most effective adversarial training strategy for robust defense? To answer these, we first propose a unified benchmark framework focused on digital, non-patch-based attacks. This framework introduces specific metrics to disentangle localization and classification errors and evaluates attack cost using multiple perceptual metrics. Using this benchmark, we conduct extensive experiments on state-of-the-art attacks and a wide range of detectors. Our findings reveal two major conclusions: first, modern adversarial attacks against object detection models show a significant lack of transferability to transformer-based architectures. Second, we demonstrate that the most robust adversarial training strategy leverages a dataset composed of a mix of high-perturbation attacks with different objectives (e.g., spatial and semantic), which outperforms training on any single attack.
16. 【2602.16493】MMA: Multimodal Memory Agent
链接:https://arxiv.org/abs/2602.16493
作者:Yihao Lu,Wanru Cheng,Zeyu Zhang,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Long-horizon multimodal agents, trigger overconfident errors, multimodal agents depend, Long-horizon multimodal, Multimodal Memory Agent
备注:
点击查看摘要
Abstract:Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: this https URL.
17. 【2602.16455】Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
链接:https://arxiv.org/abs/2602.16455
作者:Jinsong Li,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Jiaqi Wang,Dahua Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, demonstrated remarkable capabilities, strengths provide minimal, provide minimal benefits, Large Vision-Language
备注:
点击查看摘要
Abstract:While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.
18. 【2602.16430】Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems
链接:https://arxiv.org/abs/2602.16430
作者:Ali Faraz,Raja Kolla,Ashish Kulkarni,Shubham Agarwal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Designing Optical Character, Optical Character Recognition, India requires balancing, Designing Optical, Character Recognition
备注:
点击查看摘要
Abstract:Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.
19. 【2602.16412】ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
链接:https://arxiv.org/abs/2602.16412
作者:Daichi Yashima,Shuhei Kurita,Yusuke Oda,Komei Sugiura
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, shown remarkable success, long-form video understanding, video understanding remains, large language models
备注:
点击查看摘要
Abstract:While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.
20. 【2602.16385】Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired
链接:https://arxiv.org/abs/2602.16385
作者:Qi He,XiangXiang Wang,Jingtao Zhang,Yongbin Yu,Hongxiang Chu,Manping Fan,JingYe Cai,Zhenglin Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Semantic Scene Completion, safety-critical scene understanding, Scene Completion, strictly monocular vision, scene understanding
备注: 17 pages, 9 figures, 5 tables
点击查看摘要
Abstract:In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural stability. To address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across scales. Experiments on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.
21. 【2602.16365】Markerless 6D Pose Estimation and Position-Based Visual Servoing for Endoscopic Continuum Manipulators
链接:https://arxiv.org/abs/2602.16365
作者:Junhyun Park,Chunggil An,Myeongbo Park,Ihsan Ullah,Sihyeong Park,Minho Hwang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:minimally invasive procedures, flexible endoscopic surgical, endoscopic surgical systems, surgical systems offer, remain challenging due
备注: 20 pages, 13 figures, 7 tables
点击查看摘要
Abstract:Continuum manipulators in flexible endoscopic surgical systems offer high dexterity for minimally invasive procedures; however, accurate pose estimation and closed-loop control remain challenging due to hysteresis, compliance, and limited distal sensing. Vision-based approaches reduce hardware complexity but are often constrained by limited geometric observability and high computational overhead, restricting real-time closed-loop applicability. This paper presents a unified framework for markerless stereo 6D pose estimation and position-based visual servoing of continuum manipulators. A photo-realistic simulation pipeline enables large-scale automatic training with pixel-accurate annotations. A stereo-aware multi-feature fusion network jointly exploits segmentation masks, keypoints, heatmaps, and bounding boxes to enhance geometric observability. To enforce geometric consistency without iterative optimization, a feed-forward rendering-based refinement module predicts residual pose corrections in a single pass. A self-supervised sim-to-real adaptation strategy further improves real-world performance using unlabeled data. Extensive real-world validation achieves a mean translation error of 0.83 mm and a mean rotation error of 2.76° across 1,000 samples. Markerless closed-loop visual servoing driven by the estimated pose attains accurate trajectory tracking with a mean translation error of 2.07 mm and a mean rotation error of 7.41°, corresponding to 85% and 59% reductions compared to open-loop control, together with high repeatability in repeated point-reaching tasks. To the best of our knowledge, this work presents the first fully markerless pose-estimation-driven position-based visual servoing framework for continuum manipulators, enabling precise closed-loop control without physical markers or embedded sensing.
22. 【2602.16356】Articulated 3D Scene Graphs for Open-World Mobile Manipulation
链接:https://arxiv.org/abs/2602.16356
作者:Martin Büchner,Adrian Röfer,Tim Engelbracht,Tim Welschehold,Zuria Bauer,Hermann Blum,Marc Pollefeys,Abhinav Valada
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:understanding and affordance-driven, affordance-driven object interaction, object, objects, scene understanding
备注:
点击查看摘要
Abstract:Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics including parent-child relation labels with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate the performance of MoMa-SG on two datasets and ablate key design choices of our approach. In addition, real-world experiments on both a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments. We provide code and data at: this https URL.
23. 【2602.16349】SCAR: Satellite Imagery-Based Calibration for Aerial Recordings
链接:https://arxiv.org/abs/2602.16349
作者:Henry Hölzemann,Michael Schleiss
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:persistent global reference, exploits georeferenced satellite, georeferenced satellite imagery, aerial visual-inertial systems, long-term auto-calibration refinement
备注:
点击查看摘要
Abstract:We introduce SCAR, a method for long-term auto-calibration refinement of aerial visual-inertial systems that exploits georeferenced satellite imagery as a persistent global reference. SCAR estimates both intrinsic and extrinsic parameters by aligning aerial images with 2D--3D correspondences derived from publicly available orthophotos and elevation models. In contrast to existing approaches that rely on dedicated calibration maneuvers or manually surveyed ground control points, our method leverages external geospatial data to detect and correct calibration degradation under field deployment conditions. We evaluate our approach on six large-scale aerial campaigns conducted over two years under diverse seasonal and environmental conditions. Across all sequences, SCAR consistently outperforms established baselines (Kalibr, COLMAP, VINS-Mono), reducing median reprojection error by a large margin, and translating these calibration gains into substantially lower visual localization rotation errors and higher pose accuracy. These results demonstrate that SCAR provides accurate, robust, and reproducible calibration over long-term aerial operations without the need for manual intervention.
24. 【2602.16337】Subtractive Modulative Network with Learnable Periodic Activations
链接:https://arxiv.org/abs/2602.16337
作者:Tiou Wang,Zhuoqian Yang,Markus Flierl,Mathieu Salzmann,Sabine Süsstrunk
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Implicit Neural Representation, parameter-efficient Implicit Neural, Subtractive Modulative Network, Neural Representation, Implicit Neural
备注: 4 pages, 3 figures, 3 tables
点击查看摘要
Abstract:We propose the Subtractive Modulative Network (SMN), a novel, parameter-efficient Implicit Neural Representation (INR) architecture inspired by classical subtractive synthesis. The SMN is designed as a principled signal processing pipeline, featuring a learnable periodic activation layer (Oscillator) that generates a multi-frequency basis, and a series of modulative mask modules (Filters) that actively generate high-order harmonics. We provide both theoretical analysis and empirical validation for our design. Our SMN achieves a PSNR of $40+$ dB on two image datasets, comparing favorably against state-of-the-art methods in terms of both reconstruction accuracy and parameter efficiency. Furthermore, consistent advantage is observed on the challenging 3D NeRF novel view synthesis task. Supplementary materials are available at this https URL.
25. 【2602.16327】Guide-Guard: Off-Target Predicting in CRISPR Applications
链接:https://arxiv.org/abs/2602.16327
作者:Joseph Bingham,Netanel Arussy,Saman Zonouz
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:cyber-physical genome sequencing, easily access tools, agriculture and medicine, editing technologies, health science
备注: 10 pages, 11 figs, accepted to IDEAL 2022
点击查看摘要
Abstract:With the introduction of cyber-physical genome sequencing and editing technologies, such as CRISPR, researchers can more easily access tools to investigate and create remedies for a variety of topics in genetics and health science (e.g. agriculture and medicine). As the field advances and grows, new concerns present themselves in the ability to predict the off-target behavior. In this work, we explore the underlying biological and chemical model from a data driven perspective. Additionally, we present a machine learning based solution named \textit{Guide-Guard} to predict the behavior of the system given a gRNA in the CRISPR gene-editing process with 84\% accuracy. This solution is able to be trained on multiple different genes at the same time while retaining accuracy.
26. 【2602.16322】A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks
链接:https://arxiv.org/abs/2602.16322
作者:Santiago C. Vilabella,Pablo Pérez-Núñez,Beatriz Remeseiro
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:training deep learning, artificial intelligence, complexity and size, fast-evolving field, field of artificial
备注:
点击查看摘要
Abstract:In the fast-evolving field of artificial intelligence, where models are increasingly growing in complexity and size, the availability of labeled data for training deep learning models has become a significant challenge. Addressing complex problems like object detection demands considerable time and resources for data labeling to achieve meaningful results. For companies developing such applications, this entails extensive investment in highly skilled personnel or costly outsourcing. This research work aims to demonstrate that enhancing feature extractors can substantially alleviate this challenge, enabling models to learn more effective representations with less labeled data. Utilizing a self-supervised learning strategy, we present a model trained on unlabeled data that outperforms state-of-the-art feature extractors pre-trained on ImageNet and particularly designed for object detection tasks. Moreover, the results demonstrate that our approach encourages the model to focus on the most relevant aspects of an object, thus achieving better feature representations and, therefore, reinforcing its reliability and robustness.
27. 【2602.16281】Breaking the Sub-Millimeter Barrier: Eyeframe Acquisition from Color Images
链接:https://arxiv.org/abs/2602.16281
作者:Manel Guzmán,Antonio Agudo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ensure proper lens, proper lens fitting, requires sub-millimeter precision, optimal vision correction, proper lens
备注: Accepted to CAI 2026
点击查看摘要
Abstract:Eyeframe lens tracing is an important process in the optical industry that requires sub-millimeter precision to ensure proper lens fitting and optimal vision correction. Traditional frame tracers rely on mechanical tools that need precise positioning and calibration, which are time-consuming and require additional equipment, creating an inefficient workflow for opticians. This work presents a novel approach based on artificial vision that utilizes multi-view information. The proposed algorithm operates on images captured from an InVision system. The full pipeline includes image acquisition, frame segmentation to isolate the eyeframe from background, depth estimation to obtain 3D spatial information, and multi-view processing that integrates segmented RGB images with depth data for precise frame contour measurement. To this end, different configurations and variants are proposed and analyzed on real data, providing competitive measurements from still color images with respect to other solutions, while eliminating the need for specialized tracing equipment and reducing workflow complexity for optical technicians.
28. 【2602.16249】AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards
链接:https://arxiv.org/abs/2602.16249
作者:David Smerkous,Zian Wang,Behzad Najafian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling data-efficient fine-tuning, requires server-scale infrastructure, limiting in-domain foundation, transformed computer vision, typically requires server-scale
备注: Preprint
点击查看摘要
Abstract:Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision. On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090. Code available at this https URL.
29. 【2602.16245】HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis
链接:https://arxiv.org/abs/2602.16245
作者:J. Dhar,M. K. Pandey,D. Chakladar,M. Haghighat,A. Alavi,S. Mistry,N. Zaidi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:skin cancer detection, brain tumor prediction, shown great potential, Multimodal fusion frameworks, medical imaging modalities
备注: Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision 2026
点击查看摘要
Abstract:Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: this https URL.
30. 【2602.16238】EasyControlEdge: A Foundation-Model Fine-Tuning for Edge Detection
链接:https://arxiv.org/abs/2602.16238
作者:Hiroki Nakamura,Hiroto Iino,Masashi Okada,Tadahiro Taniguchi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:edge detection, image-generation foundation models, propose EasyControlEdge, image-generation foundation, edge
备注:
点击查看摘要
Abstract:We propose EasyControlEdge, adapting an image-generation foundation model to edge detection. In real-world edge detection (e.g., floor-plan walls, satellite roads/buildings, and medical organ boundaries), crispness and data efficiency are crucial, yet producing crisp raw edge maps with limited training samples remains challenging. Although image-generation foundation models perform well on many downstream tasks, their pretrained priors for data-efficient transfer and iterative refinement for high-frequency detail preservation remain underexploited for edge detection. To enable crisp and data-efficient edge detection using these capabilities, we introduce an edge-specialized adaptation of image-generation foundation models. To better specialize the foundation model for edge detection, we incorporate an edge-oriented objective with an efficient pixel-space loss. At inference, we introduce guidance based on unconditional dynamics, enabling a single model to control the edge density through a guidance scale. Experiments on BSDS500, NYUDv2, BIPED, and CubiCasa compare against state-of-the-art methods and show consistent gains, particularly under no-post-processing crispness evaluation and with limited training data.
31. 【2602.16231】DataCube: A Video Retrieval Platform via Natural Language Semantic Profiling
链接:https://arxiv.org/abs/2602.16231
作者:Yiming Ju,Hanyu Zhao,Quanyue Ma,Donglin Hao,Chengwei Wu,Ming Li,Songjing Wang,Tengfei Pan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large-scale video repositories, modern video understanding, Large-scale video, generation tasks, understanding and generation
备注: This paper is under review for the IJCAI-ECAI 2026 Demonstrations Track
点击查看摘要
Abstract:Large-scale video repositories are increasingly available for modern video understanding and generation tasks. However, transforming raw videos into high-quality, task-specific datasets remains costly and inefficient. We present DataCube, an intelligent platform for automatic video processing, multi-dimensional profiling, and query-driven retrieval. DataCube constructs structured semantic representations of video clips and supports hybrid retrieval with neural re-ranking and deep semantic matching. Through an interactive web interface, users can efficiently construct customized video subsets from massive repositories for training, analysis, and evaluation, and build searchable systems over their own private video collections. The system is publicly accessible at this https URL. Demo Video: this https URL
32. 【2602.16213】Graph neural network for colliding particles with an application to sea ice floe modeling
链接:https://arxiv.org/abs/2602.16213
作者:Ruibiao Zhu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)
关键词:Graph Neural Networks, natural graph structure, nodes represent individual, Graph Neural, Neural Networks
备注:
点击查看摘要
Abstract:This paper introduces a novel approach to sea ice modeling using Graph Neural Networks (GNNs), utilizing the natural graph structure of sea ice, where nodes represent individual ice pieces, and edges model the physical interactions, including collisions. This concept is developed within a one-dimensional framework as a foundational step. Traditional numerical methods, while effective, are computationally intensive and less scalable. By utilizing GNNs, the proposed model, termed the Collision-captured Network (CN), integrates data assimilation (DA) techniques to effectively learn and predict sea ice dynamics under various conditions. The approach was validated using synthetic data, both with and without observed data points, and it was found that the model accelerates the simulation of trajectories without compromising accuracy. This advancement offers a more efficient tool for forecasting in marginal ice zones (MIZ) and highlights the potential of combining machine learning with data assimilation for more effective and efficient modeling.
33. 【2602.16160】Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking
链接:https://arxiv.org/abs/2602.16160
作者:Patrick Poggi,Divake Kumar,Theja Tulabandhula,Amit Ranjan Trivedi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:incurring unnecessary computational, unnecessary computational cost, single-object trackers achieve, Transformer-based single-object trackers, temporally coherent frames
备注: Submitted to IJCNN 2026
点击查看摘要
Abstract:Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12\% GFLOPs reduction, 8.9\% latency reduction, and 10.8\% energy savings while maintaining tracking accuracy within 0.2\% of the full-depth baseline across both short-term and long-term sequences.
34. 【2602.16149】Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing
链接:https://arxiv.org/abs/2602.16149
作者:Huichan Seo,Minki Hong,Sieun Choi,Jihie Kim,Jean Oh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:editing remain underexplored, remain underexplored, Demographic bias, Soft Erasure, Stereotype Replacement
备注: 19 pages, 13 figures. Preprint
点击查看摘要
Abstract:Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems. Project page: this https URL
35. 【2602.16138】IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models
链接:https://arxiv.org/abs/2602.16138
作者:Parsa Madinei,Srijita Karmakar,Russell Cohen Hoffing,Felix Gervitz,Miguel P. Eckstein
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Intent Resolution, Inference-time Saccades, Resolution via Inference-time, introduce IRIS, resolve ambiguity
备注:
点击查看摘要
Abstract:We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.
36. 【2602.16132】CHAI: CacHe Attention Inference for text2video
链接:https://arxiv.org/abs/2602.16132
作者:Joel Mathew Cherian,Ashutosh Muralidhara Bharadwaj,Vima Gupta,Anand Padmanabha Iyer
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:deliver impressive results, diffusion models deliver, models deliver impressive, deliver impressive, impressive results
备注:
点击查看摘要
Abstract:Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.
37. 【2602.16110】OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis
链接:https://arxiv.org/abs/2602.16110
作者:Tianwei Lin,Zhongwei Qiu,Wenqiao Zhang,Jiang Liu,Yihan Xie,Mingjian Gao,Zhenxuan Fan,Zhaocheng Li,Sijing Li,Zhongle Xie,Peng LU,Yueting Zhuang,Yingda Xia,Ling Zhang,Beng Chin Ooi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Computed Tomography, covering critical organs, information-dense imaging modalities, diagnostically information-dense imaging, covering critical
备注:
点击查看摘要
Abstract:Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.
38. 【2602.16086】LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization
链接:https://arxiv.org/abs/2602.16086
作者:Idil Bilge Altun,Mert Onur Cakiroglu,Elham Buxton,Mehmet Dalkilic,Hasan Kurban
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:scalable visual generation, efficient latent-space priors, preserving semantic structure, discrete capacity effectively, Discrete image tokenization
备注:
点击查看摘要
Abstract:Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent-space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade-off: vector-quantized tokenizers learn flexible geometries but often suffer from biased straight-through optimization, codebook under-utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near-complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end. LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free-energy objective, provably converging to nearest-neighbor quantization in the low-temperature limit. LGQ combines a token-level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN-style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2602.16086 [cs.CV]
(or
arXiv:2602.16086v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.16086
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
39. 【2602.16057】Extracting and Analyzing Rail Crossing Behavior Signatures from Videos using Tensor Methods
链接:https://arxiv.org/abs/2602.16057
作者:Dawon Ahn,Het Patel,Aemal Khattak,Jia Chen,Evangelos E. Papalexakis
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:driver behavior varies, crossings present complex, present complex safety, complex safety challenges, present complex
备注: 6 pages, 10 figures. Accepted at InnovaRail 2026
点击查看摘要
Abstract:Railway crossings present complex safety challenges where driver behavior varies by location, time, and conditions. Traditional approaches analyze crossings individually, limiting the ability to identify shared behavioral patterns across locations. We propose a multi-view tensor decomposition framework that captures behavioral similarities across three temporal phases: Approach (warning activation to gate lowering), Waiting (gates down to train passage), and Clearance (train passage to gate raising). We analyze railway crossing videos from multiple locations using TimeSformer embeddings to represent each phase. By constructing phase-specific similarity matrices and applying non-negative symmetric CP decomposition, we discover latent behavioral components with distinct temporal signatures. Our tensor analysis reveals that crossing location appears to be a stronger determinant of behavior patterns than time of day, and that approach-phase behavior provides particularly discriminative signatures. Visualization of the learned component space confirms location-based clustering, with certain crossings forming distinct behavioral clusters. This automated framework enables scalable pattern discovery across multiple crossings, providing a foundation for grouping locations by behavioral similarity to inform targeted safety interventions.
40. 【2602.16019】MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval
链接:https://arxiv.org/abs/2602.16019
作者:Ahmad Elallaf,Yu Zhang,Yuktha Priya Masupalli,Jeong Yang,Young Lee,Zechun Cao,Gongbo Liang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:high-stakes biomedical applications, general-purpose representation learners, powerful general-purpose representation, Vision-language foundation models, multimodal understanding
备注: Accepted to the 2026 Winter Conference on Applications of Computer Vision (WACV) Workshops
点击查看摘要
Abstract:Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.
41. 【2602.16006】BTReport: A Framework for Brain Tumor Radiology Report Generation with Clinically Relevant Features
链接:https://arxiv.org/abs/2602.16006
作者:Juampablo E. Heras Rivera,Dickson T. Chen,Tianyi Ren,Daniel K. Low,Asma Ben Abacha,Alberto Santamaria-Pang,Mehmet Kurt
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:open paired image-report, Recent advances, large paired image-text, paired image-text datasets, paired image-report datasets
备注: Accepted to Medical Imaging with Deep Learning (MIDL) 2026
点击查看摘要
Abstract:Recent advances in radiology report generation (RRG) have been driven by large paired image-text datasets; however, progress in neuro-oncology has been limited due to a lack of open paired image-report datasets. Here, we introduce BTReport, an open-source framework for brain tumor RRG that constructs natural language radiology reports using deterministically extracted imaging features. Unlike existing approaches that rely on large general-purpose or fine-tuned vision-language models for both image interpretation and report composition, BTReport performs deterministic feature extraction for image analysis and uses large language models only for syntactic structuring and narrative formatting. By separating RRG into a deterministic feature extraction step and a report generation step, the generated reports are completely interpretable and less prone to hallucinations. We show that the features used for report generation are predictive of key clinical outcomes, including survival and IDH mutation status, and reports generated by BTReport are more closely aligned with reference clinical reports than existing baselines for RRG. Finally, we introduce BTReport-BraTS, a companion dataset that augments BraTS imaging with synthetically generated radiology reports produced with BTReport. Code for this project can be found at this https URL.
42. 【2602.15989】SAM 3D Body: Robust Full-Body Human Mesh Recovery
链接:https://arxiv.org/abs/2602.15989
作者:Xitong Yang,Devansh Kukreja,Don Pinkus,Anushka Sagar,Taosha Fan,Jinhyung Park,Soyong Shin,Jinkun Cao,Jiawei Liu,Nicolas Ugrinovic,Matt Feiszli,Jitendra Malik,Piotr Dollar,Kris Kitani
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human mesh recovery, Momentum Human Rig, single-image full-body, accuracy in diverse, consistent accuracy
备注: Code: [this https URL](https://github.com/facebookresearch/sam-3d-body)
点击查看摘要
Abstract:We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape. 3DB employs an encoder-decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.
43. 【2602.15973】LAND: A Longitudinal Analysis of Neuromorphic Datasets
链接:https://arxiv.org/abs/2602.15973
作者:Gregory Cohen,Alexandre Marcireau
类目:Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
关键词:neuromorphic datasets, datasets, neuromorphic datasets published, Neuromorphic, existing neuromorphic datasets
备注: The LAND dataset tool can be accessed via [this https URL](https://neuromorphicsystems.github.io/land/)
点击查看摘要
Abstract:Neuromorphic engineering has a data problem. Despite the meteoric rise in the number of neuromorphic datasets published over the past ten years, the conclusion of a significant portion of neuromorphic research papers still states that there is a need for yet more data and even larger datasets. Whilst this need is driven in part by the sheer volume of data required by modern deep learning approaches, it is also fuelled by the current state of the available neuromorphic datasets and the difficulties in finding them, understanding their purpose, and determining the nature of their underlying task. This is further compounded by practical difficulties in downloading and using these datasets. This review starts by capturing a snapshot of the existing neuromorphic datasets, covering over 423 datasets, and then explores the nature of their tasks and the underlying structure of the presented data. Analysing these datasets shows the difficulties arising from their size, the lack of standardisation, and difficulties in accessing the actual data. This paper also highlights the growth in the size of individual datasets and the complexities involved in working with the data. However, a more important concern is the rise of synthetic datasets, created by either simulation or video-to-events methods. This review explores the benefits of simulated data for testing existing algorithms and applications, highlighting the potential pitfalls for exploring new applications of neuromorphic technologies. This review also introduces the concepts of meta-datasets, created from existing datasets, as a way of both reducing the need for more data, and to remove potential bias arising from defining both the dataset and the task.
44. 【2602.15971】B-DENSE: Branching For Dense Ensemble Network Learning
链接:https://arxiv.org/abs/2602.15971
作者:Cherish Puniani,Tushar Kumar,Arnav Bendre,Gaurav Kumar,Shree Singhi
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
关键词:Inspired by non-equilibrium, non-equilibrium thermodynamics, performance in generative, generative modeling, Inspired
备注: 11 pages, 5 figures, 4 algorithms and 2 tables. Submitted to iclr 2026 delta workshop and still under review
点击查看摘要
Abstract:Inspired by non-equilibrium thermodynamics, diffusion models have achieved state-of-the-art performance in generative modeling. However, their iterative sampling nature results in high inference latency. While recent distillation techniques accelerate sampling, they discard intermediate trajectory steps. This sparse supervision leads to a loss of structural information and introduces significant discretization errors. To mitigate this, we propose B-DENSE, a novel framework that leverages multi-branch trajectory alignment. We modify the student architecture to output $K$-fold expanded channels, where each subset corresponds to a specific branch representing a discrete intermediate step in the teacher's trajectory. By training these branches to simultaneously map to the entire sequence of the teacher's target timesteps, we enforce dense intermediate trajectory alignment. Consequently, the student model learns to navigate the solution space from the earliest stages of training, demonstrating superior image generation quality compared to baseline distillation frameworks.
45. 【2602.15967】Non-Contact Physiological Monitoring in Pediatric Intensive Care Units via Adaptive Masking and Self-Supervised Learning
链接:https://arxiv.org/abs/2602.15967
作者:Mohamed Khalil Ben Salah,Philippe Jouvet,Rita Noumeir
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Intensive Care Units, Pediatric Intensive Care, Care Units, Intensive Care, Continuous monitoring
备注:
点击查看摘要
Abstract:Continuous monitoring of vital signs in Pediatric Intensive Care Units (PICUs) is essential for early detection of clinical deterioration and effective clinical decision-making. However, contact-based sensors such as pulse oximeters may cause skin irritation, increase infection risk, and lead to patient discomfort. Remote photoplethysmography (rPPG) offers a contactless alternative to monitor heart rate using facial video, but remains underutilized in PICUs due to motion artifacts, occlusions, variable lighting, and domain shifts between laboratory and clinical data. We introduce a self-supervised pretraining framework for rPPG estimation in the PICU setting, based on a progressive curriculum strategy. The approach leverages the VisionMamba architecture and integrates an adaptive masking mechanism, where a lightweight Mamba-based controller assigns spatiotemporal importance scores to guide probabilistic patch sampling. This strategy dynamically increases reconstruction difficulty while preserving physiological relevance. To address the lack of labeled clinical data, we adopt a teacher-student distillation setup. A supervised expert model, trained on public datasets, provides latent physiological guidance to the student. The curriculum progresses through three stages: clean public videos, synthetic occlusion scenarios, and unlabeled videos from 500 pediatric patients. Our framework achieves a 42% reduction in mean absolute error relative to standard masked autoencoders and outperforms PhysFormer by 31%, reaching a final MAE of 3.2 bpm. Without explicit region-of-interest extraction, the model consistently attends to pulse-rich areas and demonstrates robustness under clinical occlusions and noise.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.15967 [cs.CV]
(or
arXiv:2602.15967v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.15967
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Mohamed Khalil Ben Salah [view email] [v1]
Tue, 17 Feb 2026 19:34:50 UTC (8,263 KB)
46. 【2602.15962】Automated Re-Identification of Holstein-Friesian Cattle in Dense Crowds
链接:https://arxiv.org/abs/2602.15962
作者:Phoenix Yu,Tilo Burghardt,Andrew W Dowsey,Neill W Campbell
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:methods capture individuals, methods capture, spatially separate, capture individuals, targets are spatially
备注: 32 pages, 13 figures, 5 tables
点击查看摘要
Abstract:Holstein-Friesian detection and re-identification (Re-ID) methods capture individuals well when targets are spatially separate. However, existing approaches, including YOLO-based species detection, break down when cows group closely together. This is particularly prevalent for species which have outline-breaking coat patterns. To boost both effectiveness and transferability in this setting, we propose a new detect-segment-identify pipeline that leverages the Open-Vocabulary Weight-free Localisation and the Segment Anything models as pre-processing stages alongside Re-ID networks. To evaluate our approach, we publish a collection of nine days CCTV data filmed on a working dairy farm. Our methodology overcomes detection breakdown in dense animal groupings, resulting in a 98.93% accuracy. This significantly outperforms current oriented bounding box-driven, as well as SAM species detection baselines with accuracy improvements of 47.52% and 27.13%, respectively. We show that unsupervised contrastive learning can build on this to yield 94.82% Re-ID accuracy on our test data. Our work demonstrates that Re-ID in crowded scenarios is both practical as well as reliable in working farm settings with no manual intervention. Code and dataset are provided for reproducibility.
47. 【2602.15959】Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration
链接:https://arxiv.org/abs/2602.15959
作者:Yiwen Wang,Jiahao Qin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:High-speed optical-resolution photoacoustic, optical-resolution photoacoustic microscopy, backward scan lines, bidirectional raster scanning, raster scanning doubles
备注: 10 pages, 5 figures
点击查看摘要
Abstract:High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing registration methods, constrained by brightness constancy assumptions, achieve limited alignment quality, while recent generative approaches address domain shift through complex architectures that lack temporal awareness across frames. We propose GPEReg-Net, a scene-appearance disentanglement framework that separates domain-invariant scene features from domain-specific appearance codes via Adaptive Instance Normalization (AdaIN), enabling direct image-to-image registration without explicit deformation field estimation. To exploit temporal structure in sequential acquisitions, we introduce a Global Position Encoding (GPE) module that combines learnable position embeddings with sinusoidal encoding and cross-frame attention, allowing the network to leverage context from neighboring frames for improved temporal coherence. On the OR-PAM-Reg-4K benchmark (432 test samples), GPEReg-Net achieves NCC of 0.953, SSIM of 0.932, and PSNR of 34.49dB, surpassing the state-of-the-art by 3.8% in SSIM and 1.99dB in PSNR while maintaining competitive NCC. Code is available at this https URL.
48. 【2602.15958】DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
链接:https://arxiv.org/abs/2602.15958
作者:Md Mofijul Islam,Md Sirajus Salekin,Nivedha Balakrishnan,Vincil C. Bishop III,Niharika Jain,Spencer Romo,Bob Strahan,Boyi Xie,Diego A. Socolinsky
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:multiple documents stitched, document packet, multi-page document packets, document packet splitting, Document
备注:
点击查看摘要
Abstract:Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.
49. 【2602.15950】Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families
链接:https://arxiv.org/abs/2602.15950
作者:Yuval Levental
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:lack textual identity, accurately localize filled, cells lack textual, localize filled cells, textual identity
备注: 9 pages, 3 figures, 2 tables. Workshop-length paper
点击查看摘要
Abstract:We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types -- text symbols (. and #) and filled squares without gridlines -- then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder -- the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their native visual pathway. Each model exhibits a distinct failure mode in the squares condition -- systematic under-counting (Claude), massive over-counting (ChatGPT), and template hallucination (Gemini) -- but all share the same underlying deficit: severely degraded spatial localization for non-textual visual elements.
50. 【2602.15927】Visual Memory Injection Attacks for Multi-Turn Conversations
链接:https://arxiv.org/abs/2602.15927
作者:Christian Schlarmann,Matthias Hein
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Generative large vision-language, large vision-language models, impressive performance gains, recently achieved impressive, achieved impressive performance
备注:
点击查看摘要
Abstract:Generative large vision-language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long-context multi-turn setting, is largely underexplored. In this paper, we consider the realistic scenario in which an attacker uploads a manipulated image to the web/social media. A benign user downloads this image and uses it as input to the LVLM. Our novel stealthy Visual Memory Injection (VMI) attack is designed such that on normal prompts the LVLM exhibits nominal behavior, but once the user gives a triggering prompt, the LVLM outputs a specific prescribed target message to manipulate the user, e.g. for adversarial marketing or political persuasion. Compared to previous work that focused on single-turn attacks, VMI is effective even after a long multi-turn conversation with the user. We demonstrate our attack on several recent open-weight LVLMs. This article thereby shows that large-scale manipulation of users is feasible with perturbed images in multi-turn conversation settings, calling for better robustness of LVLMs against these attacks. We release the source code at this https URL
51. 【2602.15926】A Study on Real-time Object Detection using Deep Learning
链接:https://arxiv.org/abs/2602.15926
作者:Ankita Bose,Jayasravani Bhumireddy,Naveen N
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Augmented Reality, Virtual Reality, including human-computer interfaces, industrial automation healthcare, road traffic monitoring
备注: 34 pages, 18 figures
点击查看摘要
Abstract:Object detection has compelling applications over a range of domains, including human-computer interfaces, security and video surveillance, navigation and road traffic monitoring, transportation systems, industrial automation healthcare, the world of Augmented Reality (AR) and Virtual Reality (VR), environment monitoring and activity identification. Applications of real time object detection in all these areas provide dynamic analysis of the visual information that helps in immediate decision making. Furthermore, advanced deep learning algorithms leverage the progress in the field of object detection providing more accurate and efficient solutions. There are some outstanding deep learning algorithms for object detection which includes, Faster R CNN(Region-based Convolutional Neural Network),Mask R-CNN, Cascade R-CNN, YOLO (You Only Look Once), SSD (Single Shot Multibox Detector), RetinaNet etc. This article goes into great detail on how deep learning algorithms are used to enhance real time object recognition. It provides information on the different object detection models available, open benchmark datasets, and studies on the use of object detection models in a range of applications. Additionally, controlled studies are provided to compare various strategies and produce some illuminating findings. Last but not least, a number of encouraging challenges and approaches are offered as suggestions for further investigation in both relevant deep learning approaches and object recognition.
52. 【2602.15922】World Action Models are Zero-shot Policies
链接:https://arxiv.org/abs/2602.15922
作者:Seonghyeon Ye,Yunhao Ge,Kaiyuan Zheng,Shenyuan Gao,Sihyun Yu,George Kurian,Suneel Indupuru,You Liang Tan,Chuning Zhu,Jiannan Xiang,Ayaan Malik,Kyungmin Lee,William Liang,Nadun Ranawaka,Jiasheng Gu,Yinzhen Xu,Guanzhi Wang,Fengyuan Hu,Avnish Narayan,Johan Bjorck,Jing Wang,Gwanghyun Kim,Dantong Niu,Ruijie Zheng,Yuqi Xie,Jimmy Wu,Qi Wang,Ryan Julian,Danfei Xu,Yilun Du,Yevgen Chebotar,Scott Reed,Jan Kautz,Yuke Zhu,Linxi "Jim" Fan,Joel Jang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:excel at semantic, struggle to generalize, World Action Model, models excel, unseen physical motions
备注: Project page: [this https URL](https://dreamzero0.github.io/)
点击查看摘要
Abstract:State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.
53. 【2602.15918】EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery
链接:https://arxiv.org/abs/2602.15918
作者:Zelin Xu,Yupu Zhang,Saugat Adhikari,Saiful Islam,Tingsong Xiao,Zibo Liu,Shigang Chen,Da Yan,Zhe Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:multimodal large language, attracted growing interest, computer vision due, require precise interaction, Benchmarking spatial reasoning
备注:
点击查看摘要
Abstract:Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.
54. 【2602.15915】MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering
链接:https://arxiv.org/abs/2602.15915
作者:Xianwei Mao,Kai Ye,Sheng Zhou,Nan Zhang,Haikuan Huang,Bin Li,Jiajun Bu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Knowledge-based Visual Question, Visual Question Answering, Question Answering, integrating visual information, Knowledge-based Visual
备注:
点击查看摘要
Abstract:Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.
55. 【2602.15904】A Comprehensive Survey on Deep Learning-Based LiDAR Super-Resolution for Autonomous Driving
链接:https://arxiv.org/abs/2602.15904
作者:June Moh Goo,Zichao Zeng,Jan Boehm
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:miss critical details, sparse point clouds, produce sparse point, high-resolution sensors remain, sensors remain expensive
备注: Accepted to The IEEE Intelligent Vehicles Symposium 2026 (IEEE IV 2026)
点击查看摘要
Abstract:LiDAR sensors are often considered essential for autonomous driving, but high-resolution sensors remain expensive while affordable low-resolution sensors produce sparse point clouds that miss critical details. LiDAR super-resolution addresses this challenge by using deep learning to enhance sparse point clouds, bridging the gap between different sensor types and enabling cross-sensor compatibility in real-world deployments. This paper presents the first comprehensive survey of LiDAR super-resolution methods for autonomous driving. Despite the importance of practical deployment, no systematic review has been conducted until now. We organize existing approaches into four categories: CNN-based architectures, model-based deep unrolling, implicit representation methods, and Transformer and Mamba-based approaches. We establish fundamental concepts including data representations, problem formulation, benchmark datasets and evaluation metrics. Current trends include the adoption of range image representation for efficient processing, extreme model compression and the development of resolution-flexible architectures. Recent research prioritizes real-time inference and cross-sensor generalization for practical deployment. We conclude by identifying open challenges and future research directions for advancing LiDAR super-resolution technology.
56. 【2602.15903】Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment
链接:https://arxiv.org/abs/2602.15903
作者:Jingwei Li,Jiaxin Tong,Pengfei Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly realistic facial, Soft Blending Augmentation, Forgery Intensity Estimation, realistic facial forgeries, facial forgeries necessitates
备注:
点击查看摘要
Abstract:The proliferation of highly realistic facial forgeries necessitates robust detection methods. However, existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques. To address these challenges, we propose a novel Multivariate and Soft Blending Augmentation with CLIP-guided Forgery Intensity Estimation (MSBA-CLIP) framework. Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces. We introduce a Multivariate and Soft Blending Augmentation (MSBA) strategy that synthesizes images by blending forgeries from multiple methods with random weights, forcing the model to learn generalizable patterns. Furthermore, a dedicated Multivariate Forgery Intensity Estimation (MFIE) module is designed to explicitly guide the model in learning features related to varied forgery modes and intensities. Extensive experiments demonstrate state-of-the-art performance. On in-domain tests, our method improves Accuracy and AUC by 3.32\% and 4.02\%, respectively, over the best baseline. In cross-domain evaluations across five datasets, it achieves an average AUC gain of 3.27\%. Ablation studies confirm the efficacy of both proposed components. While the reliance on a large vision-language model entails higher computational cost, our work presents a significant step towards more generalizable and robust deepfake detection.
57. 【2602.15900】Adaptive Illumination Control for Robot Perception
链接:https://arxiv.org/abs/2602.15900
作者:Yash Turkar,Shekoufeh Sadeghi,Karthik Dantu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:robust feature extraction, high dynamic range, improved downstream, feature extraction, perception under low
备注:
点击查看摘要
Abstract:Robot perception under low light or high dynamic range is usually improved downstream - via more robust feature extraction, image enhancement, or closed-loop exposure control. However, all of these approaches are limited by the image captured these conditions. An alternate approach is to utilize a programmable onboard light that adds to ambient illumination and improves captured images. However, it is not straightforward to predict its impact on image formation. Illumination interacts nonlinearly with depth, surface reflectance, and scene geometry. It can both reveal structure and induce failure modes such as specular highlights and saturation. We introduce Lightning, a closed-loop illumination-control framework for visual SLAM that combines relighting, offline optimization, and imitation learning. This is performed in three stages. First, we train a Co-Located Illumination Decomposition (CLID) relighting model that decomposes a robot observation into an ambient component and a light-contribution field. CLID enables physically consistent synthesis of the same scene under alternative light intensities and thereby creates dense multi-intensity training data without requiring us to repeatedly re-run trajectories. Second, using these synthesized candidates, we formulate an offline Optimal Intensity Schedule (OIS) problem that selects illumination levels over a sequence trading off SLAM-relevant image utility against power consumption and temporal smoothness. Third, we distill this ideal solution into a real-time controller through behavior cloning, producing an Illumination Control Policy (ILC) that generalizes beyond the initial training distribution and runs online on a mobile robot to command discrete light-intensity levels. Across our evaluation, Lightning substantially improves SLAM trajectory robustness while reducing unnecessary illumination power.
58. 【2602.15892】Egocentric Bias in Vision-Language Models
链接:https://arxiv.org/abs/2602.15892
作者:Maijunxian Wang,Yijiang Li,Bingyang Wang,Tianwei Zhao,Ran Ji,Qingying Gao,Emmy Liu,Hokin Deng,Dezhi Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Visual perspective taking, Visual perspective, perspective taking, Visual, social cognition
备注:
点击查看摘要
Abstract:Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.
59. 【2602.15872】MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models
链接:https://arxiv.org/abs/2602.15872
作者:Xunlan Zhou,Xuanlin Chen,Shaowei Zhang,Xiangkun Li,ShengHua Wan,Xiaohai Hu,Yuan Lei,Le Gan,De-chuan Zhan
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Designing dense reward, robotic Reinforcement Learning, Reinforcement Learning, efficient robotic Reinforcement, Designing dense
备注:
点击查看摘要
Abstract:Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.
60. 【2602.15864】ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation
链接:https://arxiv.org/abs/2602.15864
作者:Yuzhuo Ao,Anbang Wang,Yu-Wing Tai,Chi-Keung Tang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:partial egocentric observations, restrict global foresight, egocentric observations, inefficient exploration, Multimodal Large Language
备注: 18 pages, 6 figures, Project page: [this https URL](https://reasonnavi.github.io/)
点击查看摘要
Abstract:Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations, which restrict global foresight and lead to inefficient exploration. In contrast, humans plan using maps: we reason globally first, then act locally. We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners. ReasonNavi converts a top-down map into a discrete reasoning space by room segmentation and candidate target nodes sampling. An MLLM is then queried in a multi-stage process to identify the candidate most consistent with the instruction (object, image, or text goal), effectively leveraging the model's semantic reasoning ability while sidestepping its weakness in continuous coordinate prediction. The selected waypoint is grounded into executable trajectories using a deterministic action planner over an online-built occupancy map, while pretrained object detectors and segmenters ensure robust recognition at the goal. This yields a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements. Across three navigation tasks, ReasonNavi consistently outperforms prior methods that demand extensive training or heavy scene modeling, offering a scalable, interpretable, and globally grounded solution to embodied navigation. Project page: this https URL
61. 【2602.16422】Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model
链接:https://arxiv.org/abs/2602.16422
作者:Ahmet Halici,Ece Tugba Cebeci,Musa Balci,Mustafa Cini,Serkan Sokmen
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:domain specific language, Generating diagnostic text, Generating diagnostic, slide images, requirement for precise
备注: 9 pages. Equal contribution: Ahmet Halici, Ece Tugba Cebeci, Musa Balci
点击查看摘要
Abstract:Generating diagnostic text from histopathology whole slide images (WSIs) is challenging due to the gigapixel scale of the input and the requirement for precise, domain specific language. We propose a hierarchical vision language framework that combines a frozen pathology foundation model with a Transformer decoder for report generation. To make WSI processing tractable, we perform multi resolution pyramidal patch selection (downsampling factors 2^3 to 2^6) and remove background and artifacts using Laplacian variance and HSV based criteria. Patch features are extracted with the UNI Vision Transformer and projected to a 6 layer Transformer decoder that generates diagnostic text via cross attention. To better represent biomedical terminology, we tokenize the output using BioGPT. Finally, we add a retrieval based verification step that compares generated reports with a reference corpus using Sentence BERT embeddings; if a high similarity match is found, the generated report is replaced with the retrieved ground truth reference to improve reliability.
62. 【2602.16320】RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion
链接:https://arxiv.org/abs/2602.16320
作者:Kavyansh Tyagi,Vishwas Rathi,Puneet Goyal
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Accurate and computationally, remains a critical, critical challenge, image segmentation remains, medical image segmentation
备注: 13 pages, 5 figures, 7 tables
点击查看摘要
Abstract:Accurate and computationally efficient 3D medical image segmentation remains a critical challenge in clinical workflows. Transformer-based architectures often demonstrate superior global contextual modeling but at the expense of excessive parameter counts and memory demands, restricting their clinical deployment. We propose RefineFormer3D, a lightweight hierarchical transformer architecture that balances segmentation accuracy and computational efficiency for volumetric medical imaging. The architecture integrates three key components: (i) GhostConv3D-based patch embedding for efficient feature extraction with minimal redundancy, (ii) MixFFN3D module with low-rank projections and depthwise convolutions for parameter-efficient feature extraction, and (iii) a cross-attention fusion decoder enabling adaptive multi-scale skip connection integration. RefineFormer3D contains only 2.94M parameters, substantially fewer than contemporary transformer-based methods. Extensive experiments on ACDC and BraTS benchmarks demonstrate that RefineFormer3D achieves 93.44\% and 85.9\% average Dice scores respectively, outperforming or matching state-of-the-art methods while requiring significantly fewer parameters. Furthermore, the model achieves fast inference (8.35 ms per volume on GPU) with low memory requirements, supporting deployment in resource-constrained clinical environments. These results establish RefineFormer3D as an effective and scalable solution for practical 3D medical image segmentation.
63. 【2602.15988】Automated Assessment of Kidney Ureteroscopy Exploration for Training
链接:https://arxiv.org/abs/2602.15988
作者:Fangjie Li,Nicholas Kavoussi,Charan Mohan,Matthieu Chabanas,Jie Ying Wu
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:steep learning curve, Kidney ureteroscopic navigation, learning curve, ureteroscopic navigation, navigation is challenging
备注:
点击查看摘要
Abstract:Purpose: Kidney ureteroscopic navigation is challenging with a steep learning curve. However, current clinical training has major deficiencies, as it requires one-on-one feedback from experts and occurs in the operating room (OR). Therefore, there is a need for a phantom training system with automated feedback to greatly \revision{expand} training opportunities. Methods: We propose a novel, purely ureteroscope video-based scope localization framework that automatically identifies calyces missed by the trainee in a phantom kidney exploration. We use a slow, thorough, prior exploration video of the kidney to generate a reference reconstruction. Then, this reference reconstruction can be used to localize any exploration video of the same phantom. Results: In 15 exploration videos, a total of 69 out of 74 calyces were correctly classified. We achieve 4mm camera pose localization error. Given the reference reconstruction, the system takes 10 minutes to generate the results for a typical exploration (1-2 minute long). Conclusion: We demonstrate a novel camera localization framework that can provide accurate and automatic feedback for kidney phantom explorations. We show its ability as a valid tool that enables out-of-OR training without requiring supervision from an expert.
Subjects:
Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Cite as:
arXiv:2602.15988 [eess.IV]
(or
arXiv:2602.15988v1 [eess.IV] for this version)
https://doi.org/10.48550/arXiv.2602.15988
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Fangjie Li [view email] [v1]
Tue, 17 Feb 2026 20:25:04 UTC (1,823 KB)
64. 【2602.15917】ROIX-Comp: Optimizing X-ray Computed Tomography Imaging Strategy for Data Reduction and Reconstruction
链接:https://arxiv.org/abs/2602.15917
作者:Amarjit Singh,Kento Sato,Kohei Yoshida,Kentaro Uesugi,Yasumasa Joti,Takaki Hatsui,Andrès Rubio Proaño
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
关键词:synchrotron radiation facilities, X-ray Computed Tomography, large-scale X-ray Computed, X-ray images, high-performance computing
备注: 11 pages, SCA/HPCAsia2026
点击查看摘要
Abstract:In high-performance computing (HPC) environments, particularly in synchrotron radiation facilities, vast amounts of X-ray images are generated. Processing large-scale X-ray Computed Tomography (X-CT) datasets presents significant computational and storage challenges due to their high dimensionality and data volume. Traditional approaches often require extensive storage capacity and high transmission bandwidth, limiting real-time processing capabilities and workflow efficiency. To address these constraints, we introduce a region-of-interest (ROI)-driven extraction framework (ROIX-Comp) that intelligently compresses X-CT data by identifying and retaining only essential features. Our work reduces data volume while preserving critical information for downstream processing tasks. At pre-processing stage, we utilize error-bounded quantization to reduce the amount of data to be processed and therefore improve computational efficiencies. At the compression stage, our methodology combines object extraction with multiple state-of-the-art lossless and lossy compressors, resulting in significantly improved compression ratios. We evaluated this framework against seven X-CT datasets and observed a relative compression ratio improvement of 12.34x compared to the standard compression.
65. 【2602.15913】Foundation Models for Medical Imaging: Status, Challenges, and Directions
链接:https://arxiv.org/abs/2602.15913
作者:Chuang Niu,Pengwei Wu,Bruno De Man,Ge Wang
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:rapidly reshaping medical, reshaping medical imaging, Foundation models, general-purpose models, shifting the field
备注:
点击查看摘要
Abstract:Foundation models (FMs) are rapidly reshaping medical imaging, shifting the field from narrowly trained, task-specific networks toward large, general-purpose models that can be adapted across modalities, anatomies, and clinical tasks. In this review, we synthesize the emerging landscape of medical imaging FMs along three major axes: principles of FM design, applications of FMs, and forward-looking challenges and opportunities. Taken together, this review provides a technically grounded, clinically aware, and future-facing roadmap for developing FMs that are not only powerful and versatile but also trustworthy and ready for responsible translation into clinical practice.
66. 【2512.17322】Rotterdam artery-vein segmentation (RAV) dataset
链接:https://arxiv.org/abs/2512.17322
作者:Jose Vargas Quiros,Bart Liefers,Karin van Garderen,Jeroen Vermeulen,Eyened Reading Center,Caroline Klaver
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Toggle, Code Toggle Papers, Toggle Hugging Face, Bibliographic Explorer Toggle, Explorer Toggle Bibliographic
备注:
点击查看摘要
Abstract:Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.
Subjects:
Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.17322 [eess.IV]
(or
arXiv:2512.17322v2 [eess.IV] for this version)
https://doi.org/10.48550/arXiv.2512.17322
Focus to learn more
arXiv-issued DOI via DataCite
Submission history From: Jose David Vargas Quiros [view email] [v1]
Fri, 19 Dec 2025 08:09:02 UTC (432 KB)
[v2]
Wed, 18 Feb 2026 15:06:58 UTC (384 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Rotterdam artery-vein segmentation (RAV) dataset, by Jose Vargas Quiros and 5 other authorsView PDF
view license
Current browse context: eess.IV
prev
|
next
new
|
recent
| 2025-12
Change to browse by:
cs
cs.CV
eess
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status



