本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新482篇论文,其中:

  • 自然语言处理85
  • 信息检索12
  • 计算机视觉66

自然语言处理

1. 【2602.16704】Reinforced Fast Weights with Next-Sequence Prediction

链接https://arxiv.org/abs/2602.16704

作者:Hee Seung Hwang,Xindi Wu,Sanghyuk Chun,Olga Russakovsky

类目:Computation and Language (cs.CL)

关键词:maintaining constant memory, constant memory overhead, Fast weight, context length, offer a promising

备注

点击查看摘要

Abstract:Fast weight architectures offer a promising alternative to attention-based transformers for long-context modeling by maintaining constant memory overhead regardless of context length. However, their potential is limited by the next-token prediction (NTP) training paradigm. NTP optimizes single-token predictions and ignores semantic coherence across multiple tokens following a prefix. Consequently, fast weight models, which dynamically update their parameters to store contextual information, learn suboptimal representations that fail to capture long-range dependencies. We introduce REFINE (Reinforced Fast weIghts with Next sEquence prediction), a reinforcement learning framework that trains fast weight models under the next-sequence prediction (NSP) objective. REFINE selects informative token positions based on prediction entropy, generates multi-token rollouts, assigns self-supervised sequence-level rewards, and optimizes the model with group relative policy optimization (GRPO). REFINE is applicable throughout the training lifecycle of pre-trained language models: mid-training, post-training, and test-time training. Our experiments on LaCT-760M and DeltaNet-1.3B demonstrate that REFINE consistently outperforms supervised fine-tuning with NTP across needle-in-a-haystack retrieval, long-context question answering, and diverse tasks in LongBench. REFINE provides an effective and versatile framework for improving long-context modeling in fast weight architectures.

2. 【2602.16699】Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

链接https://arxiv.org/abs/2602.16699

作者:Wenxuan Ding,Nicholas Tomlin,Greg Durrett

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:single response, necessarily resolved, require interacting, acquire information, LLMs

备注

点击查看摘要

Abstract:LLMs are increasingly being used for complex problems which are not necessarily resolved in a single response, but require interacting with an environment to acquire information. In these scenarios, LLMs must reason about inherent cost-uncertainty tradeoffs in when to stop exploring and commit to an answer. For instance, on a programming task, an LLM should test a generated code snippet if it is uncertain about the correctness of that code; the cost of writing a test is nonzero, but typically lower than the cost of making a mistake. In this work, we show that we can induce LLMs to explicitly reason about balancing these cost-uncertainty tradeoffs, then perform more optimal environment exploration. We formalize multiple tasks, including information retrieval and coding, as sequential decision-making problems under uncertainty. Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent. We introduce a framework called Calibrate-Then-Act (CTA), where we feed the LLM this additional context to enable it to act more optimally. This improvement is preserved even under RL training of both the baseline and CTA. Our results on information-seeking QA and on a simplified coding task show that making cost-benefit tradeoffs explicit with CTA can help agents discover more optimal decision-making strategies.

3. 【2602.16687】Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

链接https://arxiv.org/abs/2602.16687

作者:Potsawee Manakul,Woody Haosheng Gan,Martijn Bartelds,Guangzhi Sun,William Held,Diyi Yang

类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:pre-trained text LLM, Current audio language, text LLM backbones, extending pre-trained text, limiting general audio

备注

点击查看摘要

Abstract:Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.

4. 【2602.16660】Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

链接https://arxiv.org/abs/2602.16660

作者:Yuyan Bu,Xiaohao Liu,ZhaoXing Ren,Yaodong Yang,Juntao Dai

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:linguistic communities necessitates, communities necessitates reliable, necessitates reliable multilingual, widespread deployment, deployment of large

备注: Accepted by ICLR 2026

点击查看摘要

Abstract:The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment. However, recent efforts to extend alignment to other languages often require substantial resources, either through large-scale, high-quality supervision in the target language or through pairwise alignment with high-resource languages, which limits scalability. In this work, we propose a resource-efficient method for improving multilingual safety alignment. We introduce a plug-and-play Multi-Lingual Consistency (MLC) loss that can be integrated into existing monolingual alignment pipelines. By improving collinearity between multilingual representation vectors, our method encourages directional consistency at the multilingual semantic level in a single update. This allows simultaneous alignment across multiple languages using only multilingual prompt variants without requiring additional response-level supervision in low-resource languages. We validate the proposed method across different model architectures and alignment paradigms, and demonstrate its effectiveness in enhancing multilingual safety with limited impact on general model utility. Further evaluation across languages and tasks indicates improved cross-lingual generalization, suggesting the proposed approach as a practical solution for multilingual consistency alignment under limited supervision.

5. 【2602.16640】Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

链接https://arxiv.org/abs/2602.16640

作者:Subrit Dikshit

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, revolutionized Natural Language, Large Language Models, Language Processing, Large Language

备注: 5 pages, 2 tables

点击查看摘要

Abstract:The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter counts (7B+) and cloud-based inference, rendering them inaccessible to practitioners in resource-constrained environments and posing significant data sovereignty risks. This paper introduces Quecto-V1, a domain-specific Small Language Model (SLM) engineered to democratize access to Indian legal intelligence. Built upon a custom configuration of the GPT-2 architecture (124 million parameters), Quecto-V1 was trained from scratch exclusively on a corpus of Indian statutes, including the Indian Penal Code (IPC), the Code of Criminal Procedure (CrPC), and the Constitution of India. Unlike generalist models, which prioritize broad world knowledge, our approach maximizes "lexical density" within the legal domain. Furthermore, we address the deployment bottleneck by applying post-training 8-bit quantization (GGUF format), compressing the model to a memory footprint of under 150 MB. Our empirical analysis demonstrates that Quecto-V1 achieves high fidelity in retrieving statutory definitions and penal provisions, outperforming general-purpose SLMs in domain-specific exact match tasks while running entirely offline on consumer-grade CPUs. We further present an ablation study showing that 8-bit quantization yields a 74% reduction in model size with less than 3.5% degradation in retrieval accuracy compared to full-precision baselines. These findings suggest that for specialized, high-stakes domains like law, domain-specific training coupled with aggressive quantization offers a viable, privacy-preserving alternative to monolithic cloud models.

6. 【2602.16639】AREG: Adversarial Resource Extraction Game for Evaluating Persuasion and Resistance in Large Language Models

链接https://arxiv.org/abs/2602.16639

作者:Adib Sakhawat,Fardeen Sadab

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, increasingly requires moving, static text generation, intelligence of Large

备注: 15 pages, 5 figures, 11 tables. Includes appendix with detailed experimental results and prompts

点击查看摘要

Abstract:Evaluating the social intelligence of Large Language Models (LLMs) increasingly requires moving beyond static text generation toward dynamic, adversarial interaction. We introduce the Adversarial Resource Extraction Game (AREG), a benchmark that operationalizes persuasion and resistance as a multi-turn, zero-sum negotiation over financial resources. Using a round-robin tournament across frontier models, AREG enables joint evaluation of offensive (persuasion) and defensive (resistance) capabilities within a single interactional framework. Our analysis provides evidence that these capabilities are weakly correlated ($\rho = 0.33$) and empirically dissociated: strong persuasive performance does not reliably predict strong resistance, and vice versa. Across all evaluated models, resistance scores exceed persuasion scores, indicating a systematic defensive advantage in adversarial dialogue settings. Further linguistic analysis suggests that interaction structure plays a central role in these outcomes. Incremental commitment-seeking strategies are associated with higher extraction success, while verification-seeking responses are more prevalent in successful defenses than explicit refusal. Together, these findings indicate that social influence in LLMs is not a monolithic capability and that evaluation frameworks focusing on persuasion alone may overlook asymmetric behavioral vulnerabilities.

7. 【2602.16610】Who can we trust? LLM-as-a-jury for Comparative Assessment

链接https://arxiv.org/abs/2602.16610

作者:Mengjie Qian,Guangzhi Sun,Mark J.F. Gales,Kate M. Knill

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, natural language generation, language generation assessment, Large language, pairwise comparative judgements

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-as-a-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminator strongly correlates with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.

8. 【2602.16609】ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

链接https://arxiv.org/abs/2602.16609

作者:Antoine Chaffin,Luca Arnaboldi,Amélie Chatelain,Florent Krzakala

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:small Knowledge Distillation, Knowledge Distillation, strong single-vector models, small Knowledge, multi-vector models

备注: 9 pages, 5 tables, 2 figures

点击查看摘要

Abstract:Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.

9. 【2602.16608】Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

链接https://arxiv.org/abs/2602.16608

作者:Melkamu Abay Mersha,Jugal Kalita

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:deeply layered representations, layered representations make, Layer-wise Integrated Gradients, Transformer models achieve, difficult to interpret

备注

点击查看摘要

Abstract:Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

10. 【2602.16607】CitiLink-Summ: Summarization of Discussion Subjects in European Portuguese Municipal Meeting Minutes

链接https://arxiv.org/abs/2602.16607

作者:Miguel Marques,Ana Luísa Fernandes,Ana Filipa Pacheco,Rute Rebouças,Inês Cantante,José Isidro,Luís Filipe Cunha,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,António Leal,Purificação Silvano,Ricardo Campos

类目:Computation and Language (cs.CL)

关键词:formal records documenting, Municipal meeting minutes, local government, citizens to navigate, Municipal meeting

备注

点击查看摘要

Abstract:Municipal meeting minutes are formal records documenting the discussions and decisions of local government, yet their content is often lengthy, dense, and difficult for citizens to navigate. Automatic summarization can help address this challenge by producing concise summaries for each discussion subject. Despite its potential, research on summarizing discussion subjects in municipal meeting minutes remains largely unexplored, especially in low-resource languages, where the inherent complexity of these documents adds further challenges. A major bottleneck is the scarcity of datasets containing high-quality, manually crafted summaries, which limits the development and evaluation of effective summarization models for this domain. In this paper, we present CitiLink-Summ, a new corpus of European Portuguese municipal meeting minutes, comprising 100 documents and 2,322 manually hand-written summaries, each corresponding to a distinct discussion subject. Leveraging this dataset, we establish baseline results for automatic summarization in this domain, employing state-of-the-art generative models (e.g., BART, PRIMERA) as well as large language models (LLMs), evaluated with both lexical and semantic metrics such as ROUGE, BLEU, METEOR, and BERTScore. CitiLink-Summ provides the first benchmark for municipal-domain summarization in European Portuguese, offering a valuable resource for advancing NLP research on complex administrative texts.

11. 【2602.16578】Creating a digital poet

链接https://arxiv.org/abs/2602.16578

作者:Vered Tohar,Tsahi Hayat,Amir Leshem

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:machine write good, write good poetry, machine write, write good, good poetry

备注: 24 pages, 3 figures

点击查看摘要

Abstract:Can a machine write good poetry? Any positive answer raises fundamental questions about the nature and value of art. We report a seven-month poetry workshop in which a large language model was shaped into a digital poet through iterative in-context expert feedback, without retraining. Across sessions, the model developed a distinctive style and a coherent corpus, supported by quantitative and qualitative analyses, and it produced a pen name and author image. In a blinded authorship test with 50 humanities students and graduates (three AI poems and three poems by well-known poets each), judgments were at chance: human poems were labeled human 54% of the time and AI poems 52%, with 95% confidence intervals including 50%. After the workshop, a commercial publisher released a poetry collection authored by the model. These results show that workshop-style prompting can support long-horizon creative shaping and renew debates on creativity and authorship.

12. 【2602.16571】Utility-Preserving De-Identification for Math Tutoring: Investigating Numeric Ambiguity in the MathEd-PII Benchmark Dataset

链接https://arxiv.org/abs/2602.16571

作者:Zhuqian Zhou,Kirk Vanacore,Bakhtawar Ahtisham,Jinsook Lee,Doug Pietrzak,Daryl Hedley,Jorge Dias,Chris Shaw,Ruth Schäfer,René F. Kizilcec

类目:Computation and Language (cs.CL)

关键词:Personally Identifiable Information, Large-scale sharing, generic Personally Identifiable, rigorous de-identification remains, teaching and learning

备注

点击查看摘要

Abstract:Large-scale sharing of dialogue-based data is instrumental for advancing the science of teaching and learning, yet rigorous de-identification remains a major barrier. In mathematics tutoring transcripts, numeric expressions frequently resemble structured identifiers (e.g., dates or IDs), leading generic Personally Identifiable Information (PII) detection systems to over-redact core instructional content and reduce dataset utility. This work asks how PII can be detected in math tutoring transcripts while preserving their educational utility. To address this challenge, we investigate the "numeric ambiguity" problem and introduce MathEd-PII, the first benchmark dataset for PII detection in math tutoring dialogues, created through a human-in-the-loop LLM workflow that audits upstream redactions and generates privacy-preserving surrogates. The dataset contains 1,000 tutoring sessions (115,620 messages; 769,628 tokens) with validated PII annotations. Using a density-based segmentation method, we show that false PII redactions are disproportionately concentrated in math-dense regions, confirming numeric ambiguity as a key failure mode. We then compare four detection strategies: a Presidio baseline and LLM-based approaches with basic, math-aware, and segment-aware prompting. Math-aware prompting substantially improves performance over the baseline (F1: 0.821 vs. 0.379) while reducing numeric false positives, demonstrating that de-identification must incorporate domain context to preserve analytic utility. This work provides both a new benchmark and evidence that utility-preserving de-identification for tutoring data requires domain-aware modeling.

13. 【2602.16516】Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification

链接https://arxiv.org/abs/2602.16516

作者:Taja Kuzman Pungeršek,Peter Rupnik,Daniela Širinić,Nikola Ljubešić

类目:Computation and Language (cs.CL)

关键词:setting across Europe, Comparative Agendas Project, building domain-specific policy, paper introduces ParlaCAP, parliamentary agenda setting

备注: 17 pages, 7 figures, 7 tables. Submitted to the PoliticalNLP 2026 workshop, co-located with LREC 2026 conference

点击查看摘要

Abstract:This paper introduces ParlaCAP, a large-scale dataset for analyzing parliamentary agenda setting across Europe, and proposes a cost-effective method for building domain-specific policy topic classifiers. Applying the Comparative Agendas Project (CAP) schema to the multilingual ParlaMint corpus of over 8 million speeches from 28 parliaments of European countries and autonomous regions, we follow a teacher-student framework in which a high-performing large language model (LLM) annotates in-domain training data and a multilingual encoder model is fine-tuned on these annotations for scalable data annotation. We show that this approach produces a classifier tailored to the target domain. Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data. In addition to the CAP annotations, the ParlaCAP dataset offers rich speaker and party metadata, as well as sentiment predictions coming from the ParlaSent multilingual transformer model, enabling comparative research on political attention and representation across countries. We illustrate the analytical potential of the dataset with three use cases, examining the distribution of parliamentary attention across policy topics, sentiment patterns in parliamentary speech, and gender differences in policy attention.

14. 【2602.16500】Optimizing Soft Prompt Tuning via Structural Evolution

链接https://arxiv.org/abs/2602.16500

作者:Zhenzhen Huang,Chaoning Zhang,Haoyu Bian,Songbo Zhang,Chi-lok Andy Tai,Jiaquan Zhang,Caiyan Qin,Jingjing Qu,Yalan Ye,Yang Yang,Heng Tao Shen

类目:Computation and Language (cs.CL)

关键词:capture task-specific information, large pre-trained language, Soft prompt tuning, Soft prompt, leverages continuous embeddings

备注: This manuscript has been submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE) for peer review

点击查看摘要

Abstract:Soft prompt tuning leverages continuous embeddings to capture task-specific information in large pre-trained language models (LLMs), achieving competitive performance in few-shot settings. However, soft prompts rely on high-dimensional, implicit representations and lack explicit semantics and traceable training behaviors, which limits their interpretability. To address this limitation, we propose a soft prompt tuning optimization method based on topological morphological evolution. Specifically, we employ persistent homology from topological data analysis (TDA) to quantify the structural representations of soft prompts in continuous parameter space and their training process evolution. Quantitative analysis shows that topologically stable and compact soft prompts achieve better downstream performance. Based on this empirical observation, we construct a loss function for optimizing soft prompt tuning, termed Topological Soft Prompt Loss (TSLoss). TSLoss guides the model to learn structurally stable adaptations by quantifying inter-parameter connectivity and redundancy. Extensive experiments show that training with TSLoss accelerates convergence and improves tuning performance, providing an interpretable method to understand and optimize soft prompt tuning from structural and topological perspectives.

15. 【2602.16490】From Growing to Looping: A Unified View of Iterative Computation in LLMs

链接https://arxiv.org/abs/2602.16490

作者:Ferdinand Kapl,Emmanouil Angelis,Kaitlin Maile,Johannes von Oswald,Stefan Bauer

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:relationship remains unclear, remains unclear, duplicating middle layers, linked to stronger, relationship remains

备注

点击查看摘要

Abstract:Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear. We provide a mechanistic unification: looped and depth-grown models exhibit convergent depth-wise signatures, including increased reliance on late layers and recurring patterns aligned with the looped or grown block. These shared signatures support the view that their gains stem from a common form of iterative computation. Building on this connection, we show that the two techniques are adaptable and composable: applying inference-time looping to the middle blocks of a depth-grown model improves accuracy on some reasoning primitives by up to $2\times$, despite the model never being trained to loop. Both approaches also adapt better than the baseline when given more in-context examples or additional supervised fine-tuning data. Additionally, depth-grown models achieve the largest reasoning gains when using higher-quality, math-heavy cooldown mixtures, which can be further boosted by adapting a middle block to loop. Overall, our results position depth growth and looping as complementary, practical methods for inducing and scaling iterative computation to improve reasoning.

16. 【2602.16488】Learning to Learn from Language Feedback with Social Meta-Learning

链接https://arxiv.org/abs/2602.16488

作者:Jonathan Cook,Diego Antognini,Martin Klissarov,Claudiu Musat,Edward Grefenstette

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, conversational context, Large, SML

备注

点击查看摘要

Abstract:Large language models (LLMs) often struggle to learn from corrective feedback within a conversational context. They are rarely proactive in soliciting this feedback, even when faced with ambiguity, which can make their dialogues feel static, one-sided, and lacking the adaptive qualities of human conversation. To address these limitations, we draw inspiration from social meta-learning (SML) in humans - the process of learning how to learn from others. We formulate SML as a finetuning methodology, training LLMs to solicit and learn from language feedback in simulated pedagogical dialogues, where static tasks are converted into interactive social learning problems. SML effectively teaches models to use conversation to solve problems they are unable to solve in a single turn. This capability generalises across domains; SML on math problems produces models that better use feedback to solve coding problems and vice versa. Furthermore, despite being trained only on fully-specified problems, these models are better able to solve underspecified tasks where critical information is revealed over multiple turns. When faced with this ambiguity, SML-trained models make fewer premature answer attempts and are more likely to ask for the information they need. This work presents a scalable approach to developing AI systems that effectively learn from language feedback.

17. 【2602.16485】am of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

链接https://arxiv.org/abs/2602.16485

作者:Jeffrey T. H. Wong,Zixi Zhang,Junyi Liu,Yiren Zhao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:Existing Multi-Agent Systems, Existing Multi-Agent, Multi-Agent Systems, differently post-trained models, typically rely

备注: 8 pages

点击查看摘要

Abstract:Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.

18. 【2602.16469】raining Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning

链接https://arxiv.org/abs/2602.16469

作者:Jenny Kunz

类目:Computation and Language (cs.CL)

关键词:multilingual NLP, native text, NLP, source, source language

备注

点击查看摘要

Abstract:Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce. However, translated text differs systematically from native text. This phenomenon is known as translationese, and it reflects both traces of the source language and characteristic properties of translation itself. In this paper, we study how training on machine-translated data affects small English language models, focusing on how translationese from different source languages shapes linguistic acceptability judgments and language modelling for different domains. We train models on English text translated from 24 typologically and resource-diverse source languages, enabling a systematic analysis of how source language and corpus properties influence what models learn. Our results show that the source language has a clear impact on model behavior: general perplexity is more driven by the lexical diversity of the translated corpus, while grammatical performance is strongly correlated to typological similarity to English, given enough data.

19. 【2602.16467】IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models

链接https://arxiv.org/abs/2602.16467

作者:Saurabh Bharti,Gaurav Azad,Abhinaw Jagtap,Nachiket Tapas

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reflect real-world academic, real-world academic rigor, necessitates evaluation frameworks, large language models, rapid advancement

备注

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and humanities domains in both English and Hindi. Unlike synthetic benchmarks, IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability. The framework automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) prompting strategies and supports modular integration of new models and languages. Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings. First, CoT prompting consistently improves reasoning accuracy, with substantial gains across subjects and languages. Second, significant cross-model performance disparities persist, particularly in high-complexity examinations. Third, multilingual degradation remains a critical challenge, with marked accuracy drops in Hindi compared to English, especially under Zero-Shot conditions. These results highlight persistent gaps in bilingual reasoning and domain transfer. Overall, IndicEval provides a practice-oriented, extensible foundation for rigorous, equitable evaluation of LLMs in multilingual educational settings and offers actionable insights for improving reasoning robustness and language adaptability.

20. 【2602.16429】abAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers

链接https://arxiv.org/abs/2602.16429

作者:Ido Levy,Eilam Shapira,Yinon Goldshtein,Avi Yaeli,Nir Mashkif,Segev Shlomov

类目:Computation and Language (cs.CL)

关键词:achieve complex goals, large language model, autonomously execute multi-step, execute multi-step workflows, repeated large language

备注

点击查看摘要

Abstract:Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating, and verification. While convenient, this design makes deployments slow and expensive due to cumulative latency and token usage. We propose TabAgent, a framework for replacing generative decision components in closed-set selection tasks with a compact textual-tabular classifier trained on execution traces. TabAgent (i) extracts structured schema, state, and dependency features from trajectories (TabSchema), (ii) augments coverage with schema-aligned synthetic supervision (TabSynth), and (iii) scores candidates with a lightweight classifier (TabHead). On the long-horizon AppWorld benchmark, TabAgent maintains task-level success while eliminating shortlist-time LLM calls, reducing latency by approximately 95% and inference cost by 85-91%. Beyond tool shortlisting, TabAgent generalizes to other agentic decision heads, establishing a paradigm for learned discriminative replacements of generative bottlenecks in production agent architectures.

21. 【2602.16379】Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents

链接https://arxiv.org/abs/2602.16379

作者:Mohammad H.A. Monfared,Lucie Flek,Akbar Karimi

类目:Computation and Language (cs.CL)

关键词:Aspect-Based Sentiment Analysis, produce high quality, high quality synthetic, quality synthetic training, Sentiment Analysis

备注: Accepted to WASSA Workshop at EACL 2026

点击查看摘要

Abstract:We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks (Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)), four SemEval datasets, and two encoder-decoder models: T5-Base and Tk-Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting-based generation. These benefits are most pronounced for T5-Base, while the more heavily pretrained Tk-Instruct exhibits smaller improvements. As a result, augmented data helps T5-Base achieve comparable performance with its counterpart.

22. 【2602.16375】Variable-Length Semantic IDs for Recommender Systems

链接https://arxiv.org/abs/2602.16375

作者:Kirill Khrylchenko

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:modeling user behavior, Generative models, generative models difficult, integrating large language, training generative models

备注

点击查看摘要

Abstract:Generative models are increasingly used in recommender systems, both for modeling user behavior as event sequences and for integrating large language models into recommendation pipelines. A key challenge in this setting is the extremely large cardinality of item spaces, which makes training generative models difficult and introduces a vocabulary gap between natural language and item identifiers. Semantic identifiers (semantic IDs), which represent items as sequences of low-cardinality tokens, have recently emerged as an effective solution to this problem. However, existing approaches generate semantic identifiers of fixed length, assigning the same description length to all items. This is inefficient, misaligned with natural language, and ignores the highly skewed frequency structure of real-world catalogs, where popular items and rare long-tail items exhibit fundamentally different information requirements. In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions. Despite the conceptual similarity, these ideas have not been systematically adopted in recommender systems. In this work, we bridge recommender systems and emergent communication by introducing variable-length semantic identifiers for recommendation. We propose a discrete variational autoencoder with Gumbel-Softmax reparameterization that learns item representations of adaptive length under a principled probabilistic framework, avoiding the instability of REINFORCE-based training and the fixed-length constraints of prior semantic ID methods.

Subjects:

Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2602.16375 [cs.IR]

(or
arXiv:2602.16375v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.16375

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2602.16346】Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

链接https://arxiv.org/abs/2602.16346

作者:Nivya Talokar,Ayush K Tarun,Murari Mandal,Maksym Andriushchenko,Antoine Bosselut

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:execute real-world workflows, LLM-based agents execute, agents execute real-world, execute real-world, real-world workflows

备注

点击查看摘要

Abstract:LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

24. 【2602.16313】MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

链接https://arxiv.org/abs/2602.16313

作者:Zexue He,Yu Wang,Churan Zhi,Yuanzhe Hu,Tzu-Ping Chen,Lang Yin,Ze Chen,Tong Arthur Wu,Siru Ouyang,Zihan Wang,Jiaxin Pei,Julian McAuley,Yejin Choi,Alex Pentland

类目:Computation and Language (cs.CL)

关键词:typically assess memorization, memory typically assess, memory, typically assess, assess memorization

备注

点击查看摘要

Abstract:Existing evaluations of agents with memory typically assess memorization and action in isolation. One class of benchmarks evaluates memorization by testing recall of past conversations or text but fails to capture how memory is used to guide future decisions. Another class focuses on agents acting in single-session tasks without the need for long-term memory. However, in realistic settings, memorization and action are tightly coupled: agents acquire memory while interacting with the environment, and subsequently rely on that memory to solve future tasks. To capture this setting, we introduce MemoryArena, a unified evaluation gym for benchmarking agent memory in multi-session Memory-Agent-Environment loops. The benchmark consists of human-crafted agentic tasks with explicitly interdependent subtasks, where agents must learn from earlier actions and feedback by distilling experiences into memory, and subsequently use that memory to guide later actions to solve the overall task. MemoryArena supports evaluation across web navigation, preference-constrained planning, progressive information search, and sequential formal reasoning, and reveals that agents with near-saturated performance on existing long-context memory benchmarks like LoCoMo perform poorly in our agentic setting, exposing a gap in current evaluations for agents with memory.

25. 【2602.16298】MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

链接https://arxiv.org/abs/2602.16298

作者:Martin Hyben,Sebastian Kula,Jan Cegin,Jakub Simko,Ivan Srba,Robert Moro

类目:Computation and Language (cs.CL)

关键词:professionals verify information, process remains limited, media professionals verify, Large Language Models, Large Language

备注: 18 pages, 8 figures, 19 tables, EACL-2026

点击查看摘要

Abstract:Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.

26. 【2602.16290】Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation

链接https://arxiv.org/abs/2602.16290

作者:Jonathan Mutal,Perla Al Almaoui,Simon Hengchen,Pierrette Bouillon

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, Language Processing, under-represented in Natural, Natural Language, Large Language Models

备注: 13 pages, Paper submitted to the AMIYA shared task at the VarDial workshop, co-located with EACL 2026

点击查看摘要

Abstract:Arabic dialects have long been under-represented in Natural Language Processing (NLP) research due to their non-standardization and high variability, which pose challenges for computational modeling. Recent advances in the field, such as Large Language Models (LLMs), offer promising avenues to address this gap by enabling Arabic to be modeled as a pluricentric language rather than a monolithic system. This paper presents Aladdin-FTI, our submission to the AMIYA shared task. The proposed system is designed to both generate and translate dialectal Arabic (DA). Specifically, the model supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects, as well as bidirectional translation between these dialects, Modern Standard Arabic (MSA), and English. The code and trained model are publicly available.

27. 【2602.16241】Are LLMs Ready to Replace Bangla Annotators?

链接https://arxiv.org/abs/2602.16241

作者:Md. Najib Hasan,Touseef Hasan,Souvika Sarkar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:remains poorly understood, Large Language Models, scale dataset creation, Large Language, dataset creation

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.

28. 【2602.16201】Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications

链接https://arxiv.org/abs/2602.16201

作者:Sanket Badhe,Deep Shah,Nehal Kathrotia

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:steep power-law distributions, exhibit steep power-law, Large language models, long-Tail Knowledge, power-law distributions

备注

点击查看摘要

Abstract:Large language models (LLMs) are trained on web-scale corpora that exhibit steep power-law distributions, in which the distribution of knowledge is highly long-tailed, with most appearing infrequently. While scaling has improved average-case performance, persistent failures on low-frequency, domain-specific, cultural, and temporal knowledge remain poorly characterized. This paper develops a structured taxonomy and analysis of long-Tail Knowledge in large language models, synthesizing prior work across technical and sociotechnical perspectives. We introduce a structured analytical framework that synthesizes prior work across four complementary axes: how long-Tail Knowledge is defined, the mechanisms by which it is lost or distorted during training and inference, the technical interventions proposed to mitigate these failures, and the implications of these failures for fairness, accountability, transparency, and user trust. We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures. The paper concludes by identifying open challenges related to privacy, sustainability, and governance that constrain long-Tail Knowledge representation. Taken together, this paper provides a unifying conceptual framework for understanding how long-Tail Knowledge is defined, lost, evaluated, and manifested in deployed language model systems.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Cite as:
arXiv:2602.16201 [cs.CL]

(or
arXiv:2602.16201v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.16201

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
29. 【2602.16200】he Validity of Coreference-based Evaluations of Natural Language Understanding

链接https://arxiv.org/abs/2602.16200

作者:Ian Porada

类目:Computation and Language (cs.CL)

关键词:expanding existing evaluation, existing evaluation practices, converging or conflicting, refine our understanding, reach from coreference-based

备注: PhD Thesis

点击查看摘要

Abstract:In this thesis, I refine our understanding as to what conclusions we can reach from coreference-based evaluations by expanding existing evaluation practices and considering the extent to which evaluation results are either converging or conflicting. First, I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions due to issues of measurement validity - including contestedness (multiple, competing definitions of coreference) and convergent validity (evaluation results that rank models differently across benchmarks). Second, I propose and implement a novel evaluation focused on testing systems' ability to infer the relative plausibility of events, a key aspect of resolving coreference. Through this extended evaluation, I find that contemporary language models demonstrate strong performance on standard benchmarks - improving over earlier baseline systems within certain domains and types of coreference - but remain sensitive to the evaluation conditions: they often fail to generalize in ways one would expect a human to be capable of when evaluation contexts are slightly modified. Taken together, these findings clarify both the strengths, such as improved accuracy over baselines on widely used evaluations, and the limitations of the current NLP paradigm, including weaknesses in measurement validity, and suggest directions for future work in developing better evaluation methods and more genuinely generalizable systems.

30. 【2602.16197】ModalImmune: Immunity Driven Unlearning via Self Destructive Training

链接https://arxiv.org/abs/2602.16197

作者:Rong Fu,Jia Yee Tan,Wenxin Zhang,Zijian Zhang,Ziming Wang,Zhaolu Kang,Muge Qi,Shuning Zhang,Simon Fong

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:channels at deployment, real-world settings, systems are vulnerable, vulnerable to partial, partial or complete

备注: 23 pages, 8 figures

点击查看摘要

Abstract:Multimodal systems are vulnerable to partial or complete loss of input channels at deployment, which undermines reliability in real-world settings. This paper presents ModalImmune, a training framework that enforces modality immunity by intentionally and controllably collapsing selected modality information during training so the model learns joint representations that are robust to destructive modality influence. The framework combines a spectrum-adaptive collapse regularizer, an information-gain guided controller for targeted interventions, curvature-aware gradient masking to stabilize destructive updates, and a certified Neumann-truncated hyper-gradient procedure for automatic meta-parameter adaptation. Empirical evaluation on standard multimodal benchmarks demonstrates that ModalImmune improves resilience to modality removal and corruption while retaining convergence stability and reconstruction capacity.

31. 【2602.16189】Beyond Learning: A Training-Free Alternative to Model Adaptation

链接https://arxiv.org/abs/2602.16189

作者:Namkyung Yoon,Kyeonghyun Yoo,Wooyong Jung,Sanghong Kim,Hwangnam Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:underperform previous versions, previous versions, language models, underperform previous, model

备注: 7 pages, 3 figures, 5 tables. Preprint submitted to Pattern Recognition Letters

点击查看摘要

Abstract:Despite the continuous research and evolution of language models, they sometimes underperform previous versions. Existing approaches to overcome these challenges are resource-intensive, highlighting the need for alternatives that enable immediate action. We assume that each language model has a local module inside that is suitable for a specific function. First, this work identifies a set of modules showing consistent and local activation changes under an inference workload through activation-based analysis. Subsequently, we transplant an internal module that is properly activated for a specific task into the target model, leading to immediate and measurable functional changes without additional training or fine-tuning. To experimentally demonstrate the effectiveness of the transplant technique, we quantify the relationship between transplant strength and performance improvement under different conditions for two language models. In the cross-generation setting, we find that transplanting activation-selected modules can substantially improve the underperforming model, reaching up to twice the target baseline and achieving gap-based recovery above 100%. Moreover, in transplant experiments between a base model and its instruction-tuned counterpart, transplantation improves the underperforming model toward the stronger baseline, yielding up to about 2.33 times the target baseline with gap-based recovery reaching up to 100% in the best case. These results show that meaningful capacity transfer can be realized through the implantation of highly localized modules implied by language models. Overall, this work provides empirical evidence for task-localized modularity in language models and presents a new research area: model transplantation.

32. 【2602.16173】Learning Personalized Agents from Human Feedback

链接https://arxiv.org/abs/2602.16173

作者:Kaiqu Liang,Julia Kruk,Shengyi Qian,Xianjun Yang,Shengjie Bi,Yuanshun Yao,Shaoliang Nie,Mingyang Zhang,Lijuan Liu,Jaime Fernández Fisac,Shuyan Zhou,Saghar Hosseini

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:fail to align, Modern, individual users, PAHF, Modern AI agents

备注

点击查看摘要

Abstract:Modern AI agents are powerful but often fail to align with the idiosyncratic, evolving preferences of individual users. Prior approaches typically rely on static datasets, either training implicit preference models on interaction history or encoding user profiles in external memory. However, these approaches struggle with new users and with preferences that change over time. We introduce Personalized Agents from Human Feedback (PAHF), a framework for continual personalization in which agents learn online from live interaction using explicit per-user memory. PAHF operationalizes a three-step loop: (1) seeking pre-action clarification to resolve ambiguity, (2) grounding actions in preferences retrieved from memory, and (3) integrating post-action feedback to update memory when preferences drift. To evaluate this capability, we develop a four-phase protocol and two benchmarks in embodied manipulation and online shopping. These benchmarks quantify an agent's ability to learn initial preferences from scratch and subsequently adapt to persona shifts. Our theoretical analysis and empirical results show that integrating explicit memory with dual feedback channels is critical: PAHF learns substantially faster and consistently outperforms both no-memory and single-channel baselines, reducing initial personalization error and enabling rapid adaptation to preference shifts.

33. 【2602.16169】Discrete Stochastic Localization for Non-autoregressive Generation

链接https://arxiv.org/abs/2602.16169

作者:Yunshu Wu,Jiayi Cheng,Partha Thakuria,Rob Brekelmans,Evangelos E. Papalexakis,Greg Ver Steeg

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:reduces decoding latency, generation reduces decoding, NAR iterative refinement, modern NAR iterative, tokens in parallel

备注

点击查看摘要

Abstract:Non-autoregressive (NAR) generation reduces decoding latency by predicting many tokens in parallel, but iterative refinement often suffers from error accumulation and distribution shift under self-generated drafts. Masked diffusion language models (MDLMs) and their remasking samplers (e.g., ReMDM) can be viewed as modern NAR iterative refinement, where generation repeatedly revises a partially observed draft. In this work we show that \emph{training alone} can substantially improve the step-efficiency of MDLM/ReMDM sampling. We propose \textsc{DSL} (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer. On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with \(\sim\)4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets. Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.

34. 【2602.16162】LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers

链接https://arxiv.org/abs/2602.16162

作者:Peiqi Sui

类目:Computation and Language (cs.CL)

关键词:trite and cliché-ridden, key and understudied, understudied limitation, limitation of LLMs', LLMs' performance

备注: 11 tables

点击查看摘要

Abstract:We argue that uncertainty is a key and understudied limitation of LLMs' performance in creative writing, which is often characterized as trite and cliché-ridden. Literary theory identifies uncertainty as a necessary condition for creative expression, while current alignment strategies steer models away from uncertain outputs to ensure factuality and reduce hallucination. We formalize this tension by quantifying the "uncertainty gap" between human-authored stories and model-generated continuations. Through a controlled information-theoretic analysis of 28 LLMs on high-quality storytelling datasets, we demonstrate that human writing consistently exhibits significantly higher uncertainty than model outputs. We find that instruction-tuned and reasoning models exacerbate this trend compared to their base counterparts; furthermore, the gap is more pronounced in creative writing than in functional domains, and strongly correlates to writing quality. Achieving human-level creativity requires new uncertainty-aware alignment paradigms that can distinguish between destructive hallucinations and the constructive ambiguity required for literary richness.

35. 【2602.16161】Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection

链接https://arxiv.org/abs/2602.16161

作者:Rong Fu,Ziming Wang,Shuo Yin,Wenxin Zhang,Haiyun Wei,Kun Liu,Xianda Li,Zeli Su,Simon Fong

类目:Multimedia (cs.MM); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Emotional expression underpins, expression underpins natural, underpins natural communication, Emotional expression, effective human-computer interaction

备注: 25 pages, 14 figures

点击查看摘要

Abstract:Emotional expression underpins natural communication and effective human-computer interaction. We present Emotion Collider (EC-Net), a hyperbolic hypergraph framework for multimodal emotion and sentiment modeling. EC-Net represents modality hierarchies using Poincare-ball embeddings and performs fusion through a hypergraph mechanism that passes messages bidirectionally between nodes and hyperedges. To sharpen class separation, contrastive learning is formulated in hyperbolic space with decoupled radial and angular objectives. High-order semantic relations across time steps and modalities are preserved via adaptive hyperedge construction. Empirical results on standard multimodal emotion benchmarks show that EC-Net produces robust, semantically coherent representations and consistently improves accuracy, particularly when modalities are partially available or contaminated by noise. These findings indicate that explicit hierarchical geometry combined with hypergraph fusion is effective for resilient multimodal affect understanding.

36. 【2602.16154】Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution

链接https://arxiv.org/abs/2602.16154

作者:Nithin Sivakumaran,Shoubin Yu,Hyunji Lee,Yue Zhang,Ali Payani,Mohit Bansal,Elias Stengel-Eskin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:LLMs arrive, large language model, hampering its utility, fails to faithfully, faithfully reflect

备注: Code: [this https URL](https://github.com/nsivaku/remul)

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning sometimes fails to faithfully reflect the true computation of a large language model (LLM), hampering its utility in explaining how LLMs arrive at their answers. Moreover, optimizing for faithfulness and interpretability in reasoning often degrades task performance. To address this tradeoff and improve CoT faithfulness, we propose Reasoning Execution by Multiple Listeners (REMUL), a multi-party reinforcement learning approach. REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful. A speaker model generates a reasoning trace, which is truncated and passed to a pool of listener models who "execute" the trace, continuing the trace to an answer. Speakers are rewarded for producing reasoning that is clear to listeners, with additional correctness regularization via masked supervised finetuning to counter the tradeoff between faithfulness and performance. On multiple reasoning benchmarks (BIG-Bench Extra Hard, MuSR, ZebraLogicBench, and FOLIO), REMUL consistently and substantially improves three measures of faithfulness -- hint attribution, early answering area over the curve (AOC), and mistake injection AOC -- while also improving accuracy. Our analysis finds that these gains are robust across training domains, translate to legibility gains, and are associated with shorter and more direct CoTs.

37. 【2602.16144】Missing-by-Design: Certifiable Modality Deletion for Revocable Multimodal Sentiment Analysis

链接https://arxiv.org/abs/2602.16144

作者:Rong Fu,Wenxin Zhang,Ziming Wang,Chunlei Meng,Jiaxuan Lu,Jiekai Wu,Kangan Qian,Hao Zhang,Simon Fong

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:sensitive personal data, revoke specific data, specific data modalities, systems increasingly process, increasingly process sensitive

备注: 21 pages, 6 figures

点击查看摘要

Abstract:As multimodal systems increasingly process sensitive personal data, the ability to selectively revoke specific data modalities has become a critical requirement for privacy compliance and user autonomy. We present Missing-by-Design (MBD), a unified framework for revocable multimodal sentiment analysis that combines structured representation learning with a certifiable parameter-modification pipeline. Revocability is critical in privacy-sensitive applications where users or regulators may request removal of modality-specific information. MBD learns property-aware embeddings and employs generator-based reconstruction to recover missing channels while preserving task-relevant signals. For deletion requests, the framework applies saliency-driven candidate selection and a calibrated Gaussian update to produce a machine-verifiable Modality Deletion Certificate. Experiments on benchmark datasets show that MBD achieves strong predictive performance under incomplete inputs and delivers a practical privacy-utility trade-off, positioning surgical unlearning as an efficient alternative to full retraining.

38. 【2602.16093】Updating Parametric Knowledge with Context Distillation Retains Post-Training Capabilities

链接https://arxiv.org/abs/2602.16093

作者:Shankar Padmanabhan,Mustafa Omer Gul,Tanya Goyal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Post-training endows pretrained, endows pretrained LLMs, Post-training endows, endows pretrained, variety of desirable

备注: 15 pages. Preprint, under review

点击查看摘要

Abstract:Post-training endows pretrained LLMs with a variety of desirable skills, including instruction-following, reasoning, and others. However, these post-trained LLMs only encode knowledge up to a cut-off date, necessitating continual adaptation. Unfortunately, existing solutions cannot simultaneously learn new knowledge from an adaptation document corpora and mitigate the forgetting of earlier learned capabilities. To address this, we introduce Distillation via Split Contexts (DiSC), a simple context-distillation based approach for continual knowledge adaptation. \methodname~derives student and teacher distributions by conditioning on distinct segments of the training example and minimizes the KL divergence between the shared tokens. This allows us to efficiently apply context-distillation without requiring explicit generation steps during training. We run experiments on four post-trained models and two adaptation domains. Compared to prior finetuning and distillation methods for continual adaptation, DiSC consistently reports the best trade-off between learning new knowledge and mitigating forgetting of previously learned skills like instruction-following, reasoning, and factual knowledge.

39. 【2602.16092】Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff

链接https://arxiv.org/abs/2602.16092

作者:Patrick Pynadath,Ruqi Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:native key-value caching, efficient masked diffusion, enabling native key-value, Any-order autoregressive models, required two-stream attention

备注

点击查看摘要

Abstract:Any-order autoregressive models (AO-ARMs) offer a promising path toward efficient masked diffusion by enabling native key-value caching, but competitive performance has so far required two-stream attention, typically motivated as a means of decoupling token content from position. In this work, we argue that two-stream attention may be serving a more subtle role. We identify a structural-semantic tradeoff in any-order generation: the hidden representation at each step must simultaneously attend to semantically informative tokens for prediction and structurally recent tokens for summarization, objectives that compete for attention capacity in a single stream but can specialize across two streams. To isolate this tradeoff from position-content separation, we propose Decoupled RoPE, a modification to rotary position embeddings that provides target position information without revealing target content. Decoupled RoPE performs competitively at short sequence lengths--where semantic and structural proximity coincide--but degrades as sequence length increases and the two orderings diverge. These results suggest that the success of two-stream attention stems not merely from separating position from content, but from circumventing the deeper structural-semantic tradeoff inherent to any-order generation.

40. 【2602.16085】Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs

链接https://arxiv.org/abs/2602.16085

作者:Sean Trott,Samuel Taylor,Cameron Jones,James A. Michaelov,Pamela D. Rivière

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:mental state reasoning, state reasoning emerges, mental state, state reasoning, potential to inform

备注: 15 pages, 7 figures, submitted to conference

点击查看摘要

Abstract:Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition--such as the theory that mental state reasoning emerges in part from language exposure--and our understanding of LMs themselves. Yet much published work on LMs relies on a relatively small sample of closed-source LMs, limiting our ability to rigorously test psychological theories and evaluate LM capacities. Here, we replicate and extend published work on the false belief task by assessing LM mental state reasoning behavior across 41 open-weight models (from distinct model families). We find sensitivity to implied knowledge states in 34% of the LMs tested; however, consistent with prior work, none fully ``explain away'' the effect in humans. Larger LMs show increased sensitivity and also exhibit higher psychometric predictive power. Finally, we use LM behavior to generate and test a novel hypothesis about human cognition: both humans and LMs show a bias towards attributing false beliefs when knowledge states are cued using a non-factive verb (``John thinks...'') than when cued indirectly (``John looks in the...''). Unlike the primary effect of knowledge states, where human sensitivity exceeds that of LMs, the magnitude of the human knowledge cue effect falls squarely within the distribution of LM effect sizes-suggesting that distributional statistics of language can in principle account for the latter but not the former in humans. These results demonstrate the value of using larger samples of open-weight LMs to test theories of human cognition and evaluate LM capacities.

41. 【2602.16080】Surgical Activation Steering via Generative Causal Mediation

链接https://arxiv.org/abs/2602.16080

作者:Aruna Sankaranarayanan,Amir Zur,Atticus Geiger,Dylan Hadfield-Menell

类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:Generative Causal Mediation, introduce Generative Causal, Causal Mediation, control behaviors, Generative Causal

备注

点击查看摘要

Abstract:Where should we intervene in a language model (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three language models. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.

42. 【2602.16054】CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

链接https://arxiv.org/abs/2602.16054

作者:Bradley McDanel,Steven Li,Harshit Khaitan

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:long-context LLM inference, LLM inference remains, long-context LLM, LLM inference, computational bottleneck

备注: 15 pages, 8 figures

点击查看摘要

Abstract:The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.

43. 【2602.16050】Evidence-Grounded Subspecialty Reasoning: Evaluating a Curated Clinical Intelligence Layer on the 2025 Endocrinology Board-Style Examination

链接https://arxiv.org/abs/2602.16050

作者:Amir Hosseinian,MohammadReza Zare Shahneh,Umer Mansoor,Gilbert Szeto,Kirill Karlin,Nima Aghaeepour

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, demonstrated strong performance, remains challenging due, general medical examinations, Large language

备注

点击查看摘要

Abstract:Background: Large language models have demonstrated strong performance on general medical examinations, but subspecialty clinical reasoning remains challenging due to rapidly evolving guidelines and nuanced evidence hierarchies. Methods: We evaluated January Mirror, an evidence-grounded clinical reasoning system, against frontier LLMs (GPT-5, GPT-5.2, Gemini-3-Pro) on a 120-question endocrinology board-style examination. Mirror integrates a curated endocrinology and cardiometabolic evidence corpus with a structured reasoning architecture to generate evidence-linked outputs. Mirror operated under a closed-evidence constraint without external retrieval. Comparator LLMs had real-time web access to guidelines and primary literature. Results: Mirror achieved 87.5% accuracy (105/120; 95% CI: 80.4-92.3%), exceeding a human reference of 62.3% and frontier LLMs including GPT-5.2 (74.6%), GPT-5 (74.0%), and Gemini-3-Pro (69.8%). On the 30 most difficult questions (human accuracy less than 50%), Mirror achieved 76.7% accuracy. Top-2 accuracy was 92.5% for Mirror versus 85.25% for GPT-5.2. Conclusions: Mirror provided evidence traceability: 74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification. Curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning and supports auditability for clinical deployment.

44. 【2602.16023】A Curious Class of Adpositional Multiword Expressions in Korean

链接https://arxiv.org/abs/2602.16023

作者:Junghyun Min,Na-Rae Han,Jena D. Hwang,Nathan Schneider

类目:Computation and Language (cs.CL)

关键词:widely studied, Korean multiword adpositions, PARSEME, Korean, Korean multiword

备注: 10 pages. Camera-ready for MWE at EACL 2026

点击查看摘要

Abstract:Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME. However, Korean MWEs remain underrepresented in these efforts. In particular, Korean multiword adpositions lack systematic analysis, annotated resources, and integration into existing multilingual frameworks. In this paper, we study a class of Korean functional multiword expressions: postpositional verb-based constructions (PVCs). Using data from Korean Wikipedia, we survey and analyze several PVC expressions and contrast them with non-MWEs and light verb constructions (LVCs) with similar structure. Building on this analysis, we propose annotation guidelines designed to support future work in Korean multiword adpositions and facilitate alignment with cross-lingual frameworks.

45. 【2602.16008】MAEB: Massive Audio Embedding Benchmark

链接https://arxiv.org/abs/2602.16008

作者:Adnan El Assadi,Isaac Chung,Chenghao Xiao,Roman Solomatin,Animesh Jha,Rahul Chand,Silky Singh,Kaitlyn Wang,Ali Sartaz Khan,Marc Moussa Nasser,Sufen Fong,Pengfei He,Alan Xiao,Ayush Sunil Munot,Aditya Shrivastava,Artem Gazizov,Niklas Muennighoff,Kenneth Enevoldsen

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Massive Audio Embedding, Audio Embedding Benchmark, large-scale benchmark covering, Embedding Benchmark, introduce the Massive

备注

点击查看摘要

Abstract:We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at this https URL.

46. 【2602.15997】Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks

链接https://arxiv.org/abs/2602.15997

作者:Jayadev Billa

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:remains mechanistically opaque, neural network training, network training remains, training remains mechanistically, Capability emergence

备注: 19 pages, 6 figures, 12 appendix pages

点击查看摘要

Abstract:Capability emergence during neural network training remains mechanistically opaque. We track five geometric measures across five model scales (405K-85M parameters), 120+ emergence events in eight algorithmic tasks, and three Pythia language models (160M-2.8B). We find: (1) training begins with a universal representation collapse to task-specific floors that are scale-invariant across a 210X parameter range (e.g., modular arithmetic collapses to RANKME ~ 2.0 regardless of model size); (2) collapse propagates top-down through layers (32/32 task X model consistency), contradicting bottom-up feature-building intuition; (3) a geometric hierarchy in which representation geometry leads emergence (75-100% precursor rate for hard tasks), while the local learning coefficient is synchronous (0/24 precursor) and Hessian measures lag. We also delineate prediction limits: geometric measures encode coarse task difficulty but not fine-grained timing (within-class concordance 27%; when task ordering reverses across scales, prediction fails at 26%). On Pythia, global geometric patterns replicate but per-task precursor signals do not -- the precursor relationship requires task-training alignment that naturalistic pre-training does not provide. Our contribution is the geometric anatomy of emergence and its boundary conditions, not a prediction tool.

47. 【2602.15958】DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

链接https://arxiv.org/abs/2602.15958

作者:Md Mofijul Islam,Md Sirajus Salekin,Nivedha Balakrishnan,Vincil C. Bishop III,Niharika Jain,Spencer Romo,Bob Strahan,Boyi Xie,Diego A. Socolinsky

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:multiple documents stitched, document packet, multi-page document packets, document packet splitting, Document

备注

点击查看摘要

Abstract:Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.

48. 【2602.15902】Doc-to-LoRA: Learning to Instantly Internalize Contexts

链接https://arxiv.org/abs/2602.15902

作者:Rujikorn Charakorn,Edoardo Cetin,Shinnosuke Uesaka,Robert Tjarko Lange

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Long input sequences, Long input, reasoning of Large

备注

点击查看摘要

Abstract:Long input sequences are central to in-context learning, document understanding, and multi-step reasoning of Large Language Models (LLMs). However, the quadratic attention cost of Transformers makes inference memory-intensive and slow. While context distillation (CD) can transfer information into model parameters, per-prompt distillation is impractical due to training costs and latency. To address these limitations, we propose Doc-to-LoRA (D2L), a lightweight hypernetwork that meta-learns to perform approximate CD within a single forward pass. Given an unseen prompt, D2L generates a LoRA adapter for a target LLM, enabling subsequent queries to be answered without re-consuming the original context, reducing latency and KV-cache memory consumption during inference of the target LLM. On a long-context needle-in-a-haystack task, D2L successfully learns to map contexts into adapters that store the needle information, achieving near-perfect zero-shot accuracy at sequence lengths exceeding the target LLM's native context window by more than 4x. On real-world QA datasets with limited compute, D2L outperforms standard CD while significantly reducing peak memory consumption and update latency. We envision that D2L can facilitate rapid adaptation of LLMs, opening up the possibility of frequent knowledge updates and personalized chat behavior.

49. 【2602.15898】MultiCube-RAG for Multi-hop Question Answering

链接https://arxiv.org/abs/2602.15898

作者:Jimeng Shi,Wei Hu,Runchu Tian,Bowen Jin,Wonbin Kweon,SeongKu Kang,Yunfan Kang,Dingqi Ye,Sizhe Zhou,Shaowen Wang,Jiawei Han

类目:Computation and Language (cs.CL)

关键词:Multi-hop question answering, question answering, necessitates multi-step reasoning, reasoning and retrieval, reasoning

备注: 12 pages

点击查看摘要

Abstract:Multi-hop question answering (QA) necessitates multi-step reasoning and retrieval across interconnected subjects, attributes, and relations. Existing retrieval-augmented generation (RAG) methods struggle to capture these structural semantics accurately, resulting in suboptimal performance. Graph-based RAGs structure such information in graphs, but the resulting graphs are often noisy and computationally expensive. Moreover, most methods rely on single-step retrieval, neglecting the need for multi-hop reasoning processes. Recent training-based approaches attempt to incentivize the large language models (LLMs) for iterative reasoning and retrieval, but their training processes are prone to unstable convergence and high computational overhead. To address these limitations, we devise an ontology-based cube structure with multiple and orthogonal dimensions to model structural subjects, attributes, and relations. Built on the cube structure, we propose MultiCube-RAG, a training-free method consisting of multiple cubes for multi-step reasoning and retrieval. Each cube specializes in modeling a class of subjects, so that MultiCube-RAG flexibly selects the most suitable cubes to acquire the relevant knowledge precisely. To enhance the query-based reasoning and retrieval, our method decomposes a complex multi-hop query into a set of simple subqueries along cube dimensions and conquers each of them sequentially. Experiments on four multi-hop QA datasets show that MultiCube-RAG improves response accuracy by 8.9% over the average performance of various baselines. Notably, we also demonstrate that our method performs with greater efficiency and inherent explainability.

50. 【2602.15897】Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

链接https://arxiv.org/abs/2602.15897

作者:Xinguo Feng,Zhongkui Ma,Zihan Wang,Alsharif Abuadbba,Guangdong Bai

类目:Computation and Language (cs.CL)

关键词:fine-tuning large-scale language, reconstruct private training, private training data, large-scale language models, language models largely

备注

点击查看摘要

Abstract:Training and fine-tuning large-scale language models largely benefit from collaborative learning, but the approach has been proven vulnerable to gradient inversion attacks (GIAs), which allow adversaries to reconstruct private training data from shared gradients. Existing defenses mainly employ gradient perturbation techniques, e.g., noise injection or gradient pruning, to disrupt GIAs' direct mapping from gradient space to token space. However, these methods often fall short due to the retention of semantics similarity across gradient, embedding, and token spaces. In this work, we propose a novel defense mechanism named GHOST (gradient shield with obfuscated tokens), a token-level obfuscation mechanism that neutralizes GIAs by decoupling the inherent connections across gradient, embedding, and token spaces. GHOST is built upon an important insight: due to the large scale of the token space, there exist semantically distinct yet embedding-proximate tokens that can serve as the shadow substitutes of the original tokens, which enables a semantic disconnection in the token space while preserving the connection in the embedding and gradient spaces. GHOST comprises a searching step, which identifies semantically distinct candidate tokens using a multi-criteria searching process, and a selection step, which selects optimal shadow tokens to ensure minimal disruption to features critical for training by preserving alignment with the internal outputs produced by original tokens. Evaluation across diverse model architectures (from BERT to Llama) and datasets demonstrates the remarkable effectiveness of GHOST in protecting privacy (as low as 1% in recovery rate) and preserving utility (up to 0.92 in classification F1 and 5.45 in perplexity), in both classification and generation tasks against state-of-the-art GIAs and adaptive attack scenarios.

51. 【2602.15896】Every Little Helps: Building Knowledge Graph Foundation Model with Fine-grained Transferable Multi-modal Tokens

链接https://arxiv.org/abs/2602.15896

作者:Yichi Zhang,Zhuo Chen,Lingbing Guo,Wen Zhang,Huajun Chen

类目:Computation and Language (cs.CL)

关键词:multi-modal entity contents, knowledge graph reasoning, aims to predict, entity contents, predict the missing

备注: Work in progress

点击查看摘要

Abstract:Multi-modal knowledge graph reasoning (MMKGR) aims to predict the missing links by exploiting both graph structure information and multi-modal entity contents. Most existing works are designed for a transductive setting, which learns dataset-specific embeddings and struggles to generalize to new KGs. Recent knowledge graph foundation models (KGFMs) improve cross-KG transfer, but they mainly exploit structural patterns and ignore rich multi-modal signals. We address these gaps by proposing a token-based foundation model (TOFU) for MMKGR, which exhibits strong generalization across different MMKGs. TOFU discretizes structural, visual, and textual information into modality-specific tokens. TOFU then employs a hierarchical fusion architecture with mixture-of-message mechanisms, aiming to process these tokens and obtain transferable features for MMKGR. Experimental results on 17 transductive, inductive, and fully-inductive MMKGs show that TOFU consistently outperforms strong KGFM and MMKGR baselines, delivering strong performance on unseen MMKGs.

52. 【2602.15895】Understand Then Memory: A Cognitive Gist-Driven RAG Framework with Global Semantic Diffusion

链接https://arxiv.org/abs/2602.15895

作者:Pengcheng Zhou,Haochen Li,Zhiqiang Nie,JiaLe Chen,Qing Gong,Weizhen Zhang,Chun Yu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:effectively mitigates hallucinations, incorporating external knowledge, effectively mitigates, mitigates hallucinations, hallucinations in LLMs

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) effectively mitigates hallucinations in LLMs by incorporating external knowledge. However, the inherent discrete representation of text in existing frameworks often results in a loss of semantic integrity, leading to retrieval deviations. Inspired by the human episodic memory mechanism, we propose CogitoRAG, a RAG framework that simulates human cognitive memory processes. The core of this framework lies in the extraction and evolution of the Semantic Gist. During the offline indexing stage, CogitoRAG first deduces unstructured corpora into gist memory corpora, which are then transformed into a multi-dimensional knowledge graph integrating entities, relational facts, and memory nodes. In the online retrieval stage, the framework handles complex queries via Query Decomposition Module that breaks them into comprehensive sub-queries, mimicking the cognitive decomposition humans employ for complex information. Subsequently, Entity Diffusion Module performs associative retrieval across the graph, guided by structural relevance and an entity-frequency reward mechanism. Furthermore, we propose the CogniRank algorithm, which precisely reranks candidate passages by fusing diffusion-derived scores with semantic similarity. The final evidence is delivered to the generator in a passage-memory pairing format, providing high-density information support. Experimental results across five mainstream QA benchmarks and multi-task generation on GraphBench demonstrate that CogitoRAG significantly outperforms state-of-the-art RAG methods, showcasing superior capabilities in complex knowledge integration and reasoning.

53. 【2602.15894】Quality-constrained Entropy Maximization Policy Optimization for LLM Diversity

链接https://arxiv.org/abs/2602.15894

作者:Haihui Pan,Yuzhong Hong,Shaoke Lv,Junwei Bao,Hongfei Jiang,Yang Song

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language model, Recent research, methods significantly improve, LLM output diversity, language model

备注

点击查看摘要

Abstract:Recent research indicates that while alignment methods significantly improve the quality of large language model(LLM) outputs, they simultaneously reduce the diversity of the models' output. Although some methods have been proposed to enhance LLM output diversity, they often come at the cost of reduced performance. In this work, we first theoretically demonstrate that the alignment task can be decomposed into two distributions: quality and diversity. To enhance the diversity of LLM outputs while ensuring quality, we propose the Quality-constrained Entropy Maximization Policy Optimization (QEMPO). QEMPO aims to maximize the output entropy of the policy while ensuring output quality. By adding different constraints to QEMPO, we obtain different policies. To optimize policies, we propose both online and offline training methods. Experiments validate that QEMPO achieves performance comparable to or even better than RLHF while improving output diversity.

54. 【2602.15874】P-RAG: Prompt-Enhanced Parametric RAG with LoRA and Selective CoT for Biomedical and Multi-Hop QA

链接https://arxiv.org/abs/2602.15874

作者:Xingda Lyu,Gongfu Lyu,Zitai Yan,Yuxin Jiang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, demonstrate remarkable capabilities, static training data

备注

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities but remain limited by their reliance on static training data. Retrieval-Augmented Generation (RAG) addresses this constraint by retrieving external knowledge during inference, though it still depends heavily on knowledge base quality. To explore potential improvements, we evaluated three RAG variants-Standard RAG, DA-RAG, and our proposed Prompt-Enhanced Parametric RAG (P-RAG), a hybrid architecture that integrates parametric knowledge within the LLM and retrieved evidence, guided by Chain-of-Thought (CoT) prompting and Low-Rank Adaptation (LoRA) fine-tuning-on both general and biomedical datasets. Using LLaMA-3.2-1B-Instruct fine-tuned via LoRA, we evaluate on PubMedQA and 2WikiMultihopQA. P-RAG outperforms Standard RAG on PubMedQA by 10.47 percentage points in F1 (93.33% vs. 82.86%; 12.64% relative). On 2WikiMultihopQA, P-RAG nearly doubles the overall score vs. Standard RAG (33.44% vs. 17.83%) and achieves 44.03% on the Compare subset (with 42.74% Bridge, 21.84% Inference, 8.60% Compose). CoT prompting substantially improves multi-hop reasoning but yields mixed results for simpler, single-hop queries. These findings underscore P-RAG's potential for accurate, scalable, and contextually adaptive biomedical question answering. Our contributions include: (1) LoRA-based fine-tuning of LLaMA-3.2-1B-Instruct for biomedical QA, (2) introduction of P-RAG with Chain-of-Thought prompting, and (3) state-of-the-art results on PubMedQA and 2WikiMultihopQA.

55. 【2602.15871】CheckIfExist: Detecting Citation Hallucinations in the Era of AI-Generated Content

链接https://arxiv.org/abs/2602.15871

作者:Diletta Abbonato

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:large language models, introduced unprecedented challenges, language models, proliferation of large, large language

备注

点击查看摘要

Abstract:The proliferation of large language models (LLMs) in academic workflows has introduced unprecedented challenges to bibliographic integrity, particularly through reference hallucination -- the generation of plausible but non-existent citations. Recent investigations have documented the presence of AI-hallucinated citations even in papers accepted at premier machine learning conferences such as NeurIPS and ICLR, underscoring the urgency of automated verification mechanisms. This paper presents "CheckIfExist", an open-source web-based tool designed to provide immediate verification of bibliographic references through multi-source validation against CrossRef, Semantic Scholar, and OpenAlex scholarly databases. While existing reference management tools offer bibliographic organization capabilities, they do not provide real-time validation of citation authenticity. Commercial hallucination detection services, though increasingly available, often impose restrictive usage limits on free tiers or require substantial subscription fees. The proposed tool fills this gap by employing a cascading validation architecture with string similarity algorithms to compute multi-dimensional match confidence scores, delivering instant feedback on reference authenticity. The system supports both single-reference verification and batch processing of BibTeX entries through a unified interface, returning validated APA citations and exportable BibTeX records within seconds.

56. 【2602.15870】VDLM: Variable Diffusion LMs via Robust Latent-to-Text Rendering

链接https://arxiv.org/abs/2602.15870

作者:Shuhui Qu

类目:Computation and Language (cs.CL)

关键词:Autoregressive language models, language models decode, Autoregressive language, irreversible commitments, limiting revision

备注

点击查看摘要

Abstract:Autoregressive language models decode left-to-right with irreversible commitments, limiting revision during multi-step reasoning. We propose \textbf{VDLM}, a modular variable diffusion language model that separates semantic planning from text rendering. VDLM applies LLaDA-style masked diffusion over semantic variable embeddings to enable iterative refinement in latent space, then post-trains the planner with trajectory-aware optimization using embedding-space rewards and values, avoiding text decoding inside the RL loop. To convert planned embeddings back to text, we use a \textbf{Vec2Text} renderer and introduce \textbf{embedding perturbations} to robustify decoding under planner noise. Across nine benchmarks spanning general reasoning, math, and code, VDLM is competitive in pre-training and yields substantial post-training improvements on long-form generation tasks, outperforming other baselines. These results highlight the effectiveness of embedding-space post-training and robust latent-to-text rendering for diffusion language modeling.

57. 【2602.15869】owards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches

链接https://arxiv.org/abs/2602.15869

作者:Noopur Zambare,Kiana Aghakasiri,Carissa Lin,Carrie Ye,J. Ross Mitchell,Mohamed Abdalla

类目:Computation and Language (cs.CL)

关键词:shown strong performance, identifying sensitive identifiers, protect privacy, shown strong, task of identifying

备注: Accpted to the Findings of EACL 2026

点击查看摘要

Abstract:Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in multi-cultural contexts, we introduce and publicly release BERT-MultiCulture-DEID, a set of de-identification models based on BERT, ClinicalBERT, and ModernBERT, fine-tuned on MIMIC with identifiers from multiple language variants. Our findings provide the first comprehensive quantification of the efficiency-generalizability trade-off in de-identification and establish practical pathways for fair and efficient clinical de-identification. Details on accessing the models are available at: this https URL

Comments:
Accpted to the Findings of EACL 2026

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.15869 [cs.CL]

(or
arXiv:2602.15869v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.15869

Focus to learn more

              arXiv-issued DOI via DataCite</p>
58. 【2602.15868】Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning

链接https://arxiv.org/abs/2602.15868

作者:Magnus Boman

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, seemingly trivial tasks, exhibit failure modes, seemingly trivial

备注: 8 pages, 1 page appendix

点击查看摘要

Abstract:Large language models (LLMs) exhibit failure modes on seemingly trivial tasks. We propose a formalisation of LLM interaction using a deterministic multi-tape Turing machine, where each tape represents a distinct component: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text. The model enables precise localisation of failure modes to specific pipeline stages, revealing, e.g., how tokenisation obscures character-level structure needed for counting tasks. The model clarifies why techniques like chain-of-thought prompting help, by externalising computation on the output tape, while also revealing their fundamental limitations. This approach provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis.

59. 【2602.15867】Playing With AI: How Do State-Of-The-Art Large Language Models Perform in the 1977 Text-Based Adventure Game Zork?

链接https://arxiv.org/abs/2602.15867

作者:Berry Gerrits

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:contemporary Large Language, Large Language Models, contemporary Large, Large Language, seminal text-based adventure

备注: 18 pages, 1 figure

点击查看摘要

Abstract:In this positioning paper, we evaluate the problem-solving and reasoning capabilities of contemporary Large Language Models (LLMs) through their performance in Zork, the seminal text-based adventure game first released in 1977. The game's dialogue-based structure provides a controlled environment for assessing how LLM-based chatbots interpret natural language descriptions and generate appropriate action sequences to succeed in the game. We test the performance of leading proprietary models - ChatGPT, Claude, and Gemini - under both minimal and detailed instructions, measuring game progress through achieved scores as the primary metric. Our results reveal that all tested models achieve less than 10% completion on average, with even the best-performing model (Claude Opus 4.5) reaching only approximately 75 out of 350 possible points. Notably, providing detailed game instructions offers no improvement, nor does enabling ''extended thinking''. Qualitative analysis of the models' reasoning processes reveals fundamental limitations: repeated unsuccessful actions suggesting an inability to reflect on one's own thinking, inconsistent persistence of strategies, and failure to learn from previous attempts despite access to conversation history. These findings suggest substantial limitations in current LLMs' metacognitive abilities and problem-solving capabilities within the domain of text-based games, raising questions about the nature and extent of their reasoning capabilities.

60. 【2602.15866】NLP Privacy Risk Identification in Social Media (NLP-PRISM): A Survey

链接https://arxiv.org/abs/2602.15866

作者:Dhiman Goswami,Jai Kruthunz Naveen Kumar,Sanchari Das

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:Personally Identifiable Information, Identifiable Information, Personally Identifiable, Natural Language Processing, social media analytics

备注

点击查看摘要

Abstract:Natural Language Processing (NLP) is integral to social media analytics but often processes content containing Personally Identifiable Information (PII), behavioral cues, and metadata raising privacy risks such as surveillance, profiling, and targeted advertising. To systematically assess these risks, we review 203 peer-reviewed papers and propose the NLP Privacy Risk Identification in Social Media (NLP-PRISM) framework, which evaluates vulnerabilities across six dimensions: data collection, preprocessing, visibility, fairness, computational risk, and regulatory compliance. Our analysis shows that transformer models achieve F1-scores ranging from 0.58-0.84, but incur a 1% - 23% drop under privacy-preserving fine-tuning. Using NLP-PRISM, we examine privacy coverage in six NLP tasks: sentiment analysis (16), emotion detection (14), offensive language identification (19), code-mixed processing (39), native language identification (29), and dialect detection (24) revealing substantial gaps in privacy research. We further found a (reduced by 2% - 9%) trade-off in model utility, MIA AUC (membership inference attacks) 0.81, AIA accuracy 0.75 (attribute inference attacks). Finally, we advocate for stronger anonymization, privacy-aware learning, and fairness-driven training to enable ethical NLP in social media contexts.

61. 【2602.15865】AI as Teammate or Tool? A Review of Human-AI Interaction in Decision Support

链接https://arxiv.org/abs/2602.15865

作者:Most. Sharmin Sultana Samu,Nafisa Khan,Kazi Toufique Elahi,Tasnuva Binte Rahman,Md. Rakibul Islam,Farig Sadeque

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Artificial Intelligence, integration of Artificial, necessitates determining, function as tools, Intelligence

备注: Preprint

点击查看摘要

Abstract:The integration of Artificial Intelligence (AI) necessitates determining whether systems function as tools or collaborative teammates. In this study, by synthesizing Human-AI Interaction (HAI) literature, we analyze this distinction across four dimensions: interaction design, trust calibration, collaborative frameworks and healthcare applications. Our analysis reveals that static interfaces and miscalibrated trust limit AI efficacy. Performance hinges on aligning transparency with cognitive workflows, yet a fluency trap often inflates trust without improving decision-making. Consequently, an overemphasis on explainability leaves systems largely passive. Our findings show that current AI systems remain largely passive due to an overreliance on explainability-centric designs and that transitioning AI to an active teammate requires adaptive, context-aware interactions that support shared mental models and the dynamic negotiation of authority between humans and AI.

62. 【2602.15863】Not the Example, but the Process: How Self-Generated Examples Enhance LLM Reasoning

链接https://arxiv.org/abs/2602.15863

作者:Daehoon Gwak,Minseo Jung,Junwoo Park,Minho Park,ChaeHun Park,Junha Hyung,Jaegul Choo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, achieving results comparable, Recent studies, shown that Large

备注: Presented at AACL-IJCNLP 2025

点击查看摘要

Abstract:Recent studies have shown that Large Language Models (LLMs) can improve their reasoning performance through self-generated few-shot examples, achieving results comparable to manually curated in-context examples. However, the underlying mechanism behind these gains remains unclear, making it hard to decide when and how to apply the technique effectively. In this work, we argue that the key benefit arises not from the generated examples themselves but from the act of creating them. To validate this, on reasoning-intensive tasks across diverse LLM architectures, we systematically evaluate three prompting strategies for in-context learning: (1) Zero-shot prompting; (2) Integrated prompting, where LLMs create and solve problems within a single, unified prompt; and (3) Decoupled prompting, where self-generated examples are reused as in-context examples, but the context of their creation itself is excluded. We conduct experiments across five widely used model architectures, demonstrating that Integrated prompting consistently outperforms both Zero-shot and Decoupled prompting. In contrast, Decoupled prompting offers only marginal gains over Zero-shot. Further, for a more in-depth analysis, we conduct an attention analysis and observe significant differences in attention patterns between Integrated and Decoupled prompting. These findings suggest that the advantage of self-generation prompting comes from the process of problem creation, not the examples themselves, providing valuable insights for designing more effective prompting strategies.

63. 【2602.15862】Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation

链接https://arxiv.org/abs/2602.15862

作者:Guoshan Liu,Bin Zhu,Yian Li,Jingjing Chen,Chong-Wah Ngo,Yu-Gang Jiang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Multimodal Large Language, high lexical scores, Language Models, Multimodal Large

备注

点击查看摘要

Abstract:Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.

64. 【2602.15861】CAST: Achieving Stable LLM-based Text Analysis for Data Analytics

链接https://arxiv.org/abs/2602.15861

作者:Jinxiang Xie,Zihao Li,Wei He,Rui Ding,Shi Han,Dongmei Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:corpus-level theme extraction, Text analysis, tabular data relies, textbf, core operations

备注

点击查看摘要

Abstract:Text analysis of tabular data relies on two core operations: \emph{summarization} for corpus-level theme extraction and \emph{tagging} for row-level labeling. A critical limitation of employing large language models (LLMs) for these tasks is their inability to meet the high standards of output stability demanded by data analytics. To address this challenge, we introduce \textbf{CAST} (\textbf{C}onsistency via \textbf{A}lgorithmic Prompting and \textbf{S}table \textbf{T}hinking), a framework that enhances output stability by constraining the model's latent reasoning path. CAST combines (i) Algorithmic Prompting to impose a procedural scaffold over valid reasoning transitions and (ii) Thinking-before-Speaking to enforce explicit intermediate commitments before final generation. To measure progress, we introduce \textbf{CAST-S} and \textbf{CAST-T}, stability metrics for bulleted summarization and tagging, and validate their alignment with human judgments. Experiments across publicly available benchmarks on multiple LLM backbones show that CAST consistently achieves the best stability among all baselines, improving Stability Score by up to 16.2\%, while maintaining or improving output quality.

65. 【2602.15860】Reranker Optimization via Geodesic Distances on k-NN Manifolds

链接https://arxiv.org/abs/2602.15860

作者:Wen G. Gong

类目:Computation and Language (cs.CL)

关键词:Current neural reranking, large language models, requiring substantial computational, substantial computational resources, Current neural

备注: 10 pages, 3 tables

点击查看摘要

Abstract:Current neural reranking approaches for retrieval-augmented generation (RAG) rely on cross-encoders or large language models (LLMs), requiring substantial computational resources and exhibiting latencies of 3-5 seconds per query. We propose Maniscope, a geometric reranking method that computes geodesic distances on k-nearest neighbor (k-NN) manifolds constructed over retrieved document candidates. This approach combines global cosine similarity with local manifold geometry to capture semantic structure that flat Euclidean metrics miss. Evaluating on eight BEIR benchmark datasets (1,233 queries), Maniscope outperforms HNSW graph-based baseline on the three hardest datasets (NFCorpus: +7.0%, TREC-COVID: +1.6%, AorB: +2.8% NDCG@3) while being 3.2x faster (4.7 ms vs 14.8 ms average). Compared to cross-encoder rerankers, Maniscope achieves within 2% accuracy at 10-45x lower latency. On TREC-COVID, LLM-Reranker provides only +0.5% NDCG@3 improvement over Maniscope at 840x higher latency, positioning Maniscope as a practical alternative for real-time RAG deployment. The method requires O(N D + M^2 D + M k log k) complexity where M N , enabling sub-10 ms latency. We plan to release Maniscope as open-source software.

66. 【2602.15859】From Transcripts to AI Agents: Knowledge Extraction, RAG Integration, and Robust Evaluation of Conversational AI Assistants

链接https://arxiv.org/abs/2602.15859

作者:Krittin Pachtrachai,Petmongkon Pornpichitsuwan,Wachiravit Modecrua,Touchapon Kraisingkorn

类目:Computation and Language (cs.CL)

关键词:Building reliable conversational, customer-facing industries remains, accurate human hand-off, Building reliable, industries remains challenging

备注: 9 pages, 2 figures, 1 table

点击查看摘要

Abstract:Building reliable conversational AI assistants for customer-facing industries remains challenging due to noisy conversational data, fragmented knowledge, and the requirement for accurate human hand-off - particularly in domains that depend heavily on real-time information. This paper presents an end-to-end framework for constructing and evaluating a conversational AI assistant directly from historical call transcripts. Incoming transcripts are first graded using a simplified adaptation of the PIPA framework, focusing on observation alignment and appropriate response behavior, and are filtered to retain only high-quality interactions exhibiting coherent flow and effective human agent responses. Structured knowledge is then extracted from curated transcripts using large language models (LLMs) and deployed as the sole grounding source in a Retrieval-Augmented Generation (RAG) pipeline. Assistant behavior is governed through systematic prompt tuning, progressing from monolithic prompts to lean, modular, and governed designs that ensure consistency, safety, and controllable execution. Evaluation is conducted using a transcript-grounded user simulator, enabling quantitative measurement of call coverage, factual accuracy, and human escalation behavior. Additional red teaming assesses robustness against prompt injection, out-of-scope, and out-of-context attacks. Experiments are conducted in the Real Estate and Specialist Recruitment domains, which are intentionally challenging and currently suboptimal for automation due to their reliance on real-time data. Despite these constraints, the assistant autonomously handles approximately 30 percents of calls, achieves near-perfect factual accuracy and rejection behavior, and demonstrates strong robustness under adversarial testing.

67. 【2602.15858】State Design Matters: How Representations Shape Dynamic Reasoning in Large Language Models

链接https://arxiv.org/abs/2602.15858

作者:Annie Wong,Aske Plaat,Thomas Bäck,Niki van Stein,Anna V. Kononova

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:dynamic environments, inference time, tasks toward dynamic, success depends, ability to navigate

备注

点击查看摘要

Abstract:As large language models (LLMs) move from static reasoning tasks toward dynamic environments, their success depends on the ability to navigate and respond to an environment that changes as they interact at inference time. An underexplored factor in these settings is the representation of the state. Holding model parameters fixed, we systematically vary three key aspects: (1) state granularity (long form versus summary), (2) structure (natural language versus symbolic), and (3) spatial grounding (text-only versus images or textual map encodings) across sequential decision-making benchmarks. We find that trajectory summarisation improves performance by reducing noise and stabilising long-horizon reasoning. Second, natural language representations are the most robust across models, whereas structured encodings help mainly for models with strong code or structured output priors, such as JSON schemas. Third, while image-inputs show some benefit, text-based spatial encodings prove most effective. This advantage stems not from the spatial information itself, but from the act of construction, which compels the model to perform the spatial reasoning that static input does not elicit. Overall, we demonstrate that design choices for representing state are a decisive factor in performance, distinct from the availability of information itself. We note, however, that even with improved representations, current LLMs and VLMs remain brittle over long horizons, particularly when they must synthesise information to manage multiple subtasks to reach a goal.

68. 【2602.15857】Multi-source Heterogeneous Public Opinion Analysis via Collaborative Reasoning and Adaptive Fusion: A Systematically Integrated Approach

链接https://arxiv.org/abs/2602.15857

作者:Yi Liu

类目:Computation and Language (cs.CL)

关键词:presents significant challenges, significant challenges due, multiple heterogeneous sources, heterogeneous sources presents, sources presents significant

备注: 13 pages, 11 figures

点击查看摘要

Abstract:The analysis of public opinion from multiple heterogeneous sources presents significant challenges due to structural differences, semantic variations, and platform-specific biases. This paper introduces a novel Collaborative Reasoning and Adaptive Fusion (CRAF) framework that systematically integrates traditional feature-based methods with large language models (LLMs) through a structured multi-stage reasoning mechanism. Our approach features four key innovations: (1) a cross-platform collaborative attention module that aligns semantic representations while preserving source-specific characteristics, (2) a hierarchical adaptive fusion mechanism that dynamically weights features based on both data quality and task requirements, (3) a joint optimization strategy that simultaneously learns topic representations and sentiment distributions through shared latent spaces, and (4) a novel multimodal extraction capability that processes video content from platforms like Douyin and Kuaishou by integrating OCR, ASR, and visual sentiment analysis. Theoretical analysis demonstrates that CRAF achieves a tighter generalization bound with a reduction of O(sqrt(d log K / m)) compared to independent source modeling, where d is feature dimensionality, K is the number of sources, and m is sample size. Comprehensive experiments on three multi-platform datasets (Weibo-12, CrossPlatform-15, NewsForum-8) show that CRAF achieves an average topic clustering ARI of 0.76 (4.1% improvement over best baseline) and sentiment analysis F1-score of 0.84 (3.8% improvement). The framework exhibits strong cross-platform adaptability, reducing the labeled data requirement for new platforms by 75%.

69. 【2602.15856】Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective

链接https://arxiv.org/abs/2602.15856

作者:Yunhao Liu,Zian Jia,Xinyu Gao,Kanjun Xu,Yun Xiong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, effectively grounds Large, grounds Large Language, Web-related tasks, grounds Large

备注: Accepted by WWW 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM's downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%.

Comments:
Accepted by WWW 2026

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2602.15856 [cs.CL]

(or
arXiv:2602.15856v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.15856

Focus to learn more

              arXiv-issued DOI via DataCite</p>
70. 【2602.15854】Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

链接https://arxiv.org/abs/2602.15854

作者:Jingyi Xu,Xingyu Ren,Zhiqiang You,Yumeng Zhang,Zhoupeng Shou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Customer Service Agent, Large language models, existing training methods, Large language, Expert Agent

备注

点击查看摘要

Abstract:Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent's critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.

71. 【2602.15853】A Lightweight Explainable Guardrail for Prompt Safety

链接https://arxiv.org/abs/2602.15853

作者:Md Asiful Islam,Mihai Surdeanu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:lightweight explainable guardrail, explainable guardrail, LEG, propose a lightweight, lightweight explainable

备注

点击查看摘要

Abstract:We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly.

72. 【2602.15852】Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints

链接https://arxiv.org/abs/2602.15852

作者:Ha Na Cho,Sairam Sutari,Alexander Lopez,Hansen Bow,Kai Zheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:natural language processing, Clinical natural language, leveraging narrative clinical, hospital discharge planning, supporting hospital discharge

备注

点击查看摘要

Abstract:Clinical natural language processing (NLP) models have shown promise for supporting hospital discharge planning by leveraging narrative clinical documentation. However, note-based models are particularly vulnerable to temporal and lexical leakage, where documentation artifacts encode future clinical decisions and inflate apparent predictive performance. Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety. This study focuses on system-level design choices required to build safe and deployable clinical NLP under temporal leakage constraints. We present a lightweight auditing pipeline that integrates interpretability into the model development process to identify and suppress leakage-prone signals prior to final training. Using next-day discharge prediction after elective spine surgery as a case study, we evaluate how auditing affects predictive behavior, calibration, and safety-relevant trade-offs. Results show that audited models exhibit more conservative and better-calibrated probability estimates, with reduced reliance on discharge-related lexical cues. These findings emphasize that deployment-ready clinical NLP systems should prioritize temporal validity, calibration, and behavioral robustness over optimistic performance.

73. 【2602.15851】Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

链接https://arxiv.org/abs/2602.15851

作者:David Y. Liu,Aditya Joshi,Paul Dawson

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:deliver promising use-cases, automatic story generation, deliver promising, large language models, narrative

备注: 31 pages

点击查看摘要

Abstract:Applications of narrative theories using large language models (LLMs) deliver promising use-cases in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research engages with fields of narrative studies, and proposes a taxonomy for ongoing efforts that reflect established distinctions in narratology. We discover patterns in the following: narrative datasets and tasks, narrative theories and NLP pipeline and methodological trends in prompting and fine-tuning. We highlight how LLMs enable easy connections of NLP pipelines with abstract narrative concepts and opportunities for interdisciplinary collaboration. Challenges remain in attempts to work towards any unified definition or benchmark of narrative related tasks, making model comparison difficult. For future directions, instead of the pursuit of a single, generalised benchmark for 'narrative quality', we believe that progress benefits more from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes to incrementally improve model performance; conducting large-scale, theory-driven literary/social/cultural analysis; and creating experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

74. 【2602.15850】Large Language Models for Assisting American College Applications

链接https://arxiv.org/abs/2602.15850

作者:Zhengliang Liu,Weihang You,Peng Shu,Junhao Chen,Yi Pan,Hanqi Jiang,Yiwei Li,Zhaojun Ding,Chao Cao,Xinliang Li,Yifan Zhou,Ruidong Zhang,Shaochen Xu,Wei Ruan,Huaqin Zhao,Dajiang Zhu,Tianming Liu

类目:Computation and Language (cs.CL)

关键词:American college applications, demand cross-referencing multiple, fragmented admissions policies, American college, navigate fragmented admissions

备注

点击查看摘要

Abstract:American college applications require students to navigate fragmented admissions policies, repetitive and conditional forms, and ambiguous questions that often demand cross-referencing multiple sources. We present EZCollegeApp, a large language model (LLM)-powered system that assists high-school students by structuring application forms, grounding suggested answers in authoritative admissions documents, and maintaining full human control over final responses. The system introduces a mapping-first paradigm that separates form understanding from answer generation, enabling consistent reasoning across heterogeneous application portals. EZCollegeApp integrates document ingestion from official admissions websites, retrieval-augmented question answering, and a human-in-the-loop chatbot interface that presents suggestions alongside application fields without automated submission. We describe the system architecture, data pipeline, internal representations, security and privacy measures, and evaluation through automated testing and human quality assessment. Our source code is released on GitHub (this https URL) to facilitate the broader impact of this work.

75. 【2602.15849】Preference Optimization for Review Question Generation Improves Writing Quality

链接https://arxiv.org/abs/2602.15849

作者:Karun Sharma,Vidushee Vats,Shengzhi Li,Yuxiang Wang,Zhongtian Sun,Prayag Tiwari

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:generate surface-level queries, Peer review relies, existing LLM-based approaches, Sampling Policy Optimization, relies on substantive

备注: 24 Pages, v1

点击查看摘要

Abstract:Peer review relies on substantive, evidence-based questions, yet existing LLM-based approaches often generate surface-level queries, drawing over 50\% of their question tokens from a paper's first page. To bridge this gap, we develop IntelliReward, a novel reward model built from a frozen autoregressive LLM with trainable multi-head transformers over the final 50 token states, which outperforms API-based SFT baselines in predicting expert-level human preferences. By applying Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) with IntelliReward, we train IntelliAsk, a question-generation model aligned with human standards of effort, evidence, and grounding. We find consistent improvements on reasoning and writing benchmarks, suggesting reviewer-question quality correlates with broader capabilities. Compared to the Qwen3-32B base model, IntelliAsk shows measurable gains across diverse benchmarks, specifically improving performance on reasoning tasks like MuSR (68.3 vs 64.7 Acc) and complex writing evaluations such as WritingBench (8.31 vs 8.07). We release our implementation, expert preference annotations, and the IntelliReward model to provide an automatic evaluation benchmark for grounding, effort, and evidence in LLM-generated review questions.

76. 【2602.15848】Can LLMs Assess Personality? Validating Conversational AI for Trait Profiling

链接https://arxiv.org/abs/2602.15848

作者:Andrius Matšenas,Anet Lello,Tõnis Lees,Hans Peep,Kim Lilii Tamm

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, validates Large Language, study validates Large, Language Models, Large Language

备注: 13 pages, 7 figures, 4 tables, 2 appendices

点击查看摘要

Abstract:This study validates Large Language Models (LLMs) as a dynamic alternative to questionnaire-based personality assessment. Using a within-subjects experiment (N=33), we compared Big Five personality scores derived from guided LLM conversations against the gold-standard IPIP-50 questionnaire, while also measuring user-perceived accuracy. Results indicate moderate convergent validity (r=0.38-0.58), with Conscientiousness, Openness, and Neuroticism scores statistically equivalent between methods. Agreeableness and Extraversion showed significant differences, suggesting trait-specific calibration is needed. Notably, participants rated LLM-generated profiles as equally accurate as traditional questionnaire results. These findings suggest conversational AI offers a promising new approach to traditional psychometrics.

77. 【2602.15847】Do Personality Traits Interfere? Geometric Limitations of Steering in Large Language Models

链接https://arxiv.org/abs/2602.15847

作者:Pranav Bhandari,Usman Naseem,Mehwish Nasim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, injecting trait-specific steering, personality steering directions, commonly relies, implicitly assuming

备注

点击查看摘要

Abstract:Personality steering in large language models (LLMs) commonly relies on injecting trait-specific steering vectors, implicitly assuming that personality traits can be controlled independently. In this work, we examine whether this assumption holds by analysing the geometric relationships between Big Five personality steering directions. We study steering vectors extracted from two model families (LLaMA-3-8B and Mistral-8B) and apply a range of geometric conditioning schemes, from unconstrained directions to soft and hard orthonormalisation. Our results show that personality steering directions exhibit substantial geometric dependence: steering one trait consistently induces changes in others, even when linear overlap is explicitly removed. While hard orthonormalisation enforces geometric independence, it does not eliminate cross-trait behavioural effects and can reduce steering strength. These findings suggest that personality traits in LLMs occupy a slightly coupled subspace, limiting fully independent trait control.

78. 【2602.15846】Gated Tree Cross-attention for Checkpoint-Compatible Syntax Injection in Decoder-Only LLMs

链接https://arxiv.org/abs/2602.15846

作者:Xinyu Gao,Shaonan Wang,Nai Ding

类目:Computation and Language (cs.CL)

关键词:minor grammatical perturbations, large language models, language models achieve, models achieve strong, achieve strong broad

备注

点击查看摘要

Abstract:Decoder-only large language models achieve strong broad performance but are brittle to minor grammatical perturbations, undermining reliability for downstream reasoning. However, directly injecting explicit syntactic structure into an existing checkpoint can interfere with its pretrained competence. We introduce a checkpoint-compatible gated tree cross-attention (GTCA) branch that reads precomputed constituency chunk memory while leaving backbone architecture unchanged. Our design uses a token update mask and staged training to control the scope and timing of structural updates. Across benchmarks and Transformer backbones, GTCA strengthens syntactic robustness beyond continued-training baselines without compromising Multiple-Choice QA performance or commonsense reasoning, providing a practical checkpoint-compatible route to more syntax-robust decoder-only LLMs.

79. 【2602.15845】KD4MT: A Survey of Knowledge Distillation for Machine Translation

链接https://arxiv.org/abs/2602.15845

作者:Ona de Gibert,Joseph Attieh,Timothee Mickus,Yves Scherrer,Jörg Tiedemann

类目:Computation and Language (cs.CL)

关键词:address challenges related, models in NLP, Knowledge Distillation, area has gained, gained a lot

备注: Pre-print under Review submitted to Computational Linguistics Journal

点击查看摘要

Abstract:Knowledge Distillation (KD) as a research area has gained a lot of traction in recent years as a compression tool to address challenges related to ever-larger models in NLP. Remarkably, Machine Translation (MT) offers a much more nuanced take on this narrative: in MT, KD also functions as a general-purpose knowledge transfer mechanism that shapes supervision and translation quality as well as efficiency. This survey synthesizes KD for MT (KD4MT) across 105 papers (through October 1, 2025). We begin by introducing both MT and KD for non-experts, followed by an overview of the standard KD approaches relevant to MT applications. Subsequently, we categorize advances in the KD4MT literature based on (i) their methodological contributions and (ii) their practical applications. Our qualitative and quantitative analyses identify common trends in the field and highlight key research gaps as well as the absence of unified evaluation practice for KD methods in MT. We further provide practical guidelines for selecting a KD method in concrete settings and highlight potential risks associated with the application of KD to MT such as increased hallucination and bias amplification. Finally, we discuss the role of LLMs in re-shaping the KD4MT field. To support further research, we complement our survey with a publicly available database summarizing the main characteristics of the surveyed KD methods and a glossary of key terms.

Comments:
Pre-print under Review submitted to Computational Linguistics Journal

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2602.15845 [cs.CL]

(or
arXiv:2602.15845v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.15845

Focus to learn more

              arXiv-issued DOI via DataCite</p>
80. 【2602.15844】Language Model Representations for Efficient Few-Shot Tabular Classification

链接https://arxiv.org/abs/2602.15844

作者:Inwon Kang,Parikshit Ram,Yi Zhou,Horst Samulowitz,Oshani Seneviratne

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:rich source, textbf, existing LLM infrastructure, LLM infrastructure, structured data

备注: Accepted to WWW'26

点击查看摘要

Abstract:The Web is a rich source of structured data in the form of tables, from product catalogs and knowledge bases to scientific datasets. However, the heterogeneity of the structure and semantics of these tables makes it challenging to build a unified method that can effectively leverage the information they contain. Meanwhile, Large language models (LLMs) are becoming an increasingly integral component of web infrastructure for tasks like semantic search. This raises a crucial question: can we leverage these already-deployed LLMs to classify structured data in web-native tables (e.g., product catalogs, knowledge base exports, scientific data portals), avoiding the need for specialized models or extensive retraining? This work investigates a lightweight paradigm, $\textbf{Ta}$ble $\textbf{R}$epresentation with $\textbf{L}$anguage Model~($\textbf{TaRL}$), for few-shot tabular classification that directly utilizes semantic embeddings of individual table rows. We first show that naive application of these embeddings underperforms compared to specialized tabular models. We then demonstrate that their potentials can be unlocked with two key techniques: removing the common component from all embeddings and calibrating the softmax temperature. We show that a simple meta-learner, trained on handcrafted features, can learn to predict an appropriate temperature. This approach achieves performance comparable to state-of-the-art models in low-data regimes ($k \leq 32$) of semantically-rich tables. Our findings demonstrate the viability of reusing existing LLM infrastructure for efficient semantics-driven pathway to reuse existing LLM infrastructure for Web table understanding.

81. 【2602.15843】he Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts

链接https://arxiv.org/abs/2602.15843

作者:Warren Johnson

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Compress or Route, Compress, Route, perplexity, Abstract

备注: 19 pages, 5 figures, 4 tables. Second paper in TAAC research series. Code and data at [this https URL](https://github.com/micoverde/taac-llm-compression)

点击查看摘要

Abstract:In "Compress or Route?" (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r = 0.6) while chain-of-thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the "perplexity paradox" mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL-E) and four reasoning benchmarks (GSM8K, MATH, ARC-Challenge, MMLU-STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per-token perplexity analysis (n=723 tokens), revealing a "perplexity paradox": code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task-critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen's h=0.890). Third, we propose TAAC (Task-Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed-ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.

82. 【2602.15842】Memes-as-Replies: Can Models Select Humorous Manga Panel Responses?

链接https://arxiv.org/abs/2602.15842

作者:Ryosuke Kohita,Seiichiro Yoshioka

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:modern web communication, Meme Reply Selection, Manga Meme Reply, Meme Reply Benchmark, Meme Reply

备注

点击查看摘要

Abstract:Memes are a popular element of modern web communication, used not only as static artifacts but also as interactive replies within conversations. While computational research has focused on analyzing the intrinsic properties of memes, the dynamic and contextual use of memes to create humor remains an understudied area of web science. To address this gap, we introduce the Meme Reply Selection task and present MaMe-Re (Manga Meme Reply Benchmark), a benchmark of 100,000 human-annotated pairs (500,000 total annotations from 2,325 unique annotators) consisting of openly licensed Japanese manga panels and social media posts. Our analysis reveals three key insights: (1) large language models (LLMs) show preliminary evidence of capturing complex social cues such as exaggeration, moving beyond surface-level semantic matching; (2) the inclusion of visual information does not improve performance, revealing a gap between understanding visual content and effectively using it for contextual humor; (3) while LLMs can match human judgments in controlled settings, they struggle to distinguish subtle differences in wit among semantically similar candidates. These findings suggest that selecting contextually humorous replies remains an open challenge for current models.

83. 【2602.15835】A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dialogue System Alignment Models

链接https://arxiv.org/abs/2602.15835

作者:Mikio Nakano,Hironori Takeuchi,Kazunori Komatani

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Software Engineering (cs.SE)

关键词:evaluation items, dialogue systems, practical dialogue systems, evaluation, dialogue

备注: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2025 (IWSDS 2025)

点击查看摘要

Abstract:This paper proposes a methodology for identifying evaluation items for practical dialogue systems. Traditionally, user satisfaction and user experiences have been the primary metrics for evaluating dialogue systems. However, there are various other evaluation items to consider when developing and operating practical dialogue systems, and such evaluation items are expected to lead to new research topics. So far, there has been no methodology for identifying these evaluation items. We propose identifying evaluation items based on business-dialogue system alignment models, which are applications of business-IT alignment models used in the development and operation of practical IT systems. We also present a generic model that facilitates the construction of a business-dialogue system alignment model for each dialogue system.

84. 【2602.16273】Lyapunov Spectral Analysis of Speech Embedding Trajectories in Psychosis

链接https://arxiv.org/abs/2602.16273

作者:Jelena Vasic,Branislav Andjelic,Ana Mancic,Dusica Filipovic Djurdjevic,Ljiljana Mihic,Aleksandar Kovacevic,Nadja P. Maric,Aleksandra Maluckov

类目:Adaptation and Self-Organizing Systems (nlin.AO); Computation and Language (cs.CL)

关键词:structured clinical interviews, treating language production, high-dimensional dynamical process, structured clinical, clinical interviews

备注: 14 pages, 3 figures

点击查看摘要

Abstract:We analyze speech embeddings from structured clinical interviews of psychotic patients and healthy controls by treating language production as a high-dimensional dynamical process. Lyapunov exponent (LE) spectra are computed from word-level and answer-level embeddings generated by two distinct large language models, allowing us to assess the stability of the conclusions with respect to different embedding presentations. Word-level embeddings exhibit uniformly contracting dynamics with no positive LE, while answer-level embeddings, in spite of the overall contraction, display a number of positive LEs and higher-dimensional attractors. The resulting LE spectra robustly separate psychotic from healthy speech, while differentiation within the psychotic group is not statistically significant overall, despite a tendency of the most severe cases to occupy distinct dynamical regimes. These findings indicate that nonlinear dynamical invariants of speech embeddings provide a physics-inspired probe of disordered cognition whose conclusions remain stable across embedding models.

85. 【2602.15889】Evidence for Daily and Weekly Periodic Variability in GPT-4o Performance

链接https://arxiv.org/abs/2602.15889

作者:Paul Tschisgale,Peter Wulff

类目:Applications (stat.AP); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Physics Education (physics.ed-ph)

关键词:Large language models, Large language, objects of investigation, Large, language models

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in research both as tools and as objects of investigation. Much of this work implicitly assumes that LLM performance under fixed conditions (identical model snapshot, hyperparameters, and prompt) is time-invariant. If average output quality changes systematically over time, this assumption is violated, threatening the reliability, validity, and reproducibility of findings. To empirically examine this assumption, we conducted a longitudinal study on the temporal variability of GPT-4o's average performance. Using a fixed model snapshot, fixed hyperparameters, and identical prompting, GPT-4o was queried via the API to solve the same multiple-choice physics task every three hours for approximately three months. Ten independent responses were generated at each time point and their scores were averaged. Spectral (Fourier) analysis of the resulting time series revealed notable periodic variability in average model performance, accounting for approximately 20% of the total variance. In particular, the observed periodic patterns are well explained by the interaction of a daily and a weekly rhythm. These findings indicate that, even under controlled conditions, LLM performance may vary periodically over time, calling into question the assumption of time invariance. Implications for ensuring validity and replicability of research that uses or investigates LLMs are discussed.

信息检索

1. 【2602.16673】Neighborhood Stability as a Measure of Nearest Neighbor Searchability

链接https://arxiv.org/abs/2602.16673

作者:Thomas Vecchiato,Sebastian Bruch

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:Clustering-based Approximate Nearest, Nearest Neighbor Search, Approximate Nearest Neighbor, Clustering-based Approximate, Neighbor Search

备注

点击查看摘要

Abstract:Clustering-based Approximate Nearest Neighbor Search (ANNS) organizes a set of points into partitions, and searches only a few of them to find the nearest neighbors of a query. Despite its popularity, there are virtually no analytical tools to determine the suitability of clustering-based ANNS for a given dataset -- what we call "searchability." To address that gap, we present two measures for flat clusterings of high-dimensional points in Euclidean space. First is Clustering-Neighborhood Stability Measure (clustering-NSM), an internal measure of clustering quality -- a function of a clustering of a dataset -- that we show to be predictive of ANNS accuracy. The second, Point-Neighborhood Stability Measure (point-NSM), is a measure of clusterability -- a function of the dataset itself -- that is predictive of clustering-NSM. The two together allow us to determine whether a dataset is searchable by clustering-based ANNS given only the data points. Importantly, both are functions of nearest neighbor relationships between points, not distances, making them applicable to various distance functions including inner product.

2. 【2602.16609】ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

链接https://arxiv.org/abs/2602.16609

作者:Antoine Chaffin,Luca Arnaboldi,Amélie Chatelain,Florent Krzakala

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:small Knowledge Distillation, Knowledge Distillation, strong single-vector models, small Knowledge, multi-vector models

备注: 9 pages, 5 tables, 2 figures

点击查看摘要

Abstract:Current state-of-the-art multi-vector models are obtained through a small Knowledge Distillation (KD) training step on top of strong single-vector models, leveraging the large-scale pre-training of these models. In this paper, we study the pre-training of multi-vector models and show that large-scale multi-vector pre-training yields much stronger multi-vector models. Notably, a fully ColBERT-pre-trained model, ColBERT-Zero, trained only on public data, outperforms GTE-ModernColBERT as well as its base model, GTE-ModernBERT, which leverages closed and much stronger data, setting new state-of-the-art for model this size. We also find that, although performing only a small KD step is not enough to achieve results close to full pre-training, adding a supervised step beforehand allows to achieve much closer performance while skipping the most costly unsupervised phase. Finally, we find that aligning the fine-tuning and pre-training setups is crucial when repurposing existing models. To enable exploration of our results, we release various checkpoints as well as code used to train them.

3. 【2602.16587】Why Thinking Hurts? Diagnosing and Rectifying the Reasoning Shift in Foundation Recommender Models

链接https://arxiv.org/abs/2602.16587

作者:Luankang Zhang,Yonghao Huang,Hang Lv,Mingjia Yin,Liangyue Li,Zulong Chen,Hao Wang,Enhong Chen

类目:Information Retrieval (cs.IR)

关键词:degrades recommendation performance, Semantic ID-based recommendation, paradoxically degrades recommendation, ID-based recommendation foundation, recommendation performance

备注

点击查看摘要

Abstract:Integrating Chain-of-Thought (CoT) reasoning into Semantic ID-based recommendation foundation models (such as OpenOneRec) often paradoxically degrades recommendation performance. We identify the root cause as textual inertia from the General Subspace, where verbose reasoning dominates inference and causes the model to neglect critical Semantic ID. To address this, we propose a training-free Inference-Time Subspace Alignment framework. By compressing reasoning chains and applying bias-subtracted contrastive decoding, our approach mitigates ungrounded textual drift. Experiments show this effectively calibrates inference, allowing foundation models to leverage reasoning without sacrificing ID-grounded accuracy.

4. 【2602.16541】From Latent to Observable Position-Based Click Models in Carousel Interfaces

链接https://arxiv.org/abs/2602.16541

作者:Santiago de Leon-Martinez,Robert Moro,Branislav Kveton,Maria Bielikova

类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

关键词:single ranked-list interfaces, models, Click models, central component, component of learning

备注

点击查看摘要

Abstract:Click models are a central component of learning and evaluation in recommender systems, yet most existing models are designed for single ranked-list interfaces. In contrast, modern recommender platforms increasingly use complex interfaces such as carousels, which consist of multiple swipeable lists that enable complex user browsing behaviors. In this paper, we study position-based click models in carousel interfaces and examine optimization methods, model structure, and alignment with user behavior. We propose three novel position-based models tailored to carousels, including the first position-based model without latent variables that incorporates observed examination signals derived from eye tracking data, called the Observed Examination Position-Based Model (OEPBM). We develop a general implementation of these carousel click models, supporting multiple optimization techniques and conduct experiments comparing gradient-based methods with classical approaches, namely expectation-maximization and maximum likelihood estimation. Our results show that gradient-based optimization consistently achieve better click likelihoods. Among the evaluated models, the OEPBM achieves the strongest performance in click prediction and produces examination patterns that most closely align to user behavior. However, we also demonstrate that strong click fit does not imply realistic modeling of user examination and browsing patterns. This reveals a fundamental limitation of click-only models in complex interfaces and the need for incorporating additional behavioral signals when designing click models for carousel-based recommender systems.

Subjects:

Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2602.16541 [cs.IR]

(or
arXiv:2602.16541v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.16541

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
5. 【2602.16375】Variable-Length Semantic IDs for Recommender Systems

链接https://arxiv.org/abs/2602.16375

作者:Kirill Khrylchenko

类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:modeling user behavior, Generative models, generative models difficult, integrating large language, training generative models

备注

点击查看摘要

Abstract:Generative models are increasingly used in recommender systems, both for modeling user behavior as event sequences and for integrating large language models into recommendation pipelines. A key challenge in this setting is the extremely large cardinality of item spaces, which makes training generative models difficult and introduces a vocabulary gap between natural language and item identifiers. Semantic identifiers (semantic IDs), which represent items as sequences of low-cardinality tokens, have recently emerged as an effective solution to this problem. However, existing approaches generate semantic identifiers of fixed length, assigning the same description length to all items. This is inefficient, misaligned with natural language, and ignores the highly skewed frequency structure of real-world catalogs, where popular items and rare long-tail items exhibit fundamentally different information requirements. In parallel, the emergent communication literature studies how agents develop discrete communication protocols, often producing variable-length messages in which frequent concepts receive shorter descriptions. Despite the conceptual similarity, these ideas have not been systematically adopted in recommender systems. In this work, we bridge recommender systems and emergent communication by introducing variable-length semantic identifiers for recommendation. We propose a discrete variational autoencoder with Gumbel-Softmax reparameterization that learns item representations of adaptive length under a principled probabilistic framework, avoiding the instability of REINFORCE-based training and the fixed-length constraints of prior semantic ID methods.

Subjects:

Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2602.16375 [cs.IR]

(or
arXiv:2602.16375v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2602.16375

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
6. 【2602.16315】he Diversity Paradox revisited: Systemic Effects of Feedback Loops in Recommender Systems

链接https://arxiv.org/abs/2602.16315

作者:Gabriele Barlacchi,Margherita Lalli,Emanuele Ferragina,Fosca Giannotti,Dino Pedreschi,Luca Pappalardo

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:algorithmic recommendations coevolve, shape individual choices, Recommender systems shape, user behavior, behavior and algorithmic

备注

点击查看摘要

Abstract:Recommender systems shape individual choices through feedback loops in which user behavior and algorithmic recommendations coevolve over time. The systemic effects of these loops remain poorly understood, in part due to unrealistic assumptions in existing simulation studies. We propose a feedback-loop model that captures implicit feedback, periodic retraining, probabilistic adoption of recommendations, and heterogeneous recommender systems. We apply the framework on online retail and music streaming data and analyze systemic effects of the feedback loop. We find that increasing recommender adoption may lead to a progressive diversification of individual consumption, while collective demand is redistributed in model- and domain-dependent ways, often amplifying popularity concentration. Temporal analyses further reveal that apparent increases in individual diversity observed in static evaluations are illusory: when adoption is fixed and time unfolds, individual diversity consistently decreases across all models. Our results highlight the need to move beyond static evaluations and explicitly account for feedback-loop dynamics when designing recommender systems.

7. 【2602.16299】MICE: Minimal Interaction Cross-Encoders for efficient Re-ranking

链接https://arxiv.org/abs/2602.16299

作者:Mathias Vast,Victor Morand,Basile van Cooten,Laure Soulier,Josiane Mothe,Benjamin Piwowarski

类目:Information Retrieval (cs.IR)

关键词:high inference cost, ranking effectiveness, information retrieval, Cross-encoders deliver, Cross-encoders

备注: 9 pages, 5 figures

点击查看摘要

Abstract:Cross-encoders deliver state-of-the-art ranking effectiveness in information retrieval, but have a high inference cost. This prevents them from being used as first-stage rankers, but also incurs a cost when re-ranking documents. Prior work has addressed this bottleneck from two largely separate directions: accelerating cross-encoder inference by sparsifying the attention process or improving first-stage retrieval effectiveness using more complex models, e.g. late-interaction ones. In this work, we propose to bridge these two approaches, based on an in-depth understanding of the internal mechanisms of cross-encoders. Starting from cross-encoders, we show that it is possible to derive a new late-interaction-like architecture by carefully removing detrimental or unnecessary interactions. We name this architecture MICE (Minimal Interaction Cross-Encoders). We extensively evaluate MICE across both in-domain (ID) and out-of-domain (OOD) datasets. MICE decreases fourfold the inference latency compared to standard cross-encoders, matching late-interaction models like ColBERT while retaining most of cross-encoder ID effectiveness and demonstrating superior generalization abilities in OOD.

8. 【2602.16136】Retrieval Collapses When AI Pollutes the Web

链接https://arxiv.org/abs/2602.16136

作者:Hongyeon Yu,Dongchan Kim,Young-Bum Kim

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Language Models, Large Language, Retrieval-Augmented Generation, Web presents

备注: 4 pages, Proceedings of The Web Conference 2026 (WWW '26)

点击查看摘要

Abstract:The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both high-quality SEO-style content and adversarially crafted content. In the SEO scenario, a 67\% pool contamination led to over 80\% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial contamination, baselines like BM25 exposed $\sim$19\% of harmful content, whereas LLM-based rankers demonstrated stronger suppression capabilities. These findings highlight the risk of retrieval pipelines quietly shifting toward synthetic evidence and the need for retrieval-aware strategies to prevent a self-reinforcing cycle of quality decline in Web-grounded systems.

9. 【2602.16124】Rethinking ANN-based Retrieval: Multifaceted Learnable Index for Large-scale Recommendation System

链接https://arxiv.org/abs/2602.16124

作者:Jiang Zhang,Yubo Wang,Wei Chang,Lu Han,Xingying Cheng,Feng Zhang,Min Li,Songhao Jiang,Wei Zheng,Harry Tran,Zhen Wang,Lei Chen,Yueming Wang,Benyu Zhang,Xiangjun Fan,Bi Xue,Qifan Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Approximate nearest neighbor, Approximate nearest, ANN search, large-scale recommendation systems, nearest neighbor

备注

点击查看摘要

Abstract:Approximate nearest neighbor (ANN) search is widely used in the retrieval stage of large-scale recommendation systems. In this stage, candidate items are indexed using their learned embedding vectors, and ANN search is executed for each user (or item) query to retrieve a set of relevant items. However, ANN-based retrieval has two key limitations. First, item embeddings and their indices are typically learned in separate stages: indexing is often performed offline after embeddings are trained, which can yield suboptimal retrieval quality-especially for newly created items. Second, although ANN offers sublinear query time, it must still be run for every request, incurring substantial computation cost at industry scale. In this paper, we propose MultiFaceted Learnable Index (MFLI), a scalable, real-time retrieval paradigm that learns multifaceted item embeddings and indices within a unified framework and eliminates ANN search at serving time. Specifically, we construct a multifaceted hierarchical codebook via residual quantization of item embeddings and co-train the codebook with the embeddings. We further introduce an efficient multifaceted indexing structure and mechanisms that support real-time updates. At serving time, the learned hierarchical indices are used directly to identify relevant items, avoiding ANN search altogether. Extensive experiments on real-world data with billions of users show that MFLI improves recall on engagement tasks by up to 11.8\%, cold-content delivery by up to 57.29\%, and semantic relevance by 13.5\% compared with prior state-of-the-art methods. We also deploy MFLI in the system and report online experimental results demonstrating improved engagement, less popularity bias, and higher serving efficiency.

10. 【2602.16034】FeDecider: An LLM-Based Framework for Federated Cross-Domain Recommendation

链接https://arxiv.org/abs/2602.16034

作者:Xinrui He,Ting-Wei Li,Tianxin Wei,Xuying Ning,Xinyu He,Wenxuan Bao,Hanghang Tong,Jingrui He

类目:Information Retrieval (cs.IR)

关键词:preserving data privacy, Federated CDR, aims to collaboratively, data privacy, recommendation models

备注: Accepted to The Web Conference (WWW) 2026

点击查看摘要

Abstract:Federated cross-domain recommendation (Federated CDR) aims to collaboratively learn personalized recommendation models across heterogeneous domains while preserving data privacy. Recently, large language model (LLM)-based recommendation models have demonstrated impressive performance by leveraging LLMs' strong reasoning capabilities and broad knowledge. However, adopting LLM-based recommendation models in Federated CDR scenarios introduces new challenges. First, there exists a risk of overfitting with domain-specific local adapters. The magnitudes of locally optimized parameter updates often vary across domains, causing biased aggregation and overfitting toward domain-specific distributions. Second, unlike traditional recommendation models (e.g., collaborative filtering, bipartite graph-based methods) that learn explicit and comparable user/item representations, LLMs encode knowledge implicitly through autoregressive text generation training. This poses additional challenges for effectively measuring the cross-domain similarities under heterogeneity. To address these challenges, we propose an LLM-based framework for federated cross-domain recommendation, FeDecider. Specifically, FeDecider tackles the challenge of scale-specific noise by disentangling each client's low-rank updates and sharing only their directional components. To handle the need for flexible and effective integration, each client further learns personalized weights that achieve the data-aware integration of updates from other domains. Extensive experiments across diverse datasets validate the effectiveness of our proposed FeDecider.

11. 【2602.15921】Latent Objective Induction and Diversity-Constrained Selection: Algorithms for Multi-Locale Retrieval Pipelines

链接https://arxiv.org/abs/2602.15921

作者:Faruk Alpay,Levent Sarioglu

类目:Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)

关键词:ranked search results, formal correctness guarantees, selecting a diverse, search results, formal correctness

备注: 13 pages, 2 algorithms, 3 tables

点击查看摘要

Abstract:We present three algorithms with formal correctness guarantees and complexity bounds for the problem of selecting a diverse, multi-locale set of sources from ranked search results. First, we formulate weighted locale allocation as a constrained integer partition problem and give an $O(n \log n)$ algorithm that simultaneously satisfies minimum-representation, budget-exhaustion, and proportionality-bound constraints; we prove all three hold with a tight deviation bound of $ 1$. Second, we define a cascaded country-code inference function as a deterministic priority chain over heterogeneous signals (TLD structure, model-inferred metadata, language fallback) and prove it satisfies both determinism and graceful degradation. Third, we introduce a $\kappa$-domain diversity constraint for source selection and give an $O(|K| \cdot R)$ algorithm that maintains the invariant via hash-map lookup, eliminating the aggregator monopolization pathology present in URL-level deduplication. We further formalize Latent Objective Induction (LOI), an environment-shaping operator over prompt spaces that steers downstream model behavior without restricting the feasible output set, and prove its convergence under mild assumptions. Applied to a multi-locale retrieval pipeline, these algorithms yield 62% improvement in first-party source ratio and 89% reduction in same-domain duplication across 120 multilingual queries.

12. 【2602.15856】Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective

链接https://arxiv.org/abs/2602.15856

作者:Yunhao Liu,Zian Jia,Xinyu Gao,Kanjun Xu,Yun Xiong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, effectively grounds Large, grounds Large Language, Web-related tasks, grounds Large

备注: Accepted by WWW 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge and is widely applied to Web-related tasks. However, its scalability is hindered by excessive context length and redundant retrievals. Recent research on soft context compression aims to address this by encoding long documents into compact embeddings, yet they often underperform non-compressed RAG due to their reliance on auto-encoder-like full-compression that forces the encoder to compress all document information regardless of relevance to the input query. In this work, we conduct an analysis on this paradigm and reveal two fundamental limitations: (I) Infeasibility, full-compression conflicts with the LLM's downstream generation behavior; and (II) Non-necessity: full-compression is unnecessary and dilutes task-relevant information density. Motivated by these insights, we introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector. The selector is decoder-only and is trained with a massive, diverse and difficulty-graded synthetic QA dataset with curriculum learning. Extensive experiments show that SeleCom significantly outperforms existing soft compression approaches and achieves competitive or superior performance to non-compression baselines, while reducing computation and latency by 33.8%~84.6%.

Comments:
Accepted by WWW 2026

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

Cite as:
arXiv:2602.15856 [cs.CL]

(or
arXiv:2602.15856v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2602.15856

Focus to learn more

              arXiv-issued DOI via DataCite</p>

计算机视觉

1. 【2602.16711】CoNeRV: Leveraging Temporal Coherence for Compressible Neural Representations for Videos

链接https://arxiv.org/abs/2602.16711

作者:Namitha Padmanabhan,Matthew Gwilliam,Abhinav Shrivastava

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Implicit Neural Representations, Implicit Neural, recently demonstrated impressive, demonstrated impressive performance, Neural Representations

备注

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have recently demonstrated impressive performance for video compression. However, since a separate INR must be overfit for each video, scaling to high-resolution videos while maintaining encoding efficiency remains a significant challenge. Hypernetwork-based approaches predict INR weights (hyponetworks) for unseen videos at high speeds, but with low quality, large compressed size, and prohibitive memory needs at higher resolutions. We address these fundamental limitations through three key contributions: (1) an approach that decomposes the weight prediction task spatially and temporally, by breaking short video segments into patch tubelets, to reduce the pretraining memory overhead by 20$\times$; (2) a residual-based storage scheme that captures only differences between consecutive segment representations, significantly reducing bitstream size; and (3) a temporal coherence regularization framework that encourages changes in the weight space to be correlated with video content. Our proposed method, TeCoNeRV, achieves substantial improvements of 2.47dB and 5.35dB PSNR over the baseline at 480p and 720p on UVG, with 36% lower bitrates and 1.5-3$\times$ faster encoding speeds. With our low memory usage, we are the first hypernetwork approach to demonstrate results at 480p, 720p and 1080p on UVG, HEVC and MCL-JCV. Our project page is available at this https URL .

2. 【2602.16705】Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

链接https://arxiv.org/abs/2602.16705

作者:Runpei Dong,Ziyan Li,Xialin He,Saurabh Gupta

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:RGB-D images, humanoid robots requires, humanoid robots, RGB-D, robots requires accurate

备注: Project page: [this https URL](https://hero-humanoid.github.io/)

点击查看摘要

Abstract:Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.

3. 【2602.16702】Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning

链接https://arxiv.org/abs/2602.16702

作者:Mingjia Shi,Yinhan He,Yaochen Zhu,Jundong Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision-language models, jointly leveraging visual, aim to reason, reason by jointly, jointly leveraging

备注: preprint 10 pages, 4 figures

点击查看摘要

Abstract:Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.

4. 【2602.16689】Are Object-Centric Representations Better At Compositional Generalization?

链接https://arxiv.org/abs/2602.16689

作者:Ferdinand Kapl,Amir Mohammad Karimi Mamaghan,Maximilian Seitzer,Karl Henrik Johansson,Carsten Marr,Stefan Bauer,Andrea Dittadi

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:familiar concepts, machine learning, Visual Question Answering, ability to reason, fundamental to human

备注

点击查看摘要

Abstract:Compositional generalization, the ability to reason about novel combinations of familiar concepts, is fundamental to human cognition and a critical challenge for machine learning. Object-centric (OC) representations, which encode a scene as a set of objects, are often argued to support such generalization, but systematic evidence in visually rich settings is limited. We introduce a Visual Question Answering benchmark across three controlled visual worlds (CLEVRTex, Super-CLEVR, and MOVi-C) to measure how well vision encoders, with and without object-centric biases, generalize to unseen combinations of object properties. To ensure a fair and comprehensive comparison, we carefully account for training data diversity, sample size, representation size, downstream model capacity, and compute. We use DINOv2 and SigLIP2, two widely used vision encoders, as the foundation models and their OC counterparts. Our key findings reveal that (1) OC approaches are superior in harder compositional generalization settings; (2) original dense representations surpass OC only on easier settings and typically require substantially more downstream compute; and (3) OC models are more sample efficient, achieving stronger generalization with fewer images, whereas dense encoders catch up or surpass them only with sufficient data and diversity. Overall, object-centric representations offer stronger compositional generalization when any one of dataset size, training data diversity, or downstream compute is constrained.

5. 【2602.16682】Learning Situated Awareness in the Real World

链接https://arxiv.org/abs/2602.16682

作者:Chuhan Li,Ruilin Han,Joy Hsu,Yongyuan Liang,Rajiv Dhawan,Jiajun Wu,Ming-Hsuan Yang,Xin Eric Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:surrounding physical environment, actions in context, core aspect, aspect of human, human perception

备注

点击查看摘要

Abstract:A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

6. 【2602.16681】VETime: Vision Enhanced Zero-Shot Time Series Anomaly Detection

链接https://arxiv.org/abs/2602.16681

作者:Yingyuan Yang,Tian Lan,Yifei Gao,Yimeng Lu,Wenjun He,Meng Wang,Chenghao Liu,Chen Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:long-range Context Anomalies, Point Anomalies, Context Anomalies, Time-series anomaly detection, long-range Context

备注

点击查看摘要

Abstract:Time-series anomaly detection (TSAD) requires identifying both immediate Point Anomalies and long-range Context Anomalies. However, existing foundation models face a fundamental trade-off: 1D temporal models provide fine-grained pointwise localization but lack a global contextual perspective, while 2D vision-based models capture global patterns but suffer from information bottlenecks due to a lack of temporal alignment and coarse-grained pointwise detection. To resolve this dilemma, we propose VETime, the first TSAD framework that unifies temporal and visual modalities through fine-grained visual-temporal alignment and dynamic fusion. VETime introduces a Reversible Image Conversion and a Patch-Level Temporal Alignment module to establish a shared visual-temporal timeline, preserving discriminative details while maintaining temporal sensitivity. Furthermore, we design an Anomaly Window Contrastive Learning mechanism and a Task-Adaptive Multi-Modal Fusion to adaptively integrate the complementary perceptual strengths of both modalities. Extensive experiments demonstrate that VETime significantly outperforms state-of-the-art models in zero-shot scenarios, achieving superior localization precision with lower computational overhead than current vision-based approaches. Code available at: this https URL.

7. 【2602.16669】PredMapNet: Future and Historical Reasoning for Consistent Online HD Vectorized Map Construction

链接https://arxiv.org/abs/2602.16669

作者:Bo Lang,Nirav Savaliya,Zhihao Zheng,Jinglun Feng,Zheng-Hang Yeh,Mooi Choo Chuah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:providing structured representations, autonomous driving, providing structured, navigation and planning, crucial to autonomous

备注: WACV 2026

点击查看摘要

Abstract:High-definition (HD) maps are crucial to autonomous driving, providing structured representations of road elements to support navigation and planning. However, existing query-based methods often employ random query initialization and depend on implicit temporal modeling, which lead to temporal inconsistencies and instabilities during the construction of a global map. To overcome these challenges, we introduce a novel end-to-end framework for consistent online HD vectorized map construction, which jointly performs map instance tracking and short-term prediction. First, we propose a Semantic-Aware Query Generator that initializes queries with spatially aligned semantic masks to capture scene-level context globally. Next, we design a History Rasterized Map Memory to store fine-grained instance-level maps for each tracked instance, enabling explicit historical priors. A History-Map Guidance Module then integrates rasterized map information into track queries, improving temporal continuity. Finally, we propose a Short-Term Future Guidance module to forecast the immediate motion of map instances based on the stored history trajectories. These predicted future locations serve as hints for tracked instances to further avoid implausible predictions and keep temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed method outperforms state-of-the-art (SOTA) methods with good efficiency.

8. 【2602.16664】Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

链接https://arxiv.org/abs/2602.16664

作者:Jiaming Liu,Felix Petersen,Yunhe Gao,Yabin Zhang,Hyojin Kim,Akshay S. Chaudhari,Yu Sun,Stefano Ermon,Sergios Gatidis

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:faces key limitations, advanced unpaired, diffusion-inversion methods, require target-domain adversarial, Adversarial approaches require

备注: 36 pages

点击查看摘要

Abstract:Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.

9. 【2602.16611】Style-Aware Gloss Control for Generative Non-Photorealistic Rendering

链接https://arxiv.org/abs/2602.16611

作者:Santiago Jimenez-Navarro,Belen Masia,Ana Serrano

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:similar perceptual strategies, perceptual strategies guide, infer material characteristics, paintings or drawings, Humans can infer

备注

点击查看摘要

Abstract:Humans can infer material characteristics of objects from their visual appearance, and this ability extends to artistic depictions, where similar perceptual strategies guide the interpretation of paintings or drawings. Among the factors that define material appearance, gloss, along with color, is widely regarded as one of the most important, and recent studies indicate that humans can perceive gloss independently of the artistic style used to depict an object. To investigate how gloss and artistic style are represented in learned models, we train an unsupervised generative model on a newly curated dataset of painterly objects designed to systematically vary such factors. Our analysis reveals a hierarchical latent space in which gloss is disentangled from other appearance factors, allowing for a detailed study of how gloss is represented and varies across artistic styles. Building on this representation, we introduce a lightweight adapter that connects our style- and gloss-aware latent space to a latent-diffusion model, enabling the synthesis of non-photorealistic images with fine-grained control of these factors. We compare our approach with previous models and observe improved disentanglement and controllability of the learned factors.

10. 【2602.16608】Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

链接https://arxiv.org/abs/2602.16608

作者:Melkamu Abay Mersha,Jugal Kalita

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:deeply layered representations, layered representations make, Layer-wise Integrated Gradients, Transformer models achieve, difficult to interpret

备注

点击查看摘要

Abstract:Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

11. 【2602.16590】A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

链接https://arxiv.org/abs/2602.16590

作者:Qi You,Yitai Cheng,Zichao Zeng,James Haworth

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:high-definition map construction, Street-view image attribute, urban analytics, vital downstream task, Street-view image

备注

点击查看摘要

Abstract:Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at this https URL.

12. 【2602.16569】Arc2Morph: Identity-Preserving Facial Morphing with Arc2Face

链接https://arxiv.org/abs/2602.16569

作者:Nicolò Di Domenico,Annalisa Franco,Matteo Ferrara,Davide Maltoni

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:face recognition systems, electronic identity documents, morphing attack potential, morphing attack, widely recognized

备注

点击查看摘要

Abstract:Face morphing attacks are widely recognized as one of the most challenging threats to face recognition systems used in electronic identity documents. These attacks exploit a critical vulnerability in passport enrollment procedures adopted by many countries, where the facial image is often acquired without a supervised live capture process. In this paper, we propose a novel face morphing technique based on Arc2Face, an identity-conditioned face foundation model capable of synthesizing photorealistic facial images from compact identity representations. We demonstrate the effectiveness of the proposed approach by comparing the morphing attack potential metric on two large-scale sequestered face morphing attack detection datasets against several state-of-the-art morphing methods, as well as on two novel morphed face datasets derived from FEI and ONOT. Experimental results show that the proposed deep learning-based approach achieves a morphing attack potential comparable to that of landmark-based techniques, which have traditionally been regarded as the most challenging. These findings confirm the ability of the proposed method to effectively preserve and manage identity information during the morph generation process.

13. 【2602.16545】Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding

链接https://arxiv.org/abs/2602.16545

作者:Kaiting Liu,Hazel Doughty

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:manner or outcome, single label, typically trained, trained on fixed, fixed taxonomies

备注: ICLR 2026

点击查看摘要

Abstract:Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: this https URL.

14. 【2602.16502】DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images

链接https://arxiv.org/abs/2602.16502

作者:Zeng Tao,Ying Jiang,Yunuo Chen,Tianyi Xie,Huamin Wang,Yingnian Wu,Yin Yang,Abishek Sampath Kumar,Kenji Tashiro,Chenfanfu Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:shown promising progress, Recent advances, promising progress, shown promising, sewing pattern generation

备注

点击查看摘要

Abstract:Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.

15. 【2602.16494】Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection

链接https://arxiv.org/abs/2602.16494

作者:Alexis Winter,Jean-Vincent Martini,Romaric Audigier,Angelique Loesch,Bertrand Luvison

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:automated systems, perception-based robots, security risk, critical components, components of automated

备注

点击查看摘要

Abstract:Object detection models are critical components of automated systems, such as autonomous vehicles and perception-based robots, but their sensitivity to adversarial attacks poses a serious security risk. Progress in defending these models lags behind classification, hindered by a lack of standardized evaluation. It is nearly impossible to thoroughly compare attack or defense methods, as existing work uses different datasets, inconsistent efficiency metrics, and varied measures of perturbation cost. This paper addresses this gap by investigating three key questions: (1) How can we create a fair benchmark to impartially compare attacks? (2) How well do modern attacks transfer across different architectures, especially from Convolutional Neural Networks to Vision Transformers? (3) What is the most effective adversarial training strategy for robust defense? To answer these, we first propose a unified benchmark framework focused on digital, non-patch-based attacks. This framework introduces specific metrics to disentangle localization and classification errors and evaluates attack cost using multiple perceptual metrics. Using this benchmark, we conduct extensive experiments on state-of-the-art attacks and a wide range of detectors. Our findings reveal two major conclusions: first, modern adversarial attacks against object detection models show a significant lack of transferability to transformer-based architectures. Second, we demonstrate that the most robust adversarial training strategy leverages a dataset composed of a mix of high-perturbation attacks with different objectives (e.g., spatial and semantic), which outperforms training on any single attack.

16. 【2602.16493】MMA: Multimodal Memory Agent

链接https://arxiv.org/abs/2602.16493

作者:Yihao Lu,Wanru Cheng,Zeyu Zhang,Hao Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Long-horizon multimodal agents, trigger overconfident errors, multimodal agents depend, Long-horizon multimodal, Multimodal Memory Agent

备注

点击查看摘要

Abstract:Long-horizon multimodal agents depend on external memory; however, similarity-based retrieval often surfaces stale, low-credibility, or conflicting items, which can trigger overconfident errors. We propose Multimodal Memory Agent (MMA), which assigns each retrieved memory item a dynamic reliability score by combining source credibility, temporal decay, and conflict-aware network consensus, and uses this signal to reweight evidence and abstain when support is insufficient. We also introduce MMA-Bench, a programmatically generated benchmark for belief dynamics with controlled speaker reliability and structured text-vision contradictions. Using this framework, we uncover the "Visual Placebo Effect", revealing how RAG-based agents inherit latent visual biases from foundation models. On FEVER, MMA matches baseline accuracy while reducing variance by 35.2% and improving selective utility; on LoCoMo, a safety-oriented configuration improves actionable accuracy and reduces wrong answers; on MMA-Bench, MMA reaches 41.18% Type-B accuracy in Vision mode, while the baseline collapses to 0.0% under the same protocol. Code: this https URL.

17. 【2602.16455】Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing

链接https://arxiv.org/abs/2602.16455

作者:Jinsong Li,Xiaoyi Dong,Yuhang Zang,Yuhang Cao,Jiaqi Wang,Dahua Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, demonstrated remarkable capabilities, strengths provide minimal, provide minimal benefits, Large Vision-Language

备注

点击查看摘要

Abstract:While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.

18. 【2602.16430】Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems

链接https://arxiv.org/abs/2602.16430

作者:Ali Faraz,Raja Kolla,Ashish Kulkarni,Shubham Agarwal

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Designing Optical Character, Optical Character Recognition, India requires balancing, Designing Optical, Character Recognition

备注

点击查看摘要

Abstract:Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.

19. 【2602.16412】ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

链接https://arxiv.org/abs/2602.16412

作者:Daichi Yashima,Shuhei Kurita,Yusuke Oda,Komei Sugiura

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multimodal large language, shown remarkable success, long-form video understanding, video understanding remains, large language models

备注

点击查看摘要

Abstract:While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.

20. 【2602.16385】Parameter-Free Adaptive Multi-Scale Channel-Spatial Attention Aggregation framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired

链接https://arxiv.org/abs/2602.16385

作者:Qi He,XiangXiang Wang,Jingtao Zhang,Yongbin Yu,Hongxiang Chu,Manping Fan,JingYe Cai,Zhenglin Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Semantic Scene Completion, safety-critical scene understanding, Scene Completion, strictly monocular vision, scene understanding

备注: 17 pages, 9 figures, 5 tables

点击查看摘要

Abstract:In indoor assistive perception for visually impaired users, 3D Semantic Scene Completion (SSC) is expected to provide structurally coherent and semantically consistent occupancy under strictly monocular vision for safety-critical scene understanding. However, existing monocular SSC approaches often lack explicit modeling of voxel-feature reliability and regulated cross-scale information propagation during 2D-3D projection and multi-scale fusion, making them vulnerable to projection diffusion and feature entanglement and thus limiting structural stability. To address these challenges, this paper presents an Adaptive Multi-scale Attention Aggregation (AMAA) framework built upon the MonoScene pipeline. Rather than introducing a heavier backbone, AMAA focuses on reliability-oriented feature regulation within a monocular SSC framework. Specifically, lifted voxel features are jointly calibrated in semantic and spatial dimensions through parallel channel-spatial attention aggregation, while multi-scale encoder-decoder fusion is stabilized via a hierarchical adaptive feature-gating strategy that regulates information injection across scales. Experiments on the NYUv2 benchmark demonstrate consistent improvements over MonoScene without significantly increasing system complexity: AMAA achieves 27.25% SSC mIoU (+0.31) and 43.10% SC IoU (+0.59). In addition, system-level deployment on an NVIDIA Jetson platform verifies that the complete AMAA framework can be executed stably on embedded hardware. Overall, AMAA improves monocular SSC quality and provides a reliable and deployable perception framework for indoor assistive systems targeting visually impaired users.

21. 【2602.16365】Markerless 6D Pose Estimation and Position-Based Visual Servoing for Endoscopic Continuum Manipulators

链接https://arxiv.org/abs/2602.16365

作者:Junhyun Park,Chunggil An,Myeongbo Park,Ihsan Ullah,Sihyeong Park,Minho Hwang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:minimally invasive procedures, flexible endoscopic surgical, endoscopic surgical systems, surgical systems offer, remain challenging due

备注: 20 pages, 13 figures, 7 tables

点击查看摘要

Abstract:Continuum manipulators in flexible endoscopic surgical systems offer high dexterity for minimally invasive procedures; however, accurate pose estimation and closed-loop control remain challenging due to hysteresis, compliance, and limited distal sensing. Vision-based approaches reduce hardware complexity but are often constrained by limited geometric observability and high computational overhead, restricting real-time closed-loop applicability. This paper presents a unified framework for markerless stereo 6D pose estimation and position-based visual servoing of continuum manipulators. A photo-realistic simulation pipeline enables large-scale automatic training with pixel-accurate annotations. A stereo-aware multi-feature fusion network jointly exploits segmentation masks, keypoints, heatmaps, and bounding boxes to enhance geometric observability. To enforce geometric consistency without iterative optimization, a feed-forward rendering-based refinement module predicts residual pose corrections in a single pass. A self-supervised sim-to-real adaptation strategy further improves real-world performance using unlabeled data. Extensive real-world validation achieves a mean translation error of 0.83 mm and a mean rotation error of 2.76° across 1,000 samples. Markerless closed-loop visual servoing driven by the estimated pose attains accurate trajectory tracking with a mean translation error of 2.07 mm and a mean rotation error of 7.41°, corresponding to 85% and 59% reductions compared to open-loop control, together with high repeatability in repeated point-reaching tasks. To the best of our knowledge, this work presents the first fully markerless pose-estimation-driven position-based visual servoing framework for continuum manipulators, enabling precise closed-loop control without physical markers or embedded sensing.

22. 【2602.16356】Articulated 3D Scene Graphs for Open-World Mobile Manipulation

链接https://arxiv.org/abs/2602.16356

作者:Martin Büchner,Adrian Röfer,Tim Engelbracht,Tim Welschehold,Zuria Bauer,Hermann Blum,Marc Pollefeys,Abhinav Valada

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:understanding and affordance-driven, affordance-driven object interaction, object, objects, scene understanding

备注

点击查看摘要

Abstract:Semantics has enabled 3D scene understanding and affordance-driven object interaction. However, robots operating in real-world environments face a critical limitation: they cannot anticipate how objects move. Long-horizon mobile manipulation requires closing the gap between semantics, geometry, and kinematics. In this work, we present MoMa-SG, a novel framework for building semantic-kinematic 3D scene graphs of articulated scenes containing a myriad of interactable objects. Given RGB-D sequences containing multiple object articulations, we temporally segment object interactions and infer object motion using occlusion-robust point tracking. We then lift point trajectories into 3D and estimate articulation models using a novel unified twist estimation formulation that robustly estimates revolute and prismatic joint parameters in a single optimization pass. Next, we associate objects with estimated articulations and detect contained objects by reasoning over parent-child relations at identified opening states. We also introduce the novel Arti4D-Semantic dataset, which uniquely combines hierarchical object semantics including parent-child relation labels with object axis annotations across 62 in-the-wild RGB-D sequences containing 600 object interactions and three distinct observation paradigms. We extensively evaluate the performance of MoMa-SG on two datasets and ablate key design choices of our approach. In addition, real-world experiments on both a quadruped and a mobile manipulator demonstrate that our semantic-kinematic scene graphs enable robust manipulation of articulated objects in everyday home environments. We provide code and data at: this https URL.

23. 【2602.16349】SCAR: Satellite Imagery-Based Calibration for Aerial Recordings

链接https://arxiv.org/abs/2602.16349

作者:Henry Hölzemann,Michael Schleiss

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:persistent global reference, exploits georeferenced satellite, georeferenced satellite imagery, aerial visual-inertial systems, long-term auto-calibration refinement

备注

点击查看摘要

Abstract:We introduce SCAR, a method for long-term auto-calibration refinement of aerial visual-inertial systems that exploits georeferenced satellite imagery as a persistent global reference. SCAR estimates both intrinsic and extrinsic parameters by aligning aerial images with 2D--3D correspondences derived from publicly available orthophotos and elevation models. In contrast to existing approaches that rely on dedicated calibration maneuvers or manually surveyed ground control points, our method leverages external geospatial data to detect and correct calibration degradation under field deployment conditions. We evaluate our approach on six large-scale aerial campaigns conducted over two years under diverse seasonal and environmental conditions. Across all sequences, SCAR consistently outperforms established baselines (Kalibr, COLMAP, VINS-Mono), reducing median reprojection error by a large margin, and translating these calibration gains into substantially lower visual localization rotation errors and higher pose accuracy. These results demonstrate that SCAR provides accurate, robust, and reproducible calibration over long-term aerial operations without the need for manual intervention.

24. 【2602.16337】Subtractive Modulative Network with Learnable Periodic Activations

链接https://arxiv.org/abs/2602.16337

作者:Tiou Wang,Zhuoqian Yang,Markus Flierl,Mathieu Salzmann,Sabine Süsstrunk

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Implicit Neural Representation, parameter-efficient Implicit Neural, Subtractive Modulative Network, Neural Representation, Implicit Neural

备注: 4 pages, 3 figures, 3 tables

点击查看摘要

Abstract:We propose the Subtractive Modulative Network (SMN), a novel, parameter-efficient Implicit Neural Representation (INR) architecture inspired by classical subtractive synthesis. The SMN is designed as a principled signal processing pipeline, featuring a learnable periodic activation layer (Oscillator) that generates a multi-frequency basis, and a series of modulative mask modules (Filters) that actively generate high-order harmonics. We provide both theoretical analysis and empirical validation for our design. Our SMN achieves a PSNR of $40+$ dB on two image datasets, comparing favorably against state-of-the-art methods in terms of both reconstruction accuracy and parameter efficiency. Furthermore, consistent advantage is observed on the challenging 3D NeRF novel view synthesis task. Supplementary materials are available at this https URL.

25. 【2602.16327】Guide-Guard: Off-Target Predicting in CRISPR Applications

链接https://arxiv.org/abs/2602.16327

作者:Joseph Bingham,Netanel Arussy,Saman Zonouz

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:cyber-physical genome sequencing, easily access tools, agriculture and medicine, editing technologies, health science

备注: 10 pages, 11 figs, accepted to IDEAL 2022

点击查看摘要

Abstract:With the introduction of cyber-physical genome sequencing and editing technologies, such as CRISPR, researchers can more easily access tools to investigate and create remedies for a variety of topics in genetics and health science (e.g. agriculture and medicine). As the field advances and grows, new concerns present themselves in the ability to predict the off-target behavior. In this work, we explore the underlying biological and chemical model from a data driven perspective. Additionally, we present a machine learning based solution named \textit{Guide-Guard} to predict the behavior of the system given a gRNA in the CRISPR gene-editing process with 84\% accuracy. This solution is able to be trained on multiple different genes at the same time while retaining accuracy.

26. 【2602.16322】A Self-Supervised Approach for Enhanced Feature Representations in Object Detection Tasks

链接https://arxiv.org/abs/2602.16322

作者:Santiago C. Vilabella,Pablo Pérez-Núñez,Beatriz Remeseiro

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:training deep learning, artificial intelligence, complexity and size, fast-evolving field, field of artificial

备注

点击查看摘要

Abstract:In the fast-evolving field of artificial intelligence, where models are increasingly growing in complexity and size, the availability of labeled data for training deep learning models has become a significant challenge. Addressing complex problems like object detection demands considerable time and resources for data labeling to achieve meaningful results. For companies developing such applications, this entails extensive investment in highly skilled personnel or costly outsourcing. This research work aims to demonstrate that enhancing feature extractors can substantially alleviate this challenge, enabling models to learn more effective representations with less labeled data. Utilizing a self-supervised learning strategy, we present a model trained on unlabeled data that outperforms state-of-the-art feature extractors pre-trained on ImageNet and particularly designed for object detection tasks. Moreover, the results demonstrate that our approach encourages the model to focus on the most relevant aspects of an object, thus achieving better feature representations and, therefore, reinforcing its reliability and robustness.

27. 【2602.16281】Breaking the Sub-Millimeter Barrier: Eyeframe Acquisition from Color Images

链接https://arxiv.org/abs/2602.16281

作者:Manel Guzmán,Antonio Agudo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ensure proper lens, proper lens fitting, requires sub-millimeter precision, optimal vision correction, proper lens

备注: Accepted to CAI 2026

点击查看摘要

Abstract:Eyeframe lens tracing is an important process in the optical industry that requires sub-millimeter precision to ensure proper lens fitting and optimal vision correction. Traditional frame tracers rely on mechanical tools that need precise positioning and calibration, which are time-consuming and require additional equipment, creating an inefficient workflow for opticians. This work presents a novel approach based on artificial vision that utilizes multi-view information. The proposed algorithm operates on images captured from an InVision system. The full pipeline includes image acquisition, frame segmentation to isolate the eyeframe from background, depth estimation to obtain 3D spatial information, and multi-view processing that integrates segmented RGB images with depth data for precise frame contour measurement. To this end, different configurations and variants are proposed and analyzed on real data, providing competitive measurements from still color images with respect to other solutions, while eliminating the need for specialized tracing equipment and reducing workflow complexity for optical technicians.

28. 【2602.16249】AFFMAE: Scalable and Efficient Vision Pretraining for Desktop Graphics Cards

链接https://arxiv.org/abs/2602.16249

作者:David Smerkous,Zian Wang,Behzad Najafian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling data-efficient fine-tuning, requires server-scale infrastructure, limiting in-domain foundation, transformed computer vision, typically requires server-scale

备注: Preprint

点击查看摘要

Abstract:Self-supervised pretraining has transformed computer vision by enabling data-efficient fine-tuning, yet high-resolution training typically requires server-scale infrastructure, limiting in-domain foundation model development for many research laboratories. Masked Autoencoders (MAE) reduce computation by encoding only visible tokens, but combining MAE with hierarchical downsampling architectures remains structurally challenging due to dense grid priors and mask-aware design compromises. We introduce AFFMAE, a masking-friendly hierarchical pretraining framework built on adaptive, off-grid token merging. By discarding masked tokens and performing dynamic merging exclusively over visible tokens, AFFMAE removes dense-grid assumptions while preserving hierarchical scalability. We developed numerically stable mixed-precision Flash-style cluster attention kernels, and mitigate sparse-stage representation collapse via deep supervision. On high-resolution electron microscopy segmentation, AFFMAE matches ViT-MAE performance at equal parameter count while reducing FLOPs by up to 7x, halving memory usage, and achieving faster training on a single RTX 5090. Code available at this https URL.

29. 【2602.16245】HyPCA-Net: Advancing Multimodal Fusion in Medical Image Analysis

链接https://arxiv.org/abs/2602.16245

作者:J. Dhar,M. K. Pandey,D. Chakladar,M. Haghighat,A. Alavi,S. Mistry,N. Zaidi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:skin cancer detection, brain tumor prediction, shown great potential, Multimodal fusion frameworks, medical imaging modalities

备注: Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision 2026

点击查看摘要

Abstract:Multimodal fusion frameworks, which integrate diverse medical imaging modalities (e.g., MRI, CT), have shown great potential in applications such as skin cancer detection, dementia diagnosis, and brain tumor prediction. However, existing multimodal fusion methods face significant challenges. First, they often rely on computationally expensive models, limiting their applicability in low-resource environments. Second, they often employ cascaded attention modules, which potentially increase risk of information loss during inter-module transitions and hinder their capacity to effectively capture robust shared representations across modalities. This restricts their generalization in multi-disease analysis tasks. To address these limitations, we propose a Hybrid Parallel-Fusion Cascaded Attention Network (HyPCA-Net), composed of two core novel blocks: (a) a computationally efficient residual adaptive learning attention block for capturing refined modality-specific representations, and (b) a dual-view cascaded attention block aimed at learning robust shared representations across diverse modalities. Extensive experiments on ten publicly available datasets exhibit that HyPCA-Net significantly outperforms existing leading methods, with improvements of up to 5.2% in performance and reductions of up to 73.1% in computational cost. Code: this https URL.

30. 【2602.16238】EasyControlEdge: A Foundation-Model Fine-Tuning for Edge Detection

链接https://arxiv.org/abs/2602.16238

作者:Hiroki Nakamura,Hiroto Iino,Masashi Okada,Tadahiro Taniguchi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:edge detection, image-generation foundation models, propose EasyControlEdge, image-generation foundation, edge

备注

点击查看摘要

Abstract:We propose EasyControlEdge, adapting an image-generation foundation model to edge detection. In real-world edge detection (e.g., floor-plan walls, satellite roads/buildings, and medical organ boundaries), crispness and data efficiency are crucial, yet producing crisp raw edge maps with limited training samples remains challenging. Although image-generation foundation models perform well on many downstream tasks, their pretrained priors for data-efficient transfer and iterative refinement for high-frequency detail preservation remain underexploited for edge detection. To enable crisp and data-efficient edge detection using these capabilities, we introduce an edge-specialized adaptation of image-generation foundation models. To better specialize the foundation model for edge detection, we incorporate an edge-oriented objective with an efficient pixel-space loss. At inference, we introduce guidance based on unconditional dynamics, enabling a single model to control the edge density through a guidance scale. Experiments on BSDS500, NYUDv2, BIPED, and CubiCasa compare against state-of-the-art methods and show consistent gains, particularly under no-post-processing crispness evaluation and with limited training data.

31. 【2602.16231】DataCube: A Video Retrieval Platform via Natural Language Semantic Profiling

链接https://arxiv.org/abs/2602.16231

作者:Yiming Ju,Hanyu Zhao,Quanyue Ma,Donglin Hao,Chengwei Wu,Ming Li,Songjing Wang,Tengfei Pan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large-scale video repositories, modern video understanding, Large-scale video, generation tasks, understanding and generation

备注: This paper is under review for the IJCAI-ECAI 2026 Demonstrations Track

点击查看摘要

Abstract:Large-scale video repositories are increasingly available for modern video understanding and generation tasks. However, transforming raw videos into high-quality, task-specific datasets remains costly and inefficient. We present DataCube, an intelligent platform for automatic video processing, multi-dimensional profiling, and query-driven retrieval. DataCube constructs structured semantic representations of video clips and supports hybrid retrieval with neural re-ranking and deep semantic matching. Through an interactive web interface, users can efficiently construct customized video subsets from massive repositories for training, analysis, and evaluation, and build searchable systems over their own private video collections. The system is publicly accessible at this https URL. Demo Video: this https URL

32. 【2602.16213】Graph neural network for colliding particles with an application to sea ice floe modeling

链接https://arxiv.org/abs/2602.16213

作者:Ruibiao Zhu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computational Physics (physics.comp-ph)

关键词:Graph Neural Networks, natural graph structure, nodes represent individual, Graph Neural, Neural Networks

备注

点击查看摘要

Abstract:This paper introduces a novel approach to sea ice modeling using Graph Neural Networks (GNNs), utilizing the natural graph structure of sea ice, where nodes represent individual ice pieces, and edges model the physical interactions, including collisions. This concept is developed within a one-dimensional framework as a foundational step. Traditional numerical methods, while effective, are computationally intensive and less scalable. By utilizing GNNs, the proposed model, termed the Collision-captured Network (CN), integrates data assimilation (DA) techniques to effectively learn and predict sea ice dynamics under various conditions. The approach was validated using synthetic data, both with and without observed data points, and it was found that the model accelerates the simulation of trajectories without compromising accuracy. This advancement offers a more efficient tool for forecasting in marginal ice zones (MIZ) and highlights the potential of combining machine learning with data assimilation for more effective and efficient modeling.

33. 【2602.16160】Uncertainty-Guided Inference-Time Depth Adaptation for Transformer-Based Visual Tracking

链接https://arxiv.org/abs/2602.16160

作者:Patrick Poggi,Divake Kumar,Theja Tulabandhula,Amit Ranjan Trivedi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:incurring unnecessary computational, unnecessary computational cost, single-object trackers achieve, Transformer-based single-object trackers, temporally coherent frames

备注: Submitted to IJCNN 2026

点击查看摘要

Abstract:Transformer-based single-object trackers achieve state-of-the-art accuracy but rely on fixed-depth inference, executing the full encoder--decoder stack for every frame regardless of visual complexity, thereby incurring unnecessary computational cost in long video sequences dominated by temporally coherent frames. We propose UncL-STARK, an architecture-preserving approach that enables dynamic, uncertainty-aware depth adaptation in transformer-based trackers without modifying the underlying network or adding auxiliary heads. The model is fine-tuned to retain predictive robustness at multiple intermediate depths using random-depth training with knowledge distillation, thus enabling safe inference-time truncation. At runtime, we derive a lightweight uncertainty estimate directly from the model's corner localization heatmaps and use it in a feedback-driven policy that selects the encoder and decoder depth for the next frame based on the prediction confidence by exploiting temporal coherence in video. Extensive experiments on GOT-10k and LaSOT demonstrate up to 12\% GFLOPs reduction, 8.9\% latency reduction, and 10.8\% energy savings while maintaining tracking accuracy within 0.2\% of the full-depth baseline across both short-term and long-term sequences.

34. 【2602.16149】Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing

链接https://arxiv.org/abs/2602.16149

作者:Huichan Seo,Minki Hong,Sieun Choi,Jihie Kim,Jean Oh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:editing remain underexplored, remain underexplored, Demographic bias, Soft Erasure, Stereotype Replacement

备注: 19 pages, 13 figures. Preprint

点击查看摘要

Abstract:Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems. Project page: this https URL

35. 【2602.16138】IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

链接https://arxiv.org/abs/2602.16138

作者:Parsa Madinei,Srijita Karmakar,Russell Cohen Hoffing,Felix Gervitz,Miguel P. Eckstein

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Intent Resolution, Inference-time Saccades, Resolution via Inference-time, introduce IRIS, resolve ambiguity

备注

点击查看摘要

Abstract:We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.

36. 【2602.16132】CHAI: CacHe Attention Inference for text2video

链接https://arxiv.org/abs/2602.16132

作者:Joel Mathew Cherian,Ashutosh Muralidhara Bharadwaj,Vima Gupta,Anand Padmanabha Iyer

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:deliver impressive results, diffusion models deliver, models deliver impressive, deliver impressive, impressive results

备注

点击查看摘要

Abstract:Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality.

37. 【2602.16110】OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis

链接https://arxiv.org/abs/2602.16110

作者:Tianwei Lin,Zhongwei Qiu,Wenqiao Zhang,Jiang Liu,Yihan Xie,Mingjian Gao,Zhenxuan Fan,Zhaocheng Li,Sijing Li,Zhongle Xie,Peng LU,Yueting Zhuang,Yingda Xia,Ling Zhang,Beng Chin Ooi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Computed Tomography, covering critical organs, information-dense imaging modalities, diagnostically information-dense imaging, covering critical

备注

点击查看摘要

Abstract:Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.

38. 【2602.16086】LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

链接https://arxiv.org/abs/2602.16086

作者:Idil Bilge Altun,Mert Onur Cakiroglu,Elham Buxton,Mehmet Dalkilic,Hasan Kurban

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:scalable visual generation, efficient latent-space priors, preserving semantic structure, discrete capacity effectively, Discrete image tokenization

备注

点击查看摘要

Abstract:Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent-space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade-off: vector-quantized tokenizers learn flexible geometries but often suffer from biased straight-through optimization, codebook under-utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near-complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end. LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free-energy objective, provably converging to nearest-neighbor quantization in the low-temperature limit. LGQ combines a token-level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN-style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2602.16086 [cs.CV]

(or
arXiv:2602.16086v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.16086

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
39. 【2602.16057】Extracting and Analyzing Rail Crossing Behavior Signatures from Videos using Tensor Methods

链接https://arxiv.org/abs/2602.16057

作者:Dawon Ahn,Het Patel,Aemal Khattak,Jia Chen,Evangelos E. Papalexakis

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:driver behavior varies, crossings present complex, present complex safety, complex safety challenges, present complex

备注: 6 pages, 10 figures. Accepted at InnovaRail 2026

点击查看摘要

Abstract:Railway crossings present complex safety challenges where driver behavior varies by location, time, and conditions. Traditional approaches analyze crossings individually, limiting the ability to identify shared behavioral patterns across locations. We propose a multi-view tensor decomposition framework that captures behavioral similarities across three temporal phases: Approach (warning activation to gate lowering), Waiting (gates down to train passage), and Clearance (train passage to gate raising). We analyze railway crossing videos from multiple locations using TimeSformer embeddings to represent each phase. By constructing phase-specific similarity matrices and applying non-negative symmetric CP decomposition, we discover latent behavioral components with distinct temporal signatures. Our tensor analysis reveals that crossing location appears to be a stronger determinant of behavior patterns than time of day, and that approach-phase behavior provides particularly discriminative signatures. Visualization of the learned component space confirms location-based clustering, with certain crossings forming distinct behavioral clusters. This automated framework enables scalable pattern discovery across multiple crossings, providing a foundation for grouping locations by behavioral similarity to inform targeted safety interventions.

40. 【2602.16019】MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval

链接https://arxiv.org/abs/2602.16019

作者:Ahmad Elallaf,Yu Zhang,Yuktha Priya Masupalli,Jeong Yang,Young Lee,Zechun Cao,Gongbo Liang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:high-stakes biomedical applications, general-purpose representation learners, powerful general-purpose representation, Vision-language foundation models, multimodal understanding

备注: Accepted to the 2026 Winter Conference on Applications of Computer Vision (WACV) Workshops

点击查看摘要

Abstract:Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.

41. 【2602.16006】BTReport: A Framework for Brain Tumor Radiology Report Generation with Clinically Relevant Features

链接https://arxiv.org/abs/2602.16006

作者:Juampablo E. Heras Rivera,Dickson T. Chen,Tianyi Ren,Daniel K. Low,Asma Ben Abacha,Alberto Santamaria-Pang,Mehmet Kurt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:open paired image-report, Recent advances, large paired image-text, paired image-text datasets, paired image-report datasets

备注: Accepted to Medical Imaging with Deep Learning (MIDL) 2026

点击查看摘要

Abstract:Recent advances in radiology report generation (RRG) have been driven by large paired image-text datasets; however, progress in neuro-oncology has been limited due to a lack of open paired image-report datasets. Here, we introduce BTReport, an open-source framework for brain tumor RRG that constructs natural language radiology reports using deterministically extracted imaging features. Unlike existing approaches that rely on large general-purpose or fine-tuned vision-language models for both image interpretation and report composition, BTReport performs deterministic feature extraction for image analysis and uses large language models only for syntactic structuring and narrative formatting. By separating RRG into a deterministic feature extraction step and a report generation step, the generated reports are completely interpretable and less prone to hallucinations. We show that the features used for report generation are predictive of key clinical outcomes, including survival and IDH mutation status, and reports generated by BTReport are more closely aligned with reference clinical reports than existing baselines for RRG. Finally, we introduce BTReport-BraTS, a companion dataset that augments BraTS imaging with synthetically generated radiology reports produced with BTReport. Code for this project can be found at this https URL.

42. 【2602.15989】SAM 3D Body: Robust Full-Body Human Mesh Recovery

链接https://arxiv.org/abs/2602.15989

作者:Xitong Yang,Devansh Kukreja,Don Pinkus,Anushka Sagar,Taosha Fan,Jinhyung Park,Soyong Shin,Jinkun Cao,Jiawei Liu,Nicolas Ugrinovic,Matt Feiszli,Jitendra Malik,Piotr Dollar,Kris Kitani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human mesh recovery, Momentum Human Rig, single-image full-body, accuracy in diverse, consistent accuracy

备注: Code: [this https URL](https://github.com/facebookresearch/sam-3d-body)

点击查看摘要

Abstract:We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR) that demonstrates state-of-the-art performance, with strong generalization and consistent accuracy in diverse in-the-wild conditions. 3DB estimates the human pose of the body, feet, and hands. It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape. 3DB employs an encoder-decoder architecture and supports auxiliary prompts, including 2D keypoints and masks, enabling user-guided inference similar to the SAM family of models. We derive high-quality annotations from a multi-stage annotation pipeline that uses various combinations of manual keypoint annotation, differentiable optimization, multi-view geometry, and dense keypoint detection. Our data engine efficiently selects and processes data to ensure data diversity, collecting unusual poses and rare imaging conditions. We present a new evaluation dataset organized by pose and appearance categories, enabling nuanced analysis of model behavior. Our experiments demonstrate superior generalization and substantial improvements over prior methods in both qualitative user preference studies and traditional quantitative analysis. Both 3DB and MHR are open-source.

43. 【2602.15973】LAND: A Longitudinal Analysis of Neuromorphic Datasets

链接https://arxiv.org/abs/2602.15973

作者:Gregory Cohen,Alexandre Marcireau

类目:Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)

关键词:neuromorphic datasets, datasets, neuromorphic datasets published, Neuromorphic, existing neuromorphic datasets

备注: The LAND dataset tool can be accessed via [this https URL](https://neuromorphicsystems.github.io/land/)

点击查看摘要

Abstract:Neuromorphic engineering has a data problem. Despite the meteoric rise in the number of neuromorphic datasets published over the past ten years, the conclusion of a significant portion of neuromorphic research papers still states that there is a need for yet more data and even larger datasets. Whilst this need is driven in part by the sheer volume of data required by modern deep learning approaches, it is also fuelled by the current state of the available neuromorphic datasets and the difficulties in finding them, understanding their purpose, and determining the nature of their underlying task. This is further compounded by practical difficulties in downloading and using these datasets. This review starts by capturing a snapshot of the existing neuromorphic datasets, covering over 423 datasets, and then explores the nature of their tasks and the underlying structure of the presented data. Analysing these datasets shows the difficulties arising from their size, the lack of standardisation, and difficulties in accessing the actual data. This paper also highlights the growth in the size of individual datasets and the complexities involved in working with the data. However, a more important concern is the rise of synthetic datasets, created by either simulation or video-to-events methods. This review explores the benefits of simulated data for testing existing algorithms and applications, highlighting the potential pitfalls for exploring new applications of neuromorphic technologies. This review also introduces the concepts of meta-datasets, created from existing datasets, as a way of both reducing the need for more data, and to remove potential bias arising from defining both the dataset and the task.

44. 【2602.15971】B-DENSE: Branching For Dense Ensemble Network Learning

链接https://arxiv.org/abs/2602.15971

作者:Cherish Puniani,Tushar Kumar,Arnav Bendre,Gaurav Kumar,Shree Singhi

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词:Inspired by non-equilibrium, non-equilibrium thermodynamics, performance in generative, generative modeling, Inspired

备注: 11 pages, 5 figures, 4 algorithms and 2 tables. Submitted to iclr 2026 delta workshop and still under review

点击查看摘要

Abstract:Inspired by non-equilibrium thermodynamics, diffusion models have achieved state-of-the-art performance in generative modeling. However, their iterative sampling nature results in high inference latency. While recent distillation techniques accelerate sampling, they discard intermediate trajectory steps. This sparse supervision leads to a loss of structural information and introduces significant discretization errors. To mitigate this, we propose B-DENSE, a novel framework that leverages multi-branch trajectory alignment. We modify the student architecture to output $K$-fold expanded channels, where each subset corresponds to a specific branch representing a discrete intermediate step in the teacher's trajectory. By training these branches to simultaneously map to the entire sequence of the teacher's target timesteps, we enforce dense intermediate trajectory alignment. Consequently, the student model learns to navigate the solution space from the earliest stages of training, demonstrating superior image generation quality compared to baseline distillation frameworks.

45. 【2602.15967】Non-Contact Physiological Monitoring in Pediatric Intensive Care Units via Adaptive Masking and Self-Supervised Learning

链接https://arxiv.org/abs/2602.15967

作者:Mohamed Khalil Ben Salah,Philippe Jouvet,Rita Noumeir

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Intensive Care Units, Pediatric Intensive Care, Care Units, Intensive Care, Continuous monitoring

备注

点击查看摘要

Abstract:Continuous monitoring of vital signs in Pediatric Intensive Care Units (PICUs) is essential for early detection of clinical deterioration and effective clinical decision-making. However, contact-based sensors such as pulse oximeters may cause skin irritation, increase infection risk, and lead to patient discomfort. Remote photoplethysmography (rPPG) offers a contactless alternative to monitor heart rate using facial video, but remains underutilized in PICUs due to motion artifacts, occlusions, variable lighting, and domain shifts between laboratory and clinical data. We introduce a self-supervised pretraining framework for rPPG estimation in the PICU setting, based on a progressive curriculum strategy. The approach leverages the VisionMamba architecture and integrates an adaptive masking mechanism, where a lightweight Mamba-based controller assigns spatiotemporal importance scores to guide probabilistic patch sampling. This strategy dynamically increases reconstruction difficulty while preserving physiological relevance. To address the lack of labeled clinical data, we adopt a teacher-student distillation setup. A supervised expert model, trained on public datasets, provides latent physiological guidance to the student. The curriculum progresses through three stages: clean public videos, synthetic occlusion scenarios, and unlabeled videos from 500 pediatric patients. Our framework achieves a 42% reduction in mean absolute error relative to standard masked autoencoders and outperforms PhysFormer by 31%, reaching a final MAE of 3.2 bpm. Without explicit region-of-interest extraction, the model consistently attends to pulse-rich areas and demonstrates robustness under clinical occlusions and noise.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2602.15967 [cs.CV]

(or
arXiv:2602.15967v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.15967

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Mohamed Khalil Ben Salah [view email] [v1]
Tue, 17 Feb 2026 19:34:50 UTC (8,263 KB)

46. 【2602.15962】Automated Re-Identification of Holstein-Friesian Cattle in Dense Crowds

链接https://arxiv.org/abs/2602.15962

作者:Phoenix Yu,Tilo Burghardt,Andrew W Dowsey,Neill W Campbell

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:methods capture individuals, methods capture, spatially separate, capture individuals, targets are spatially

备注: 32 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Holstein-Friesian detection and re-identification (Re-ID) methods capture individuals well when targets are spatially separate. However, existing approaches, including YOLO-based species detection, break down when cows group closely together. This is particularly prevalent for species which have outline-breaking coat patterns. To boost both effectiveness and transferability in this setting, we propose a new detect-segment-identify pipeline that leverages the Open-Vocabulary Weight-free Localisation and the Segment Anything models as pre-processing stages alongside Re-ID networks. To evaluate our approach, we publish a collection of nine days CCTV data filmed on a working dairy farm. Our methodology overcomes detection breakdown in dense animal groupings, resulting in a 98.93% accuracy. This significantly outperforms current oriented bounding box-driven, as well as SAM species detection baselines with accuracy improvements of 47.52% and 27.13%, respectively. We show that unsupervised contrastive learning can build on this to yield 94.82% Re-ID accuracy on our test data. Our work demonstrates that Re-ID in crowded scenarios is both practical as well as reliable in working farm settings with no manual intervention. Code and dataset are provided for reproducibility.

47. 【2602.15959】Position-Aware Scene-Appearance Disentanglement for Bidirectional Photoacoustic Microscopy Registration

链接https://arxiv.org/abs/2602.15959

作者:Yiwen Wang,Jiahao Qin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:High-speed optical-resolution photoacoustic, optical-resolution photoacoustic microscopy, backward scan lines, bidirectional raster scanning, raster scanning doubles

备注: 10 pages, 5 figures

点击查看摘要

Abstract:High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional raster scanning doubles imaging speed but introduces coupled domain shift and geometric misalignment between forward and backward scan lines. Existing registration methods, constrained by brightness constancy assumptions, achieve limited alignment quality, while recent generative approaches address domain shift through complex architectures that lack temporal awareness across frames. We propose GPEReg-Net, a scene-appearance disentanglement framework that separates domain-invariant scene features from domain-specific appearance codes via Adaptive Instance Normalization (AdaIN), enabling direct image-to-image registration without explicit deformation field estimation. To exploit temporal structure in sequential acquisitions, we introduce a Global Position Encoding (GPE) module that combines learnable position embeddings with sinusoidal encoding and cross-frame attention, allowing the network to leverage context from neighboring frames for improved temporal coherence. On the OR-PAM-Reg-4K benchmark (432 test samples), GPEReg-Net achieves NCC of 0.953, SSIM of 0.932, and PSNR of 34.49dB, surpassing the state-of-the-art by 3.8% in SSIM and 1.99dB in PSNR while maintaining competitive NCC. Code is available at this https URL.

48. 【2602.15958】DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting

链接https://arxiv.org/abs/2602.15958

作者:Md Mofijul Islam,Md Sirajus Salekin,Nivedha Balakrishnan,Vincil C. Bishop III,Niharika Jain,Spencer Romo,Bob Strahan,Boyi Xie,Diego A. Socolinsky

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:multiple documents stitched, document packet, multi-page document packets, document packet splitting, Document

备注

点击查看摘要

Abstract:Document understanding in real-world applications often requires processing heterogeneous, multi-page document packets containing multiple documents stitched together. Despite recent advances in visual document understanding, the fundamental task of document packet splitting, which involves separating a document packet into individual units, remains largely unaddressed. We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models. DocSplit comprises five datasets of varying complexity, covering diverse document types, layouts, and multimodal settings. We formalize the DocSplit task, which requires models to identify document boundaries, classify document types, and maintain correct page ordering within a document packet. The benchmark addresses real-world challenges, including out-of-order pages, interleaved documents, and documents lacking clear demarcations. We conduct extensive experiments evaluating multimodal LLMs on our datasets, revealing significant performance gaps in current models' ability to handle complex document splitting tasks. The DocSplit benchmark datasets and proposed novel evaluation metrics provide a systematic framework for advancing document understanding capabilities essential for legal, financial, healthcare, and other document-intensive domains. We release the datasets to facilitate future research in document packet processing.

49. 【2602.15950】Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families

链接https://arxiv.org/abs/2602.15950

作者:Yuval Levental

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:lack textual identity, accurately localize filled, cells lack textual, localize filled cells, textual identity

备注: 9 pages, 3 figures, 2 tables. Workshop-length paper

点击查看摘要

Abstract:We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types -- text symbols (. and #) and filled squares without gridlines -- then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder -- the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their native visual pathway. Each model exhibits a distinct failure mode in the squares condition -- systematic under-counting (Claude), massive over-counting (ChatGPT), and template hallucination (Gemini) -- but all share the same underlying deficit: severely degraded spatial localization for non-textual visual elements.

50. 【2602.15927】Visual Memory Injection Attacks for Multi-Turn Conversations

链接https://arxiv.org/abs/2602.15927

作者:Christian Schlarmann,Matthias Hein

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Generative large vision-language, large vision-language models, impressive performance gains, recently achieved impressive, achieved impressive performance

备注

点击查看摘要

Abstract:Generative large vision-language models (LVLMs) have recently achieved impressive performance gains, and their user base is growing rapidly. However, the security of LVLMs, in particular in a long-context multi-turn setting, is largely underexplored. In this paper, we consider the realistic scenario in which an attacker uploads a manipulated image to the web/social media. A benign user downloads this image and uses it as input to the LVLM. Our novel stealthy Visual Memory Injection (VMI) attack is designed such that on normal prompts the LVLM exhibits nominal behavior, but once the user gives a triggering prompt, the LVLM outputs a specific prescribed target message to manipulate the user, e.g. for adversarial marketing or political persuasion. Compared to previous work that focused on single-turn attacks, VMI is effective even after a long multi-turn conversation with the user. We demonstrate our attack on several recent open-weight LVLMs. This article thereby shows that large-scale manipulation of users is feasible with perturbed images in multi-turn conversation settings, calling for better robustness of LVLMs against these attacks. We release the source code at this https URL

51. 【2602.15926】A Study on Real-time Object Detection using Deep Learning

链接https://arxiv.org/abs/2602.15926

作者:Ankita Bose,Jayasravani Bhumireddy,Naveen N

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Augmented Reality, Virtual Reality, including human-computer interfaces, industrial automation healthcare, road traffic monitoring

备注: 34 pages, 18 figures

点击查看摘要

Abstract:Object detection has compelling applications over a range of domains, including human-computer interfaces, security and video surveillance, navigation and road traffic monitoring, transportation systems, industrial automation healthcare, the world of Augmented Reality (AR) and Virtual Reality (VR), environment monitoring and activity identification. Applications of real time object detection in all these areas provide dynamic analysis of the visual information that helps in immediate decision making. Furthermore, advanced deep learning algorithms leverage the progress in the field of object detection providing more accurate and efficient solutions. There are some outstanding deep learning algorithms for object detection which includes, Faster R CNN(Region-based Convolutional Neural Network),Mask R-CNN, Cascade R-CNN, YOLO (You Only Look Once), SSD (Single Shot Multibox Detector), RetinaNet etc. This article goes into great detail on how deep learning algorithms are used to enhance real time object recognition. It provides information on the different object detection models available, open benchmark datasets, and studies on the use of object detection models in a range of applications. Additionally, controlled studies are provided to compare various strategies and produce some illuminating findings. Last but not least, a number of encouraging challenges and approaches are offered as suggestions for further investigation in both relevant deep learning approaches and object recognition.

52. 【2602.15922】World Action Models are Zero-shot Policies

链接https://arxiv.org/abs/2602.15922

作者:Seonghyeon Ye,Yunhao Ge,Kaiyuan Zheng,Shenyuan Gao,Sihyun Yu,George Kurian,Suneel Indupuru,You Liang Tan,Chuning Zhu,Jiannan Xiang,Ayaan Malik,Kyungmin Lee,William Liang,Nadun Ranawaka,Jiasheng Gu,Yinzhen Xu,Guanzhi Wang,Fengyuan Hu,Avnish Narayan,Johan Bjorck,Jing Wang,Gwanghyun Kim,Dantong Niu,Ruijie Zheng,Yuqi Xie,Jimmy Wu,Qi Wang,Ryan Julian,Danfei Xu,Yilun Du,Yevgen Chebotar,Scott Reed,Jan Kautz,Yuke Zhu,Linxi "Jim" Fan,Joel Jang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:excel at semantic, struggle to generalize, World Action Model, models excel, unseen physical motions

备注: Project page: [this https URL](https://dreamzero0.github.io/)

点击查看摘要

Abstract:State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.

53. 【2602.15918】EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery

链接https://arxiv.org/abs/2602.15918

作者:Zelin Xu,Yupu Zhang,Saugat Adhikari,Saiful Islam,Tingsong Xiao,Zibo Liu,Shigang Chen,Da Yan,Zhe Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multimodal large language, attracted growing interest, computer vision due, require precise interaction, Benchmarking spatial reasoning

备注

点击查看摘要

Abstract:Benchmarking spatial reasoning in multimodal large language models (MLLMs) has attracted growing interest in computer vision due to its importance for embodied AI and other agentic systems that require precise interaction with the physical world. However, spatial reasoning on Earth imagery has lagged behind, as it uniquely involves grounding objects in georeferenced images and quantitatively reasoning about distances, directions, and topological relations using both visual cues and vector geometry coordinates (e.g., 2D bounding boxes, polylines, and polygons). Existing benchmarks for Earth imagery primarily focus on 2D spatial grounding, image captioning, and coarse spatial relations (e.g., simple directional or proximity cues). They lack support for quantitative direction and distance reasoning, systematic topological relations, and complex object geometries beyond bounding boxes. To fill this gap, we propose \textbf{EarthSpatialBench}, a comprehensive benchmark for evaluating spatial reasoning in MLLMs on Earth imagery. The benchmark contains over 325K question-answer pairs spanning: (1) qualitative and quantitative reasoning about spatial distance and direction; (2) systematic topological relations; (3) single-object queries, object-pair queries, and compositional aggregate group queries; and (4) object references expressed via textual descriptions, visual overlays, and explicit geometry coordinates, including 2D bounding boxes, polylines, and polygons. We conducted extensive experiments on both open-source and proprietary models to identify limitations in the spatial reasoning of MLLMs.

54. 【2602.15915】MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

链接https://arxiv.org/abs/2602.15915

作者:Xianwei Mao,Kai Ye,Sheng Zhou,Nan Zhang,Haikuan Huang,Bin Li,Jiajun Bu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Knowledge-based Visual Question, Visual Question Answering, Question Answering, integrating visual information, Knowledge-based Visual

备注

点击查看摘要

Abstract:Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA and InfoSeek demonstrate consistent performance gains across multiple MLLM backbones, and ablations verify that the selection mechanism effectively reduces noise and enhances knowledge utilization.

55. 【2602.15904】A Comprehensive Survey on Deep Learning-Based LiDAR Super-Resolution for Autonomous Driving

链接https://arxiv.org/abs/2602.15904

作者:June Moh Goo,Zichao Zeng,Jan Boehm

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:miss critical details, sparse point clouds, produce sparse point, high-resolution sensors remain, sensors remain expensive

备注: Accepted to The IEEE Intelligent Vehicles Symposium 2026 (IEEE IV 2026)

点击查看摘要

Abstract:LiDAR sensors are often considered essential for autonomous driving, but high-resolution sensors remain expensive while affordable low-resolution sensors produce sparse point clouds that miss critical details. LiDAR super-resolution addresses this challenge by using deep learning to enhance sparse point clouds, bridging the gap between different sensor types and enabling cross-sensor compatibility in real-world deployments. This paper presents the first comprehensive survey of LiDAR super-resolution methods for autonomous driving. Despite the importance of practical deployment, no systematic review has been conducted until now. We organize existing approaches into four categories: CNN-based architectures, model-based deep unrolling, implicit representation methods, and Transformer and Mamba-based approaches. We establish fundamental concepts including data representations, problem formulation, benchmark datasets and evaluation metrics. Current trends include the adoption of range image representation for efficient processing, extreme model compression and the development of resolution-flexible architectures. Recent research prioritizes real-time inference and cross-sensor generalization for practical deployment. We conclude by identifying open challenges and future research directions for advancing LiDAR super-resolution technology.

56. 【2602.15903】Detecting Deepfakes with Multivariate Soft Blending and CLIP-based Image-Text Alignment

链接https://arxiv.org/abs/2602.15903

作者:Jingwei Li,Jiaxin Tong,Pengfei Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:highly realistic facial, Soft Blending Augmentation, Forgery Intensity Estimation, realistic facial forgeries, facial forgeries necessitates

备注

点击查看摘要

Abstract:The proliferation of highly realistic facial forgeries necessitates robust detection methods. However, existing approaches often suffer from limited accuracy and poor generalization due to significant distribution shifts among samples generated by diverse forgery techniques. To address these challenges, we propose a novel Multivariate and Soft Blending Augmentation with CLIP-guided Forgery Intensity Estimation (MSBA-CLIP) framework. Our method leverages the multimodal alignment capabilities of CLIP to capture subtle forgery traces. We introduce a Multivariate and Soft Blending Augmentation (MSBA) strategy that synthesizes images by blending forgeries from multiple methods with random weights, forcing the model to learn generalizable patterns. Furthermore, a dedicated Multivariate Forgery Intensity Estimation (MFIE) module is designed to explicitly guide the model in learning features related to varied forgery modes and intensities. Extensive experiments demonstrate state-of-the-art performance. On in-domain tests, our method improves Accuracy and AUC by 3.32\% and 4.02\%, respectively, over the best baseline. In cross-domain evaluations across five datasets, it achieves an average AUC gain of 3.27\%. Ablation studies confirm the efficacy of both proposed components. While the reliance on a large vision-language model entails higher computational cost, our work presents a significant step towards more generalizable and robust deepfake detection.

57. 【2602.15900】Adaptive Illumination Control for Robot Perception

链接https://arxiv.org/abs/2602.15900

作者:Yash Turkar,Shekoufeh Sadeghi,Karthik Dantu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:robust feature extraction, high dynamic range, improved downstream, feature extraction, perception under low

备注

点击查看摘要

Abstract:Robot perception under low light or high dynamic range is usually improved downstream - via more robust feature extraction, image enhancement, or closed-loop exposure control. However, all of these approaches are limited by the image captured these conditions. An alternate approach is to utilize a programmable onboard light that adds to ambient illumination and improves captured images. However, it is not straightforward to predict its impact on image formation. Illumination interacts nonlinearly with depth, surface reflectance, and scene geometry. It can both reveal structure and induce failure modes such as specular highlights and saturation. We introduce Lightning, a closed-loop illumination-control framework for visual SLAM that combines relighting, offline optimization, and imitation learning. This is performed in three stages. First, we train a Co-Located Illumination Decomposition (CLID) relighting model that decomposes a robot observation into an ambient component and a light-contribution field. CLID enables physically consistent synthesis of the same scene under alternative light intensities and thereby creates dense multi-intensity training data without requiring us to repeatedly re-run trajectories. Second, using these synthesized candidates, we formulate an offline Optimal Intensity Schedule (OIS) problem that selects illumination levels over a sequence trading off SLAM-relevant image utility against power consumption and temporal smoothness. Third, we distill this ideal solution into a real-time controller through behavior cloning, producing an Illumination Control Policy (ILC) that generalizes beyond the initial training distribution and runs online on a mobile robot to command discrete light-intensity levels. Across our evaluation, Lightning substantially improves SLAM trajectory robustness while reducing unnecessary illumination power.

58. 【2602.15892】Egocentric Bias in Vision-Language Models

链接https://arxiv.org/abs/2602.15892

作者:Maijunxian Wang,Yijiang Li,Bingyang Wang,Tianwei Zhao,Ran Ji,Qingying Gao,Emmy Liu,Hokin Deng,Dezhi Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Visual perspective taking, Visual perspective, perspective taking, Visual, social cognition

备注

点击查看摘要

Abstract:Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.

59. 【2602.15872】MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

链接https://arxiv.org/abs/2602.15872

作者:Xunlan Zhou,Xuanlin Chen,Shaowei Zhang,Xiangkun Li,ShengHua Wan,Xiaohai Hu,Yuan Lei,Le Gan,De-chuan Zhan

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Designing dense reward, robotic Reinforcement Learning, Reinforcement Learning, efficient robotic Reinforcement, Designing dense

备注

点击查看摘要

Abstract:Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.

60. 【2602.15864】ReasonNavi: Human-Inspired Global Map Reasoning for Zero-Shot Embodied Navigation

链接https://arxiv.org/abs/2602.15864

作者:Yuzhuo Ao,Anbang Wang,Yu-Wing Tai,Chi-Keung Tang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:partial egocentric observations, restrict global foresight, egocentric observations, inefficient exploration, Multimodal Large Language

备注: 18 pages, 6 figures, Project page: [this https URL](https://reasonnavi.github.io/)

点击查看摘要

Abstract:Embodied agents often struggle with efficient navigation because they rely primarily on partial egocentric observations, which restrict global foresight and lead to inefficient exploration. In contrast, humans plan using maps: we reason globally first, then act locally. We introduce ReasonNavi, a human-inspired framework that operationalizes this reason-then-act paradigm by coupling Multimodal Large Language Models (MLLMs) with deterministic planners. ReasonNavi converts a top-down map into a discrete reasoning space by room segmentation and candidate target nodes sampling. An MLLM is then queried in a multi-stage process to identify the candidate most consistent with the instruction (object, image, or text goal), effectively leveraging the model's semantic reasoning ability while sidestepping its weakness in continuous coordinate prediction. The selected waypoint is grounded into executable trajectories using a deterministic action planner over an online-built occupancy map, while pretrained object detectors and segmenters ensure robust recognition at the goal. This yields a unified zero-shot navigation framework that requires no MLLM fine-tuning, circumvents the brittleness of RL-based policies and scales naturally with foundation model improvements. Across three navigation tasks, ReasonNavi consistently outperforms prior methods that demand extensive training or heavy scene modeling, offering a scalable, interpretable, and globally grounded solution to embodied navigation. Project page: this https URL

61. 【2602.16422】Automated Histopathology Report Generation via Pyramidal Feature Extraction and the UNI Foundation Model

链接https://arxiv.org/abs/2602.16422

作者:Ahmet Halici,Ece Tugba Cebeci,Musa Balci,Mustafa Cini,Serkan Sokmen

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:domain specific language, Generating diagnostic text, Generating diagnostic, slide images, requirement for precise

备注: 9 pages. Equal contribution: Ahmet Halici, Ece Tugba Cebeci, Musa Balci

点击查看摘要

Abstract:Generating diagnostic text from histopathology whole slide images (WSIs) is challenging due to the gigapixel scale of the input and the requirement for precise, domain specific language. We propose a hierarchical vision language framework that combines a frozen pathology foundation model with a Transformer decoder for report generation. To make WSI processing tractable, we perform multi resolution pyramidal patch selection (downsampling factors 2^3 to 2^6) and remove background and artifacts using Laplacian variance and HSV based criteria. Patch features are extracted with the UNI Vision Transformer and projected to a 6 layer Transformer decoder that generates diagnostic text via cross attention. To better represent biomedical terminology, we tokenize the output using BioGPT. Finally, we add a retrieval based verification step that compares generated reports with a reference corpus using Sentence BERT embeddings; if a high similarity match is found, the generated report is replaced with the retrieved ground truth reference to improve reliability.

62. 【2602.16320】RefineFormer3D: Efficient 3D Medical Image Segmentation via Adaptive Multi-Scale Transformer with Cross Attention Fusion

链接https://arxiv.org/abs/2602.16320

作者:Kavyansh Tyagi,Vishwas Rathi,Puneet Goyal

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Accurate and computationally, remains a critical, critical challenge, image segmentation remains, medical image segmentation

备注: 13 pages, 5 figures, 7 tables

点击查看摘要

Abstract:Accurate and computationally efficient 3D medical image segmentation remains a critical challenge in clinical workflows. Transformer-based architectures often demonstrate superior global contextual modeling but at the expense of excessive parameter counts and memory demands, restricting their clinical deployment. We propose RefineFormer3D, a lightweight hierarchical transformer architecture that balances segmentation accuracy and computational efficiency for volumetric medical imaging. The architecture integrates three key components: (i) GhostConv3D-based patch embedding for efficient feature extraction with minimal redundancy, (ii) MixFFN3D module with low-rank projections and depthwise convolutions for parameter-efficient feature extraction, and (iii) a cross-attention fusion decoder enabling adaptive multi-scale skip connection integration. RefineFormer3D contains only 2.94M parameters, substantially fewer than contemporary transformer-based methods. Extensive experiments on ACDC and BraTS benchmarks demonstrate that RefineFormer3D achieves 93.44\% and 85.9\% average Dice scores respectively, outperforming or matching state-of-the-art methods while requiring significantly fewer parameters. Furthermore, the model achieves fast inference (8.35 ms per volume on GPU) with low memory requirements, supporting deployment in resource-constrained clinical environments. These results establish RefineFormer3D as an effective and scalable solution for practical 3D medical image segmentation.

63. 【2602.15988】Automated Assessment of Kidney Ureteroscopy Exploration for Training

链接https://arxiv.org/abs/2602.15988

作者:Fangjie Li,Nicholas Kavoussi,Charan Mohan,Matthieu Chabanas,Jie Ying Wu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:steep learning curve, Kidney ureteroscopic navigation, learning curve, ureteroscopic navigation, navigation is challenging

备注

点击查看摘要

Abstract:Purpose: Kidney ureteroscopic navigation is challenging with a steep learning curve. However, current clinical training has major deficiencies, as it requires one-on-one feedback from experts and occurs in the operating room (OR). Therefore, there is a need for a phantom training system with automated feedback to greatly \revision{expand} training opportunities. Methods: We propose a novel, purely ureteroscope video-based scope localization framework that automatically identifies calyces missed by the trainee in a phantom kidney exploration. We use a slow, thorough, prior exploration video of the kidney to generate a reference reconstruction. Then, this reference reconstruction can be used to localize any exploration video of the same phantom. Results: In 15 exploration videos, a total of 69 out of 74 calyces were correctly classified. We achieve 4mm camera pose localization error. Given the reference reconstruction, the system takes 10 minutes to generate the results for a typical exploration (1-2 minute long). Conclusion: We demonstrate a novel camera localization framework that can provide accurate and automatic feedback for kidney phantom explorations. We show its ability as a valid tool that enables out-of-OR training without requiring supervision from an expert.

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2602.15988 [eess.IV]

(or
arXiv:2602.15988v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2602.15988

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Fangjie Li [view email] [v1]
Tue, 17 Feb 2026 20:25:04 UTC (1,823 KB)

64. 【2602.15917】ROIX-Comp: Optimizing X-ray Computed Tomography Imaging Strategy for Data Reduction and Reconstruction

链接https://arxiv.org/abs/2602.15917

作者:Amarjit Singh,Kento Sato,Kohei Yoshida,Kentaro Uesugi,Yasumasa Joti,Takaki Hatsui,Andrès Rubio Proaño

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)

关键词:synchrotron radiation facilities, X-ray Computed Tomography, large-scale X-ray Computed, X-ray images, high-performance computing

备注: 11 pages, SCA/HPCAsia2026

点击查看摘要

Abstract:In high-performance computing (HPC) environments, particularly in synchrotron radiation facilities, vast amounts of X-ray images are generated. Processing large-scale X-ray Computed Tomography (X-CT) datasets presents significant computational and storage challenges due to their high dimensionality and data volume. Traditional approaches often require extensive storage capacity and high transmission bandwidth, limiting real-time processing capabilities and workflow efficiency. To address these constraints, we introduce a region-of-interest (ROI)-driven extraction framework (ROIX-Comp) that intelligently compresses X-CT data by identifying and retaining only essential features. Our work reduces data volume while preserving critical information for downstream processing tasks. At pre-processing stage, we utilize error-bounded quantization to reduce the amount of data to be processed and therefore improve computational efficiencies. At the compression stage, our methodology combines object extraction with multiple state-of-the-art lossless and lossy compressors, resulting in significantly improved compression ratios. We evaluated this framework against seven X-CT datasets and observed a relative compression ratio improvement of 12.34x compared to the standard compression.

65. 【2602.15913】Foundation Models for Medical Imaging: Status, Challenges, and Directions

链接https://arxiv.org/abs/2602.15913

作者:Chuang Niu,Pengwei Wu,Bruno De Man,Ge Wang

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:rapidly reshaping medical, reshaping medical imaging, Foundation models, general-purpose models, shifting the field

备注

点击查看摘要

Abstract:Foundation models (FMs) are rapidly reshaping medical imaging, shifting the field from narrowly trained, task-specific networks toward large, general-purpose models that can be adapted across modalities, anatomies, and clinical tasks. In this review, we synthesize the emerging landscape of medical imaging FMs along three major axes: principles of FM design, applications of FMs, and forward-looking challenges and opportunities. Taken together, this review provides a technically grounded, clinically aware, and future-facing roadmap for developing FMs that are not only powerful and versatile but also trustworthy and ready for responsible translation into clinical practice.

66. 【2512.17322】Rotterdam artery-vein segmentation (RAV) dataset

链接https://arxiv.org/abs/2512.17322

作者:Jose Vargas Quiros,Bart Liefers,Karin van Garderen,Jeroen Vermeulen,Eyened Reading Center,Caroline Klaver

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Toggle, Code Toggle Papers, Toggle Hugging Face, Bibliographic Explorer Toggle, Explorer Toggle Bibliographic

备注

点击查看摘要

Abstract:Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2512.17322 [eess.IV]

(or
arXiv:2512.17322v2 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2512.17322

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Jose David Vargas Quiros [view email] [v1]
Fri, 19 Dec 2025 08:09:02 UTC (432 KB)
[v2]
Wed, 18 Feb 2026 15:06:58 UTC (384 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Rotterdam artery-vein segmentation (RAV) dataset, by Jose Vargas Quiros and 5 other authorsView PDF

view license

Current browse context: eess.IV

prev

|
next

new
|
recent
| 2025-12

Change to browse by:

cs
cs.CV
eess

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status