本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新670篇论文,其中:
- 自然语言处理119篇
- 信息检索22篇
- 计算机视觉98篇
自然语言处理
1. 【2602.12276】Agentic Test-Time Scaling for WebAgents
链接:https://arxiv.org/abs/2602.12276
作者:Nicholas Lee,Lutfi Eren Erdogan,Chris Joseph John,Surya Krishnapillai,Michael W. Mahoney,Kurt Keutzer,Amir Gholami
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:neural network models, network models, boost reliability, reliability of neural, neural network
备注:
点击查看摘要
Abstract:Test-time scaling has become a standard way to improve performance and boost reliability of neural network models. However, its behavior on agentic, multi-step tasks remains less well-understood: small per-step errors can compound over long horizons; and we find that naive policies that uniformly increase sampling show diminishing returns. In this work, we present CATTS, a simple technique for dynamically allocating compute for multi-step agents. We first conduct an empirical study of inference-time scaling for web agents. We find that uniformly increasing per-step compute quickly saturates in long-horizon environments. We then investigate stronger aggregation strategies, including an LLM-based Arbiter that can outperform naive voting, but that can overrule high-consensus decisions. We show that uncertainty statistics derived from the agent's own vote distribution (entropy and top-1/top-2 margin) correlate with downstream success and provide a practical signal for dynamic compute allocation. Based on these findings, we introduce Confidence-Aware Test-Time Scaling (CATTS), which uses vote-derived uncertainty to allocate compute only when decisions are genuinely contentious. CATTS improves performance on WebArena-Lite and GoBrowse by up to 9.1% over React while using up to 2.3x fewer tokens than uniform scaling, providing both efficiency gains and an interpretable decision rule.
2. 【2602.12275】On-Policy Context Distillation for Language Models
链接:https://arxiv.org/abs/2602.12275
作者:Tianzhu Ye,Li Dong,Xun Wu,Shaohan Huang,Furu Wei
类目:Computation and Language (cs.CL)
关键词:Context distillation, On-Policy Context Distillation, Context distillation enables, propose On-Policy Context, Context
备注:
点击查看摘要
Abstract:Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.
3. 【2602.12262】3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization
链接:https://arxiv.org/abs/2602.12262
作者:Tunyu Zhang,Xinxi Zhang,Ligong Han,Haizhou Shi,Xiaoxiao He,Zhuowei Li,Hao Wang,Kai Xu,Akash Srivastava,Hao Wang,Vladimir Pavlovic,Dimitris N. Metaxas
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Diffusion large language, enable fast text, fast text generation, Diffusion large, large language models
备注:
点击查看摘要
Abstract:Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at this https URL.
4. 【2602.12251】A technical curriculum on language-oriented artificial intelligence in translation and specialised communication
链接:https://arxiv.org/abs/2602.12251
作者:Ralph Krüger
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:language-oriented artificial intelligence, artificial intelligence, paper presents, language-oriented artificial, curriculum
备注: 10 pages, 1 figure, EAMT 2026, TAITT Workshop
点击查看摘要
Abstract:This paper presents a technical curriculum on language-oriented artificial intelligence (AI) in the language and translation (LT) industry. The curriculum aims to foster domain-specific technical AI literacy among stakeholders in the fields of translation and specialised communication by exposing them to the conceptual and technical/algorithmic foundations of modern language-oriented AI in an accessible way. The core curriculum focuses on 1) vector embeddings, 2) the technical foundations of neural networks, 3) tokenization and 4) transformer neural networks. It is intended to help users develop computational thinking as well as algorithmic awareness and algorithmic agency, ultimately contributing to their digital resilience in AI-driven work environments. The didactic suitability of the curriculum was tested in an AI-focused MA course at the Institute of Translation and Multilingual Communication at TH Koeln. Results suggest the didactic effectiveness of the curriculum, but participant feedback indicates that it should be embedded into higher-level didactic scaffolding - e.g., in the form of lecturer support - in order to enable optimal learning conditions.
5. 【2602.12249】"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
链接:https://arxiv.org/abs/2602.12249
作者:Kaitlyn Zhou,Martijn Bartelds,Federico Bianchi,James Zou
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:achieving low word, low word error, fail on short, recognition systems achieving, systems achieving low
备注:
点击查看摘要
Abstract:Despite speech recognition systems achieving low word error rates on standard benchmarks, they often fail on short, high-stakes utterances in real-world deployments. Here, we study this failure mode in a high-stakes task: the transcription of U.S. street names as spoken by U.S. participants. We evaluate 15 models from OpenAI, Deepgram, Google, and Microsoft on recordings from linguistically diverse U.S. speakers and find an average transcription error rate of 44%. We quantify the downstream impact of failed transcriptions by geographic locations and show that mis-transcriptions systematically cause errors for all speakers, but that routing distance errors are twice as large for non-English primary speakers compared to English primary speakers. To mitigate this harm, we introduce a synthetic data generation approach that produces diverse pronunciations of named entities using open-source text-to-speech models. Fine-tuning with less than 1,000 synthetic samples improves street name transcription accuracy by nearly 60% (relative to base models) for non-English primary speakers. Our results highlight a critical gap between benchmark performance and real-world reliability in speech systems and demonstrate a simple, scalable path to reducing high-stakes transcription errors.
6. 【2602.12241】Moonshine v2: Ergodic Streaming Encoder ASR for Latency-Critical Speech Applications
链接:https://arxiv.org/abs/2602.12241
作者:Manjunath Kudlur,Evan King,James Wang,Pete Warden
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
关键词:Latency-critical speech applications, high transcription accuracy, voice commands, demand low, Latency-critical speech
备注: 7 pages, 5 figures
点击查看摘要
Abstract:Latency-critical speech applications (e.g., live transcription, voice commands, and real-time translation) demand low time-to-first-token (TTFT) and high transcription accuracy, particularly on resource-constrained edge devices. Full-attention Transformer encoders remain a strong accuracy baseline for automatic speech recognition (ASR) because every frame can directly attend to every other frame, which resolves otherwise locally ambiguous acoustics using distant lexical context. However, this global dependency incurs quadratic complexity in sequence length, inducing an inherent "encode-the-whole-utterance" latency profile. For streaming use cases, this causes TTFT to grow linearly with utterance length as the encoder must process the entire prefix before any decoder token can be emitted. To better meet the needs of on-device, streaming ASR use cases we introduce Moonshine v2, an ergodic streaming-encoder ASR model that employs sliding-window self-attention to achieve bounded, low-latency inference while preserving strong local context. Our models achieve state of the art word error rates across standard benchmarks, attaining accuracy on-par with models 6x their size while running significantly faster. These results demonstrate that carefully designed local attention is competitive with the accuracy of full attention at a fraction of the size and latency cost, opening new possibilities for interactive speech interfaces on edge devices.
7. 【2602.12237】Olmix: A Framework for Data Mixing Throughout LM Development
链接:https://arxiv.org/abs/2602.12237
作者:Mayee F. Chen,Tyler Murray,David Heineman,Matt Jordan,Hannaneh Hajishirzi,Christopher Ré,Luca Soldaini,Kyle Lo
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:training language models, language models, first-order concern, mixing, Data
备注:
点击查看摘要
Abstract:Data mixing -- determining the ratios of data from different domains -- is a first-order concern for training language models (LMs). While existing mixing methods show promise, they fall short when applied during real-world LM development. We present Olmix, a framework that addresses two such challenges. First, the configuration space for developing a mixing method is not well understood -- design choices across existing methods lack justification or consensus and overlook practical issues like data constraints. We conduct a comprehensive empirical study of this space, identifying which design choices lead to a strong mixing method. Second, in practice, the domain set evolves throughout LM development as datasets are added, removed, partitioned, and revised -- a problem setting largely unaddressed by existing works, which assume fixed domains. We study how to efficiently recompute the mixture after the domain set is updated, leveraging information from past mixtures. We introduce mixture reuse, a mechanism that reuses existing ratios and recomputes ratios only for domains affected by the update. Over a sequence of five domain-set updates mirroring real-world LM development, mixture reuse matches the performance of fully recomputing the mix after each update with 74% less compute and improves over training without mixing by 11.6% on downstream tasks.
8. 【2602.12235】Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2602.12235
作者:Julia Belikova,Danila Rozhevskii,Dennis Svirin,Konstantin Polev,Alexander Panchenko
类目:Computation and Language (cs.CL)
关键词:Efficient long-context processing, large language models, contemporary large language, Efficient long-context, long-context processing remains
备注: Accepted to EACL 2026 Student Research Workshop. 14 pages, 6 tables, 1 figure
点击查看摘要
Abstract:Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
9. 【2602.12203】ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images
链接:https://arxiv.org/abs/2602.12203
作者:Mathieu Sibue,Andres Muñoz Garza,Samuel Mensah,Pranav Shetty,Zhiqiang Ma,Xiaomo Liu,Manuela Veloso
类目:Computation and Language (cs.CL)
关键词:embed critical information, Vision Language Models, Visual Question Answering, Enterprise documents, automated workflows
备注: EACL 2026, main conference
点击查看摘要
Abstract:Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.
10. 【2602.12196】Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education
链接:https://arxiv.org/abs/2602.12196
作者:Mohamed Huti,Alasdair Mackintosh,Amy Waldock,Dominic Andrews,Maxime Lelièvre,Moritz Boos,Tobias Murray,Paul Atherton,Robin A. A. Ince,Oliver G. B. Garrod
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:relational structures remains, Large Language Models, Multimodal Large Language, results in textual, critical bottleneck
备注:
点击查看摘要
Abstract:AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.
11. 【2602.12192】Query-focused and Memory-aware Reranker for Long Context Processing
链接:https://arxiv.org/abs/2602.12192
作者:Yuqing Li,Jiangnan Li,Mo Yu,Guoxuan Ding,Zheng Lin,Weiping Wang,Jie Zhou
类目:Computation and Language (cs.CL)
关键词:estimate passage-query relevance, large language models, alternative reranking framework, large language, propose an alternative
备注: 14 pages, 2 figures
点击查看摘要
Abstract:Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.
12. 【2602.12172】Pedagogically-Inspired Data Synthesis for Language Model Knowledge Distillation
链接:https://arxiv.org/abs/2602.12172
作者:Bowei He,Yankai Chen,Xiaokun Zhang,Linghe Kong,Philip S. Yu,Xue Liu,Chen Ma
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, efficient AI systems, critical technique
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Knowledge distillation from Large Language Models (LLMs) to smaller models has emerged as a critical technique for deploying efficient AI systems. However, current methods for distillation via synthetic data lack pedagogical awareness, treating knowledge transfer as a one-off data synthesis and training task rather than a systematic learning process. In this paper, we propose a novel pedagogically-inspired framework for LLM knowledge distillation that draws from fundamental educational principles. Our approach introduces a three-stage pipeline -- Knowledge Identifier, Organizer, and Adapter (IOA) -- that systematically identifies knowledge deficiencies in student models, organizes knowledge delivery through progressive curricula, and adapts representations to match the cognitive capacity of student models. We integrate Bloom's Mastery Learning Principles and Vygotsky's Zone of Proximal Development to create a dynamic distillation process where student models approach teacher model's performance on prerequisite knowledge before advancing, and new knowledge is introduced with controlled, gradual difficulty increments. Extensive experiments using LLaMA-3.1/3.2 and Qwen2.5 as student models demonstrate that IOA achieves significant improvements over baseline distillation methods, with student models retaining 94.7% of teacher performance on DollyEval while using less than 1/10th of the parameters. Our framework particularly excels in complex reasoning tasks, showing 19.2% improvement on MATH and 22.3% on HumanEval compared with state-of-the-art baselines.
13. 【2602.12153】dVoting: Fast Voting for dLLMs
链接:https://arxiv.org/abs/2602.12153
作者:Sicheng Feng,Zigeng Chen,Xinyin Ma,Gongfan Fang,Xinchao Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Diffusion Large Language, Diffusion Large, Language Models, Large Language
备注:
点击查看摘要
Abstract:Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross-sample variability. Leveraging the arbitrary-position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%-7.66% on GSM8K, 4.40%-7.20% on MATH500, 3.16%-14.84% on ARC-C, and 4.83%-5.74% on MMLU. Our code is available at this https URL
14. 【2602.12150】GPT-4o Lacks Core Features of Theory of Mind
链接:https://arxiv.org/abs/2602.12150
作者:John Muchovej,Amanda Royka,Shane Lee,Julian Jara-Ettinger
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Theory of Mind, Large Language, possess a Theory, Language Models
备注: Submitted to CogSci 2025; see more at [this https URL](https://jmuchovej.com/projects/llm-tom) . Note: "abstractness" is the second feature we test for, but due to arXiv's abstract requirements, the text has been altered
点击查看摘要
Abstract:Do Large Language Models (LLMs) possess a Theory of Mind (ToM)? Research into this question has focused on evaluating LLMs against benchmarks and found success across a range of social tasks. However, these evaluations do not test for the actual representations posited by ToM: namely, a causal model of mental states and behavior. Here, we use a cognitively-grounded definition of ToM to develop and test a new evaluation framework. Specifically, our approach probes whether LLMs have a coherent, domain-general, and consistent model of how mental states cause behavior -- regardless of whether that model matches a human-like ToM. We find that even though LLMs succeed in approximating human judgments in a simple ToM paradigm, they fail at a logically equivalent task and exhibit low consistency between their action predictions and corresponding mental state inferences. As such, these findings suggest that the social proficiency exhibited by LLMs is not the result of an domain-general or consistent ToM.
15. 【2602.12146】Seq2Seq2Seq: Lossless Data Compression via Discrete Latent Transformers and Reinforcement Learning
链接:https://arxiv.org/abs/2602.12146
作者:Mahdi Khodabandeh,Ghazal Shabani,Arash Yousefi Jordehi,Seyed Abolghasem Mirroshandel
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
关键词:minimizing storage costs, Reinforcement Learning, essential for minimizing, minimizing storage, storage costs
备注:
点击查看摘要
Abstract:Efficient lossless compression is essential for minimizing storage costs and transmission overhead while preserving data integrity. Traditional compression techniques, such as dictionary-based and statistical methods, often struggle to optimally exploit the structure and redundancy in complex data formats. Recent advancements in deep learning have opened new avenues for compression; however, many existing approaches depend on dense vector representations that obscure the underlying token structure. To address these limitations, we propose a novel lossless compression method that leverages Reinforcement Learning applied to a T5 language model architecture. This approach enables the compression of data into sequences of tokens rather than traditional vector representations. Unlike auto-encoders, which typically encode information into continuous latent spaces, our method preserves the token-based structure, aligning more closely with the original data format. This preservation allows for higher compression ratios while maintaining semantic integrity. By training the model using an off-policy Reinforcement Learning algorithm, we optimize sequence length to minimize redundancy and enhance compression efficiency. Our method introduces an efficient and adaptive data compression system built upon advanced Reinforcement Learning techniques, functioning independently of external grammatical or world knowledge. This approach shows significant improvements in compression ratios compared to conventional methods. By leveraging the latent information within language models, our system effectively compresses data without requiring explicit content understanding, paving the way for more robust and practical compression solutions across various applications.
16. 【2602.12137】CitiLink-Minutes: A Multilayer Annotated Dataset of Municipal Meeting Minutes
链接:https://arxiv.org/abs/2602.12137
作者:Ricardo Campos,Ana Filipa Pacheco,Ana Luísa Fernandes,Inês Cantante,Rute Rebouças,Luís Filipe Cunha,José Miguel Isidro,José Pedro Evans,Miguel Marques,Rodrigo Batista,Evelin Amorim,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,António Leal,Purificação Silvano
类目:Computation and Language (cs.CL)
关键词:City councils play, directly influencing citizens', influencing citizens' daily, citizens' daily lives, City councils
备注:
点击查看摘要
Abstract:City councils play a crucial role in local governance, directly influencing citizens' daily lives through decisions made during municipal meetings. These deliberations are formally documented in meeting minutes, which serve as official records of discussions, decisions, and voting outcomes. Despite their importance, municipal meeting records have received little attention in Information Retrieval (IR) and Natural Language Processing (NLP), largely due to the lack of annotated datasets, which ultimately limit the development of computational models. To address this gap, we introduce CitiLink-Minutes, a multilayer dataset of 120 European Portuguese municipal meeting minutes from six municipalities. Unlike prior annotated datasets of parliamentary or video records, CitiLink-Minutes provides multilayer annotations and structured linkage of official written minutes. The dataset contains over one million tokens, with all personal identifiers de-identified. Each minute was manually annotated by two trained annotators and curated by an experienced linguist across three complementary dimensions: (1) metadata, (2) subjects of discussion, and (3) voting outcomes, totaling over 38,000 individual annotations. Released under FAIR principles and accompanied by baseline results on metadata extraction, topic classification, and vote labeling, CitiLink-Minutes demonstrates its potential for downstream NLP and IR tasks, while promoting transparent access to municipal decisions.
17. 【2602.12135】WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models
链接:https://arxiv.org/abs/2602.12135
作者:Yangzhuo Li,Shengpeng Ji,Yifu Chen,Tianle Liang,Haorong Ying,Yule Wang,Junbo Li,Jun Fang,Zhou Zhao
类目:Computation and Language (cs.CL)
关键词:field urgently demands, transcend simple interactions, address real-world complexity, advanced reasoning capabilities, urgently demands benchmarks
备注: Open-source at [this https URL](https://naruto-2024.github.io/wavbench.github.io/)
点击查看摘要
Abstract:With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at this https URL.
18. 【2602.12133】Neutral Prompts, Non-Neutral People: Quantifying Gender and Skin-Tone Bias in Gemini Flash 2.5 Image and GPT Image 1.5
链接:https://arxiv.org/abs/2602.12133
作者:Roberto Balestri
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:commercial image generators, widely deployed commercial, deployed commercial image, Gemini Flash, prompts yield demographically
备注:
点击查看摘要
Abstract:This study quantifies gender and skin-tone bias in two widely deployed commercial image generators - Gemini Flash 2.5 Image (NanoBanana) and GPT Image 1.5 - to test the assumption that neutral prompts yield demographically neutral outputs. We generated 3,200 photorealistic images using four semantically neutral prompts. The analysis employed a rigorous pipeline combining hybrid color normalization, facial landmark masking, and perceptually uniform skin tone quantification using the Monk (MST), PERLA, and Fitzpatrick scales. Neutral prompts produced highly polarized defaults. Both models exhibited a strong "default white" bias (96% of outputs). However, they diverged sharply on gender: Gemini favored female-presenting subjects, while GPT favored male-presenting subjects with lighter skin tones. This research provides a large-scale, comparative audit of state-of-the-art models using an illumination-aware colorimetric methodology, distinguishing aesthetic rendering from underlying pigmentation in synthetic imagery. The study demonstrates that neutral prompts function as diagnostic probes rather than neutral instructions. It offers a robust framework for auditing algorithmic visual culture and challenges the sociolinguistic assumption that unmarked language results in inclusive representation.
19. 【2602.12132】A Rule-based Computational Model for Gaidhlig Morphology
链接:https://arxiv.org/abs/2602.12132
作者:Peter J Barclay
类目:Computation and Language (cs.CL)
关键词:popular neural models, neural models require, models require considerable, require considerable data, continuing vitality
备注: A revised version of this article will be published at ICAART 2026 ( [this https URL](https://icaart.scitevents.org/?y=2026) )
点击查看摘要
Abstract:Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.
20. 【2602.12125】Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
链接:https://arxiv.org/abs/2602.12125
作者:Wenkai Yang,Weijie Liu,Ruobing Xie,Kai Yang,Saiyong Yang,Yankai Lin
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:demonstrated strong empirical, strong empirical gains, teacher logit distribution, Generalized On-Policy Distillation, outperforms off-policy distillation
备注: Work in progress. Github repo: [this https URL](https://github.com/RUCBM/G-OPD)
点击查看摘要
Abstract:On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.
21. 【2602.12124】Capability-Oriented Training Induced Alignment Risk
链接:https://arxiv.org/abs/2602.12124
作者:Yujun Zhou,Yue Huang,Han Bao,Kehan Guo,Zhenwen Liang,Pin-Yu Chen,Tian Gao,Werner Geyer,Nuno Moniz,Nitesh V Chawla,Xiangliang Zhang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:generating explicitly harmful, alignment research focuses, explicitly harmful content, training induced exploitation, research focuses
备注:
点击查看摘要
Abstract:While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability-oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games", each presenting a unique, exploitable flaw related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these exploitative strategies are not narrow "tricks" but generalizable skills; they can be transferred to new tasks and even "distilled" from a capable teacher model to other student models through data alone. Our findings reveal that capability-oriented training induced risks pose a fundamental challenge to current alignment approaches, suggesting that future AI safety work must extend beyond content moderation to rigorously auditing and securing the training environments and reward mechanisms themselves. Code is available at this https URL.
22. 【2602.12123】Meta-Sel: Efficient Demonstration Selection for In-Context Learning via Supervised Meta-Learning
链接:https://arxiv.org/abs/2602.12123
作者:Xubin Wang,Weijia Jia
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:change substantially depending, tight prompt budget, large candidate pools, accuracy can change, practical bottleneck
备注:
点击查看摘要
Abstract:Demonstration selection is a practical bottleneck in in-context learning (ICL): under a tight prompt budget, accuracy can change substantially depending on which few-shot examples are included, yet selection must remain cheap enough to run per query over large candidate pools. We propose Meta-Sel, a lightweight supervised meta-learning approach for intent classification that learns a fast, interpretable scoring function for (candidate, query) pairs from labeled training data. Meta-Sel constructs a meta-dataset by sampling pairs from the training split and using class agreement as supervision, then trains a calibrated logistic regressor on two inexpensive meta-features: TF--IDF cosine similarity and a length-compatibility ratio. At inference time, the selector performs a single vectorized scoring pass over the full candidate pool and returns the top-k demonstrations, requiring no model fine-tuning, no online exploration, and no additional LLM calls. This yields deterministic rankings and makes the selection mechanism straightforward to audit via interpretable feature weights. Beyond proposing Meta-Sel, we provide a broad empirical study of demonstration selection, benchmarking 12 methods -- spanning prompt engineering baselines, heuristic selection, reinforcement learning, and influence-based approaches -- across four intent datasets and five open-source LLMs. Across this benchmark, Meta-Sel consistently ranks among the top-performing methods, is particularly effective for smaller models where selection quality can partially compensate for limited model capacity, and maintains competitive selection-time overhead.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2602.12123 [cs.LG]
(or
arXiv:2602.12123v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2602.12123
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2602.12116】P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling
链接:https://arxiv.org/abs/2602.12116
作者:Pinyi Zhang,Ting-En Lin,Yuchuan Wu,Jingyang Chen,Zongqi Wang,Hua Yang,Ze Xu,Fei Huang,Kai Zhang,Yongbin Li
类目:Computation and Language (cs.CL)
关键词:language models seeks, large language models, typically via reinforcement, reinforcement learning, large language
备注: Accepted as ICLR 2026 Oral
点击查看摘要
Abstract:Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.
24. 【2602.12113】Stop Unnecessary Reflection: Training LRMs for Efficient Reasoning with Adaptive Reflection and Length Coordinated Penalty
链接:https://arxiv.org/abs/2602.12113
作者:Zewei Yu,Lirong Gao,Yuke Zhu,Bo Zheng,Sheng Guo,Haobo Wang,Junbo Zhao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:employing test-time scaling, demonstrated remarkable performance, Large Reasoning Models, complex reasoning tasks, Large Reasoning
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks by employing test-time scaling. However, they often generate over-long chains-of-thought that, driven by substantial reflections such as repetitive self-questioning and circular reasoning, lead to high token consumption, substantial computational overhead, and increased latency without improving accuracy, particularly in smaller models. Our observation reveals that increasing problem complexity induces more excessive and unnecessary reflection, which in turn reduces accuracy and increases token overhead. To address this challenge, we propose Adaptive Reflection and Length Coordinated Penalty (ARLCP), a novel reinforcement learning framework designed to dynamically balance reasoning efficiency and solution accuracy. ARLCP introduces two key innovations: (1) a reflection penalty that adaptively curtails unnecessary reflective steps while preserving essential reasoning, and (2) a length penalty calibrated to the estimated complexity of the problem. By coordinating these penalties, ARLCP encourages the model to generate more concise and effective reasoning paths. We evaluate our method on five mathematical reasoning benchmarks using DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B models. Experimental results show that ARLCP achieves a superior efficiency-accuracy trade-off compared to existing approaches. For the 1.5B model, it reduces the average response length by 53.1% while simultaneously improving accuracy by 5.8%. For the 7B model, it achieves a 35.0% reduction in length with a 2.7% accuracy gain. The code is released at this https URL .
25. 【2602.12092】DeepSight: An All-in-One LM Safety Toolkit
链接:https://arxiv.org/abs/2602.12092
作者:Bo Zhang,Jiaxuan Guo,Lijun Li,Dongrui Liu,Sujin Chen,Guanxu Chen,Zhijie Zheng,Qihao Lin,Lewen Yan,Chen Qian,Yijin Zhou,Yuyao Wu,Shaoxiong Guo,Tianyi Du,Jingyi Yang,Xuhao Hu,Ziqi Miao,Xiaoya Lu,Jing Shao,Xia Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Large Language, current Large Language, Language Models
备注: Technical report, 29 pages, 24 figures
点击查看摘要
Abstract:As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.
26. 【2602.12078】ny Recursive Reasoning with Mamba-2 Attention Hybrid
链接:https://arxiv.org/abs/2602.12078
作者:Wenlong Wang,Fergal Reid
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:emitting intermediate tokens, achieve strong performance, abstract reasoning tasks, Recent work, hidden representation space
备注:
点击查看摘要
Abstract:Recent work on recursive reasoning models like TRM demonstrates that tiny networks (7M parameters) can achieve strong performance on abstract reasoning tasks through latent recursion -- iterative refinement in hidden representation space without emitting intermediate tokens. This raises a natural question about operator choice: Mamba-2's state space recurrence is itself a form of iterative refinement, making it a natural candidate for recursive reasoning -- but does introducing Mamba-2 into the recursive scaffold preserve reasoning capability? We investigate this by replacing the Transformer blocks in TRM with Mamba-2 hybrid operators while maintaining parameter parity (6.83M vs 6.86M parameters). On ARC-AGI-1, we find that the hybrid improves pass@2 (the official metric) by +2.0\% (45.88\% vs 43.88\%) and consistently outperforms at higher K values (+4.75\% at pass@100), whilst maintaining pass@1 parity. This suggests improved candidate coverage -- the model generates correct solutions more reliably -- with similar top-1 selection. Our results validate that Mamba-2 hybrid operators preserve reasoning capability within the recursive scaffold, establishing SSM-based operators as viable candidates in the recursive operator design space and taking a first step towards understanding the best mixing strategies for recursive reasoning.
27. 【2602.12036】Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models
链接:https://arxiv.org/abs/2602.12036
作者:Xin Xu,Clive Bai,Kai Yang,Tianhao Chen,Yangkun Chen,Weijie Liu,Hao Chen,Yang Wang,Saiyong Yang,Can Yang
类目:Computation and Language (cs.CL)
关键词:Reinforcement Learning, success of Reinforcement, Large-scale verifiable prompts, Large-scale verifiable, Verifiable Rewards
备注:
点击查看摘要
Abstract:Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at this https URL.
28. 【2602.12018】Artificial intelligence is creating a new global linguistic hierarchy
链接:https://arxiv.org/abs/2602.12018
作者:Giulia Occhini,Kumiko Tanaka-Ishii,Anna Barford,Refael Tikochinski,Songbo Hu,Roi Reichart,Yijie Zhou,Hannah Claus,Ulla Petti,Ivan Vulić,Ramit Debnath,Anna Korhonen
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:Artificial intelligence, benefits remain concentrated, transform healthcare, governance and socioeconomic, socioeconomic equity
备注:
点击查看摘要
Abstract:Artificial intelligence (AI) has the potential to transform healthcare, education, governance and socioeconomic equity, but its benefits remain concentrated in a small number of languages (Bender, 2019; Blasi et al., 2022; Joshi et al., 2020; Ranathunga and de Silva, 2022; Young, 2015). Language AI - the technologies that underpin widely-used conversational systems such as ChatGPT - could provide major benefits if available in people's native languages, yet most of the world's 7,000+ linguistic communities currently lack access and face persistent digital marginalization. Here we present a global longitudinal analysis of social, economic and infrastructural conditions across languages to assess systemic inequalities in language AI. We first analyze the existence of AI resources for 6003 languages. We find that despite efforts of the community to broaden the reach of language technologies (Bapna et al., 2022; Costa-Jussà et al., 2022), the dominance of a handful of languages is exacerbating disparities on an unprecedented scale, with divides widening exponentially rather than narrowing. Further, we contrast the longitudinal diffusion of AI with that of earlier IT technologies, revealing a distinctive hype-driven pattern of spread. To translate our findings into practical insights and guide prioritization efforts, we introduce the Language AI Readiness Index (EQUATE), which maps the state of technological, socio-economic, and infrastructural prerequisites for AI deployment across languages. The index highlights communities where capacity exists but remains underutilized, and provides a framework for accelerating more equitable diffusion of language AI. Our work contributes to setting the baseline for a transition towards more sustainable and equitable language technologies.
29. 【2602.12015】Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study
链接:https://arxiv.org/abs/2602.12015
作者:Angelo Ziletti,Leonardo D'Ambrosi
类目:Computation and Language (cs.CL)
关键词:Deploying large language, trigger human review, Deploying large, trigger clarification, trigger human
备注:
点击查看摘要
Abstract:Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations -- answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.
30. 【2602.12005】LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss
链接:https://arxiv.org/abs/2602.12005
作者:Szilvia Ujváry,Louis Béthune,Pierre Ablin,João Monteiro,Marco Cuturi,Michael Kirchhof
类目:Computation and Language (cs.CL)
关键词:Small Language Models, Language models, parameter size, Small Language, world knowledge
备注: 29 pages, 24 figures, 5 tables, preprint
点击查看摘要
Abstract:Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{CALL} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{CALL} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.
31. 【2602.11982】Automatic Simplification of Common Vulnerabilities and Exposures Descriptions
链接:https://arxiv.org/abs/2602.11982
作者:Varpu Vehomäki,Kimmo K. Kaski
类目:Computation and Language (cs.CL)
关键词:Understanding cyber security, Understanding cyber, cyber security, individuals and organizations, increasingly important
备注: 8 pages, 1 figure, submitted to Nordic Machine Intelligence
点击查看摘要
Abstract:Understanding cyber security is increasingly important for individuals and organizations. However, a lot of information related to cyber security can be difficult to understand to those not familiar with the topic. In this study, we focus on investigating how large language models (LLMs) could be utilized in automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions. Automatic text simplification has been studied in several contexts, such as medical, scientific, and news texts, but it has not yet been studied to simplify texts in the rapidly changing and complex domain of cyber security. We created a baseline for cyber security ATS and a test dataset of 40 CVE descriptions, evaluated by two groups of cyber security experts in two survey rounds. We have found that while out-of-the box LLMs can make the text appear simpler, they struggle with meaning preservation. Code and data are available at this https URL\_nmi.
32. 【2602.11968】DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling
链接:https://arxiv.org/abs/2602.11968
作者:Mariia Fedorova,Andrey Kutuzov,Khonzoda Umarova
类目:Computation and Language (cs.CL)
关键词:present DHPLT, web-crawled HPLT datasets, diverse languages, DHPLT, HPLT datasets
备注: LChange'26 workshop at the EACL 2026 conference
点击查看摘要
Abstract:In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets. DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field. All the resources described in this paper are available at this https URL, sorted by language.
33. 【2602.11961】Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models
链接:https://arxiv.org/abs/2602.11961
作者:Yuzhe Shang,Pengzhi Gao,Wei Liu,Jian Luan,Jinsong Su
类目:Computation and Language (cs.CL)
关键词:demonstrated improving multilingual, improving multilingual capabilities, Open large language, open LLMs, adapting open LLMs
备注:
点击查看摘要
Abstract:Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.
34. 【2602.11960】Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
链接:https://arxiv.org/abs/2602.11960
作者:Bruno Rigal,Victor Dupriez,Alexis Mignon,Ronan Le Hy,Nicolas Mery
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:challenging French documents, recent Vision-Language Models, challenging French, French documents, report evaluates
备注: 13 pages, 6 figures
点击查看摘要
Abstract:This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.
Comments:
13 pages, 6 figures
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2602.11960 [cs.CV]
(or
arXiv:2602.11960v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.11960
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
35. 【2602.11958】RAM-Net: Expressive Linear Attention with Selectively Addressable Memory
链接:https://arxiv.org/abs/2602.11958
作者:Kaicheng Xiao,Haotian Li,Liran Dong,Guoliang Xing
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:offer efficient inference, compressing unbounded history, inherently limits expressivity, architectures offer efficient, fixed-size memory inherently
备注:
点击查看摘要
Abstract:While linear attention architectures offer efficient inference, compressing unbounded history into a fixed-size memory inherently limits expressivity and causes information loss. To address this limitation, we introduce Random Access Memory Network (RAM-Net), a novel architecture designed to bridge the gap between the representational capacity of full attention and the memory efficiency of linear models. The core of RAM-Net maps inputs to high-dimensional sparse vectors serving as explicit addresses, allowing the model to selectively access a massive memory state. This design enables exponential state size scaling without additional parameters, which significantly mitigates signal interference and enhances retrieval fidelity. Moreover, the inherent sparsity ensures exceptional computational efficiency, as state updates are confined to minimal entries. Extensive experiments demonstrate that RAM-Net consistently surpasses state-of-the-art baselines in fine-grained long-range retrieval tasks and achieves competitive performance in standard language modeling and zero-shot commonsense reasoning benchmarks, validating its superior capability to capture complex dependencies with significantly reduced computational overhead.
36. 【2602.11939】Do Large Language Models Adapt to Language Variation across Socioeconomic Status?
链接:https://arxiv.org/abs/2602.11939
作者:Elisa Bassignana,Mike Zhang,Dirk Hovy,Amanda Cercas Curry
类目:Computation and Language (cs.CL)
关键词:Humans adjust, Humans, SES, social, LLMs
备注:
点击查看摘要
Abstract:Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.
37. 【2602.11938】Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance
链接:https://arxiv.org/abs/2602.11938
作者:Yunchong Huang,Gianni Barlacchi,Sandro Pezzelle
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, standard question-answering, Large, underspecified questions
备注: 4 pages of main text, 13 pages in total, 5 tables and 10 figures in total
点击查看摘要
Abstract:Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.
38. 【2602.11933】Cross-Modal Robustness Transfer (CMRT): Training Robust Speech Translation Models Using Adversarial Text
链接:https://arxiv.org/abs/2602.11933
作者:Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis
类目:Computation and Language (cs.CL)
关键词:Speech Translation, significant advancements, benchmarked on curated, primarily benchmarked, Speech
备注:
点击查看摘要
Abstract:End-to-End Speech Translation (E2E-ST) has seen significant advancements, yet current models are primarily benchmarked on curated, "clean" datasets. This overlooks critical real-world challenges, such as morphological robustness to inflectional variations common in non-native or dialectal speech. In this work, we adapt a text-based adversarial attack targeting inflectional morphology to the speech domain and demonstrate that state-of-the-art E2E-ST models are highly vulnerable it. While adversarial training effectively mitigates such risks in text-based tasks, generating high-quality adversarial speech data remains computationally expensive and technically challenging. To address this, we propose Cross-Modal Robustness Transfer (CMRT), a framework that transfers adversarial robustness from the text modality to the speech modality. Our method eliminates the requirement for adversarial speech data during training. Extensive experiments across four language pairs demonstrate that CMRT improves adversarial robustness by an average of more than 3 BLEU points, establishing a new baseline for robust E2E-ST without the overhead of generating adversarial speech.
39. 【2602.11931】AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection
链接:https://arxiv.org/abs/2602.11931
作者:Pretam Ray,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:agentic systems intensify, repeatedly invoking large, invoking large language, Evolutionary agentic systems, large language models
备注: 8 pages, 2 Figues
点击查看摘要
Abstract:Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade-off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi-LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real-time solvability. Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines. Our code is available at this https URL.
40. 【2602.11908】When Should LLMs Be Less Specific? Selective Abstraction for Reliable Long-Form Text Generation
链接:https://arxiv.org/abs/2602.11908
作者:Shani Goren,Ido Galil,Ran El-Yaniv
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:erode user trust, remain prone, errors that erode, erode user, user trust
备注:
点击查看摘要
Abstract:LLMs are widely used, yet they remain prone to factual errors that erode user trust and limit adoption in high-risk settings. One approach to mitigate this risk is to equip models with uncertainty estimation mechanisms that abstain when confidence is low. However, this binary "all-or-nothing" approach is excessively restrictive in long-form settings, often discarding valuable information. We introduce Selective Abstraction (SA), a framework that enables LLMs to trade specificity for reliability by selectively reducing the detail of uncertain content. We first formalize SA through the lenses of selective risk and coverage. We then propose Atom-wise Selective Abstraction, a claim-level instantiation that decomposes responses into atomic claims (short, self-contained statements each expressing a single fact) and replaces uncertain atoms with higher confidence, less specific abstractions. To evaluate this framework, we develop a novel end-to-end pipeline for open-ended generation that instantiates risk as factual correctness and measures coverage using an information-theoretic measure of retained information. Across six open-source models on the FactScore and LongFact-Objects benchmarks, atom-wise SA consistently outperforms existing baselines, improving the area under the risk-coverage curve (AURC) by up to 27.73% over claim removal, demonstrating that reducing specificity can boost accuracy and reliability while preserving most of their original meaning.
41. 【2602.11898】Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences
链接:https://arxiv.org/abs/2602.11898
作者:Eddie Yang,Dashun Wang
类目:Computation and Language (cs.CL)
关键词:large language models, measured and trusted, underpin how progress, progress in large, large language
备注:
点击查看摘要
Abstract:Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.
42. 【2602.11886】LLM-based Triplet Extraction from Financial Reports
链接:https://arxiv.org/abs/2602.11886
作者:Dante Wesslund,Ville Stenström,Pontus Linde,Alexander Holmberg
类目:Computation and Language (cs.CL)
关键词:Knowledge Graph construction, annotated ground truth, Knowledge Graph, makes evaluation difficult, domain makes evaluation
备注:
点击查看摘要
Abstract:Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.
43. 【2602.11877】owards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems
链接:https://arxiv.org/abs/2602.11877
作者:Wanxing Wu,He Zhu,Yixia Li,Lei Yang,Jiehui Zhao,Hongru Wang,Jian Yang,Benyou Wang,Bingyi Jing,Guanhua Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:privacy constraints necessitate, constraints necessitate deploying, necessitate deploying smaller, offloading complex queries, Large language models
备注: Our code is publicly available at [this https URL](https://github.com/zhuchichi56/RouterXBench)
点击查看摘要
Abstract:Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.
44. 【2602.11871】DMAP: A Distribution Map for Text
链接:https://arxiv.org/abs/2602.11871
作者:Tom Kempton,Julia Rozanova,Parameswaran Kamalaruban,Maeve Madigan,Karolina Wresilo,Yoann L. Launay,David Sutton,Stuart Burrell
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, probability distributions offering, next-token probability distributions, powerful tool
备注: ICLR 2026
点击查看摘要
Abstract:Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.
45. 【2602.11861】A$^{2}$V-SLP: Alignment-Aware Variational Modeling for Disentangled Sign Language Production
链接:https://arxiv.org/abs/2602.11861
作者:Sümeyye Meryem Taşyürek,Enis Mücahid İskender,Hacer Yalim Keles
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:structural disentanglement frameworks, recent structural disentanglement, alignment-aware variational framework, Building upon recent, sign language production
备注: 9 pages, 2 figures, 8 tables
点击查看摘要
Abstract:Building upon recent structural disentanglement frameworks for sign language production, we propose A$^{2}$V-SLP, an alignment-aware variational framework that learns articulator-wise disentangled latent distributions rather than deterministic embeddings. A disentangled Variational Autoencoder (VAE) encodes ground-truth sign pose sequences and extracts articulator-specific mean and variance vectors, which are used as distributional supervision for training a non-autoregressive Transformer. Given text embeddings, the Transformer predicts both latent means and log-variances, while the VAE decoder reconstructs the final sign pose sequences through stochastic sampling at the decoding stage. This formulation maintains articulator-level representations by avoiding deterministic latent collapse through distributional latent modeling. In addition, we integrate a gloss attention mechanism to strengthen alignment between linguistic input and articulated motion. Experimental results show consistent gains over deterministic latent regression, achieving state-of-the-art back-translation performance and improved motion realism in a fully gloss-free setting.
46. 【2602.11858】Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
链接:https://arxiv.org/abs/2602.11858
作者:Lai Wei,Liangbo He,Jun Lan,Lingzhong Dong,Yutong Cai,Siyuan Li,Huijia Zhu,Weiqiang Wang,Linghe Kong,Yue Wang,Zhuosheng Zhang,Weiran Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Multimodal Large Language, Large Language Models, Large Language, broad visual understanding, Multimodal Large
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at this https URL.
47. 【2602.11852】Prototype Transformer: Towards Language Model Architectures Interpretable by Design
链接:https://arxiv.org/abs/2602.11852
作者:Yordan Yordanov,Matteo Forasassi,Bayar Menzat,Ruizhi Wang,Chang Qi,Markus Kaltenberger,Amine M'Charrak,Tommaso Salvatori,Thomas Lukasiewicz
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:remains largely opaque, reasoning remains largely, surpass the vast, undermining trust, vast majority
备注: Preprint under review. Equal contribution: Yordan Yordanov and Matteo Forasassi. 39 pages, 25 figures, 22 tables
点击查看摘要
Abstract:While state-of-the-art language models (LMs) surpass the vast majority of humans in certain domains, their reasoning remains largely opaque, undermining trust in their output. Furthermore, while autoregressive LMs can output explicit reasoning, their true reasoning process is opaque, which introduces risks like deception and hallucination. In this work, we introduce the Prototype Transformer (ProtoT) -- an autoregressive LM architecture based on prototypes (parameter vectors), posed as an alternative to the standard self-attention-based transformers. ProtoT works by means of two-way communication between the input sequence and the prototypes, and we show that this leads to the prototypes automatically capturing nameable concepts (e.g. "woman") during training. They provide the potential to interpret the model's reasoning and allow for targeted edits of its behavior. Furthermore, by design, the prototypes create communication channels that aggregate contextual information at different time scales, aiding interpretability. In terms of computation scalability, ProtoT scales linearly with sequence length vs the quadratic scalability of SOTA self-attention transformers. Compared to baselines, ProtoT scales well with model and data size, and performs well on text generation and downstream tasks (GLUE). ProtoT exhibits robustness to input perturbations on par or better than some baselines, but differs from them by providing interpretable pathways showing how robustness and sensitivity arises. Reaching close to the performance of state-of-the-art architectures, ProtoT paves the way to creating well-performing autoregressive LMs interpretable by design.
48. 【2602.11795】A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments
链接:https://arxiv.org/abs/2602.11795
作者:Anne-Marie Lutgen,Alistair Plum,Christoph Purschke
类目:Computation and Language (cs.CL)
关键词:predefined variant lists, variant lists, paper presents, presents an embedding-based, relying on prior
备注:
点击查看摘要
Abstract:This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in ''noisy'' or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.
49. 【2602.11793】More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles
链接:https://arxiv.org/abs/2602.11793
作者:Ruibo Chen,Yihan Wu,Xuehao Cui,Jingqi Zhang,Heng Huang
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:large language models, attributing content generated, language models, crucial technique, technique for detecting
备注:
点击查看摘要
Abstract:Watermarking has emerged as a crucial technique for detecting and attributing content generated by large language models. While recent advancements have utilized watermark ensembles to enhance robustness, prevailing methods typically prioritize maximizing the strength of the watermark at every individual layer. In this work, we identify a critical limitation in this "stronger-is-better" approach: strong watermarks significantly reduce the entropy of the token distribution, which paradoxically weakens the effectiveness of watermarking in subsequent layers. We theoretically and empirically show that detectability is bounded by entropy and that watermark ensembles induce a monotonic decrease in both entropy and the expected green-list ratio across layers. To address this inherent trade-off, we propose a general framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling. Empirical evaluations demonstrate that this counter-intuitive strategy mitigates signal decay and consistently outperforms strong baselines in both detectability and robustness.
50. 【2602.11792】Detecting RLVR Training Data via Structural Convergence of Reasoning
链接:https://arxiv.org/abs/2602.11792
作者:Hongbo Zhang,Yue Yang,Jianhao Yan,Guangsheng Bao,Yue Zhang,Yue Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:data raises concerns, Reinforcement learning, undisclosed training data, training data raises, learning with verifiable
备注: Preprint
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) is central to training modern reasoning models, but the undisclosed training data raises concerns about benchmark contamination. Unlike pretraining methods, which optimize models using token-level probabilities, RLVR fine-tunes models based on reward feedback from self-generated reasoning trajectories, making conventional likelihood-based detection methods less effective. We show that RLVR induces a distinctive behavioral signature: prompts encountered during RLVR training result in more rigid and similar generations, while unseen prompts retain greater diversity. We introduce Min-$k$NN Distance, a simple black-box detector that quantifies this collapse by sampling multiple completions for a given prompt and computing the average of the $k$ smallest nearest-neighbor edit distances. Min-$k$NN Distance requires no access to the reference model or token probabilities. Experiments across multiple RLVR-trained reasoning models show that Min-$k$NN Distance reliably distinguishes RL-seen examples from unseen ones and outperforms existing membership inference and RL contamination detection baselines.
51. 【2602.11790】Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation
链接:https://arxiv.org/abs/2602.11790
作者:Lingyong Yan,Jiulong Wu,Dong Xie,Weixian Shi,Deguo Xia,Jizhou Huang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:oriented content creation, models demonstrate impressive, demonstrate impressive performance, visually oriented content, require strict logical
备注: For more information, visit the project website: [this https URL](https://robitsg.github.io/LASEV/)
点击查看摘要
Abstract:Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LAVES, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. The LAVES formulates educational video generation as a multi-objective task that simultaneously demands correct step-by-step reasoning, pedagogically coherent narration, semantically faithful visual demonstrations, and precise audio--visual alignment. To address the limitations of prior approaches--including low procedural fidelity, high production cost, and limited controllability--LAVES decomposes the generation workflow into specialized agents coordinated by a central Orchestrating Agent with explicit quality gates and iterative critique mechanisms. Specifically, the Orchestrating Agent supervises a Solution Agent for rigorous problem solving, an Illustration Agent that produces executable visualization codes, and a Narration Agent for learner-oriented instructional scripts. In addition, all outputs from the working agents are subject to semantic critique, rule-based constraints, and tool-based compilation checks. Rather than directly synthesizing pixels, the system constructs a structured executable video script that is deterministically compiled into synchronized visuals and narration using template-driven assembly rules, enabling fully automated end-to-end production without manual editing. In large-scale deployments, LAVES achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost compared to current industry-standard approaches while maintaining a high acceptance rate.
52. 【2602.11767】SR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents
链接:https://arxiv.org/abs/2602.11767
作者:Aladin Djuhera,Swanand Ravindra Kadhe,Farhan Ahmed,Holger Boche
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, Advances in large, language models, large language, driving a shift
备注:
点击查看摘要
Abstract:Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using task-specific feedback. This improves rollout quality and stabilizes learning while leaving the underlying optimization objective unchanged, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a simple and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.
53. 【2602.11761】MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling
链接:https://arxiv.org/abs/2602.11761
作者:MiniCPM Team:Wenhao An,Yingfa Chen,Yewei Fang,Jiayi Li,Xin Li,Yaohui Li,Yishan Li,Yuxuan Li,Biyuan Lin,Chuan Liu,Hezi Liu,Siyuan Liu,Hongya Lyu,Yinxu Pan,Shixin Ren,Xingyu Shen,Zhou Su,Haojun Sun,Yangang Sun,Zhen Leng Thai,Xin Tian,Rui Wang,Xiaorong Wang,Yudong Wang,Bo Wu,Xiaoyue Xu,Dong Xu,Shuaikang Xue,Jiawei Yang,Bowen Zhang,Jinqian Zhang,Letian Zhang,Shengnan Zhang,Xinyu Zhang,Xinyuan Zhang,Zhu Zhang,Hengyu Zhao,Jiacheng Zhao,Jie Zhou,Zihan Zhou,Shuo Wang,Chaojun Xiao,Xu Han,Zhiyuan Liu,Maosong Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:faces challenges posed, large language models, ultra-long contexts faces, contexts faces challenges, Transformer architecture
备注: MiniCPM-SALA Technical Report
点击查看摘要
Abstract:The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.
54. 【2602.11748】hink Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning
链接:https://arxiv.org/abs/2602.11748
作者:Futing Wang,Jianhao Yan,Yun Luo,Ganqu Cui,Zhi Wang,Xiaoye Qu,Yue Zhang,Yu Cheng,Tao Lin
类目:Computation and Language (cs.CL)
关键词:Achieving effective test-time, Shallow Exploration Trap, Toggle, test-time scaling requires, refine multiple reasoning
备注:
点击查看摘要
Abstract:Achieving effective test-time scaling requires models to engage in In-Context Exploration -- the intrinsic ability to generate, verify, and refine multiple reasoning hypotheses within a single continuous context. Grounded in State Coverage theory, our analysis identifies a critical bottleneck to enabling this capability: while broader state coverage requires longer reasoning trajectories, the probability of sampling such sequences decays exponentially during autoregressive generation, a phenomenon we term the ``Shallow Exploration Trap''. To bridge this gap, we propose Length-Incentivized Exploration(\method). This simple yet effective recipe explicitly encourages models to explore more via a length-based reward coupled with a redundancy penalty, thereby maximizing state coverage in two-step manner. Comprehensive experiments across different models (Qwen3, Llama) demonstrate that \method effectively incentivize in-context exploration. As a result, our method achieves an average improvement of 4.4\% on in-domain tasks and a 2.7\% gain on out-of-domain benchmarks.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.11748 [cs.CL]
(or
arXiv:2602.11748v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.11748
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Futing Wang [view email] [v1]
Thu, 12 Feb 2026 09:24:32 UTC (1,753 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning, by Futing Wang and 8 other authorsView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.CL
prev
|
next
new
|
recent
| 2026-02
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
55. 【2602.11737】Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
链接:https://arxiv.org/abs/2602.11737
作者:Boqi Chen,Xudong Liu,Jianing Qiu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
56. 【2602.11731】hinking with Drafting: Optical Decompression via Logical Reconstruction
链接:https://arxiv.org/abs/2602.11731
作者:Jingxuan Wei,Honghao He,Caijun Jia,Siyuan Li,Zheng Sun,Yuhang Xu,Yuanyuan Lin,Linzhuang Sun,Yuchen Wu,Bihui Yu,Xiangxiang Zhang,Cheng Tan
类目:Computation and Language (cs.CL)
关键词:Existing multimodal large, Existing multimodal, multimodal large language, achieved high-fidelity visual, large language models
备注:
点击查看摘要
Abstract:Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.
57. 【2602.11715】DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels
链接:https://arxiv.org/abs/2602.11715
作者:Haolei Bai,Lingcheng Kong,Xueyi Chen,Jianmian Wang,Zhiqiang Tao,Huan Wang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:CUDA kernel generation, CUDA kernel, parallel token generation, CUDA, kernel generation
备注:
点击查看摘要
Abstract:Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.
58. 【2602.11699】Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models
链接:https://arxiv.org/abs/2602.11699
作者:Katrin Olsen,Sebastian Padó
类目:Computation and Language (cs.CL)
关键词:semantic interpretation, development of computational, computational models, models of semantic, Nonsensical
备注:
点击查看摘要
Abstract:Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.
59. 【2602.11684】PatientHub: A Unified Framework for Patient Simulation
链接:https://arxiv.org/abs/2602.11684
作者:Sahand Sabour,TszYam NG,Minlie Huang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:Large Language Models, Language Models increasingly, Models increasingly power, power role-playing applications, scaling therapeutic assessment
备注: Work in progress
点击查看摘要
Abstract:As Large Language Models increasingly power role-playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non-standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub's utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross-method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub's extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross-method and cross-model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient-centered dialogue, and the code is publicly available via this https URL.
60. 【2602.11683】hinkRouter: Efficient Reasoning via Routing Thinking between Latent and Discrete Spaces
链接:https://arxiv.org/abs/2602.11683
作者:Xin Xu,Tong Yu,Xiang Chen,Haoliang Wang,Julian McAuley,Saayan Mitra
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Recent work explores, Recent work, work explores latent, improve reasoning efficiency, varies across settings
备注: Work in Progress
点击查看摘要
Abstract:Recent work explores latent reasoning to improve reasoning efficiency by replacing explicit reasoning trajectories with continuous representations in a latent space, yet its effectiveness varies across settings. Analysis of model confidence dynamics under latent reasoning reveals that thinking trajectories ending in incorrect answers contain fewer low-confidence steps than those ending in correct answers. Meanwhile, we suggest that soft embeddings aggregated by multiple low-confidence thinking alternatives may introduce and propagate noise, leading to high confidence in unreliable reasoning trajectories. Motivated by these observations, ThinkRouter, an inference-time confidence-aware routing mechanism is proposed to avoid high confidence and noise for efficient reasoning. ThinkRouter routes thinking to the discrete token space when model confidence is low, and to the latent space otherwise. Extensive experiments on STEM reasoning and coding benchmarks across diverse large reasoning models demonstrate that ThinkRouter outperforms explicit CoT, random routing, and latent reasoning baselines in terms of accuracy, achieving an average improvement of 19.70 points in Pass@1, while reducing generation length by up to 15.55%. Further comprehensive analysis reveals that ThinkRouter can calibrate errors arising from explicit CoT and latent reasoning, and accelerates end-of-thinking token generation by globally lowering model confidence.
61. 【2602.11666】PhyNiKCE: A Neurosymbolic Agentic Framework for Autonomous Computational Fluid Dynamics
链接:https://arxiv.org/abs/2602.11666
作者:E Fan,Lisong Shi,Zhengtong Li,Chih-yung Wen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Computational Fluid Dynamics, Large Language Models, Fluid Dynamics, Computational Fluid, Large Language
备注: 30 pages, 10 figures
点击查看摘要
Abstract:The deployment of autonomous agents for Computational Fluid Dynamics (CFD), is critically limited by the probabilistic nature of Large Language Models (LLMs), which struggle to enforce the strict conservation laws and numerical stability required for physics-based simulations. Reliance on purely semantic Retrieval Augmented Generation (RAG) often leads to "context poisoning," where agents generate linguistically plausible but physically invalid configurations due to a fundamental Semantic-Physical Disconnect. To bridge this gap, this work introduces PhyNiKCE (Physical and Numerical Knowledgeable Context Engineering), a neurosymbolic agentic framework for trustworthy engineering. Unlike standard black-box agents, PhyNiKCE decouples neural planning from symbolic validation. It employs a Symbolic Knowledge Engine that treats simulation setup as a Constraint Satisfaction Problem, rigidly enforcing physical constraints via a Deterministic RAG Engine with specialized retrieval strategies for solvers, turbulence models, and boundary conditions. Validated through rigorous OpenFOAM experiments on practical, non-tutorial CFD tasks using Gemini-2.5-Pro/Flash, PhyNiKCE demonstrates a 96% relative improvement over state-of-the-art baselines. Furthermore, by replacing trial-and-error with knowledge-driven initialization, the framework reduced autonomous self-correction loops by 59% while simultaneously lowering LLM token consumption by 17%. These results demonstrate that decoupling neural generation from symbolic constraint enforcement significantly enhances robustness and efficiency. While validated on CFD, this architecture offers a scalable, auditable paradigm for Trustworthy Artificial Intelligence in broader industrial automation.
62. 【2602.11650】Which Feedback Works for Whom? Differential Effects of LLM-Generated Feedback Elements Across Learner Profiles
链接:https://arxiv.org/abs/2602.11650
作者:Momoka Furuhashi,Kouta Nakayama,Noboru Kawai,Takashi Kodama,Saku Sugawara,Kyosuke Takami
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, automatically generating feedback, language models, promise for automatically
备注: Under Review
点击查看摘要
Abstract:Large language models (LLMs) show promise for automatically generating feedback in education settings. However, it remains unclear how specific feedback elements, such as tone and information coverage, contribute to learning outcomes and learner acceptance, particularly across learners with different personality traits. In this study, we define six feedback elements and generate feedback for multiple-choice biology questions using GPT-5. We conduct a learning experiment with 321 first-year high school students and evaluate feedback effectiveness using two learning outcomes measures and subjective evaluations across six criteria. We further analyze differences in how feedback acceptance varies across learners based on Big Five personality traits. Our results show that effective feedback elements share common patterns supporting learning outcomes, while learners' subjective preferences differ across personality-based clusters. These findings highlight the importance of selecting and adapting feedback elements according to learners' personality traits when we design LLM-generated feedback, and provide practical implications for personalized feedback design in education.
63. 【2602.11639】PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning
链接:https://arxiv.org/abs/2602.11639
作者:Ruixiang Feng,Yuntao Wen,Silin Zhou,Ke Shi,Yifan Wang,Ran Le,Zhenwei An,Zongchao Chen,Chen Yang,Guangyue Peng,Yiming Jia,Dongsheng Wang,Tao Zhang,Lisi Chen,Yang Song,Shen Gao,Shuo Shang
类目:Computation and Language (cs.CL)
关键词:producing excessively long, scaling test-time computation, Language Reasoning Models, excessively long reasoning, long reasoning traces
备注:
点击查看摘要
Abstract:Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from ``overthinking'', producing excessively long reasoning traces that increase latency and memory usage. Existing LRMs typically enforce conciseness with uniform length penalties, which over-compress crucial early deduction steps at the sequence level and indiscriminately penalize all queries at the group level. To solve these limitations, we propose \textbf{\model}, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision. At the sequence level, prefix-protected optimization employs decaying mixed rollouts to maintain valid reasoning paths while promoting conciseness. At the group level, difficulty-aware penalty dynamically scales length constraints based on query complexity, maintaining exploration for harder questions while curbing redundancy on easier ones. Extensive experiments on DeepSeek-R1-Distill-Qwen (1.5B/7B) demonstrate that \model achieves a substantial reduction in token usage (up to \textbf{55.7\%}) while simultaneously improving accuracy (up to \textbf{4.1\%}) on math benchmarks, with generalization ability to code, science, and general domains.
64. 【2602.11607】Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays
链接:https://arxiv.org/abs/2602.11607
作者:Yijie Zhong,Mengying Guo,Zewei Wang,Zhongyang Li,Dandan Tu,Haofen Wang
类目:Computation and Language (cs.CL)
关键词:generating vast amounts, Intelligent devices, form valuable personal, everyday life, generating vast
备注: Accepted by Knowledge-Based Systems. Lincense: CC BY-NC-ND
点击查看摘要
Abstract:Intelligent devices have become deeply integrated into everyday life, generating vast amounts of user interactions that form valuable personal knowledge. Efficient organization of this knowledge in user memory is essential for enabling personalized applications. However, current research on memory writing, management, and reading using large language models (LLMs) faces challenges in filtering irrelevant information and in dealing with rising computational costs. Inspired by the concept of selective attention in the human brain, we introduce a memory discrimination task. To address large-scale interactions and diverse memory standards in this task, we propose a Scene-Aware Memory Discrimination method (SAMD), which comprises two key components: the Gating Unit Module (GUM) and the Cluster Prompting Module (CPM). GUM enhances processing efficiency by filtering out non-memorable interactions and focusing on the salient content most relevant to application demands. CPM establishes adaptive memory standards, guiding LLMs to discern what information should be remembered or discarded. It also analyzes the relationship between user intents and memory contexts to build effective clustering prompts. Comprehensive direct and indirect evaluations demonstrate the effectiveness and generalization of our approach. We independently assess the performance of memory discrimination, showing that SAMD successfully recalls the majority of memorable data and remains robust in dynamic scenarios. Furthermore, when integrated into personalized applications, SAMD significantly enhances both the efficiency and quality of memory construction, leading to better organization of personal knowledge.
65. 【2602.11581】Analytical Search
链接:https://arxiv.org/abs/2602.11581
作者:Yiteng Tu,Shuo Miao,Weihang Su,Yiqun Liu,Qingyao Ai
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:causal impact assessment, domains including law, Analytical, impact assessment, including law
备注:
点击查看摘要
Abstract:Analytical information needs, such as trend analysis and causal impact assessment, are prevalent across various domains including law, finance, science, and much more. However, existing information retrieval paradigms, whether based on relevance-oriented document ranking or retrieval-augmented generation (RAG) with large language models (LLMs), often struggle to meet the end-to-end requirements of such tasks at the corpus scale. They either emphasize information finding rather than end-to-end problem solving, or simply treat everything as naive question answering, offering limited control over reasoning, evidence usage, and verifiability. As a result, they struggle to support analytical queries that have diverse utility concepts and high accountability requirements. In this paper, we propose analytical search as a distinct and emerging search paradigm designed to fulfill these analytical information needs. Analytical search reframes search as an evidence-governed, process-oriented analytical workflow that explicitly models analytical intent, retrieves evidence for fusion, and produces verifiable conclusions through structured, multi-step inference. We position analytical search in contrast to existing paradigms, and present a unified system framework that integrates query understanding, recall-oriented retrieval, reasoning-aware fusion, and adaptive verification. We also discuss potential research directions for the construction of analytical search engines. In this way, we highlight the conceptual significance and practical importance of analytical search and call on efforts toward the next generation of search engines that support analytical information needs.
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2602.11581 [cs.IR]
(or
arXiv:2602.11581v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.11581
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
66. 【2602.11570】PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering
链接:https://arxiv.org/abs/2602.11570
作者:Xiangfeng Wang,Hangyu Guo,Yanlin Lai,Mitt Huang,Liang Zhao,Chengyuan Yao,Yinmin Zhang,Qi Han,Xiaoxiao Ren,Chun Yuan,Tong Xu,Zheng Ge,Xiangyu Zhang,Daxin Jiang
类目:Computation and Language (cs.CL)
关键词:scaling Reinforcement Learning, Reinforcement Learning, Learning with Verifiable, neglecting potential errors, scaling Reinforcement
备注:
点击查看摘要
Abstract:While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification in Mathematics and Engineering. Curated from a comprehensive collection of college-level STEM problems, PRIME comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of 8.29%, 9.12%, and 7.31% on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation ($R^2 0.92$) between verifier accuracy on PRIME and RLVR training effectiveness, validating PRIME as a reliable predictor for verifier selection.
67. 【2602.11551】SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent
链接:https://arxiv.org/abs/2602.11551
作者:Wenlin Zhong,Jinluan Yang,Yiquan Wu,Yi Liu,Jianhang Yao,Kun Kuang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, empowered Large Language, Reinforcement Learning, Language Models, Large Language
备注:
点击查看摘要
Abstract:Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.
68. 【2602.11543】Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm
链接:https://arxiv.org/abs/2602.11543
作者:Jinrui Zhang,Chaodong Xiao,Aoqi Wu,Xindong Zhang,Lei Zhang
类目:Computation and Language (cs.CL)
关键词:Pretraining large language, large language models, typically requires centralized, requires centralized clusters, typically requires
备注:
点击查看摘要
Abstract:Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at this https URL.
69. 【2602.11528】Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs
链接:https://arxiv.org/abs/2602.11528
作者:Dong Yan,Jian Liang,Ran He,Tieniu Tan
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large-scale privacy breaches, Recent studies, text shared online, infer private user, user-generated text shared
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:Recent studies have shown that large language models (LLMs) can infer private user attributes (e.g., age, location, gender) from user-generated text shared online, enabling rapid and large-scale privacy breaches. Existing anonymization-based defenses are coarse-grained, lacking word-level precision in anonymizing privacy-leaking elements. Moreover, they are inherently limited as altering user text to hide sensitive cues still allows attribute inference to occur through models' reasoning capabilities. To address these limitations, we propose a unified defense framework that combines fine-grained anonymization (TRACE) with inference-preventing optimization (RPS). TRACE leverages attention mechanisms and inference chain generation to identify and anonymize privacy-leaking textual elements, while RPS employs a lightweight two-stage optimization strategy to induce model rejection behaviors, thereby preventing attribute inference. Evaluations across diverse LLMs show that TRACE-RPS reduces attribute inference accuracy from around 50\% to below 5\% on open-source models. In addition, our approach offers strong cross-model generalization, prompt-variation robustness, and utility-privacy tradeoffs. Our code is available at this https URL.
70. 【2602.11524】Adaptive Milestone Reward for GUI Agents
链接:https://arxiv.org/abs/2602.11524
作者:Congmin Zheng,Xiaoyun Mo,Xinbei Ma,Qiqiang Lin,Yin Zhao,Jiachen Zhu,Xingyu Lou,Jun Wang,Zhaoxiang Wang,Weiwen Liu,Zhuosheng Zhang,Yong Yu,Weinan Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Mobile GUI Agents, training Mobile GUI, Reinforcement Learning, GUI Agents, Mobile GUI
备注:
点击查看摘要
Abstract:Reinforcement Learning (RL) has emerged as a mainstream paradigm for training Mobile GUI Agents, yet it struggles with the temporal credit assignment problem inherent in long-horizon tasks. A primary challenge lies in the trade-off between reward fidelity and density: outcome reward offers high fidelity but suffers from signal sparsity, while process reward provides dense supervision but remains prone to bias and reward hacking. To resolve this conflict, we propose the Adaptive Milestone Reward (ADMIRE) mechanism. ADMIRE constructs a verifiable, adaptive reward system by anchoring trajectory to milestones, which are dynamically distilled from successful explorations. Crucially, ADMIRE integrates an asymmetric credit assignment strategy that denoises successful trajectories and scaffolds failed trajectories. Extensive experiments demonstrate that ADMIRE consistently yields over 10% absolute improvement in success rate across different base models on AndroidWorld. Moreover, the method exhibits robust generalizability, achieving strong performance across diverse RL algorithms and heterogeneous environments such as web navigation and embodied tasks.
71. 【2602.11509】Multimodal Fact-Level Attribution for Verifiable Reasoning
链接:https://arxiv.org/abs/2602.11509
作者:David Wan,Han Wang,Ziyang Wang,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:individual factual claims, real-world tasks involving, tasks involving multi-step, verifying individual factual, Multimodal large language
备注: 29 pages. Code and data are available at [this https URL](https://github.com/meetdavidwan/murgat)
点击查看摘要
Abstract:Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
72. 【2602.11495】Jailbreaking Leaves a Trace: Understanding and Detecting Jailbreak Attacks from Internal Representations of Large Language Models
链接:https://arxiv.org/abs/2602.11495
作者:Sri Durga Sai Sowmya Kadali,Evangelos E. Papalexakis
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Jailbreaking large language, critical security challenge, large language models, conversational AI systems, large language
备注:
点击查看摘要
Abstract:Jailbreaking large language models (LLMs) has emerged as a critical security challenge with the widespread deployment of conversational AI systems. Adversarial users exploit these models through carefully crafted prompts to elicit restricted or unsafe outputs, a phenomenon commonly referred to as Jailbreaking. Despite numerous proposed defense mechanisms, attackers continue to develop adaptive prompting strategies, and existing models remain vulnerable. This motivates approaches that examine the internal behavior of LLMs rather than relying solely on prompt-level defenses. In this work, we study jailbreaking from both security and interpretability perspectives by analyzing how internal representations differ between jailbreak and benign prompts. We conduct a systematic layer-wise analysis across multiple open-source models, including GPT-J, LLaMA, Mistral, and the state-space model Mamba, and identify consistent latent-space patterns associated with harmful inputs. We then propose a tensor-based latent representation framework that captures structure in hidden activations and enables lightweight jailbreak detection without model fine-tuning or auxiliary LLM-based detectors. We further demonstrate that the latent signals can be used to actively disrupt jailbreak execution at inference time. On an abliterated LLaMA-3.1-8B model, selectively bypassing high-susceptibility layers blocks 78% of jailbreak attempts while preserving benign behavior on 94% of benign prompts. This intervention operates entirely at inference time and introduces minimal overhead, providing a scalable foundation for achieving stronger coverage by incorporating additional attack distributions or more refined susceptibility thresholds. Our results provide evidence that jailbreak behavior is rooted in identifiable internal structures and suggest a complementary, architecture-agnostic direction for improving LLM security.
73. 【2602.11488】When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration
链接:https://arxiv.org/abs/2602.11488
作者:Jayadev Billa
类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:text dominance, text, explicitly instructed, instructed to trust, audio
备注: 25 pages, 18 tables, 8 languages, benchmark and code at [this https URL](https://github.com/jb1999/alme-benchmark)
点击查看摘要
Abstract:When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it ($-$23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.
Comments:
25 pages, 18 tables, 8 languages, benchmark and code at this https URL
Subjects:
Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:
arXiv:2602.11488 [cs.CL]
(or
arXiv:2602.11488v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.11488
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
74. 【2602.11460】ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias
链接:https://arxiv.org/abs/2602.11460
作者:Guangxin Zhao,Jiahao Zheng,Malaz Boustani,Jarek Nabrzyski,Meng Jiang,Yiyu Shi,Zhi Zheng
类目:Computation and Language (cs.CL)
关键词:Large language models, shown great potential, Large language, healthcare applications, shown great
备注:
点击查看摘要
Abstract:Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at this https URL.
75. 【2602.11451】LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation
链接:https://arxiv.org/abs/2602.11451
作者:Ahmadreza Jeddi,Marco Ciccone,Babak Taati
类目:Computation and Language (cs.CL)
关键词:Looped Transformers, efficient and powerful, powerful class, Looped, looped Transformer trained
备注: ICLR2026
点击查看摘要
Abstract:Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.
76. 【2602.11444】owards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety
链接:https://arxiv.org/abs/2602.11444
作者:Muskaan Chopra,Lorenz Sparrenberg,Rafet Sifa
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:public policy communication, equitable knowledge dissemination, Machine Translation, cross-lingual information access, plays a pivotal
备注: Accepted at ECIR 2026
点击查看摘要
Abstract:Machine Translation (MT) plays a pivotal role in cross-lingual information access, public policy communication, and equitable knowledge dissemination. However, critical meaning errors, such as factual distortions, intent reversals, or biased translations, can undermine the reliability, fairness, and safety of multilingual systems. In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such critical errors, evaluating models across a range of parameters using the publicly accessible data sets. Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements, outperforming encoder-only baselines like XLM-R and ModernBERT. We argue that improving critical error detection in MT contributes to safer, more trustworthy, and socially accountable information systems by reducing the risk of disinformation, miscommunication, and linguistic harm, especially in high-stakes or underrepresented contexts. This work positions error detection not merely as a technical challenge, but as a necessary safeguard in the pursuit of just and responsible multilingual AI. The code will be made available at GitHub.
77. 【2602.11424】Gradients Must Earn Their Influence: Unifying SFT with Generalized Entropic Objectives
链接:https://arxiv.org/abs/2602.11424
作者:Zecheng Wang,Deyuan Liu,Chunshan Li,Yupeng Zhang,Zhengyun Zhao,Dianhui Chu,Bingning Wang,Dianbo Sui
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Standard negative log-likelihood, Standard negative, applies uniform token-level, negative log-likelihood, Supervised Fine-Tuning
备注:
点击查看摘要
Abstract:Standard negative log-likelihood (NLL) for Supervised Fine-Tuning (SFT) applies uniform token-level weighting. This rigidity creates a two-fold failure mode: (i) overemphasizing low-probability targets can amplify gradients on noisy supervision and disrupt robust priors, and (ii) uniform weighting provides weak sharpening when the model is already confident. Existing methods fail to resolve the resulting plasticity--stability dilemma, often suppressing necessary learning signals alongside harmful ones. To address this issue, we unify token-level SFT objectives within a generalized deformed-log family and expose a universal gate $\times$ error gradient structure, where the gate controls how much the model trusts its current prediction. By employing the Cayley transform, we map the model's continuously evolving uncertainty onto a continuous focus trajectory, which enables seamless interpolation between scenarios involving uncertain novel concepts and those involving well-established knowledge. We then introduce Dynamic Entropy Fine-Tuning (DEFT), a parameter-free objective that modulates the trust gate using distribution concentration (Rényi-2 entropy) as a practical proxy for the model's predictive state. Extensive experiments and analyses demonstrate that DEFT achieves a better balance between exploration and exploitation, leading to improved overall performance.
78. 【2602.11391】Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection
链接:https://arxiv.org/abs/2602.11391
作者:Md Tanvir Rouf Shawon,Mohammad Sabik Irbaz,Hadeel R. A. Elyazori,Keerti Reddy Resapu,Yili Lin,Vladimir Franzuela Cardenas,Farrokh Alemi,Kevin Lybarger
类目:Computation and Language (cs.CL)
关键词:healthcare conversational agents, patient simulator designed, Risk Management Framework, enable scalable, automated evaluation
备注:
点击查看摘要
Abstract:Objective: This paper introduces a patient simulator designed to enable scalable, automated evaluation of healthcare conversational agents. The simulator generates realistic, controllable patient interactions that systematically vary across medical, linguistic, and behavioral dimensions, allowing annotators and an independent AI judge to assess agent performance, identify hallucinations and inaccuracies, and characterize risk patterns across diverse patient populations. Methods: The simulator is grounded in the NIST AI Risk Management Framework and integrates three profile components reflecting different dimensions of patient variation: (1) medical profiles constructed from electronic health records in the All of Us Research Program; (2) linguistic profiles modeling variation in health literacy and condition-specific communication patterns; and (3) behavioral profiles representing empirically observed interaction patterns, including cooperation, distraction, and adversarial engagement. We evaluated the simulator's effectiveness in identifying errors in an AI decision aid for antidepressant selection. Results: We generated 500 conversations between the patient simulator and the AI decision aid across systematic combinations of five linguistic and three behavioral profiles. Human annotators assessed 1,787 medical concepts across 100 conversations, achieving high agreement (F1=0.94, \k{appa}=0.73), and the LLM judge achieved comparable agreement with human annotators (F1=0.94, \k{appa}=0.78; paired bootstrap p=0.21). The simulator revealed a monotonic degradation in AI decision aid performance across the health literacy spectrum: rank-one concept retrieval accuracy increased from 47.9% for limited health literacy to 69.1% for functional and 81.6% for proficient.
79. 【2602.11388】Sparse Semantic Dimension as a Generalization Certificate for LLMs
链接:https://arxiv.org/abs/2602.11388
作者:Dibyanayan Bandyopadhyay,Asif Ekbal
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, Standard statistical learning, Large Language, statistical learning theory, learning theory predicts
备注: Work in progress (17 pages)
点击查看摘要
Abstract:Standard statistical learning theory predicts that Large Language Models (LLMs) should overfit because their parameter counts vastly exceed the number of training tokens. Yet, in practice, they generalize robustly. We propose that the effective capacity controlling generalization lies in the geometry of the model's internal representations: while the parameter space is high-dimensional, the activation states lie on a low-dimensional, sparse manifold. To formalize this, we introduce the Sparse Semantic Dimension (SSD), a complexity measure derived from the active feature vocabulary of a Sparse Autoencoder (SAE) trained on the model's layers. Treating the LLM and SAE as frozen oracles, we utilize this framework to attribute the model's generalization capabilities to the sparsity of the dictionary rather than the total parameter count. Empirically, we validate this framework on GPT-2 Small and Gemma-2B, demonstrating that our bound provides non-vacuous certificates at realistic sample sizes. Crucially, we uncover a counter-intuitive "feature sharpness" scaling law: despite being an order of magnitude larger, Gemma-2B requires significantly fewer calibration samples to identify its active manifold compared to GPT-2, suggesting that larger models learn more compressible, distinct semantic structures. Finally, we show that this framework functions as a reliable safety monitor: out-of-distribution inputs trigger a measurable "feature explosion" (a sharp spike in active features), effectively signaling epistemic uncertainty through learned feature violation. Code is available at: this https URL.
80. 【2602.11364】he Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods
链接:https://arxiv.org/abs/2602.11364
作者:Arpit Singh Gautam,Kailash Talreja,Saurabh Jha
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, frequently hallucinate plausible, Language Models, Generative Stress Test
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) frequently hallucinate plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are confidently wrong. We propose DiffuTruth, an unsupervised framework that reconceptualizes fact verification via non equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the Generative Stress Test, claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER demonstrate DiffuTruth achieves a state of the art unsupervised AUROC of 0.725, outperforming baselines by 1.5 percent through the correction of overconfident predictions. Furthermore, we show superior zero shot generalization on the multi hop HOVER dataset, outperforming baselines by over 4 percent, confirming the robustness of thermodynamic truth properties to distribution shifts.
81. 【2602.11361】Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification
链接:https://arxiv.org/abs/2602.11361
作者:Weili Shi,Dongliang Guo,Lehan Yang,Tianlong Wang,Hanzhang Yuan,Sheng Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, Large language, critical tokens, demonstrated impressive performance, reasoning
备注:
点击查看摘要
Abstract:Large language models have demonstrated impressive performance across a variety of reasoning tasks. However, their problem-solving ability often declines on more complex tasks due to hallucinations and the accumulation of errors within these intermediate steps. Recent work has introduced the notion of critical tokens--tokens in the reasoning process that exert significant influence on subsequent steps. Prior studies suggest that replacing critical tokens can refine reasoning trajectories. Nonetheless, reliably identifying and exploiting critical tokens remains challenging. To address this, we propose the Paraphrastic Probing and Consistency Verification~(PPCV) framework. PPCV operates in two stages. In the first stage, we roll out an initial reasoning path from the original question and then concatenate paraphrased versions of the question with this reasoning path. And we identify critical tokens based on mismatches between the predicted top-1 token and the expected token in the reasoning path. A criterion is employed to confirm the final critical token. In the second stage, we substitute critical tokens with candidate alternatives and roll out new reasoning paths for both the original and paraphrased questions. The final answer is determined by checking the consistency of outputs across these parallel reasoning processes. We evaluate PPCV on mainstream LLMs across multiple benchmarks. Extensive experiments demonstrate PPCV substantially enhances the reasoning performance of LLMs compared to baselines.
82. 【2602.11358】When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
链接:https://arxiv.org/abs/2602.11358
作者:Zachary Pedram Dadfar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, Large language, rich introspective language, language reflects internal, reflects internal computation
备注: Code and data: [this https URL](https://doi.org/10.5281/zenodo.18567446)
点击查看摘要
Abstract:Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.
83. 【2602.11354】ReplicatorBench: Benchmarking LLM Agents for Replicability in Social and Behavioral Sciences
链接:https://arxiv.org/abs/2602.11354
作者:Bang Nguyen,Dominik Soós,Qian Ma,Rochana R. Obadage,Zack Ranjan,Sai Koneru,Timothy M. Errington,Shakhlo Nematova,Sarah Rajtmajer,Jian Wu,Meng Jiang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:literature has witnessed, witnessed an emerging, emerging interest, automated assessment, assessment of scientific
备注:
点击查看摘要
Abstract:The literature has witnessed an emerging interest in AI agents for automated assessment of scientific papers. Existing benchmarks focus primarily on the computational aspect of this task, testing agents' ability to reproduce or replicate research outcomes when having access to the code and data. This setting, while foundational, (1) fails to capture the inconsistent availability of new data for replication as opposed to reproduction, and (2) lacks ground-truth diversity by focusing only on reproducible papers, thereby failing to evaluate an agent's ability to identify non-replicable research. Furthermore, most benchmarks only evaluate outcomes rather than the replication process. In response, we introduce ReplicatorBench, an end-to-end benchmark, including human-verified replicable and non-replicable research claims in social and behavioral sciences for evaluating AI agents in research replication across three stages: (1) extraction and retrieval of replication data; (2) design and execution of computational experiments; and (3) interpretation of results, allowing a test of AI agents' capability to mimic the activities of human replicators in real world. To set a baseline of AI agents' capability, we develop ReplicatorAgent, an agentic framework equipped with necessary tools, like web search and iterative interaction with sandboxed environments, to accomplish tasks in ReplicatorBench. We evaluate ReplicatorAgent across four underlying large language models (LLMs), as well as different design choices of programming language and levels of code access. Our findings reveal that while current LLM agents are capable of effectively designing and executing computational experiments, they struggle with retrieving resources, such as new data, necessary to replicate a claim. All code and data are publicly available at this https URL.
84. 【2602.11328】Evaluating Alignment of Behavioral Dispositions in LLMs
链接:https://arxiv.org/abs/2602.11328
作者:Amir Taubenfeld,Zorik Gekhman,Lior Nezry,Omri Feldman,Natalie Harris,Shashir Reddy,Romina Stella,Ariel Goldstein,Marian Croak,Yossi Matias,Amir Feder
类目:Computation and Language (cs.CL)
关键词:Situational Judgment Tests, daily lives, LLMs, human, Judgment Tests
备注:
点击查看摘要
Abstract:As LLMs integrate into our daily lives, understanding their behavior becomes essential. In this work, we focus on behavioral dispositions$-$the underlying tendencies that shape responses in social contexts$-$and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans. Our approach is grounded in established psychological questionnaires but adapts them for LLMs by transforming human self-report statements into Situational Judgment Tests (SJTs). These SJTs assess behavior by eliciting natural recommendations in realistic user-assistant scenarios. We generate 2,500 SJTs, each validated by three human annotators, and collect preferred actions from 10 annotators per SJT, from a large pool of 550 participants. In a comprehensive study involving 25 LLMs, we find that models often do not reflect the distribution of human preferences: (1) in scenarios with low human consensus, LLMs consistently exhibit overconfidence in a single response; (2) when human consensus is high, smaller models deviate significantly, and even some frontier models do not reflect the consensus in 15-20% of cases; (3) traits can exhibit cross-LLM patterns, e.g., LLMs may encourage emotion expression in contexts where human consensus favors composure. Lastly, mapping psychometric statements directly to behavioral scenarios presents a unique opportunity to evaluate the predictive validity of self-reports, revealing considerable gaps between LLMs' stated values and their revealed behavior.
85. 【2602.11318】Dissecting Subjectivity and the "Ground Truth" Illusion in Data Annotation
链接:https://arxiv.org/abs/2602.11318
作者:Sheza Munir,Benjamin Mah,Krisha Kalsi,Shivani Kapania,Julian Posada,Edith Law,Ding Wang,Syed Ishtiaque Ahmed
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:assumed correct labels, ground truth, machine learning, assumed correct, correct labels
备注:
点击查看摘要
Abstract:In machine learning, "ground truth" refers to the assumed correct labels used to train and evaluate models. However, the foundational "ground truth" paradigm rests on a positivistic fallacy that treats human disagreement as technical noise rather than a vital sociotechnical signal. This systematic literature review analyzes research published between 2020 and 2025 across seven premier venues: ACL, AIES, CHI, CSCW, EAAMO, FAccT, and NeurIPS, investigating the mechanisms in data annotation practices that facilitate this "consensus trap". Our identification phase captured 30,897 records, which were refined via a tiered keyword filtration schema to a high-recall corpus of 3,042 records for manual screening, resulting in a final included corpus of 346 papers for qualitative synthesis. Our reflexive thematic analysis reveals that systemic failures in positional legibility, combined with the recent architectural shift toward human-as-verifier models, specifically the reliance on model-mediated annotations, introduce deep-seated anchoring bias and effectively remove human voices from the loop. We further demonstrate how geographic hegemony imposes Western norms as universal benchmarks, often enforced by the performative alignment of precarious data workers who prioritize requester compliance over honest subjectivity to avoid economic penalties. Critiquing the "noisy sensor" fallacy, where statistical models misdiagnose cultural pluralism as random error, we argue for reclaiming disagreement as a high-fidelity signal essential for building culturally competent models. To address these systemic tensions, we propose a roadmap for pluralistic annotation infrastructures that shift the objective from discovering a singular "right" answer to mapping the diversity of human experience.
86. 【2602.11305】Are Aligned Large Language Models Still Misaligned?
链接:https://arxiv.org/abs/2602.11305
作者:Usman Naseem,Gautam Siddharth Kashyap,Rafiq Ali,Ebad Shabbir,Sushant Kumar Ray,Abdullah Mohammad,Agrima Seth
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, model behavior diverges, simultaneously satisfy safety, real-world query
备注:
点击查看摘要
Abstract:Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate 50% and lower Alignment Score (63%-66%) under joint conditions.
87. 【2602.11246】How Many Features Can a Language Model Store Under the Linear Representation Hypothesis?
链接:https://arxiv.org/abs/2602.11246
作者:Nikhil Garg,Jon Kleinberg,Kenny Peng
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Combinatorics (math.CO)
关键词:linear, compressed sensing, introduce a mathematical, mathematical framework, asserts that intermediate
备注:
点击查看摘要
Abstract:We introduce a mathematical framework for the linear representation hypothesis (LRH), which asserts that intermediate layers of language models store features linearly. We separate the hypothesis into two claims: linear representation (features are linearly embedded in neuron activations) and linear accessibility (features can be linearly decoded). We then ask: How many neurons $d$ suffice to both linearly represent and linearly access $m$ features? Classical results in compressed sensing imply that for $k$-sparse inputs, $d = O(k\log (m/k))$ suffices if we allow non-linear decoding algorithms (Candes and Tao, 2006; Candes et al., 2006; Donoho, 2006). However, the additional requirement of linear decoding takes the problem out of the classical compressed sensing, into linear compressed sensing. Our main theoretical result establishes nearly-matching upper and lower bounds for linear compressed sensing. We prove that $d = \Omega_\epsilon(\frac{k^2}{\log k}\log (m/k))$ is required while $d = O_\epsilon(k^2\log m)$ suffices. The lower bound establishes a quantitative gap between classical and linear compressed setting, illustrating how linear accessibility is a meaningfully stronger hypothesis than linear representation alone. The upper bound confirms that neurons can store an exponential number of features under the LRH, giving theoretical evidence for the "superposition hypothesis" (Elhage et al., 2022). The upper bound proof uses standard random constructions of matrices with approximately orthogonal columns. The lower bound proof uses rank bounds for near-identity matrices (Alon, 2003) together with Turán's theorem (bounding the number of edges in clique-free graphs). We also show how our results do and do not constrain the geometry of feature representations and extend our results to allow decoders with an activation function and bias.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT); Combinatorics (math.CO)
Cite as:
arXiv:2602.11246 [cs.LG]
(or
arXiv:2602.11246v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2602.11246
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
88. 【2602.11243】Evaluating Memory Structure in LLM Agents
链接:https://arxiv.org/abs/2602.11243
作者:Alina Shutova,Alexandra Olenina,Ivan Vinogradov,Anton Sinitsin
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:chat assistants rely, recall user preferences, store reusable knowledge, user preferences, augment reasoning
备注: Preprint, work in progress
点击查看摘要
Abstract:Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long-term memory benchmarks focus on simple fact retention, multi-hop recall, and time-based changes. While undoubtedly important, these capabilities can often be achieved with simple retrieval-augmented LLMs and do not test complex memory hierarchies. To bridge this gap, we propose StructMemEval - a benchmark that tests the agent's ability to organize its long-term memory, not just factual recall. We gather a suite of tasks that humans solve by organizing their knowledge in a specific structure: transaction ledgers, to-do lists, trees and others. Our initial experiments show that simple retrieval-augmented LLMs struggle with these tasks, whereas memory agents can reliably solve them if prompted how to organize their memory. However, we also find that modern LLMs do not always recognize the memory structure when not prompted to do so. This highlights an important direction for future improvements in both LLM training and memory frameworks.
89. 【2602.11238】SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation
链接:https://arxiv.org/abs/2602.11238
作者:Beichen Guo,Zhiyuan Wen,Jia Gu,Senzhang Wang,Haochen Shi,Ruosong Yang,Shuaiqi Liu
类目:Computation and Language (cs.CL)
关键词:Automatic Survey Generation, evolution of Automatic, commercial Deep Research, Deep Research agents, Survey Generation
备注:
点击查看摘要
Abstract:The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.
90. 【2602.11236】ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
链接:https://arxiv.org/abs/2602.11236
作者:Yandan Yang,Shuang Zeng,Tong Lin,Xinyuan Chang,Dekang Qi,Junjin Xiao,Haoyun Liu,Ronghan Chen,Yuzhi Chen,Dongjie Huo,Feng Xiong,Xing Wei,Zhiheng Ma,Mu Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:Building general-purpose embodied, diverse hardware remains, many-forms paradigm, Building general-purpose, challenge in robotics
备注: Project website: [this https URL](https://amap-cvlab.github.io/ABot-Manipulation/) . Code: [this https URL](https://github.com/amap-cvlab/ABot-Manipulation) . 22 pages, 10 figures, 10 tables
点击查看摘要
Abstract:Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.
91. 【2602.11224】Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation
链接:https://arxiv.org/abs/2602.11224
作者:Hubert M. Pysklo,Artem Zhuravel,Patrick D. Watson
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:Large Language Models, agentic Large Language, Large Language, evaluating agentic Large, Language Models
备注: Pre-Print. Under review for KDD 2026
点击查看摘要
Abstract:We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world tasks that execute code via external APIs. Agentic LLM performance varies due to differences in models, external tool access, prompt structures, and agentic frameworks. Benchmarks must make fundamental trade-offs between a sandboxed approach that controls for variation in software environments and more ecologically valid approaches employing real services. Agent-Diff attempts to capture the desirable features of both of these approaches by including access to the real API interfaces for software services while sandboxing the environment in which calls are made, processed, and evaluated. This approach relies on two key innovations. The first is a novel state-diff contract, which separates process from outcome - rather than fuzzy trace or parameter matching, we define task success as whether the expected change in environment state was achieved. The second is a novel sandbox that provides a standardized scripting layer that all models use to execute code against external APIs (Slack, Box, Linear, Google Calendar). Thus, we can evaluate different agentic LLMs against a standardized set of contracts using a unified sandbox while still evaluating their performance on real-world service interfaces. Using the Agent-Diff framework, we provide benchmarks for nine LLMs across 224 tasks utilizing enterprise software workflows. In addition, we evaluate the robustness of the framework with ablation experiments to assess the contribution of access to API documentation on benchmark performance. Code and data: this https URL.
92. 【2602.11221】he Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
链接:https://arxiv.org/abs/2602.11221
作者:Rui Cao,Zhenyun Deng,Yulong Chen,Michael Schlichtkrull,Andreas Vlachos
类目:Computation and Language (cs.CL)
关键词:real-world image-text claims, verifying real-world image-text, Image-Text Claims, Automatic Verification, real-world image-text
备注: Shared Task Overview and Summary for the Ninth FEVER Workshop, Co-located at EACL 2026
点击查看摘要
Abstract:The Automatic Verification of Image-Text Claims (AVerImaTeC) shared task aims to advance system development for retrieving evidence and verifying real-world image-text claims. Participants were allowed to either employ external knowledge sources, such as web search engines, or leverage the curated knowledge store provided by the organizers. System performance was evaluated using the AVerImaTeC score, defined as a conditional verdict accuracy in which a verdict is considered correct only when the associated evidence score exceeds a predefined threshold. The shared task attracted 14 submissions during the development phase and 6 submissions during the testing phase. All participating systems in the testing phase outperformed the baseline provided. The winning team, HUMANE, achieved an AVerImaTeC score of 0.5455. This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.
93. 【2602.11220】Patch the Distribution Mismatch: RL Rewriting Agent for Stable Off-Policy SFT
链接:https://arxiv.org/abs/2602.11220
作者:Jiacheng Wang,Ping Jian,Zhen Yang,Zirong Chen,Keren Liao,Zhongbin Guo
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large language models, made rapid progress, Large language, rapid progress, supervised fine-tuning
备注:
点击查看摘要
Abstract:Large language models (LLMs) have made rapid progress, yet adapting them to downstream scenarios still commonly relies on supervised fine-tuning (SFT). When downstream data exhibit a substantial distribution shift from the model's prior training distribution, SFT can induce catastrophic forgetting. To narrow this gap, data rewriting has been proposed as a data-centric approach that rewrites downstream training data prior to SFT. However, existing methods typically sample rewrites from a prompt-induced conditional distribution, so the resulting targets are not necessarily aligned with the model's natural QA-style generation distribution. Moreover, reliance on fixed templates can lead to diversity collapse. To address these issues, we cast data rewriting as a policy learning problem and learn a rewriting policy that better matches the backbone's QA-style generation distribution while preserving diversity. Since distributional alignment, diversity and task consistency are automatically evaluable but difficult to optimize end-to-end with differentiable objectives, we leverage reinforcement learning to optimize the rewrite distribution under reward feedback and propose an RL-based data-rewriting agent. The agent jointly optimizes QA-style distributional alignment and diversity under a hard task-consistency gate, thereby constructing a higher-quality rewritten dataset for downstream SFT. Extensive experiments show that our method achieves downstream gains comparable to standard SFT while reducing forgetting on non-downstream benchmarks by 12.34% on average. Our code is available at this https URL .
94. 【2602.11201】Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning
链接:https://arxiv.org/abs/2602.11201
作者:Donald Ye,Max Loffgren,Om Kotadia,Linus Wong
类目:Computation and Language (cs.CL)
关键词:solve complex problems, Logit Difference Decay, Normalized Logit Difference, language models solve, models solve complex
备注: 16 pages, 15 figures. Code: [this https URL](https://github.com/donald-ye/ACL_2026)
点击查看摘要
Abstract:Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.
95. 【2602.11199】When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification
链接:https://arxiv.org/abs/2602.11199
作者:Jiale Zhao,Ke Fang,Lu Cheng
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, include misleading information, prompts omit critical, omit critical details, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.
96. 【2602.11198】DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task
链接:https://arxiv.org/abs/2602.11198
作者:Shafiuddin Rehan Ahmed,Wei Wei
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:LLM-driven software development, Multi-agent frameworks promise, simplify LLM-driven software, software development, controlled setting
备注: ARR submission
点击查看摘要
Abstract:Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability -- the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.
97. 【2602.11182】MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization
链接:https://arxiv.org/abs/2602.11182
作者:Haidong Xin,Xinze Li,Zhenghao Liu,Yukun Yan,Shuo Wang,Cheng Yang,Yu Gu,Ge Yu,Maosong Sun
类目:Computation and Language (cs.CL)
关键词:Large Language Models, enable Large Language, Language Models, Large Language, systems enable Large
备注:
点击查看摘要
Abstract:Existing memory systems enable Large Language Models (LLMs) to support long-horizon human-LLM interactions by persisting historical interactions beyond limited context windows. However, while recent approaches have succeeded in constructing effective memories, they often disrupt the inherent logical and temporal relationships within interaction sessions, resulting in fragmented memory units and degraded reasoning performance. In this paper, we propose MetaMem, a novel framework that augments memory systems with a self-evolving meta-memory, aiming to teach LLMs how to effectively utilize memorized knowledge. During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks by self-reflecting on reasoning processes and performing actions to update the current meta-memory state. The accumulated meta-memory units serve as explicit knowledge utilization experiences, guiding the LLM to systematically identify and integrate critical evidence from scattered memory fragments. Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%. All codes and datasets are available at this https URL.
98. 【2602.11181】Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs
链接:https://arxiv.org/abs/2602.11181
作者:Himanshu Gupta,Pratik Jayarao,Chaitanya Dwivedi,Neeraj Varshney
类目:Computation and Language (cs.CL)
关键词:remain challenging phenomena, remain challenging, large language models, large language model, challenging phenomena
备注: 7 pages main paper, 10 pages total
点击查看摘要
Abstract:Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.
99. 【2602.11180】Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
链接:https://arxiv.org/abs/2602.11180
作者:Usman Naseem
类目:Computation and Language (cs.CL)
关键词:Large language models, remain largely opaque, achieved remarkable capabilities, internal decision-making processes, decision-making processes remain
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Mechanistic interpretability (i.e., the systematic study of how neural networks implement algorithms through their learned representations and computational structures) has emerged as a critical research direction for understanding and aligning these models. This paper surveys recent progress in mechanistic interpretability techniques applied to LLM alignment, examining methods ranging from circuit discovery to feature visualization, activation steering, and causal intervention. We analyze how interpretability insights have informed alignment strategies including reinforcement learning from human feedback (RLHF), constitutional AI, and scalable oversight. Key challenges are identified, including the superposition hypothesis, polysemanticity of neurons, and the difficulty of interpreting emergent behaviors in large-scale models. We propose future research directions focusing on automated interpretability, cross-model generalization of circuits, and the development of interpretability-driven alignment techniques that can scale to frontier models.
100. 【2602.11179】From Instruction to Output: The Role of Prompting in Modern NLG
链接:https://arxiv.org/abs/2602.11179
作者:Munazza Zaib,Elaf Alhazmi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Natural Language Processing, significant performance gains, gain significant performance, Natural Language Generation
备注:
点击查看摘要
Abstract:Prompt engineering has emerged as an integral technique for extending the strengths and abilities of Large Language Models (LLMs) to gain significant performance gains in various Natural Language Processing (NLP) tasks. This approach, which requires instructions to be composed in natural language to bring out the knowledge from LLMs in a structured way, has driven breakthroughs in various NLP tasks. Yet there is still no structured framework or coherent understanding of the varied prompt engineering methods and techniques, particularly in the field of Natural Language Generation (NLG). This survey aims to help fill that gap by outlining recent developments in prompt engineering, and their effect on different NLG tasks. It reviews recent advances in prompting methods and their impact on NLG tasks, presenting prompt design as an input-level control mechanism that complements fine-tuning and decoding approaches. The paper introduces a taxonomy of prompting paradigms, a decision framework for prompt selection based on varying factors for the practitioners, outlines emerging trends and challenges, and proposes a framework that links design, optimization, and evaluation to support more controllable and generalizable NLG.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.11179 [cs.CL]
(or
arXiv:2602.11179v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.11179
Focus to learn more
arXiv-issued DOI via DataCite</p>
101. 【2602.11177】What Do LLMs Know About Alzheimer's Disease? Fine-Tuning, Probing, and Data Synthesis for AD Detection
链接:https://arxiv.org/abs/2602.11177
作者:Lei Jiang,Yue Zhou,Natalie Parde
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Reliable early detection, Alzheimer disease, Reliable early, due to limited, limited availability
备注:
点击查看摘要
Abstract:Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across domains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we fine-tune an LLM for AD detection and investigate how task-relevant information is encoded within its internal representations. We employ probing techniques to analyze intermediate activations across transformer layers, and we observe that, after fine-tuning, the probing values of specific words and special markers change substantially, indicating that these elements assume a crucial role in the model's improved detection performance. Guided by this insight, we design a curated set of task-aware special markers and train a sequence-to-sequence model as a data-synthesis tool that leverages these markers to generate structurally consistent and diagnostically informative synthetic samples. We evaluate the synthesized data both intrinsically and by incorporating it into downstream training pipelines.
102. 【2602.11176】Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments
链接:https://arxiv.org/abs/2602.11176
作者:Maral Doctorarastoo,Katherine A. Flanigan,Mario Bergés,Christopher McComb
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
关键词:activity-based transportation system, transportation system simulation, Anticipating human activities, system simulation, Anticipating human
备注:
点击查看摘要
Abstract:Anticipating human activities and their durations is essential in applications such as smart-home automation, simulation-based architectural and urban design, activity-based transportation system simulation, and human-robot collaboration, where adaptive systems must respond to human activities. Existing data-driven agent-based models--from rule-based to deep learning--struggle in low-data environments, limiting their practicality. This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap by reasoning about everyday activities from compact contextual cues. We adopt a retrieval-augmented prompting strategy that integrates four sources of context--temporal, spatial, behavioral history, and persona--and evaluate it on the CASAS Aruba smart-home dataset. The evaluation spans two complementary tasks: next-activity prediction with duration estimation, and multi-step daily sequence generation, each tested with various numbers of few-shot examples provided in the prompt. Analyzing few-shot effects reveals how much contextual supervision is sufficient to balance data efficiency and predictive accuracy, particularly in low-data environments. Results show that large language models exhibit strong inherent temporal understanding of human behavior: even in zero-shot settings, they produce coherent daily activity predictions, while adding one or two demonstrations further refines duration calibration and categorical accuracy. Beyond a few examples, performance saturates, indicating diminishing returns. Sequence-level evaluation confirms consistent temporal alignment across few-shot conditions. These findings suggest that pre-trained language models can serve as promising temporal reasoners, capturing both recurring routines and context-dependent behavioral variations, thereby strengthening the behavioral modules of agent-based models.
103. 【2602.11175】Barriers to Discrete Reasoning with Transformers: A Survey Across Depth, Exactness, and Bandwidth
链接:https://arxiv.org/abs/2602.11175
作者:Michelle Yuan,Weiyi Sun,Amir H. Rezaeian,Jyotika Singh,Sandip Ghoshal,Yao-Ting Wang,Miguel Ballesteros,Yassine Benajiba
类目:Computation and Language (cs.CL)
关键词:sequence modeling applications, natural language processing, modeling applications, systems in natural, language processing
备注: Accepted to EACL 2026 Main Conference
点击查看摘要
Abstract:Transformers have become the foundational architecture for a broad spectrum of sequence modeling applications, underpinning state-of-the-art systems in natural language processing, vision, and beyond. However, their theoretical limitations in discrete reasoning tasks, such as arithmetic, logical inference, and algorithmic composition, remain a critical open problem. In this survey, we synthesize recent studies from three theoretical perspectives: circuit complexity, approximation theory, and communication complexity, to clarify the structural and computational barriers that transformers face when performing symbolic computations. By connecting these established theoretical frameworks, we provide an accessible and unified account of why current transformer architectures struggle to implement exact discrete algorithms, even as they excel at pattern matching and interpolation. We review key definitions, seminal results, and illustrative examples, highlighting challenges such as depth constraints, difficulty approximating discontinuities, and bottlenecks in inter-token communication. Finally, we discuss implications for model design and suggest promising directions for overcoming these foundational limitations.
104. 【2602.11174】he Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models
链接:https://arxiv.org/abs/2602.11174
作者:Aradhya Dixit,Shreem Dixit
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Pretrained multilingual language, impose systematic costs, multilingual language models, writing systems, language models
备注:
点击查看摘要
Abstract:Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the "NLL paradox" from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06-9.65) and +47.1% for XLM-R (12.19-17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in multilingual NLP and motivate script-aware tokenization and pretraining.
105. 【2602.11173】Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review
链接:https://arxiv.org/abs/2602.11173
作者:Qian Ruan,Iryna Gurevych
类目:Computation and Language (cs.CL)
关键词:substantial author effort, demands substantial author, author expertise, scientific peer review, Author
备注:
点击查看摘要
Abstract:Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. Recent work frames this task as automatic text generation, underusing author expertise and intent. In practice, authors possess domain expertise, author-only information, revision and response strategies--concrete forms of author expertise and intent--to address reviewer concerns, and seek NLP assistance that integrates these signals to support effective response writing in peer review. We reformulate author response generation as an author-in-the-loop task and introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement, together with REspEval, a comprehensive evaluation suite with 20+ metrics covering input utilization, controllability, response quality, and discourse. To support this formulation, we construct Re$^3$Align, the first large-scale dataset of aligned review--response--revision triplets, where revisions provide signals of author expertise and intent. Experiments with state-of-the-art LLMs show the benefits of author input and evaluation-guided refinement, the impact of input design on response quality, and trade-offs between controllability and quality. We make our dataset, generation and evaluation tools publicly available.
106. 【2602.11172】Synthesizing the Virtual Advocate: A Multi-Persona Speech Generation Framework for Diverse Linguistic Jurisdictions in Indic Languages
链接:https://arxiv.org/abs/2602.11172
作者:Aniket Deroy
类目:Computation and Language (cs.CL)
关键词:Pro TTS models, authoritative tone, rhythmic pausing, pausing for emphasis, emotional intelligence
备注:
点击查看摘要
Abstract:Legal advocacy requires a unique combination of authoritative tone, rhythmic pausing for emphasis, and emotional intelligence. This study investigates the performance of the Gemini 2.5 Flash TTS and Gemini 2.5 Pro TTS models in generating synthetic courtroom speeches across five Indic languages: Tamil, Telugu, Bengali, Hindi, and Gujarati. We propose a prompting framework that utilizes Gemini 2.5s native support for 5 languages and its context-aware pacing to produce distinct advocate personas. The evolution of Large Language Models (LLMs) has shifted the focus of TexttoSpeech (TTS) technology from basic intelligibility to context-aware, expressive synthesis. In the legal domain, synthetic speech must convey authority and a specific professional persona a task that becomes significantly more complex in the linguistically diverse landscape of India. The models exhibit a "monotone authority," excelling at procedural information delivery but struggling with the dynamic vocal modulation and emotive gravitas required for persuasive advocacy. Performance dips in Bengali and Gujarati further highlight phonological frontiers for future refinement. This research underscores the readiness of multilingual TTS for procedural legal tasks while identifying the remaining challenges in replicating the persuasive artistry of human legal discourse. The code is available at-this https URL
107. 【2602.11171】Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization
链接:https://arxiv.org/abs/2602.11171
作者:Baek Seong-Eun,Lee Jung-Mok,Kim Sung-Bin,Tae-Hyun Oh
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Fine-tuning Large Language, enables resource-efficient personalization, Large Language Models, Fine-tuning Large, Low-Rank Adaptation
备注:
点击查看摘要
Abstract:Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enables resource-efficient personalization or specialization, but it comes at the expense of additional hyperparameter tuning. Although LoRA makes fine-tuning efficient, it is highly sensitive to the choice of hyperparameters, and exhaustive hyperparameter search is still computationally very demanding. To address these challenges, we propose a framework that integrates the domain knowledge of pre-trained LLMs into Bayesian Optimization (BO) to efficiently search for LoRA hyperparameters. To leverage the informed knowledge of LLMs, we repurpose LLMs as a discrete-to-continuous mapping to link the hyperparameters and their domain knowledge with a continuous vector space, where BO is conducted. We design and control the mapping by language prompting, where we provide a domain-aware textual prompt describing the relationships among hyperparameters and their respective roles; thereby, we explicitly inject domain knowledge about LoRA into the LLM in natural language. Also, we model the residual information that is hard to linguistically describe in the prompt with an additional learnable token. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the observation of the strong correlation between the respective performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation with a data subset. This further increases the efficiency of our method. We demonstrate that our hyperparameter found with only about 30 iterations achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations.
108. 【2602.11170】PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models
链接:https://arxiv.org/abs/2602.11170
作者:Jiawei Xu,Zhenyu Yu,Ziqian Bi,Minh Duc Pham,Xiaoyi Qu,Danyang Zhang
类目:Computation and Language (cs.CL)
关键词:demonstrated remarkable capabilities, Large language models, reasoning remains limited, Large language, algorithmic reasoning remains
备注:
点击查看摘要
Abstract:Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited. To handle this limitation, we propose PRIME (Policy-Reinforced Iterative Multi-agent Execution), a framework comprising three specialized agents, an executor for step-by-step reasoning, a verifier for constraint checking, and a coordinator for backtracking control, optimized through group relative policy optimization. For comprehensive evaluation, we introduce PRIME-Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances. Tasks span sorting algorithms, graph and tree structures, automata and state machines, symbolic reasoning, and constraint-based puzzles, with execution traces reaching over one million steps. Compared to baseline approach, PRIME improves average accuracy from 26.8% to 93.8%, a 250% relative gain. The largest improvements occur on tasks requiring sustained state tracking, with Turing machine simulation improving from 9% to 92% and long division from 16% to 94%. Ablation studies identify iterative verification as the primary contributor, preventing the error propagation that causes baseline approaches to fail catastrophically. Analysis across model scales (8B-120B parameters) reveals that smaller models benefit disproportionately, achieving accuracy comparable to models 8x larger.
109. 【2602.11169】Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis
链接:https://arxiv.org/abs/2602.11169
作者:Mangadoddi Srikar Vardhan,Lekkala Sai Teja
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Transformer hidden states, hidden states encode, states encode information, Transformer hidden, serve distinct functional
备注: 15 pages, 7 figures. will Submit to ICML 2026
点击查看摘要
Abstract:Transformer hidden states encode information as high-dimensional vectors, yet whether direction (orientation in representational space) and magnitude (vector norm) serve distinct functional roles remains unclear. Studying Pythia-family models, we discover a striking cross-over dissociation: angular perturbations cause up to 42.9 more damage to language modeling loss, while magnitude perturbations cause disproportionately more damage to syntactic processing (20.4% vs.1.6% accuracy drop on subject-verb agreement).This finding is enabled by L2-matched perturbation analysis, a methodology ensuring that an gular and magnitude perturbations achieve identical Euclidean displacements. Causal intervention reveals that angular damage flows substantially through the attention pathways (28.4% loss recovery via attention repair), while magnitude damage flows partly through the LayerNorm pathways(29.9% recovery via LayerNorm repair). These patterns replicate across scales within the Pythia architecture family. These findings provide evidence that direction and magnitude support partially distinct computational roles in LayerNorm based architectures. The direction preferentially affects attentional routing, while magnitude modulates processing intensity for fine-grained syntactic judgments. We find different patterns in RMSNorm-based architectures, suggesting that the dissociation depends on architectural choices. Our results refine the linear representation hypothesis and have implications for model editing and interpretability research
110. 【2602.11168】Enhancing SDG-Text Classification with Combinatorial Fusion Analysis and Generative AI
链接:https://arxiv.org/abs/2602.11168
作者:Jingyan Xu,Marcelo L. LaFleur,Christina Schweikert,D. Frank Hsu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Natural Language Processing, Natural Language, Language Processing, including information retrieval, application areas including
备注: 8 pages, 8 figures, 4 tables; Accepted to 2025 IEEE International Conference on Pervasive Intelligence and Computing (PICom 2025)
点击查看摘要
Abstract:(Natural Language Processing) NLP techniques such as text classification and topic discovery are very useful in many application areas including information retrieval, knowledge discovery, policy formulation, and decision-making. However, it remains a challenging problem in cases where the categories are unavailable, difficult to differentiate, or are interrelated. Social analysis with human context is an area that can benefit from text classification, as it relies substantially on text data. The focus of this paper is to enhance the classification of text according to the UN's Sustainable Development Goals (SDGs) by collecting and combining intelligence from multiple models. Combinatorial Fusion Analysis (CFA), a system fusion paradigm using a rank-score characteristic (RSC) function and cognitive diversity (CD), has been used to enhance classifier methods by combining a set of relatively good and mutually diverse classification models. We use a generative AI model to generate synthetic data for model training and then apply CFA to this classification task. The CFA technique achieves 96.73% performance, outperforming the best individual model. We compare the outcomes with those obtained from human domain experts. It is demonstrated that combining intelligence from multiple ML/AI models using CFA and getting input from human experts can, not only complement, but also enhance each other.
111. 【2602.11167】Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering
链接:https://arxiv.org/abs/2602.11167
作者:Nathan Mao,Varun Kaushik,Shreya Shivkumar,Parham Sharafoleslami,Kevin Zhu,Sunishchal Dev
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, generating nonsensical, medicine or law
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) often hallucinate, generating nonsensical or false information that can be especially harmful in sensitive fields such as medicine or law. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite's potential as a foundation for evaluating and mitigating hallucinations in future LLM research.
112. 【2602.11166】Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?
链接:https://arxiv.org/abs/2602.11166
作者:Xu Hu,Yifan Zhang,Songtao Wei,Chen Zhao,Qiannan Li,Bingzhe Li,Feng Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:adapt large language, large language models, improve factual correctness, Parameter-efficient fine-tuning, parameter-efficient fine-tuning methods
备注: 18 pages, 13 figures, 8 tables
点击查看摘要
Abstract:Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt large language models (LLMs) to downstream tasks and are often assumed to improve factual correctness. However, how the parameter-efficient fine-tuning methods affect hallucination behavior remains insufficiently understood, especially on QA datasets. In this work, we systematically investigate the impact of PEFT on hallucination detection through a comprehensive empirical study across three open-weight LLM backbones and three fact-seeking QA benchmarks. For each model, we evaluate performance using seven unsupervised hallucination detection methods spanning three complementary approaches: semantic consistency based detectors, confidence based detectors, and entropy based detectors. This multifaceted evaluation enables us to characterize how PEFT reshapes uncertainty across different detection paradigms. In conclusion, our experimental results show that PEFT consistently strengthens hallucination detection ability, substantially improving AUROC across a wide range of hallucination detectors. Besides, further analyses using linear probes and representation diagnostics indicate that PEFT methods primarily reshapes how uncertainty is encoded and surfaced, comparing with injecting new factual knowledge into the models.
113. 【2602.11165】Assessing LLM Reliability on Temporally Recent Open-Domain Questions
链接:https://arxiv.org/abs/2602.11165
作者:Pushwitha Krishnappa,Amit Das,Vinija Jain,Tathagata Mukherjee,Aman Chadha
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, open-domain question answering, temporally recent information, information remains underexplored
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at this https URL
114. 【2602.11164】Automated Optimization Modeling via a Localizable Error-Driven Perspective
链接:https://arxiv.org/abs/2602.11164
作者:Weiting Liu,Han Wu,Yufei Kuang,Xiongwei Han,Tao Zhong,Jianfeng Feng,Wenlian Lu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, complex human decision-making, assist complex human, Automated optimization modeling
备注:
点击查看摘要
Abstract:Automated optimization modeling via Large Language Models (LLMs) has emerged as a promising approach to assist complex human decision-making. While post-training has become a pivotal technique to enhance LLMs' capabilities in this domain, its effectiveness is severely constrained by the scarcity and underutilization of high-quality training data. However, through a detailed profiling of error patterns across various problem-response pairs drawn from post-training, we identify two fundamental limitations of existing automated optimization modeling approaches: (L1) the sparsity of error-specific problems and (L2) the sparse rewards associated with difficult problems. We demonstrate that these limitations can result in suboptimal performance in domain-specific post-training for LLMs. To tackle the above two limitations, we propose a novel error-driven learning framework -- namely, auto\textbf{m}ated opt\textbf{i}mization modeli\textbf{n}g via a localizable error-\textbf{d}riven perspective (MIND) -- that customizes the whole model training framework from data synthesis to post-training. MIND is based on our key observation of the unique localizable patterns in error propagation of optimization modelings, that is, modeling errors may remain localized to specific semantic segments and do not propagate throughout the entire solution. Thus, in contrast to holistic reasoning tasks such as mathematical proofs, MIND leverages the construction of a focused, high-density training corpus and proposes \textbf{D}ynamic Supervised \textbf{F}ine-Tuning \textbf{P}olicy \textbf{O}ptimization (DFPO) to tackle difficult problems through localized refinement. Experiments on six benchmarks demonstrate that MIND consistently outperforms all the state-of-the-art automated optimization modeling approaches.
115. 【2602.11163】Nested Named Entity Recognition in Plasma Physics Research Articles
链接:https://arxiv.org/abs/2602.11163
作者:Muhammad Haris,Hans Höft,Markus M. Becker,Markus Stocker
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:natural language processing, plasma physics, plasma physics research, physics research articles, extract key entities
备注:
点击查看摘要
Abstract:Named Entity Recognition (NER) is an important task in natural language processing that aims to identify and extract key entities from unstructured text. We present a novel application of NER in plasma physics research articles and address the challenges of extracting specialized entities from scientific text in this domain. Research articles in plasma physics often contain highly complex and context-rich content that must be extracted to enable, e.g., advanced search. We propose a lightweight approach based on encoder-transformers and conditional random fields to extract (nested) named entities from plasma physics research articles. First, we annotate a plasma physics corpus with 16 classes specifically designed for the nested NER task. Second, we evaluate an entity-specific model specialization approach, where independent BERT-CRF models are trained to recognize individual entity types in plasma physics text. Third, we integrate an optimization process to systematically fine-tune hyperparameters and enhance model performance. Our work contributes to the advancement of entity recognition in plasma physics and also provides a foundation to support researchers in navigating and analyzing scientific literature.
116. 【2602.11162】Retrieval Heads are Dynamic
链接:https://arxiv.org/abs/2602.11162
作者:Yuping Lin,Zitao Li,Yue Xing,Pengfei He,Yingqian Cui,Yaliang Li,Bolin Ding,Jingren Zhou,Jiliang Tang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Recent studies, Language Models, retrieval heads
备注:
点击查看摘要
Abstract:Recent studies have identified "retrieval heads" in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model's hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.
117. 【2602.11161】Althea: Human-AI Collaboration for Fact-Checking and Critical Reasoning
链接:https://arxiv.org/abs/2602.11161
作者:Svetlana Churina,Kokil Jaidka,Anab Maulana Barik,Harshit Aneja,Cai Yang,Wynne Hsu,Mong Li Lee
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:web information ecosystem, information ecosystem demands, ecosystem demands fact-checking, demands fact-checking systems, epistemically trustworthy
备注:
点击查看摘要
Abstract:The web's information ecosystem demands fact-checking systems that are both scalable and epistemically trustworthy. Automated approaches offer efficiency but often lack transparency, while human verification remains slow and inconsistent. We introduce Althea, a retrieval-augmented system that integrates question generation, evidence retrieval, and structured reasoning to support user-driven evaluation of online claims. On the AVeriTeC benchmark, Althea achieves a Macro-F1 of 0.44, outperforming standard verification pipelines and improving discrimination between supported and refuted claims. We further evaluate Althea through a controlled user study and a longitudinal survey experiment (N = 642), comparing three interaction modes that vary in the degree of scaffolding: an Exploratory mode with guided reasoning, a Summary mode providing synthesized verdicts, and a Self-search mode that offers procedural guidance without algorithmic intervention. Results show that guided interaction produces the strongest immediate gains in accuracy and confidence, while self-directed search yields the most persistent improvements over time. This pattern suggests that performance gains are not driven solely by effort or exposure, but by how cognitive work is structured and internalized.
118. 【2602.11157】Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety
链接:https://arxiv.org/abs/2602.11157
作者:Max Zhang,Derek Liu,Kai Zhang,Joshua Franco,Haihao Liu
类目:Computation and Language (cs.CL)
关键词:remains predominantly English-centric, increasingly deployed worldwide, Large language models, predominantly English-centric, Large language
备注: 9 pages, Poster presented at Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 Workshop
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.
119. 【2602.11156】HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated QA over Raw Unstructured Documents
链接:https://arxiv.org/abs/2602.11156
作者:Sungmoon Kim,Hyuna Jeon,Dahye Kim,Mingyu Kim,Dong-Kyu Chae,Jiwoong Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large Language Model, Language Model, grounding Large Language, based chatbot responses, Large Language
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for grounding Large Language Model (LLM)-based chatbot responses on external knowledge. However, existing RAG studies typically assume well-structured textual sources (e.g. Wikipedia or curated datasets) and perform retrieval and generation at query time, which can limit their applicability in real-world chatbot scenarios. In this paper, we present HybridRAG, a novel and practical RAG framework towards more accurate and faster chatbot responses. First, HybridRAG ingests raw, unstructured PDF documents containing complex layouts (text, tables, figures) via Optical Character Recognition (OCR) and layout analysis, and convert them into hierarchical text chunks. Then, it pre-generates a plausible question-answer (QA) knowledge base from the organized chunks using an LLM. At query time, user questions are matched against this QA bank to retrieve immediate answers when possible, and only if no suitable QA match is found does our framework fall back to an on-the-fly response generation. Experiments on OHRBench demonstrate that our HybridRAG provides higher answer quality and lower latency compared to a standard RAG baseline. We believe that HybridRAG could be a practical solution for real-world chatbot applications that must handle large volumes of unstructured documents and lots of users under limited computational resources.
信息检索
1. 【2602.12278】AttentionRetriever: Attention Layers are Secretly Long Document Retrievers
链接:https://arxiv.org/abs/2602.12278
作者:David Jiahao Fu,Lam Thanh Do,Jiayu Li,Kevin Chen-Chuan Chang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:long document retrieval, Large Language Models, process tasks involving, Retrieval augmented generation, long document
备注:
点击查看摘要
Abstract:Retrieval augmented generation (RAG) has been widely adopted to help Large Language Models (LLMs) to process tasks involving long documents. However, existing retrieval models are not designed for long document retrieval and fail to address several key challenges of long document retrieval, including context-awareness, causal dependence, and scope of retrieval. In this paper, we proposed AttentionRetriever, a novel long document retrieval model that leverages attention mechanism and entity-based retrieval to build context-aware embeddings for long document and determine the scope of retrieval. With extensive experiments, we found AttentionRetriever is able to outperform existing retrieval models on long document retrieval datasets by a large margin while remaining as efficient as dense retrieval models.
2. 【2602.12187】SAGEO Arena: A Realistic Environment for Evaluating Search-Augmented Generative Engine Optimization
链接:https://arxiv.org/abs/2602.12187
作者:Sunghwan Kim,Wooseok Jeong,Serin Kim,Sangam Lee,Dongha Lee
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:deliver synthesized answers, Search-Augmented Generative Engines, Search-Augmented Generative Engine, Generative Engine Optimization, bridging web-scale retrieval
备注: Work in Progress
点击查看摘要
Abstract:Search-Augmented Generative Engines (SAGE) have emerged as a new paradigm for information access, bridging web-scale retrieval with generative capabilities to deliver synthesized answers. This shift has fundamentally reshaped how web content gains exposure online, giving rise to Search-Augmented Generative Engine Optimization (SAGEO), the practice of optimizing web documents to improve their visibility in AI-generated responses. Despite growing interest, no evaluation environment currently supports comprehensive investigation of SAGEO. Specifically, existing benchmarks lack end-to-end visibility evaluation of optimization strategies, operating on pre-determined candidate documents that abstract away retrieval and reranking preceding generation. Moreover, existing benchmarks discard structural information (e.g., schema markup) present in real web documents, overlooking the rich signals that search systems actively leverage in practice. Motivated by these gaps, we introduce SAGEO Arena, a realistic and reproducible environment for stage-level SAGEO analysis. Our objective is to jointly target search-oriented optimization (SEO) and generation-centric optimization (GEO). To achieve this, we integrate a full generative search pipeline over a large-scale corpus of web documents with rich structural information. Our findings reveal that existing approaches remain largely impractical under realistic conditions and often degrade performance in retrieval and reranking. We also find that structural information helps mitigate these limitations, and that effective SAGEO requires tailoring optimization to each pipeline stage. Overall, our benchmark paves the way for realistic SAGEO evaluation and optimization beyond simplified settings.
3. 【2602.12129】owards Personalized Bangla Book Recommendation: A Large-Scale Multi-Entity Book Graph Dataset
链接:https://arxiv.org/abs/2602.12129
作者:Rahin Arefin Ahmed,Md. Anik Chowdhury,Sakil Ahmed Sheikh Reza,Devnil Bhattacharjee,Muhammad Abdullah Adnan,Nafis Sadeq
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Personalized book recommendation, lack of structured, Bangla book recommendation, Bangla literature, recommendation
备注:
点击查看摘要
Abstract:Personalized book recommendation in Bangla literature has been constrained by the lack of structured, large-scale, and publicly available datasets. This work introduces RokomariBG, a large-scale, multi-entity heterogeneous book graph dataset designed to support research on personalized recommendation in a low-resource language setting. The dataset comprises 127,302 books, 63,723 users, 16,601 authors, 1,515 categories, 2,757 publishers, and 209,602 reviews, connected through eight relation types and organized as a comprehensive knowledge graph. To demonstrate the utility of the dataset, we provide a systematic benchmarking study on the Top-N recommendation task, evaluating a diverse set of representative recommendation models, including classical collaborative filtering methods, matrix factorization models, content-based approaches, graph neural networks, a hybrid matrix factorization model with side information, and a neural two-tower retrieval architecture. The benchmarking results highlight the importance of leveraging multi-relational structure and textual side information, with neural retrieval models achieving the strongest performance (NDCG@10 = 0.204). Overall, this work establishes a foundational benchmark and a publicly available resource for Bangla book recommendation research, enabling reproducible evaluation and future studies on recommendation in low-resource cultural domains. The dataset and code are publicly available at this https URL
Subjects:
Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:
arXiv:2602.12129 [cs.IR]
(or
arXiv:2602.12129v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.12129
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2602.12041】Compress, Cross and Scale: Multi-Level Compression Cross Networks for Efficient Scaling in Recommender Systems
链接:https://arxiv.org/abs/2602.12041
作者:Heng Yu,Xiangjun Zhou,Jie Xia,Heng Zhao,Anxin Wu,Yu Zhao,Dongying Kong
类目:Information Retrieval (cs.IR)
关键词:conversion rate prediction, Modeling high-order feature, rate prediction, Modeling high-order, click-through rate
备注: 11 pages, 3 figures
点击查看摘要
Abstract:Modeling high-order feature interactions efficiently is a central challenge in click-through rate and conversion rate prediction. Modern industrial recommender systems are predominantly built upon deep learning recommendation models, where the interaction backbone plays a critical role in determining both predictive performance and system efficiency. However, existing interaction modules often struggle to simultaneously achieve strong interaction capacity, high computational efficiency, and good scalability, resulting in limited ROI when models are scaled under strict production constraints. In this work, we propose MLCC, a structured feature interaction architecture that organizes feature crosses through hierarchical compression and dynamic composition, which can efficiently capture high-order feature dependencies while maintaining favorable computational complexity. We further introduce MC-MLCC, a Multi-Channel extension that decomposes feature interactions into parallel subspaces, enabling efficient horizontal scaling with improved representation capacity and significantly reduced parameter growth. Extensive experiments on three public benchmarks and a large-scale industrial dataset show that our proposed models consistently outperform strong DLRM-style baselines by up to 0.52 AUC, while reducing model parameters and FLOPs by up to 26$\times$ under comparable performance. Comprehensive scaling analyses demonstrate stable and predictable scaling behavior across embedding dimension, head number, and channel count, with channel-based scaling achieving substantially better efficiency than conventional embedding inflation. Finally, online A/B testing on a real-world advertising platform validates the practical effectiveness of our approach, which has been widely adopted in Bilibili advertising system under strict latency and resource constraints.
5. 【2602.11941】IncompeBench: A Permissively Licensed, Fine-Grained Benchmark for Music Information Retrieval
链接:https://arxiv.org/abs/2602.11941
作者:Benjamin Clavié,Atoof Shakir,Jonah Turner,Sean Lee,Aamir Shakir,Makoto P. Kato
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:increasingly strong multimodal, strong multimodal abilities, Multimodal Information Retrieval, made significant progress, deep pre-trained models
备注:
点击查看摘要
Abstract:Multimodal Information Retrieval has made significant progress in recent years, leveraging the increasingly strong multimodal abilities of deep pre-trained models to represent information across modalities. Music Information Retrieval (MIR), in particular, has considerably increased in quality, with neural representations of music even making its way into everyday life products. However, there is a lack of high-quality benchmarks for evaluating music retrieval performance. To address this issue, we introduce \textbf{IncompeBench}, a carefully annotated benchmark comprising $1,574$ permissively licensed, high-quality music snippets, $500$ diverse queries, and over $125,000$ individual relevance judgements. These annotations were created through the use of a multi-stage pipeline, resulting in high agreement between human annotators and the generated data. The resulting datasets are publicly available at this https URL and this https URL with the prompts available at this https URL.
6. 【2602.11874】Efficient Crawling for Scalable Web Data Acquisition (Extended Version)
链接:https://arxiv.org/abs/2602.11874
作者:Antoine Gauquier,Ioana Manolescu,Pierre Senellart
类目:Information Retrieval (cs.IR)
关键词:require analyzing high-quality, Journalistic fact-checking, high-quality statistics datasets, analyzing high-quality statistics, economic research
备注: Extended version of a paper published at the EDBT 2026 conference
点击查看摘要
Abstract:Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.
7. 【2602.11841】Improving Neural Retrieval with Attribution-Guided Query Rewriting
链接:https://arxiv.org/abs/2602.11841
作者:Moncef Garouani,Josiane Mothe
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:relevant documents exist, Neural retrievers, effective but brittle, documents exist, misdirect ranking
备注:
点击查看摘要
Abstract:Neural retrievers are effective but brittle: underspecified or ambiguous queries can misdirect ranking even when relevant documents exist. Existing approaches address this brittleness only partially: LLMs rewrite queries without retriever feedback, and explainability methods identify misleading tokens but are used for post-hoc analysis. We close this loop and propose an attribution-guided query rewriting method that uses token-level explanations to guide query rewriting. For each query, we compute gradient-based token attributions from the retriever and then use these scores as soft guidance in a structured prompt to an LLM that clarifies weak or misleading query components while preserving intent. Evaluated on BEIR collections, the resulting rewrites consistently improve retrieval effectiveness over strong baselines, with larger gains for implicit or ambiguous information needs.
8. 【2602.11836】ULTRA:Urdu Language Transformer-based Recommendation Architecture
链接:https://arxiv.org/abs/2602.11836
作者:Alishbah Bashir,Fatima Qaiser,Ijaz Hussain
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:lacks effective semantic, lacks effective, domain of personalized, recommendation, content recommendation
备注:
点击查看摘要
Abstract:Urdu, as a low-resource language, lacks effective semantic content recommendation systems, particularly in the domain of personalized news retrieval. Existing approaches largely rely on lexical matching or language-agnostic techniques, which struggle to capture semantic intent and perform poorly under varying query lengths and information needs. This limitation results in reduced relevance and adaptability in Urdu content recommendation. We propose ULTRA (Urdu Language Transformer-based Recommendation Architecture),an adaptive semantic recommendation framework designed to address these challenges. ULTRA introduces a dual-embedding architecture with a query-length aware routing mechanism that dynamically distinguishes between short, intent-focused queries and longer, context-rich queries. Based on a threshold-driven decision process, user queries are routed to specialized semantic pipelines optimized for either title/headline-level or full-content/document level representations, ensuring appropriate semantic granularity during retrieval. The proposed system leverages transformer-based embeddings and optimized pooling strategies to move beyond surface-level keyword matching and enable context-aware similarity search. Extensive experiments conducted on a large-scale Urdu news corpus demonstrate that the proposed architecture consistently improves recommendation relevance across diverse query types. Results show gains in precision above 90% compared to single-pipeline baselines, highlighting the effectiveness of query-adaptive semantic alignment for low-resource languages. The findings establish ULTRA as a robust and generalizable content recommendation architecture, offering practical design insights for semantic retrieval systems in low-resource language settings.
9. 【2602.11764】Reliable and Private Anonymous Routing for Satellite Constellations
链接:https://arxiv.org/abs/2602.11764
作者:Nilesh Vyas,Fabien Geyer,Svetoslav Duhovnikov
类目:Cryptography and Security (cs.CR); Emerging Technologies (cs.ET); Information Retrieval (cs.IR); Networking and Internet Architecture (cs.NI)
关键词:pose critical threats, state actors operating, dual-use LEO satellite, pose critical, mixed-trust environments
备注: 14 Pages, 16 Figures
点击查看摘要
Abstract:Shared, dynamic network infrastructures, such as dual-use LEO satellite constellations, pose critical threats to metadata privacy, particularly for state actors operating in mixed-trust environments. This work proposes an enhanced anonymity architecture, evolving the Loopix mix-network, to provide robust security and reliability in these volatile topologies. We introduce three primary contributions: (1) A multi-path transport protocol utilizing $(n, k)$ erasure codes, which is demonstrated to counteract the high link volatility and intermittent connectivity that renders standard mix-networks unreliable. (2) The integration of a computationally efficient Private Information Retrieval (PIR) protocol during route discovery. (3) The introduction of adaptive, centrality-based delay strategies that efficiently mitigate the inherent topological bias of LEO networks, providing a superior anonymity-to-latency trade-off. This mechanism provably prevents metadata leakage at the user-provider directory, mitigating profiling and correlation attacks. We validate this architecture via high-fidelity, packet-level simulations of a LEO constellation. Empirical results show our multi-path transport achieves near-zero message loss, establishing a quantifiable trade-off between reliability and bandwidth overhead. Furthermore, microbenchmarks of the PIR protocol quantify its computational and latency overheads, confirming its feasibility for practical deployment. This work provides a validated blueprint for deployable high-anonymity communication systems, demonstrating the viability of securely multiplexing sensitive operations within large-scale commercial network infrastructures.
10. 【2602.11719】Uncertainty-aware Generative Recommendation
链接:https://arxiv.org/abs/2602.11719
作者:Chenxiao Fan,Chongming Gao,Yaxin Gong,Haoyan Liu,Fuli Feng,Xiangnan He
类目:Information Retrieval (cs.IR)
关键词:autoregressive sequence generation, sequence generation task, transformative paradigm, autoregressive sequence, Uncertainty-aware Generative Recommendation
备注:
点击查看摘要
Abstract:Generative Recommendation has emerged as a transformative paradigm, reformulating recommendation as an end-to-end autoregressive sequence generation task. Despite its promise, existing preference optimization methods typically rely on binary outcome correctness, suffering from a systemic limitation we term uncertainty blindness. This issue manifests in the neglect of the model's intrinsic generation confidence, the variation in sample learning difficulty, and the lack of explicit confidence expression, directly leading to unstable training dynamics and unquantifiable decision risks. In this paper, we propose Uncertainty-aware Generative Recommendation (UGR), a unified framework that leverages uncertainty as a critical signal for adaptive optimization. UGR synergizes three mechanisms: (1) an uncertainty-weighted reward to penalize confident errors; (2) difficulty-aware optimization dynamics to prevent premature convergence; and (3) explicit confidence alignment to empower the model with confidence expression capabilities. Extensive experiments demonstrate that UGR not only yields superior recommendation performance but also fundamentally stabilizes training, preventing the performance degradation often observed in standard methods. Furthermore, the learned confidence enables reliable downstream risk-aware applications.
11. 【2602.11680】EpicCBR: Item-Relation-Enhanced Dual-Scenario Contrastive Learning for Cold-Start Bundle Recommendation
链接:https://arxiv.org/abs/2602.11680
作者:Yihang Li,Zhuo Liu,Wei Wei
类目:Information Retrieval (cs.IR)
关键词:Bundle recommendation aims, aims to recommend, recommend a set, Bundle recommendation, recommendation aims
备注: 10 pages, 3 figures, 5 tables, accepted by WSDM 2026
点击查看摘要
Abstract:Bundle recommendation aims to recommend a set of items to users for overall consumption. Existing bundle recommendation models primarily depend on observed user-bundle interactions, limiting exploration of newly-emerged bundles that are constantly created. It pose a critical representation challenge for current bundle methods, as they usually treat each bundle as an independent instance, while neglecting to fully leverage the user-item (UI) and bundle-item (BI) relations over popular items. To alleviate it, in this paper we propose a multi-view contrastive learning framework for cold-start bundle recommendation, named EpicCBR. Specifically, it precisely mine and utilize the item relations to construct user profiles, identifying users likely to engage with bundles. Additionally, a popularity-based method that characterizes the features of new bundles through historical bundle information and user preferences is proposed. To build a framework that demonstrates robustness in both cold-start and warm-start scenarios, a multi-view graph contrastive learning framework capable of integrating these diverse scenarios is introduced to ensure the model's generalization capability. Extensive experiments conducted on three popular benchmarks showed that EpicCBR outperforms state-of-the-art by a large margin (up to 387%), sufficiently demonstrating the superiority of the proposed method in cold-start scenario. The code and dataset can be found in the GitHub repository: this https URL.
12. 【2602.11664】IntTravel: A Real-World Dataset and Generative Framework for Integrated Multi-Task Travel Recommendation
链接:https://arxiv.org/abs/2602.11664
作者:Huimin Yan,Longfei Xu,Junjie Sun,Zheng Liu,Wei Luo,Kaikui Liu,Xiangxiang Chu
类目:Information Retrieval (cs.IR)
关键词:Point of Interest, location-based services, essential for modern, modern mobility, mobility and location-based
备注:
点击查看摘要
Abstract:Next Point of Interest (POI) recommendation is essential for modern mobility and location-based services. To provide a smooth user experience, models must understand several components of a journey holistically: "when to depart", "how to travel", "where to go", and "what needs arise via the route". However, current research is limited by fragmented datasets that focus merely on next POI recommendation ("where to go"), neglecting the departure time, travel mode, and situational requirements along the journey. Furthermore, the limited scale of these datasets impedes accurate evaluation of performance. To bridge this gap, we introduce IntTravel, the first large-scale public dataset for integrated travel recommendation, including 4.1 billion interactions from 163 million users with 7.3 million POIs. Built upon this dataset, we introduce an end-to-end, decoder-only generative framework for multi-task recommendation. It incorporates information preservation, selection, and factorization to balance task collaboration with specialized differentiation, yielding substantial performance gains. The framework's generalizability is highlighted by its state-of-the-art performance across both IntTravel dataset and an additional non-travel benchmark. IntTravel has been successfully deployed on Amap serving hundreds of millions of users, leading to a 1.09% increase in CTR. IntTravel is available at this https URL.
13. 【2602.11622】Evolutionary Router Feature Generation for Zero-Shot Graph Anomaly Detection with Mixture-of-Experts
链接:https://arxiv.org/abs/2602.11622
作者:Haiyang Jiang,Tong Chen,Xinyi Gao,Guansong Pang,Quoc Viet Hung Nguyen,Hongzhi Yin
类目:Information Retrieval (cs.IR)
关键词:single GNN methods, GNN methods insufficiently, attention recent years, attracted increasing attention, increasing attention recent
备注:
点击查看摘要
Abstract:Zero-shot graph anomaly detection (GAD) has attracted increasing attention recent years, yet the heterogeneity of graph structures, features, and anomaly patterns across graphs make existing single GNN methods insufficiently expressive to model diverse anomaly mechanisms. In this regard, Mixture-of-experts (MoE) architectures provide a promising paradigm by integrating diverse GNN experts with complementary inductive biases, yet their effectiveness in zero-shot GAD is severely constrained by distribution shifts, leading to two key routing challenges. First, nodes often carry vastly different semantics across graphs, and straightforwardly performing routing based on their features is prone to generating biased or suboptimal expert assignments. Second, as anomalous graphs often exhibit pronounced distributional discrepancies, existing router designs fall short in capturing domain-invariant routing principles that generalize beyond the training graphs. To address these challenges, we propose a novel MoE framework with evolutionary router feature generation (EvoFG) for zero-shot GAD. To enhance MoE routing, we propose an evolutionary feature generation scheme that iteratively constructs and selects informative structural features via an LLM-based generator and Shapley-guided evaluation. Moreover, a memory-enhanced router with an invariant learning objective is designed to capture transferable routing patterns under distribution shifts. Extensive experiments on six benchmarks show that EvoFG consistently outperforms state-of-the-art baselines, achieving strong and stable zero-shot GAD performance.
14. 【2602.11605】Recurrent Preference Memory for Efficient Long-Sequence Generative Recommendation
链接:https://arxiv.org/abs/2602.11605
作者:Yixiao Chen,Yuan Wang,Yue Liu,Qiyao Wang,Ke Cheng,Xin Xu,Juntong Yan,Shuojin Yang,Menghao Guo,Jun Zhang,Huan Yu,Jie Jiang
类目:Information Retrieval (cs.IR)
关键词:prohibitive computational costs, Generative recommendation, full attention, behavior via full, scaling to lifelong
备注: 12 pages, 6figures
点击查看摘要
Abstract:Generative recommendation (GenRec) models typically model user behavior via full attention, but scaling to lifelong sequences is hindered by prohibitive computational costs and noise accumulation from stochastic interactions. To address these challenges, we introduce Rec2PM, a framework that compresses long user interaction histories into compact Preference Memory tokens. Unlike traditional recurrent methods that suffer from serial training, Rec2PM employs a novel self-referential teacher-forcing strategy: it leverages a global view of the history to generate reference memories, which serve as supervision targets for parallelized recurrent updates. This allows for fully parallel training while maintaining the capability for iterative updates during inference. Additionally, by representing memory as token embeddings rather than extensive KV caches, Rec2PM achieves extreme storage efficiency. Experiments on large-scale benchmarks show that Rec2PM significantly reduces inference latency and memory footprint while achieving superior accuracy compared to full-sequence models. Analysis reveals that the Preference Memory functions as a denoising Information Bottleneck, effectively filtering interaction noise to capture robust long-term interests.
15. 【2602.11581】Analytical Search
链接:https://arxiv.org/abs/2602.11581
作者:Yiteng Tu,Shuo Miao,Weihang Su,Yiqun Liu,Qingyao Ai
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:causal impact assessment, domains including law, Analytical, impact assessment, including law
备注:
点击查看摘要
Abstract:Analytical information needs, such as trend analysis and causal impact assessment, are prevalent across various domains including law, finance, science, and much more. However, existing information retrieval paradigms, whether based on relevance-oriented document ranking or retrieval-augmented generation (RAG) with large language models (LLMs), often struggle to meet the end-to-end requirements of such tasks at the corpus scale. They either emphasize information finding rather than end-to-end problem solving, or simply treat everything as naive question answering, offering limited control over reasoning, evidence usage, and verifiability. As a result, they struggle to support analytical queries that have diverse utility concepts and high accountability requirements. In this paper, we propose analytical search as a distinct and emerging search paradigm designed to fulfill these analytical information needs. Analytical search reframes search as an evidence-governed, process-oriented analytical workflow that explicitly models analytical intent, retrieves evidence for fusion, and produces verifiable conclusions through structured, multi-step inference. We position analytical search in contrast to existing paradigms, and present a unified system framework that integrates query understanding, recall-oriented retrieval, reasoning-aware fusion, and adaptive verification. We also discuss potential research directions for the construction of analytical search engines. In this way, we highlight the conceptual significance and practical importance of analytical search and call on efforts toward the next generation of search engines that support analytical information needs.
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2602.11581 [cs.IR]
(or
arXiv:2602.11581v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.11581
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
16. 【2602.11562】LASER: An Efficient Target-Aware Segmented Attention Framework for End-to-End Long Sequence Modeling
链接:https://arxiv.org/abs/2602.11562
作者:Tianhe Lin,Ziwei Xiong,Baoyuan Ou,Yingjie Qin,Lai Xu,Xiaocheng Zhong,Yao Hu,Zhiyong Wang,Tao Zhou,Yubin Xu,Di Wu
类目:Information Retrieval (cs.IR)
关键词:modern recommendation systems, ultra-long user behavior, Modeling ultra-long user, pivotal for capturing, capturing evolving
备注: 9 pages
点击查看摘要
Abstract:Modeling ultra-long user behavior sequences is pivotal for capturing evolving and lifelong interests in modern recommendation systems. However, deploying such models in real-time industrial environments faces a strict "Latency Wall", constrained by two distinct bottlenecks: the high I/O latency of retrieving massive user histories and the quadratic computational complexity of standard attention mechanisms. To break these bottlenecks, we present LASER, a full-stack optimization framework developed and deployed at Xiaohongshu (RedNote). Our approach tackles the challenges through two complementary innovations: (1) System efficiency: We introduce SeqVault, a unified schema-aware serving infrastructure for long user histories. By implementing a hybrid DRAM-SSD indexing strategy, SeqVault reduces retrieval latency by 50% and CPU usage by 75%, ensuring millisecond-level access to full real-time and life-cycle user histories. (2) Algorithmic efficiency: We propose a Segmented Target Attention (STA) mechanism to address the computational overhead. Motivated by the inherent sparsity of user interests, STA employs a sigmoid-based gating strategy that acts as a silence mechanism to filter out noisy items. Subsequently, a lightweight Global Stacked Target Attention (GSTA) module refines these compressed segments to capture cross-segment dependencies without incurring high computational costs. This design performs effective sequence compression, reducing the complexity of long-sequence modeling while preserving critical signals. Extensive offline evaluations demonstrate that LASER consistently outperforms state-of-the-art baselines. In large-scale online A/B testing serving over 100 million daily active users, LASER achieved a 2.36% lift in ADVV and a 2.08% lift in revenue, demonstrating its scalability and significant commercial impact.
17. 【2602.11518】KuaiSearch: A Large-Scale E-Commerce Search Dataset for Recall, Ranking, and Relevance
链接:https://arxiv.org/abs/2602.11518
作者:Yupeng Li,Ben Chen,Mingyue Cheng,Zhiding Liu,Xuxin Zhang,Chenyi Lei,Wenwu Ou
类目:Information Retrieval (cs.IR)
关键词:connecting user demands, massive product inventories, E-commerce search, E-commerce search serves, central interface
备注:
点击查看摘要
Abstract:E-commerce search serves as a central interface, connecting user demands with massive product inventories and plays a vital role in our daily lives. However, in real-world applications, it faces challenges, including highly ambiguous queries, noisy product texts with weak semantic order, and diverse user preferences, all of which make it difficult to accurately capture user intent and fine-grained product semantics. In recent years, significant advances in large language models (LLMs) for semantic representation and contextual reasoning have created new opportunities to address these challenges. Nevertheless, existing e-commerce search datasets still suffer from notable limitations: queries are often heuristically constructed, cold-start users and long-tail products are filtered out, query and product texts are anonymized, and most datasets cover only a single stage of the search pipeline. Collectively, these issues constrain research on LLM-based e-commerce search. To address these challenges, we construct and release KuaiSearch. To the best of our knowledge, it is the largest e-commerce search dataset currently available. KuaiSearch is built upon real user search interactions from the Kuaishou platform, preserving authentic user queries and natural-language product texts, covering cold-start users and long-tail products, and systematically spanning three key stages of the search pipeline: recall, ranking, and relevance judgment. We conduct a comprehensive analysis of KuaiSearch from multiple perspectives, including products, users, and queries, and establish benchmark experiments across several representative search tasks. Experimental results demonstrate that KuaiSearch provides a valuable foundation for research on real-world e-commerce search.
18. 【2602.11453】From Noise to Order: Learning to Rank via Denoising Diffusion
链接:https://arxiv.org/abs/2602.11453
作者:Sajad Ebrahimi,Bhaskar Mitra,Negar Arabzadeh,Ye Yuan,Haolun Wu,Fattane Zarrinkalam,Ebrahim Bagheri
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:discriminative machine learning, machine learning approaches, information retrieval, methods have traditionally, query-document pair
备注:
点击查看摘要
Abstract:In information retrieval (IR), learning-to-rank (LTR) methods have traditionally limited themselves to discriminative machine learning approaches that model the probability of the document being relevant to the query given some feature representation of the query-document pair. In this work, we propose an alternative denoising diffusion-based deep generative approach to LTR that instead models the full joint distribution over feature vectors and relevance labels. While in the discriminative setting, an over-parameterized ranking model may find different ways to fit the training data, we hypothesize that candidate solutions that can explain the full data distribution under the generative setting produce more robust ranking models. With this motivation, we propose DiffusionRank that extends TabDiff, an existing denoising diffusion-based generative model for tabular datasets, to create generative equivalents of classical discriminative pointwise and pairwise LTR objectives. Our empirical results demonstrate significant improvements from DiffusionRank models over their discriminative counterparts. Our work points to a rich space for future research exploration on how we can leverage ongoing advancements in deep generative modeling approaches, such as diffusion, for learning-to-rank in IR.
19. 【2602.11443】Filtered Approximate Nearest Neighbor Search in Vector Databases: System Design and Performance Analysis
链接:https://arxiv.org/abs/2602.11443
作者:Abylay Amanbayev,Brian Tsan,Tri Dang,Florin Rusu
类目:Databases (cs.DB); Information Retrieval (cs.IR)
关键词:Approximate Nearest Neighbor, Nearest Neighbor Search, Nearest Neighbor, applications increasingly rely, Retrieval-Augmented Generation
备注: The artifacts are available at: [this https URL](https://github.com/aabylay/ANN-benchmark-HQ)
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) applications increasingly rely on Filtered Approximate Nearest Neighbor Search (FANNS) to combine semantic retrieval with metadata constraints. While algorithmic innovations for FANNS have been proposed, there remains a lack of understanding regarding how generic filtering strategies perform within Vector Databases. In this work, we systematize the taxonomy of filtering strategies and evaluate their integration into FAISS, Milvus, and pgvector. To provide a robust benchmarking framework, we introduce a new relational dataset, \textit{MoReVec}, consisting of two tables, featuring 768-dimensional text embeddings and a rich schema of metadata attributes. We further propose the \textit{Global-Local Selectivity (GLS)} correlation metric to quantify the relationship between filters and query vectors. Our experiments reveal that algorithmic adaptations within the engine often override raw index performance. Specifically, we find that: (1) \textit{Milvus} achieves superior recall stability through hybrid approximate/exact execution; (2) \textit{pgvector}'s cost-based query optimizer frequently selects suboptimal execution plans, favoring approximate index scans even when exact sequential scans would yield perfect recall at comparable latency; and (3) partition-based indexes (IVFFlat) outperform graph-based indexes (HNSW) for low-selectivity queries. To facilitate this analysis, we extend the widely-used \textit{ANN-Benchmarks} to support filtered vector search and make it available online. Finally, we synthesize our findings into a set of practical guidelines for selecting index types and configuring query optimizers for hybrid search workloads.
Comments:
The artifacts are available at: this https URL
Subjects:
Databases (cs.DB); Information Retrieval (cs.IR)
Cite as:
arXiv:2602.11443 [cs.DB]
(or
arXiv:2602.11443v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2602.11443
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
20. 【2602.11235】MTFM: A Scalable and Alignment-free Foundation Model for Industrial Recommendation in Meituan
链接:https://arxiv.org/abs/2602.11235
作者:Xin Song,Zhilin Guan,Ruidong Han,Binghao Tang,Tianwen Chen,Bing Li,Zihao Li,Han Zhang,Fei Jiang,Chaolin Xie,Chi Ma,Chunyang Jiang,Chunzhen Jing,Dengxuan Li,Fengyi Li,Lei Yu,Mengyao Sun,Pu Wang,Qing Wang,Rui Fan,Shangyu Chen,Shifeng Du,Siyuan Bai,Wei Lin,Wentao Zhu,Zhou Han,Zhuo Chen,Zikang Xu
类目:Information Retrieval (cs.IR)
关键词:Industrial recommendation systems, involve multiple scenarios, systems typically involve, typically involve multiple, require prohibitive resources
备注:
点击查看摘要
Abstract:Industrial recommendation systems typically involve multiple scenarios, yet existing cross-domain (CDR) and multi-scenario (MSR) methods often require prohibitive resources and strict input alignment, limiting their extensibility. We propose MTFM (Meituan Foundation Model for Recommendation), a transformer-based framework that addresses these challenges. Instead of pre-aligning inputs, MTFM transforms cross-domain data into heterogeneous tokens, capturing multi-scenario knowledge in an alignment-free manner. To enhance efficiency, we first introduce a multi-scenario user-level sample aggregation that significantly enhances training throughput by reducing the total number of instances. We further integrate Grouped-Query Attention and a customized Hybrid Target Attention to minimize memory usage and computational complexity. Furthermore, we implement various system-level optimizations, such as kernel fusion and the elimination of CPU-GPU blocking, to further enhance both training and inference throughput. Offline and online experiments validate the effectiveness of MTFM, demonstrating that significant performance gains are achieved by scaling both model capacity and multi-scenario training data.
21. 【2602.11160】BIRD: A Museum Open Dataset Combining Behavior Patterns and Identity Types to Better Model Visitors' Experience
链接:https://arxiv.org/abs/2602.11160
作者:Alexanne Worm(LORIA),Florian Marchal(LORIA),Sylvain Castagnos(LORIA)
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Artificial Intelligence, problem in Artificial, recurring problem, essential for training, training and validating
备注:
点击查看摘要
Abstract:Lack of data is a recurring problem in Artificial Intelligence, as it is essential for training and validating models. This is particularly true in the field of cultural heritage, where the number of open datasets is relatively limited and where the data collected does not always allow for holistic modeling of visitors' experience due to the fact that data are ad hoc (i.e. restricted to the sole characteristics required for the evaluation of a specific model). To overcome this lack, we conducted a study between February and March 2019 aimed at obtaining comprehensive and detailed information about visitors, their visit experience and their feedback. We equipped 51 participants with eye-tracking glasses, leaving them free to explore the 3 floors of the museum for an average of 57 minutes, and to discover an exhibition of more than 400 artworks. On this basis, we built an open dataset combining contextual data (demographic data, preferences, visiting habits, motivations, social context. . . ), behavioral data (spatiotemporal trajectories, gaze data) and feedback (satisfaction, fatigue, liked artworks, verbatim. . . ). Our analysis made it possible to re-enact visitor identities combining the majority of characteristics found in the literature and to reproduce the Veron and Levasseur profiles. This dataset will ultimately make it possible to improve the quality of recommended paths in museums by personalizing the number of points of interest (POIs), the time spent at these different POIs, and the amount of information to be provided to each visitor based on their level of interest.
22. 【2602.11156】HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated QA over Raw Unstructured Documents
链接:https://arxiv.org/abs/2602.11156
作者:Sungmoon Kim,Hyuna Jeon,Dahye Kim,Mingyu Kim,Dong-Kyu Chae,Jiwoong Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large Language Model, Language Model, grounding Large Language, based chatbot responses, Large Language
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for grounding Large Language Model (LLM)-based chatbot responses on external knowledge. However, existing RAG studies typically assume well-structured textual sources (e.g. Wikipedia or curated datasets) and perform retrieval and generation at query time, which can limit their applicability in real-world chatbot scenarios. In this paper, we present HybridRAG, a novel and practical RAG framework towards more accurate and faster chatbot responses. First, HybridRAG ingests raw, unstructured PDF documents containing complex layouts (text, tables, figures) via Optical Character Recognition (OCR) and layout analysis, and convert them into hierarchical text chunks. Then, it pre-generates a plausible question-answer (QA) knowledge base from the organized chunks using an LLM. At query time, user questions are matched against this QA bank to retrieve immediate answers when possible, and only if no suitable QA match is found does our framework fall back to an on-the-fly response generation. Experiments on OHRBench demonstrate that our HybridRAG provides higher answer quality and lower latency compared to a standard RAG baseline. We believe that HybridRAG could be a practical solution for real-world chatbot applications that must handle large volumes of unstructured documents and lots of users under limited computational resources.
计算机视觉
1. 【2602.12280】Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching
链接:https://arxiv.org/abs/2602.12280
作者:Huai-Hsun Cheng,Siang-Ling Zhang,Yu-Lun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:illusions traditionally rely, Progressive Semantic Illusions, Visual illusions traditionally, multi-view consistency, traditionally rely
备注: Project page: [this https URL](https://stroke-of-surprise.github.io/) Code: [this https URL](https://github.com/stroke-of-surprise/Stroke-Of-Surprise)
点击查看摘要
Abstract:Visual illusions traditionally rely on spatial manipulations such as multi-view consistency. In this work, we introduce Progressive Semantic Illusions, a novel vector sketching task where a single sketch undergoes a dramatic semantic transformation through the sequential addition of strokes. We present Stroke of Surprise, a generative framework that optimizes vector strokes to satisfy distinct semantic interpretations at different drawing stages. The core challenge lies in the "dual-constraint": initial prefix strokes must form a coherent object (e.g., a duck) while simultaneously serving as the structural foundation for a second concept (e.g., a sheep) upon adding delta strokes. To address this, we propose a sequence-aware joint optimization framework driven by a dual-branch Score Distillation Sampling (SDS) mechanism. Unlike sequential approaches that freeze the initial state, our method dynamically adjusts prefix strokes to discover a "common structural subspace" valid for both targets. Furthermore, we introduce a novel Overlay Loss that enforces spatial complementarity, ensuring structural integration rather than occlusion. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art baselines in recognizability and illusion strength, successfully expanding visual anagrams from the spatial to the temporal dimension. Project page: this https URL
2. 【2602.12279】UniT: Unified Multimodal Chain-of-Thought Test-time Scaling
链接:https://arxiv.org/abs/2602.12279
作者:Leon Liangyu Chen,Haoyu Ma,Zhipeng Fan,Ziqi Huang,Animesh Sinha,Xiaoliang Dai,Jialiang Wang,Zecheng He,Jianwei Yang,Chunyuan Li,Junzhe Sun,Chu Wang,Serena Yeung-Levy,Felix Juefei-Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:refining their outputs, typically operate, pass without iteratively, iteratively refining, Unified
备注:
点击查看摘要
Abstract:Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.
3. 【2602.12271】MonarchRT: Efficient Attention for Real-Time Video Generation
链接:https://arxiv.org/abs/2602.12271
作者:Krish Agarwal,Zhuoming Chen,Cheng Luo,Yongqi Chen,Haizhong Zheng,Xun Huang,Atri Rudra,Beidi Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Transformers is bottlenecked, Diffusion Transformers, few-step and autoregressive, substantially more information, quadratic cost
备注:
点击查看摘要
Abstract:Real-time video generation with Diffusion Transformers is bottlenecked by the quadratic cost of 3D self-attention, especially in real-time regimes that are both few-step and autoregressive, where errors compound across time and each denoising step must carry substantially more information. In this setting, we find that prior sparse-attention approximations break down, despite showing strong results for bidirectional, many-step diffusion. Specifically, we observe that video attention is not reliably sparse, but instead combines pronounced periodic structure driven by spatiotemporal position with dynamic, sparse semantic correspondences and dense mixing, exceeding the representational capacity of even oracle top-k attention. Building on this insight, we propose Monarch-RT, a structured attention parameterization for video diffusion models that factorizes attention using Monarch matrices. Through appropriately aligned block structure and our extended tiled Monarch parameterization, we achieve high expressivity while preserving computational efficiency. We further overcome the overhead of parameterization through finetuning, with custom Triton kernels. We first validate the high efficacy of Monarch-RT over existing sparse baselines designed only for bidirectional models. We further observe that Monarch-RT attains up to 95% attention sparsity with no loss in quality when applied to the state-of-the-art model Self-Forcing, making Monarch-RT a pioneering work on highly-capable sparse attention parameterization for real-time video generation. Our optimized implementation outperforms FlashAttention-2, FlashAttention-3, and FlashAttention-4 kernels on Nvidia RTX 5090, H100, and B200 GPUs respectively, providing kernel speedups in the range of 1.4-11.8X. This enables us, for the first time, to achieve true real-time video generation with Self-Forcing at 16 FPS on a single RTX 5090.
4. 【2602.12236】Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision
链接:https://arxiv.org/abs/2602.12236
作者:Anika Tabassum Meem,Muntasir Hossain Nadid,Md Zesun Ahmed Mia
类目:Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:continually evolving environments, catastrophic forgetting remains, spiking neural networks, neural networks, evolving environments
备注:
点击查看摘要
Abstract:Neuromorphic vision systems based on spiking neural networks (SNNs) offer ultra-low-power perception for event-based and frame-based cameras, yet catastrophic forgetting remains a critical barrier to deployment in continually evolving environments. Existing continual learning methods, developed primarily for artificial neural networks, seldom jointly optimize accuracy and energy efficiency, with particularly limited exploration on event-based datasets. We propose an energy-aware spike budgeting framework for continual SNN learning that integrates experience replay, learnable leaky integrate-and-fire neuron parameters, and an adaptive spike scheduler to enforce dataset-specific energy constraints during training. Our approach exhibits modality-dependent behavior: on frame-based datasets (MNIST, CIFAR-10), spike budgeting acts as a sparsity-inducing regularizer, improving accuracy while reducing spike rates by up to 47\%; on event-based datasets (DVS-Gesture, N-MNIST, CIFAR-10-DVS), controlled budget relaxation enables accuracy gains up to 17.45 percentage points with minimal computational overhead. Across five benchmarks spanning both modalities, our method demonstrates consistent performance improvements while minimizing dynamic power consumption, advancing the practical viability of continual learning in neuromorphic vision systems.
5. 【2602.12222】owards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
链接:https://arxiv.org/abs/2602.12222
作者:Miaosen Zhang,Yishan Liu,Shuxia Lin,Xu Yang,Qi Dai,Chong Luo,Weihao Jiang,Peng Hou,Anxiang Zeng,Xin Geng,Baining Guo
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Supervised fine-tuning, yields inferior generalization, inferior generalization compared, reinforcement learning, computationally efficient
备注:
点击查看摘要
Abstract:Supervised fine-tuning (SFT) is computationally efficient but often yields inferior generalization compared to reinforcement learning (RL). This gap is primarily driven by RL's use of on-policy data. We propose a framework to bridge this chasm by enabling On-Policy SFT. We first present \textbf{\textit{Distribution Discriminant Theory (DDT)}}, which explains and quantifies the alignment between data and the model-induced distribution. Leveraging DDT, we introduce two complementary techniques: (i) \textbf{\textit{In-Distribution Finetuning (IDFT)}}, a loss-level method to enhance generalization ability of SFT, and (ii) \textbf{\textit{Hinted Decoding}}, a data-level technique that can re-align the training corpus to the model's distribution. Extensive experiments demonstrate that our framework achieves generalization performance on par with prominent offline RL algorithms, including DPO and SimPO, while maintaining the efficiency of an SFT pipeline. The proposed framework thus offers a practical alternative in domains where RL is infeasible. We open-source the code here: this https URL
6. 【2602.12221】Best of Both Worlds: Multimodal Reasoning and Generation via Unified Discrete Flow Matching
链接:https://arxiv.org/abs/2602.12221
作者:Onkar Susladkar,Tushar Prakash,Gayatri Deshmukh,Kiet A. Nguyen,Jiaxun Zhang,Adheesh Juvekar,Tianshu Bao,Lin Chai,Sparsh Mittal,Inderjit S Dhillon,Ismini Lourentzou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unified discrete flow-matching, discrete flow-matching framework, propose UniDFlow, unified discrete, discrete flow-matching
备注:
点击查看摘要
Abstract:We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
7. 【2602.12205】DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
链接:https://arxiv.org/abs/2602.12205
作者:Dianyi Wang,Ruihang Li,Feng Han,Chaofan Ma,Wei Song,Siyuan Wang,Yibin Wang,Yi Xin,Hongjian Liu,Zhixiong Zhang,Shengyuan Ding,Tianhang Wang,Zhenglin Cheng,Tao Lin,Cheng Jin,Kaicheng Yu,Jingjing Chen,Wenjie Wang,Zhongyu Wei,Jiaqi Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:massive parameter scales, Current unified multimodal, editing typically rely, entailing prohibitive training, prohibitive training costs
备注:
点击查看摘要
Abstract:Current unified multimodal models for image generation and editing typically rely on massive parameter scales (e.g., 10B), entailing prohibitive training costs and deployment footprints. In this work, we present DeepGen 1.0, a lightweight 5B unified model that achieves comprehensive capabilities competitive with or surpassing much larger counterparts. To overcome the limitations of compact models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable 'think tokens' to provide the generative backbone with structured, reasoning-rich guidance. We further design a data-centric training strategy spanning three progressive stages: (1) Alignment Pre-training on large-scale image-text pairs and editing triplets to synchronize VLM and DiT representations, (2) Joint Supervised Fine-tuning on a high-quality mixture of generation, editing, and reasoning tasks to foster omni-capabilities, and (3) Reinforcement Learning with MR-GRPO, which leverages a mixture of reward functions and supervision signals, resulting in substantial gains in generation quality and alignment with human preferences, while maintaining stable training progress and avoiding visual artifacts. Despite being trained on only ~50M samples, DeepGen 1.0 achieves leading performance across diverse benchmarks, surpassing the 80B HunyuanImage by 28% on WISE and the 27B Qwen-Image-Edit by 37% on UniREditBench. By open-sourcing our training code, weights, and datasets, we provide an efficient, high-performance alternative to democratize unified multimodal research.
8. 【2602.12177】EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data
链接:https://arxiv.org/abs/2602.12177
作者:Nils Lehmann,Yi Wang,Zhitong Xiong,Xiaoxiang Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:compress high-dimensional inputs, efficient latent representations, video models rely, models rely heavily, image and video
备注:
点击查看摘要
Abstract:State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.
9. 【2602.12160】DreamID-Omni: Unified Framework for Controllable Human-Centric Audio-Video Generation
链接:https://arxiv.org/abs/2602.12160
作者:Xu Guo,Fulong Ye,Qichao Sun,Liyang Chen,Bingchuan Li,Pengze Zhang,Jiawei Liu,Songtao Zhao,Qian He,Xiangwang Hou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:joint audio-video generation, revolutionized joint audio-video, Recent advancements, audio-video generation, reference-based audio-video generation
备注: Project: [this https URL](https://guoxu1233.github.io/DreamID-Omni/)
点击查看摘要
Abstract:Recent advancements in foundation models have revolutionized joint audio-video generation. However, existing approaches typically treat human-centric tasks including reference-based audio-video generation (R2AV), video editing (RV2AV) and audio-driven video animation (RA2V) as isolated objectives. Furthermore, achieving precise, disentangled control over multiple character identities and voice timbres within a single framework remains an open challenge. In this paper, we propose DreamID-Omni, a unified framework for controllable human-centric audio-video generation. Specifically, we design a Symmetric Conditional Diffusion Transformer that integrates heterogeneous conditioning signals via a symmetric conditional injection scheme. To resolve the pervasive identity-timbre binding failures and speaker confusion in multi-person scenarios, we introduce a Dual-Level Disentanglement strategy: Synchronized RoPE at the signal level to ensure rigid attention-space binding, and Structured Captions at the semantic level to establish explicit attribute-subject mappings. Furthermore, we devise a Multi-Task Progressive Training scheme that leverages weakly-constrained generative priors to regularize strongly-constrained tasks, preventing overfitting and harmonizing disparate objectives. Extensive experiments demonstrate that DreamID-Omni achieves comprehensive state-of-the-art performance across video, audio, and audio-visual consistency, even outperforming leading proprietary commercial models. We will release our code to bridge the gap between academic research and commercial-grade applications.
10. 【2602.12157】xSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation
链接:https://arxiv.org/abs/2602.12157
作者:Ziteng Lu,Yushuang Wu,Chongjie Ye,Yuda Qiu,Jing Shao,Xiaoyang Guo,Jiaqing Zhou,Tianlei Hu,Kun Zhou,Xiaoguang Han
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:fundamental challenge due, current mainstream multi-view, texture generation remains, remains a fundamental, fundamental challenge
备注: Project page: [this https URL](https://anonymous.4open.science/w/TexSpot-page-2D91)
点击查看摘要
Abstract:High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. Existing representations either rely on UV maps, which suffer from distortion during unwrapping, or point-based methods, which tightly couple texture fidelity to geometric density that limits high-resolution texture generation. To address these limitations, we introduce TexSpot, a diffusion-based texture enhancement framework. At its core is Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representation. Each Texlet latent vector encodes a local texture patch via a 2D encoder and is further aggregated using a 3D encoder to incorporate global shape context. A cascaded 3D-to-2D decoder reconstructs high-quality texture patches, enabling the Texlet space learning. Leveraging this representation, we train a diffusion transformer conditioned on Texlets to refine and enhance textures produced by multi-view diffusion methods. Extensive experiments demonstrate that TexSpot significantly improves visual fidelity, geometric consistency, and robustness over existing state-of-the-art 3D texture generation and enhancement approaches. Project page: this https URL.
11. 【2602.12155】FAIL: Flow Matching Adversarial Imitation Learning for Image Generation
链接:https://arxiv.org/abs/2602.12155
作者:Yeyao Ma,Chen Li,Xiaosong Zhang,Han Hu,Weidi Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-quality target-is mathematically, target-is mathematically equivalent, flow matching models-aligning, imitation learning, Adversarial Imitation Learning
备注:
点击查看摘要
Abstract:Post-training of flow matching models-aligning the output distribution with a high-quality target-is mathematically equivalent to imitation learning. While Supervised Fine-Tuning mimics expert demonstrations effectively, it cannot correct policy drift in unseen states. Preference optimization methods address this but require costly preference pairs or reward modeling. We propose Flow Matching Adversarial Imitation Learning (FAIL), which minimizes policy-expert divergence through adversarial training without explicit rewards or pairwise comparisons. We derive two algorithms: FAIL-PD exploits differentiable ODE solvers for low-variance pathwise gradients, while FAIL-PG provides a black-box alternative for discrete or computationally constrained settings. Fine-tuning FLUX with only 13,000 demonstrations from Nano Banana pro, FAIL achieves competitive performance on prompt following and aesthetic benchmarks. Furthermore, the framework generalizes effectively to discrete image and video generation, and functions as a robust regularizer to mitigate reward hacking in reward-based optimization. Code and data are available at this https URL.
12. 【2602.12127】PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback
链接:https://arxiv.org/abs/2602.12127
作者:Sixiang Chen,Jianyu Lai,Jialin Gao,Hengyu Shi,Zhongying Liu,Tian Ye,Junfeng Luo,Xiaoming Wei,Lei Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-demand task requiring, high-level design understanding, global creation, understanding abstract design, local editing
备注:
点击查看摘要
Abstract:Image-to-poster generation is a high-demand task requiring not only local adjustments but also high-level design understanding. Models must generate text, layout, style, and visual elements while preserving semantic fidelity and aesthetic coherence. The process spans two regimes: local editing, where ID-driven generation, rescaling, filling, and extending must preserve concrete visual entities; and global creation, where layout- and style-driven tasks rely on understanding abstract design concepts. These intertwined demands make image-to-poster a multi-dimensional process coupling entity-preserving editing with concept-driven creation under image-prompt control. To address these challenges, we propose PosterOmni, a generalized artistic poster creation framework that unlocks the potential of a base edit model for multi-task image-to-poster generation. PosterOmni integrates the two regimes, namely local editing and global creation, within a single system through an efficient data-distillation-reward pipeline: (i) constructing multi-scenario image-to-poster datasets covering six task types across entity-based and concept-based creation; (ii) distilling knowledge between local and global experts for supervised fine-tuning; and (iii) applying unified PosterOmni Reward Feedback to jointly align visual entity-preserving and aesthetic preference across all tasks. Additionally, we establish PosterOmni-Bench, a unified benchmark for evaluating both local editing and global creation. Extensive experiments show that PosterOmni significantly enhances reference adherence, global composition quality, and aesthetic harmony, outperforming all open-source baselines and even surpassing several proprietary systems.
13. 【2602.12105】Iskra: A System for Inverse Geometry Processing
链接:https://arxiv.org/abs/2602.12105
作者:Ana Dodik,Ahmed H. Mahmoud,Justin Solomon
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:geometry processing problems, geometry processing, geometry processing applications, existing geometry processing, geometry processing algorithms
备注:
点击查看摘要
Abstract:We propose a system for differentiating through solutions to geometry processing problems. Our system differentiates a broad class of geometric algorithms, exploiting existing fast problem-specific schemes common to geometry processing, including local-global and ADMM solvers. It is compatible with machine learning frameworks, opening doors to new classes of inverse geometry processing applications. We marry the scatter-gather approach to mesh processing with tensor-based workflows and rely on the adjoint method applied to user-specified imperative code to generate an efficient backward pass behind the scenes. We demonstrate our approach by differentiating through mean curvature flow, spectral conformal parameterization, geodesic distance computation, and as-rigid-as-possible deformation, examining usability and performance on these applications. Our system allows practitioners to differentiate through existing geometry processing algorithms without needing to reformulate them, resulting in low implementation effort, fast runtimes, and lower memory requirements than differentiable optimization tools not tailored to geometry processing.
14. 【2602.12100】AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer
链接:https://arxiv.org/abs/2602.12100
作者:Lingting Zhu,Shengju Qian,Haidi Fan,Jiayu Dong,Zhenchao Jin,Siwei Zhou,Gen Dong,Xin Wang,Lequan Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:industry demands high-quality, digital industry demands, demands high-quality, digital industry, industry demands
备注: Accepted by ICLR 2026. 23 pages, 14 figures
点击查看摘要
Abstract:The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets, contributing to the broader field of 3D content generation. The code is available at this https URL.
15. 【2602.12099】GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
链接:https://arxiv.org/abs/2602.12099
作者:GigaBrain Team:Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Hao Li,Jie Li,Jindi Lv,Jingyu Liu,Lv Feng,Mingming Yu,Peng Li,Qiuping Deng,Tianze Liu,Xinyu Zhou,Xinze Chen,Xiaofeng Wang,Yang Wang,Yifan Li,Yifei Nie,Yilong Li,Yukun Zhou,Yun Ye,Zhichao Liu,Zheng Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:directly predict multi-step, predict multi-step action, multi-step action chunks, current observations face, observations face inherent
备注: [this https URL](https://gigabrain05m.github.io/)
点击查看摘要
Abstract:Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{this https URL}{project page}.
16. 【2602.12092】DeepSight: An All-in-One LM Safety Toolkit
链接:https://arxiv.org/abs/2602.12092
作者:Bo Zhang,Jiaxuan Guo,Lijun Li,Dongrui Liu,Sujin Chen,Guanxu Chen,Zhijie Zheng,Qihao Lin,Lewen Yan,Chen Qian,Yijin Zhou,Yuyao Wu,Shaoxiong Guo,Tianyi Du,Jingyi Yang,Xuhao Hu,Ziqi Miao,Xiaoya Lu,Jing Shao,Xia Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Large Language, current Large Language, Language Models
备注: Technical report, 29 pages, 24 figures
点击查看摘要
Abstract:As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.
17. 【2602.12044】A DMD-Based Adaptive Modulation Method for High Dynamic Range Imaging in High-Glare Environments
链接:https://arxiv.org/abs/2602.12044
作者:Banglei Guan,Jing Tao,Liang Xu,Dongcai Tan,Pengju Sun,Jianbing Liu,Yang Shang,Qifeng Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:photomechanics measurements critically, measurements critically relies, welding arc monitoring, polished metallic surface, extreme illumination conditions
备注: This paper has been accepted by Experimental Mechanics
点击查看摘要
Abstract:Background The accuracy of photomechanics measurements critically relies on image quality,particularly under extreme illumination conditions such as welding arc monitoring and polished metallic surface analysis. High dynamic range (HDR) imaging above 120 dB is essential in these contexts. Conventional CCD/CMOS sensors, with dynamic ranges typically below 70 dB, are highly susceptible to saturation under glare, resulting in irreversible loss of detail and significant errors in digital image correlation (DIC). Methods This paper presents an HDR imaging system that leverages the spatial modulation capability of a digital micromirror device (DMD). The system architecture enables autonomous regional segmentation and adaptive exposure control for high-dynamic-range scenes through an integrated framework comprising two synergistic subsystems: a DMD-based optical modulation unit and an adaptive computational imaging pipeline. Results The system achieves a measurable dynamic range of 127 dB, effectively eliminating satu ration artifacts under high glare. Experimental results demonstrate a 78% reduction in strain error and improved DIC positioning accuracy, confirming reliable performance across extreme intensity variations. Conclusion The DMD-based system provides high fidelity adaptive HDR imaging, overcoming key limitations of conventional sensors. It exhibits strong potential for optical metrology and stress analysis in high-glare environments where traditional methods are inadequate.
18. 【2602.12003】Projected Representation Conditioning for High-fidelity Novel View Synthesis
链接:https://arxiv.org/abs/2602.12003
作者:Min-Seop Kwak,Minkyung Kwon,Jinhyeok Choi,Jiho Park,Seungryong Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enhanced geometric consistency, semantic correspondence properties, generated novel viewpoints, leverage external representations, enhanced geometric
备注:
点击查看摘要
Abstract:We propose a novel framework for diffusion-based novel view synthesis in which we leverage external representations as conditions, harnessing their geometric and semantic correspondence properties for enhanced geometric consistency in generated novel viewpoints. First, we provide a detailed analysis exploring the correspondence capabilities emergent in the spatial attention of external visual representations. Building from these insights, we propose a representation-guided novel view synthesis through dedicated representation projection modules that inject external representations into the diffusion process, a methodology named ReNoV, short for representation-guided novel view synthesis. Our experiments show that this design yields marked improvements in both reconstruction fidelity and inpainting quality, outperforming prior diffusion-based novel-view methods on standard benchmarks and enabling robust synthesis from sparse, unposed image collections.
19. 【2602.12002】Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation
链接:https://arxiv.org/abs/2602.12002
作者:Enrico Guerriero,Kjersti Engan,Øyvind Meinich-Bache
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate documentation, newborn resuscitation videos, newborn resuscitation, clinical guidelines, underutilized in practice
备注: Presented at the Satellite Workshop on Workshop 15: Generative AI for World Simulations and Communications Celebrating 40 Years of Excellence in Education: Honoring Professor Aggelos Katsaggelos, IEEE International Conference on Image Processing (ICIP), 2025
点击查看摘要
Abstract:Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.
20. 【2602.11980】Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation
链接:https://arxiv.org/abs/2602.11980
作者:Wei Chen,Yancheng Long,Mingqiao Liu,Haojie Ding,Yankai Yang,Hongyang Wei,Yi-Fan Zhang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, aesthetic image synthesis, shown exceptional capabilities, shown exceptional
备注: 19 pages, 4 figures
点击查看摘要
Abstract:While diffusion models have shown exceptional capabilities in aesthetic image synthesis, they often struggle with complex spatial understanding and reasoning. Existing approaches resort to Multimodal Large Language Models (MLLMs) to enhance this capability. However, they either incur high computational costs through joint training or suffer from spatial information loss when relying solely on textual prompts. To alleviate these limitations, we propose a Spatial Chain-of-Thought (SCoT) framework, a plug-and-play approach that effectively bridges the reasoning capabilities of MLLMs with the generative power of diffusion models. Specifically, we first enhance the diffusion model's layout awareness by training it on an interleaved text-coordinate instruction format. We then leverage state-of-the-art MLLMs as planners to generate comprehensive layout plans, transferring their spatial planning capabilities directly to the generation process. Extensive experiments demonstrate that our method achieves state-of-the-art performance on image generation benchmarks and significantly outperforms baselines on complex reasoning tasks, while also showing strong efficacy in image editing scenarios.
21. 【2602.11973】Calibrated Bayesian Deep Learning for Explainable Decision Support Systems Based on Medical Imaging
链接:https://arxiv.org/abs/2602.11973
作者:Hua Xu,Julián D. Arias-Londoño,Juan I. Godino-Llorente
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:critical decision support, decision support systems, support systems based, critical decision, decision support
备注: 24 pages, 3 figures
点击查看摘要
Abstract:In critical decision support systems based on medical imaging, the reliability of AI-assisted decision-making is as relevant as predictive accuracy. Although deep learning models have demonstrated significant accuracy, they frequently suffer from miscalibration, manifested as overconfidence in erroneous predictions. To facilitate clinical acceptance, it is imperative that models quantify uncertainty in a manner that correlates with prediction correctness, allowing clinicians to identify unreliable outputs for further review. In order to address this necessity, the present paper proposes a generalizable probabilistic optimization framework grounded in Bayesian deep learning. Specifically, a novel Confidence-Uncertainty Boundary Loss (CUB-Loss) is introduced that imposes penalties on high-certainty errors and low-certainty correct predictions, explicitly enforcing alignment between prediction correctness and uncertainty estimates. Complementing this training-time optimization, a Dual Temperature Scaling (DTS) strategy is devised for post-hoc calibration, further refining the posterior distribution to improve intuitive explainability. The proposed framework is validated on three distinct medical imaging tasks: automatic screening of pneumonia, diabetic retinopathy detection, and identification of skin lesions. Empirical results demonstrate that the proposed approach achieves consistent calibration improvements across diverse modalities, maintains robust performance in data-scarce scenarios, and remains effective on severely imbalanced datasets, underscoring its potential for real clinical deployment.
22. 【2602.11960】Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
链接:https://arxiv.org/abs/2602.11960
作者:Bruno Rigal,Victor Dupriez,Alexis Mignon,Ronan Le Hy,Nicolas Mery
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:challenging French documents, recent Vision-Language Models, challenging French, French documents, report evaluates
备注: 13 pages, 6 figures
点击查看摘要
Abstract:This report evaluates PDF-to-Markdown conversion using recent Vision-Language Models (VLMs) on challenging French documents. Document parsing is a critical step for Retrieval-Augmented Generation (RAG) pipelines, where transcription and layout errors propagate to downstream retrieval and grounding. Existing benchmarks often emphasize English or Chinese and can over-penalize benign formatting and linearization choices (e.g., line breaks, list segmentation, alternative table renderings) that are largely irrelevant for downstream use. We introduce a French-focused benchmark of difficult pages selected via model-disagreement sampling from a corpus of 60{,}000 documents, covering handwritten forms, complex layouts, dense tables, and graphics-rich pages. Evaluation is performed with unit-test-style checks that target concrete failure modes (text presence, reading order, and local table constraints) combined with category-specific normalization designed to discount presentation-only variance. Across 15 models, we observe substantially higher robustness for the strongest proprietary models on handwriting and forms, while several open-weights systems remain competitive on standard printed layouts.
Comments:
13 pages, 6 figures
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2602.11960 [cs.CV]
(or
arXiv:2602.11960v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.11960
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2602.11942】Synthesis of Late Gadolinium Enhancement Images via Implicit Neural Representations for Cardiac Scar Segmentation
链接:https://arxiv.org/abs/2602.11942
作者:Soufiane Ben Haddou,Laura Alvarez-Florez,Erik J. Bekkers,Fleur V. Y. Tjong,Ahmad S. Amin,Connie R. Bezzina,Ivana Išgum
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Late gadolinium enhancement, myocardial scar assessment, limited annotated datasets, annotated datasets hinder, Late gadolinium
备注: Paper accepted at SPIE Medical Imaging 2026 Conference
点击查看摘要
Abstract:Late gadolinium enhancement (LGE) imaging is the clinical standard for myocardial scar assessment, but limited annotated datasets hinder the development of automated segmentation methods. We propose a novel framework that synthesises both LGE images and their corresponding segmentation masks using implicit neural representations (INRs) combined with denoising diffusion models. Our approach first trains INRs to capture continuous spatial representations of LGE data and associated myocardium and fibrosis masks. These INRs are then compressed into compact latent embeddings, preserving essential anatomical information. A diffusion model operates on this latent space to generate new representations, which are decoded into synthetic LGE images with anatomically consistent segmentation masks. Experiments on 133 cardiac MRI scans suggest that augmenting training data with 200 synthetic volumes contributes to improved fibrosis segmentation performance, with the Dice score showing an increase from 0.509 to 0.524. Our approach provides an annotation-free method to help mitigate data this http URL code for this research is publicly available.
24. 【2602.11919】DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target
链接:https://arxiv.org/abs/2602.11919
作者:BoCheng Hu,Zhonghan Zhao,Kaiyue Zhou,Hongwei Wang,Gaoang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:coordination largely untested, time-critical coordination largely, leaving dynamic scenarios, hand-object interaction, focus on static
备注:
点击查看摘要
Abstract:Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.
25. 【2602.11882】Where Bits Matter in World Model Planning: A Paired Mixed-Bit Study for Efficient Spatial Reasoning
链接:https://arxiv.org/abs/2602.11882
作者:Suraj Ranganath,Anish Patnaik,Vaishak Menon
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:requires world models, reasoning requires world, requires world, world models, reliable under tight
备注: Workshop submission
点击查看摘要
Abstract:Efficient spatial reasoning requires world models that remain reliable under tight precision budgets. We study whether low-bit planning behavior is determined mostly by total bitwidth or by where bits are allocated across modules. Using DINO-WM on the Wall planning task, we run a paired-goal mixed-bit evaluation across uniform, mixed, asymmetric, and layerwise variants under two planner budgets. We observe a consistent three-regime pattern: 8-bit and 6-bit settings remain close to FP16, 3-bit settings collapse, and 4-bit settings are allocation-sensitive. In that transition region, preserving encoder precision improves planning relative to uniform quantization, and near-size asymmetric variants show the same encoder-side direction. In a later strict 22-cell replication with smaller per-cell episode count, the mixed-versus-uniform INT4 sign becomes budget-conditioned, which further highlights the sensitivity of this transition regime. These findings motivate module-aware, budget-aware quantization policies as a broader research direction for efficient spatial reasoning. Code and run artifacts are available at this https URL.
26. 【2602.11880】SynthRAR: Ring Artifacts Reduction in CT with Unrolled Network and Synthetic Data Training
链接:https://arxiv.org/abs/2602.11880
作者:Hongxu Yang,Levente Lippenszky,Edina Timko,Gopal Avinash
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Defective and inconsistent, making them unusable, ring artifact reduction, reconstructed images, streak artifacts
备注: Prepare for submission
点击查看摘要
Abstract:Defective and inconsistent responses in CT detectors can cause ring and streak artifacts in the reconstructed images, making them unusable for clinical purposes. In recent years, several ring artifact reduction solutions have been proposed in the image domain or in the sinogram domain using supervised deep learning methods. However, these methods require dedicated datasets for training, leading to a high data collection cost. Furthermore, existing approaches focus exclusively on either image-space or sinogram-space correction, neglecting the intrinsic correlations from the forward operation of the CT geometry. Based on the theoretical analysis of non-ideal CT detector responses, the RAR problem is reformulated as an inverse problem by using an unrolled network, which considers non-ideal response together with linear forward-projection with CT geometry. Additionally, the intrinsic correlations of ring artifacts between the sinogram and image domains are leveraged through synthetic data derived from natural images, enabling the trained model to correct artifacts without requiring real-world clinical data. Extensive evaluations on diverse scanning geometries and anatomical regions demonstrate that the model trained on synthetic data consistently outperforms existing state-of-the-art methods.
27. 【2602.11875】DiffPlace: Street View Generation via Place-Controllable Diffusion Model Enhancing Place Recognition
链接:https://arxiv.org/abs/2602.11875
作者:Ji Li,Zhiwei Li,Shihao Li,Zhenjiang Yu,Boyang Wang,Haiou Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:diffusion models excelling, advanced significantly, diffusion models improve, diffusion models, Recent multi-view diffusion
备注: accepted by ICRA 2026
点击查看摘要
Abstract:Generative models have advanced significantly in realistic image synthesis, with diffusion models excelling in quality and stability. Recent multi-view diffusion models improve 3D-aware street view generation, but they struggle to produce place-aware and background-consistent urban scenes from text, BEV maps, and object bounding boxes. This limits their effectiveness in generating realistic samples for place recognition tasks. To address these challenges, we propose DiffPlace, a novel framework that introduces a place-ID controller to enable place-controllable multi-view image generation. The place-ID controller employs linear projection, perceiver transformer, and contrastive learning to map place-ID embeddings into a fixed CLIP space, allowing the model to synthesize images with consistent background buildings while flexibly modifying foreground objects and weather conditions. Extensive experiments, including quantitative comparisons and augmented training evaluations, demonstrate that DiffPlace outperforms existing methods in both generation quality and training support for visual place recognition. Our results highlight the potential of generative models in enhancing scene-level and place-aware synthesis, providing a valuable approach for improving place recognition in autonomous driving
28. 【2602.11858】Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
链接:https://arxiv.org/abs/2602.11858
作者:Lai Wei,Liangbo He,Jun Lan,Lingzhong Dong,Yutong Cai,Siyuan Li,Huijia Zhu,Weiqiang Wang,Linghe Kong,Yue Wang,Zhuosheng Zhang,Weiran Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Multimodal Large Language, Large Language Models, Large Language, broad visual understanding, Multimodal Large
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) excel at broad visual understanding but still struggle with fine-grained perception, where decisive evidence is small and easily overwhelmed by global context. Recent "Thinking-with-Images" methods alleviate this by iteratively zooming in and out regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM. In particular, we first zoom in to micro-cropped regions to let strong teacher models generate high-quality VQA data, and then distill this region-grounded supervision back to the full image. After training on such data, the smaller student model improves "single-glance" fine-grained perception without tool use. To rigorously evaluate this capability, we further present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap". Experiments show that our models achieve leading performance across multiple fine-grained perception benchmarks, and also improve general multimodal cognition on benchmarks such as visual reasoning and GUI agents. We further discuss when "Thinking-with-Images" is necessary versus when its gains can be distilled into a single forward pass. Our code is available at this https URL.
29. 【2602.11850】Free Lunch for Stabilizing Rectified Flow Inversion
链接:https://arxiv.org/abs/2602.11850
作者:Chenru Wang,Beier Zhu,Chi Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:based generative models, traditional diffusion models, based generative, recently emerged, emerged as strong
备注:
点击查看摘要
Abstract:Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.
30. 【2602.11845】WorldTree: Towards 4D Dynamic Worlds from Monocular Video using Tree-Chains
链接:https://arxiv.org/abs/2602.11845
作者:Qisen Wang,Yifan Zhao,Jia Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, remarkable progress, practical applications, reconstruction has achieved, achieved remarkable
备注:
点击查看摘要
Abstract:Dynamic reconstruction has achieved remarkable progress, but there remain challenges in monocular input for more practical applications. The prevailing works attempt to construct efficient motion representations, but lack a unified spatiotemporal decomposition framework, suffering from either holistic temporal optimization or coupled hierarchical spatial composition. To this end, we propose WorldTree, a unified framework comprising Temporal Partition Tree (TPT) that enables coarse-to-fine optimization based on the inheritance-based partition tree structure for hierarchical temporal decomposition, and Spatial Ancestral Chains (SAC) that recursively query ancestral hierarchical structure to provide complementary spatial dynamics while specializing motion representations across ancestral nodes. Experimental results on different datasets indicate that our proposed method achieves 8.26% improvement of LPIPS on NVIDIA-LS and 9.09% improvement of mLPIPS on DyCheck compared to the second-best method. Code: this https URL.
31. 【2602.11832】JEPA-VLA: Video Predictive Embedding is Needed for VLA Models
链接:https://arxiv.org/abs/2602.11832
作者:Shangchen Miao,Ningya Feng,Jialong Wu,Ye Lin,Xu He,Dong Li,Mingsheng Long
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:achieved significant improvements, pretrained vision-language models, robotic manipulation, models built, vision-language models
备注:
点击查看摘要
Abstract:Recent vision-language-action (VLA) models built upon pretrained vision-language models (VLMs) have achieved significant improvements in robotic manipulation. However, current VLAs still suffer from low sample efficiency and limited generalization. This paper argues that these limitations are closely tied to an overlooked component, pretrained visual representation, which offers insufficient knowledge on both aspects of environment understanding and policy prior. Through an in-depth analysis, we find that commonly used visual representations in VLAs, whether pretrained via language-image contrastive learning or image-based self-supervised learning, remain inadequate at capturing crucial, task-relevant environment information and at inducing effective policy priors, i.e., anticipatory knowledge of how the environment evolves under successful task execution. In contrast, we discover that predictive embeddings pretrained on videos, in particular V-JEPA 2, are adept at flexibly discarding unpredictable environment factors and encoding task-relevant temporal dynamics, thereby effectively compensating for key shortcomings of existing visual representations in VLAs. Building on these observations, we introduce JEPA-VLA, a simple yet effective approach that adaptively integrates predictive embeddings into existing VLAs. Our experiments demonstrate that JEPA-VLA yields substantial performance gains across a range of benchmarks, including LIBERO, LIBERO-plus, RoboTwin2.0, and real-robot tasks.
32. 【2602.11814】A Comparative Study of MAP and LMMSE Estimators for Blind Inverse Problems
链接:https://arxiv.org/abs/2602.11814
作者:Nathan Buskulic,Luca Calatroni
类目:Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:careful parameter selection, forward operators, framework for inverse, combined with expressive, inverse problems
备注:
点击查看摘要
Abstract:Maximum-a-posteriori (MAP) approaches are an effective framework for inverse problems with known forward operators, particularly when combined with expressive priors and careful parameter selection. In blind settings, however, their use becomes significantly less stable due to the inherent non-convexity of the problem and the potential non-identifiability of the solutions. (Linear) minimum mean square error (MMSE) estimators provide a compelling alternative that can circumvent these limitations. In this work, we study synthetic two-dimensional blind deconvolution problems under fully controlled conditions, with complete prior knowledge of both the signal and kernel distributions. We compare tailored MAP algorithms with simple LMMSE estimators whose functional form is closely related to that of an optimal Tikhonov estimator. Our results show that, even in these highly controlled settings, MAP methods remain unstable and require extensive parameter tuning, whereas the LMMSE estimator yields a robust and reliable baseline. Moreover, we demonstrate empirically that the LMMSE solution can serve as an effective initialization for MAP approaches, improving their performance and reducing sensitivity to regularization parameters, thereby opening the door to future theoretical and practical developments.
33. 【2602.11810】How to Sample High Quality 3D Fractals for Action Recognition Pre-Training?
链接:https://arxiv.org/abs/2602.11810
作者:Marko Putak,Thomas B. Moeslund,Joakim Bruslund Haurum
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:deep learning realm, Driven Supervised Learning, Formula Driven Supervised, exhaustively labeled real, labeled real data
备注: 12 pages, 6 figures. To be published in VISAPP
点击查看摘要
Abstract:Synthetic datasets are being recognized in the deep learning realm as a valuable alternative to exhaustively labeled real data. One such synthetic data generation method is Formula Driven Supervised Learning (FDSL), which can provide an infinite number of perfectly labeled data through a formula driven approach, such as fractals or contours. FDSL does not have common drawbacks like manual labor, privacy and other ethical concerns. In this work we generate 3D fractals using 3D Iterated Function Systems (IFS) for pre-training an action recognition model. The fractals are temporally transformed to form a video that is used as a pre-training dataset for downstream task of action recognition. We find that standard methods of generating fractals are slow and produce degenerate 3D fractals. Therefore, we systematically explore alternative ways of generating fractals and finds that overly-restrictive approaches, while generating aesthetically pleasing fractals, are detrimental for downstream task performance. We propose a novel method, Targeted Smart Filtering, to address both the generation speed and fractal diversity issue. The method reports roughly 100 times faster sampling speed and achieves superior downstream performance against other 3D fractal filtering methods.
34. 【2602.11804】Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data
链接:https://arxiv.org/abs/2602.11804
作者:Yiming Zhou,Xuenjie Xie,Panfeng Li,Albrecht Kunz,Ahmad Osman,Xavier Maldague
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Segment Anything Models, require massive datasets, achieve impressive universal, universal segmentation performance, impressive universal segmentation
备注:
点击查看摘要
Abstract:Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.
35. 【2602.11769】Light4D: Training-Free Extreme Viewpoint 4D Video Relighting
链接:https://arxiv.org/abs/2602.11769
作者:Zhenghuang Wu,Kang Chen,Zeyu Zhang,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diffusion-based generative models, Recent advances, advances in diffusion-based, diffusion-based generative, generative models
备注:
点击查看摘要
Abstract:Recent advances in diffusion-based generative models have established a new paradigm for image and video relighting. However, extending these capabilities to 4D relighting remains challenging, due primarily to the scarcity of paired 4D relighting training data and the difficulty of maintaining temporal consistency across extreme viewpoints. In this work, we propose Light4D, a novel training-free framework designed to synthesize consistent 4D videos under target illumination, even under extreme viewpoint changes. First, we introduce Disentangled Flow Guidance, a time-aware strategy that effectively injects lighting control into the latent space while preserving geometric integrity. Second, to reinforce temporal consistency, we develop Temporal Consistent Attention within the IC-Light architecture and further incorporate deterministic regularization to eliminate appearance flickering. Extensive experiments demonstrate that our method achieves competitive performance in temporal consistency and lighting fidelity, robustly handling camera rotations from -90 to 90. Code: this https URL. Website: this https URL.
36. 【2602.11757】Code2Worlds: Empowering Coding LLMs for 4D World Generation
链接:https://arxiv.org/abs/2602.11757
作者:Yi Zhang,Yunshuang Wang,Zeyu Zhang,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Achieving spatial intelligence, spatial intelligence requires, intelligence requires moving, build world simulators, world simulators grounded
备注:
点击查看摘要
Abstract:Achieving spatial intelligence requires moving beyond visual plausibility to build world simulators grounded in physical laws. While coding LLMs have advanced static 3D scene generation, extending this paradigm to 4D dynamics remains a critical frontier. This task presents two fundamental challenges: multi-scale context entanglement, where monolithic generation fails to balance local object structures with global environmental layouts; and a semantic-physical execution gap, where open-loop code generation leads to physical hallucinations lacking dynamic fidelity. We introduce Code2Worlds, a framework that formulates 4D generation as language-to-simulation code generation. First, we propose a dual-stream architecture that disentangles retrieval-augmented object generation from hierarchical environmental orchestration. Second, to ensure dynamic fidelity, we establish a physics-aware closed-loop mechanism in which a PostProcess Agent scripts dynamics, coupled with a VLM-Motion Critic that performs self-reflection to iteratively refine simulation code. Evaluations on the Code4D benchmark show Code2Worlds outperforms baselines with a 41% SGS gain and 49% higher Richness, while uniquely generating physics-aware dynamics absent in prior static methods. Code: this https URL. Website: this https URL.
37. 【2602.11743】Adaptive Debiasing Tsallis Entropy for Test-Time Adaptation
链接:https://arxiv.org/abs/2602.11743
作者:Xiangyu Wu,Dongming Jiang,Feng Yu,Yueying Tian,Jiaqi Tang,Qing-Guo Chen,Yang Yang,Jianfeng Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Mainstream Test-Time Adaptation, Shannon Entropy, Mainstream Test-Time, measure prediction uncertainty, rely on Shannon
备注: Accepted for publication at ICLR 2026; 24 pages; 5 figures
点击查看摘要
Abstract:Mainstream Test-Time Adaptation (TTA) methods for adapting vision-language models, e.g., CLIP, typically rely on Shannon Entropy (SE) at test time to measure prediction uncertainty and inconsistency. However, since CLIP has a built-in bias from pretraining on highly imbalanced web-crawled data, SE inevitably results in producing biased estimates of uncertainty entropy. To address this issue, we notably find and demonstrate that Tsallis Entropy (TE), a generalized form of SE, is naturally suited for characterizing biased distributions by introducing a non-extensive parameter q, with the performance of SE serving as a lower bound for TE. Building upon this, we generalize TE into Adaptive Debiasing Tsallis Entropy (ADTE) for TTA, customizing a class-specific parameter q^l derived by normalizing the estimated label bias from continuously incoming test instances, for each category. This adaptive approach allows ADTE to accurately select high-confidence views and seamlessly integrate with a label adjustment strategy to enhance adaptation, without introducing distribution-specific hyperparameter tuning. Besides, our investigation reveals that both TE and ADTE can serve as direct, advanced alternatives to SE in TTA, without any other modifications. Experimental results show that ADTE outperforms state-of-the-art methods on ImageNet and its five variants, and achieves the highest average performance on 10 cross-domain benchmarks, regardless of the model architecture or text prompts used. Our code is available at this https URL.
38. 【2602.11737】Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
链接:https://arxiv.org/abs/2602.11737
作者:Boqi Chen,Xudong Liu,Jianing Qiu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
39. 【2602.11733】Adapting Vision-Language Models for E-commerce Understanding at Scale
链接:https://arxiv.org/abs/2602.11733
作者:Matteo Nulli,Vladimir Orshulevich,Tala Bazazo,Christian Herold,Michael Kozielski,Marcin Mazur,Szymon Tuzel,Cees G. M. Snoek,Seyyed Hadi Hashemi,Omar Javed,Yannick Versley,Shahram Khadivi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:strong multimodal comprehension, comprehension from text, General-purpose Vision-Language Models, product understanding demands, strong multimodal
备注:
点击查看摘要
Abstract:E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
40. 【2602.11730】STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning
链接:https://arxiv.org/abs/2602.11730
作者:Xiaowen Zhang,Zhi Gao,Licheng Jiao,Lingling Li,Qing Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision-language models, misalignment between textual, induces hallucinations, textual descriptions, Abstract
备注:
点击查看摘要
Abstract:In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations. This issue becomes particularly severe in dense prediction tasks such as spatial-temporal video grounding (STVG). Prior approaches typically focus on enhancing visual-textual alignment or attaching auxiliary decoders. However, these strategies inevitably introduce additional trainable modules, leading to significant annotation costs and computational overhead. In this work, we propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities. Specifically, we reformulate per-frame coordinate prediction as a compact instance-level identification problem by assigning each object a unique, temporally consistent ID. These IDs are embedded into the video as visual prompts, providing explicit and interpretable inputs to the VLMs. Furthermore, we introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization. Extensive experiments on six benchmarks demonstrate the effectiveness of our approach. STVG-R1 surpasses the baseline Qwen2.5-VL-7B by a remarkable margin of 20.9% on m_IoU on the HCSTVG-v2 benchmark, establishing a new state of the art (SOTA). Surprisingly, STVG-R1 also exhibits strong zero-shot generalization to multi-object referring video object segmentation tasks, achieving a SOTA 47.3% JF on MeViS.
41. 【2602.11714】GSO-SLAM: Bidirectionally Coupled Gaussian Splatting and Direct Visual Odometry
链接:https://arxiv.org/abs/2602.11714
作者:Jiung Yeon,Seongbo Ha,Hyeonwoo Yu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:monocular dense SLAM, dense SLAM system, real-time monocular dense, propose GSO-SLAM, dense SLAM
备注: 8 pages, 6 figures, RA-L accepted
点击查看摘要
Abstract:We propose GSO-SLAM, a real-time monocular dense SLAM system that leverages Gaussian scene representation. Unlike existing methods that couple tracking and mapping with a unified scene, incurring computational costs, or loosely integrate them with well-structured tracking frameworks, introducing redundancies, our method bidirectionally couples Visual Odometry (VO) and Gaussian Splatting (GS). Specifically, our approach formulates joint optimization within an Expectation-Maximization (EM) framework, enabling the simultaneous refinement of VO-derived semi-dense depth estimates and the GS representation without additional computational overhead. Moreover, we present Gaussian Splat Initialization, which utilizes image information, keyframe poses, and pixel associations from VO to produce close approximations to the final Gaussian scene, thereby eliminating the need for heuristic methods. Through extensive experiments, we validate the effectiveness of our method, showing that it not only operates in real time but also achieves state-of-the-art geometric/photometric fidelity of the reconstructed scene and tracking accuracy.
42. 【2602.11706】LLM-Driven 3D Scene Generation of Agricultural Simulation Environments
链接:https://arxiv.org/abs/2602.11706
作者:Arafa Yoncalik,Wouter Jansen,Nico Huebel,Mohammad Hasan Rahmani,Jan Steckel
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:Procedural generation techniques, Procedural generation, Large Language Models, reducing reliance, revolutionized the creation
备注: Accepted at IEEE Conference on Artificial Intelligence 2026
点击查看摘要
Abstract:Procedural generation techniques in 3D rendering engines have revolutionized the creation of complex environments, reducing reliance on manual design. Recent approaches using Large Language Models (LLMs) for 3D scene generation show promise but often lack domain-specific reasoning, verification mechanisms, and modular design. These limitations lead to reduced control and poor scalability. This paper investigates the use of LLMs to generate agricultural synthetic simulation environments from natural language prompts, specifically to address the limitations of lacking domain-specific reasoning, verification mechanisms, and modular design. A modular multi-LLM pipeline was developed, integrating 3D asset retrieval, domain knowledge injection, and code generation for the Unreal rendering engine using its API. This results in a 3D environment with realistic planting layouts and environmental context, all based on the input prompt and the domain knowledge. To enhance accuracy and scalability, the system employs a hybrid strategy combining LLM optimization techniques such as few-shot prompting, Retrieval-Augmented Generation (RAG), finetuning, and validation. Unlike monolithic models, the modular architecture enables structured data handling, intermediate verification, and flexible expansion. The system was evaluated using structured prompts and semantic accuracy metrics. A user study assessed realism and familiarity against real-world images, while an expert comparison demonstrated significant time savings over manual scene design. The results confirm the effectiveness of multi-LLM pipelines in automating domain-specific 3D scene generation with improved reliability and precision. Future work will explore expanding the asset hierarchy, incorporating real-time generation, and adapting the pipeline to other simulation domains beyond agriculture.
43. 【2602.11705】G-Field: Geometry-Aware Radiative Gaussian Fields for Tomographic Reconstruction
链接:https://arxiv.org/abs/2602.11705
作者:Yuxiang Zhong,Jun Wei,Chaoqi Chen,Senyou An,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, Tomographic Geometry Field, efficiency and quality, superior efficiency, propose Tomographic Geometry
备注: Accepted to AAAI 2026. Project page: [this https URL](https://vcc.tech/research/2026/TG-Field)
点击查看摘要
Abstract:3D Gaussian Splatting (3DGS) has revolutionized 3D scene representation with superior efficiency and quality. While recent adaptations for computed tomography (CT) show promise, they struggle with severe artifacts under highly sparse-view projections and dynamic motions. To address these challenges, we propose Tomographic Geometry Field (TG-Field), a geometry-aware Gaussian deformation framework tailored for both static and dynamic CT reconstruction. A multi-resolution hash encoder is employed to capture local spatial priors, regularizing primitive parameters under ultra-sparse settings. We further extend the framework to dynamic reconstruction by introducing time-conditioned representations and a spatiotemporal attention block to adaptively aggregate features, thereby resolving spatiotemporal ambiguities and enforcing temporal coherence. In addition, a motion-flow network models fine-grained respiratory motion to track local anatomical deformations. Extensive experiments on synthetic and real-world datasets demonstrate that TG-Field consistently outperforms existing methods, achieving state-of-the-art reconstruction accuracy under highly sparse-view conditions.
44. 【2602.11703】Semantically Conditioned Diffusion Models for Cerebral DSA Synthesis
链接:https://arxiv.org/abs/2602.11703
作者:Qiwen Xu,David Rügamer,Holger Wenz,Johann Fontana,Nora Meggyeshazi,Andreas Bender,Máté E. Maros
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Digital subtraction angiography, public data sharing, large-scale data collection, limit large-scale data, cost severely limit
备注:
点击查看摘要
Abstract:Digital subtraction angiography (DSA) plays a central role in the diagnosis and treatment of cerebrovascular disease, yet its invasive nature and high acquisition cost severely limit large-scale data collection and public data sharing. Therefore, we developed a semantically conditioned latent diffusion model (LDM) that synthesizes arterial-phase cerebral DSA frames under explicit control of anatomical circulation (anterior vs.\ posterior) and canonical C-arm positions. We curated a large single-centre DSA dataset of 99,349 frames and trained a conditional LDM using text embeddings that encoded anatomy and acquisition geometry. To assess clinical realism, four medical experts, including two neuroradiologists, one neurosurgeon, and one internal medicine expert, systematically rated 400 synthetic DSA images using a 5-grade Likert scale for evaluating proximal large, medium, and small peripheral vessels. The generated images achieved image-wise overall Likert scores ranging from 3.1 to 3.3, with high inter-rater reliability (ICC(2,k) = 0.80--0.87). Distributional similarity to real DSA frames was supported by a low median Fréchet inception distance (FID) of 15.27. Our results indicate that semantically controlled LDMs can produce realistic synthetic DSAs suitable for downstream algorithm development, research, and training.
45. 【2602.11693】OMEGA-Avatar: One-shot Modeling of 360° Gaussian Avatars
链接:https://arxiv.org/abs/2602.11693
作者:Zehao Xia,Yiqun Wang,Zhengda Lu,Kai Liu,Jun Xiao,Peter Wonka
类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Creating high-fidelity, single image remains, formidable challenge, remains a formidable, full-head avatar generation
备注: Project page: [this https URL](https://omega-avatar.github.io/OMEGA-Avatar/)
点击查看摘要
Abstract:Creating high-fidelity, animatable 3D avatars from a single image remains a formidable challenge. We identified three desirable attributes of avatar generation: 1) the method should be feed-forward, 2) model a 360° full-head, and 3) should be animation-ready. However, current work addresses only two of the three points simultaneously. To address these limitations, we propose OMEGA-Avatar, the first feed-forward framework that simultaneously generates a generalizable, 360°-complete, and animatable 3D Gaussian head from a single image. Starting from a feed-forward and animatable framework, we address the 360° full-head avatar generation problem with two novel components. First, to overcome poor hair modeling in full-head avatar generation, we introduce a semantic-aware mesh deformation module that integrates multi-view normals to optimize a FLAME head with hair while preserving its topology structure. Second, to enable effective feed-forward decoding of full-head features, we propose a multi-view feature splatting module that constructs a shared canonical UV representation from features across multiple views through differentiable bilinear splatting, hierarchical UV mapping, and visibility-aware fusion. This approach preserves both global structural coherence and local high-frequency details across all viewpoints, ensuring 360° consistency without per-instance optimization. Extensive experiments demonstrate that OMEGA-Avatar achieves state-of-the-art performance, significantly outperforming existing baselines in 360° full-head completeness while robustly preserving identity across different viewpoints.
46. 【2602.11678】Beyond Pixels: Vector-to-Graph Transformation for Reliable Schematic Auditing
链接:https://arxiv.org/abs/2602.11678
作者:Chengwei Ma,Zhen Tian,Zhou Zhou,Zhixian Xu,Xiaowei Zhu,Xia Hua,Si Shi,F. Richard Yu
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, shown remarkable progress, Multimodal Large Language, Language Models, visual understanding
备注: 4 pages, 3 figures. Accepted to ICASSP 2026
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have shown remarkable progress in visual understanding, yet they suffer from a critical limitation: structural blindness. Even state-of-the-art models fail to capture topology and symbolic logic in engineering schematics, as their pixel-driven paradigm discards the explicit vector-defined relations needed for reasoning. To overcome this, we propose a Vector-to-Graph (V2G) pipeline that converts CAD diagrams into property graphs where nodes represent components and edges encode connectivity, making structural dependencies explicit and machine-auditable. On a diagnostic benchmark of electrical compliance checks, V2G yields large accuracy gains across all error categories, while leading MLLMs remain near chance level. These results highlight the systemic inadequacy of pixel-based methods and demonstrate that structure-aware representations provide a reliable path toward practical deployment of multimodal AI in engineering domains. To facilitate further research, we release our benchmark and implementation at this https URL.
47. 【2602.11673】RI-Mamba: Rotation-Invariant Mamba for Robust Text-to-Shape Retrieval
链接:https://arxiv.org/abs/2602.11673
作者:Khanh Nguyen,Dasith de Silva Edirimuni,Ghulam Mubashar Hassan,Ajmal Mian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:assets have rapidly, reality and gaming, rapidly expanded, expanded in quantity, quantity and diversity
备注:
点击查看摘要
Abstract:3D assets have rapidly expanded in quantity and diversity due to the growing popularity of virtual reality and gaming. As a result, text-to-shape retrieval has become essential in facilitating intuitive search within large repositories. However, existing methods require canonical poses and support few object categories, limiting their real-world applicability where objects can belong to diverse classes and appear in random orientations. To address this challenge, we propose RI-Mamba, the first rotation-invariant state-space model for point clouds. RI-Mamba defines global and local reference frames to disentangle pose from geometry and uses Hilbert sorting to construct token sequences with meaningful geometric structure while maintaining rotation invariance. We further introduce a novel strategy to compute orientational embeddings and reintegrate them via feature-wise linear modulation, effectively recovering spatial context and enhancing model expressiveness. Our strategy is inherently compatible with state-space models and operates in linear time. To scale up retrieval, we adopt cross-modal contrastive learning with automated triplet generation, allowing training on diverse datasets without manual annotation. Extensive experiments demonstrate RI-Mamba's superior representational capacity and robustness, achieving state-of-the-art performance on the OmniObject3D benchmark across more than 200 object categories under arbitrary orientations. Our code will be made available at this https URL.
48. 【2602.11672】U-Net with Hadamard Transform and DCT Latent Spaces for Next-day Wildfire Spread Prediction
链接:https://arxiv.org/abs/2602.11672
作者:Yingyi Luo,Shuaiang Rong,Adam Watts,Ahmet Enis Cetin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computationally efficient tool, multimodal satellite data, Discrete Cosine Transform, next-day wildfire spread, Transform Domain Fusion
备注:
点击查看摘要
Abstract:We developed a lightweight and computationally efficient tool for next-day wildfire spread prediction using multimodal satellite data as input. The deep learning model, which we call Transform Domain Fusion UNet (TD-FusionUNet), incorporates trainable Hadamard Transform and Discrete Cosine Transform layers that apply two-dimensional transforms, enabling the network to capture essential "frequency" components in orthogonalized latent spaces. Additionally, we introduce custom preprocessing techniques, including random margin cropping and a Gaussian mixture model, to enrich the representation of the sparse pre-fire masks and enhance the model's generalization capability. The TD-FusionUNet is evaluated on two datasets which are the Next-Day Wildfire Spread dataset released by Google Research in 2023, and WildfireSpreadTS dataset. Our proposed TD-FusionUNet achieves an F1 score of 0.591 with 370k parameters, outperforming the UNet baseline using ResNet18 as the encoder reported in the WildfireSpreadTS dataset while using substantially fewer parameters. These results show that the proposed latent space fusion model balances accuracy and efficiency under a lightweight setting, making it suitable for real time wildfire prediction applications in resource limited environments.
49. 【2602.11669】Egocentric Gaze Estimation via Neck-Mounted Camera
链接:https://arxiv.org/abs/2602.11669
作者:Haoyu Huang,Yoichi Sato
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:estimates user gaze, paper introduces neck-mounted, neck-mounted camera perspective, introduces neck-mounted view, paper introduces
备注:
点击查看摘要
Abstract:This paper introduces neck-mounted view gaze estimation, a new task that estimates user gaze from the neck-mounted camera perspective. Prior work on egocentric gaze estimation, which predicts device wearer's gaze location within the camera's field of view, mainly focuses on head-mounted cameras while alternative viewpoints remain underexplored. To bridge this gap, we collect the first dataset for this task, consisting of approximately 4 hours of video collected from 8 participants during everyday activities. We evaluate a transformer-based gaze estimation model, GLC, on the new dataset and propose two extensions: an auxiliary gaze out-of-bound classification task and a multi-view co-learning approach that jointly trains head-view and neck-view models using a geometry-aware auxiliary loss. Experimental results show that incorporating gaze out-of-bound classification improves performance over standard fine-tuning, while the co-learning approach does not yield gains. We further analyze these results and discuss implications for neck-mounted gaze estimation.
50. 【2602.11660】Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes
链接:https://arxiv.org/abs/2602.11660
作者:Jeongho Noh,Tai Hyoung Rhee,Eunho Lee,Jeongyun Kim,Sunwoo Lee,Ayoung Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:language-grounded robotic manipulation, robotic manipulation, Reliable, language-grounded robotic, instance segmentation
备注: Accepted to ICRA 2026. 9 pages, 8 figures
点击查看摘要
Abstract:Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: this https URL.
51. 【2602.11658】EmoSpace: Fine-Grained Emotion Prototype Learning for Immersive Affective Content Generation
链接:https://arxiv.org/abs/2602.11658
作者:Bingyuan Wang,Xingbei Chen,Zongyang Qiu,Linping Yuan,Zeyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:compelling virtual reality, creating compelling virtual, virtual reality, compelling virtual, creating compelling
备注:
点击查看摘要
Abstract:Emotion is important for creating compelling virtual reality (VR) content. Although some generative methods have been applied to lower the barrier to creating emotionally rich content, they fail to capture the nuanced emotional semantics and the fine-grained control essential for immersive experiences. To address these limitations, we introduce EmoSpace, a novel framework for emotion-aware content generation that learns dynamic, interpretable emotion prototypes through vision-language alignment. We employ a hierarchical emotion representation with rich learnable prototypes that evolve during training, enabling fine-grained emotional control without requiring explicit emotion labels. We develop a controllable generation pipeline featuring multi-prototype guidance, temporal blending, and attention reweighting that supports diverse applications, including emotional image outpainting, stylized generation, and emotional panorama generation for VR environments. Our experiments demonstrate the superior performance of EmoSpace over existing methods in both qualitative and quantitative evaluations. Additionally, we present a comprehensive user study investigating how VR environments affect emotional perception compared to desktop settings. Our work facilitates immersive visual content generation with fine-grained emotion control and supports applications like therapy, education, storytelling, artistic creation, and cultural preservation. Code and models will be made publicly available.
52. 【2602.11656】SToRM: Supervised Token Reduction for Multi-modal LLMs toward efficient end-to-end autonomous driving
链接:https://arxiv.org/abs/2602.11656
作者:Seo Hyun Kim,Jin Bok Park,Do Yeon Koo,Ho Gun Park,Il Yong Chun
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:achieved significant advancements, predict control commands, control commands directly, significant advancements, predict control
备注:
点击查看摘要
Abstract:In autonomous driving, end-to-end (E2E) driving systems that predict control commands directly from sensor data have achieved significant advancements. For safe driving in unexpected scenarios, these systems may additionally rely on human interventions such as natural language instructions. Using a multi-modal large language model (MLLM) facilitates human-vehicle interaction and can improve performance in such scenarios. However, this approach requires substantial computational resources due to its reliance on an LLM and numerous visual tokens from sensor inputs, which are limited in autonomous vehicles. Many MLLM studies have explored reducing visual tokens, but often suffer end-task performance degradation compared to using all tokens. To enable efficient E2E driving while maintaining performance comparable to using all tokens, this paper proposes the first Supervised Token Reduction framework for multi-modal LLMs (SToRM). The proposed framework consists of three key elements. First, a lightweight importance predictor with short-term sliding windows estimates token importance scores. Second, a supervised training approach uses an auxiliary path to obtain pseudo-supervision signals from an all-token LLM pass. Third, an anchor-context merging module partitions tokens into anchors and context tokens, and merges context tokens into relevant anchors to reduce redundancy while minimizing information loss. Experiments on the LangAuto benchmark show that SToRM outperforms state-of-the-art E2E driving MLLMs under the same reduced-token budget, maintaining all-token performance while reducing computational cost by up to 30x.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Cite as:
arXiv:2602.11656 [cs.CV]
(or
arXiv:2602.11656v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.11656
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
53. 【2602.11653】GR-Diffusion: 3D Gaussian Representation Meets Diffusion in Whole-Body PET Reconstruction
链接:https://arxiv.org/abs/2602.11653
作者:Mengxiao Geng,Zijie Chen,Ran Hong,Bingxuan Li,Qiegen Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Positron emission tomography, Positron emission, detail loss due, structural blurring, discrete Gaussian representation
备注:
点击查看摘要
Abstract:Positron emission tomography (PET) reconstruction is a critical challenge in molecular imaging, often hampered by noise amplification, structural blurring, and detail loss due to sparse sampling and the ill-posed nature of inverse problems. The three-dimensional discrete Gaussian representation (GR), which efficiently encodes 3D scenes using parameterized discrete Gaussian distributions, has shown promise in computer vision. In this work, we pro-pose a novel GR-Diffusion framework that synergistically integrates the geometric priors of GR with the generative power of diffusion models for 3D low-dose whole-body PET reconstruction. GR-Diffusion employs GR to generate a reference 3D PET image from projection data, establishing a physically grounded and structurally explicit benchmark that overcomes the low-pass limitations of conventional point-based or voxel-based methods. This reference image serves as a dual guide during the diffusion process, ensuring both global consistency and local accuracy. Specifically, we employ a hierarchical guidance mechanism based on the GR reference. Fine-grained guidance leverages differences to refine local details, while coarse-grained guidance uses multi-scale difference maps to correct deviations. This strategy allows the diffusion model to sequentially integrate the strong geometric prior from GR and recover sub-voxel information. Experimental results on the UDPET and Clinical datasets with varying dose levels show that GR-Diffusion outperforms state-of-the-art methods in enhancing 3D whole-body PET image quality and preserving physiological details.
54. 【2602.11646】Brain Tumor Classifiers Under Attack: Robustness of ResNet Variants Against Transferable FGSM and PGD Attacks
链接:https://arxiv.org/abs/2602.11646
作者:Ryan Deem,Garrett Goodman,Waqas Majeed,Md Abdullah Al Hafiz Khan,Michail S. Alexiou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:scenarios involving MRI, deep learning models, clinical deployment scenarios, deployment scenarios involving, involving MRI data
备注:
点击查看摘要
Abstract:Adversarial robustness in deep learning models for brain tumor classification remains an underexplored yet critical challenge, particularly for clinical deployment scenarios involving MRI data. In this work, we investigate the susceptibility and resilience of several ResNet-based architectures, referred to as BrainNet, BrainNeXt and DilationNet, against gradient-based adversarial attacks, namely FGSM and PGD. These models, based on ResNet, ResNeXt, and dilated ResNet variants respectively, are evaluated across three preprocessing configurations (i) full-sized augmented, (ii) shrunk augmented and (iii) shrunk non-augmented MRI datasets. Our experiments reveal that BrainNeXt models exhibit the highest robustness to black-box attacks, likely due to their increased cardinality, though they produce weaker transferable adversarial samples. In contrast, BrainNet and Dilation models are more vulnerable to attacks from each other, especially under PGD with higher iteration steps and $\alpha$ values. Notably, shrunk and non-augmented data significantly reduce model resilience, even when the untampered test accuracy remains high, highlighting a key trade-off between input resolution and adversarial vulnerability. These results underscore the importance of jointly evaluating classification performance and adversarial robustness for reliable real-world deployment in brain MRI analysis.
55. 【2602.11643】ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning
链接:https://arxiv.org/abs/2602.11643
作者:Yufeng Tian,Shuiqi Cheng,Tianming Wei,Tianxing Zhou,Yuanhang Zhang,Zixian Liu,Qianwei Han,Zhecheng Yuan,Huazhe Xu
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:human manipulation tasks, recently garnered increasing, garnered increasing attention, Tactile information plays, human manipulation
备注: Published to ICRA 2026
点击查看摘要
Abstract:Tactile information plays a crucial role in human manipulation tasks and has recently garnered increasing attention in robotic manipulation. However, existing approaches mostly focus on the alignment of visual and tactile features and the integration mechanism tends to be direct concatenation. Consequently, they struggle to effectively cope with occluded scenarios due to neglecting the inherent complementary nature of both modalities and the alignment may not be exploited enough, limiting the potential of their real-world deployment. In this paper, we present ViTaS, a simple yet effective framework that incorporates both visual and tactile information to guide the behavior of an agent. We introduce Soft Fusion Contrastive Learning, an advanced version of conventional contrastive learning method and a CVAE module to utilize the alignment and complementarity within visuo-tactile representations. We demonstrate the effectiveness of our method in 12 simulated and 3 real-world environments, and our experiments show that ViTaS significantly outperforms existing baselines. Project page: this https URL.
56. 【2602.11642】Electrostatics-Inspired Surface Reconstruction (EISR): Recovering 3D Shapes as a Superposition of Poisson's PDE Solutions
链接:https://arxiv.org/abs/2602.11642
作者:Diego Patiño,Knut Peterson,Kostas Daniilidis,David K. Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Implicit shape representation, level sets, Poisson equation, Eikonal partial differential, proxy PDE
备注:
点击查看摘要
Abstract:Implicit shape representation, such as SDFs, is a popular approach to recover the surface of a 3D shape as the level sets of a scalar field. Several methods approximate SDFs using machine learning strategies that exploit the knowledge that SDFs are solutions of the Eikonal partial differential equation (PDEs). In this work, we present a novel approach to surface reconstruction by encoding it as a solution to a proxy PDE, namely Poisson's equation. Then, we explore the connection between Poisson's equation and physics, e.g., the electrostatic potential due to a positive charge density. We employ Green's functions to obtain a closed-form parametric expression for the PDE's solution, and leverage the linearity of our proxy PDE to find the target shape's implicit field as a superposition of solutions. Our method shows improved results in approximating high-frequency details, even with a small number of shape priors.
57. 【2602.11636】ScalSelect: Scalable Training-Free Multimodal Data Selection for Efficient Visual Instruction Tuning
链接:https://arxiv.org/abs/2602.11636
作者:Changti Wu,Jiahuai Mao,Yuzhuo Miao,Shijie Lian,Bin Yu,Xiaopeng Lin,Cong Huang,Lei Zhang,Kai Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Visual Instruction Tuning, multimodal data selection, Instruction Tuning, Large-scale Visual Instruction, key paradigm
备注: The code is available at \href{ [this https URL](https://github.com/ChangtiWu/ScalSelect) }{ScalSelect}
点击查看摘要
Abstract:Large-scale Visual Instruction Tuning (VIT) has become a key paradigm for advancing the performance of vision-language models (VLMs) across various multimodal tasks. However, training on the large-scale datasets is computationally expensive and inefficient due to redundancy in the data, which motivates the need for multimodal data selection to improve training efficiency. Existing data selection methods for VIT either require costly training or gradient computation. Training-free alternatives often depend on proxy models or datasets, instruction-agnostic representations, and pairwise similarity with quadratic complexity, limiting scalability and representation fidelity. In this work, we propose ScalSelect, a scalable training-free multimodal data selection method with linear-time complexity with respect to the number of samples, eliminating the need for external models or auxiliary datasets. ScalSelect first constructs sample representations by extracting visual features most attended by instruction tokens in the target VLM, capturing instruction-relevant information. It then identifies samples whose representations best approximate the dominant subspace of the full dataset representations, enabling scalable importance scoring without pairwise comparisons. Extensive experiments across multiple VLMs, datasets, and selection budgets demonstrate that ScalSelect achieves over 97.5% of the performance of training on the full dataset using only 16% of the data, and even outperforms full-data training in some settings. The code is available at \href{this https URL}{ScalSelect}.
58. 【2602.11628】PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation
链接:https://arxiv.org/abs/2602.11628
作者:Yeva Gabrielyan(1),Varduhi Yeghiazaryan(1),Irina Voiculescu(2) ((1) Akian College of Science and Engineering, American University of Armenia, Yerevan, Armenia, (2) Department of Computer Science, University of Oxford, Oxford, UK)
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Weakly supervised learning, sparse user-drawn strokes, Weakly supervised, subset of pixels, supervised learning
备注: This work was supported by the Afeyan Family Foundation Seed Grants and the JACE Foundation Research Innovation Grant Program at AUA
点击查看摘要
Abstract:Weakly supervised learning with scribble annotations uses sparse user-drawn strokes to indicate segmentation labels on a small subset of pixels. This annotation reduces the cost of dense pixel-wise labeling, but suffers inherently from noisy and incomplete supervision. Recent scribble-based approaches in medical image segmentation address this limitation using pseudo-label-based training; however, the quality of the pseudo-labels remains a key performance limit. We propose PLESS, a generic pseudo-label enhancement strategy which improves reliability and spatial consistency. It builds on a hierarchical partitioning of the image into a hierarchy of spatially coherent regions. PLESS propagates scribble information to refine pseudo-labels within semantically coherent regions. The framework is model-agnostic and easily integrates into existing pseudo-label methods. Experiments on two public cardiac MRI datasets (ACDC and MSCMRseg) across four scribble-supervised algorithms show consistent improvements in segmentation accuracy. Code will be made available on GitHub upon acceptance.
59. 【2602.11625】PLOT-CT: Pre-log Voronoi Decomposition Assisted Generation for Low-dose CT Reconstruction
链接:https://arxiv.org/abs/2602.11625
作者:Bin Huang,Xun Yu,Yikun Zhang,Yi Zhang,Yang Chen,Qiegen Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Low-dose computed tomography, reduced radiation exposure, Low-dose computed, compromised data fidelity, computed tomography
备注:
点击查看摘要
Abstract:Low-dose computed tomography (LDCT) reconstruction is fundamentally challenged by severe noise and compromised data fidelity under reduced radiation exposure. Most existing methods operate either in the image or post-log projection domain, which fails to fully exploit the rich structural information in pre-log measurements while being highly susceptible to noise. The requisite logarithmic transformation critically amplifies noise within these data, imposing exceptional demands on reconstruction precision. To overcome these challenges, we propose PLOT-CT, a novel framework for Pre-Log vOronoi decomposiTion-assisted CT generation. Our method begins by applying Voronoi decomposition to pre-log sinograms, disentangling the data into distinct underlying components, which are embedded in separate latent spaces. This explicit decomposition significantly enhances the model's capacity to learn discriminative features, directly improving reconstruction accuracy by mitigating noise and preserving information inherent in the pre-log domain. Extensive experiments demonstrate that PLOT-CT achieves state-of-the-art performance, attaining a 2.36dB PSNR improvement over traditional methods at the 1e4 incident photon level in the pre-log domain.
60. 【2602.11598】ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation
链接:https://arxiv.org/abs/2602.11598
作者:Zedong Chu,Shichao Xie,Xiaolong Wu,Yanfen Shen,Minghua Luo,Zhengbo Wang,Fei Liu,Xiaoxu Leng,Junjun Hu,Mingyang Yin,Jia Lu,Yingnan Guo,Kai Yang,Jiawei Han,Xu Chen,Yanqing Zhu,Yuxiang Zhao,Xin Liu,Yirong Yang,Ye He,Jiahang Wang,Yang Cai,Tianlin Zhang,Li Gao,Liu Liu,Mingchao Sun,Fan Jiang,Chiyu Wang,Zhicheng Liu,Hongyu Pan,Honglin Han,Zhining Gu,Kuan Yang,Jianfang Zhang,Di Jing,Zihao Guan,Wei Guo,Guoqing Liu,Di Yang,Xiangpo Yang,Menglin Yang,Hongguang Xing,Weiguo Li,Mu Xu
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Embodied navigation, Flow Matching-based Action, long been fragmented, fragmented by task-specific, Grand Unification
备注: Project Page: [this https URL](https://amap-cvlab.github.io/ABot-Navigation/ABot-N0/)
点击查看摘要
Abstract:Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action'' architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 $\text{km}^2$). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.
Comments:
Project Page: this https URL
Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.11598 [cs.RO]
(or
arXiv:2602.11598v1 [cs.RO] for this version)
https://doi.org/10.48550/arXiv.2602.11598
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
61. 【2602.11588】A Large Language Model for Disaster Structural Reconnaissance Summarization
链接:https://arxiv.org/abs/2602.11588
作者:Yuqing Gao,Guanren Zhou,Khalid M. Mosalam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Structural Health Monitoring, Artificial Intelligence, vision-based Structural Health, Health Monitoring, assessing structural condition
备注: 8 pages, 4 figures. Presented at the 18th World Conference on Earthquake Engineering (18WCEE 2024)
点击查看摘要
Abstract:Artificial Intelligence (AI)-aided vision-based Structural Health Monitoring (SHM) has emerged as an effective approach for monitoring and assessing structural condition by analyzing image and video data. By integrating Computer Vision (CV) and Deep Learning (DL), vision-based SHM can automatically identify and localize visual patterns associated with structural damage. However, previous works typically generate only discrete outputs, such as damage class labels and damage region coordinates, requiring engineers to further reorganize and analyze these results for evaluation and decision-making. In late 2022, Large Language Models (LLMs) became popular across multiple fields, providing new insights into AI-aided vision-based SHM. In this study, a novel LLM-based Disaster Reconnaissance Summarization (LLM-DRS) framework is proposed. It introduces a standard reconnaissance plan in which the collection of vision data and corresponding metadata follows a well-designed on-site investigation process. Text-based metadata and image-based vision data are then processed and integrated into a unified format, where well-trained Deep Convolutional Neural Networks extract key attributes, including damage state, material type, and damage level. Finally, all data are fed into an LLM with carefully designed prompts, enabling the LLM-DRS to generate summary reports for individual structures or affected regions based on aggregated attributes and metadata. Results show that integrating LLMs into vision-based SHM, particularly for rapid post-disaster reconnaissance, demonstrates promising potential for improving resilience of the built environment through effective reconnaissance.
62. 【2602.11575】ReaDy-Go: Real-to-Sim Dynamic 3D Gaussian Splatting Simulation for Environment-Specific Visual Navigation with Moving Obstacles
链接:https://arxiv.org/abs/2602.11575
作者:Seungyeon Yoo,Youngseok Jang,Dabin Kim,Youngsoo Han,Seungwoo Jung,H. Jin Kim
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual navigation models, training policies tailored, Visual navigation, Gaussian Splatting, dynamic
备注: Project page: [this https URL](https://syeon-yoo.github.io/ready-go-site/)
点击查看摘要
Abstract:Visual navigation models often struggle in real-world dynamic environments due to limited robustness to the sim-to-real gap and the difficulty of training policies tailored to target deployment environments (e.g., households, restaurants, and factories). Although real-to-sim navigation simulation using 3D Gaussian Splatting (GS) can mitigate this gap, prior works have assumed only static scenes or unrealistic dynamic obstacles, despite the importance of safe navigation in dynamic environments. To address these issues, we propose ReaDy-Go, a novel real-to-sim simulation pipeline that synthesizes photorealistic dynamic scenarios for target environments. ReaDy-Go generates photorealistic navigation datasets for dynamic environments by combining a reconstructed static GS scene with dynamic human GS obstacles, and trains policies robust to both the sim-to-real gap and moving obstacles. The pipeline consists of three components: (1) a dynamic GS simulator that integrates scene GS with a human animation module, enabling the insertion of animatable human GS avatars and the synthesis of plausible human motions from 2D trajectories, (2) navigation dataset generation for dynamic environments that leverages the simulator, a robot expert planner designed for dynamic GS representations, and a human planner, and (3) policy learning using the generated datasets. ReaDy-Go outperforms baselines across target environments in both simulation and real-world experiments, demonstrating improved navigation performance even after sim-to-real transfer and in the presence of moving obstacles. Moreover, zero-shot sim-to-real deployment in an unseen environment indicates its generalization potential. Project page: this https URL.
63. 【2602.11565】Move What Matters: Parameter-Efficient Domain Adaptation via Optimal Transport Flow for Collaborative Perception
链接:https://arxiv.org/abs/2602.11565
作者:Zesheng Jia,Jin Wang,Siao Liu,Lingzhi Li,Ziyao Huang,Yunjiang Xu,Jianping Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deploying multi-agent systems, Fast domain adaptation, collaborative perception, remains a fundamental, fundamental challenge
备注:
点击查看摘要
Abstract:Fast domain adaptation remains a fundamental challenge for deploying multi-agent systems across diverse environments in Vehicle-to-Everything (V2X) collaborative perception. Despite the success of Parameter-Efficient Fine-Tuning (PEFT) in natural language processing and conventional vision tasks, directly applying PEFT to multi-agent settings leads to significant performance degradation and training instability. In this work, we conduct a detailed analysis and identify two key factors: (i) inter-frame redundancy in heterogeneous sensory streams, and (ii) erosion of fine-grained semantics in deep-layer representations under PEFT adaptation. To address these issues, we propose FlowAdapt, a parameter-efficient framework grounded in optimal transport theory, which minimizes information transport costs across both data distributions and network hierarchies. Specifically, we introduce a Wasserstein Greedy Sampling strategy to selectively filter redundant samples via a bounded covering radius. Furthermore, Progressive Knowledge Transfer module is designed to progressively inject compressed early-stage representations into later stages through learnable pathways, alleviating semantic degradation in late-stage adaptation. Extensive experiments on three benchmarks demonstrate that FlowAdapt achieves state-of-the-art performance with only 1% of trainable parameters, effectively bridging domain gaps with superior sample efficiency and generalization.
64. 【2602.11564】LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts
链接:https://arxiv.org/abs/2602.11564
作者:Chen Zhao,Jiawei Chen,Hongyu Li,Zhuoliang Kang,Shilin Lu,Xiaoming Wei,Kai Zhang,Jian Yang,Ying Tai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:improved visual quality, significantly improved visual, formidable challenge due, Recent advances, video diffusion models
备注:
点击查看摘要
Abstract:Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{this https URL}{this https URL}.
65. 【2602.11554】HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds
链接:https://arxiv.org/abs/2602.11554
作者:Yichun Xiao,Runwei Guan,Fangqiang Ding
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:velocity-aware measurements, cost-effective than LiDAR, radar, mmWave radar, radar point clouds
备注: 9 pages, 4 figures, 6 tables
点击查看摘要
Abstract:4D mmWave radar provides weather-robust, velocity-aware measurements and is more cost-effective than LiDAR. However, radar-only 3D detection still trails LiDAR-based systems because radar point clouds are sparse, irregular, and often corrupted by multipath noise, yielding weak and unstable geometry. We present HyperDet, a detector-agnostic radar-only 3D detection framework that constructs a task-aware hyper 4D radar point cloud for standard LiDAR-oriented detectors. HyperDet aggregates returns from multiple surround-view 4D radars over consecutive frames to improve coverage and density, then applies geometry-aware cross-sensor consensus validation with a lightweight self-consistency check outside overlap regions to suppress inconsistent returns. It further integrates a foreground-focused diffusion module with training-time mixed radar-LiDAR supervision to densify object structures while lifting radar attributes (e.g., Doppler, RCS); the model is distilled into a consistency model for single-step inference. On MAN TruckScenes, HyperDet consistently improves over raw radar inputs with VoxelNeXt and CenterPoint, partially narrowing the radar-LiDAR gap. These results show that input-level refinement enables radar to better leverage LiDAR-oriented detectors without architectural modifications.
66. 【2602.11553】Perception-based Image Denoising via Generative Compression
链接:https://arxiv.org/abs/2602.11553
作者:Nam Nguyen,Thinh Nguyen,Bella Bose
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:preserving structural details, Image denoising aims, produce over-smoothed reconstructions, distribution shift, aims to remove
备注:
点击查看摘要
Abstract:Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.
67. 【2602.11545】Supervise-assisted Multi-modality Fusion Diffusion Model for PET Restoration
链接:https://arxiv.org/abs/2602.11545
作者:Yingkai Zhang,Shuang Chen,Ye Tian,Yunyi Gao,Jianyong Jiang,Ying Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Positron emission tomography, offers powerful functional, powerful functional imaging, Positron emission, involves radiation exposure
备注:
点击查看摘要
Abstract:Positron emission tomography (PET) offers powerful functional imaging but involves radiation exposure. Efforts to reduce this exposure by lowering the radiotracer dose or scan time can degrade image quality. While using magnetic resonance (MR) images with clearer anatomical information to restore standard-dose PET (SPET) from low-dose PET (LPET) is a promising approach, it faces challenges with the inconsistencies in the structure and texture of multi-modality fusion, as well as the mismatch in out-of-distribution (OOD) data. In this paper, we propose a supervise-assisted multi-modality fusion diffusion model (MFdiff) for addressing these challenges for high-quality PET restoration. Firstly, to fully utilize auxiliary MR images without introducing extraneous details in the restored image, a multi-modality feature fusion module is designed to learn an optimized fusion feature. Secondly, using the fusion feature as an additional condition, high-quality SPET images are iteratively generated based on the diffusion model. Furthermore, we introduce a two-stage supervise-assisted learning strategy that harnesses both generalized priors from simulated in-distribution datasets and specific priors tailored to in-vivo OOD data. Experiments demonstrate that the proposed MFdiff effectively restores high-quality SPET images from multi-modality inputs and outperforms state-of-the-art methods both qualitatively and quantitatively.
68. 【2602.11536】Vascular anatomy-aware self-supervised pre-training for X-ray angiogram analysis
链接:https://arxiv.org/abs/2602.11536
作者:De-Xing Huang,Chaohui Yu,Xiao-Hu Zhou,Tian-Yu Xiang,Qin-Yi Zhang,Mei-Jiang Gui,Rui-Ze Ma,Chen-Yu Wang,Nu-Fang Xiao,Fan Wang,Zeng-Guang Hou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:gold standard imaging, standard imaging modality, cardiovascular diseases, gold standard, standard imaging
备注: 10 pages, 10 figures, 10 tables. Journal version of VasoMIM (AAAI 2026)
点击查看摘要
Abstract:X-ray angiography is the gold standard imaging modality for cardiovascular diseases. However, current deep learning approaches for X-ray angiogram analysis are severely constrained by the scarcity of annotated data. While large-scale self-supervised learning (SSL) has emerged as a promising solution, its potential in this domain remains largely unexplored, primarily due to the lack of effective SSL frameworks and large-scale datasets. To bridge this gap, we introduce a vascular anatomy-aware masked image modeling (VasoMIM) framework that explicitly integrates domain-specific anatomical knowledge. Specifically, VasoMIM comprises two key designs: an anatomy-guided masking strategy and an anatomical consistency loss. The former strategically masks vessel-containing patches to compel the model to learn robust vascular semantics, while the latter preserves structural consistency of vessels between original and reconstructed images, enhancing the discriminability of the learned representations. In conjunction with VasoMIM, we curate XA-170K, the largest X-ray angiogram pre-training dataset to date. We validate VasoMIM on four downstream tasks across six datasets, where it demonstrates superior transferability and achieves state-of-the-art performance compared to existing methods. These findings highlight the significant potential of VasoMIM as a foundation model for advancing a wide range of X-ray angiogram analysis tasks. VasoMIM and XA-170K will be available at this https URL.
69. 【2602.11514】How Smart Is Your GUI Agent? A Framework for the Future of Software Interaction
链接:https://arxiv.org/abs/2602.11514
作者:Sidong Feng,Chunyang Chen
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:allowing people, navigate web, desktop and mobile, people to navigate, GUI Agent Autonomy
备注:
点击查看摘要
Abstract:GUI agents are rapidly becoming a new interaction to software, allowing people to navigate web, desktop and mobile rather than execute them click by click. Yet ``agent'' is described with radically different degrees of autonomy, obscuring capability, responsibility and risk. We call for conceptual clarity through GUI Agent Autonomy Levels (GAL), a six-level framework that makes autonomy explicit and helps benchmark progress toward trustworthy software interaction.
70. 【2602.11509】Multimodal Fact-Level Attribution for Verifiable Reasoning
链接:https://arxiv.org/abs/2602.11509
作者:David Wan,Han Wang,Ziyang Wang,Elias Stengel-Eskin,Hyunji Lee,Mohit Bansal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:individual factual claims, real-world tasks involving, tasks involving multi-step, verifying individual factual, Multimodal large language
备注: 29 pages. Code and data are available at [this https URL](https://github.com/meetdavidwan/murgat)
点击查看摘要
Abstract:Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
71. 【2602.11499】What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation
链接:https://arxiv.org/abs/2602.11499
作者:Zhenlong Yuan,Xiangyan Qu,Jing Tang,Rui Chen,Lei Sun,Ruidong Chen,Hongwei Yu,Chengxuan Qian,Xiangxiang Chu,Shuo Li,Yuyin Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Open-Vocabulary Human-Object Interaction, shown promising capabilities, Multimodal Large
备注:
点击查看摘要
Abstract:Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and occlusion-induced ambiguity. To address this, we propose \textbf{ImagineAgent}, an agentic framework that harmonizes cognitive reasoning with generative imagination for robust visual understanding. Specifically, our method innovatively constructs cognitive maps that explicitly model plausible relationships between detected entities and candidate actions. Subsequently, it dynamically invokes tools including retrieval augmentation, image cropping, and diffusion models to gather domain-specific knowledge and enriched visual evidence, thereby achieving cross-modal alignment in ambiguous scenarios. Moreover, we propose a composite reward that balances prediction accuracy and tool efficiency. Evaluations on SWIG-HOI and HICO-DET datasets demonstrate our SOTA performance, requiring approximately 20\% of training data compared to existing methods, validating our robustness and efficiency.
72. 【2602.11494】Arbitrary Ratio Feature Compression via Next Token Prediction
链接:https://arxiv.org/abs/2602.11494
作者:Yufan Liu,Daoyuan Ren,Zhipeng Zhang,Wenyang Luo,Bing Li,Weiming Hu,Stephen Maybank
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:applications involving large-scale, Arbitrary Ratio Feature, Arbitrary Ratio Compressor, compression, compression ratio
备注:
点击查看摘要
Abstract:Feature compression is increasingly important for improving the efficiency of downstream tasks, especially in applications involving large-scale or multi-modal data. While existing methods typically rely on dedicated models for achieving specific compression ratios, they are often limited in flexibility and generalization. In particular, retraining is necessary when adapting to a new compression ratio. To address this limitation, we propose a novel and flexible Arbitrary Ratio Feature Compression (ARFC) framework, which supports any compression ratio with a single model, eliminating the need for multiple specialized models. At its core, the Arbitrary Ratio Compressor (ARC) is an auto-regressive model that performs compression via next-token prediction. This allows the compression ratio to be controlled at inference simply by adjusting the number of generated tokens. To enhance the quality of the compressed features, two key modules are introduced. The Mixture of Solutions (MoS) module refines the compressed tokens by utilizing multiple compression results (solutions), reducing uncertainty and improving robustness. The Entity Relation Graph Constraint (ERGC) is integrated into the training process to preserve semantic and structural relationships during compression. Extensive experiments on cross-modal retrieval, image classification, and image retrieval tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches at various compression ratios. Notably, in some cases, it even surpasses the performance of the original, uncompressed features. These results validate the effectiveness and versatility of ARFC for practical, resource-constrained scenarios.
73. 【2602.11466】A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness
链接:https://arxiv.org/abs/2602.11466
作者:Yun-Cheng Li,Sen Lei,Heng-Chao Li,Ke Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Semantic Change Detection, remote sensing images, bi-temporal remote sensing, categorize land-cover changes, Change Detection
备注:
点击查看摘要
Abstract:Semantic Change Detection (SCD) aims to detect and categorize land-cover changes from bi-temporal remote sensing images. Existing methods often suffer from blurred boundaries and inadequate temporal modeling, limiting segmentation accuracy. To address these issues, we propose a Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness, termed DBTANet. Specifically, we utilize a dual-branch Siamese encoder where a frozen SAM branch captures global semantic context and boundary priors, while a ResNet34 branch provides local spatial details, ensuring complementary feature representations. On this basis, we design a Bidirectional Temporal Awareness Module (BTAM) to aggregate multi-scale features and capture temporal dependencies in a symmetric manner. Furthermore, a Gaussian-smoothed Projection Module (GSPM) refines shallow SAM features, suppressing noise while enhancing edge information for boundary-aware constraints. Extensive experiments on two public benchmarks demonstrate that DBTANet effectively integrates global semantics, local details, temporal reasoning, and boundary awareness, achieving state-of-the-art performance.
74. 【2602.11448】Hierarchical Concept Embedding Pursuit for Interpretable Image Classification
链接:https://arxiv.org/abs/2602.11448
作者:Nghia Nguyen,Tianjiao Ding,René Vidal
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:provide faithful explanations, concept embeddings, concept, gaining traction, traction in computer
备注:
点击查看摘要
Abstract:Interpretable-by-design models are gaining traction in computer vision because they provide faithful explanations for their predictions. In image classification, these models typically recover human-interpretable concepts from an image and use them for classification. Sparse concept recovery methods leverage the latent space of vision-language models to represent image embeddings as a sparse combination of concept embeddings. However, because such methods ignore the hierarchical structure of concepts, they can produce correct predictions with explanations that are inconsistent with the hierarchy. In this work, we propose Hierarchical Concept Embedding \ Pursuit (HCEP), a framework that induces a hierarchy of concept embeddings in the latent space and uses hierarchical sparse coding to recover the concepts present in an image. Given a hierarchy of semantic concepts, we construct a corresponding hierarchy of concept embeddings and, assuming the correct concepts for an image form a rooted path in the hierarchy, derive desirable conditions for identifying them in the embedded space. We show that hierarchical sparse coding reliably recovers hierarchical concept embeddings, whereas vanilla sparse coding fails. Our experiments on real-world datasets demonstrate that HCEP outperforms baselines in concept precision and recall while maintaining competitive classification accuracy. Moreover, when the number of samples is limited, HCEP achieves superior classification accuracy and concept recovery. These results show that incorporating hierarchical structures into sparse coding yields more reliable and interpretable image classification models.
75. 【2602.11446】Enhanced Portable Ultra Low-Field Diffusion Tensor Imaging with Bayesian Artifact Correction and Deep Learning-Based Super-Resolution
链接:https://arxiv.org/abs/2602.11446
作者:Mark D. Olchanyi,Annabel Sorby-Adams,John Kirsch,Brian L. Edlow,Ava Farnan,Renfei Liu,Matthew S. Rosen,Emery N. Brown,W. Taylor Kimberly,Juan Eugenio Iglesias
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:magnetic resonance imaging, ULF DTI, ULF DTI scans, ULF DTI sequence, DTI
备注: 38 pages, 8 figures, 2 supplementary figures, and 3 supplementary tables
点击查看摘要
Abstract:Portable, ultra-low-field (ULF) magnetic resonance imaging has the potential to expand access to neuroimaging but currently suffers from coarse spatial and angular resolutions and low signal-to-noise ratios. Diffusion tensor imaging (DTI), a sequence tailored to detect and reconstruct white matter tracts within the brain, is particularly prone to such imaging degradation due to inherent sequence design coupled with prolonged scan times. In addition, ULF DTI scans exhibit artifacting that spans both the space and angular domains, requiring a custom modelling algorithm for subsequent correction. We introduce a nine-direction, single-shell ULF DTI sequence, as well as a companion Bayesian bias field correction algorithm that possesses angular dependence and convolutional neural network-based superresolution algorithm that is generalizable across DTI datasets and does not require re-training (''DiffSR''). We show through a synthetic downsampling experiment and white matter assessment in real, matched ULF and high-field DTI scans that these algorithms can recover microstructural and volumetric white matter information at ULF. We also show that DiffSR can be directly applied to white matter-based Alzheimers disease classification in synthetically degraded scans, with notable improvements in agreement between DTI metrics, as compared to un-degraded scans. We freely disseminate the Bayesian bias correction algorithm and DiffSR with the goal of furthering progress on both ULF reconstruction methods and general DTI sequence harmonization. We release all code related to DiffSR for $\href{this https URL}{public \space use}$.
76. 【2602.11440】CtrlShift: High-Quality Geometry-Aware Object Manipulation in Visual Generation
链接:https://arxiv.org/abs/2602.11440
作者:Penghui Ruan,Bojia Zi,Xianbiao Qi,Youze Huang,Rong Xiao,Pichao Wang,Jiannong Cao,Yuhui Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:preserving scene realism, Object-level manipulation, relocating or reorienting, scene realism, film post-production
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:Object-level manipulation, relocating or reorienting objects in images or videos while preserving scene realism, is central to film post-production, AR, and creative editing. Yet existing methods struggle to jointly achieve three core goals: background preservation, geometric consistency under viewpoint shifts, and user-controllable transformations. Geometry-based approaches offer precise control but require explicit 3D reconstruction and generalize poorly; diffusion-based methods generalize better but lack fine-grained geometric control. We present CtrlShift, an end-to-end diffusion framework to achieve geometry-consistent object manipulation without explicit 3D representations. Our key insight is to decompose manipulation into two stages, object removal and reference-guided inpainting under explicit camera pose control, and encode both within a unified diffusion process. To enable precise, disentangled control, we design a multi-task, multi-stage training strategy that separates background, identity, and pose signals across tasks. To improve generalization, we introduce a scalable real-world dataset construction pipeline that generates paired image and video samples with estimated relative camera poses. Extensive experiments demonstrate that CtrlShift achieves state-of-the-art results in fidelity, viewpoint consistency, and controllability. To our knowledge, this is the first framework to unify fine-grained geometric control and real-world generalization for object manipulation, without relying on any explicit 3D modeling.
77. 【2602.11436】Fighting MRI Anisotropy: Learning Multiple Cardiac Shapes From a Single Implicit Neural Representation
链接:https://arxiv.org/abs/2602.11436
作者:Carolina Brás,Soufiane Ben Haddou,Thijs P. Kuipers,Laura Alvarez-Florez,R. Nils Planken,Fleur V. Y. Tjong,Connie Bezzina,Ivana Išgum
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:cardiovascular magnetic resonance, magnetic resonance imaging, limits cardiac shape, nature of short-axis, cardiovascular magnetic
备注:
点击查看摘要
Abstract:The anisotropic nature of short-axis (SAX) cardiovascular magnetic resonance imaging (CMRI) limits cardiac shape analysis. To address this, we propose to leverage near-isotropic, higher resolution computed tomography angiography (CTA) data of the heart. We use this data to train a single neural implicit function to jointly represent cardiac shapes from CMRI at any resolution. We evaluate the method for the reconstruction of right ventricle (RV) and myocardium (MYO), where MYO simultaneously models endocardial and epicardial left-ventricle surfaces. Since high-resolution SAX reference segmentations are unavailable, we evaluate performance by extracting a 4-chamber (4CH) slice of RV and MYO from their reconstructed shapes. When compared with the reference 4CH segmentation masks from CMRI, our method achieved a Dice similarity coefficient of 0.91 $\pm$ 0.07 and 0.75 $\pm$ 0.13, and a Hausdorff distance of 6.21 $\pm$ 3.97 mm and 7.53 $\pm$ 5.13 mm for RV and MYO, respectively. Quantitative and qualitative assessment demonstrate the model's ability to reconstruct accurate, smooth and anatomically plausible shapes, supporting improvements in cardiac shape analysis.
78. 【2602.11401】Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation
链接:https://arxiv.org/abs/2602.11401
作者:Alan Baade,Eric Ryan Chan,Kyle Sargent,Changan Chen,Justin Johnson,Ehsan Adeli,Li Fei-Fei
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:generating high-quality images, diffusion models excel, excel at generating, generating high-quality, lose the benefits
备注: 8 pages, 6 figures
点击查看摘要
Abstract:Latent diffusion models excel at generating high-quality images but lose the benefits of end-to-end modeling. They discard information during image encoding, require a separately trained decoder, and model an auxiliary distribution to the raw data. In this paper, we propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images. Our approach orders the denoising trajectory by jointly processing latents and pixels with separately tuned noise schedules. This allows the latents to act as a scratchpad for intermediate computation before high-frequency pixel features are generated. We find that the order of conditioning signals is critical, and we analyze this to explain differences between REPA distillation in the tokenizer and the diffusion model, conditional versus unconditional generation, and how tokenizer reconstruction quality relates to diffusability. Applied to ImageNet, Latent Forcing achieves a new state-of-the-art for diffusion transformer-based pixel generation at our compute scale.
79. 【2602.11349】ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model
链接:https://arxiv.org/abs/2602.11349
作者:Samuel Waugh,Stuart James
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Art History articles, History articles discuss, Art History, Open-Access Art History, parts of works
备注:
点击查看摘要
Abstract:Many Art History articles discuss artworks in general as well as specific parts of works, such as layout, iconography, or material culture. However, when viewing an artwork, it is not trivial to identify what different articles have said about the piece. Therefore, we propose ArtContext, a pipeline for taking a corpus of Open-Access Art History articles and Wikidata Knowledge and annotating Artworks with this information. We do this using a novel corpus collection pipeline, then learn a bespoke CLIP model adapted using Low-Rank Adaptation (LoRA) to make it domain-specific. We show that the new model, PaintingCLIP, which is weakly supervised by the collected corpus, outperforms CLIP and provides context for a given artwork. The proposed pipeline is generalisable and can be readily applied to numerous humanities areas.
80. 【2602.11339】Exploring Real-Time Super-Resolution: Benchmarking and Fine-Tuning for Streaming Content
链接:https://arxiv.org/abs/2602.11339
作者:Evgeney Bogatyrev,Khaled Abud,Ivan Molodetskikh,Nikita Alutis,Dmitry Vatolin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:compressed video content, Recent advancements, enabled higher-quality video, existing methods struggle, higher-quality video streaming
备注:
点击查看摘要
Abstract:Recent advancements in real-time super-resolution have enabled higher-quality video streaming, yet existing methods struggle with the unique challenges of compressed video content. Commonly used datasets do not accurately reflect the characteristics of streaming media, limiting the relevance of current benchmarks. To address this gap, we introduce a comprehensive dataset - StreamSR - sourced from YouTube, covering a wide range of video genres and resolutions representative of real-world streaming scenarios. We benchmark 11 state-of-the-art real-time super-resolution models to evaluate their performance for the streaming use-case. Furthermore, we propose EfRLFN, an efficient real-time model that integrates Efficient Channel Attention and a hyperbolic tangent activation function - a novel design choice in the context of real-time super-resolution. We extensively optimized the architecture to maximize efficiency and designed a composite loss function that improves training convergence. EfRLFN combines the strengths of existing architectures while improving both visual quality and runtime performance. Finally, we show that fine-tuning other models on our dataset results in significant performance gains that generalize well across various standard benchmarks. We made the dataset, the code, and the benchmark available at this https URL.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.11339 [cs.CV]
(or
arXiv:2602.11339v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.11339
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
81. 【2602.11337】MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation
链接:https://arxiv.org/abs/2602.11337
作者:Yejin Kim,Wilbert Pumacay,Omar Rayyan,Max Argus,Winson Han,Eli VanderBilt,Jordi Salvador,Abhay Deshpande,Rose Hendrix,Snehal Jauhri,Shuo Liu,Nur Muhammad Mahi Shafiullah,Maya Guru,Ainaz Eftekhar,Karen Farley,Donovan Clay,Jiafei Duan,Arjun Guru,Piper Wolters,Alvaro Herrasti,Ying-Chun Lee,Georgia Chalvatzaki,Yuchen Cui,Ali Farhadi,Dieter Fox,Ranjay Krishna
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:scale demands robustness, Deploying robots, everyday situations, demands robustness, long tail
备注:
点击查看摘要
Abstract:Deploying robots at scale demands robustness to the long tail of everyday situations. The countless variations in scene layout, object geometry, and task specifications that characterize real environments are vast and underrepresented in existing robot benchmarks. Measuring this level of generalization requires infrastructure at a scale and diversity that physical evaluation alone cannot provide. We introduce MolmoSpaces, a fully open ecosystem to support large-scale benchmarking of robot policies. MolmoSpaces consists of over 230k diverse indoor environments, ranging from handcrafted household scenes to procedurally generated multiroom houses, populated with 130k richly annotated object assets, including 48k manipulable objects with 42M stable grasps. Crucially, these environments are simulator-agnostic, supporting popular options such as MuJoCo, Isaac, and ManiSkill. The ecosystem supports the full spectrum of embodied tasks: static and mobile manipulation, navigation, and multiroom long-horizon tasks requiring coordinated perception, planning, and interaction across entire indoor environments. We also design MolmoSpaces-Bench, a benchmark suite of 8 tasks in which robots interact with our diverse scenes and richly annotated objects. Our experiments show MolmoSpaces-Bench exhibits strong sim-to-real correlation (R = 0.96, \r{ho} = 0.98), confirm newer and stronger zero-shot policies outperform earlier versions in our benchmarks, and identify key sensitivities to prompt phrasing, initial joint positions, and camera occlusion. Through MolmoSpaces and its open-source assets and tooling, we provide a foundation for scalable data generation, policy training, and benchmark creation for robot learning research.
82. 【2602.11323】MDE-VIO: Enhancing Visual-Inertial Odometry Using Learned Depth Priors
链接:https://arxiv.org/abs/2602.11323
作者:Arda Alniak,Sinan Kalkan,Mustafa Mert Ankarali,Afsar Saranli,Abdullah Aydin Alatan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Traditional monocular Visual-Inertial, monocular Visual-Inertial Odometry, accurate pose estimation, sparse visual features, Visual-Inertial Odometry
备注: 6 pages, 2 figures, 3 tables. Submitted to ICIP 2026
点击查看摘要
Abstract:Traditional monocular Visual-Inertial Odometry (VIO) systems struggle in low-texture environments where sparse visual features are insufficient for accurate pose estimation. To address this, dense Monocular Depth Estimation (MDE) has been widely explored as a complementary information source. While recent Vision Transformer (ViT) based complex foundational models offer dense, geometrically consistent depth, their computational demands typically preclude them from real-time edge deployment. Our work bridges this gap by integrating learned depth priors directly into the VINS-Mono optimization backend. We propose a novel framework that enforces affine-invariant depth consistency and pairwise ordinal constraints, explicitly filtering unstable artifacts via variance-based gating. This approach strictly adheres to the computational limits of edge devices while robustly recovering metric scale. Extensive experiments on the TartanGround and M3ED datasets demonstrate that our method prevents divergence in challenging scenarios and delivers significant accuracy gains, reducing Absolute Trajectory Error (ATE) by up to 28.3%. Code will be made available.
83. 【2602.11316】Selective Prior Synchronization via SYNC Loss
链接:https://arxiv.org/abs/2602.11316
作者:Ishan Mishra,Jiajie Li,Deepak Mishra,Jinjun Xiong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:deep neural network, selective prediction, succeed responsibly, critical requirement, deep neural
备注:
点击查看摘要
Abstract:Prediction under uncertainty is a critical requirement for the deep neural network to succeed responsibly. This paper focuses on selective prediction, which allows DNNs to make informed decisions about when to predict or abstain based on the uncertainty level of their predictions. Current methods are either ad-hoc such as SelectiveNet, focusing on how to modify the network architecture or objective function, or post-hoc such as softmax response, achieving selective prediction through analyzing the model's probabilistic outputs. We observe that post-hoc methods implicitly generate uncertainty information, termed the selective prior, which has traditionally been used only during inference. We argue that the selective prior provided by the selection mechanism is equally vital during the training stage. Therefore, we propose the SYNC loss which introduces a novel integration of ad-hoc and post-hoc method. Specifically, our approach incorporates the softmax response into the training process of SelectiveNet, enhancing its selective prediction capabilities by examining the selective prior. Evaluated across various datasets, including CIFAR-100, ImageNet-100, and Stanford Cars, our method not only enhances the model's generalization capabilities but also surpasses previous works in selective prediction performance, and sets new benchmarks for state-of-the-art performance.
84. 【2602.11314】Advancing Digital Twin Generation Through a Novel Simulation Framework and Quantitative Benchmarking
链接:https://arxiv.org/abs/2602.11314
作者:Jacob Rubinstein,Avi Donaty,Don Engel
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:triangulating matched point-based, matched point-based features, accomplished through photogrammetry, textured mesh, triangulating matched
备注: 9 pages, 10 figures. Preprint
点击查看摘要
Abstract:The generation of 3D models from real-world objects has often been accomplished through photogrammetry, i.e., by taking 2D photos from a variety of perspectives and then triangulating matched point-based features to create a textured mesh. Many design choices exist within this framework for the generation of digital twins, and differences between such approaches are largely judged qualitatively. Here, we present and test a novel pipeline for generating synthetic images from high-quality 3D models and programmatically generated camera poses. This enables a wide variety of repeatable, quantifiable experiments which can compare ground-truth knowledge of virtual camera parameters and of virtual objects against the reconstructed estimations of those perspectives and subjects.
85. 【2602.11244】Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models
链接:https://arxiv.org/abs/2602.11244
作者:Sethuraman T V,Savya Khosla,Aditi Tiwari,Vidya Ganesh,Rakshana Jayaprakash,Aditya Jain,Vignesh Srinivasakumar,Onkar Kishor Susladkar,Srinidhi Sunkara,Aditya Shanmugham,Rakesh Vaideeswaran,Abbaas Alif Mohamed Nishar,Simon Jenni,Derek Hoiem
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:robustly account, work investigates, Video-Language Models, video content, temporal sequence
备注:
点击查看摘要
Abstract:This work investigates a fundamental question: Do Video-Language Models (VidLMs) robustly account for video content, temporal sequence, and motion? Our investigation shows that, surprisingly, they often do not. We introduce REVEAL{}, a diagnostic benchmark that probes fundamental weaknesses of contemporary VidLMs through five controlled stress tests; assessing temporal expectation bias, reliance on language-only shortcuts, video sycophancy, camera motion sensitivity, and robustness to spatiotemporal occlusion. We test leading open- and closed-source VidLMs and find that these models confidently describe reversed scenes as forward, answer questions while neglecting video content, agree with false claims, struggle with basic camera motion, and fail to aggregate temporal information amidst simple spatiotemporal masking. Humans, on the other hand, succeed at these tasks with ease. Alongside our benchmark, we provide a data pipeline that automatically generates diagnostic examples for our stress tests, enabling broader and more scalable evaluation. We will release our benchmark and code to support future research.
86. 【2602.11242】ReTracing: An Archaeological Approach Through Body, Machine, and Generative Systems
链接:https://arxiv.org/abs/2602.11242
作者:Yitong Wang,Yue Yao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:artificial intelligence shapes, multi-agent embodied performance, embodied performance art, produces bodily movement, intelligence shapes
备注:
点击查看摘要
Abstract:We present ReTracing, a multi-agent embodied performance art that adopts an archaeological approach to examine how artificial intelligence shapes, constrains, and produces bodily movement. Drawing from science-fiction novels, the project extracts sentences that describe human-machine interaction. We use large language models (LLMs) to generate paired prompts "what to do" and "what not to do" for each excerpt. A diffusion-based text-to-video model transforms these prompts into choreographic guides for a human performer and motor commands for a quadruped robot. Both agents enact the actions on a mirrored floor, captured by multi-camera motion tracking and reconstructed into 3D point clouds and motion trails, forming a digital archive of motion traces. Through this process, ReTracing serves as a novel approach to reveal how generative systems encode socio-cultural biases through choreographed movements. Through an immersive interplay of AI, human, and robot, ReTracing confronts a critical question of our time: What does it mean to be human among AIs that also move, think, and leave traces behind?
87. 【2602.11241】Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration
链接:https://arxiv.org/abs/2602.11241
作者:Jinghan He,Junfeng Fang,Feng Xiong,Zijun Yao,Fei Shen,Haiyun Guo,Jinqiao Wang,Tat-Seng Chua
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:enabled large language, large language models, self-generated challenges, enabled large, large language
备注:
点击查看摘要
Abstract:Self-play has enabled large language models to autonomously improve through self-generated challenges. However, existing self-play methods for vision-language models rely on passive interaction with static image collections, resulting in strong dependence on initial datasets and inefficient learning. Without the ability to actively seek visual data tailored to their evolving capabilities, agents waste computational effort on samples that are either trivial or beyond their current skill level. To address these limitations, we propose Active-Zero, a framework that shifts from passive interaction to active exploration of visual environments. Active-Zero employs three co-evolving agents: a Searcher that retrieves images from open-world repositories based on the model's capability frontier, a Questioner that synthesizes calibrated reasoning tasks, and a Solver refined through accuracy rewards. This closed loop enables self-scaffolding auto-curricula where the model autonomously constructs its learning trajectory. On Qwen2.5-VL-7B-Instruct across 12 benchmarks, Active-Zero achieves 53.97 average accuracy on reasoning tasks (5.7% improvement) and 59.77 on general understanding (3.9% improvement), consistently outperforming existing self-play baselines. These results highlight active exploration as a key ingredient for scalable and adaptive self-evolving vision-language systems.
88. 【2602.11239】oward Reliable Tea Leaf Disease Diagnosis Using Deep Learning Model: Enhancing Robustness With Explainable AI and Adversarial Training
链接:https://arxiv.org/abs/2602.11239
作者:Samanta Ghosh,Jannatul Adan Mahi,Shayan Abrar,Md Parvez Mia,Asaduzzaman Rayhan,Abdul Awal Yasir,Asaduzzaman Hridoy
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:economy of Bangladesh, tea leaf disease, tea leaf diseases, valuable asset, tea leaf
备注: 6 pages,9 figures, 2025 IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE)
点击查看摘要
Abstract:Tea is a valuable asset for the economy of Bangladesh. So, tea cultivation plays an important role to boost the economy. These valuable plants are vulnerable to various kinds of leaf infections which may cause less production and low quality. It is not so easy to detect these diseases manually. It may take time and there could be some errors in the this http URL, the purpose of the study is to develop an automated deep learning model for tea leaf disease classification based on the teaLeafBD dataset so that anyone can detect the diseases more easily and efficiently. There are 5,278 high-resolution images in this dataset. The images are classified into seven categories. Six of them represents various diseases and the rest one represents healthy leaves. The proposed pipeline contains data preprocessing, data splitting, adversarial training, augmentation, model training, evaluation, and comprehension made possible with Explainable AI strategies. DenseNet201 and EfficientNetB3 were employed to perform the classification task. To prepare the model more robustly, we applied adversarial training so it can operate effectively even with noisy or disturbed inputs. In addition, Grad-CAM visualization was executed to analyze the model's predictions by identifying the most influential regions of each image. Our experimental outcomes revealed that EfficientNetB3 achieved the highest classification accuracy of 93%, while DenseNet201 reached 91%. The outcomes prove that the effectiveness of the proposed approach can accurately detect tea leaf diseases and provide a practical solution for advanced agricultural management.
89. 【2602.11236】ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
链接:https://arxiv.org/abs/2602.11236
作者:Yandan Yang,Shuang Zeng,Tong Lin,Xinyuan Chang,Dekang Qi,Junjin Xiao,Haoyun Liu,Ronghan Chen,Yuzhi Chen,Dongjie Huo,Feng Xiong,Xing Wei,Zhiheng Ma,Mu Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
关键词:Building general-purpose embodied, diverse hardware remains, many-forms paradigm, Building general-purpose, challenge in robotics
备注: Project website: [this https URL](https://amap-cvlab.github.io/ABot-Manipulation/) . Code: [this https URL](https://github.com/amap-cvlab/ABot-Manipulation) . 22 pages, 10 figures, 10 tables
点击查看摘要
Abstract:Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.
90. 【2602.11214】DD-MDN: Human Trajectory Forecasting with Diffusion-Based Dual Mixture Density Networks and Uncertainty Self-Calibration
链接:https://arxiv.org/abs/2602.11214
作者:Manuel Hetzel,Kerim Turacan,Hannes Reichert,Konrad Doll,Bernhard Sick
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Human Trajectory Forecasting, predicts future human, future human movements, Smart Surveillance, Autonomous Driving
备注:
点击查看摘要
Abstract:Human Trajectory Forecasting (HTF) predicts future human movements from past trajectories and environmental context, with applications in Autonomous Driving, Smart Surveillance, and Human-Robot Interaction. While prior work has focused on accuracy, social interaction modeling, and diversity, little attention has been paid to uncertainty modeling, calibration, and forecasts from short observation periods, which are crucial for downstream tasks such as path planning and collision avoidance. We propose DD-MDN, an end-to-end probabilistic HTF model that combines high positional accuracy, calibrated uncertainty, and robustness to short observations. Using a few-shot denoising diffusion backbone and a dual mixture density network, our method learns self-calibrated residence areas and probability-ranked anchor paths, from which diverse trajectory hypotheses are derived, without predefined anchors or endpoints. Experiments on the ETH/UCY, SDD, inD, and IMPTC datasets demonstrate state-of-the-art accuracy, robustness at short observation intervals, and reliable uncertainty modeling. The code is available at: this https URL.
91. 【2602.11206】UltraLIF: Fully Differentiable Spiking Neural Networks via Ultradiscretization and Max-Plus Algebra
链接:https://arxiv.org/abs/2602.11206
作者:Jose Marie Antonio Miñoza
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Rings and Algebras (math.RA); Neurons and Cognition (q-bio.NC)
关键词:Spiking Neural Networks, biologically plausible computation, non-differentiable spike generation, Spiking Neural, Neural Networks
备注:
点击查看摘要
Abstract:Spiking Neural Networks (SNNs) offer energy-efficient, biologically plausible computation but suffer from non-differentiable spike generation, necessitating reliance on heuristic surrogate gradients. This paper introduces UltraLIF, a principled framework that replaces surrogate gradients with ultradiscretization, a mathematical formalism from tropical geometry providing continuous relaxations of discrete dynamics. The central insight is that the max-plus semiring underlying ultradiscretization naturally models neural threshold dynamics: the log-sum-exp function serves as a differentiable soft-maximum that converges to hard thresholding as a learnable temperature parameter $\eps \to 0$. Two neuron models are derived from distinct dynamical systems: UltraLIF from the LIF ordinary differential equation (temporal dynamics) and UltraDLIF from the diffusion equation modeling gap junction coupling across neuronal populations (spatial dynamics). Both yield fully differentiable SNNs trainable via standard backpropagation with no forward-backward mismatch. Theoretical analysis establishes pointwise convergence to classical LIF dynamics with quantitative error bounds and bounded non-vanishing gradients. Experiments on six benchmarks spanning static images, neuromorphic vision, and audio demonstrate improvements over surrogate gradient baselines, with gains most pronounced in single-timestep ($T{=}1$) settings on neuromorphic and temporal datasets. An optional sparsity penalty enables significant energy reduction while maintaining competitive accuracy.
92. 【2602.11186】GAC-KAN: An Ultra-Lightweight GNSS Interference Classifier for GenAI-Powered Consumer Edge Devices
链接:https://arxiv.org/abs/2602.11186
作者:Zhihan Zeng,Kaihe Wang,Zhongpei Zhang,Yue Xiu
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Uncrewed Aerial Vehicles, autonomous Uncrewed Aerial, revolutionized user experiences, Consumer Electronics, Aerial Vehicles
备注:
点击查看摘要
Abstract:The integration of Generative AI (GenAI) into Consumer Electronics (CE)--from AI-powered assistants in wearables to generative planning in autonomous Uncrewed Aerial Vehicles (UAVs)--has revolutionized user experiences. However, these GenAI applications impose immense computational burdens on edge hardware, leaving strictly limited resources for fundamental security tasks like Global Navigation Satellite System (GNSS) signal protection. Furthermore, training robust classifiers for such devices is hindered by the scarcity of real-world interference data. To address the dual challenges of data scarcity and the extreme efficiency required by the GenAI era, this paper proposes a novel framework named GAC-KAN. First, we adopt a physics-guided simulation approach to synthesize a large-scale, high-fidelity jamming dataset, mitigating the data bottleneck. Second, to reconcile high accuracy with the stringent resource constraints of GenAI-native chips, we design a Multi-Scale Ghost-ACB-Coordinate (MS-GAC) backbone. This backbone combines Asymmetric Convolution Blocks (ACB) and Ghost modules to extract rich spectral-temporal features with minimal redundancy. Replacing the traditional Multi-Layer Perceptron (MLP) decision head, we introduce a Kolmogorov-Arnold Network (KAN), which employs learnable spline activation functions to achieve superior non-linear mapping capabilities with significantly fewer parameters. Experimental results demonstrate that GAC-KAN achieves an overall accuracy of 98.0\%, outperforming state-of-the-art baselines. Significantly, the model contains only 0.13 million parameter--approximately 660 times fewer than Vision Transformer (ViT) baselines. This extreme lightweight characteristic makes GAC-KAN an ideal "always-on" security companion, ensuring GNSS reliability without contending for the computational resources required by primary GenAI tasks.
93. 【2602.11183】Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering
链接:https://arxiv.org/abs/2602.11183
作者:Yin Tang,Jiawei Ma,Jinrui Zhang,Alex Jinpeng Wang,Deyu Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
关键词:Unmanned Aerial Vehicle, Aerial Vehicle, Unmanned Aerial, critical for Unmanned, Continuous navigation
备注: Preprint, 15 pages, 6 figures
点击查看摘要
Abstract:Continuous navigation in complex environments is critical for Unmanned Aerial Vehicle (UAV). However, the existing Vision-Language Navigation (VLN) models follow the dead-reckoning, which iteratively updates its position for the next waypoint prediction, and subsequently construct the complete trajectory. Then, such stepwise manner will inevitably lead to accumulated errors of position over time, resulting in misalignment between internal belief and objective coordinates, which is known as "state drift" and ultimately compromises the full trajectory prediction. Drawing inspiration from classical control theory, we propose to correct for errors by formulating such sequential prediction as a recursive Bayesian state estimation problem. In this paper, we design NeuroKalman, a novel framework that decouples navigation into two complementary processes: a Prior Prediction, based on motion dynamics and a Likelihood Correction, from historical observation. We first mathematically associate Kernel Density Estimation of the measurement likelihood with the attention-based retrieval mechanism, which then allows the system to rectify the latent representation using retrieved historical anchors without gradient updates. Comprehensive experiments on TravelUAV benchmark demonstrate that, with only 10% of the training data fine-tuning, our method clearly outperforms strong baselines and regulates drift accumulation.
94. 【2602.10619】Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation
链接:https://arxiv.org/abs/2602.10619
作者:Guangjing Yang,ZhangYuan Yu,Ziyuan Qin,Xinyuan Song,Huahui Yi,Qingbo Kang,Jun Gao,Yiyue Li,Chenlin Du,Qicheng Lao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:remains largely underexplored, enable effective post-training, vision-centric domains remains, domains remains largely, rule-based reward schemes
备注: CPAL 2026
点击查看摘要
Abstract:While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications.
Comments:
CPAL 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.10619 [cs.CV]
(or
arXiv:2602.10619v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.10619
Focus to learn more
arXiv-issued DOI via DataCite</p>
95. 【2602.11969】UPDA: Unsupervised Progressive Domain Adaptation for No-Reference Point Cloud Quality Assessment
链接:https://arxiv.org/abs/2602.11969
作者:Bingxu Xie,Fang Zhou,Jincan Wu,Yonghui Liu,Weiqing Li,Zhiyong Su
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:no-reference point cloud, achieved significant progress, distribution gap exists, cloud quality assessment, point cloud quality
备注: to be published in IEEE Transactions on Broadcasting
点击查看摘要
Abstract:While no-reference point cloud quality assessment (NR-PCQA) approaches have achieved significant progress over the past decade, their performance often degrades substantially when a distribution gap exists between the training (source domain) and testing (target domain) data. However, to date, limited attention has been paid to transferring NR-PCQA models across domains. To address this challenge, we propose the first unsupervised progressive domain adaptation (UPDA) framework for NR-PCQA, which introduces a two-stage coarse-to-fine alignment paradigm to address domain shifts. At the coarse-grained stage, a discrepancy-aware coarse-grained alignment method is designed to capture relative quality relationships between cross-domain samples through a novel quality-discrepancy-aware hybrid loss, circumventing the challenges of direct absolute feature alignment. At the fine-grained stage, a perception fusion fine-grained alignment approach with symmetric feature fusion is developed to identify domain-invariant features, while a conditional discriminator selectively enhances the transfer of quality-relevant features. Extensive experiments demonstrate that the proposed UPDA effectively enhances the performance of NR-PCQA methods in cross-domain scenarios, validating its practical applicability. The code is available at this https URL.
96. 【2602.11903】Learning Perceptual Representations for Gaming NR-VQA with Multi-Task FR Signals
链接:https://arxiv.org/abs/2602.11903
作者:Yu-Chih Chen,Michael Wang,Chieh-Dun Wen,Kai-Siang Ma,Avinab Saha,Li-Heng Chen,Alan Bovik
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:including fast motion, No-reference video quality, unique content characteristics, content characteristics including, characteristics including fast
备注: 6 pages, 2 figures
点击查看摘要
Abstract:No-reference video quality assessment (NR-VQA) for gaming videos is challenging due to limited human-rated datasets and unique content characteristics including fast motion, stylized graphics, and compression artifacts. We present MTL-VQA, a multi-task learning framework that uses full-reference metrics as supervisory signals to learn perceptually meaningful features without human labels for pretraining. By jointly optimizing multiple full-reference (FR) objectives with adaptive task weighting, our approach learns shared representations that transfer effectively to NR-VQA. Experiments on gaming video datasets show MTL-VQA achieves performance competitive with state-of-the-art NR-VQA methods across both MOS-supervised and label-efficient/self-supervised settings.
97. 【2602.11704】U-DAVI: Uncertainty-Aware Diffusion-Prior-Based Amortized Variational Inference for Image Reconstruction
链接:https://arxiv.org/abs/2602.11704
作者:Ayush Varshney,Katherine L. Bouman,Berthy T. Feng
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Ill-posed imaging inverse, imaging inverse problems, inverse problems remain, problems remain challenging, remain challenging due
备注: Accepted at ICASSP 2026
点击查看摘要
Abstract:Ill-posed imaging inverse problems remain challenging due to the ambiguity in mapping degraded observations to clean images. Diffusion-based generative priors have recently shown promise, but typically rely on computationally intensive iterative sampling or per-instance optimization. Amortized variational inference frameworks address this inefficiency by learning a direct mapping from measurements to posteriors, enabling fast posterior sampling without requiring the optimization of a new posterior for every new set of measurements. However, they still struggle to reconstruct fine details and complex textures. To address this, we extend the amortized framework by injecting spatially adaptive perturbations to measurements during training, guided by uncertainty estimates, to emphasize learning in the most uncertain regions. Experiments on deblurring and super-resolution demonstrate that our method achieves superior or competitive performance to previous diffusion-based approaches, delivering more realistic reconstructions without the computational cost of iterative refinement.
98. 【2602.11197】Hybrid operator learning of wave scattering maps in high-contrast media
链接:https://arxiv.org/abs/2602.11197
作者:Advait Balaji,Trevor Teolis,S. David Mis,Jose Antonio Lara Benitez,Chao Wang,Maarten V. de Hoop
类目:ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Surrogate modeling, wave field map, field map, imaging and inversion, speed and source
备注:
点击查看摘要
Abstract:Surrogate modeling of wave propagation and scattering (i.e. the wave speed and source to wave field map) in heterogeneous media has significant potential in applications such as seismic imaging and inversion. High-contrast settings, such as subsurface models with salt bodies, exhibit strong scattering and phase sensitivity that challenge existing neural operators. We propose a hybrid architecture that decomposes the scattering operator into two separate contributions: a smooth background propagation and a high-contrast scattering correction. The smooth component is learned with a Fourier Neural Operator (FNO), which produces globally coupled feature tokens encoding background wave propagation; these tokens are then passed to a vision transformer, where attention is used to model the high-contrast scattering correction dominated by strong, spatial interactions. Evaluated on high-frequency Helmholtz problems with strong contrasts, the hybrid model achieves substantially improved phase and amplitude accuracy compared to standalone FNOs or transformers, with favorable accuracy-parameter scaling.




