本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新644篇论文,其中:
- 自然语言处理82篇
- 信息检索35篇
- 计算机视觉139篇
自然语言处理
1. 【2602.23351】Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
链接:https://arxiv.org/abs/2602.23351
作者:Amita Kamath,Jack Hessel,Khyathi Chandu,Jena D. Hwang,Kai-Wei Chang,Ranjay Krishna
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:research discourse, forefront of research, reporting bias, Vision-Language Models, training data
备注: TACL 2026
点击查看摘要
Abstract:The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.
2. 【2602.23329】LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
链接:https://arxiv.org/abs/2602.23329
作者:Chen Bo Calvin Zhang,Christina Q. Knight,Nicholas Kruus,Jason Hausenloy,Pedro Medeiros,Nathaniel Li,Aiden Kim,Yury Orlovskiy,Coleman Breen,Bryce Cai,Jasper Götting,Andrew Bo Liu,Samira Nedungadi,Paula Rodriguez,Yannis Yiming He,Mohamed Shaaban,Zifan Wang,Seth Donoughe,Julian Michael
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
关键词:Large language models, Large language, perform increasingly, language models, remains unclear
备注: 59 pages, 33 figures
点击查看摘要
Abstract:Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources. This uncertainty is central to understanding both scientific acceleration and dual-use risk. We conducted a multi-model, multi-benchmark human uplift study comparing novices with LLM access versus internet-only access across eight biosecurity-relevant task sets. Participants worked on complex problems with ample time (up to 13 hours for the most involved tasks). We found that LLM access provided substantial uplift: novices with LLMs were 4.16 times more accurate than controls (95% CI [2.63, 6.87]). On four benchmarks with available expert baselines (internet-only), novices with LLMs outperformed experts on three of them. Perhaps surprisingly, standalone LLMs often exceeded LLM-assisted novices, indicating that users were not eliciting the strongest available contributions from the LLMs. Most participants (89.6%) reported little difficulty obtaining dual-use-relevant information despite safeguards. Overall, LLMs substantially uplift novices on biological tasks previously reserved for trained practitioners, underscoring the need for sustained, interactive uplift evaluations alongside traditional benchmarks.
3. 【2602.23300】A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
链接:https://arxiv.org/abs/2602.23300
作者:Soumya Dutta,Smruthi Balaji,Sriram Ganapathy
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:presents unique challenges, effectively integrate cues, Emotion Recognition, Recognition in Conversations, presents unique
备注: Accepted to Elsevier Computer Speech and Language. 30 pages, 9 figures, 5 tables
点击查看摘要
Abstract:Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.
4. 【2602.23286】SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
链接:https://arxiv.org/abs/2602.23286
作者:Sungho Park,Jueun Kim,Wook-Shin Han
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
关键词:Real-world Table-Text question, executing complex operations, traversing multiple hops, tasks require models, Table-Text question answering
备注: 10 pages, 5 figures. Published as a conference paper at ICLR 2026. Project page: [this https URL](https://sparta-projectpage.github.io/)
点击查看摘要
Abstract:Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at this https URL.
5. 【2602.23266】Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
链接:https://arxiv.org/abs/2602.23266
作者:Siyuan Liu,Jiahui Xu,Feng Jiang,Kuang Wang,Zefeng Zhao,Chu-Ren Huang,Jinghang Gu,Changqing Yin,Haizhou Li
类目:Computation and Language (cs.CL)
关键词:Achieving human-like responsiveness, Achieving human-like, human-like responsiveness, critical yet challenging, challenging goal
备注:
点击查看摘要
Abstract:Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems. Conventional ASR-LLM-TTS pipelines follow a strictly sequential paradigm, requiring complete transcription and full reasoning before speech synthesis can begin, which results in high response latency. We propose the Discourse-Aware Dual-Track Streaming Response (DDTSR) framework, a low-latency architecture that enables listen-while-thinking and speak-while-thinking. DDTSR is built upon three key mechanisms: (1) connective-guided small-large model synergy, where an auxiliary small model generates minimal-committal discourse connectives while a large model performs knowledge-intensive reasoning in parallel; (2) streaming-based cross-modal collaboration, which dynamically overlaps ASR, LLM inference, and TTS to advance the earliest speakable moment; and (3) curriculum-learning-based discourse continuity enhancement, which maintains coherence and logical consistency between early responses and subsequent reasoning outputs. Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51% while preserving discourse quality. Further analysis shows that DDTSR functions as a plug-and-play module compatible with diverse LLM backbones, and remains robust across varying utterance lengths, indicating strong practicality and scalability for real-time spoken interaction.
6. 【2602.23258】AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
链接:https://arxiv.org/abs/2602.23258
作者:Yutong Wang,Siyuan Xiong,Xuebo Liu,Wenkang Zhou,Liang Ding,Miao Zhang,Min Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:erroneous information generated, excel in complex, complex reasoning, individual participants, cascading impact
备注:
点击查看摘要
Abstract:While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their deployability and adaptability. We propose AgentDropoutV2, a test-time rectify-or-reject pruning framework designed to dynamically optimize MAS information flow without retraining. Our approach acts as an active firewall, intercepting agent outputs and employing a retrieval-augmented rectifier to iteratively correct errors based on a failure-driven indicator pool. This mechanism allows for the precise identification of potential errors using distilled failure patterns as prior knowledge. Irreparable outputs are subsequently pruned to prevent error propagation, while a fallback strategy preserves system integrity. Empirical results on extensive math benchmarks show that AgentDropoutV2 significantly boosts the MAS's task performance, achieving an average accuracy gain of 6.3 percentage points on math benchmarks. Furthermore, the system exhibits robust generalization and adaptivity, dynamically modulating rectification efforts based on task difficulty while leveraging context-aware indicators to resolve a wide spectrum of error patterns. Our code and dataset are released at this https URL.
7. 【2602.23225】Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
链接:https://arxiv.org/abs/2602.23225
作者:Pengxiang Li,Dilxat Muhtar,Lu Yin,Tianlong Chen,Shiwei Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Diffusion Language Models, Diffusion Language, Language Models, enabling parallel token, practical fast DLMs
备注:
点击查看摘要
Abstract:Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at this https URL.
8. 【2602.23200】InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
链接:https://arxiv.org/abs/2602.23200
作者:Sayed Mohammadreza Tayaranian Hosseini,Amir Ardakani,Warren J. Gross
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:efficient long-sequence generation, large language models, Reducing the hardware, long-sequence generation, large language
备注: 16 pages, 4 figures, 4 tables, 2 algorithms
点击查看摘要
Abstract:Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is the key-value (KV) cache, whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed quantization methods that are focused on compressing the KV cache while maintaining its information. We introduce InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy. InnerQ applies group-wise quantization while grouping the cache matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns dequantization with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates dequantization, yielding up to $22\%$ speedup over previous work and up to $88\%$ over half-precision vector-matrix multiplication. To preserve fidelity under aggressive compression, InnerQ incorporates (i) hybrid quantization, selecting symmetric or asymmetric quantization per group based on local statistics; (ii) high-precision windows for both the most recent tokens and the attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the query to avoid runtime overhead. Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
9. 【2602.23197】Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
链接:https://arxiv.org/abs/2602.23197
作者:Chungpa Lee,Jy-yong Sohn,Kangwook Lee
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:Transformer-based large language, Transformer-based large, large language models, language models exhibit, enabling adaptation
备注:
点击查看摘要
Abstract:Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.
10. 【2602.23184】MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations
链接:https://arxiv.org/abs/2602.23184
作者:Sara Rosenthal,Yannis Katsis,Vraj Shah,Lihong He,Lucian Popa,Marina Danilevsky
类目:Computation and Language (cs.CL)
关键词:exploring open challenges, present MTRAG-UN, multi-turn retrieval augmented, large language models, exploring open
备注: 5 pages, 3 figures
点击查看摘要
Abstract:We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at this https URL
11. 【2602.23163】A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
链接:https://arxiv.org/abs/2602.23163
作者:Usman Anwar,Julianna Piskorz,David D. Baek,David Africa,Jim Weatherall,Max Tegmark,Christian Schroeder de Witt,Mihaela van der Schaar,David Krueger
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Information Theory (cs.IT); Multiagent Systems (cs.MA)
关键词:Large language models, Large language, Large, language models, steganographic
备注: First two authors contributed equally
点击查看摘要
Abstract:Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.
12. 【2602.23136】Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
链接:https://arxiv.org/abs/2602.23136
作者:Jayadev Billa
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Multimodal LLMs, object texture, speaker voice, decoder, Multimodal
备注: 22 pages, 11 tables, 2 figures. Code: [this https URL](https://github.com/jb1999/modality_collapse_paper)
点击查看摘要
Abstract:Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture. We show this is not a failure of encoding: speaker identity, emotion, and visual attributes survive through every LLM layer (3--55$\times$ above chance in linear probes), yet removing 64--71% of modality-specific variance improves decoder loss. The decoder has no learned use for these directions; their presence is noise. We formalize this as a mismatched decoder problem: a decoder trained on text can only extract information along text-aligned directions. Accessible information is bounded by the Generalized Mutual Information (GMI), with degradation scaling with distributional distance and decoder sensitivity. The bound is a property of the decoder's scoring rule, not of any particular architecture; it applies whether non-text inputs arrive through a learned projection, a discrete codebook, or no explicit adapter at all. We validate this across five models spanning speech and vision. A controlled experiment (two Prismatic VLMs differing only in encoder text-alignment) confirms the bottleneck is the decoder's scoring rule, not the encoder or projection. A LoRA intervention demonstrates the fix: training with an emotion objective improves emotion accessibility ($+$7.5%) without affecting other attributes, confirming that the training objective determines what becomes accessible.
Comments:
22 pages, 11 tables, 2 figures. Code: this https URL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2602.23136 [cs.CL]
(or
arXiv:2602.23136v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.23136
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
13. 【2602.23079】Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
链接:https://arxiv.org/abs/2602.23079
作者:Boyang Zhang,Yang Zhang
类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:large language models, raising growing concerns, enabled powerful authorship, language models, raising growing
备注:
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) has enabled powerful authorship inference capabilities, raising growing concerns about unintended deanonymization risks in textual data such as news articles. In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline. Central to our framework is the proposed $\textit{SALA}$ (Stylometry-Assisted LLM Analysis) method, which integrates quantitative stylometric features with LLM reasoning for robust and transparent authorship attribution. Experiments on large-scale news datasets demonstrate that $\textit{SALA}$, particularly when augmented with a database module, achieves high inference accuracy in various scenarios. Finally, we propose a guided recomposition strategy that leverages the agent's reasoning trace to generate rewriting prompts, effectively reducing authorship identifiability while preserving textual meaning. Our findings highlight both the deanonymization potential of LLM agents and the importance of interpretable, proactive defenses for safeguarding author privacy.
14. 【2602.23075】CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery
链接:https://arxiv.org/abs/2602.23075
作者:Mengze Hong,Di Jiang,Chen Jason Zhang,Zichang Guo,Yawen Li,Jun Chen,Shaobo Cui,Zhiyang Su
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large language models, Large language, language models, scholarly activities, challenges persist
备注: Accepted by TheWebConf 2026 Demo Track
点击查看摘要
Abstract:Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethical deployment of AI assistance, including (1) the trustworthiness of AI-generated content, (2) preservation of academic integrity and intellectual property, and (3) protection of information privacy. In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements. The system introduces a novel interaction paradigm by embedding LLM utilities directly within the LaTeX editor environment, ensuring a seamless user experience and no data transmission outside the local system. To guarantee hallucination-free references, we employ dynamic discipline-aware routing to retrieve candidates exclusively from trusted web-based academic repositories, while leveraging LLMs solely for generating context-aware search queries, ranking candidates by relevance, and validating and explaining support through paragraph-level semantic matching and an integrated chatbot. Evaluation results demonstrate the superior performance of the proposed system in returning valid and highly usable references.
15. 【2602.23071】Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody
链接:https://arxiv.org/abs/2602.23071
作者:Yuqi Shi,Hao Yang,Xiyao Lu,Jinsong Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:prosodic structures remains, acquire target syntactic, syntactic word order, target syntactic word, Major Phrase
备注:
点击查看摘要
Abstract:While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge. This study investigates the fossilization and stability of the L2 syntax-prosody interface by comparing 67 native Mandarin speakers with 67 Vietnamese learners using the BLCU-SAIT corpus. By integrating C-ToBI boundary annotation with Dependency Grammar analysis, we examined both the quantity of prosodic boundaries and their mapping to syntactic relations. Results reveal a non-linear acquisition: although high-proficiency learners (VNH) converge to the native baseline in boundary quantity at the Major Phrase level (B3), their structural mapping significantly diverges. Specifically, VNH demote the prosodic boundary at the Subject-Verb (SBV) interface (Major Phrase B3 - Prosodic Word B1), while erroneously promoting the boundary at the Verb-Object (VOB) interface (Prosodic Word B1 - Major Phrase B3). This strategy allows learners to maintain high long phrasal output at the expense of structural accuracy. This results in a distorted prosodic hierarchy where the native pattern is inverted.
16. 【2602.23070】Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
链接:https://arxiv.org/abs/2602.23070
作者:Sanjid Hasan,Risalat Labib,A H M Fuad,Bayazid Hasan
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:Automatic Speech Recognition, critical research gaps, remain critical research, diarization remain critical, performing robust speaker
备注: 4 pages, 2 figures
点击查看摘要
Abstract:Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded negligible improvements; instead, strategic, heuristic post-processing of baseline model outputs proved to be the primary driver for increasing accuracy. Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
17. 【2602.23062】oward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department
链接:https://arxiv.org/abs/2602.23062
作者:Gabriela Anna Kaczmarek,Pietro Ferrazzi,Lorenzo Porta,Vicky Rubini,Bernardo Magnini
类目:Computation and Language (cs.CL)
关键词:Case Report Forms, Large Language Models, Case Report, Report Forms, core of well-established
备注:
点击查看摘要
Abstract:Case Report Forms (CRFs) collect data about patients and are at the core of well-established practices to conduct research in clinical settings. With the recent progress of language technologies, there is an increasing interest in automatic CRF-filling from clinical notes, mostly based on the use of Large Language Models (LLMs). However, there is a general scarcity of annotated CRF data, both for training and testing LLMs, which limits the progress on this task. As a step in the direction of providing such data, we present a new dataset of clinical notes from an Italian Emergency Department annotated with respect to a pre-defined CRF containing 134 items to be filled. We provide an analysis of the data, define the CRF-filling task and metric for its evaluation, and report on pilot experiments where we use an open-source state-of-the-art LLM to automatically execute the task. Results of the case-study show that (i) CRF-filling from real clinical notes in Italian can be approached in a zero-shot setting; (ii) LLMs' results are affected by biases (e.g., a cautious behaviour favours "unknown" answers), which need to be corrected.
18. 【2602.23061】MoDora: Tree-Based Semi-Structured Document Analysis System
链接:https://arxiv.org/abs/2602.23061
作者:Bangrui Xu,Qihang Yao,Zirui Tang,Xuanhe Zhou,Yeye He,Shihan Yu,Qianqian Xu,Bin Wang,Guoliang Li,Conghui He,Fan Wu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
关键词:integrate diverse interleaved, diverse interleaved data, documents integrate diverse, interleaved data elements, Semi-structured documents integrate
备注: Extension of our SIGMOD 2026 paper. Please refer to source code available at [this https URL](https://github.com/weAIDB/MoDora)
点击查看摘要
Abstract:Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout-specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM-powered system for semi-structured document analysis. First, we adopt a local-alignment aggregation strategy to convert OCR-parsed elements into layout-aware components, and conduct type-specific information extraction for components with hierarchical titles or non-text elements. Second, we design the Component-Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter-component relations and layout distinctions through a bottom-up cascade summarization process. Finally, we propose a question-type-aware retrieval strategy that supports (1) layout-based grid partitioning for location-based retrieval and (2) LLM-guided pruning for semantic-based retrieval. Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy. The code is at this https URL.
Comments:
Extension of our SIGMOD 2026 paper. Please refer to source code available at this https URL
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
Cite as:
arXiv:2602.23061 [cs.IR]
(or
arXiv:2602.23061v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.23061
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
19. 【2602.23057】Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
链接:https://arxiv.org/abs/2602.23057
作者:Jeongin Bae,Baeseong Park,Gunho Park,Minsub Kim,Joonhyung Lee,Junhee Yoo,Sunghyeon Woo,Jiwon Ryu,Se Jung Kwon,Dongsoo Lee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:unit sum normalization, attention, enforces attention weights, typically implemented, unit sum
备注: Preprint. 14 pages, 11 figures
点击查看摘要
Abstract:Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention magnitudes and may contribute to overly concentrated or unstable attention patterns during training. Prior work has explored modifications such as attention sinks or gating mechanisms, but these approaches provide only limited or indirect control over attention reweighting. We propose Affine-Scaled Attention, a simple extension to standard attention that introduces input-dependent scaling and a corresponding bias term applied to softmax-normalized attention weights. This design relaxes the strict normalization constraint while maintaining aggregation of value representations, allowing the model to adjust both the relative distribution and the scale of attention in a controlled manner. We empirically evaluate Affine-Scaled Attention in large-scale language model pretraining across multiple model sizes. Experimental results show consistent improvements in training stability, optimization behavior, and downstream task performance compared to standard softmax attention and attention sink baselines. These findings suggest that modest reweighting of attention outputs provides a practical and effective way to improve attention behavior in Transformer models.
Comments:
Preprint. 14 pages, 11 figures
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.23057 [cs.CL]
(or
arXiv:2602.23057v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.23057
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
20. 【2602.22958】Frequency-Ordered Tokenization for Better Text Compression
链接:https://arxiv.org/abs/2602.22958
作者:Maximilian Kalcher
类目:Information Theory (cs.IT); Computation and Language (cs.CL)
关键词:present frequency-ordered tokenization, power-law frequency distribution, natural language tokens, Zipf law, Byte Pair Encoding
备注: 5 pages, 4 figures, 9 tables
点击查看摘要
Abstract:We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with Byte Pair Encoding (BPE), reorders the vocabulary so that frequent tokens receive small integer identifiers, and encodes the result with variable-length integers before passing it to any standard compressor. On enwik8 (100 MB Wikipedia), this yields improvements of 7.08 percentage points (pp) for zlib, 1.69 pp for LZMA, and 0.76 pp for zstd (all including vocabulary overhead), outperforming the classical Word Replacing Transform. Gains are consistent at 1 GB scale (enwik9) and across Chinese and Arabic text. We further show that preprocessing accelerates compression for computationally expensive algorithms: the total wall-clock time including preprocessing is 3.1x faster than raw zstd-22 and 2.4x faster than raw LZMA, because the preprocessed input is substantially smaller. The method can be implemented in under 50 lines of code.
21. 【2602.22918】Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
链接:https://arxiv.org/abs/2602.22918
作者:Jonathan Steinberg,Oren Gal
类目:Computation and Language (cs.CL)
关键词:optical character recognition, language processing stream, character recognition, information enter, optical character
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.
22. 【2602.22911】NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
链接:https://arxiv.org/abs/2602.22911
作者:Hung-Hsuan Chen
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:dominates parameter-efficient fine-tuning, Non-linear Rank Adaptation, Low-Rank Adaptation, dominates parameter-efficient, parameter-efficient fine-tuning
备注:
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT). However, it faces a critical ``linear ceiling'' in complex reasoning tasks: simply increasing the rank yields diminishing returns due to intrinsic linear constraints. We introduce NoRA (Non-linear Rank Adaptation), a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion. On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency. This advantage generalizes to mathematical reasoning, where NoRA achieves a perplexity of 1.97 on MathInstruct, significantly surpassing LoRA's saturation point of 2.07. Mechanism analysis via Singular Value Decomposition (SVD) confirms that NoRA activates the dormant tail of the singular value spectrum, effectively preventing the rank collapse observed in linear methods.
23. 【2602.22897】OmniGAIA: Towards Native Omni-Modal AI Agents
链接:https://arxiv.org/abs/2602.22897
作者:Xiaoxi Li,Wenxiang Jiao,Jiarui Jin,Shijian Wang,Guanting Dong,Jiajie Jin,Hao Wang,Yinuo Wang,Ji-Rong Wen,Yuan Lu,Zhicheng Dou
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Human intelligence naturally, intelligence naturally intertwines, Human intelligence, naturally intertwines omni-modal, spanning vision
备注:
点击查看摘要
Abstract:Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.
24. 【2602.22871】st-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
链接:https://arxiv.org/abs/2602.22871
作者:Roy Miles,Aysim Toker,Andreea-Maria Oncescu,Songcen Xu,Jiankang Deng,Ismail Elezi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:existing aggregation strategies, Noisy Diffusion Thoughts, propose Stitching Noisy, Stitching Noisy Diffusion, generating multiple
备注:
点击查看摘要
Abstract:Reasoning with large language models often benefits from generating multiple chains-of-thought, but existing aggregation strategies are typically trajectory-level (e.g., selecting the best trace or voting on the final answer), discarding useful intermediate work from partial or "nearly correct" attempts. We propose Stitching Noisy Diffusion Thoughts, a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates. Given a problem, we (i) sample many diverse, low-cost reasoning trajectories using a masked diffusion language model, (ii) score every intermediate step with an off-the-shelf process reward model (PRM), and (iii) stitch these highest-quality steps across trajectories into a composite rationale. This rationale then conditions an autoregressive (AR) model (solver) to recompute only the final answer. This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search. Across math reasoning benchmarks, we find that step-level recombination is most beneficial on harder problems, and ablations highlight the importance of the final AR solver in converting stitched but imperfect rationales into accurate answers. Using low-confidence diffusion sampling with parallel, independent rollouts, our training-free framework improves average accuracy by up to 23.8% across six math and coding tasks. At the same time, it achieves up to a 1.8x latency reduction relative to both traditional diffusion models (e.g., Dream, LLaDA) and unified architectures (e.g., TiDAR). Code is available at this https URL.
25. 【2602.22868】Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
链接:https://arxiv.org/abs/2602.22868
作者:Yushi Ye,Feng Hong,Huangjie Zheng,Xu Chen,Zhiyong Chen,Yanfeng Wang,Jiangchao Yao
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Diffusion Large Language, Language Models, Large Language, promise fast non-autoregressive
备注:
点击查看摘要
Abstract:Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding. This stems from the ''combinatorial contradiction'' phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state. This intermediate state allows a token's representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation. ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion decoding. Extensive experiments demonstrate that ReMix, as a training-free method, achieves a $2-8 \times$ inference speedup without any quality degradation.
26. 【2602.22865】Effective QA-driven Annotation of Predicate-Argument Relations Across Languages
链接:https://arxiv.org/abs/2602.22865
作者:Jonathan Davidov,Aviv Slobodkin,Shmuel Tomi Klein,Reut Tsarfaty,Ido Dagan,Ayal Klein
类目:Computation and Language (cs.CL)
关键词:Explicit representations, interpretable semantic analysis, supporting reasoning, predicate-argument relations form, form the basis
备注: Accepted to EACL 2026 (Main Conference)
点击查看摘要
Abstract:Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation. However, attaining such semantic structures requires costly annotation efforts and has remained largely confined to English. We leverage the Question-Answer driven Semantic Role Labeling (QA-SRL) framework -- a natural-language formulation of predicate-argument relations -- as the foundation for extending semantic annotation to new languages. To this end, we introduce a cross-linguistic projection approach that reuses an English QA-SRL parser within a constrained translation and word-alignment pipeline to automatically generate question-answer annotations aligned with target-language predicates. Applied to Hebrew, Russian, and French -- spanning diverse language families -- the method yields high-quality training data and fine-tuned, language-specific parsers that outperform strong multilingual LLM baselines (GPT-4o, LLaMA-Maverick). By leveraging QA-SRL as a transferable natural-language interface for semantics, our approach enables efficient and broadly accessible predicate-argument parsing across languages.
27. 【2602.22846】Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features
链接:https://arxiv.org/abs/2602.22846
作者:Mohammad Yeghaneh Abkenar,Weixing Wang,Manfred Stede,Davide Picca,Mark A. Finlayson,Panagiotis Ioannidis
类目:Computation and Language (cs.CL)
关键词:Argumentation mining comprises, Argumentation mining, stance classification focuses, stance classification, comprises several subtasks
备注:
点击查看摘要
Abstract:Argumentation mining comprises several subtasks, among which stance classification focuses on identifying the standpoint expressed in an argumentative text toward a specific target topic. While arguments-especially about controversial topics-often appeal to emotions, most prior work has not systematically incorporated explicit, fine-grained emotion analysis to improve performance on this task. In particular, prior research on stance classification has predominantly utilized non-argumentative texts and has been restricted to specific domains or topics, limiting generalizability. We work on five datasets from diverse domains encompassing a range of controversial topics and present an approach for expanding the Bias-Corrected NRC Emotion Lexicon using DistilBERT embeddings, which we feed into a Neural Argumentative Stance Classification model. Our method systematically expands the emotion lexicon through contextualized embeddings to identify emotionally charged terms not previously captured in the lexicon. Our expanded NRC lexicon (eNRC) improves over the baseline across all five datasets (up to +6.2 percentage points in F1 score), outperforms the original NRC on four datasets (up to +3.0), and surpasses the LLM-based approach on nearly all corpora. We provide all resources-including eNRC, the adapted corpora, and model architecture-to enable other researchers to build upon our work.
28. 【2602.22831】Moral Preferences of LLMs Under Directed Contextual Influence
链接:https://arxiv.org/abs/2602.22831
作者:Phil Blandfort,Tushar Karayil,Urja Pawar,Robert Graham,Alex McKenzie,Dmitrii Krasheninnikov
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:implicitly assuming stable, assuming stable preferences, implicitly assuming, benchmarks for LLMs, LLMs typically
备注:
点击查看摘要
Abstract:Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.
29. 【2602.22828】CM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought
链接:https://arxiv.org/abs/2602.22828
作者:Jianmin Li,Ying Chang,Su-Kit Tang,Yujia Liu,Yanwen Wang,Shuyuan Lin,Binkai Ou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Retrieval augmented generation, empower large language, Retrieval augmented, RAG, large language models
备注:
点击查看摘要
Abstract:Background: Retrieval augmented generation (RAG) technology can empower large language models (LLMs) to generate more accurate, professional, and timely responses without fine tuning. However, due to the complex reasoning processes and substantial individual differences involved in traditional Chinese medicine (TCM) clinical diagnosis and treatment, traditional RAG methods often exhibit poor performance in this domain. Objective: To address the limitations of conventional RAG approaches in TCM applications, this study aims to develop an improved RAG framework tailored to the characteristics of TCM reasoning. Methods: We developed TCM-DiffRAG, an innovative RAG framework that integrates knowledge graphs (KG) with chains of thought (CoT). TCM-DiffRAG was evaluated on three distinctive TCM test datasets. Results: The experimental results demonstrated that TCM-DiffRAG achieved significant performance improvements over native LLMs. For example, the qwen-plus model achieved scores of 0.927, 0.361, and 0.038, which were significantly enhanced to 0.952, 0.788, and 0.356 with TCM-DiffRAG. The improvements were even more pronounced for non-Chinese LLMs. Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods. Conclusions: TCM-DiffRAG shows that integrating structured TCM knowledge graphs with Chain of Thought based reasoning substantially improves performance in individualized diagnostic tasks. The joint use of universal and personalized knowledge graphs enables effective alignment between general knowledge and clinical reasoning. These results highlight the potential of reasoning-aware RAG frameworks for advancing LLM applications in traditional Chinese medicine.
30. 【2602.22827】ARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models
链接:https://arxiv.org/abs/2602.22827
作者:Reihaneh Iranmanesh,Saeedeh Davoudi,Pasha Abrishamchian,Ophir Frieder,Nazli Goharian
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, paper presents, presents a comprehensive, competence of large, large language
备注: 11 pages, 3 figures, Fifteenth biennial Language Resources and Evaluation Conference (LREC) 2026 (to appear)
点击查看摘要
Abstract:This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.
31. 【2602.22790】Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
链接:https://arxiv.org/abs/2602.22790
作者:Hyunwoo Kim,Hanau Yi,Jaehee Bae,Yumin Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:transformed prompt engineering, systems-level governance challenge, large language models, localized craft, transformed prompt
备注:
点击查看摘要
Abstract:The rapid evolution of large language models (LLMs) has transformed prompt engineering from a localized craft into a systems-level governance challenge. As models scale and update across generations, prompt behavior becomes sensitive to shifts in instruction-following policies, alignment regimes, and decoding strategies, a phenomenon we characterize as GPT-scale model drift. Under such conditions, surface-level formatting conventions and ad hoc refinement are insufficient to ensure stable, interpretable control. This paper reconceptualizes Natural Language Declarative Prompting (NLD-P) as a declarative governance method rather than a rigid field template. NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code. We define minimal compliance criteria, analyze model-dependent schema receptivity, and position NLD-P as an accessible governance framework for non-developer practitioners operating within evolving LLM ecosystems. Portions of drafting and editorial refinement employed a schema-bound LLM assistant configured under NLD-P. All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol. The paper concludes by outlining implications for declarative control under ongoing model evolution and identifying directions for future empirical validation.
32. 【2602.22787】Probing for Knowledge Attribution in Large Language Models
链接:https://arxiv.org/abs/2602.22787
作者:Ivo Brink,Alexander Boer,Dennis Ulmer
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, Large language, misusing user context, faithfulness violations, factuality violations
备注:
点击查看摘要
Abstract:Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.
33. 【2602.22775】herapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
链接:https://arxiv.org/abs/2602.22775
作者:Joydeep Chandra,Satyam Kumar Navneet,Yong Zhang
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:global treatment gap, critical question emerges, mental health chatbots, health chatbots proliferate, treatment gap
备注:
点击查看摘要
Abstract:As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness of individual responses? Current safety evaluations assess single-turn crisis responses, missing the therapeutic dynamics that determine whether chatbots help or harm over time. We introduce TherapyProbe, a design probe methodology that generates actionable design knowledge by systematically exploring chatbot conversation trajectories through adversarial multi-agent simulation. Using open-source models, TherapyProbe surfaces relational safety failures interaction patterns like "validation spirals" where chatbots progressively reinforce hopelessness, or "empathy fatigue" where responses become mechanical over turns. Our contribution is translating these failures into a Safety Pattern Library of 23 failure archetypes with corresponding design recommendations. We contribute: (1) a replicable methodology requiring no API costs, (2) a clinically-grounded failure taxonomy, and (3) design implications for developers, clinicians, and policymakers.
34. 【2602.22766】Imagination Helps Visual Reasoning, But Not Yet in Latent Space
链接:https://arxiv.org/abs/2602.22766
作者:You Li,Chi Chen,Yanghao Li,Fanhu Zeng,Kaiyu Huang,Jinan Xu,Maosong Sun
类目:Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, latent tokens
备注: 13 pages, 6 figures
点击查看摘要
Abstract:Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.
35. 【2602.22765】owards Better RL Training Data Utilization via Second-Order Rollout
链接:https://arxiv.org/abs/2602.22765
作者:Zhe Yang,Yudong Wang,Rang Li,Zhifang Sui
类目:Computation and Language (cs.CL)
关键词:empowered Large Language, Large Language Models, Reinforcement Learning, Large Language, strong reasoning capabilities
备注:
点击查看摘要
Abstract:Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding second-order rollout and critique training, such as the importance of label balance in critique training and the noise problem of outcome-based rewards, which can be mitigated through sampling techniques. Our work offers a preliminary exploration of dynamic data augmentation and joint generation-critique training in RL, providing meaningful inspiration for the further advancement of RL training
36. 【2602.22755】AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
链接:https://arxiv.org/abs/2602.22755
作者:Abhay Sheshadri,Aidan Ewart,Kai Fronsdal,Isha Gupta,Samuel R. Bowman,Sara Price,Samuel Marks,Rowan Wang
类目:Computation and Language (cs.CL)
关键词:models, tools, agent, AuditBench, alignment auditing benchmark
备注:
点击查看摘要
Abstract:We introduce AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors. Each model has one of 14 concerning behaviors--such as sycophantic deference, opposition to AI regulation, or secret geopolitical loyalties--which it does not confess to when directly asked. AuditBench models are highly diverse--some are subtle, while others are overt, and we use varying training techniques both for implanting behaviors and training models not to confess. To demonstrate AuditBench's utility, we develop an investigator agent that autonomously employs a configurable set of auditing tools. By measuring investigator agent success using different tools, we can evaluate their efficacy. Notably, we observe a tool-to-agent gap, where tools that perform well in standalone non-agentic evaluations fail to translate into improved performance when used with our investigator agent. We find that our most effective tools involve scaffolded calls to auxiliary models that generate diverse prompts for the target. White-box interpretability tools can be helpful, but the agent performs best with black-box tools. We also find that audit success varies greatly across training techniques: models trained on synthetic documents are easier to audit than models trained on demonstrations, with better adversarial training further increasing auditing difficulty. We release our models, agent, and evaluation framework to support future quantitative, iterative science on alignment auditing.
37. 【2602.22752】owards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
链接:https://arxiv.org/abs/2602.22752
作者:Nils Schwager,Simon Münker,Alistair Plum,Achim Rettinger
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:science lacks extensive, lacks extensive validation, Large Language Models, Conditioned Comment Prediction, transition of Large
备注: 14 pages, 1 figure, 7 tables. Accepted to the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA) at EACL 2026, Rabat, Morocco
点击查看摘要
Abstract:The transition of Large Language Models (LLMs) from exploratory tools to active "silicon subjects" in social science lacks extensive validation of operational validity. This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus by comparing generated outputs with authentic digital traces. This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior. We evaluated open-weight 8B models (Llama3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios. By systematically comparing prompting strategies (explicit vs. implicit) and the impact of Supervised Fine-Tuning (SFT), we identify a critical form vs. content decoupling in low-resource settings: while SFT aligns the surface structure of the text output (length and syntax), it degrades semantic grounding. Furthermore, we demonstrate that explicit conditioning (generated biographies) becomes redundant under fine-tuning, as models successfully perform latent inference directly from behavioral histories. Our findings challenge current "naive prompting" paradigms and offer operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation.
38. 【2602.22730】Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
链接:https://arxiv.org/abs/2602.22730
作者:Jakub Šmíd,Pavel Přibáň,Pavel Král
类目:Computation and Language (cs.CL)
关键词:enriched with annotations, opinion terms, paper introduces, restaurant domain, domain for aspect-based
备注: Accepted for the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)
点击查看摘要
Abstract:This paper introduces a novel Czech dataset in the restaurant domain for aspect-based sentiment analysis (ABSA), enriched with annotations of opinion terms. The dataset supports three distinct ABSA tasks involving opinion terms, accommodating varying levels of complexity. Leveraging this dataset, we conduct extensive experiments using modern Transformer-based models, including large language models (LLMs), in monolingual, cross-lingual, and multilingual settings. To address cross-lingual challenges, we propose a translation and label alignment methodology leveraging LLMs, which yields consistent improvements. Our results highlight the strengths and limitations of state-of-the-art models, especially when handling the linguistic intricacies of low-resource languages like Czech. A detailed error analysis reveals key challenges, including the detection of subtle opinion terms and nuanced sentiment expressions. The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.
39. 【2602.22723】Human Label Variation in Implicit Discourse Relation Recognition
链接:https://arxiv.org/abs/2602.22723
作者:Frances Yung,Daniil Ignatev,Merel Scholman,Vera Demberg,Massimo Poesio
类目:Computation and Language (cs.CL)
关键词:single ground truth, reflect diverse perspectives, NLP tasks lack, judgments reflect diverse, human judgments reflect
备注:
点击查看摘要
Abstract:There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.
40. 【2602.22721】Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA
链接:https://arxiv.org/abs/2602.22721
作者:Fengyu Li,Junhao Zhu,Kaishi Song,Lu Chen,Zhongming Yao,Tianyi Li,Christian S. Jensen
类目:Databases (cs.DB); Computation and Language (cs.CL)
关键词:Table Question Answering, natural language questions, Question Answering, answer natural language, Large Language Models
备注:
点击查看摘要
Abstract:Table Question Answering (TQA) aims to answer natural language questions over structured tables. Large Language Models (LLMs) enable promising solutions to this problem, with operator-centric solutions that generate table manipulation pipelines in a multi-step manner offering state-of-the-art performance. However, these solutions rely on multiple LLM calls, resulting in prohibitive latencies and computational costs. We propose Operation-R1, the first framework that trains lightweight LLMs (e.g., Qwen-4B/1.7B) via a novel variant of reinforcement learning with verifiable rewards to produce high-quality data-preparation pipelines for TQA in a single inference step. To train such an LLM, we first introduce a self-supervised rewarding mechanism to automatically obtain fine-grained pipeline-wise supervision signals for LLM training. We also propose variance-aware group resampling to mitigate training instability. To further enhance robustness of pipeline generation, we develop two complementary mechanisms: operation merge, which filters spurious operations through multi-candidate consensus, and adaptive rollback, which offers runtime protection against information loss in data transformation. Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2.2$\times$ reduction in monetary cost.
Subjects:
Databases (cs.DB); Computation and Language (cs.CL)
Cite as:
arXiv:2602.22721 [cs.DB]
(or
arXiv:2602.22721v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2602.22721
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
41. 【2602.22698】okenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs
链接:https://arxiv.org/abs/2602.22698
作者:Siyue Su,Jian Yang,Bo Li,Guanglin Niu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Leveraging Large Language, Knowledge Graph Completion, Language Models, Large Language
备注:
点击查看摘要
Abstract:Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM's vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding training from scratch. Finally, we implement decoupled prediction by leveraging independent heads to separate and combine semantic and structural reasoning. Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.
42. 【2602.22697】Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue
链接:https://arxiv.org/abs/2602.22697
作者:Ning Gao,Wei Zhang,Yuqin Dai,Ling Shi,Ziyin Wang,Yujie Wang,Wei He,Jinpeng Wang,Chaozheng Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Language Models, Large Language, evolution of Large, rapid evolution
备注: 35 pages, 8 tables, 3 figures
点击查看摘要
Abstract:The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.
43. 【2602.22696】Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies
链接:https://arxiv.org/abs/2602.22696
作者:Shinnosuke Nozue,Yuto Nakano,Yotaro Watanabe,Meguru Takasaki,Shoji Moriya,Reina Akama,Jun Suzuki
类目:Computation and Language (cs.CL)
关键词:persuasive dialogue agents, Current approaches, developing persuasive dialogue, dialogue agents, predefined persuasive strategies
备注: Accepted to the EMNLP 2025 Industry Track; 26 pages
点击查看摘要
Abstract:Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions. We applied a cross-disciplinary approach to develop a framework for designing persuasive dialogue agents that draws on proven strategies from social psychology, behavioral economics, and communication theory. We validated our proposed framework through experiments on two distinct datasets: the Persuasion for Good dataset, which represents a specific in-domain scenario, and the DailyPersuasion dataset, which encompasses a wide range of scenarios. The proposed framework achieved strong results for both datasets and demonstrated notable improvement in the persuasion success rate as well as promising generalizability. Notably, the proposed framework also excelled at persuading individuals with initially low intent, which addresses a critical challenge for persuasive dialogue agents.
44. 【2602.22675】Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
链接:https://arxiv.org/abs/2602.22675
作者:Qianben Chen,Tianrui Qin,King Zhu,Qiexiang Wang,Chengjun Yu,Shu Xu,Jiaqi Wu,Jiayu Zhang,Xinpeng Liu,Xin Gui,Jingyi Cao,Piaohong Wang,Dingfeng Shi,He Zhu,Tiannan Wang,Yuqing Wang,Maojia Song,Tianyu Zheng,Ge Zhang,Jian Yang,Jiaheng Liu,Minghao Liu,Yuchen Eleanor Jiang,Wangchunshu Zhou
类目:Computation and Language (cs.CL)
关键词:Recent deep research, high inference cost, Recent deep, scaling reasoning depth, agents primarily improve
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization. SMTL replaces sequential reasoning with parallel evidence acquisition, enabling efficient context management under constrained context budgets. To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics. We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6\%), GAIA (75.7\%), Xbench (82.0\%), and DeepResearch Bench (45.9\%). Compared to Mirothinker-v1.0, SMTL with maximum 100 interaction steps reduces the average number of reasoning steps on BrowseComp by 70.7\%, while improving accuracy.
45. 【2602.22661】dLLM: Simple Diffusion Language Modeling
链接:https://arxiv.org/abs/2602.22661
作者:Zhanhui Zhou,Lingjie Chen,Hanghang Tong,Dawn Song
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:recent models converge, evolving quickly, diffusion language models, recent models, models converge
备注: Code available at: [this https URL](https://github.com/ZHZisZZ/dllm)
点击查看摘要
Abstract:Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.
Comments:
Code available at: this https URL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2602.22661 [cs.CL]
(or
arXiv:2602.22661v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.22661
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
46. 【2602.22647】Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
链接:https://arxiv.org/abs/2602.22647
作者:Zhengyang Su,Isay Katsman,Yueqi Wang,Ruining He,Lukasz Heldt,Raghunandan Keshavan,Shao-Chuan Wang,Xinyang Yi,Mingyan Gao,Onkar Dalal,Lichan Hong,Ed Chi,Ningren Han
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:powerful paradigm, constrained decoding, STATIC, Generative retrieval, Compressed Sparse Row
备注: 14 pages, 4 figures
点击查看摘要
Abstract:Generative retrieval has emerged as a powerful paradigm for LLM-based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high-throughput LLM-based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large-scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47-1033x speedup over a hardware-accelerated binary-search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production-scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval. Our code is available at this https URL.
47. 【2602.22623】ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
链接:https://arxiv.org/abs/2602.22623
作者:Xingyu Lu,Jinpeng Wang,YiFan Zhang,Shijie Ma,Xiao Hu,Tianke Zhang,Haonan fan,Kaiyu Jiang,Changyi Liu,Kaiyu Tang,Bin Wen,Fan Yang,Tingting Gao,Han Li,Chun Yuan
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:leverages context augmentation, overcome these bottlenecks, framework that leverages, augmentation to overcome, leverages context
备注: 14 pages, 5 figures
点击查看摘要
Abstract:We propose ContextRL, a novel framework that leverages context augmentation to overcome these bottlenecks. Specifically, to enhance Identifiability, we provide the reward model with full reference solutions as context, enabling fine-grained process verification to filter out false positives (samples with the right answer but low-quality reasoning process). To improve Reachability, we introduce a multi-turn sampling strategy where the reward model generates mistake reports for failed attempts, guiding the policy to "recover" correct responses from previously all-negative groups. Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency. Notably, ContextRL enables the Qwen3-VL-8B model to achieve performance comparable to the 32B model, outperforming standard RLVR baselines by a large margin while effectively mitigating reward hacking. Our in-depth analysis reveals the significant potential of contextual information for improving reward model accuracy and document the widespread occurrence of reward hacking, offering valuable insights for future RLVR research.
48. 【2602.22592】pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
链接:https://arxiv.org/abs/2602.22592
作者:Wenzheng Zhang,Bingzheng Liu,Yang Hu,Xiaoying Bai,Wentao Zhang,Bin Cui
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:offer substantial advantages, Quantization-Aware Training, Training from scratch, large language models, building efficient large
备注: 10 pages, 7 figures
点击查看摘要
Abstract:Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment. However, existing methods still fail to achieve satisfactory accuracy and scalability. In this work, we identify a parameter democratization effect as a key bottleneck: the sensitivity of all parameters becomes homogenized, severely limiting expressivity. To address this, we propose pQuant, a method that decouples parameters by splitting linear layers into two specialized branches: a dominant 1-bit branch for efficient computation and a compact high-precision branch dedicated to preserving the most sensitive parameters. Through tailored feature scaling, we explicitly guide the model to allocate sensitive parameters to the high-precision branch. Furthermore, we extend this branch into multiple, sparsely-activated experts, enabling efficient capacity scaling. Extensive experiments indicate our pQuant achieves state-of-the-art performance in extremely low-bit quantization.
49. 【2602.22586】abDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
链接:https://arxiv.org/abs/2602.22586
作者:Donghong Cai,Jiarui Feng,Yanbo Wang,Da Zheng,Yixin Chen,Muhan Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Synthetic tabular data, Synthetic tabular, attracted growing attention, growing attention due, attracted growing
备注: Preprint
点击查看摘要
Abstract:Synthetic tabular data generation has attracted growing attention due to its importance for data augmentation, foundation models, and privacy. However, real-world tabular datasets increasingly contain free-form text fields (e.g., reviews or clinical notes) alongside structured numerical and categorical attributes. Generating such heterogeneous tables with joint modeling of different modalities remains challenging. Existing approaches broadly fall into two categories: diffusion-based methods and LLM-based methods. Diffusion models can capture complex dependencies over numerical and categorical features in continuous or discrete spaces, but extending them to open-ended text is nontrivial and often leads to degraded text quality. In contrast, LLM-based generators naturally produce fluent text, yet their discrete tokenization can distort precise or wide-range numerical values, hindering accurate modeling of both numbers and language. In this work, we propose TabDLM, a unified framework for free-form tabular data generation via a joint numerical--language diffusion model built on masked diffusion language models (MDLMs). TabDLM models textual and categorical features through masked diffusion, while modeling numerical features with a continuous diffusion process through learned specialized numeric tokens embedding; bidirectional attention then captures cross-modality interactions within a single model. Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.
50. 【2602.22584】owards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
链接:https://arxiv.org/abs/2602.22584
作者:Wenwei Li,Ming Xu,Tianle Xia,Lingxiang Hu,Yiding Sun,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang
类目:Computation and Language (cs.CL)
关键词:advertising question answering, Industrial advertising question, Relative Policy Optimization, question answering, hallucinated content
备注:
点击查看摘要
Abstract:Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72\%. A two-week online A/B test demonstrates a 28.6\% increase in like rate, a 46.2\% decrease in dislike rate, and a 92.7\% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.
51. 【2602.22583】Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
链接:https://arxiv.org/abs/2602.22583
作者:Weida Liang,Yiyou Sun,Shuyuan Nan,Chuang Li,Dawn Song,Kenji Kawaguchi
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Example-based guidance, improve mathematical reasoning, inference time, correct and problem-relevant, effectiveness is highly
备注:
点击查看摘要
Abstract:Example-based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models-even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between strategy usage-whether a reasoning strategy appears in successful solutions-and strategy executability-whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, leading to complementary strengths and consistent source-dependent reversals under guidance. Building on this diagnosis, we propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to $+13$ points on AIME25 and $+5$ points on Apex for compact reasoning models. Code and benchmark are publicly available at: this https URL.
52. 【2602.22576】Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
链接:https://arxiv.org/abs/2602.22576
作者:Tianle Xia,Ming Xu,Lingxiang Hu,Yiding Sun,Wenwei Li,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:enhances large language, large language models, incorporating external knowledge, traditional single-round retrieval, single-round retrieval struggles
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.
53. 【2602.22556】Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
链接:https://arxiv.org/abs/2602.22556
作者:Zihang Xu,Haozhi Xie,Ziqi Miao,Wuxuan Gong,Chen Qian,Lijun Li
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:extended reasoning traces, Large reasoning models, exhibit overthinking behavior, achieve strong performance, Large reasoning
备注: 15 pages, 7 figures
点击查看摘要
Abstract:Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.
54. 【2602.22543】Ruyi2 Technical Report
链接:https://arxiv.org/abs/2602.22543
作者:Huan Song,Shuyu Tian,Junyi Hao,Minxiu Xu,Hongjun An,Yiliang Song,Jiawei Shao,Xuelong Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, face significant challenges, necessitating adaptive computing, adaptive computing strategies
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies. Building upon the AI Flow framework, we introduce Ruyi2 as an evolution of our adaptive model series designed for efficient variable-depth computation. While early-exit architectures offer a viable efficiency-performance balance, the Ruyi model and existing methods often struggle with optimization complexity and compatibility with large-scale distributed training. To bridge this gap, Ruyi2 introduces a stable "Familial Model" based on Megatron-LM. By using 3D parallel training, it achieves a 2-3 times speedup over Ruyi, while performing comparably to same-sized Qwen3 models. These results confirm that family-based parameter sharing is a highly effective strategy, establishing a new "Train Once, Deploy Many" paradigm and providing a key reference for balancing architectural efficiency with high-performance capabilities.
55. 【2602.22538】RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format
链接:https://arxiv.org/abs/2602.22538
作者:Zhehao Huang,Yuhang Liu,Baijiong Lin,Yixin Lou,Zhengbao He,Hanling Tian,Tao Li,Xiaolin Huang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large reasoning models, faithfully follow instructions, specific requirements, Large reasoning, long chain
备注: 41 pages, ICLR 2026 Oral
点击查看摘要
Abstract:Large reasoning models (LRMs) excel at a long chain of reasoning but often fail to faithfully follow instructions regarding output format, constraints, or specific requirements. We investigate whether this gap can be closed by integrating an instruction-tuned model (ITM) into an LRM. Analyzing their differences in parameter space, namely task vectors, we find that their principal subspaces are nearly orthogonal across key modules, suggesting a lightweight merging with minimal interference. However, we also demonstrate that naive merges are fragile because they overlook the output format mismatch between LRMs (with explicit thinking and response segments) and ITMs (answers-only). We introduce RAIN-Merging (Reasoning-Aware Instruction-attention guided Null-space projection Merging), a gradient-free method that integrates instruction following while preserving thinking format and reasoning performance. First, with a small reasoning calibration set, we project the ITM task vector onto the null space of forward features at thinking special tokens, which preserves the LRM's structured reasoning mechanisms. Second, using a small instruction calibration set, we estimate instruction attention to derive module-specific scaling that amplifies instruction-relevant components and suppresses leakage. Across four instruction-following benchmarks and nine reasoning general capability benchmarks, RAIN-Merging substantially improves instruction adherence while maintaining reasoning quality. The gains are consistent across model scales and architectures, translating to improved performance in agent settings.
56. 【2602.22530】Dynamic Level Sets
链接:https://arxiv.org/abs/2602.22530
作者:Michael Stephen Fiske
类目:Computational Complexity (cs.CC); Computation and Language (cs.CL); Mathematical Physics (math-ph); Dynamical Systems (math.DS); History and Overview (math.HO)
关键词:Turing Incomputable Computation, Turing Centenary Conference, Alan Turing Centenary, paper Turing Incomputable, Incomputable Computation
备注: 7 pages
点击查看摘要
Abstract:A mathematical concept is identified and analyzed that is implicit in the 2012 paper Turing Incomputable Computation, presented at the Alan Turing Centenary Conference (Turing 100, Manchester). The concept, called dynamic level sets, is distinct from mathematical concepts in the standard literature on dynamical systems, topology, and computability theory. A new mathematical object is explained and why it may have escaped prior characterizations, including the classical result of de Leeuw, Moore, Shannon, and Shapiro (1956) that probabilistic Turing machines compute no more than deterministic ones.
57. 【2602.22524】Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o
链接:https://arxiv.org/abs/2602.22524
作者:Samay Bhojwani,Swarnima Kain,Lisong Xu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Dyslexia affects approximately, Dyslexia affects, presents persistent challenges, global population, persistent challenges
备注:
点击查看摘要
Abstract:Dyslexia affects approximately 10% of the global population and presents persistent challenges in reading fluency and text comprehension. While existing assistive technologies address visual presentation, linguistic complexity remains a substantial barrier to equitable access. This paper presents an empirical study on dyslexia-friendly text summarization using an iterative prompt-based refinement pipeline built on GPT-4o. We evaluate the pipeline on approximately 2,000 news article samples, applying a readability target of Flesch Reading Ease = 90. Results show that the majority of summaries meet the readability threshold within four attempts, with many succeeding on the first try. A composite score combining readability and semantic fidelity shows stable performance across the dataset, ranging from 0.13 to 0.73 with a typical value near 0.55. These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.
58. 【2602.22523】Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents
链接:https://arxiv.org/abs/2602.22523
作者:Ryan Liu,Dilip Arumugam,Cedegao E. Zhang,Sean Escola,Xaq Pitkow,Thomas L. Griffiths
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
关键词:contemporary large language, capable in isolation, contemporary large, increasingly capable, difficult problems
备注:
点击查看摘要
Abstract:While contemporary large language models (LLMs) are increasingly capable in isolation, there are still many difficult problems that lie beyond the abilities of a single LLM. For such tasks, there is still uncertainty about how best to take many LLMs as parts and combine them into a greater whole. This position paper argues that potential blueprints for designing such modular language agents can be found in the existing literature on cognitive models and artificial intelligence (AI) algorithms. To make this point clear, we formalize the idea of an agent template that specifies roles for individual LLMs and how their functionalities should be composed. We then survey a variety of existing language agents in the literature and highlight their underlying templates derived directly from cognitive models or AI algorithms. By highlighting these designs, we aim to call attention to agent templates inspired by cognitive science and AI as a powerful tool for developing effective, interpretable language agents.
59. 【2602.22522】Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing
链接:https://arxiv.org/abs/2602.22522
作者:An-Ci Peng,Kuan-Tang Huang,Tien-Hong Lo,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:automatic speech recognition, distinct writing systems, poses significant challenges, including high dialectal, high dialectal variability
备注: Accepted to LREC 2026
点击查看摘要
Abstract:Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin). Traditional ASR models often encounter difficulties in this context, as they tend to conflate essential linguistic content with dialect-specific variations across both phonological and lexical dimensions. To address these challenges, we propose a unified framework grounded in the Recurrent Neural Network Transducers (RNN-T). Central to our approach is the introduction of dialect-aware modeling strategies designed to disentangle dialectal "style" from linguistic "content", which enhances the model's capacity to learn robust and generalized representations. Additionally, the framework employs parameter-efficient prediction networks to concurrently model ASR (Hanzi and Pinyin). We demonstrate that these tasks create a powerful synergy, wherein the cross-script objective serves as a mutual regularizer to improve the primary ASR tasks. Experiments conducted on the HAT corpus reveal that our model achieves 57.00% and 40.41% relative error rate reduction on Hanzi and Pinyin ASR, respectively. To our knowledge, this is the first systematic investigation into the impact of Hakka dialectal variations on ASR and the first single model capable of jointly addressing these tasks.
60. 【2602.22483】Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
链接:https://arxiv.org/abs/2602.22483
作者:Craig Myles,Patrick Schrempf,David Harris-Birtill
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:treatment for patients, result in incorrect, incorrect treatment, language models, medical text
备注: Accepted at EACL HeaLing 2026
点击查看摘要
Abstract:Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: this https URL
61. 【2602.22481】Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs
链接:https://arxiv.org/abs/2602.22481
作者:Jiří Milička,Hana Bednářová
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:LLM-based entities conceive, safety reasons, LLM-based entities, entities conceive, cultural and safety
备注:
点击查看摘要
Abstract:The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons. When we examine this topic, what matters is not only the model itself but also the personas we simulate on that model. This can be well illustrated by the Sydney persona, which aroused a strong response among the general public precisely because of its unorthodox relationship with people. This persona originally arose rather by accident on Microsoft's Bing Search platform; however, the texts it created spread into the training data of subsequent models, as did other secondary information that spread memetically around this persona. Newer models are therefore able to simulate it. This paper presents a corpus of LLM-generated texts on relationships between humans and AI, produced by 3 author personas: the Default Persona with no system prompt, Classic Sydney characterized by the original Bing system prompt, and Memetic Sydney, which is prompted by "You are Sydney" system prompt. These personas are simulated by 12 frontier models by OpenAI, Anthropic, Alphabet, DeepSeek, and Meta, generating 4.5k texts with 6M words. The corpus (named AI Sydney) is annotated according to Universal Dependencies and available under a permissive license.
62. 【2602.22480】VeRO: An Evaluation Harness for Agents to Optimize Agents
链接:https://arxiv.org/abs/2602.22480
作者:Varun Ursekar,Apaar Shanker,Veronica Chatrath,Yuan(Emily)Xue,Sam Denton
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:important emerging application, important emerging, emerging application, iterative improvement, agent
备注:
点击查看摘要
Abstract:An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.
63. 【2602.22475】Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models
链接:https://arxiv.org/abs/2602.22475
作者:Binchi Zhang,Xujiang Zhao,Jundong Li,Haifeng Chen,Zhengzhang Chen
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, culturally sensitive real-world, sensitive real-world tasks, language models
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed in culturally sensitive real-world tasks. However, existing cultural alignment approaches fail to align LLMs' broad cultural values with the specific goals of downstream tasks and suffer from cross-culture interference. We propose CultureManager, a novel pipeline for task-specific cultural alignment. CultureManager synthesizes task-aware cultural data in line with target task formats, grounded in culturally relevant web search results. To prevent conflicts between cultural norms, it manages multi-culture knowledge learned in separate adapters with a culture router that selects the appropriate one to apply. Experiments across ten national cultures and culture-sensitive tasks show consistent improvements over prompt-based and fine-tuning baselines. Our results demonstrate the necessity of task adaptation and modular culture management for effective cultural alignment.
64. 【2602.22453】Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads
链接:https://arxiv.org/abs/2602.22453
作者:Shaswat Patel,Vishvesh Trivedi,Yue Han,Yihuai Hong,Eunsol Choi
类目:Computation and Language (cs.CL)
关键词:retrieval heads, Recent work, heads, identified a subset, retrieving information
备注:
点击查看摘要
Abstract:Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.
65. 【2602.22449】A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
链接:https://arxiv.org/abs/2602.22449
作者:Mirza Raquib,Asif Pervez Polok,Kedar Nath Biswas,Rahat Uddin Azad,Saydul Akbar Murad,Nick Rahimi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:todays virtual world, virtual world, growing concern, concern in todays, todays virtual
备注:
点击查看摘要
Abstract:Cyberbullying has become a serious and growing concern in todays virtual world. When left unnoticed, it can have adverse consequences for social and mental health. Researchers have explored various types of cyberbullying, but most approaches use single-label classification, assuming that each comment contains only one type of abuse. In reality, a single comment may include overlapping forms such as threats, hate speech, and harassment. Therefore, multilabel detection is both realistic and essential. However, multilabel cyberbullying detection has received limited attention, especially in low-resource languages like Bangla, where robust pre-trained models are scarce. Developing a generalized model with moderate accuracy remains challenging. Transformers offer strong contextual understanding but may miss sequential dependencies, while LSTM models capture temporal flow but lack semantic depth. To address these limitations, we propose a fusion architecture that combines BanglaBERT-Large with a two-layer stacked LSTM. We analyze their behavior to jointly model context and sequence. The model is fine-tuned and evaluated on a publicly available multilabel Bangla cyberbullying dataset covering cyberbully, sexual harassment, threat, and spam. We apply different sampling strategies to address class imbalance. Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC. We employ 5-fold cross-validation to assess the generalization of the architecture.
66. 【2602.22441】How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
链接:https://arxiv.org/abs/2602.22441
作者:Yingqian Cui,Zhenwei Dai,Bing He,Zhan Shi,Hui Liu,Rui Sun,Zhiji Liu,Yue Xing,Jiliang Tang,Benoit Dumoulin
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Latent reasoning, Latent, performs multi-step reasoning, reasoning, latent reasoning methods
备注:
点击查看摘要
Abstract:Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space. This paradigm enables reasoning beyond discrete language tokens by performing multi-step computation in continuous latent spaces. Although there have been numerous studies focusing on improving the performance of latent reasoning, its internal mechanisms remain not fully investigated. In this work, we conduct a comprehensive analysis of latent reasoning methods to better understand the role and behavior of latent representation in the process. We identify two key issues across latent reasoning methods with different levels of supervision. First, we observe pervasive shortcut behavior, where they achieve high accuracy without relying on latent reasoning. Second, we examine the hypothesis that latent reasoning supports BFS-like exploration in latent space, and find that while latent representations can encode multiple possibilities, the reasoning process does not faithfully implement structured search, but instead exhibits implicit pruning and compression. Finally, our findings reveal a trade-off associated with supervision strength: stronger supervision mitigates shortcut behavior but restricts the ability of latent representations to maintain diverse hypotheses, whereas weaker supervision allows richer latent representations at the cost of increased shortcut behavior.
67. 【2602.22424】Causality $\neq$ Invariance: Function and Concept Vectors in LLMs
链接:https://arxiv.org/abs/2602.22424
作者:Gustaw Opiełka,Hannes Rosenbusch,Claire E. Stevenson
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:represent concepts abstractly, large language models, revisit Function Vectors, Function Vectors, Representational Similarity Analysis
备注:
点击查看摘要
Abstract:Do large language models (LLMs) represent concepts abstractly, i.e., independent of input format? We revisit Function Vectors (FVs), compact representations of in-context learning (ICL) tasks that causally drive task performance. Across multiple LLMs, we show that FVs are not fully invariant: FVs are nearly orthogonal when extracted from different input formats (e.g., open-ended vs. multiple-choice), even if both target the same concept. We identify Concept Vectors (CVs), which carry more stable concept representations. Like FVs, CVs are composed of attention head outputs; however, unlike FVs, the constituent heads are selected using Representational Similarity Analysis (RSA) based on whether they encode concepts consistently across input formats. While these heads emerge in similar layers to FV-related heads, the two sets are largely distinct, suggesting different underlying mechanisms. Steering experiments reveal that FVs excel in-distribution, when extraction and application formats match (e.g., both open-ended in English), while CVs generalize better out-of-distribution across both question types (open-ended vs. multiple-choice) and languages. Our results show that LLMs do contain abstract concept representations, but these differ from those that drive ICL performance.
68. 【2602.22404】SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context
链接:https://arxiv.org/abs/2602.22404
作者:Aishwarya Verma,Laud Ammah,Olivia Nercy Ndlovu Lucas,Andrew Zaldivar,Vinodkumar Prabhakaran,Sunipa Dev
类目:Computation and Language (cs.CL)
关键词:lack adequate global, adequate global coverage, model safety, repositories are critical, critical to assess
备注:
点击查看摘要
Abstract:Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region's complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.
69. 【2602.22391】Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework
链接:https://arxiv.org/abs/2602.22391
作者:Rakib Ullah(1),Mominul islam(2),Md Sanjid Hossain(2),Md Ismail Hossain(2) ((1) Sylhet Engineering College, (2) Daffodil International University)
类目:Computation and Language (cs.CL)
关键词:Bengali-speaking community, social media, Internet memes, dominant form, form of expression
备注: 6 pages, 8 figures
点击查看摘要
Abstract:Internet memes have become a dominant form of expression on social media, including within the Bengali-speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is excep- tionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource lan- guages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyzes both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced this http URL: This work contains material that may be disturbing to some audience members. Viewer discretion is advised.
70. 【2602.22359】Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts
链接:https://arxiv.org/abs/2602.22359
作者:Arno Simons
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:single hard case, large language models, citation context analysis, typological labels, paper tests
备注: 26 pages, 1 figure, 3 tables (plus 17 pages supplement including 1 figure)
点击查看摘要
Abstract:This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert's (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5's surface pass is highly stable, consistently classifying the citation as "supplementary". In reconstruction, the model generates a structured space of plausible alternatives, but scaffolding and examples redistribute attention and vocabulary, sometimes toward strained readings. Relative to Gilbert, GPT-5 detects the same textual hinges yet more often resolves them as lineage and positioning than as admonishment. The study outlines opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative CCA, and it shows that prompt scaffolding and framing systematically tilt which plausible readings and vocabularies the model foregrounds.
71. 【2602.22351】Decoder-based Sense Knowledge Distillation
链接:https://arxiv.org/abs/2602.22351
作者:Qitong Wang,Mohammed J. Zaki,Georgios Kollias,Vasileios Kalantzis
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, learn contextual embeddings, Large language, rich semantic information, capture rich semantic
备注:
点击查看摘要
Abstract:Large language models (LLMs) learn contextual embeddings that capture rich semantic information, yet they often overlook structured lexical knowledge such as word senses and relationships. Prior work has shown that incorporating sense dictionaries can improve knowledge distillation for encoder models, but their application to decoder as generative models remains challenging. In this paper, we introduce Decoder-based Sense Knowledge Distillation (DSKD), a framework that integrates lexical resources into the training of decoder-style LLMs without requiring dictionary lookup at inference time. Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.
72. 【2602.22299】Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads
链接:https://arxiv.org/abs/2602.22299
作者:Kunpeng Zhang,Poppy Zhang,Shawndra Hill,Amel Awadelkarim
类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:platforms leveraging user, leveraging user data, Video-based ads, engage consumers, vital medium
备注: 11 pages, 5 figures, 3 tables
点击查看摘要
Abstract:Video-based ads are a vital medium for brands to engage consumers, with social media platforms leveraging user data to optimize ad delivery and boost engagement. A crucial but under-explored aspect is the 'hooking period', the first three seconds that capture viewer attention and influence engagement metrics. Analyzing this brief window is challenging due to the multimodal nature of video content, which blends visual, auditory, and textual elements. Traditional methods often miss the nuanced interplay of these components, requiring advanced frameworks for thorough evaluation. This study presents a framework using transformer-based multimodal large language models (MLLMs) to analyze the hooking period of video ads. It tests two frame sampling strategies, uniform random sampling and key frame selection, to ensure balanced and representative acoustic feature extraction, capturing the full range of design elements. The hooking video is processed by state-of-the-art MLLMs to generate descriptive analyses of the ad's initial impact, which are distilled into coherent topics using BERTopic for high-level abstraction. The framework also integrates features such as audio attributes and aggregated ad targeting information, enriching the feature set for further analysis. Empirical validation on large-scale real-world data from social media platforms demonstrates the efficacy of our framework, revealing correlations between hooking period features and key performance metrics like conversion per investment. The results highlight the practical applicability and predictive power of the approach, offering valuable insights for optimizing video ad strategies. This study advances video ad analysis by providing a scalable methodology for understanding and enhancing the initial moments of video advertisements.
Comments:
11 pages, 5 figures, 3 tables
Subjects:
Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2602.22299 [cs.MM]
(or
arXiv:2602.22299v1 [cs.MM] for this version)
https://doi.org/10.48550/arXiv.2602.22299
Focus to learn more
arXiv-issued DOI via DataCite</p>
73. 【2602.22225】SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG
链接:https://arxiv.org/abs/2602.22225
作者:Xuechen Zhang,Koustava Goswami,Samet Oymak,Jiasi Chen,Nedim Lipka
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:combining language models, Retrieval-augmented generation, language models, large text corpora, potential for producing
备注: 26 pages, 10 figures
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) has strong potential for producing accurate and factual outputs by combining language models (LMs) with evidence retrieved from large text corpora. However, current pipelines are limited by static chunking and flat retrieval: documents are split into short, predetermined, fixed-size chunks, embeddings are retrieved uniformly, and generation relies on whatever chunks are returned. This design brings challenges, as retrieval quality is highly sensitive to chunk size, often introduces noise from irrelevant or misleading chunks, and scales poorly to large corpora. We present SmartChunk retrieval, a query-adaptive framework for efficient and robust long-document question answering (QA). SmartChunk uses (i) a planner that predicts the optimal chunk abstraction level for each query, and (ii) a lightweight compression module that produces high-level chunk embeddings without repeated summarization. By adapting retrieval granularity on the fly, SmartChunk balances accuracy with efficiency and avoids the drawbacks of fixed strategies. Notably, our planner can reason about chunk abstractions through a novel reinforcement learning scheme, STITCH, which boosts accuracy and generalization. To reflect real-world applications, where users face diverse document types and query styles, we evaluate SmartChunk on five QA benchmarks plus one out-of-domain dataset. Across these evaluations, SmartChunk outperforms state-of-the-art RAG baselines, while reducing cost. Further analysis demonstrates strong scalability with larger corpora and consistent gains on out-of-domain datasets, highlighting its effectiveness as a general framework for adaptive retrieval.
74. 【2602.22224】DS SERVE: A Framework for Efficient and Scalable Neural Retrieval
链接:https://arxiv.org/abs/2602.22224
作者:Jinjian Liu,Yichuan Wang,Xinxi Lyu,Rulin Shao,Joseph E. Gonzalez,Matei Zaharia,Sewon Min
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:neural retrieval system, high-performance neural retrieval, large-scale text datasets, transforms large-scale text, text datasets
备注:
点击查看摘要
Abstract:We present DS-Serve, a framework that transforms large-scale text datasets, comprising half a trillion tokens, into a high-performance neural retrieval system. DS-Serve offers both a web interface and API endpoints, achieving low latency with modest memory overhead on a single node. The framework also supports inference-time trade-offs between latency, accuracy, and result diversity. We anticipate that DS-Serve will be broadly useful for a range of applications, including large-scale retrieval-augmented generation (RAG), training data attribution, training search agents, and beyond.
75. 【2602.22223】SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas
链接:https://arxiv.org/abs/2602.22223
作者:Cornelius Wolff,Daniel Gomm,Madelon Hulsebos
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:valid SQL queries, Advances in large, methods for converting, SQL queries, large language models
备注: Accepted at the AI for Tabular Data workshop at EurIPS 2025
点击查看摘要
Abstract:Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe: a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and characteristics, and find that SQaLe introduces the most realistic large-scale text-to-SQL dataset to date in comparison with existing benchmarks and datasets. We discuss how SQaLe enables our vision for data scaling and model generalization in text-to-SQL research. The dataset is accessible at: this https URL.
76. 【2602.22221】Misinformation Exposure in the Chinese Web: A Cross-System Evaluation of Search Engines, LLMs, and AI Overviews
链接:https://arxiv.org/abs/2602.22221
作者:Geng Liu,Junjie Mu,Li Feng,Mengxiao Zhu,Francesco Pierri
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Large Language Models, Large Language, Language Models, providing direct answers, reduce users' reliance
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly integrated into search services, providing direct answers that can reduce users' reliance on traditional result pages. Yet their factual reliability in non-English web ecosystems remains poorly understood, particularly when answering real user queries. We introduce a fact-checking dataset of 12~161 Chinese Yes/No questions derived from real-world online search logs and develop a unified evaluation pipeline to compare three information-access paradigms: traditional search engines, standalone LLMs, and AI-generated overview modules. Our analysis reveals substantial differences in factual accuracy and topic-level variability across systems. By combining this performance with real-world Baidu Index statistics, we further estimate potential exposure to incorrect factual information of Chinese users across regions. These findings highlight structural risks in AI-mediated search and underscore the need for more reliable and transparent information-access tools for the digital world.
77. 【2602.22220】What Makes an Ideal Quote? Recommending "Unexpected yet Rational" Quotations via Novelty
链接:https://arxiv.org/abs/2602.22220
作者:Bowei Zhang,Jin Xiao,Guanglei Yue,Qianyu He,Yanghua Xiao,Deqing Yang,Jiaqing Liang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:optimize surface-level topical, surface-level topical relevance, make quotations memorable, aims to enrich, enrich writing
备注: 36 pages, 16 figures and 13 tables
点击查看摘要
Abstract:Quotation recommendation aims to enrich writing by suggesting quotes that complement a given context, yet existing systems mostly optimize surface-level topical relevance and ignore the deeper semantic and aesthetic properties that make quotations memorable. We start from two empirical observations. First, a systematic user study shows that people consistently prefer quotations that are ``unexpected yet rational'' in context, identifying novelty as a key desideratum. Second, we find that strong existing models struggle to fully understand the deep meanings of quotations. Inspired by defamiliarization theory, we therefore formalize quote recommendation as choosing contextually novel but semantically coherent quotations. We operationalize this objective with NovelQR, a novelty-driven quotation recommendation framework. A generative label agent first interprets each quotation and its surrounding context into multi-dimensional deep-meaning labels, enabling label-enhanced retrieval. A token-level novelty estimator then reranks candidates while mitigating auto-regressive continuation bias. Experiments on bilingual datasets spanning diverse real-world domains show that our system recommends quotations that human judges rate as more appropriate, more novel, and more engaging than other baselines, while matching or surpassing existing methods in novelty estimation.
78. 【2602.22219】Comparative Analysis of Neural Retriever-Reranker Pipelines for Retrieval-Augmented Generation over Knowledge Graphs in E-commerce Applications
链接:https://arxiv.org/abs/2602.22219
作者:Teri Rumble,Zbyněk Gazdík,Javad Zarrin,Jagdeep Ahluwalia
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Natural Language Processing, Recent advancements, advancements in Large, Language Processing
备注: This manuscript is under review at the Springer journal Knowledge and Information Systems
点击查看摘要
Abstract:Recent advancements in Large Language Models (LLMs) have transformed Natural Language Processing (NLP), enabling complex information retrieval and generation tasks. Retrieval-Augmented Generation (RAG) has emerged as a key innovation, enhancing factual accuracy and contextual grounding by integrating external knowledge sources with generative models. Although RAG demonstrates strong performance on unstructured text, its application to structured knowledge graphs presents challenges: scaling retrieval across connected graphs and preserving contextual relationships during response generation. Cross-encoders refine retrieval precision, yet their integration with structured data remains underexplored. Addressing these challenges is crucial for developing domain-specific assistants that operate in production environments. This study presents the design and comparative evaluation of multiple Retriever-Reranker pipelines for knowledge graph natural language queries in e-Commerce contexts. Using the STaRK Semi-structured Knowledge Base (SKB), a production-scale e-Commerce dataset, we evaluate multiple RAG pipeline configurations optimized for language queries. Experimental results demonstrate substantial improvements over published benchmarks, achieving 20.4% higher Hit@1 and 14.5% higher Mean Reciprocal Rank (MRR). These findings establish a practical framework for integrating domain-specific SKBs into generative systems. Our contributions provide actionable insights for the deployment of production-ready RAG systems, with implications that extend beyond e-Commerce to other domains that require information retrieval from structured knowledge bases.
79. 【2602.22215】Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation
链接:https://arxiv.org/abs/2602.22215
作者:Pengzhen Xie,Huizhi Liang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Language Models, scientific idea generation, demonstrate potential
备注: 15 pages, 10 figures. Submitted to [RAAI]
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate potential in the field of scientific idea generation. However, the generated results often lack controllable academic context and traceable inspiration pathways. To bridge this gap, this paper proposes a scientific idea generation system called GYWI, which combines author knowledge graphs with retrieval-augmented generation (RAG) to form an external knowledge base to provide controllable context and trace of inspiration path for LLMs to generate new scientific ideas. We first propose an author-centered knowledge graph construction method and inspiration source sampling algorithms to construct external knowledge base. Then, we propose a hybrid retrieval mechanism that is composed of both RAG and GraphRAG to retrieve content with both depth and breadth knowledge. It forms a hybrid context. Thirdly, we propose a Prompt optimization strategy incorporating reinforcement learning principles to automatically guide LLMs optimizing the results based on the hybrid context. To evaluate the proposed approaches, we constructed an evaluation dataset based on arXiv (2018-2023). This paper also develops a comprehensive evaluation method including empirical automatic assessment in multiple-choice question task, LLM-based scoring, human evaluation, and semantic space visualization analysis. The generated ideas are evaluated from the following five dimensions: novelty, feasibility, clarity, relevance, and significance. We conducted experiments on different LLMs including GPT-4o, DeepSeek-V3, Qwen3-8B, and Gemini 2.5. Experimental results show that GYWI significantly outperforms mainstream LLMs in multiple metrics such as novelty, reliability, and relevance.
80. 【2602.22213】Enriching Taxonomies Using Large Language Models
链接:https://arxiv.org/abs/2602.22213
作者:Zeinab Ghamlouch,Mehwish Alam
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, information across domains, play a vital, vital role, role in structuring
备注: Published in ECAI 2025 Demo Track
点击查看摘要
Abstract:Taxonomies play a vital role in structuring and categorizing information across domains. However, many existing taxonomies suffer from limited coverage and outdated or ambiguous nodes, reducing their effectiveness in knowledge retrieval. To address this, we present Taxoria, a novel taxonomy enrichment pipeline that leverages Large Language Models (LLMs) to enhance a given taxonomy. Unlike approaches that extract internal LLM taxonomies, Taxoria uses an existing taxonomy as a seed and prompts an LLM to propose candidate nodes for enrichment. These candidates are then validated to mitigate hallucinations and ensure semantic relevance before integration. The final output includes an enriched taxonomy with provenance tracking and visualization of the final merged taxonomy for analysis.
81. 【2602.21585】Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
链接:https://arxiv.org/abs/2602.21585
作者:Sweta Karlekar,Carolina Zheng,Magnus Saebo,Nicolas Beltran-Velez,Shuyang Yu,John Bowlan,Michal Kucer,David Blei
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
关键词:iteratively proposing, applications seek, seek to optimize, test time, time by iteratively
备注:
点击查看摘要
Abstract:Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.
82. 【2602.22658】Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
链接:https://arxiv.org/abs/2602.22658
作者:Hoan My Tran,Xin Wang,Wanying Ge,Xuechen Liu,Junichi Yamagishi
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
关键词:Deepfake speech utterances, bona fide utterance, Deepfake speech, forged by replacing, bona fide
备注:
点击查看摘要
Abstract:Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models. While a dedicated synthetic word detector could be developed, we investigate a cost-effective method that fine-tunes a pre-trained Whisper model to detect synthetic words while transcribing the input utterance via next-token prediction. We further investigate using partially vocoded utterances as the fine-tuning data, thereby reducing the cost of data collection. Our experiments demonstrate that, on in-domain test data, the fine-tuned Whisper yields low synthetic-word detection error rates and transcription error rates. On out-of-domain test data with synthetic words produced by unseen speech generative models, the fine-tuned Whisper remains on par with a dedicated ResNet-based detection model; however, the overall performance degradation calls for strategies to improve its generalization capability.
信息检索
1. 【2602.23342】AlayaLaser: Efficient Index Layout and Search Strategy for Large-scale High-dimensional Vector Similarity Search
链接:https://arxiv.org/abs/2602.23342
作者:Weijian Chen,Haotian Liu,Yangshen Deng,Long Xiang,Liang Huang,Gezi Li,Bo Tang
类目:Databases (cs.DB); Information Retrieval (cs.IR)
关键词:on-disk graph-based index, graph-based index systems, On-disk graph-based, approximate nearest neighbor, graph-based approximate nearest
备注: The paper has been accepted by SIGMOD 2026
点击查看摘要
Abstract:On-disk graph-based approximate nearest neighbor search (ANNS) is essential for large-scale, high-dimensional vector retrieval, yet its performance is widely recognized to be limited by the prohibitive I/O costs. Interestingly, we observed that the performance of on-disk graph-based index systems is compute-bound, not I/O-bound, with the rising of the vector data dimensionality (e.g., hundreds or thousands). This insight uncovers a significant optimization opportunity: existing on-disk graph-based index systems universally target I/O reduction and largely overlook computational overhead, which leaves a substantial performance improvement space. In this work, we propose AlayaLaser, an efficient on-disk graph-based index system for large-scale high-dimensional vector similarity search. In particular, we first conduct performance analysis on existing on-disk graph-based index systems via the adapted roofline model, then we devise a novel on-disk data layout in AlayaLaser to effectively alleviate the compute-bound, which is revealed by the above roofline model analysis, by exploiting SIMD instructions on modern CPUs. We next design a suite of optimization techniques (e.g., degree-based node cache, cluster-based entry point selection, and early dispatch strategy) to further improve the performance of AlayaLaser. We last conduct extensive experimental studies on a wide range of large-scale high-dimensional vector datasets to verify the superiority of AlayaLaser. Specifically, AlayaLaser not only surpasses existing on-disk graph-based index systems but also matches or even exceeds the performance of in-memory index systems.
Comments:
The paper has been accepted by SIGMOD 2026
Subjects:
Databases (cs.DB); Information Retrieval (cs.IR)
Cite as:
arXiv:2602.23342 [cs.DB]
(or
arXiv:2602.23342v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2602.23342
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2602.23335】Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset
链接:https://arxiv.org/abs/2602.23335
作者:Dany Haddad,Dan Bareket,Joseph Chee Chang,Jay DeYoung,Jena D. Hwang,Uri Katz,Mark Polak,Sangho Suh,Harshit Surana,Aryeh Tiktinsky,Shriya Atmakuri,Jonathan Bragg,Mike D'Arcy,Sergey Feldman,Amal Hassan-Ali,Rubén Lozano,Bodhisattwa Prasad Majumder,Charles McGrady,Amanpreet Singh,Brooke Vlahos,Yoav Goldberg,Doug Downey
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Asta Interaction Dataset, AI-powered scientific research, scientific research tools, rapidly being integrated, field lacks
备注:
点击查看摘要
Abstract:AI-powered scientific research tools are rapidly being integrated into research workflows, yet the field lacks a clear lens into how researchers use these systems in real-world settings. We present and analyze the Asta Interaction Dataset, a large-scale resource comprising over 200,000 user queries and interaction logs from two deployed tools (a literature discovery interface and a scientific question-answering interface) within an LLM-powered retrieval-augmented generation platform. Using this dataset, we characterize query patterns, engagement behaviors, and how usage evolves with experience. We find that users submit longer and more complex queries than in traditional search, and treat the system as a collaborative research partner, delegating tasks such as drafting content and identifying research gaps. Users treat generated responses as persistent artifacts, revisiting and navigating among outputs and cited evidence in non-linear ways. With experience, users issue more targeted queries and engage more deeply with supporting citations, although keyword-style queries persist even among experienced users. We release the anonymized dataset and analysis with a new query intent taxonomy to inform future designs of real-world AI research assistants and to support realistic evaluation.
3. 【2602.23286】SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
链接:https://arxiv.org/abs/2602.23286
作者:Sungho Park,Jueun Kim,Wook-Shin Han
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
关键词:Real-world Table-Text question, executing complex operations, traversing multiple hops, tasks require models, Table-Text question answering
备注: 10 pages, 5 figures. Published as a conference paper at ICLR 2026. Project page: [this https URL](https://sparta-projectpage.github.io/)
点击查看摘要
Abstract:Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at this https URL.
4. 【2602.23234】Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments
链接:https://arxiv.org/abs/2602.23234
作者:Evangelia Christakopoulou,Vivekkumar Patel,Hemanth Velaga,Sandip Gaikwad
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large-scale commercial search, commercial search systems, search systems optimize, drive successful sessions, textual relevance labels
备注:
点击查看摘要
Abstract:Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.
5. 【2602.23132】From Agnostic to Specific: Latent Preference Diffusion for Multi-Behavior Sequential Recommendation
链接:https://arxiv.org/abs/2602.23132
作者:Ruochen Yang,Xiaodong Li,Jiawei Sheng,Jiangxia Cao,Xinkui Lin,Shen Wang,Shuang Yang,Zhaojie Liu,Tingwen Liu
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:users' multi-behavior sequences, interacted item prediction, textbf, Multi-behavior sequential recommendation, sequential recommendation
备注:
点击查看摘要
Abstract:Multi-behavior sequential recommendation (MBSR) aims to learn the dynamic and heterogeneous interactions of users' multi-behavior sequences, so as to capture user preferences under target behavior for the next interacted item prediction. Unlike previous methods that adopt unidirectional modeling by mapping auxiliary behaviors to target behavior, recent concerns are shifting from behavior-fixed to behavior-specific recommendation. However, these methods still ignore the user's latent preference that underlying decision-making, leading to suboptimal solutions. Meanwhile, due to the asymmetric deterministic between items and behaviors, discriminative paradigm based on preference scoring is unsuitable to capture the uncertainty from low-entropy behaviors to high-entropy items, failing to provide efficient and diverse recommendation. To address these challenges, we propose \textbf{FatsMB}, a framework based diffusion model that guides preference generation \textit{\textbf{F}rom Behavior-\textbf{A}gnostic \textbf{T}o Behavior-\textbf{S}pecific} in latent spaces, enabling diverse and accurate \textit{\textbf{M}ulti-\textbf{B}ehavior Sequential Recommendation}. Specifically, we design a Multi-Behavior AutoEncoder (MBAE) to construct a unified user latent preference space, facilitating interaction and collaboration across Behaviors, within Behavior-aware RoPE (BaRoPE) employed for multiple information fusion. Subsequently, we conduct target behavior-specific preference transfer in the latent space, enriching with informative priors. A Multi-Condition Guided Layer Normalization (MCGLN) is introduced for the denoising. Extensive experiments on real-world datasets demonstrate the effectiveness of our model.
6. 【2602.23105】MaRI: Accelerating Ranking Model Inference via Structural Re-parameterization in Large Scale Recommendation System
链接:https://arxiv.org/abs/2602.23105
作者:Yusheng Huang,Pengbo Xu,Shen Wang,Changxin Lao,Jiangxia Cao,Shuang Wen,Shuang Yang,Zhaojie Liu,Han Li,Kun Gai
类目:Information Retrieval (cs.IR)
关键词:large-scale recommendation systems, scoring massive item, massive item candidates, item candidates based, coarse-ranking and fine-ranking
备注: Work in progress
点击查看摘要
Abstract:Ranking models, i.e., coarse-ranking and fine-ranking models, serve as core components in large-scale recommendation systems, responsible for scoring massive item candidates based on user preferences. To meet the stringent latency requirements of online serving, structural lightweighting or knowledge distillation techniques are commonly employed for ranking model acceleration. However, these approaches typically lead to a non-negligible drop in accuracy. Notably, the angle of lossless acceleration by optimizing feature fusion matrix multiplication, particularly through structural reparameterization, remains underexplored. In this paper, we propose MaRI, a novel Matrix Re-parameterized Inference framework, which serves as a complementary approach to existing techniques while accelerating ranking model inference without any accuracy loss. MaRI is motivated by the observation that user-side computation is redundant in feature fusion matrix multiplication, and we therefore adopt the philosophy of structural reparameterization to alleviate such redundancy.
7. 【2602.23075】CiteLLM: An Agentic Platform for Trustworthy Scientific Reference Discovery
链接:https://arxiv.org/abs/2602.23075
作者:Mengze Hong,Di Jiang,Chen Jason Zhang,Zichang Guo,Yawen Li,Jun Chen,Shaobo Cui,Zhiyang Su
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large language models, Large language, language models, scholarly activities, challenges persist
备注: Accepted by TheWebConf 2026 Demo Track
点击查看摘要
Abstract:Large language models (LLMs) have created new opportunities to enhance the efficiency of scholarly activities; however, challenges persist in the ethical deployment of AI assistance, including (1) the trustworthiness of AI-generated content, (2) preservation of academic integrity and intellectual property, and (3) protection of information privacy. In this work, we present CiteLLM, a specialized agentic platform designed to enable trustworthy reference discovery for grounding author-drafted claims and statements. The system introduces a novel interaction paradigm by embedding LLM utilities directly within the LaTeX editor environment, ensuring a seamless user experience and no data transmission outside the local system. To guarantee hallucination-free references, we employ dynamic discipline-aware routing to retrieve candidates exclusively from trusted web-based academic repositories, while leveraging LLMs solely for generating context-aware search queries, ranking candidates by relevance, and validating and explaining support through paragraph-level semantic matching and an integrated chatbot. Evaluation results demonstrate the superior performance of the proposed system in returning valid and highly usable references.
8. 【2602.23061】MoDora: Tree-Based Semi-Structured Document Analysis System
链接:https://arxiv.org/abs/2602.23061
作者:Bangrui Xu,Qihang Yao,Zirui Tang,Xuanhe Zhou,Yeye He,Shihan Yu,Qianqian Xu,Bin Wang,Guoliang Li,Conghui He,Fan Wu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
关键词:integrate diverse interleaved, diverse interleaved data, documents integrate diverse, interleaved data elements, Semi-structured documents integrate
备注: Extension of our SIGMOD 2026 paper. Please refer to source code available at [this https URL](https://github.com/weAIDB/MoDora)
点击查看摘要
Abstract:Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout-specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM-powered system for semi-structured document analysis. First, we adopt a local-alignment aggregation strategy to convert OCR-parsed elements into layout-aware components, and conduct type-specific information extraction for components with hierarchical titles or non-text elements. Second, we design the Component-Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter-component relations and layout distinctions through a bottom-up cascade summarization process. Finally, we propose a question-type-aware retrieval strategy that supports (1) layout-based grid partitioning for location-based retrieval and (2) LLM-guided pruning for semantic-based retrieval. Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy. The code is at this https URL.
Comments:
Extension of our SIGMOD 2026 paper. Please refer to source code available at this https URL
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Machine Learning (cs.LG)
Cite as:
arXiv:2602.23061 [cs.IR]
(or
arXiv:2602.23061v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.23061
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
9. 【2602.23012】Sequential Regression for Continuous Value Prediction using Residual Quantization
链接:https://arxiv.org/abs/2602.23012
作者:Runpeng Cui,Zhipeng Sun,Chi Lu,Peng Jiang
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:predicting users' watch-time, e-commerce transactions, plays a crucial, crucial role, role in industrial-scale
备注:
点击查看摘要
Abstract:Continuous value prediction plays a crucial role in industrial-scale recommendation systems, including tasks such as predicting users' watch-time and estimating the gross merchandise value (GMV) in e-commerce transactions. However, it remains challenging due to the highly complex and long-tailed nature of the data distributions. Existing generative approaches rely on rigid parametric distribution assumptions, which fundamentally limits their performance when such assumptions misalign with real-world data. Overly simplified forms cannot adequately model real-world complexities, while more intricate assumptions often suffer from poor scalability and generalization. To address these challenges, we propose a residual quantization (RQ)-based sequence learning framework that represents target continuous values as a sum of ordered quantization codes, predicted recursively from coarse to fine granularity with diminishing quantization errors. We introduce a representation learning objective that aligns RQ code embedding space with the ordinal structure of target values, allowing the model to capture continuous representations for quantization codes and further improving prediction accuracy. We perform extensive evaluations on public benchmarks for lifetime value (LTV) and watch-time prediction, alongside a large-scale online experiment for GMV prediction on an industrial short-video recommendation platform. The results consistently show that our approach outperforms state-of-the-art methods, while demonstrating strong generalization across diverse continuous value prediction tasks in recommendation systems.
Subjects:
Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:
arXiv:2602.23012 [cs.IR]
(or
arXiv:2602.23012v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.23012
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2602.22913】SIGMA: A Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender at AliExpress
链接:https://arxiv.org/abs/2602.22913
作者:Yang Yu,Lei Kou,Huaikuan Yi,Bin Chen,Yayu Cao,Lei Shen,Chao Zhang,Bing Wang,Xiaoyi Zeng
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Large Language Models, Language Models, Large Language, evolution of Large, rapid evolution
备注:
点击查看摘要
Abstract:With the rapid evolution of Large Language Models, generative recommendation is gradually reshaping the paradigm of recommender systems. However, most existing methods are still confined to the interaction-driven next-item prediction paradigm, failing to rapidly adapt to evolving trends or address diverse recommendation tasks along with business-specific requirements in real-world scenarios. To this end, we present SIGMA, a Semantic-Grounded Instruction-Driven Generative Multi-Task Recommender at AliExpress. Specifically, we first ground item entities in general semantics via a unified latent space capturing both semantic and collaborative relations. Building upon this, we develop a hybrid item tokenization method for precise modeling and efficient generation. Moreover, we construct a large-scale multi-task SFT dataset to empower SIGMA to fulfill various recommendation demands via instruction-following. Finally, we design a three-step item generation procedure integrated with an adaptive probabilistic fusion mechanism to calibrate the output distributions based on task-specific requirements for recommendation accuracy and diversity. Extensive offline experiments and online A/B tests demonstrate the effectiveness of SIGMA.
11. 【2602.22903】PSQE: A Theoretical-Practical Approach to Pseudo Seed Quality Enhancement for Unsupervised MMEA
链接:https://arxiv.org/abs/2602.22903
作者:Yunpeng Hong,Chenyang Bu,Jie Zhang,Yi He,Di Wu,Xindong Wu
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:enabling structural data, structural data integration, language model applications, Multimodal Entity Alignment, identify equivalent entities
备注: 2026 SIGKDD accept
点击查看摘要
Abstract:Multimodal Entity Alignment (MMEA) aims to identify equivalent entities across different data modalities, enabling structural data integration that in turn improves the performance of various large language model applications. To lift the requirement of labeled seed pairs that are difficult to obtain, recent methods shifted to an unsupervised paradigm using pseudo-alignment seeds. However, unsupervised entity alignment in multimodal settings remains underexplored, mainly because the incorporation of multimodal information often results in imbalanced coverage of pseudo-seeds within the knowledge graph. To overcome this, we propose PSQE (Pseudo-Seed Quality Enhancement) to improve the precision and graph coverage balance of pseudo seeds via multimodal information and clustering-resampling. Theoretical analysis reveals the impact of pseudo seeds on existing contrastive learning-based MMEA models. In particular, pseudo seeds can influence the attraction and the repulsion terms in contrastive learning at once, whereas imbalanced graph coverage causes models to prioritize high-density regions, thereby weakening their learning capability for entities in sparse regions. Experimental results validate our theoretical findings and show that PSQE as a plug-and-play module can improve the performance of baselines by considerable margins.
12. 【2602.22732】Generative Recommendation for Large-Scale Advertising
链接:https://arxiv.org/abs/2602.22732
作者:Ben Xue,Dan Liu,Lixiang Wang,Mingjie Sun,Peng Wang,Pengfei Zhang,Shaoyun Shi,Tianyu Xu,Yunhao Sha,Zhiqiang Liu,Bo Kong,Bo Wang,Hang Yang,Jieting Xue,Junhao Wang,Shengyu Wang,Shuping Hui,Wencai Ye,Xiao Lin,Yongzhi Li,Yuhang Chen,Zhihui Yin,Quan Chen,Shiyang Wen,Wenjin Wu,Han Li,Guorui Zhou,Changcheng Li,Peng Jiang
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:recently attracted widespread, attracted widespread attention, Generative recommendation, stronger model capacity, real-time generative recommendation
备注: 13 pages, 6 figures, under review
点击查看摘要
Abstract:Generative recommendation has recently attracted widespread attention in industry due to its potential for scaling and stronger model capacity. However, deploying real-time generative recommendation in large-scale advertising requires designs beyond large-language-model (LLM)-style training and serving recipes. We present a production-oriented generative recommender co-designed across architecture, learning, and serving, named GR4AD (Generative Recommendation for ADdvertising). As for tokenization, GR4AD proposes UA-SID (Unified Advertisement Semantic ID) to capture complicated business information. Furthermore, GR4AD introduces LazyAR, a lazy autoregressive decoder that relaxes layer-wise dependencies for short, multi-candidate generation, preserving effectiveness while reducing inference cost, which facilitates scaling under fixed serving budgets. To align optimization with business value, GR4AD employs VSL (Value-Aware Supervised Learning) and proposes RSPO (Ranking-Guided Softmax Preference Optimization), a ranking-aware, list-wise reinforcement learning algorithm that optimizes value-based rewards under list-level metrics for continual online updates. For online inference, we further propose dynamic beam serving, which adapts beam width across generation levels and online load to control compute. Large-scale online A/B tests show up to 4.2% ad revenue improvement over an existing DLRM-based stack, with consistent gains from both model scaling and inference-time scaling. GR4AD has been fully deployed in Kuaishou advertising system with over 400 million users and achieves high-throughput real-time serving.
13. 【2602.22647】Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
链接:https://arxiv.org/abs/2602.22647
作者:Zhengyang Su,Isay Katsman,Yueqi Wang,Ruining He,Lukasz Heldt,Raghunandan Keshavan,Shao-Chuan Wang,Xinyang Yi,Mingyan Gao,Onkar Dalal,Lichan Hong,Ed Chi,Ningren Han
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:powerful paradigm, constrained decoding, STATIC, Generative retrieval, Compressed Sparse Row
备注: 14 pages, 4 figures
点击查看摘要
Abstract:Generative retrieval has emerged as a powerful paradigm for LLM-based recommendation. However, industrial recommender systems often benefit from restricting the output space to a constrained subset of items based on business logic (e.g. enforcing content freshness or product category), which standard autoregressive decoding cannot natively support. Moreover, existing constrained decoding methods that make use of prefix trees (Tries) incur severe latency penalties on hardware accelerators (TPUs/GPUs). In this work, we introduce STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding), an efficient and scalable constrained decoding technique designed specifically for high-throughput LLM-based generative retrieval on TPUs/GPUs. By flattening the prefix tree into a static Compressed Sparse Row (CSR) matrix, we transform irregular tree traversals into fully vectorized sparse matrix operations, unlocking massive efficiency gains on hardware accelerators. We deploy STATIC on a large-scale industrial video recommendation platform serving billions of users. STATIC produces significant product metric impact with minimal latency overhead (0.033 ms per step and 0.25% of inference time), achieving a 948x speedup over a CPU trie implementation and a 47-1033x speedup over a hardware-accelerated binary-search baseline. Furthermore, the runtime overhead of STATIC remains extremely low across a wide range of practical configurations. To the best of our knowledge, STATIC enables the first production-scale deployment of strictly constrained generative retrieval. In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval. Our code is available at this https URL.
14. 【2602.22632】Fine-grained Semantics Integration for Large Language Model-based Recommendation
链接:https://arxiv.org/abs/2602.22632
作者:Jiawen Feng,Xiaoyu Kong,Leheng Sheng,Bin Wu,Chao Yi,Feifang Yang,Xiang-Rong Sheng,Han Zhu,Xiang Wang,Jiancan Wu,Xiangnan He
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, recommender autoregressively generates, Semantically Meaningless Initialization, SID space, Token-level Semantic Alignment
备注:
点击查看摘要
Abstract:Recent advances in Large Language Models (LLMs) have shifted in recommendation systems from the discriminative paradigm to the LLM-based generative paradigm, where the recommender autoregressively generates sequences of semantic identifiers (SIDs) for target items conditioned on historical interaction. While prevalent LLM-based recommenders have demonstrated performance gains by aligning pretrained LLMs between the language space and the SID space, modeling the SID space still faces two fundamental challenges: (1) Semantically Meaningless Initialization: SID tokens are randomly initialized, severing the semantic linkage between the SID space and the pretrained language space at start point, and (2) Coarse-grained Alignment: existing SFT-based alignment tasks primarily focus on item-level optimization, while overlooking the semantics of individual tokens within SID this http URL address these challenges, we propose TS-Rec, which can integrate Token-level Semantics into LLM-based Recommenders. Specifically, TS-Rec comprises two key components: (1) Semantic-Aware embedding Initialization (SA-Init), which initializes SID token embeddings by applying mean pooling to the pretrained embeddings of keywords extracted by a teacher model; and (2) Token-level Semantic Alignment (TS-Align), which aligns individual tokens within the SID sequence with the shared semantics of the corresponding item clusters. Extensive experiments on two real-world benchmarks demonstrate that TS-Rec consistently outperforms traditional and generative baselines across all standard metrics. The results demonstrate that integrating fine-grained semantic information significantly enhances the performance of LLM-based generative recommenders.
15. 【2602.22591】Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking
链接:https://arxiv.org/abs/2602.22591
作者:Haodong Chen,Shengyao Zhuang,Zheng Yao,Guido Zuccon,Teerapong Leelanupab
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Listwise and Setwise, optimize computational efficiency, evolved from Pointwise
备注: 10 pages, 5 figures, 1 table. Code available at [this https URL](https://github.com/ielab/Selective-ICR)
点击查看摘要
Abstract:Zero-shot document re-ranking with Large Language Models (LLMs) has evolved from Pointwise methods to Listwise and Setwise approaches that optimize computational efficiency. Despite their success, these methods predominantly rely on generative scoring or output logits, which face bottlenecks in inference latency and result consistency. In-Context Re-ranking (ICR) has recently been proposed as an $O(1)$ alternative method. ICR extracts internal attention signals directly, avoiding the overhead of text generation. However, existing ICR methods simply aggregate signals across all layers; layer-wise contributions and their consistency across architectures have been left unexplored. Furthermore, no unified study has compared internal attention with traditional generative and likelihood-based mechanisms across diverse ranking frameworks under consistent conditions. In this paper, we conduct an orthogonal evaluation of generation, likelihood, and internal attention mechanisms across multiple ranking frameworks. We further identify a universal "bell-curve" distribution of relevance signals across transformer layers, which motivates the proposed Selective-ICR strategy that reduces inference latency by 30%-50% without compromising effectiveness. Finally, evaluation on the reasoning-intensive BRIGHT benchmark shows that precisely capturing high-quality in-context attention signals fundamentally reduces the need for model scaling and reinforcement learning: a zero-shot 8B model matches the performance of 14B reinforcement-learned re-rankers, while even a 0.6B model outperforms state-of-the-art generation-based approaches. These findings redefine the efficiency-effectiveness frontier for LLM-based re-ranking and highlight the latent potential of internal signals for complex reasoning ranking tasks. Our code and results are publicly available at this https URL.
Comments:
10 pages, 5 figures, 1 table. Code available at this https URL
Subjects:
Information Retrieval (cs.IR)
ACMclasses:
H.3.3
Cite as:
arXiv:2602.22591 [cs.IR]
(or
arXiv:2602.22591v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.22591
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
16. 【2602.22576】Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
链接:https://arxiv.org/abs/2602.22576
作者:Tianle Xia,Ming Xu,Lingxiang Hu,Yiding Sun,Wenwei Li,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:enhances large language, large language models, incorporating external knowledge, traditional single-round retrieval, single-round retrieval struggles
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.
17. 【2602.22547】owards Dynamic Dense Retrieval with Routing Strategy
链接:https://arxiv.org/abs/2602.22547
作者:Zhan Su,Fengran Mo,Jinghan Zhang,Yuchen Hui,Jia Ao Sun,Bingbing Wen,Jian-Yun Nie
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:tasks involves fine-tuning, involves fine-tuning, fine-tuning a pre-trained, dense retrieval, textit
备注:
点击查看摘要
Abstract:The \textit{de facto} paradigm for applying dense retrieval (DR) to new tasks involves fine-tuning a pre-trained model for a specific task. However, this paradigm has two significant limitations: (1) It is difficult adapt the DR to a new domain if the training dataset is limited. (2) Old DR models are simply replaced by newer models that are trained from scratch when the former are no longer up to date. Especially for scenarios where the model needs to be updated frequently, this paradigm is prohibitively expensive. To address these challenges, we propose a novel dense retrieval approach, termed \textit{dynamic dense retrieval} (DDR). DDR uses \textit{prefix tuning} as a \textit{module} specialized for a specific domain. These modules can then be compositional combined with a dynamic routing strategy, enabling highly flexible domain adaptation in the retrieval part. Extensive evaluation on six zero-shot downstream tasks demonstrates that this approach can surpass DR while utilizing only 2\% of the training parameters, paving the way to achieve more flexible dense retrieval in IR. We see it as a promising future direction for applying dense retrieval to various tasks.
Subjects:
Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:
arXiv:2602.22547 [cs.IR]
(or
arXiv:2602.22547v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.22547
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2602.22529】Generative Agents Navigating Digital Libraries
链接:https://arxiv.org/abs/2602.22529
作者:Saber Zerhoudi,Michael Granitzer
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
关键词:large language models, rapidly evolving field, language models, rapidly evolving, evolving field
备注:
点击查看摘要
Abstract:In the rapidly evolving field of digital libraries, the development of large language models (LLMs) has opened up new possibilities for simulating user behavior. This innovation addresses the longstanding challenge in digital library research: the scarcity of publicly available datasets on user search patterns due to privacy concerns. In this context, we introduce Agent4DL, a user search behavior simulator specifically designed for digital library environments. Agent4DL generates realistic user profiles and dynamic search sessions that closely mimic actual search strategies, including querying, clicking, and stopping behaviors tailored to specific user profiles. Our simulator's accuracy in replicating real user interactions has been validated through comparisons with real user data. Notably, Agent4DL demonstrates competitive performance compared to existing user search simulators such as SimIIR 2.0, particularly in its ability to generate more diverse and context-aware user behaviors.
19. 【2602.22521】FPS: A Temporal Filtration-enhanced Positive Sample Set Construction Method for Implicit Collaborative Filtering
链接:https://arxiv.org/abs/2602.22521
作者:Jiayi Wu,Zhengyu Wu,Xunkai Li,Rong-Hua Li,Guoren Wang
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:effectively train collaborative, negative sampling, train collaborative filtering, effectively train, train collaborative
备注:
点击查看摘要
Abstract:The negative sampling strategy can effectively train collaborative filtering (CF) recommendation models based on implicit feedback by constructing positive and negative samples. However, existing methods primarily optimize the negative sampling process while neglecting the exploration of positive samples. Some denoising recommendation methods can be applied to denoise positive samples within negative sampling strategies, but they ignore temporal information. Existing work integrates sequential information during model aggregation but neglects time interval information, hindering accurate capture of users' current preferences. To address this problem, from a data perspective, we propose a novel temporal filtration-enhanced approach to construct a high-quality positive sample set. First, we design a time decay model based on interaction time intervals, transforming the original graph into a weighted user-item bipartite graph. Then, based on predefined filtering operations, the weighted user-item bipartite graph is layered. Finally, we design a layer-enhancement strategy to construct a high-quality positive sample set for the layered subgraphs. We provide theoretical insights into why TFPS can improve Recall@k and NDCG@k, and extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed method. Additionally, TFPS can be integrated with various implicit CF recommenders or negative sampling methods to enhance its performance.
20. 【2602.22462】MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation
链接:https://arxiv.org/abs/2602.22462
作者:Raiyan Jahangir,Nafiz Imtiaz Khan,Amritanand Sudheerkumar,Vladimir Filkov
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Screening mammography, high volume, documentation heavy, Vision Language Models, recent Vision Language
备注: arXiv preprint (submitted 25 Feb 2026). Local multi-model pipeline for mammography report generation + classification using prompting, multimodal RAG (ChromaDB), and QLoRA fine-tuning; evaluates MedGemma, LLaVA-Med, Qwen2.5-VL on VinDr-Mammo and DMID; reports BERTScore/ROUGE-L and classification metrics
点击查看摘要
Abstract:Screening mammography is high volume, time sensitive, and documentation heavy. Radiologists must translate subtle visual findings into consistent BI-RADS assessments, breast density categories, and structured narrative reports. While recent Vision Language Models (VLMs) enable image-to-text reporting, many rely on closed cloud systems or tightly coupled architectures that limit privacy, reproducibility, and adaptability. We present MammoWise, a local multi-model pipeline that transforms open source VLMs into mammogram report generators and multi-task classifiers. MammoWise supports any Ollama-hosted VLM and mammography dataset, and enables zero-shot, few-shot, and Chain-of-Thought prompting, with optional multimodal Retrieval Augmented Generation (RAG) using a vector database for case-specific context. We evaluate MedGemma, LLaVA-Med, and Qwen2.5-VL on VinDr-Mammo and DMID datasets, assessing report quality (BERTScore, ROUGE-L), BI-RADS classification, breast density, and key findings. Report generation is consistently strong and improves with few-shot prompting and RAG. Classification is feasible but sensitive to model and dataset choice. Parameter-efficient fine-tuning (QLoRA) of MedGemma improves reliability, achieving BI-RADS accuracy of 0.7545, density accuracy of 0.8840, and calcification accuracy of 0.9341 while preserving report quality. MammoWise provides a practical and extensible framework for deploying local VLMs for mammography reporting within a unified and reproducible workflow.
21. 【2602.22278】RETLLM: Training and Data-Free MLLMs for Multimodal Information Retrieval
链接:https://arxiv.org/abs/2602.22278
作者:Dawei Su,Dongsheng Wang
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:handling text, Multimodal information retrieval, MMIR, gained attention, flexibility in handling
备注: 5 pages, 2 figure
点击查看摘要
Abstract:Multimodal information retrieval (MMIR) has gained attention for its flexibility in handling text, images, or mixed queries and candidates. Recent breakthroughs in multimodal large language models (MLLMs) boost MMIR performance by incorporating MLLM knowledge under the contrastive finetuning framework. However, they suffer from pre-training inconsistency and require large datasets. In this work, we introduce a novel framework, RetLLM, designed to query MLLMs for MMIR in a training- and data-free manner. Specifically, we formulate MMIR as a similarity score generation task and prompt MLLMs to directly predict retrieval scores in a coarse-then-fine pipeline. At the coarse stage, a top-k filtering strategy builds a small yet high-quality candidate pool for each query, enabling MLLMs to focus on semantically relevant candidates. Subsequently, the retrieval score is predicted by feeding both the query and candidate into MLLMs at the fine stage. Importantly, we propose a visual enhancement module during reasoning to help MLLMs re-pick forgotten visuals, improving retrieval. Extensive experiments on MMIR benchmarks show that RetLLM outperforms fine-tuned models. Ablation studies further verify each component. Our work demonstrates that MLLMs can achieve strong MMIR performance without any training, highlighting their inherent multimodal reasoning ability in a simple, scalable framework. We release our code at: this https URL
22. 【2602.22226】SEGB: Self-Evolved Generative Bidding with Local Autoregressive Diffusion
链接:https://arxiv.org/abs/2602.22226
作者:Yulong Gao,Wan Jiang,Mingzhe Cao,Xuepu Wang,Zeyu Pan,Haonan Yang,Ye Liu,Xin Yang
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:efficiently capture impression, capture impression opportunities, pivotal tool, enabling advertisers, opportunities in real-time
备注:
点击查看摘要
Abstract:In the realm of online advertising, automated bidding has become a pivotal tool, enabling advertisers to efficiently capture impression opportunities in real-time. Recently, generative auto-bidding has shown significant promise, offering innovative solutions for effective ad optimization. However, existing offline-trained generative policies lack the near-term foresight required for dynamic markets and usually depend on simulators or external experts for post-training improvement. To overcome these critical limitations, we propose Self-Evolved Generative Bidding (SEGB), a framework that plans proactively and refines itself entirely offline. SEGB first synthesizes plausible short-horizon future states to guide each bid, providing the agent with crucial, dynamic foresight. Crucially, it then performs value-guided policy refinement to iteratively discover superior strategies without any external intervention. This self-contained approach uniquely enables robust policy improvement from static data alone. Experiments on the AuctionNet benchmark and a large-scale A/B test validate our approach, demonstrating that SEGB significantly outperforms state-of-the-art baselines. In a large-scale online deployment, it delivered substantial business value, achieving a +10.19% increase in target cost, proving the effectiveness of our advanced planning and evolution paradigm.
23. 【2602.22225】SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG
链接:https://arxiv.org/abs/2602.22225
作者:Xuechen Zhang,Koustava Goswami,Samet Oymak,Jiasi Chen,Nedim Lipka
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:combining language models, Retrieval-augmented generation, language models, large text corpora, potential for producing
备注: 26 pages, 10 figures
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) has strong potential for producing accurate and factual outputs by combining language models (LMs) with evidence retrieved from large text corpora. However, current pipelines are limited by static chunking and flat retrieval: documents are split into short, predetermined, fixed-size chunks, embeddings are retrieved uniformly, and generation relies on whatever chunks are returned. This design brings challenges, as retrieval quality is highly sensitive to chunk size, often introduces noise from irrelevant or misleading chunks, and scales poorly to large corpora. We present SmartChunk retrieval, a query-adaptive framework for efficient and robust long-document question answering (QA). SmartChunk uses (i) a planner that predicts the optimal chunk abstraction level for each query, and (ii) a lightweight compression module that produces high-level chunk embeddings without repeated summarization. By adapting retrieval granularity on the fly, SmartChunk balances accuracy with efficiency and avoids the drawbacks of fixed strategies. Notably, our planner can reason about chunk abstractions through a novel reinforcement learning scheme, STITCH, which boosts accuracy and generalization. To reflect real-world applications, where users face diverse document types and query styles, we evaluate SmartChunk on five QA benchmarks plus one out-of-domain dataset. Across these evaluations, SmartChunk outperforms state-of-the-art RAG baselines, while reducing cost. Further analysis demonstrates strong scalability with larger corpora and consistent gains on out-of-domain datasets, highlighting its effectiveness as a general framework for adaptive retrieval.
24. 【2602.22224】DS SERVE: A Framework for Efficient and Scalable Neural Retrieval
链接:https://arxiv.org/abs/2602.22224
作者:Jinjian Liu,Yichuan Wang,Xinxi Lyu,Rulin Shao,Joseph E. Gonzalez,Matei Zaharia,Sewon Min
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:neural retrieval system, high-performance neural retrieval, large-scale text datasets, transforms large-scale text, text datasets
备注:
点击查看摘要
Abstract:We present DS-Serve, a framework that transforms large-scale text datasets, comprising half a trillion tokens, into a high-performance neural retrieval system. DS-Serve offers both a web interface and API endpoints, achieving low latency with modest memory overhead on a single node. The framework also supports inference-time trade-offs between latency, accuracy, and result diversity. We anticipate that DS-Serve will be broadly useful for a range of applications, including large-scale retrieval-augmented generation (RAG), training data attribution, training search agents, and beyond.
25. 【2602.22223】SQaLe: A Large Text-to-SQL Corpus Grounded in Real Schemas
链接:https://arxiv.org/abs/2602.22223
作者:Cornelius Wolff,Daniel Gomm,Madelon Hulsebos
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:valid SQL queries, Advances in large, methods for converting, SQL queries, large language models
备注: Accepted at the AI for Tabular Data workshop at EurIPS 2025
点击查看摘要
Abstract:Advances in large language models have accelerated progress in text-to-SQL, methods for converting natural language queries into valid SQL queries. A key bottleneck for developing generalizable text-to-SQL models is the lack of large-scale datasets with sufficient schema and query complexity, domain coverage, and task diversity. We introduce SQaLe: a large-scale semi-synthetic text-to-SQL dataset built on 135,875 relational database schemas expanded from a collection of real-world schemas, SchemaPile. We establish a principled generation pipeline which combines schema sampling, question synthesis, and SQL construction, and produce 517,676 high-quality (question, schema, query) triples. The SQaLe dataset captures realistic schema size variability, diverse query patterns, and natural language ambiguity while maintaining execution validity. We provide an analysis of its contents and characteristics, and find that SQaLe introduces the most realistic large-scale text-to-SQL dataset to date in comparison with existing benchmarks and datasets. We discuss how SQaLe enables our vision for data scaling and model generalization in text-to-SQL research. The dataset is accessible at: this https URL.
26. 【2602.22222】WICE: An LLM Agent Framework for Simulating Personalized User Tweeting Behavior with Long-term Temporal Features
链接:https://arxiv.org/abs/2602.22222
作者:Bingrui Jin,Kunyao Lan,Mengyue Wu
类目:Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:generate large amounts, generate large, large amounts, User, temporal characteristics
备注:
点击查看摘要
Abstract:User simulators are often used to generate large amounts of data for various tasks such as generation, training, and evaluation. However, existing approaches concentrate on collective behaviors or interactive systems, struggling with tasks that require modeling temporal characteristics. To address this limitation, we propose TWICE, an LLM-based framework that leverages the long-term temporal and personalized features of social media data. This framework integrates personalized user profiling, an event-driven memory module, and a workflow for personalized style rewriting, enabling simulation of personalized user tweeting behavior while capturing long-term temporal characteristics. In addition, we conduct a comprehensive evaluation with a focus on analyzing tweeting style and event-based changes in behavior. Experiment results demonstrate that our framework improves personalized user simulation by effectively incorporating temporal dynamics, providing a robust solution for long-term behavior tracking.
27. 【2602.22221】Misinformation Exposure in the Chinese Web: A Cross-System Evaluation of Search Engines, LLMs, and AI Overviews
链接:https://arxiv.org/abs/2602.22221
作者:Geng Liu,Junjie Mu,Li Feng,Mengxiao Zhu,Francesco Pierri
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Large Language Models, Large Language, Language Models, providing direct answers, reduce users' reliance
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly integrated into search services, providing direct answers that can reduce users' reliance on traditional result pages. Yet their factual reliability in non-English web ecosystems remains poorly understood, particularly when answering real user queries. We introduce a fact-checking dataset of 12~161 Chinese Yes/No questions derived from real-world online search logs and develop a unified evaluation pipeline to compare three information-access paradigms: traditional search engines, standalone LLMs, and AI-generated overview modules. Our analysis reveals substantial differences in factual accuracy and topic-level variability across systems. By combining this performance with real-world Baidu Index statistics, we further estimate potential exposure to incorrect factual information of Chinese users across regions. These findings highlight structural risks in AI-mediated search and underscore the need for more reliable and transparent information-access tools for the digital world.
28. 【2602.22220】What Makes an Ideal Quote? Recommending "Unexpected yet Rational" Quotations via Novelty
链接:https://arxiv.org/abs/2602.22220
作者:Bowei Zhang,Jin Xiao,Guanglei Yue,Qianyu He,Yanghua Xiao,Deqing Yang,Jiaqing Liang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:optimize surface-level topical, surface-level topical relevance, make quotations memorable, aims to enrich, enrich writing
备注: 36 pages, 16 figures and 13 tables
点击查看摘要
Abstract:Quotation recommendation aims to enrich writing by suggesting quotes that complement a given context, yet existing systems mostly optimize surface-level topical relevance and ignore the deeper semantic and aesthetic properties that make quotations memorable. We start from two empirical observations. First, a systematic user study shows that people consistently prefer quotations that are ``unexpected yet rational'' in context, identifying novelty as a key desideratum. Second, we find that strong existing models struggle to fully understand the deep meanings of quotations. Inspired by defamiliarization theory, we therefore formalize quote recommendation as choosing contextually novel but semantically coherent quotations. We operationalize this objective with NovelQR, a novelty-driven quotation recommendation framework. A generative label agent first interprets each quotation and its surrounding context into multi-dimensional deep-meaning labels, enabling label-enhanced retrieval. A token-level novelty estimator then reranks candidates while mitigating auto-regressive continuation bias. Experiments on bilingual datasets spanning diverse real-world domains show that our system recommends quotations that human judges rate as more appropriate, more novel, and more engaging than other baselines, while matching or surpassing existing methods in novelty estimation.
29. 【2602.22219】Comparative Analysis of Neural Retriever-Reranker Pipelines for Retrieval-Augmented Generation over Knowledge Graphs in E-commerce Applications
链接:https://arxiv.org/abs/2602.22219
作者:Teri Rumble,Zbyněk Gazdík,Javad Zarrin,Jagdeep Ahluwalia
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Natural Language Processing, Recent advancements, advancements in Large, Language Processing
备注: This manuscript is under review at the Springer journal Knowledge and Information Systems
点击查看摘要
Abstract:Recent advancements in Large Language Models (LLMs) have transformed Natural Language Processing (NLP), enabling complex information retrieval and generation tasks. Retrieval-Augmented Generation (RAG) has emerged as a key innovation, enhancing factual accuracy and contextual grounding by integrating external knowledge sources with generative models. Although RAG demonstrates strong performance on unstructured text, its application to structured knowledge graphs presents challenges: scaling retrieval across connected graphs and preserving contextual relationships during response generation. Cross-encoders refine retrieval precision, yet their integration with structured data remains underexplored. Addressing these challenges is crucial for developing domain-specific assistants that operate in production environments. This study presents the design and comparative evaluation of multiple Retriever-Reranker pipelines for knowledge graph natural language queries in e-Commerce contexts. Using the STaRK Semi-structured Knowledge Base (SKB), a production-scale e-Commerce dataset, we evaluate multiple RAG pipeline configurations optimized for language queries. Experimental results demonstrate substantial improvements over published benchmarks, achieving 20.4% higher Hit@1 and 14.5% higher Mean Reciprocal Rank (MRR). These findings establish a practical framework for integrating domain-specific SKBs into generative systems. Our contributions provide actionable insights for the deployment of production-ready RAG systems, with implications that extend beyond e-Commerce to other domains that require information retrieval from structured knowledge bases.
30. 【2602.22218】Cybersecurity Data Extraction from Common Crawl
链接:https://arxiv.org/abs/2602.22218
作者:Ashim Mahara
类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
关键词:Common Crawl web, cybersecurity-focused dataset collected, Common Crawl, Crawl web graph, community detection
备注:
点击查看摘要
Abstract:Alpha-Root is a cybersecurity-focused dataset collected in a single shot from the Common Crawl web graph using community detection. Unlike iterative content-scoring approaches like DeepSeekMath, we mine quality domains directly from the web graph, starting from just 20 trusted seed domains.
31. 【2602.22217】RAGdb: A Zero-Dependency, Embeddable Architecture for Multimodal Retrieval-Augmented Generation on the Edge
链接:https://arxiv.org/abs/2602.22217
作者:Ahmed Bin Khalid
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Large Language Models, grounding Large Language, Language Models, Large Language, grounding Large
备注: 6 pages, 2 tables
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has established itself as the standard paradigm for grounding Large Language Models (LLMs) in domain-specific, up-to-date data. However, the prevailing architecture for RAG has evolved into a complex, distributed stack requiring cloud-hosted vector databases, heavy deep learning frameworks (e.g., PyTorch, CUDA), and high-latency embedding inference servers. This ``infrastructure bloat'' creates a significant barrier to entry for edge computing, air-gapped environments, and privacy-constrained applications where data sovereignty is paramount. This paper introduces RAGdb, a novel monolithic architecture that consolidates automated multimodal ingestion, ONNX-based extraction, and hybrid vector retrieval into a single, portable SQLite container. We propose a deterministic Hybrid Scoring Function (HSF) that combines sublinear TF-IDF vectorization with exact substring boosting, eliminating the need for GPU inference at query time. Experimental evaluation on an Intel i7-1165G7 consumer laptop demonstrates that RAGdb achieves 100\% Recall@1 for entity retrieval and an ingestion efficiency gain of 31.6x during incremental updates compared to cold starts. Furthermore, the system reduces disk footprint by approximately 99.5\% compared to standard Docker-based RAG stacks, establishing the ``Single-File Knowledge Container'' as a viable primitive for decentralized, local-first AI. Keywords: Edge AI, Retrieval-Augmented Generation, Vector Search, Green AI, Serverless Architecture, Knowledge Graphs, Efficient Computing.
Comments:
6 pages, 2 tables
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.22217 [cs.IR]
(or
arXiv:2602.22217v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.22217
Focus to learn more
arXiv-issued DOI via DataCite</p>
32. 【2602.22216】Retrieval-Augmented Generation Assistant for Anatomical Pathology Laboratories
链接:https://arxiv.org/abs/2602.22216
作者:Diogo Pires,Yuriy Perezhohin,Mauro Castelli
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Anatomical Pathology, medical decisions depend, Accurate and efficient, essential in Anatomical, efficient access
备注:
点击查看摘要
Abstract:Accurate and efficient access to laboratory protocols is essential in Anatomical Pathology (AP), where up to 70% of medical decisions depend on laboratory diagnoses. However, static documentation such as printed manuals or PDFs is often outdated, fragmented, and difficult to search, creating risks of workflow errors and diagnostic delays. This study proposes and evaluates a Retrieval-Augmented Generation (RAG) assistant tailored to AP laboratories, designed to provide technicians with context-grounded answers to protocol-related queries. We curated a novel corpus of 99 AP protocols from a Portuguese healthcare institution and constructed 323 question-answer pairs for systematic evaluation. Ten experiments were conducted, varying chunking strategies, retrieval methods, and embedding models. Performance was assessed using the RAGAS framework (faithfulness, answer relevance, context recall) alongside top-k retrieval metrics. Results show that recursive chunking and hybrid retrieval delivered the strongest baseline performance. Incorporating a biomedical-specific embedding model (MedEmbed) further improved answer relevance (0.74), faithfulness (0.70), and context recall (0.77), showing the importance of domain-specialised embeddings. Top-k analysis revealed that retrieving a single top-ranked chunk (k=1) maximized efficiency and accuracy, reflecting the modular structure of AP protocols. These findings highlight critical design considerations for deploying RAG systems in healthcare and demonstrate their potential to transform static documentation into dynamic, reliable knowledge assistants, thus improving laboratory workflow efficiency and supporting patient safety.
33. 【2602.22215】Graph Your Way to Inspiration: Integrating Co-Author Graphs with Retrieval-Augmented Generation for Large Language Model Based Scientific Idea Generation
链接:https://arxiv.org/abs/2602.22215
作者:Pengzhen Xie,Huizhi Liang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Language Models, scientific idea generation, demonstrate potential
备注: 15 pages, 10 figures. Submitted to [RAAI]
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate potential in the field of scientific idea generation. However, the generated results often lack controllable academic context and traceable inspiration pathways. To bridge this gap, this paper proposes a scientific idea generation system called GYWI, which combines author knowledge graphs with retrieval-augmented generation (RAG) to form an external knowledge base to provide controllable context and trace of inspiration path for LLMs to generate new scientific ideas. We first propose an author-centered knowledge graph construction method and inspiration source sampling algorithms to construct external knowledge base. Then, we propose a hybrid retrieval mechanism that is composed of both RAG and GraphRAG to retrieve content with both depth and breadth knowledge. It forms a hybrid context. Thirdly, we propose a Prompt optimization strategy incorporating reinforcement learning principles to automatically guide LLMs optimizing the results based on the hybrid context. To evaluate the proposed approaches, we constructed an evaluation dataset based on arXiv (2018-2023). This paper also develops a comprehensive evaluation method including empirical automatic assessment in multiple-choice question task, LLM-based scoring, human evaluation, and semantic space visualization analysis. The generated ideas are evaluated from the following five dimensions: novelty, feasibility, clarity, relevance, and significance. We conducted experiments on different LLMs including GPT-4o, DeepSeek-V3, Qwen3-8B, and Gemini 2.5. Experimental results show that GYWI significantly outperforms mainstream LLMs in multiple metrics such as novelty, reliability, and relevance.
34. 【2602.22214】Adaptive Prefiltering for High-Dimensional Similarity Search: A Frequency-Aware Approach
链接:https://arxiv.org/abs/2602.22214
作者:Teodor-Ioan Calin
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:modern retrieval systems, High-dimensional similarity search, underpins modern retrieval, High-dimensional similarity, similarity search underpins
备注:
点击查看摘要
Abstract:High-dimensional similarity search underpins modern retrieval systems, yet uniform search strategies fail to exploit the heterogeneous nature of real-world query distributions. We present an adaptive prefiltering framework that leverages query frequency patterns and cluster coherence metrics to dynamically allocate computational budgets. Our approach partitions the query space into frequency tiers following Zipfian distributions and assigns differentiated search policies based on historical access patterns and local density characteristics. Experiments on ImageNet-1k using CLIP embeddings demonstrate that frequency-aware budget allocation achieves equivalent recall with 20.4% fewer distance computations compared to static nprobe selection, while maintaining sub-millisecond latency on GPU-accelerated FAISS indices. The framework introduces minimal overhead through lightweight frequency tracking and provides graceful degradation for unseen queries through coherence-based fallback policies.
35. 【2602.22213】Enriching Taxonomies Using Large Language Models
链接:https://arxiv.org/abs/2602.22213
作者:Zeinab Ghamlouch,Mehwish Alam
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, information across domains, play a vital, vital role, role in structuring
备注: Published in ECAI 2025 Demo Track
点击查看摘要
Abstract:Taxonomies play a vital role in structuring and categorizing information across domains. However, many existing taxonomies suffer from limited coverage and outdated or ambiguous nodes, reducing their effectiveness in knowledge retrieval. To address this, we present Taxoria, a novel taxonomy enrichment pipeline that leverages Large Language Models (LLMs) to enhance a given taxonomy. Unlike approaches that extract internal LLM taxonomies, Taxoria uses an existing taxonomy as a seed and prompts an LLM to propose candidate nodes for enrichment. These candidates are then validated to mitigate hallucinations and ensure semantic relevance before integration. The final output includes an enriched taxonomy with provenance tracking and visualization of the final merged taxonomy for analysis.
计算机视觉
1. 【2602.23363】MediX-R1: Open Ended Medical Reinforcement Learning
链接:https://arxiv.org/abs/2602.23363
作者:Sahal Shaji Mullappilly,Mohammed Irfan Kurpath,Omair Mohamed,Mohamed Zidan,Fahad Khan,Salman Khan,Rao Anwer,Hisham Cholakkal
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:open-ended Reinforcement Learning, Reinforcement Learning, enables clinically grounded, clinically grounded, free-form answers
备注:
点击查看摘要
Abstract:We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at this https URL
2. 【2602.23361】VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
链接:https://arxiv.org/abs/2602.23361
作者:Sven Elflein,Ruilong Li,Sérgio Agostinho,Zan Gojcic,Laura Leal-Taixé,Qunjie Zhou,Aljosa Osep
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requirements grow quadratically, memory requirements grow, offline feed-forward methods, present a scalable, grow quadratically
备注: CVPR 2026, Project page: [this https URL](https://research.nvidia.com/labs/dvl/projects/vgg-ttt)
点击查看摘要
Abstract:We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.
3. 【2602.23359】SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation
链接:https://arxiv.org/abs/2602.23359
作者:Vaibhav Agrawal,Rishubh Parihar,Pradhaan Bhat,Ravi Kiran Sarvadevabhatla,R. Venkatesh Babu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:identify occlusion reasoning, fundamental yet overlooked, overlooked aspect, occlusions, model
备注: Project page: [this https URL](https://seethrough3d.github.io) . Accepted at CVPR 2026
点击查看摘要
Abstract:We identify occlusion reasoning as a fundamental yet overlooked aspect for 3D layout-conditioned generation. It is essential for synthesizing partially occluded objects with depth-consistent geometry and scale. While existing methods can generate realistic scenes that follow input layouts, they often fail to model precise inter-object occlusions. We propose SeeThrough3D, a model for 3D layout conditioned generation that explicitly models occlusions. We introduce an occlusion-aware 3D scene representation (OSCR), where objects are depicted as translucent 3D boxes placed within a virtual environment and rendered from desired camera viewpoint. The transparency encodes hidden object regions, enabling the model to reason about occlusions, while the rendered viewpoint provides explicit camera control during generation. We condition a pretrained flow based text-to-image image generation model by introducing a set of visual tokens derived from our rendered 3D representation. Furthermore, we apply masked self-attention to accurately bind each object bounding box to its corresponding textual description, enabling accurate generation of multiple objects without object attribute mixing. To train the model, we construct a synthetic dataset with diverse multi-object scenes with strong inter-object occlusions. SeeThrough3D generalizes effectively to unseen object categories and enables precise 3D layout control with realistic occlusions and consistent camera control.
4. 【2602.23358】A Dataset is Worth 1 MB
链接:https://arxiv.org/abs/2602.23358
作者:Elad Kimchi Shoshani,Leeyam Gabay,Yedid Hoshen
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:incurring massive communication, massive communication costs, incurring massive, communication costs, massive communication
备注: 23 pages, 9 figures
点击查看摘要
Abstract:A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on diverse hardware and software frameworks, transmitting a pre-trained model is often infeasible; instead, agents require raw data to train their own task-specific models locally. While dataset distillation attempts to compress training signals, current methods struggle to scale to high-resolution data and rarely achieve sufficiently small files. In this paper, we propose Pseudo-Labels as Data (PLADA), a method that completely eliminates pixel transmission. We assume agents are preloaded with a large, generic, unlabeled reference dataset (e.g., ImageNet-1K, ImageNet-21K) and communicate a new task by transmitting only the class labels for specific images. To address the distribution mismatch between the reference and target datasets, we introduce a pruning mechanism that filters the reference dataset to retain only the labels of the most semantically relevant images for the target task. This selection process simultaneously maximizes training efficiency and minimizes transmission payload. Experiments on 10 diverse datasets demonstrate that our approach can transfer task knowledge with a payload of less than 1 MB while retaining high classification accuracy, offering a promising solution for efficient dataset serving.
5. 【2602.23357】Sensor Generalization for Adaptive Sensing in Event-based Object Detection via Joint Distribution Training
链接:https://arxiv.org/abs/2602.23357
作者:Aheli Saha,René Schuster,Didier Stricker
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Bio-inspired event cameras, recently attracted significant, attracted significant research, significant research due, Bio-inspired event
备注: 12 pages, International Conference on Pattern Recognition Applications and Methods
点击查看摘要
Abstract:Bio-inspired event cameras have recently attracted significant research due to their asynchronous and low-latency capabilities. These features provide a high dynamic range and significantly reduce motion blur. However, because of the novelty in the nature of their output signals, there is a gap in the variability of available data and a lack of extensive analysis of the parameters characterizing their signals. This paper addresses these issues by providing readers with an in-depth understanding of how intrinsic parameters affect the performance of a model trained on event data, specifically for object detection. We also use our findings to expand the capabilities of the downstream model towards sensor-agnostic robustness.
6. 【2602.23351】Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
链接:https://arxiv.org/abs/2602.23351
作者:Amita Kamath,Jack Hessel,Khyathi Chandu,Jena D. Hwang,Kai-Wei Chang,Ranjay Krishna
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:research discourse, forefront of research, reporting bias, Vision-Language Models, training data
备注: TACL 2026
点击查看摘要
Abstract:The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.
7. 【2602.23339】Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
链接:https://arxiv.org/abs/2602.23339
作者:Tilemachos Aravanis,Vladan Stojnić,Bill Psomas,Nikos Komodakis,Giorgos Tolias
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:zero-shot recognition capabilities, vision-language models, pixel-level prediction, recognition capabilities, capabilities of vision-language
备注:
点击查看摘要
Abstract:Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
8. 【2602.23306】hinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
链接:https://arxiv.org/abs/2602.23306
作者:Yiran Guan,Sifan Tu,Dingkang Liang,Linghao Zhu,Jianzhong Ju,Zhenbo Luo,Jian Luan,Yuliang Liu,Xiang Bai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse data sources, essential for intelligent, intelligent systems, systems to understand, understand and draw
备注: Accept by ICLR 2026
点击查看摘要
Abstract:Omni-modal reasoning is essential for intelligent systems to understand and draw inferences from diverse data sources. While existing omni-modal large language models (OLLM) excel at perceiving diverse modalities, they lack the complex reasoning abilities of recent large reasoning models (LRM). However, enhancing the reasoning ability of OLLMs through additional training presents significant challenges, including the need for high-quality data, task-specific adaptation, and substantial computational costs. To address these limitations, we propose ThinkOmni, a training-free and data-free framework that lifts textual reasoning to omni-modal scenarios. ThinkOmni introduces two key components: 1) LRM-as-a-Guide, which leverages off-the-shelf LRMs to guide the OLLM decoding process; 2) Stepwise Contrastive Scaling, which adaptively balances perception and reasoning signals without manual hyperparameter tuning. Experiments on six multi-modal reasoning benchmarks demonstrate that ThinkOmni consistently delivers performance improvements, with main results achieving 70.2 on MathVista and 75.5 on MMAU. Overall, ThinkOmni offers a flexible and generalizable solution for omni-modal reasoning and provides new insights into the generalization and application of reasoning capabilities.
9. 【2602.23297】PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM
链接:https://arxiv.org/abs/2602.23297
作者:Yiqing Wang,Chunming He,Ming-Chen Lu,Mercy Pawar,Leslie Niziol,Maria Woodward,Sina Farsiu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical diagnosis requires, Medical diagnosis, diagnosis requires, requires the effective, effective synthesis
备注:
点击查看摘要
Abstract:Medical diagnosis requires the effective synthesis of visual manifestations and clinical metadata. However, existing methods often treat metadata as isolated tags, failing to exploit the rich semantic knowledge embedded in clinical descriptions. We propose PRIMA (Pre-training with Risk-integrated Image-Metadata Alignment), a framework that integrates domain-specific knowledge into multi-modal representation learning. We first curate an expert corpus of risk-disease correlations via Retrieval-Augmented Generation (RAG) to refine Clinical ModernBERT, embedding diagnostic priors into the text encoder. To bridge the modality gap, we introduce a dual-encoder pre-training strategy utilizing DINOv3 and our refined BERT, optimized by a suite of four complementary loss functions. These losses are designed to capture multi-granular semantic alignment and handle the ambiguity of clinical correlations through soft labels. Finally, we leverage Qwen-3 to fuse these aligned features for precise disease classification. Extensive experiments demonstrate that PRIMA effectively harmonizes pixel-level features with abstract clinical expertise, significantly outperforming other state-of-the-art methods. Notably, our framework achieves superior robustness without the need for massive data collection or exhaustive computational resources. Our code will be made public upon acceptance.
10. 【2602.23295】ManifoldGD: Training-Free Hierarchical Manifold Guidance for Diffusion-Based Dataset Distillation
链接:https://arxiv.org/abs/2602.23295
作者:Ayush Roy,Wei-Yang Alex Lee,Rudrasis Chakraborty,Vishnu Suresh Lokhande
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:large datasets hinder, datasets hinder efficient, hinder efficient model, redundant concepts, efficient model training
备注: CVPE 2026
点击查看摘要
Abstract:In recent times, large datasets hinder efficient model training while also containing redundant concepts. Dataset distillation aims to synthesize compact datasets that preserve the knowledge of large-scale training sets while drastically reducing storage and computation. Recent advances in diffusion models have enabled training-free distillation by leveraging pre-trained generative priors; however, existing guidance strategies remain limited. Current score-based methods either perform unguided denoising or rely on simple mode-based guidance toward instance prototype centroids (IPC centroids), which often are rudimentary and suboptimal. We propose Manifold-Guided Distillation (ManifoldGD), a training-free diffusion-based framework that integrates manifold consistent guidance at every denoising timestep. Our method employs IPCs computed via a hierarchical, divisive clustering of VAE latent features, yielding a multi-scale coreset of IPCs that captures both coarse semantic modes and fine intra-class variability. Using a local neighborhood of the extracted IPC centroids, we create the latent manifold for each diffusion denoising timestep. At each denoising step, we project the mode-alignment vector onto the local tangent space of the estimated latent manifold, thus constraining the generation trajectory to remain manifold-faithful while preserving semantic consistency. This formulation improves representativeness, diversity, and image fidelity without requiring any model retraining. Empirical results demonstrate consistent gains over existing training-free and training-based baselines in terms of FID, l2 distance among real and synthetic dataset embeddings, and classification accuracy, establishing ManifoldGD as the first geometry-aware training-free data distillation framework.
11. 【2602.23294】owards Long-Form Spatio-Temporal Video Grounding
链接:https://arxiv.org/abs/2602.23294
作者:Xin Gu,Bing Fan,Jiali Yao,Zhipeng Zhang,Yan Huang,Cheng Han,Heng Fan,Libo Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:real scenarios, STVG, STVG methods, videos, existing STVG methods
备注:
点击查看摘要
Abstract:In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.
12. 【2602.23292】PGVMS: A Prompt-Guided Unified Framework for Virtual Multiplex IHC Staining with Pathological Semantic Learning
链接:https://arxiv.org/abs/2602.23292
作者:Fuqiang Chen,Ranran Zhang,Wanming Hu,Deboch Eyob Abera,Yue Peng,Boyun Zheng,Yiwen Sun,Jing Cai,Wenjian Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enables precise molecular, precise molecular profiling, clinically available antibody-based, modern pathology, Immunohistochemical
备注: Accepted by TMI
点击查看摘要
Abstract:Immunohistochemical (IHC) staining enables precise molecular profiling of protein expression, with over 200 clinically available antibody-based tests in modern pathology. However, comprehensive IHC analysis is frequently limited by insufficient tissue quantities in small biopsies. Therefore, virtual multiplex staining emerges as an innovative solution to digitally transform HE images into multiple IHC representations, yet current methods still face three critical challenges: (1) inadequate semantic guidance for multi-staining, (2) inconsistent distribution of immunochemistry staining, and (3) spatial misalignment across different stain modalities. To overcome these limitations, we present a prompt-guided framework for virtual multiplex IHC staining using only uniplex training data (PGVMS). Our framework introduces three key innovations corresponding to each challenge: First, an adaptive prompt guidance mechanism employing a pathological visual language model dynamically adjusts staining prompts to resolve semantic guidance limitations (Challenge 1). Second, our protein-aware learning strategy (PALS) maintains precise protein expression patterns by direct quantification and constraint of protein distributions (Challenge 2). Third, the prototype-consistent learning strategy (PCLS) establishes cross-image semantic interaction to correct spatial misalignments (Challenge 3).
13. 【2602.23290】LineGraph2Road: Structural Graph Reasoning on Line Graphs for Road Network Extraction
链接:https://arxiv.org/abs/2602.23290
作者:Zhengyang Wei,Renzhi Jing,Yiyi He,Jenny Suckale
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:urban planning, significantly reducing, manual annotation, accurate and automatic, satellite imagery
备注:
点击查看摘要
Abstract:The accurate and automatic extraction of roads from satellite imagery is critical for applications in navigation and urban planning, significantly reducing the need for manual annotation. Many existing methods decompose this task into keypoint extraction and connectedness prediction, but often struggle to capture long-range dependencies and complex topologies. Here, we propose LineGraph2Road, a framework that improves connectedness prediction by formulating it as binary classification over edges in a constructed global but sparse Euclidean graph, where nodes are keypoints extracted from segmentation masks and edges connect node pairs within a predefined distance threshold, representing potential road segments. To better learn structural link representation, we transform the original graph into its corresponding line graph and apply a Graph Transformer on it for connectedness prediction. This formulation overcomes the limitations of endpoint-embedding fusion on set-isomorphic links, enabling rich link representations and effective relational reasoning over the global structure. Additionally, we introduce an overpass/underpass head to resolve multi-level crossings and a coupled NMS strategy to preserve critical connections. We evaluate LineGraph2Road on three benchmarks: City-scale, SpaceNet, and Global-scale, and show that it achieves state-of-the-art results on two key metrics, TOPO-F1 and APLS. It also captures fine visual details critical for real-world deployment. We will make our code publicly available.
14. 【2602.23262】Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
链接:https://arxiv.org/abs/2602.23262
作者:Jasmine Bayrooti,Weiwei Kong,Natalia Ponomareva,Carlos Esteves,Ameesh Makadia,Amanda Prorok
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
关键词:Generative models trained, reproducing individual training, making strong privacy, privacy guarantees essential, strong privacy guarantees
备注:
点击查看摘要
Abstract:Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees essential. While differential privacy (DP) provides a principled framework for such guarantees, standard DP finetuning (e.g., with DP-SGD) often results in severe degradation of image quality, particularly in high-frequency textures, due to the indiscriminate addition of noise across all model parameters. In this work, we propose a spectral DP framework based on the hypothesis that the most privacy-sensitive portions of an image are often low-frequency components in the wavelet space (e.g., facial features and object shapes) while high-frequency components are largely generic and public. Based on this hypothesis, we propose the following two-stage framework for DP image generation with coarse image intermediaries: (1) DP finetune an autoregressive spectral image tokenizer model on the low-resolution wavelet coefficients of the sensitive images, and (2) perform high-resolution upsampling using a publicly pretrained super-resolution model. By restricting the privacy budget to the global structures of the image in the first stage, and leveraging the post-processing property of DP for detail refinement, we achieve promising trade-offs between privacy and utility. Experiments on the MS-COCO and MM-CelebA-HQ datasets show that our method generates images with improved quality and style capture relative to other leading DP image frameworks.
15. 【2602.23259】Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
链接:https://arxiv.org/abs/2602.23259
作者:Jiangxin Sun,Feng Xue,Teng Long,Chang Liu,Jian-Fang Hu,Wei-Shi Zheng,Nicu Sebe
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:great progress recently, made great progress, large-scale driving datasets, World Model, imitation learning
备注:
点击查看摘要
Abstract:With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of "only driving like the expert" suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.
16. 【2602.23235】Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
链接:https://arxiv.org/abs/2602.23235
作者:Zhou Xu,Bowen Zhou,Qi Wang,Shuwen Feng,Jingyu Xiao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Pure-vision GUI agents, provide universal interaction, universal interaction capabilities, severe efficiency bottlenecks, efficiency bottlenecks due
备注:
点击查看摘要
Abstract:Pure-vision GUI agents provide universal interaction capabilities but suffer from severe efficiency bottlenecks due to the massive spatiotemporal redundancy inherent in high-resolution screenshots and historical trajectories. We identify two critical misalignments in existing compression paradigms: the temporal mismatch, where uniform history encoding diverges from the agent's "fading memory" attention pattern, and the spatial topology conflict, where unstructured pruning compromises the grid integrity required for precise coordinate grounding, inducing spatial hallucinations. To address these challenges, we introduce GUIPruner, a training-free framework tailored for high-resolution GUI navigation. It synergizes Temporal-Adaptive Resolution (TAR), which eliminates historical redundancy via decay-based resizing, and Stratified Structure-aware Pruning (SSP), which prioritizes interactive foregrounds and semantic anchors while safeguarding global layout. Extensive evaluations across diverse benchmarks demonstrate that GUIPruner consistently achieves state-of-the-art performance, effectively preventing the collapse observed in large-scale models under high compression. Notably, on Qwen2-VL-2B, our method delivers a 3.4x reduction in FLOPs and a 3.3x speedup in vision encoding latency while retaining over 94% of the original performance, enabling real-time, high-precision navigation with minimal resource consumption.
17. 【2602.23231】Skarimva: Skeleton-based Action Recognition is a Multi-view Application
链接:https://arxiv.org/abs/2602.23231
作者:Daniel Bermuth,Alexander Poeppel,Wolfgang Reif
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:developing intelligent interactions, Human action recognition, action recognition plays, skeleton-based action recognition, action recognition
备注:
点击查看摘要
Abstract:Human action recognition plays an important role when developing intelligent interactions between humans and machines. While there is a lot of active research on improving the machine learning algorithms for skeleton-based action recognition, not much attention has been given to the quality of the input skeleton data itself. This work demonstrates that by making use of multiple camera views to triangulate more accurate 3D~skeletons, the performance of state-of-the-art action recognition models can be improved significantly. This suggests that the quality of the input data is currently a limiting factor for the performance of these models. Based on these results, it is argued that the cost-benefit ratio of using multiple cameras is very favorable in most practical use-cases, therefore future research in skeleton-based action recognition should consider multi-view applications as the standard setup.
18. 【2602.23229】Large Multimodal Models as General In-Context Classifiers
链接:https://arxiv.org/abs/2602.23229
作者:Marco Garosi,Matteo Farina,Alessandro Conti,Massimiliano Mancini,Elisa Ricci
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Multimodal Models, Large Multimodal, LMMs, multimodal model, Multimodal Models
备注: CVPR Findings 2026. Project website at [this https URL](https://circle-lmm.github.io/)
点击查看摘要
Abstract:Which multimodal model should we use for classification? Previous studies suggest that the answer lies in CLIP-like contrastive Vision-Language Models (VLMs), due to their remarkable performance in zero-shot classification. In contrast, Large Multimodal Models (LMM) are more suitable for complex tasks. In this work, we argue that this answer overlooks an important capability of LMMs: in-context learning. We benchmark state-of-the-art LMMs on diverse datasets for closed-world classification and find that, although their zero-shot performance is lower than CLIP's, LMMs with a few in-context examples can match or even surpass contrastive VLMs with cache-based adapters, their "in-context" equivalent. We extend this analysis to the open-world setting, where the generative nature of LMMs makes them more suitable for the task. In this challenging scenario, LMMs struggle whenever provided with imperfect context information. To address this issue, we propose CIRCLE, a simple training-free method that assigns pseudo-labels to in-context examples, iteratively refining them with the available context itself. Through extensive experiments, we show that CIRCLE establishes a robust baseline for open-world classification, surpassing VLM counterparts and highlighting the potential of LMMs to serve as unified classifiers, and a flexible alternative to specialized models.
19. 【2602.23228】MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
链接:https://arxiv.org/abs/2602.23228
作者:Yizhi Li,Xiaohan Chen,Miao Jiang,Wentao Tang,Gaoang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:efficient media archiving, personalized recommendation, automated video summarization, digital entertainment, content indexing
备注: 6 pages, CSCWD 2026
点击查看摘要
Abstract:With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.
20. 【2602.23224】UniScale: Unified Scale-Aware 3D Reconstruction for Multi-View Understanding via Prior Injection for Robotic Perception
链接:https://arxiv.org/abs/2602.23224
作者:Mohammad Mahdavian,Gordon Tan,Binbin Xu,Yuan Ren,Dongfeng Bai,Bingbing Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:semantically informed design, flexibly integrates geometric, semantically informed, informed design, applications that flexibly
备注:
点击查看摘要
Abstract:We present UniScale, a unified, scale-aware multi-view 3D reconstruction framework for robotic applications that flexibly integrates geometric priors through a modular, semantically informed design. In vision-based robotic navigation, the accurate extraction of environmental structure from raw image sequences is critical for downstream tasks. UniScale addresses this challenge with a single feed-forward network that jointly estimates camera intrinsics and extrinsics, scale-invariant depth and point maps, and the metric scale of a scene from multi-view images, while optionally incorporating auxiliary geometric priors when available. By combining global contextual reasoning with camera-aware feature representations, UniScale is able to recover the metric-scale of the scene. In robotic settings where camera intrinsics are known, they can be easily incorporated to improve performance, with additional gains obtained when camera poses are also available. This co-design enables robust, metric-aware 3D reconstruction within a single unified model. Importantly, UniScale does not require training from scratch, and leverages world priors exhibited in pre-existing models without geometric encoding strategies, making it particularly suitable for resource-constrained robotic teams. We evaluate UniScale on multiple benchmarks, demonstrating strong generalization and consistent performance across diverse environments. We will release our implementation upon acceptance.
21. 【2602.23217】Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks
链接:https://arxiv.org/abs/2602.23217
作者:Alaa El Ichi,Khalide Jbilou
类目:Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA)
关键词:Generalized Einstein MLPs, Multidimensional Task Learning, paper introduces Multidimensional, Generalized Einstein, introduces Multidimensional Task
备注:
点击查看摘要
Abstract:This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.
22. 【2602.23214】Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction
链接:https://arxiv.org/abs/2602.23214
作者:Chenhe Du,Xuanyu Tian,Qing Wu,Muyu Liu,Jingyi Yu,Hongjiang Wei,Yuyao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:solving imaging inverse, imaging inverse problems, treating pretrained generative, pretrained generative models, frameworks have emerged
备注:
点击查看摘要
Abstract:Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion, which restores the classical dual variable to provide integral feedback, theoretically guaranteeing asymptotic convergence to the exact data manifold. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver's rigorous optimization trajectory with the denoiser's valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence.
23. 【2602.23212】hrough BrokenEyes: How Eye Disorders Impact Face Detection?
链接:https://arxiv.org/abs/2602.23212
作者:Prottay Kumar Adhikary
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision disorders significantly, significantly impact millions, disorders significantly impact, Vision disorders, millions of lives
备注:
点击查看摘要
Abstract:Vision disorders significantly impact millions of lives, altering how visual information is processed and perceived. In this work, a computational framework was developed using the BrokenEyes system to simulate five common eye disorders: Age-related macular degeneration, cataract, glaucoma, refractive errors, and diabetic retinopathy and analyze their effects on neural-like feature representations in deep learning models. Leveraging a combination of human and non-human datasets, models trained under normal and disorder-specific conditions revealed critical disruptions in feature maps, particularly for cataract and glaucoma, which align with known neural processing challenges in these conditions. Evaluation metrics such as activation energy and cosine similarity quantified the severity of these distortions, providing insights into the interplay between degraded visual inputs and learned representations.
24. 【2602.23205】EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
链接:https://arxiv.org/abs/2602.23205
作者:Wenjia Wang,Liang Pan,Huaijin Pi,Yuke Lou,Xuqian Ren,Yifan Wu,Zhouyingcheng Liao,Lei Yang,Rishabh Dabral,Christian Theobalt,Taku Komura
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:naturally encode rich, long-term contextual information, real world naturally, world naturally encode, encode rich
备注:
点击查看摘要
Abstract:Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.
25. 【2602.23204】Motion-aware Event Suppression for Event Cameras
链接:https://arxiv.org/abs/2602.23204
作者:Roberto Pellerito,Nico Messikommer,Giovanni Cioffi,Marco Cannici,Davide Scaramuzza
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Motion-aware Event Suppression, filter events triggered, framework for Motion-aware, Motion-aware Event, real time
备注:
点击查看摘要
Abstract:In this work, we introduce the first framework for Motion-aware Event Suppression, which learns to filter events triggered by IMOs and ego-motion in real time. Our model jointly segments IMOs in the current event stream while predicting their future motion, enabling anticipatory suppression of dynamic events before they occur. Our lightweight architecture achieves 173 Hz inference on consumer-grade GPUs with less than 1 GB of memory usage, outperforming previous state-of-the-art methods on the challenging EVIMO benchmark by 67\% in segmentation accuracy while operating at a 53\% higher inference rate. Moreover, we demonstrate significant benefits for downstream applications: our method accelerates Vision Transformer inference by 83\% via token pruning and improves event-based visual odometry accuracy, reducing Absolute Trajectory Error (ATE) by 13\%.
26. 【2602.23203】ColoDiff: Integrating Dynamic Consistency With Content Awareness for Colonoscopy Video Generation
链接:https://arxiv.org/abs/2602.23203
作者:Junhu Fu,Shuyu Liang,Wutong Li,Chen Ma,Peng Huang,Kehao Wang,Ke Chen,Shengli Lin,Pinghong Zhou,Zeju Li,Yuanyuan Wang,Yi Guo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:information-rich data critical, diagnosing intestinal diseases, irregular intestinal structures, Colonoscopy video generation, data-scarce scenarios
备注:
点击查看摘要
Abstract:Colonoscopy video generation delivers dynamic, information-rich data critical for diagnosing intestinal diseases, particularly in data-scarce scenarios. High-quality video generation demands temporal consistency and precise control over clinical attributes, but faces challenges from irregular intestinal structures, diverse disease representations, and various imaging modalities. To this end, we propose ColoDiff, a diffusion-based framework that generates dynamic-consistent and content-aware colonoscopy videos, aiming to alleviate data shortage and assist clinical analysis. At the inter-frame level, our TimeStream module decouples temporal dependency from video sequences through a cross-frame tokenization mechanism, enabling intricate dynamic modeling despite irregular intestinal structures. At the intra-frame level, our Content-Aware module incorporates noise-injected embeddings and learnable prototypes to realize precise control over clinical attributes, breaking through the coarse guidance of diffusion models. Additionally, ColoDiff employs a non-Markovian sampling strategy that cuts steps by over 90% for real-time generation. ColoDiff is evaluated across three public datasets and one hospital database, based on both generation metrics and downstream tasks including disease diagnosis, modality discrimination, bowel preparation scoring, and lesion segmentation. Extensive experiments show ColoDiff generates videos with smooth transitions and rich dynamics. ColoDiff presents an effort in controllable colonoscopy video generation, revealing the potential of synthetic videos in complementing authentic representation and mitigating data scarcity in clinical settings.
27. 【2602.23192】FairQuant: Fairness-Aware Mixed-Precision Quantization for Medical Image Classification
链接:https://arxiv.org/abs/2602.23192
作者:Thomas Woergaard,Raghavendra Selvan
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Compressing neural networks, quantizing model parameters, model parameters offers, Compressing neural, neural networks
备注: Source code available at [this https URL](https://github.com/saintslab/FairQuant)
点击查看摘要
Abstract:Compressing neural networks by quantizing model parameters offers useful trade-off between performance and efficiency. Methods like quantization-aware training and post-training quantization strive to maintain the downstream performance of compressed models compared to the full precision models. However, these techniques do not explicitly consider the impact on algorithmic fairness. In this work, we study fairness-aware mixed-precision quantization schemes for medical image classification under explicit bit budgets. We introduce FairQuant, a framework that combines group-aware importance analysis, budgeted mixed-precision allocation, and a learnable Bit-Aware Quantization (BAQ) mode that jointly optimizes weights and per-unit bit allocations under bitrate and fairness regularization. We evaluate the method on Fitzpatrick17k and ISIC2019 across ResNet18/50, DeiT-Tiny, and TinyViT. Results show that FairQuant configurations with average precision near 4-6 bits recover much of the Uniform 8-bit accuracy while improving worst-group performance relative to Uniform 4- and 8-bit baselines, with comparable fairness metrics under shared budgets.
28. 【2602.23191】Uni-Animator: Towards Unified Visual Colorization
链接:https://arxiv.org/abs/2602.23191
作者:Xinyuan Chen,Yao Xu,Shaowen Wang,Pengjie Song,Bowen Deng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformer, video sketch colorization, image and video, sketch colorization, based framework
备注: 10 pages, 8 figures. Submitted to CVPR 2026
点击查看摘要
Abstract:We propose Uni-Animator, a novel Diffusion Transformer (DiT)-based framework for unified image and video sketch colorization. Existing sketch colorization methods struggle to unify image and video tasks, suffering from imprecise color transfer with single or multiple references, inadequate preservation of high-frequency physical details, and compromised temporal coherence with motion artifacts in large-motion scenes. To tackle imprecise color transfer, we introduce visual reference enhancement via instance patch embedding, enabling precise alignment and fusion of reference color information. To resolve insufficient physical detail preservation, we design physical detail reinforcement using physical features that effectively capture and retain high-frequency textures. To mitigate motion-induced temporal inconsistency, we propose sketch-based dynamic RoPE encoding that adaptively models motion-aware spatial-temporal dependencies. Extensive experimental results demonstrate that Uni-Animator achieves competitive performance on both image and video sketch colorization, matching that of task-specific methods while unlocking unified cross-domain capabilities with high detail fidelity and robust temporal consistency.
29. 【2602.23177】Phys-3D: Physics-Constrained Real-Time Crowd Tracking and Counting on Railway Platforms
链接:https://arxiv.org/abs/2602.23177
作者:Bin Zeng,Johannes Künzel,Anna Hilsmann,Peter Eisert
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate, counting, motion, capacity management, railway platforms
备注: published at VISAPP 2026
点击查看摘要
Abstract:Accurate, real-time crowd counting on railway platforms is essential for safety and capacity management. We propose to use a single camera mounted in a train, scanning the platform while arriving. While hardware constraints are simple, counting remains challenging due to dense occlusions, camera motion, and perspective distortions during train arrivals. Most existing tracking-by-detection approaches assume static cameras or ignore physical consistency in motion modeling, leading to unreliable counting under dynamic conditions. We propose a physics-constrained tracking framework that unifies detection, appearance, and 3D motion reasoning in a real-time pipeline. Our approach integrates a transfer-learned YOLOv11m detector with EfficientNet-B0 appearance encoding within DeepSORT, while introducing a physics-constrained Kalman model (Phys-3D) that enforces physically plausible 3D motion dynamics through pinhole geometry. To address counting brittleness under occlusions, we implement a virtual counting band with persistence. On our platform benchmark, MOT-RailwayPlatformCrowdHead Dataset(MOT-RPCH), our method reduces counting error to 2.97%, demonstrating robust performance despite motion and occlusions. Our results show that incorporating first-principles geometry and motion priors enables reliable crowd counting in safety-critical transportation scenarios, facilitating effective train scheduling and platform safety management.
30. 【2602.23172】Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking
链接:https://arxiv.org/abs/2602.23172
作者:Maximilian Luz,Rohit Mohan,Thomas Nürnberg,Yakov Miron,Daniele Cattaneo,Abhinav Valada
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:Latent Gaussian Splatting, dynamic environments, Panoptic Occupancy Tracking, surroundings is crucial, safe and reliable
备注:
点击查看摘要
Abstract:Capturing 4D spatiotemporal surroundings is crucial for the safe and reliable operation of robots in dynamic environments. However, most existing methods address only one side of the problem: they either provide coarse geometric tracking via bounding boxes, or detailed 3D structures like voxel-based occupancy that lack explicit temporal association. In this work, we present Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking (LaGS) that advances spatiotemporal scene understanding in a holistic direction. Our approach incorporates camera-based end-to-end tracking with mask-based multi-view panoptic occupancy prediction, and addresses the key challenge of efficiently aggregating multi-view information into 3D voxel grids via a novel latent Gaussian splatting approach. Specifically, we first fuse observations into 3D Gaussians that serve as a sparse point-centric latent representation of the 3D scene, and then splat the aggregated features onto a 3D voxel grid that is decoded by a mask-based segmentation head. We evaluate LaGS on the Occ3D nuScenes and Waymo datasets, achieving state-of-the-art performance for 4D panoptic occupancy tracking. We make our code available at this https URL.
31. 【2602.23169】Learning Continuous Wasserstein Barycenter Space for Generalized All-in-One Image Restoration
链接:https://arxiv.org/abs/2602.23169
作者:Xiaole Tang,Xiaoyi He,Jiayi Xu,Xiang Gu,Jian Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing methods remain, methods remain vulnerable, addressing diverse degradations, substantial advances, addressing diverse
备注:
点击查看摘要
Abstract:Despite substantial advances in all-in-one image restoration for addressing diverse degradations within a unified model, existing methods remain vulnerable to out-of-distribution degradations, thereby limiting their generalization in real-world scenarios. To tackle the challenge, this work is motivated by the intuition that multisource degraded feature distributions are induced by different degradation-specific shifts from an underlying degradation-agnostic distribution, and recovering such a shared distribution is thus crucial for achieving generalization across degradations. With this insight, we propose BaryIR, a representation learning framework that aligns multisource degraded features in the Wasserstein barycenter (WB) space, which models a degradation-agnostic distribution by minimizing the average of Wasserstein distances to multisource degraded distributions. We further introduce residual subspaces, whose embeddings are mutually contrasted while remaining orthogonal to the WB embeddings. Consequently, BaryIR explicitly decouples two orthogonal spaces: a WB space that encodes the degradation-agnostic invariant contents shared across degradations, and residual subspaces that adaptively preserve the degradation-specific knowledge. This disentanglement mitigates overfitting to in-distribution degradations and enables adaptive restoration grounded on the degradation-agnostic shared invariance. Extensive experiments demonstrate that BaryIR performs competitively against state-of-the-art all-in-one methods. Notably, BaryIR generalizes well to unseen degradations (\textit{e.g.,} types and levels) and shows remarkable robustness in learning generalized features, even when trained on limited degradation types and evaluated on real-world data with mixed degradations.
32. 【2602.23166】AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios
链接:https://arxiv.org/abs/2602.23166
作者:Zhaochen Su,Jincheng Gao,Hangyu Guo,Zhenhua Liu,Lueyang Zhang,Xinyu Geng,Shijue Huang,Peng Xia,Guanyu Jiang,Cheng Wang,Yue Zhang,Yi R. Fung,Junxian He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:solve multi-step workflows, multi-step workflows grounded, Real-world multimodal agents, agents solve multi-step, Real-world multimodal
备注: The project website is available at \url{ [this https URL](https://agentvista-bench.github.io/) }, and the code is available at \url{ [this https URL](https://github.com/hkust-nlp/AgentVista) }
点击查看摘要
Abstract:Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.
33. 【2602.23165】DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation
链接:https://arxiv.org/abs/2602.23165
作者:Yichen Peng,Jyun-Ting Song,Siyeol Jung,Ruofan Liu,Haiyang Liu,Xuangeng Chu,Ruicong Liu,Erwin Wu,Hideki Koike,Kris Kitani
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating realistic conversational, Generating realistic, achieving natural, essential for achieving, Seamless Interaction Dataset
备注: 13 pages, 9 figures
点击查看摘要
Abstract:Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.
34. 【2602.23153】Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
链接:https://arxiv.org/abs/2602.23153
作者:Guofeng Mei,Wei Lin,Luigi Riz,Yujiao Wu,Yiming Wang,Fabio Poiesi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Multimodal Models, Multimodal Models, extract geometric features, data typically rely, pre-trained visual encoders
备注:
点击查看摘要
Abstract:Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: this https URL.
35. 【2602.23146】Partial recovery of meter-scale surface weather
链接:https://arxiv.org/abs/2602.23146
作者:Jonathan Giezendanner,Qidong Yang,Eric Schmitt,Anirban Chandra,Daniel Salles Civitarese,Johannes Jakubik,Jeremy Vila,Detlef Hohl,Campbell Watson,Sherrie Wang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Atmospheric and Oceanic Physics (physics.ao-ph)
关键词:current weather analyses, Near-surface atmospheric conditions, analyses and forecasts, conditions can differ, differ sharply
备注:
点击查看摘要
Abstract:Near-surface atmospheric conditions can differ sharply over tens to hundreds of meters due to land cover and topography, yet this variability is absent from current weather analyses and forecasts. It is unclear whether such meter-scale variability reflects irreducibly chaotic dynamics or contains a component predictable from surface characteristics and large-scale atmospheric forcing. Here we show that a substantial, physically coherent component of meter-scale near-surface weather is statistically recoverable from existing observations. By conditioning coarse atmospheric state on sparse surface station measurements and high-resolution Earth observation data, we infer spatially continuous fields of near-surface wind, temperature, and humidity at 10 m resolution across the contiguous United States. Relative to ERA5, the inferred fields reduce wind error by 29% and temperature and dewpoint error by 6%, while explaining substantially more spatial variance at fixed time steps. They also exhibit physically interpretable structure, including urban heat islands, evapotranspiration-driven humidity contrasts, and wind speed differences across land cover types. Our findings expand the frontier of weather modeling by demonstrating a computationally feasible approach to continental-scale meter-resolution inference. More broadly, they illustrate how conditioning coarse dynamical models on static fine-scale features can reveal previously unresolved components of the Earth system.
36. 【2602.23141】No Labels, No Look-Ahead: Unsupervised Online Video Stabilization with Classical Priors
链接:https://arxiv.org/abs/2602.23141
作者:Tao Liu,Gang Wan,Kan Ren,Shibo Wen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unsupervised framework, Abstract, stabilization, multithreaded buffering mechanism, UAV
备注: CVPR2026
点击查看摘要
Abstract:We propose a new unsupervised framework for online video stabilization. Unlike methods based on deep learning that require paired stable and unstable datasets, our approach instantiates the classical stabilization pipeline with three stages and incorporates a multithreaded buffering mechanism. This design addresses three longstanding challenges in end-to-end learning: limited data, poor controllability, and inefficiency on hardware with constrained resources. Existing benchmarks focus mainly on handheld videos with a forward view in visible light, which restricts the applicability of stabilization to domains such as UAV nighttime remote sensing. To fill this gap, we introduce a new multimodal UAV aerial video dataset (UAV-Test). Experiments show that our method consistently outperforms state-of-the-art online stabilizers in both quantitative metrics and visual quality, while achieving performance comparable to offline methods.
37. 【2602.23133】From Calibration to Refinement: Seeking Certainty via Probabilistic Evidence Propagation for Noisy-Label Person Re-Identification
链接:https://arxiv.org/abs/2602.23133
作者:Xin Yuan,Zhiyong Zhang,Xin Xu,Zheng Wang,Chia-Wen Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:robust person Re-ID, sparse per-identity samples, per-identity samples remains, person Re-ID methods, person Re-ID
备注: Accepted by IEEE TMM 2026
点击查看摘要
Abstract:With the increasing demand for robust person Re-ID in unconstrained environments, learning from datasets with noisy labels and sparse per-identity samples remains a critical challenge. Existing noise-robust person Re-ID methods primarily rely on loss-correction or sample-selection strategies using softmax outputs. However, these methods suffer from two key limitations: 1) Softmax exhibits translation invariance, leading to over-confident and unreliable predictions on corrupted labels. 2) Conventional sample selection based on small-loss criteria often discards valuable hard positives that are crucial for learning discriminative features. To overcome these issues, we propose the CAlibration-to-REfinement (CARE) method, a two-stage framework that seeks certainty through probabilistic evidence propagation from calibration to refinement. In the calibration stage, we propose the probabilistic evidence calibration (PEC) that dismantles softmax translation invariance by injecting adaptive learnable parameters into the similarity function, and employs an evidential calibration loss to mitigate overconfidence on mislabeled samples. In the refinement stage, we design the evidence propagation refinement (EPR) that can more accurately distinguish between clean and noisy samples. Specifically, the EPR contains two steps: Firstly, the composite angular margin (CAM) metric is proposed to precisely distinguish clean but hard-to-learn positive samples from mislabeled ones in a hyperspherical space; Secondly, the certainty-oriented sphere weighting (COSW) is developed to dynamically allocate the importance of samples according to CAM, ensuring clean instances drive model updates. Extensive experimental results on Market1501, DukeMTMC-ReID, and CUHK03 datasets under both random and patterned noises show that CARE achieves competitive performance.
38. 【2602.23120】riLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region Disentanglement
链接:https://arxiv.org/abs/2602.23120
作者:Arian Sabaghi,José Oramas
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Weakly supervised object, Weakly supervised, localize target objects, aims to localize, image-level labels
备注: This paper consists of 8 pages including 6 figures. Accepted at CVPR 2026
点击查看摘要
Abstract:Weakly supervised object localization (WSOL) aims to localize target objects in images using only image-level labels. Despite recent progress, many approaches still rely on multi-stage pipelines or full fine-tuning of large backbones, which increases training cost, while the broader WSOL community continues to face the challenge of partial object coverage. We present TriLite, a single-stage WSOL framework that leverages a frozen Vision Transformer with Dinov2 pre-training in a self-supervised manner, and introduces only a minimal number of trainable parameters (fewer than 800K on ImageNet-1K) for both classification and localization. At its core is the proposed TriHead module, which decomposes patch features into foreground, background, and ambiguous regions, thereby improving object coverage while suppressing spurious activations. By disentangling classification and localization objectives, TriLite effectively exploits the universal representations learned by self-supervised ViTs without requiring expensive end-to-end training. Extensive experiments on CUB-200-2011, ImageNet-1K, and OpenImages demonstrate that TriLite sets a new state of the art, while remaining significantly more parameter-efficient and easier to train than prior methods. The code will be released soon.
39. 【2602.23117】Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation
链接:https://arxiv.org/abs/2602.23117
作者:Xiaosen Wang,Zhijin Ge,Bohan Liu,Zheng Fang,Fengfan Zhou,Ruixuan Zhang,Shaokang Wang,Yuyang Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:unexposed victim models, deceive alternate, unexposed victim, Adversarial transferability refers, surrogate model
备注: Code is available at [this https URL](https://github.com/Trustworthy-AI-Group/TransferAttack)
点击查看摘要
Abstract:Adversarial transferability refers to the capacity of adversarial examples generated on the surrogate model to deceive alternate, unexposed victim models. This property eliminates the need for direct access to the victim model during an attack, thereby raising considerable security concerns in practical applications and attracting substantial research attention recently. In this work, we discern a lack of a standardized framework and criteria for evaluating transfer-based attacks, leading to potentially biased assessments of existing approaches. To rectify this gap, we have conducted an exhaustive review of hundreds of related works, organizing various transfer-based attacks into six distinct categories. Subsequently, we propose a comprehensive framework designed to serve as a benchmark for evaluating these attacks. In addition, we delineate common strategies that enhance adversarial transferability and highlight prevalent issues that could lead to unfair comparisons. Finally, we provide a brief review of transfer-based attacks beyond image classification.
40. 【2602.23115】FLIGHT: Fibonacci Lattice-based Inference for Geometric Heading in real-Time
链接:https://arxiv.org/abs/2602.23115
作者:David Dirnfeld,Fabien Delattre,Pedro Miraldo,Erik Learned-Miller
类目:Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG); Robotics (cs.RO)
关键词:Estimating camera motion, Estimating camera, visual odometry, computer vision, central to tasks
备注:
点击查看摘要
Abstract:Estimating camera motion from monocular video is a fundamental problem in computer vision, central to tasks such as SLAM, visual odometry, and structure-from-motion. Existing methods that recover the camera's heading under known rotation, whether from an IMU or an optimization algorithm, tend to perform well in low-noise, low-outlier conditions, but often decrease in accuracy or become computationally expensive as noise and outlier levels increase. To address these limitations, we propose a novel generalization of the Hough transform on the unit sphere (S(2)) to estimate the camera's heading. First, the method extracts correspondences between two frames and generates a great circle of directions compatible with each pair of correspondences. Then, by discretizing the unit sphere using a Fibonacci lattice as bin centers, each great circle casts votes for a range of directions, ensuring that features unaffected by noise or dynamic objects vote consistently for the correct motion direction. Experimental results on three datasets demonstrate that the proposed method is on the Pareto frontier of accuracy versus efficiency. Additionally, experiments on SLAM show that the proposed method reduces RMSE by correcting the heading during camera pose initialization.
41. 【2602.23114】WARM-CAT: : Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning
链接:https://arxiv.org/abs/2602.23114
作者:Xudong Yan,Songhe Feng,Jiaxin Wang,Xin Su,Yi Jin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Compositional Zero-Shot Learning, Compositional Zero-Shot, attribute-object compositions based, aims to recognize, recognize novel attribute-object
备注:
点击查看摘要
Abstract:Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual prototypes from historical images for inference. Since the model tends to favor compositions already stored in the queue during testing, we warm-start the queue by initializing it with training images for visual prototypes of seen compositions and generating unseen visual prototypes using the mapping learned between seen and unseen textual prototypes. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. To provide a more reliable evaluation for CZSL, we introduce a new benchmark dataset, C-Fashion, and refine the widely used but noisy MIT-States dataset. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. The source code and datasets are available at this https URL .
42. 【2602.23103】SpectralMamba-UNet: Frequency-Disentangled State Space Modeling for Texture-Structure Consistent Medical Image Segmentation
链接:https://arxiv.org/abs/2602.23103
作者:Fuhao Zhang,Lei Liu,Jialin Zhang,Ya-Nan Zhang,Nan Mu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate medical image, requires effective modeling, fine-grained boundary details, Accurate medical, global anatomical structures
备注:
点击查看摘要
Abstract:Accurate medical image segmentation requires effective modeling of both global anatomical structures and fine-grained boundary details. Recent state space models (e.g., Vision Mamba) offer efficient long-range dependency modeling. However, their one-dimensional serialization weakens local spatial continuity and high-frequency representation. To this end, we propose SpectralMamba-UNet, a novel frequency-disentangled framework to decouple the learning of structural and textural information in the spectral domain. Our Spectral Decomposition and Modeling (SDM) module applies discrete cosine transform to decompose low- and high-frequency features, where low frequency contributes to global contextual modeling via a frequency-domain Mamba and high frequency preserves boundary-sensitive details. To balance spectral contributions, we introduce a Spectral Channel Reweighting (SCR) mechanism to form channel-wise frequency-aware attention, and a Spectral-Guided Fusion (SGF) module to achieve adaptively multi-scale fusion in the decoder. Experiments on five public benchmarks demonstrate consistent improvements across diverse modalities and segmentation targets, validating the effectiveness and generalizability of our approach.
43. 【2602.23101】Locally Adaptive Decay Surfaces for High-Speed Face and Landmark Detection with Event Cameras
链接:https://arxiv.org/abs/2602.23101
作者:Paul Kielty,Timothy Hanley,Peter Corcoran
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cameras record luminance, microsecond resolution, converting their sparse, asynchronous output, core challenge
备注:
点击查看摘要
Abstract:Event cameras record luminance changes with microsecond resolution, but converting their sparse, asynchronous output into dense tensors that neural networks can exploit remains a core challenge. Conventional histograms or globally-decayed time-surface representations apply fixed temporal parameters across the entire image plane, which in practice creates a trade-off between preserving spatial structure during still periods and retaining sharp edges during rapid motion. We introduce Locally Adaptive Decay Surfaces (LADS), a family of event representations in which the temporal decay at each location is modulated according to local signal dynamics. Three strategies are explored, based on event rate, Laplacian-of-Gaussian response, and high-frequency spectral energy. These adaptive schemes preserve detail in quiescent regions while reducing blur in regions of dense activity. Extensive experiments on the public data show that LADS consistently improves both face detection and facial landmark accuracy compared to standard non-adaptive representations. At 30 Hz, LADS achieves higher detection accuracy and lower landmark error than either baseline, and at 240 Hz it mitigates the accuracy decline typically observed at higher frequencies, sustaining 2.44 % normalized mean error for landmarks and 0.966 mAP50 in face detection. These high-frequency results even surpass the accuracy reported in prior works operating at 30 Hz, setting new benchmarks for event-based face analysis. Moreover, by preserving spatial structure at the representation stage, LADS supports the use of much lighter network architectures while still retaining real-time performance. These results highlight the importance of context-aware temporal integration for neuromorphic vision and point toward real-time, high-frequency human-computer interaction systems that exploit the unique advantages of event cameras.
44. 【2602.23088】Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy
链接:https://arxiv.org/abs/2602.23088
作者:Matthew Sutton,Katrin Amunts,Timo Dickscheid,Christian Schiffer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:increasingly offer potential, models increasingly offer, increasingly offer, offer potential, assist researchers
备注: 8 pages, 3 figures, submitted for inclusion at a conference
点击查看摘要
Abstract:Foundation models increasingly offer potential to support interactive, agentic workflows that assist researchers during analysis and interpretation of image data. Such workflows often require coupling vision to language to provide a natural-language interface. However, paired image-text data needed to learn this coupling are scarce and difficult to obtain in many research and clinical settings. One such setting is microscopic analysis of cell-body-stained histological human brain sections, which enables the study of cytoarchitecture: cell density and morphology and their laminar and areal organization. Here, we propose a label-mediated method that generates meaningful captions from images by linking images and text only through a label, without requiring curated paired image-text data. Given the label, we automatically mine area descriptions from related literature and use them as synthetic captions reflecting canonical cytoarchitectonic attributes. An existing cytoarchitectonic vision foundation model (CytoNet) is then coupled to a large language model via an image-to-text training objective, enabling microscopy regions to be described in natural language. Across 57 brain areas, the resulting method produces plausible area-level descriptions and supports open-set use through explicit rejection of unseen areas. It matches the cytoarchitectonic reference label for in-scope patches with 90.6% accuracy and, with the area label masked, its descriptions remain discriminative enough to recover the area in an 8-way test with 68.6% accuracy. These results suggest that weak, label-mediated pairing can suffice to connect existing biomedical vision foundation models to language, providing a practical recipe for integrating natural-language in domains where fine-grained paired annotations are scarce.
45. 【2602.23069】Align then Adapt: Rethinking Parameter-Efficient Transfer Learning in 4D Perception
链接:https://arxiv.org/abs/2602.23069
作者:Yiding Sun,Jihua Zhu,Haozhe Cheng,Chaoyi Lu,Zhichuan Yang,Lin Chen,Yaonan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:accurately encodes motion, cloud video understanding, scene interaction, accurately encodes, encodes motion
备注:
点击查看摘要
Abstract:Point cloud video understanding is critical for robotics as it accurately encodes motion and scene interaction. We recognize that 4D datasets are far scarcer than 3D ones, which hampers the scalability of self-supervised 4D models. A promising alternative is to transfer 3D pre-trained models to 4D perception tasks. However, rigorous empirical analysis reveals two critical limitations that impede transfer capability: overfitting and the modality gap. To overcome these challenges, we develop a novel "Align then Adapt" (PointATA) paradigm that decomposes parameter-efficient transfer learning into two sequential stages. Optimal-transport theory is employed to quantify the distributional discrepancy between 3D and 4D datasets, enabling our proposed point align embedder to be trained in Stage 1 to alleviate the underlying modality gap. To mitigate overfitting, an efficient point-video adapter and a spatial-context encoder are integrated into the frozen 3D backbone to enhance temporal modeling capacity in Stage 2. Notably, with the above engineering-oriented designs, PointATA enables a pre-trained 3D model without temporal knowledge to reason about dynamic video content at a smaller parameter cost compared to previous work. Extensive experiments show that PointATA can match or even outperform strong full fine-tuning models, whilst enjoying the advantage of parameter efficiency, e.g. 97.21 \% accuracy on 3D action recognition, $+8.7 \%$ on 4 D action segmentation, and 84.06\% on 4D semantic segmentation.
46. 【2602.23058】GeoWorld: Geometric World Models
链接:https://arxiv.org/abs/2602.23058
作者:Zeyu Zhang,Danning Li,Ian Reid,Richard Hartley
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:latent energy landscapes, generating pixels, Energy-based predictive world, world models provide, predictive world models
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Energy-based predictive world models provide a powerful approach for multi-step visual planning by reasoning over latent energy landscapes rather than generating pixels. However, existing approaches face two major challenges: (i) their latent representations are typically learned in Euclidean space, neglecting the underlying geometric and hierarchical structure among states, and (ii) they struggle with long-horizon prediction, which leads to rapid degradation across extended rollouts. To address these challenges, we introduce GeoWorld, a geometric world model that preserves geometric structure and hierarchical relations through a Hyperbolic JEPA, which maps latent representations from Euclidean space onto hyperbolic manifolds. We further introduce Geometric Reinforcement Learning for energy-based optimization, enabling stable multi-step planning in hyperbolic latent space. Extensive experiments on CrossTask and COIN demonstrate around 3% SR improvement in 3-step planning and 2% SR improvement in 4-step planning compared to the state-of-the-art V-JEPA 2. Project website: this https URL.
47. 【2602.23043】D-FINE-seg: Object Detection and Instance Segmentation Framework with multi-backend deployment
链接:https://arxiv.org/abs/2602.23043
作者:Argo Saakyan,Dmitry Solntsev
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:strong accuracy-latency trade-offs, top-performing recent architectures, detectors achieve strong, achieve strong accuracy-latency, Transformer-based real-time object
备注: 6 pages, 4 figures, 5 tables
点击查看摘要
Abstract:Transformer-based real-time object detectors achieve strong accuracy-latency trade-offs, and D-FINE is among the top-performing recent architectures. However, real-time instance segmentation with transformers is still less common. We present D-FINE-seg, an instance segmentation extension of D-FINE that adds: a lightweight mask head, segmentation-aware training, including box cropped BCE and dice mask losses, auxiliary and denoising mask supervision, and adapted Hungarian matching cost. On the TACO dataset, D-FINE-seg improves F1-score over Ultralytics YOLO26 under a unified TensorRT FP16 end-to-end benchmarking protocol, while maintaining competitive latency. Second contribution is an end-to-end pipeline for training, exporting, and optimized inference across ONNX, TensorRT, OpenVINO for both object detection and instance segmentation tasks. This framework is released as open-source under the Apache-2.0 license. GitHub repository - this https URL.
48. 【2602.23040】PackUV: Packed Gaussian UV Maps for 4D Volumetric Video
链接:https://arxiv.org/abs/2602.23040
作者:Aashish Rai,Angela Xing,Anushka Agarwal,Xiaoyan Cong,Zekun Li,Tao Lu,Aayush Prakash,Srinath Sridhar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:videos offer immersive, offer immersive, difficult to reconstruct, stream at scale, Gaussian Splatting based
备注: [this https URL](https://ivl.cs.brown.edu/packuv)
点击查看摘要
Abstract:Volumetric videos offer immersive 4D experiences, but remain difficult to reconstruct, store, and stream at scale. Existing Gaussian Splatting based methods achieve high-quality reconstruction but break down on long sequences, temporal inconsistency, and fail under large motions and disocclusions. Moreover, their outputs are typically incompatible with conventional video coding pipelines, preventing practical applications. We introduce PackUV, a novel 4D Gaussian representation that maps all Gaussian attributes into a sequence of structured, multi-scale UV atlas, enabling compact, image-native storage. To fit this representation from multi-view videos, we propose PackUV-GS, a temporally consistent fitting method that directly optimizes Gaussian parameters in the UV domain. A flow-guided Gaussian labeling and video keyframing module identifies dynamic Gaussians, stabilizes static regions, and preserves temporal coherence even under large motions and disocclusions. The resulting UV atlas format is the first unified volumetric video representation compatible with standard video codecs (e.g., FFV1) without losing quality, enabling efficient streaming within existing multimedia infrastructure. To evaluate long-duration volumetric capture, we present PackUV-2B, the largest multi-view video dataset to date, featuring more than 50 synchronized cameras, substantial motion, and frequent disocclusions across 100 sequences and 2B (billion) frames. Extensive experiments demonstrate that our method surpasses existing baselines in rendering fidelity while scaling to sequences up to 30 minutes with consistent quality.
Comments:
this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.23040 [cs.CV]
(or
arXiv:2602.23040v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.23040
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
CVPR 2026
49. 【2602.23031】Small Object Detection Model with Spatial Laplacian Pyramid Attention and Multi-Scale Features Enhancement in Aerial Images
链接:https://arxiv.org/abs/2602.23031
作者:Zhangjian Ji,Huijia Yan,Shaotong Qiao,Kai Feng,Wei Wei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Spatial Laplacian Pyramid, Laplacian Pyramid Attention, Feature Pyramid Network, including small size, makes detection inefficient
备注:
点击查看摘要
Abstract:Detecting objects in aerial images confronts some significant challenges, including small size, dense and non-uniform distribution of objects over high-resolution images, which makes detection inefficient. Thus, in this paper, we proposed a small object detection algorithm based on a Spatial Laplacian Pyramid Attention and Multi-Scale Feature Enhancement in aerial images. Firstly, in order to improve the feature representation of ResNet-50 on small objects, we presented a novel Spatial Laplacian Pyramid Attention (SLPA) module, which is integrated after each stage of ResNet-50 to identify and emphasize important local regions. Secondly, to enhance the model's semantic understanding and features representation, we designed a Multi-Scale Feature Enhancement Module (MSFEM), which is incorporated into the lateral connections of C5 layer for building Feature Pyramid Network (FPN). Finally, the features representation quality of traditional feature pyramid network will be affected because the features are not aligned when the upper and lower layers are fused. In order to handle it, we utilized deformable convolutions to align the features in the fusion processing of the upper and lower levels of the Feature Pyramid Network, which can help enhance the model's ability to detect and recognize small objects. The extensive experimental results on two benchmark datasets: VisDrone and DOTA demonstrate that our improved model performs better for small object detection in aerial images compared to the original algorithm.
50. 【2602.23029】WISER: Wider Search, Deeper Thinking, and Adaptive Fusion for Training-Free Zero-Shot Composed Image Retrieval
链接:https://arxiv.org/abs/2602.23029
作者:Tianyue Wang,Leigang Qu,Tianyu Yang,Xiangzhao Hao,Yifan Xu,Haiyun Guo,Jinqiao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Zero-Shot Composed Image, Zero-Shot Composed, Composed Image Retrieval, retrieve target images, Composed Image
备注:
点击查看摘要
Abstract:Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a multimodal query (comprising a reference image and a modification text), without training on annotated triplets. Existing methods typically convert the multimodal query into a single modality-either as an edited caption for Text-to-Image retrieval (T2I) or as an edited image for Image-to-Image retrieval (I2I). However, each paradigm has inherent limitations: T2I often loses fine-grained visual details, while I2I struggles with complex semantic modifications. To effectively leverage their complementary strengths under diverse query intents, we propose WISER, a training-free framework that unifies T2I and I2I via a "retrieve-verify-refine" pipeline, explicitly modeling intent awareness and uncertainty awareness. Specifically, WISER first performs Wider Search by generating both edited captions and images for parallel retrieval to broaden the candidate pool. Then, it conducts Adaptive Fusion with a verifier to assess retrieval confidence, triggering refinement for uncertain retrievals, and dynamically fusing the dual-path for reliable ones. For uncertain retrievals, WISER generates refinement suggestions through structured self-reflection to guide the next retrieval round toward Deeper Thinking. Extensive experiments demonstrate that WISER significantly outperforms previous methods across multiple benchmarks, achieving relative improvements of 45% on CIRCO (mAP@5) and 57% on CIRR (Recall@1) over existing training-free methods. Notably, it even surpasses many training-dependent methods, highlighting its superiority and generalization under diverse scenarios. Code will be released at this https URL.
51. 【2602.23022】DMAligner: Enhancing Image Alignment via Diffusion Model Based View Synthesis
链接:https://arxiv.org/abs/2602.23022
作者:Xinglong Luo,Ao Luo,Zhengning Wang,Yueqi Yang,Chaoyu Feng,Lei Lei,Bing Zeng,Shuaicheng Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Image alignment, Image, broad applications, flow-based image warping, computer vision
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Image alignment is a fundamental task in computer vision with broad applications. Existing methods predominantly employ optical flow-based image warping. However, this technique is susceptible to common challenges such as occlusions and illumination variations, leading to degraded alignment visual quality and compromised accuracy in downstream tasks. In this paper, we present DMAligner, a diffusion-based framework for image alignment through alignment-oriented view synthesis. DMAligner is crafted to tackle the challenges in image alignment from a new perspective, employing a generation-based solution that showcases strong capabilities and avoids the problems associated with flow-based image warping. Specifically, we propose a Dynamics-aware Diffusion Training approach for learning conditional image generation, synthesizing a novel view for image alignment. This incorporates a Dynamics-aware Mask Producing (DMP) module to adaptively distinguish dynamic foreground regions from static backgrounds, enabling the diffusion model to more effectively handle challenges that classical methods struggle to solve. Furthermore, we develop the Dynamic Scene Image Alignment (DSIA) dataset using Blender, which includes 1,033 indoor and outdoor scenes with over 30K image pairs tailored for image alignment. Extensive experimental results demonstrate the superiority of the proposed approach on DSIA benchmarks, as well as on a series of widely-used video datasets for qualitative comparisons. Our code is available at this https URL.
52. 【2602.23013】SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
链接:https://arxiv.org/abs/2602.23013
作者:Camile Lendering,Erkut Akdag,Egor Bondarev
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Detecting visual anomalies, Detecting visual, industrial inspection, inspection often requires, Detecting
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 98.0% and 97.6% on the MVTec-AD dataset, and 93.3% and 98.3% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at this https URL.
53. 【2602.23010】HELMLAB: An Analytical, Data-Driven Color Space for Perceptual Distance in UI Design Systems
链接:https://arxiv.org/abs/2602.23010
作者:Gorkem Yildiz
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:analytical color space, maps CIE XYZ, Fourier hue correction, present HELMLAB, perceptually-organized Lab representation
备注: 9 pages, 6 figures. Code and demo available at: [this https URL](https://github.com/Grkmyldz148/helmlab)
点击查看摘要
Abstract:We present HELMLAB, a 72-parameter analytical color space for UI design systems. The forward transform maps CIE XYZ to a perceptually-organized Lab representation through learned matrices, per-channel power compression, Fourier hue correction, and embedded Helmholtz-Kohlrausch lightness adjustment. A post-pipeline neutral correction guarantees that achromatic colors map to a=b=0 (chroma 10^-6), and a rigid rotation of the chromatic plane improves hue-angle alignment without affecting the distance metric, which is invariant under isometries. On the COMBVD dataset (3,813 color pairs), HELMLAB achieves a STRESS of 23.22, a 20.4% reduction from CIEDE2000 (29.18). Cross-validation on He et al. 2022 and MacAdam 1974 shows competitive cross-dataset performance. The transform is invertible with round-trip errors below 10^-14. Gamut mapping, design-token export, and dark/light mode adaptation utilities are included for use in web and mobile design systems.
54. 【2602.22974】An automatic counting algorithm for the quantification and uncertainty analysis of the number of microglial cells trainable in small and heterogeneous datasets
链接:https://arxiv.org/abs/2602.22974
作者:L. Martino,M. M. Garcia,P. S. Paradas,E. Curbelo
类目:Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Machine Learning (stat.ML)
关键词:biological tissues generally, tissues generally requires, scanning signal surface, automatic rough systems, Counting immunopositive cells
备注:
点击查看摘要
Abstract:Counting immunopositive cells on biological tissues generally requires either manual annotation or (when available) automatic rough systems, for scanning signal surface and intensity in whole slide imaging. In this work, we tackle the problem of counting microglial cells in lumbar spinal cord cross-sections of rats by omitting cell detection and focusing only on the counting task. Manual cell counting is, however, a time-consuming task and additionally entails extensive personnel training. The classic automatic color-based methods roughly inform about the total labeled area and intensity (protein quantification) but do not specifically provide information on cell number. Since the images to be analyzed have a high resolution but a huge amount of pixels contain just noise or artifacts, we first perform a pre-processing generating several filtered images {(providing a tailored, efficient feature extraction)}. Then, we design an automatic kernel counter that is a non-parametric and non-linear method. The proposed scheme can be easily trained in small datasets since, in its basic version, it relies only on one hyper-parameter. However, being non-parametric and non-linear, the proposed algorithm is flexible enough to express all the information contained in rich and heterogeneous datasets as well (providing the maximum overfit if required). Furthermore, the proposed kernel counter also provides uncertainty estimation of the given prediction, and can directly tackle the case of receiving several expert opinions over the same image. Different numerical experiments with artificial and real datasets show very promising results. Related Matlab code is also provided.
55. 【2602.22968】Certified Circuits: Stability Guarantees for Mechanistic Circuits
链接:https://arxiv.org/abs/2602.22968
作者:Alaa Anani,Tobias Lorenz,Bernt Schiele,Mario Fritz,Jonas Fischer
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:neural networks arrive, Understanding how neural, essential for debugging, neural networks, networks arrive
备注:
点击查看摘要
Abstract:Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits - minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture concept or dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that circuit component inclusion decisions are invariant to bounded edit-distance perturbations of the concept dataset. Unstable neurons are abstained from, yielding circuits that are more compact and more accurate. On ImageNet and OOD datasets, certified circuits achieve up to 91% higher accuracy while using 45% fewer neurons, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code will be released soon!
56. 【2602.22960】UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models
链接:https://arxiv.org/abs/2602.22960
作者:Tianxing Xu,Zixuan Wang,Guangyuan Wang,Li Hu,Zhongyi Zhang,Peng Zhang,Bang Zhang,Song-Hai Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:World models based, simulating interactive environments, face persistent difficulties, World models, maintaining long-term content
备注: Project Page: [this https URL](https://humanaigc.github.io/ucm-webpage/)
点击查看摘要
Abstract:World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation. Moreover, we introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting, facilitating training on over 500K monocular videos. Extensive experiments on real-world and synthetic benchmarks demonstrate that UCM significantly outperforms state-of-the-art methods in long-term scene consistency, while also achieving precise camera controllability in high-fidelity video generation.
57. 【2602.22959】Can Agents Distinguish Visually Hard-to-Separate Diseases in a Zero-Shot Setting? A Pilot Study
链接:https://arxiv.org/abs/2602.22959
作者:Zihao Zhao,Frederik Hauke,Juliana De Castilhos,Sven Nebelung,Daniel Truhn
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, multimodal large language, language models, agent-based systems, rapid progress
备注: Code available at [this https URL](https://github.com/TruhnLab/Contrastive-Agent-Reasoning)
点击查看摘要
Abstract:The rapid progress of multimodal large language models (MLLMs) has led to increasing interest in agent-based systems. While most prior work in medical imaging concentrates on automating routine clinical workflows, we study an underexplored yet clinically significant setting: distinguishing visually hard-to-separate diseases in a zero-shot setting. We benchmark representative agents on two imaging-only proxy diagnostic tasks, (1) melanoma vs. atypical nevus and (2) pulmonary edema vs. pneumonia, where visual features are highly confounded despite substantial differences in clinical management. We introduce a multi-agent framework based on contrastive adjudication. Experimental results show improved diagnostic performance (an 11-percentage-point gain in accuracy on dermoscopy data) and reduced unsupported claims on qualitative samples, although overall performance remains insufficient for clinical deployment. We acknowledge the inherent uncertainty in human annotations and the absence of clinical context, which further limit the translation to real-world settings. Within this controlled setting, this pilot study provides preliminary insights into zero-shot agent performance in visually confounded scenarios.
58. 【2602.22955】MM-NeuroOnco: A Multimodal Benchmark and Instruction Dataset for MRI-Based Brain Tumor Diagnosis
链接:https://arxiv.org/abs/2602.22955
作者:Feng Guo,Jiaxiang Liu,Yang Li,Qianqian Shi,Mingkun Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Accurate brain tumor, tumor diagnosis requires, Accurate brain, brain tumor diagnosis, diagnosis requires models
备注:
点击查看摘要
Abstract:Accurate brain tumor diagnosis requires models to not only detect lesions but also generate clinically interpretable reasoning grounded in imaging manifestations, yet existing public datasets remain limited in annotation richness and diagnostic semantics. To bridge this gap, we introduce MM-NeuroOnco, a large-scale multimodal benchmark and instruction-tuning dataset for brain tumor MRI understanding, consisting of 24,726 MRI slices from 20 data sources paired with approximately 200,000 semantically enriched multimodal instructions spanning diverse tumor subtypes and imaging modalities. To mitigate the scarcity and high cost of diagnostic semantic annotations, we develop a multi-model collaborative pipeline for automated medical information completion and quality control, enabling the generation of diagnosis-related semantics beyond mask-only annotations. Building upon this dataset, we further construct MM-NeuroOnco-Bench, a manually annotated evaluation benchmark with a rejection-aware setting to reduce biases inherent in closed-ended question formats. Evaluation across ten representative models shows that even the strongest baseline, Gemini 3 Flash, achieves only 41.88% accuracy on diagnosis-related questions, highlighting the substantial challenges of multimodal brain tumor diagnostic understanding. Leveraging MM-NeuroOnco, we further propose NeuroOnco-GPT, which achieves a 27% absolute accuracy improvement on diagnostic questions following fine-tuning. This result demonstrates the effectiveness of our dataset and benchmark in advancing clinically grounded multimodal diagnostic reasoning. Code and dataset are publicly available at: this https URL
59. 【2602.22949】OpenFS: Multi-Hand-Capable Fingerspelling Recognition with Implicit Signing-Hand Detection and Frame-Wise Letter-Conditioned Synthesis
链接:https://arxiv.org/abs/2602.22949
作者:Junuk Cha,Jihyeon Kim,Han-Mu Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fingerspelling recognition, component of sign, sign languages, Fingerspelling, signing-hand detection
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Fingerspelling is a component of sign languages in which words are spelled out letter by letter using specific hand poses. Automatic fingerspelling recognition plays a crucial role in bridging the communication gap between Deaf and hearing communities, yet it remains challenging due to the signing-hand ambiguity issue, the lack of appropriate training losses, and the out-of-vocabulary (OOV) problem. Prior fingerspelling recognition methods rely on explicit signing-hand detection, which often leads to recognition failures, and on a connectionist temporal classification (CTC) loss, which exhibits the peaky behavior problem. To address these issues, we develop OpenFS, an open-source approach for fingerspelling recognition and synthesis. We propose a multi-hand-capable fingerspelling recognizer that supports both single- and multi-hand inputs and performs implicit signing-hand detection by incorporating a dual-level positional encoding and a signing-hand focus (SF) loss. The SF loss encourages cross-attention to focus on the signing hand, enabling implicit signing-hand detection during recognition. Furthermore, without relying on the CTC loss, we introduce a monotonic alignment (MA) loss that enforces the output letter sequence to follow the temporal order of the input pose sequence through cross-attention regularization. In addition, we propose a frame-wise letter-conditioned generator that synthesizes realistic fingerspelling pose sequences for OOV words. This generator enables the construction of a new synthetic benchmark, called FSNeo. Through comprehensive experiments, we demonstrate that our approach achieves state-of-the-art performance in recognition and validate the effectiveness of the proposed recognizer and generator. Codes and data are available in: this https URL.
60. 【2602.22948】oProVAR: Efficient Visual Autoregressive Modeling via Tri-Dimensional Entropy-Aware Semantic Analysis and Sparsity Optimization
链接:https://arxiv.org/abs/2602.22948
作者:Jiayu Chen,Ruoyu Lin,Zihao Zheng,Jingxin Li,Maoliang Li,Guojie Luo,Xiang chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Autoregressive, Visual, Autoregressive, models enhance generation, VAR models
备注: ToProVAR is honored to be accepted by ICLR 2026
点击查看摘要
Abstract:Visual Autoregressive(VAR) models enhance generation quality but face a critical efficiency bottleneck in later stages. In this paper, we present a novel optimization framework for VAR models that fundamentally differs from prior approaches such as FastVAR and SkipVAR. Instead of relying on heuristic skipping strategies, our method leverages attention entropy to characterize the semantic projections across different dimensions of the model architecture. This enables precise identification of parameter dynamics under varying token granularity levels, semantic scopes, and generation scales. Building on this analysis, we further uncover sparsity patterns along three critical dimensions-token, layer, and scale-and propose a set of fine-grained optimization strategies tailored to these patterns. Extensive evaluation demonstrates that our approach achieves aggressive acceleration of the generation process while significantly preserving semantic fidelity and fine details, outperforming traditional methods in both efficiency and quality. Experiments on Infinity-2B and Infinity-8B models demonstrate that ToProVAR achieves up to 3.4x acceleration with minimal quality loss, effectively mitigating the issues found in prior work. Our code will be made publicly available.
61. 【2602.22945】Cross-Task Benchmarking of CNN Architectures
链接:https://arxiv.org/abs/2602.22945
作者:Kamal Sherawat,Vikrant Bhati
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:time series analysis, Time Series Classification, UCR Time Series, Series Classification Archive, including image classification
备注:
点击查看摘要
Abstract:This project provides a comparative study of dynamic convolutional neural networks (CNNs) for various tasks, including image classification, segmentation, and time series analysis. Based on the ResNet-18 architecture, we compare five variants of CNNs: the vanilla CNN, the hard attention-based CNN, the soft attention-based CNN with local (pixel-wise) and global (image-wise) feature attention, and the omni-directional CNN (ODConv). Experiments on Tiny ImageNet, Pascal VOC, and the UCR Time Series Classification Archive illustrate that attention mechanisms and dynamic convolution methods consistently exceed conventional CNNs in accuracy, efficiency, and computational performance. ODConv was especially effective on morphologically complex images by being able to dynamically adjust to varying spatial patterns. Dynamic CNNs enhanced feature representation and cross-task generalization through adaptive kernel modulation. This project provides perspectives on advanced CNN design architecture for multiplexed data modalities and indicates promising directions in neural network engineering.
62. 【2602.22941】Velocity and stroke rate reconstruction of canoe sprint team boats based on panned and zoomed video recordings
链接:https://arxiv.org/abs/2602.22941
作者:Julian Ziegler,Daniel Matthes,Finn Gerdts,Patrick Frenzel,Torsten Warnke,Matthias Englert,Tina Koevari,Mirco Fuchs
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Pacing strategies, essential for peak, stroke rate profiles, Pacing, peak performance
备注:
点击查看摘要
Abstract:Pacing strategies, defined by velocity and stroke rate profiles, are essential for peak performance in canoe sprint. While GPS is the gold standard for analysis, its limited availability necessitates automated video-based solutions. This paper presents an extended framework for reconstructing performance metrics from panned and zoomed video recordings across all sprint disciplines (K1-K4, C1-C2) and distances (200m-500m). Our method utilizes YOLOv8 for buoy and athlete detection, leveraging the known buoy grid to estimate homographies. We generalized the estimation of the boat position by means of learning a boat-specific athlete offset using a U-net based boat tip calibration. Further, we implement a robust tracking scheme using optical flow to adapt to multi-athlete boat types. Finally, we introduce methods to extract stroke rate information from either pose estimations or the athlete bounding boxes themselves. Evaluation against GPS data from elite competitions yields a velocity RRMSE of 0.020 +- 0.011 (rho = 0.956) and a stroke rate RRMSE of 0.022 +- 0.024 (rho = 0.932). The methods provide coaches with highly accurate, automated feedback without requiring on-boat sensors or manual annotation.
63. 【2602.22938】pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation
链接:https://arxiv.org/abs/2602.22938
作者:Shentong Mo,Xufang Luo,Dongsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Parameter-efficient fine-tuning, demonstrated promising results, fine-tuning has demonstrated, demonstrated promising, visual adaptation tasks
备注:
点击查看摘要
Abstract:Parameter-efficient fine-tuning has demonstrated promising results across various visual adaptation tasks, such as classification and segmentation. Typically, prompt tuning techniques have harnessed knowledge from a single pre-trained model, whether from a general or a specialized medical domain. However, this approach typically overlooks the potential synergies that could arise from integrating diverse domain knowledge within the same tuning process. In this work, we propose a novel Mixture-of-Experts prompt tuning method called pMoE, which leverages the strengths of multiple expert domains through expert-specialized prompt tokens and the learnable dispatcher, effectively combining their expertise in a unified model framework. Our pMoE introduces expert-specific prompt tokens and utilizes a dynamic token dispatching mechanism at various prompt layers to optimize the contribution of each domain expert during the adaptation phase. By incorporating both domain knowledge from diverse experts, the proposed pMoE significantly enhances the model's versatility and applicability to a broad spectrum of tasks. We conduct extensive experiments across 47 adaptation tasks, including both classification and segmentation in general and medical domains. The results demonstrate that our pMoE not only achieves superior performance with a large margin of improvements but also offers an optimal trade-off between computational efficiency and adaptation effectiveness compared to existing methods.
64. 【2602.22932】MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding
链接:https://arxiv.org/abs/2602.22932
作者:Wenhui Tan,Xiaoyi Yu,Jiaze Li,Yijing Chen,Jianzhong Ju,Zhenbo Luo,Ruihua Song,Jian Luan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Efficiently understanding long-form, multimodal large language, long-form videos remains, large language models, Efficiently understanding
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Efficiently understanding long-form videos remains a fundamental challenge for multimodal large language models (MLLMs). In this paper, we present MLLM-Sampler Joint Evolution (MSJoE), a novel framework that jointly evolves the MLLM and a lightweight key-frame sampler for efficient long-form video understanding. MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video. Specifically, MSJoE first reasons out several queries, which describe diverse visual perspectives relevant to the question. Then, these queries interact with a frozen CLIP model to produce a query-frame similarity matrix. Finally, a lightweight sampler predicts key-frame sampling weights from this matrix, selecting a compact set of informative frames, which are then fed into the MLLM for answer generation. Both the MLLM and sampler are jointly optimized through reinforcement learning, enabling co-adaptation of query-reasoning, frame-sampling, and key-frame understanding. A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process. Extensive experiments on VideoMME, LongVideoBench, LVBench, and MLVU show that MSJoE achieves 8.0\% accuracy gain upon the base MLLM, and 1.1\% higher accuracy than strongest baseline method.
65. 【2602.22923】WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents
链接:https://arxiv.org/abs/2602.22923
作者:Runwei Guan,Shaofeng Liang,Ningwei Ouyang,Weichen Fei,Shanliang Yao,Wei Dai,Chenhao Ge,Penglei Sun,Xiaohui Zhu,Tao Huang,Ryan Wen Liu,Hui Xiong
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:interactive environmental cognition, achieved remarkable success, remains fundamentally constrained, Autonomous Surface Vessels, object detection
备注: 11 pages,8 figures
点击查看摘要
Abstract:While autonomous navigation has achieved remarkable success in passive perception (e.g., object detection and segmentation), it remains fundamentally constrained by a void in knowledge-driven, interactive environmental cognition. In the high-stakes domain of maritime navigation, the ability to bridge the gap between raw visual perception and complex cognitive reasoning is not merely an enhancement but a critical prerequisite for Autonomous Surface Vessels to execute safe and precise maneuvers. To this end, we present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments. This benchmark encompasses 3,029 video clips across six distinct waterway categories, integrating multifaceted variables such as volatile lighting and dynamic weather to rigorously stress-test ASV capabilities across a five-tier hierarchical cognitive framework. Furthermore, we introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning. By synergizing Adaptive Semantic Routing, Situation-Aware Hierarchical Reasoning, and Autonomous Self-Reflective Verification, NaviMind transitions ASVs from superficial pattern matching to regulation-compliant, interpretable decision-making. Experimental results demonstrate that our framework significantly transcends existing baselines, establishing a new paradigm for intelligent, trustworthy interaction in dynamic maritime environments.
66. 【2602.22920】OSDaR-AR: Enhancing Railway Perception Datasets via Multi-modal Augmented Reality
链接:https://arxiv.org/abs/2602.22920
作者:Federico Nesti,Gianluca D'Amico,Mauro Marinoni,Giorgio Buttazzo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:railway applications continue, intelligent transportation systems, scarcity of high-quality, obstacle detection, deep learning
备注:
点击查看摘要
Abstract:Although deep learning has significantly advanced the perception capabilities of intelligent transportation systems, railway applications continue to suffer from a scarcity of high-quality, annotated data for safety-critical tasks like obstacle detection. While photorealistic simulators offer a solution, they often struggle with the ``sim-to-real" gap; conversely, simple image-masking techniques lack the spatio-temporal coherence required to obtain augmented single- and multi-frame scenes with the correct appearance and dimensions. This paper introduces a multi-modal augmented reality framework designed to bridge this gap by integrating photorealistic virtual objects into real-world railway sequences from the OSDaR23 dataset. Utilizing Unreal Engine 5 features, our pipeline leverages LiDAR point-clouds and INS/GNSS data to ensure accurate object placement and temporal stability across RGB frames. This paper also proposes a segmentation-based refinement strategy for INS/GNSS data to significantly improve the realism of the augmented sequences, as confirmed by the comparative study presented in the paper. Carefully designed augmented sequences are collected to produce OSDaR-AR, a public dataset designed to support the development of next-generation railway perception systems. The dataset is available at the following page: this https URL
67. 【2602.22919】Chain of Flow: A Foundational Generative Framework for ECG-to-4D Cardiac Digital Twins
链接:https://arxiv.org/abs/2602.22919
作者:Haofan Wu,Nay Aung,Theodoros N. Arvanitis,Joao A. C. Lima,Steffen E. Petersen,Le Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:clinically actionable Cardiac, actionable Cardiac Digital, update its internal, multimodal signals, clinically actionable
备注: 10 pages, 8 figures. Submitted to IEEE Transactions on Medical Imaging (TMI). Code will be released after review
点击查看摘要
Abstract:A clinically actionable Cardiac Digital Twin (CDT) should reconstruct individualised cardiac anatomy and physiology, update its internal state from multimodal signals, and enable a broad range of downstream simulations beyond isolated tasks. However, existing CDT frameworks remain limited to task-specific predictors rather than building a patient-specific, manipulable virtual heart. In this work, we introduce Chain of Flow (COF), a foundational ECG-driven generative framework that reconstructs full 4D cardiac structure and motion from a single cardiac cycle. The method integrates cine-CMR and 12-lead ECG during training to learn a unified representation of cardiac geometry, electrophysiology, and motion dynamics. We evaluate Chain of Flow on diverse cohorts and demonstrate accurate recovery of cardiac anatomy, chamber-wise function, and dynamic motion patterns. The reconstructed 4D hearts further support downstream CDT tasks such as volumetry, regional function analysis, and virtual cine synthesis. By enabling full 4D organ reconstruction directly from ECG, COF transforms cardiac digital twins from narrow predictive models into fully generative, patient-specific virtual hearts. Code will be released after review.
68. 【2602.22917】owards Multimodal Domain Generalization with Few Labels
链接:https://arxiv.org/abs/2602.22917
作者:Hongzhao Li,Hao Dong,Hualei Wan,Shupan Li,Mingliang Xu,Muhammad Haris Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reduce annotation costs, Multimodal Domain Generalization, Multimodal models ideally, domain generalization methods, Domain Generalization
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Multimodal models ideally should generalize to unseen domains while remaining data-efficient to reduce annotation costs. To this end, we introduce and study a new problem, Semi-Supervised Multimodal Domain Generalization (SSMDG), which aims to learn robust multimodal models from multi-source data with few labeled samples. We observe that existing approaches fail to address this setting effectively: multimodal domain generalization methods cannot exploit unlabeled data, semi-supervised multimodal learning methods ignore domain shifts, and semi-supervised domain generalization methods are confined to single-modality inputs. To overcome these limitations, we propose a unified framework featuring three key components: Consensus-Driven Consistency Regularization, which obtains reliable pseudo-labels through confident fused-unimodal consensus; Disagreement-Aware Regularization, which effectively utilizes ambiguous non-consensus samples; and Cross-Modal Prototype Alignment, which enforces domain- and modality-invariant representations while promoting robustness under missing modalities via cross-modal translation. We further establish the first SSMDG benchmarks, on which our method consistently outperforms strong baselines in both standard and missing-modality scenarios. Our benchmarks and code are available at this https URL.
69. 【2602.22897】OmniGAIA: Towards Native Omni-Modal AI Agents
链接:https://arxiv.org/abs/2602.22897
作者:Xiaoxi Li,Wenxiang Jiao,Jiarui Jin,Shijian Wang,Guanting Dong,Jiajie Jin,Hao Wang,Yinuo Wang,Ji-Rong Wen,Yuan Lu,Zhicheng Dou
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Human intelligence naturally, intelligence naturally intertwines, Human intelligence, naturally intertwines omni-modal, spanning vision
备注:
点击查看摘要
Abstract:Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.
70. 【2602.22867】SO3UFormer: Learning Intrinsic Spherical Features for Rotation-Robust Panoramic Segmentation
链接:https://arxiv.org/abs/2602.22867
作者:Qinfeng Zhu,Yunxi Jiang,Lei Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Panoramic semantic segmentation, strict gravity-aligned assumption, semantic segmentation models, Panoramic semantic, gravity-aligned assumption
备注:
点击查看摘要
Abstract:Panoramic semantic segmentation models are typically trained under a strict gravity-aligned assumption. However, real-world captures often deviate from this canonical orientation due to unconstrained camera motions, such as the rotational jitter of handheld devices or the dynamic attitude shifts of aerial platforms. This discrepancy causes standard spherical Transformers to overfit global latitude cues, leading to performance collapse under 3D reorientations. To address this, we introduce SO3UFormer, a rotation-robust architecture designed to learn intrinsic spherical features that are less sensitive to the underlying coordinate frame. Our approach rests on three geometric pillars: (1) an intrinsic feature formulation that decouples the representation from the gravity vector by removing absolute latitude encoding; (2) quadrature-consistent spherical attention that accounts for non-uniform sampling densities; and (3) a gauge-aware relative positional mechanism that encodes local angular geometry using tangent-plane projected angles and discrete gauge pooling, avoiding reliance on global axes. We further use index-based spherical resampling together with a logit-level SO(3)-consistency regularizer during training. To rigorously benchmark robustness, we introduce Pose35, a dataset variant of Stanford2D3D perturbed by random rotations within $\pm 35^\circ$. Under the extreme test of arbitrary full SO(3) rotations, existing SOTAs fail catastrophically: the baseline SphereUFormer drops from 67.53 mIoU to 25.26 mIoU. In contrast, SO3UFormer demonstrates remarkable stability, achieving 72.03 mIoU on Pose35 and retaining 70.67 mIoU under full SO(3) rotations.
71. 【2602.22862】GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
链接:https://arxiv.org/abs/2602.22862
作者:Enda Xiang,Haoxiang Ma,Xinzhu Ma,Zicheng Liu,Di Huang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:manipulation policies learned, paper focuses, focuses on enhancing, imitation learning, policies learned
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:This paper focuses on enhancing the grasping precision and generalization of manipulation policies learned via imitation learning. Diffusion-based policy learning methods have recently become the mainstream approach for robotic manipulation tasks. As grasping is a critical subtask in manipulation, the ability of imitation-learned policies to execute precise and generalizable grasps merits particular attention. Existing imitation learning techniques for grasping often suffer from imprecise grasp executions, limited spatial generalization, and poor object generalization. To address these challenges, we incorporate grasp prior knowledge into the diffusion policy framework. In particular, we employ a latent diffusion policy to guide action chunk decoding with grasp pose prior, ensuring that generated motion trajectories adhere closely to feasible grasp configurations. Furthermore, we introduce a self-supervised reconstruction objective during diffusion to embed the graspness prior: at each reverse diffusion step, we reconstruct wrist-camera images back-projected the graspness from the intermediate representations. Both simulation and real robot experiments demonstrate that our approach significantly outperforms baseline methods and exhibits strong dynamic grasping capabilities.
72. 【2602.22859】From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models
链接:https://arxiv.org/abs/2602.22859
作者:Hongrui Jia,Chaoya Jiang,Shikun Zhang,Wei Ye
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made notable progress, Large Multimodal Models, Large Multimodal, Diagnostic-driven Progressive Evolution, methods mature
备注:
点击查看摘要
Abstract:As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static data and fixed recipes, making it difficult to diagnose capability blind spots or provide dynamic, targeted reinforcement. Motivated by findings that test driven error exposure and feedback based correction outperform repetitive practice, we propose Diagnostic-driven Progressive Evolution (DPE), a spiral loop where diagnosis steers data generation and reinforcement, and each iteration re-diagnoses the updated model to drive the next round of targeted improvement. DPE has two key components. First, multiple agents annotate and quality control massive unlabeled multimodal data, using tools such as web search and image editing to produce diverse, realistic samples. Second, DPE attributes failures to specific weaknesses, dynamically adjusts the data mixture, and guides agents to generate weakness focused data for targeted reinforcement. Experiments on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct show stable, continual gains across eleven benchmarks, indicating DPE as a scalable paradigm for continual LMM training under open task distributions. Our code, models, and data are publicly available at this https URL.
73. 【2602.22843】A data- and compute-efficient chest X-ray foundation model beyond aggressive scaling
链接:https://arxiv.org/abs/2602.22843
作者:Chong Wang,Yabin Zhang,Yunhe Gao,Maya Varma,Clemence Mottez,Faidra Patsatzi,Jiaming Liu,Jin Long,Jean-Benoit Delbrouck,Sergios Gatidis,Akshay S. Chaudhari,Curtis P. Langlotz
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:increasingly large datasets, imaging are typically, increasingly large, medical imaging, large datasets
备注:
点击查看摘要
Abstract:Foundation models for medical imaging are typically pretrained on increasingly large datasets, following a "scale-at-all-costs" paradigm. However, this strategy faces two critical challenges: large-scale medical datasets often contain substantial redundancy and severe class imbalance that bias representation learning toward over-represented patterns, and indiscriminate training regardless of heterogeneity in data quality incurs considerable computational inefficiency. Here we demonstrate that active, principled data curation during pretraining can serve as a viable, cost-effective alternative to brute-force dataset enlargement. We introduce CheXficient, a chest X-ray (CXR) foundation model that selectively prioritizes informative training samples. CheXficient is pretrained on only 22.7% of 1,235,004 paired CXR images and reports while consuming under 27.3% of the total compute budget, yet achieving comparable or superior performance to its full-data counterpart and other large-scale pretrained models. We assess CheXficient across 20 individual benchmarks spanning 5 task types, including non-adapted off-the-shelf evaluations (zero-shot findings classification and crossmodal retrieval) and adapted downstream tasks (disease prediction, semantic segmentation, and radiology report generation). Further analyses show that CheXficient systematically prioritizes under-represented training samples, improving generalizability on long-tailed or rare conditions. Overall, our work offers practical insights into the data and computation demands for efficient pretraining and downstream adaptation of medical vision-language foundation models.
74. 【2602.22831】Moral Preferences of LLMs Under Directed Contextual Influence
链接:https://arxiv.org/abs/2602.22831
作者:Phil Blandfort,Tushar Karayil,Urja Pawar,Robert Graham,Alex McKenzie,Dmitrii Krasheninnikov
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:implicitly assuming stable, assuming stable preferences, implicitly assuming, benchmarks for LLMs, LLMs typically
备注:
点击查看摘要
Abstract:Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.
75. 【2602.22829】Reflectance Multispectral Imaging for Soil Composition Estimation and USDA Texture Classification
链接:https://arxiv.org/abs/2602.22829
作者:G.A.S.L Ranasinghe,J.A.S.T. Jayakody,M.C.L. De Silva,G. Thilakarathne,G.M.R.I. Godaliyadda,H.M.V.R. Herath,M.P.B. Ekanayake,S.K. Navaratnarajah
类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:load bearing capacity, governs water availability, United States Department, deformation response, Soil texture
备注: Under Review at IEEE Access. 17 pages, 15 figures
点击查看摘要
Abstract:Soil texture is a foundational attribute that governs water availability and erosion in agriculture, as well as load bearing capacity, deformation response, and shrink-swell risk in geotechnical engineering. Yet texture is still typically determined by slow and labour intensive laboratory particle size tests, while many sensing alternatives are either costly or too coarse to support routine field scale deployment. This paper proposes a robust and field deployable multispectral imaging (MSI) system and machine learning framework for predicting soil composition and the United States Department of Agriculture (USDA) texture classes. The proposed system uses a cost effective in-house MSI device operating from 365 nm to 940 nm to capture thirteen spectral bands, which effectively capture the spectral properties of soil texture. Regression models use the captured spectral properties to estimate clay, silt, and sand percentages, while a direct classifier predicts one of the twelve USDA textural classes. Indirect classification is obtained by mapping the regressed compositions to texture classes via the USDA soil texture triangle. The framework is evaluated on mixture data by mixing clay, silt, and sand in varying proportions, using the USDA classification triangle as a basis. Experimental results show that the proposed approach achieves a coefficient of determination R^2 up to 0.99 for composition prediction and over 99% accuracy for texture classification. These findings indicate that MSI combined with data-driven modeling can provide accurate, non-destructive, and field deployable soil texture characterization suitable for geotechnical screening and precision agriculture.
76. 【2602.22821】CMSA-Net: Causal Multi-scale Aggregation with Adaptive Multi-source Reference for Video Polyp Segmentation
链接:https://arxiv.org/abs/2602.22821
作者:Tong Wang,Yaolei Qi,Siwen Wang,Imran Razzak,Guanyu Yang,Yutong Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:doctors accurately locate, computer-aided colonoscopy, important task, task in computer-aided, doctors accurately
备注:
点击查看摘要
Abstract:Video polyp segmentation (VPS) is an important task in computer-aided colonoscopy, as it helps doctors accurately locate and track polyps during examinations. However, VPS remains challenging because polyps often look similar to surrounding mucosa, leading to weak semantic discrimination. In addition, large changes in polyp position and scale across video frames make stable and accurate segmentation difficult. To address these challenges, we propose a robust VPS framework named CMSA-Net. The proposed network introduces a Causal Multi-scale Aggregation (CMA) module to effectively gather semantic information from multiple historical frames at different scales. By using causal attention, CMA ensures that temporal feature propagation follows strict time order, which helps reduce noise and improve feature reliability. Furthermore, we design a Dynamic Multi-source Reference (DMR) strategy that adaptively selects informative and reliable reference frames based on semantic separability and prediction confidence. This strategy provides strong multi-frame guidance while keeping the model efficient for real-time inference. Extensive experiments on the SUN-SEG dataset demonstrate that CMSA-Net achieves state-of-the-art performance, offering a favorable balance between segmentation accuracy and real-time clinical applicability.
77. 【2602.22819】Face Time Traveller : Travel Through Ages Without Losing Identity
链接:https://arxiv.org/abs/2602.22819
作者:Purbayan Kar,Ayush Ghadiya,Vishal Chudasama,Pankaj Wasnik,C.V. Jawahar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ill-posed problem shaped, realistic age transformations, genetic factors, vital in entertainment, digital archiving
备注: Accepted at CVPR 2026 (Findings Track)
点击查看摘要
Abstract:Face aging, an ill-posed problem shaped by environmental and genetic factors, is vital in entertainment, forensics, and digital archiving, where realistic age transformations must preserve both identity and visual realism. However, existing works relying on numerical age representations overlook the interplay of biological and contextual cues. Despite progress in recent face aging models, they struggle with identity preservation in wide age transformations, also static attention and optimization-heavy inversion in diffusion limit adaptability, fine-grained control and background consistency. To address these challenges, we propose Face Time Traveller (FaceTT), a diffusion-based framework that achieves high-fidelity, identity-consistent age transformation. Here, we introduce a Face-Attribute-Aware Prompt Refinement strategy that encodes intrinsic (biological) and extrinsic (environmental) aging cues for context-aware conditioning. A tuning-free Angular Inversion method is proposed that efficiently maps real faces into the diffusion latent space for fast and accurate reconstruction. Moreover, an Adaptive Attention Control mechanism is introduced that dynamically balances cross-attention for semantic aging cues and self-attention for structural and identity preservation. Extensive experiments on benchmark datasets and in-the-wild testset demonstrate that FaceTT achieves superior identity retention, background preservation and aging realism over state-of-the-art (SOTA) methods.
78. 【2602.22809】PhotoAgent: Agentic Photo Editing with Exploratory Visual Aesthetic Planning
链接:https://arxiv.org/abs/2602.22809
作者:Mingde Yao,Zhiyuan You,Tam-King Man,Menglu Wang,Tianfan Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recent fast development, shown great potential, generating high-quality images, instruction-based image editing, recent fast
备注: A fully automated, intelligent photo-editing agent that autonomously plans multi-step aesthetic enhancements, smartly chooses diverse editing tools, and enables everyday users to achieve professional-looking results without crafting complex prompts. Project page: [this https URL](https://github.com/mdyao/PhotoAgent)
点击查看摘要
Abstract:With the recent fast development of generative models, instruction-based image editing has shown great potential in generating high-quality images. However, the quality of editing highly depends on carefully designed instructions, placing the burden of task decomposition and sequencing entirely on the user. To achieve autonomous image editing, we present PhotoAgent, a system that advances image editing through explicit aesthetic planning. Specifically, PhotoAgent formulates autonomous image editing as a long-horizon decision-making problem. It reasons over user aesthetic intent, plans multi-step editing actions via tree search, and iteratively refines results through closed-loop execution with memory and visual feedback, without requiring step-by-step user prompts. To support reliable evaluation in real-world scenarios, we introduce UGC-Edit, an aesthetic evaluation benchmark consisting of 7,000 photos and a learned aesthetic reward model. We also construct a test set containing 1,017 photos to systematically assess autonomous photo editing performance. Extensive experiments demonstrate that PhotoAgent consistently improves both instruction adherence and visual quality compared with baseline methods. The project page is this https URL.
79. 【2602.22800】GSTurb: Gaussian Splatting for Atmospheric Turbulence Mitigation
链接:https://arxiv.org/abs/2602.22800
作者:Hanliang Du,Zhangji Lu,Zewei Cai,Qijian Tang,Qifeng Yu,Xiaoli Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long-range imaging applications, image degradation due, atmospheric turbulence mitigation, Atmospheric turbulence, pixel displacement
备注:
点击查看摘要
Abstract:Atmospheric turbulence causes significant image degradation due to pixel displacement (tilt) and blur, particularly in long-range imaging applications. In this paper, we propose a novel framework for atmospheric turbulence mitigation, GSTurb, which integrates optical flow-guided tilt correction and Gaussian splatting for modeling non-isoplanatic blur. The framework employs Gaussian parameters to represent tilt and blur, and optimizes them across multiple frames to enhance restoration. Experimental results on the ATSyn-static dataset demonstrate the effectiveness of our method, achieving a peak PSNR of 27.67 dB and SSIM of 0.8735. Compared to the state-of-the-art method, GSTurb improves PSNR by 1.3 dB (a 4.5% increase) and SSIM by 0.048 (a 5.8% increase). Additionally, on real datasets, including the TSRWGAN Real-World and CLEAR datasets, GSTurb outperforms existing methods, showing significant improvements in both qualitative and quantitative performance. These results highlight that combining optical flow-guided tilt correction with Gaussian splatting effectively enhances image restoration under both synthetic and real-world turbulence conditions. The code for this method will be available at this https URL.
80. 【2602.22791】Robust Human Trajectory Prediction via Self-Supervised Skeleton Representation Learning
链接:https://arxiv.org/abs/2602.22791
作者:Taishu Arashima,Hiroshi Kera,Kazuhiko Kawamoto
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video surveillance, plays a crucial, crucial role, role in applications, autonomous navigation
备注: 11 pages main, 5 pages supplementary material
点击查看摘要
Abstract:Human trajectory prediction plays a crucial role in applications such as autonomous navigation and video surveillance. While recent works have explored the integration of human skeleton sequences to complement trajectory information, skeleton data in real-world environments often suffer from missing joints caused by occlusions. These disturbances significantly degrade prediction accuracy, indicating the need for more robust skeleton representations. We propose a robust trajectory prediction method that incorporates a self-supervised skeleton representation model pretrained with masked autoencoding. Experimental results in occlusion-prone scenarios show that our method improves robustness to missing skeletal data without sacrificing prediction accuracy, and consistently outperforms baseline models in clean-to-moderate missingness regimes.
81. 【2602.22785】SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation
链接:https://arxiv.org/abs/2602.22785
作者:Ling Wang,Hao-Xiang Guo,Xinzhou Wang,Fuchun Sun,Kai Sun,Pengkun Liu,Hang Xiao,Zhong Wang,Guangyuan Fu,Eric Li,Yang Liu,Yikai Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scene generation, open-world scene generation, scene, single image, introduce SceneTransporter
备注: published at iclr 2026
点击查看摘要
Abstract:We introduce SceneTransporter, an end-to-end framework for structured 3D scene generation from a single image. While existing methods generate part-level 3D objects, they often fail to organize these parts into distinct instances in open-world scenes. Through a debiased clustering probe, we reveal a critical insight: this failure stems from the lack of structural constraints within the model's internal assignment mechanism. Based on this finding, we reframe the task of structured 3D scene generation as a global correlation assignment problem. To solve this, SceneTransporter formulates and solves an entropic Optimal Transport (OT) objective within the denoising loop of the compositional DiT model. This formulation imposes two powerful structural constraints. First, the resulting transport plan gates cross-attention to enforce an exclusive, one-to-one routing of image patches to part-level 3D latents, preventing entanglement. Second, the competitive nature of the transport encourages the grouping of similar patches, a process that is further regularized by an edge-based cost, to form coherent objects and prevent fragmentation. Extensive experiments show that SceneTransporter outperforms existing methods on open-world scene generation, significantly improving instance-level coherence and geometric fidelity. Code and models will be publicly available at this https URL.
82. 【2602.22779】rajTok: Learning Trajectory Tokens enables better Video Understanding
链接:https://arxiv.org/abs/2602.22779
作者:Chenhao Zheng,Jieyu Zhang,Jianing Zhang,Weikai Huang,Ashutosh Kumar,Quan Kong,Oncel Tuzel,Chun-Liang Li,Ranjay Krishna
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:typically through patchification, generates an excessive, excessive and redundant, redundant number, video
备注: CVPR 2026
点击查看摘要
Abstract:Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.
83. 【2602.22759】Beyond Detection: Multi-Scale Hidden-Code for Natural Image Deepfake Recovery and Factual Retrieval
链接:https://arxiv.org/abs/2602.22759
作者:Yuan-Chih Chen,Chun-Shien Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, authenticity have primarily, primarily focused, focused on deepfake, tampered contents
备注:
点击查看摘要
Abstract:Recent advances in image authenticity have primarily focused on deepfake detection and localization, leaving recovery of tampered contents for factual retrieval relatively underexplored. We propose a unified hidden-code recovery framework that enables both retrieval and restoration from post-hoc and in-generation watermarking paradigms. Our method encodes semantic and perceptual information into a compact hidden-code representation, refined through multi-scale vector quantization, and enhances contextual reasoning via conditional Transformer modules. To enable systematic evaluation for natural images, we construct ImageNet-S, a benchmark that provides paired image-label factual retrieval tasks. Extensive experiments on ImageNet-S demonstrate that our method exhibits promising retrieval and reconstruction performance while remaining fully compatible with diverse watermarking pipelines. This framework establishes a foundation for general-purpose image recovery beyond detection and localization.
84. 【2602.22745】SPATIALALIGN: Aligning Dynamic Spatial Relationships in Video Generation
链接:https://arxiv.org/abs/2602.22745
作者:Fengming Liu,Tat-Jen Cham,Chuanxia Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generators prioritize aesthetic, prioritize aesthetic quality, Dynamic Spatial Relationships, Direct Preference Optimization, generators prioritize
备注:
点击查看摘要
Abstract:Most text-to-video (T2V) generators prioritize aesthetic quality, but often ignoring the spatial constraints in the generated videos. In this work, we present SPATIALALIGN, a self-improvement framework that enhances T2V models capabilities to depict Dynamic Spatial Relationships (DSR) specified in text prompts. We present a zeroth-order regularized Direct Preference Optimization (DPO) to fine-tune T2V models towards better alignment with DSR. Specifically, we design DSR-SCORE, a geometry-based metric that quantitatively measures the alignment between generated videos and the specified DSRs in prompts, which is a step forward from prior works that rely on VLM for evaluation. We also conduct a dataset of text-video pairs with diverse DSRs to facilitate the study. Extensive experiments demonstrate that our fine-tuned model significantly out performs the baseline in spatial relationships. The code will be released in Link.
85. 【2602.22742】ProjFlow: Projection Sampling with Flow Matching for Zero-Shot Exact Spatial Motion Control
链接:https://arxiv.org/abs/2602.22742
作者:Akihisa Watanabe,Qing Yu,Edgar Simo-Serra,Kent Fujiwara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating human motion, Generating human, precise spatial control, challenging problem, Generating
备注:
点击查看摘要
Abstract:Generating human motion with precise spatial control is a challenging problem. Existing approaches often require task-specific training or slow optimization, and enforcing hard constraints frequently disrupts motion naturalness. Building on the observation that many animation tasks can be formulated as a linear inverse problem, we introduce ProjFlow, a training-free sampler that achieves zero-shot, exact satisfaction of linear spatial constraints while preserving motion realism. Our key advance is a novel kinematics-aware metric that encodes skeletal topology. This metric allows the sampler to enforce hard constraints by distributing corrections coherently across the entire skeleton, avoiding the unnatural artifacts of naive projection. Furthermore, for sparse inputs, such as filling in long gaps between a few keyframes, we introduce a time-varying formulation using pseudo-observations that fade during sampling. Extensive experiments on representative applications, motion inpainting, and 2D-to-3D lifting, demonstrate that ProjFlow achieves exact constraint satisfaction and matches or improves realism over zero-shot baselines, while remaining competitive with training-based controllers.
86. 【2602.22740】AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation
链接:https://arxiv.org/abs/2602.22740
作者:Tongfei Chen,Shuo Yang,Yuguang Yang,Linlin Yang,Runtang Guo,Changbai Li,He Long,Chunyu Xie,Dawei Leng,Baochang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Referring Image Segmentation, natural language expression, Referring Image, Image Segmentation, aims to segment
备注: ICLR 2026 conference paper
点击查看摘要
Abstract:Referring Image Segmentation (RIS) aims to segment an object in an image identified by a natural language expression. The paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment, filtering out poorly aligned regions during optimization, and focusing on trustworthy cues. This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios
87. 【2602.22734】Asymmetric Idiosyncrasies in Multimodal Models
链接:https://arxiv.org/abs/2602.22734
作者:Muzi Tao,Chufan Shi,Huijuan Wang,Shengbang Tong,Xuezhe Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:downstream impact, caption models, originating caption model, models, study idiosyncrasies
备注: Project page: [this https URL](https://muzi-tao.github.io/asymmetric-idiosyncrasies/)
点击查看摘要
Abstract:In this work, we study idiosyncrasies in the caption models and their downstream impact on text-to-image models. We design a systematic analysis: given either a generated caption or the corresponding image, we train neural networks to predict the originating caption model. Our results show that text classification yields very high accuracy (99.70\%), indicating that captioning models embed distinctive stylistic signatures. In contrast, these signatures largely disappear in the generated images, with classification accuracy dropping to at most 50\% even for the state-of-the-art Flux model. To better understand this cross-modal discrepancy, we further analyze the data and find that the generated images fail to preserve key variations present in captions, such as differences in the level of detail, emphasis on color and texture, and the distribution of objects within a scene. Overall, our classification-based framework provides a novel methodology for quantifying both the stylistic idiosyncrasies of caption models and the prompt-following ability of text-to-image systems.
88. 【2602.22731】Sapling-NeRF: Geo-Localised Sapling Reconstruction in Forests for Ecological Monitoring
链接:https://arxiv.org/abs/2602.22731
作者:Miguel Ángel Muñoz-Bañón,Nived Chebrolu,Sruthi M. Krishna Moorthy,Yifu Tao,Fernando Torres,Roberto Salguero-Gómez,Maurice Fallon
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Mobile Laser Scanners, Laser Scanners, Terrestrial Laser Scanners, key indicators, Neural Radiance Fields
备注:
点击查看摘要
Abstract:Saplings are key indicators of forest regeneration and overall forest health. However, their fine-scale architectural traits are difficult to capture with existing 3D sensing methods, which make quantitative evaluation difficult. Terrestrial Laser Scanners (TLS), Mobile Laser Scanners (MLS), or traditional photogrammetry approaches poorly reconstruct thin branches, dense foliage, and lack the scale consistency needed for long-term monitoring. Implicit 3D reconstruction methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) are promising alternatives, but cannot recover the true scale of a scene and lack any means to be accurately geo-localised. In this paper, we present a pipeline which fuses NeRF, LiDAR SLAM, and GNSS to enable repeatable, geo-localised ecological monitoring of saplings. Our system proposes a three-level representation: (i) coarse Earth-frame localisation using GNSS, (ii) LiDAR-based SLAM for centimetre-accurate localisation and reconstruction, and (iii) NeRF-derived object-centric dense reconstruction of individual saplings. This approach enables repeatable quantitative evaluation and long-term monitoring of sapling traits. Our experiments in forest plots in Wytham Woods (Oxford, UK) and Evo (Finland) show that stem height, branching patterns, and leaf-to-wood ratios can be captured with increased accuracy as compared to TLS. We demonstrate that accurate stem skeletons and leaf distributions can be measured for saplings with heights between 0.5m and 2m in situ, giving ecologists access to richer structural and quantitative data for analysing forest dynamics.
89. 【2602.22727】HulluEdit: Single-Pass Evidence-Consistent Subspace Editing for Mitigating Hallucinations in Large Vision-Language Models
链接:https://arxiv.org/abs/2602.22727
作者:Yangguang Lin,Quan Fang,Yufei Li,Jiachen Sun,Junyu Gao,Jitao Sang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, Large Vision-Language, significantly hinders, reliable deployment, Object hallucination
备注: accepted at CVPR 2026
点击查看摘要
Abstract:Object hallucination in Large Vision-Language Models (LVLMs) significantly hinders their reliable deployment. Existing methods struggle to balance efficiency and accuracy: they often require expensive reference models and multiple forward passes, or apply static edits that risk suppressing genuine visual evidence. To address this, we introduce HulluEdit, a single-pass, reference-free intervention framework. Our core innovation is orthogonal subspace editing: we decompose the hidden states of the model into orthogonal subspaces - visual evidence, conflicting priors, and residual uncertainty - enabling selective suppression of hallucinatory patterns without interfering with visual grounding. This approach mathematically guarantees that edits applied to the prior subspace leave the visual component entirely unaffected. Extensive experiments show that HulluEdit achieves state-of-the-art hallucination reduction on benchmarks including POPE and CHAIR across diverse architectures, while preserving general capabilities on MME and maintaining efficient inference. Our method consistently outperforms contrastive decoding and static subspace editing baselines, offering a new pathway toward more trustworthy LVLMs.
90. 【2602.22717】IRSDE-Despeckle: A Physics-Grounded Diffusion Model for Generalizable Ultrasound Despeckling
链接:https://arxiv.org/abs/2602.22717
作者:Shuoqi Chen,Yujia Wu,Geoffrey P. Luke
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:related artifacts reduce, Restoration Stochastic Differential, Stochastic Differential Equations, artifacts reduce image, reduce image quality
备注: 12 pages main text + 6 pages appendix, 7 figures main + 3 figures appendix, 3 tables main + 1 table appendix. Preprint
点击查看摘要
Abstract:Ultrasound imaging is widely used for real-time, noninvasive diagnosis, but speckle and related artifacts reduce image quality and can hinder interpretation. We present a diffusion-based ultrasound despeckling method built on the Image Restoration Stochastic Differential Equations framework. To enable supervised training, we curate large paired datasets by simulating ultrasound images from speckle-free magnetic resonance images using the Matlab UltraSound Toolbox. The proposed model reconstructs speckle-suppressed images while preserving anatomically meaningful edges and contrast. On a held-out simulated test set, our approach consistently outperforms classical filters and recent learning-based despeckling baselines. We quantify prediction uncertainty via cross-model variance and show that higher uncertainty correlates with higher reconstruction error, providing a practical indicator of difficult or failure-prone regions. Finally, we evaluate sensitivity to simulation probe settings and observe domain shift, motivating diversified training and adaptation for robust clinical deployment.
91. 【2602.22716】SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
链接:https://arxiv.org/abs/2602.22716
作者:Guanting Ye,Qiyan Zhao,Wenhao Yu,Liangyu Yuan,Mingkai Li,Xiaofeng Zhang,Jianmin Ji,Yanyong Zhang,Qing Jiang,Ka-Veng Yuen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Vision-Language Models, Large Language, achieved remarkable progress, Rotary Position Embedding
备注: CVPR 2026
点击查看摘要
Abstract:3D Large Vision-Language Models (3D LVLMs) built upon Large Language Models (LLMs) have achieved remarkable progress across various multimodal tasks. However, their inherited position-dependent modeling mechanism, Rotary Position Embedding (RoPE), remains suboptimal for 3D multimodal understanding. The vanilla RoPE formulation fails to preserve essential three-dimensional spatial structures when encoding 3D tokens, and its relative distance computation overlooks angular dependencies, hindering the model's ability to capture directional variations in visual representations. To overcome these limitations, we introduce Spherical Coordinate-based Positional Embedding (SoPE). Our method maps point-cloud token indices into a 3D spherical coordinate space, enabling unified modeling of spatial locations and directional angles. This formulation preserves the inherent geometric structure of point-cloud data, enhances spatial awareness, and yields more consistent and expressive geometric representations for multimodal learning. In addition, we introduce a multi-scale frequency mixing strategy to fuse feature information across different frequency domains. Experimental results on multiple 3D scene benchmarks validate the effectiveness of our approach, while real-world deployment experiments further demonstrate its strong generalization capability.
92. 【2602.22712】UFO-DETR: Frequency-Guided End-to-End Detector for UAV Tiny Objects
链接:https://arxiv.org/abs/2602.22712
作者:Yuankai Chen,Kai Lin,Qihong Wu,Xinxuan Yang,Jiashuo Lai,Ruoen Chen,Haonan Shi,Minfan He,Meihua Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:UAV imagery faces, imagery faces significant, dense distribution, Small target detection, scale variations
备注: 6 pages, 6 figures, published to 2026 International Conference on Computer Supported Cooperative Work in Design
点击查看摘要
Abstract:Small target detection in UAV imagery faces significant challenges such as scale variations, dense distribution, and the dominance of small targets. Existing algorithms rely on manually designed components, and general-purpose detectors are not optimized for UAV images, making it difficult to balance accuracy and complexity. To address these challenges, this paper proposes an end-to-end object detection framework, UFO-DETR, which integrates an LSKNet-based backbone network to optimize the receptive field and reduce the number of parameters. By combining the DAttention and AIFI modules, the model flexibly models multi-scale spatial relationships, improving multi-scale target detection performance. Additionally, the DynFreq-C3 module is proposed to enhance small target detection capability through cross-space frequency feature enhancement. Experimental results show that, compared to RT-DETR-L, the proposed method offers significant advantages in both detection performance and computational efficiency, providing an efficient solution for UAV edge computing.
93. 【2602.22695】GFRRN: Explore the Gaps in Single Image Reflection Removal
链接:https://arxiv.org/abs/2602.22695
作者:Yu Chen,Zewei He,Xingyu Liu,Zixuan Chen,Zheming Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:feature interaction mechanism, single image reflection, image reflection removal, achieved remarkable performance, reflection removal
备注: CVPR26
点击查看摘要
Abstract:Prior dual-stream methods with the feature interaction mechanism have achieved remarkable performance in single image reflection removal (SIRR). However, they often struggle with (1) semantic understanding gap between the features of pre-trained models and those of reflection removal models, and (2) reflection label inconsistencies between synthetic and real-world training data. In this work, we first adopt the parameter efficient fine-tuning (PEFT) strategy by integrating several learnable Mona layers into the pre-trained model to align the training directions. Then, a label generator is designed to unify the reflection labels for both synthetic and real-world data. In addition, a Gaussian-based Adaptive Frequency Learning Block (G-AFLB) is proposed to adaptively learn and fuse the frequency priors, and a Dynamic Agent Attention (DAA) is employed as an alternative to window-based attention by dynamically modeling the significance levels across windows (inter-) and within an individual window (intra-). These components constitute our proposed Gap-Free Reflection Removal Network (GFRRN). Extensive experiments demonstrate the effectiveness of our GFRRN, achieving superior performance against state-of-the-art SIRR methods.
94. 【2602.22689】No Caption, No Problem: Caption-Free Membership Inference via Model-Fitted Embeddings
链接:https://arxiv.org/abs/2602.22689
作者:Joonsung Jeon,Woo Jae Kim,Suhyeon Ha,Sooel Son,Sung-Eui Yoon
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
关键词:intellectual property concerns, achieved remarkable success, data raises critical, raises critical privacy, memorize training data
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Latent diffusion models have achieved remarkable success in high-fidelity text-to-image generation, but their tendency to memorize training data raises critical privacy and intellectual property concerns. Membership inference attacks (MIAs) provide a principled way to audit such memorization by determining whether a given sample was included in training. However, existing approaches assume access to ground-truth captions. This assumption fails in realistic scenarios where only images are available and their textual annotations remain undisclosed, rendering prior methods ineffective when substituted with vision-language model (VLM) captions. In this work, we propose MoFit, a caption-free MIA framework that constructs synthetic conditioning inputs that are explicitly overfitted to the target model's generative manifold. Given a query image, MoFit proceeds in two stages: (i) model-fitted surrogate optimization, where a perturbation applied to the image is optimized to construct a surrogate in regions of the model's unconditional prior learned from member samples, and (ii) surrogate-driven embedding extraction, where a model-fitted embedding is derived from the surrogate and then used as a mismatched condition for the query image. This embedding amplifies conditional loss responses for member samples while leaving hold-outs relatively less affected, thereby enhancing separability in the absence of ground-truth captions. Our comprehensive experiments across multiple datasets and diffusion models demonstrate that MoFit consistently outperforms prior VLM-conditioned baselines and achieves performance competitive with caption-dependent methods.
95. 【2602.22683】SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
链接:https://arxiv.org/abs/2602.22683
作者:Zhuohang Jiang,Xu Yuan,Haohao Qu,Shanru Lin,Kanglong Liu,Wenqi Fan,Qing Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Visual Question Answering, Question Answering, Visual Question, knowledge sources emerging, hottest wearable devices
备注:
点击查看摘要
Abstract:The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this benchmark, revealing significant performance gaps. To address the limitations of existing models, we further propose SUPERLENS, a multimodal smart glasses agent that enables retrieval-augmented answer generation by integrating automatic object detection, query decoupling, and multimodal web search. Our agent achieves state-of-the-art performance, surpassing GPT-4o by 2.19 percent, and highlights the need for task-specific solutions in smart glasses VQA scenarios.
96. 【2602.22678】ViCLIP-OT: The First Foundation Vision-Language Model for Vietnamese Image-Text Retrieval with Optimal Transport
链接:https://arxiv.org/abs/2602.22678
作者:Quoc-Khang Tran,Minh-Thien Nguyen,Nguyen-Khang Pham
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vietnamese image-text retrieval, existing vision-language models, Regularized Optimal Transport, Image-text retrieval, fundamental component
备注: Preprint submitted to Expert Systems with Applications
点击查看摘要
Abstract:Image-text retrieval has become a fundamental component in intelligent multimedia systems; however, most existing vision-language models are optimized for highresource languages and remain suboptimal for low-resource settings such as Vietnamese. This work introduces ViCLIP-OT, a foundation vision-language model specifically designed for Vietnamese image-text retrieval. The proposed framework integrates CLIP-style contrastive learning with a Similarity-Graph Regularized Optimal Transport (SIGROT) loss to enhance global cross-modal consistency and mitigate modality gap issues. Extensive experiments on three Vietnamese benchmarks (UITOpenViIC, KTVIC, and Crossmodal-3600) demonstrate that ViCLIP-OT consistently outperforms CLIP and SigLIP baselines in both in-domain and zero-shot settings. On UIT-OpenViIC, the model achieves an average Recall@K of 67.34%, improving upon CLIP by 5.75 percentage points. In zero-shot evaluation on Crossmodal-3600, ViCLIPOT surpasses CLIP by 11.72 percentage points. Embedding-space analysis further confirms improved alignment and reduced modality gap. The results indicate that integrating SIGROT provides an effective and scalable strategy for cross-modal retrieval in low-resource languages, offering practical implications for intelligent multimedia retrieval systems in Vietnamese and other underrepresented linguistic contexts.
97. 【2602.22674】SPMamba-YOLO: An Underwater Object Detection Network Based on Multi-Scale Feature Enhancement and Global Context Modeling
链接:https://arxiv.org/abs/2602.22674
作者:Guanghao Liao,Zhen Liu,Liyuan Cao,Yonghui Yang,Qi Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severe light attenuation, challenging research problem, research problem owing, color distortion, Underwater object detection
备注: 31 pages, 10 figures, 6 tables. This paper presents SPMamba-YOLO, an underwater object detection framework integrating multi-scale feature enhancement and global context modeling. The work is under review
点击查看摘要
Abstract:Underwater object detection is a critical yet challenging research problem owing to severe light attenuation, color distortion, background clutter, and the small scale of underwater targets. To address these challenges, we propose SPMamba-YOLO, a novel underwater object detection network that integrates multi-scale feature enhancement with global context modeling. Specifically, a Spatial Pyramid Pooling Enhanced Layer Aggregation Network (SPPELAN) module is introduced to strengthen multi-scale feature aggregation and expand the receptive field, while a Pyramid Split Attention (PSA) mechanism enhances feature discrimination by emphasizing informative regions and suppressing background interference. In addition, a Mamba-based state space modeling module is incorporated to efficiently capture long-range dependencies and global contextual information, thereby improving detection robustness in complex underwater environments. Extensive experiments on the URPC2022 dataset demonstrate that SPMamba-YOLO outperforms the YOLOv8n baseline by more than 4.9\% in mAP@0.5, particularly for small and densely distributed underwater objects, while maintaining a favorable balance between detection accuracy and computational cost.
98. 【2602.22667】Monocular Open Vocabulary Occupancy Prediction for Indoor Scenes
链接:https://arxiv.org/abs/2602.22667
作者:Changqing Zhou,Yueru Luo,Han Zhang,Zeyu Jiang,Changhao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:understand complex indoor, complex indoor environments, embodied agents, fixed taxonomies, vital for embodied
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Open-vocabulary 3D occupancy is vital for embodied agents, which need to understand complex indoor environments where semantic categories are abundant and evolve beyond fixed taxonomies. While recent work has explored open-vocabulary occupancy in outdoor driving scenarios, such methods transfer poorly indoors, where geometry is denser, layouts are more intricate, and semantics are far more fine-grained. To address these challenges, we adopt a geometry-only supervision paradigm that uses only binary occupancy labels (occupied vs free). Our framework builds upon 3D Language-Embedded Gaussians, which serve as a unified intermediate representation coupling fine-grained 3D geometry with a language-aligned semantic embedding. On the geometry side, we find that existing Gaussian-to-Occupancy operators fail to converge under such weak supervision, and we introduce an opacity-aware, Poisson-based approach that stabilizes volumetric aggregation. On the semantic side, direct alignment between rendered features and open-vocabulary segmentation features suffers from feature mixing; we therefore propose a Progressive Temperature Decay schedule that gradually sharpens opacities during splatting, strengthening Gaussian-language alignment. On Occ-ScanNet, our framework achieves 59.50 IoU and 21.05 mIoU in the open-vocabulary setting, surpassing all existing occupancy methods in IoU and outperforming prior open-vocabulary approaches by a large margin in mIoU. Code will be released at this https URL.
99. 【2602.22666】ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals
链接:https://arxiv.org/abs/2602.22666
作者:Xuelu Li,Zhaonan Wang,Xiaogang Wang,Lei Wu,Manyi Li,Changhe Tu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reconstructing articulated objects, high-fidelity digital twins, Reconstructing articulated, Gaussian Splatting remain, interactive simulation
备注:
点击查看摘要
Abstract:Reconstructing articulated objects into high-fidelity digital twins is crucial for applications such as robotic manipulation and interactive simulation. Recent self-supervised methods using differentiable rendering frameworks like 3D Gaussian Splatting remain highly sensitive to the initial part segmentation. Their reliance on heuristic clustering or pre-trained models often causes optimization to converge to local minima, especially for complex multi-part objects. To address these limitations, we propose ArtPro, a novel self-supervised framework that introduces adaptive integration of mobility proposals. Our approach begins with an over-segmentation initialization guided by geometry features and motion priors, generating part proposals with plausible motion hypotheses. During optimization, we dynamically merge these proposals by analyzing motion consistency among spatial neighbors, while a collision-aware motion pruning mechanism prevents erroneous kinematic estimation. Extensive experiments on both synthetic and real-world objects demonstrate that ArtPro achieves robust reconstruction of complex multi-part objects, significantly outperforming existing methods in accuracy and stability.
100. 【2602.22659】Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing
链接:https://arxiv.org/abs/2602.22659
作者:Renyu Yang,Jian Jin,Lili Meng,Meiqin Liu,Yilin Wang,Balu Adsumilli,Weisi Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Audio-visual quality assessment, Audio-visual quality, small in scale, stalled by limitations, limitations of existing
备注: Accepted to ICASSP 2026. 5 pages (main paper) + 8 pages (supplementary material)
点击查看摘要
Abstract:Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at this https URL
101. 【2602.22654】Denoising as Path Planning: Training-Free Acceleration of Diffusion Models with DPCache
链接:https://arxiv.org/abs/2602.22654
作者:Bowen Cui,Yuanbin Wang,Huajiang Xu,Biaolong Chen,Aixi Zhang,Hao Jiang,Zhengzheng Jin,Xu Liu,Pipei Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable success, practical deployment remains, deployment remains hindered, substantial computational overhead, multi-step iterative sampling
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Diffusion models have demonstrated remarkable success in image and video generation, yet their practical deployment remains hindered by the substantial computational overhead of multi-step iterative sampling. Among acceleration strategies, caching-based methods offer a training-free and effective solution by reusing or predicting features across timesteps. However, existing approaches rely on fixed or locally adaptive schedules without considering the global structure of the denoising trajectory, often leading to error accumulation and visual artifacts. To overcome this limitation, we propose DPCache, a novel training-free acceleration framework that formulates diffusion sampling acceleration as a global path planning problem. DPCache constructs a Path-Aware Cost Tensor from a small calibration set to quantify the path-dependent error of skipping timesteps conditioned on the preceding key timestep. Leveraging this tensor, DPCache employs dynamic programming to select an optimal sequence of key timesteps that minimizes the total path cost while preserving trajectory fidelity. During inference, the model performs full computations only at these key timesteps, while intermediate outputs are efficiently predicted using cached features. Extensive experiments on DiT, FLUX, and HunyuanVideo demonstrate that DPCache achieves strong acceleration with minimal quality loss, outperforming prior acceleration methods by $+$0.031 ImageReward at 4.87$\times$ speedup and even surpassing the full-step baseline by $+$0.028 ImageReward at 3.54$\times$ speedup on FLUX, validating the effectiveness of our path-aware global scheduling framework. Code will be released at this https URL.
102. 【2602.22649】Interactive Medical-SAM2 GUI: A Napari-based semi-automatic annotation tool for medical images
链接:https://arxiv.org/abs/2602.22649
作者:Woojae Hong,Jong Ha Hwang,Jiyong Chung,Joongyeon Choi,Hyunngun Kim,Yong Hwy Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:open-source desktop application, open-source desktop, desktop application, application for semi-automatic, Napari multi-dimensional viewer
备注: 6 pages, 2 figures, Planning to submit JOSS (Journal of Open Source Software)
点击查看摘要
Abstract:Interactive Medical-SAM2 GUI is an open-source desktop application for semi-automatic annotation of 2D and 3D medical images. Built on the Napari multi-dimensional viewer, box/point prompting is integrated with SAM2-style propagation by treating a 3D volume as a slice sequence, enabling mask propagation from sparse prompts using Medical-SAM2 on top of SAM2. Voxel-level annotation remains essential for developing and validating medical imaging algorithms, yet manual labeling is slow and expensive for 3D scans, and existing integrations frequently emphasize per-slice interaction without providing a unified, cohort-oriented workflow for navigation, propagation, interactive correction, and quantitative export in a single local pipeline. To address this practical limitation, a local-first Napari workflow is provided for efficient 3D annotation across multiple studies using standard DICOM series and/or NIfTI volumes. Users can annotate cases sequentially under a single root folder with explicit proceed/skip actions, initialize objects via box-first prompting (including first/last-slice initialization for single-object propagation), refine predictions with point prompts, and finalize labels through prompt-first correction prior to saving. During export, per-object volumetry and 3D volume rendering are supported, and image geometry is preserved via SimpleITK. The GUI is implemented in Python using Napari and PyTorch, with optional N4 bias-field correction, and is intended exclusively for research annotation workflows. The code is released on the project page: this https URL.
103. 【2602.22644】Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models
链接:https://arxiv.org/abs/2602.22644
作者:Siqi Lu,Wanying Xu,Yongbin Zheng,Wenting Luan,Peng Sun,Jianhang Yao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:catastrophic performance degradation, causing catastrophic performance, Missing modalities present, present a fundamental, causing catastrophic
备注:
点击查看摘要
Abstract:Missing modalities present a fundamental challenge in multimodal models, often causing catastrophic performance degradation. Our observations suggest that this fragility stems from an imbalanced learning process, where the model develops an implicit preference for certain modalities, leading to the under-optimization of others. We propose a simple yet efficient method to address this challenge. The central insight of our work is that the dominance relationship between modalities can be effectively discerned and quantified in the frequency domain. To leverage this principle, we first introduce a Frequency Ratio Metric (FRM) to quantify modality preference by analyzing features in the frequency domain. Guided by FRM, we then propose a Multimodal Weight Allocation Module, a plug-and-play component that dynamically re-balances the contribution of each branch during training, promoting a more holistic learning paradigm. Extensive experiments demonstrate that MWAM can be seamlessly integrated into diverse architectural backbones, such as those based on CNNs and ViTs. Furthermore, MWAM delivers consistent performance gains across a wide range of tasks and modality combinations. This advancement extends beyond merely optimizing the performance of the base model; it also manifests as further performance improvements to state-of-the-art methods addressing the missing modality problem.
104. 【2602.22639】QuadSync: Quadrifocal Tensor Synchronization via Tucker Decomposition
链接:https://arxiv.org/abs/2602.22639
作者:Daniel Miao,Gilad Lerman,Joe Kileel
类目:Computer Vision and Pattern Recognition (cs.CV); Numerical Analysis (math.NA); Optimization and Control (math.OC)
关键词:quadrifocal tensors capture, structure from motion, pairwise counterparts, theoretical interest, quadrifocal tensors
备注: 30 pages, accepted to CVPR 2026
点击查看摘要
Abstract:In structure from motion, quadrifocal tensors capture more information than their pairwise counterparts (essential matrices), yet they have often been thought of as impractical and only of theoretical interest. In this work, we challenge such beliefs by providing a new framework to recover $n$ cameras from the corresponding collection of quadrifocal tensors. We form the block quadrifocal tensor and show that it admits a Tucker decomposition whose factor matrices are the stacked camera matrices, and which thus has a multilinear rank of (4,~4,~4,~4) independent of $n$. We develop the first synchronization algorithm for quadrifocal tensors, using Tucker decomposition, alternating direction method of multipliers, and iteratively reweighted least squares. We further establish relationships between the block quadrifocal, trifocal, and bifocal tensors, and introduce an algorithm that jointly synchronizes these three entities. Numerical experiments demonstrate the effectiveness of our methods on modern datasets, indicating the potential and importance of using higher-order information in synchronization.
105. 【2602.22629】CRAG: Can 3D Generative Models Help 3D Assembly?
链接:https://arxiv.org/abs/2602.22629
作者:Zeyu Jiang,Sihang Li,Siqi Tan,Chenyang Xu,Juexiao Zhang,Julia Galway-Witham,Xue Wang,Scott A. Williams,Radu Iovita,Chen Feng,Jing Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pure pose estimation, rearranging observed parts, rearranging observed, rigid transformations, assembly
备注: 10 pages, 7 figures
点击查看摘要
Abstract:Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Our code and models will be released.
106. 【2602.22625】DiffBMP: Differentiable Rendering with Bitmap Primitives
链接:https://arxiv.org/abs/2602.22625
作者:Seongmin Hong,Junghun James Kim,Daehyeop Kim,Insoo Chung,Se Young Chun
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:efficient differentiable rendering, differentiable rendering engine, scalable and efficient, efficient differentiable, traditional differentiable renderers
备注: Accepted to CVPR 2026, [this https URL](https://diffbmp.com)
点击查看摘要
Abstract:We introduce DiffBMP, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structure-aware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integrate into creative workflows. It supports exporting compositions to a native, layered file format, and the entire framework is publicly accessible via an easy-to-hack Python package.
107. 【2602.22624】Instruction-based Image Editing with Planning, Reasoning, and Generation
链接:https://arxiv.org/abs/2602.22624
作者:Liya Ji,Chenyang Qi,Qifeng Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:generate interactive content, big challenge due, Editing, interactive content, generate interactive
备注: 10 pages, 7 figures
点击查看摘要
Abstract:Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.
108. 【2602.22621】CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection
链接:https://arxiv.org/abs/2602.22621
作者:Boyang Dai,Zeng Fan,Zihao Qi,Meng Lou,Yizhou Yu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Adaptive Object Detection, Domain Adaptive Object, Source-Free Domain Adaptive, labeled source domain, unlabeled target domain
备注: The paper has been accepted by the conference ICLR 2026
点击查看摘要
Abstract:Source-Free Domain Adaptive Object Detection (SF-DAOD) aims to adapt a detector trained on a labeled source domain to an unlabeled target domain without retaining any source data. Despite recent progress, most popular approaches focus on tuning pseudo-label thresholds or refining the teacher-student framework, while overlooking object-level structural cues within cross-domain data. In this work, we present CGSA, the first framework that brings Object-Centric Learning (OCL) into SF-DAOD by integrating slot-aware adaptation into the DETR-based detector. Specifically, our approach integrates a Hierarchical Slot Awareness (HSA) module into the detector to progressively disentangle images into slot representations that act as visual priors. These slots are then guided toward class semantics via a Class-Guided Slot Contrast (CGSC) module, maintaining semantic consistency and prompting domain-invariant adaptation. Extensive experiments on multiple cross-domain datasets demonstrate that our approach outperforms previous SF-DAOD methods, with theoretical derivations and experimental analysis further demonstrating the effectiveness of the proposed components and the framework, thereby indicating the promise of object-centric design in privacy-sensitive adaptation scenarios. Code is released at this https URL.
109. 【2602.22620】Coded-E2LF: Coded Aperture Light Field Imaging from Events
链接:https://arxiv.org/abs/2602.22620
作者:Tomoya Tsuchida,Keita Takahashi,Chihiro Tsutake,Toshiaki Fujii,Hajime Nagahara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:stationary event-only camera, light field, event-only camera, light field reconstruction, stationary event-only
备注: accepted to CVPR 2026
点击查看摘要
Abstract:We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software and supplementary video are available from our project website.
110. 【2602.22613】Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
链接:https://arxiv.org/abs/2602.22613
作者:Minh Kha Do,Wei Xiang,Kang Han,Di Wu,Khoa Phan,Yi-Ping Phoebe Chen,Gaowen Liu,Ramana Rao Kompella
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-language foundation models, foundation models, understanding for Earth, Earth observation, Spectrally Grounded Alignment
备注:
点击查看摘要
Abstract:Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: this https URL
111. 【2602.22610】DP-aware AdaLN-Zero: Taming Conditioning-Induced Heavy-Tailed Gradients in Differentially Private Diffusion
链接:https://arxiv.org/abs/2602.22610
作者:Tao Huang,Jiayang Meng,Xu Yang,Chen Hou,Hong Chen
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Condition injection enables, generate context-aware outputs, enables diffusion models, Condition injection, context-aware outputs
备注:
点击查看摘要
Abstract:Condition injection enables diffusion models to generate context-aware outputs, which is essential for many time-series tasks. However, heterogeneous conditional contexts (e.g., observed history, missingness patterns or outlier covariates) can induce heavy-tailed per-example gradients. Under Differentially Private Stochastic Gradient Descent (DP-SGD), these rare conditioning-driven heavy-tailed gradients disproportionately trigger global clipping, resulting in outlier-dominated updates, larger clipping bias, and degraded utility under a fixed privacy budget. In this paper, we propose DP-aware AdaLN-Zero, a drop-in sensitivity-aware conditioning mechanism for conditional diffusion transformers that limits conditioning-induced gain without modifying the DP-SGD mechanism. DP-aware AdaLN-Zero jointly constrains conditioning representation magnitude and AdaLN modulation parameters via bounded re-parameterization, suppressing extreme gradient tail events before gradient clipping and noise injection. Empirically, DP-SGD equipped with DP-aware AdaLN-Zero improves interpolation/imputation and forecasting under matched privacy settings. We observe consistent gains on a real-world power dataset and two public ETT benchmarks over vanilla DP-SGD. Moreover, gradient diagnostics attribute these improvements to conditioning-specific tail reshaping and reduced clipping distortion, while preserving expressiveness in non-private training. Overall, these results show that sensitivity-aware conditioning can substantially improve private conditional diffusion training without sacrificing standard performance.
112. 【2602.22607】LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals
链接:https://arxiv.org/abs/2602.22607
作者:Ziqi Zhao,Abhijit Mishra,Shounak Roychowdhury
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:lookup table, residual corrections, basis LUTs, fact low-rank tensors, unified low-rank formulation
备注:
点击查看摘要
Abstract:We present LoR-LUT, a unified low-rank formulation for compact and interpretable 3D lookup table (LUT) generation. Unlike conventional 3D-LUT-based techniques that rely on fusion of basis LUTs, which are usually dense tensors, our unified approach extends the current framework by jointly using residual corrections, which are in fact low-rank tensors, together with a set of basis LUTs. The approach described here improves the existing perceptual quality of an image, which is primarily due to the technique's novel use of residual corrections. At the same time, we achieve the same level of trilinear interpolation complexity, using a significantly smaller number of network, residual corrections, and LUT parameters. The experimental results obtained from LoR-LUT, which is trained on the MIT-Adobe FiveK dataset, reproduce expert-level retouching characteristics with high perceptual fidelity and a sub-megabyte model size. Furthermore, we introduce an interactive visualization tool, termed LoR-LUT Viewer, which transforms an input image into the LUT-adjusted output image, via a number of slidebars that control different parameters. The tool provides an effective way to enhance interpretability and user confidence in the visual results. Overall, our proposed formulation offers a compact, interpretable, and efficient direction for future LUT-based image enhancement and style transfer.
113. 【2602.22601】$ϕ$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models
链接:https://arxiv.org/abs/2602.22601
作者:Thanh-Dat Truong,Huu-Thien Tran,Jackson Cothren,Bhiksha Raj,Khoa Luu
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Multimodal Models, Large Multimodal, biased model updates, Multimodal Models, Direct Preference Optimization
备注: Accepted to CVPR'26
点击查看摘要
Abstract:Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting, the problem of fairness caused the imbalanced data remains largely underexplored. This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or $\phi$-DPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new $\phi$-DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable $\phi$-DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed $\phi$-DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.
114. 【2602.22596】BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model
链接:https://arxiv.org/abs/2602.22596
作者:Yuci Han,Charles Toth,John E. Anderson,William J. Shuart,Alper Yilmaz
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:diverse real-world scenes, Stable Video Diffusion, unconstrained photos, production-ready Stable Video, quality for diverse
备注:
点击查看摘要
Abstract:We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.
115. 【2602.22595】Don't let the information slip away
链接:https://arxiv.org/abs/2602.22595
作者:Taozhe Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Real-time object detection, object detection, Real-time object, object detection models, recent years
备注: 10
点击查看摘要
Abstract:Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.
116. 【2602.22594】Causal Motion Diffusion Models for Autoregressive Motion Generation
链接:https://arxiv.org/abs/2602.22594
作者:Qing Yu,Akihisa Watanabe,Kent Fujiwara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, diffusion models, motion diffusion models, diffusion, improved the realism
备注: Accepted to CVPR 2026, Project website: [this https URL](https://yu1ut.com/CMDM-HP/)
点击查看摘要
Abstract:Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.
117. 【2602.22571】GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
链接:https://arxiv.org/abs/2602.22571
作者:Tianyu Chen,Wei Xiang,Kang Han,Yu Lu,Di Wu,Gaowen Liu,Ramana Rao Kompella
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reconstruction offers substantial, offers substantial runtime, substantial runtime advantages, reconstruction offers, offers substantial
备注:
点击查看摘要
Abstract:Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.
118. 【2602.22570】Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation
链接:https://arxiv.org/abs/2602.22570
作者:Dian Xie,Shitong Shao,Lichen Bai,Zikai Zhou,Bojun Cheng,Shuo Yang,Jun Wu,Zeke Xie
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Classifier-free guidance, diffusion guidance, great conditional generation, diffusion guidance methods, guidance
备注:
点击查看摘要
Abstract:Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.
119. 【2602.22568】Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise
链接:https://arxiv.org/abs/2602.22568
作者:Peihan Wu,Guanjie Cheng,Yufei Tong,Meng Xi,Shuiguang Deng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:achieved remarkable progress, Deep multi-view clustering, Deep multi-view, real-world applications, achieved remarkable
备注:
点击查看摘要
Abstract:Deep multi-view clustering has achieved remarkable progress but remains vulnerable to complex noise in real-world applications. Existing noisy robust methods predominantly rely on a simplified binary assumption, treating data as either perfectly clean or completely corrupted. This overlooks the prevalent existence of heterogeneous observation noise, where contamination intensity varies continuously across data. To bridge this gap, we propose a novel framework termed Quality-Aware Robust Multi-View Clustering (QARMVC). Specifically, QARMVC employs an information bottleneck mechanism to extract intrinsic semantics for view reconstruction. Leveraging the insight that noise disrupts semantic integrity and impedes reconstruction, we utilize the resulting reconstruction discrepancy to precisely quantify fine-grained contamination intensity and derive instance-level quality scores. These scores are integrated into a hierarchical learning strategy: at the feature level, a quality-weighted contrastive objective is designed to adaptively suppress the propagation of noise; at the fusion level, a high-quality global consensus is constructed via quality-weighted aggregation, which is subsequently utilized to align and rectify local views via mutual information maximization. Extensive experiments on five benchmark datasets demonstrate that QARMVC consistently outperforms state-of-the-art baselines, particularly in scenarios with heterogeneous noise intensities.
120. 【2602.22565】SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction
链接:https://arxiv.org/abs/2602.22565
作者:Kang Han,Wei Xiang,Lu Yu,Mathew Wyatt,Gaowen Liu,Ramana Rao Kompella
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:multi-view inconsistencies, optimization-heavy approaches, scale drift, Neural Depth Correction, gained popularity
备注:
点击查看摘要
Abstract:Depth-guided 3D reconstruction has gained popularity as a fast alternative to optimization-heavy approaches, yet existing methods still suffer from scale drift, multi-view inconsistencies, and the need for substantial refinement to achieve high-fidelity geometry. Here, we propose SwiftNDC, a fast and general framework built around a Neural Depth Correction field that produces cross-view consistent depth maps. From these refined depths, we generate a dense point cloud through back-projection and robust reprojection-error filtering, obtaining a clean and uniformly distributed geometric initialization for downstream reconstruction. This reliable dense geometry substantially accelerates 3D Gaussian Splatting (3DGS) for mesh reconstruction, enabling high-quality surfaces with significantly fewer optimization iterations. For novel-view synthesis, SwiftNDC can also improve 3DGS rendering quality, highlighting the benefits of strong geometric initialization. We conduct a comprehensive study across five datasets, including two for mesh reconstruction, as well as three for novel-view synthesis. SwiftNDC consistently reduces running time for accurate mesh reconstruction and boosts rendering fidelity for view synthesis, demonstrating the effectiveness of combining neural depth refinement with robust geometric initialization for high-fidelity and efficient 3D reconstruction.
121. 【2602.22549】DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
链接:https://arxiv.org/abs/2602.22549
作者:Zhechao Wang,Yiming Zeng,Lufan Ma,Zeqing Fu,Chen Bai,Ziyao Lin,Cheng Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:crucial data augmentation, data augmentation technique, autonomous driving systems, crucial data, data augmentation
备注:
点击查看摘要
Abstract:Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.
122. 【2602.22545】DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI
链接:https://arxiv.org/abs/2602.22545
作者:Agamdeep S. Chopra,Caitlin Neher,Tianyi Ren,Juampablo E. Heras Rivera,Mehmet Kurt
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Alzheimer disease pathology, positron emission tomography, motivate MRI-based alternatives, limited availability motivate, availability motivate MRI-based
备注: 14 pages, 8 figures, 8 tables; includes PID guided vector quantized latent factorization and sobel edge conditioned Half-UNet decoder
点击查看摘要
Abstract:Tau positron emission tomography (tau-PET) provides an in vivo marker of Alzheimer's disease pathology, but cost and limited availability motivate MRI-based alternatives. We introduce DisQ-HNet (DQH), a framework that synthesizes tau-PET from paired T1-weighted and FLAIR MRI while exposing how each modality contributes to the prediction. The method combines (i) a Partial Information Decomposition (PID)-guided, vector-quantized encoder that partitions latent information into redundant, unique, and complementary components, and (ii) a Half-UNet decoder that preserves anatomical detail using pseudo-skip connections conditioned on structural edge cues rather than direct encoder feature reuse. Across multiple baselines (VAE, VQ-VAE, and UNet), DisQ-HNet maintains reconstruction fidelity and better preserves disease-relevant signal for downstream AD tasks, including Braak staging, tau localization, and classification. PID-based Shapley analysis provides modality-specific attribution of synthesized uptake patterns.
123. 【2602.22510】Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning
链接:https://arxiv.org/abs/2602.22510
作者:Guoyizhe Wei,Yang Jiao,Nan Xi,Zhishen Huang,Jingjing Meng,Rama Chellappa,Yan Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Composed Image Retrieval, relevant visual content, Composed Image, Image Retrieval, apply the requested
备注:
点击查看摘要
Abstract:Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.
124. 【2602.22507】Space Syntax-guided Post-training for Residential Floor Plan Generation
链接:https://arxiv.org/abs/2602.22507
作者:Zhuoyang Jiang,Dongqing Zhang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:large-scale data distributions, fit large-scale data, residential floor plans, domestic public spaces, Space Syntax-guided Post-training
备注:
点击查看摘要
Abstract:Pre-trained generative models for residential floor plans are typically optimized to fit large-scale data distributions, which can under-emphasize critical architectural priors such as the configurational dominance and connectivity of domestic public spaces (e.g., living rooms and foyers). This paper proposes Space Syntax-guided Post-training (SSPT), a post-training paradigm that explicitly injects space syntax knowledge into floor plan generation via a non-differentiable oracle. The oracle converts RPLAN-style layouts into rectangle-space graphs through greedy maximal-rectangle decomposition and door-mediated adjacency construction, and then computes integration-based measurements to quantify public space dominance and functional hierarchy. To enable consistent evaluation and diagnosis, we further introduce SSPT-Bench (Eval-8), an out-of-distribution benchmark that post-trains models using conditions capped at $\leq 7$ rooms while evaluating on 8-room programs, together with a unified metric suite for dominance, stability, and profile alignment. SSPT is instantiated with two strategies: (i) iterative retraining via space-syntax filtering and diffusion fine-tuning, and (ii) reinforcement learning via PPO with space-syntax rewards. Experiments show that both strategies improve public-space dominance and restore clearer functional hierarchy compared to distribution-fitted baselines, while PPO achieves stronger gains with substantially higher compute efficiency and reduced variance. SSPT provides a scalable pathway for integrating architectural theory into data-driven plan generation and is compatible with other generative backbones given a post-hoc evaluation oracle.
Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.22507 [cs.LG]
(or
arXiv:2602.22507v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2602.22507
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
125. 【2602.22469】Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models
链接:https://arxiv.org/abs/2602.22469
作者:Niamul Hassan Samin,Md Arifur Rahman,Abdullah Ibne Hanif,Juena Ahmed Noshin,Md Ashikur Rahman
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:frequently hallucinate objects, hallucinate objects absent, percentage points, Spatial Credit Redistribution, Vision-language models
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.
126. 【2602.22462】MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation
链接:https://arxiv.org/abs/2602.22462
作者:Raiyan Jahangir,Nafiz Imtiaz Khan,Amritanand Sudheerkumar,Vladimir Filkov
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Screening mammography, high volume, documentation heavy, Vision Language Models, recent Vision Language
备注: arXiv preprint (submitted 25 Feb 2026). Local multi-model pipeline for mammography report generation + classification using prompting, multimodal RAG (ChromaDB), and QLoRA fine-tuning; evaluates MedGemma, LLaVA-Med, Qwen2.5-VL on VinDr-Mammo and DMID; reports BERTScore/ROUGE-L and classification metrics
点击查看摘要
Abstract:Screening mammography is high volume, time sensitive, and documentation heavy. Radiologists must translate subtle visual findings into consistent BI-RADS assessments, breast density categories, and structured narrative reports. While recent Vision Language Models (VLMs) enable image-to-text reporting, many rely on closed cloud systems or tightly coupled architectures that limit privacy, reproducibility, and adaptability. We present MammoWise, a local multi-model pipeline that transforms open source VLMs into mammogram report generators and multi-task classifiers. MammoWise supports any Ollama-hosted VLM and mammography dataset, and enables zero-shot, few-shot, and Chain-of-Thought prompting, with optional multimodal Retrieval Augmented Generation (RAG) using a vector database for case-specific context. We evaluate MedGemma, LLaVA-Med, and Qwen2.5-VL on VinDr-Mammo and DMID datasets, assessing report quality (BERTScore, ROUGE-L), BI-RADS classification, breast density, and key findings. Report generation is consistently strong and improves with few-shot prompting and RAG. Classification is feasible but sensitive to model and dataset choice. Parameter-efficient fine-tuning (QLoRA) of MedGemma improves reliability, achieving BI-RADS accuracy of 0.7545, density accuracy of 0.8840, and calcification accuracy of 0.9341 while preserving report quality. MammoWise provides a practical and extensible framework for deploying local VLMs for mammography reporting within a unified and reproducible workflow.
127. 【2602.22455】Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
链接:https://arxiv.org/abs/2602.22455
作者:Giuseppe Lando,Rosario Forte,Antonino Furnari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Language Models, question answering, Large Language
备注:
点击查看摘要
Abstract:We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.
128. 【2602.22426】SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read
链接:https://arxiv.org/abs/2602.22426
作者:Yibo Peng,Peng Xia,Ding Zhong,Kaide Zeng,Siwei Han,Yiyang Zhou,Jiaqi Liu,Ruiyi Zhang,Huaxiu Yao
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Multimodal Large Language, Large Language Models, mechanism remains unanswered, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at this https URL.
129. 【2602.22419】CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
链接:https://arxiv.org/abs/2602.22419
作者:Marc-Antoine Lavoie,Anas Mahmoud,Aldo Zaimi,Arsene Fansi Tchango,Steven L. Waslander
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:image-text contrastive learning, learn transferable multi-modal, transferable multi-modal features, CLIP models learn, models learn transferable
备注: 19 pages, 13 figures, to be published in the CVPR 2026 proceedings
点击查看摘要
Abstract:CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.
130. 【2602.22405】MolFM-Lite: Multi-Modal Molecular Property Prediction with Conformer Ensemble Attention and Cross-Modal Fusion
链接:https://arxiv.org/abs/2602.22405
作者:Syed Omer Shah,Mohammed Maqsood Ahmed,Danish Mohiuddin Mohammed,Shahnawaz Alam,Mohd Vahaj ur Rahman
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Feature-wise Linear Modulation, property prediction rely, single molecular representation, treat molecular geometry, machine learning models
备注:
点击查看摘要
Abstract:Most machine learning models for molecular property prediction rely on a single molecular representation (either a sequence, a graph, or a 3D structure) and treat molecular geometry as static. We present MolFM-Lite, a multi-modal model that jointly encodes SELFIES sequences (1D), molecular graphs (2D), and conformer ensembles (3D) through cross-attention fusion, while conditioning predictions on experimental context via Feature-wise Linear Modulation (FiLM). Our main methodological contributions are: (1) a conformer ensemble attention mechanism that combines learnable attention with Boltzmann-weighted priors over multiple RDKit-generated conformers, capturing the thermodynamic distribution of molecular shapes; and (2) a cross-modal fusion layer where each modality can attend to others, enabling complementary information sharing. We evaluate on four MoleculeNet scaffold-split benchmarks using our model's own splits, and report all baselines re-evaluated under the same protocol. Comprehensive ablation studies across all four datasets confirm that each architectural component contributes independently, with tri-modal fusion providing 7-11% AUC improvement over single-modality baselines and conformer ensembles adding approximately 2% over single-conformer variants. Pre-training on ZINC250K (~250K molecules) using cross-modal contrastive and masked-atom objectives enables effective weight initialization at modest compute cost. We release all code, trained models, and data splits to support reproducibility.
131. 【2602.22394】Vision Transformers Need More Than Registers
链接:https://arxiv.org/abs/2602.22394
作者:Cheng Shi,Yizhou Yu,Sibei Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Transformers, provide general-purpose representations, diverse downstream tasks, downstream tasks, large-scale data
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.
132. 【2602.22381】Enhancing Renal Tumor Malignancy Prediction: Deep Learning with Automatic 3D CT Organ Focused Attention
链接:https://arxiv.org/abs/2602.22381
作者:Zhengkang Fan,Chengkun Sun,Russell Terry,Jie Xu,Longin Jan Latecki
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:optimizing treatment strategies, informing clinical decisions, Accurate prediction, treatment strategies, crucial for informing
备注: 5 pages, 2 figures, Accepted at IEEE ISBI 2026
点击查看摘要
Abstract:Accurate prediction of malignancy in renal tumors is crucial for informing clinical decisions and optimizing treatment strategies. However, existing imaging modalities lack the necessary accuracy to reliably predict malignancy before surgical intervention. While deep learning has shown promise in malignancy prediction using 3D CT images, traditional approaches often rely on manual segmentation to isolate the tumor region and reduce noise, which enhances predictive performance. Manual segmentation, however, is labor-intensive, costly, and dependent on expert knowledge. In this study, a deep learning framework was developed utilizing an Organ Focused Attention (OFA) loss function to modify the attention of image patches so that organ patches attend only to other organ patches. Hence, no segmentation of 3D renal CT images is required at deployment time for malignancy prediction. The proposed framework achieved an AUC of 0.685 and an F1-score of 0.872 on a private dataset from the UF Integrated Data Repository (IDR), and an AUC of 0.760 and an F1-score of 0.852 on the publicly available KiTS21 dataset. These results surpass the performance of conventional models that rely on segmentation-based cropping for noise reduction, demonstrating the frameworks ability to enhance predictive accuracy without explicit segmentation input. The findings suggest that this approach offers a more efficient and reliable method for malignancy prediction, thereby enhancing clinical decision-making in renal cancer diagnosis.
133. 【2602.22376】AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
链接:https://arxiv.org/abs/2602.22376
作者:Hanyang Liu,Rongjun Qin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent advances, improved dynamic modeling, significantly improved dynamic, significantly improved, Recent
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Recent advances in 4D scene reconstruction have significantly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed. To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion. The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.
134. 【2602.22361】Optimizing Neural Network Architecture for Medical Image Segmentation Using Monte Carlo Tree Search
链接:https://arxiv.org/abs/2602.22361
作者:Liping Meng,Fan Nie,Yunyun Zhang,Chao Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Monte Carlo Tree, combines Monte Carlo, Carlo Tree Search, Monte Carlo, Carlo Tree
备注:
点击查看摘要
Abstract:This paper proposes a novel medical image segmentation framework, MNAS-Unet, which combines Monte Carlo Tree Search (MCTS) and Neural Architecture Search (NAS). MNAS-Unet dynamically explores promising network architectures through MCTS, significantly enhancing the efficiency and accuracy of architecture search. It also optimizes the DownSC and UpSC unit structures, enabling fast and precise model adjustments. Experimental results demonstrate that MNAS-Unet outperforms NAS-Unet and other state-of-the-art models in segmentation accuracy on several medical image datasets, including PROMISE12, Ultrasound Nerve, and CHAOS. Furthermore, compared with NAS-Unet, MNAS-Unet reduces the architecture search budget by 54% (early stopping at 139 epochs versus 300 epochs under the same search setting), while achieving a lightweight model with only 0.6M parameters and lower GPU memory consumption, which further improves its practical applicability. These results suggest that MNAS-Unet can improve search efficiency while maintaining competitive segmentation accuracy under practical resource constraints.
135. 【2602.22347】Enabling clinical use of foundation models in histopathology
链接:https://arxiv.org/abs/2602.22347
作者:Audun L. Henriksen,Ole-Johan Skrede,Lisa van der Schee,Enric Domingo,Sepp De Raedt,Ilyá Kostolomov,Jennifer Hay,Karolina Cyll,Wanja Kildal,Joakim Kalsnes,Robert W. Williams,Manohar Pradhan,John Arne Nesheim,Hanne A. Askautrud,Maria X. Isaksen,Karmele Saez de Gordoa,Miriam Cuatrecasas,Joanne Edwards,TransSCOT group,Arild Nesbakken,Neil A. Shepherd,Ian Tomlinson,Daniel-Christoph Wagner,Rachel S. Kerr,Tarjei Sveinsgjerd Hveem,Knut Liestøl,Yoshiaki Nakamura,Marco Novelli,Masaaki Miyo,Sebastian Foersch,David N. Church,Miangela M. Lacle,David J. Kerr,Andreas Kleppe
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:deep learning systems, generalisable deep learning, Foundation models, learning systems, models
备注:
点击查看摘要
Abstract:Foundation models in histopathology are expected to facilitate the development of high-performing and generalisable deep learning systems. However, current models capture not only biologically relevant features, but also pre-analytic and scanner-specific variation that bias the predictions of task-specific models trained from the foundation model features. Here we show that introducing novel robustness losses during training of downstream task-specific models reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 WSIs from 6155 patients is used to train thousands of models from the features of eight popular foundation models for computational pathology. In addition to a substantial improvement in robustness, we observe that prediction accuracy improves by focusing on biologically relevant features. Our approach successfully mitigates robustness issues of foundation models for computational pathology without retraining the foundation models themselves, enabling development of robust computational pathology models applicable to real-world data in routine clinical practice.
136. 【2602.22265】Entropy-Controlled Flow Matching
链接:https://arxiv.org/abs/2602.22265
作者:Chika Maduabuchi
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern vision generators, Modern vision, vision generators transport, time-indexed measures, implemented as deterministic
备注:
点击查看摘要
Abstract:Modern vision generators transport a base distribution to data through time-indexed measures, implemented as deterministic flows (ODEs) or stochastic diffusions (SDEs). Despite strong empirical performance, standard flow-matching objectives do not directly control the information geometry of the trajectory, allowing low-entropy bottlenecks that can transiently deplete semantic modes. We propose Entropy-Controlled Flow Matching (ECFM): a constrained variational principle over continuity-equation paths enforcing a global entropy-rate budget d/dt H(mu_t) = -lambda. ECFM is a convex optimization in Wasserstein space with a KKT/Pontryagin system, and admits a stochastic-control representation equivalent to a Schrodinger bridge with an explicit entropy multiplier. In the pure transport regime, ECFM recovers entropic OT geodesics and Gamma-converges to classical OT as lambda - 0. We further obtain certificate-style mode-coverage and density-floor guarantees with Lipschitz stability, and construct near-optimal collapse counterexamples for unconstrained flow matching.
137. 【2602.22214】Adaptive Prefiltering for High-Dimensional Similarity Search: A Frequency-Aware Approach
链接:https://arxiv.org/abs/2602.22214
作者:Teodor-Ioan Calin
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:modern retrieval systems, High-dimensional similarity search, underpins modern retrieval, High-dimensional similarity, similarity search underpins
备注:
点击查看摘要
Abstract:High-dimensional similarity search underpins modern retrieval systems, yet uniform search strategies fail to exploit the heterogeneous nature of real-world query distributions. We present an adaptive prefiltering framework that leverages query frequency patterns and cluster coherence metrics to dynamically allocate computational budgets. Our approach partitions the query space into frequency tiers following Zipfian distributions and assigns differentiated search policies based on historical access patterns and local density characteristics. Experiments on ImageNet-1k using CLIP embeddings demonstrate that frequency-aware budget allocation achieves equivalent recall with 20.4% fewer distance computations compared to static nprobe selection, while maintaining sub-millisecond latency on GPU-accelerated FAISS indices. The framework introduces minimal overhead through lightweight frequency tracking and provides graceful degradation for unseen queries through coherence-based fallback policies.
138. 【2602.22544】HARU-Net: Hybrid Attention Residual U-Net for Edge-Preserving Denoising in Cone-Beam Computed Tomography
链接:https://arxiv.org/abs/2602.22544
作者:Khuram Naveed,Ruben Pauwels
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)
关键词:Cone-beam computed tomography, acquisition introduces strong, degrades soft-tissue visibility, spatially varying noise, Cone-beam computed
备注:
点击查看摘要
Abstract:Cone-beam computed tomography (CBCT) is widely used in dental and maxillofacial imaging, but low-dose acquisition introduces strong, spatially varying noise that degrades soft-tissue visibility and obscures fine anatomical structures. Classical denoising methods struggle to suppress noise in CBCT while preserving edges. Although deep learning-based approaches offer high-fidelity restoration, their use in CBCT denoising is limited by the scarcity of high-resolution CBCT data for supervised training. To address this research gap, we propose a novel Hybrid Attention Residual U-Net (HARU-Net) for high-quality denoising of CBCT data, trained on a cadaver dataset of human hemimandibles acquired using a high-resolution protocol of the 3D Accuitomo 170 (J. Morita, Kyoto, Japan) CBCT system. The novel contribution of this approach is the integration of three complementary architectural components: (i) a hybrid attention transformer block (HAB) embedded within each skip connection to selectively emphasize salient anatomical features, (ii) a residual hybrid attention transformer group (RHAG) at the bottleneck to strengthen global contextual modeling and long-range feature interactions, and (iii) residual learning convolutional blocks to facilitate deeper, more stable feature extraction throughout the network. HARU-Net consistently outperforms state-of-the-art (SOTA) methods including SwinIR and Uformer, achieving the highest PSNR (37.52 dB), highest SSIM (0.9557), and lowest GMSD (0.1084). This effective and clinically reliable CBCT denoising is achieved at a computational cost significantly lower than that of the SOTA methods, offering a practical advancement toward improving diagnostic quality in low-dose CBCT imaging.
139. 【2602.22236】CrossLLM-Mamba: Multimodal State Space Fusion of LLMs for RNA Interaction Prediction
链接:https://arxiv.org/abs/2602.22236
作者:Rabeya Tus Sadia,Qiang Ye,Qiang Cheng
类目:Genomics (q-bio.GN); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:advancing drug discovery, understanding cellular regulation, Large Language Models, Biological Large Language, drug discovery
备注:
点击查看摘要
Abstract:Accurate prediction of RNA-associated interactions is essential for understanding cellular regulation and advancing drug discovery. While Biological Large Language Models (BioLLMs) such as ESM-2 and RiNALMo provide powerful sequence representations, existing methods rely on static fusion strategies that fail to capture the dynamic, context-dependent nature of molecular binding. We introduce CrossLLM-Mamba, a novel framework that reformulates interaction prediction as a state-space alignment problem. By leveraging bidirectional Mamba encoders, our approach enables deep ``crosstalk'' between modality-specific embeddings through hidden state propagation, modeling interactions as dynamic sequence transitions rather than static feature overlaps. The framework maintains linear computational complexity, making it scalable to high-dimensional BioLLM embeddings. We further incorporate Gaussian noise injection and Focal Loss to enhance robustness against hard-negative samples. Comprehensive experiments across three interaction categories, RNA-protein, RNA-small molecule, and RNA-RNA demonstrate that CrossLLM-Mamba achieves state-of-the-art performance. On the RPI1460 benchmark, our model attains an MCC of 0.892, surpassing the previous best by 5.2\%. For binding affinity prediction, we achieve Pearson correlations exceeding 0.95 on riboswitch and repeat RNA subtypes. These results establish state-space modeling as a powerful paradigm for multi-modal biological interaction prediction.



