本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新517篇论文,其中:
- 自然语言处理77篇
- 信息检索15篇
- 计算机视觉123篇
自然语言处理
1. 【2602.22207】Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
链接:https://arxiv.org/abs/2602.22207
作者:Hanna Yukhymenko,Anton Alexandrov,Martin Vechev
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Model, multilingual Large Language, Large Language, multilingual Large, LLM
备注:
点击查看摘要
Abstract:The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks. Existing resources often suffer from semantic drift and context loss, which can lead to misleading performance metrics. In this work, we present a fully automated framework designed to address these challenges by enabling scalable, high-quality translation of datasets and benchmarks. We demonstrate that adapting test-time compute scaling strategies, specifically Universal Self-Improvement (USI) and our proposed multi-round ranking method, T-RANK, allows for significantly higher quality outputs compared to traditional pipelines. Our framework ensures that benchmarks preserve their original task structure and linguistic nuances during localization. We apply this approach to translate popular benchmarks and datasets into eight Eastern and Southern European languages (Ukrainian, Bulgarian, Slovak, Romanian, Lithuanian, Estonian, Turkish, Greek). Evaluations using both reference-based metrics and LLM-as-a-judge show that our translations surpass existing resources, resulting in more accurate downstream model assessment. We release both the framework and the improved benchmarks to facilitate robust and reproducible multilingual AI development.
2. 【2602.22200】SumTablets: A Transliteration Dataset of Sumerian Tablets
链接:https://arxiv.org/abs/2602.22200
作者:Cole Simmons,Richard Diehl Martinez,Dan Jurafsky
类目:Computation and Language (cs.CL)
关键词:Latin script, Sumerian transliteration, Natural Language Processing, Sumerian, Sumerian cuneiform tablets
备注: 11 pages with 3 figures
点击查看摘要
Abstract:Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script. Thanks to visionary digital Assyriology projects such as ETCSL, CDLI, and Oracc, a large number of Sumerian transliterations have been published online, and these data are well-structured for a variety of search and analysis tasks. However, the absence of a comprehensive, accessible dataset pairing transliterations with a digital representation of the tablet's cuneiform glyphs has prevented the application of modern Natural Language Processing (NLP) methods to the task of Sumerian transliteration. To address this gap, we present SumTablets, a dataset pairing Unicode representations of 91,606 Sumerian cuneiform tablets (totaling 6,970,407 glyphs) with the associated transliterations published by Oracc. We construct SumTablets by first preprocessing and standardizing the Oracc transliterations before mapping each reading back to the Unicode representation of the source glyph. Further, we retain parallel structural information (e.g., surfaces, newlines, broken segments) through the use of special tokens. We release SumTablets as a Hugging Face Dataset (CC BY 4.0) and open source data preparation code via GitHub. Additionally, we leverage SumTablets to implement and evaluate two transliteration baselines: (1) weighted sampling from a glyph's possible readings, and (2) fine-tuning an autoregressive language model. Our fine-tuned language model achieves an average transliteration character-level F-score (chrF) of 97.55, demonstrating the immediate potential of transformer-based transliteration models in allowing experts to rapidly verify generated transliterations rather than manually transliterating tablets one-by-one.
Comments:
11 pages with 3 figures
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.22200 [cs.CL]
(or
arXiv:2602.22200v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.22200
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024), pages 192-202, Hybrid in Bangkok, Thailand and online. Association for Computational Linguistics
Related DOI:
https://doi.org/10.18653/v1/2024.ml4al-1.20
Focus to learn more
DOI(s) linking to related resources</p>
3. 【2602.22193】Improving Parametric Knowledge Access in Reasoning Language Models
链接:https://arxiv.org/abs/2602.22193
作者:Melody Ma,John Hewitt
类目:Computation and Language (cs.CL)
关键词:language model parameters, Canberra is Australia, world knowledge stored, knowledge, world knowledge
备注:
点击查看摘要
Abstract:We study reasoning for accessing world knowledge stored in a language model's parameters. For example, recalling that Canberra is Australia's capital may benefit from thinking through major cities and the concept of purpose-built capitals. While reasoning language models are trained via reinforcement learning to produce reasoning traces on tasks such as mathematics, they may not reason well for accessing their own world knowledge. We first find that models do not generate their best world knowledge reasoning by default: adding a simple "think step-by-step" cue demonstrates statistically significant improvement in knowledge recall but not math. Motivated by this, we propose training models to reason over their parametric knowledge using world-knowledge question answering as a verifiable reward. After reinforcement learning on TriviaQA (+9.9%), performance also improves on Natural Questions, HotpotQA, SimpleQA, and StrategyQA by 4.2%, 2.1%, 0.6%, and 3.0%, respectively. Reasoning models are under-optimized for parametric knowledge access, but can be easily trained to reason better.
4. 【2602.22190】GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
链接:https://arxiv.org/abs/2602.22190
作者:Rui Yang,Qianhui Wu,Zhaoyang Wang,Hanyang Chen,Ke Yang,Hao Cheng,Huaxiu Yao,Baoling Peng,Huan Zhang,Jianfeng Gao,Tong Zhang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Open-source native GUI, native GUI agents, Open-source native, long-horizon navigation tasks, GUI agents
备注: 57 pages, 17 figures
点击查看摘要
Abstract:Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.
5. 【2602.22182】LiCQA : A Lightweight Complex Question Answering System
链接:https://arxiv.org/abs/2602.22182
作者:Sourav Saha,Dwaipayan Roy,Mandar Mitra
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:implementing Question Answering, Question Answering, twenty years, significant progress, made in designing
备注:
点击查看摘要
Abstract:Over the last twenty years, significant progress has been made in designing and implementing Question Answering (QA) systems. However, addressing complex questions, the answers to which are spread across multiple documents, remains a challenging problem. Recent QA systems that are designed to handle complex questions work either on the basis of knowledge graphs, or utilise contem- porary neural models that are expensive to train, in terms of both computational resources and the volume of training data required. In this paper, we present LiCQA, an unsupervised question answer- ing model that works primarily on the basis of corpus evidence. We empirically compare the effectiveness and efficiency of LiCQA with two recently presented QA systems, which are based on different underlying principles. The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
6. 【2602.22175】DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs
链接:https://arxiv.org/abs/2602.22175
作者:Xi Ye,Wuwei Zhang,Fangcong Yin,Howard Yen,Danqi Chen
类目:Computation and Language (cs.CL)
关键词:crucial capability, capability for language, Understanding, Understanding and reasoning, long context windows
备注:
点击查看摘要
Abstract:Understanding and reasoning over long contexts is a crucial capability for language models (LMs). Although recent models support increasingly long context windows, their accuracy often deteriorates as input length grows. In practice, models often struggle to keep attention aligned with the most relevant context throughout decoding. In this work, we propose DySCO, a novel decoding algorithm for improving long-context reasoning. DySCO leverages retrieval heads--a subset of attention heads specialized for long-context retrieval--to identify task-relevant tokens at each decoding step and explicitly up-weight them. By doing so, DySCO dynamically adjusts attention during generation to better utilize relevant context. The method is training-free and can be applied directly to any off-the-shelf LMs. Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modest additional compute. Further analysis highlights the importance of both dynamic attention rescaling and retrieval-head-guided selection for the effectiveness of the method, while providing interpretability insights into decoding-time attention behavior. Our code is available at this https URL.
7. 【2602.22157】Dynamic Personality Adaptation in Large Language Models via State Machines
链接:https://arxiv.org/abs/2602.22157
作者:Leon Pielage,Ole Hätscher,Mitja Back,Bernhard Marschall,Benjamin Risse
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, inability of Large, Language Models, evolving dialogue dynamics
备注: 22 pages, 5 figures, submitted to ICPR 2026
点击查看摘要
Abstract:The inability of Large Language Models (LLMs) to modulate their personality expression in response to evolving dialogue dynamics hinders their performance in complex, interactive contexts. We propose a model-agnostic framework for dynamic personality simulation that employs state machines to represent latent personality states, where transition probabilities are dynamically adapted to the conversational context. Part of our architecture is a modular pipeline for continuous personality scoring that evaluates dialogues along latent axes while remaining agnostic to the specific personality models, their dimensions, transition mechanisms, or LLMs used. These scores function as dynamic state variables that systematically reconfigure the system prompt, steering behavioral alignment throughout the this http URL evaluate this framework by operationalizing the Interpersonal Circumplex (IPC) in a medical education setting. Results demonstrate that the system successfully adapts its personality state to user inputs, but also influences user behavior, thereby facilitating de-escalation training. Notably, the scoring pipeline maintains comparable precision even when utilizing lightweight, fine-tuned classifiers instead of large-scale LLMs. This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.
8. 【2602.22145】When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
链接:https://arxiv.org/abs/2602.22145
作者:Satyam Kumar Navneet,Joydeep Chandra,Yong Zhang
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, professionalize workplace communication, Language Models, professionalize workplace
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity. We introduce "Cultural Ghosting", the systematic erasure of linguistic markers unique to non-native English varieties during text processing. Through analysis of 22,350 LLM outputs generated from 1,490 culturally marked texts (Indian, Singaporean, Nigerian English) processed by five models under three prompt conditions, we quantify this phenomenon using two novel metrics: Identity Erasure Rate (IER) Semantic Preservation Score (SPS). Across all prompts, we find an overall IER of 10.26%, with model-level variation from 3.5% to 20.5% (5.9x range). Crucially, we identify a Semantic Preservation Paradox: models maintain high semantic similarity (mean SPS = 0.748) while systematically erasing cultural markers. Pragmatic markers (politeness conventions) are 1.9x more vulnerable than lexical markers (71.5% vs. 37.1% erasure). Our experiments demonstrate that explicit cultural-preservation prompts reduce erasure by 29% without sacrificing semantic quality.
9. 【2602.22144】NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
链接:https://arxiv.org/abs/2602.22144
作者:Lingfeng Ren,Weihao Yu,Runpeng Yu,Xinchao Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Vision-Language Models, issue in Large, Large Vision-Language, outputs include objects
备注: Code: [this https URL](https://github.com/lingfengren/NoLan)
点击查看摘要
Abstract:Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: this https URL.
10. 【2602.22125】IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages
链接:https://arxiv.org/abs/2602.22125
作者:Thanmay Jayakumar,Mohammed Safi Ur Rahman Khan,Raj Dabre,Ratish Puduppully,Anoop Kunchukuttan
类目:Computation and Language (cs.CL)
关键词:remain predominantly English-centric, predominantly English-centric, benchmarks remain predominantly, Indic language speakers, Instruction-following benchmarks remain
备注: 8 pages + Appendix
点击查看摘要
Abstract:Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers. We introduce IndicIFEval, a benchmark evaluating constrained generation of LLMs across 14 Indic languages using automatically verifiable, rule-based instructions. It comprises around 800 human-verified examples per language spread across two complementary subsets: IndicIFEval-Ground, translated prompts from IFEval (Zhou et al., 2023) carefully localized for Indic contexts, and IndicIFEval-Ground, synthetically generated instructions grounded in native Indic content. We conduct a comprehensive evaluation of major open-weight and proprietary models spanning both reasoning and non-reasoning models. While models maintain strong adherence to formatting constraints, they struggle significantly with lexical and cross-lingual tasks -- and despite progress in high-resource languages, instruction-following across the broader Indic family lags significantly behind English. We release IndicIFEval and its evaluation scripts to support progress on multilingual constrained generation (this http URL).
11. 【2602.22124】SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
链接:https://arxiv.org/abs/2602.22124
作者:Patrick Tser Jern Kon,Archana Pradeep,Ang Chen,Alexander P. Ellis,Warren Hunt,Zijian Wang,John Yang,Samuel Thompson
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Small language models, offer compelling advantages, low resolution rates, Small language, long-horizon software engineering
备注:
点击查看摘要
Abstract:Small language models (SLMs) offer compelling advantages in cost, latency, and adaptability, but have so far lagged behind larger models on long-horizon software engineering tasks such as SWE-bench, where they suffer from pervasive action looping and low resolution rates. We introduce SWE-Protégé, a post-training framework that reframes software repair as an expert-protégé collaboration problem. In SWE-Protégé, an SLM remains the sole decision-maker while learning to selectively seek guidance from a strong expert model, recognize stalled states, and follow through on expert feedback. Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration. We lightly post-train Qwen2.5-Coder-7B-Instruct to achieve 42.4% Pass@1 on SWE-bench Verified, a +25.4% improvement over the prior SLM state of the art, while using expert assistance sparsely (~4 calls per task and 11% of total tokens).
12. 【2602.22090】Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
链接:https://arxiv.org/abs/2602.22090
作者:Bo-Wei Chen,Chung-Chi Chen,An-Zi Yen
类目:Computation and Language (cs.CL)
关键词:Large Language Models, diverse natural language, Large Language, natural language tasks, larger models performing
备注: Accepted by EACL 2026 Findings
点击查看摘要
Abstract:Large Language Models (LLMs) have revolutionized inference across diverse natural language tasks, with larger models performing better but at higher computational costs. We propose a confidence-driven strategy that dynamically selects the most suitable model based on confidence estimates. By assessing a model's confidence in handling the task and response accuracy, tasks that are likely to be solved correctly are retained, while more uncertain or complex cases are delegated to a larger model, ensuring reliability while minimizing computation. Specifically, we evaluate a model's likelihood of knowing the correct answer and the probability that its response is accurate. Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%. When applied to GPT-4o API calls, it reduces token usage by approximately 60\%, further improving cost efficiency. These findings indicate the potential of confidence-based model selection to enhance real-world LLM deployment, particularly in resource-constrained settings such as edge devices and commercial API applications.
13. 【2602.22072】Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
链接:https://arxiv.org/abs/2602.22072
作者:Christian Nickel,Laura Schrewe,Florian Mai,Lucie Flek
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Theory of Mind, agent ability, internal states, ToM, Theory
备注:
点击查看摘要
Abstract:Theory of Mind (ToM) refers to an agent's ability to model the internal states of others. Contributing to the debate whether large language models (LLMs) exhibit genuine ToM capabilities, our study investigates their ToM robustness using perturbations on false-belief tasks and examines the potential of Chain-of-Thought prompting (CoT) to enhance performance and explain the LLM's decision. We introduce a handcrafted, richly annotated ToM dataset, including classic and perturbed false belief tasks, the corresponding spaces of valid reasoning chains for correct task completion, subsequent reasoning faithfulness, task solutions, and propose metrics to evaluate reasoning chain correctness and to what extent final answers are faithful to reasoning traces of the generated CoT. We show a steep drop in ToM capabilities under task perturbation for all evaluated LLMs, questioning the notion of any robust form of ToM being present. While CoT prompting improves the ToM performance overall in a faithful manner, it surprisingly degrades accuracy for some perturbation classes, indicating that selective application is necessary.
14. 【2602.22045】DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain
链接:https://arxiv.org/abs/2602.22045
作者:Walter Hernandez Cruz,Peter Devine,Nikhil Vadgama,Paolo Tasca,Jiahua Xu
类目:Computation and Language (cs.CL)
关键词:United States Patent, Distributed Ledger Technology, million documents spanning, United States, Trademark Office
备注:
点击查看摘要
Abstract:We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.22045 [cs.CL]
(or
arXiv:2602.22045v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.22045
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
15. 【2602.22014】A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
链接:https://arxiv.org/abs/2602.22014
作者:Louis Estève,Christophe Servan,Thomas Lavergne,Agata Savary
类目:Computation and Language (cs.CL)
关键词:NLP community, recent years, gaining interest, community in recent, NLP
备注:
点击查看摘要
Abstract:Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining at least comparable performance. We compare diversity-driven sampling algorithms, so as to pick the best one. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.
16. 【2602.21978】CxMP: A Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models
链接:https://arxiv.org/abs/2602.21978
作者:Miyu Oba,Saku Sugawara
类目:Computation and Language (cs.CL)
关键词:Recent work, language models, examined language models, work has examined, language
备注:
点击查看摘要
Abstract:Recent work has examined language models from a linguistic perspective to better understand how they acquire language. Most existing benchmarks focus on judging grammatical acceptability, whereas the ability to interpret meanings conveyed by grammatical forms has received much less attention. We introduce the Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models (CxMP), a benchmark grounded in Construction Grammar that treats form-meaning pairings, or constructions, as fundamental linguistic units. CxMP evaluates whether models can interpret the semantic relations implied by constructions, using a controlled minimal-pair design across nine construction types, including the let-alone, caused motion, and ditransitive constructions. Our results show that while syntactic competence emerges early, constructional understanding develops more gradually and remains limited even in large language models (LLMs). CxMP thus reveals persistent gaps in how language models integrate form and meaning, providing a framework for studying constructional understanding and learning trajectories in language models.
17. 【2602.21951】RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning
链接:https://arxiv.org/abs/2602.21951
作者:Bo Xue,Yuan Jin,Luoyi Fu,Jiaxin Ding,Xinbing Wang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, infers missing facts, Large Language, Knowledge graph reasoning
备注:
点击查看摘要
Abstract:Knowledge graph reasoning (KGR) infers missing facts, with recent advances increasingly harnessing the semantic priors and reasoning abilities of Large Language Models (LLMs). However, prevailing generative paradigms are prone to memorizing surface-level co-occurrences rather than learning genuine relational semantics, limiting out-of-distribution generalization. To address this, we propose RADAR, which reformulates KGR from generative pattern matching to discriminative relational reasoning. We recast KGR as discriminative entity selection, where reinforcement learning enforces relative entity separability beyond token-likelihood imitation. Leveraging this separability, inference operates directly in representation space, ensuring consistency with the discriminative optimization and bypassing generation-induced hallucinations. Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more robust and transferable relational reasoning.
18. 【2602.21950】MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
链接:https://arxiv.org/abs/2602.21950
作者:Boqi Chen,Xudong Liu,Jiachuan Peng,Marianne Frey-Marti,Bang Zheng,Kyle Lam,Lin Li,Jianing Qiu
类目:Computation and Language (cs.CL)
关键词:Multimodal large language, shown great potential, inadequately capture real-world, real-world clinical complexity, existing benchmarks inadequately
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity. We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case. Mirroring clinical workflow, we evaluate 18 MLLMs on differential diagnosis (DDx) generation and final diagnosis (FDx) selection. While top models often match or even outperform human experts on DDx generation, all MLLMs exhibit a much larger DDx--FDx performance gap compared to expert clinicians, indicating a failure mode in synthesis of heterogeneous CE types. Ablations attribute this failure to (i) overreliance on less discriminative textual CE ($\it{e.g.}$, medical history) and (ii) a cross-modal CE utilization gap. We introduce Evidence Sensitivity to quantify the latter and show that a smaller gap correlates with higher diagnostic accuracy. Finally, we demonstrate how it can be used to guide interventions to improve model performance. We will open-source our benchmark and code.
19. 【2602.21947】Large Language Models are Algorithmically Blind
链接:https://arxiv.org/abs/2602.21947
作者:Sohan Venkatesh,Ashish Mahendran Kurapath,Tejas Melkote
类目:Computation and Language (cs.CL)
关键词:demonstrate remarkable breadth, remains poorly understood, Large language models, computational processes remains, processes remains poorly
备注: 20 pages, 11 figures, 14 tables
点击查看摘要
Abstract:Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm selection and deployment. We address this limitation using causal discovery as a testbed and evaluate eight frontier LLMs against ground truth derived from large-scale algorithm executions and find systematic, near-total failure. Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best model is most consistent with benchmark memorization rather than principled reasoning. We term this failure algorithmic blindness and argue it reflects a fundamental gap between declarative knowledge about algorithms and calibrated procedural prediction.
20. 【2602.21941】MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents
链接:https://arxiv.org/abs/2602.21941
作者:Zhenyu Wang,Xiaofen Xing,Yirong Chen,Xiangmin Xu
类目:Computation and Language (cs.CL)
关键词:attracting increasing attention, increasing attention due, Multimodal Role-Playing Agents, multimodal emotional interactions, immersive multimodal emotional
备注: 11 pages, 6 figures
点击查看摘要
Abstract:Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to modality-synthesis metrics. This evaluation paradigm, on the one hand, entangles semantic assessment with modality generation, leading to ambiguous error attribution, and on the other hand remains constrained by the heavy reliance on human judgment. To this end, we propose MERRY, a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents. This framework introduce five refined metrics for EC and three for RC. Notably, we transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task, significantly improving the human agreement of LLM-as-Judge evaluations. Based on MERRY, we conduct extensive evaluations. Our empirical results primarily reveal that: (1) Training on synthetic datasets tends to reduce emotional consistency, whereas training on real-world datasets improves it; (2) Existing models suffer from emotional templatization and simplification, exhibiting positive-bias and performance bottleneck in fine-grained negative emotions; (3) Simple prompting method strengthens the weak models but constrains the strong ones, while simple fine-tuning method suffers from poor role generalization. Codes and dataset are available.
21. 【2602.21933】Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
链接:https://arxiv.org/abs/2602.21933
作者:Bitan Majumder,Anirban Sen
类目:Computation and Language (cs.CL)
关键词:natural language processing, low-resource linguistic availability, code-mixed environments remains, processing models due, informal expressions
备注:
点击查看摘要
Abstract:Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability. This study compares four large language models, Llama 3.1, Mistral, Gemma 3, and Phi-4, with a fine-tuned DistilBERT model for sarcasm detection in code-mixed Hinglish text. The results indicate that the smaller, sequentially fine-tuned DistilBERT model achieved the highest overall accuracy of 84%, outperforming all of the LLMs in zero and few-shot set ups, using minimal LLM generated code-mixed data used for fine-tuning. These findings indicate that domain-adaptive fine-tuning of smaller transformer based models may significantly improve sarcasm detection over general LLM inference, in low-resource and data scarce settings.
22. 【2602.21887】ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
链接:https://arxiv.org/abs/2602.21887
作者:Changjiang Gao,Zixian Huang,Kaichen Yang,Jiajun Chen,Jixing Li,Shujian Huang
类目:Computation and Language (cs.CL)
关键词:Current large reasoning, shown strong ability, large reasoning models, Current large, reinforcement learning
备注:
点击查看摘要
Abstract:Current large reasoning models (LRMs) have shown strong ability on challenging tasks after reinforcement learning (RL) based post-training. However, previous work mainly focuses on English reasoning in expectation of the strongest performance, despite the demonstrated potential advantage of multilingual thinking, as well as the requirement for native thinking traces by global users. In this paper, we propose ExpLang, a novel LLM post-training pipeline that enables on-policy thinking language selection to improve exploration and exploitation during RL with the use of multiple languages. The results show that our method steadily outperforms English-only training with the same training budget, while showing high thinking language compliance for both seen and unseen languages. Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged non-English advantage. The method is orthogonal to most RL algorithms and opens up a new perspective on using multilinguality to improve LRMs.
23. 【2602.21864】DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
链接:https://arxiv.org/abs/2602.21864
作者:Yanbin Wei,Jiangyue Yan,Chun Kang,Yang Chen,Hua Liu,James Kwok,Yu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
关键词:zero-shot question answering, question answering, emerged as versatile, Vision-Language Models, versatile solutions
备注: CVPR 2026
点击查看摘要
Abstract:Vision-Language Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on one single graph topology representation (GTR), such as fixed-style visual images or unified text descriptions. This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries. To address this, we propose the $\mbox{DynamicGTR}$ framework, which dynamically selects the optimal GTR for each query during inference, thereby enhancing the zero-shot graph QA capabilities of VLMs with a customizable accuracy and brevity trade-off. Extensive experiments show that DynamicGTR not only improves VLM-based graph algorithm QA performance but also successfully transfers the experience trained from synthetic graph algorithm tasks to real-world applications like link prediction and node classification, without any additional training. Additionally, DynamicGTR demonstrates strong transferability across tasks, domains, and models, suggesting its potential as a flexible solution for broad graph scenarios.
24. 【2602.21862】Personalized Graph-Empowered Large Language Model for Proactive Information Access
链接:https://arxiv.org/abs/2602.21862
作者:Chia Cheng Chang,An-Zi Yen,Hen-Hsen Huang,Hsin-Hsi Chen
类目:Computation and Language (cs.CL)
关键词:individuals may struggle, life details, memory recall systems, confuse events, proposed memory recall
备注:
点击查看摘要
Abstract:Since individuals may struggle to recall all life details and often confuse events, establishing a system to assist users in recalling forgotten experiences is essential. While numerous studies have proposed memory recall systems, these primarily rely on deep learning techniques that require extensive training and often face data scarcity due to the limited availability of personal lifelogs. As lifelogs grow over time, systems must also adapt quickly to newly accumulated data. Recently, large language models (LLMs) have demonstrated remarkable capabilities across various tasks, making them promising for personalized applications. In this work, we present a framework that leverages LLMs for proactive information access, integrating personal knowledge graphs to enhance the detection of access needs through a refined decision-making process. Our framework offers high flexibility, enabling the replacement of base models and the modification of fact retrieval methods for continuous improvement. Experimental results demonstrate that our approach effectively identifies forgotten events, supporting users in recalling past experiences more efficiently.
25. 【2602.21857】Distill and Align Decomposition for Enhanced Claim Verification
链接:https://arxiv.org/abs/2602.21857
作者:Jabez Magomere,Elena Kochkina,Samuel Mensah,Simerjot Kaur,Fernando Acero,Arturo Oncevay,Charese H. Smiley,Xiaomo Liu,Manuela Veloso
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:requires decomposing sentences, Relative Policy Optimization, Group Relative Policy, Complex claim verification, Complex claim
备注: EACL Findings 2026
点击查看摘要
Abstract:Complex claim verification requires decomposing sentences into verifiable subclaims, yet existing methods struggle to align decomposition quality with verification performance. We propose a reinforcement learning (RL) approach that jointly optimizes decomposition quality and verifier alignment using Group Relative Policy Optimization (GRPO). Our method integrates: (i) structured sequential reasoning; (ii) supervised finetuning on teacher-distilled exemplars; and (iii) a multi-objective reward balancing format compliance, verifier alignment, and decomposition quality. Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)). Human evaluation confirms the high quality of the generated subclaims. Our framework enables smaller language models to achieve state-of-the-art claim verification by jointly optimising for verification accuracy and decomposition quality.
26. 【2602.21854】FewMMBench: A Benchmark for Multimodal Few-Shot Learning
链接:https://arxiv.org/abs/2602.21854
作者:Mustafa Dogan,Ilker Kesen,Iacer Calixto,Aykut Erdem,Erkut Erdem
类目:Computation and Language (cs.CL)
关键词:handling interleaved image-text, large language models, multimodal large language, advance in handling, open challenge
备注: Preprint. 49 pages, 38 Figures, 5 Tables
点击查看摘要
Abstract:As multimodal large language models (MLLMs) advance in handling interleaved image-text data, assessing their few-shot learning capabilities remains an open challenge. In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting. Covering a diverse suite of multimodal understanding tasks, from attribute recognition to temporal reasoning, FewMMBench enables systematic analysis across task types, model families, and prompting strategies. We evaluate 26 open-weight MLLMs from six model families across zero-shot, few-shot, and CoT-augmented few-shot settings. Our findings reveal that instruction-tuned models exhibit strong zero-shot performance but benefit minimally, or even regress, with additional demonstrations or CoT reasoning. Retrieval-based demonstrations and increased context size also yield limited gains. These results highlight FewMMBench as a rigorous testbed for diagnosing and advancing few-shot capabilities in multimodal LLMs. The data is available at: this https URL
27. 【2602.21814】Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
链接:https://arxiv.org/abs/2602.21814
作者:Heejin Jo
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, car wash problem, language models consistently, models consistently fail, Large language
备注: 9 pages, 4 tables
点击查看摘要
Abstract:Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference. We present a variable isolation study (n=20 per condition, 6 conditions, 120 total trials) examining which prompt architecture layers in a production system enable correct reasoning. Using Claude 3.5 Sonnet with controlled hyperparameters (temperature 0.7, top_p 1.0), we find that the STAR (Situation-Task-Action-Result) reasoning framework alone raises accuracy from 0% to 85% (p=0.001, Fisher's exact test, odds ratio 13.22). Adding user profile context via vector database retrieval provides a further 10 percentage point gain, while RAG context contributes an additional 5 percentage points, achieving 100% accuracy in the full-stack condition. These results suggest that structured reasoning scaffolds -- specifically, forced goal articulation before inference -- matter substantially more than context injection for implicit constraint reasoning tasks.
28. 【2602.21786】D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
链接:https://arxiv.org/abs/2602.21786
作者:Shunsuke Ubukata
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Small Language Models, Large Language, Small Language, Language Models
备注: 9 pages, 3 figures. Code: [this https URL](https://github.com/gitpullpull/DisciplinedChainOfThought) | Benchmarks: [this https URL](https://huggingface.co/datasets/gitpullpull/D-CoT-Benchmarks) | Dataset: [this https URL](https://huggingface.co/datasets/gitpullpull/D-CoT-datasets)
点击查看摘要
Abstract:Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption. In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as TEMP_LOW for fact-checking and TEMP_HIGH for multi-perspective exploration -- as auxiliary scaffolding during training. By optimizing the CoT trajectory, D-CoT suppresses reasoning drift and simultaneously achieves token reduction and performance improvement. We demonstrate the efficacy of our approach on Qwen3-8B: with only 5,000 training samples, D-CoT significantly boosts accuracy on GPQA-diamond by 9.9% and MMLU-Pro (0-shot) by 9.1%, while drastically reducing computational costs. Furthermore, we confirm that the model internalizes this disciplined thought structure, maintaining high performance even without explicit control tags during inference.
29. 【2602.21763】Improving Implicit Discourse Relation Recognition with Natural Language Explanations from LLMs
链接:https://arxiv.org/abs/2602.21763
作者:Heng Wang,Changxing Wu
类目:Computation and Language (cs.CL)
关键词:Implicit Discourse Relation, Discourse Relation Recognition, explicit discourse markers, Implicit Discourse, challenging task due
备注: AAAI26'0ral
点击查看摘要
Abstract:Implicit Discourse Relation Recognition (IDRR) remains a challenging task due to the requirement for deep semantic understanding in the absence of explicit discourse markers. A further limitation is that existing methods only predict relations without providing any supporting explanations. Recent advances in large language models (LLMs) have shown strong reasoning capabilities in both deep language understanding and natural language explanation generation. In this work, we propose a simple yet effective approach to distill the reasoning capabilities of LLMs into lightweight IDRR models to improve both performance and interpretability. Specifically, we first prompt an LLM to generate explanations for each training instance conditioned on its gold label. Then, we introduce a novel classification-generation framework that jointly performs relation prediction and explanation generation, and train it with the additional supervision of LLM-generated explanations. Our framework is plug-and-play, enabling easy integration with most existing IDRR models. Experimental results on PDTB demonstrate that our approach significantly improves IDRR performance, while human evaluation further confirms that the generated explanations enhance model interpretability. Furthermore, we validate the generality of our approach on sentiment classification and natural language inference
30. 【2602.21741】Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
链接:https://arxiv.org/abs/2602.21741
作者:MD. Sagor Chowdhury,Adiba Fairooz Chowdhury
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
关键词:competition on Kaggle, long-form speech recognition, Word Error Rate, Diarization Error Rate, Bengali long-form speech
备注: 6 pages, 5 figures, 3 tables; system paper submitted to DL Sprint 4.0 (Kaggle)
点击查看摘要
Abstract:We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle. Bengali presents substantial challenges for both tasks: a large phoneme inventory, significant dialectal variation, frequent code-mixing with English, and a relative scarcity of large-scale labelled corpora. For ASR we achieve a best private Word Error Rate (WER) of 0.37738 and public WER of 0.36137, combining a BengaliAI fine-tuned Whisper medium model with Demucs source separation for vocal isolation, silence-boundary chunking, and carefully tuned generation hyperparameters. For speaker diarization we reach a best private Diarization Error Rate (DER) of 0.27671 and public DER of 0.20936 by replacing the default segmentation model inside the this http URL pipeline with a Bengali-fine-tuned variant, pairing it with wespeaker-voxceleb-resnet34-LM embeddings and centroid-based agglomerative clustering. Our experiments demonstrate that domain-specific fine-tuning of the segmentation component, vocal source separation, and natural silence-aware chunking are the three most impactful design choices for low-resource Bengali speech processing.
31. 【2602.21728】Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
链接:https://arxiv.org/abs/2602.21728
作者:Shiqi Yan,Yubo Chen,Ruiqi Zhou,Zhengxi Yao,Shuai Chen,Tianyi Zhang,Shijie Zhang,Wei Qiang Zhang,Yongfeng Huang,Haixin Duan,Yunqi Zhang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, question-answering tasks, plagued by hallucinations
备注: Published as a conference paper at ICLR 2026
点击查看摘要
Abstract:The reasoning process of Large Language Models (LLMs) is often plagued by hallucinations and missing facts in question-answering tasks. A promising solution is to ground LLMs' answers in verifiable knowledge sources, such as Knowledge Graphs (KGs). Prevailing KG-enhanced methods typically constrained LLM reasoning either by enforcing rules during generation or by imitating paths from a fixed set of demonstrations. However, they naturally confined the reasoning patterns of LLMs within the scope of prior experience or fine-tuning data, limiting their generalizability to out-of-distribution graph reasoning problems. To tackle this problem, in this paper, we propose Explore-on-Graph (EoG), a novel framework that encourages LLMs to autonomously explore a more diverse reasoning space on KGs. To incentivize exploration and discovery of novel reasoning paths, we propose to introduce reinforcement learning during training, whose reward is the correctness of the reasoning paths' final answers. To enhance the efficiency and meaningfulness of the exploration, we propose to incorporate path information as additional reward signals to refine the exploration process and reduce futile efforts. Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
32. 【2602.21720】Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
链接:https://arxiv.org/abs/2602.21720
作者:Andrea Silvi,Ponrawee Prasertsom,Jennifer Culbertson,Devdatt Dubhashi,Moa Johansson,Kenny Smith
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:systems, recursive numeral systems, highly regular human, numeral systems, Human recursive numeral
备注:
点击查看摘要
Abstract:Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learning, we ask whether regular systems are common because regularity facilitates learning. Adopting methods from the Reinforcement Learning literature, we confirm that highly regular human(-like) systems are easier to learn than unattested but possible irregular systems. This asymmetry emerges under the natural assumption that recursive numeral systems are designed for generalisation from limited data to represent all integers exactly. We also find that the influence of regularity on learnability is absent for unnatural, highly irregular systems, whose learnability is influenced instead by signal length, suggesting that different pressures may influence learnability differently in different parts of the space of possible numeral systems. Our results contribute to the body of work linking learnability to cross-linguistic prevalence.
33. 【2602.21669】DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation
链接:https://arxiv.org/abs/2602.21669
作者:Duc Trung Vu,Pham Khanh Chi,Dat Phi Van,Linh Ngo Van,Sang Dinh,Trung Le
类目:Computation and Language (cs.CL)
关键词:Large Language Models, compressing Large Language, Language Models, Large Language, compressing Large
备注: EACL Findings
点击查看摘要
Abstract:Knowledge Distillation (KD) has emerged as a crucial technique for compressing Large Language Models (LLMs). Although existing cross-tokenizer KD methods have made notable progress, their effectiveness remains constrained by suboptimal alignment across sequence and vocabulary levels. To address these limitations, we introduce Dual-Space Weighting and Time-Warped Alignment (DWA-KD), a novel cross-tokenizer distillation framework that enhances token-wise distillation through dual-space entropy-based weighting and achieves precise sequence-level alignment by leveraging both lexical and semantic information. At the token level, DWA-KD maps teacher representations into the student space and vice versa, performing dual-space KD via Kullback-Leibler divergence (KL). The process is modulated by dual-space weights that up-weight tokens where the student is uncertain and the teacher is confident, thereby focusing learning on informative tokens rather than treating all positions equally. At the sequence level, DWA-KD applies Soft Dynamic Time Warping (Soft-DTW) to both the embedding and final hidden-state layers, enabling robust alignment of lexical and contextual semantics between teacher and student sequences. Extensive experiments across diverse NLP benchmarks demonstrate that DWA-KD outperforms state-of-the-art KD baselines, while ablation studies confirm the complementary contributions of entropy-based token weighting and embedding and final hidden state layer Soft-DTW alignment.
34. 【2602.21652】Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
链接:https://arxiv.org/abs/2602.21652
作者:Minhao Jiang,Zhikai Li,Xuewen Liu,Jing Zhang,Mengjuan Chen,Qingyi Gu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, scales present challenges, Large language, increasing parameter scales, parameter scales present
备注: 5 pages, 1 figure, 4 tables
点击查看摘要
Abstract:Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency. Post-training sparsity (PTS), which reduces model cost by removing weights from dense networks, is an effective approach. However, native dense matrices lack high sparsity, making existing approaches that directly remove weights disrupt model states, resulting in unsatisfactory performance recovery even with post-tuning. We propose Sparsity Induction, which promotes models toward higher sparsity at both distribution and feature levels before pruning, to push the limits of PTS. At the distribution level, we enhance distributional sparsity through mathematically equivalent scaling transformations, which are fully absorbable and incur no extra parameters or inference-time overhead. At the feature level, we introduce Spectral Norm Loss to promote feature sparsity from a low-rank perspective. Experiments across diverse model architectures and tasks demonstrate that our method further enhances sparsity-friendliness, achieving superior pruning performance over existing approaches.
35. 【2602.21647】Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration
链接:https://arxiv.org/abs/2602.21647
作者:Tangsang Chongbang,Pranesh Pyara Shrestha,Amrit Sarki,Anku Jaiswal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Automatic Speech Recognition, introduced by Automatic, Speech Recognition, Automatic Speech, paper presents
备注: 13 pages, 4 figures, 12 tables
点击查看摘要
Abstract:This paper presents and evaluates an optimized cascaded Nepali speech-to-English text translation (S2TT) system, focusing on mitigating structural noise introduced by Automatic Speech Recognition (ASR). We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark. We empirically investigate the influence of punctuation loss, demonstrating that unpunctuated ASR output significantly degrades translation quality, causing a massive 20.7% relative BLEU drop on the FLORES benchmark. To overcome this, we propose and evaluate an intermediate Punctuation Restoration Module (PRM). The final S2TT pipeline was tested across three configurations on a custom dataset. The optimal configuration, which applied the PRM directly to ASR output, achieved a 4.90 BLEU point gain over the direct ASR-to-NMT baseline (BLEU 36.38 vs. 31.48). This improvement was validated by human assessment, which confirmed the optimized pipeline's superior Adequacy (3.673) and Fluency (3.804). This work validates that targeted punctuation restoration is the most effective intervention for mitigating structural noise in the Nepali S2TT pipeline. It establishes an optimized baseline and demonstrates a critical architectural insight for developing cascaded speech translation systems for similar low-resource languages.
36. 【2602.21646】Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
链接:https://arxiv.org/abs/2602.21646
作者:Yexing Du,Youcheng Pan,Zekun Wang,Zheng Chu,Yichong Huang,Kaiyuan Liu,Bo Yang,Yang Xiang,Ming Liu,Bing Qin
类目:Computation and Language (cs.CL)
关键词:Multimodal Large Language, achieved notable success, Large Language Models, Large Language, integrating multimodal information
备注: Accepted in ICLR 2026
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have achieved notable success in enhancing translation performance by integrating multimodal information. However, existing research primarily focuses on image-guided methods, whose applicability is constrained by the scarcity of multilingual image-text pairs. The speech modality overcomes this limitation due to its natural alignment with text and the abundance of existing speech datasets, which enable scalable language coverage. In this paper, we propose a Speech-guided Machine Translation (SMT) framework that integrates speech and text as fused inputs into an MLLM to improve translation quality. To mitigate reliance on low-resource data, we introduce a Self-Evolution Mechanism. The core components of this framework include a text-to-speech model, responsible for generating synthetic speech, and an MLLM capable of classifying synthetic speech samples and iteratively optimizing itself using positive samples. Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results. Furthermore, on general machine translation datasets, particularly the FLORES-200, it achieves average state-of-the-art performance in 108 translation directions. Ablation studies on CoVoST-2 confirms that differences between synthetic and authentic speech have negligible impact on translation quality. The code and models are released at this https URL.
37. 【2602.21638】Multi-dimensional Assessment and Explainable Feedback for Counselor Responses to Client Resistance in Text-based Counseling with LLMs
链接:https://arxiv.org/abs/2602.21638
作者:Anqi Li,Ruihan Wang,Zhaoming Chen,Yuqian Chen,Yu Lu,Yi Zhu,Yuan Xie,Zhenzhong Lan
类目:Computation and Language (cs.CL)
关键词:sophisticated clinical skill, scalable supervisory feedback, addressing client resistance, refine their approaches, sophisticated clinical
备注: 8 pages
点击查看摘要
Abstract:Effectively addressing client resistance is a sophisticated clinical skill in psychological counseling, yet practitioners often lack timely and scalable supervisory feedback to refine their approaches. Although current NLP research has examined overall counseling quality and general therapeutic skills, it fails to provide granular evaluations of high-stakes moments where clients exhibit resistance. In this work, we present a comprehensive pipeline for the multi-dimensional evaluation of human counselors' interventions specifically targeting client resistance in text-based therapy. We introduce a theory-driven framework that decomposes counselor responses into four distinct communication mechanisms. Leveraging this framework, we curate and share an expert-annotated dataset of real-world counseling excerpts, pairing counselor-client interactions with professional ratings and explanatory rationales. Using this data, we perform full-parameter instruction tuning on a Llama-3.1-8B-Instruct backbone to model fine-grained evaluative judgments of response quality and generate explanations underlying. Experimental results show that our approach can effectively distinguish the quality of different communication mechanisms (77-81% F1), substantially outperforming GPT-4o and Claude-3.5-Sonnet (45-59% F1). Moreover, the model produces high-quality explanations that closely align with expert references and receive near-ceiling ratings from human experts (2.8-2.9/3.0). A controlled experiment with 43 counselors further confirms that receiving these AI-generated feedback significantly improves counselors' ability to respond effectively to client resistance.
38. 【2602.21628】RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
链接:https://arxiv.org/abs/2602.21628
作者:Yukun Chen,Jiaming Li,Longze Chen,Ze Gong,Jingpeng Li,Zhen Qin,Hengyu Chang,Ancheng Xu,Zhihao Yang,Hamid Alinejad-Rokny,Qiang Qu,Bo Zheng,Min Yang
类目:Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Reinforcement Learning
备注: 8 pages
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
39. 【2602.21619】When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
链接:https://arxiv.org/abs/2602.21619
作者:Muku Akasaka,Soyeon Caren Han
类目:Computation and Language (cs.CL)
关键词:modern vision-language models, Visual spatial reasoning, Visual spatial, vision-language models, challenging for modern
备注: 5 pages, 6 figures, Under review
点击查看摘要
Abstract:Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures. A common strategy is to inject additional information at inference time, such as explicit spatial cues, external commonsense knowledge, or chain-of-thought (CoT) reasoning instructions. However, it remains unclear when such information genuinely improves reasoning and when it introduces noise. In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks. We examine (i) the type and number of spatial contexts, (ii) the amount and relevance of injected commonsense knowledge, and (iii) the interaction between spatial grounding and CoT prompting. Our results reveal a consistent pattern: more information does not necessarily yield better reasoning. Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise. These findings highlight the importance of selective, task-aligned information injection and provide practical guidance for designing reliable multimodal reasoning pipelines.
40. 【2602.21608】MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification
链接:https://arxiv.org/abs/2602.21608
作者:Kazi Samin Yasar Alam,Md Tanbir Chowdhury,Tamim Ahmed,Ajwad Abrar,Md Rafid Haque
类目:Computation and Language (cs.CL)
关键词:South Asian social, setting remain scarce, South Asian, widespread across South, Asian social media
备注: Under Review
点击查看摘要
Abstract:Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle with transliteration variation, cultural references, and intra-sentential language switching. To address this gap, we introduce MixSarc, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification. The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity. We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation. We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting. Results show strong performance on humor detection but substantial degradation on sarcasm, offense, and vulgarity due to class imbalance and pragmatic complexity. Zero-shot models achieve competitive micro-F1 scores but low exact match accuracy. Further analysis reveals that over 42\% of negative sentiment instances in an external dataset exhibit sarcastic characteristics. MixSarc provides a foundational resource for culturally aware NLP and supports more reliable multi-label modeling in code-mixed environments.
41. 【2602.21543】Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
链接:https://arxiv.org/abs/2602.21543
作者:Barah Fazili,Koustava Goswami
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:pretraining typically lacks, typically lacks explicit, explicit alignment signals, Multilingual pretraining typically, lacks explicit alignment
备注:
点击查看摘要
Abstract:Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.
42. 【2602.21492】GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
链接:https://arxiv.org/abs/2602.21492
作者:Ningyuan Yang,Weihua Du,Weiwei Sun,Sean Welleck,Yiming Yang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, central post-training paradigm, language models, central post-training, post-training paradigm
备注: 14 pages. Preliminary work
点击查看摘要
Abstract:Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems. This sensitivity stems from the non-stationarity of RL: rollouts are generated by an evolving policy, and learning is shaped by exploration and reward feedback, unlike supervised fine-tuning (SFT) with fixed trajectories. As a result, prior work often relies on manual curation or simple heuristic filters (e.g., accuracy), which can admit incorrect or low-utility problems. We propose GradAlign, a gradient-aligned data selection method for LLM reinforcement learning that uses a small, trusted validation set to prioritize training problems whose policy gradients align with validation gradients, yielding an adaptive curriculum. We evaluate GradAlign across three challenging data regimes: unreliable reward signals, distribution imbalance, and low-utility training corpus, showing that GradAlign consistently outperforms existing baselines, underscoring the importance of directional gradient signals in navigating non-stationary policy optimization and yielding more stable training and improved final performance. We release our implementation at this https URL
43. 【2602.21485】Evaluating the Usage of African-American Vernacular English in Large Language Models
链接:https://arxiv.org/abs/2602.21485
作者:Deja Dunlap,R. Thomas McCoy
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Standard American English, American Vernacular English, African American Vernacular, natural language understanding, language understanding tasks
备注:
点击查看摘要
Abstract:In AI, most evaluations of natural language understanding tasks are conducted in standardized dialects such as Standard American English (SAE). In this work, we investigate how accurately large language models (LLMs) represent African American Vernacular English (AAVE). We analyze three LLMs to compare their usage of AAVE to the usage of humans who natively speak AAVE. We first analyzed interviews from the Corpus of Regional African American Language and TwitterAAE to identify the typical contexts where people use AAVE grammatical features such as ain't. We then prompted the LLMs to produce text in AAVE and compared the model-generated text to human usage patterns. We find that, in many cases, there are substantial differences between AAVE usage in LLMs and humans: LLMs usually underuse and misuse grammatical features characteristic of AAVE. Furthermore, through sentiment analysis and manual inspection, we found that the models replicated stereotypes about African Americans. These results highlight the need for more diversity in training data and the incorporation of fairness methods to mitigate the perpetuation of stereotypes.
44. 【2602.21480】Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
链接:https://arxiv.org/abs/2602.21480
作者:Germán T. Eizaguirre,Lars Tissen,Marc Sánchez-Artigas
类目:Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:extensively benchmarked fields, Big Data, Big Data workflows, benchmarked fields, evaluates them jointly
备注: 11 pages, 4 figures
点击查看摘要
Abstract:Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as "Text-to-Big SQL". However, existing text-to-SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text-to-SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text-to-Big SQL. Our study focuses on production-level LLM agents, a database-agnostic system adaptable to diverse user needs. Via an extensive evaluation of frontier models, we show that text-to-SQL metrics are insufficient for Big Data. In contrast, our proposed text-to-Big SQL metrics accurately reflect execution efficiency, cost, and the impact of data scale. Furthermore, we provide LLM-specific insights, including fine-grained, cross-model comparisons of latency and cost.
Comments:
11 pages, 4 figures
Subjects:
Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR)
ACMclasses:
I.2.7; H.2.8
Cite as:
arXiv:2602.21480 [cs.DB]
(or
arXiv:2602.21480v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2602.21480
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
45. 【2602.21461】VecGlypher: Unified Vector Glyph Generation with Language Models
链接:https://arxiv.org/abs/2602.21461
作者:Xiaoke Huang,Bhavul Gauri,Kam Woh Ng,Tony Ng,Mengmeng Xu,Zhiheng Liu,Weiming Ren,Zhaochong An,Zijian Zhou,Haonan Qiu,Yuyin Zhou,Sen He,Ziheng Wang,Tao Xiang,Xiao Han
类目:Computation and Language (cs.CL)
关键词:curated exemplar sheets, carefully curated exemplar, vector glyphs directly, high-fidelity vector glyphs, Vector glyphs
备注: Accepted to CVPR'26. Project page: [this https URL](https://xk-huang.github.io/VecGlypher/)
点击查看摘要
Abstract:Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.
46. 【2602.21456】Revisiting Text Ranking in Deep Research
链接:https://arxiv.org/abs/2602.21456
作者:Chuan Meng,Litu Ou,Sean MacAvaney,Jeff Dalton
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:extensive open-web exploration, address hard queries, Deep research, open-web exploration, important task
备注:
点击查看摘要
Abstract:Deep research has emerged as an important task that aims to address hard queries through extensive open-web exploration. To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it. Despite search's essential role in deep research, black-box web search APIs hinder systematic analysis of search components, leaving the behaviour of established text ranking methods in deep research largely unclear. To fill this gap, we reproduce a selection of key findings and best practices for IR text ranking methods in the deep research setting. In particular, we examine their effectiveness from three perspectives: (i) retrieval units (documents vs. passages), (ii) pipeline configurations (different retrievers, re-rankers, and re-ranking depths), and (iii) query characteristics (the mismatch between agent-issued queries and the training queries of text rankers). We perform experiments on BrowseComp-Plus, a deep research dataset with a fixed corpus, evaluating 2 open-source agents, 5 retrievers, and 3 re-rankers across diverse setups. We find that agent-issued queries typically follow web-search-style syntax (e.g., quoted exact matches), favouring lexical, learned sparse, and multi-vector retrievers; passage-level units are more efficient under limited context windows, and avoid the difficulties of document length normalisation in lexical retrieval; re-ranking is highly effective; translating agent-issued queries into natural-language questions significantly bridges the query mismatch.
47. 【2602.21447】Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
链接:https://arxiv.org/abs/2602.21447
作者:Inderjeet Singh,Vikas Pahuja,Aishvariya Priya Rathina Sabapathy,Chiara Picardi,Amit Giloni,Roman Vainshtein,Andrés Murillo,Hisashi Kojima,Motoyoshi Sekiya,Yuki Unno,Junichi Suga
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:multimodal agentic RAG, agentic RAG fail, distribute malicious semantics, Current stateless defences, detect adversarial strategies
备注: 13 pages, 2 figures, 5 tables
点击查看摘要
Abstract:Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components. We formulate this security challenge as a Partially Observable Markov Decision Process (POMDP), where adversarial intent is a latent variable inferred from noisy multi-stage observations. We introduce MMA-RAG^T, an inference-time control framework governed by a Modular Trust Agent (MTA) that maintains an approximate belief state via structured LLM reasoning. Operating as a model-agnostic overlay, MMA-RAGT mediates a configurable set of internal checkpoints to enforce stateful defence-in-depth. Extensive evaluation on 43,774 instances demonstrates a 6.50x average reduction factor in Attack Success Rate relative to undefended baselines, with negligible utility cost. Crucially, a factorial ablation validates our theoretical bounds: while statefulness and spatial coverage are individually necessary (26.4 pp and 13.6 pp gains respectively), stateless multi-point intervention can yield zero marginal benefit under homogeneous stateless filtering when checkpoint detections are perfectly correlated.
48. 【2602.21379】MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
链接:https://arxiv.org/abs/2602.21379
作者:Daniel Tamayo,Iñaki Lacunza,Paula Rivera-Hidalgo,Severino Da Dalt,Javier Aula-Blasco,Aitor Gonzalez-Agirre,Marta Villegas
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:parameter encoders built, languages and code, Matryoshka Representation Learning, incorporate Matryoshka Representation, parameter encoders
备注: 24 pages, 14 tables and 4 figures
点击查看摘要
Abstract:We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves state-of-the-art results on Catalan- and Spanish-specific tasks, while establishing robust performance across specialized biomedical and legal domains. To bridge the gap between research and production, we incorporate Matryoshka Representation Learning (MRL), enabling flexible vector sizing that significantly reduces inference and storage costs. Ultimately, the MrBERT family demonstrates that modern encoder architectures can be optimized for both localized linguistic excellence and efficient, high-stakes domain specialization. We open source the complete model family on Huggingface.
49. 【2602.21377】Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages
链接:https://arxiv.org/abs/2602.21377
作者:Felix Schneider,Maria Gogolev,Sven Sickert,Joachim Denzler
类目:Computation and Language (cs.CL)
关键词:Tokenization and sub-tokenization, natural language processing, sub-tokenization based models, sub-tokenization based, Tokenization
备注: 12 content pages, 2 figures, 8 tables, one example textbox
点击查看摘要
Abstract:Tokenization and sub-tokenization based models like word2vec, BERT and the GPTs are the state-of-the-art in natural language processing. Typically, these approaches have limitations with respect to their input representation. They fail to fully capture orthographic similarities and morphological variations, especially in highly inflected and under-resource languages. To mitigate this problem, we propose to computes word vectors directly from character strings, integrating both semantic and syntactic information. We denote this transformer-based approach Rich Character Embeddings (RCE). Furthermore, we propose a hybrid model that combines transformer and convolutional mechanisms. Both vector representations can be used as a drop-in replacement for dictionary- and subtoken-based word embeddings in existing model architectures. It has the potential to improve performance for both large context-based language models like BERT and small models like word2vec for under-resourced and morphologically rich languages. We evaluate our approach on various tasks like the SWAG, declension prediction for inflected languages, metaphor and chiasmus detection for various languages. Our experiments show that it outperforms traditional token-based approaches on limited data using OddOneOut and TopK metrics.
50. 【2602.21374】Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
链接:https://arxiv.org/abs/2602.21374
作者:Mohammadreza Ghaffarzadeh-Esfahani,Nahid Yousefian,Ebrahim Heidari-Farsani,Ali Akbar Omidvarian,Sepehr Ghahraei,Atena Farangi,AmirBahador Boroumand
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Extracting clinical information, natural language processing, low-resource languages remains, Extracting clinical, Matthews Correlation Coefficient
备注: 16 pages, 3 figures, 2 supplementary files
点击查看摘要
Abstract:Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP). This study evaluates a two-step pipeline combining Aya-expanse-8B as a Persian-to-English translation model with five open-source small language models (SLMs) -- Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Llama-3.2-3B-Instruct, Qwen2.5-1.5B-Instruct, and Gemma-3-1B-it -- for binary extraction of 13 clinical features from 1,221 anonymized Persian transcripts collected at a cancer palliative care call center. Using a few-shot prompting strategy without fine-tuning, models were assessed on macro-averaged F1-score, Matthews Correlation Coefficient (MCC), sensitivity, and specificity to account for class imbalance. Qwen2.5-7B-Instruct achieved the highest overall performance (median macro-F1: 0.899; MCC: 0.797), while Gemma-3-1B-it showed the weakest results. Larger models (7B--8B parameters) consistently outperformed smaller counterparts in sensitivity and MCC. A bilingual analysis of Aya-expanse-8B revealed that translating Persian transcripts to English improved sensitivity, reduced missing outputs, and boosted metrics robust to class imbalance, though at the cost of slightly lower specificity and precision. Feature-level results showed reliable extraction of physiological symptoms across most models, whereas psychological complaints, administrative requests, and complex somatic features remained challenging. These findings establish a practical, privacy-preserving blueprint for deploying open-source SLMs in multilingual clinical NLP settings with limited infrastructure and annotation resources, and highlight the importance of jointly optimizing model scale and input language strategy for sensitive healthcare applications.
51. 【2602.21368】Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
链接:https://arxiv.org/abs/2602.21368
作者:Charafeddine Mouzouni
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
关键词:practitioner trust, system output, black-box deployment gate, confidence level, self-consistency sampling
备注: 41 pages, 11 figures, 10 tables, including appendices
点击查看摘要
Abstract:Given a black-box AI system and a task, at what confidence level can a practitioner trust the system's output? We answer with a reliability level -- a single number per system-task pair, derived from self-consistency sampling and conformal calibration, that serves as a black-box deployment gate with exact, finite-sample, distribution-free guarantees. Self-consistency sampling reduces uncertainty exponentially; conformal calibration guarantees correctness within 1/(n+1) of the target level, regardless of the system's errors -- made transparently visible through larger answer sets for harder questions. Weaker models earn lower reliability levels (not accuracy -- see Definition 2.4): GPT-4.1 earns 94.6% on GSM8K and 96.8% on TruthfulQA, while GPT-4.1-nano earns 89.8% on GSM8K and 66.5% on MMLU. We validate across five benchmarks, five models from three families, and both synthetic and real data. Conditional coverage on solvable items exceeds 0.93 across all configurations; sequential stopping reduces API costs by around 50%.
52. 【2602.21346】Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment
链接:https://arxiv.org/abs/2602.21346
作者:Mengxuan Hu,Vivek V. Datla,Anoop Kumar,Zihan Guan,Sheng Li,Alfy Samuel,Daben Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Direct Preference Optimization, Reinforcement Learning, Human Feedback, Learning from Human, Recent advances
备注:
点击查看摘要
Abstract:Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization (DPO) have improved the safety of large language models (LLMs). However, these LLMs remain vulnerable to jailbreak attacks that disguise harmful intent through indirect or deceptive phrasing. Using causal intervention, we empirically demonstrate that this vulnerability stems from shallow alignment mechanisms that lack deep reasoning, often rejecting harmful prompts without truly understanding why they are harmful. To mitigate this vulnerability, we propose enhancing alignment through reasoning-aware post-training. We construct and release a novel Chain-of-Thought (CoT) fine-tuning dataset that includes both utility-oriented and safety-critical prompts with step-by-step rationales. Fine-tuning on this dataset encourages models to produce principled refusals grounded in reasoning, outperforming standard SFT baselines. Furthermore, inspired by failure patterns in CoT fine-tuning, we introduce Alignment-Weighted DPO, which targets the most problematic parts of an output by assigning different preference weights to the reasoning and final-answer segments. This produces finer-grained, targeted updates than vanilla DPO and improves robustness to diverse jailbreak strategies. Extensive experiments across multiple safety and utility benchmarks show that our method consistently improves alignment robustness while maintaining overall model utility.
53. 【2602.21265】oolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
链接:https://arxiv.org/abs/2602.21265
作者:Hyeonje Choi,Jeongsoo Lee,Hyojun Lee,Jay-Yoon Lee
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:realistic multi-tool environments, evaluates tool-augmented language, calling schema-specified tools, sustaining multi-step execution, tool-augmented language models
备注: Conference : Submitted to ICML 2026. 8 pages (+ abstract 16 pages), 5 figures
点击查看摘要
Abstract:We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results' errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that when the intended capability is missing, distractor tools can sometimes serve as partial substitutes in solution paths, yet they can also mislead models into ungrounded tool trajectories. Finally, comparisons between tool-use protocols emphasize that improvements come less from local action selection and more from long-range plan coherence and disciplined use of observations.
54. 【2602.21262】Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models
链接:https://arxiv.org/abs/2602.21262
作者:Sasha Robinson,Kerem Oktar,Katherine M. Collins,Ilia Sucholutsky,Kelsey R. Allen
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Large Language Models, high-stakes human decision-making, Large Language, Language Models, human decision-making
备注:
点击查看摘要
Abstract:With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, it is important to understand the risks they introduce as advisors. To be useful advisors, LLMs must sift through large amounts of content, written with both benevolent and malicious intent, and then use this information to convince a user to take a specific action. This involves two social capacities: vigilance (the ability to determine which information to use, and which to discard) and persuasion (synthesizing the available evidence to make a convincing argument). While existing work has investigated these capacities in isolation, there has been little prior investigation of how these capacities may be linked. Here, we use a simple multi-turn puzzle-solving game, Sokoban, to study LLMs' abilities to persuade and be rationally vigilant towards other LLM agents. We find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs, and suggests that monitoring all three independently will be critical for future work in AI safety.
55. 【2602.21257】Structured Prompt Language: Declarative Context Management for LLMs
链接:https://arxiv.org/abs/2602.21257
作者:Wen G. Gong
类目:Computation and Language (cs.CL); Databases (cs.DB); Programming Languages (cs.PL)
关键词:Structured Prompt Language, declarative SQL-inspired language, generative knowledge bases, treats large language, SQL EXPLAIN ANALYZE
备注: 44 pages, 6 figures, 14 tables, 15 code-listings
点击查看摘要
Abstract:We present SPL (Structured Prompt Language), a declarative SQL-inspired language that treats large language models as generative knowledge bases and their context windows as constrained resources. SPL provides explicit WITH BUDGET/LIMIT token management, an automatic query optimizer, EXPLAIN transparency analogous to SQL's EXPLAIN ANALYZE, and native integration of retrieval-augmented generation (RAG) and persistent memory in a single declarative framework. SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama - OpenRouter - self-healing retry) fully transparent to the .spl script. Five extensions demonstrate the paradigm's breadth: (1) Text2SPL (multilingual NL-SPL translation); (2) Mixture-of-Models (MoM) routing that dispatches each PROMPT to a domain-specialist model at runtime; (3) Logical Chunking, an intelligent strategy for documents exceeding a single context window--expressed naturally through SPL's existing CTE syntax with no new constructs, decomposing a large query into a Map-Reduce pipeline that reduces attention cost from O(N^2) to O(N^2/k) and runs identically on cloud (parallel) or local hardware (sequential); (4) SPL-flow, a declarative agentic orchestration layer with resilient three-tier provider fallback; and (5) BENCHMARK for parallel multi-model comparison with automatic winner persistence. We provide a formal EBNF grammar, two pip-installable Python packages (spl-llm, spl-flow), and comparison against Prompty, DSPy, and LMQL. SPL reduces prompt boilerplate by 65% on average, surfaces a 68x cost spread across model tiers as a pre-execution signal, and runs the identical .spl script at $0.002 on OpenRouter or at zero marginal cost on a local Ollama instance--without modification.
56. 【2602.21231】ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces
链接:https://arxiv.org/abs/2602.21231
作者:Ramchand Kumaresan
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Adaptive Complexity, studying multi-model orchestration, present ACAR, measurement framework, framework for studying
备注: 12 pages, 9 figures. Measurement framework for adaptive multi-model routing with auditable execution traces
点击查看摘要
Abstract:We present ACAR (Adaptive Complexity and Attribution Routing), a measurement framework for studying multi-model orchestration under auditable conditions. ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes. The system is implemented on top of TEAMLLM, a deterministic execution substrate with immutable artifacts and complete decision traces. We evaluate ACAR on 1,510 tasks spanning four benchmarks: MathArena, Reasoning Gym, LiveCodeBench, and SuperGPQA, using Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, producing more than 7,550 auditable runs. Results show that sigma-based routing achieves 55.6 percent accuracy, exceeding the two-model baseline of 54.4 percent while avoiding full ensembling on 54.2 percent of tasks. The routing mechanism is model-agnostic and requires no learned components. We also document negative results. First, retrieval augmentation reduced accuracy by 3.4 percentage points, as median retrieval similarity was only 0.167, demonstrating that experience injection without semantic alignment introduces noise rather than grounding. Second, when models agree on incorrect answers (sigma equals zero), no downstream ensemble can recover; this agreement-but-wrong failure mode is intrinsic to self-consistency and bounds achievable accuracy at approximately eight percentage points below full ensembling. Third, attribution estimates based on proxy signals such as response similarity and entropy showed weak correlation with ground-truth leave-one-out values, indicating that practical attribution requires explicit counterfactual computation. This work documents which assumptions fail in practice and provides falsifiable baselines for future research on routing, retrieval, and multi-model attribution.
57. 【2602.21230】RACE: Trajectory-Aware Comprehensive Evaluation for Deep Research Agents
链接:https://arxiv.org/abs/2602.21230
作者:Yanyu Chen,Jiyue Jiang,Jiahong Liu,Yifei Zhang,Xiao Guo,Irwin King
类目:Computation and Language (cs.CL)
关键词:Deep Research Agents, Deep Research, conventional outcome-based metrics, outcome-based metrics fail, Research Agents
备注: Accepted by WWW 2026
点击查看摘要
Abstract:The evaluation of Deep Research Agents is a critical challenge, as conventional outcome-based metrics fail to capture the nuances of their complex reasoning. Current evaluation faces two primary challenges: 1) a reliance on singular metrics like Pass@1, creating a "high-score illusion" that ignores the quality, efficiency, and soundness of the reasoning process; and 2) the failure of static benchmarks to quantify crucial attributes like robustness and latent capability. To address these gaps, we introduce TRACE (Trajectory-Aware Comprehensive Evaluation), a framework that holistically assesses the entire problem-solving trajectory. To counter the "high-score illusion", we propose a Hierarchical Trajectory Utility Function that quantifies process efficiency and cognitive quality, including evidence grounding, alongside accuracy. To measure deeper attributes, TRACE introduces a Scaffolded Capability Assessment protocol, quantifying an agent's latent ability by determining the minimum guidance needed for success. Our contributions include the TRACE framework, its novel metrics, and the accompanying DeepResearch-Bench with controllable complexity. Experiments show TRACE delivers a granular ranking that uncovers critical trade-offs between agent accuracy, efficiency, and robustness entirely missed by singular metrics.
58. 【2602.21228】ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following
链接:https://arxiv.org/abs/2602.21228
作者:Yuancheng Yang,Lin Yang,Xu Wang,Chao Tong,Haihua Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, robust complex instruction, applications of large, large language, demand for robust
备注:
点击查看摘要
Abstract:As applications of large language models (LLMs) become increasingly complex, the demand for robust complex instruction following capabilities is growing accordingly. We argue that a thorough understanding of the instruction itself, especially the latent reasoning structure embedded between the lines, is crucial for improving instruction following. Therefore we target complex instructions that involve implicit reasoning, intricate logical relations, and multi-constraint dependencies. We propose ImpRIF, a method to enhance LLMs' understanding of implicit reasoning instructions, thereby improving its ability to follow complex instructions. We formalize such instructions as verifiable reasoning graphs, enabling programmatic verification and graph-driven chain-of-thought reasoning. Based on this formulation, we synthesize large-scale single- and multi-turn data, propose fine-tuning with graph reasoning, and apply reinforcement learning to explicitly train models to reason along the graph. On five complex instruction following benchmarks, our models substantially outperform their base models. These results demonstrate that enhancing implicit reasoning capabilities can significantly improve complex instruction following. This project will be open-sourced in the near future.
59. 【2602.21227】Budget-Aware Agentic Routing via Boundary-Guided Training
链接:https://arxiv.org/abs/2602.21227
作者:Caiqi Zhang,Menglin Xia,Xuchao Zhang,Daniel Madrigal,Ankur Mallick,Samuel Kessler,Victor Ruehle,Saravan Rajmohan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:execute long-horizon workflows, large language models, evolve into autonomous, long-horizon workflows, invoking a high-capability
备注:
点击查看摘要
Abstract:As large language models (LLMs) evolve into autonomous agents that execute long-horizon workflows, invoking a high-capability model at every step becomes economically unsustainable. While model routing is effective for single-turn queries, agentic routing is a sequential, path-dependent problem: early mistakes compound, feedback is often at the end of the episode, and deployments often demand strict per-task spending limits. We propose Budget-Aware Agentic Routing, which selects between a cheap and an expensive model at each step to optimize the cost--success frontier and to operate under strict per-task budgets. We propose Boundary-Guided Training, which leverages two boundary policies (always-small vs.\ always-large) to build a difficulty taxonomy and to anchor learning under sparse rewards. Our approach warms start with boundary-guided SFT data synthesis via stratified sampling of cost-efficient trajectories, then applies Boundary-Guided Policy Optimization (BoPO), combining boundary-relative rewards with a reference-guided advantage to avoid degenerate cheap-failure solutions. Experiment results show that our method improves the efficiency frontier, matching strong routing baselines at substantially lower cost while demonstrating generalization to strict inference-time budget constraints. Overall, our work establishes a foundational framework for agentic routing, shifting the paradigm from static model selection to dynamic, budget-aware sequential decision-making.
60. 【2602.21226】IslamicLegalBench: Evaluating LLMs Knowledge and Reasoning of Islamic Law Across 1,200 Years of Islamic Pluralist Legal Traditions
链接:https://arxiv.org/abs/2602.21226
作者:Ezieddin Elmahjub,Junaid Qadir,Abdullah Mushtaq,Rafay Naeem,Ibrahim Ghaznavi,Waleed Iqbal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:systems reliably reason, millions of Muslims, Muslims turn, critical question arises, question arises
备注: This manuscript has been submitted for review to Artificial Intelligence \ Law
点击查看摘要
Abstract:As millions of Muslims turn to LLMs like GPT, Claude, and DeepSeek for religious guidance, a critical question arises: Can these AI systems reliably reason about Islamic law? We introduce IslamicLegalBench, the first benchmark evaluating LLMs across seven schools of Islamic jurisprudence, with 718 instances covering 13 tasks of varying complexity. Evaluation of nine state-of-the-art models reveals major limitations: the best model achieves only 68% correctness with 21% hallucination, while several models fall below 35% correctness and exceed 55% hallucination. Few-shot prompting provides minimal gains, improving only 2 of 9 models by 1%. Moderate-complexity tasks requiring exact knowledge show the highest errors, whereas high-complexity tasks display apparent competence through semantic reasoning. False premise detection indicates risky sycophancy, with 6 of 9 models accepting misleading assumptions at rates above 40%. These results highlight that prompt-based methods cannot compensate for missing foundational knowledge. IslamicLegalBench offers the first systematic framework to evaluate Islamic legal reasoning in AI, revealing critical gaps in tools increasingly relied on for spiritual guidance.
61. 【2602.21225】Architecture-Agnostic Curriculum Learning for Document Understanding: Empirical Evidence from Text-Only and Multimodal
链接:https://arxiv.org/abs/2602.21225
作者:Mohammed Hamdan,Vincenzo Dentamaro,Giuseppe Pirlo,Mohamed Cheriet
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:architecturally distinct document, distinct document understanding, incrementally increases training, yields consistent efficiency, training data exposure
备注:
点击查看摘要
Abstract:We investigate whether progressive data scheduling -- a curriculum learning strategy that incrementally increases training data exposure (33\%$\rightarrow$67\%$\rightarrow$100\%) -- yields consistent efficiency gains across architecturally distinct document understanding models. By evaluating BERT (text-only, 110M parameters) and LayoutLMv3 (multimodal, 126M parameters) on the FUNSD and CORD benchmarks, we establish that this schedule reduces wall-clock training time by approximately 33\%, commensurate with the reduction from 6.67 to 10.0 effective epoch-equivalents of data. To isolate curriculum effects from compute reduction, we introduce matched-compute baselines (Standard-7) that control for total gradient updates. On the FUNSD dataset, the curriculum significantly outperforms the matched-compute baseline for BERT ($\Delta$F1 = +0.023, $p=0.022$, $d_z=3.83$), constituting evidence for a genuine scheduling benefit in capacity-constrained models. In contrast, no analogous benefit is observed for LayoutLMv3 ($p=0.621$), whose multimodal representations provide sufficient inductive bias. On the CORD dataset, all conditions converge to equivalent F1 scores ($\geq$0.947) irrespective of scheduling, indicating a performance ceiling. Schedule ablations comparing progressive, two-phase, reverse, and random pacing confirm that the efficiency gain derives from reduced data volume rather than ordering. Taken together, these findings demonstrate that progressive scheduling is a reliable compute-reduction strategy across model families, with curriculum-specific benefits contingent on the interaction between model capacity and task complexity.
62. 【2602.21224】Make Every Draft Count: Hidden State based Speculative Decoding
链接:https://arxiv.org/abs/2602.21224
作者:Yuetao Chen,Xuliang Wang,Xinzhou Zheng,Ming Li,Peng Wang,Hong Xu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
关键词:accelerate LLM inference, accelerate LLM, LLM inference, generate candidate tokens, pivotal technique
备注:
点击查看摘要
Abstract:Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this paradigm successfully increases the arithmetic intensity of memory-bound inference, it causes significant compute inefficiency: the majority of draft tokens fail verification and are discarded, resulting in waste of computation. Motivated by the goal of recollecting this wasted computation, we propose a novel system that transforms discarded drafts into reusable tokens. Our key insight is to perform auto-regressive prediction at the hidden states level and postpone the integrating token information after the hidden states generation, so the draft hidden states are not contaminated by incorrect tokens, enabling hidden state reuse. To implement such a system, first we introduce a draft model architecture based on auto-regressive hidden states, which preserves richer semantics than token-based drafters to facilitate draft repurposing. Second, we design an efficient token information injection mechanism that leverages our specialized draft model to construct high-quality draft token trees and enables resampling tokens from verification failures. Third, we eliminate the overhead hidden in our design to further maximize hardware utilization. We conducted extensive evaluations against various baselines, demonstrating up to a 3.3x speedup against standard speculative decoding.
63. 【2602.21223】Measuring Pragmatic Influence in Large Language Model Instructions
链接:https://arxiv.org/abs/2602.21223
作者:Yilin Geng,Omri Abend,Eduard Hovy,Lea Frermann
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, large language, framing, language models, pragmatic framing
备注:
点击查看摘要
Abstract:It is not only what we ask large language models (LLMs) to do that matters, but also how we prompt. Phrases like "This is urgent" or "As your supervisor" can shift model behavior without altering task content. We study this effect as pragmatic framing, contextual cues that shape directive interpretation rather than task specification. While prior work exploits such cues for prompt optimization or probes them as security vulnerabilities, pragmatic framing itself has not been treated as a measurable property of instruction following. Measuring this influence systematically remains challenging, requiring controlled isolation of framing cues. We introduce a framework with three novel components: directive-framing decomposition separating framing context from task specification; a taxonomy organizing 400 instantiations of framing into 13 strategies across 4 mechanism clusters; and priority-based measurement that quantifies influence through observable shifts in directive prioritization. Across five LLMs of different families and sizes, influence mechanisms cause consistent and structured shifts in directive prioritization, moving models from baseline impartiality toward favoring the framed directive. This work establishes pragmatic framing as a measurable and predictable factor in instruction-following systems.
64. 【2602.21222】ask-Aware LoRA Adapter Composition via Similarity Retrieval in Vector Databases
链接:https://arxiv.org/abs/2602.21222
作者:Riya Adsul,Balachandra Devarangadi Sunil,Isha Nalawade,Sudharshan Govindan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Parameter efficient fine, efficiently composing multiple, composing multiple specialized, tasks remains challenging, multiple specialized adapters
备注:
点击查看摘要
Abstract:Parameter efficient fine tuning methods like LoRA have enabled task specific adaptation of large language models, but efficiently composing multiple specialized adapters for unseen tasks remains challenging. We present a novel framework for dynamic LoRA adapter composition that leverages similarity retrieval in vector databases to enable zero-shot generalization across diverse NLP tasks. Our approach constructs a task-aware vector database by embedding training examples from 22 datasets spanning commonsense reasoning, question answering, natural language inference, and sentiment analysis. At inference time, we retrieve the most similar training examples, compute task similarity distributions via nucleus sampling, and dynamically merge relevant LoRA adapters using retrieval weighted fusion strategies. We evaluated four merging methods Linear, Concatenation, TIES, and Magnitude Prune demonstrating that our dataset centric retrieval approach often matches or exceeds the performance of individually fine-tuned task-specific adapters. Notably, Linear merging achieves 70.95% on PIQA and 77.62% on RTE, substantially outperforming single-task baselines (46% and 52%, respectively). Our framework requires no additional retriever training, operates with frozen embeddings, and enables efficient, interpretable adapter composition. These results suggest that retrieval based dynamic merging offers a promising direction for scalable, parameter-efficient multitask learning without requiring full model retraining for each new task.
65. 【2602.21221】Latent Context Compilation: Distilling Long Context into Compact Portable Memory
链接:https://arxiv.org/abs/2602.21221
作者:Zeju Li,Yizhou Zhou,Qiang Xu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Efficient long-context LLM, long-context LLM deployment, complicate concurrent serving, Efficient long-context, Test-Time Training
备注:
点击查看摘要
Abstract:Efficient long-context LLM deployment is stalled by a dichotomy between amortized compression, which struggles with out-of-distribution generalization, and Test-Time Training, which incurs prohibitive synthetic data costs and requires modifying model weights, creating stateful parameters that complicate concurrent serving. We propose Latent Context Compilation, a framework that fundamentally shifts context processing from adaptation to compilation. By utilizing a disposable LoRA module as a compiler, we distill long contexts into compact buffer tokens -- stateless, portable memory artifacts that are plug-and-play compatible with frozen base models. Crucially, we introduce a self-aligned optimization strategy that eliminates the need for synthetic context-relevant QA pairs. By regularizing context reconstruction task with context-agnostic random queries, we force compressed tokens to reside within the model's existing instruction-following manifold. Experiments with Llama-3.1-8B demonstrate that Latent Context Compilation preserves fine-grained details and reasoning capabilities where prior methods falter, effectively decoupling memory density from model parameters even at a 16x compression ratio.
66. 【2602.21220】Field-Theoretic Memory for AI Agents: Continuous Dynamics for Context Preservation
链接:https://arxiv.org/abs/2602.21220
作者:Subhadip Mitra
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:treats stored information, partial differential equations, continuous fields governed, present a memory, agents that treats
备注: 15 pages, 6 figures. Code: [this https URL](https://github.com/rotalabs/rotalabs-fieldmem)
点击查看摘要
Abstract:We present a memory system for AI agents that treats stored information as continuous fields governed by partial differential equations rather than discrete entries in a database. The approach draws from classical field theory: memories diffuse through semantic space, decay thermodynamically based on importance, and interact through field coupling in multi-agent scenarios. We evaluate the system on two established long-context benchmarks: LoCoMo (ACL 2024) with 300-turn conversations across 35 sessions, and LongMemEval (ICLR 2025) testing multi-session reasoning over 500+ turns. On LongMemEval, the field-theoretic approach achieves significant improvements: +116% F1 on multi-session reasoning (p0.01, d= 3.06), +43.8% on temporal reasoning (p0.001, d= 9.21), and +27.8% retrieval recall on knowledge updates (p0.001, d= 5.00). Multi-agent experiments show near-perfect collective intelligence (99.8%) through field coupling. Code is available at this http URL.
67. 【2602.21219】Reasoning-Based Personalized Generation for Users with Sparse Data
链接:https://arxiv.org/abs/2602.21219
作者:Bo Ni,Branislav Kveton,Samyadeep Basu,Subhojyoti Mukherjee,Leyao Wang,Franck Dernoncourt,Sungchul Kim,Seunghyun Yoon,Zichao Wang,Ruiyi Zhang,Puneet Mathur,Jihyung Kil,Jiuxiang Gu,Nedim Lipka,Yu Wang,Ryan A. Rossi,Tyler Derr
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Model, Large Language, Language Model, holds great promise, personalization holds great
备注:
点击查看摘要
Abstract:Large Language Model (LLM) personalization holds great promise for tailoring responses by leveraging personal context and history. However, real-world users usually possess sparse interaction histories with limited personal context, such as cold-start users in social platforms and newly registered customers in online E-commerce platforms, compromising the LLM-based personalized generation. To address this challenge, we introduce GraSPer (Graph-based Sparse Personalized Reasoning), a novel framework for enhancing personalized text generation under sparse context. GraSPer first augments user context by predicting items that the user would likely interact with in the future. With reasoning alignment, it then generates texts for these interactions to enrich the augmented context. In the end, it generates personalized outputs conditioned on both the real and synthetic histories, ensuring alignment with user style and preferences. Extensive experiments on three benchmark personalized generation datasets show that GraSPer achieves significant performance gain, substantially improving personalization in sparse user context settings.
68. 【2602.21218】EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors
链接:https://arxiv.org/abs/2602.21218
作者:Amin Banayeeanzade,Qingchuan Yang,Deqing Fu,Spencer Hong,Erin Babinsky,Alfy Samuel,Anoop Kumar,Robin Jia,Sai Praneeth Karimireddy
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:modern machine learning, High-quality data, machine learning, freely shared, essential for modern
备注:
点击查看摘要
Abstract:High-quality data is essential for modern machine learning, yet many valuable corpora are sensitive and cannot be freely shared. Synthetic data offers a practical substitute for downstream development, and large language models (LLMs) have emerged as powerful engines for generating it. However, existing private text generation methods are severely inefficient: they are data-intensive, computationally slow, and often require large private corpora or batch sizes to achieve usable quality. We introduce EPSVec, a differentially-private lightweight alternative that steers LLM generation using *dataset vectors*--directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This decouples the privacy budget from generation, enabling arbitrarily many synthetic samples without additional privacy cost and yielding strong fidelity even in low-data regimes. Furthermore, we enhance our method by utilizing pretrained (base) models and introducing fixed-shot prompting to boost generation diversity and fidelity. Our experiments demonstrate that EPSVec outperforms existing baselines in distributional alignment and downstream utility, particularly in low-data regimes, while significantly reducing computational overhead.
69. 【2602.21217】Applied Sociolinguistic AI for Community Development (ASA-CD): A New Scientific Paradigm for Linguistically-Grounded Social Intervention
链接:https://arxiv.org/abs/2602.21217
作者:S M Ruhul Alam,Rifa Ferzana
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:establishes Applied Sociolinguistic, paper establishes Applied, Applied Sociolinguistic, addressing community challenges, establishes Applied
备注: 13 pages, 2 figures, 3 tables; simulation-based study introducing the ASA-CD framework
点击查看摘要
Abstract:This paper establishes Applied Sociolinguistic AI for Community Development (ASA-CD) as a novel scientific paradigm for addressing community challenges through linguistically grounded, AI-enabled intervention. ASA-CD introduces three key contributions: (1) linguistic biomarkers as computational indicators of discursive fragmentation; (2) development-aligned natural language processing (NLP), an AI optimisation paradigm prioritising collective outcomes; and (3) a standardised five-phase protocol for discursive intervention. A proof-of-concept study, incorporating real-world and synthetic corpora, demonstrates systematic associations between exclusionary language and negative sentiment and simulates intervention-based improvements. ASA-CD provides a unified methodological, ethical and empirical framework for scalable, value-aligned AI in the service of community empowerment.
70. 【2602.21216】EQ-5D Classification Using Biomedical Entity-Enriched Pre-trained Language Models and Multiple Instance Learning
链接:https://arxiv.org/abs/2602.21216
作者:Zhyar Rzgar K Rostam,Gábor Kertész
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:quality of life, standardized instrument, evaluation of health-related, health-related quality, Multiple Instance Learning
备注: 12 tables
点击查看摘要
Abstract:The EQ-5D (EuroQol 5-Dimensions) is a standardized instrument for the evaluation of health-related quality of life. In health economics, systematic literature reviews (SLRs) depend on the correct identification of publications that use the EQ-5D, but manual screening of large volumes of scientific literature is time-consuming, error-prone, and inconsistent. In this study, we investigate fine-tuning of general-purpose (BERT) and domain-specific (SciBERT, BioBERT) pre-trained language models (PLMs), enriched with biomedical entity information extracted through scispaCy models for each statement, to improve EQ-5D detection from abstracts. We conduct nine experimental setups, including combining three scispaCy models with three PLMs, and evaluate their performance at both the sentence and study levels. Furthermore, we explore a Multiple Instance Learning (MIL) approach with attention pooling to aggregate sentence-level information into study-level predictions, where each abstract is represented as a bag of enriched sentences (by scispaCy). The findings indicate consistent improvements in F1-scores (reaching 0.82) and nearly perfect recall at the study-level, significantly exceeding classical bag-of-words baselines and recently reported PLM baselines. These results show that entity enrichment significantly improves domain adaptation and model generalization, enabling more accurate automated screening in systematic reviews.
71. 【2602.21215】Inference-time Alignment via Sparse Junction Steering
链接:https://arxiv.org/abs/2602.21215
作者:Runyi Hu,Jie Zhang,Shiqian Zhao,Jiale Meng,Jiwei Li,Jason Zeng,Ming Wu,Michael Heinrich,Yonggang Wen,Tianwei Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:enabling fine grained, fine grained control, large language models, enabling fine, parameter updates
备注: 28 pages, 17 figures
点击查看摘要
Abstract:Token-level steering has emerged as a pivotal approach for inference-time alignment, enabling fine grained control over large language models by modulating their output distributions without parameter updates. While effective, existing methods rely on dense intervention at every decoding step. This persistent manipulation not only incurs substantial computational overhead but also risks compromising generation quality by excessively drifting from the model's intrinsic distribution. In this work, we show that dense intervention is unnecessary and propose Sparse Inference time Alignment (SIA), which performs sparse junction steering by intervening only at critical decision points along the generation trajectory. Our key insight is that high entropy junctions mark pivotal decision points in the generation trajectory and are particularly susceptible to misalignment, indicating the need to introduce alignment related reward signals at these points. Extensive experiments across different model families and alignment objectives show that steering only 20% to 80% of tokens achieves superior alignment-efficiency trade offs. For strong base models such as Qwen3, intervening on as few as 20% of tokens matches or even surpasses heavily post-trained instruct models. This sparsity enables stronger guidance while better preserving the model's native distribution, integrates seamlessly with search based methods such as Best-of-N, and reduces computational cost by up to 6x.
72. 【2602.21214】oward Effective Multi-Domain Rumor Detection in Social Networks Using Domain-Gated Mixture-of-Experts
链接:https://arxiv.org/abs/2602.21214
作者:Mohadeseh Sheikhqoraei,Zainabolhoda Heshmati,Zeinab Rajabi,Leila Rabiei
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Social media platforms, tracking rumors due, key channels, channels for spreading, spreading and tracking
备注:
点击查看摘要
Abstract:Social media platforms have become key channels for spreading and tracking rumors due to their widespread accessibility and ease of information sharing. Rumors can continuously emerge across diverse domains and topics, often with the intent to mislead society for personal or commercial gain. Therefore, developing methods that can accurately detect rumors at early stages is crucial to mitigating their negative impact. While existing approaches often specialize in single-domain detection, their performance degrades when applied to new domains due to shifts in data distribution, such as lexical patterns and propagation dynamics. To bridge this gap, this study introduces PerFact, a large-scale multi-domain rumor dataset comprising 8,034 annotated posts from the X platform, annotated into two primary categories: rumor (including true, false, and unverified rumors) and non-rumor. Annotator agreement, measured via Fleiss' Kappa ($\kappa = 0.74$), ensures high-quality labels. This research further proposes an effective multi-domain rumor detection model that employs a domain gate to dynamically aggregate multiple feature representations extracted through a Mixture-of-Experts method. Each expert combines CNN and BiLSTM networks to capture local syntactic features and long-range contextual dependencies. By leveraging both textual content and publisher information, the proposed model classifies posts into rumor and non-rumor categories with high accuracy. Evaluations demonstrate state-of-the-art performance, achieving an F1-score of 79.86\% and an accuracy of 79.98\% in multi-domain settings. Keywords: Rumor Detection, Multi-Domain, Natural Language Processing, Social Networks, Mixture-of-Experts Model
Subjects:
Social and Information Networks (cs.SI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:
arXiv:2602.21214 [cs.SI]
(or
arXiv:2602.21214v1 [cs.SI] for this version)
https://doi.org/10.48550/arXiv.2602.21214
Focus to learn more
arXiv-issued DOI via DataCite</p>
73. 【2602.21212】Disaster Question Answering with LoRA Efficiency and Accurate End Position
链接:https://arxiv.org/abs/2602.21212
作者:Takato Yasuno
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:limited geographic areas, volcanic eruptions occur, extremely low frequency, affect limited geographic, torrential rainfall
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Natural disasters such as earthquakes, torrential rainfall, floods, and volcanic eruptions occur with extremely low frequency and affect limited geographic areas. When individuals face disaster situations, they often experience confusion and lack the domain-specific knowledge and experience necessary to determine appropriate responses and actions. While disaster information is continuously updated, even when utilizing RAG search and large language models for inquiries, obtaining relevant domain knowledge about natural disasters and experiences similar to one's specific situation is not guaranteed. When hallucinations are included in disaster question answering, artificial misinformation may spread and exacerbate confusion. This work introduces a disaster-focused question answering system based on Japanese disaster situations and response experiences. Utilizing the cl-tohoku/bert-base-japanese-v3 + Bi-LSTM + Enhanced Position Heads architecture with LoRA efficiency optimization, we achieved 70.4\% End Position accuracy with only 5.7\% of the total parameters (6.7M/117M). Experimental results demonstrate that the combination of Japanese BERT-base optimization and Bi-LSTM contextual understanding achieves accuracy levels suitable for real disaster response scenarios, attaining a 0.885 Span F1 score. Future challenges include: establishing natural disaster Q\A benchmark datasets, fine-tuning foundation models with disaster knowledge, developing lightweight and power-efficient edge AI Disaster Q\A applications for situations with insufficient power and communication during disasters, and addressing disaster knowledge base updates and continual learning capabilities.
74. 【2602.22039】G-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
链接:https://arxiv.org/abs/2602.22039
作者:Cheng-Yeh Yang,Chien-Chun Wang,Li-Wei Chen,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
关键词:pose significant challenges, Taiwanese Hokkien drama, Taiwanese Hokkien, Hokkien drama speech, Taiwanese Hokkien exemplifies
备注: Accepted to LREC 2026
点击查看摘要
Abstract:Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.
75. 【2602.21522】One Brain, Omni Modalities: Towards Unified Non-Invasive Brain Decoding with Large Language Models
链接:https://arxiv.org/abs/2602.21522
作者:Changli Tang,Shurui Li,Junliang Wang,Qinfan Xiao,Zhonghao Zhai,Lei Bai,Yu Qiao,Bowen Zhou,Wen Wu,Yuanning Li,Chao Zhang
类目:Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Deciphering brain function, recordings requires synthesizing, non-invasive recordings requires, textbf, synthesizing complementary high-frequency
备注:
点击查看摘要
Abstract:Deciphering brain function through non-invasive recordings requires synthesizing complementary high-frequency electromagnetic (EEG/MEG) and low-frequency metabolic (fMRI) signals. However, despite their shared neural origins, extreme discrepancies have traditionally confined these modalities to isolated analysis pipelines, hindering a holistic interpretation of brain activity. To bridge this fragmentation, we introduce \textbf{NOBEL}, a \textbf{n}euro-\textbf{o}mni-modal \textbf{b}rain-\textbf{e}ncoding \textbf{l}arge language model (LLM) that unifies these heterogeneous signals within the LLM's semantic embedding space. Our architecture integrates a unified encoder for EEG and MEG with a novel dual-path strategy for fMRI, aligning non-invasive brain signals and external sensory stimuli into a shared token space, then leverages an LLM as a universal backbone. Extensive evaluations demonstrate that NOBEL serves as a robust generalist across standard single-modal tasks. We also show that the synergistic fusion of electromagnetic and metabolic signals yields higher decoding accuracy than unimodal baselines, validating the complementary nature of multiple neural modalities. Furthermore, NOBEL exhibits strong capabilities in stimulus-aware decoding, effectively interpreting visual semantics from multi-subject fMRI data on the NSD and HAD datasets while uniquely leveraging direct stimulus inputs to verify causal links between sensory signals and neural responses. NOBEL thus takes a step towards unifying non-invasive brain decoding, demonstrating the promising potential of omni-modal brain understanding.
76. 【2602.21464】MiGUE-Speech: A Spontaneous Speech Dataset for Affective Analysis
链接:https://arxiv.org/abs/2602.21464
作者:Sofoklis Kakouros,Fang Kang,Haoyu Chen
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
关键词:work presents iMiGUE-Speech, work presents, spontaneous affective corpus, dataset, presents iMiGUE-Speech
备注: Accepted to Speech Prosody 2026
点击查看摘要
Abstract:This work presents iMiGUE-Speech, an extension of the iMiGUE dataset that provides a spontaneous affective corpus for studying emotional and affective states. The new release focuses on speech and enriches the original dataset with additional metadata, including speech transcripts, speaker-role separation between interviewer and interviewee, and word-level forced alignments. Unlike existing emotional speech datasets that rely on acted or laboratory-elicited emotions, iMiGUE-Speech captures spontaneous affect arising naturally from real match outcomes. To demonstrate the utility of the dataset and establish initial benchmarks, we introduce two evaluation tasks for comparative assessment: speech emotion recognition and transcript-based sentiment analysis. These tasks leverage state-of-the-art pre-trained representations to assess the dataset's ability to capture spontaneous affective states from both acoustic and linguistic modalities. iMiGUE-Speech can also be synchronously paired with micro-gesture annotations from the original iMiGUE dataset, forming a uniquely multimodal resource for studying speech-gesture affective dynamics. The extended dataset is available at this https URL.
77. 【2602.21229】Forecasting Future Language: Context Design for Mention Markets
链接:https://arxiv.org/abs/2602.21229
作者:Sumin Kim,Jihoon Kwon,Yoon Kim,Nicole Kagan,Raffi Khatchadourian,Wonbin Ahn,Alejandro Lopez-Lira,Jaewon Lee,Yoontae Hwang,Oscar Levy,Yongjae Lee,Chanyeol Choi
类目:General Finance (q-fin.GN); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:future public event, contracts resolve based, accurate probabilistic forecasts, require accurate probabilistic, public event
备注: 10 pages
点击查看摘要
Abstract:Mention markets, a type of prediction market in which contracts resolve based on whether a specified keyword is mentioned during a future public event, require accurate probabilistic forecasts of keyword-mention outcomes. While recent work shows that large language models (LLMs) can generate forecasts competitive with human forecasters, it remains unclear how input context should be designed to support accurate prediction. In this paper, we study this question through experiments on earnings-call mention markets, which require forecasting whether a company will mention a specified keyword during its upcoming call. We run controlled comparisons varying (i) which contextual information is provided (news and/or prior earnings-call transcripts) and (ii) how \textit{market probability}, (i.e., prediction market contract price) is used. We introduce Market-Conditioned Prompting (MCP), which explicitly treats the market-implied probability as a prior and instructs the LLM to update this prior using textual evidence, rather than re-predicting the base rate from scratch. In our experiments, we find three insights: (1) richer context consistently improves forecasting performance; (2) market-conditioned prompting (MCP), which treats the market probability as a prior and updates it using textual evidence, yields better-calibrated forecasts; and (3) a mixture of the market probability and MCP (MixMCP) outperforms the market baseline. By dampening the LLM's posterior update with the market prior, MixMCP yields more robust predictions than either the market or the LLM alone.
信息检索
1. 【2602.22182】LiCQA : A Lightweight Complex Question Answering System
链接:https://arxiv.org/abs/2602.22182
作者:Sourav Saha,Dwaipayan Roy,Mandar Mitra
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:implementing Question Answering, Question Answering, twenty years, significant progress, made in designing
备注:
点击查看摘要
Abstract:Over the last twenty years, significant progress has been made in designing and implementing Question Answering (QA) systems. However, addressing complex questions, the answers to which are spread across multiple documents, remains a challenging problem. Recent QA systems that are designed to handle complex questions work either on the basis of knowledge graphs, or utilise contem- porary neural models that are expensive to train, in terms of both computational resources and the volume of training data required. In this paper, we present LiCQA, an unsupervised question answer- ing model that works primarily on the basis of corpus evidence. We empirically compare the effectiveness and efficiency of LiCQA with two recently presented QA systems, which are based on different underlying principles. The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
2. 【2602.21957】Learning to Collaborate via Structures: Cluster-Guided Item Alignment for Federated Recommendation
链接:https://arxiv.org/abs/2602.21957
作者:Yuchun Tu,Zhiwei Li,Bingli Sun,Yixuan Li,Xiao Song
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Federated recommendation facilitates, sensitive user interaction, user interaction data, keeping sensitive user, Federated recommendation
备注: 18 pages, 9 figures
点击查看摘要
Abstract:Federated recommendation facilitates collaborative model training across distributed clients while keeping sensitive user interaction data local. Conventional approaches typically rely on synchronizing high-dimensional item representations between the server and clients. This paradigm implicitly assumes that precise geometric alignment of embedding coordinates is necessary for collaboration across clients. We posit that establishing relative semantic relationships among items is more effective than enforcing shared representations. Specifically, global semantic relations serve as structural constraints for items. Within these constraints, the framework allows item representations to vary locally on each client, which flexibility enables the model to capture fine-grained user personalization while maintaining global consistency. To this end, we propose Cluster-Guided FedRec framework (CGFedRec), a framework that transforms uploaded embeddings into compact cluster labels. In this framework, the server functions as a global structure discoverer to learn item clusters and distributes only the resulting labels. This mechanism explicitly cuts off the downstream transmission of item embeddings, relieving clients from maintaining global shared item embeddings. Consequently, CGFedRec achieves the effective injection of global collaborative signals into local item representations without transmitting full embeddings. Extensive experiments demonstrate that our approach significantly improves communication efficiency while maintaining superior recommendation accuracy across multiple datasets.
3. 【2602.21756】Offline Reasoning for Efficient Recommendation: LLM-Empowered Persona-Profiled Item Indexing
链接:https://arxiv.org/abs/2602.21756
作者:Deogyong Kim,Junseong Lee,Jeongeun Lee,Changhoe Kim,Junguel Lee,Jungseok Lee,Dongha Lee
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:rich semantic understanding, large language models, nuanced semantics, rich semantic, semantic understanding
备注: Under review
点击查看摘要
Abstract:Recent advances in large language models (LLMs) offer new opportunities for recommender systems by capturing the nuanced semantics of user interests and item characteristics through rich semantic understanding and contextual reasoning. In particular, LLMs have been employed as rerankers that reorder candidate items based on inferred user-item relevance. However, these approaches often require expensive online inference-time reasoning, leading to high latency that hampers real-world deployment. In this work, we introduce Persona4Rec, a recommendation framework that performs offline reasoning to construct interpretable persona representations of items, enabling lightweight and scalable real-time inference. In the offline stage, Persona4Rec leverages LLMs to reason over item reviews, inferring diverse user motivations that explain why different types of users may engage with an item; these inferred motivations are materialized as persona representations, providing multiple, human-interpretable views of each item. Unlike conventional approaches that rely on a single item representation, Persona4Rec learns to align user profiles with the most plausible item-side persona through a dedicated encoder, effectively transforming user-item relevance into user-persona relevance. At the online stage, this persona-profiled item index allows fast relevance computation without invoking expensive LLM reasoning. Extensive experiments show that Persona4Rec achieves performance comparable to recent LLM-based rerankers while substantially reducing inference time. Moreover, qualitative analysis confirms that persona representations not only drive efficient scoring but also provide intuitive, review-grounded explanations. These results demonstrate that Persona4Rec offers a practical and interpretable solution for next-generation recommender systems.
4. 【2602.21677】rie-Aware Transformers for Generative Recommendation
链接:https://arxiv.org/abs/2602.21677
作者:Zhenxiang Xu,Jiawei Chen,Sirui Chen,Yong He,Jieyu Yang,Chuan Yuan,Ke Ding,Can Wang
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:casting next-item prediction, aligns with advances, score-based ranking, casting next-item, next-item prediction
备注:
点击查看摘要
Abstract:Generative recommendation (GR) aligns with advances in generative AI by casting next-item prediction as token-level generation rather than score-based ranking. Most GR methods adopt a two-stage pipeline: (i) \textit{item tokenization}, which maps each item to a sequence of discrete, hierarchically organized tokens; and (ii) \textit{autoregressive generation}, which predicts the next item's tokens conditioned on the tokens of user's interaction history. Although hierarchical tokenization induces a prefix tree (trie) over items, standard autoregressive modeling with conventional Transformers often flattens item tokens into a linear stream and overlooks the underlying topology. To address this, we propose TrieRec, a trie-aware generative recommendation method that augments Transformers with structural inductive biases via two positional encodings. First, a \textit{trie-aware absolute positional encoding} aggregates a token's (node's) local structural context (\eg depth, ancestors, and descendants) into the token representation. Second, a \textit{topology-aware relative positional encoding} injects pairwise structural relations into self-attention to capture topology-induced semantic relatedness. TrieRec is also model-agnostic, efficient, and hyperparameter-free. In our experiments, we implement TrieRec within three representative GR backbones, achieving notably improvements of 8.83\% on average across four real-world datasets.
Subjects:
Information Retrieval (cs.IR); Machine Learning (cs.LG)
Cite as:
arXiv:2602.21677 [cs.IR]
(or
arXiv:2602.21677v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.21677
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
5. 【2602.21600】AQR-HNSW: Accelerating Approximate Nearest Neighbor Search via Density-aware Quantization and Multi-stage Re-ranking
链接:https://arxiv.org/abs/2602.21600
作者:Ganap Ashit Tewary,Nrusinga Charan Gantayat,Jeff Zhang
类目:Information Retrieval (cs.IR)
关键词:Approximate Nearest Neighbor, Approximate Nearest, Nearest Neighbor, large language models, powering recommendation systems
备注: Accepted at DAC 2026
点击查看摘要
Abstract:Approximate Nearest Neighbor (ANN) search has become fundamental to modern AI infrastructure, powering recommendation systems, search engines, and large language models across industry leaders from Google to OpenAI. Hierarchical Navigable Small World (HNSW) graphs have emerged as the dominant ANN algorithm, widely adopted in production systems due to their superior recall versus latency balance. However, as vector databases scale to billions of embeddings, HNSW faces critical bottlenecks: memory consumption expands, distance computation overhead dominates query latency, and it suffers suboptimal performance on heterogeneous data distributions. This paper presents Adaptive Quantization and Rerank HNSW (AQR-HNSW), a novel framework that synergistically integrates three strategies to enhance HNSW scalability. AQR-HNSW introduces (1) density-aware adaptive quantization, achieving 4x compression while preserving distance relationships; (2) multi-state re-ranking that reduces unnecessary computations by 35%; and (3) quantization-optimized SIMD implementations delivering 16-64 operations per cycle across architectures. Evaluation on standard benchmarks demonstrates 2.5-3.3x higher queries per second (QPS) than state-of-the-art HNSW implementations while maintaining over 98% recall, with 75% memory reduction for the index graph and 5x faster index construction.
6. 【2602.21598】Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
链接:https://arxiv.org/abs/2602.21598
作者:Touseef Hasan,Laila Cure,Souvika Sarkar
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:inconsistently formatted, Public service information, service information systems, service information, retrieval
备注: 3 pages, 1 figure
点击查看摘要
Abstract:Public service information systems are often fragmented, inconsistently formatted, and outdated. These characteristics create low-resource retrieval environments that hinder timely access to critical services. We investigate retrieval challenges in such settings through the domain of food pantry access, a socially urgent problem given persistent food insecurity. We develop an AI-powered conversational retrieval system that scrapes and indexes publicly available pantry data and employs a Retrieval-Augmented Generation (RAG) pipeline to support natural language queries via a web interface. We conduct a pilot evaluation study using community-sourced queries to examine system behavior in realistic scenarios. Our analysis reveals key limitations in retrieval robustness, handling underspecified queries, and grounding over inconsistent knowledge bases. This ongoing work exposes fundamental IR challenges in low-resource environments and motivates future research on robust conversational retrieval to improve access to critical public resources.
7. 【2602.21553】Revisiting RAG Retrievers: An Information Theoretic Benchmark
链接:https://arxiv.org/abs/2602.21553
作者:Wenqing Zheng,Dmitri Kalaev,Noah Fatsi,Daniel Barcklow,Owen Reinert,Igor Melnyk,Senthil Kumar,C. Bayan Bruss
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, surface relevant context, Retrieval-Augmented Generation, systems rely critically, language models
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems rely critically on the retriever module to surface relevant context for large language models. Although numerous retrievers have recently been proposed, each built on different ranking principles such as lexical matching, dense embeddings, or graph citations, there remains a lack of systematic understanding of how these mechanisms differ and overlap. Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves. Those that do compare retrievers directly use a limited set of evaluation tools which fail to capture complementary and overlapping strengths. This work presents MIGRASCOPE, a Mutual Information based RAG Retriever Analysis Scope. We revisit state-of-the-art retrievers and introduce principled metrics grounded in information and statistical estimation theory to quantify retrieval quality, redundancy, synergy, and marginal contribution. We further show that if chosen carefully, an ensemble of retrievers outperforms any single retriever. We leverage the developed tools over major RAG corpora to provide unique insights on contribution levels of the state-of-the-art retrievers. Our findings provide a fresh perspective on the structure of modern retrieval techniques and actionable guidance for designing robust and efficient RAG systems.
8. 【2602.21543】Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
链接:https://arxiv.org/abs/2602.21543
作者:Barah Fazili,Koustava Goswami
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:pretraining typically lacks, typically lacks explicit, explicit alignment signals, Multilingual pretraining typically, lacks explicit alignment
备注:
点击查看摘要
Abstract:Multilingual pretraining typically lacks explicit alignment signals, leading to suboptimal cross-lingual alignment in the representation space. In this work, we show that training standard pretrained models for cross-lingual alignment with a multi-way parallel corpus in a diverse pool of languages can substantially improve multilingual and cross-lingual representations for NLU tasks. We construct a multi-way parallel dataset using translations of English text from an off-the-shelf NMT model for a pool of six target languages and achieve strong cross-lingual alignment through contrastive learning. This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models. Using a multi-way parallel corpus for contrastive training yields substantial gains on bitext mining (21.3%), semantic similarity (5.3%), and classification (28.4%) compared to English-centric (En-X) bilingually parallel data, where X is sampled from a pool of multiple target languages. Furthermore, finetuning mE5 model on a small dataset with multi-way parallelism significantly improves bitext mining compared to one without, underscoring the importance of multi-way cross-lingual supervision even for models already pretrained for high-quality sentence embeddings.
9. 【2602.21480】Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
链接:https://arxiv.org/abs/2602.21480
作者:Germán T. Eizaguirre,Lars Tissen,Marc Sánchez-Artigas
类目:Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:extensively benchmarked fields, Big Data, Big Data workflows, benchmarked fields, evaluates them jointly
备注: 11 pages, 4 figures
点击查看摘要
Abstract:Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly. In the real world, Text-to-SQL systems are often embedded with Big Data workflows, such as large-scale data processing or interactive data analytics. We refer to this as "Text-to-Big SQL". However, existing text-to-SQL benchmarks remain narrowly scoped and overlook the cost and performance implications that arise at scale. For instance, translation errors that are minor on small datasets lead to substantial cost and latency overheads as data scales, a relevant issue completely ignored by text-to-SQL metrics. In this paper, we overcome this overlooked challenge by introducing novel and representative metrics for evaluating Text-to-Big SQL. Our study focuses on production-level LLM agents, a database-agnostic system adaptable to diverse user needs. Via an extensive evaluation of frontier models, we show that text-to-SQL metrics are insufficient for Big Data. In contrast, our proposed text-to-Big SQL metrics accurately reflect execution efficiency, cost, and the impact of data scale. Furthermore, we provide LLM-specific insights, including fine-grained, cross-model comparisons of latency and cost.
Comments:
11 pages, 4 figures
Subjects:
Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR)
ACMclasses:
I.2.7; H.2.8
Cite as:
arXiv:2602.21480 [cs.DB]
(or
arXiv:2602.21480v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2602.21480
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2602.21456】Revisiting Text Ranking in Deep Research
链接:https://arxiv.org/abs/2602.21456
作者:Chuan Meng,Litu Ou,Sean MacAvaney,Jeff Dalton
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:extensive open-web exploration, address hard queries, Deep research, open-web exploration, important task
备注:
点击查看摘要
Abstract:Deep research has emerged as an important task that aims to address hard queries through extensive open-web exploration. To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it. Despite search's essential role in deep research, black-box web search APIs hinder systematic analysis of search components, leaving the behaviour of established text ranking methods in deep research largely unclear. To fill this gap, we reproduce a selection of key findings and best practices for IR text ranking methods in the deep research setting. In particular, we examine their effectiveness from three perspectives: (i) retrieval units (documents vs. passages), (ii) pipeline configurations (different retrievers, re-rankers, and re-ranking depths), and (iii) query characteristics (the mismatch between agent-issued queries and the training queries of text rankers). We perform experiments on BrowseComp-Plus, a deep research dataset with a fixed corpus, evaluating 2 open-source agents, 5 retrievers, and 3 re-rankers across diverse setups. We find that agent-issued queries typically follow web-search-style syntax (e.g., quoted exact matches), favouring lexical, learned sparse, and multi-vector retrievers; passage-level units are more efficient under limited context windows, and avoid the difficulties of document length normalisation in lexical retrieval; re-ranking is highly effective; translating agent-issued queries into natural-language questions significantly bridges the query mismatch.
11. 【2602.21351】A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
链接:https://arxiv.org/abs/2602.21351
作者:Dmitrii Pantiukhin,Ivan Kuznetsov,Boris Shapkin,Antonia Anna Jost,Thomas Jung,Nikolay Koldunov
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:PANGAEA host vast, significant scalability challenge, portion remains underutilized, host vast collections, substantial portion remains
备注: 20 pages, 6 figures, 7 tables, supplementary material included
点击查看摘要
Abstract:The rapid accumulation of Earth science data has created a significant scalability challenge; while repositories like PANGAEA host vast collections of datasets, citation metrics indicate that a substantial portion remains underutilized, limiting data reusability. Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis. Unlike standard Large Language Model (LLM) wrappers, our architecture implements a centralized Supervisor-Worker topology with strict data-type-aware routing, sandboxed deterministic code execution, and self-correction via execution feedback, enabling agents to diagnose and resolve runtime errors. Through use-case scenarios spanning physical oceanography and ecology, we demonstrate the system's capacity to execute complex, multi-step workflows with minimal human intervention. This framework provides a methodology for querying and analyzing heterogeneous repository data through coordinated agent workflows.
12. 【2602.21247】PiPNN: Ultra-Scalable Graph-Based Nearest Neighbor Indexing
链接:https://arxiv.org/abs/2602.21247
作者:Tobias Rubel,Richard Wen,Laxman Dhulipala,Lars Gottesbüren,Rajesh Jayaram,Jakub Łącki
类目:Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
关键词:Approximate Nearest Neighbor, Nearest Neighbor Search, Neighbor Search today, Approximate Nearest, Nearest Neighbor
备注:
点击查看摘要
Abstract:The fastest indexes for Approximate Nearest Neighbor Search today are also the slowest to build: graph-based methods like HNSW and Vamana achieve state-of-the-art query performance but have large construction times due to relying on random-access-heavy beam searches. We introduce PiPNN (Pick-in-Partitions Nearest Neighbors), an ultra-scalable graph construction algorithm that avoids this ``search bottleneck'' that existing graph-based methods suffer from. PiPNN's core innovation is HashPrune, a novel online pruning algorithm which dynamically maintains sparse collections of edges. HashPrune enables PiPNN to partition the dataset into overlapping sub-problems, efficiently perform bulk distance comparisons via dense matrix multiplication kernels, and stream a subset of the edges into HashPrune. HashPrune guarantees bounded memory during index construction which permits PiPNN to build higher quality indices without the use of extra intermediate memory. PiPNN builds state-of-the-art indexes up to 11.6x faster than Vamana (DiskANN) and up to 12.9x faster than HNSW. PiPNN is significantly more scalable than recent algorithms for fast graph construction. PiPNN builds indexes at least 19.1x faster than MIRAGE and 17.3x than FastKCNA while producing indexes that achieve higher query throughput. PiPNN enables us to build, for the first time, high-quality ANN indexes on billion-scale datasets in under 20 minutes using a single multicore machine.
Subjects:
Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
Cite as:
arXiv:2602.21247 [cs.DB]
(or
arXiv:2602.21247v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2602.21247
Focus to learn more
arXiv-issued DOI via DataCite</p>
13. 【2602.21214】oward Effective Multi-Domain Rumor Detection in Social Networks Using Domain-Gated Mixture-of-Experts
链接:https://arxiv.org/abs/2602.21214
作者:Mohadeseh Sheikhqoraei,Zainabolhoda Heshmati,Zeinab Rajabi,Leila Rabiei
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Social media platforms, tracking rumors due, key channels, channels for spreading, spreading and tracking
备注:
点击查看摘要
Abstract:Social media platforms have become key channels for spreading and tracking rumors due to their widespread accessibility and ease of information sharing. Rumors can continuously emerge across diverse domains and topics, often with the intent to mislead society for personal or commercial gain. Therefore, developing methods that can accurately detect rumors at early stages is crucial to mitigating their negative impact. While existing approaches often specialize in single-domain detection, their performance degrades when applied to new domains due to shifts in data distribution, such as lexical patterns and propagation dynamics. To bridge this gap, this study introduces PerFact, a large-scale multi-domain rumor dataset comprising 8,034 annotated posts from the X platform, annotated into two primary categories: rumor (including true, false, and unverified rumors) and non-rumor. Annotator agreement, measured via Fleiss' Kappa ($\kappa = 0.74$), ensures high-quality labels. This research further proposes an effective multi-domain rumor detection model that employs a domain gate to dynamically aggregate multiple feature representations extracted through a Mixture-of-Experts method. Each expert combines CNN and BiLSTM networks to capture local syntactic features and long-range contextual dependencies. By leveraging both textual content and publisher information, the proposed model classifies posts into rumor and non-rumor categories with high accuracy. Evaluations demonstrate state-of-the-art performance, achieving an F1-score of 79.86\% and an accuracy of 79.98\% in multi-domain settings. Keywords: Rumor Detection, Multi-Domain, Natural Language Processing, Social Networks, Mixture-of-Experts Model
Subjects:
Social and Information Networks (cs.SI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:
arXiv:2602.21214 [cs.SI]
(or
arXiv:2602.21214v1 [cs.SI] for this version)
https://doi.org/10.48550/arXiv.2602.21214
Focus to learn more
arXiv-issued DOI via DataCite</p>
14. 【2602.21212】Disaster Question Answering with LoRA Efficiency and Accurate End Position
链接:https://arxiv.org/abs/2602.21212
作者:Takato Yasuno
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:limited geographic areas, volcanic eruptions occur, extremely low frequency, affect limited geographic, torrential rainfall
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Natural disasters such as earthquakes, torrential rainfall, floods, and volcanic eruptions occur with extremely low frequency and affect limited geographic areas. When individuals face disaster situations, they often experience confusion and lack the domain-specific knowledge and experience necessary to determine appropriate responses and actions. While disaster information is continuously updated, even when utilizing RAG search and large language models for inquiries, obtaining relevant domain knowledge about natural disasters and experiences similar to one's specific situation is not guaranteed. When hallucinations are included in disaster question answering, artificial misinformation may spread and exacerbate confusion. This work introduces a disaster-focused question answering system based on Japanese disaster situations and response experiences. Utilizing the cl-tohoku/bert-base-japanese-v3 + Bi-LSTM + Enhanced Position Heads architecture with LoRA efficiency optimization, we achieved 70.4\% End Position accuracy with only 5.7\% of the total parameters (6.7M/117M). Experimental results demonstrate that the combination of Japanese BERT-base optimization and Bi-LSTM contextual understanding achieves accuracy levels suitable for real disaster response scenarios, attaining a 0.885 Span F1 score. Future challenges include: establishing natural disaster Q\A benchmark datasets, fine-tuning foundation models with disaster knowledge, developing lightweight and power-efficient edge AI Disaster Q\A applications for situations with insufficient power and communication during disasters, and addressing disaster knowledge base updates and continual learning capabilities.
15. 【2407.13416】he Language of Infographics: Toward Understanding Conceptual Metaphor Use in Scientific Storytelling
链接:https://arxiv.org/abs/2407.13416
作者:Hana Pokojná,Tobias Isenberg,Stefan Bruckner,Barbora Kozlíková,Laura Garrison
类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC); Information Theory (cs.IT)
关键词:Conceptual Metaphor Theory, visual conceptual metaphors, visual conceptual, Metaphor Theory, conceptual metaphors
备注: 11 pages, 8 figures, 1 table, accepted to IEEE VIS 2024 Conference
点击查看摘要
Abstract:We apply an approach from cognitive linguistics by mapping Conceptual Metaphor Theory (CMT) to the visualization domain to address patterns of visual conceptual metaphors that are often used in science infographics. Metaphors play an essential part in visual communication and are frequently employed to explain complex concepts. However, their use is often based on intuition, rather than following a formal process. At present, we lack tools and language for understanding and describing metaphor use in visualization to the extent where taxonomy and grammar could guide the creation of visual components, e.g., infographics. Our classification of the visual conceptual mappings within scientific representations is based on the breakdown of visual components in existing scientific infographics. We demonstrate the development of this mapping through a detailed analysis of data collected from four domains (biomedicine, climate, space, and anthropology) that represent a diverse range of visual conceptual metaphors used in the visual communication of science. This work allows us to identify patterns of visual conceptual metaphor use within the domains, resolve ambiguities about why specific conceptual metaphors are used, and develop a better overall understanding of visual metaphor use in scientific infographics. Our analysis shows that ontological and orientational conceptual metaphors are the most widely applied to translate complex scientific concepts. To support our findings we developed a visual exploratory tool based on the collected database that places the individual infographics on a spatio-temporal scale and illustrates the breakdown of visual conceptual metaphors.
计算机视觉
1. 【2602.22212】Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
链接:https://arxiv.org/abs/2602.22212
作者:Julian Kaltheuner,Hannah Dröge,Markus Plack,Patrick Stotko,Reinhard Klein
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Temporally consistent surface, data remains challenging, unstructured point cloud, point cloud data, cloud data remains
备注: CVPR 2026, Code: [this https URL](https://github.com/vc-bonn/neu-pig)
点击查看摘要
Abstract:Temporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast deformation optimization method based on a novel preconditioned latent-grid encoding that distributes spatial features parameterized on the position and normal direction of a keyframe surface. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multilayer perceptron (MLP). To achieve high-fidelity, drift-free surface reconstructions in seconds, we employ Sobolev preconditioning during gradient-based training of the latent space, completely avoiding the need for any explicit correspondences or further priors. Experiments across diverse human and animal datasets demonstrate that Neu-PiG outperforms state-the-art approaches, offering both superior accuracy and scalability to long sequences while running at least 60x faster than existing training-free methods and achieving inference speeds on the same order as heavy pretrained models.
2. 【2602.22209】WHOLE: World-Grounded Hand-Object Lifted from Egocentric Videos
链接:https://arxiv.org/abs/2602.22209
作者:Yufei Ye,Jiaman Li,Ryan Rong,C. Karen Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly challenging due, frequent object entries, person moves, highly challenging, challenging due
备注: Project website: [this https URL](https://judyye.github.io/whole-www)
点击查看摘要
Abstract:Egocentric manipulation videos are highly challenging due to severe occlusions during interactions and frequent object entries and exits from the camera view as the person moves. Current methods typically focus on recovering either hand or object pose in isolation, but both struggle during interactions and fail to handle out-of-sight cases. Moreover, their independent predictions often lead to inconsistent hand-object relations. We introduce WHOLE, a method that holistically reconstructs hand and object motion in world space from egocentric videos given object templates. Our key insight is to learn a generative prior over hand-object motion to jointly reason about their interactions. At test time, the pretrained prior is guided to generate trajectories that conform to the video observations. This joint generative reconstruction substantially outperforms approaches that process hands and objects separately followed by post-processing. WHOLE achieves state-of-the-art performance on hand motion estimation, 6D object pose estimation, and their relative interaction reconstruction. Project website: this https URL
3. 【2602.22208】Solaris: Building a Multiplayer Video World Model in Minecraft
链接:https://arxiv.org/abs/2602.22208
作者:Georgy Savva,Oscar Michel,Daohan Lu,Suppakit Waiwitlikhit,Timothy Meehan,Dhairya Mishra,Srivats Poddar,Jack Lu,Saining Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single-agent perspectives, real-world environments, limited to single-agent, Existing action-conditioned video, world models
备注: Project website: [this https URL](https://solaris-wm.github.io/)
点击查看摘要
Abstract:Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.
4. 【2602.22197】Off-The-Shelf Image-to-Image Models Are All You Need To Defeat Image Protection Schemes
链接:https://arxiv.org/abs/2602.22197
作者:Xavier Pleimling,Sifat Muhammad Abdullah,Gunjan Balde,Peng Gao,Mainack Mondal,Murtuza Jadliwala,Bimal Viswanath
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Advances in Generative, strategies to prevent, prevent the unauthorized, Advances, Generative
备注: This work has been accepted for publication at the IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). The final version will be available on IEEE Xplore. To IEEE SaTML 2026
点击查看摘要
Abstract:Advances in Generative AI (GenAI) have led to the development of various protection strategies to prevent the unauthorized use of images. These methods rely on adding imperceptible protective perturbations to images to thwart misuse such as style mimicry or deepfake manipulations. Although previous attacks on these protections required specialized, purpose-built methods, we demonstrate that this is no longer necessary. We show that off-the-shelf image-to-image GenAI models can be repurposed as generic ``denoisers" using a simple text prompt, effectively removing a wide range of protective perturbations. Across 8 case studies spanning 6 diverse protection schemes, our general-purpose attack not only circumvents these defenses but also outperforms existing specialized attacks while preserving the image's utility for the adversary. Our findings reveal a critical and widespread vulnerability in the current landscape of image protection, indicating that many schemes provide a false sense of security. We stress the urgent need to develop robust defenses and establish that any future protection mechanism must be benchmarked against attacks from off-the-shelf GenAI models. Code is available in this repository: this https URL
5. 【2602.22176】Mixed Magnification Aggregation for Generalizable Region-Level Representations in Computational Pathology
链接:https://arxiv.org/abs/2602.22176
作者:Eric Zimmermann,Julian Viret,Michal Zelechowski,James Brian Hall,Neil Tenenholtz,Adam Casson,George Shaikovski,Eugene Vorontsov,Siqi Liu,Kristen A Severson
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:standard computational pathology, computational pathology workflow, recent years, standard computational, computational pathology
备注:
点击查看摘要
Abstract:In recent years, a standard computational pathology workflow has emerged where whole slide images are cropped into tiles, these tiles are processed using a foundation model, and task-specific models are built using the resulting representations. At least 15 different foundation models have been proposed, and the vast majority are trained exclusively with tiles using the 20$\times$ magnification. However, it is well known that certain histologic features can only be discerned with larger context windows and requires a pathologist to zoom in and out when analyzing a whole slide image. Furthermore, creating 224$\times$224 pixel crops at 20$\times$ leads to a large number of tiles per slide, which can be gigapixel in size. To more accurately capture multi-resolution features and investigate the possibility of reducing the number of representations per slide, we propose a region-level mixing encoder. Our approach jointly fuses image tile representations of a mixed magnification foundation model using a masked embedding modeling pretraining step. We explore a design space for pretraining the proposed mixed-magnification region aggregators and evaluate our models on transfer to biomarker prediction tasks representing various cancer types. Results demonstrate cancer dependent improvements in predictive performance, highlighting the importance of spatial context and understanding.
6. 【2602.22159】CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness
链接:https://arxiv.org/abs/2602.22159
作者:Wenhao Guo,Zhaoran Zhao,Peng Lu,Sheng Li,Qian Qiao,RuiDe Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains fundamentally limited, artifacts accumulate sharply, cross-scale distribution shift, remains fundamentally, training range
备注:
点击查看摘要
Abstract:Arbitrary-Scale SR (ASISR) remains fundamentally limited by cross-scale distribution shift: once the inference scale leaves the training range, noise, blur, and artifacts accumulate sharply. We revisit this challenge from a cross-scale distribution transition perspective and propose CASR, a simple yet highly efficient cyclic SR framework that reformulates ultra-magnification as a sequence of in-distribution scale transitions. This design ensures stable inference at arbitrary scales while requiring only a single model. CASR tackles two major bottlenecks: distribution drift across iterations and patch-wise diffusion inconsistencies. The proposed SDAM module aligns structural distributions via superpixel aggregation, preventing error accumulation, while SARM module restores high-frequency textures by enforcing autocorrelation and embedding LR self-similarity priors. Despite using only a single model, our approach significantly reduces distribution drift, preserves long-range texture consistency, and achieves superior generalization even at extreme magnification.
7. 【2602.22150】CoLoGen: Progressive Learning of Concept`-`Localization Duality for Unified Image Generation
链接:https://arxiv.org/abs/2602.22150
作者:YuXin Song,Yu Lu,Haoyuan Sun,Huanjin Yao,Fanglong Liu,Yifan Sun,Haocheng Feng,Hang Zhou,Jingdong Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generation remains difficult, remains difficult, depend on fundamentally, fundamentally different internal, Unified conditional image
备注: Accepted by CVPR2026. 15 pages, 8 figures
点击查看摘要
Abstract:Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept-localization representational conflict. To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept-localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages. Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.
8. 【2602.22144】NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
链接:https://arxiv.org/abs/2602.22144
作者:Lingfeng Ren,Weihao Yu,Runpeng Yu,Xinchao Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, Vision-Language Models, issue in Large, Large Vision-Language, outputs include objects
备注: Code: [this https URL](https://github.com/lingfengren/NoLan)
点击查看摘要
Abstract:Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: this https URL.
9. 【2602.22143】MedTri: A Platform for Structured Medical Report Normalization to Enhance Vision-Language Pretraining
链接:https://arxiv.org/abs/2602.22143
作者:Yuetan Chu,Xinhua Ma,Xinran Jin,Gongning Luo,Xin Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large-scale supervisory signals, Medical vision-language pretraining, pretraining increasingly relies, vision-language pretraining increasingly, substantial stylistic heterogeneity
备注:
点击查看摘要
Abstract:Medical vision-language pretraining increasingly relies on medical reports as large-scale supervisory signals; however, raw reports often exhibit substantial stylistic heterogeneity, variable length, and a considerable amount of image-irrelevant content. Although text normalization is frequently adopted as a preprocessing step in prior work, its design principles and empirical impact on vision-language pretraining remain insufficiently and systematically examined. In this study, we present MedTri, a deployable normalization framework for medical vision-language pretraining that converts free-text reports into a unified [Anatomical Entity: Radiologic Description + Diagnosis Category] triplet. This structured, anatomy-grounded normalization preserves essential morphological and spatial information while removing stylistic noise and image-irrelevant content, providing consistent and image-grounded textual supervision at scale. Across multiple datasets spanning both X-ray and computed tomography (CT) modalities, we demonstrate that structured, anatomy-grounded text normalization is an important factor in medical vision-language pretraining quality, yielding consistent improvements over raw reports and existing normalization baselines. In addition, we illustrate how this normalization can easily support modular text-level augmentation strategies, including knowledge enrichment and anatomy-grounded counterfactual supervision, which provide complementary gains in robustness and generalization without altering the core normalization process. Together, our results position structured text normalization as a critical and generalizable preprocessing component for medical vision-language learning, while MedTri provides this normalization platform. Code and data will be released at this https URL.
10. 【2602.22142】WeaveTime: Stream from Earlier Frames into Emergent Memory in VideoLLMs
链接:https://arxiv.org/abs/2602.22142
作者:Yulin Zhang,Cheng Shi,Sibei Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, greatly improved visual
备注: Accepted at CVPR 2026 (preview; camera-ready in preparation)
点击查看摘要
Abstract:Recent advances in Multimodal Large Language Models have greatly improved visual understanding and reasoning, yet their quadratic attention and offline training protocols make them ill-suited for streaming settings where frames arrive sequentially and future observations are inaccessible. We diagnose a core limitation of current Video-LLMs, namely Time-Agnosticism, in which videos are treated as an unordered bag of evidence rather than a causally ordered sequence, yielding two failures in streams: temporal order ambiguity, in which the model cannot follow or reason over the correct chronological order, and past-current focus blindness where it fails to distinguish present observations from accumulated history. We present WeaveTime, a simple, efficient, and model agnostic framework that first teaches order and then uses order. We introduce a lightweight Temporal Reconstruction objective-our Streaming Order Perception enhancement-that instills order aware representations with minimal finetuning and no specialized streaming data. At inference, a Past-Current Dynamic Focus Cache performs uncertainty triggered, coarse-to-fine retrieval, expanding history only when needed. Plugged into exsiting Video-LLM without architectural changes, WeaveTime delivers consistent gains on representative streaming benchmarks, improving accuracy while reducing latency. These results establish WeaveTime as a practical path toward time aware stream Video-LLMs under strict online, time causal constraints. Code and weights will be made publicly available. Project Page: this https URL
11. 【2602.22120】GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models
链接:https://arxiv.org/abs/2602.22120
作者:Abhipsa Basu,Mohana Singh,Shashank Agnihotri,Margret Keuper,R. Venkatesh Babu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rapidly gaining popularity, reinforce stereotypes, gaining popularity, misrepresent regions, Visual Diversity Index
备注: ICLR 2026
点击查看摘要
Abstract:Text-to-image (T2I) models are rapidly gaining popularity, yet their outputs often lack geographical diversity, reinforce stereotypes, and misrepresent regions. Given their broad reach, it is critical to rigorously evaluate how these models portray the world. Existing diversity metrics either rely on curated datasets or focus on surface-level visual similarity, limiting interpretability. We introduce GeoDiv, a framework leveraging large language and vision-language models to assess geographical diversity along two complementary axes: the Socio-Economic Visual Index (SEVI), capturing economic and condition-related cues, and the Visual Diversity Index (VDI), measuring variation in primary entities and backgrounds. Applied to images generated by models such as Stable Diffusion and FLUX.1-dev across $10$ entities and $16$ countries, GeoDiv reveals a consistent lack of diversity and identifies fine-grained attributes where models default to biased portrayals. Strikingly, depictions of countries like India, Nigeria, and Colombia are disproportionately impoverished and worn, reflecting underlying socio-economic biases. These results highlight the need for greater geographical nuance in generative models. GeoDiv provides the first systematic, interpretable framework for measuring such biases, marking a step toward fairer and more inclusive generative systems. Project page: this https URL
12. 【2602.22098】Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D
链接:https://arxiv.org/abs/2602.22098
作者:Mariano Barone,Francesco Di Serio,Giuseppe Riccio,Antonio Romano,Marco Postiglione,Antonino Ferraro,Vincenzo Moscato
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:accurate neuroradiological interpretation, spatial context required, process volumetric brain, volumetric brain MRI, Current medical vision-language
备注:
点击查看摘要
Abstract:Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbf{Brain3D}, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbf{Brain3D} is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnote{Our code is publicly available for transparency and reproducibility
13. 【2602.22096】WeatherCity: Urban Scene Reconstruction with Controllable Multi-Weather Transformation
链接:https://arxiv.org/abs/2602.22096
作者:Wenhua Wu,Huai Guan,Zhe Liu,Hesheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Editable high-fidelity, autonomous driving, training and closed-loop, weather, crucial for autonomous
备注:
点击查看摘要
Abstract:Editable high-fidelity 4D scenes are crucial for autonomous driving, as they can be applied to end-to-end training and closed-loop simulation. However, existing reconstruction methods are primarily limited to replicating observed scenes and lack the capability for diverse weather simulation. While image-level weather editing methods tend to introduce scene artifacts and offer poor controllability over the weather effects. To address these limitations, we propose WeatherCity, a novel framework for 4D urban scene reconstruction and weather editing. Specifically, we leverage a text-guided image editing model to achieve flexible editing of image weather backgrounds. To tackle the challenge of multi-weather modeling, we introduce a novel weather Gaussian representation based on shared scene features and dedicated weather-specific decoders. This representation is further enhanced with a content consistency optimization, ensuring coherent modeling across different weather conditions. Additionally, we design a physics-driven model that simulates dynamic weather effects through particles and motion patterns. Extensive experiments on multiple datasets and various scenes demonstrate that WeatherCity achieves flexible controllability, high fidelity, and temporal consistency in 4D reconstruction and weather editing. Our framework not only enables fine-grained control over weather conditions (e.g., light rain and heavy snow) but also supports object-level manipulation within the scene.
14. 【2602.22092】Overview of the CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification
链接:https://arxiv.org/abs/2602.22092
作者:Hexin Dong,Yi Lin,Pengyu Zhou,Fengnian Zhao,Alan Clint Legasto,Mingquan Lin,Hao Chen,Yuzhe Yang,George Shih,Yifan Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:NIH Chest X-ray, Chest X-ray, Chest X-ray datasets, interpretation is hindered, clinical environments
备注:
点击查看摘要
Abstract:Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from single institutions, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT 2026 challenge. This third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. The challenge defines two core tasks: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. We report the results of the top-performing teams, evaluating them via mean Average Precision (mAP), AUROC, and F1-score. The winning solutions achieved an mAP of 0.5854 on Task 1 and 0.4315 on Task 2, demonstrating that large-scale vision-language pre-training significantly mitigates the performance drop typically associated with zero-shot diagnosis.
15. 【2602.22091】Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
链接:https://arxiv.org/abs/2602.22091
作者:Matthew Strong,Wei-Jer Chang,Quentin Herau,Jiezhi Yang,Yihan Hu,Chensheng Peng,Wei Zhan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Ego-centric driving videos, Ego-centric driving, abundant source, source of visual, visual data
备注: Accepted at CVPR 2026
点击查看摘要
Abstract:Ego-centric driving videos available online provide an abundant source of visual data for autonomous driving, yet their lack of annotations makes it difficult to learn representations that capture both semantic structure and 3D geometry. Recent advances in large feedforward spatial models demonstrate that point maps and ego-motion can be inferred in a single forward pass, suggesting a promising direction for scalable driving perception. We therefore propose a label-free, teacher-guided framework for learning autonomous driving representations directly from unposed videos. Unlike prior self-supervised approaches that focus primarily on frame-to-frame consistency, we posit that safe and reactive driving depends critically on temporal context. To this end, we leverage a feedforward architecture equipped with a lightweight autoregressive module, trained using multi-modal supervisory signals that guide the model to jointly predict current and future point maps, camera poses, semantic segmentation, and motion masks. Multi-modal teachers provide sequence-level pseudo-supervision, enabling LFG to learn a unified pseudo-4D representation from raw YouTube videos without poses, labels, or LiDAR. The resulting encoder not only transfers effectively to downstream autonomous driving planning on the NAVSIM benchmark, surpassing multi-camera and LiDAR baselines with only a single monocular camera, but also yields strong performance when evaluated on a range of semantic, geometric, and qualitative motion prediction tasks. These geometry and motion-aware features position LFG as a compelling video-centric foundation model for autonomous driving.
16. 【2602.22073】AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting
链接:https://arxiv.org/abs/2602.22073
作者:Artur Xarles,Sergio Escalera,Thomas B. Moeslund,Albert Clapés
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Event Spotting aims, Precise Event Spotting, high temporal precision, localize fast-paced actions, Event Spotting
备注:
点击查看摘要
Abstract:Precise Event Spotting aims to localize fast-paced actions or events in videos with high temporal precision, a key task for applications in sports analytics, robotics, and autonomous systems. Existing methods typically process all frames uniformly, overlooking the inherent spatio-temporal redundancy in video data. This leads to redundant computation on non-informative regions while limiting overall efficiency. To remain tractable, they often spatially downsample inputs, losing fine-grained details crucial for precise localization. To address these limitations, we propose \textbf{AdaSpot}, a simple yet effective framework that processes low-resolution videos to extract global task-relevant features while adaptively selecting the most informative region-of-interest in each frame for high-resolution processing. The selection is performed via an unsupervised, task-aware strategy that maintains spatio-temporal consistency across frames and avoids the training instability of learnable alternatives. This design preserves essential fine-grained visual cues with a marginal computational overhead compared to low-resolution-only baselines, while remaining far more efficient than uniform high-resolution processing. Experiments on standard PES benchmarks demonstrate that \textbf{AdaSpot} achieves state-of-the-art performance under strict evaluation metrics (\eg, $+3.96$ and $+2.26$ mAP$@0$ frames on Tennis and FineDiving), while also maintaining strong results under looser metrics. Code is available at: \href{this https URL}{this https URL}.
17. 【2602.22059】NESTOR: A Nested MOE-based Neural Operator for Large-Scale PDE Pre-Training
链接:https://arxiv.org/abs/2602.22059
作者:Dengdi Sun,Xiaoya Zhou,Xiao Wang,Hao Si,Wanli Lyu,Jin Tang,Bin Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:improving computational efficiency, traditional numerical methods, significantly improving computational, Neural operators, overcoming the limitations
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Neural operators have emerged as an efficient paradigm for solving PDEs, overcoming the limitations of traditional numerical methods and significantly improving computational efficiency. However, due to the diversity and complexity of PDE systems, existing neural operators typically rely on a single network architecture, which limits their capacity to fully capture heterogeneous features and complex system dependencies. This constraint poses a bottleneck for large-scale PDE pre-training based on neural operators. To address these challenges, we propose a large-scale PDE pre-trained neural operator based on a nested Mixture-of-Experts (MoE) framework. In particular, the image-level MoE is designed to capture global dependencies, while the token-level Sub-MoE focuses on local dependencies. Our model can selectively activate the most suitable expert networks for a given input, thereby enhancing generalization and transferability. We conduct large-scale pre-training on twelve PDE datasets from diverse sources and successfully transfer the model to downstream tasks. Extensive experiments demonstrate the effectiveness of our approach.
18. 【2602.22052】AutoSew: A Geometric Approach to Stitching Prediction with Graph Neural Networks
链接:https://arxiv.org/abs/2602.22052
作者:Pablo Ríos-Navarro,Elena Garces,Jorge Lopez-Moreno
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significant challenge due, sewing patterns remains, standardized annotation protocols, Automating garment assembly, semantic cues
备注: WACV 2026
点击查看摘要
Abstract:Automating garment assembly from sewing patterns remains a significant challenge due to the lack of standardized annotation protocols and the frequent absence of semantic cues. Existing methods often rely on panel labels or handcrafted heuristics, which limit their applicability to real-world, non-conforming patterns. We present AutoSew, a fully automatic, geometry-based approach for predicting stitch correspondences directly from 2D pattern contours. AutoSew formulates the problem as a graph matching task, leveraging a Graph Neural Network to capture local and global geometric context, and employing a differentiable optimal transport solver to infer stitching relationships-including multi-edge connections. To support this task, we update the GarmentCodeData dataset modifying over 18k patterns with realistic multi-edge annotations, reflecting industrial assembly scenarios. AutoSew achieves 96% F1-score and successfully assembles 73.3% of test garments without error, outperforming existing methods while relying solely on geometric input. Our results demonstrate that geometry alone can robustly guide stitching prediction, enabling scalable garment assembly without manual input.
19. 【2602.22049】SPGen: Stochastic scanpath generation for paintings using unsupervised domain adaptation
链接:https://arxiv.org/abs/2602.22049
作者:Mohamed Amine Kerkouri,Marouane Tliba,Aladine Chetouani,Alessandro Bruno
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:Understanding human visual, human visual attention, preserving cultural heritage, eye movementswhen viewers, movementswhen viewers observe
备注: Under Review
点击查看摘要
Abstract:Understanding human visual attention is key to preserving cultural heritage We introduce SPGen a novel deep learning model to predict scanpaths the sequence of eye movementswhen viewers observe paintings. Our architecture uses a Fully Convolutional Neural Network FCNN with differentiable fixation selection and learnable Gaussian priors to simulate natural viewing biases To address the domain gap between photographs and artworks we employ unsupervised domain adaptation via a gradient reversal layer allowing the model to transfer knowledge from natural scenes to paintings Furthermore a random noise sampler models the inherent stochasticity of eyetracking data. Extensive testing shows SPGen outperforms existing methods offering a powerful tool to analyze gaze behavior and advance the preservation and appreciation of artistic treasures.
Comments:
Under Review
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
Cite as:
arXiv:2602.22049 [cs.CV]
(or
arXiv:2602.22049v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.22049
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
20. 【2602.22033】RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking
链接:https://arxiv.org/abs/2602.22033
作者:Yanqiu Yu,Zhifan Jin,Sijia Chen,Tongfei Chu,En Yu,Liman Liu,Wenbing Tao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Referring Multi-Object Tracking, human-friendly interactive characteristics, attracted increasing attention, increasing attention due, Referring Multi-Object
备注:
点击查看摘要
Abstract:Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model's potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.
21. 【2602.22026】RGB-Event HyperGraph Prompt for Kilometer Marker Recognition based on Pre-trained Foundation Models
链接:https://arxiv.org/abs/2602.22026
作者:Xiaoyu Xian,Shiao Wang,Xiao Wang,Daxin Tian,Yan Tian
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:highly complex environments, adverse weather conditions, complex environments, characterized by illumination, illumination variations
备注: Accepted by IEEE Transactions on Cognitive and Developmental Systems (IEEE TCDS) 2026
点击查看摘要
Abstract:Metro trains often operate in highly complex environments, characterized by illumination variations, high-speed motion, and adverse weather conditions. These factors pose significant challenges for visual perception systems, especially those relying solely on conventional RGB cameras. To tackle these difficulties, we explore the integration of event cameras into the perception system, leveraging their advantages in low-light conditions, high-speed scenarios, and low power consumption. Specifically, we focus on Kilometer Marker Recognition (KMR), a critical task for autonomous metro localization under GNSS-denied conditions. In this context, we propose a robust baseline method based on a pre-trained RGB OCR foundation model, enhanced through multi-modal adaptation. Furthermore, we construct the first large-scale RGB-Event dataset, EvMetro5K, containing 5,599 pairs of synchronized RGB-Event samples, split into 4,479 training and 1,120 testing samples. Extensive experiments on EvMetro5K and other widely used benchmarks demonstrate the effectiveness of our approach for KMR. Both the dataset and source code will be released on this https URL
22. 【2602.22025】Olbedo: An Albedo and Shading Aerial Dataset for Large-Scale Outdoor Environments
链接:https://arxiv.org/abs/2602.22025
作者:Shuang Song,Debao Huang,Deyan Deng,Haolin Xiong,Yang Tang,Yajie Zhao,Rongjun Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:understanding large-scale environments, lack of real-world, large-scale environments, shading supervision, understanding large-scale
备注: CVPR 2026
点击查看摘要
Abstract:Intrinsic image decomposition (IID) of outdoor scenes is crucial for relighting, editing, and understanding large-scale environments, but progress has been limited by the lack of real-world datasets with reliable albedo and shading supervision. We introduce Olbedo, a large-scale aerial dataset for outdoor albedo--shading decomposition in the wild. Olbedo contains 5,664 UAV images captured across four landscape types, multiple years, and diverse illumination conditions. Each view is accompanied by multi-view consistent albedo and shading maps, metric depth, surface normals, sun and sky shading components, camera poses, and, for recent flights, measured HDR sky domes. These annotations are derived from an inverse-rendering refinement pipeline over multi-view stereo reconstructions and calibrated sky illumination, together with per-pixel confidence masks. We demonstrate that Olbedo enables state-of-the-art diffusion-based IID models, originally trained on synthetic indoor data, to generalize to real outdoor imagery: fine-tuning on Olbedo significantly improves single-view outdoor albedo prediction on the MatrixCity benchmark. We further illustrate applications of Olbedo-trained models to multi-view consistent relighting of 3D assets, material editing, and scene change analysis for urban digital twins. We release the dataset, baseline models, and an evaluation protocol to support future research in outdoor intrinsic decomposition and illumination-aware aerial vision.
23. 【2602.22013】RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
链接:https://arxiv.org/abs/2602.22013
作者:I-Hsiang Chen,Yu-Wei Liu,Tse-Yu Wu,Yu-Chien Chiang,Jen-Chien Yang,Wei-Ting Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-based Retrieval-Augmented Generation, jointly retrieve relevant, generate grounded answers, grounded answers based, leverages vision-language models
备注: Accepted by CVPR2026; Project Page: [this https URL](https://robustvisrag.github.io)
点击查看摘要
Abstract:Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
24. 【2602.22010】World Guidance: World Modeling in Condition Space for Action Generation
链接:https://arxiv.org/abs/2602.22010
作者:Yue Su,Sijin Chen,Haixin Shi,Mingyu Liu,Zhengshen Zhang,Ningyuan Huang,Weiheng Zhong,Zhengbang Zhu,Yuxiao Liu,Xihui Liu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Leveraging future observation, Leveraging future, action generation presents, presents a promising, promising avenue
备注: Project Page: [this https URL](https://selen-suyue.github.io/WoGNet/)
点击查看摘要
Abstract:Leveraging future observation modeling to facilitate action generation presents a promising avenue for enhancing the capabilities of Vision-Language-Action (VLA) models. However, existing approaches struggle to strike a balance between maintaining efficient, predictable future representations and preserving sufficient fine-grained information to guide precise action generation. To address this limitation, we propose WoG (World Guidance), a framework that maps future observations into compact conditions by injecting them into the action inference pipeline. The VLA is then trained to simultaneously predict these compressed conditions alongside future actions, thereby achieving effective world modeling within the condition space for action inference. We demonstrate that modeling and predicting this condition space not only facilitates fine-grained action generation but also exhibits superior generalization capabilities. Moreover, it learns effectively from substantial human manipulation videos. Extensive experiments across both simulation and real-world environments validate that our method significantly outperforms existing methods based on future prediction. Project page is available at: this https URL
25. 【2602.21992】PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning
链接:https://arxiv.org/abs/2602.21992
作者:Zekai Lin,Xu Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:holistic scene understanding, autonomous driving, virtual reality, robotics for holistic, holistic scene
备注:
点击查看摘要
Abstract:360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.
26. 【2602.21987】PatchDenoiser: Parameter-efficient multi-scale patch learning and fusion denoiser for medical images
链接:https://arxiv.org/abs/2602.21987
作者:Jitindra Fartiyal,Pedro Freire,Sergei K. Turitsyn,Sergei G. Solovski
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:treatment planning, patient motion, essential for diagnosis, interpretation and downstream, clinical interpretation
备注: Under review in Medical Image Analysis journal
点击查看摘要
Abstract:Medical images are essential for diagnosis, treatment planning, and research, but their quality is often degraded by noise from low-dose acquisition, patient motion, or scanner limitations, affecting both clinical interpretation and downstream analysis. Traditional filtering approaches often over-smooth and lose fine anatomical details, while deep learning methods, including CNNs, GANs, and transformers, may struggle to preserve such details or require large, computationally expensive models, limiting clinical practicality. We propose PatchDenoiser, a lightweight, energy-efficient multi-scale patch-based denoising framework. It decomposes denoising into local texture extraction and global context aggregation, fused via a spatially aware patch fusion strategy. This design enables effective noise suppression while preserving fine structural and anatomical details. PatchDenoiser is ultra-lightweight, with far fewer parameters and lower computational complexity than CNN-, GAN-, and transformer-based denoisers. On the 2016 Mayo Low-Dose CT dataset, PatchDenoiser consistently outperforms state-of-the-art CNN- and GAN-based methods in PSNR and SSIM. It is robust to variations in slice thickness, reconstruction kernels, and HU windows, generalizes across scanners without fine-tuning, and reduces parameters by ~9x and energy consumption per inference by ~27x compared with conventional CNN denoisers. PatchDenoiser thus provides a practical, scalable, and computationally efficient solution for medical image denoising, balancing performance, robustness, and clinical deployability.
Comments:
Under review in Medical Image Analysis journal
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.21987 [cs.CV]
(or
arXiv:2602.21987v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.21987
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Jitindra Fartiyal [view email] [v1]
Wed, 25 Feb 2026 15:08:43 UTC (5,244 KB)
27. 【2602.21977】When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters
链接:https://arxiv.org/abs/2602.21977
作者:Liangwei Lyu,Jiaqi Xu,Jianwei Ding,Qiyao Deng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Low-Rank Adaptation, efficiently fine-tuning, leading technique, technique for efficiently, widespread adoption
备注:
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) has emerged as a leading technique for efficiently fine-tuning text-to-image diffusion models, and its widespread adoption on open-source platforms has fostered a vibrant culture of model sharing and customization. However, the same modular and plug-and-play flexibility that makes LoRA appealing also introduces a broader attack surface. To highlight this risk, we propose Masquerade-LoRA (MasqLoRA), the first systematic attack framework that leverages an independent LoRA module as the attack vehicle to stealthily inject malicious behavior into text-to-image diffusion models. MasqLoRA operates by freezing the base model parameters and updating only the low-rank adapter weights using a small number of "trigger word-target image" pairs. This enables the attacker to train a standalone backdoor LoRA module that embeds a hidden cross-modal mapping: when the module is loaded and a specific textual trigger is provided, the model produces a predefined visual output; otherwise, it behaves indistinguishably from the benign model, ensuring the stealthiness of the attack. Experimental results demonstrate that MasqLoRA can be trained with minimal resource overhead and achieves a high attack success rate of 99.8%. MasqLoRA reveals a severe and unique threat in the AI supply chain, underscoring the urgent need for dedicated defense mechanisms for the LoRA-centric sharing ecosystem.
28. 【2602.21967】Dream-SLAM: Dreaming the Unseen for Active SLAM in Dynamic Environments
链接:https://arxiv.org/abs/2602.21967
作者:Xiangqi Meng,Pengxu Hou,Zhenjun Zhao,Javier Civera,Daniel Cremers,Hesheng Wang,Haoang Li
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:volves generating robot, generating robot actions, active SLAM additionally, active SLAM, existing active SLAM
备注:
点击查看摘要
Abstract:In addition to the core tasks of simultaneous localization and mapping (SLAM), active SLAM additionally in- volves generating robot actions that enable effective and efficient exploration of unknown environments. However, existing active SLAM pipelines are limited by three main factors. First, they inherit the restrictions of the underlying SLAM modules that they may be using. Second, their motion planning strategies are typically shortsighted and lack long-term vision. Third, most approaches struggle to handle dynamic scenes. To address these limitations, we propose a novel monocular active SLAM method, Dream-SLAM, which is based on dreaming cross-spatio-temporal images and semantically plausible structures of partially observed dynamic environments. The generated cross-spatio-temporal im- ages are fused with real observations to mitigate noise and data incompleteness, leading to more accurate camera pose estimation and a more coherent 3D scene representation. Furthermore, we integrate dreamed and observed scene structures to enable long- horizon planning, producing farsighted trajectories that promote efficient and thorough exploration. Extensive experiments on both public and self-collected datasets demonstrate that Dream-SLAM outperforms state-of-the-art methods in localization accuracy, mapping quality, and exploration efficiency. Source code will be publicly available upon paper acceptance.
29. 【2602.21963】Global-Aware Edge Prioritization for Pose Graph Initialization
链接:https://arxiv.org/abs/2602.21963
作者:Tong Wei,Giorgos Tolias,Jiri Matas,Daniel Barath
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:encode relative poses, edges encode relative, act as nodes, encode relative, pose graph
备注: accepted to CVPR 2026
点击查看摘要
Abstract:The pose graph is a core component of Structure-from-Motion (SfM), where images act as nodes and edges encode relative poses. Since geometric verification is expensive, SfM pipelines restrict the pose graph to a sparse set of candidate edges, making initialization critical. Existing methods rely on image retrieval to connect each image to its $k$ nearest neighbors, treating pairs independently and ignoring global consistency. We address this limitation through the concept of edge prioritization, ranking candidate edges by their utility for SfM. Our approach has three components: (1) a GNN trained with SfM-derived supervision to predict globally consistent edge reliability; (2) multi-minimal-spanning-tree-based pose graph construction guided by these ranks; and (3) connectivity-aware score modulation that reinforces weak regions and reduces graph diameter. This globally informed initialization yields more reliable and compact pose graphs, improving reconstruction accuracy in sparse and high-speed settings and outperforming SOTA retrieval methods on ambiguous scenes. The ode and trained models are available at this https URL.
30. 【2602.21956】Global-Local Dual Perception for MLLMs in High-Resolution Text-Rich Image Translation
链接:https://arxiv.org/abs/2602.21956
作者:Junxin Lu,Tengfei Song,Zhanglin Wu,Pengfei Li,Xiaowei Liang,Hui Yang,Kun Chen,Ning Xie,Yunfei Lu,Jing Zhao,Shiliang Sun,Daimeng Wei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requiring synergistic integration, Text Image Machine, Image Machine Translation, translate text embedded, Image Machine
备注:
点击查看摘要
Abstract:Text Image Machine Translation (TIMT) aims to translate text embedded in images in the source-language into target-language, requiring synergistic integration of visual perception and linguistic understanding. Existing TIMT methods, whether cascaded pipelines or end-to-end multimodal large language models (MLLMs),struggle with high-resolution text-rich images due to cluttered layouts, diverse fonts, and non-textual distractions, resulting in text omission, semantic drift, and contextual inconsistency. To address these challenges, we propose GLoTran, a global-local dual visual perception framework for MLLM-based TIMT. GLoTran integrates a low-resolution global image with multi-scale region-level text image slices under an instruction-guided alignment strategy, conditioning MLLMs to maintain scene-level contextual consistency while faithfully capturing fine-grained textual details. Moreover, to realize this dual-perception paradigm, we construct GLoD, a large-scale text-rich TIMT dataset comprising 510K high-resolution global-local image-text pairs covering diverse real-world scenarios. Extensive experiments demonstrate that GLoTran substantially improves translation completeness and accuracy over state-of-the-art MLLMs, offering a new paradigm for fine-grained TIMT under high-resolution and text-rich conditions.
31. 【2602.21952】MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving
链接:https://arxiv.org/abs/2602.21952
作者:Lingjun Zhang,Yujian Yuan,Changjie Wu,Xinyuan Chang,Xin Cai,Shuang Zeng,Linzhe Shi,Sijin Wang,Hang Zhang,Mu Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-Language Models, exhibit strong reasoning, strong reasoning capabilities, exhibit strong, showing promise
备注: CVPR2026; Yujian Yuan and Lingjun Zhang contributed equally with random order
点击查看摘要
Abstract:Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at this https URL.
32. 【2602.21944】Learning to Fuse and Reconstruct Multi-View Graphs for Diabetic Retinopathy Grading
链接:https://arxiv.org/abs/2602.21944
作者:Haoran Li,Yuxin Lin,Huan Wang,Xiaoling Luo,Qi Zhu,Jiahua Shi,Huaming Chen,Bo Du,Johan Barthelemy,Zongyan Xue,Jun Shen,Yong Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision loss worldwide, Multi-View Graph Fusion, Graph Fusion, Multi-View Graph, multi-view fundus images
备注:
点击查看摘要
Abstract:Diabetic retinopathy (DR) is one of the leading causes of vision loss worldwide, making early and accurate DR grading critical for timely intervention. Recent clinical practices leverage multi-view fundus images for DR detection with a wide coverage of the field of view (FOV), motivating deep learning methods to explore the potential of multi-view learning for DR grading. However, existing methods often overlook the inter-view correlations when fusing multi-view fundus images, failing to fully exploit the inherent consistency across views originating from the same patient. In this work, we present MVGFDR, an end-to-end Multi-View Graph Fusion framework for DR grading. Different from existing methods that directly fuse visual features from multiple views, MVGFDR is equipped with a novel Multi-View Graph Fusion (MVGF) module to explicitly disentangle the shared and view-specific visual features. Specifically, MVGF comprises three key components: (1) Multi-view Graph Initialization, which constructs visual graphs via residual-guided connections and employs Discrete Cosine Transform (DCT) coefficients as frequency-domain anchors; (2) Multi-view Graph Fusion, which integrates selective nodes across multi-view graphs based on frequency-domain relevance to capture complementary view-specific information; and (3) Masked Cross-view Reconstruction, which leverages masked reconstruction of shared information across views to facilitate view-invariant representation learning. Extensive experimental results on MFIDDR, by far the largest multi-view fundus image dataset, demonstrate the superiority of our proposed approach over existing state-of-the-art approaches in diabetic retinopathy grading.
33. 【2602.21943】Mobile-Ready Automated Triage of Diabetic Retinopathy Using Digital Fundus Images
链接:https://arxiv.org/abs/2602.21943
作者:Aadi Joshi,Manav S. Sharma,Vijay Uttam Rathod,Ashlesha Sawant,Prajakta Musale,Asmita B. Kalamkar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision impairment worldwide, impairment worldwide, vision impairment, Consistent Rank Logits, Diabetic Retinopathy
备注: Presented at ICCI 2025. 11 pages, 2 figures. MobileNetV3 + CORAL-based lightweight model for diabetic retinopathy severity classification with mobile deployment
点击查看摘要
Abstract:Diabetic Retinopathy (DR) is a major cause of vision impairment worldwide. However, manual diagnosis is often time-consuming and prone to errors, leading to delays in screening. This paper presents a lightweight automated deep learning framework for efficient assessment of DR severity from digital fundus images. We use a MobileNetV3 architecture with a Consistent Rank Logits (CORAL) head to model the ordered progression of disease while maintaining computational efficiency for resource-constrained environments. The model is trained and validated on a combined dataset of APTOS 2019 and IDRiD images using a preprocessing pipeline including circular cropping and illumination normalization. Extensive experiments including 3-fold cross-validation and ablation studies demonstrate strong performance. The model achieves a Quadratic Weighted Kappa (QWK) score of 0.9019 and an accuracy of 80.03 percent. Additionally, we address real-world deployment challenges through model calibration to reduce overconfidence and optimization for mobile devices. The proposed system provides a scalable and practical tool for early-stage diabetic retinopathy screening.
34. 【2602.21942】Directed Ordinal Diffusion Regularization for Progression-Aware Diabetic Retinopathy Grading
链接:https://arxiv.org/abs/2602.21942
作者:Huangwei Chen,Junhao Jia,Ruocheng Li,Cunyuan Yang,Wu Li,Xiaotao Pang,Yifei Chen,Haishuai Wang,Jiajun Bu,Lei Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diabetic Retinopathy, well-defined clinical trajectory, continuous and irreversible, irreversible deterioration, well-defined clinical
备注: 3 figures
点击查看摘要
Abstract:Diabetic Retinopathy (DR) progresses as a continuous and irreversible deterioration of the retina, following a well-defined clinical trajectory from mild to severe stages. However, most existing ordinal regression approaches model DR severity as a set of static, symmetric ranks, capturing relative order while ignoring the inherent unidirectional nature of disease progression. As a result, the learned feature representations may violate biological plausibility, allowing implausible proximity between non-consecutive stages or even reverse transitions. To bridge this gap, we propose Directed Ordinal Diffusion Regularization (D-ODR), which explicitly models the feature space as a directed flow by constructing a progression-constrained directed graph that strictly enforces forward disease evolution. By performing multi-scale diffusion on this directed structure, D-ODR imposes penalties on score inversions along valid progression paths, thereby effectively preventing the model from learning biologically inconsistent reverse transitions. This mechanism aligns the feature representation with the natural trajectory of DR worsening. Extensive experiments demonstrate that D-ODR yields superior grading performance compared to state-of-the-art ordinal regression and DR-specific grading methods, offering a more clinically reliable assessment of disease severity. Our code is available on this https URL.
35. 【2602.21935】A Framework for Cross-Domain Generalization in Coronary Artery Calcium Scoring Across Gated and Non-Gated Computed Tomography
链接:https://arxiv.org/abs/2602.21935
作者:Mahmut S. Gokmen,Moneera N. Haque,Steve W. Leung,Caroline N. Leach,Seth Parker,Stephen B. Hobbs,Vincent L. Sorrell,W. Brent Seales,V. K. Cody Bumgardner
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Coronary artery calcium, Coronary artery, cardiac imaging settings, specialized cardiac imaging, artery calcium
备注:
点击查看摘要
Abstract:Coronary artery calcium (CAC) scoring is a key predictor of cardiovascular risk, but it relies on ECG-gated CT scans, restricting its use to specialized cardiac imaging settings. We introduce an automated framework for CAC detection and lesion-specific Agatston scoring that operates across both gated and non-gated CT scans. At its core is CARD-ViT, a self-supervised Vision Transformer trained exclusively on gated CT data using DINO. Without any non-gated training data, our framework achieves 0.707 accuracy and a Cohen's kappa of 0.528 on the Stanford non-gated dataset, matching models trained directly on non-gated scans. On gated test sets, the framework achieves 0.910 accuracy with Cohen's kappa scores of 0.871 and 0.874 across independent datasets, demonstrating robust risk stratification. These results demonstrate the feasibility of cross-domain CAC scoring from gated to non-gated domains, supporting scalable cardiovascular screening in routine chest imaging without additional scans or annotations.
36. 【2602.21929】Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context
链接:https://arxiv.org/abs/2602.21929
作者:JiaKui Hu,Jialun Liu,Liying Yang,Xinliang Zhang,Kaiwen Li,Shuang Zeng,Yuanwei Li,Haibin Huang,Chi Zhang,Yanye Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Scene-consistent video generation, video generation aims, Scene-consistent video, video generation, aims to create
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory. Previous methods rely on video generation models with external memory for consistency, or iterative 3D reconstruction and inpainting, which accumulate errors during inference due to incorrect intermediary outputs, non-differentiable processes, and separate models. To overcome these limitations, we introduce ``geometry-as-context". It iteratively completes the following steps using an autoregressive camera-controlled video generation model: (1) estimates the geometry of the current view necessary for 3D reconstruction, and (2) simulates and restores novel view images rendered by the 3D scene. Under this multi-task framework, we develop the camera gated attention module to enhance the model's capability to effectively leverage camera poses. During the training phase, text contexts are utilized to ascertain whether geometric or RGB images should be generated. To ensure that the model can generate RGB-only outputs during inference, the geometry context is randomly dropped from the interleaved text-image-geometry training sequence. The method has been tested on scene video generation with one-direction and forth-and-back trajectories. The results show its superiority over previous approaches in maintaining scene consistency and camera control.
37. 【2602.21919】Learning in the Null Space: Small Singular Values for Continual Learning
链接:https://arxiv.org/abs/2602.21919
作者:Cuong Anh Pham,Praneeth Vepakomma,Samuel Horváth
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Alleviating catastrophic forgetting, Alleviating catastrophic, small singular, primary challenge, Alleviating
备注: 17 pages, accepted as Oral presentation at the Third Conference on Parsimony and Learning (CPAL 2026)
点击查看摘要
Abstract:Alleviating catastrophic forgetting while enabling further learning is a primary challenge in continual learning (CL). Orthogonal-based training methods have gained attention for their efficiency and strong theoretical properties, and many existing approaches enforce orthogonality through gradient projection. In this paper, we revisit orthogonality and exploit the fact that small singular values correspond to directions that are nearly orthogonal to the input space of previous tasks. Building on this principle, we introduce NESS (Null-space Estimated from Small Singular values), a CL method that applies orthogonality directly in the weight space rather than through gradient manipulation. Specifically, NESS constructs an approximate null space using the smallest singular values of each layer's input representation and parameterizes task-specific updates via a compact low-rank adaptation (LoRA-style) formulation constrained to this subspace. The subspace basis is fixed to preserve the null-space constraint, and only a single trainable matrix is learned for each task. This design ensures that the resulting updates remain approximately in the null space of previous inputs while enabling adaptation to new tasks. Our theoretical analysis and experiments on three benchmark datasets demonstrate competitive performance, low forgetting, and stable accuracy across tasks, highlighting the role of small singular values in continual learning. The code is available at this https URL.
38. 【2602.21917】Scan Clusters, Not Pixels: A Cluster-Centric Paradigm for Efficient Ultra-high-definition Image Restoration
链接:https://arxiv.org/abs/2602.21917
作者:Chen Wu,Ling Wang,Zhuoran Zheng,Yuning Cui,Zhixiong Yang,Xiangyu Chen,Yue Zhang,Weidong Jiang,Jingyuan Xia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demand unsustainable computation, scalability crisis, bound to pixel-wise, pixel-wise operations, demand unsustainable
备注: Aceepted by CVPR26
点击查看摘要
Abstract:Ultra-High-Definition (UHD) image restoration is trapped in a scalability crisis: existing models, bound to pixel-wise operations, demand unsustainable computation. While state space models (SSMs) like Mamba promise linear complexity, their pixel-serial scanning remains a fundamental bottleneck for the millions of pixels in UHD content. We ask: must we process every pixel to understand the image? This paper introduces C$^2$SSM, a visual state space model that breaks this taboo by shifting from pixel-serial to cluster-serial scanning. Our core discovery is that the rich feature distribution of a UHD image can be distilled into a sparse set of semantic centroids via a neural-parameterized mixture model. C$^2$SSM leverages this to reformulate global modeling into a novel dual-path process: it scans and reasons over a handful of cluster centers, then diffuses the global context back to all pixels through a principled similarity distribution, all while a lightweight modulator preserves fine details. This cluster-centric paradigm achieves a decisive leap in efficiency, slashing computational costs while establishing new state-of-the-art results across five UHD restoration tasks. More than a solution, C$^2$SSM charts a new course for efficient large-scale vision: scan clusters, not pixels.
39. 【2602.21915】Protein Graph Neural Networks for Heterogeneous Cryo-EM Reconstruction
链接:https://arxiv.org/abs/2602.21915
作者:Jonathan Krook,Axel Janson,Joakim andén,Melanie Weber,Ozan Öktem
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:cryogenic electron microscopy, heterogeneous single-particle cryogenic, single-particle cryogenic electron, predicts atomic backbone, atomic backbone conformations
备注:
点击查看摘要
Abstract:We present a geometry-aware method for heterogeneous single-particle cryogenic electron microscopy (cryo-EM) reconstruction that predicts atomic backbone conformations. To incorporate protein-structure priors, we represent the backbone as a graph and use a graph neural network (GNN) autodecoder that maps per-image latent variables to 3D displacements of a template conformation. The objective combines a data-discrepancy term based on a differentiable cryo-EM forward model with geometric regularization, and it supports unknown orientations via ellipsoidal support lifting (ESL) pose estimation. On synthetic datasets derived from molecular dynamics trajectories, the proposed GNN achieves higher accuracy compared to a multilayer perceptron (MLP) of comparable size, highlighting the benefits of a geometry-informed inductive bias.
40. 【2602.21905】IRAuxCloud: A Thermal Infrared Dataset for Day and Night Cloud Detection
链接:https://arxiv.org/abs/2602.21905
作者:Alexis Apostolakis,Vasileios Botsos,Niklas Wölki,Andrea Spichtinger,Nikolaos Ioannis Bountos,Ioannis Papoutsis,Panayiotis Tsanakas
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:urban heat island, fire disaster response, critical remote sensing, remote sensing applications, heat island monitoring
备注:
点击查看摘要
Abstract:Clouds are a major obstacle in Earth observation, limiting the usability and reliability of critical remote sensing applications such as fire disaster response, urban heat island monitoring, and snow and ice cover mapping. Therefore, the ability to detect clouds 24/7 is of paramount importance. While visible and near-infrared bands are effective for daytime cloud detection, their dependence on solar illumination makes them unsuitable for nighttime monitoring. In contrast, thermal infrared (TIR) imagery plays a crucial role in detecting clouds at night, when sunlight is absent. Due to their generally lower temperatures, clouds emit distinct thermal signatures that are detectable in TIR bands. Despite this, accurate nighttime cloud detection remains challenging due to limited spectral information and the typically lower spatial resolution of TIR imagery. To address these challenges, we present TIRAuxCloud, a multi-modal dataset centered around thermal spectral data to facilitate cloud segmentation under both daytime and nighttime conditions. The dataset comprises a unique combination of multispectral data (TIR, optical, and near-infrared bands) from Landsat and VIIRS, aligned with auxiliary information layers. Elevation, land cover, meteorological variables, and cloud-free reference images are included to help reduce surface-cloud ambiguity and cloud formation uncertainty. To overcome the scarcity of manual cloud labels, we include a large set of samples with automated cloud masks and a smaller manually annotated subset to further evaluate and improve models. Comprehensive benchmarks are presented to establish performance baselines through supervised and transfer learning, demonstrating the dataset's value in advancing the development of innovative methods for day and night time cloud detection.
41. 【2602.21904】UNet-Based Keypoint Regression for 3D Cone Localization in Autonomous Racing
链接:https://arxiv.org/abs/2602.21904
作者:Mariia Baidachna,James Carty,Aidan Ferguson,Joseph Agrane,Varad Kulkarni,Aubrey Agub,Michael Baxendale,Aaron David,Rachel Horton,Elliott Atkinson
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Accurate cone localization, space is essential, precise navigation, Accurate cone, cone localization
备注: 8 pages, 9 figures. Accepted to ICCV End-to-End 3D Learning Workshop 2025 and presented as a poster; not included in the final proceedings due to a conference administrative error
点击查看摘要
Abstract:Accurate cone localization in 3D space is essential in autonomous racing for precise navigation around the track. Approaches that rely on traditional computer vision algorithms are sensitive to environmental variations, and neural networks are often trained on limited data and are infeasible to run in real time. We present a UNet-based neural network for keypoint detection on cones, leveraging the largest custom-labeled dataset we have assembled. Our approach enables accurate cone position estimation and the potential for color prediction. Our model achieves substantial improvements in keypoint accuracy over conventional methods. Furthermore, we leverage our predicted keypoints in the perception pipeline and evaluate the end-to-end autonomous system. Our results show high-quality performance across all metrics, highlighting the effectiveness of this approach and its potential for adoption in competitive autonomous racing systems.
42. 【2602.21893】EndoDDC: Learning Sparse to Dense Reconstruction for Endoscopic Robotic Navigation via Diffusion Depth Completion
链接:https://arxiv.org/abs/2602.21893
作者:Yinheng Lin,Yiming Huang,Beilei Cui,Long Bai,Huxin Gao,Hongliang Ren,Jiewen Lai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:safe instrument guidance, endoscopic surgical robots, depth, depth estimation plays, Accurate depth estimation
备注: Accepted by ICRA 2026
点击查看摘要
Abstract:Accurate depth estimation plays a critical role in the navigation of endoscopic surgical robots, forming the foundation for 3D reconstruction and safe instrument guidance. Fine-tuning pretrained models heavily relies on endoscopic surgical datasets with precise depth annotations. While existing self-supervised depth estimation techniques eliminate the need for accurate depth annotations, their performance degrades in environments with weak textures and variable lighting, leading to sparse reconstruction with invalid depth estimation. Depth completion using sparse depth maps can mitigate these issues and improve accuracy. Despite the advances in depth completion techniques in general fields, their application in endoscopy remains limited. To overcome these limitations, we propose EndoDDC, an endoscopy depth completion method that integrates images, sparse depth information with depth gradient features, and optimizes depth maps through a diffusion model, addressing the issues of weak texture and light reflection in endoscopic environments. Extensive experiments on two publicly available endoscopy datasets show that our approach outperforms state-of-the-art models in both depth accuracy and robustness. This demonstrates the potential of our method to reduce visual errors in complex endoscopic environments. Our code will be released at this https URL.
43. 【2602.21877】How to Take a Memorable Picture? Empowering Users with Actionable Feedback
链接:https://arxiv.org/abs/2602.21877
作者:Francesco Laiti,Davide Talon,Jacopo Staiano,Elisa Ricci
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generative methods altering, traditionally been studied, studied in computer, computer vision, regressing a scalar
备注: Accepted @ CVPR 2026. Project page: [this https URL](https://laitifranz.github.io/MemCoach/)
点击查看摘要
Abstract:Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of Memorability Feedback (MemFeed), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present MemCoach, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., "emphasize facial expression," "bring the subject forward"). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce MemBench, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators.
44. 【2602.21873】GFPL: Generative Federated Prototype Learning for Resource-Constrained and Data-Imbalanced Vision Task
链接:https://arxiv.org/abs/2602.21873
作者:Shiwei Lu,Yuhang He,Jiashuo Li,Qiang Wang,Yihong Gong
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:medical image recognition, Federated Prototype Learning, facilitates the secure, advancing applications, autonomous driving
备注:
点击查看摘要
Abstract:Federated learning (FL) facilitates the secure utilization of decentralized images, advancing applications in medical image recognition and autonomous driving. However, conventional FL faces two critical challenges in real-world deployment: ineffective knowledge fusion caused by model updates biased toward majority-class features, and prohibitive communication overhead due to frequent transmissions of high-dimensional model parameters. Inspired by the human brain's efficiency in knowledge integration, we propose a novel Generative Federated Prototype Learning (GFPL) framework to address these issues. Within this framework, a prototype generation method based on Gaussian Mixture Model (GMM) captures the statistical information of class-wise features, while a prototype aggregation strategy using Bhattacharyya distance effectively fuses semantically similar knowledge across clients. In addition, these fused prototypes are leveraged to generate pseudo-features, thereby mitigating feature distribution imbalance across clients. To further enhance feature alignment during local training, we devise a dual-classifier architecture, optimized via a hybrid loss combining Dot Regression and Cross-Entropy. Extensive experiments on benchmarks show that GFPL improves model accuracy by 3.6% under imbalanced data settings while maintaining low communication cost.
45. 【2602.21864】DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
链接:https://arxiv.org/abs/2602.21864
作者:Yanbin Wei,Jiangyue Yan,Chun Kang,Yang Chen,Hua Liu,James Kwok,Yu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR)
关键词:zero-shot question answering, question answering, emerged as versatile, Vision-Language Models, versatile solutions
备注: CVPR 2026
点击查看摘要
Abstract:Vision-Language Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on one single graph topology representation (GTR), such as fixed-style visual images or unified text descriptions. This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries. To address this, we propose the $\mbox{DynamicGTR}$ framework, which dynamically selects the optimal GTR for each query during inference, thereby enhancing the zero-shot graph QA capabilities of VLMs with a customizable accuracy and brevity trade-off. Extensive experiments show that DynamicGTR not only improves VLM-based graph algorithm QA performance but also successfully transfers the experience trained from synthetic graph algorithm tasks to real-world applications like link prediction and node classification, without any additional training. Additionally, DynamicGTR demonstrates strong transferability across tasks, domains, and models, suggesting its potential as a flexible solution for broad graph scenarios.
46. 【2602.21855】Understanding Annotation Error Propagation and Learning an Adaptive Policy for Expert Intervention in Barrett's Video Segmentation
链接:https://arxiv.org/abs/2602.21855
作者:Lokesha Rasanjalee,Jin Lin Tan,Dileepa Pitawela,Rajvinder Singh,Hsiang-Ting Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:lack clear boundaries, Accurate annotation, essential yet time-consuming, clear boundaries, endoscopic videos
备注: Accepted at IEEE ISBI 2026
点击查看摘要
Abstract:Accurate annotation of endoscopic videos is essential yet time-consuming, particularly for challenging datasets such as dysplasia in Barrett's esophagus, where the affected regions are irregular and lack clear boundaries. Semi-automatic tools like Segment Anything Model 2 (SAM2) can ease this process by propagating annotations across frames, but small errors often accumulate and reduce accuracy, requiring expert review and correction. To address this, we systematically study how annotation errors propagate across different prompt types, namely masks, boxes, and points, and propose Learning-to-Re-Prompt (L2RP), a cost-aware framework that learns when and where to seek expert input. By tuning a human-cost parameter, our method balances annotation effort and segmentation accuracy. Experiments on a private Barrett's dysplasia dataset and the public SUN-SEG benchmark demonstrate improved temporal consistency and superior performance over baseline strategies.
47. 【2602.21849】Meta-FC: Meta-Learning with Feature Consistency for Robust and Generalizable Watermarking
链接:https://arxiv.org/abs/2602.21849
作者:Yuheng Li,Weitong Chen,Chengcheng Zhu,Jiale Zhang,Chunpeng Ge,Di Wu,Guodong Long
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep learning-based watermarking, made remarkable progress, Deep learning-based, underline, textbf
备注:
点击查看摘要
Abstract:Deep learning-based watermarking has made remarkable progress in recent years. To achieve robustness against various distortions, current methods commonly adopt a training strategy where a \underline{\textbf{s}}ingle \underline{\textbf{r}}andom \underline{\textbf{d}}istortion (SRD) is chosen as the noise layer in each training batch. However, the SRD strategy treats distortions independently within each batch, neglecting the inherent relationships among different types of distortions and causing optimization conflicts across batches. As a result, the robustness and generalizability of the watermarking model are limited. To address this issue, we propose a novel training strategy that enhances robustness and generalization via \underline{\textbf{meta}}-learning with \underline{\textbf{f}}eature \underline{\textbf{c}}onsistency (Meta-FC). Specifically, we randomly sample multiple distortions from the noise pool to construct a meta-training task, while holding out one distortion as a simulated ``unknown'' distortion for the meta-testing phase. Through meta-learning, the model is encouraged to identify and utilize neurons that exhibit stable activations across different types of distortions, mitigating the optimization conflicts caused by the random sampling of diverse distortions in each batch. To further promote the transformation of stable activations into distortion-invariant representations, we introduce a feature consistency loss that constrains the decoded features of the same image subjected to different distortions to remain consistent. Extensive experiments demonstrate that, compared to the SRD training strategy, Meta-FC improves the robustness and generalization of various watermarking models by an average of 1.59\%, 4.71\%, and 2.38\% under high-intensity, combined, and unknown distortions.
48. 【2602.21835】UniVBench: Towards Unified Evaluation for Video Foundation Models
链接:https://arxiv.org/abs/2602.21835
作者:Jianhui Wei,Xiaotian Zhang,Yichen Li,Yuan Wang,Yan Zhang,Ziyi Chen,Zhihang Tang,Wei Xu,Zuozhu Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:next-generation multimodal systems, Video foundation models, Video, integrate video understanding, central direction
备注:
点击查看摘要
Abstract:Video foundation models aim to integrate video understanding, generation, editing, and instruction following within a single framework, making them a central direction for next-generation multimodal systems. However, existing evaluation benchmarks remain fragmented and limited in scope, as they each target a single task, rely on task-specific metrics, and typically use short or simple video clips. As a result, they do not capture the unified capabilities that these models are designed to deliver. To address this gap, we introduce UniVBench, a benchmark purpose-built for evaluating video foundation models across four core abilities: video understanding, video generation, video editing, and a newly proposed task, video reconstruction, which assesses how faithfully a model can reproduce video content it has encountered. Our benchmark substantially expands the complexity of evaluation by incorporating 200 high-quality, diverse and multi-shot videos, each paired with detailed captions, multi-format editing instructions, and reference images. All videos are human-created and carefully validated, offering richer cinematic information than prior benchmarks. In addition, we develop a unified agentic evaluation system (UniV-Eval) that standardizes prompting, instruction parsing, and scoring across all tasks, enabling fair, scalable, and reproducible comparisons of unified video models. By grounding evaluation in instruction-based multi-shot video tasks, UniVBench provides the first framework for measuring the integrated capabilities that video foundation models aim to achieve. Extensive human annotations ensure our evaluation aligns with human judgment, enabling rigorous assessment and accelerating progress toward robust video intelligence.
49. 【2602.21829】StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles
链接:https://arxiv.org/abs/2602.21829
作者:Daniel Oliveira,David Martins de Matos
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:correctly ground entities, generating incorrect dialogue, Visual storytelling models, generating incorrect, emotional states
备注: 15 pages, submitted to Journal of Visual Communication and Image Representation
点击查看摘要
Abstract:Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.
Comments:
15 pages, submitted to Journal of Visual Communication and Image Representation
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
ACMclasses:
I.2.7; I.2.10; I.4.8
Cite as:
arXiv:2602.21829 [cs.CV]
(or
arXiv:2602.21829v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.21829
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
50. 【2602.21820】Joint Shadow Generation and Relighting via Light-Geometry Interaction Maps
链接:https://arxiv.org/abs/2602.21820
作者:Shan Wang,Peixia Li,Chenchen Xu,Ziang Cheng,Jiayu Yang,Hongdong Li,Pulak Purkait
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:encodes light-aware occlusion, propose Light-Geometry Interaction, Light-Geometry Interaction, encodes light-aware, light-aware occlusion
备注: ICRL 2026
点击查看摘要
Abstract:We propose Light-Geometry Interaction (LGI) maps, a novel representation that encodes light-aware occlusion from monocular depth. Unlike ray tracing, which requires full 3D reconstruction, LGI captures essential light-shadow interactions reliably and accurately, computed from off-the-shelf 2.5D depth map predictions. LGI explicitly ties illumination direction to geometry, providing a physics-inspired prior that constrains generative models. Without such prior, these models often produce floating shadows, inconsistent illumination, and implausible shadow geometry. Building on this representation, we propose a unified pipeline for joint shadow generation and relighting - unlike prior methods that treat them as disjoint tasks - capturing the intrinsic coupling of illumination and shadowing essential for modeling indirect effects. By embedding LGI into a bridge-matching generative backbone, we reduce ambiguity and enforce physically consistent light-shadow reasoning. To enable effective training, we curated the first large-scale benchmark dataset for joint shadow and relighting, covering reflections, transparency, and complex interreflections. Experiments show significant gains in realism and consistency across synthetic and real images. LGI thus bridges geometry-inspired rendering with generative modeling, enabling efficient, physically consistent shadow generation and relighting.
51. 【2602.21819】SemVideo: Reconstructs What You Watch from Brain Activity via Hierarchical Semantic Guidance
链接:https://arxiv.org/abs/2602.21819
作者:Minghan Yang,Lan Yang,Ke Li,Honggang Zhang,Kaiyue Pang,Yizhe Song
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:human visual perception, Reconstructing dynamic visual, Reconstructing dynamic, experiences from brain, brain activity
备注:
点击查看摘要
Abstract:Reconstructing dynamic visual experiences from brain activity provides a compelling avenue for exploring the neural mechanisms of human visual perception. While recent progress in fMRI-based image reconstruction has been notable, extending this success to video reconstruction remains a significant challenge. Current fMRI-to-video reconstruction approaches consistently encounter two major shortcomings: (i) inconsistent visual representations of salient objects across frames, leading to appearance mismatches; (ii) poor temporal coherence, resulting in motion misalignment or abrupt frame transitions. To address these limitations, we introduce SemVideo, a novel fMRI-to-video reconstruction framework guided by hierarchical semantic information. At the core of SemVideo is SemMiner, a hierarchical guidance module that constructs three levels of semantic cues from the original video stimulus: static anchor descriptions, motion-oriented narratives, and holistic summaries. Leveraging this semantic guidance, SemVideo comprises three key components: a Semantic Alignment Decoder that aligns fMRI signals with CLIP-style embeddings derived from SemMiner, a Motion Adaptation Decoder that reconstructs dynamic motion patterns using a novel tripartite attention fusion architecture, and a Conditional Video Render that leverages hierarchical semantic guidance for video reconstruction. Experiments conducted on the CC2017 and HCP datasets demonstrate that SemVideo achieves superior performance in both semantic alignment and temporal consistency, setting a new state-of-the-art in fMRI-to-video reconstruction.
52. 【2602.21818】SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model
链接:https://arxiv.org/abs/2602.21818
作者:Guibin Chen,Dixuan Lin,Jiangping Yang,Youqiang Zhang,Zhengcong Fei,Debang Li,Sheng Chen,Chaofeng Ao,Nuo Pang,Yiming Wang,Yikun Dou,Zheng Chen,Mingyuan Fan,Tuanhui Li,Mingshan Chang,Hao Zhang,Xiaopeng Sun,Jingtao Xu,Yuqiang Xie,Jiahua Wang,Zhiheng Xu,Weiming Xiong,Yuzhe Jin,Baoxuan Gu,Binjie Mao,Yunjie Yu,Jujie He,Yuhao Feng,Shiwen Tu,Chaojie Wang,Rui Yan,Wei Shen,Jingchen Wu,Peng Zhao,Xuanyue Zhong,Zhuangzhuang Liu,Kaifei Wang,Fuxiang Zhang,Weikai Xu,Wenyan Liu,Binglu Zhang,Yu Shen,Tianhui Xiong,Bin Peng,Liang Zeng,Xuchen Song,Haoxiang Guo,Peiyu Wang,Yahui Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Diffusion Transformer, Multimodal Large Language, Large Language Models, stream Multimodal Diffusion, video
备注:
点击查看摘要
Abstract:SkyReels V4 is a unified multi modal video foundation model for joint video audio generation, inpainting, and editing. The model adopts a dual stream Multimodal Diffusion Transformer (MMDiT) architecture, where one branch synthesizes video and the other generates temporally aligned audio, while sharing a powerful text encoder based on the Multimodal Large Language Models (MMLM). SkyReels V4 accepts rich multi modal instructions, including text, images, video clips, masks, and audio references. By combining the MMLMs multi modal instruction following capability with in context learning in the video branch MMDiT, the model can inject fine grained visual guidance under complex conditioning, while the audio branch MMDiT simultaneously leverages audio references to guide sound generation. On the video side, we adopt a channel concatenation formulation that unifies a wide range of inpainting style tasks, such as image to video, video extension, and video editing under a single interface, and naturally extends to vision referenced inpainting and editing via multi modal prompts. SkyReels V4 supports up to 1080p resolution, 32 FPS, and 15 second duration, enabling high fidelity, multi shot, cinema level video generation with synchronized audio. To make such high resolution, long-duration generation computationally feasible, we introduce an efficiency strategy: Joint generation of low resolution full sequences and high-resolution keyframes, followed by dedicated super-resolution and frame interpolation models. To our knowledge, SkyReels V4 is the first video foundation model that simultaneously supports multi-modal input, joint video audio generation, and a unified treatment of generation, inpainting, and editing, while maintaining strong efficiency and quality at cinematic resolutions and durations.
53. 【2602.21810】GeoMotion: Rethinking Motion Segmentation via Latent 4D Geometry
链接:https://arxiv.org/abs/2602.21810
作者:Xiankang He,Peile Lin,Ying Cui,Dongyan Guo,Chunhua Shen,Xiaoqin Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:noisy motion cues, inherently noisy motion, conventional methods heavily, methods heavily rely, highly challenging
备注:
点击查看摘要
Abstract:Motion segmentation in dynamic scenes is highly challenging, as conventional methods heavily rely on estimating camera poses and point correspondences from inherently noisy motion cues. Existing statistical inference or iterative optimization techniques that struggle to mitigate the cumulative errors in multi-stage pipelines often lead to limited performance or high computational cost. In contrast, we propose a fully learning-based approach that directly infers moving objects from latent feature representations via attention mechanisms, thus enabling end-to-end feed-forward motion segmentation. Our key insight is to bypass explicit correspondence estimation and instead let the model learn to implicitly disentangle object and camera motion. Supported by recent advances in 4D scene geometry reconstruction (e.g., $\pi^3$), the proposed method leverages reliable camera poses and rich spatial-temporal priors, which ensure stable training and robust inference for the model. Extensive experiments demonstrate that by eliminating complex pre-processing and iterative refinement, our approach achieves state-of-the-art motion segmentation performance with high efficiency. The code is available at:this https URL.
54. 【2602.21780】XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression
链接:https://arxiv.org/abs/2602.21780
作者:Zunhai Su,Weihao Ye,Hansen Feng,Keyu Fan,Jing Zhang,Dahai Yu,Zhengwu Liu,Ngai Wong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual geometry models, visual geometry, large-scale transformers, geometry models, models have significantly
备注: Submission to the Journal of the Society for Information Display
点击查看摘要
Abstract:Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to increased memory consumption and inference latency as input frames accumulate. This ultimately limits its scalability for long-horizon applications. To address this gap, we propose XStreamVGGT, a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs generated from multi-frame inputs are initially pruned to conform to a fixed KV memory budget using an efficient token-importance identification mechanism that maintains full compatibility with high-performance attention kernels (e.g., FlashAttention). Additionally, leveraging the inherent distribution patterns of KV tensors, we apply dimension-adaptive KV quantization within the pruning pipeline to further minimize memory overhead while preserving numerical accuracy. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling practical and scalable streaming 3D applications. The code is available at this https URL.
55. 【2602.21779】Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
链接:https://arxiv.org/abs/2602.21779
作者:Zheyuan Gu,Qingsong Zhao,Yusong Wang,Zhaohong Huang,Xinqi Li,Cheng Yuan,Jiaowei Shao,Chi Zhang,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Current Vision-Language Models, Current Vision-Language, identifying spatial artifacts, critical dimension, video forgeries
备注: 16 pages, 9 figures. Submitted to CVPR 2026
点击查看摘要
Abstract:Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.
56. 【2602.21778】From Statics to Dynamics: Physics-Aware Image Editing with Latent Transition Priors
链接:https://arxiv.org/abs/2602.21778
作者:Liangbing Zhao,Le Zhuo,Sayak Paul,Hongsheng Li,Mohamed Elhoseiny
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, involves complex causal, Instruction-based image editing, complex causal dynamics, editing involves complex
备注: All code, checkpoints, and datasets are available at [this https URL](https://liangbingzhao.github.io/statics2dynamics/)
点击查看摘要
Abstract:Instruction-based image editing has achieved remarkable success in semantic alignment, yet state-of-the-art models frequently fail to render physically plausible results when editing involves complex causal dynamics, such as refraction or material deformation. We attribute this limitation to the dominant paradigm that treats editing as a discrete mapping between image pairs, which provides only boundary conditions and leaves transition dynamics underspecified. To address this, we reformulate physics-aware editing as predictive physical state transitions and introduce PhysicTran38K, a large-scale video-based dataset comprising 38K transition trajectories across five physical domains, constructed via a two-stage filtering and constraint-aware annotation pipeline. Building on this supervision, we propose PhysicEdit, an end-to-end framework equipped with a textual-visual dual-thinking mechanism. It combines a frozen Qwen2.5-VL for physically grounded reasoning with learnable transition queries that provide timestep-adaptive visual guidance to a diffusion backbone. Experiments show that PhysicEdit improves over Qwen-Image-Edit by 5.9% in physical realism and 10.1% in knowledge-grounded editing, setting a new state-of-the-art for open-source methods, while remaining competitive with leading proprietary models.
57. 【2602.21773】Easy to Learn, Yet Hard to Forget: Towards Robust Unlearning Under Bias
链接:https://arxiv.org/abs/2602.21773
作者:JuneHyoung Kwon,MiHyeon Kim,Eunju Lee,Yoonji Lee,Seunghoon Lee,YoungBin Kim
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:ensuring data privacy, Machine unlearning, forget specific data, crucial for ensuring, Machine
备注: Accepted to AAAI 2026
点击查看摘要
Abstract:Machine unlearning, which enables a model to forget specific data, is crucial for ensuring data privacy and model reliability. However, its effectiveness can be severely undermined in real-world scenarios where models learn unintended biases from spurious correlations within the data. This paper investigates the unique challenges of unlearning from such biased models. We identify a novel phenomenon we term ``shortcut unlearning," where models exhibit an ``easy to learn, yet hard to forget" tendency. Specifically, models struggle to forget easily-learned, bias-aligned samples; instead of forgetting the class attribute, they unlearn the bias attribute, which can paradoxically improve accuracy on the class intended to be forgotten. To address this, we propose CUPID, a new unlearning framework inspired by the observation that samples with different biases exhibit distinct loss landscape sharpness. Our method first partitions the forget set into causal- and bias-approximated subsets based on sample sharpness, then disentangles model parameters into causal and bias pathways, and finally performs a targeted update by routing refined causal and bias gradients to their respective pathways. Extensive experiments on biased datasets including Waterbirds, BAR, and Biased NICO++ demonstrate that our method achieves state-of-the-art forgetting performance and effectively mitigates the shortcut unlearning problem.
58. 【2602.21762】SAPNet++: Evolving Point-Prompted Instance Segmentation with Semantic and Spatial Awareness
链接:https://arxiv.org/abs/2602.21762
作者:Zhaoyang Wei,Xumeng Han,Xuehui Yu,Xue Yang,Guorong Li,Zhenjun Han,Jianbin Jiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:labeling cost reduction, cost reduction, increasingly prominent, prominent in visual, labeling cost
备注: 18 pages
点击查看摘要
Abstract:Single-point annotation is increasingly prominent in visual tasks for labeling cost reduction. However, it challenges tasks requiring high precision, such as the point-prompted instance segmentation (PPIS) task, which aims to estimate precise masks using single-point prompts to train a segmentation network. Due to the constraints of point annotations, granularity ambiguity and boundary uncertainty arise the difficulty distinguishing between different levels of detail (eg. whole object vs. parts) and the challenge of precisely delineating object boundaries. Previous works have usually inherited the paradigm of mask generation along with proposal selection to achieve PPIS. However, proposal selection relies solely on category information, failing to resolve the ambiguity of different granularity. Furthermore, mask generators offer only finite discrete solutions that often deviate from actual masks, particularly at boundaries. To address these issues, we propose the Semantic-Aware Point-Prompted Instance Segmentation Network (SAPNet). It integrates Point Distance Guidance and Box Mining Strategy to tackle group and local issues caused by the point's granularity ambiguity. Additionally, we incorporate completeness scores within proposals to add spatial granularity awareness, enhancing multiple instance learning (MIL) in proposal selection termed S-MIL. The Multi-level Affinity Refinement conveys pixel and semantic clues, narrowing boundary uncertainty during mask refinement. These modules culminate in SAPNet++, mitigating point prompt's granularity ambiguity and boundary uncertainty and significantly improving segmentation performance. Extensive experiments on four challenging datasets validate the effectiveness of our methods, highlighting the potential to advance PPIS.
59. 【2602.21760】Accelerating Diffusion via Hybrid Data-Pipeline Parallelism Based on Conditional Guidance Scheduling
链接:https://arxiv.org/abs/2602.21760
作者:Euisoo Jung,Byunghyun Kim,Hyunjin Kim,Seonghye Cho,Jae-Gil Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains computationally expensive, achieved remarkable progress, inference remains computationally, computationally expensive, Diffusion models
备注:
点击查看摘要
Abstract:Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism suffer from noticeable generation artifacts and fail to achieve substantial acceleration proportional to the number of GPUs. Therefore, we propose a hybrid parallelism framework that combines a novel data parallel strategy, condition-based partitioning, with an optimal pipeline scheduling method, adaptive parallelism switching, to reduce generation latency and achieve high generation quality in conditional diffusion models. The key ideas are to (i) leverage the conditional and unconditional denoising paths as a new data-partitioning perspective and (ii) adaptively enable optimal pipeline parallelism according to the denoising discrepancy between these two paths. Our framework achieves $2.31\times$ and $2.07\times$ latency reductions on SDXL and SD3, respectively, using two NVIDIA RTX~3090 GPUs, while preserving image quality. This result confirms the generality of our approach across U-Net-based diffusion models and DiT-based flow-matching architectures. Our approach also outperforms existing methods in acceleration under high-resolution synthesis settings. Code is available at this https URL.
60. 【2602.21754】LiREC-Net: A Target-Free and Learning-Based Network for LiDAR, RGB, and Event Calibration
链接:https://arxiv.org/abs/2602.21754
作者:Aditya Ranjan Dash,Ramy Battrawy,René Schuster,Didier Stricker
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Advanced autonomous systems, autonomous systems rely, Advanced autonomous, robust perception, autonomous systems
备注: Accepted in CVPR 2026
点击查看摘要
Abstract:Advanced autonomous systems rely on multi-sensor fusion for safer and more robust perception. To enable effective fusion, calibrating directly from natural driving scenes (i.e., target-free) with high accuracy is crucial for precise multi-sensor alignment. Existing learning-based calibration methods are typically designed for only a single pair of sensor modalities (i.e., a bi-modal setup). Unlike these methods, we propose LiREC-Net, a target-free, learning-based calibration network that jointly calibrates multiple sensor modality pairs, including LiDAR, RGB, and event data, within a unified framework. To reduce redundant computation and improve efficiency, we introduce a shared LiDAR representation that leverages features from both its 3D nature and projected depth map, ensuring better consistency across modalities. Trained and evaluated on established datasets, such as KITTI and DSEC, our LiREC-Net achieves competitive performance to bi-modal models and sets a new strong baseline for the tri-modal use case.
61. 【2602.21743】Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization
链接:https://arxiv.org/abs/2602.21743
作者:Jinghan Li,Junfeng Fang,Jinda Lu,Yuan Wang,Xiaoyan Guo,Tianyu Zhang,Xiang Wang,Xiangnan He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Relative Policy Optimization, Group Relative Policy, Reinforcement Learning, Policy Optimization, Learning with Verifiable
备注:
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std within each group. Our approach preserves GRPO's intra-group distinctions while eliminating sensitivity to extreme cases, yielding significant performance gains across multiple multimodal reasoning benchmarks.
62. 【2602.21740】Structure-to-Image: Zero-Shot Depth Estimation in Colonoscopy via High-Fidelity Sim-to-Real Adaptation
链接:https://arxiv.org/abs/2602.21740
作者:Juan Yang,Yuyan Zhang,Han Jia,Bing Hu,Wanzhong Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Monocular depth estimation, Monocular depth, real-world images, colonoscopy is hampered, gap between simulated
备注: \c{opyright} 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
点击查看摘要
Abstract:Monocular depth estimation (MDE) for colonoscopy is hampered by the domain gap between simulated and real-world images. Existing image-to-image translation methods, which use depth as a posterior constraint, often produce structural distortions and specular highlights by failing to balance realism with structure consistency. To address this, we propose a Structure-to-Image paradigm that transforms the depth map from a passive constraint into an active generative foundation. We are the first to introduce phase congruency to colonoscopic domain adaptation and design a cross-level structure constraint to co-optimize geometric structures and fine-grained details like vascular textures. In zero-shot evaluations conducted on a publicly available phantom dataset, the MDE model that was fine-tuned on our generated data achieved a maximum reduction of 44.18% in RMSE compared to competing methods. Our code is available at this https URL.
63. 【2602.21735】SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning
链接:https://arxiv.org/abs/2602.21735
作者:Jiayi Wang,Hadrien Reynaud,Ibrahim Ethem Hamamci,Sezgin Er,Suprosanna Shit,Bjoern Menze,Bernhard Kainz
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:volumetric medical imaging, medical imaging datasets, imaging datasets typically, datasets typically aggregate, typically aggregate scans
备注:
点击查看摘要
Abstract:Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.
64. 【2602.21716】ranX-Adapter: Bridging Artifacts and Semantics within MLLMs for Robust AI-generated Image Detection
链接:https://arxiv.org/abs/2602.21716
作者:Wenbin Wang,Yuge Huang,Jianqing Xu,Yue Yu,Jiangtao Yan,Shouhong Ding,Pan Zhou,Yong Luo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:technology enable highly, highly realistic synthesis, enable highly realistic, Rapid advances, threatening public information
备注:
点击查看摘要
Abstract:Rapid advances in AI-generated image (AIGI) technology enable highly realistic synthesis, threatening public information integrity and security. Recent studies have demonstrated that incorporating texture-level artifact features alongside semantic features into multimodal large language models (MLLMs) can enhance their AIGI detection capability. However, our preliminary analyses reveal that artifact features exhibit high intra-feature similarity, leading to an almost uniform attention map after the softmax operation. This phenomenon causes attention dilution, thereby hindering effective fusion between semantic and artifact features. To overcome this limitation, we propose a lightweight fusion adapter, TranX-Adapter, which integrates a Task-aware Optimal-Transport Fusion that leverages the Jensen-Shannon divergence between artifact and semantic prediction probabilities as a cost matrix to transfer artifact information into semantic features, and an X-Fusion that employs cross-attention to transfer semantic information into artifact features. Experiments on standard AIGI detection benchmarks upon several advanced MLLMs, show that our TranX-Adapter brings consistent and significant improvements (up to +6% accuracy).
65. 【2602.21712】Innovative Tooth Segmentation Using Hierarchical Features and Bidirectional Sequence Modeling
链接:https://arxiv.org/abs/2602.21712
作者:Xinxin Zhao,Jian Jiang,Yan Tian,Liqin Wu,Zhaocheng Xu,Teddy Yang,Yunuo Zou,Xun Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Pattern Recognition, dental digitization, Pattern Recognition Subjects, Tooth image segmentation, dental images
备注: Accepted by Pattern Recognition
点击查看摘要
Abstract:Tooth image segmentation is a cornerstone of dental digitization. However, traditional image encoders relying on fixed-resolution feature maps often lead to discontinuous segmentation and poor discrimination between target regions and background, due to insufficient modeling of environmental and global context. Moreover, transformer-based self-attention introduces substantial computational overhead because of its quadratic complexity (O(n^2)), making it inefficient for high-resolution dental images. To address these challenges, we introduce a three-stage encoder with hierarchical feature representation to capture scale-adaptive information in dental images. By jointly leveraging low-level details and high-level semantics through cross-scale feature fusion, the model effectively preserves fine structural information while maintaining strong contextual awareness. Furthermore, a bidirectional sequence modeling strategy is incorporated to enhance global spatial context understanding without incurring high computational cost. We validate our method on two dental datasets, with experimental results demonstrating its superiority over existing approaches. On the OralVision dataset, our model achieves a 1.1% improvement in mean intersection over union (mIoU).
Comments:
Accepted by Pattern Recognition
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.21712 [cs.CV]
(or
arXiv:2602.21712v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.21712
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
Xinxin Zhao, Jian Jiang, Yan Tian, Liqin Wu, Zhaocheng Xu, Wei-fa Yang, Yunuo Zou, Xun Wang. Innovative tooth segmentation using hierarchical features and bidirectional sequence modeling[J]. Pattern Recognition, 2026, 175:113045
Related DOI:
https://doi.org/10.1016/j.patcog.2026.113045
Focus to learn more
DOI(s) linking to related resources</p>
66. 【2602.21709】Assessing airborne laser scanning and aerial photogrammetry for deep learning-based stand delineation
链接:https://arxiv.org/abs/2602.21709
作者:Håkon Næss Sandum,Hans Ole Ørka,Oliver Tomic,Terje Gobakken
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate forest stand, Accurate forest, subjective process, inventory and management, management but remains
备注: 20 pages, 4 figures, 4 tables
点击查看摘要
Abstract:Accurate forest stand delineation is essential for forest inventory and management but remains a largely manual and subjective process. A recent study has shown that deep learning can produce stand delineations comparable to expert interpreters when combining aerial imagery and airborne laser scanning (ALS) data. However, temporal misalignment between data sources limits operational scalability. Canopy height models (CHMs) derived from digital photogrammetry (DAP) offer better temporal alignment but may smoothen canopy surface and canopy gaps, raising the question of whether they can reliably replace ALS-derived CHMs. Similarly, the inclusion of a digital terrain model (DTM) has been suggested to improve delineation performance, but has remained untested in published literature. Using expert-delineated forest stands as reference data, we assessed a U-Net-based semantic segmentation framework with municipality-level cross-validation across six municipalities in southeastern Norway. We compared multispectral aerial imagery combined with (i) an ALS-derived CHM, (ii) a DAP-derived CHM, and (iii) a DAP-derived CHM in combination with a DTM. Results showed comparable performance across all data combinations, reaching overall accuracy values between 0.90-0.91. Agreement between model predictions was substantially larger than agreement with the reference data, highlighting both model consistency and the inherent subjectivity of stand delineation. The similar performance of DAP-CHMs, despite the reduced structural detail, and the lack of improvements of the DTM indicate that the framework is resilient to variations in input data. These findings indicate that large datasets for deep learning-based stand delineations can be assembled using projects including temporally aligned ALS data and DAP point clouds.
67. 【2602.21706】SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
链接:https://arxiv.org/abs/2602.21706
作者:Guanyi Qin,Xiaozhen Wang,Zhu Zhuo,Chang Han Low,Yuancan Xiao,Yibing Fu,Haofeng Liu,Kai Wang,Chunjiang Li,Yueming Jin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Minimally invasive surgery, patient operative outcomes, integrate visual cues, high cognitive load, improved patient operative
备注:
点击查看摘要
Abstract:Minimally invasive surgery has dramatically improved patient operative outcomes, yet identifying safe operative zones remains challenging in critical phases, requiring surgeons to integrate visual cues, procedural phase, and anatomical context under high cognitive load. Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning. We introduce ResGo, a benchmark of laparoscopic frames annotated with Go Zone bounding boxes and clinician-authored rationales covering phase, exposure quality reasoning, next action and risk reminder. We introduce evaluation metrics that treat correct grounding under incorrect phase as failures, revealing that most vision-language models cannot handle such tasks and perform poorly. We then present SurGo-R1, a model optimized via RLHF with a multi-turn phase-then-go architecture where the model first identifies the surgical phase, then generates reasoning and Go Zone coordinates conditioned on that context. On unseen procedures, SurGo-R1 achieves 76.6% phase accuracy, 32.7 mIoU, and 54.8% hardcore accuracy, a 6.6$\times$ improvement over the mainstream generalist VLMs. Code, model and benchmark will be available at this https URL
68. 【2602.21704】Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
链接:https://arxiv.org/abs/2602.21704
作者:Jianghao Yin,Qin Chen,Kedi Chen,Jie Zhou,Xingjiao Wu,Liang He
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, Large Vision-Language, exhibit outstanding performance, exhibit outstanding, Dynamic Multimodal Activation
备注: Accepted by ICLR 2026
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems. Through in-depth analysis of LVLM activation patterns, we reveal two key findings: 1) truthfulness and visual perception capabilities predominantly engage different subsets of attention heads within the model architecture; and 2) truthfulness steering vectors vary significantly across different semantic contexts. Based on these observations, we propose Dynamic Multimodal Activation Steering, a training-free approach for hallucination mitigation. Our method constructs a semantic-based truthfulness steering vector database and computes visual perception steering vectors, enabling context-aware interventions during inference by dynamically selecting the most relevant steering vectors based on input semantic similarity and applying them to the most influential attention heads. We conduct comprehensive experiments across multiple models and datasets, demonstrating that our approach significantly enhances model performance, outperforming existing state-of-the-art methods.
69. 【2602.21703】Brain Tumor Segmentation with Special Emphasis on the Non-Enhancing Brain Tumor Compartment
链接:https://arxiv.org/abs/2602.21703
作者:T. Schaffer,A. Brawanski,S. Wein,A. M. Tomé,E. W. Lang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:U-Net based deep, based deep learning, deep learning architecture, MRI modalities, U-Net based
备注:
点击查看摘要
Abstract:A U-Net based deep learning architecture is designed to segment brain tumors as they appear on various MRI modalities. Special emphasis is lent to the non-enhancing tumor compartment. The latter has not been considered anymore in recent brain tumor segmentation challenges like the MICCAI challenges. However, it is considered to be indicative of the survival time of the patient as well as of areas of further tumor growth. Hence it deems essential to have means to automatically delineate its extension within the tumor.
70. 【2602.21699】SF3D-RGB: Scene Flow Estimation from Monocular Camera and Sparse LiDAR
链接:https://arxiv.org/abs/2602.21699
作者:Rajai Alhimdiat,Ramy Battrawy,René Schuster,Didier Stricker,Wesam Ashour
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:extremely important task, Scene flow, Scene flow estimation, Scene, extremely important
备注: Accepted in Computer Vision Conference (CVC) 2026
点击查看摘要
Abstract:Scene flow estimation is an extremely important task in computer vision to support the perception of dynamic changes in the scene. For robust scene flow, learning-based approaches have recently achieved impressive results using either image-based or LiDAR-based modalities. However, these methods have tended to focus on the use of a single modality. To tackle these problems, we present a deep learning architecture, SF3D-RGB, that enables sparse scene flow estimation using 2D monocular images and 3D point clouds (e.g., acquired by LiDAR) as inputs. Our architecture is an end-to-end model that first encodes information from each modality into features and fuses them together. Then, the fused features enhance a graph matching module for better and more robust mapping matrix computation to generate an initial scene flow. Finally, a residual scene flow module further refines the initial scene flow. Our model is designed to strike a balance between accuracy and efficiency. Furthermore, experiments show that our proposed method outperforms single-modality methods and achieves better scene flow accuracy on real-world datasets while using fewer parameters compared to other state-of-the-art methods with fusion.
71. 【2602.21698】E-comIQ-ZH: A Human-Aligned Dataset and Benchmark for Fine-Grained Evaluation of E-commerce Posters with Chain-of-Thought
链接:https://arxiv.org/abs/2602.21698
作者:Meiqi Sun,Mingyu Li,Junxiong Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:create commercial posters, create commercial, Chinese e-commerce posters, Generative, e-commerce posters
备注: 21pages, 19figures, accepted by CVPR 2026
点击查看摘要
Abstract:Generative AI is widely used to create commercial posters. However, rapid advances in generation have outpaced automated quality assessment. Existing models emphasize generic esthetics or low level distortions and lack the functional criteria required for e-commerce design. It is especially challenging for Chinese content, where complex characters often produce subtle but critical textual artifacts that are overlooked by existing methods. To address this, we introduce E-comIQ-ZH, a framework for evaluating Chinese e-commerce posters. We build the first dataset E-comIQ-18k to feature multi dimensional scores and expert calibrated Chain of Thought (CoT) rationales. Using this dataset, we train E-comIQ-M, a specialized evaluation model that aligns with human expert judgment. Our framework enables E-comIQ-Bench, the first automated and scalable benchmark for the generation of Chinese e-commerce posters. Extensive experiments show our E-comIQ-M aligns more closely with expert standards and enables scalable automated assessment of e-commerce posters. All datasets, models, and evaluation tools will be released to support future research in this this http URL will be available at this https URL.
72. 【2602.21668】Space-Time Forecasting of Dynamic Scenes with Motion-aware Gaussian Grouping
链接:https://arxiv.org/abs/2602.21668
作者:Junmyeong Lee,Hoseung Choi,Minsu Cho
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:limited observations make, capture coherent object-level, Gaussian Splatting representation, Group-aware Gaussian Forecasting, Motion Group-aware Gaussian
备注: 20 pages, 13 figures
点击查看摘要
Abstract:Forecasting dynamic scenes remains a fundamental challenge in computer vision, as limited observations make it difficult to capture coherent object-level motion and long-term temporal evolution. We present Motion Group-aware Gaussian Forecasting (MoGaF), a framework for long-term scene extrapolation built upon the 4D Gaussian Splatting representation. MoGaF introduces motion-aware Gaussian grouping and group-wise optimization to enforce physically consistent motion across both rigid and non-rigid regions, yielding spatially coherent dynamic representations. Leveraging this structured space-time representation, a lightweight forecasting module predicts future motion, enabling realistic and temporally stable scene evolution. Experiments on synthetic and real-world datasets demonstrate that MoGaF consistently outperforms existing baselines in rendering quality, motion plausibility, and long-term forecasting stability. Our project page is available at this https URL
73. 【2602.21667】Send Less, Perceive More: Masked Quantized Point Cloud Communication for Loss-Tolerant Collaborative Perception
链接:https://arxiv.org/abs/2602.21667
作者:Sheng Xu,Enshu Wang,Hongfei Xue,Jian Teng,Bingyi Liu,Yi Zhu,Pu Wang,Libing Wu,Chunming Qiao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Collaborative perception, sharing sensory information, packet loss, perception allows connected, connected vehicles
备注:
点击查看摘要
Abstract:Collaborative perception allows connected vehicles to overcome occlusions and limited viewpoints by sharing sensory information. However, existing approaches struggle to achieve high accuracy under strict bandwidth constraints and remain highly vulnerable to random transmission packet loss. We introduce QPoint2Comm, a quantized point-cloud communication framework that dramatically reduces bandwidth while preserving high-fidelity 3D information. Instead of transmitting intermediate features, QPoint2Comm directly communicates quantized point-cloud indices using a shared codebook, enabling efficient reconstruction with lower bandwidth than feature-based methods. To ensure robustness to possible communication packet loss, we employ a masked training strategy that simulates random packet loss, allowing the model to maintain strong performance even under severe transmission failures. In addition, a cascade attention fusion module is proposed to enhance multi-vehicle information integration. Extensive experiments on both simulated and real-world datasets demonstrate that QPoint2Comm sets a new state of the art in accuracy, communication efficiency, and resilience to packet loss.
74. 【2602.21662】HybridINR-PCGC: Hybrid Lossless Point Cloud Geometry Compression Bridging Pretrained Model and Implicit Neural Representation
链接:https://arxiv.org/abs/2602.21662
作者:Wenjie Huang,Qi Yang,Shuting Xia,He Huang,Zhu Li,Yiling Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Learning-based point cloud, Learning-based point, point cloud compression, cloud compression presents, handcrafted codecs
备注: 8 pages, 10 figures
点击查看摘要
Abstract:Learning-based point cloud compression presents superior performance to handcrafted codecs. However, pretrained-based methods, which are based on end-to-end training and expected to generalize to all the potential samples, suffer from training data dependency. Implicit neural representation (INR) based methods are distribution-agnostic and more robust, but they require time-consuming online training and suffer from the bitstream overhead from the overfitted model. To address these limitations, we propose HybridINR-PCGC, a novel hybrid framework that bridges the pretrained model and INR. Our framework retains distribution-agnostic properties while leveraging a pretrained network to accelerate convergence and reduce model overhead, which consists of two parts: the Pretrained Prior Network (PPN) and the Distribution Agnostic Refiner (DAR). We leverage the PPN, designed for fast inference and stable performance, to generate a robust prior for accelerating the DAR's convergence. The DAR is decomposed into a base layer and an enhancement layer, and only the enhancement layer needed to be packed into the bitstream. Finally, we propose a supervised model compression module to further supervise and minimize the bitrate of the enhancement layer parameters. Based on experiment results, HybridINR-PCGC achieves a significantly improved compression rate and encoding efficiency. Specifically, our method achieves a Bpp reduction of approximately 20.43% compared to G-PCC on 8iVFB. In the challenging out-of-distribution scenario Cat1B, our method achieves a Bpp reduction of approximately 57.85% compared to UniPCGC. And our method exhibits a superior time-rate trade-off, achieving an average Bpp reduction of 15.193% relative to the LINR-PCGC on 8iVFB.
75. 【2602.21657】Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis
链接:https://arxiv.org/abs/2602.21657
作者:Shaoxuan Wu,Jingkun Chen,Chong Ma,Cong Shen,Xiao Zhang,Jun Feng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:automated chest X-ray, chest X-ray diagnosis, significantly advanced automated, advanced automated chest, lacks reliable decision
备注:
点击查看摘要
Abstract:Computer-aided diagnosis (CAD) has significantly advanced automated chest X-ray diagnosis but remains isolated from clinical workflows and lacks reliable decision support and interpretability. Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists. However, the absence of interactive tools seamlessly embedded within diagnostic routines impedes collaboration, while the semantic gap between radiologists' decision-making patterns and model representations further limits clinical adoption. To overcome these limitations, we propose a visual cognition-guided collaborative network (VCC-Net) to achieve the cooperative diagnostic paradigm. VCC-Net centers on visual cognition (VC) and employs clinically compatible interfaces, such as eye-tracking or the mouse, to capture radiologists' visual search traces and attention patterns during diagnosis. VCC-Net employs VC as a spatial cognition guide, learning hierarchical visual search strategies to localize diagnostically key regions. A cognition-graph co-editing module subsequently integrates radiologist VC with model inference to construct a disease-aware graph. The module captures dependencies among anatomical regions and aligns model representations with VC-driven features, mitigating radiologist bias and facilitating complementary, transparent decision-making. Experiments on the public datasets SIIM-ACR, EGD-CXR, and self-constructed TB-Mouse dataset achieved classification accuracies of 88.40%, 85.05%, and 92.41%, respectively. The attention maps produced by VCC-Net exhibit strong concordance with radiologists' gaze distributions, demonstrating a mutual reinforcement of radiologist and model inference. The code is available at this https URL.
76. 【2602.21655】CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
链接:https://arxiv.org/abs/2602.21655
作者:Zhijiang Tang,Linhua Wang,Jiaxin Qi,Weihao Jiang,Peng Hou,Anxiang Zeng,Jianqiang Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:vision language understanding, language understanding, human-annotated references, remains a fundamental, fundamental task
备注: Accept by CVPR 2026
点击查看摘要
Abstract:Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references. Because human annotations reflect subjective preferences and expertise, ground-truth captions are often incomplete or even incorrect, which in turn limits caption models. We argue that caption quality should be assessed by two objective aspects: completeness (does the caption cover all salient visual facts?) and correctness (are the descriptions true with respect to the image?). To this end, we introduce CCCaption: a dual-reward reinforcement learning framework with a dedicated fine-tuning corpus that explicitly optimizes these properties to generate \textbf{C}omplete and \textbf{C}orrect \textbf{Captions}. For completeness, we use diverse LVLMs to disentangle the image into a set of visual queries, and reward captions that answer more of these queries, with a dynamic query sampling strategy to improve training efficiency. For correctness, we penalize captions that contain hallucinations by validating the authenticity of sub-caption queries, which are derived from the caption decomposition. Our symmetric dual-reward optimization jointly maximizes completeness and correctness, guiding models toward captions that better satisfy these objective criteria. Extensive experiments across standard captioning benchmarks show consistent improvements, offering a principled path to training caption models beyond human-annotation imitation.
77. 【2602.21645】Lie Flow: Video Dynamic Fields Modeling and Predicting with Lie Algebra as Geometric Physics Principle
链接:https://arxiv.org/abs/2602.21645
作者:Weidong Qiao,Wangmeng Zuo,Hui Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:physically consistent representations, challenging due, rigid and non-rigid, scenes requires capturing, spatial structure
备注: 10pages,5 figures
点击查看摘要
Abstract:Modeling 4D scenes requires capturing both spatial structure and temporal motion, which is challenging due to the need for physically consistent representations of complex rigid and non-rigid motions. Existing approaches mainly rely on translational displacements, which struggle to represent rotations, articulated transformations, often leading to spatial inconsistency and physically implausible motion. LieFlow, a dynamic radiance representation framework that explicitly models motion within the SE(3) Lie group, enabling coherent learning of translation and rotation in a unified geometric space. The SE(3) transformation field enforces physically inspired constraints to maintain motion continuity and geometric consistency. The evaluation includes a synthetic dataset with rigid-body trajectories and two real-world datasets capturing complex motion under natural lighting and occlusions. Across all datasets, LieFlow consistently improves view-synthesis fidelity, temporal coherence, and physical realism over NeRF-based baselines. These results confirm that SE(3)-based motion modeling offers a robust and physically grounded framework for representing dynamic 4D scenes.
78. 【2602.21637】CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis
链接:https://arxiv.org/abs/2602.21637
作者:Di Zhang,Zhangpeng Gong,Xiaobo Pang,Jiashuai Liu,Junbo Lu,Hao Cui,Jiusong Ge,Zhi Zeng,Kai Yi,Yinghua Li,Si Liu,Tingsong Yu,Haoran Wang,Mireia Crispin-Ortuzar,eimiao Yu,Chen Li,Zeyu Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrating strong generalization, recently achieved impressive, achieved impressive success, diverse histopathology tasks, demonstrating strong
备注:
点击查看摘要
Abstract:Foundation models have recently achieved impressive success in computational pathology, demonstrating strong generalization across diverse histopathology tasks. However, existing models overlook the heterogeneous and non-uniform organization of pathological regions of interest (ROIs) because they rely on natural image backbones not tailored for tissue morphology. Consequently, they often fail to capture the coherent tissue architecture beyond isolated patches, limiting interpretability and clinical relevance. To address these challenges, we present Cross-modal Adaptive Region Encoder (CARE), a foundation model for pathology that automatically partitions WSIs into several morphologically relevant regions. Specifically, CARE employs a two-stage pretraining strategy: (1) a self-supervised unimodal pretraining stage that learns morphological representations from 34,277 whole-slide images (WSIs) without segmentation annotations, and (2) a cross-modal alignment stage that leverages RNA and protein profiles to refine the construction and representation of adaptive regions. This molecular guidance enables CARE to identify biologically relevant patterns and generate irregular yet coherent tissue regions, selecting the most representative area as ROI. CARE supports a broad range of pathology-related tasks, using either the ROI feature or the slide-level feature obtained by aggregating adaptive regions. Based on only one-tenth of the pretraining data typically used by mainstream foundation models, CARE achieves superior average performance across 33 downstream benchmarks, including morphological classification, molecular prediction, and survival analysis, and outperforms other foundation model baselines overall.
79. 【2602.21636】Axial-Centric Cross-Plane Attention for 3D Medical Image Classification
链接:https://arxiv.org/abs/2602.21636
作者:Doyoung Park,Jinsoo Kim,Lohendran Baskaran
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Clinicians commonly interpret, commonly interpret three-dimensional, Clinicians commonly, single volumetric representation, interpret three-dimensional
备注: Submitted to MICCAI 2026
点击查看摘要
Abstract:Clinicians commonly interpret three-dimensional (3D) medical images, such as computed tomography (CT) scans, using multiple anatomical planes rather than as a single volumetric representation. In this multi-planar approach, the axial plane typically serves as the primary acquisition and diagnostic reference, while the coronal and sagittal planes provide complementary spatial information to increase diagnostic confidence. However, many existing 3D deep learning methods either process volumetric data holistically or assign equal importance to all planes, failing to reflect the axial-centric clinical interpretation workflow. To address this gap, we propose an axial-centric cross-plane attention architecture for 3D medical image classification that captures the inherent asymmetric dependencies between different anatomical planes. Our architecture incorporates MedDINOv3, a medical vision foundation model pretrained via self-supervised learning on large-scale axial CT images, as a frozen feature extractor for the axial, coronal, and sagittal planes. RICA blocks and intra-plane transformer encoders capture plane-specific positional and contextual information within each anatomical plane, while axial-centric cross-plane transformer encoders condition axial features on complementary information from auxiliary planes. Experimental results on six datasets from the MedMNIST3D benchmark demonstrate that the proposed architecture consistently outperforms existing 3D and multi-plane models in terms of accuracy and AUC. Ablation studies further confirm the importance of axial-centric query-key-value allocation and directional cross-plane fusion. These results highlight the importance of aligning architectural design with clinical interpretation workflows for robust and data-efficient 3D medical image analysis.
80. 【2602.21633】Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
链接:https://arxiv.org/abs/2602.21633
作者:Chenyv Liu,Wentao Tan,Lei Zhu,Fengling Li,Jingjing Li,Guoli Yang,Heng Tao Shen
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:statistical data priors, fitting statistical data, underlying physical dynamics, data priors, limiting their robust
备注:
点击查看摘要
Abstract:Standard vision-language-action (VLA) models rely on fitting statistical data priors, limiting their robust understanding of underlying physical dynamics. Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states. World action models have emerged as a promising paradigm that integrates imagination and control to enable predictive planning. However, they rely on implicit context modeling, lacking explicit mechanisms for self-improvement. To solve these problems, we propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination. We first design sparse world imagination by integrating auxiliary predictive heads to forecast current task progress and future trajectory trends, thereby constraining the policy to encode short-term physical evolution. Then we introduce the online action refinement module to reshape progress-dependent dense rewards, adjusting trajectory orientation based on the predicted sparse future states. Evaluations on challenging robot manipulation tasks from simulation benchmarks and real-world settings demonstrate that SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines, alongside a 14% gain in real-world experiments. Code is available at this https URL.
81. 【2602.21631】UniHand: A Unified Model for Diverse Controlled 4D Hand Motion Modeling
链接:https://arxiv.org/abs/2602.21631
作者:Zhihao Sun,Tong Wu,Ruirui Tu,Daoguo Dong,Zuxuan Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains challenging, human interaction, plays a central, central role, role in human
备注:
点击查看摘要
Abstract:Hand motion plays a central role in human interaction, yet modeling realistic 4D hand motion (i.e., 3D hand pose sequences over time) remains challenging. Research in this area is typically divided into two tasks: (1) Estimation approaches reconstruct precise motion from visual observations, but often fail under hand occlusion or absence; (2) Generation approaches focus on synthesizing hand poses by exploiting generative priors under multi-modal structured inputs and infilling motion from incomplete sequences. However, this separation not only limits the effective use of heterogeneous condition signals that frequently arise in practice, but also prevents knowledge transfer between the two tasks. We present UniHand, a unified diffusion-based framework that formulates both estimation and generation as conditional motion synthesis. UniHand integrates heterogeneous inputs by embedding structured signals into a shared latent space through a joint variational autoencoder, which aligns conditions such as MANO parameters and 2D skeletons. Visual observations are encoded with a frozen vision backbone, while a dedicated hand perceptron extracts hand-specific cues directly from image features, removing the need for complex detection and cropping pipelines. A latent diffusion model then synthesizes consistent motion sequences from these diverse conditions. Extensive experiments across multiple benchmarks demonstrate that UniHand delivers robust and accurate hand motion modeling, maintaining performance under severe occlusions and temporally incomplete inputs.
82. 【2602.21627】okenizing Semantic Segmentation with RLE
链接:https://arxiv.org/abs/2602.21627
作者:Abhineet Singh,Justin Rozeboom,Nilanjan Ray
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:paper presents, language modeling, modeling to output, unified approach, approach to semantic
备注:
点击查看摘要
Abstract:This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks and then train a modified version of Pix2Seq \cite{p2s} to output these RLE tokens through autoregression. We propose novel tokenization strategies to compress the length of the token sequence to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our proposed models on two datasets to show that they are competitive with the state of the art in spite of being bottlenecked by our limited computational resources.
83. 【2602.21613】Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
链接:https://arxiv.org/abs/2602.21613
作者:Xinzhe Luo,Shuai Shao,Yan Wang,Jiangtao Wang,Yutong Bai,Jianguo Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Deep intracranial tumors, eloquent brain regions, brain regions controlling, regions controlling vital, controlling vital functions
备注:
点击查看摘要
Abstract:Deep intracranial tumors situated in eloquent brain regions controlling vital functions present critical diagnostic challenges. Clinical practice has shifted toward stereotactic biopsy for pathological confirmation before treatment. Yet biopsy carries inherent risks of hemorrhage and neurological deficits and struggles with sampling bias due to tumor spatial heterogeneity, because pathological changes are typically region-selective rather than tumor-wide. Therefore, advancing non-invasive MRI-based pathology prediction is essential for holistic tumor assessment and modern clinical decision-making. The primary challenge lies in data scarcity: low tumor incidence requires long collection cycles, and annotation demands biopsy-verified pathology from neurosurgical experts. Additionally, tiny lesion volumes lacking segmentation masks cause critical features to be overwhelmed by background noise. To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories. We propose a Virtual Biopsy framework comprising: MRI-Processor for standardization; Tumor-Localizer employing vision-language models for coarse-to-fine localization via weak supervision; and Adaptive-Diagnoser with a Masked Channel Attention mechanism fusing local discriminative features with global contexts. Experiments demonstrate over 90% accuracy, outperforming baselines by more than 20%.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2602.21613 [cs.CV]
(or
arXiv:2602.21613v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.21613
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
84. 【2602.21599】Iterative Closed-Loop Motion Synthesis for Scaling the Capabilities of Humanoid Control
链接:https://arxiv.org/abs/2602.21599
作者:Weisheng Xu,Qiwei Wu,Jiaxi Zhang,Tan Jing,Yangfan Li,Yuetong Fang,Jiaqi Xiong,Kai Wu,Rong Ou,Renjing Xu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Physics-based humanoid control, humanoid control relies, Physics-based humanoid, diverse data distributions, relies on training
备注:
点击查看摘要
Abstract:Physics-based humanoid control relies on training with motion datasets that have diverse data distributions. However, the fixed difficulty distribution of datasets limits the performance ceiling of the trained control policies. Additionally, the method of acquiring high-quality data through professional motion capture systems is constrained by costs, making it difficult to achieve large-scale scalability. To address these issues, we propose a closed-loop automated motion data generation and iterative framework. It can generate high-quality motion data with rich action semantics, including martial arts, dance, combat, sports, gymnastics, and more. Furthermore, our framework enables difficulty iteration of policies and data through physical metrics and objective evaluations, allowing the trained tracker to break through its original difficulty limits. On the PHC single-primitive tracker, using only approximately 1/10 of the AMASS dataset size, the average failure rate on the test set (2201 clips) is reduced by 45\% compared to the baseline. Finally, we conduct comprehensive ablation and comparative experiments to highlight the rationality and advantages of our framework.
85. 【2602.21596】A Hidden Semantic Bottleneck in Conditional Embeddings of Diffusion Transformers
链接:https://arxiv.org/abs/2602.21596
作者:Trung X. Pham,Kang Zhang,Ji Woo Hong,Chang D. Yoo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Transformers have achieved, remains poorly understood, learned conditional embeddings, Diffusion Transformers, performance in class-conditional
备注: Accepted to ICLR 2026
点击查看摘要
Abstract:Diffusion Transformers have achieved state-of-the-art performance in class-conditional and multimodal generation, yet the structure of their learned conditional embeddings remains poorly understood. In this work, we present the first systematic study of these embeddings and uncover a notable redundancy: class-conditioned embeddings exhibit extreme angular similarity, exceeding 99\% on ImageNet-1K, while continuous-condition tasks such as pose-guided image generation and video-to-audio generation reach over 99.9\%. We further find that semantic information is concentrated in a small subset of dimensions, with head dimensions carrying the dominant signal and tail dimensions contributing minimally. By pruning low-magnitude dimensions--removing up to two-thirds of the embedding space--we show that generation quality and fidelity remain largely unaffected, and in some cases improve. These results reveal a semantic bottleneck in Transformer-based diffusion models, providing new insights into how semantics are encoded and suggesting opportunities for more efficient conditioning mechanisms.
86. 【2602.21593】Breaking Semantic-Aware Watermarks via LLM-Guided Coherence-Preserving Semantic Injection
链接:https://arxiv.org/abs/2602.21593
作者:Zheng Gao,Xiaoyu Li,Zhicheng Bao,Xiaoyan Feng,Jiaojiao Jiang
类目:Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:copyright distribution scenarios, online copyright distribution, support reliable provenance, reliable provenance tracking, web content
备注: Accepted by The Web Conference 2026 (Short Paper Track)
点击查看摘要
Abstract:Generative images have proliferated on Web platforms in social media and online copyright distribution scenarios, and semantic watermarking has increasingly been integrated into diffusion models to support reliable provenance tracking and forgery prevention for web content. Traditional noise-layer-based watermarking, however, remains vulnerable to inversion attacks that can recover embedded signals. To mitigate this, recent content-aware semantic watermarking schemes bind watermark signals to high-level image semantics, constraining local edits that would otherwise disrupt global coherence. Yet, large language models (LLMs) possess structured reasoning capabilities that enable targeted exploration of semantic spaces, allowing locally fine-grained but globally coherent semantic alterations that invalidate such bindings. To expose this overlooked vulnerability, we introduce a Coherence-Preserving Semantic Injection (CSI) attack that leverages LLM-guided semantic manipulation under embedding-space similarity constraints. This alignment enforces visual-semantic consistency while selectively perturbing watermark-relevant semantics, ultimately inducing detector misclassification. Extensive empirical results show that CSI consistently outperforms prevailing attack baselines against content-aware semantic watermarking, revealing a fundamental security weakness of current semantic watermark designs when confronted with LLM-driven semantic perturbations.
87. 【2602.21591】CADC: Content Adaptive Diffusion-Based Generative Image Compression
链接:https://arxiv.org/abs/2602.21591
作者:Xihua Sheng,Lingyu Zhu,Tianyu Zhang,Dong Liu,Shiqi Wang,Jing Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieving realistic reconstruction, demonstrated remarkable potential, generative image compression, demonstrated remarkable, achieving realistic
备注: CVPR2026
点击查看摘要
Abstract:Diffusion-based generative image compression has demonstrated remarkable potential for achieving realistic reconstruction at ultra-low bitrates. The key to unlocking this potential lies in making the entire compression process content-adaptive, ensuring that the encoder's representation and the decoder's generative prior are dynamically aligned with the semantic and structural characteristics of the input image. However, existing methods suffer from three critical limitations that prevent effective content adaptation. First, isotropic quantization applies a uniform quantization step, failing to adapt to the spatially varying complexity of image content and creating a misalignment with the diffusion model's noise-dependent prior. Second, the information concentration bottleneck -- arising from the dimensional mismatch between the high-dimensional noisy latent and the diffusion decoder's fixed input -- prevents the model from adaptively preserving essential semantic information in the primary channels. Third, existing textual conditioning strategies either need significant textual bitrate overhead or rely on generic, content-agnostic textual prompts, thereby failing to provide adaptive semantic guidance efficiently. To overcome these limitations, we propose a content-adaptive diffusion-based image codec with three technical innovations: 1) an Uncertainty-Guided Adaptive Quantization method that learns spatial uncertainty maps to adaptively align quantization distortion with content characteristics; 2) an Auxiliary Decoder-Guided Information Concentration method that uses a lightweight auxiliary decoder to enforce content-aware information preservation in the primary latent channels; and 3) a Bitrate-Free Adaptive Textual Conditioning method that derives content-aware textual descriptions from the auxiliary reconstructed image, enabling semantic guidance without bitrate cost.
88. 【2602.21589】SEF-MAP: Subspace-Decomposed Expert Fusion for Robust Multimodal HD Map Prediction
链接:https://arxiv.org/abs/2602.21589
作者:Haoxiang Fu,Lingfeng Zhang,Hao Li,Ruibing Hu,Zhengrong Li,Guanjing Liu,Zimu Tan,Long Chen,Hangjun Ye,Xiaoshuai Hao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sparse point clouds, autonomous driving, LiDAR modalities, point clouds, essential for autonomous
备注:
点击查看摘要
Abstract:High-definition (HD) maps are essential for autonomous driving, yet multi-modal fusion often suffers from inconsistency between camera and LiDAR modalities, leading to performance degradation under low-light conditions, occlusions, or sparse point clouds. To address this, we propose SEFMAP, a Subspace-Expert Fusion framework for robust multimodal HD map prediction. The key idea is to explicitly disentangle BEV features into four semantic subspaces: LiDAR-private, Image-private, Shared, and Interaction. Each subspace is assigned a dedicated expert, thereby preserving modality-specific cues while capturing cross-modal consensus. To adaptively combine expert outputs, we introduce an uncertainty-aware gating mechanism at the BEV-cell level, where unreliable experts are down-weighted based on predictive variance, complemented by a usage balance regularizer to prevent expert collapse. To enhance robustness in degraded conditions and promote role specialization, we further propose distribution-aware masking: during training, modality-drop scenarios are simulated using EMA-statistical surrogate features, and a specialization loss enforces distinct behaviors of private, shared, and interaction experts across complete and masked inputs. Experiments on nuScenes and Argoverse2 benchmarks demonstrate that SEFMAP achieves state-of-the-art performance, surpassing prior methods by +4.2% and +4.8% in mAP, respectively. SEF-MAPprovides a robust and effective solution for multi-modal HD map prediction under diverse and degraded conditions.
89. 【2602.21581】MultiAnimate: Pose-Guided Image Animation Made Extensible
链接:https://arxiv.org/abs/2602.21581
作者:Yingcheng Hu,Haowen Gong,Chuanguang Yang,Zhulin An,Yongjun Xu,Songhua Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Pose-guided human image, Pose-guided human, synthesize realistic videos, reference character driven, sequence of poses
备注: Project page at [this https URL](https://hyc001.github.io/MultiAnimate/)
点击查看摘要
Abstract:Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.
90. 【2602.21552】Generalizing Visual Geometry Priors to Sparse Gaussian Occupancy Prediction
链接:https://arxiv.org/abs/2602.21552
作者:Changqing Zhou,Yueru Luo,Changhao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scene understanding, embodied intelligence, free space, understanding is essential, essential for embodied
备注: Accepted by CVPR2026
点击查看摘要
Abstract:Accurate 3D scene understanding is essential for embodied intelligence, with occupancy prediction emerging as a key task for reasoning about both objects and free space. Existing approaches largely rely on depth priors (e.g., DepthAnything) but make only limited use of 3D cues, restricting performance and generalization. Recently, visual geometry models such as VGGT have shown strong capability in providing rich 3D priors, but similar to monocular depth foundation models, they still operate at the level of visible surfaces rather than volumetric interiors, motivating us to explore how to more effectively leverage these increasingly powerful geometry priors for 3D occupancy prediction. We present GPOcc, a framework that leverages generalizable visual geometry priors (GPs) for monocular occupancy prediction. Our method extends surface points inward along camera rays to generate volumetric samples, which are represented as Gaussian primitives for probabilistic occupancy inference. To handle streaming input, we further design a training-free incremental update strategy that fuses per-frame Gaussians into a unified global representation. Experiments on Occ-ScanNet and EmbodiedOcc-ScanNet demonstrate significant gains: GPOcc improves mIoU by +9.99 in the monocular setting and +11.79 in the streaming setting over prior state of the art. Under the same depth prior, it achieves +6.73 mIoU while running 2.65$\times$ faster. These results highlight that GPOcc leverages geometry priors more effectively and efficiently. Code will be released at this https URL.
91. 【2602.21539】VasGuideNet: Vascular Topology-Guided Couinaud Liver Segmentation with Structural Contrastive Loss
链接:https://arxiv.org/abs/2602.21539
作者:Chaojie Shen,Jingjun Gu,Zihao Zhao,Ruocheng Li,Cunyuan Yang,Jiajun Bu,Lei Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate Couinaud liver, existing methods primarily, spatial location cues, preoperative surgical planning, methods primarily rely
备注:
点击查看摘要
Abstract:Accurate Couinaud liver segmentation is critical for preoperative surgical planning and tumor this http URL, existing methods primarily rely on image intensity and spatial location cues, without explicitly modeling vascular topology. As a result, they often produce indistinct boundaries near vessels and show limited generalization under anatomical this http URL propose VasGuideNet, the first Couinaud segmentation framework explicitly guided by vascular topology. Specifically, skeletonized vessels, Euclidean distance transform (EDT)--derived geometry, and k-nearest neighbor (kNN) connectivity are encoded into topology features using Graph Convolutional Networks (GCNs). These features are then injected into a 3D encoder--decoder backbone via a cross-attention fusion module. To further improve inter-class separability and anatomical consistency, we introduce a Structural Contrastive Loss (SCL) with a global memory this http URL Task08_HepaticVessel and our private LASSD dataset, VasGuideNet achieves Dice scores of 83.68% and 76.65% with RVDs of 1.68 and 7.08, respectively. It consistently outperforms representative baselines including UNETR, Swin UNETR, and G-UNETR++, delivering higher Dice/mIoU and lower RVD across datasets, demonstrating its effectiveness for anatomically consistent segmentation. Code is available at this https URL.
92. 【2602.21536】IHF-Harmony: Multi-Modality Magnetic Resonance Images Harmonization using Invertible Hierarchy Flow Model
链接:https://arxiv.org/abs/2602.21536
作者:Pengli Zhu,Yitao Zhu,Haowen Pang,Anqi Qiu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:traveling subject datasets, Retrospective MRI harmonization, Retrospective MRI, subject datasets, limited by poor
备注:
点击查看摘要
Abstract:Retrospective MRI harmonization is limited by poor scalability across modalities and reliance on traveling subject datasets. To address these challenges, we introduce IHF-Harmony, a unified invertible hierarchy flow framework for multi-modality harmonization using unpaired data. By decomposing the translation process into reversible feature transformations, IHF-Harmony guarantees bijective mapping and lossless reconstruction to prevent anatomical distortion. Specifically, an invertible hierarchy flow (IHF) performs hierarchical subtractive coupling to progressively remove artefact-related features, while an artefact-aware normalization (AAN) employs anatomy-fixed feature modulation to accurately transfer target characteristics. Combined with anatomy and artefact consistency loss objectives, IHF-Harmony achieves high-fidelity harmonization that retains source anatomy. Experiments across multiple MRI modalities demonstrate that IHF-Harmony outperforms existing methods in both anatomical fidelity and downstream task performance, facilitating robust harmonization for large-scale multi-site imaging studies. Code will be released upon acceptance.
93. 【2602.21535】Pseudo-View Enhancement via Confidence Fusion for Unposed Sparse-View Reconstruction
链接:https://arxiv.org/abs/2602.21535
作者:Beizhen Zhao,Sicheng Yu,Guanzhi Ding,Yu Hu,Hao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:practically important problem, unposed sparse viewpoints, important problem, scale variation, unposed sparse
备注: 14 pages
点击查看摘要
Abstract:3D scene reconstruction under unposed sparse viewpoints is a highly challenging yet practically important problem, especially in outdoor scenes due to complex lighting and scale variation. With extremely limited input views, directly utilizing diffusion model to synthesize pseudo frames will introduce unreasonable geometry, which will harm the final reconstruction quality. To address these issues, we propose a novel framework for sparse-view outdoor reconstruction that achieves high-quality results through bidirectional pseudo frame restoration and scene perception Gaussian management. Specifically, we introduce a bidirectional pseudo frame restoration method that restores missing content by diffusion-based synthesis guided by adjacent frames with a lightweight pseudo-view deblur model and confidence mask inference algorithm. Then we propose a scene perception Gaussian management strategy that optimize Gaussians based on joint depth-density information. These designs significantly enhance reconstruction completeness, suppress floating artifacts and improve overall geometric consistency under extreme view sparsity. Experiments on outdoor benchmarks demonstrate substantial gains over existing methods in both fidelity and stability.
94. 【2602.21531】LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
链接:https://arxiv.org/abs/2602.21531
作者:Yue Yang,Shuo Cheng,Yu Fang,Homanga Bharadhwaj,Mingyu Ding,Gedas Bertasius,Daniel Szafir
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
关键词:involving multiple kinematic, multiple kinematic structure, General-purpose robots, tasks involving multiple, master long-horizon manipulation
备注:
点击查看摘要
Abstract:General-purpose robots must master long-horizon manipulation, defined as tasks involving multiple kinematic structure changes (e.g., attaching or detaching objects) in unstructured environments. While Vision-Language-Action (VLA) models offer the potential to master diverse atomic skills, they struggle with the combinatorial complexity of sequencing them and are prone to cascading failures due to environmental sensitivity. To address these challenges, we propose LiLo-VLA (Linked Local VLA), a modular framework capable of zero-shot generalization to novel long-horizon tasks without ever being trained on them. Our approach decouples transport from interaction: a Reaching Module handles global motion, while an Interaction Module employs an object-centric VLA to process isolated objects of interest, ensuring robustness against irrelevant visual features and invariance to spatial configurations. Crucially, this modularity facilitates robust failure recovery through dynamic replanning and skill reuse, effectively mitigating the cascading errors common in end-to-end approaches. We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long. In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%. Furthermore, real-world evaluations across 8 long-horizon tasks demonstrate an average success rate of 85%. Project page: this https URL.
95. 【2602.21517】Which Tool Response Should I Trust? Tool-Expertise-Aware Chest X-ray Agent with Multimodal Agentic Learning
链接:https://arxiv.org/abs/2602.21517
作者:Zheang Huai,Honglong Yang,Xiaomeng Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:tool-use capabilities show, capabilities show promise, tool-use capabilities, promise for integrating, integrating the domain
备注: 11 pages
点击查看摘要
Abstract:AI agents with tool-use capabilities show promise for integrating the domain expertise of various tools. In the medical field, however, tools are usually AI models that are inherently error-prone and can produce contradictory responses. Existing research on medical agents lacks sufficient understanding of the tools' realistic reliability and thus cannot effectively resolve tool conflicts. To address this gap, this paper introduces a framework that enables an agent to interact with tools and empirically learn their practical trustworthiness across different types of multimodal queries via agentic learning. As a concrete instantiation, we focus on chest X-ray analysis and present a tool-expertise-aware chest X-ray agent (TEA-CXA). When tool outputs disagree, the agent experimentally accepts or rejects multimodal tool results, receives rewards, and learns which tool to trust for each query type. Importantly, TEA-CXA extends existing codebases for reinforcement learning with multi-turn tool-calling that focus on textual inputs, to support multimodal contexts effectively. In addition, we enhance the codebase for medical use scenarios by supporting multiple tool calls in one turn, parallel tool inference, and multi-image accommodation within a single user query. Our code framework is applicable to general medical research on multi-turn tool-calling reinforcement learning in multimodal settings. Experiments show that TEA-CXA outperforms the state-of-the-art methods and a comprehensive set of baselines. Code will be released.
96. 【2602.21508】WaterVIB: Learning Minimal Sufficient Watermark Representations via Variational Information Bottleneck
链接:https://arxiv.org/abs/2602.21508
作者:Haoyuan He,Yu Zheng,Jie Zhou,Jiwen Lu
类目:Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:intellectual property protection, regeneration-based AIGC attacks, Robust watermarking, regeneration-based AIGC, existing methods face
备注: 22 pages, 7 figures. Preprint
点击查看摘要
Abstract:Robust watermarking is critical for intellectual property protection, whereas existing methods face a severe vulnerability against regeneration-based AIGC attacks. We identify that existing methods fail because they entangle the watermark with high-frequency cover texture, which is susceptible to being rewritten during generative purification. To address this, we propose WaterVIB, a theoretically grounded framework that reformulates the encoder as an information sieve via the Variational Information Bottleneck. Instead of overfitting to fragile cover details, our approach forces the model to learn a Minimal Sufficient Statistic of the message. This effectively filters out redundant cover nuances prone to generative shifts, retaining only the essential signal invariant to regeneration. We theoretically prove that optimizing this bottleneck is a necessary condition for robustness against distribution-shifting attacks. Extensive experiments demonstrate that WaterVIB significantly outperforms state-of-the-art methods, achieving superior zero-shot resilience against unknown diffusion-based editing.
97. 【2602.21503】AHAN: Asymmetric Hierarchical Attention Network for Identical Twin Face Verification
链接:https://arxiv.org/abs/2602.21503
作者:Hoang-Nhat Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:overwhelming genetic similarity, extreme fine-grained recognition, systems fail due, genetic similarity, fine-grained recognition challenge
备注: Accepted to AAAI 2026
点击查看摘要
Abstract:Identical twin face verification represents an extreme fine-grained recognition challenge where even state-of-the-art systems fail due to overwhelming genetic similarity. Current face recognition methods achieve over 99.8% accuracy on standard benchmarks but drop dramatically to 88.9% when distinguishing identical twins, exposing critical vulnerabilities in biometric security systems. The difficulty lies in learning features that capture subtle, non-genetic variations that uniquely identify individuals. We propose the Asymmetric Hierarchical Attention Network (AHAN), a novel architecture specifically designed for this challenge through multi-granularity facial analysis. AHAN introduces a Hierarchical Cross-Attention (HCA) module that performs multi-scale analysis on semantic facial regions, enabling specialized processing at optimal resolutions. We further propose a Facial Asymmetry Attention Module (FAAM) that learns unique biometric signatures by computing cross-attention between left and right facial halves, capturing subtle asymmetric patterns that differ even between twins. To ensure the network learns truly individuating features, we introduce Twin-Aware Pair-Wise Cross-Attention (TA-PWCA), a training-only regularization strategy that uses each subject's own twin as the hardest possible distractor. Extensive experiments on the ND_TWIN dataset demonstrate that AHAN achieves 92.3% twin verification accuracy, representing a 3.4% improvement over state-of-the-art methods.
98. 【2602.21499】Easy3E: Feed-Forward 3D Asset Editing via Rectified Voxel Flow
链接:https://arxiv.org/abs/2602.21499
作者:Shimin Hu,Yuanyi Wei,Fei Zha,Yudong Guo,Juyong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computationally intensive, iterative optimization, rely on computationally, optimization and suffer, TRELLIS generative backbone
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:Existing 3D editing methods rely on computationally intensive scene-by-scene iterative optimization and suffer from multi-view inconsistency. We propose an effective and fully feedforward 3D editing framework based on the TRELLIS generative backbone, capable of modifying 3D models from a single editing view. Our framework addresses two key issues: adapting training-free 2D editing to structured 3D representations, and overcoming the bottleneck of appearance fidelity in compressed 3D features. To ensure geometric consistency, we introduce Voxel FlowEdit, an edit-driven flow in the sparse voxel latent space that achieves globally consistent 3D deformation in a single pass. To restore high-fidelity details, we develop a normal-guided single to multi-view generation module as an external appearance prior, successfully recovering high-frequency textures. Experiments demonstrate that our method enables fast, globally consistent, and high-fidelity 3D model editing.
99. 【2602.21497】See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
链接:https://arxiv.org/abs/2602.21497
作者:Yongchang Zhang,Xianzheng Ma,Tianyi Liu,Guangquan Zhou,Yang Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent large vision-language, Recent large, large vision-language models, impressive reasoning ability, demonstrated impressive reasoning
备注: CVPR2026 Accepted
点击查看摘要
Abstract:Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.
100. 【2602.21484】Unified Unsupervised and Sparsely-Supervised 3D Object Detection by Semantic Pseudo-Labeling and Prototype Learning
链接:https://arxiv.org/abs/2602.21484
作者:Yushen He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large-scale manually annotated, manually annotated data, annotated data limits, data limits scalability, robotic perception
备注:
点击查看摘要
Abstract:3D object detection is essential for autonomous driving and robotic perception, yet its reliance on large-scale manually annotated data limits scalability and adaptability. To reduce annotation dependency, unsupervised and sparsely-supervised paradigms have emerged. However, they face intertwined challenges: low-quality pseudo-labels, unstable feature mining, and a lack of a unified training framework. This paper proposes SPL, a unified training framework for both Unsupervised and Sparsely-Supervised 3D Object Detection via Semantic Pseudo-labeling and prototype Learning. SPL first generates high-quality pseudo-labels by integrating image semantics, point cloud geometry, and temporal cues, producing both 3D bounding boxes for dense objects and 3D point labels for sparse ones. These pseudo-labels are not used directly but as probabilistic priors within a novel, multi-stage prototype learning strategy. This strategy stabilizes feature representation learning through memory-based initialization and momentum-based prototype updating, effectively mining features from both labeled and unlabeled data. Extensive experiments on KITTI and nuScenes datasets demonstrate that SPL significantly outperforms state-of-the-art methods in both settings. Our work provides a robust and generalizable solution for learning 3D object detectors with minimal or no manual annotations.
101. 【2602.21473】Automatic Map Density Selection for Locally-Performant Visual Place Recognition
链接:https://arxiv.org/abs/2602.21473
作者:Somayeh Hussaini,Tobias Fischer,Michael Milford
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Place Recognition, translating Visual Place, Place Recognition, Visual Place, translating Visual
备注: Under Review
点击查看摘要
Abstract:A key challenge in translating Visual Place Recognition (VPR) from the lab to long-term deployment is ensuring a priori that a system can meet user-specified performance requirements across different parts of an environment, rather than just on average globally. A critical mechanism for controlling local VPR performance is the density of the reference mapping database, yet this factor is largely neglected in existing work, where benchmark datasets with fixed, engineering-driven (sensors, storage, GPS frequency) sampling densities are typically used. In this paper, we propose a dynamic VPR mapping approach that uses pairs of reference traverses from the target environment to automatically select an appropriate map density to satisfy two user-defined requirements: (1) a target Local Recall@1 level, and (2) the proportion of the operational environment over which this requirement must be met or exceeded, which we term the Recall Achievement Rate (RAR). Our approach is based on the hypothesis that match patterns between multiple reference traverses, evaluated across different map densities, can be modelled to predict the density required to meet these performance targets on unseen deployment data. Through extensive experiments across multiple VPR methods and the Nordland and Oxford RobotCar benchmarks, we show that our system consistently achieves or exceeds the specified local recall level over at least the user-specified proportion of the environment. Comparisons with alternative baselines demonstrate that our approach reliably selects the correct operating point in map density, avoiding unnecessary over-densification. Finally, ablation studies and analysis evaluate sensitivity to reference map choice and local space definitions, and reveal that conventional global Recall@1 is a poor predictor of the often more operationally meaningful RAR metric.
102. 【2602.21452】Adversarial Robustness of Deep Learning-Based Thyroid Nodule Segmentation in Ultrasound
链接:https://arxiv.org/abs/2602.21452
作者:Nicholas Dietrich,David McShannon
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Deep learning-based segmentation, clinical imaging workflows, remains incompletely characterized, Deep learning-based, Structured Speckle Amplification
备注: 14 pages, 3 figures, 3 tables
点击查看摘要
Abstract:Introduction: Deep learning-based segmentation models are increasingly integrated into clinical imaging workflows, yet their robustness to adversarial perturbations remains incompletely characterized, particularly for ultrasound images. We evaluated adversarial attacks and inference-time defenses for thyroid nodule segmentation in B-mode ultrasound. Methods: Two black-box adversarial attacks were developed: (1) Structured Speckle Amplification Attack (SSAA), which injects boundary-targeted noise, and (2) Frequency-Domain Ultrasound Attack (FDUA), which applies bandpass-filtered phase perturbations in the Fourier domain. Three inference-time mitigations were evaluated on adversarial images: randomized preprocessing with test-time augmentation, deterministic input denoising, and stochastic ensemble inference with consistency-aware aggregation. Experiments were conducted on a U-Net segmentation model trained on cine-clips from a database of 192 thyroid nodules. Results: The baseline model achieved a mean Dice similarity coefficient (DSC) of 0.76 (SD 0.20) on unperturbed images. SSAA reduced DSC by 0.29 (SD 0.20) while maintaining high visual similarity (SSIM = 0.94). FDUA resulted in a smaller DSC reduction of 0.11 (SD 0.09) with lower visual fidelity (SSIM = 0.82). Against SSAA, all three defenses significantly improved DSC after correction, with deterministic denoising showing the largest recovery (+0.10, p 0.001), followed by randomized preprocessing (+0.09, p 0.001), and stochastic ensemble inference (+0.08, p = 0.002). No defense achieved statistically significant improvement against FDUA. Conclusion: Spatial-domain adversarial perturbations in ultrasound segmentation showed partial mitigation with input preprocessing, whereas frequency-domain perturbations were not mitigated by the defenses, highlighting modality-specific challenges in adversarial robustness evaluation.
103. 【2602.21441】Causal Decoding for Hallucination-Resistant Multimodal Large Language Models
链接:https://arxiv.org/abs/2602.21441
作者:Shiwei Tan,Hengyi Wang,Weiyi Qin,Qi Xu,Zhigang Hua,Hao Wang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, deliver detailed responses
备注: Published in Transactions on Machine Learning Research (TMLR), 2026
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) deliver detailed responses on vision-language tasks, yet remain susceptible to object hallucination (introducing objects not present in the image), undermining reliability in practice. Prior efforts often rely on heuristic penalties, post-hoc correction, or generic decoding tweaks, which do not directly intervene in the mechanisms that trigger object hallucination and thus yield limited gains. To address this challenge, we propose a causal decoding framework that applies targeted causal interventions during generation to curb spurious object mentions. By reshaping the decoding dynamics to attenuate spurious dependencies, our approach reduces false object tokens while maintaining descriptive quality. Across captioning and QA benchmarks, our framework substantially lowers object-hallucination rates and achieves state-of-the-art faithfulness without degrading overall output quality.
104. 【2602.21435】Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking
链接:https://arxiv.org/abs/2602.21435
作者:Shengqiong Wu,Bobo Li,Xinkai Wang,Xiangtai Li,Lei Cui,Furu Wei,Shuicheng Yan,Hao Fei,Tat-seng Chua
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unified Vision-Language Models, Unified Vision-Language, advance multimodal learning, aim to advance, single framework
备注: 28 pages, 17 figures, 6 tables, ICLR conference
点击查看摘要
Abstract:Unified Vision-Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved Analyzing-Drafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive and autonomous control. Extensive experiments demonstrate that AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLMs architectures. Visual analyses further validate the effectiveness of implicit visual thoughts. These results highlight AD-Loop as a principled and broadly applicable strategy for synergizing comprehension and creation. The project page is at this https URL.
105. 【2602.21428】PSF-Med: Measuring and Explaining Paraphrase Sensitivity in Medical Vision Language Models
链接:https://arxiv.org/abs/2602.21428
作者:Binesh Sadanandan,Vahid Behzadan
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Medical Vision Language, raises deployment risks, Vision Language Models, Medical Vision, Vision Language
备注:
点击查看摘要
Abstract:Medical Vision Language Models (VLMs) can change their answers when clinicians rephrase the same question, which raises deployment risks. We introduce Paraphrase Sensitivity Failure (PSF)-Med, a benchmark of 19,748 chest Xray questions paired with about 92,000 meaningpreserving paraphrases across MIMIC-CXR and PadChest. Across six medical VLMs, we measure yes/no flips for the same image and find flip rates from 8% to 58%. However, low flip rate does not imply visual grounding: text-only baselines show that some models stay consistent even when the image is removed, suggesting they rely on language priors. To study mechanisms in one model, we apply GemmaScope 2 Sparse Autoencoders (SAEs) to MedGemma 4B and analyze FlipBank, a curated set of 158 flip cases. We identify a sparse feature at layer 17 that correlates with prompt framing and predicts decision margin shifts. In causal patching, removing this feature's contribution recovers 45% of the yesminus-no logit margin on average and fully reverses 15% of flips. Acting on this finding, we show that clamping the identified feature at inference reduces flip rates by 31% relative with only a 1.3 percentage-point accuracy cost, while also decreasing text-prior reliance. These results suggest that flip rate alone is not enough; robustness evaluations should test both paraphrase stability and image reliance.
106. 【2602.21425】Automating Timed Up and Go Phase Segmentation and Gait Analysis via the tugturn Markerless 3D Pipeline
链接:https://arxiv.org/abs/2602.21425
作者:Abel Gonçalves Chinaglia,Guilherme Manna Cesar,Paulo Roberto Pereira Santiago
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Instrumented Timed, markerless TUG processing, research decision-making, support clinical, clinical and research
备注: 16 pages, 2 figures, 1 pdf report, submitted to arXiv under cs.CV
点击查看摘要
Abstract:Instrumented Timed Up and Go (TUG) analysis can support clinical and research decision-making, but robust and reproducible markerless pipelines are still limited. We present \textit{this http URL}, a Python-based workflow for 3D markerless TUG processing that combines phase segmentation, gait-event detection, spatiotemporal metrics, intersegmental coordination, and dynamic stability analysis. The pipeline uses spatial thresholds to segment each trial into stand, first gait, turning, second gait, and sit phases, and applies a relative-distance strategy to detect heel-strike and toe-off events within valid gait windows. In addition to conventional kinematics, \textit{tugturn} provides Vector Coding outputs and Extrapolated Center of Mass (XCoM)-based metrics. The software is configured through TOML files and produces reproducible artifacts, including HTML reports, CSV tables, and quality-assurance visual outputs. A complete runnable example is provided with test data and command-line instructions. This manuscript describes the implementation, outputs, and reproducibility workflow of \textit{tugturn} as a focused software contribution for markerless biomechanical TUG analysis.
107. 【2602.21421】ECHOSAT: Estimating Canopy Height Over Space And Time
链接:https://arxiv.org/abs/2602.21421
作者:Jan Pauls,Karsten Schrödter,Sven Ligensa,Martin Schwartz,Berkant Turan,Max Zimmer,Sassan Saatchi,Sebastian Pokutta,Philippe Ciais,Fabian Gieseke
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:climate change mitigation, change mitigation, critical for climate, climate change, Forest
备注: 19 pages, 12 figures, 6 tables
点击查看摘要
Abstract:Forest monitoring is critical for climate change mitigation. However, existing global tree height maps provide only static snapshots and do not capture temporal forest dynamics, which are essential for accurate carbon accounting. We introduce ECHOSAT, a global and temporally consistent tree height map at 10 m resolution spanning multiple years. To this end, we resort to multi-sensor satellite data to train a specialized vision transformer model, which performs pixel-level temporal regression. A self-supervised growth loss regularizes the predictions to follow growth curves that are in line with natural tree development, including gradual height increases over time, but also abrupt declines due to forest loss events such as fires. Our experimental evaluation shows that our model improves state-of-the-art accuracies in the context of single-year predictions. We also provide the first global-scale height map that accurately quantifies tree growth and disturbances over time. We expect ECHOSAT to advance global efforts in carbon monitoring and disturbance assessment. The maps can be accessed at this https URL.
108. 【2602.21416】WildSVG: Towards Reliable SVG Generation Under Real-Word Conditions
链接:https://arxiv.org/abs/2602.21416
作者:Marco Terral,Haotian Zhang,Tianyang Zhang,Meng Lin,Xiaoqing Xie,Haoran Dai,Darsh Kaushik,Pai Peng,Nicklas Scharpff,David Vazquez,Joan Rodriguez
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scalable vector graphics, translating specific visual, specific visual inputs, vector graphics, consists in translating
备注: 10 pages, 6 pages of additional material
点击查看摘要
Abstract:We introduce the task of SVG extraction, which consists in translating specific visual inputs from an image into scalable vector graphics. Existing multimodal models achieve strong results when generating SVGs from clean renderings or textual descriptions, but they fall short in real-world scenarios where natural images introduce noise, clutter, and domain shifts. A central challenge in this direction is the lack of suitable benchmarks. To address this need, we introduce the WildSVG Benchmark, formed by two complementary datasets: Natural WildSVG, built from real images containing company logos paired with their SVG annotations, and Synthetic WildSVG, which blends complex SVG renderings into real scenes to simulate difficult conditions. Together, these resources provide the first foundation for systematic benchmarking SVG extraction. We benchmark state-of-the-art multimodal models and find that current approaches perform well below what is needed for reliable SVG extraction in real scenarios. Nonetheless, iterative refinement methods point to a promising path forward, and model capabilities are steadily improving
109. 【2602.21406】Exploring Vision-Language Models for Open-Vocabulary Zero-Shot Action Segmentation
链接:https://arxiv.org/abs/2602.21406
作者:Asim Unmesh,Kaki Ramesh,Mayank Patel,Rahul Jain,Karthik Ramani
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:comprehensive datasets infeasible, alternative breakdowns makes, breakdowns makes collecting, makes collecting comprehensive, collecting comprehensive datasets
备注: ICRA 2026
点击查看摘要
Abstract:Temporal Action Segmentation (TAS) requires dividing videos into action segments, yet the vast space of activities and alternative breakdowns makes collecting comprehensive datasets infeasible. Existing methods remain limited to closed vocabularies and fixed label sets. In this work, we explore the largely unexplored problem of Open-Vocabulary Zero-Shot Temporal Action Segmentation (OVTAS) by leveraging the strong zero-shot capabilities of Vision-Language Models (VLMs). We introduce a training-free pipeline that follows a segmentation-by-classification design: Frame-Action Embedding Similarity (FAES) matches video frames to candidate action labels, and Similarity-Matrix Temporal Segmentation (SMTS) enforces temporal consistency. Beyond proposing OVTAS, we present a systematic study across 14 diverse VLMs, providing the first broad analysis of their suitability for open-vocabulary action segmentation. Experiments on standard benchmarks show that OVTAS achieves strong results without task-specific supervision, underscoring the potential of VLMs for structured temporal understanding.
110. 【2602.21402】FlowFixer: Towards Detail-Preserving Subject-Driven Generation
链接:https://arxiv.org/abs/2602.21402
作者:Jinyoung Jun,Won-Dong Jang,Wenbin Ouyang,Raghudeep Gadde,Jungbeom Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:restores fine details, fine details lost, refinement framework, restores fine, scale and perspective
备注:
点击查看摘要
Abstract:We present FlowFixer, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.
111. 【2602.21399】FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning
链接:https://arxiv.org/abs/2602.21399
作者:Alina Devkota,Jacob Thrasher,Donald Adjeroh,Binod Bhattarai,Prashnna K. Gyawali
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:enables collaborative model, Federated Learning, collaborative model training, enables collaborative, training across multiple
备注: Accepted to CVPR 2026 (Findings Track)
点击查看摘要
Abstract:Federated Learning (FL) enables collaborative model training across multiple clients without sharing their private data. However, data heterogeneity across clients leads to client drift, which degrades the overall generalization performance of the model. This effect is further compounded by overemphasis on poorly performing clients. To address this problem, we propose FedVG, a novel gradient-based federated aggregation framework that leverages a global validation set to guide the optimization process. Such a global validation set can be established using readily available public datasets, ensuring accessibility and consistency across clients without compromising privacy. In contrast to conventional approaches that prioritize client dataset volume, FedVG assesses the generalization ability of client models by measuring the magnitude of validation gradients across layers. Specifically, we compute layerwise gradient norms to derive a client-specific score that reflects how much each client needs to adjust for improved generalization on the global validation set, thereby enabling more informed and adaptive federated aggregation. Extensive experiments on both natural and medical image benchmarking datasets, across diverse model architectures, demonstrate that FedVG consistently improves performance, particularly in highly heterogeneous settings. Moreover, FedVG is modular and can be seamlessly integrated with various state-of-the-art FL algorithms, often further improving their results. Our code is available at this https URL.
112. 【2602.21397】MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation
链接:https://arxiv.org/abs/2602.21397
作者:Sajjad Ghiasvand,Haniyeh Ehsani Oskouie,Mahnoosh Alizadeh,Ramtin Pedarsani
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:adapting vision-language models, modifying pretrained weights, textbf, vision-language models, pretrained weights
备注:
点击查看摘要
Abstract:Prompt learning has become a dominant paradigm for adapting vision-language models (VLMs) such as CLIP to downstream tasks without modifying pretrained weights. While extending prompts to both vision and text encoders across multiple transformer layers significantly boosts performance, it dramatically increases the number of trainable parameters, with state-of-the-art methods requiring millions of parameters and abandoning the parameter efficiency that makes prompt tuning attractive. In this work, we propose \textbf{MMLoP} (\textbf{M}ulti-\textbf{M}odal \textbf{Lo}w-Rank \textbf{P}rompting), a framework that achieves deep multi-modal prompting with only \textbf{11.5K trainable parameters}, comparable to early text-only methods like CoOp. MMLoP parameterizes vision and text prompts at each transformer layer through a low-rank factorization, which serves as an implicit regularizer against overfitting on few-shot training data. To further close the accuracy gap with state-of-the-art methods, we introduce three complementary components: a self-regulating consistency loss that anchors prompted representations to frozen zero-shot CLIP features at both the feature and logit levels, a uniform drift correction that removes the global embedding shift induced by prompt tuning to preserve class-discriminative structure, and a shared up-projection that couples vision and text prompts through a common low-rank factor to enforce cross-modal alignment. Extensive experiments across three benchmarks and 11 diverse datasets demonstrate that MMLoP achieves a highly favorable accuracy-efficiency tradeoff, outperforming the majority of existing methods including those with orders of magnitude more parameters, while achieving a harmonic mean of 79.70\% on base-to-novel generalization.
113. 【2602.21395】Momentum Memory for Knowledge Distillation in Computational Pathology
链接:https://arxiv.org/abs/2602.21395
作者:Yongxin Guo,Hao Lu,Onur C. Koyun,Zhengjie Zhu,Muhammet Fatih Demir,Metin Nafi Gurcan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:paired histology-genomics data, shown strong potential, cancer diagnosis, histology-genomics data, potential in cancer
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Multimodal learning that integrates genomics and histopathology has shown strong potential in cancer diagnosis, yet its clinical translation is hindered by the limited availability of paired histology-genomics data. Knowledge distillation (KD) offers a practical solution by transferring genomic supervision into histopathology models, enabling accurate inference using histology alone. However, existing KD methods rely on batch-local alignment, which introduces instability due to limited within-batch comparisons and ultimately degrades performance. To address these limitations, we propose Momentum Memory Knowledge Distillation (MoMKD), a cross-modal distillation framework driven by a momentum-updated memory. This memory aggregates genomic and histopathology information across batches, effectively enlarging the supervisory context available to each mini-batch. Furthermore, we decouple the gradients of the genomics and histology branches, preventing genomic signals from dominating histology feature learning during training and eliminating the modality-gap issue at inference time. Extensive experiments on the TCGA-BRCA benchmark (HER2, PR, and ODX classification tasks) and an independent in-house testing dataset demonstrate that MoMKD consistently outperforms state-of-the-art MIL and multimodal KD baselines, delivering strong performance and generalization under histology-only inference. Overall, MoMKD establishes a robust and generalizable knowledge distillation paradigm for computational pathology.
Comments:
Accepted by CVPR 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2602.21395 [cs.CV]
(or
arXiv:2602.21395v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.21395
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
114. 【2602.21365】owards Controllable Video Synthesis of Routine and Rare OR Events
链接:https://arxiv.org/abs/2602.21365
作者:Dominik Schneider,Lalithkumar Seenivasan,Sampath Rapuri,Vishalroshan Anil,Aiza Maksutova,Yiqing Shen,Jan Emily Mangulabnan,Hao Ding,Jose L. Porras,Masaru Ishii,Mathias Unberath
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:Curating large-scale datasets, Curating large-scale, operating room, remains operationally, ethically challenging
备注: Accepted to IPCAI 2026 and submitted to IJCARs
点击查看摘要
Abstract:Purpose: Curating large-scale datasets of operating room (OR) workflow, encompassing rare, safety-critical, or atypical events, remains operationally and ethically challenging. This data bottleneck complicates the development of ambient intelligence for detecting, understanding, and mitigating rare or safety-critical events in the OR. Methods: This work presents an OR video diffusion framework that enables controlled synthesis of rare and safety-critical events. The framework integrates a geometric abstraction module, a conditioning module, and a fine-tuned diffusion model to first transform OR scenes into abstract geometric representations, then condition the synthesis process, and finally generate realistic OR event videos. Using this framework, we also curate a synthetic dataset to train and validate AI models for detecting near-misses of sterile-field violations. Results: In synthesizing routine OR events, our method outperforms off-the-shelf video diffusion baselines, achieving lower FVD/LPIPS and higher SSIM/PSNR in both in- and out-of-domain datasets. Through qualitative results, we illustrate its ability for controlled video synthesis of counterfactual events. An AI model trained and validated on the generated synthetic data achieved a RECALL of 70.13% in detecting near safety-critical events. Finally, we conduct an ablation study to quantify performance gains from key design choices. Conclusion: Our solution enables controlled synthesis of routine and rare OR events from abstract geometric representations. Beyond demonstrating its capability to generate rare and safety-critical scenarios, we show its potential to support the development of ambient intelligence models.
Comments:
Accepted to IPCAI 2026 and submitted to IJCARs
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
Cite as:
arXiv:2602.21365 [cs.CV]
(or
arXiv:2602.21365v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.21365
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Lalithkumar Seenivasan [view email] [v1]
Tue, 24 Feb 2026 20:56:15 UTC (2,177 KB)
115. 【2602.21341】Scaling View Synthesis Transformers
链接:https://arxiv.org/abs/2602.21341
作者:Evan Kim,Hyunwoo Ryu,Thomas W. Mitchel,Vincent Sitzmann
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Geometry-free view synthesis, outperforming traditional approaches, explicit geometry modeling, view synthesis transformers, Geometry-free view
备注: Project page: [this https URL](https://www.evn.kim/research/svsm)
点击查看摘要
Abstract:Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.
116. 【2602.21333】HorizonForge: Driving Scene Editing with Any Trajectories and Any Vehicles
链接:https://arxiv.org/abs/2602.21333
作者:Yifan Wang,Francesco Pittaluga,Zaid Tasneem,Chenyu You,Manmohan Chandraker,Ziyu Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing approaches struggle, jointly achieve photorealism, scalable autonomous driving, driving scene generation, autonomous driving simulation
备注: Accepted by CVPR 2026
点击查看摘要
Abstract:Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose HorizonSuite, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian-Mesh representation delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these findings, HorizonForge establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation, achieving an 83.4% user-preference gain and a 25.19% FID improvement over the second best state-of-the-art method. Project page: this https URL .
117. 【2602.21319】Uncertainty-Aware Diffusion Model for Multimodal Highway Trajectory Prediction via DDIM Sampling
链接:https://arxiv.org/abs/2602.21319
作者:Marion Neumeier,Niklas Roßberg,Michael Botsch,Wolfgang Utschick
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:complex multi-agent interactions, diverse scene contexts, Accurate and uncertainty-aware, inherently stochastic nature, autonomous driving
备注: Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States
点击查看摘要
Abstract:Accurate and uncertainty-aware trajectory prediction remains a core challenge for autonomous driving, driven by complex multi-agent interactions, diverse scene contexts and the inherently stochastic nature of future motion. Diffusion-based generative models have recently shown strong potential for capturing multimodal futures, yet existing approaches such as cVMD suffer from slow sampling, limited exploitation of generative diversity and brittle scenario encodings. This work introduces cVMDx, an enhanced diffusion-based trajectory prediction framework that improves efficiency, robustness and multimodal predictive capability. Through DDIM sampling, cVMDx achieves up to a 100x reduction in inference time, enabling practical multi-sample generation for uncertainty estimation. A fitted Gaussian Mixture Model further provides tractable multimodal predictions from the generated trajectories. In addition, a CVQ-VAE variant is evaluated for scenario encoding. Experiments on the publicly available highD dataset show that cVMDx achieves higher accuracy and significantly improved efficiency over cVMD, enabling fully stochastic, multimodal trajectory prediction.
Comments:
Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States
Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Cite as:
arXiv:2602.21319 [cs.LG]
(or
arXiv:2602.21319v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2602.21319
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
118. 【2602.21273】StoryTailor:A Zero-Shot Pipeline for Action-Rich Multi-Subject Visual Narratives
链接:https://arxiv.org/abs/2602.21273
作者:Jinghao Hu,Yuhe Zhang,GuoHua Geng,Kang Li,Han Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:action-rich visual narratives, action text faithfulness, subject identity fidelity, cross-frame background continuity, Generating multi-frame
备注: 24 pages,19 figures,accepted by CVPR2026
点击查看摘要
Abstract:Generating multi-frame, action-rich visual narratives without fine-tuning faces a threefold tension: action text faithfulness, subject identity fidelity, and cross-frame background continuity. We propose StoryTailor, a zero-shot pipeline that runs on a single RTX 4090 (24 GB) and produces temporally coherent, identity-preserving image sequences from a long narrative prompt, per-subject references, and grounding boxes. Three synergistic modules drive the system: Gaussian-Centered Attention (GCA) to dynamically focus on each subject core and ease grounding-box overlaps; Action-Boost Singular Value Reweighting (AB-SVR) to amplify action-related directions in the text embedding space; and Selective Forgetting Cache (SFC) that retains transferable background cues, forgets nonessential history, and selectively surfaces retained cues to build cross-scene semantic ties. Compared with baseline methods, experiments show that CLIP-T improves by up to 10-15%, with DreamSim lower than strong baselines, while CLIP-I stays in a visually acceptable, competitive range. With matched resolution and steps on a 24 GB GPU, inference is faster than FluxKontext. Qualitatively, StoryTailor delivers expressive interactions and evolving yet stable scenes.
119. 【2602.22140】Lumosaic: Hyperspectral Video via Active Illumination and Coded-Exposure Pixels
链接:https://arxiv.org/abs/2602.22140
作者:Dhruv Verma,Andrew Qiu,Roberto Rangel,Ayandev Barman,Hao Yang,Chenjia Hu,Fengqi Zhang,Roman Genov,David B. Lindell,Kiriakos N. Kutulakos,Alex Mariakakis
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:compact active hyperspectral, video system designed, compact active, designed for real-time, real-time capture
备注: Accepted to CVPR 2026
点击查看摘要
Abstract:We present Lumosaic, a compact active hyperspectral video system designed for real-time capture of dynamic scenes. Our approach combines a narrowband LED array with a coded-exposure-pixel (CEP) camera capable of high-speed, per-pixel exposure control, enabling joint encoding of scene information across space, time, and wavelength within each video frame. Unlike passive snapshot systems that divide light across multiple spectral channels simultaneously and assume no motion during a frame's exposure, Lumosaic actively synchronizes illumination and pixel-wise exposure, improving photon utilization and preserving spectral fidelity under motion. A learning-based reconstruction pipeline then recovers 31-channel hyperspectral (400-700 nm) video at 30 fps and VGA resolution, producing temporally coherent and spectrally accurate reconstructions. Experiments on synthetic and real data demonstrate that Lumosaic significantly improves reconstruction fidelity and temporal stability over existing snapshot hyperspectral imaging systems, enabling robust hyperspectral video across diverse materials and motion conditions.
120. 【2602.21707】Learning spatially adaptive sparsity level maps for arbitrary convolutional dictionaries
链接:https://arxiv.org/abs/2602.21707
作者:Joshua Schulz,David Schote,Christoph Kolbitsch,Kostas Papafitsoros,Andreas Kofler
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Optimization and Control (math.OC)
关键词:strong performance, raise questions, rely on black-box, black-box modules, image reconstruction method
备注:
点击查看摘要
Abstract:State-of-the-art learned reconstruction methods often rely on black-box modules that, despite their strong performance, raise questions about their interpretability and robustness. Here, we build on a recently proposed image reconstruction method, which is based on embedding data-driven information into a model-based convolutional dictionary regularization via neural network-inferred spatially adaptive sparsity level maps. By means of improved network design and dedicated training strategies, we extend the method to achieve filter-permutation invariance as well as the possibility to change the convolutional dictionary at inference time. We apply our method to low-field MRI and compare it to several other recent deep learning-based methods, also on in vivo data, in which the benefit for the use of a different dictionary is showcased. We further assess the method's robustness when tested on in- and out-of-distribution data. When tested on the latter, the proposed method suffers less from the data distribution shift compared to the other learned methods, which we attribute to its reduced reliance on training data due to its underlying model-based reconstruction component.
121. 【2602.21482】Perceptual Quality Optimization of Image Super-Resolution
链接:https://arxiv.org/abs/2602.21482
作者:Wei Zhou,Yixiao Li,Hadi Amirpour,Xiaoshuai Hao,Jiang Liu,Peng Wang,Hantao Liu
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:achieved remarkable progress, Single-image super-resolution, Bi-directional Attention Network, heuristic perceptual priors, Perceptual Bi-directional Attention
备注: 6 pages, 2 figures, accepted in ICASSP 26
点击查看摘要
Abstract:Single-image super-resolution (SR) has achieved remarkable progress with deep learning, yet most approaches rely on distortion-oriented losses or heuristic perceptual priors, which often lead to a trade-off between fidelity and visual quality. To address this issue, we propose an \textit{Efficient Perceptual Bi-directional Attention Network (Efficient-PBAN)} that explicitly optimizes SR towards human-preferred quality. Unlike patch-based quality models, Efficient-PBAN avoids extensive patch sampling and enables efficient image-level perception. The proposed framework is trained on our self-constructed SR quality dataset that covers a wide range of state-of-the-art SR methods with corresponding human opinion scores. Using this dataset, Efficient-PBAN learns to predict perceptual quality in a way that correlates strongly with subjective judgments. The learned metric is further integrated into SR training as a differentiable perceptual loss, enabling closed-loop alignment between reconstruction and perceptual assessment. Extensive experiments demonstrate that our approach delivers superior perceptual quality. Code is publicly available at this https URL.
122. 【2602.21361】owards single-shot coherent imaging via overlap-free ptychography
链接:https://arxiv.org/abs/2602.21361
作者:Oliver Hoidn,Aashwin Mishra,Steven Henke,Albert Vong,Matthew Seaberg
类目:Optics (physics.optics); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
关键词:XFEL sources requires, dense overlapping scans, requires dense overlapping, synchrotron and XFEL, sources requires dense
备注:
点击查看摘要
Abstract:Ptychographic imaging at synchrotron and XFEL sources requires dense overlapping scans, limiting throughput and increasing dose. Extending coherent diffractive imaging to overlap-free operation on extended samples remains an open problem. Here, we extend PtychoPINN (O. Hoidn \emph{et al.}, \emph{Scientific Reports} \textbf{13}, 22789, 2023) to deliver \emph{overlap-free, single-shot} reconstructions in a Fresnel coherent diffraction imaging (CDI) geometry while also accelerating conventional multi-shot ptychography. The framework couples a differentiable forward model of coherent scattering with a Poisson photon-counting likelihood; real-space overlap enters as a tunable parameter via coordinate-based grouping rather than a hard requirement. On synthetic benchmarks, reconstructions remain accurate at low counts ($\sim\!10^4$ photons/frame), and overlap-free single-shot reconstruction with an experimental probe reaches amplitude structural similarity (SSIM) 0.904, compared with 0.968 for overlap-constrained reconstruction. Against a data-saturated supervised model with the same backbone (16,384 training images), PtychoPINN achieves higher SSIM with only 1,024 images and generalizes to unseen illumination profiles. Per-graphics processing unit (GPU) throughput is approximately $40\times$ that of least-squares maximum-likelihood (LSQ-ML) reconstruction at matched $128\times128$ resolution. These results, validated on experimental data from the Advanced Photon Source and the Linac Coherent Light Source, unify single-exposure Fresnel CDI and overlapped ptychography within one framework, supporting dose-efficient, high-throughput imaging at modern light sources.
123. 【2602.21345】RelA-Diffusion: Relativistic Adversarial Diffusion for Multi-Tracer PET Synthesis from Multi-Sequence MRI
链接:https://arxiv.org/abs/2602.21345
作者:Minhui Yu,Yongheng Sun,David S. Lalush,Jason P Mihalik,Pew-Thian Yap,Mingxia Liu
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:positron emission tomography, comprehensive neurological assessment, diverse neuropathological processes, Multi-tracer positron emission, multi-tracer PET
备注:
点击查看摘要
Abstract:Multi-tracer positron emission tomography (PET) provides critical insights into diverse neuropathological processes such as tau accumulation, neuroinflammation, and $\beta$-amyloid deposition in the brain, making it indispensable for comprehensive neurological assessment. However, routine acquisition of multi-tracer PET is limited by high costs, radiation exposure, and restricted tracer availability. Recent efforts have explored deep learning approaches for synthesizing PET images from structural MRI. While some methods rely solely on T1-weighted MRI, others incorporate additional sequences such as T2-FLAIR to improve pathological sensitivity. However, existing methods often struggle to capture fine-grained anatomical and pathological details, resulting in artifacts and unrealistic outputs. To this end, we propose RelA-Diffusion, a Relativistic Adversarial Diffusion framework for multi-tracer PET synthesis from multi-sequence MRI. By leveraging both T1-weighted and T2-FLAIR scans as complementary inputs, RelA-Diffusion captures richer structural information to guide image generation. To improve synthesis fidelity, we introduce a gradient-penalized relativistic adversarial loss to the intermediate clean predictions of the diffusion model. This loss compares real and generated images in a relative manner, encouraging the synthesis of more realistic local structures. Both the relativistic formulation and the gradient penalty contribute to stabilizing the training, while adversarial feedback at each diffusion timestep enables consistent refinement throughout the generation process. Extensive experiments on two datasets demonstrate that RelA-Diffusion outperforms existing methods in both visual fidelity and quantitative metrics, highlighting its potential for accurate synthesis of multi-tracer PET.



