本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新495篇论文,其中:
- 自然语言处理61篇
- 信息检索20篇
- 计算机视觉93篇
自然语言处理
1. 【2602.13194】Semantic Chunking and the Entropy of Natural Language
链接:https://arxiv.org/abs/2602.13194
作者:Weishun Zhong,Doron Sivan,Tankut Can,Mikhail Katkov,Misha Tsodyks
类目:Computation and Language (cs.CL); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
关键词:entropy rate, printed English, recently approached, large language models, modern large language
备注: 29 pages, 9 figures
点击查看摘要
Abstract:The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.
2. 【2602.13191】CoPE-VideoLM: Codec Primitives For Efficient Video Language Models
链接:https://arxiv.org/abs/2602.13191
作者:Sayan Deb Sarkar,Rémi Pautrat,Ondrej Miksik,Marc Pollefeys,Iro Armeni,Mahdi Rad,Mihai Dusmanu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Video Language Models, Language Models, understand temporal dynamics, empower AI systems, Video Language
备注: Project Page: [this https URL](https://sayands.github.io/cope/)
点击查看摘要
Abstract:Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.
3. 【2602.13151】Quantization-Robust LLM Unlearning via Low-Rank Adaptation
链接:https://arxiv.org/abs/2602.13151
作者:João Vitor Boer Abitante,Joana Meneguzzo Pasquali,Luan Fonseca Garcia,Ewerton de Oliveira,Thomas da Silva Paula,Rodrigo C. Barros,Lucas S. Kupssinskü
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Model, Large Language, remove targeted knowledge, require post-training quantization, Language Model
备注:
点击查看摘要
Abstract:Large Language Model (LLM) unlearning aims to remove targeted knowledge from a trained model, but practical deployments often require post-training quantization (PTQ) for efficient inference. However, aggressive low-bit PTQ can mask or erase unlearning updates, causing quantized models to revert to pre-unlearning behavior. We show that standard full-parameter fine-tuning often induce parameter changes that are too small to survive 4-bit quantization. We propose quantization-robust unlearning via low-rank adaptation (LoRA): we freeze the base model and concentrate unlearning into trainable adapters so that the effective update is preserved after quantization. On Llama-2-7B evaluated with MUSE dataset (BOOKS and NEWS), LoRA improves 4-bit utility by up to 7.93 points (NPO+GDR on BOOKS: 50.17 to 58.10) and yields higher 4-bit utility on NEWS for GA+GDR (40.06 to 44.82, increase of 4.76). LoRA also substantially reduces privacy leakage under 4-bit PTQ, e.g., for GA+KLR on BOOKS, PrivLeak moves from -25.68 to -5.86 (closer to ideal 0), while maintaining strong forgetting (VerMem and KnowMem near 0). Thus, using LoRA for Machine Unlearning is beneficial for scenarios where quantization is necessary for model deployment.
4. 【2602.13139】OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report
链接:https://arxiv.org/abs/2602.13139
作者:Mariia Fedorova,Nikolay Arefyev,Maja Buljan,Jindřich Helcl,Stephan Oepen,Egil Rønningstad,Yves Scherrer
类目:Computation and Language (cs.CL)
关键词:building high-quality multilingual, high-quality multilingual datasets, essential step, step in building, building high-quality
备注: VarDial'26 workshop at the EACL 2026 conference
点击查看摘要
Abstract:Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During development, we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages. OpenLID-v3 is available on this https URL.
5. 【2602.13123】From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media
链接:https://arxiv.org/abs/2602.13123
作者:Maria Ryskina,Matthew R. Gormley,Kyle Mahowald,David R. Mortensen,Taylor Berg-Kirkpatrick,Vivek Kulkarni
类目:Computation and Language (cs.CL)
关键词:external evolutionary pressures, Living languages, host of conflicting, conflicting internal, internal and external
备注: Accepted to LChange 2026
点击查看摘要
Abstract:Living languages are shaped by a host of conflicting internal and external evolutionary pressures. While some of these pressures are universal across languages and cultures, others differ depending on the social and conversational context: language use in newspapers is subject to very different constraints than language use on social media. Prior distributional semantic work on English word emergence (neology) identified two factors correlated with creation of new words by analyzing a corpus consisting primarily of historical published texts (Ryskina et al., 2020, arXiv:2001.07740). Extending this methodology to contextual embeddings in addition to static ones and applying it to a new corpus of Twitter posts, we show that the same findings hold for both domains, though the topic popularity growth factor may contribute less to neology on Twitter than in published writing. We hypothesize that this difference can be explained by the two domains favouring different neologism formation mechanisms.
6. 【2602.13110】SCOPE: Selective Conformal Optimized Pairwise LLM Judging
链接:https://arxiv.org/abs/2602.13110
作者:Sher Badshah,Ali Emami,Hassan Sajjad
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, replace costly human, Large language, costly human preference, human preference labels
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $\alpha$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $\alpha = 0.10$, \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, \textsc{Scope} accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.
7. 【2602.13102】owards interpretable models for language proficiency assessment: Predicting the CEFR level of Estonian learner texts
链接:https://arxiv.org/abs/2602.13102
作者:Kais Allkivi
类目:Computation and Language (cs.CL)
关键词:analyze authentic learner, build automated assessment, NLP to analyze, authentic learner language, feedback tools
备注:
点击查看摘要
Abstract:Using NLP to analyze authentic learner language helps to build automated assessment and feedback tools. It also offers new and extensive insights into the development of second language production. However, there is a lack of research explicitly combining these aspects. This study aimed to classify Estonian proficiency examination writings (levels A2-C1), assuming that careful feature selection can lead to more explainable and generalizable machine learning models for language testing. Various linguistic properties of the training data were analyzed to identify relevant proficiency predictors associated with increasing complexity and correctness, rather than the writing task. Such lexical, morphological, surface, and error features were used to train classification models, which were compared to models that also allowed for other features. The pre-selected features yielded a similar test accuracy but reduced variation in the classification of different text types. The best classifiers achieved an accuracy of around 0.9. Additional evaluation on an earlier exam sample revealed that the writings have become more complex over a 7-10-year period, while accuracy still reached 0.8 with some feature sets. The results have been implemented in the writing evaluation module of an Estonian open-source language learning environment.
8. 【2602.13093】Consistency of Large Reasoning Models Under Multi-Turn Attacks
链接:https://arxiv.org/abs/2602.13093
作者:Yubo Li,Ramayya Krishnan,Rema Padman
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large reasoning models, pressure remains underexplored, Large reasoning, reasoning models, multi-turn adversarial pressure
备注:
点击查看摘要
Abstract:Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.
9. 【2602.13084】Exploring a New Competency Modeling Process with Large Language Models
链接:https://arxiv.org/abs/2602.13084
作者:Silin Du,Manqing Xin,Raymond Jia Wang
类目:Computation and Language (cs.CL)
关键词:human resource management, management to select, evaluate talent, human resource, resource management
备注:
点击查看摘要
Abstract:Competency modeling is widely used in human resource management to select, develop, and evaluate talent. However, traditional expert-driven approaches rely heavily on manual analysis of large volumes of interview transcripts, making them costly and prone to randomness, ambiguity, and limited reproducibility. This study proposes a new competency modeling process built on large language models (LLMs). Instead of merely automating isolated steps, we reconstruct the workflow by decomposing expert practices into structured computational components. Specifically, we leverage LLMs to extract behavioral and psychological descriptions from raw textual data and map them to predefined competency libraries through embedding-based similarity. We further introduce a learnable parameter that adaptively integrates different information sources, enabling the model to determine the relative importance of behavioral and psychological signals. To address the long-standing challenge of validation, we develop an offline evaluation procedure that allows systematic model selection without requiring additional large-scale data collection. Empirical results from a real-world implementation in a software outsourcing company demonstrate strong predictive validity, cross-library consistency, and structural robustness. Overall, our framework transforms competency modeling from a largely qualitative and expert-dependent practice into a transparent, data-driven, and evaluable analytical process.
10. 【2602.13073】LCSB: Layer-Cyclic Selective Backpropagation for Memory-Efficient On-Device LLM Fine-Tuning
链接:https://arxiv.org/abs/2602.13073
作者:Juneyoung Park,Eunbeen Yoon,Seongwan Kim. Jaeho Lee
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:enabled first-order fine-tuning, Memory-efficient backpropagation, large language models, enabled first-order, first-order fine-tuning
备注: Under the review, 13 pages
点击查看摘要
Abstract:Memory-efficient backpropagation (MeBP) has enabled first-order fine-tuning of large language models (LLMs) on mobile devices with less than 1GB memory. However, MeBP requires backward computation through all transformer layers at every step, where weight decompression alone accounts for 32--42% of backward time. We propose Layer-Cyclic Selective Backpropagation (LCSB), which computes gradients for only a subset of layers per step. Our key insight is that residual connections guarantee gradient flow through identity paths, while AdamW momentum provides implicit updates for non-selected layers. We interpret LCSB as Block Coordinate Descent on the LoRA parameter space, providing theoretical justification for convergence. LCSB achieves up to 1.40$\times$ speedup with less than 2\% quality degradation across five models and three tasks. Surprisingly, in 4-bit quantized settings, LCSB exhibits superior stability: a 3B model that completely diverges under full backpropagation converges smoothly with LCSB, suggesting an implicit regularization effect from selective gradient computation.
11. 【2602.13069】Memory-Efficient Structured Backpropagation for On-Device LLM Fine-Tuning
链接:https://arxiv.org/abs/2602.13069
作者:Juneyoung Park,Yuri Hong,Seongwan Kim,Jaeho Lee
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:On-device fine-tuning enables, enables privacy-preserving personalization, large language models, severe memory constraints, Memory-efficient Structured Backpropagation
备注: Under the review, 11 pages
点击查看摘要
Abstract:On-device fine-tuning enables privacy-preserving personalization of large language models, but mobile devices impose severe memory constraints, typically 6--12GB shared across all workloads. Existing approaches force a trade-off between exact gradients with high memory (MeBP) and low memory with noisy estimates (MeZO). We propose Memory-efficient Structured Backpropagation (MeSP), which bridges this gap by manually deriving backward passes that exploit LoRA's low-rank structure. Our key insight is that the intermediate projection $h = xA$ can be recomputed during backward at minimal cost since rank $r \ll d_{in}$, eliminating the need to store it. MeSP achieves 49\% average memory reduction compared to MeBP on Qwen2.5 models (0.5B--3B) while computing mathematically identical gradients. Our analysis also reveals that MeZO's gradient estimates show near-zero correlation with true gradients (cosine similarity $\approx$0.001), explaining its slow convergence. MeSP reduces peak memory from 361MB to 136MB for Qwen2.5-0.5B, enabling fine-tuning scenarios previously infeasible on memory-constrained devices.
12. 【2602.13059】raceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution
链接:https://arxiv.org/abs/2602.13059
作者:Tejas Anvekar,Junha Park,Rajat Jha,Devanshu Gupta,Poojah Ganesan,Puneeth Mathur,Vivek Gupta
类目:Computation and Language (cs.CL)
关键词:structured tables requires, Question answering, structured tables, tables requires, accurate answers
备注:
点击查看摘要
Abstract:Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table-based QA.
13. 【2602.13047】Can we trust AI to detect healthy multilingual English speakers among the cognitively impaired cohort in the UK? An investigation using real-world conversational speech
链接:https://arxiv.org/abs/2602.13047
作者:Madhurananda Pahar,Caitlin Illingworth,Dorota Braun,Bahman Mirheidari,Lise Sproson,Daniel Blackburn,Heidi Christensen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:reveals early signs, Conversational speech, speech often reveals, reveals early, early signs
备注:
点击查看摘要
Abstract:Conversational speech often reveals early signs of cognitive decline, such as dementia and MCI. In the UK, one in four people belongs to an ethnic minority, and dementia prevalence is expected to rise most rapidly among Black and Asian communities. This study examines the trustworthiness of AI models, specifically the presence of bias, in detecting healthy multilingual English speakers among the cognitively impaired cohort, to make these tools clinically beneficial. For experiments, monolingual participants were recruited nationally (UK), and multilingual speakers were enrolled from four community centres in Sheffield and Bradford. In addition to a non-native English accent, multilinguals spoke Somali, Chinese, or South Asian languages, who were further divided into two Yorkshire accents (West and South) to challenge the efficiency of the AI tools thoroughly. Although ASR systems showed no significant bias across groups, classification and regression models using acoustic and linguistic features exhibited bias against multilingual speakers, particularly in memory, fluency, and reading tasks. This bias was more pronounced when models were trained on the publicly available DementiaBank dataset. Moreover, multilinguals were more likely to be misclassified as having cognitive decline. This study is the first of its kind to discover that, despite their strong overall performance, current AI models show bias against multilingual individuals from ethnic minority backgrounds in the UK, and they are also more likely to misclassify speakers with a certain accent (South Yorkshire) as living with a more severe cognitive decline. In this pilot study, we conclude that the existing AI tools are therefore not yet reliable for diagnostic use in these populations, and we aim to address this in future work by developing more generalisable, bias-mitigated models.
14. 【2602.13035】Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL
链接:https://arxiv.org/abs/2602.13035
作者:Yixiao Zhou,Yang Li,Dongzhou Cheng,Hehe Fan,Yu Cheng
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:trains large language, purely inference-time choice, making decoding strategy, large language models, Verifiable Rewards
备注:
点击查看摘要
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) trains large language models (LLMs) from sampled trajectories, making decoding strategy a core component of learning rather than a purely inference-time choice. Sampling temperature directly controls the exploration--exploitation trade-off by modulating policy entropy, yet existing methods rely on static values or heuristic adaptations that are decoupled from task-level rewards. We propose Introspective LLM, a hierarchical reinforcement learning framework that learns to control sampling temperature during generation. At each decoding step, the model selects a temperature based on its hidden state and samples the next token from the resulting distribution. Temperature and token policies are jointly optimized from downstream rewards using a coordinate ascent scheme. Experiments on mathematical reasoning benchmarks show that learned temperature policies outperform fixed and heuristic baselines, while exhibiting interpretable exploration behaviors aligned with reasoning uncertainty.
15. 【2602.13033】Buy versus Build an LLM: A Decision Framework for Governments
链接:https://arxiv.org/abs/2602.13033
作者:Jiahao Lu,Ziwei Xu,William Tjhi,Junnan Li,Antoine Bosselut,Pang Wei Koh,Mohan Kankanhalli
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
关键词:sensitive state functions, Large Language Models, general purpose citizen, purpose citizen services, Large Language
备注: The short version of this document is published as an ACM TechBrief, and this document is published as an ACM Technology Policy Council white paper
点击查看摘要
Abstract:Large Language Models (LLMs) represent a new frontier of digital infrastructure that can support a wide range of public-sector applications, from general purpose citizen services to specialized and sensitive state functions. When expanding AI access, governments face a set of strategic choices over whether to buy existing services, build domestic capabilities, or adopt hybrid approaches across different domains and use cases. These are critical decisions especially when leading model providers are often foreign corporations, and LLM outputs are increasingly treated as trusted inputs to public decision-making and public discourse. In practice, these decisions are not intended to mandate a single approach across all domains; instead, national AI strategies are typically pluralistic, with sovereign, commercial and open-source models coexisting to serve different purposes. Governments may rely on commercial models for non-sensitive or commodity tasks, while pursuing greater control for critical, high-risk or strategically important applications. This paper provides a strategic framework for making this decision by evaluating these options across dimensions including sovereignty, safety, cost, resource capability, cultural fit, and sustainability. Importantly, "building" does not imply that governments must act alone: domestic capabilities may be developed through public research institutions, universities, state-owned enterprises, joint ventures, or broader national ecosystems. By detailing the technical requirements and practical challenges of each pathway, this work aims to serve as a reference for policy-makers to determine whether a buy or build approach best aligns with their specific national needs and societal goals.
Comments:
The short version of this document is published as an ACM TechBrief, and this document is published as an ACM Technology Policy Council white paper
Subjects:
Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
ACMclasses:
K.4.1; K.1; K.4.2; K.4.3; K.5.2; K.6.1; J.1
Cite as:
arXiv:2602.13033 [cs.CY]
(or
arXiv:2602.13033v1 [cs.CY] for this version)
https://doi.org/10.48550/arXiv.2602.13033
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
16. 【2602.13028】Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis
链接:https://arxiv.org/abs/2602.13028
作者:Runzhou Liu(1),Hailey Weingord(2),Sejal Mittal(2),Prakhar Dungarwal(2),Anusha Nandula(2),Bo Ni(3),Samyadeep Basu(4),Hongjie Chen(5),Nesreen K. Ahmed(6),Li Li(7),Jiayi Zhang(8),Koustava Goswami(4),Subhojyoti Mukherjee(4),Branislav Kveton(4),Puneet Mathur(4),Franck Dernoncourt(4),Yue Zhao(7),Yu Wang(9),Ryan A. Rossi(4),Zhengzhong Tu(10),Hongru Du(1) ((1) University of Virginia, (2) Columbia University, (3) Vanderbilt University, (4) Adobe Research, (5) Dolby Laboratories, (6) Cisco Research, (7) University of Southern California, (8) University of Wisconsin-Madison, (9) University of Oregon, (10) Texas Aamp;M University)
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:remains challenging due, capture aspects important, Evaluating image editing, models remains challenging, Evaluating image
备注:
点击查看摘要
Abstract:Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.
17. 【2602.12996】Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models
链接:https://arxiv.org/abs/2602.12996
作者:Hao Chen,Ye He,Yuchun Fan,Yukun Yan,Zhenghao Liu,Qingfu Zhu,Maosong Sun,Wanxiang Che
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, knowledge-intensive tasks, model performance equates
备注:
点击查看摘要
Abstract:Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns.
18. 【2602.12989】Evaluating the Homogeneity of Keyphrase Prediction Models
链接:https://arxiv.org/abs/2602.12989
作者:Maël Houbre,Florian Boudin,Beatrice Daille
类目:Computation and Language (cs.CL)
关键词:keyphrase generation models, keyphrase generation, keyphrase, keyphrase prediction models, models
备注: Accepted to LREC 2026
点击查看摘要
Abstract:Keyphrases which are useful in several NLP and IR applications are either extracted from text or predicted by generative models. Contrarily to keyphrase extraction approaches, keyphrase generation models can predict keyphrases that do not appear in a document's text called `absent keyphrases`. This ability means that keyphrase generation models can associate a document to a notion that is not explicitly mentioned in its text. Intuitively, this suggests that for two documents treating the same subjects, a keyphrase generation model is more likely to be homogeneous in their indexing i.e. predict the same keyphrase for both documents, regardless of those keyphrases appearing in their respective text or not; something a keyphrase extraction model would fail to do. Yet, homogeneity of keyphrase prediction models is not covered by current benchmarks. In this work, we introduce a method to evaluate the homogeneity of keyphrase prediction models and study if absent keyphrase generation capabilities actually help the model to be more homogeneous. To our surprise, we show that keyphrase extraction methods are competitive with generative models, and that the ability to generate absent keyphrases can actually have a negative impact on homogeneity. Our data, code and prompts are available on huggingface and github.
19. 【2602.12984】SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
链接:https://arxiv.org/abs/2602.12984
作者:Yujiong Shen,Yajie Yang,Zhiheng Xi,Binze Hu,Huayu Sha,Jiazheng Zhang,Qiyuan Peng,Junlin Shang,Jixuan Huang,Yutao Fan,Jingqi Tong,Shihan Dou,Ming Zhang,Lei Bai,Zhenfei Yin,Tao Gui,Xingjun Ma,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang
类目:Computation and Language (cs.CL)
关键词:reasoning inherently demands, inherently demands integrating, demands integrating sophisticated, integrating sophisticated toolkits, navigate domain-specific knowledge
备注:
点击查看摘要
Abstract:Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.
20. 【2602.12968】RGAlign-Rec: Ranking-Guided Alignment for Latent Query Reasoning in Recommendation Systems
链接:https://arxiv.org/abs/2602.12968
作者:Junhua Liu,Yang Jihao,Cheng Chang,Kunrong LI,Bin Fu,Kwan Hui Lim
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:modern e-commerce chatbots, chatbot Knowledge Base, critical capability, capability in modern, modern e-commerce
备注:
点击查看摘要
Abstract:Proactive intent prediction is a critical capability in modern e-commerce chatbots, enabling "zero-query" recommendations by anticipating user needs from behavioral and contextual signals. However, existing industrial systems face two fundamental challenges: (1) the semantic gap between discrete user features and the semantic intents within the chatbot's Knowledge Base, and (2) the objective misalignment between general-purpose LLM outputs and task-specific ranking utilities. To address these issues, we propose RGAlign-Rec, a closed-loop alignment framework that integrates an LLM-based semantic reasoner with a Query-Enhanced (QE) ranking model. We also introduce Ranking-Guided Alignment (RGA), a multi-stage training paradigm that utilizes downstream ranking signals as feedback to refine the LLM's latent reasoning. Extensive experiments on a large-scale industrial dataset from Shopee demonstrate that RGAlign-Rec achieves a 0.12% gain in GAUC, leading to a significant 3.52% relative reduction in error rate, and a 0.56% improvement in Recall@3. Online A/B testing further validates the cumulative effectiveness of our framework: the Query-Enhanced model (QE-Rec) initially yields a 0.98% improvement in CTR, while the subsequent Ranking-Guided Alignment stage contributes an additional 0.13% gain. These results indicate that ranking-aware alignment effectively synchronizes semantic reasoning with ranking objectives, significantly enhancing both prediction accuracy and service quality in real-world proactive recommendation systems.
21. 【2602.12966】ProbeLLM: Automating Principled Diagnosis of LLM Failures
链接:https://arxiv.org/abs/2602.12966
作者:Yue Huang,Zhengzhe Jiang,Yuchen Ma,Yu Jiang,Xiangqi Wang,Yujun Zhou,Yuexing Hao,Kehan Guo,Pin-Yu Chen,Stefan Feuerriegel,Xiangliang Zhang
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:large language models, models rapidly evolve, large language, central challenge, rapidly evolve
备注:
点击查看摘要
Abstract:Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.
22. 【2602.12937】Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models
链接:https://arxiv.org/abs/2602.12937
作者:Ali Mekky,Mohamed El Zeftawy,Lara Hassan,Amr Keleg,Preslav Nakov
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Arabic Dialect Identification, single-label classification task, classification task, multi-label classification task, Dialect Identification
备注: Accepted at the 12th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2026
点击查看摘要
Abstract:Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at this https URL.
23. 【2602.12921】When Words Don't Mean What They Say: Figurative Understanding in Bengali Idioms
链接:https://arxiv.org/abs/2602.12921
作者:Adib Sakhawat,Shamim Ara Parveen,Md Ruhul Amin,Shamim Al Mahmud,Md Saiful Islam,Tahera Khatun
类目:Computation and Language (cs.CL)
关键词:Large Language Models, challenge for Large, Figurative language understanding, Large Language, language understanding remains
备注: 9 pages, 5 figures. Accepted for presentation at LREC 2026 (Language Resources and Evaluation Conference)
点击查看摘要
Abstract:Figurative language understanding remains a significant challenge for Large Language Models (LLMs), especially for low-resource languages. To address this, we introduce a new idiom dataset, a large-scale, culturally-grounded corpus of 10,361 Bengali idioms. Each idiom is annotated under a comprehensive 19-field schema, established and refined through a deliberative expert consensus process, that captures its semantic, syntactic, cultural, and religious dimensions, providing a rich, structured resource for computational linguistics. To establish a robust benchmark for Bangla figurative language understanding, we evaluate 30 state-of-the-art multilingual and instruction-tuned LLMs on the task of inferring figurative meaning. Our results reveal a critical performance gap, with no model surpassing 50% accuracy, a stark contrast to significantly higher human performance (83.4%). This underscores the limitations of existing models in cross-linguistic and cultural reasoning. By releasing the new idiom dataset and benchmark, we provide foundational infrastructure for advancing figurative language understanding and cultural grounding in LLMs for Bengali and other low-resource languages.
24. 【2602.12911】ViMedCSS: A Vietnamese Medical Code-Switching Speech Dataset Benchmark
链接:https://arxiv.org/abs/2602.12911
作者:Tung X. Nguyen,Nhu Vo,Giang-Son Nguyen,Duy Mai Hoang,Chien Dinh Huynh,Inigo Jauregi Unanue,Massimo Piccardi,Wray Buntine,Dung D. Le
类目:Computation and Language (cs.CL)
关键词:Automatic Speech Recognition, Vietnamese medical communication, textbf, Vietnamese, Automatic Speech
备注: Accepted at LREC 2026
点击查看摘要
Abstract:Code-switching (CS), which is when Vietnamese speech uses English words like drug names or procedures, is a common phenomenon in Vietnamese medical communication. This creates challenges for Automatic Speech Recognition (ASR) systems, especially in low-resource languages like Vietnamese. Current most ASR systems struggle to recognize correctly English medical terms within Vietnamese sentences, and no benchmark addresses this challenge. In this paper, we construct a 34-hour \textbf{Vi}etnamese \textbf{Med}ical \textbf{C}ode-\textbf{S}witching \textbf{S}peech dataset (ViMedCSS) containing 16,576 utterances. Each utterance includes at least one English medical term drawn from a curated bilingual lexicon covering five medical topics. Using this dataset, we evaluate several state-of-the-art ASR models and examine different specific fine-tuning strategies for improving medical term recognition to investigate the best approach to solve in the dataset. Experimental results show that Vietnamese-optimized models perform better on general segments, while multilingual pretraining helps capture English insertions. The combination of both approaches yields the best balance between overall and code-switched accuracy. This work provides the first benchmark for Vietnamese medical code-switching and offers insights into effective domain adaptation for low-resource, multilingual ASR systems.
25. 【2602.12892】RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training
链接:https://arxiv.org/abs/2602.12892
作者:Yunshuang Nie,Bingqian Lin,Minzhe Niu,Kun Xiang,Jianhua Han,Guowei Huang,Xingyue Quan,Hang Xu,Bokui Chen,Xiaodan Liang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multi-modal Large Language, Large Language Models, Large Language, solve complex tasks, Pre-trained Multi-modal Large
备注:
点击查看摘要
Abstract:Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre-training metrics cannot quantify a model's perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre-training objectives. Thus, we propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine-tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi-Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre-trained MLLMs' perception and reasoning abilities in a 0-shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre-training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at this https URL.
26. 【2602.12889】BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models
链接:https://arxiv.org/abs/2602.12889
作者:Jiangxi Chen,Qian Liu
类目:Computation and Language (cs.CL)
关键词:Global Fortune-teller Competition, temporally compositional reasoning, temporally compositional, large language models, Fortune-teller Competition
备注:
点击查看摘要
Abstract:We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference this http URL further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.
27. 【2602.12881】Semantic Communities and Boundary-Spanning Lyrics in K-pop: A Graph-Based Unsupervised Analysis
链接:https://arxiv.org/abs/2602.12881
作者:Oktay Karakuş
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
关键词:present unique challenges, multilingual content, Large-scale lyric corpora, data-driven analysis, including the absence
备注:
点击查看摘要
Abstract:Large-scale lyric corpora present unique challenges for data-driven analysis, including the absence of reliable annotations, multilingual content, and high levels of stylistic repetition. Most existing approaches rely on supervised classification, genre labels, or coarse document-level representations, limiting their ability to uncover latent semantic structure. We present a graph-based framework for unsupervised discovery and evaluation of semantic communities in K-pop lyrics using line-level semantic representations. By constructing a similarity graph over lyric texts and applying community detection, we uncover stable micro-theme communities without genre, artist, or language supervision. We further identify boundary-spanning songs via graph-theoretic bridge metrics and analyse their structural properties. Across multiple robustness settings, boundary-spanning lyrics exhibit higher lexical diversity and lower repetition compared to core community members, challenging the assumption that hook intensity or repetition drives cross-theme connectivity. Our framework is language-agnostic and applicable to unlabeled cultural text corpora.
28. 【2602.12871】MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models
链接:https://arxiv.org/abs/2602.12871
作者:Hoyun Song,Migyeong Kang,Jisu Shin,Jihyun Kim,Chanbi Park,Hangyeol Yoo,Jihyun An,Alice Oh,Jinyoung Han,KyungTae Lim
类目:Computation and Language (cs.CL)
关键词:large language models, language models, large language, evaluating psychiatric diagnostic, diagnostic
备注:
点击查看摘要
Abstract:We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evaluation gaps not captured by existing benchmarks.
29. 【2602.12818】AIWizards at MULTIPRIDE: A Hierarchical Approach to Slur Reclamation Detection
链接:https://arxiv.org/abs/2602.12818
作者:Luca Tedeschini,Matteo Fasulo
类目:Computation and Language (cs.CL)
关键词:in-group affirmations depending, Detecting reclaimed slurs, Detecting reclaimed, reclaimed slurs represents, represents a fundamental
备注:
点击查看摘要
Abstract:Detecting reclaimed slurs represents a fundamental challenge for hate speech detection systems, as the same lexcal items can function either as abusive expressions or as in-group affirmations depending on social identity and context. In this work, we address Subtask B of the MultiPRIDE shared task at EVALITA 2026 by proposing a hierarchical approach to modeling the slur reclamation process. Our core assumption is that members of the LGBTQ+ community are more likely, on average, to employ certain slurs in a eclamatory manner. Based on this hypothesis, we decompose the task into two stages. First, using a weakly supervised LLM-based annotation, we assign fuzzy labels to users indicating the likelihood of belonging to the LGBTQ+ community, inferred from the tweet and the user bio. These soft labels are then used to train a BERT-like model to predict community membership, encouraging the model to learn latent representations associated with LGBTQ+ identity. In the second stage, we integrate this latent space with a newly initialized model for the downstream slur reclamation detection task. The intuition is that the first model encodes user-oriented sociolinguistic signals, which are then fused with representations learned by a model pretrained for hate speech detection. Experimental results on Italian and Spanish show that our approach achieves performance statistically comparable to a strong BERT-based baseline, while providing a modular and extensible framework for incorporating sociolinguistic context into hate speech modeling. We argue that more fine-grained hierarchical modeling of user identity and discourse context may further improve the detection of reclaimed language. We release our code at this https URL.
30. 【2602.12811】Left-right asymmetry in predicting brain activity from LLMs' representations emerges with their formal linguistic competence
链接:https://arxiv.org/abs/2602.12811
作者:Laurent Bonnasse-Gahot,Christophe Pallier
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
关键词:magnetic resonance imaging, functional magnetic resonance, brain activity measured, left-right asymmetry, large language models
备注:
点击查看摘要
Abstract:When humans and large language models (LLMs) process the same text, activations in the LLMs correlate with brain activity measured, e.g., with functional magnetic resonance imaging (fMRI). Moreover, it has been shown that, as the training of an LLM progresses, the performance in predicting brain activity from its internal activations improves more in the left hemisphere than in the right one. The aim of the present work is to understand which kind of competence acquired by the LLMs underlies the emergence of this left-right asymmetry. Using the OLMo-2 7B language model at various training checkpoints and fMRI data from English participants, we compare the evolution of the left-right asymmetry in brain scores alongside performance on several benchmarks. We observe that the asymmetry co-emerges with the formal linguistic abilities of the LLM. These abilities are demonstrated in two ways: by the model's capacity to assign a higher probability to an acceptable sentence than to a grammatically unacceptable one within a minimal contrasting pair, or its ability to produce well-formed text. On the opposite, the left-right asymmetry does not correlate with the performance on arithmetic or Dyck language tasks; nor with text-based tasks involving world knowledge and reasoning. We generalize these results to another family of LLMs (Pythia) and another language, namely French. Our observations indicate that the left-right asymmetry in brain predictivity matches the progress in formal linguistic competence (knowledge of linguistic patterns).
31. 【2602.12806】RAT-Bench: A Comprehensive Benchmark for Text Anonymization
链接:https://arxiv.org/abs/2602.12806
作者:Nataša Krčo,Zexi Yao,Matthieu Meeus,Yves-Alexandre de Montjoye
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:Large Language Models, query Large Language, Data containing personal, query Large, Anthropic PII purifier
备注:
点击查看摘要
Abstract:Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report the risk of re-identification in the U.S. population, while properly accounting for the disparate impact of identifiers. We find that, while capabilities vary widely, even the best tools are far from perfect in particular when direct identifiers are not written in standard ways and when indirect identifiers enable re-identification. Overall we find LLM-based anonymizers, including new iterative anonymizers, to provide a better privacy-utility trade-off albeit at a higher computational cost. Importantly, we also find them to work well across languages. We conclude with recommendations for future anonymization tools and will release the benchmark and encourage community efforts to expand it, in particular to other geographies.
32. 【2602.12778】Aspect-Based Sentiment Analysis for Future Tourism Experiences: A BERT-MoE Framework for Persian User Reviews
链接:https://arxiv.org/abs/2602.12778
作者:Hamidreza Kazemi Taskooh,Taha Zare Harofte
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Persian-language user reviews, Persian-language user, advances aspect-based sentiment, study advances aspect-based, addressing challenges
备注: 25 pages, 12 figures, 4 tables
点击查看摘要
Abstract:This study advances aspect-based sentiment analysis (ABSA) for Persian-language user reviews in the tourism domain, addressing challenges of low-resource languages. We propose a hybrid BERT-based model with Top-K routing and auxiliary losses to mitigate routing collapse and improve efficiency. The pipeline includes: (1) overall sentiment classification using BERT on 9,558 labeled reviews, (2) multi-label aspect extraction for six tourism-related aspects (host, price, location, amenities, cleanliness, connectivity), and (3) integrated ABSA with dynamic routing. The dataset consists of 58,473 preprocessed reviews from the Iranian accommodation platform Jabama, manually annotated for aspects and sentiments. The proposed model achieves a weighted F1-score of 90.6% for ABSA, outperforming baseline BERT (89.25%) and a standard hybrid approach (85.7%). Key efficiency gains include a 39% reduction in GPU power consumption compared to dense BERT, supporting sustainable AI deployment in alignment with UN SDGs 9 and 12. Analysis reveals high mention rates for cleanliness and amenities as critical aspects. This is the first ABSA study focused on Persian tourism reviews, and we release the annotated dataset to facilitate future multilingual NLP research in tourism.
33. 【2602.12759】owards a Diagnostic and Predictive Evaluation Methodology for Sequence Labeling Tasks
链接:https://arxiv.org/abs/2602.12759
作者:Elena Alvarez-Mellado,Julio Gonzalo
类目:Computation and Language (cs.CL)
关键词:NLP typically, Standard evaluation, evaluation in NLP, Standard, improve performance
备注: Accepted at LREC 2026
点击查看摘要
Abstract:Standard evaluation in NLP typically indicates that system A is better on average than system B, but it provides little info on how to improve performance and, what is worse, it should not come as a surprise if B ends up being better than A on outside data. We propose an evaluation methodology for sequence labeling tasks grounded on error analysis that provides both quantitative and qualitative information on where systems must be improved and predicts how models will perform on a different distribution. The key is to create test sets that, contrary to common practice, do not rely on gathering large amounts of real-world in-distribution scraped data, but consists in handcrafting a small set of linguistically motivated examples that exhaustively cover the range of span attributes (such as shape, length, casing, sentence position, etc.) a system may encounter in the wild. We demonstrate this methodology on a benchmark for anglicism identification in Spanish. Our methodology provides results that are diagnostic (because they help identify systematic weaknesses in performance), actionable (because they can inform which model is better suited for a given scenario) and predictive: our method predicts model performance on external datasets with a median correlation of 0.85.
34. 【2602.12746】Lamer-SSL: Layer-aware Mixture of LoRA Experts for Continual Multilingual Expansion of Self-supervised Models without Forgetting
链接:https://arxiv.org/abs/2602.12746
作者:Jing Xu,Minglin Wu,Xueyuan Chen,Xixin Wu,Helen Meng
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:forget previously acquired, struggle to generalize, tend to forget, continual training, previously acquired knowledge
备注: Accepted by ICASSP 2026
点击查看摘要
Abstract:Despite their impressive performance, self-supervised speech models often struggle to generalize to new languages and tend to forget previously acquired knowledge during continual training. To address this, we propose Lamer-SSL, a parameter-efficient framework that integrates a Layer-Aware MixturE of LoRA Experts (Lamer) module with a replay strategy. The Lamer module enables flexible balancing between shared and language-specific representations, while layer-aware expert allocation assigns more experts to deeper layers where semantic information is richer. Meanwhile, the replay strategy retains prior knowledge using minimal data, mitigating forgetting during continual training. Experiments on automatic speech recognition (ASR) and language identification (LID) demonstrate that Lamer-SSL extends self-supervised models to new languages effectively while maintaining strong performance on previously learned languages with only 2.14% parameters being trainable.
35. 【2602.12735】VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph
链接:https://arxiv.org/abs/2602.12735
作者:Qiuchen Wang,Shihang Wang,Yu Zeng,Qiang Zhang,Fanrui Zhang,Zhuoning Guo,Bosi Zhang,Wenxuan Huang,Lin Chen,Zehui Chen,Pengjun Xie,Ruixue Ding
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Effectively retrieving, Traditional Retrieval-augmented Generation, understanding multimodal information, multimodal information remains, agentic systems
备注:
点击查看摘要
Abstract:Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at this https URL.
36. 【2602.12709】ReFilter: Improving Robustness of Retrieval-Augmented Generation via Gated Filter
链接:https://arxiv.org/abs/2602.12709
作者:Yixin Chen,Ying Xiong,Shangyu Wu,Xiangrui Ke,Nan Guan,Chun Jason Xue
类目:Computation and Language (cs.CL)
关键词:knowledge-intensive question answering, Retrieval-augmented generation, large language models, grounding large language, question answering
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) has become a dominant paradigm for grounding large language models (LLMs) with external evidence in knowledge-intensive question answering. A core design choice is how to fuse retrieved samples into the LLMs, where existing internal fusion approaches broadly fall into query-based fusion, parametric fusion, and latent-based fusion. Despite their effectiveness at modest retrieval scales, these methods often fail to scale gracefully as the number of retrieved candidates k increases: Larger k improves evidence coverage, yet realistic top-k retrieval inevitably contains irrelevant or redundant content and increases the inference cost. To address these limitations, we propose ReFilter, a novel latent-based fusion framework that performs token-level filtering and fusion. ReFilter consists of three key components: a context encoder for encoding context features, a gated filter for weighting each token, and a token fusion module for integrating the weighted token feature into the LLM's hidden states. Our experiments across four general-domain QA benchmarks show that ReFilter consistently achieves the best average performance under both in-domain adaptation and out-of-domain transfer. ReFilter further generalizes to five biomedical QA benchmarks in zero-shot transfer without domain fine-tuning, reaching 70.01% average accuracy with Qwen2.5-14B-Instruct.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2602.12709 [cs.CL]
(or
arXiv:2602.12709v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2602.12709
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
37. 【2602.12705】MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
链接:https://arxiv.org/abs/2602.12705
作者:Baorong Shi,Bo Cui,Boyuan Jiang,Deli Yu,Fang Qian,Haihua Yang,Huichao Wang,Jiale Chen,Jianfei Pan,Jieqiong Cao,Jinghao Lin,Kai Wu,Lin Yang,Shengsheng Yao,Tao Chen,Xiaojun Xiao,Xiaozhong Ji,Xu Wang,Yijun He,Zhixiong Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:vision-language foundation model, foundation model designed, real-world clinical applications, medical vision-language foundation, advance general-purpose medical
备注:
点击查看摘要
Abstract:We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.
38. 【2602.12674】$\mathcal{X}$-KD: General Experiential Knowledge Distillation for Large Language Models
链接:https://arxiv.org/abs/2602.12674
作者:Yuang Cai,Yuyu Yuan
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Experiential Knowledge Distillation, Language Models, Knowledge Distillation
备注:
点击查看摘要
Abstract:Knowledge Distillation (KD) for Large Language Models (LLMs) has become increasingly important as models grow in size and complexity. While existing distillation approaches focus on imitating teacher behavior, they often overlook the original learning environment that shaped the teacher's knowledge. Inspired by the experiential learning theory and inverse reinforcement learning, we propose Experiential Knowledge Distillation ($\mathcal{X}$-KD), a novel and general framework that enables student models to learn in the teacher's original learning environment. $\mathcal{X}$-KD adopts the Approximated Variational Reward Imitation Learning (AVRIL) framework to jointly model the teacher's original reward function and perform policy distillation, encouraging consistency between the student policy and the original reward function. Our derivation demonstrates that $\mathcal{X}$-KD follows the supervised learning framework and applies to both sequence-level and divergence-based distillation methods, underlining the simplicity and flexibility of our approach. Empirical results show that $\mathcal{X}$-KD outperforms the generalized KD and MiniLLM baselines on abstractive summarization, machine translation, and arithmetic reasoning tasks. Additionally, $\mathcal{X}$-KD achieves better performance-diversity trade-off and data efficiency than baseline KD approaches.
39. 【2602.12662】hink Fast and Slow: Step-Level Cognitive Depth Adaptation for LLM Agents
链接:https://arxiv.org/abs/2602.12662
作者:Ruihan Yang,Fanghua Ye,Xiang We,Ruoqing Zhao,Kang Luo,Xinbo Xu,Bo Zhao,Ruotian Ma,Shanyi Wang,Zhaopeng Tu,Xiaolong Li,Deqing Yang,Linus
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, Large language, multi-turn decision-making tasks, increasingly deployed, deployed as autonomous
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed as autonomous agents for multi-turn decision-making tasks. However, current agents typically rely on fixed cognitive patterns: non-thinking models generate immediate responses, while thinking models engage in deep reasoning uniformly. This rigidity is inefficient for long-horizon tasks, where cognitive demands vary significantly from step to step, with some requiring strategic planning and others only routine execution. In this paper, we introduce CogRouter, a framework that trains agents to dynamically adapt cognitive depth at each step. Grounded in ACT-R theory, we design four hierarchical cognitive levels ranging from instinctive responses to strategic planning. Our two-stage training approach includes Cognition-aware Supervised Fine-tuning (CoSFT) to instill stable level-specific patterns, and Cognition-aware Policy Optimization (CoPO) for step-level credit assignment via confidence-aware advantage reweighting. The key insight is that appropriate cognitive depth should maximize the confidence of the resulting action. Experiments on ALFWorld and ScienceWorld demonstrate that CogRouter achieves state-of-the-art performance with superior efficiency. With Qwen2.5-7B, it reaches an 82.3% success rate, outperforming GPT-4o (+40.3%), OpenAI-o3 (+18.3%), and GRPO (+14.0%), while using 62% fewer tokens.
40. 【2602.12660】Learning Ordinal Probabilistic Reward from Preferences
链接:https://arxiv.org/abs/2602.12660
作者:Longze Chen,Lu Wang,Renke Shan,Ze Gong,Run Luo,Jiaming Li,Jing Luo,Qiyao Wang,Min Yang
类目:Computation and Language (cs.CL)
关键词:aligning large language, large language models, Probabilistic Reward Model, crucial for aligning, aligning large
备注: 28 pages, 5 figures, ICLR 2026
点击查看摘要
Abstract:Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT). It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions. Experiments on various reward model benchmarks show that our method improves accuracy by $\textbf{2.9%}\sim\textbf{7.4%}$ compared to prior reward models, demonstrating strong performance and data efficiency. Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.
41. 【2602.12642】Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR
链接:https://arxiv.org/abs/2602.12642
作者:Dohyung Kim,Minbeom Kim,Jeonghye Kim,Sangmook Lee,Sojeong Rhee,Kyomin Jung
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Reward-maximizing RL methods, diversity among outputs, methods enhance, enhance the reasoning, reduce the diversity
备注:
点击查看摘要
Abstract:Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.
42. 【2602.12639】CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation
链接:https://arxiv.org/abs/2602.12639
作者:Yiran Rex Ma,Yuxiao Ye,Huiyuan Xie
类目:Computation and Language (cs.CL)
关键词:large language models, reasonable factual accuracy, achieve reasonable factual, language models, Legal
备注: Accepted at LREC 2026
点击查看摘要
Abstract:Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: this https URL).
43. 【2602.12635】Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats
链接:https://arxiv.org/abs/2602.12635
作者:Pengxiang Zhao,Hui-Ling Zhen,Xing Li,Han Bao,Weizhe Lin,Zhiyuan Yang,Ziwei Yu,Xin Wang,Mingxuan Yuan,Xianzhi Yu,Zhenhua Dong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:low-bit floating-point formats, offer new opportunities, precision and efficiency, opportunities for precision, LLMs scale
备注:
点击查看摘要
Abstract:As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4's hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.
44. 【2602.12618】Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models
链接:https://arxiv.org/abs/2602.12618
作者:Omer Faruk Deniz,Ruiyu Mao,Ruochen Li,Yapeng Tian,Latifur Khan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, incur significant computational
备注: 2025 IEEE International Conference on Big Data (BigData)
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.
45. 【2602.12601】HyperMLP: An Integrated Perspective for Sequence Modeling
链接:https://arxiv.org/abs/2602.12601
作者:Jiecheng Lu,Shihao Yang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
关键词:probabilistic query-key lookup, preserve normalized attention, fixed positional semantics, normalized attention scores, query-key lookup
备注:
点击查看摘要
Abstract:Self-attention is often viewed as probabilistic query-key lookup, motivating designs that preserve normalized attention scores and fixed positional semantics. We advocate a simpler and more unified perspective: an autoregressive attention head can be viewed as a dynamic two-layer MLP whose weights are instantiated from the context history. From this view, attention scores form an ever-growing hidden representation, and standard MLP activations such as ReLU or GLU naturally implement input-conditioned selection over a context-dependent memory pool rather than a probability distribution. Based on this formulation, we introduce HyperMLP and HyperGLU, which learn dynamic mixing in both feature space and sequence space, using a reverse-offset (lag) layout to align temporal mixing with autoregressive semantics. We provide theoretical characterizations of the expressivity and implications of this structure, and empirically show that HyperMLP/HyperGLU consistently outperform strong softmax-attention baselines under matched parameter budgets.
46. 【2602.12575】Discovering Semantic Latent Structures in Psychological Scales: A Response-Free Pathway to Efficient Simplification
链接:https://arxiv.org/abs/2602.12575
作者:Bo Wang,Yuxuan Zhang,Yueqin Hu,Hanchao Hou,Kaiping Peng,Shiguang Ni
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Psychological scale refinement, refinement traditionally relies, item response theory, optimize item composition, scale refinement traditionally
备注: 78 pages, 20 figures
点击查看摘要
Abstract:Psychological scale refinement traditionally relies on response-based methods such as factor analysis, item response theory, and network psychometrics to optimize item composition. Although rigorous, these approaches require large samples and may be constrained by data availability and cross-cultural comparability. Recent advances in natural language processing suggest that the semantic structure of questionnaire items may encode latent construct organization, offering a complementary response-free perspective. We introduce a topic-modeling framework that operationalizes semantic latent structure for scale simplification. Items are encoded using contextual sentence embeddings and grouped via density-based clustering to discover latent semantic factors without predefining their number. Class-based term weighting derives interpretable topic representations that approximate constructs and enable merging of semantically adjacent clusters. Representative items are selected using membership criteria within an integrated reduction pipeline. We benchmarked the framework across DASS, IPIP, and EPOCH, evaluating structural recovery, internal consistency, factor congruence, correlation preservation, and reduction efficiency. The proposed method recovered coherent factor-like groupings aligned with established constructs. Selected items reduced scale length by 60.5% on average while maintaining psychometric adequacy. Simplified scales showed high concordance with original factor structures and preserved inter-factor correlations, indicating that semantic latent organization provides a response-free approximation of measurement structure. Our framework formalizes semantic structure as an inspectable front-end for scale construction and reduction. To facilitate adoption, we provide a visualization-supported tool enabling one-click semantic analysis and structured simplification.
47. 【2602.12528】DiffuRank: Effective Document Reranking with Diffusion Language Models
链接:https://arxiv.org/abs/2602.12528
作者:Qi Liu,Kun Ai,Jiaxin Mao,Yanzhao Zhang,Mingxin Li,Dingkun Long,Pengjun Xie,Fengbin Zhu,Ji-Rong Wen
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Recent advances, advances in large, dLLMs, large language models, Recent
备注: The code is available at [this https URL](https://github.com/liuqi6777/DiffusionRank)
点击查看摘要
Abstract:Recent advances in large language models (LLMs) have inspired new paradigms for document reranking. While this paradigm better exploits the reasoning and contextual understanding capabilities of LLMs, most existing LLM-based rerankers rely on autoregressive generation, which limits their efficiency and flexibility. In particular, token-by-token decoding incurs high latency, while the fixed left-to-right generation order causes early prediction errors to propagate and is difficult to revise. To address these limitations, we explore the use of diffusion language models (dLLMs) for document reranking and propose DiffuRank, a reranking framework built upon dLLMs. Unlike autoregressive models, dLLMs support more flexible decoding and generation processes that are not constrained to a left-to-right order, and enable parallel decoding, which may lead to improved efficiency and controllability. Specifically, we investigate three reranking strategies based on dLLMs: (1) a pointwise approach that uses dLLMs to estimate the relevance of each query-document pair; (2) a logit-based listwise approach that prompts dLLMs to jointly assess the relevance of multiple documents and derives ranking lists directly from model logits; and (3) a permutation-based listwise approach that adapts the canonical decoding process of dLLMs to the reranking tasks. For each approach, we design corresponding training methods to fully exploit the advantages of dLLMs. We evaluate both zero-shot and fine-tuned reranking performance on multiple benchmarks. Experimental results show that dLLMs achieve performance comparable to, and in some cases exceeding, that of autoregressive LLMs with similar model sizes. These findings demonstrate the promise of diffusion-based language models as a compelling alternative to autoregressive architectures for document reranking.
48. 【2602.12526】Constraint-Rectified Training for Efficient Chain-of-Thought
链接:https://arxiv.org/abs/2602.12526
作者:Qinhang Wu,Sen Lin,Ming Zhang,Yingbin Liang,Ness B. Shroff
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, capabilities of Large, Language Models, Large Language, reinforcement learning
备注:
点击查看摘要
Abstract:Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), especially when combined with reinforcement learning (RL) based post-training methods. While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking. Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy, either through length-aware reward design or prompt-based calibration. However, these heuristic-based approaches may suffer from severe accuracy drop and be very sensitive to hyperparameters. To address these problems, we introduce CRT (Constraint-Rectified Training), a principled post-training framework based on reference-guarded constrained optimization, yielding a more stable and interpretable formulation for efficient reasoning. CRT alternates between minimizing reasoning length and rectifying accuracy only when performance falls below the reference, enabling stable and effective pruning of redundant reasoning. We further extend CRT with a two-stage training scheme that first discovers the shortest reliable reasoning patterns and then refines accuracy under a learnt length budget, preventing the re-emergence of verbose CoT. Our comprehensive evaluation shows that this framework consistently reduces token usage while maintaining answer quality at a robust and reliable level. Further analysis reveals that CRT improves reasoning efficiency not only by shortening responses but also by reducing internal language redundancy, leading to a new evaluation metric. Moreover, CRT-based training naturally yields a sequence of intermediate checkpoints that span a spectrum of explanation lengths while preserving correctness, enabling fine-grained control over reasoning verbosity without retraining.
49. 【2602.12445】RBCorr: Response Bias Correction in Language Models
链接:https://arxiv.org/abs/2602.12445
作者:Om Bhatt,Anna A. Ivanova
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:option preference biases, preference biases, present as option, option preference, response bias correction
备注: 12 pages (8 pages main text), 4 figures
点击查看摘要
Abstract:Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve LM performance and enable more accurate evaluations of model abilities. Here, we propose a simple response bias correction strategy ($\texttt{RBCorr}$) and test it on 12 open-weight language models using yes-no, entailment, and multiple choice questions. We show that response bias is prevalent in LMs pre-correction and that $\texttt{RBCorr}$ effectively eliminates bias and boosts model performance. We also explore the generalizability of bias behavior across models, datasets, and prompt formats, showing that LogProbs-based correction is highly dependent on all three of these aspects. Overall, $\texttt{RBCorr}$ is an easy-to-use method that can boost the performance of smaller LMs and ensure that LM performance on closed-response benchmarks aligns more closely with their true capabilities.
50. 【2602.12424】RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty
链接:https://arxiv.org/abs/2602.12424
作者:Ziqian Zhang,Xingjian Hu,Yue Huang,Kai Zhang,Ruoxi Chen,Yixin Liu,Qingsong Wen,Kaidi Xu,Xiangliang Zhang,Neil Zhenqiang Gong,Lichao Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:facilitating objective comparisons, large language models, facilitating objective, Benchmarks establish, establish a standardized
备注: 32 pages, 9 figures. Accepted by ICLR 2026
点击查看摘要
Abstract:Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.
51. 【2602.12418】Sparse Autoencoders are Capable LLM Jailbreak Mitigators
链接:https://arxiv.org/abs/2602.12418
作者:Yannick Assogba,Jacopo Cortellazzi,Javier Abad,Pau Rodriguez,Xavier Suau,Arno Blaas
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:language model safety, large language model, remain a persistent, persistent threat, threat to large
备注: 26 pages, 14 figures, 3 tables
点击查看摘要
Abstract:Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and twelve jailbreak attacks, CC-Delta achieves comparable or better safety-utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.
52. 【2602.12414】propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
链接:https://arxiv.org/abs/2602.12414
作者:Maximilian Idahl,Benedikt Droste,Björn Plüster,Jan Philipp Harries
类目:Computation and Language (cs.CL)
关键词:single scalar quality, scalar quality scores, quality scores produced, predominantly relied, single score conflates
备注: Release: [this https URL](https://hf.co/collections/ellamind/propella-1)
点击查看摘要
Abstract:Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.
53. 【2602.12389】Evolving Beyond Snapshots: Harmonizing Structure and Sequence via Entity State Tuning for Temporal Knowledge Graph Forecasting
链接:https://arxiv.org/abs/2602.12389
作者:Siyuan Li,Yunjia Wu,Yiyong Xiao,Pingyang Huang,Peize Li,Ruitong Liu,Yan Wen,Te Sun,Fangyi Pei
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:requires predicting future, predicting future facts, Temporal knowledge graph, jointly modeling structural, forecasting requires predicting
备注:
点击查看摘要
Abstract:Temporal knowledge graph (TKG) forecasting requires predicting future facts by jointly modeling structural dependencies within each snapshot and temporal evolution across snapshots. However, most existing methods are stateless: they recompute entity representations at each timestamp from a limited query window, leading to episodic amnesia and rapid decay of long-term dependencies. To address this limitation, we propose Entity State Tuning (EST), an encoder-agnostic framework that endows TKG forecasters with persistent and continuously evolving entity states. EST maintains a global state buffer and progressively aligns structural evidence with sequential signals via a closed-loop design. Specifically, a topology-aware state perceiver first injects entity-state priors into structural encoding. Then, a unified temporal context module aggregates the state-enhanced events with a pluggable sequence backbone. Subsequently, a dual-track evolution mechanism writes the updated context back to the global entity state memory, balancing plasticity against stability. Experiments on multiple benchmarks show that EST consistently improves diverse backbones and achieves state-of-the-art performance, highlighting the importance of state persistence for long-horizon TKG forecasting. The code is published at this https URL
54. 【2602.12316】GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
链接:https://arxiv.org/abs/2602.12316
作者:Pepijn Cobben,Xuanqiang Angelo Huang,Thao Amelia Pham,Isabel Dahlgren,Terry Jingchen Zhang,Zhijing Jin
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
关键词:systems are increasingly, increasingly capable, capable and deployed, leaving multi-agent risks, high-stakes multi-agent environments
备注:
点击查看摘要
Abstract:Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 2,009 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at this https URL.
55. 【2602.12302】Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática
链接:https://arxiv.org/abs/2602.12302
作者:Neemias da Silva,Júlio C. W. Scholz,John Harrison,Marina Borges,Paulo Ávila,Frances A Santos,Myriam Delgado,Rodrigo Minetto,Thiago H Silva
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, natural language understanding, Large Language, Language Models
备注: in Portuguese language. Accepted book chapter - Webmedia 2025
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: this https URL. Finally, the chapter discusses the challenges and highlights promising trends.
56. 【2602.12301】Beyond Musical Descriptors: Extracting Preference-Bearing Intent in Music Queries
链接:https://arxiv.org/abs/2602.12301
作者:Marion Baranes,Romain Hennequin,Elena V. Epure
类目:ound (cs.SD); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
关键词:increasingly common, queries are increasingly, essential for effectively, effectively meeting, Reddit music requests
备注: Accepted at NLP4MusA 2026 (4th Workshop on NLP for Music and Audio)
点击查看摘要
Abstract:Although annotated music descriptor datasets for user queries are increasingly common, few consider the user's intent behind these descriptors, which is essential for effectively meeting their needs. We introduce MusicRecoIntent, a manually annotated corpus of 2,291 Reddit music requests, labeling musical descriptors across seven categories with positive, negative, or referential preference-bearing roles. We then investigate how reliably large language models (LLMs) can extract these music descriptors, finding that they do capture explicit descriptors but struggle with context-dependent ones. This work can further serve as a benchmark for fine-grained modeling of user intent and for gaining insights into improving LLM-based music understanding systems.
57. 【2602.12287】Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction
链接:https://arxiv.org/abs/2602.12287
作者:Junjie An,Jingguang Tian,Tianyi Wang,Yu Gao,Xiaofeng Mou,Yi Xu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
关键词:systems frequently misrecognize, frequently misrecognize domain-specific, misrecognize domain-specific phrases, automatic speech recognition, automatic speech
备注:
点击查看摘要
Abstract:End-to-end automatic speech recognition (ASR) systems frequently misrecognize domain-specific phrases like named entities, which can cause catastrophic failures in downstream tasks. A new family of named entity correction methods based on large language models (LLMs) has recently emerged. However, these approaches have yet to fully exploit the sophisticated reasoning capabilities inherent to LLMs. To bridge this gap, we propose a novel retrieval-augmented generation framework for correcting named entity errors in ASR. Our approach consists of two key components: (1) a rephrasing language model (RLM) for named entity recognition, followed by candidate retrieval using a phonetic-level edit distance; and (2) a novel self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts the depth of its reasoning based on task difficulty. Experiments on the AISHELL-1 and Homophone datasets demonstrate the effectiveness of our method, which achieves relative reductions in the named entity character error rate of 17.96\% and 34.42\%, respectively, compared to a strong baseline.
58. 【2602.12285】From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness
链接:https://arxiv.org/abs/2602.12285
作者:Linbo Cao,Lihao Sun,Yang Yue
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, autonomous agents capable, Language Models, text generation
备注: Accepted to the AAAI 2026 TrustAgent Workshop. 6 pages, 4 figures
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real-world impacts beyond text generation. While persona-induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored, even though such effects pose more direct operational risks. In this work, we present the first systematic case study showing that demographic-based persona assignments can alter LLM agents' behavior and degrade performance across diverse domains. Evaluating widely deployed models on agentic benchmarks spanning strategic reasoning, planning, and technical operations, we uncover substantial performance variations - up to 26.2% degradation, driven by task-irrelevant persona cues. These shifts appear across task types and model architectures, indicating that persona conditioning and simple prompt injections can distort an agent's decision-making reliability. Our findings reveal an overlooked vulnerability in current LLM agentic systems: persona assignments can introduce implicit biases and increase behavioral volatility, raising concerns for the safe and robust deployment of LLM agents.
59. 【2602.12284】A Lightweight LLM Framework for Disaster Humanitarian Information Classification
链接:https://arxiv.org/abs/2602.12284
作者:Han Jinzhen,Kim Jisung,Yang Jong Soo,Yun Hong Sik
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:effective disaster response, Timely classification, social media, media is critical, critical for effective
备注:
点击查看摘要
Abstract:Timely classification of humanitarian information from social media is critical for effective disaster response. However, deploying large language models (LLMs) for this task faces challenges in resource-constrained emergency settings. This paper develops a lightweight, cost-effective framework for disaster tweet classification using parameter-efficient fine-tuning. We construct a unified experimental corpus by integrating and normalizing the HumAID dataset (76,484 tweets across 19 disaster events) into a dual-task benchmark: humanitarian information categorization and event type identification. Through systematic evaluation of prompting strategies, LoRA fine-tuning, and retrieval-augmented generation (RAG) on Llama 3.1 8B, we demonstrate that: (1) LoRA achieves 79.62% humanitarian classification accuracy (+37.79% over zero-shot) while training only ~2% of parameters; (2) QLoRA enables efficient deployment with 99.4% of LoRA performance at 50% memory cost; (3) contrary to common assumptions, RAG strategies degrade fine-tuned model performance due to label noise from retrieved examples. These findings establish a practical, reproducible pipeline for building reliable crisis intelligence systems with limited computational resources.
60. 【2602.12546】Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR
链接:https://arxiv.org/abs/2602.12546
作者:Jaeyoung Lee,Masato Mimura
类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
关键词:automatic speech recognition, external speech encoders, pretrained large language, hybrid-causality Conformer blocks, stack without external
备注: Accepted to ICASSP 2026
点击查看摘要
Abstract:We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules.
61. 【2602.12286】Alignment or Integration? Rethinking Multimodal Fusion in DNA-language Foundation Models
链接:https://arxiv.org/abs/2602.12286
作者:Yanan Li,Christina Yi Jin,Yuan Jin,Manli Luo,Tie Xu,Shuai Jiao,Wei He,Qing Zhang
类目:Genomics (q-bio.GN); Computation and Language (cs.CL)
关键词:Fusing DNA foundation, Fusing DNA, DNA foundation models, natural language interact, encode DNA sequences
备注:
点击查看摘要
Abstract:Fusing DNA foundation models with large language models (LLMs) for DNA-language reasoning raises a fundamental question: at what level should genomic sequences and natural language interact? Most existing approaches encode DNA sequences and text separately and rely on embedding-level alignment to connect the two modalities. Such late-stage fusion compresses rich genomic sequences into fixed representations, limiting the model's ability to reason over fine-grained, token-level genomic structure. In this work, we propose two new methods for DNA-language fusion, i.e., a semantic alignment method SeqCLIP and a vocabulary-level integration method OneVocab. SeqCLIP strengthens embedding-level alignment via sequence-level contrastive pre-training, and OneVocab directly integrates genomic $k$-mers into the language model's existing vocabulary. Comprehensive experiments on classification and reasoning tasks show that, while various alignment strategies improve embedding-level fusion, early vocabulary-level integration yields more expressive and effective representations for DNA-language modeling.
信息检索
1. 【2602.13179】Fix Before Search: Benchmarking Agentic Query Visual Pre-processing in Multimodal Retrieval-augmented Generation
链接:https://arxiv.org/abs/2602.13179
作者:Jiankun Zhang,Shenglai Zeng,Kai Guo,Xinnan Dai,Hui Liu,Jiliang Tang,Yi Chang
类目:Information Retrieval (cs.IR)
关键词:Multimodal Retrieval-Augmented Generation, Multimodal Retrieval-Augmented, Retrieval-Augmented Generation, external knowledge, query pre-processing
备注:
点击查看摘要
Abstract:Multimodal Retrieval-Augmented Generation (MRAG) has emerged as a key paradigm for grounding MLLMs with external knowledge. While query pre-processing (e.g., rewriting) is standard in text-based RAG, existing MRAG pipelines predominantly treat visual inputs as static and immutable, implicitly assuming they are noise-free. However, real-world visual queries are often ``imperfect'' -- suffering from geometric distortions, quality degradation, or semantic ambiguity -- leading to catastrophic retrieval failures. To address this gap, we propose V-QPP-Bench, the first comprehensive benchmark dedicated to Visual Query Pre-processing (V-QPP). We formulate V-QPP as an agentic decision-making task where MLLMs must autonomously diagnose imperfections and deploy perceptual tools to refine queries. Our extensive evaluation across 46,700 imperfect queries and diverse MRAG paradigms reveals three critical insights: (1) Vulnerability -- visual imperfections severely degrade both retrieval recall and end-to-end MRAG performance; (2) Restoration Potential \ Bottleneck -- while oracle preprocessing recovers near-perfect performance, off-the-shelf MLLMs struggle with tool selection and parameter prediction without specialized training; and (3) Training Enhancement -- supervised fine-tuning enables compact models to achieve comparable or superior performance to larger proprietary models, demonstrating the benchmark's value for developing robust MRAG systems The code is available at this https URL
2. 【2602.13165】Asynchronous Verified Semantic Caching for Tiered LLM Architectures
链接:https://arxiv.org/abs/2602.13165
作者:Asmit Kumar Singh,Haozhe Wang,Laxmi Naga Santosh Attaluri,Tak Chiam,Weihua Zhu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Large language models, reducing inference cost, Large language, making semantic caching, semantic caching essential
备注:
点击查看摘要
Abstract:Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbf{Krites}, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promoted into the dynamic cache, allowing future repeats and paraphrases to reuse curated static answers and expanding static reach over time. In trace-driven simulations on conversational and search workloads, Krites increases the fraction of requests served with curated static answers (direct static hits plus verified promotions) by up to $\textbf{3.9}$ times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency.
3. 【2602.13134】Awakening Dormant Users: Generative Recommendation with Counterfactual Functional Role Reasoning
链接:https://arxiv.org/abs/2602.13134
作者:Huishi Luo,Shuokai Li,Hanchen Yang,Zhongbo Sun,Haojie Ding,Boheng Zhang,Zijia Cai,Renliang Qian,Fan Yang,Tingting Gao,Chenyi Lei,Wenwu Ou,Fuzhen Zhuang
类目:Information Retrieval (cs.IR)
关键词:incremental GMV growth, exhibit low conversion, incremental GMV, GMV growth, Awakening dormant users
备注:
点击查看摘要
Abstract:Awakening dormant users, who remain engaged but exhibit low conversion, is a pivotal driver for incremental GMV growth in large-scale e-commerce platforms. However, existing approaches often yield suboptimal results since they typically rely on single-step estimation of an item's intrinsic value (e.g., immediate click probability). This mechanism overlooks the instrumental effect of items, where specific interactions act as triggers to shape latent intent and drive subsequent decisions along a conversion trajectory. To bridge this gap, we propose RoleGen, a novel framework that synergizes a Conversion Trajectory Reasoner with a Generative Behavioral Backbone. Specifically, the LLM-based Reasoner explicitly models the context-dependent Functional Role of items to reconstruct intent evolution. It further employs counterfactual inference to simulate diverse conversion paths, effectively mitigating interest collapse. These reasoned candidate items are integrated into the generative backbone, which is optimized via a collaborative "Reasoning-Execution-Feedback-Reflection" closed-loop strategy to ensure grounded execution. Extensive offline experiments and online A/B testing on the Kuaishou e-commerce platform demonstrate that RoleGen achieves a 6.2% gain in Recall@1 and a 7.3% increase in online order volume, confirming its effectiveness in activating the dormant user base.
4. 【2602.12968】RGAlign-Rec: Ranking-Guided Alignment for Latent Query Reasoning in Recommendation Systems
链接:https://arxiv.org/abs/2602.12968
作者:Junhua Liu,Yang Jihao,Cheng Chang,Kunrong LI,Bin Fu,Kwan Hui Lim
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:modern e-commerce chatbots, chatbot Knowledge Base, critical capability, capability in modern, modern e-commerce
备注:
点击查看摘要
Abstract:Proactive intent prediction is a critical capability in modern e-commerce chatbots, enabling "zero-query" recommendations by anticipating user needs from behavioral and contextual signals. However, existing industrial systems face two fundamental challenges: (1) the semantic gap between discrete user features and the semantic intents within the chatbot's Knowledge Base, and (2) the objective misalignment between general-purpose LLM outputs and task-specific ranking utilities. To address these issues, we propose RGAlign-Rec, a closed-loop alignment framework that integrates an LLM-based semantic reasoner with a Query-Enhanced (QE) ranking model. We also introduce Ranking-Guided Alignment (RGA), a multi-stage training paradigm that utilizes downstream ranking signals as feedback to refine the LLM's latent reasoning. Extensive experiments on a large-scale industrial dataset from Shopee demonstrate that RGAlign-Rec achieves a 0.12% gain in GAUC, leading to a significant 3.52% relative reduction in error rate, and a 0.56% improvement in Recall@3. Online A/B testing further validates the cumulative effectiveness of our framework: the Query-Enhanced model (QE-Rec) initially yields a 0.98% improvement in CTR, while the subsequent Ranking-Guided Alignment stage contributes an additional 0.13% gain. These results indicate that ranking-aware alignment effectively synchronizes semantic reasoning with ranking objectives, significantly enhancing both prediction accuracy and service quality in real-world proactive recommendation systems.
5. 【2602.12941】JARVIS: An Evidence-Grounded Retrieval System for Interpretable Deceptive Reviews Adjudication
链接:https://arxiv.org/abs/2602.12941
作者:Nan Lu,Leyang Li,Yurong Hu,Rui Lin,Shaoyi Xu
类目:Information Retrieval (cs.IR)
关键词:fabricated feedback designed, refer to fabricated, quality of products, Deceptive reviews, fabricated feedback
备注:
点击查看摘要
Abstract:Deceptive reviews, refer to fabricated feedback designed to artificially manipulate the perceived quality of products. Within modern e-commerce ecosystems, these reviews remain a critical governance challenge. Despite advances in review-level and graph-based detection methods, two pivotal limitations remain: inadequate generalization and lack of interpretability. To address these challenges, we propose JARVIS, a framework providing Judgment via Augmented Retrieval and eVIdence graph Structures. Starting from the review to be evaluated, it retrieves semantically similar evidence via hybrid dense-sparse multimodal retrieval, expands relational signals through shared entities, and constructs a heterogeneous evidence graph. Large language model then performs evidence-grounded adjudication to produce interpretable risk assessments. Offline experiments demonstrate that JARVIS enhances performance on our constructed review dataset, achieving a precision increase from 0.953 to 0.988 and a recall boost from 0.830 to 0.901. In the production environment, our framework achieves a 27% increase in the recall volume and reduces manual inspection time by 75%. Furthermore, the adoption rate of the model-generated analysis reaches 96.4%.
6. 【2602.12819】WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata
链接:https://arxiv.org/abs/2602.12819
作者:Prasanna Sridhar,Horace Lee,David M. S. Pinto,Andrew Zisserman,Abhishek Dutta
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:practical tool accessible, machine learning expertise, multimodal retrieval capabilities, audiovisual search engine, practical tool
备注: Software: [this https URL](https://www.robots.ox.ac.uk/~vgg/software/wise/) , Online demos: [this https URL](https://www.robots.ox.ac.uk/~vgg/software/wise/demo/) , Example Queries: [this https URL](https://www.robots.ox.ac.uk/~vgg/software/wise/examples/)
点击查看摘要
Abstract:In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. WISE supports natural-language and reverse-image queries at both the scene level (e.g. empty street) and object level (e.g. horse) across images and videos; face-based search for specific individuals; audio retrieval of acoustic events using text (e.g. wood creak) or an audio file; search over automatically transcribed speech; and filtering by user-provided metadata. Rich insights can be obtained by combining queries across modalities -- for example, retrieving German trains from a historical archive by applying the object query "train" and the metadata query "Germany", or searching for a face in a place. By employing vector search techniques, WISE can scale to support efficient retrieval over millions of images or thousands of hours of video. Its modular architecture facilitates the integration of new models. WISE can be deployed locally for private or sensitive collections, and has been applied to various real-world use cases. Our code is open-source and available at this https URL.
7. 【2602.12783】SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
链接:https://arxiv.org/abs/2602.12783
作者:Yuejie Li,Ke Yang,Yueying Hua,Berlin Chen,Jianhao Nie,Yueping He,Caixin Kang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:important interaction mode, Spoken query retrieval, Spoken query, modern information retrieval, important interaction
备注:
点击查看摘要
Abstract:Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex acoustic perturbations. To address this limitation, we present SQuTR, a robustness benchmark for spoken query retrieval that includes a large-scale dataset and a unified evaluation protocol. SQuTR aggregates 37,317 unique queries from six commonly used English and Chinese text retrieval datasets, spanning multiple domains and diverse query types. We synthesize speech using voice profiles from 200 real speakers and mix 17 categories of real-world environmental noise under controlled SNR levels, enabling reproducible robustness evaluation from quiet to highly noisy conditions. Under the unified protocol, we conduct large-scale evaluations on representative cascaded and end-to-end retrieval systems. Experimental results show that retrieval performance decreases as noise increases, with substantially different drops across systems. Even large-scale retrieval models struggle under extreme noise, indicating that robustness remains a critical bottleneck. Overall, SQuTR provides a reproducible testbed for benchmarking and diagnostic analysis, and facilitates future research on robustness in spoken query to text retrieval.
8. 【2602.12727】raining Dense Retrievers with Multiple Positive Passages
链接:https://arxiv.org/abs/2602.12727
作者:Benben Wang,Minghao Tang,Hengran Zhang,Jiafeng Guo,Keping Bi
类目:Information Retrieval (cs.IR)
关键词:Modern knowledge-intensive systems, Modern knowledge-intensive, knowledge-intensive systems, retrieval-augmented generation, rely on effective
备注:
点击查看摘要
Abstract:Modern knowledge-intensive systems, such as retrieval-augmented generation (RAG), rely on effective retrievers to establish the performance ceiling for downstream modules. However, retriever training has been bottlenecked by sparse, single-positive annotations, which lead to false-negative noise and suboptimal supervision. While the advent of large language models (LLMs) makes it feasible to collect comprehensive multi-positive relevance labels at scale, the optimal strategy for incorporating these dense signals into training remains poorly understood. In this paper, we present a systematic study of multi-positive optimization objectives for retriever training. We unify representative objectives, including Joint Likelihood (JointLH), Summed Marginal Likelihood (SumMargLH), and Log-Sum-Exp Pairwise (LSEPair) loss, under a shared contrastive learning framework. Our theoretical analysis characterizes their distinct gradient behaviors, revealing how each allocates probability mass across positive document sets. Empirically, we conduct extensive evaluations on Natural Questions, MS MARCO, and the BEIR benchmark across two realistic regimes: homogeneous LLM-annotated data and heterogeneous mixtures of human and LLM labels. Our results show that LSEPair consistently achieves superior robustness and performance across settings, while JointLH and SumMargLH exhibit high sensitivity to the quality of positives. Furthermore, we find that the simple strategy of random sampling (Rand1LH) serves as a reliable baseline. By aligning theoretical insights with empirical findings, we provide practical design principles for leveraging dense, LLM-augmented supervision to enhance retriever effectiveness.
9. 【2602.12612】Self-EvolveRec: Self-Evolving Recommender Systems with LLM-based Directional Feedback
链接:https://arxiv.org/abs/2602.12612
作者:Sein Kim,Sangwu Park,Hongseok Kang,Wonjoong Kim,Jimin Seo,Yeonjun In,Kanghoon Yoon,Chanyoung Park
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Neural Architecture Search, recommender system design, fixed search space, automating recommender system, search space defined
备注:
点击查看摘要
Abstract:Traditional methods for automating recommender system design, such as Neural Architecture Search (NAS), are often constrained by a fixed search space defined by human priors, limiting innovation to pre-defined operators. While recent LLM-driven code evolution frameworks shift fixed search space target to open-ended program spaces, they primarily rely on scalar metrics (e.g., NDCG, Hit Ratio) that fail to provide qualitative insights into model failures or directional guidance for improvement. To address this, we propose Self-EvolveRec, a novel framework that establishes a directional feedback loop by integrating a User Simulator for qualitative critiques and a Model Diagnosis Tool for quantitative internal verification. Furthermore, we introduce a Diagnosis Tool - Model Co-Evolution strategy to ensure that evaluation criteria dynamically adapt as the recommendation architecture evolves. Extensive experiments demonstrate that Self-EvolveRec significantly outperforms state-of-the-art NAS and LLM-driven code evolution baselines in both recommendation performance and user satisfaction. Our code is available at this https URL.
10. 【2602.12593】RQ-GMM: Residual Quantized Gaussian Mixture Model for Multimodal Semantic Discretization in CTR Prediction
链接:https://arxiv.org/abs/2602.12593
作者:Ziye Tong,Jiahao Liu,Weimin Zhang,Hongji Ruan,Derick Tang,Zhanpeng Zeng,Qinsong Zeng,Peng Zhang,Tun Lu,Ning Gu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:CTR models, click-through rate, Quantized Gaussian Mixture, content is crucial, crucial for click-through
备注: Under review
点击查看摘要
Abstract:Multimodal content is crucial for click-through rate (CTR) prediction. However, directly incorporating continuous embeddings from pre-trained models into CTR models yields suboptimal results due to misaligned optimization objectives and convergence speed inconsistency during joint training. Discretizing embeddings into semantic IDs before feeding them into CTR models offers a more effective solution, yet existing methods suffer from limited codebook utilization, reconstruction accuracy, and semantic discriminability. We propose RQ-GMM (Residual Quantized Gaussian Mixture Model), which introduces probabilistic modeling to better capture the statistical structure of multimodal embedding spaces. Through Gaussian Mixture Models combined with residual quantization, RQ-GMM achieves superior codebook utilization and reconstruction accuracy. Experiments on public datasets and online A/B tests on a large-scale short-video platform serving hundreds of millions of users demonstrate substantial improvements: RQ-GMM yields a 1.502% gain in Advertiser Value over strong baselines. The method has been fully deployed, serving daily recommendations for hundreds of millions of users.
11. 【2602.12564】CAPTS: Channel-Aware, Preference-Aligned Trigger Selection for Multi-Channel Item-to-Item Retrieval
链接:https://arxiv.org/abs/2602.12564
作者:Xiaoyou Zhou,Yuqi Liu,Zhao Liu,Xiao Lv,Bo Chen,Ruiming Tang,Guorui Zhou
类目:Information Retrieval (cs.IR)
关键词:industrial recommender systems, recommender systems commonly, systems commonly adopt, Large-scale industrial recommender, commonly adopt multi-channel
备注: 10 pages, 6 figures
点击查看摘要
Abstract:Large-scale industrial recommender systems commonly adopt multi-channel retrieval for candidate generation, combining direct user-to-item (U2I) retrieval with two-hop user-to-item-to-item (U2I2I) pipelines. In U2I2I, the system selects a small set of historical interactions as triggers to seed downstream item-to-item (I2I) retrieval across multiple channels. In production, triggers are often selected using rule-based policies or learned scorers and tuned in a channel-by-channel manner. However, these practices face two persistent challenges: biased value attribution that values triggers by on-trigger feedback rather than their downstream utility as retrieval seeds, and uncoordinated multi-channel routing where channels select triggers independently under a shared quota, increasing cross-channel overlap. To address these challenges, we propose Channel-Aware, Preference-Aligned Trigger Selection (CAPTS), a unified and flexible framework that treats multi-channel trigger selection as a learnable routing problem. CAPTS introduces a Value Attribution Module (VAM) that provides look-ahead supervision by crediting each trigger with the subsequent engagement generated by items retrieved from it on each I2I channel, and a Channel-Adaptive Trigger Routing (CATR) module that coordinates trigger-to-channel assignment to maximize the overall value of multi-channel retrieval. Extensive offline experiments and large-scale online A/B tests on Kwai, Kuaishou's international short-video platform, show that CAPTS consistently improves multi-channel recall offline and delivers a +0.351% lift in average time spent per device online.
12. 【2602.12530】Reasoning to Rank: An End-to-End Solution for Exploiting Large Language Models for Recommendation
链接:https://arxiv.org/abs/2602.12530
作者:Kehan Zheng,Deyao Hong,Qian Li,Jun Zhang,Huan Yu,Jie Jiang,Hongning Wang
类目:Information Retrieval (cs.IR)
关键词:infer users' evolving, users' evolving preferences, rank items aligned, Recommender systems, pattern-based scoring
备注:
点击查看摘要
Abstract:Recommender systems are tasked to infer users' evolving preferences and rank items aligned with their intents, which calls for in-depth reasoning beyond pattern-based scoring. Recent efforts start to leverage large language models (LLMs) for recommendation, but how to effectively optimize the model for improved recommendation utility is still under explored. In this work, we propose Reasoning to Rank, an end-to-end training framework that internalizes recommendation utility optimization into the learning of step-by-step reasoning in LLMs. To avoid position bias in LLM reasoning and enable direct optimization of the reasoning process, our framework performs reasoning at the user-item level and employs reinforcement learning for end-to-end training of the LLM. Experiments on three Amazon datasets and a large-scale industrial dataset showed consistent gains over strong conventional and LLM-based solutions. Extensive in-depth analyses validate the necessity of the key components in the proposed framework and shed lights on the future developments of this line of work.
13. 【2602.12528】DiffuRank: Effective Document Reranking with Diffusion Language Models
链接:https://arxiv.org/abs/2602.12528
作者:Qi Liu,Kun Ai,Jiaxin Mao,Yanzhao Zhang,Mingxin Li,Dingkun Long,Pengjun Xie,Fengbin Zhu,Ji-Rong Wen
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Recent advances, advances in large, dLLMs, large language models, Recent
备注: The code is available at [this https URL](https://github.com/liuqi6777/DiffusionRank)
点击查看摘要
Abstract:Recent advances in large language models (LLMs) have inspired new paradigms for document reranking. While this paradigm better exploits the reasoning and contextual understanding capabilities of LLMs, most existing LLM-based rerankers rely on autoregressive generation, which limits their efficiency and flexibility. In particular, token-by-token decoding incurs high latency, while the fixed left-to-right generation order causes early prediction errors to propagate and is difficult to revise. To address these limitations, we explore the use of diffusion language models (dLLMs) for document reranking and propose DiffuRank, a reranking framework built upon dLLMs. Unlike autoregressive models, dLLMs support more flexible decoding and generation processes that are not constrained to a left-to-right order, and enable parallel decoding, which may lead to improved efficiency and controllability. Specifically, we investigate three reranking strategies based on dLLMs: (1) a pointwise approach that uses dLLMs to estimate the relevance of each query-document pair; (2) a logit-based listwise approach that prompts dLLMs to jointly assess the relevance of multiple documents and derives ranking lists directly from model logits; and (3) a permutation-based listwise approach that adapts the canonical decoding process of dLLMs to the reranking tasks. For each approach, we design corresponding training methods to fully exploit the advantages of dLLMs. We evaluate both zero-shot and fine-tuned reranking performance on multiple benchmarks. Experimental results show that dLLMs achieve performance comparable to, and in some cases exceeding, that of autoregressive LLMs with similar model sizes. These findings demonstrate the promise of diffusion-based language models as a compelling alternative to autoregressive architectures for document reranking.
14. 【2602.12510】Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search
链接:https://arxiv.org/abs/2602.12510
作者:Ara Yeroyan
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:deliver strong accuracy, ColPali-style late interaction, search increasingly expensive, late interaction models, Multi-vector visual retrievers
备注: 4 pages, 3 figures. Submitted to SIGIR 2026 Demonstrations Track. Project website: [this https URL](https://github.com/Ara-Yeroyan/visual-rag-toolkit)
点击查看摘要
Abstract:Multi-vector visual retrievers (e.g., ColPali-style late interaction models) deliver strong accuracy, but scale poorly because each page yields thousands of vectors, making indexing and search increasingly expensive. We present Visual RAG Toolkit, a practical system for scaling visual multi-vector retrieval with training-free, model-aware pooling and multi-stage retrieval. Motivated by Matryoshka Embeddings, our method performs static spatial pooling - including a lightweight sliding-window averaging variant - over patch embeddings to produce compact tile-level and global representations for fast candidate generation, followed by exact MaxSim reranking using full multi-vector embeddings. Our design yields a quadratic reduction in vector-to-vector comparisons by reducing stored vectors per page from thousands to dozens, notably without requiring post-training, adapters, or distillation. Across experiments with interaction-style models such as ColPali and ColSmol-500M, we observe that over the limited ViDoRe v2 benchmark corpus 2-stage retrieval typically preserves NDCG and Recall @ 5/10 with minimal degradation, while substantially improving throughput (approximately 4x QPS); with sensitivity mainly at very large k. The toolkit additionally provides robust preprocessing - high resolution PDF to image conversion, optional margin/empty-region cropping and token hygiene (indexing only visual tokens) - and a reproducible evaluation pipeline, enabling rapid exploration of two-, three-, and cascaded retrieval variants. By emphasizing efficiency at common cutoffs (e.g., k = 10), the toolkit lowers hardware barriers and makes state-of-the-art visual retrieval more accessible in practice.
Comments:
4 pages, 3 figures. Submitted to SIGIR 2026 Demonstrations Track. Project website: this https URL
Subjects:
Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
ACMclasses:
H.3.3
Cite as:
arXiv:2602.12510 [cs.IR]
(or
arXiv:2602.12510v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.12510
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
15. 【2602.12485】Latent Customer Segmentation and Value-Based Recommendation Leveraging a Two-Stage Model with Missing Labels
链接:https://arxiv.org/abs/2602.12485
作者:Keerthi Gopalakrishnan,Tianning Dong,Chia-Yen Ho,Yokila Arora,Topojoy Biswas,Jason Cho,Sushant Kumar,Kannan Achan
类目:Information Retrieval (cs.IR)
关键词:businesses depends, ability to convert, convert consumers, consumers into loyal, customers
备注:
点击查看摘要
Abstract:The success of businesses depends on their ability to convert consumers into loyal customers. A customer's value proposition is a primary determinant in this process, requiring a balance between affordability and long-term brand equity. Broad marketing campaigns can erode perceived brand value and reduce return on investment, while existing economic algorithms often misidentify highly engaged customers as ideal targets, leading to inefficient engagement and conversion outcomes. This work introduces a two-stage multi-model architecture employing Self-Paced Loss to improve customer categorization. The first stage uses a multi-class neural network to distinguish customers influenced by campaigns, organically engaged customers, and low-engagement customers. The second stage applies a binary label correction model to identify true campaign-driven intent using a missing-label framework, refining customer segmentation during training. By separating prompted engagement from organic behavior, the system enables more precise campaign targeting, reduces exposure costs, and improves conversion efficiency. A/B testing demonstrates over 100 basis points improvement in key success metrics, highlighting the effectiveness of intent-aware segmentation for value-driven marketing strategies.
Subjects:
Information Retrieval (cs.IR)
Cite as:
arXiv:2602.12485 [cs.IR]
(or
arXiv:2602.12485v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.12485
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
Companion Proceedings of the ACM Web Conference 2025 (WWW Companion 25), ACM, 2025
Related DOI:
https://doi.org/10.1145/3701716.3715243
Focus to learn more
DOI(s) linking to related resources</p>
16. 【2602.12354】An Industrial-Scale Sequential Recommender for LinkedIn Feed Ranking
链接:https://arxiv.org/abs/2602.12354
作者:Lars Hertel,Gaurav Srivastava,Syed Ali Naqvi,Satyam Kumar,Yue Zhang,Borja Ocejo,Benjamin Zelditch,Adrian Englhardt,Hailing Cheng,Andy Hu,Antonio Alonso,Daming Li,Siddharth Dangi,Chen Zhu,Mingzhou Zhou,Wanning Li,Tao Huang,Fedor Borisyuk,Ganesh Parameswaran,Birjodh Singh Tiwana,Sriram Sankar,Qing Lan,Julie Choi,Souvik Ghosh
类目:Information Retrieval (cs.IR)
关键词:discover relevant content, Feed enables professionals, Feed Sequential Recommender, enables professionals worldwide, LinkedIn Feed
备注:
点击查看摘要
Abstract:LinkedIn Feed enables professionals worldwide to discover relevant content, build connections, and share knowledge at scale. We present Feed Sequential Recommender (Feed-SR), a transformer-based sequential ranking model for LinkedIn Feed that replaces a DCNv2-based ranker and meets strict production constraints. We detail the modeling choices, training techniques, and serving optimizations that enable deployment at LinkedIn scale. Feed-SR is currently the primary member experience on LinkedIn's Feed and shows significant improvements in member engagement (+2.10% time spent) in online A/B tests compared to the existing production model. We also describe our deployment experience with alternative sequential and LLM-based ranking architectures and why Feed-SR provided the best combination of online metrics and production efficiency.
17. 【2602.12315】AgenticShop: Benchmarking Agentic Product Curation for Personalized Web Shopping
链接:https://arxiv.org/abs/2602.12315
作者:Sunghwan Kim,Ryang Heo,Yongsik Seo,Jinyoung Yeo,Dongha Lee
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:vast digital marketplace, platforms key gateways, shopping platforms key, digital marketplace, proliferation of e-commerce
备注: Accepted at WWW 2026
点击查看摘要
Abstract:The proliferation of e-commerce has made web shopping platforms key gateways for customers navigating the vast digital marketplace. Yet this rapid expansion has led to a noisy and fragmented information environment, increasing cognitive burden as shoppers explore and purchase products online. With promising potential to alleviate this challenge, agentic systems have garnered growing attention for automating user-side tasks in web shopping. Despite significant advancements, existing benchmarks fail to comprehensively evaluate how well agentic systems can curate products in open-web settings. Specifically, they have limited coverage of shopping scenarios, focusing only on simplified single-platform lookups rather than exploratory search. Moreover, they overlook personalization in evaluation, leaving unclear whether agents can adapt to diverse user preferences in realistic shopping contexts. To address this gap, we present AgenticShop, the first benchmark for evaluating agentic systems on personalized product curation in open-web environment. Crucially, our approach features realistic shopping scenarios, diverse user profiles, and a verifiable, checklist-driven personalization evaluation framework. Through extensive experiments, we demonstrate that current agentic systems remain largely insufficient, emphasizing the need for user-side systems that effectively curate tailored products across the modern web.
18. 【2602.12301】Beyond Musical Descriptors: Extracting Preference-Bearing Intent in Music Queries
链接:https://arxiv.org/abs/2602.12301
作者:Marion Baranes,Romain Hennequin,Elena V. Epure
类目:ound (cs.SD); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
关键词:increasingly common, queries are increasingly, essential for effectively, effectively meeting, Reddit music requests
备注: Accepted at NLP4MusA 2026 (4th Workshop on NLP for Music and Audio)
点击查看摘要
Abstract:Although annotated music descriptor datasets for user queries are increasingly common, few consider the user's intent behind these descriptors, which is essential for effectively meeting their needs. We introduce MusicRecoIntent, a manually annotated corpus of 2,291 Reddit music requests, labeling musical descriptors across seven categories with positive, negative, or referential preference-bearing roles. We then investigate how reliably large language models (LLMs) can extract these music descriptors, finding that they do capture explicit descriptors but struggle with context-dependent ones. This work can further serve as a benchmark for fine-grained modeling of user intent and for gaining insights into improving LLM-based music understanding systems.
19. 【2602.11799】Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation
链接:https://arxiv.org/abs/2602.11799
作者:Pingjun Pan,Tingting Zhou,Peiyao Lu,Tingting Fei,Hongxiang Chen,Chuanjiang Luo
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:possess rich attributes, items possess rich, text and images, recommendation has gained, gained traction
备注:
点击查看摘要
Abstract:Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data Mismatch: vanilla Transformers treat semantic IDs as flat streams, ignoring the hierarchy of user interactions, items, and tokens. Expanding items into multiple tokens amplifies length and noise, biasing attention toward local details over holistic semantics. We propose Hi-SAM, a Hierarchical Structure-Aware Multi-modal framework with two designs: (1) Disentangled Semantic Tokenizer (DST): unifies modalities via geometry-aware alignment and quantizes them via a coarse-to-fine strategy. Shared codebooks distill consensus while modality-specific ones recover nuances from residuals, enforced by mutual information minimization; (2) Hierarchical Memory-Anchor Transformer (HMAT): splits positional encoding into inter- and intra-item subspaces via Hierarchical RoPE to restore hierarchy. It inserts Anchor Tokens to condense items into compact memory, retaining details for the current item while accessing history only through compressed summaries. Experiments on real-world datasets show consistent improvements over SOTA baselines, especially in cold-start scenarios. Deployed on a large-scale social platform serving millions of users, Hi-SAM achieved a 6.55% gain in the core online metric.
20. 【2602.12291】Nationwide Hourly Population Estimating at the Neighborhood Scale in the United States Using Stable-Attendance Anchor Calibration
链接:https://arxiv.org/abs/2602.12291
作者:Huan Ning,Zhenlong Li,Manzhu Yu,Xiao Huang,Shiyan Zhang,Shan Qiao
类目:Applications (stat.AP); Information Retrieval (cs.IR)
关键词:Traditional population datasets, Traditional population, population, datasets are largely, largely static
备注:
点击查看摘要
Abstract:Traditional population datasets are largely static and therefore unable to capture the strong temporal dynamics of human presence driven by daily mobility. Recent smartphone-based mobility data offer unprecedented spatiotemporal coverage, yet translating these opportunistic observations into accurate population estimates remains challenging due to incomplete sensing, spatially heterogeneous device penetration, and unstable observation processes. We propose a Stable-Attendance Anchor Calibration (SAAC) framework to reconstruct hourly population presence at the Census block group level across the United States. SAAC formulates population estimation as a balance-based population accounting problem, combining residential population with time-varying inbound and outbound mobility inferred from device-event observations. To address observation bias and identifiability limitations, the framework leverages locations with highly regular attendance as calibration anchors, using high schools in this study. These anchors enable estimation of observation scaling factors that correct for under-recorded mobility events. By integrating anchor-based calibration with an explicit sampling model, SAAC enables consistent conversion from observed device events to population presence at fine temporal resolution. The inferred population patterns are consistent with established empirical findings in prior mobility and urban population studies. SAAC provides a generalizable framework for transforming large-scale, biased digital trace data into interpretable dynamic population products, with implications for urban science, public health, and human mobility research. The hourly population estimates can be accessed at: this https URL.
计算机视觉
1. 【2602.13197】Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos
链接:https://arxiv.org/abs/2602.13197
作者:Albert J. Zhai,Kuo-Hao Zeng,Jiasen Lu,Ali Farhadi,Shenlong Wang,Wei-Chiu Ma
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:highly scalable data, potential to unlock, source of highly, highly scalable, learning
备注:
点击查看摘要
Abstract:The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.
2. 【2602.13195】Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
链接:https://arxiv.org/abs/2602.13195
作者:Aadarsh Sahoo,Georgia Gkioxari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Conversational image segmentation, segmentation grounds abstract, image segmentation grounds, Conversational image, grounds abstract
备注: Project webpage: [this https URL](https://glab-caltech.github.io/converseg/)
点击查看摘要
Abstract:Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: this https URL
3. 【2602.13191】CoPE-VideoLM: Codec Primitives For Efficient Video Language Models
链接:https://arxiv.org/abs/2602.13191
作者:Sayan Deb Sarkar,Rémi Pautrat,Ondrej Miksik,Marc Pollefeys,Iro Armeni,Mahdi Rad,Mihai Dusmanu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Video Language Models, Language Models, understand temporal dynamics, empower AI systems, Video Language
备注: Project Page: [this https URL](https://sayands.github.io/cope/)
点击查看摘要
Abstract:Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.
4. 【2602.13185】FlexAM: Flexible Appearance-Motion Decomposition for Versatile Video Generation Control
链接:https://arxiv.org/abs/2602.13185
作者:Mingzhi Sheng,Zekai Gu,Peng Li,Cheng Lin,Hao-Xiang Guo,Ying-Cong Chen,Yuan Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Effective and generalizable, video generation remains, significant challenge, generation remains, remains a significant
备注: Codes: [this https URL](https://github.com/IGL-HKUST/FlexAM)
点击查看摘要
Abstract:Effective and generalizable control in video generation remains a significant challenge. While many methods rely on ambiguous or task-specific signals, we argue that a fundamental disentanglement of "appearance" and "motion" provides a more robust and scalable pathway. We propose FlexAM, a unified framework built upon a novel 3D control signal. This signal represents video dynamics as a point cloud, introducing three key enhancements: multi-frequency positional encoding to distinguish fine-grained motion, depth-aware positional encoding, and a flexible control signal for balancing precision and generative quality. This representation allows FlexAM to effectively disentangle appearance and motion, enabling a wide range of tasks including I2V/V2V editing, camera control, and spatial object editing. Extensive experiments demonstrate that FlexAM achieves superior performance across all evaluated tasks.
5. 【2602.13176】Monocular Markerless Motion Capture Enables Quantitative Assessment of Upper Extremity Reachable Workspace
链接:https://arxiv.org/abs/2602.13176
作者:Seth Donahue,J.D. Peiffer,R. Tyler Richardson,Yishan Zhong,Shaun Q. Y. Tan,Benoit Marteau,Stephanie R. Russo,May D. Wang,R. James Cotton,Ross Chafetz
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:driven Markerless Motion, Artificial Intelligence, Markerless Motion Capture, driven Markerless, Upper Extremity Reachable
备注:
点击查看摘要
Abstract:To validate a clinically accessible approach for quantifying the Upper Extremity Reachable Workspace (UERW) using a single (monocular) camera and Artificial Intelligence (AI)-driven Markerless Motion Capture (MMC) for biomechanical analysis. Objective assessment and validation of these techniques for specific clinically oriented tasks are crucial for their adoption in clinical motion analysis. AI-driven monocular MMC reduces the barriers to adoption in the clinic and has the potential to reduce the overhead for analysis of this common clinical assessment. Nine adult participants with no impairments performed the standardized UERW task, which entails reaching targets distributed across a virtual sphere centered on the torso, with targets displayed in a VR headset. Movements were simultaneously captured using a marker-based motion capture system and a set of eight FLIR cameras. We performed monocular video analysis on two of these video camera views to compare a frontal and offset camera configurations. The frontal camera orientation demonstrated strong agreement with the marker-based reference, exhibiting a minimal mean bias of $0.61 \pm 0.12$ \% reachspace reached per octanct (mean $\pm$ standard deviation). In contrast, the offset camera view underestimated the percent workspace reached ($-5.66 \pm 0.45$ \% reachspace reached). Conclusion: The findings support the feasibility of a frontal monocular camera configuration for UERW assessment, particularly for anterior workspace evaluation where agreement with marker-based motion capture was highest. The overall performance demonstrates clinical potential for practical, single-camera assessments. This study provides the first validation of monocular MMC system for the assessment of the UERW task. By reducing technical complexity, this approach enables broader implementation of quantitative upper extremity mobility assessment.
6. 【2602.13172】LongStream: Long-Sequence Streaming Autoregressive Visual Geometry
链接:https://arxiv.org/abs/2602.13172
作者:Chong Cheng,Xianda Chen,Tao Xie,Wei Yin,Weiqiang Ren,Qian Zhang,Xiaoyuang Guo,Hao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significant open challenge, Long-sequence streaming, open challenge, remains a significant, significant open
备注:
点击查看摘要
Abstract:Long-sequence streaming 3D reconstruction remains a significant open challenge. Existing autoregressive models often fail when processing long sequences. They typically anchor poses to the first frame, which leads to attention decay, scale drift, and extrapolation errors. We introduce LongStream, a novel gauge-decoupled streaming visual geometry model for metric-scale scene reconstruction across thousands of frames. Our approach is threefold. First, we discard the first-frame anchor and predict keyframe-relative poses. This reformulates long-range extrapolation into a constant-difficulty local task. Second, we introduce orthogonal scale learning. This method fully disentangles geometry from scale estimation to suppress drift. Finally, we solve Transformer cache issues such as attention-sink reliance and long-term KV-cache contamination. We propose cache-consistent training combined with periodic cache refresh. This approach suppresses attention degradation over ultra-long sequences and reduces the gap between training and inference. Experiments show LongStream achieves state-of-the-art performance. It delivers stable, metric-scale reconstruction over kilometer-scale sequences at 18 FPS. Project Page: this https URL
7. 【2602.13168】Realistic Face Reconstruction from Facial Embeddings via Diffusion Models
链接:https://arxiv.org/abs/2602.13168
作者:Dong Han,Yong Li,Joachim Denzler
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:privacy-preserving face recognition, enhanced facial privacy, facial privacy protection, accurate recognition, face recognition
备注: Accepted to AAAI 2026
点击查看摘要
Abstract:With the advancement of face recognition (FR) systems, privacy-preserving face recognition (PPFR) systems have gained popularity for their accurate recognition, enhanced facial privacy protection, and robustness to various attacks. However, there are limited studies to further verify privacy risks by reconstructing realistic high-resolution face images from embeddings of these systems, especially for PPFR. In this work, we propose the face embedding mapping (FEM), a general framework that explores Kolmogorov-Arnold Network (KAN) for conducting the embedding-to-face attack by leveraging pre-trained Identity-Preserving diffusion model against state-of-the-art (SOTA) FR and PPFR systems. Based on extensive experiments, we verify that reconstructed faces can be used for accessing other real-word FR systems. Besides, the proposed method shows the robustness in reconstructing faces from the partial and protected face embeddings. Moreover, FEM can be utilized as a tool for evaluating safety of FR and PPFR systems in terms of privacy leakage. All images used in this work are from public datasets.
8. 【2602.13091】Universal Transformation of One-Class Classifiers for Unsupervised Anomaly Detection
链接:https://arxiv.org/abs/2602.13091
作者:Declan McIntosh,Alexandra Branzan Albu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:including industrial inspection, Detecting anomalies, computer-assisted diagnosis, multiple real-world problems, including industrial
备注: 6 figures, 9 pages main paper, 15 pages total with supplemental
点击查看摘要
Abstract:Detecting anomalies in images and video is an essential task for multiple real-world problems, including industrial inspection, computer-assisted diagnosis, and environmental monitoring. Anomaly detection is typically formulated as a one-class classification problem, where the training data consists solely of nominal values, leaving methods built on this assumption susceptible to training label noise. We present a dataset folding method that transforms an arbitrary one-class classifier-based anomaly detector into a fully unsupervised method. This is achieved by making a set of key weak assumptions: that anomalies are uncommon in the training dataset and generally heterogeneous. These assumptions enable us to utilize multiple independently trained instances of a one-class classifier to filter the training dataset for anomalies. This transformation requires no modifications to the underlying anomaly detector; the only changes are algorithmically selected data subsets used for training. We demonstrate that our method can transform a wide variety of one-class classifier anomaly detectors for both images and videos into unsupervised ones. Our method creates the first unsupervised logical anomaly detectors by transforming existing methods. We also demonstrate that our method achieves state-of-the-art performance for unsupervised anomaly detection on the MVTec AD, ViSA, and MVTec Loco AD datasets. As improvements to one-class classifiers are made, our method directly transfers those improvements to the unsupervised domain, linking the domains.
9. 【2602.13067】SIEFormer: Spectral-Interpretable and -Enhanced Transformer for Generalized Category Discovery
链接:https://arxiv.org/abs/2602.13067
作者:Chunming Li,Shidong Wang,Tong Xin,Haofeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generalized Category Discovery, challenging Generalized Category, Vision Transformer, Category Discovery, leverages spectral analysis
备注:
点击查看摘要
Abstract:This paper presents a novel approach, Spectral-Interpretable and -Enhanced Transformer (SIEFormer), which leverages spectral analysis to reinterpret the attention mechanism within Vision Transformer (ViT) and enhance feature adaptability, with particular emphasis on challenging Generalized Category Discovery (GCD) tasks. The proposed SIEFormer is composed of two main branches, each corresponding to an implicit and explicit spectral perspective of the ViT, enabling joint optimization. The implicit branch realizes the use of different types of graph Laplacians to model the local structure correlations of tokens, along with a novel Band-adaptive Filter (BaF) layer that can flexibly perform both band-pass and band-reject filtering. The explicit branch, on the other hand, introduces a Maneuverable Filtering Layer (MFL) that learns global dependencies among tokens by applying the Fourier transform to the input ``value" features, modulating the transformed signal with a set of learnable parameters in the frequency domain, and then performing an inverse Fourier transform to obtain the enhanced features. Extensive experiments reveal state-of-the-art performance on multiple image recognition datasets, reaffirming the superiority of our approach through ablation studies and visualizations.
10. 【2602.13066】A Calibrated Memorization Index (MI) for Detecting Training Data Leakage in Generative MRI Models
链接:https://arxiv.org/abs/2602.13066
作者:Yash Deo,Yan Jia,Toni Lassila,Victoria J Hodge,Alejandro F Frang,Chenghao Qian,Siyuan Kang,Ibrahim Habli
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:medical image generation, lead to privacy, privacy concerns, training data, Novelty Index
备注: Accepted in ISBI 2026
点击查看摘要
Abstract:Image generative models are known to duplicate images from the training data as part of their outputs, which can lead to privacy concerns when used for medical image generation. We propose a calibrated per-sample metric for detecting memorization and duplication of training data. Our metric uses image features extracted using an MRI foundation model, aggregates multi-layer whitened nearest-neighbor similarities, and maps them to a bounded \emph{Overfit/Novelty Index} (ONI) and \emph{Memorization Index} (MI) scores. Across three MRI datasets with controlled duplication percentages and typical image augmentations, our metric robustly detects duplication and provides more consistent metric values across datasets. At the sample level, our metric achieves near-perfect detection of duplicates.
11. 【2602.13055】Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation
链接:https://arxiv.org/abs/2602.13055
作者:Florinel-Alin Croitoru,Vlad Hondru,Radu Tudor Ionescu,Nicu Sebe,Mubarak Shah
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Direct Preference Optimization, Direct Preference, Preference Optimization, preference optimization approaches, effective and efficient
备注: arXiv admin note: substantial text overlap with [arXiv:2405.13637](https://arxiv.org/abs/2405.13637)
点击查看摘要
Abstract:Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at this https URL.
12. 【2602.13041】Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images
链接:https://arxiv.org/abs/2602.13041
作者:Yuhao Chen,Gautham Vinod,Siddeshwar Raghavan,Talha Ibn Mahmud,Bruce Coburn,Jinge Ma,Fengqing Zhu,Jiangpeng He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:realistic dining scenarios, Monocular Multi-Food Images, food portion estimation, dining scenarios, Multi-Food Images
备注: Paper accepted to 2026 IEEE Southwest Symposium on Image Analysis and Interpretation. The dataset can be downloaded at: [this https URL](https://www.kaggle.com/competitions/3d-reconstruction-from-monocular-multi-food-images/data)
点击查看摘要
Abstract:We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision--language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.
13. 【2602.13030】Resource-Efficient Gesture Recognition through Convexified Attention
链接:https://arxiv.org/abs/2602.13030
作者:Daniel Schwartz,Dario Salvucci,Yusuf Osmanlioglu,Richard Vallett,Genevieve Dion,Ali Shokoufandeh
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:face severe constraints, make traditional deep, deep learning impractical, computational capacity, traditional deep learning
备注: 22 pages, 3 figures, EICS 2026
点击查看摘要
Abstract:Wearable e-textile interfaces require gesture recognition capabilities but face severe constraints in power consumption, computational capacity, and form factor that make traditional deep learning impractical. While lightweight architectures like MobileNet improve efficiency, they still demand thousands of parameters, limiting deployment on textile-integrated platforms. We introduce a convexified attention mechanism for wearable applications that dynamically weights features while preserving convexity through nonexpansive simplex projection and convex loss functions. Unlike conventional attention mechanisms using non-convex softmax operations, our approach employs Euclidean projection onto the probability simplex combined with multi-class hinge loss, ensuring global convergence guarantees. Implemented on a textile-based capacitive sensor with four connection points, our approach achieves 100.00\% accuracy on tap gestures and 100.00\% on swipe gestures -- consistent across 10-fold cross-validation and held-out test evaluation -- while requiring only 120--360 parameters, a 97\% reduction compared to conventional approaches. With sub-millisecond inference times (290--296$\mu$s) and minimal storage requirements ($$7KB), our method enables gesture interfaces directly within e-textiles without external processing. Our evaluation, conducted in controlled laboratory conditions with a single-user dataset, demonstrates feasibility for basic gesture interactions. Real-world deployment would require validation across multiple users, environmental conditions, and more complex gesture vocabularies. These results demonstrate how convex optimization can enable efficient on-device machine learning for textile interfaces.
14. 【2602.13028】Human-Aligned MLLM Judges for Fine-Grained Image Editing Evaluation: A Benchmark, Framework, and Analysis
链接:https://arxiv.org/abs/2602.13028
作者:Runzhou Liu(1),Hailey Weingord(2),Sejal Mittal(2),Prakhar Dungarwal(2),Anusha Nandula(2),Bo Ni(3),Samyadeep Basu(4),Hongjie Chen(5),Nesreen K. Ahmed(6),Li Li(7),Jiayi Zhang(8),Koustava Goswami(4),Subhojyoti Mukherjee(4),Branislav Kveton(4),Puneet Mathur(4),Franck Dernoncourt(4),Yue Zhao(7),Yu Wang(9),Ryan A. Rossi(4),Zhengzhong Tu(10),Hongru Du(1) ((1) University of Virginia, (2) Columbia University, (3) Vanderbilt University, (4) Adobe Research, (5) Dolby Laboratories, (6) Cisco Research, (7) University of Southern California, (8) University of Wisconsin-Madison, (9) University of Oregon, (10) Texas Aamp;M University)
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:remains challenging due, capture aspects important, Evaluating image editing, models remains challenging, Evaluating image
备注:
点击查看摘要
Abstract:Evaluating image editing models remains challenging due to the coarse granularity and limited interpretability of traditional metrics, which often fail to capture aspects important to human perception and intent. Such metrics frequently reward visually plausible outputs while overlooking controllability, edit localization, and faithfulness to user instructions. In this work, we introduce a fine-grained Multimodal Large Language Model (MLLM)-as-a-Judge framework for image editing that decomposes common evaluation notions into twelve fine-grained interpretable factors spanning image preservation, edit quality, and instruction fidelity. Building on this formulation, we present a new human-validated benchmark that integrates human judgments, MLLM-based evaluations, model outputs, and traditional metrics across diverse image editing tasks. Through extensive human studies, we show that the proposed MLLM judges align closely with human evaluations at a fine granularity, supporting their use as reliable and scalable evaluators. We further demonstrate that traditional image editing metrics are often poor proxies for these factors, failing to distinguish over-edited or semantically imprecise outputs, whereas our judges provide more intuitive and informative assessments in both offline and online settings. Together, this work introduces a benchmark, a principled factorization, and empirical evidence positioning fine-grained MLLM judges as a practical foundation for studying, comparing, and improving image editing approaches.
15. 【2602.13024】FedHENet: A Frugal Federated Learning Framework for Heterogeneous Environments
链接:https://arxiv.org/abs/2602.13024
作者:Alejandro Dopico-Castro,Oscar Fontenla-Romero,Bertha Guijarro-Berdiñas,Amparo Alonso-Betanzos,Iván Pérez Digón
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:enables collaborative training, sensitive visual information, real-world scenarios involving, scenarios involving sensitive, involving sensitive visual
备注: Accepted for publication at the 34th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN 2026)
点击查看摘要
Abstract:Federated Learning (FL) enables collaborative training without centralizing data, essential for privacy compliance in real-world scenarios involving sensitive visual information. Most FL approaches rely on expensive, iterative deep network optimization, which still risks privacy via shared gradients. In this work, we propose FedHENet, extending the FedHEONN framework to image classification. By using a fixed, pre-trained feature extractor and learning only a single output layer, we avoid costly local fine-tuning. This layer is learned by analytically aggregating client knowledge in a single round of communication using homomorphic encryption (HE). Experiments show that FedHENet achieves competitive accuracy compared to iterative FL baselines while demonstrating superior stability performance and up to 70\% better energy efficiency. Crucially, our method is hyperparameter-free, removing the carbon footprint associated with hyperparameter tuning in standard FL. Code available in this https URL
16. 【2602.13022】Learning Image-based Tree Crown Segmentation from Enhanced Lidar-based Pseudo-labels
链接:https://arxiv.org/abs/2602.13022
作者:Julius Pesonen,Stefan Rua,Josef Taher,Niko Koivumäki,Xiaowei Yu,Eija Honkavaara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:monitoring forest health, Mapping individual tree, maintaining urban tree, urban tree inventories, Mapping individual
备注:
点击查看摘要
Abstract:Mapping individual tree crowns is essential for tasks such as maintaining urban tree inventories and monitoring forest health, which help us understand and care for our environment. However, automatically separating the crowns from each other in aerial imagery is challenging due to factors such as the texture and partial tree crown overlaps. In this study, we present a method to train deep learning models that segment and separate individual trees from RGB and multispectral images, using pseudo-labels derived from aerial laser scanning (ALS) data. Our study shows that the ALS-derived pseudo-labels can be enhanced using a zero-shot instance segmentation model, Segment Anything Model 2 (SAM 2). Our method offers a way to obtain domain-specific training annotations for optical image-based models without any manual annotation cost, leading to segmentation models which outperform any available models which have been targeted for general domain deployment on the same task.
17. 【2602.13020】DynaGuide: A Generalizable Dynamic Guidance Framework for Unsupervised Semantic Segmentation
链接:https://arxiv.org/abs/2602.13020
作者:Boujemaa Guermazi,Riadh Ksantini,Naimul Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, critical task, task in computer, Unsupervised image segmentation, Unsupervised image
备注: Accepted at Image and Vision Computing
点击查看摘要
Abstract:Unsupervised image segmentation is a critical task in computer vision. It enables dense scene understanding without human annotations, which is especially valuable in domains where labelled data is scarce. However, existing methods often struggle to reconcile global semantic structure with fine-grained boundary accuracy. This paper introduces DynaGuide, an adaptive segmentation framework that addresses these challenges through a novel dual-guidance strategy and dynamic loss optimization. Building on our previous work, DynaSeg, DynaGuide combines global pseudo-labels from zero-shot models such as DiffSeg or SegFormer with local boundary refinement using a lightweight CNN trained from scratch. This synergy allows the model to correct coarse or noisy global predictions and produce high-precision segmentations. At the heart of DynaGuide is a multi-component loss that dynamically balances feature similarity, Huber-smoothed spatial continuity, including diagonal relationships, and semantic alignment with the global pseudo-labels. Unlike prior approaches, DynaGuide trains entirely without ground-truth labels in the target domain and supports plug-and-play integration of diverse guidance sources. Extensive experiments on BSD500, PASCAL VOC2012, and COCO demonstrate that DynaGuide achieves state-of-the-art performance, improving mIoU by 17.5% on BSD500, 3.1% on PASCAL VOC2012, and 11.66% on COCO. With its modular design, strong generalization, and minimal computational footprint, DynaGuide offers a scalable and practical solution for unsupervised segmentation in real-world settings. Code available at: this https URL
18. 【2602.13015】Multimodal Classification via Total Correlation Maximization
链接:https://arxiv.org/abs/2602.13015
作者:Feng Yu,Xiangyu Wu,Yang Yang,Jianfeng Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:effectively harness information, learning integrates data, unimodal learning, integrates data, data from diverse
备注: Accepted for publication at ICLR 2026; 19 pages; 2 figures
点击查看摘要
Abstract:Multimodal learning integrates data from diverse sensors to effectively harness information from different modalities. However, recent studies reveal that joint learning often overfits certain modalities while neglecting others, leading to performance inferior to that of unimodal learning. Although previous efforts have sought to balance modal contributions or combine joint and unimodal learning, thereby mitigating the degradation of weaker modalities with promising outcomes, few have examined the relationship between joint and unimodal learning from an information-theoretic perspective. In this paper, we theoretically analyze modality competition and propose a method for multimodal classification by maximizing the total correlation between multimodal features and labels. By maximizing this objective, our approach alleviates modality competition while capturing inter-modal interactions via feature alignment. Building on Mutual Information Neural Estimation (MINE), we introduce Total Correlation Neural Estimation (TCNE) to derive a lower bound for total correlation. Subsequently, we present TCMax, a hyperparameter-free loss function that maximizes total correlation through variational bound optimization. Extensive experiments demonstrate that TCMax outperforms state-of-the-art joint and unimodal learning approaches. Our code is available at this https URL.
19. 【2602.13013】owards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions
链接:https://arxiv.org/abs/2602.13013
作者:Yunheng Li,Hengrui Zhang,Meng-Hao Guo,Wenzhao Gao,Shaoyong Jia,Shaohui Jiao,Qibin Hou,Ming-Ming Cheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse real-world scenarios, Universal video understanding, understanding requires modeling, requires modeling fine-grained, modeling fine-grained visual
备注: Project page: [this https URL](https://asid-caption.github.io/)
点击查看摘要
Abstract:Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.
20. 【2602.13003】MASAR: Motion-Appearance Synergy Refinement for Joint Detection and Trajectory Forecasting
链接:https://arxiv.org/abs/2602.13003
作者:Mohammed Amine Bencheikh Lehocine,Julian Schmidt,Frank Moosmann,Dikshant Gupta,Fabian Flohr
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Classical autonomous driving, limiting information flow, hand-crafted bounding-box interfaces, autonomous driving systems, driving systems connect
备注: Accepted to the 2026 IEEE International Conference on Robotics and Automation (ICRA 2026)
点击查看摘要
Abstract:Classical autonomous driving systems connect perception and prediction modules via hand-crafted bounding-box interfaces, limiting information flow and propagating errors to downstream tasks. Recent research aims to develop end-to-end models that jointly address perception and prediction; however, they often fail to fully exploit the synergy between appearance and motion cues, relying mainly on short-term visual features. We follow the idea of "looking backward to look forward", and propose MASAR, a novel fully differentiable framework for joint 3D detection and trajectory forecasting compatible with any transformer-based 3D detector. MASAR employs an object-centric spatio-temporal mechanism that jointly encodes appearance and motion features. By predicting past trajectories and refining them using guidance from appearance cues, MASAR captures long-term temporal dependencies that enhance future trajectory forecasting. Experiments conducted on the nuScenes dataset demonstrate MASAR's effectiveness, showing improvements of over 20% in minADE and minFDE while maintaining robust detection performance. Code and models are available at this https URL.
21. 【2602.12983】Detecting Object Tracking Failure via Sequential Hypothesis Testing
链接:https://arxiv.org/abs/2602.12983
作者:Alejandro Monroy Muñoz,Rajeev Verma,Alexander Timans
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:wide-ranging applications including, including video surveillance, applications including video, motion capture, Real-time online object
备注: Accepted in WACV workshop "Real World Surveillance: Applications and Challenges, 6th"
点击查看摘要
Abstract:Real-time online object tracking in videos constitutes a core task in computer vision, with wide-ranging applications including video surveillance, motion capture, and robotics. Deployed tracking systems usually lack formal safety assurances to convey when tracking is reliable and when it may fail, at best relying on heuristic measures of model confidence to raise alerts. To obtain such assurances we propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time. Leveraging recent advancements in the field, our sequential test (formalized as an e-process) quickly identifies when tracking failures set in whilst provably containing false alerts at a desired rate, and thus limiting potentially costly re-calibration or intervention steps. The approach is computationally light-weight, requires no extra training or fine-tuning, and is in principle model-agnostic. We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information, and demonstrate its effectiveness for two established tracking models across four video benchmarks. As such, sequential testing can offer a statistically grounded and efficient mechanism to incorporate safety assurances into real-time tracking systems.
22. 【2602.12957】raining-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding
链接:https://arxiv.org/abs/2602.12957
作者:Wenhui Liao,Hongliang Li,Pengyu Xie,Xinyu Cai,Yufan Shen,Yi Xin,Qi Qin,Shenglong Ye,Tianbin Li,Ming Hu,Junjun He,Yihao Liu,Wenhai Wang,Min Dou,Bin Fu,Botian Shi,Yu Qiao,Lianwen Jin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:intelligent document analysis, multimodal understanding, supporting a wide, wide range, range of downstream
备注: Preliminary version of an ongoing project; the paper will be refined and extended in subsequent revisions
点击查看摘要
Abstract:Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the this http URL model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.
23. 【2602.12952】ransporting Task Vectors across Different Architectures without Training
链接:https://arxiv.org/abs/2602.12952
作者:Filippo Rinaldi,Aniello Panariello,Giacomo Salici,Angelo Porrello,Simone Calderara
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Adapting large pre-trained, Adapting large, large pre-trained models, large pre-trained, expensive to relearn
备注:
点击查看摘要
Abstract:Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains largely unexplored. In this work, we introduce Theseus, a training-free method for transporting task-specific updates across heterogeneous models. Rather than matching parameters directly, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over strong baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically.
24. 【2602.12936】Unleashing MLLMs on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation
链接:https://arxiv.org/abs/2602.12936
作者:Hongbo Jiang,Jie Li,Xinqi Cai,Tianyu Xie,Yunhang Shen,Pingyang Dai,Liujuan Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:faces challenges due, Practical cloud-edge deployment, Large Language Models, Practical cloud-edge, Multi-Modal Large Language
备注: Equal contribution by Jie Li
点击查看摘要
Abstract:Practical cloud-edge deployment of Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models for diverse modalities. While Multi-Modal Large Language Models (MLLMs) offer strong unification potential, existing approaches fail to adapt them into a single end-to-end backbone and lack effective knowledge distillation strategies for edge deployment. To address these limitations, we propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture. First, we adapt a foundational MLLM into a state-of-the-art cloud model. We leverage instruction-based prompting to guide the MLLM in generating a unified embedding space across RGB, infrared, sketch, and text modalities. This model is then trained efficiently with a hierarchical Low-Rank Adaptation finetuning (LoRA-SFT) strategy, optimized under a holistic cross-modal alignment objective. Second, to deploy its knowledge onto an edge-native student, we introduce a novel distillation strategy motivated by the low-rank property in the teacher's feature space. To prioritize essential information, this method employs a Principal Component Mapping loss, while relational structures are preserved via a Feature Relation loss. Our lightweight edge-based model achieves state-of-the-art performance on multiple visual CM-ReID benchmarks, while its cloud-based counterpart excels across all CM-ReID benchmarks. The MLLMEmbed-ReID framework thus presents a complete and effective solution for deploying unified MLLM-level intelligence on resource-constrained devices. The code and models will be open-sourced soon.
25. 【2602.12933】Deep-Learning Atlas Registration for Melanoma Brain Metastases: Preserving Pathology While Enabling Cohort-Level Analyses
链接:https://arxiv.org/abs/2602.12933
作者:Nanna E. Wielenberg,Ilinca Popp,Oliver Blanck,Lucas Zander,Jan C. Peeken,Stephanie E. Combs,Anca-Ligia Grosu,Dimos Baltas,Tobias Fechter
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
关键词:spatially heterogeneous lesions, differing MRI protocols, Melanoma brain metastases, complicating cohort-level analyses, cohort-level analyses due
备注:
点击查看摘要
Abstract:Melanoma brain metastases (MBM) are common and spatially heterogeneous lesions, complicating cohort-level analyses due to anatomical variability and differing MRI protocols. We propose a fully differentiable, deep-learning-based deformable registration framework that aligns individual pathological brains to a common atlas while preserving metastatic tissue without requiring lesion masks or preprocessing. Missing anatomical correspondences caused by metastases are handled through a forward-model similarity metric based on distance-transformed anatomical labels, combined with a volume-preserving regularization term to ensure deformation plausibility. Registration performance was evaluated using Dice coefficient (DSC), Hausdorff distance (HD), average symmetric surface distance (ASSD), and Jacobian-based measures. The method was applied to 209 MBM patients from three centres, enabling standardized mapping of metastases to anatomical, arterial, and perfusion atlases. The framework achieved high registration accuracy across datasets (DSC 0.89-0.92, HD 6.79-7.60 mm, ASSD 0.63-0.77 mm) while preserving metastatic volumes. Spatial analysis demonstrated significant over-representation of MBM in the cerebral cortex and putamen, under-representation in white matter, and consistent localization near the gray-white matter junction. No arterial territory showed increased metastasis frequency after volume correction. This approach enables robust atlas registration of pathological brain MRI without lesion masks and supports reproducible multi-centre analyses. Applied to MBM, it confirms and refines known spatial predilections, particularly preferential seeding near the gray-white matter junction and cortical regions. The publicly available implementation facilitates reproducible research and extension to other brain tumours and neurological pathologies.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Medical Physics (physics.med-ph)
Cite as:
arXiv:2602.12933 [cs.CV]
(or
arXiv:2602.12933v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.12933
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Tobias Fechter [view email] [v1]
Fri, 13 Feb 2026 13:43:57 UTC (1,605 KB)
26. 【2602.12922】Beyond Benchmarks of IUGC: Rethinking Requirements of Deep Learning Methods for Intrapartum Ultrasound Biometry from Fetal Ultrasound Videos
链接:https://arxiv.org/abs/2602.12922
作者:Jieyun Bai,Zihao Zhou,Yitong Tang,Jie Gan,Zhuonan Liang,Jianan Fan,Lisa B. Mcguire,Jillian L. Clarke,Weidong Cai,Jacaueline Spurway,Yubo Tang,Shiye Wang,Wenda Shen,Wangwang Yu,Yihao Li,Philippe Zhang,Weili Jiang,Yongjie Li,Salem Muhsin Ali Binqahal Al Nasim,Arsen Abzhanov,Numan Saeed,Mohammad Yaqub,Zunhui Xian,Hongxing Lin,Libin Lan,Jayroop Ramesh,Valentin Bacher,Mark Eid,Hoda Kalabizadeh,Christian Rupprecht,Ana I. L. Namburete,Pak-Hei Yeung,Madeleine K. Wyburd,Nicola K. Dinsdale,Assanali Serikbey,Jiankai Li,Sung-Liang Chen,Zicheng Hu,Nana Liu,Yian Deng,Wei Hu,Cong Tan,Wenfeng Zhang,Mai Tuyet Nhi,Gregor Koehler,Rapheal Stock,Klaus Maier-Hein,Marawan Elbatel,Xiaomeng Li,Saad Slimani,Victor M. Campello,Benard Ohene-Botwe,Isaac Khobo,Yuxin Huang,Zhenyan Han,Hongying Hou,Di Qiu,Zheng Zheng,Gongning Luo,Dong Ni,Yaosheng Lu,Karim Lekadir,Shuo Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:neonatal deaths, maternal deaths, Intrapartum Ultrasound Grand, substantial proportion, burden in low
备注:
点击查看摘要
Abstract:A substantial proportion (45\%) of maternal deaths, neonatal deaths, and stillbirths occur during the intrapartum phase, with a particularly high burden in low- and middle-income countries. Intrapartum biometry plays a critical role in monitoring labor progression; however, the routine use of ultrasound in resource-limited settings is hindered by a shortage of trained sonographers. To address this challenge, the Intrapartum Ultrasound Grand Challenge (IUGC), co-hosted with MICCAI 2024, was launched. The IUGC introduces a clinically oriented multi-task automatic measurement framework that integrates standard plane classification, fetal head-pubic symphysis segmentation, and biometry, enabling algorithms to exploit complementary task information for more accurate estimation. Furthermore, the challenge releases the largest multi-center intrapartum ultrasound video dataset to date, comprising 774 videos (68,106 frames) collected from three hospitals, providing a robust foundation for model training and evaluation. In this study, we present a comprehensive overview of the challenge design, review the submissions from eight participating teams, and analyze their methods from five perspectives: preprocessing, data augmentation, learning strategy, model architecture, and post-processing. In addition, we perform a systematic analysis of the benchmark results to identify key bottlenecks, explore potential solutions, and highlight open challenges for future research. Although encouraging performance has been achieved, our findings indicate that the field remains at an early stage, and further in-depth investigation is required before large-scale clinical deployment. All benchmark solutions and the complete dataset have been publicly released to facilitate reproducible research and promote continued advances in automatic intrapartum ultrasound biometry.
27. 【2602.12919】EPRBench: A High-Quality Benchmark Dataset for Event Stream Based Visual Place Recognition
链接:https://arxiv.org/abs/2602.12919
作者:Xiao Wang,Xingxing Xiong,Jinfeng Gao,Xufeng Lou,Bo Jiang,Si-bao Chen,Yaowei Wang,Yonghong Tian
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
关键词:Event stream-based Visual, conventional visible-light cameras, stream-based Visual Place, Visual Place Recognition, emerging research direction
备注:
点击查看摘要
Abstract:Event stream-based Visual Place Recognition (VPR) is an emerging research direction that offers a compelling solution to the instability of conventional visible-light cameras under challenging conditions such as low illumination, overexposure, and high-speed motion. Recognizing the current scarcity of dedicated datasets in this domain, we introduce EPRBench, a high-quality benchmark specifically designed for event stream-based VPR. EPRBench comprises 10K event sequences and 65K event frames, collected using both handheld and vehicle-mounted setups to comprehensively capture real-world challenges across diverse viewpoints, weather conditions, and lighting scenarios. To support semantic-aware and language-integrated VPR research, we provide LLM-generated scene descriptions, subsequently refined through human annotation, establishing a solid foundation for integrating LLMs into event-based perception pipelines. To facilitate systematic evaluation, we implement and benchmark 15 state-of-the-art VPR algorithms on EPRBench, offering a strong baseline for future algorithmic comparisons. Furthermore, we propose a novel multi-modal fusion paradigm for VPR: leveraging LLMs to generate textual scene descriptions from raw event streams, which then guide spatially attentive token selection, cross-modal feature fusion, and multi-scale representation learning. This framework not only achieves highly accurate place recognition but also produces interpretable reasoning processes alongside its predictions, significantly enhancing model transparency and explainability. The dataset and source code will be released on this https URL
28. 【2602.12916】Reliable Thinking with Images
链接:https://arxiv.org/abs/2602.12916
作者:Haobin Li,Yutong Yang,Yijie Lin,Dai Xiang,Mouxing Yang,Xi Peng
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Language Models, Multi-modal Large Language, Language Models, Multi-modal Large, Large Language
备注: 26 pages, 19 figures
点击查看摘要
Abstract:As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering and voting modules to prevent NT from contaminating the final answer. Extensive experiments on seven benchmarks verify the effectiveness of RTWI against NT.
29. 【2602.12905】Adaptive Scaling with Geometric and Visual Continuity of completed 3D objects
链接:https://arxiv.org/abs/2602.12905
作者:Jelle Vermandere,Maarten Bassier,Maarten Vergauwen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Signed Distance Fields, static Signed Distance, produce static Signed, introducing structural distortions, Signed Distance
备注: ISPRS Congress 2026
点击查看摘要
Abstract:Object completion networks typically produce static Signed Distance Fields (SDFs) that faithfully reconstruct geometry but cannot be rescaled or deformed without introducing structural distortions. This limitation restricts their use in applications requiring flexible object manipulation, such as indoor redesign, simulation, and digital content creation. We introduce a part-aware scaling framework that transforms these static completed SDFs into editable, structurally coherent objects. Starting from SDFs and Texture Fields generated by state-of-the-art completion models, our method performs automatic part segmentation, defines user-controlled scaling zones, and applies smooth interpolation of SDFs, color, and part indices to enable proportional and artifact-free deformation. We further incorporate a repetition-based strategy to handle large-scale deformations while preserving repeating geometric patterns. Experiments on Matterport3D and ShapeNet objects show that our method overcomes the inherent rigidity of completed SDFs and is visually more appealing than global and naive selective scaling, particularly for complex shapes and repetitive structures.
30. 【2602.12902】Robustness of Object Detection of Autonomous Vehicles in Adverse Weather Conditions
链接:https://arxiv.org/abs/2602.12902
作者:Fox Pettersen,Hong Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:determining safe operational, self-driving technology advances, safe operational thresholds, object detection, object detection model
备注:
点击查看摘要
Abstract:As self-driving technology advances toward widespread adoption, determining safe operational thresholds across varying environmental conditions becomes critical for public safety. This paper proposes a method for evaluating the robustness of object detection ML models in autonomous vehicles under adverse weather conditions. It employs data augmentation operators to generate synthetic data that simulates different severance degrees of the adverse operation conditions at progressive intensity levels to find the lowest intensity of the adverse conditions at which the object detection model fails. The robustness of the object detection model is measured by the average first failure coefficients (AFFC) over the input images in the benchmark. The paper reports an experiment with four object detection models: YOLOv5s, YOLOv11s, Faster R-CNN, and Detectron2, utilising seven data augmentation operators that simulate weather conditions fog, rain, and snow, and lighting conditions of dark, bright, flaring, and shadow. The experiment data show that the method is feasible, effective, and efficient to evaluate and compare the robustness of object detection models in various adverse operation conditions. In particular, the Faster R-CNN model achieved the highest robustness with an overall average AFFC of 71.9% over all seven adverse conditions, while YOLO variants showed the AFFC values of 43%. The method is also applied to assess the impact of model training that targets adverse operation conditions using synthetic data on model robustness. It is observed that such training can improve robustness in adverse conditions but may suffer from diminishing returns and forgetting phenomena (i.e., decline in robustness) if overtrained.
31. 【2602.12892】RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training
链接:https://arxiv.org/abs/2602.12892
作者:Yunshuang Nie,Bingqian Lin,Minzhe Niu,Kun Xiang,Jianhua Han,Guowei Huang,Xingyue Quan,Hang Xu,Bokui Chen,Xiaodan Liang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multi-modal Large Language, Large Language Models, Large Language, solve complex tasks, Pre-trained Multi-modal Large
备注:
点击查看摘要
Abstract:Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training by leveraging their inherent perception and reasoning capabilities to solve complex tasks. However, the lack of an efficient evaluation framework impedes the diagnosis of their performance bottlenecks. Current evaluation primarily relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs. Meanwhile, common pre-training metrics cannot quantify a model's perception and reasoning abilities in a disentangled manner. Furthermore, existing evaluation benchmarks are typically limited in scale or misaligned with pre-training objectives. Thus, we propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training. RADAR involves two key components: (1) Soft Discrimination Score, a novel metric for robustly tracking ability development without fine-tuning, based on quantifying nuanced gradations of the model preference for the correct answer over distractors; and (2) Multi-Modal Mixture Benchmark, a new 15K+ sample benchmark for comprehensively evaluating pre-trained MLLMs' perception and reasoning abilities in a 0-shot manner, where we unify authoritative benchmark datasets and carefully collect new datasets, extending the evaluation scope and addressing the critical gaps in current benchmarks. With RADAR, we comprehensively reveal the asymmetric development of perceptual and reasoning capabilities in pretrained MLLMs across diverse factors, including data volume, model size, and pretraining strategy. Our RADAR underscores the need for a decomposed perspective on pre-training ability bottlenecks, informing targeted interventions to advance MLLMs efficiently. Our code is publicly available at this https URL.
32. 【2602.12877】RoadscapesQA: A Multitask, Multimodal Dataset for Visual Question Answering on Indian Roads
链接:https://arxiv.org/abs/2602.12877
作者:Vijayasri Iyer,Maahin Rathinagiriswaran,Jyothikamalesh S
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:interpret visual surroundings, effective decision-making, Indian driving environments, essential for autonomous, enables systems
备注:
点击查看摘要
Abstract:Understanding road scenes is essential for autonomous driving, as it enables systems to interpret visual surroundings to aid in effective decision-making. We present Roadscapes, a multitask multimodal dataset consisting of upto 9,000 images captured in diverse Indian driving environments, accompanied by manually verified bounding boxes. To facilitate scalable scene understanding, we employ rule-based heuristics to infer various scene attributes, which are subsequently used to generate question-answer (QA) pairs for tasks such as object grounding, reasoning, and scene understanding. The dataset includes a variety of scenes from urban and rural India, encompassing highways, service roads, village paths, and congested city streets, captured in both daytime and nighttime settings. Roadscapes has been curated to advance research on visual scene understanding in unstructured environments. In this paper, we describe the data collection and annotation process, present key dataset statistics, and provide initial baselines for image QA tasks using vision-language models.
33. 【2602.12869】X-VORTEX: Spatio-Temporal Contrastive Learning for Wake Vortex Trajectory Forecasting
链接:https://arxiv.org/abs/2602.12869
作者:Zhan Qu,Michael Färber
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:air traffic management, coherent air turbulences, air turbulences created, Wake vortices, coherent air
备注:
点击查看摘要
Abstract:Wake vortices are strong, coherent air turbulences created by aircraft, and they pose a major safety and capacity challenge for air traffic management. Tracking how vortices move, weaken, and dissipate over time from LiDAR measurements is still difficult because scans are sparse, vortex signatures fade as the flow breaks down under atmospheric turbulence and instabilities, and point-wise annotation is prohibitively expensive. Existing approaches largely treat each scan as an independent, fully supervised segmentation problem, which overlooks temporal structure and does not scale to the vast unlabeled archives collected in practice. We present X-VORTEX, a spatio-temporal contrastive learning framework grounded in Augmentation Overlap Theory that learns physics-aware representations from unlabeled LiDAR point cloud sequences. X-VORTEX addresses two core challenges: sensor sparsity and time-varying vortex dynamics. It constructs paired inputs from the same underlying flight event by combining a weakly perturbed sequence with a strongly augmented counterpart produced via temporal subsampling and spatial masking, encouraging the model to align representations across missing frames and partial observations. Architecturally, a time-distributed geometric encoder extracts per-scan features and a sequential aggregator models the evolving vortex state across variable-length sequences. We evaluate on a real-world dataset of over one million LiDAR scans. X-VORTEX achieves superior vortex center localization while using only 1% of the labeled data required by supervised baselines, and the learned representations support accurate trajectory forecasting.
34. 【2602.12843】hinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation
链接:https://arxiv.org/abs/2602.12843
作者:Yichen Zhao,Zelin Peng,Piao Yang,Xiaokang Yang,Wei Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:careful visual inspection, visual inspection, visual, perform visual inspection, Radiological diagnosis
备注:
点击查看摘要
Abstract:Radiological diagnosis is a perceptual process in which careful visual inspection and language reasoning are repeatedly interleaved. Most medical large vision language models (LVLMs) perform visual inspection only once and then rely on text-only chain-of-thought (CoT) reasoning, which operates purely in the linguistic space and is prone to hallucination. Recent methods attempt to mitigate this issue by introducing visually related coordinates, such as bounding boxes. However, these remain a pseudo-visual solution: coordinates are still text and fail to preserve rich visual details like texture and density. Motivated by the interleaved nature of radiological diagnosis, we introduce MMRad-IVL-22K, the first large-scale dataset designed for natively interleaved visual language reasoning in chest X-ray interpretation. MMRad-IVL-22K reflects a repeated cycle of reasoning and visual inspection workflow of radiologists, in which visual rationales complement textual descriptions and ground each step of the reasoning process. MMRad-IVL-22K comprises 21,994 diagnostic traces, enabling systematic scanning across 35 anatomical regions. Experimental results on advanced closed-source LVLMs demonstrate that report generation guided by multimodal CoT significantly outperforms that guided by text-only CoT in clinical accuracy and report quality (e.g., 6\% increase in the RadGraph metric), confirming that high-fidelity interleaved vision language evidence is a non-substitutable component of reliable medical AI. Furthermore, benchmarking across seven state-of-the-art open-source LVLMs demonstrates that models fine-tuned on MMRad-IVL-22K achieve superior reasoning consistency and report quality compared with both general-purpose and medical-specific LVLMs. The project page is available at this https URL.
35. 【2602.12819】WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata
链接:https://arxiv.org/abs/2602.12819
作者:Prasanna Sridhar,Horace Lee,David M. S. Pinto,Andrew Zisserman,Abhishek Dutta
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
关键词:practical tool accessible, machine learning expertise, multimodal retrieval capabilities, audiovisual search engine, practical tool
备注: Software: [this https URL](https://www.robots.ox.ac.uk/~vgg/software/wise/) , Online demos: [this https URL](https://www.robots.ox.ac.uk/~vgg/software/wise/demo/) , Example Queries: [this https URL](https://www.robots.ox.ac.uk/~vgg/software/wise/examples/)
点击查看摘要
Abstract:In this paper, we present WISE, an open-source audiovisual search engine which integrates a range of multimodal retrieval capabilities into a single, practical tool accessible to users without machine learning expertise. WISE supports natural-language and reverse-image queries at both the scene level (e.g. empty street) and object level (e.g. horse) across images and videos; face-based search for specific individuals; audio retrieval of acoustic events using text (e.g. wood creak) or an audio file; search over automatically transcribed speech; and filtering by user-provided metadata. Rich insights can be obtained by combining queries across modalities -- for example, retrieving German trains from a historical archive by applying the object query "train" and the metadata query "Germany", or searching for a face in a place. By employing vector search techniques, WISE can scale to support efficient retrieval over millions of images or thousands of hours of video. Its modular architecture facilitates the integration of new models. WISE can be deployed locally for private or sensitive collections, and has been applied to various real-world use cases. Our code is open-source and available at this https URL.
36. 【2602.12796】GSM-GS: Geometry-Constrained Single and Multi-view Gaussian Splatting for Surface Reconstruction
链接:https://arxiv.org/abs/2602.12796
作者:Xiao Ren,Yu Liu,Ning An,Jian Cheng,Xin Qiao,He Kong
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:prominent research direction, research direction owing, ultrarapid training speed, Gaussian Splatting, Splatting has emerged
备注: [this https URL](https://aislab-sustech.github.io/GSM-GS/)
点击查看摘要
Abstract:Recently, 3D Gaussian Splatting has emerged as a prominent research direction owing to its ultrarapid training speed and high-fidelity rendering capabilities. However, the unstructured and irregular nature of Gaussian point clouds poses challenges to reconstruction accuracy. This limitation frequently causes high-frequency detail loss in complex surface microstructures when relying solely on routine strategies. To address this limitation, we propose GSM-GS: a synergistic optimization framework integrating single-view adaptive sub-region weighting constraints and multi-view spatial structure refinement. For single-view optimization, we leverage image gradient features to partition scenes into texture-rich and texture-less sub-regions. The reconstruction quality is enhanced through adaptive filtering mechanisms guided by depth discrepancy features. This preserves high-weight regions while implementing a dual-branch constraint strategy tailored to regional texture variations, thereby improving geometric detail characterization. For multi-view optimization, we introduce a geometry-guided cross-view point cloud association method combined with a dynamic weight sampling strategy. This constructs 3D structural normal constraints across adjacent point cloud frames, effectively reinforcing multi-view consistency and reconstruction fidelity. Extensive experiments on public datasets demonstrate that our method achieves both competitive rendering quality and geometric reconstruction. See our interactive project page
37. 【2602.12774】Bootstrapping MLLM for Weakly-Supervised Class-Agnostic Object Counting
链接:https://arxiv.org/abs/2602.12774
作者:Xiaowen Zhang,Zijie Yue,Yong Luo,Cairong Zhao,Qijun Chen,Miaojing Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision, real-world scenarios, fundamental task, task in computer, broad applicability
备注: Accepted at ICLR 2026
点击查看摘要
Abstract:Object counting is a fundamental task in computer vision, with broad applicability in many real-world scenarios. Fully-supervised counting methods require costly point-level annotations per object. Few weakly-supervised methods leverage only image-level object counts as supervision and achieve fairly promising results. They are, however, often limited to counting a single category, e.g. person. In this paper, we propose WS-COC, the first MLLM-driven weakly-supervised framework for class-agnostic object counting. Instead of directly fine-tuning MLLMs to predict object counts, which can be challenging due to the modality gap, we incorporate three simple yet effective strategies to bootstrap the counting paradigm in both training and testing: First, a divide-and-discern dialogue tuning strategy is proposed to guide the MLLM to determine whether the object count falls within a specific range and progressively break down the range through multi-round dialogue. Second, a compare-and-rank count optimization strategy is introduced to train the MLLM to optimize the relative ranking of multiple images according to their object counts. Third, a global-and-local counting enhancement strategy aggregates and fuses local and global count predictions to improve counting performance in dense scenes. Extensive experiments on FSC-147, CARPK, PUCPR+, and ShanghaiTech show that WS-COC matches or even surpasses many state-of-art fully-supervised methods while significantly reducing annotation costs. Code is available at this https URL.
38. 【2602.12769】PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
链接:https://arxiv.org/abs/2602.12769
作者:Hong-Phuc Lai,Phong Nguyen,Anh Tran
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Pre-trained diffusion models, native training resolution, diffusion models excel, remain inherently limited, Pre-trained diffusion
备注:
点击查看摘要
Abstract:Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image. In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10$\times$ to 35$\times$ speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.
39. 【2602.12761】owards complete digital twins in cultural heritage with ART3mis 3D artifacts annotator
链接:https://arxiv.org/abs/2602.12761
作者:Dimitrios Karamatskos,Vasileios Arampatzakis,Vasileios Sevetlidis,Stavros Nousias,Athanasios Kalogeras,Christos Koulamas,Aris Lalos,George Pavlidis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:additional functions, simplistic three-dimensional, specialists and practitioners, attachment of metadata, metadata to specific
备注: Presented at EUROMED 2022: International Conference on Digital Heritage
点击查看摘要
Abstract:Archaeologists, as well as specialists and practitioners in cultural heritage, require applications with additional functions, such as the annotation and attachment of metadata to specific regions of the 3D digital artifacts, to go beyond the simplistic three-dimensional (3D) visualization. Different strategies addressed this issue, most of which are excellent in their particular area of application, but their capacity is limited to their design's purpose; they lack generalization and interoperability. This paper introduces ART3mis, a general-purpose, user-friendly, feature-rich, interactive web-based textual annotation tool for 3D objects. Moreover, it enables the communication, distribution, and reuse of information as it complies with the W3C Web Annotation Data Model. It is primarily designed to help cultural heritage conservators, restorers, and curators who lack technical expertise in 3D imaging and graphics, handle, segment, and annotate 3D digital replicas of artifacts with ease.
40. 【2602.12755】owards reconstructing experimental sparse-view X-ray CT data with diffusion models
链接:https://arxiv.org/abs/2602.12755
作者:Nelas J. Thomsen,Xinyuan Wang,Felix Lucka,Ezgi Demircan-Tureyen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:X-ray Computed Tomography, sparse-view X-ray Computed, Computed Tomography, X-ray Computed, Diffusion-based image generators
备注: 5 pages + references, 4 figures, 2 tables, conference paper
点击查看摘要
Abstract:Diffusion-based image generators are promising priors for ill-posed inverse problems like sparse-view X-ray Computed Tomography (CT). As most studies consider synthetic data, it is not clear whether training data mismatch (``domain shift'') or forward model mismatch complicate their successful application to experimental data. We measured CT data from a physical phantom resembling the synthetic Shepp-Logan phantom and trained diffusion priors on synthetic image data sets with different degrees of domain shift towards it. Then, we employed the priors in a Decomposed Diffusion Sampling scheme on sparse-view CT data sets with increasing difficulty leading to the experimental data. Our results reveal that domain shift plays a nuanced role: while severe mismatch causes model collapse and hallucinations, diverse priors outperform well-matched but narrow priors. Forward model mismatch pulls the image samples away from the prior manifold, which causes artifacts but can be mitigated with annealed likelihood schedules that also increase computational efficiency. Overall, we demonstrate that performance gains do not immediately translate from synthetic to experimental data, and future development must validate against real-world benchmarks.
41. 【2602.12751】ReBA-Pred-Net: Weakly-Supervised Regional Brain Age Prediction on MRI
链接:https://arxiv.org/abs/2602.12751
作者:Shuai Shao,Yan Wang,Shu Jiang,Shiyuan Zhao,Xinzhe Luo,Di Yang,Jiangtao Wang,Yutong Bai,Jianguo Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Brain age, regional brain age, Brain, Brain Age Prediction, prominent biomarker
备注:
点击查看摘要
Abstract:Brain age has become a prominent biomarker of brain health. Yet most prior work targets whole brain age (WBA), a coarse paradigm that struggles to support tasks such as disease characterization and research on development and aging patterns, because relevant changes are typically region-selective rather than brain-wide. Therefore, robust regional brain age (ReBA) estimation is critical, yet a widely generalizable model has yet to be established. In this paper, we propose the Regional Brain Age Prediction Network (ReBA-Pred-Net), a Teacher-Student framework designed for fine-grained brain age estimation. The Teacher produces soft ReBA to guide the Student to yield reliable ReBA estimates with a clinical-prior consistency constraint (regions within the same function should change similarly). For rigorous evaluation, we introduce two indirect metrics: Healthy Control Similarity (HCS), which assesses statistical consistency by testing whether regional brain-age-gap (ReBA minus chronological age) distributions align between training and unseen HC; and Neuro Disease Correlation (NDC), which assesses factual consistency by checking whether clinically confirmed patients show elevated brain-age-gap in disease-associated regions. Experiments across multiple backbones demonstrate the statistical and factual validity of our method.
42. 【2602.12742】Synthetic Craquelure Generation for Unsupervised Painting Restoration
链接:https://arxiv.org/abs/2602.12742
作者:Jana Cuch-Guillén,Antonio Agudo,Raül Pérez-Gonzalo
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Cultural heritage preservation, scarce pixel-level annotations, heritage preservation increasingly, preservation increasingly demands, increasingly demands non-invasive
备注: Accepted to CAI 2026
点击查看摘要
Abstract:Cultural heritage preservation increasingly demands non-invasive digital methods for painting restoration, yet identifying and restoring fine craquelure patterns from complex brushstrokes remains challenging due to scarce pixel-level annotations. We propose a fully annotation-free framework driven by a domain-specific synthetic craquelure generator, which simulates realistic branching and tapered fissure geometry using Bézier trajectories. Our approach couples a classical morphological detector with a learning-based refinement module: a SegFormer backbone adapted via Low-Rank Adaptation (LoRA). Uniquely, we employ a detector-guided strategy, injecting the morphological map as an input spatial prior, while a masked hybrid loss and logit adjustment constrain the training to focus specifically on refining candidate crack regions. The refined masks subsequently guide an Anisotropic Diffusion inpainting stage to reconstruct missing content. Experimental results demonstrate that our pipeline significantly outperforms state-of-the-art photographic restoration models in zero-shot settings, while faithfully preserving the original paint brushwork.
43. 【2602.12740】SPRig: Self-Supervised Pose-Invariant Rigging from Mesh Sequences
链接:https://arxiv.org/abs/2602.12740
作者:Ruipeng Wang,Langkun Zhong,Miaowei Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:canonical rest pose, animal motion capture, capture or AIGC, lack the T-pose, video-derived mesh sequences
备注: Code: [this https URL](https://github.com/WANG-Ruipeng/SPRig)
点击查看摘要
Abstract:State-of-the-art rigging methods assume a canonical rest pose--an assumption that fails for sequential data (e.g., animal motion capture or AIGC/video-derived mesh sequences) that lack the T-pose. Applied frame-by-frame, these methods are not pose-invariant and produce topological inconsistencies across frames. Thus We propose SPRig, a general fine-tuning framework that enforces cross-frame consistency losses to learn pose-invariant rigs on top of existing models. We validate our approach on rigging using a new permutation-invariant stability protocol. Experiments demonstrate SOTA temporal stability: our method produces coherent rigs from challenging sequences and dramatically reduces the artifacts that plague baseline methods. The code will be released publicly upon acceptance.
44. 【2602.12735】VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph
链接:https://arxiv.org/abs/2602.12735
作者:Qiuchen Wang,Shihang Wang,Yu Zeng,Qiang Zhang,Fanrui Zhang,Zhuoning Guo,Bosi Zhang,Wenxuan Huang,Lin Chen,Zehui Chen,Pengjun Xie,Ruixue Ding
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Effectively retrieving, Traditional Retrieval-augmented Generation, understanding multimodal information, multimodal information remains, agentic systems
备注:
点击查看摘要
Abstract:Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at this https URL.
45. 【2602.12725】ART3mis: Ray-Based Textual Annotation on 3D Cultural Objects
链接:https://arxiv.org/abs/2602.12725
作者:Vasileios Arampatzakis,Vasileios Sevetlidis,Fotis Arnaoutoglou,Athanasios Kalogeras,Christos Koulamas,Aris Lalos,Chairi Kiourt,George Ioannakis,Anestis Koutsoudis,George Pavlidis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:experts and practitioners, advanced functionalities, cultural heritage experts, heritage experts, Abstract
备注: Presented at CAA 2021 - "Digital Crossroads"
点击查看摘要
Abstract:Beyond simplistic 3D visualisations, archaeologists, as well as cultural heritage experts and practitioners, need applications with advanced functionalities. Such as the annotation and attachment of metadata onto particular regions of the 3D digital objects. Various approaches have been presented to tackle this challenge, most of which achieve excellent results in the domain of their application. However, they are often confined to that specific domain and particular problem. In this paper, we present ART3mis - a general-purpose, user-friendly, interactive textual annotation tool for 3D objects. Primarily attuned to aid cultural heritage conservators, restorers and curators with no technical skills in 3D imaging and graphics, the tool allows for the easy handling, segmenting and annotating of 3D digital replicas of artefacts. ART3mis applies a user-driven, direct-on-surface approach. It can handle detailed 3D cultural objects in real-time and store textual annotations for multiple complex regions in JSON data format.
46. 【2602.12705】MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
链接:https://arxiv.org/abs/2602.12705
作者:Baorong Shi,Bo Cui,Boyuan Jiang,Deli Yu,Fang Qian,Haihua Yang,Huichao Wang,Jiale Chen,Jianfei Pan,Jieqiong Cao,Jinghao Lin,Kai Wu,Lin Yang,Shengsheng Yao,Tao Chen,Xiaojun Xiao,Xiaozhong Ji,Xu Wang,Yijun He,Zhixiong Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:vision-language foundation model, foundation model designed, real-world clinical applications, medical vision-language foundation, advance general-purpose medical
备注:
点击查看摘要
Abstract:We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.
47. 【2602.12696】Channel-Aware Probing for Multi-Channel Imaging
链接:https://arxiv.org/abs/2602.12696
作者:Umar Marikkar,Syed Sameed Husain,Muhammad Awais,Sara Atito
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:data remains challenging, Multi-Channel Imaging, preventing fixed-channel training, evaluating vision encoders, channel configurations vary
备注:
点击查看摘要
Abstract:Training and evaluating vision encoders on Multi-Channel Imaging (MCI) data remains challenging as channel configurations vary across datasets, preventing fixed-channel training and limiting reuse of pre-trained encoders on new channel settings. Prior work trains MCI encoders but typically evaluates them via full fine-tuning, leaving probing with frozen pre-trained encoders comparatively underexplored. Existing studies that perform probing largely focus on improving representations, rather than how to best leverage fixed representations for downstream tasks. Although the latter problem has been studied in other domains, directly transferring those strategies to MCI yields weak results, even worse than training from scratch. We therefore propose Channel-Aware Probing (CAP), which exploits the intrinsic inter-channel diversity in MCI datasets by controlling feature flow at both the encoder and probe levels. CAP uses Independent Feature Encoding (IFE) to encode each channel separately, and Decoupled Pooling (DCP) to pool within channels before aggregating across channels. Across three MCI benchmarks, CAP consistently improves probing performance over the default probing protocol, matches fine-tuning from scratch, and largely reduces the gap to full fine-tuning from the same MCI pre-trained checkpoints. Code can be found in this https URL.
48. 【2602.12679】Motion Prior Distillation in Time Reversal Sampling for Generative Inbetweening
链接:https://arxiv.org/abs/2602.12679
作者:Wooseok Jeon,Seunghyun Shin,Dongmin Shin,Hae-Gon Jeon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generate semantically plausible, Recent progress, semantically plausible frames, significantly advanced, advanced the field
备注: Accepted at ICLR 2026. Project page: [this https URL](https://vvsjeon.github.io/MPD/)
点击查看摘要
Abstract:Recent progress in image-to-video (I2V) diffusion models has significantly advanced the field of generative inbetweening, which aims to generate semantically plausible frames between two keyframes. In particular, inference-time sampling strategies, which leverage the generative priors of large-scale pre-trained I2V models without additional training, have become increasingly popular. However, existing inference-time sampling, either fusing forward and backward paths in parallel or alternating them sequentially, often suffers from temporal discontinuities and undesirable visual artifacts due to the misalignment between the two generated paths. This is because each path follows the motion prior induced by its own conditioning frame. In this work, we propose Motion Prior Distillation (MPD), a simple yet effective inference-time distillation technique that suppresses bidirectional mismatch by distilling the motion residual of the forward path into the backward path. Our method can deliberately avoid denoising the end-conditioned path which causes the ambiguity of the path, and yield more temporally coherent inbetweening results with the forward motion prior. We not only perform quantitative evaluations on standard benchmarks, but also conduct extensive user studies to demonstrate the effectiveness of our approach in practical scenarios.
49. 【2602.12675】SLA2: Sparse-Linear Attention with Learnable Routing and QAT
链接:https://arxiv.org/abs/2602.12675
作者:Jintao Zhang,Haoxu Wang,Kai Jiang,Kaiwen Zheng,Youhe Jiang,Ion Stoica,Jianfei Chen,Jun Zhu,Joseph E. Gonzalez
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:shown strong performance, linear attention, Attention, SLA, shown strong
备注:
点击查看摘要
Abstract:Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or linear branch based on attention-weight magnitude, which can be suboptimal. Additionally, (ii) after formally analyzing the attention error in SLA, we identify a mismatch between SLA and a direct decomposition into sparse and linear attention. We propose SLA2, which introduces (I) a learnable router that dynamically selects whether each attention computation should use sparse or linear attention, (II) a more faithful and direct sparse-linear attention formulation that uses a learnable ratio to combine the sparse and linear attention branches, and (III) a sparse + low-bit attention design, where low-bit attention is introduced via quantization-aware fine-tuning to reduce quantization error. Experiments show that on video diffusion models, SLA2 can achieve 97% attention sparsity and deliver an 18.6x attention speedup while preserving generation quality.
50. 【2602.12659】IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models
链接:https://arxiv.org/abs/2602.12659
作者:Aarish Shah Mohsin,Mohammed Tayyab Ilyas Khan,Mohammad Nadeem,Shahab Saquib Sohail,Erik Cambria,Jiechao Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:amplify societal biases, web-scale training data, Vision-Language Models, inherit and amplify, amplify societal
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data with Indian being particularly misrepresented. Existing fairness-aware datasets have significantly improved demographic balance across global race and gender groups, yet they continue to treat Indian as a single monolithic category. The oversimplification ignores the vast intra-national diversity across 28 states and 8 Union Territories of India and leads to representational and geographical bias. To address the limitation, we present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India. Images were sourced ethically from Wikimedia Commons and open-license web repositories and uniformly balanced across states and gender. Using IndicFairFace, we quantify intra-national geographical bias in prominent CLIP-based VLMs and reduce it using post-hoc Iterative Nullspace Projection debiasing approach. We also show that the adopted debiasing approach does not adversely impact the existing embedding space as the average drop in retrieval accuracy on benchmark datasets is less than 1.5 percent. Our work establishes IndicFairFace as the first benchmark to study geographical bias in VLMs for the Indian context.
51. 【2602.12652】CBEN -- A Multimodal Machine Learning Dataset for Cloud Robust Remote Sensing Image Understanding
链接:https://arxiv.org/abs/2602.12652
作者:Marco Stricker,Masakazu Iwamura,Koichi Kise
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:distorts optical satellite, remote sensing, common phenomenon, phenomenon that distorts, poses a challenge
备注: This work has been submitted to the IEEE Transactions on Geoscience Remote Sensing for possible publication
点击查看摘要
Abstract:Clouds are a common phenomenon that distorts optical satellite imagery, which poses a challenge for remote sensing. However, in the literature cloudless analysis is often performed where cloudy images are excluded from machine learning datasets and methods. Such an approach cannot be applied to time sensitive applications, e.g., during natural disasters. A possible solution is to apply cloud removal as a preprocessing step to ensure that cloudfree solutions are not failing under such conditions. But cloud removal methods are still actively researched and suffer from drawbacks, such as generated visual artifacts. Therefore, it is desirable to develop cloud robust methods that are less affected by cloudy weather. Cloud robust methods can be achieved by combining optical data with radar, a modality unaffected by clouds. While many datasets for machine learning combine optical and radar data, most researchers exclude cloudy images. We identify this exclusion from machine learning training and evaluation as a limitation that reduces applicability to cloudy scenarios. To investigate this, we assembled a dataset, named CloudyBigEarthNet (CBEN), of paired optical and radar images with cloud occlusion for training and evaluation. Using average precision (AP) as the evaluation metric, we show that state-of-the-art methods trained on combined clear-sky optical and radar imagery suffer performance drops of 23-33 percentage points when evaluated on cloudy images. We then adapt these methods to cloudy optical data during training, achieving relative improvement of 17.2-28.7 percentage points on cloudy test cases compared with the original approaches. Code and dataset are publicly available at: this https URL
52. 【2602.12649】Multi-Task Learning with Additive U-Net for Image Denoising and Classification
链接:https://arxiv.org/abs/2602.12649
作者:Vikram Lakkavalli,Neelam Sinha
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:architectures for image, investigate additive skip, additive fusion, investigate additive, U-Net architectures
备注:
点击查看摘要
Abstract:We investigate additive skip fusion in U-Net architectures for image denoising and denoising-centric multi-task learning (MTL). By replacing concatenative skips with gated additive fusion, the proposed Additive U-Net (AddUNet) constrains shortcut capacity while preserving fixed feature dimensionality across depth. This structural regularization induces controlled encoder-decoder information flow and stabilizes joint optimization. Across single-task denoising and joint denoising-classification settings, AddUNet achieves competitive reconstruction performance with improved training stability. In MTL, learned skip weights exhibit systematic task-aware redistribution: shallow skips favor reconstruction, while deeper features support discrimination. Notably, reconstruction remains robust even under limited classification capacity, indicating implicit task decoupling through additive fusion. These findings show that simple constraints on skip connections act as an effective architectural regularizer for stable and scalable multi-task learning without increasing model complexity.
53. 【2602.12640】ImageRAGTurbo: Towards One-step Text-to-Image Generation with Retrieval-Augmented Diffusion Models
链接:https://arxiv.org/abs/2602.12640
作者:Peijie Qiu,Hariharan Ramshankar,Arnau Ramisa,René Vidal,Amit Kumar K C,Vamsi Salaka,Rahul Bhagat
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:few-step diffusion models, Diffusion models, diffusion models reduce, few-step diffusion, Diffusion
备注: 11 pages, 7 figures
点击查看摘要
Abstract:Diffusion models have emerged as the leading approach for text-to-image generation. However, their iterative sampling process, which gradually morphs random noise into coherent images, introduces significant latency that limits their applicability. While recent few-step diffusion models reduce the number of sampling steps to as few as one to four steps, they often compromise image quality and prompt alignment, especially in one-step generation. Additionally, these models require computationally expensive training procedures. To address these limitations, we propose ImageRAGTurbo, a novel approach to efficiently finetune few-step diffusion models via retrieval augmentation. Given a text prompt, we retrieve relevant text-image pairs from a database and use them to condition the generation process. We argue that such retrieved examples provide rich contextual information to the UNet denoiser that helps reduce the number of denoising steps without compromising image quality. Indeed, our initial investigations show that using the retrieved content to edit the denoiser's latent space ($\mathcal{H}$-space) without additional finetuning already improves prompt fidelity. To further improve the quality of the generated images, we augment the UNet denoiser with a trainable adapter in the $\mathcal{H}$-space, which efficiently blends the retrieved content with the target prompt using a cross-attention mechanism. Experimental results on fast text-to-image generation demonstrate that our approach produces high-fidelity images without compromising latency compared to existing methods.
54. 【2602.12624】Formalizing the Sampling Design Space of Diffusion-Based Generative Models via Adaptive Solvers and Wasserstein-Bounded Timesteps
链接:https://arxiv.org/abs/2602.12624
作者:Sangwoo Jo,Sungjoon Choi
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion-based generative models, high sampling costs, Diffusion-based generative, achieved remarkable performance, generative models
备注:
点击查看摘要
Abstract:Diffusion-based generative models have achieved remarkable performance across various domains, yet their practical deployment is often limited by high sampling costs. While prior work focuses on training objectives or individual solvers, the holistic design of sampling, specifically solver selection and scheduling, remains dominated by static heuristics. In this work, we revisit this challenge through a geometric lens, proposing SDM, a principled framework that aligns the numerical solver with the intrinsic properties of the diffusion trajectory. By analyzing the ODE dynamics, we show that efficient low-order solvers suffice in early high-noise stages while higher-order solvers can be progressively deployed to handle the increasing non-linearity of later stages. Furthermore, we formalize the scheduling by introducing a Wasserstein-bounded optimization framework. This method systematically derives adaptive timesteps that explicitly bound the local discretization error, ensuring the sampling process remains faithful to the underlying continuous dynamics. Without requiring additional training or architectural modifications, SDM achieves state-of-the-art performance across standard benchmarks, including an FID of 1.93 on CIFAR-10, 2.41 on FFHQ, and 1.98 on AFHQv2, with a reduced number of function evaluations compared to existing samplers. Our code is available at this https URL.
55. 【2602.12618】Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models
链接:https://arxiv.org/abs/2602.12618
作者:Omer Faruk Deniz,Ruiyu Mao,Ruochen Li,Yapeng Tian,Latifur Khan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, incur significant computational
备注: 2025 IEEE International Conference on Big Data (BigData)
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.
56. 【2602.12609】QuEPT: Quantized Elastic Precision Transformers with One-Shot Calibration for Multi-Bit Switching
链接:https://arxiv.org/abs/2602.12609
作者:Ke Xu,Yixin Wang,Zhongcheng Li,Hao Cui,Jinshui Hu,Xingyi Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:small data slice, elastic quantization remains, Transformer architecture, single optimization pass, http URL paper
备注: Accepted by AAAI 2026
点击查看摘要
Abstract:Elastic precision quantization enables multi-bit deployment via a single optimization pass, fitting diverse quantization this http URL, the high storage and optimization costs associated with the Transformer architecture, research on elastic quantization remains limited, particularly for large language this http URL paper proposes QuEPT, an efficient post-training scheme that reconstructs block-wise multi-bit errors with one-shot calibration on a small data slice. It can dynamically adapt to various predefined bit-widths by cascading different low-rank adapters, and supports real-time switching between uniform quantization and mixed precision quantization without repeated optimization. To enhance accuracy and robustness, we introduce Multi-Bit Token Merging (MB-ToMe) to dynamically fuse token features across different bit-widths, improving robustness during bit-width switching. Additionally, we propose Multi-Bit Cascaded Low-Rank adapters (MB-CLoRA) to strengthen correlations between bit-width groups, further improve the overall performance of QuEPT. Extensive experiments demonstrate that QuEPT achieves comparable or better performance to existing state-of-the-art post-training quantization this http URL code is available at this https URL
57. 【2602.12590】Unbiased Gradient Estimation for Event Binning via Functional Backpropagation
链接:https://arxiv.org/abs/2602.12590
作者:Jinze Chen,Wei Zhai,Han Han,Tiankai Ma,Yang Cao,Bin Li,Zheng-Jun Zha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision encodes dynamic, encodes dynamic scenes, asynchronous spatio-temporal spikes, spatio-temporal spikes called, spikes called events
备注:
点击查看摘要
Abstract:Event-based vision encodes dynamic scenes as asynchronous spatio-temporal spikes called events. To leverage conventional image processing pipelines, events are typically binned into frames. However, binning functions are discontinuous, which truncates gradients at the frame level and forces most event-based algorithms to rely solely on frame-based features. Attempts to directly learn from raw events avoid this restriction but instead suffer from biased gradient estimation due to the discontinuities of the binning operation, ultimately limiting their learning efficiency. To address this challenge, we propose a novel framework for unbiased gradient estimation of arbitrary binning functions by synthesizing weak derivatives during backpropagation while keeping the forward output unchanged. The key idea is to exploit integration by parts: lifting the target functions to functionals yields an integral form of the derivative of the binning function during backpropagation, where the cotangent function naturally arises. By reconstructing this cotangent function from the sampled cotangent vector, we compute weak derivatives that provably match long-range finite differences of both smooth and non-smooth targets. Experimentally, our method improves simple optimization-based egomotion estimation with 3.2\% lower RMS error and 1.57$\times$ faster convergence. On complex downstream tasks, we achieve 9.4\% lower EPE in self-supervised optical flow, and 5.1\% lower RMS error in SLAM, demonstrating broad benefits for event-based visual perception. Source code can be found at this https URL.
58. 【2602.12563】he Constant Eye: Benchmarking and Bridging Appearance Robustness in Autonomous Driving
链接:https://arxiv.org/abs/2602.12563
作者:Jiabao Wang,Hongyu Zhou,Yuanbo Yang,Jiahao Shao,Yiyi Liao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remain notoriously fragile, rapid progress, notoriously fragile, autonomous driving algorithms, algorithms remain notoriously
备注:
点击查看摘要
Abstract:Despite rapid progress, autonomous driving algorithms remain notoriously fragile under Out-of-Distribution (OOD) conditions. We identify a critical decoupling failure in current research: the lack of distinction between appearance-based shifts, such as weather and lighting, and structural scene changes. This leaves a fundamental question unanswered: Is the planner failing because of complex road geometry, or simply because it is raining? To resolve this, we establish navdream, a high-fidelity robustness benchmark leveraging generative pixel-aligned style transfer. By creating a visual stress test with negligible geometric deviation, we isolate the impact of appearance on driving performance. Our evaluation reveals that existing planning algorithms often show significant degradation under OOD appearance conditions, even when the underlying scene structure remains consistent. To bridge this gap, we propose a universal perception interface leveraging a frozen visual foundation model (DINOv3). By extracting appearance-invariant features as a stable interface for the planner, we achieve exceptional zero-shot generalization across diverse planning paradigms, including regression-based, diffusion-based, and scoring-based models. Our plug-and-play solution maintains consistent performance across extreme appearance shifts without requiring further fine-tuning. The benchmark and code will be made available.
59. 【2602.12561】PLLM: Pseudo-Labeling Large Language Models for CAD Program Synthesis
链接:https://arxiv.org/abs/2602.12561
作者:Yuanbo Li,Dule Shu,Yanying Chen,Matt Klenk,Daniel Ritchie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recovering Computer-Aided Design, widely studied problem, Recovering Computer-Aided, Computer-Aided Design, CAD program synthesis
备注:
点击查看摘要
Abstract:Recovering Computer-Aided Design (CAD) programs from 3D geometries is a widely studied problem. Recent advances in large language models (LLMs) have enabled progress in CAD program synthesis, but existing methods rely on supervised training with paired shape-program data, which is often unavailable. We introduce PLLM, a self-training framework for CAD program synthesis from unlabeled 3D shapes. Given a pre-trained CAD-capable LLM and a shape dataset, PLLM iteratively samples candidate programs, selects high-fidelity executions, and augments programs to construct synthetic program-shape pairs for fine-tuning. We experiment on adapting CAD-Recode from DeepCAD to the unlabeled ABC dataset show consistent improvements in geometric fidelity and program diversity.
60. 【2602.12540】Self-Supervised JEPA-based World Models for LiDAR Occupancy Completion and Forecasting
链接:https://arxiv.org/abs/2602.12540
作者:Haoran Zhu,Anna Choromanska
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:support long-term planning, environment evolves spatiotemporally, requires the fundamental, capability to build, long-term planning
备注:
点击查看摘要
Abstract:Autonomous driving, as an agent operating in the physical world, requires the fundamental capability to build \textit{world models} that capture how the environment evolves spatiotemporally in order to support long-term planning. At the same time, scalability demands learning such models in a self-supervised manner; \textit{joint-embedding predictive architecture (JEPA)} enables learning world models via leveraging large volumes of unlabeled data without relying on expensive human annotations. In this paper, we propose \textbf{AD-LiST-JEPA}, a self-supervised world model for autonomous driving that predicts future spatiotemporal evolution from LiDAR data using a JEPA framework. We evaluate the quality of the learned representations through a downstream LiDAR-based occupancy completion and forecasting (OCF) task, which jointly assesses perception and prediction. Proof of concept experiments show better OCF performance with pretrained encoder after JEPA-based world model learning.
61. 【2602.12529】Flow-Factory: A Unified Framework for Reinforcement Learning in Flow-Matching Models
链接:https://arxiv.org/abs/2602.12529
作者:Bowen Ping,Chengyou Jia,Minnan Luo,Hangwei Qian,Ivor Tsang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:practitioners face fragmented, Reinforcement learning, face fragmented codebases, human preferences, engineering complexity
备注:
点击查看摘要
Abstract:Reinforcement learning has emerged as a promising paradigm for aligning diffusion and flow-matching models with human preferences, yet practitioners face fragmented codebases, model-specific implementations, and engineering complexity. We introduce Flow-Factory, a unified framework that decouples algorithms, models, and rewards through through a modular, registry-based architecture. This design enables seamless integration of new algorithms and architectures, as demonstrated by our support for GRPO, DiffusionNFT, and AWM across Flux, Qwen-Image, and WAN video models. By minimizing implementation overhead, Flow-Factory empowers researchers to rapidly prototype and scale future innovations with ease. Flow-Factory provides production-ready memory optimization, flexible multi-reward training, and seamless distributed training support. The codebase is available at this https URL.
62. 【2602.12525】Geometric Stratification for Singular Configurations of the P3P Problem via Local Dual Space
链接:https://arxiv.org/abs/2602.12525
作者:Xueying Sun,Zijia Li,Nan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:paper investigates singular, investigates singular configurations, paper investigates, singular configurations, danger cylinder
备注:
点击查看摘要
Abstract:This paper investigates singular configurations of the P3P problem. Using local dual space, a systematic algebraic-computational framework is proposed to give a complete geometric stratification for the P3P singular configurations with respect to the multiplicity $\mu$ of the camera center $O$: for $\mu\ge 2$, $O$ lies on the ``danger cylinder'', for $\mu\ge 3$, $O$ lies on one of three generatrices of the danger cylinder associated with the first Morley triangle or the circumcircle, and for $\mu\ge 4$, $O$ lies on the circumcircle which indeed corresponds to infinite P3P solutions. Furthermore, a geometric stratification for the complementary configuration $O^\prime$ associated with a singular configuration $O$ is studied as well: for $\mu\ge 2$, $O^\prime$ lies on a deltoidal surface associated with the danger cylinder, and for $\mu\ge 3$, $O^\prime$ lies on one of three cuspidal curves of the deltoidal surface.
63. 【2602.12524】LiDAR-Anchored Collaborative Distillation for Robust 2D Representations
链接:https://arxiv.org/abs/2602.12524
作者:Wonjun Jo,Hyunwoo Ha,Kim Ji-Yeon,Hawook Jeong,Tae-Hyun Oh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made considerable strides, deep learning continues, continues to advance, considerable strides, image encoders
备注:
点击查看摘要
Abstract:As deep learning continues to advance, self-supervised learning has made considerable strides. It allows 2D image encoders to extract useful features for various downstream tasks, including those related to vision-based systems. Nevertheless, pre-trained 2D image encoders fall short in conducting the task under noisy and adverse weather conditions beyond clear daytime scenes, which require for robust visual perception. To address these issues, we propose a novel self-supervised approach, \textbf{Collaborative Distillation}, which leverages 3D LiDAR as self-supervision to improve robustness to noisy and adverse weather conditions in 2D image encoders while retaining their original capabilities. Our method outperforms competing methods in various downstream tasks across diverse conditions and exhibits strong generalization ability. In addition, our method also improves 3D awareness stemming from LiDAR's characteristics. This advancement highlights our method's practicality and adaptability in real-world scenarios.
64. 【2602.12515】Matching of SAR and optical images based on transformation to shared modality
链接:https://arxiv.org/abs/2602.12515
作者:Alexey Borisov,Evgeny Myasnikov,Vladislav Myasnikov
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthetic Aperture Radar, Aperture Radar, Earth remote sensing, Synthetic Aperture, remote sensing platforms
备注:
点击查看摘要
Abstract:Significant differences in optical images and Synthetic Aperture Radar (SAR) images are caused by fundamental differences in the physical principles underlying their acquisition by Earth remote sensing platforms. These differences make precise image matching (co-registration) of these two types of images difficult. In this paper, we propose a new approach to image matching of optical and SAR images, which is based on transforming the images to a new modality. The new image modality is common to both optical and SAR images and satisfies the following conditions. First, the transformed images must have an equal pre-defined number of channels. Second, the transformed and co-registered images must be as similar as possible. Third, the transformed images must be non-degenerate, meaning they must preserve the significant features of the original images. To further match images transformed to this shared modality, we train the RoMa image matching model, which is one of the leading solutions for matching of regular digital photographs. We evaluated the proposed approach on the publicly available MultiSenGE dataset containing both optical and SAR images. We demonstrated its superiority over alternative approaches based on image translation between original modalities and various feature matching algorithms. The proposed solution not only provides better quality of matching, but is also more versatile. It enables the use of ready-made RoMa and DeDoDe models, pre-trained for regular images, without retraining for a new modality, while maintaining high-quality matching of optical and SAR images.
65. 【2602.12510】Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search
链接:https://arxiv.org/abs/2602.12510
作者:Ara Yeroyan
类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:deliver strong accuracy, ColPali-style late interaction, search increasingly expensive, late interaction models, Multi-vector visual retrievers
备注: 4 pages, 3 figures. Submitted to SIGIR 2026 Demonstrations Track. Project website: [this https URL](https://github.com/Ara-Yeroyan/visual-rag-toolkit)
点击查看摘要
Abstract:Multi-vector visual retrievers (e.g., ColPali-style late interaction models) deliver strong accuracy, but scale poorly because each page yields thousands of vectors, making indexing and search increasingly expensive. We present Visual RAG Toolkit, a practical system for scaling visual multi-vector retrieval with training-free, model-aware pooling and multi-stage retrieval. Motivated by Matryoshka Embeddings, our method performs static spatial pooling - including a lightweight sliding-window averaging variant - over patch embeddings to produce compact tile-level and global representations for fast candidate generation, followed by exact MaxSim reranking using full multi-vector embeddings. Our design yields a quadratic reduction in vector-to-vector comparisons by reducing stored vectors per page from thousands to dozens, notably without requiring post-training, adapters, or distillation. Across experiments with interaction-style models such as ColPali and ColSmol-500M, we observe that over the limited ViDoRe v2 benchmark corpus 2-stage retrieval typically preserves NDCG and Recall @ 5/10 with minimal degradation, while substantially improving throughput (approximately 4x QPS); with sensitivity mainly at very large k. The toolkit additionally provides robust preprocessing - high resolution PDF to image conversion, optional margin/empty-region cropping and token hygiene (indexing only visual tokens) - and a reproducible evaluation pipeline, enabling rapid exploration of two-, three-, and cascaded retrieval variants. By emphasizing efficiency at common cutoffs (e.g., k = 10), the toolkit lowers hardware barriers and makes state-of-the-art visual retrieval more accessible in practice.
Comments:
4 pages, 3 figures. Submitted to SIGIR 2026 Demonstrations Track. Project website: this https URL
Subjects:
Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
ACMclasses:
H.3.3
Cite as:
arXiv:2602.12510 [cs.IR]
(or
arXiv:2602.12510v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2602.12510
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
66. 【2602.12508】Monocular Reconstruction of Neural Tactile Fields
链接:https://arxiv.org/abs/2602.12508
作者:Pavan Mantripragada,Siddhanth Deshmukh,Eadom Dessalene,Manas Desai,Yiannis Aloimonos
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:static geometric occupancy, requiring interaction-aware, neural tactile fields, environments that deform, geometric occupancy
备注: 10 pages, 8 figures
点击查看摘要
Abstract:Robots operating in the real world must plan through environments that deform, yield, and reconfigure under contact, requiring interaction-aware 3D representations that extend beyond static geometric occupancy. To address this, we introduce neural tactile fields, a novel 3D representation that maps spatial locations to the expected tactile response upon contact. Our model predicts these neural tactile fields from a single monocular RGB image -- the first method to do so. When integrated with off-the-shelf path planners, neural tactile fields enable robots to generate paths that avoid high-resistance objects while deliberately routing through low-resistance regions (e.g. foliage), rather than treating all occupied space as equally impassable. Empirically, our learning framework improves volumetric 3D reconstruction by $85.8\%$ and surface reconstruction by $26.7\%$ compared to state-of-the-art monocular 3D reconstruction methods (LRM and Direct3D).
67. 【2602.12498】Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models
链接:https://arxiv.org/abs/2602.12498
作者:Ali Abbasi,Mehdi Taghipour,Rahmatollah Beheshti
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fundamental linguistic operation, frequently fail, fundamental linguistic, linguistic operation, fail to distinguish
备注: 15 pages, 5 figures. Submitted to ICML 2026
点击查看摘要
Abstract:Negation is a fundamental linguistic operation in clinical reporting, yet vision-language models (VLMs) frequently fail to distinguish affirmative from negated medical statements. To systematically characterize this limitation, we introduce a radiology-specific diagnostic benchmark that evaluates polarity sensitivity under controlled clinical conditions, revealing that common medical VLMs consistently confuse negated and non-negated findings. To enable learning beyond simple condition absence, we further construct a contextual clinical negation dataset that encodes structured claims and supports attribute-level negations involving location and severity. Building on these resources, we propose Negation-Aware Selective Training (NAST), an interpretability-guided adaptation method that uses causal tracing effects (CTEs) to modulate layer-wise gradient updates during fine-tuning. Rather than applying uniform learning rates, NAST scales each layer's update according to its causal contribution to negation processing, transforming mechanistic interpretability signals into a principled optimization rule. Experiments demonstrate improved discrimination of affirmative and negated clinical statements without degrading general vision-language alignment, highlighting the value of causal interpretability for targeted model adaptation in safety-critical medical settings. Code and resources are available at this https URL.
68. 【2602.12489】Insertion Network for Image Sequence Correspondence
链接:https://arxiv.org/abs/2602.12489
作者:Dingjie Su,Weixiang Hong,Benoit M. Dawant,Bennett A. Landman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:establishing correspondence, body part regression, Abstract, images, slices
备注:
点击查看摘要
Abstract:We propose a novel method for establishing correspondence between two sequences of 2D images. One particular application of this technique is slice-level content navigation, where the goal is to localize specific 2D slices within a 3D volume or determine the anatomical coverage of a 3D scan based on its 2D slices. This serves as an important preprocessing step for various diagnostic tasks, as well as for automatic registration and segmentation pipelines. Our approach builds sequence correspondence by training a network to learn how to insert a slice from one sequence into the appropriate position in another. This is achieved by encoding contextual representations of each slice and modeling the insertion process using a slice-to-slice attention mechanism. We apply this method to localize manually labeled key slices in body CT scans and compare its performance to the current state-of-the-art alternative known as body part regression, which predicts anatomical position scores for individual slices. Unlike body part regression, which treats each slice independently, our method leverages contextual information from the entire sequence. Experimental results show that the insertion network reduces slice localization errors in supervised settings from 8.4 mm to 5.4 mm, demonstrating a substantial improvement in accuracy.
69. 【2602.12486】Human-Like Coarse Object Representations in Vision Models
链接:https://arxiv.org/abs/2602.12486
作者:Andrey Gizdov,Andrea Procopio,Yichen Li,Daniel Harari,Tomer Ullman
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:trading fine visual, efficient physical predictions, fine visual details, smooth concavities, trading fine
备注:
点击查看摘要
Abstract:Humans appear to represent objects for intuitive physics with coarse, volumetric bodies'' that smooth concavities - trading fine visual details for efficient physical predictions - yet their internal structure is largely unknown. Segmentation models, in contrast, optimize pixel-accurate masks that may misalign with such bodies. We ask whether and when these models nonetheless acquire human-like bodies. Using a time-to-collision (TTC) behavioral paradigm, we introduce a comparison pipeline and alignment metric, then vary model training time, size, and effective capacity via pruning. Across all manipulations, alignment with human behavior follows an inverse U-shaped curve: small/briefly trained/pruned models under-segment into blobs; large/fully trained models over-segment with boundary wiggles; and an intermediate ideal body granularity'' best matches humans. This suggests human-like coarse bodies emerge from resource constraints rather than bespoke biases, and points to simple knobs - early checkpoints, modest architectures, light pruning - for eliciting physics-efficient representations. We situate these results within resource-rational accounts balancing recognition detail against physical affordances.
70. 【2602.12484】A Lightweight and Explainable DenseNet-121 Framework for Grape Leaf Disease Classification
链接:https://arxiv.org/abs/2602.12484
作者:Md. Ehsanul Haque,Md.Saymon Hosen Polash,Rakib Hasan Ovi,Aminul Kader Bulbul,Md Kamrul Siam,Tamim Hasan Saykat
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:culturally significant fruits, Europe and Asia, quantities in Europe, culturally significant, significant fruits
备注: Accepted and Presented at 28th International Conference on Computer and Information Technology (ICCIT)
点击查看摘要
Abstract:Grapes are among the most economically and culturally significant fruits on a global scale, and table grapes and wine are produced in significant quantities in Europe and Asia. The production and quality of grapes are significantly impacted by grape diseases such as Bacterial Rot, Downy Mildew, and Powdery Mildew. Consequently, the sustainable management of a vineyard necessitates the early and precise identification of these diseases. Current automated methods, particularly those that are based on the YOLO framework, are often computationally costly and lack interpretability that makes them unsuitable for real-world scenarios. This study proposes grape leaf disease classification using Optimized DenseNet 121. Domain-specific preprocessing and extensive connectivity reveal disease-relevant characteristics, including veins, edges, and lesions. An extensive comparison with baseline CNN models, including ResNet18, VGG16, AlexNet, and SqueezeNet, demonstrates that the proposed model exhibits superior performance. It achieves an accuracy of 99.27%, an F1 score of 99.28%, a specificity of 99.71%, and a Kappa of 98.86%, with an inference time of 9 seconds. The cross-validation findings show a mean accuracy of 99.12%, indicating strength and generalizability across all classes. We also employ Grad-CAM to highlight disease-related regions to guarantee the model is highlighting physiologically relevant aspects and increase transparency and confidence. Model optimization reduces processing requirements for real-time deployment, while transfer learning ensures consistency on smaller and unbalanced samples. An effective architecture, domain-specific preprocessing, and interpretable outputs make the proposed framework scalable, precise, and computationally inexpensive for detecting grape leaf diseases.
71. 【2602.12461】Semantic-aware Adversarial Fine-tuning for CLIP
链接:https://arxiv.org/abs/2602.12461
作者:Jiacheng Zhang,Jinhao Li,Hanxun Huang,Sarah M. Erfani,Benjamin I.P. Rubinstein,Feng Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent studies, zero-shot classification tasks, cosine similarity, classification tasks, enhanced by adversarially
备注:
点击查看摘要
Abstract:Recent studies have shown that CLIP model's adversarial robustness in zero-shot classification tasks can be enhanced by adversarially fine-tuning its image encoder with adversarial examples (AEs), which are generated by minimizing the cosine similarity between images and a hand-crafted template (e.g., ''A photo of a {label}''). However, it has been shown that the cosine similarity between a single image and a single hand-crafted template is insufficient to measure the similarity for image-text pairs. Building on this, in this paper, we find that the AEs generated using cosine similarity may fail to fool CLIP when the similarity metric is replaced with semantically enriched alternatives, making the image encoder fine-tuned with these AEs less robust. To overcome this issue, we first propose a semantic-ensemble attack to generate semantic-aware AEs by minimizing the average similarity between the original image and an ensemble of refined textual descriptions. These descriptions are initially generated by a foundation model to capture core semantic features beyond hand-crafted templates and are then refined to reduce hallucinations. To this end, we propose Semantic-aware Adversarial Fine-Tuning (SAFT), which fine-tunes CLIP's image encoder with semantic-aware AEs. Extensive experiments show that SAFT outperforms current methods, achieving substantial improvements in zero-shot adversarial robustness across 16 datasets. Our code is available at: this https URL.
72. 【2602.12441】Prototype-driven fusion of pathology and spatial transcriptomics for interpretable survival prediction
链接:https://arxiv.org/abs/2602.12441
作者:Lihe Liu,Xiaoxi Pan,Yinyin Yuan,Lulu Shang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multiple instance learning, slide images, modeling via multiple, multiple instance, MIL
备注:
点击查看摘要
Abstract:Whole slide images (WSIs) enable weakly supervised prognostic modeling via multiple instance learning (MIL). Spatial transcriptomics (ST) preserves in situ gene expression, providing a spatial molecular context that complements morphology. As paired WSI-ST cohorts scale to population level, leveraging their complementary spatial signals for prognosis becomes crucial; however, principled cross-modal fusion strategies remain limited for this paradigm. To this end, we introduce PathoSpatial, an interpretable end-to-end framework integrating co-registered WSIs and ST to learn spatially informed prognostic representations. PathoSpatial uses task-guided prototype learning within a multi-level experts architecture, adaptively orchestrating unsupervised within-modality discovery with supervised cross-modal aggregation. By design, PathoSpatial substantially strengthens interpretability while maintaining discriminative ability. We evaluate PathoSpatial on a triple-negative breast cancer cohort with paired ST and WSIs. PathoSpatial delivers strong and consistent performance across five survival endpoints, achieving superior or comparable performance to leading unimodal and multimodal methods. PathoSpatial inherently enables post-hoc prototype interpretation and molecular risk decomposition, providing quantitative, biologically grounded explanations, highlighting candidate prognostic factors. We present PathoSpatial as a proof-of-concept for scalable and interpretable multimodal learning for spatial omics-pathology fusion.
73. 【2602.12407】MiDAS: A Multimodal Data Acquisition System and Dataset for Robot-Assisted Minimally Invasive Surgery
链接:https://arxiv.org/abs/2602.12407
作者:Keshara Weerasinghe,Seyed Hamid Reza Roodabeh,Andrew Hawkins(MD),Zhaomeng Zhang,Zachary Schrader,Homa Alemzadeh
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Robot-assisted minimally invasive, minimally invasive surgery, research increasingly relies, Robot-assisted minimally, invasive surgery
备注: 29 pages, 17 figures
点击查看摘要
Abstract:Background: Robot-assisted minimally invasive surgery (RMIS) research increasingly relies on multimodal data, yet access to proprietary robot telemetry remains a major barrier. We introduce MiDAS, an open-source, platform-agnostic system enabling time-synchronized, non-invasive multimodal data acquisition across surgical robotic platforms. Methods: MiDAS integrates electromagnetic and RGB-D hand tracking, foot pedal sensing, and surgical video capturing without requiring proprietary robot interfaces. We validated MiDAS on the open-source Raven-II and the clinical da Vinci Xi by collecting multimodal datasets of peg transfer and hernia repair suturing tasks performed by surgical residents. Correlation analysis and downstream gesture recognition experiments were conducted. Results: External hand and foot sensing closely approximated internal robot kinematics and non-invasive motion signals achieved gesture recognition performance comparable to proprietary telemetry. Conclusion: MiDAS enables reproducible multimodal RMIS data collection and is released with annotated datasets, including the first multimodal dataset capturing hernia repair suturing on high-fidelity simulation models.
Comments:
29 pages, 17 figures
Subjects:
Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2602.12407 [cs.RO]
(or
arXiv:2602.12407v1 [cs.RO] for this version)
https://doi.org/10.48550/arXiv.2602.12407
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Keshara Weerasinghe [view email] [v1]
Thu, 12 Feb 2026 20:56:15 UTC (41,409 KB)
74. 【2602.12403】MonoLoss: A Training Objective for Interpretable Monosemantic Representations
链接:https://arxiv.org/abs/2602.12403
作者:Ali Nasiri-Sarvi,Anh Tien Nguyen,Hassan Rivaz,Dimitris Samaras,Mahdi S. Hosseini
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multiple unrelated concepts, decompose polysemantic neural, Sparse autoencoders, polysemantic neural representations, unrelated concepts
备注: Under Review
点击查看摘要
Abstract:Sparse autoencoders (SAEs) decompose polysemantic neural representations, where neurons respond to multiple unrelated concepts, into monosemantic features that capture single, interpretable concepts. However, standard training objectives only weakly encourage this decomposition, and existing monosemanticity metrics require pairwise comparisons across all dataset samples, making them inefficient during training and evaluation. We study a recent MonoScore metric and derive a single-pass algorithm that computes exactly the same quantity, but with a cost that grows linearly, rather than quadratically, with the number of dataset images. On OpenImagesV7, we achieve up to a 1200x speedup wall-clock speedup in evaluation and 159x during training, while adding only ~4% per-epoch overhead. This allows us to treat MonoScore as a training signal: we introduce the Monosemanticity Loss (MonoLoss), a plug-in objective that directly rewards semantically consistent activations for learning interpretable monosemantic representations. Across SAEs trained on CLIP, SigLIP2, and pretrained ViT features, using BatchTopK, TopK, and JumpReLU SAEs, MonoLoss increases MonoScore for most latents. MonoLoss also consistently improves class purity (the fraction of a latent's activating images belonging to its dominant class) across all encoder and SAE combinations, with the largest gain raising baseline purity from 0.152 to 0.723. Used as an auxiliary regularizer during ResNet-50 and CLIP-ViT-B/32 finetuning, MonoLoss yields up to 0.6\% accuracy gains on ImageNet-1K and monosemantic activating patterns on standard benchmark datasets. The code is publicly available at this https URL.
75. 【2602.12401】ZeroDiff++: Substantial Unseen Visual-semantic Correlation in Zero-shot Learning
链接:https://arxiv.org/abs/2602.12401
作者:Zihan Ye,Shreyank N Gowda,Kaile Du,Weijian Luo,Ling Shao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:learn visual semantic, visual semantic correlations, synthesize unseen class, Zero-shot Learning, recognize classes unseen
备注: Under review
点击查看摘要
Abstract:Zero-shot Learning (ZSL) enables classifiers to recognize classes unseen during training, commonly via generative two stage methods: (1) learn visual semantic correlations from seen classes; (2) synthesize unseen class features from semantics to train classifiers. In this paper, we identify spurious visual semantic correlations in existing generative ZSL worsened by scarce seen class samples and introduce two metrics to quantify spuriousness for seen and unseen classes. Furthermore, we point out a more critical bottleneck: existing unadaptive fully noised generators produce features disconnected from real test samples, which also leads to the spurious correlation. To enhance the visual-semantic correlations on both seen and unseen classes, we propose ZeroDiff++, a diffusion-based generative framework. In training, ZeroDiff++ uses (i) diffusion augmentation to produce diverse noised samples, (ii) supervised contrastive (SC) representations for instance level semantics, and (iii) multi view discriminators with Wasserstein mutual learning to assess generated features. At generation time, we introduce (iv) Diffusion-based Test time Adaptation (DiffTTA) to adapt the generator using pseudo label reconstruction, and (v) Diffusion-based Test time Generation (DiffGen) to trace the diffusion denoising path and produce partially synthesized features that connect real and generated data, and mitigates data scarcity further. Extensive experiments on three ZSL benchmarks demonstrate that ZeroDiff++ not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Code would be available.
76. 【2602.12395】What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis
链接:https://arxiv.org/abs/2602.12395
作者:Xirui Li,Ming Li,Tianyi Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:standard post-training stage, Reinforcement learning, cold-start initialization, verifiable rewards, standard post-training
备注:
点击查看摘要
Abstract:Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.
77. 【2602.12393】Reproducing DragDiffusion: Interactive Point-Based Editing with Diffusion Models
链接:https://arxiv.org/abs/2602.12393
作者:Ali Subhan,Ashir Raza
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:interactive point-based image, point-based image editing, dragging selected points, directly dragging selected, point-based image
备注: 16 pages, 8 figures. Reproducibility study of DragDiffusion (CVPR 2024). Submitted to TMLR Reproducibility Challenge. Code available on GitHub
点击查看摘要
Abstract:DragDiffusion is a diffusion-based method for interactive point-based image editing that enables users to manipulate images by directly dragging selected points. The method claims that accurate spatial control can be achieved by optimizing a single diffusion latent at an intermediate timestep, together with identity-preserving fine-tuning and spatial regularization. This work presents a reproducibility study of DragDiffusion using the authors' released implementation and the DragBench benchmark. We reproduce the main ablation studies on diffusion timestep selection, LoRA-based fine-tuning, mask regularization strength, and UNet feature supervision, and observe close agreement with the qualitative and quantitative trends reported in the original work. At the same time, our experiments show that performance is sensitive to a small number of hyperparameter assumptions, particularly the optimized timestep and the feature level used for motion supervision, while other components admit broader operating ranges. We further evaluate a multi-timestep latent optimization variant and find that it does not improve spatial accuracy while substantially increasing computational cost. Overall, our findings support the central claims of DragDiffusion while clarifying the conditions under which they are reliably reproducible. Code is available at this https URL.
78. 【2602.12381】Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues
链接:https://arxiv.org/abs/2602.12381
作者:Marco Willi,Melanie Mathys,Michael Graber
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:produce near-photorealistic images, models produce near-photorealistic, challenging the trustworthiness, produce near-photorealistic, SID
备注: 11 figures; 23 pages
点击查看摘要
Abstract:Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs--unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.
79. 【2602.12380】FT-ACB-XML: Decision-Level Integration of Customized Temporal Fusion Transformer and Attention-BiLSTM with XGBoost Meta-Learner for BTC Price Forecasting
链接:https://arxiv.org/abs/2602.12380
作者:Raiz Ud Din(1),Saddam Hussain Khan(2) ((1) Artificial Intelligence Lab, Department of Computer Systems Engineering, University of Engineering and Applied Sciences (UEAS), Swat, Pakistan, (2) Interdisciplinary Research Center for Smart Mobility and Logistics, King Fahad University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia)
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate forecasting, forecasting of Bitcoin, highly volatile, challenge because decentralized, Temporal Fusion Transformer
备注: 41 pages, 15 Figures, 12 Tables
点击查看摘要
Abstract:Accurate forecasting of Bitcoin (BTC) has always been a challenge because decentralized markets are non-linear, highly volatile, and have temporal irregularities. Existing deep learning models often struggle with interpretability and generalization across diverse market conditions. This research presents a hybrid stacked-generalization framework, TFT-ACB-XML, for BTC closing price prediction. The framework integrates two parallel base learners: a customized Temporal Fusion Transformer (TFT) and an Attention-Customized Bidirectional Long Short-Term Memory network (ACB), followed by an XGBoost regressor as the meta-learner. The customized TFT model handles long-range dependencies and global temporal dynamics via variable selection networks and interpretable single-head attention. The ACB module uses a new attention mechanism alongside the customized BiLSTM to capture short-term sequential dependencies. Predictions from both customized TFT and ACB are weighted through an error-reciprocal weighting strategy. These weights are derived from validation performance, where a model showing lower prediction error receives a higher weight. Finally, the framework concatenates these weighted outputs into a feature vector and feeds the vector to an XGBoost regressor, which captures non-linear residuals and produces the final BTC closing price prediction. Empirical validation using BTC data from October 1, 2014, to January 5, 2026, shows improved performance of the proposed framework compared to recent Deep Learning and Transformer baseline models. The results show a MAPE of 0.65%, an MAE of 198.15, and an RMSE of 258.30 for one-step-ahead out-of-sample under a walk-forward evaluation on the test block. The evaluation period spans the 2024 BTC halving and the spot ETFs (exchange-traded funds) period, which coincide with major liquidity and volatility shifts.
80. 【2602.12370】LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
链接:https://arxiv.org/abs/2602.12370
作者:Zekun Li,Sizhe An,Chengcheng Tang,Chuan Guo,Ivan Shugurov,Linguang Zhang,Amy Zhao,Srinath Sridhar,Lingling Tao,Abhay Mittal
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent progress, Recent, generation, understanding, models
备注: Project page: [this https URL](https://kunkun0w0.github.io/project/LLaMo/)
点击查看摘要
Abstract:Recent progress in large models has led to significant advances in unified multimodal generation and understanding. However, the development of models that unify motion-language generation and understanding remains largely underexplored. Existing approaches often fine-tune large language models (LLMs) on paired motion-text data, which can result in catastrophic forgetting of linguistic capabilities due to the limited scale of available text-motion pairs. Furthermore, prior methods typically convert motion into discrete representations via quantization to integrate with language models, introducing substantial jitter artifacts from discrete tokenization. To address these challenges, we propose LLaMo, a unified framework that extends pretrained LLMs through a modality-specific Mixture-of-Transformers (MoT) architecture. This design inherently preserves the language understanding of the base model while enabling scalable multimodal adaptation. We encode human motion into a causal continuous latent space and maintain the next-token prediction paradigm in the decoder-only backbone through a lightweight flow-matching head, allowing for streaming motion generation in real-time (30 FPS). Leveraging the comprehensive language understanding of pretrained LLMs and large-scale motion-text pretraining, our experiments demonstrate that LLaMo achieves high-fidelity text-to-motion generation and motion-to-text captioning in general settings, especially zero-shot motion generation, marking a significant step towards a general unified motion-language large model.
81. 【2602.12361】hermal Imaging for Contactless Cardiorespiratory and Sudomotor Response Monitoring
链接:https://arxiv.org/abs/2602.12361
作者:Constantino Álvarez Casado,Mohammad Rahman,Sasan Sharifipour,Nhi Nguyen,Manuel Lage Cañellas,Xiaoting Wu,Miguel Bordallo López
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:infrared imaging captures, imaging captures skin, captures skin temperature, Thermal infrared imaging, electrodermal activity
备注: 7 pages, 6 figures, 3 tables, 22 references, 1 equation, conference
点击查看摘要
Abstract:Thermal infrared imaging captures skin temperature changes driven by autonomic regulation and can potentially provide contactless estimation of electrodermal activity (EDA), heart rate (HR), and breathing rate (BR). While visible-light methods address HR and BR, they cannot access EDA, a standard marker of sympathetic activation. This paper characterizes the extraction of these three biosignals from facial thermal video using a signal-processing pipeline that tracks anatomical regions, applies spatial aggregation, and separates slow sudomotor trends from faster cardiorespiratory components. For HR, we apply an orthogonal matrix image transformation (OMIT) decomposition across multiple facial regions of interest (ROIs), and for BR we average nasal and cheek signals before spectral peak detection. We evaluate 288 EDA configurations and the HR/BR pipeline on 31 sessions from the public SIMULATOR STUDY 1 (SIM1) driver monitoring dataset. The best fixed EDA configuration (nose region, exponential moving average) reaches a mean absolute correlation of $0.40 \pm 0.23$ against palm EDA, with individual sessions reaching 0.89. BR estimation achieves a mean absolute error of $3.1 \pm 1.1$ bpm, while HR estimation yields $13.8 \pm 7.5$ bpm MAE, limited by the low camera frame rate (7.5 Hz). We report signal polarity alternation across sessions, short thermodynamic latency for well-tracked signals, and condition-dependent and demographic effects on extraction quality. These results provide baseline performance bounds and design guidance for thermal contactless biosignal estimation.
82. 【2602.12351】LongNav-R1: Horizon-Adaptive Multi-Turn RL for Long-Horizon VLA Navigation
链接:https://arxiv.org/abs/2602.12351
作者:Yue Hu,Avery Xi,Qixin Xiao,Seth Isaacson,Henry X. Liu,Ram Vasudevan,Maani Ghaffari
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:multi-turn reinforcement learning, reinforcement learning, designed to optimize, VLA policy, multi-turn reinforcement
备注: VLA, Navigation
点击查看摘要
Abstract:This paper develops LongNav-R1, an end-to-end multi-turn reinforcement learning (RL) framework designed to optimize Visual-Language-Action (VLA) models for long-horizon navigation. Unlike existing single-turn paradigm, LongNav-R1 reformulates the navigation decision process as a continuous multi-turn conversation between the VLA policy and the embodied environment. This multi-turn RL framework offers two distinct advantages: i) it enables the agent to reason about the causal effects of historical interactions and sequential future outcomes; and ii) it allows the model to learn directly from online interactions, fostering diverse trajectory generation and avoiding the behavioral rigidity often imposed by human demonstrations. Furthermore, we introduce Horizon-Adaptive Policy Optimization. This mechanism explicitly accounts for varying horizon lengths during advantage estimation, facilitating accurate temporal credit assignment over extended sequences. Consequently, the agent develops diverse navigation behaviors and resists collapse during long-horizon tasks. Experiments on object navigation benchmarks validate the framework's efficacy: With 4,000 rollout trajectories, LongNav-R1 boosts the Qwen3-VL-2B success rate from 64.3% to 73.0%. These results demonstrate superior sample efficiency and significantly outperform state-of-the-art methods. The model's generalizability and robustness are further validated by its zero-shot performance in long-horizon real-world navigation settings. All source code will be open-sourced upon publication.
83. 【2602.12314】LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning
链接:https://arxiv.org/abs/2602.12314
作者:Junwoon Lee,Yulun Tian
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:open-vocabulary robotic perception, builds scalable latent, Gaussian Splatting, streaming RGB-D observations, scalable latent feature
备注: 8 pages, 5 figures
点击查看摘要
Abstract:We present LatentAM, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception. Instead of distilling high-dimensional Vision-Language Model (VLM) embeddings using model-specific decoders, LatentAM proposes an online dictionary learning approach that is both model-agnostic and pretraining-free, enabling plug-and-play integration with different VLMs at test time. Specifically, our approach associates each Gaussian primitive with a compact query vector that can be converted into approximate VLM embeddings using an attention mechanism with a learnable dictionary. The dictionary is initialized efficiently from streaming observations and optimized online to adapt to evolving scene semantics under trust-region regularization. To scale to long trajectories and large environments, we further propose an efficient map management strategy based on voxel hashing, where optimization is restricted to an active local map on the GPU, while the global map is stored and indexed on the CPU to maintain bounded GPU memory usage. Experiments on public benchmarks and a large-scale custom dataset demonstrate that LatentAM attains significantly better feature reconstruction fidelity compared to state-of-the-art methods, while achieving near-real-time speed (12-35 FPS) on the evaluated datasets. Our project page is at: this https URL
84. 【2602.12302】Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática
链接:https://arxiv.org/abs/2602.12302
作者:Neemias da Silva,Júlio C. W. Scholz,John Harrison,Marina Borges,Paulo Ávila,Frances A Santos,Myriam Delgado,Rodrigo Minetto,Thiago H Silva
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, natural language understanding, Large Language, Language Models
备注: in Portuguese language. Accepted book chapter - Webmedia 2025
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: this https URL. Finally, the chapter discusses the challenges and highlights promising trends.
85. 【2511.13494】Language-Guided Invariance Probing of Vision-Language Models
链接:https://arxiv.org/abs/2511.13494
作者:Jae Joong Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent vision-language models, strong zero-shot performance, achieve strong zero-shot, Recent vision-language, controlled linguistic perturbations
备注:
点击查看摘要
Abstract:Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2511.13494 [cs.CV]
(or
arXiv:2511.13494v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2511.13494
Focus to learn more
arXiv-issued DOI via DataCite</p>
86. 【2602.12985】Represent Micro-Doppler Signature in Orders
链接:https://arxiv.org/abs/2602.12985
作者:Weicheng Gao
类目:ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
关键词:indoor human activities, human activities, multiple-input multiple-output, complex environments, environments is enabled
备注: 17 pages, 8 figures, 5 tables
点击查看摘要
Abstract:Non-line-of-sight sensing of human activities in complex environments is enabled by multiple-input multiple-output through-the-wall radar (TWR). However, the distinctiveness of micro-Doppler signature between similar indoor human activities such as gun carrying and normal walking is minimal, while the large scale of input images required for effective identification utilizing time-frequency spectrograms creates challenges for model training and inference efficiency. To address this issue, the Chebyshev-time map is proposed in this paper, which is a method characterizing micro-Doppler signature using polynomial orders. The parametric kinematic models for human motion and the TWR echo model are first established. Then, a time-frequency feature representation method based on orthogonal Chebyshev polynomial decomposition is proposed. The kinematic envelopes of the torso and limbs are extracted, and the time-frequency spectrum slices are mapped into a robust Chebyshev-time coefficient space, preserving the multi-order morphological detail information of time-frequency spectrum. Numerical simulations and experiments are conducted to verify the effectiveness of the proposed method, which demonstrates the capability to characterize armed and unarmed indoor human activities while effectively compressing the scale of the time-frequency spectrum to achieve a balance between recognition accuracy and input data dimensions. The open-source code of this paper can be found in: this https URL.
87. 【2602.12974】Statistical Opportunities in Neuroimaging
链接:https://arxiv.org/abs/2602.12974
作者:Jian Kang,Thomas Nichols,Lexin Li,Martin A. Lindquist,Hongtu Zhu
类目:Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV); Methodology (stat.ME)
关键词:modalities like MRI, characterizing its structure, profoundly enhanced, enhanced our understanding, connectivity through modalities
备注: 33 pages, 3 figures
点击查看摘要
Abstract:Neuroimaging has profoundly enhanced our understanding of the human brain by characterizing its structure, function, and connectivity through modalities like MRI, fMRI, EEG, and PET. These technologies have enabled major breakthroughs across the lifespan, from early brain development to neurodegenerative and neuropsychiatric disorders. Despite these advances, the brain is a complex, multiscale system, and neuroimaging measurements are correspondingly high-dimensional. This creates major statistical challenges, including measurement noise, motion-related artifacts, substantial inter-subject and site/scanner variability, and the sheer scale of modern studies. This paper explores statistical opportunities and challenges in neuroimaging across four key areas: (i) brain development from birth to age 20, (ii) the adult and aging brain, (iii) neurodegeneration and neuropsychiatric disorders, and (iv) brain encoding and decoding. After a quick tutorial on major imaging technologies, we review cutting-edge studies, underscore data and modeling challenges, and highlight research opportunities for statisticians. We conclude by emphasizing that close collaboration among statisticians, neuroscientists, and clinicians is essential for translating neuroimaging advances into improved diagnostics, deeper mechanistic insight, and more personalized treatments.
88. 【2602.12883】Dual-Phase Cross-Modal Contrastive Learning for CMR-Guided ECG Representations for Cardiovascular Disease Assessment
链接:https://arxiv.org/abs/2602.12883
作者:Laura Alvarez-Florez,Angel Bujalance-Gomez,Femke Raijmakers,Samuel Ruiperez-Campillo,Maarten Z. H. Kolk,Jesse Wiers,Julia Vogt,Erik J. Bekkers,Ivana Išgum,Fleur V. Y. Tjong
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:magnetic resonance imaging, selected patient populations, offers detailed evaluation, limited accessibility restricts, Cardiac magnetic resonance
备注: Paper accepted at SPIE Medical Imaging 2026 Conference
点击查看摘要
Abstract:Cardiac magnetic resonance imaging (CMR) offers detailed evaluation of cardiac structure and function, but its limited accessibility restricts use to selected patient populations. In contrast, the electrocardiogram (ECG) is ubiquitous and inexpensive, and provides rich information on cardiac electrical activity and rhythm, yet offers limited insight into underlying cardiac structure and mechanical function. To address this, we introduce a contrastive learning framework that improves the extraction of clinically relevant cardiac phenotypes from ECG by learning from paired ECG-CMR data. Our approach aligns ECG representations with 3D CMR volumes at end-diastole (ED) and end-systole (ES), with a dual-phase contrastive loss to anchor each ECG jointly with both cardiac phases in a shared latent space. Unlike prior methods limited to 2D CMR representations with or without a temporal component, our framework models 3D anatomy at both ED and ES phases as distinct latent representations, enabling flexible disentanglement of structural and functional cardiac properties. Using over 34,000 ECG-CMR pairs from the UK Biobank, we demonstrate improved extraction of image-derived phenotypes from ECG, particularly for functional parameters ($\uparrow$ 9.2\%), while improvements in clinical outcome prediction remained modest ($\uparrow$ 0.7\%). This strategy could enable scalable and cost-effective extraction of image-derived traits from ECG. The code for this research is publicly available.
89. 【2602.12820】3DLAND: 3D Lesion Abdominal Anomaly Localization Dataset
链接:https://arxiv.org/abs/2602.12820
作者:Mehran Advand,Zahra Dehghanian,Navid Faraji,Reza Barati,Seyed Amir Ahmad Safavi-Naini,Hamid R. Rabiee
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:hindering robust representation, Existing medical imaging, lack three-dimensional annotations, robust representation learning, multi-organ coverage
备注:
点击查看摘要
Abstract:Existing medical imaging datasets for abdominal CT often lack three-dimensional annotations, multi-organ coverage, or precise lesion-to-organ associations, hindering robust representation learning and clinical applications. To address this gap, we introduce 3DLAND, a large-scale benchmark dataset comprising over 6,000 contrast-enhanced CT volumes with over 20,000 high-fidelity 3D lesion annotations linked to seven abdominal organs: liver, kidneys, pancreas, spleen, stomach, and gallbladder. Our streamlined three-phase pipeline integrates automated spatial reasoning, prompt-optimized 2D segmentation, and memory-guided 3D propagation, validated by expert radiologists with surface dice scores exceeding 0.75. By providing diverse lesion types and patient demographics, 3DLAND enables scalable evaluation of anomaly detection, localization, and cross-organ transfer learning for medical AI. Our dataset establishes a new benchmark for evaluating organ-aware 3D segmentation models, paving the way for advancements in healthcare-oriented AI. To facilitate reproducibility and further research, the 3DLAND dataset and implementation code are publicly available at this https URL.
90. 【2602.12758】VineetVC: Adaptive Video Conferencing Under Severe Bandwidth Constraints Using Audio-Driven Talking-Head Reconstruction
链接:https://arxiv.org/abs/2602.12758
作者:Vineet Kumar Rakesh,Soumya Mazumdar,Tapas Samanta,Hemendra Kumar Pandey,Amitabha Das,Sarbajit Pal
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:frame rates deteriorate, packet loss escalates, latency significantly increases, encoder rate management, Intense bandwidth depletion
备注:
点击查看摘要
Abstract:Intense bandwidth depletion within consumer and constrained networks has the potential to undermine the stability of real-time video conferencing: encoder rate management becomes saturated, packet loss escalates, frame rates deteriorate, and end-to-end latency significantly increases. This work delineates an adaptive conferencing system that integrates WebRTC media delivery with a supplementary audio-driven talking-head reconstruction pathway and telemetry-driven mode regulation. The system consists of a WebSocket signaling service, an optional SFU for multi-party transmission, a browser client capable of real-time WebRTC statistics extraction and CSV telemetry export, and an AI REST service that processes a reference face image and recorded audio to produce a synthesized MP4; the browser can substitute its outbound camera track with the synthesized stream with a median bandwidth of 32.80 kbps. The solution incorporates a bandwidth-mode switching strategy and a client-side mode-state logger.
91. 【2602.12750】Lung nodule classification on CT scan patches using 3D convolutional neural networks
链接:https://arxiv.org/abs/2602.12750
作者:Volodymyr Sydorskyi
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
关键词:Lung cancer remains, Lung cancer, common and deadliest, deadliest forms, lung cancer detection
备注:
点击查看摘要
Abstract:Lung cancer remains one of the most common and deadliest forms of cancer worldwide. The likelihood of successful treatment depends strongly on the stage at which the disease is diagnosed. Therefore, early detection of lung cancer represents a critical medical challenge. However, this task poses significant difficulties for thoracic radiologists due to the large number of studies to review, the presence of multiple nodules within the lungs, and the small size of many nodules, which complicates visual assessment. Consequently, the development of automated systems that incorporate highly accurate and computationally efficient lung nodule detection and classification modules is essential. This study introduces three methodological improvements for lung nodule classification: (1) an advanced CT scan cropping strategy that focuses the model on the target nodule while reducing computational cost; (2) target filtering techniques for removing noisy labels; (3) novel augmentation methods to improve model robustness. The integration of these techniques enables the development of a robust classification subsystem within a comprehensive Clinical Decision Support System for lung cancer detection, capable of operating across diverse acquisition protocols, scanner types, and upstream models (segmentation or detection). The multiclass model achieved a Macro ROC AUC of 0.9176 and a Macro F1-score of 0.7658, while the binary model reached a Binary ROC AUC of 0.9383 and a Binary F1-score of 0.8668 on the LIDC-IDRI dataset. These results outperform several previously reported approaches and demonstrate state-of-the-art performance for this task.
92. 【2602.12410】Conference Proceedings of the Inaugural Conference of the International Society for Tractography (IST 2025 Bordeaux)
链接:https://arxiv.org/abs/2602.12410
作者:Flavio Dell Acqua,Maxime Descoteaux,Graham Little,Laurent Petit,Dogu Baran Aydogan,Stephanie Forkel,Alexander Leemans,Simona Schiavi,Michel Thiebaut de Schotten
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
关键词:International Society, held in Bordeaux, IST Conference, October 13-16, Inaugural Conference
备注: Proceedings of the Inaugural Conference of the International Society for Tractography (IST Conference 2025). Held at the Institut des Maladies Neurodégénératives in Bordeaux, France, October 13-16, 2025. Society website: [this http URL](http://www.tractography.io)
点击查看摘要
Abstract:This collection comprises the abstracts presented during poster, power pitch and oral sessions at the Inaugural Conference of the International Society for Tractography (IST Conference 2025), held in Bordeaux, France, from October 13-16, 2025. The conference was designed to foster meaningful exchange and collaboration between disparate fields. The overall focus was on advancing research, innovation, and community in the common fields of interest: neuroanatomy, tractography methods and scientific/clinical applications of tractography. The included abstracts cover the latest advancements in tractography, Diffusion MRI, and related fields including new work on; neurological and psychiatric disorders, deep brain stimulation targeting, and brain development. This landmark event brought together world-leading experts to discuss critical challenges and chart the future direction of the field.
93. 【2602.12306】Quantum walk inspired JPEG compression of images
链接:https://arxiv.org/abs/2602.12306
作者:Abhishek Verma,Sahil Tomar,Sandeep Kumar
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Information Theory (cs.IT)
关键词:Walk Inspired Optimization, Quantum Walk Inspired, quantum inspired adaptive, optimized Qtable derived, inspired adaptive quantization
备注: 8 pages
点击查看摘要
Abstract:This work proposes a quantum inspired adaptive quantization framework that enhances the classical JPEG compression by introducing a learned, optimized Qtable derived using a Quantum Walk Inspired Optimization (QWIO) search strategy. The optimizer searches a continuous parameter space of frequency band scaling factors under a unified rate distortion objective that jointly considers reconstruction fidelity and compression efficiency. The proposed framework is evaluated on MNIST, CIFAR10, and ImageNet subsets, using Peak Signal to Noise Ratio (PSNR), Structural Similarity Index (SSIM), Bits Per Pixel (BPP), and error heatmap visual analysis as evaluation metrics. Experimental results show average gains ranging from 3 to 6 dB PSNR, along with better structural preservation of edges, contours, and luminance transitions, without modifying decoder compatibility. The structure remains JPEG compliant and can be implemented using accessible scientific packages making it ideal for deployment and practical research use.



