本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新673篇论文,其中:

  • 自然语言处理93
  • 信息检索8
  • 计算机视觉110

自然语言处理

1. 【2601.23278】FOCUS: DLLMs Know How to Tame Their Compute Bound

链接https://arxiv.org/abs/2601.23278

作者:Kaihua Liang,Xin Tan,An Zhong,Hong Xu,Marco Canini

类目:Machine Learning (cs.LG); Hardware Architecture (cs.AR); Computation and Language (cs.CL)

关键词:Large Language Models, Diffusion Large Language, Large Language, Language Models, Auto-Regressive models

备注: 22 pages, 15 figures

点击查看摘要

Abstract:Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS -- an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks. The FOCUS system is publicly available on GitHub: this https URL.

2. 【2601.23273】UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection

链接https://arxiv.org/abs/2601.23273

作者:Siran Peng,Weisong Zhao,Tianyu Fu,Chenxu Zhao,Tianshuo Zhang,Haoyuan Zhang,Xiangyu Zhu,Minghui Wu,Zhen Lei

类目:Computation and Language (cs.CL)

关键词:sequential decision-making problem, framing refinement, recently emerged, promising paradigm, paradigm for automated

备注

点击查看摘要

Abstract:Prompt agents have recently emerged as a promising paradigm for automated prompt optimization, framing refinement as a sequential decision-making problem over a structured prompt space. While this formulation enables the use of advanced planning algorithms, these methods typically assume access to supervised reward signals, which are often unavailable in practical scenarios. In this work, we propose UPA, an Unsupervised Prompt Agent that realizes structured search and selection without relying on supervised feedback. Specifically, during search, UPA iteratively constructs an evolving tree structure to navigate the prompt space, guided by fine-grained and order-invariant pairwise comparisons from Large Language Models (LLMs). Crucially, as these local comparisons do not inherently yield a consistent global scale, we decouple systematic prompt exploration from final selection, introducing a two-stage framework grounded in the Bradley-Terry-Luce (BTL) model. This framework first performs path-wise Bayesian aggregation of local comparisons to filter candidates under uncertainty, followed by global tournament-style comparisons to infer latent prompt quality and identify the optimal prompt. Experiments across multiple tasks demonstrate that UPA consistently outperforms existing prompt optimization methods, showing that agent-style optimization remains highly effective even in fully unsupervised settings.

3. 【2601.23265】PaperBanana: Automating Academic Illustration for AI Scientists

链接https://arxiv.org/abs/2601.23265

作者:Dawei Zhu,Rui Meng,Yale Song,Xiyu Wei,Sujian Li,Tomas Pfister,Jinsung Yoon

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:generating publication-ready illustrations, rapid advances, advances in autonomous, autonomous AI scientists, remains a labor-intensive

备注

点击查看摘要

Abstract:Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.

4. 【2601.23258】Agnostic Language Identification and Generation

链接https://arxiv.org/abs/2601.23258

作者:Mikael Møller Høgsgaard,Chirag Pabbaraju

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent works, established tight statistical, Recent, language identification, identification and generation

备注

点击查看摘要

Abstract:Recent works on language identification and generation have established tight statistical rates at which these tasks can be achieved. These works typically operate under a strong realizability assumption: that the input data is drawn from an unknown distribution necessarily supported on some language in a given collection. In this work, we relax this assumption of realizability entirely, and impose no restrictions on the distribution of the input data. We propose objectives to study both language identification and generation in this more general "agnostic" setup. Across both problems, we obtain novel interesting characterizations and nearly tight rates.

5. 【2601.23255】Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

链接https://arxiv.org/abs/2601.23255

作者:Ye Yu,Haibo Jin,Yaoning Yu,Jun Zhuang,Haohan Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:Large audio-language models, Large audio-language, raw speech inputs, audio-language models increasingly, models increasingly operate

备注: to be published at EACL 2026 main conference

点击查看摘要

Abstract:Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.

6. 【2601.23228】Scaling Multiagent Systems with Process Rewards

链接https://arxiv.org/abs/2601.23228

作者:Ed Li,Junyu Ren,Cat Yan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)

关键词:multiple agents simultaneously, agents simultaneously faces, sample efficiency, shown promise, promise for tackling

备注

点击查看摘要

Abstract:While multiagent systems have shown promise for tackling complex tasks via specialization, finetuning multiple agents simultaneously faces two key challenges: (1) credit assignment across agents, and (2) sample efficiency of expensive multiagent rollouts. In this work, we propose finetuning multiagent systems with per-action process rewards from AI feedback (MAPPA) to address both. Through assigning credit to individual agent actions rather than only at task completion, MAPPA enables fine-grained supervision without ground truth labels while extracting maximal training signal from each rollout. We demonstrate our approach on competition math problems and tool-augmented data analysis tasks. On unseen math problems, MAPPA achieves +5.0--17.5pp on AIME and +7.8--17.2pp on AMC. For data analysis tasks, our method improves success rate by +12.5pp while quality metrics improve by up to 30%, validating that per-action supervision can lead to improvements across different multiagent system on various domains. By addressing these challenges, our work takes a first step toward scaling multiagent systems for complex, long-horizon tasks with minimal human supervision.

7. 【2601.23223】Are you going to finish that? A Practical Study of the Tokenization Boundary Problem

链接https://arxiv.org/abs/2601.23223

作者:Hao Xu,Alisa Liu,Jonathan Hayase,Yejin Choi,Noah A. Smith

类目:Computation and Language (cs.CL)

关键词:trained over sequences, users interact, partial token problem, token, boundaries

备注

点击查看摘要

Abstract:Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text. This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token, leading to distorted next-token predictions. Although this issue has been studied using arbitrary character prefixes, its prevalence and severity in realistic prompts respecting word boundaries remains underexplored. In this work, we identify three domains where token and "word" boundaries often do not line up: languages that do not use whitespace, highly compounding languages, and code. In Chinese, for example, up to 25% of word boundaries do not line up with token boundaries, making even natural, word-complete prompts susceptible to this problem. We systematically construct semantically natural prompts ending with a partial tokens; in experiments, we find that they comprise a serious failure mode: frontier LMs consistently place three orders of magnitude less probability on the correct continuation compared to when the prompt is "backed-off" to be token-aligned. This degradation does not diminish with scale and often worsens for larger models. Finally, we evaluate inference-time mitigations to the partial token problem and validate the effectiveness of recent exact solutions. Overall, we demonstrate the scale and severity of probability distortion caused by tokenization in realistic use cases, and provide practical recommentions for model inference providers.

8. 【2601.23188】Deep Search with Hierarchical Meta-Cognitive Monitoring Inspired by Cognitive Neuroscience

链接https://arxiv.org/abs/2601.23188

作者:Zhongxiang Sun,Qipeng Wang,Weijie Yu,Jingxuan Yang,Haolang Lu,Jun Xu

类目:Computation and Language (cs.CL)

关键词:demonstrated strong capabilities, long-horizon task execution, large language models, Deep search, powered by large

备注: 11 pages, 3 figures

点击查看摘要

Abstract:Deep search agents powered by large language models have demonstrated strong capabilities in multi-step retrieval, reasoning, and long-horizon task execution. However, their practical failures often stem from the lack of mechanisms to monitor and regulate reasoning and retrieval states as tasks evolve under uncertainty. Insights from cognitive neuroscience suggest that human metacognition is hierarchically organized, integrating fast anomaly detection with selectively triggered, experience-driven reflection. In this work, we propose Deep Search with Meta-Cognitive Monitoring (DS-MCM), a deep search framework augmented with an explicit hierarchical metacognitive monitoring mechanism. DS-MCM integrates a Fast Consistency Monitor, which performs lightweight checks on the alignment between external evidence and internal reasoning confidence, and a Slow Experience-Driven Monitor, which is selectively activated to guide corrective intervention based on experience memory from historical agent trajectories. By embedding monitoring directly into the reasoning-retrieval loop, DS-MCM determines both when intervention is warranted and how corrective actions should be informed by prior experience. Experiments across multiple deep search benchmarks and backbone models demonstrate that DS-MCM consistently improves performance and robustness.

9. 【2601.23184】ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought

链接https://arxiv.org/abs/2601.23184

作者:Fanmeng Wang,Haotian Liu,Guojiang Zhao,Hongteng Xu,Zhifeng Gao

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, latent reasoning, Language Models, chains introduce substantial

备注

点击查看摘要

Abstract:While Chain-of-Thought (CoT) significantly enhances the performance of Large Language Models (LLMs), explicit reasoning chains introduce substantial computational redundancy. Recent latent reasoning methods attempt to mitigate this by compressing reasoning processes into latent space, but often suffer from severe performance degradation due to the lack of appropriate compression guidance. In this study, we propose Rendered CoT-Guided variational Latent Reasoning (ReGuLaR), a simple yet novel latent learning paradigm resolving this issue. Fundamentally, we formulate latent reasoning within the Variational Auto-Encoding (VAE) framework, sampling the current latent reasoning state from the posterior distribution conditioned on previous ones. Specifically, when learning this variational latent reasoning model, we render explicit reasoning chains as images, from which we extract dense visual-semantic representations to regularize the posterior distribution, thereby achieving efficient compression with minimal information loss. Extensive experiments demonstrate that ReGuLaR significantly outperforms existing latent reasoning methods across both computational efficiency and reasoning effectiveness, and even surpasses CoT through multi-modal reasoning, providing a new and insightful solution to latent reasoning. Code: this https URL.

10. 【2601.23183】JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs

链接https://arxiv.org/abs/2601.23183

作者:Casimiro Pio Carrino,Paula Estrella,Rabih Zbib,Carlos Escolano,José A. R. Fonollosa

类目:Computation and Language (cs.CL)

关键词:Machine Reading Comprehension, evaluating Machine Reading, Reading Comprehension, HR-specific tasks involving, tasks involving résumés

备注: Under review

点击查看摘要

Abstract:We introduce JobResQA, a multilingual Question Answering benchmark for evaluating Machine Reading Comprehension (MRC) capabilities of LLMs on HR-specific tasks involving résumés and job descriptions. The dataset comprises 581 QA pairs across 105 synthetic résumé-job description pairs in five languages (English, Spanish, Italian, German, and Chinese), with questions spanning three complexity levels from basic factual extraction to complex cross-document reasoning. We propose a data generation pipeline derived from real-world sources through de-identification and data synthesis to ensure both realism and privacy, while controlled demographic and professional attributes (implemented via placeholders) enable systematic bias and fairness studies. We also present a cost-effective, human-in-the-loop translation pipeline based on the TEaR methodology, incorporating MQM error annotations and selective post-editing to ensure an high-quality multi-way parallel benchmark. We provide a baseline evaluations across multiple open-weight LLM families using an LLM-as-judge approach revealing higher performances on English and Spanish but substantial degradation for other languages, highlighting critical gaps in multilingual MRC capabilities for HR applications. JobResQA provides a reproducible benchmark for advancing fair and reliable LLM-based HR systems. The benchmark is publicly available at: this https URL

11. 【2601.23182】FourierSampler: Unlocking Non-Autoregressive Potential in Diffusion Language Models via Frequency-Guided Generation

链接https://arxiv.org/abs/2601.23182

作者:Siyang He,Qiqi Wang,Xiaoran Liu,Hongnan Ma,Yiwei Shi,Yuerong Song,Ying Zhu,Tianyi Liang,Zengfeng Huang,Ziwei He,Xipeng Qiu

类目:Computation and Language (cs.CL)

关键词:demonstrate positional bias, existing decoding strategies, decoding strategies demonstrate, strategies demonstrate positional, diffusion language models

备注: 15 pages, 6 figures, under review

点击查看摘要

Abstract:Despite the non-autoregressive potential of diffusion language models (dLLMs), existing decoding strategies demonstrate positional bias, failing to fully unlock the potential of arbitrary generation. In this work, we delve into the inherent spectral characteristics of dLLMs and present the first frequency-domain analysis showing that low-frequency components in hidden states primarily encode global structural information and long-range dependencies, while high-frequency components are responsible for characterizing local details. Based on this observation, we propose FourierSampler, which leverages a frequency-domain sliding window mechanism to dynamically guide the model to achieve a "structure-to-detail" generation. FourierSampler outperforms other inference enhancement strategies on LLADA and SDAR, achieving relative improvements of 20.4% on LLaDA1.5-8B and 16.0% on LLaDA-8B-Instruct. It notably surpasses similarly sized autoregressive models like Llama3.1-8B-Instruct.

12. 【2601.23166】Monotonic Reference-Free Refinement for Autoformalization

链接https://arxiv.org/abs/2601.23166

作者:Lan Zhang,Marco Valentino,André Freitas

类目:Computation and Language (cs.CL)

关键词:remains largely unexplored, autoformalization remains largely, full-theorem autoformalization remains, full-theorem autoformalization, advanced rapidly

备注: Work in progress

点击查看摘要

Abstract:While statement autoformalization has advanced rapidly, full-theorem autoformalization remains largely unexplored. Existing iterative refinement methods in statement autoformalization typicall improve isolated aspects of formalization, such as syntactic correctness, but struggle to jointly optimizing multiple quality dimensions, which is critical for full-theorem autoformalization. We introduce a reference-free iterative monotonic process for full-theorem autoformalization that leverages complementary feedback from theorem provers and LLM-based judges, without access to ground-truth proofs or existing formalizations at inference time. Our approach optimizes a masked composite objective over Formal Validity, Logical Preservation, Mathematical Consistency, and Formal Quality, guided by a responsiveness map that indicates how different LLMs acting as different roles preferentially improve each dimension. We further propose an acceptance policy that guarantees certified monotonic improvement, and provide conditions ensuring convergence and termination. Empirical experiments demonstrate the proposed process enables simultaneous improvement across multiple dimensions, achieving 93.44% formal validity and a 78.22% overall score on miniF2F, and 44.09% formal validity and a 29.79% overall score on ProofNet.

13. 【2601.23161】DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

链接https://arxiv.org/abs/2601.23161

作者:Jiaming Zhou,Xuxin Cheng,Shiwan Zhao,Yuhang Jia,Cao Liu,Ke Zeng,Xunliang Cai,Yong Qin

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:limits inference efficiency, strictly sequential decoding, sequential decoding limits, decoding limits inference, large language models

备注

点击查看摘要

Abstract:Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding. Our code is available at this https URL.

14. 【2601.23129】Evaluating the Utility of Grounding Documents with Reference-Free LLM-based Metrics

链接https://arxiv.org/abs/2601.23129

作者:Yilun Hua,Giuseppe Castellucci,Peter Schulam,Heba Elfardy,Kevin Small

类目:Computation and Language (cs.CL)

关键词:Retrieval Augmented Generation, Retrieval Augmented, Augmented Generation, success depends, LLM derives

备注

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG)'s success depends on the utility the LLM derives from the content used for grounding. Quantifying content utility does not have a definitive specification and existing metrics ignore model-specific capabilities and/or rely on costly annotations. In this paper, we propose Grounding Generation Utility (GroGU), a model-specific and reference-free metric that defines utility as a function of the downstream LLM's generation confidence based on entropy. Despite having no annotation requirements, GroGU is largely faithful in distinguishing ground-truth documents while capturing nuances ignored by LLM-agnostic metrics. We apply GroGU to train a query-rewriter for RAG by identifying high-utility preference data for Direct Preference Optimization. Experiments show improvements by up to 18.2 points in Mean Reciprocal Rank and up to 9.4 points in answer accuracy.

15. 【2601.23094】Safer Policy Compliance with Dynamic Epistemic Fallback

链接https://arxiv.org/abs/2601.23094

作者:Joseph Marvin Imperial,Harish Tayyar Madabushi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:everyday interactions, Humans develop, series of cognitive, combat risks, misinformation from everyday

备注

点击查看摘要

Abstract:Humans develop a series of cognitive defenses, known as epistemic vigilance, to combat risks of deception and misinformation from everyday interactions. Developing safeguards for LLMs inspired by this mechanism might be particularly helpful for their application in high-stakes tasks such as automating compliance with data privacy laws. In this paper, we introduce Dynamic Epistemic Fallback (DEF), a dynamic safety protocol for improving an LLM's inference-time defenses against deceptive attacks that make use of maliciously perturbed policy texts. Through various levels of one-sentence textual cues, DEF nudges LLMs to flag inconsistencies, refuse compliance, and fallback to their parametric knowledge upon encountering perturbed policy texts. Using globally recognized legal policies such as HIPAA and GDPR, our empirical evaluations report that DEF effectively improves the capability of frontier LLMs to detect and refuse perturbed versions of policies, with DeepSeek-R1 achieving a 100% detection rate in one setting. This work encourages further efforts to develop cognitively inspired defenses to improve LLM robustness against forms of harm and deception that exploit legal artifacts.

16. 【2601.23081】Character as a Latent Variable in Large Language Models: A Mechanistic Account of Emergent Misalignment and Conditional Safety Failures

链接https://arxiv.org/abs/2601.23081

作者:Yanghao Su,Wenbo Zhou,Tianwei Zhang,Qiu Han,Weiming Zhang,Nenghai Yu,Jie Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:narrowly scoped data, induces broadly misaligned, fine-tuning large language, broadly misaligned behavior, large language models

备注

点击查看摘要

Abstract:Emergent Misalignment refers to a failure mode in which fine-tuning large language models (LLMs) on narrowly scoped data induces broadly misaligned behavior. Prior explanations mainly attribute this phenomenon to the generalization of erroneous or unsafe content. In this work, we show that this view is incomplete. Across multiple domains and model families, we find that fine-tuning models on data exhibiting specific character-level dispositions induces substantially stronger and more transferable misalignment than incorrect-advice fine-tuning, while largely preserving general capabilities. This indicates that emergent misalignment arises from stable shifts in model behavior rather than from capability degradation or corrupted knowledge. We further show that such behavioral dispositions can be conditionally activated by both training-time triggers and inference-time persona-aligned prompts, revealing shared structure across emergent misalignment, backdoor activation, and jailbreak susceptibility. Overall, our results identify character formation as a central and underexplored alignment risk, suggesting that robust alignment must address behavioral dispositions rather than isolated errors or prompt-level defenses.

17. 【2601.23022】DimABSA: Building Multilingual and Multidomain Datasets for Dimensional Aspect-Based Sentiment Analysis

链接https://arxiv.org/abs/2601.23022

作者:Lung-Hao Lee,Liang-Chih Yu,Natalia Loukashevich,Ilseyar Alimova,Alexander Panchenko,Tzu-Mi Lin,Zhe-Yu Xu,Jian-Yu Zhou,Guangmin Zheng,Jin Wang,Sharanya Awasthi,Jonas Becker,Jan Philip Wahle,Terry Ruas,Shamsuddeen Hassan Muhammad,Saif M. Mohammed

类目:Computation and Language (cs.CL)

关键词:Aspect-Based Sentiment Analysis, ABSA, focuses on extracting, dimensional ABSA, widely applied

备注

点击查看摘要

Abstract:Aspect-Based Sentiment Analysis (ABSA) focuses on extracting sentiment at a fine-grained aspect level and has been widely applied across real-world domains. However, existing ABSA research relies on coarse-grained categorical labels (e.g., positive, negative), which limits its ability to capture nuanced affective states. To address this limitation, we adopt a dimensional approach that represents sentiment with continuous valence-arousal (VA) scores, enabling fine-grained analysis at both the aspect and sentiment levels. To this end, we introduce DimABSA, the first multilingual, dimensional ABSA resource annotated with both traditional ABSA elements (aspect terms, aspect categories, and opinion terms) and newly introduced VA scores. This resource contains 76,958 aspect instances across 42,590 sentences, spanning six languages and four domains. We further introduce three subtasks that combine VA scores with different ABSA elements, providing a bridge from traditional ABSA to dimensional ABSA. Given that these subtasks involve both categorical and continuous outputs, we propose a new unified metric, continuous F1 (cF1), which incorporates VA prediction error into standard F1. We provide a comprehensive benchmark using both prompted and fine-tuned large language models across all subtasks. Our results show that DimABSA is a challenging benchmark and provides a foundation for advancing multilingual dimensional ABSA.

18. 【2601.23014】Mem-T: Densifying Rewards for Long-Horizon Memory Agents

链接https://arxiv.org/abs/2601.23014

作者:Yanwei Yue,Guibin Zhang,Boci Peng,Xuanbo Fan,Jiaxin Guo,Qiankun Li,Yan Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:predefined memory-processing pipelines, garnered increasing attention, managing the processing, autonomy and adaptability, depart from predefined

备注

点击查看摘要

Abstract:Memory agents, which depart from predefined memory-processing pipelines by endogenously managing the processing, storage, and retrieval of memories, have garnered increasing attention for their autonomy and adaptability. However, existing training paradigms remain constrained: agents often traverse long-horizon sequences of memory operations before receiving sparse and delayed rewards, which hinders truly end-to-end optimization of memory management policies. To address this limitation, we introduce Mem-T, an autonomous memory agent that interfaces with a lightweight hierarchical memory database to perform dynamic updates and multi-turn retrieval over streaming inputs. To effectively train long-horizon memory management capabilities, we further propose MoT-GRPO, a tree-guided reinforcement learning framework that transforms sparse terminal feedback into dense, step-wise supervision via memory operation tree backpropagation and hindsight credit assignment, thereby enabling the joint optimization of memory construction and retrieval. Extensive experiments demonstrate that Mem-T is (1) high-performing, surpassing frameworks such as A-Mem and Mem0 by up to $14.92\%$, and (2) economical, operating on a favorable accuracy-efficiency Pareto frontier and reducing inference tokens per query by $\sim24.45\%$ relative to GAM without sacrificing performance.

19. 【2601.23006】InstructDiff: Domain-Adaptive Data Selection via Differential Entropy for Efficient LLM Fine-Tuning

链接https://arxiv.org/abs/2601.23006

作者:Junyou Su,He Zhu,Xiao Luo,Liyu Zhang,Hong-Yu Zhou,Yun Chen,Peng Li,Yang Liu,Guanhua Chen

类目:Computation and Language (cs.CL)

关键词:adapting large language, complete datasets incurs, datasets incurs prohibitive, incurs prohibitive costs, Supervised fine-tuning

备注

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is fundamental to adapting large language models, yet training on complete datasets incurs prohibitive costs with diminishing returns. Existing data selection methods suffer from severe domain specificity: techniques optimized for general instruction-following fail on reasoning tasks, and vice versa. We observe that measuring entropy differences between base models and minimally instruction-tuned calibrated models reveals a pattern -- samples with the lowest differential entropy consistently yield optimal performance across domains, yet this principle manifests domain-adaptively: reasoning tasks favor entropy increase (cognitive expansion), while general tasks favor entropy decrease (cognitive compression). We introduce InstructDiff, a unified framework that operationalizes differential entropy as a domain-adaptive selection criterion through warmup calibration, bi-directional NLL filtering, and entropy-based ranking. Extensive experiments show that InstructDiff achieves 17\% relative improvement over full data training on mathematical reasoning and 52\% for general instruction-following, outperforming prior baselines while using only 10\% of the data.

20. 【2601.23001】Bias Beyond Borders: Political Ideology Evaluation and Steering in Multilingual LLMs

链接https://arxiv.org/abs/2601.23001

作者:Afrozah Nadeem,Agrima,Mehwish Nasim,Usman Naseem

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, increasingly shape global, shape global discourse, Large Language, Language Models

备注: PrePrint

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly shape global discourse, making fairness and ideological neutrality essential for responsible AI deployment. Despite growing attention to political bias in LLMs, prior work largely focuses on high-resource, Western languages or narrow multilingual settings, leaving cross-lingual consistency and safe post-hoc mitigation underexplored. To address this gap, we present a large-scale multilingual evaluation of political bias spanning 50 countries and 33 languages. We introduce a complementary post-hoc mitigation framework, Cross-Lingual Alignment Steering (CLAS), designed to augment existing steering methods by aligning ideological representations across languages and dynamically regulating intervention strength. This method aligns latent ideological representations induced by political prompts into a shared ideological subspace, ensuring cross lingual consistency, with the adaptive mechanism prevents over correction and preserves coherence. Experiments demonstrate substantial bias reduction along both economic and social axes with minimal degradation in response quality. The proposed framework establishes a scalable and interpretable paradigm for fairness-aware multilingual LLM governance, balancing ideological neutrality with linguistic and cultural diversity.

21. 【2601.22987】ArabicDialectHub: A Cross-Dialectal Arabic Learning Resource and Platform

链接https://arxiv.org/abs/2601.22987

作者:Salem Lahlou

类目:Computation and Language (cs.CL)

关键词:cross-dialectal Arabic learning, Arabic learning resource, Moroccan Darija, learning resource comprising, cross-dialectal Arabic

备注

点击查看摘要

Abstract:We present ArabicDialectHub, a cross-dialectal Arabic learning resource comprising 552 phrases across six varieties (Moroccan Darija, Lebanese, Syrian, Emirati, Saudi, and MSA) and an interactive web platform. Phrases were generated using LLMs and validated by five native speakers, stratified by difficulty, and organized thematically. The open-source platform provides translation exploration, adaptive quizzing with algorithmic distractor generation, cloud-synchronized progress tracking, and cultural context. Both the dataset and complete platform source code are released under MIT license. Platform: this https URL.

22. 【2601.22980】Learnable Permutation for Structured Sparsity on Transformer Models

链接https://arxiv.org/abs/2601.22980

作者:Zekai Li,Ji Liu,Guanchen Li,Yixing Xu,Ziqiong Liu,Xuanwu Yin,Dong Li,Emad Barsoum

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:model pruning technique, popular model pruning, including CNNs, large language models, widely adopted

备注

点击查看摘要

Abstract:Structured sparsity has emerged as a popular model pruning technique, widely adopted in various architectures, including CNNs, Transformer models, and especially large language models (LLMs) in recent years. A promising direction to further improve post-pruning performance is weight permutation, which reorders model weights into patterns more amenable to pruning. However, the exponential growth of the permutation search space with the scale of Transformer architectures forces most methods to rely on greedy or heuristic algorithms, limiting the effectiveness of reordering. In this work, we propose a novel end-to-end learnable permutation framework. Our method introduces a learnable permutation cost matrix to quantify the cost of swapping any two input channels of a given weight matrix, a differentiable bipartite matching solver to obtain the optimal binary permutation matrix given a cost matrix, and a sparsity optimization loss function to directly optimize the permutation operator. We extensively validate our approach on vision and language Transformers, demonstrating that our method achieves state-of-the-art permutation results for structured sparsity.

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2601.22980 [cs.LG]

(or
arXiv:2601.22980v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2601.22980

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2601.22974】MiTa: A Hierarchical Multi-Agent Collaboration Framework with Memory-integrated and Task Allocation

链接https://arxiv.org/abs/2601.22974

作者:XiaoJie Zhang,JianHan Wu,Xiaoyang Qu,Jianzong Wang

类目:Emerging Technologies (cs.ET); Computation and Language (cs.CL)

关键词:large language models, language models, advances in large, large language, substantially accelerated

备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have substantially accelerated the development of embodied agents. LLM-based multi-agent systems mitigate the inefficiency of single agents in complex tasks. However, they still suffer from issues such as memory inconsistency and agent behavioral conflicts. To address these challenges, we propose MiTa, a hierarchical memory-integrated task allocative framework to enhance collaborative efficiency. MiTa organizes agents into a manager-member hierarchy, where the manager incorporates additional allocation and summary modules that enable (1) global task allocation and (2) episodic memory integration. The allocation module enables the manager to allocate tasks from a global perspective, thereby avoiding potential inter-agent conflicts. The summary module, triggered by task progress updates, performs episodic memory integration by condensing recent collaboration history into a concise summary that preserves long-horizon context. By combining task allocation with episodic memory, MiTa attains a clearer understanding of the task and facilitates globally consistent task distribution. Experimental results confirm that MiTa achieves superior efficiency and adaptability in complex multi-agent cooperation over strong baseline methods.

24. 【2601.22966】A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training

链接https://arxiv.org/abs/2601.22966

作者:Zihan Qiu,Zeyu Huang,Kaiyue Wen,Peng Jin,Bo Zheng,Yuxin Zhou,Haofeng Huang,Zekun Wang,Xiao Li,Huaqing Zhang,Yang Xu,Haoran Lian,Siqi Zhang,Rui Men,Jianwei Zhang,Ivan Titov,Dayiheng Liu,Jingren Zhou,Junyang Lin

类目:Computation and Language (cs.CL)

关键词:consistently receive large, persistently large activations, large attention logits, receive large attention, large language models

备注

点击查看摘要

Abstract:We investigate the functional role of emergent outliers in large language models, specifically attention sinks (a few tokens that consistently receive large attention logits) and residual sinks (a few fixed dimensions with persistently large activations across most tokens). We hypothesize that these outliers, in conjunction with the corresponding normalizations (\textit{e.g.}, softmax attention and RMSNorm), effectively rescale other non-outlier components. We term this phenomenon \textit{outlier-driven rescaling} and validate this hypothesis across different model architectures and training token counts. This view unifies the origin and mitigation of both sink types. Our main conclusions and observations include: (1) Outliers function jointly with normalization: removing normalization eliminates the corresponding outliers but degrades training stability and performance; directly clipping outliers while retaining normalization leads to degradation, indicating that outlier-driven rescaling contributes to training stability. (2) Outliers serve more as rescale factors rather than contributors, as the final contributions of attention and residual sinks are significantly smaller than those of non-outliers. (3) Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling, leading to improved training performance (average gain of 2 points) and enhanced quantization robustness (1.2 points degradation under W4A4 quantization).

25. 【2601.22954】Residual Context Diffusion Language Models

链接https://arxiv.org/abs/2601.22954

作者:Yuezhou Hu,Harman Singh,Monishwaran Maheswaran,Haocheng Xi,Coleman Hooper,Jintao Zhang,Aditya Tomar,Michael W. Mahoney,Sewon Min,Mehrdad Farajtabar,Kurt Keutzer,Amir Gholami,Chenfeng Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Diffusion Large Language, Large Language Models, purely autoregressive language, Large Language, autoregressive language models

备注

点击查看摘要

Abstract:Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a "remasking" mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ~1 billion tokens. RCD consistently improves frontier dLLMs by 5-10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4-5x fewer denoising steps at equivalent accuracy levels.

26. 【2601.22950】Perplexity Cannot Always Tell Right from Wrong

链接https://arxiv.org/abs/2601.22950

作者:Petar Veličković,Federico Barbero,Christos Perivolaropoulos,Simon Osindero,Razvan Pascanu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:gained significant traction, function measuring, loss function, gained significant, significant traction

备注: 11 pages, 4 figures

点击查看摘要

Abstract:Perplexity -- a function measuring a model's overall level of "surprise" when encountering a particular output -- has gained significant traction in recent years, both as a loss function and as a simple-to-compute metric of model quality. Prior studies have pointed out several limitations of perplexity, often from an empirical manner. Here we leverage recent results on Transformer continuity to show in a rigorous manner how perplexity may be an unsuitable metric for model selection. Specifically, we prove that, if there is any sequence that a compact decoder-only Transformer model predicts accurately and confidently -- a necessary pre-requisite for strong generalisation -- it must imply existence of another sequence with very low perplexity, but not predicted correctly by that same model. Further, by analytically studying iso-perplexity plots, we find that perplexity will not always select for the more accurate model -- rather, any increase in model confidence must be accompanied by a commensurate rise in accuracy for the new model to be selected.

27. 【2601.22949】Autonomous Chain-of-Thought Distillation for Graph-Based Fraud Detection

链接https://arxiv.org/abs/2601.22949

作者:Yuan Li,Jun Hu,Bryan Hooi,Bingsheng He,Cheng Chen

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:requires jointly modeling, jointly modeling rich, modeling rich textual, Graph-based fraud detection, rich textual semantics

备注

点击查看摘要

Abstract:Graph-based fraud detection on text-attributed graphs (TAGs) requires jointly modeling rich textual semantics and relational dependencies. However, existing LLM-enhanced GNN approaches are constrained by predefined prompting and decoupled training pipelines, limiting reasoning autonomy and weakening semantic-structural alignment. We propose FraudCoT, a unified framework that advances TAG-based fraud detection through autonomous, graph-aware chain-of-thought (CoT) reasoning and scalable LLM-GNN co-training. To address the limitations of predefined prompts, we introduce a fraud-aware selective CoT distillation mechanism that generates diverse reasoning paths and enhances semantic-structural understanding. These distilled CoTs are integrated into node texts, providing GNNs with enriched, multi-hop semantic and structural cues for fraud detection. Furthermore, we develop an efficient asymmetric co-training strategy that enables end-to-end optimization while significantly reducing the computational cost of naive joint training. Extensive experiments on public and industrial benchmarks demonstrate that FraudCoT achieves up to 8.8% AUPRC improvement over state-of-the-art methods and delivers up to 1,066x speedup in training throughput, substantially advancing both detection performance and efficiency.

28. 【2601.22947】Relaxing Positional Alignment in Masked Diffusion Language Models

链接https://arxiv.org/abs/2601.22947

作者:Mengyu Ye,Ryosuke Takahashi,Keito Kudo,Jun Suzuki

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Masked diffusion language, dominant autoregressive approaches, Masked diffusion, diffusion language models, autoregressive approaches

备注

点击查看摘要

Abstract:Masked diffusion language models (MDLMs) have emerged as a promising alternative to dominant autoregressive approaches. Although they achieve competitive performance on several tasks, a substantial gap remains in open-ended text generation. We hypothesize that one cause of this gap is that strict positional prediction makes MDLM decoding highly sensitive to token misalignment, and we show through controlled interventions that a one-position shift can severely disrupt semantics. This observation suggests that enforcing strict positional supervision during training is misaligned with the irreversible denoising dynamics of MDLM decoding. Motivated by this mismatch, we adopt an alignment-flexible supervision strategy during fine-tuning. Specifically, we introduce a special token slack via the connectionist temporal classification objective. We apply this approach to the widely used MDLM model and conduct experiments on five open-ended text generation benchmarks. Our method consistently outperforms the original model and improves robustness to positional shifts, indicating that relaxing strict positional supervision is an important factor in improving generation quality in MDLMs.

29. 【2601.22931】Benchmarking Machine Translation on Chinese Social Media Texts

链接https://arxiv.org/abs/2601.22931

作者:Kaiyan Zhao,Zheyong Xie,Zhongtao Miao,Xinze Lyu,Yao Hu,Shaosheng Cao

类目:Computation and Language (cs.CL)

关键词:poses significant challenges, highly stylized expressions, rapidly evolving slang, challenges for Machine, Chinese social media

备注: Work in Progress

点击查看摘要

Abstract:The prevalence of rapidly evolving slang, neologisms, and highly stylized expressions in informal user-generated text, particularly on Chinese social media, poses significant challenges for Machine Translation (MT) benchmarking. Specifically, we identify two primary obstacles: (1) data scarcity, as high-quality parallel data requires bilingual annotators familiar with platform-specific slang, and stylistic cues in both languages; and (2) metric limitations, where traditional evaluators like COMET often fail to capture stylistic fidelity and nonstandard expressions. To bridge these gaps, we introduce CSM-MTBench, a benchmark covering five Chinese-foreign language directions and consisting of two expert-curated subsets: Fun Posts, featuring context-rich, slang- and neologism-heavy content, and Social Snippets, emphasizing concise, emotion- and style- driven expressions. Furthermore, we propose tailored evaluation approaches for each subset: measuring the translation success rate of slang and neologisms in Fun Posts, while assessing tone and style preservation in Social Snippets via a hybrid of embedding-based metrics and LLM-as-a-judge. Experiments on over 20 models reveal substantial variation in how current MT systems handle semantic fidelity and informal, social-media-specific stylistic cues. CSM-MTBench thus serves as a rigorous testbed for advancing MT systems capable of mastering real-world Chinese social media texts.

30. 【2601.22929】Semantic Leakage from Image Embeddings

链接https://arxiv.org/abs/2601.22929

作者:Yiyi Chen,Qiongkai Xu,Desmond Eliott,Qiongxiu Li,Johannes Bjerva

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:Image embeddings, limited privacy risk, Image, semantic, pose limited privacy

备注: 20 pages, 19 figures

点击查看摘要

Abstract:Image embeddings are generally assumed to pose limited privacy risk. We challenge this assumption by formalizing semantic leakage as the ability to recover semantic structures from compressed image embeddings. Surprisingly, we show that semantic leakage does not require exact reconstruction of the original image. Preserving local semantic neighborhoods under embedding alignment is sufficient to expose the intrinsic vulnerability of image embeddings. Crucially, this preserved neighborhood structure allows semantic information to propagate through a sequence of lossy mappings. Based on this conjecture, we propose Semantic Leakage from Image Embeddings (SLImE), a lightweight inference framework that reveals semantic information from standalone compressed image embeddings, incorporating a locally trained semantic retriever with off-the-shelf models, without training task-specific decoders. We thoroughly validate each step of the framework empirically, from aligned embeddings to retrieved tags, symbolic representations, and grammatical and coherent descriptions. We evaluate SLImE across a range of open and closed embedding models, including GEMINI, COHERE, NOMIC, and CLIP, and demonstrate consistent recovery of semantic information across diverse inference tasks. Our results reveal a fundamental vulnerability in image embeddings, whereby the preservation of semantic neighborhoods under alignment enables semantic leakage, highlighting challenges for privacy preservation.1

31. 【2601.22928】LLMs Explain't: A Post-Mortem on Semantic Interpretability in Transformer Models

链接https://arxiv.org/abs/2601.22928

作者:Alhassan Abdelhalim,Janick Edinger,Sören Laue,Michaela Regneri

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, increasingly popular, versatility and strong, strong performance

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are becoming increasingly popular in pervasive computing due to their versatility and strong performance. However, despite their ubiquitous use, the exact mechanisms underlying their outstanding performance remain unclear. Different methods for LLM explainability exist, and many are, as a method, not fully understood themselves. We started with the question of how linguistic abstraction emerges in LLMs, aiming to detect it across different LLM modules (attention heads and input embeddings). For this, we used methods well-established in the literature: (1) probing for token-level relational structures, and (2) feature-mapping using embeddings as carriers of human-interpretable properties. Both attempts failed for different methodological reasons: Attention-based explanations collapsed once we tested the core assumption that later-layer representations still correspond to tokens. Property-inference methods applied to embeddings also failed because their high predictive scores were driven by methodological artifacts and dataset structure rather than meaningful semantic knowledge. These failures matter because both techniques are widely treated as evidence for what LLMs supposedly understand, yet our results show such conclusions are unwarranted. These limitations are particularly relevant in pervasive and distributed computing settings where LLMs are deployed as system components and interpretability methods are relied upon for debugging, compression, and explaining models.

Subjects:

Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2601.22928 [cs.CL]

(or
arXiv:2601.22928v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2601.22928

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
32. 【2601.22889】DiffuSpeech: Silent Thought, Spoken Answer via Unified Speech-Text Diffusion

链接https://arxiv.org/abs/2601.22889

作者:Yuxuan Lou,Ziming Wu,Yaochen Wang,Yong Liu,Yingxuan Ren,Fuming Lai,Shaobing Lian,Jie Tang,Yang You

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)

关键词:Current speech language, Current speech, leading to errors, audio is produced, generate responses directly

备注

点击查看摘要

Abstract:Current speech language models generate responses directly without explicit reasoning, leading to errors that cannot be corrected once audio is produced. We introduce \textbf{``Silent Thought, Spoken Answer''} -- a paradigm where speech LLMs generate internal text reasoning alongside spoken responses, with thinking traces informing speech quality. To realize this, we present \method{}, the first diffusion-based speech-text language model supporting both understanding and generation, unifying discrete text and tokenized speech under a single masked diffusion framework. Unlike autoregressive approaches, \method{} jointly generates reasoning traces and speech tokens through iterative denoising, with modality-specific masking schedules. We also construct \dataset{}, the first speech QA dataset with paired text reasoning traces, containing 26K samples totaling 319 hours. Experiments show \method{} achieves state-of-the-art speech-to-speech QA accuracy, outperforming the best baseline by up to 9 points, while attaining the best TTS quality among generative models (6.2\% WER) and preserving language understanding (66.2\% MMLU). Ablations confirm that both the diffusion architecture and thinking traces contribute to these gains.

33. 【2601.22888】Should LLMs, $\textit{like}$, Generate How Users Talk? Building Dialect-Accurate Dialog[ue]s Beyond the American Default with MDial

链接https://arxiv.org/abs/2601.22888

作者:Jio Oh,Paul Vicinanza,Thomas Butler,Steven Euijong Whang,Dezhi Hong,Amani Namboori

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Standard American English, billion English speakers, Standard American, higher failure rates, experience higher failure

备注

点击查看摘要

Abstract:More than 80% of the 1.6 billion English speakers do not use Standard American English (SAE) and experience higher failure rates and stereotyped responses when interacting with LLMs as a result. Yet multi-dialectal performance remains underexplored. We introduce $\textbf{MDial}$, the first large-scale framework for generating multi-dialectal conversational data encompassing the three pillars of written dialect -- lexical (vocabulary), orthographic (spelling), and morphosyntactic (grammar) features -- for nine English dialects. Partnering with native linguists, we design an annotated and scalable rule-based LLM transformation to ensure precision. Our approach challenges the assumption that models should mirror users' morphosyntactic features, showing that up to 90% of the grammatical features of a dialect should not be reproduced by models. Independent evaluations confirm data quality, with annotators preferring MDial outputs over prior methods in 98% of pairwise comparisons for dialect naturalness. Using this pipeline, we construct the dialect-parallel $\textbf{MDialBench}$mark with 50k+ dialogs, resulting in 97k+ QA pairs, and evaluate 17 LLMs on dialect identification and response generation tasks. Even frontier models achieve under 70% accuracy, fail to reach 50% for Canadian English, and systematically misclassify non-SAE dialects as American or British. As dialect identification underpins natural language understanding, these errors risk cascading failures into downstream tasks.

34. 【2601.22887】MoVE: Mixture of Value Embeddings -- A New Axis for Scaling Parametric Memory in Autoregressive Models

链接https://arxiv.org/abs/2601.22887

作者:Yangyan Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse modalities ranging, modern Generative, powering results, cornerstone of modern, results across diverse

备注

点击查看摘要

Abstract:Autoregressive sequence modeling stands as the cornerstone of modern Generative AI, powering results across diverse modalities ranging from text generation to image generation. However, a fundamental limitation of this paradigm is the rigid structural coupling of model capacity to computational cost: expanding a model's parametric memory -- its repository of factual knowledge or visual patterns -- traditionally requires deepening or widening the network, which incurs a proportional rise in active FLOPs. In this work, we introduce $\textbf{MoVE (Mixture of Value Embeddings)}$, a mechanism that breaks this coupling and establishes a new axis for scaling capacity. MoVE decouples memory from compute by introducing a global bank of learnable value embeddings shared across all attention layers. For every step in the sequence, the model employs a differentiable soft gating mechanism to dynamically mix retrieved concepts from this bank into the standard value projection. This architecture allows parametric memory to be scaled independently of network depth by simply increasing the number of embedding slots. We validate MoVE through strictly controlled experiments on two representative applications of autoregressive modeling: Text Generation and Image Generation. In both domains, MoVE yields consistent performance improvements over standard and layer-wise memory baselines, enabling the construction of "memory-dense" models that achieve lower perplexity and higher fidelity than their dense counterparts at comparable compute budgets.

35. 【2601.22885】Leveraging LLMs For Turkish Skill Extraction

链接https://arxiv.org/abs/2601.22885

作者:Ezgi Arslan İltüzer,Özgür Anıl Özlü,Vahid Farajijobehdar,Gülşen Eryiğit

类目:Computation and Language (cs.CL)

关键词:modern recruitment systems, labor market analysis, Skill extraction, Skill, efficient job matching

备注

点击查看摘要

Abstract:Skill extraction is a critical component of modern recruitment systems, enabling efficient job matching, personalized recommendations, and labor market analysis. Despite Türkiye's significant role in the global workforce, Turkish, a morphologically complex language, lacks both a skill taxonomy and a dedicated skill extraction dataset, resulting in underexplored research in skill extraction for Turkish. This article seeks the answers to three research questions: 1) How can skill extraction be effectively performed for this language, in light of its low resource nature? 2)~What is the most promising model? 3) What is the impact of different Large Language Models (LLMs) and prompting strategies on skill extraction (i.e., dynamic vs. static few-shot samples, varying context information, and encouraging causal reasoning)? The article introduces the first Turkish skill extraction dataset and performance evaluations of automated skill extraction using LLMs. The manually annotated dataset contains 4,819 labeled skill spans from 327 job postings across different occupation areas. The use of LLM outperforms supervised sequence labeling when used in an end-to-end pipeline, aligning extracted spans with standardized skills in the ESCO taxonomy more effectively. The best-performing configuration, utilizing Claude Sonnet 3.7 with dynamic few-shot prompting for skill identification, embedding-based retrieval, and LLM-based reranking for skill linking, achieves an end-to-end performance of 0.56, positioning Turkish alongside similar studies in other languages, which are few in the literature. Our findings suggest that LLMs can improve skill extraction performance in low-resource settings, and we hope that our work will accelerate similar research on skill extraction for underrepresented languages.

36. 【2601.22875】From Labels to Facets: Building a Taxonomically Enriched Turkish Learner Corpus

链接https://arxiv.org/abs/2601.22875

作者:Elif Sayar,Tolgahan Türker,Anna Golynskaia Knezhevich,Bihter Dereli,Ayşe Demirhas,Lionel Nicolas,Gülşen Eryiğit

类目:Computation and Language (cs.CL)

关键词:explicitly separate multiple, holistic flat label, flat label inventories, separate multiple linguistic, rely on holistic

备注

点击查看摘要

Abstract:In terms of annotation structure, most learner corpora rely on holistic flat label inventories which, even when extensive, do not explicitly separate multiple linguistic dimensions. This makes linguistically deep annotation difficult and complicates fine-grained analyses aimed at understanding why and how learners produce specific errors. To address these limitations, this paper presents a semi-automated annotation methodology for learner corpora, built upon a recently proposed faceted taxonomy, and implemented through a novel annotation extension framework. The taxonomy provides a theoretically grounded, multi-dimensional categorization that captures the linguistic properties underlying each error instance, thereby enabling standardized, fine-grained, and interpretable enrichment beyond flat annotations. The annotation extension tool, implemented based on the proposed extension framework for Turkish, automatically extends existing flat annotations by inferring additional linguistic and metadata information as facets within the taxonomy to provide richer learner-specific context. It was systematically evaluated and yielded promising performance results, achieving a facet-level accuracy of 95.86%. The resulting taxonomically enriched corpus offers enhanced querying capabilities and supports detailed exploratory analyses across learner corpora, enabling researchers to investigate error patterns through complex linguistic and pedagogical dimensions. This work introduces the first collaboratively annotated and taxonomically enriched Turkish Learner Corpus, a manual annotation guideline with a refined tagset, and an annotation extender. As the first corpus designed in accordance with the recently introduced taxonomy, we expect our study to pave the way for subsequent enrichment efforts of existing error-annotated learner corpora.

37. 【2601.22871】Eroding the Truth-Default: A Causal Analysis of Human Susceptibility to Foundation Model Hallucinations and Disinformation in the Wild

链接https://arxiv.org/abs/2601.22871

作者:Alexander Loth,Martin Kappes,Marc-Oliver Pahl

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Trustworthy Web Intelligence, approach human-level fluency, Web Intelligence, Structural Causal Models, approach human-level

备注: Accepted at ACM TheWebConf '26 Companion

点击查看摘要

Abstract:As foundation models (FMs) approach human-level fluency, distinguishing synthetic from organic content has become a key challenge for Trustworthy Web Intelligence. This paper presents JudgeGPT and RogueGPT, a dual-axis framework that decouples "authenticity" from "attribution" to investigate the mechanisms of human susceptibility. Analyzing 918 evaluations across five FMs (including GPT-4 and Llama-2), we employ Structural Causal Models (SCMs) as a principal framework for formulating testable causal hypotheses about detection accuracy. Contrary to partisan narratives, we find that political orientation shows a negligible association with detection performance ($r=-0.10$). Instead, "fake news familiarity" emerges as a candidate mediator ($r=0.35$), suggesting that exposure may function as adversarial training for human discriminators. We identify a "fluency trap" where GPT-4 outputs (HumanMachineScore: 0.20) bypass Source Monitoring mechanisms, rendering them indistinguishable from human text. These findings suggest that "pre-bunking" interventions should target cognitive source monitoring rather than demographic segmentation to ensure trustworthy information ecosystems.

Comments:
Accepted at ACM TheWebConf '26 Companion

Subjects:

Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

ACMclasses:
I.2.7; H.5.2; K.4.1

Cite as:
arXiv:2601.22871 [cs.CY]

(or
arXiv:2601.22871v1 [cs.CY] for this version)

https://doi.org/10.48550/arXiv.2601.22871

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Alexander Loth [view email] [v1]
Fri, 30 Jan 2026 11:49:58 UTC (1,074 KB)

38. 【2601.22851】When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training

链接https://arxiv.org/abs/2601.22851

作者:Felicia Körner,Max Müller-Eberstein,Anna Korhonen,Barbara Plank

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Training Large Language, high multilingual coverage, Large Language, Training Large

备注: Accepted to EACL 2026 Main Conference

点击查看摘要

Abstract:Training Large Language Models (LLMs) with high multilingual coverage is becoming increasingly important -- especially when monolingual resources are scarce. Recent studies have found that LLMs process multilingual inputs in shared concept spaces, thought to support generalization and cross-lingual transfer. However, these prior studies often do not use causal methods, lack deeper error analysis or focus on the final model only, leaving open how these spaces emerge during training. We investigate the development of language-agnostic concept spaces during pretraining of EuroLLM through the causal interpretability method of activation patching. We isolate cross-lingual concept representations, then inject them into a translation prompt to investigate how consistently translations can be altered, independently of the language. We find that shared concept spaces emerge early} and continue to refine, but that alignment with them is language-dependent}. Furthermore, in contrast to prior work, our fine-grained manual analysis reveals that some apparent gains in translation quality reflect shifts in behavior -- like selecting senses for polysemous words or translating instead of copying cross-lingual homographs -- rather than improved translation ability. Our findings offer new insight into the training dynamics of cross-lingual alignment and the conditions under which causal interpretability methods offer meaningful insights in multilingual contexts.

39. 【2601.22805】SOMBRERO: Measuring and Steering Boundary Placement in End-to-End Hierarchical Sequence Models

链接https://arxiv.org/abs/2601.22805

作者:Pit Neitemeier,Alessio Serra,Jiaze Li,Sascha Wirges,Lukas Balles,Jan Hendrik Metzen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Hierarchical sequence models, efficient autoregressive modeling, sequence models replace, long byte sequences, models replace fixed

备注

点击查看摘要

Abstract:Hierarchical sequence models replace fixed tokenization with learned segmentations that compress long byte sequences for efficient autoregressive modeling. While recent end-to-end methods can learn meaningful boundaries from the language-modeling objective alone, it remains difficult to quantitatively assess and systematically steer where compute is spent. We introduce a router-agnostic metric of boundary quality, boundary enrichment B, which measures how strongly chunk starts concentrate on positions with high next-byte surprisal. Guided by this metric, we propose Sombrero, which steers boundary placement toward predictive difficulty via a confidence-alignment boundary loss and stabilizes boundary learning by applying confidence-weighted smoothing at the input level rather than on realized chunks. On 1B scale, across UTF-8 corpora covering English and German text as well as code and mathematical content, Sombrero improves the accuracy-efficiency trade-off and yields boundaries that more consistently align compute with hard-to-predict positions.

40. 【2601.22795】Sparse or Dense? A Mechanistic Estimation of Computation Density in Transformer-based LLMs

链接https://arxiv.org/abs/2601.22795

作者:Corentin Kervadec,Iuliia Lysova,Marco Baroni,Gemma Boleda

类目:Computation and Language (cs.CL)

关键词:Transformer-based large language, wide computational graphs, Transformer-based large, large language models, computational graphs

备注

点击查看摘要

Abstract:Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs. Several studies on LLM efficiency optimization argue that it is possible to prune a significant portion of the parameters, while only marginally impacting performance. This suggests that the computation is not uniformly distributed across the parameters. We introduce here a technique to systematically quantify computation density in LLMs. In particular, we design a density estimator drawing on mechanistic interpretability. We experimentally test our estimator and find that: (1) contrary to what has been often assumed, LLM processing generally involves dense computation; (2) computation density is dynamic, in the sense that models shift between sparse and dense processing regimes depending on the input; (3) per-input density is significantly correlated across LLMs, suggesting that the same inputs trigger either low or high density. Investigating the factors influencing density, we observe that predicting rarer tokens requires higher density, and increasing context length often decreases the density. We believe that our computation density estimator will contribute to a better understanding of the processing at work in LLMs, challenging their symbolic interpretation.

41. 【2601.22777】RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation

链接https://arxiv.org/abs/2601.22777

作者:Jiaxuan Luo,Siqi Ouyang,Lei Li

类目:Computation and Language (cs.CL)

关键词:produces target text, target text incrementally, Simultaneous speech translation, Simultaneous speech, produces target

备注

点击查看摘要

Abstract:Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate rare and domain-specific terminology. While retrieval augmentation has been effective for terminology translation in machine translation, bringing retrieval to SST is non-trivial: it requires fast and accurate cross-modal (speech-to-text) retrieval under partial, continually arriving input, and the model must decide whether and when to apply retrieved terms during incremental generation. We propose Retrieval-Augmented Simultaneous Speech Translation (RASST), which tightly integrates cross-modal retrieval into the SST pipeline. RASST trains a lightweight speech-text retriever and performs efficient sliding-window retrieval, providing chunkwise terminology hints to the Speech LLM. We further synthesize training data that teaches the Speech LLM to leverage retrieved terms precisely. Experiments on three language directions of the ACL 60/60 dev set show that RASST improves terminology translation accuracy by up to 16% and increases overall translation quality by up to 3 BLEU points, with ablations confirming the contribution of each component.

42. 【2601.22742】AR-BENCH: Benchmarking Legal Reasoning with Judgment Error Detection, Classification and Correction

链接https://arxiv.org/abs/2601.22742

作者:Yifei Li,Richong Zhang,Wanyu Tu,Zhijie Nie,Haokun Luo,Chuantao Yin,Pengchong Li

类目:Computation and Language (cs.CL)

关键词:mechanisms face efficiency, face efficiency pressures, review mechanisms face, appellate review mechanisms, case volumes

备注

点击查看摘要

Abstract:Legal judgments may contain errors due to the complexity of case circumstances and the abstract nature of legal concepts, while existing appellate review mechanisms face efficiency pressures from a surge in case volumes. Although current legal AI research focuses on tasks like judgment prediction and legal document generation, the task of judgment review differs fundamentally in its objectives and paradigm: it centers on detecting, classifying, and correcting errors after a judgment is issued, constituting anomaly detection rather than prediction or generation. To address this research gap, we introduce a novel task APPELLATE REVIEW, aiming to assess models' diagnostic reasoning and reliability in legal practice. We also construct a novel dataset benchmark AR-BENCH, which comprises 8,700 finely annotated decisions and 34,617 supplementary corpora. By evaluating 14 large language models, we reveal critical limitations in existing models' ability to identify legal application errors, providing empirical evidence for future improvements.

43. 【2601.22735】MM-THEBench: Do Reasoning MLLMs Think Reasonably?

链接https://arxiv.org/abs/2601.22735

作者:Zhidian Huang,Zijun Yao,Ji Qi,Shangqing Tu,Junxian Ma,Jinxin Liu,Weichuan Liu,Xiaoyin Che,Lei Hou,Juanzi Li

类目:Computation and Language (cs.CL)

关键词:solving complex problems, Recent advances, multimodal large language, large language models, mark a shift

备注

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) mark a shift from non-thinking models to post-trained reasoning models capable of solving complex problems through thinking. However, whether such thinking mitigates hallucinations in multimodal perception and reasoning remains unclear. Self-reflective reasoning enhances robustness but introduces additional hallucinations, and subtle perceptual errors still result in incorrect or coincidentally correct answers. Existing benchmarks primarily focus on models before the emergence of reasoning MLLMs, neglecting the internal thinking process and failing to measure the hallucinations that occur during thinking. To address these challenges, we introduce MM-THEBench, a comprehensive benchmark for assessing hallucinations of intermediate CoTs in reasoning MLLMs. MM-THEBench features a fine-grained taxonomy grounded in cognitive dimensions, diverse data with verified reasoning annotations, and a multi-level automated evaluation framework. Extensive experiments on mainstream reasoning MLLMs reveal insights into how thinking affects hallucination and reasoning capability in various multimodal tasks.

44. 【2601.22710】AlienLM: Alienization of Language for API-Boundary Privacy in Black-Box LLMs

链接https://arxiv.org/abs/2601.22710

作者:Jaehee Kim,Pilsung Kang

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:transmit sensitive prompts, critical privacy risk, Modern LLMs, Alien Adaptation Training, requiring users

备注

点击查看摘要

Abstract:Modern LLMs are increasingly accessed via black-box APIs, requiring users to transmit sensitive prompts, outputs, and fine-tuning data to external providers, creating a critical privacy risk at the API boundary. We introduce AlienLM, a deployable API-only privacy layer that protects text by translating it into an Alien Language via a vocabulary-scale bijection, enabling lossless recovery on the client side. Using only standard fine-tuning APIs, Alien Adaptation Training (AAT) adapts target models to operate directly on alienized inputs. Across four LLM backbones and seven benchmarks, AlienLM retains over 81\% of plaintext-oracle performance on average, substantially outperforming random-bijection and character-level baselines. Under adversaries with access to model weights, corpus statistics, and learning-based inverse translation, recovery attacks reconstruct fewer than 0.22\% of alienized tokens. Our results demonstrate a practical pathway for privacy-preserving LLM deployment under API-only access, substantially reducing plaintext exposure while maintaining task performance.

45. 【2601.22708】A Unified Study of LoRA Variants: Taxonomy, Review, Codebase, and Empirical Evaluation

链接https://arxiv.org/abs/2601.22708

作者:Haonan He,Jingqi Ye,Minglei Li,Zhengbo Wang,Tao Chen,Lei Bai,Peng Ye

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:fundamental parameter-efficient fine-tuning, parameter-efficient fine-tuning method, large-scale neural networks, Low-Rank Adaptation, neural networks

备注: Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence, Under Review

点击查看摘要

Abstract:Low-Rank Adaptation (LoRA) is a fundamental parameter-efficient fine-tuning method that balances efficiency and performance in large-scale neural networks. However, the proliferation of LoRA variants has led to fragmentation in methodology, theory, code, and evaluation. To this end, this work presents the first unified study of LoRA variants, offering a systematic taxonomy, unified theoretical review, structured codebase, and standardized empirical assessment. First, we categorize LoRA variants along four principal axes: rank, optimization dynamics, initialization, and integration with Mixture-of-Experts. Then, we review their relationships and evolution within a common theoretical framework focused on low-rank update dynamics. Further, we introduce LoRAFactory, a modular codebase that implements variants through a unified interface, supporting plug-and-play experimentation and fine-grained analysis. Last, using this codebase, we conduct a large-scale evaluation across natural language generation, natural language understanding, and image classification tasks, systematically exploring key hyperparameters. Our results uncover several findings, notably: LoRA and its variants exhibit pronounced sensitivity to the choices of learning rate compared to other hyperparameters; moreover, with proper hyperparameter configurations, LoRA consistently matches or surpasses the performance of most of its variants.

46. 【2601.22699】Models Know Models Best: Evaluation via Model-Preferred Formats

链接https://arxiv.org/abs/2601.22699

作者:Joonhak Lee,Sungmok Jung,Jongyeon Park,Jaejin Lee

类目:Computation and Language (cs.CL)

关键词:Large Language Models, multiple-choice tasks differs, tasks differs markedly, Language Models, Large Language

备注

点击查看摘要

Abstract:Performance of Large Language Models (LLMs) on multiple-choice tasks differs markedly between symbol-based and cloze-style evaluation formats. The observed discrepancies are systematically attributable to task characteristics: natural language continuation benefits from likelihood scoring, whereas explicit comparison is better suited to symbol-based selection. These trends are consistent across various decoder-based LLMs, indicating model-agnostic effects. To address these inconsistencies, a dynamic format-alignment strategy is introduced that employs a lightweight classifier trained on latent model-preference signals. In contrast to human-designed heuristics, which often degrade performance, this approach uses model-generated signals to determine the optimal format for each problem instance. The proposed method achieves substantial and consistent improvements in zero-shot accuracy across reasoning and knowledge benchmarks, better revealing the models' latent capabilities.

47. 【2601.22692】FNF: Functional Network Fingerprint for Large Language Models

链接https://arxiv.org/abs/2601.22692

作者:Yiheng Liu,Junhao Ning,Sichen Xia,Haiyang Sun,Yang Yang,Hanyang Chi,Xiaohui Gao,Ning Qiang,Bao Ge,Junwei Han,Xintao Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:large language models, Functional Network Fingerprint, development of large, large language, significant commercial

备注: 13 pages, 4 figures

点击查看摘要

Abstract:The development of large language models (LLMs) is costly and has significant commercial value. Consequently, preventing unauthorized appropriation of open-source LLMs and protecting developers' intellectual property rights have become critical challenges. In this work, we propose the Functional Network Fingerprint (FNF), a training-free, sample-efficient method for detecting whether a suspect LLM is derived from a victim model, based on the consistency between their functional network activity. We demonstrate that models that share a common origin, even with differences in scale or architecture, exhibit highly consistent patterns of neuronal activity within their functional networks across diverse input samples. In contrast, models trained independently on distinct data or with different objectives fail to preserve such activity alignment. Unlike conventional approaches, our method requires only a few samples for verification, preserves model utility, and remains robust to common model modifications (such as fine-tuning, pruning, and parameter permutation), as well as to comparisons across diverse architectures and dimensionalities. FNF thus provides model owners and third parties with a simple, non-invasive, and effective tool for protecting LLM intellectual property. The code is available at this https URL.

48. 【2601.22688】SLM: Tree-Structured Language Modeling for Divergent Thinking

链接https://arxiv.org/abs/2601.22688

作者:Doyoung Kim,Jaehyeok Doo,Minjoon Seo

类目:Computation and Language (cs.CL)

关键词:decoupling irrelevant exploration, Tree-Structured Language Modeling, generate reasoning sequentially, decoupling irrelevant, Language Modeling

备注

点击查看摘要

Abstract:Language models generate reasoning sequentially, preventing them from decoupling irrelevant exploration paths during search. We introduce Tree-Structured Language Modeling (TSLM), which uses special tokens to encode branching structure, enabling models to generate and selectively expand multiple search paths within a single generation process. By training on complete search trees including both successful and failed attempts, TSLM learns to internalize systematic exploration without redundant recomputation of shared prefixes. TSLM achieves robust performance and superior inference efficiency by avoiding the multiple independent forward passes required by external search methods. These results suggest a new paradigm of inference-time scaling for robust reasoning, demonstrating that supervised learning on complete tree-structured traces provides an efficient alternative for developing systematic exploration capabilities in language models.

49. 【2601.22657】NAG: A Unified Native Architecture for Encoder-free Text-Graph Modeling in Language Models

链接https://arxiv.org/abs/2601.22657

作者:Haisong Gong,Zhibo Liu,Qiang Liu,Shu Wu,Liang Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Graph Neural Networks, Neural Networks, Prevailing methods, LMs process textual, external Graph Neural

备注

点击查看摘要

Abstract:Prevailing methods for integrating graphs into Language Models (LMs) typically rely on a segregated architecture: external Graph Neural Networks (GNNs) encode structural topology, while LMs process textual semantics. We argue this approach is suboptimal for text-graphs: it creates a conceptually disjointed interaction paradigm. By segregating structural encoding from semantic processing, these systems must perform a complex implicit alignment between abstract graph tokens and concrete textual elements. Challenging the necessity of external encoders, we propose NAG (Native Architecture for Graphs), a unified framework that internalizes graph processing within the LM's native manifold. Instead of bridging disparate embedding spaces, NAG repurposes the self-attention mechanism to enforce topological dependencies and recalibrates positional IDs to ensure structural equivalence. This allows the model to harness its intrinsic linguistic capability to simultaneously comprehend node and edge content alongside structural topology. We introduce two efficient implementations: NAG-Zero for absolute preservation of the base model's linguistic capabilities, and NAG-LoRA for enhanced structural adaptation. Experiments across diverse graph tasks validate that NAG achieves robust graph comprehension without the overhead of external encoders, offering a simpler, more coherent paradigm for text-graph modeling.

50. 【2601.22632】DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning

链接https://arxiv.org/abs/2601.22632

作者:Abhishek Tyagi,Yunuo Cen,Shrey Dhorajiya,Bharadwaj Veeravalli,Xuanyao Fong

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Feed-Forward Networks, exhibit substantial parameter, substantial parameter redundancy

备注

点击查看摘要

Abstract:Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed-Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset-specific calibration introduces significant data dependency and computational overhead. Second, being predominantly static, they fail to account for the evolving subset of knowledge neurons in LLMs during autoregressive generation as the context evolves. To address this, we introduce DART, i.e., Dynamic Attention-Guided Runtime Tracing), a lightweight, training-free method that performs on-the-fly context-based pruning. DART monitors shifts in attention score distributions to infer context changes, dynamically updating neuron-level masks to retain salient parameters. Across ten benchmarks, DART outperforms prior dynamic baseline, achieving accuracy gains of up to 14.5% on LLAMA-3.1-8B at 70% FFN sparsity. Furthermore, DART achieves up to 3x better ROUGE-L scores with respect to static-masked pruning on summarization tasks, with its performance comparable to the original dense models. We conclusively demonstrate that the proposed framework effectively adapts to diverse semantic contexts, preserves model capabilities across both general and domain-specific tasks while running at less than 10MBs of memory for LLAMA-3.1-8B(16GBs) with 0.1% FLOPs overhead. The code is available at this https URL.

51. 【2601.22629】me-Annealed Perturbation Sampling: Diverse Generation for Diffusion Language Models

链接https://arxiv.org/abs/2601.22629

作者:Jingxuan Wu,Zhenglin Wan,Xingrui Yu,Yuzhe Yang,Yiqiao Huang,Ivor Tsang,Yang You

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:paths remains underexplored, exploring multiple valid, explicit temporal dimension, multiple valid semantic, reasoning paths remains

备注

点击查看摘要

Abstract:Diffusion language models (Diffusion-LMs) introduce an explicit temporal dimension into text generation, yet how this structure can be leveraged to control generation diversity for exploring multiple valid semantic or reasoning paths remains underexplored. In this paper, we show that Diffusion-LMs, like diffusion models in image generation, exhibit a temporal division of labor: early denoising steps largely determine the global semantic structure, while later steps focus on local lexical refinement. Building on this insight, we propose Time-Annealed Perturbation Sampling (TAPS), a training-free inference strategy that encourages semantic branching early in the diffusion process while progressively reducing perturbations to preserve fluency and instruction adherence. TAPS is compatible with both non-autoregressive and semi-autoregressive Diffusion backbones, demonstrated on LLaDA and TraDo in our paper, and consistently improves output diversity across creative writing and reasoning benchmarks without compromising generation quality.

52. 【2601.22628】CS: Test-Time Curriculum Synthesis for Self-Evolving

链接https://arxiv.org/abs/2601.22628

作者:Chengyi Yang,Zhishang Xiang,Yunbo Tang,Zongpei Teng,Chengsong Huang,Fei Long,Yuhan Liu,Jinsong Su

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, Test-Time Training offers, test questions, Test-Time Training, offers a promising

备注: 10 pages, 4 figures, Our code and implementation details are available at [this https URL](https://github.com/XMUDeepLIT/TTCS)

点击查看摘要

Abstract:Test-Time Training offers a promising way to improve the reasoning ability of large language models (LLMs) by adapting the model using only the test questions. However, existing methods struggle with difficult reasoning problems for two reasons: raw test questions are often too difficult to yield high-quality pseudo-labels, and the limited size of test sets makes continuous online updates prone to instability. To address these limitations, we propose TTCS, a co-evolving test-time training framework. Specifically, TTCS initializes two policies from the same pretrained model: a question synthesizer and a reasoning solver. These policies evolve through iterative optimization: the synthesizer generates progressively challenging question variants conditioned on the test questions, creating a structured curriculum tailored to the solver's current capability, while the solver updates itself using self-consistency rewards computed from multiple sampled responses on both original test and synthetic questions. Crucially, the solver's feedback guides the synthesizer to generate questions aligned with the model's current capability, and the generated question variants in turn stabilize the solver's test-time training. Experiments show that TTCS consistently strengthens the reasoning ability on challenging mathematical benchmarks and transfers to general-domain tasks across different LLM backbones, highlighting a scalable path towards dynamically constructing test-time curricula for self-evolving. Our code and implementation details are available at this https URL.

53. 【2601.22620】Layer-wise Swapping for Generalizable Multilingual Safety

链接https://arxiv.org/abs/2601.22620

作者:Hyunseo Shin,Wonseok Hwang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, advancements of Large, safety risks remain, Large Language, rapid advancements

备注

点击查看摘要

Abstract:Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.

54. 【2601.22607】From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

链接https://arxiv.org/abs/2601.22607

作者:Jiaxuan Gao,Jiaao Chen,Chuyi He,Wei-Chen Wang,Shusheng Xu,Hanrui Wang,Di Jin,Yi Wu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:multi-step tool execution, solve real-world tasks, requiring dialogue state, dialogue state tracking, Interactive tool-using agents

备注: Submitted to ICML 2026

点击查看摘要

Abstract:Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.

55. 【2601.22597】meMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks

链接https://arxiv.org/abs/2601.22597

作者:Ryo Fujii,Makoto Morishita,Kazuki Yano,Jun Suzuki

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:practical tasks reflecting, research focus, automated software engineering, focus is increasingly, increasingly shifting

备注: Accepted to EACL 2026 Main, camera-ready

点击查看摘要

Abstract:With the advancement of automated software engineering, research focus is increasingly shifting toward practical tasks reflecting the day-to-day work of software engineers. Among these tasks, software migration, a critical process of adapting code to evolving environments, has been largely overlooked. In this study, we introduce TimeMachine-bench, a benchmark designed to evaluate software migration in real-world Python projects. Our benchmark consists of GitHub repositories whose tests begin to fail in response to dependency updates. The construction process is fully automated, enabling live updates of the benchmark. Furthermore, we curated a human-verified subset to ensure problem solvability. We evaluated agent-based baselines built on top of 11 models, including both strong open-weight and state-of-the-art LLMs on this verified subset. Our results indicated that, while LLMs show some promise for migration tasks, they continue to face substantial reliability challenges, including spurious solutions that exploit low test coverage and unnecessary edits stemming from suboptimal tool-use strategies. Our dataset and implementation are available at this https URL.

56. 【2601.22594】Language Model Circuits Are Sparse in the Neuron Basis

链接https://arxiv.org/abs/2601.22594

作者:Aryaman Arora,Zhengxuan Wu,Jacob Steinhardt,Sarah Schwettmann

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:high-level concepts, neural network, aligned to individual, perform computation, Smolensky

备注: 8 pages main text, 41 pages total

点击查看摘要

Abstract:The high-level concepts that a neural network uses to perform computation need not be aligned to individual neurons (Smolensky, 1986). Language model interpretability research has thus turned to techniques such as \textit{sparse autoencoders} (SAEs) to decompose the neuron basis into more interpretable units of model computation, for tasks such as \textit{circuit tracing}. However, not all neuron-based representations are uninterpretable. For the first time, we empirically show that \textbf{MLP neurons are as sparse a feature basis as SAEs}. We use this finding to develop an end-to-end pipeline for circuit tracing on the MLP neuron basis, which locates causal circuitry on a variety of tasks using gradient-based attribution. On a standard subject-verb agreement benchmark (Marks et al., 2025), a circuit of $\approx 10^2$ MLP neurons is enough to control model behaviour. On the multi-hop city $\to$ state $\to$ capital task from Lindsey et al., 2025, we find a circuit in which small sets of neurons encode specific latent reasoning steps (e.g.~`map city to its state'), and can be steered to change the model's output. This work thus advances automated interpretability of language models without additional training costs.

57. 【2601.22588】Rethinking LLM-as-a-Judge: Representation-as-a-Judge with Small Language Models via Semantic Capacity Asymmetry

链接https://arxiv.org/abs/2601.22588

作者:Zhuochun Li,Yong Zhang,Ming Li,Yuelyu Ji,Yiming Zeng,Ning Cheng,Yun Zhu,Yanmeng Wang,Shaojun Wang,Jing Xiao,Daqing He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, Large language, prompt design, sensitive to prompt, Capacity Asymmetry Hypothesis

备注

点击查看摘要

Abstract:Large language models (LLMs) are widely used as reference-free evaluators via prompting, but this "LLM-as-a-Judge" paradigm is costly, opaque, and sensitive to prompt design. In this work, we investigate whether smaller models can serve as efficient evaluators by leveraging internal representations instead of surface generation. We uncover a consistent empirical pattern: small LMs, despite with weak generative ability, encode rich evaluative signals in their hidden states. This motivates us to propose the Semantic Capacity Asymmetry Hypothesis: evaluation requires significantly less semantic capacity than generation and can be grounded in intermediate representations, suggesting that evaluation does not necessarily need to rely on large-scale generative models but can instead leverage latent features from smaller ones. Our findings motivate a paradigm shift from LLM-as-a-Judge to Representation-as-a-Judge, a decoding-free evaluation strategy that probes internal model structure rather than relying on prompted output. We instantiate this paradigm through INSPECTOR, a probing-based framework that predicts aspect-level evaluation scores from small model representations. Experiments on reasoning benchmarks (GSM8K, MATH, GPQA) show that INSPECTOR substantially outperforms prompting-based small LMs and closely approximates full LLM judges, while offering a more efficient, reliable, and interpretable alternative for scalable evaluation.

58. 【2601.22580】SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

链接https://arxiv.org/abs/2601.22580

作者:Chao Wang,Bei Li,Jiaqi Zhang,Xinyu Liu,Yuchun Fan,Linkun Lyu,Xin Chen,Jingang Wang,Tong Xiao,Peng Pei,Xunliang Cai

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, success of Large, deep Transformer architectures, Language Models

备注

点击查看摘要

Abstract:The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm'' architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.

59. 【2601.22575】PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

链接https://arxiv.org/abs/2601.22575

作者:Xudong Lu,Huankang Guan,Yang Bo,Jinpeng Chen,Xintong Guo,Shuhan Li,Fang Liu,Peiwen Sun,Xueying Li,Wei Zhang,Xue Yang,Rui Liu,Hongsheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, streams remains underexplored, Multimodal Large, continuous real-world streams

备注: 18 pages

点击查看摘要

Abstract:Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal a temporal asymmetry in LLM-judged scores (0-100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decide when to speak, not just what to say. Code and datasets used in this work will be made publicly accessible at this https URL.

60. 【2601.22548】Are LLM Evaluators Really Narcissists? Sanity Checking Self-Preference Evaluations

链接https://arxiv.org/abs/2601.22548

作者:Dani Roytburg,Matthew Bozoukov,Matthew Nguyen,Mackenzie Puig-Hall,Narmeen Oozeer

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, undermining the integrity, Recent research, shown that large, large language

备注

点击查看摘要

Abstract:Recent research has shown that large language models (LLM) favor own outputs when acting as judges, undermining the integrity of automated post-training and evaluation workflows. However, it is difficult to disentangle which evaluation biases are explained by narcissism versus general experimental confounds, distorting measurements of self-preference bias. We discover a core methodological confound which could reduce measurement error by 89.6%. Specifically, LLM evaluators may deliver self-preferring verdicts when the judge responds to queries which they completed incorrectly themselves; this would be true regardless of whether one of their responses is their own. To decouple self-preference signals from noisy outputs on hard problems, we introduce an Evaluator Quality Baseline, which compares the probability that a judge incorrectly votes for itself against the probability that it votes for an incorrect response from another model. Evaluating this simple baseline on 37,448 queries, only 51% of initial findings retain statistical significance. Finally, we turn towards characterizing the entropy of "easy" versus "hard" evaluation votes from LLM judges. Our corrective baseline enables future research on self-preference by eliminating noisy data from potential solutions. More widely, this work contributes to the growing body of work on cataloging and isolating judge-bias effects.

61. 【2601.22546】owards the Holographic Characteristic of LLMs for Efficient Short-text Generation

链接https://arxiv.org/abs/2601.22546

作者:Shun Qian,Bingquan Liu,Chengjie Sun,Zhen Xu,Baoxun Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, in-context learning abilities, advancements in Large, Language Models, Large Language

备注

点击查看摘要

Abstract:The recent advancements in Large Language Models (LLMs) have attracted interest in exploring their in-context learning abilities and chain-of-thought capabilities. However, there are few studies investigating the specific traits related to the powerful generation capacity of LLMs. This paper aims to delve into the generation characteristics exhibited by LLMs. Through our investigation, we have discovered that language models tend to capture target-side keywords at the beginning of the generation process. We name this phenomenon the Holographic Characteristic of language models. For the purpose of exploring this characteristic and further improving the inference efficiency of language models, we propose a plugin called HOLO, which leverages the Holographic Characteristic to extract target-side keywords from language models within a limited number of generation steps and complements the sentence with a parallel lexically constrained text generation method. To verify the effectiveness of HOLO, we conduct massive experiments on language models of varying architectures and scales in the short-text generation scenario. The results demonstrate that HOLO achieves comparable performance to the baselines in terms of both automatic and human-like evaluation metrics and highlight the potential of the Holographic Characteristic.

62. 【2601.22527】$ρ$-$\texttt{EOS}$: Training-free Bidirectional Variable-Length Control for Masked Diffusion LLMs

链接https://arxiv.org/abs/2601.22527

作者:Jingyi Yang,Yuxian Jiang,Jing Shao

类目:Computation and Language (cs.CL)

关键词:global context modeling, large language models, diffusion large language, EOS, texttt

备注: 11 pages,6 figures,6 tables

点击查看摘要

Abstract:Beyond parallel generation and global context modeling, current masked diffusion large language models (dLLMs) suffer from a fundamental limitation: they require a predefined, fixed generation length, which lacks flexibility and forces an inevitable trade-off between output quality and computational efficiency. To address this, we study the denoising dynamics and find that the implicit density ($\rho$) of end-of-sequence ($\texttt{EOS}$) tokens serves as a reliable signal of generation sufficiency. In particular, the evolving implicit $\texttt{EOS}$ density during denoising reveals whether the current masked space is excessive or insufficient, thereby guiding the adjustment direction for generation length. Building on this insight, we propose $\textbf{$\rho$-$\texttt{EOS}$}$, a training-free, single-stage strategy that enables bidirectional variable-length generation for masked dLLMs. Unlike prior two-stage approaches--which require separate length adjustment and iterative mask insertion phases while supporting only unidirectional expansion--$\textbf{$\rho$-$\texttt{EOS}$}$ achieves bidirectional length adjustment within a unified denoising process by continuously estimating the implicit $\texttt{EOS}$ density: excessively high density triggers $\texttt{MASK}$ token contraction, while insufficient density induces expansion. Extensive experiments on mathematics and code benchmarks demonstrate that $\textbf{$\rho$-$\texttt{EOS}$}$ achieves comparable performance while substantially improving inference efficiency and token utilization.

63. 【2601.22521】One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry

链接https://arxiv.org/abs/2601.22521

作者:Weisong Zhao,Tong Wang,Zichang Tan,Te Yang,Siran Peng,Haoyuan Zhang,Tianshuo Zhang,Haichao Shi,Meng Meng,Yang Yang,Xiangyu Zhu,Zhen Lei,Xiao-Yu Zhang,Xu Zhou

类目:Computation and Language (cs.CL)

关键词:Group-based reinforcement learning, Group-based reinforcement, reinforcement learning, learning has evolved, Power-Mean Policy Optimization

备注: 17 pages, 3 figures

点击查看摘要

Abstract:Group-based reinforcement learning has evolved from the arithmetic mean of GRPO to the geometric mean of GMPO. While GMPO improves stability by constraining a conservative objective, it shares a fundamental limitation with GRPO: reliance on a fixed aggregation geometry that ignores the evolving and heterogeneous nature of each trajectory. In this work, we unify these approaches under Power-Mean Policy Optimization (PMPO), a generalized framework that parameterizes the aggregation geometry via the power-mean geometry exponent p. Within this framework, GRPO and GMPO are recovered as special cases. Theoretically, we demonstrate that adjusting p modulates the concentration of gradient updates, effectively reweighting tokens based on their advantage contribution. To determine p adaptively, we introduce a Clip-aware Effective Sample Size (ESS) mechanism. Specifically, we propose a deterministic rule that maps a trajectory clipping fraction to a target ESS. Then, we solve for the specific p to align the trajectory induced ESS with this target one. This allows PMPO to dynamically transition between the aggressive arithmetic mean for reliable trajectories and the conservative geometric mean for unstable ones. Experiments on multiple mathematical reasoning benchmarks demonstrate that PMPO outperforms strong baselines.

64. 【2601.22511】Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards

链接https://arxiv.org/abs/2601.22511

作者:Yuan-Jay Lü,Chengyu Wang,Lei Shen,Jun Huang,Tong Xu

类目:Computation and Language (cs.CL)

关键词:capabilities of large, LLMs often struggle, struggle to match, agentic capabilities, reinforcement learning

备注

点击查看摘要

Abstract:Small LLMs often struggle to match the agentic capabilities of large, costly models. While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. Specifically, a strong teacher model creates novel tasks and tool ecosystems, then rewrites them into intentionally underspecified instructions. This compels agents to actively query users for missing details. When handling synthetic tasks, an LLM-based user simulator provides user-private information, while a mock tool system delivers stable tool responses. For rewards, task-level rubrics are constructed based on required subgoals, user-agent interactions, and forbidden behaviors. Across 14 challenging datasets in math, search, and tool use, models trained on our synthetic data achieve substantial gains, with small models outperforming larger baselines.

65. 【2601.22491】SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization

链接https://arxiv.org/abs/2601.22491

作者:Jinyang Wu,Changpeng Yang,Yuhao Shen,Fangzhi Xu,Bolin Ni,Chonghua Liao,Yuchen Liu,Hongzhen Wang,Shuai Nie,Shuai Zhang,Haoran Luo,Jiaming Xu

类目:Computation and Language (cs.CL)

关键词:Reinforcement learning, textbf, learning with verifiable, powerful paradigm, Reinforcement

备注

点击查看摘要

Abstract:Reinforcement learning with verifiable rewards has emerged as a powerful paradigm for training intelligent agents. However, existing methods typically employ binary rewards that fail to capture quality differences among trajectories achieving identical outcomes, thereby overlooking potential diversity within the solution space. Inspired by the ``sweet spot'' concept in tennis-the racket's core region that produces optimal hitting effects, we introduce \textbf{S}weet \textbf{S}pot \textbf{L}earning (\textbf{SSL}), a novel framework that provides differentiated guidance for agent optimization. SSL follows a simple yet effective principle: progressively amplified, tiered rewards guide policies toward the sweet-spot region of the solution space. This principle naturally adapts across diverse tasks: visual perception tasks leverage distance-tiered modeling to reward proximity, while complex reasoning tasks reward incremental progress toward promising solutions. We theoretically demonstrate that SSL preserves optimal solution ordering and enhances the gradient signal-to-noise ratio, thereby fostering more directed optimization. Extensive experiments across GUI perception, short/long-term planning, and complex reasoning tasks show consistent improvements over strong baselines on 12 benchmarks, achieving up to 2.5X sample efficiency gains and effective cross-task transferability. Our work establishes SSL as a general principle for training capable and robust agents.

66. 【2601.22485】FraudShield: Knowledge Graph Empowered Defense for LLMs against Fraud Attacks

链接https://arxiv.org/abs/2601.22485

作者:Naen Xu,Jinghuai Zhang,Ping He,Chunyi Zhou,Jun Wang,Zhihui Fu,Tianyu Du,Zhaoxiang Wang,Shouling Ji

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:critical automated workflows, including contract review, Large language models, job application processes, Large language

备注: WWW 2026

点击查看摘要

Abstract:Large language models (LLMs) have been widely integrated into critical automated workflows, including contract review and job application processes. However, LLMs are susceptible to manipulation by fraudulent information, which can lead to harmful outcomes. Although advanced defense methods have been developed to address this issue, they often exhibit limitations in effectiveness, interpretability, and generalizability, particularly when applied to LLM-based applications. To address these challenges, we introduce FraudShield, a novel framework designed to protect LLMs from fraudulent content by leveraging a comprehensive analysis of fraud tactics. Specifically, FraudShield constructs and refines a fraud tactic-keyword knowledge graph to capture high-confidence associations between suspicious text and fraud techniques. The structured knowledge graph augments the original input by highlighting keywords and providing supporting evidence, guiding the LLM toward more secure responses. Extensive experiments show that FraudShield consistently outperforms state-of-the-art defenses across four mainstream LLMs and five representative fraud types, while also offering interpretable clues for the model's generations.

67. 【2601.22448】HeaPA: Difficulty-Aware Heap Sampling and On-Policy Query Augmentation for LLM Reinforcement Learning

链接https://arxiv.org/abs/2601.22448

作者:Weiqi Wang,Xin Liu,Binxuan Huang,Hejie Cui,Rongzhi Zhang,Changlong Yu,Shuowei Jin,Jingfeng Yang,Qingyu Yin,Zhengyang Wang,Zheng Li,Yifan Gao,Priyanka Nigam,Bing Yin,Lihong Li,Yangqiu Song

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:rollout generation dominates, efficiency depends heavily, verifiable outcomes, train LLMs, LLMs on reasoning

备注

点击查看摘要

Abstract:RLVR is now a standard way to train LLMs on reasoning tasks with verifiable outcomes, but when rollout generation dominates the cost, efficiency depends heavily on which prompts you sample and when. In practice, prompt pools are often static or only loosely tied to the model's learning progress, so uniform sampling can't keep up with the shifting capability frontier and ends up wasting rollouts on prompts that are already solved or still out of reach. Existing approaches improve efficiency through filtering, curricula, adaptive rollout allocation, or teacher guidance, but they typically assume a fixed pool-which makes it hard to support stable on-policy pool growth-or they add extra teacher cost and latency. We introduce HeaPA (Heap Sampling and On-Policy Query Augmentation), which maintains a bounded, evolving pool, tracks the frontier using heap-based boundary sampling, expands the pool via on-policy augmentation with lightweight asynchronous validation, and stabilizes correlated queries through topology-aware re-estimation of pool statistics and controlled reinsertion. Across two training corpora, two training recipes, and seven benchmarks, HeaPA consistently improves accuracy and reaches target performance with fewer computations while keeping wall-clock time comparable. Our analyses suggest these gains come from frontier-focused sampling and on-policy pool growth, with the benefits becoming larger as model scale increases. Our code is available at this https URL.

68. 【2601.22440】AI and My Values: User Perceptions of LLMs' Ability to Extract, Embody, and Explain Human Values from Casual Conversations

链接https://arxiv.org/abs/2601.22440

作者:Bhada Yun,Renn Su,April Yi Wang

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Value-Alignment Perception Toolkit, understand human, Perception Toolkit, human, open philosophical question

备注: To appear in CHI '26

点击查看摘要

Abstract:Does AI understand human values? While this remains an open philosophical question, we take a pragmatic stance by introducing VAPT, the Value-Alignment Perception Toolkit, for studying how LLMs reflect people's values and how people judge those reflections. 20 participants texted a human-like chatbot over a month, then completed a 2-hour interview with our toolkit evaluating AI's ability to extract (pull details regarding), embody (make decisions guided by), and explain (provide proof of) human values. 13 participants left our study convinced that AI can understand human values. Participants found the experience insightful for self-reflection and found themselves getting persuaded by the AI's reasoning. Thus, we warn about "weaponized empathy": a potentially dangerous design pattern that may arise in value-aligned, yet welfare-misaligned AI. VAPT offers concrete artifacts and design implications to evaluate and responsibly build value-aligned conversational agents with transparency, consent, and safeguards as AI grows more capable and human-like into the future.

69. 【2601.22439】Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss

链接https://arxiv.org/abs/2601.22439

作者:Galim Turumtaev

类目:Computation and Language (cs.CL)

关键词:Neural language models, Neural language, limited availability, rare tokens, training set

备注: Accepted at LoResLM 2025 (COLING 2025 workshop). Oral presentation

点击查看摘要

Abstract:Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.

70. 【2601.22438】owards Resiliency in Large Language Model Serving with KevlarFlow

链接https://arxiv.org/abs/2601.22438

作者:Shangshu Qian,Kipling Liu,P. C. Sruthi,Lin Tan,Yongle Zhang

类目:Distributed, Parallel, and Cluster Computing (cs.DC); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Model, Large Language, remain fundamentally fragile, hyperscale clusters trigger, clusters trigger disproportionate

备注

点击查看摘要

Abstract:Large Language Model (LLM) serving systems remain fundamentally fragile, where frequent hardware faults in hyperscale clusters trigger disproportionate service outages in the software stack. Current recovery mechanisms are prohibitively slow, often requiring up to 10 minutes to reinitialize resources and reload massive model weights. We introduce KevlarFlow, a fault tolerant serving architecture designed to bridge the gap between hardware unreliability and service availability. KevlarFlow leverages 1) decoupled model parallelism initialization, 2) dynamic traffic rerouting, and 3) background KV cache replication to maintain high throughput during partial failures. Our evaluation demonstrates that KevlarFlow reduces mean-time-to-recovery (MTTR) by 20x and, under failure conditions, improves average latency by 3.1x, 99th percentile (p99) latency by 2.8x, average time-to-first-token (TTFT) by 378.9x, and p99 TTFT by 574.6x with negligible runtime overhead in comparison to state-of-the-art LLM serving systems.

71. 【2601.22436】Large Language Model Agents Are Not Always Faithful Self-Evolvers

链接https://arxiv.org/abs/2601.22436

作者:Weixiang Zhao,Yingshuo Wang,Yichen Zhang,Yang Deng,Yanyan Zhao,Wanxiang Che,Bing Qin,Ting Liu

类目:Computation and Language (cs.CL)

关键词:large language model, Self-evolving large language, agents continually improve, self-evolving LLM agents, reusing past experience

备注: 25 pages, 16 figures, 7 tables

点击查看摘要

Abstract:Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their behavior. We present the first systematic investigation of experience faithfulness, the causal dependence of an agent's decisions on the experience it is given, in self-evolving LLM agents. Using controlled causal interventions on both raw and condensed forms of experience, we comprehensively evaluate four representative frameworks across 10 LLM backbones and 9 environments. Our analysis uncovers a striking asymmetry: while agents consistently depend on raw experience, they often disregard or misinterpret condensed experience, even when it is the only experience provided. This gap persists across single- and multi-agent configurations and across backbone scales. We trace its underlying causes to three factors: the semantic limitations of condensed content, internal processing biases that suppress experience, and task regimes where pretrained priors already suffice. These findings challenge prevailing assumptions about self-evolving methods and underscore the need for more faithful and reliable approaches to experience integration.

72. 【2601.22432】ReNCE: Learning to Reason by Noise Contrastive Estimation

链接https://arxiv.org/abs/2601.22432

作者:Wenzheng Zhang,Karl Stratos

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:endowing pretrained LLMs, endowing pretrained, pretrained LLMs, reasoning capabilities, GRPO

备注

点击查看摘要

Abstract:GRPO is a standard approach to endowing pretrained LLMs with reasoning capabilities. It estimates the advantage of an outcome from a group of $K$ outcomes, and promotes those with positive advantages inside a trust region. Since GRPO discriminates between good and bad outcomes softly, it benefits from additional refinements such as asymmetric clipping and zero-variance data filtering. While effective, these refinements require significant empirical insight and can be challenging to identify. We instead propose an explicit contrastive learning approach. Instead of estimating advantages, we bifurcate $K$ outcomes into positive and negative sets, then maximize the likelihood of positive outcomes. Our approach can be viewed as an online instantiation of (multi-label) noise contrastive estimation for LLM reasoning. We validate our method by demonstrating competitive performance on a suite of challenging math benchmarks against strong baselines such as DAPO and online DPO.

73. 【2601.22410】Word-Centered Semantic Graphs for Interpretable Diachronic Sense Tracking

链接https://arxiv.org/abs/2601.22410

作者:Imene Kolli,Kai-Robin Lange,Jonas Rieger,Carsten Jentsch

类目:Computation and Language (cs.CL)

关键词:propose an interpretable, graph-based framework, framework for analyzing, diachronic Skip-gram embeddings, analyzing semantic shift

备注: 20 pages, 16 figures

点击查看摘要

Abstract:We propose an interpretable, graph-based framework for analyzing semantic shift in diachronic corpora. For each target word and time slice, we induce a word-centered semantic network that integrates distributional similarity from diachronic Skip-gram embeddings with lexical substitutability from time-specific masked language models. We identify sense-related structure by clustering the peripheral graph, align clusters across time via node overlap, and track change through cluster composition and normalized cluster mass. In an application study on a corpus of New York Times Magazine articles (1980 - 2017), we show that graph connectivity reflects polysemy dynamics and that the induced communities capture contrasting trajectories: event-driven sense replacement (trump), semantic stability with cluster over-segmentation effects (god), and gradual association shifts tied to digital communication (post). Overall, word-centered semantic graphs offer a compact and transparent representation for exploring sense evolution without relying on predefined sense inventories.

74. 【2601.22402】Bifocal Attention: Harmonizing Geometric and Spectral Positional Embeddings for Algorithmic Generalization

链接https://arxiv.org/abs/2601.22402

作者:Kanishk Awadhiya

类目:Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)

关键词:Large Language Models, Rotary Positional Embeddings, Large Language, encode relative positions, Language Models

备注

点击查看摘要

Abstract:Rotary Positional Embeddings (RoPE) have become the standard for Large Language Models (LLMs) due to their ability to encode relative positions through geometric rotation. However, we identify a significant limitation we term ''Spectral Rigidity'': standard RoPE utilizes a fixed geometric decay ($\theta^{-i}$) optimized for local syntactic coherence, which fails to capture the long-range, periodic structures inherent in recursive logic and algorithmic reasoning. This results in a ''Structure Gap'', where models trained on shallow reasoning chains fail to extrapolate to deeper recursive steps. In this work, we introduce Bifocal Attention, an architectural paradigm that decouples positional encoding into two distinct modalities: Geometric Eyes (Standard RoPE) for precise token-level manipulation, and Spectral Eyes (Learnable Harmonic Operators) for tracking long-range recursive depth. We propose a novel training protocol, Spectral Evolution, which initializes positional frequencies as static geometric parameters but allows them to evolve via gradient descent into a harmonic basis optimized for the specific algorithmic topology of the task.

75. 【2601.22396】Culturally Grounded Personas in Large Language Models: Characterization and Alignment with Socio-Psychological Value Frameworks

链接https://arxiv.org/abs/2601.22396

作者:Candida M. Greco,Lucio La Cava,Andrea Tagarelli

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Physics and Society (physics.soc-ph)

关键词:Large Language Models, Language Models, Large Language, accurately reflect world, conditionings remains uncertain

备注

点击查看摘要

Abstract:Despite the growing utility of Large Language Models (LLMs) for simulating human behavior, the extent to which these synthetic personas accurately reflect world and moral value systems across different cultural conditionings remains uncertain. This paper investigates the alignment of synthetic, culturally-grounded personas with established frameworks, specifically the World Values Survey (WVS), the Inglehart-Welzel Cultural Map, and Moral Foundations Theory. We conceptualize and produce LLM-generated personas based on a set of interpretable WVS-derived variables, and we examine the generated personas through three complementary lenses: positioning on the Inglehart-Welzel map, which unveils their interpretation reflecting stable differences across cultural conditionings; demographic-level consistency with the World Values Survey, where response distributions broadly track human group patterns; and moral profiles derived from a Moral Foundations questionnaire, which we analyze through a culture-to-morality mapping to characterize how moral responses vary across different cultural configurations. Our approach of culturally-grounded persona generation and analysis enables evaluation of cross-cultural structure and moral variation.

76. 【2601.22386】Specialists or Generalists? Multi-Agent and Single-Agent LLMs for Essay Grading

链接https://arxiv.org/abs/2601.22386

作者:Jamiu Adekunle Idowu,Ahmed Almasoud

类目:Computation and Language (cs.CL); Multiagent Systems (cs.MA)

关键词:Automated essay scoring, systems increasingly rely, Automated essay, increasingly rely, rely on large

备注

点击查看摘要

Abstract:Automated essay scoring (AES) systems increasingly rely on large language models, yet little is known about how architectural choices shape their performance across different essay quality levels. This paper evaluates single-agent and multi-agent LLM architectures for essay grading using the ASAP 2.0 corpus. Our multi-agent system decomposes grading into three specialist agents (Content, Structure, Language) coordinated by a Chairman Agent that implements rubric-aligned logic including veto rules and score capping. We test both architectures in zero-shot and few-shot conditions using GPT-5.1. Results show that the multi-agent system is significantly better at identifying weak essays while the single-agent system performs better on mid-range essays. Both architectures struggle with high-quality essays. Critically, few-shot calibration emerges as the dominant factor in system performance -- providing just two examples per score level improves QWK by approximately 26% for both architectures. These findings suggest architectural choice should align with specific deployment priorities, with multi-agent AI particularly suited for diagnostic screening of at-risk students, while single-agent models provide a cost-effective solution for general assessment.

77. 【2601.22385】SP^2DPO: An LLM-assisted Semantic Per-Pair DPO Generalization

链接https://arxiv.org/abs/2601.22385

作者:Chaoyue He,Xin Zhou,Di Wang,Hong Xu,Wei Liu,Chunyan Miao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Direct Preference Optimization, Direct Preference, Preference Optimization, fitting preference labels, single global temperature

备注: 39 pages, 15 figures, 16 tables, 60 equations

点击查看摘要

Abstract:Direct Preference Optimization (DPO) controls the trade-off between fitting preference labels and staying close to a reference model using a single global temperature beta, implicitly treating all preference pairs as equally informative. Real-world preference corpora are heterogeneous: they mix high-signal, objective failures (for example, safety, factuality, instruction violations) with low-signal or subjective distinctions (for example, style), and also include label noise. We introduce our method, SP2DPO (Semantic Per-Pair DPO), a generalization that replaces the global temperature with an instance-specific schedule beta_i pre-decided offline from structured semantic-gap annotations (category, magnitude, confidence) produced by teacher language models. We instantiate this procedure on the UltraFeedback preference corpus (59,960 pairs), enabling large-scale construction of an auditable beta_i artifact, and incur zero training-time overhead: the inner-loop optimizer remains standard DPO with beta set per pair. We focus our empirical study on AlpacaEval 2.0, reporting both raw win rate and length-controlled win rate. Across four open-weight, instruction-tuned student backbones (4B-8B), SP2DPO is competitive with a tuned global-beta DPO baseline and improves AlpacaEval 2.0 length-controlled win rate on two of four backbones, while avoiding per-model beta sweeps. All code, annotations, and artifacts will be released.

78. 【2601.22379】SPLA: Block Sparse Plus Linear Attention for Long Context Modeling

链接https://arxiv.org/abs/2601.22379

作者:Bailin Wang,Dan Friedman,Tao Lei,Chong Wang

类目:Computation and Language (cs.CL)

关键词:Block-wise sparse attention, offers significant efficiency, significant efficiency gains, cumulative contextual loss, low selection fidelity

备注: v1

点击查看摘要

Abstract:Block-wise sparse attention offers significant efficiency gains for long-context modeling, yet existing methods often suffer from low selection fidelity and cumulative contextual loss by completely discarding unselected blocks. To address these limitations, we introduce Sparse Plus Linear Attention (SPLA), a framework that utilizes a selection metric derived from second-order Taylor expansions to accurately identify relevant blocks for exact attention. Instead of discarding the remaining "long tail," SPLA compresses unselected blocks into a compact recurrent state via a residual linear attention (RLA) module. Crucially, to avoid IO overhead, we derive an optimized subtraction-based formulation for RLA -- calculating the residual as the difference between global and selected linear attention -- ensuring that unselected blocks are never explicitly accessed during inference. Our experiments demonstrate that SPLA closes the performance gap in continual pretraining, surpassing dense attention models on long-context benchmarks like RULER while maintaining competitive general knowledge and reasoning capabilities.

79. 【2601.22373】Stability-Aware Prompt Optimization for Clinical Data Abstraction

链接https://arxiv.org/abs/2601.22373

作者:Arinbjörn Kolbeinsson,Daniel Timbie,Sajjan Narsinghani,Sanjay Hariharan

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, work treats prompts, uncertainty in isolation, work treats

备注

点击查看摘要

Abstract:Large language models used for clinical abstraction are sensitive to prompt wording, yet most work treats prompts as fixed and studies uncertainty in isolation. We argue these should be treated jointly. Across two clinical tasks (MedAlign applicability/correctness and MS subtype abstraction) and multiple open and proprietary models, we measure prompt sensitivity via flip rates and relate it to calibration and selective prediction. We find that higher accuracy does not guarantee prompt stability, and that models can appear well-calibrated yet remain fragile to paraphrases. We propose a dual-objective prompt optimization loop that jointly targets accuracy and stability, showing that explicitly including a stability term reduces flip rates across tasks and models, sometimes at modest accuracy cost. Our results suggest prompt sensitivity should be an explicit objective when validating clinical LLM systems.

80. 【2601.22364】Context Structure Reshapes the Representational Geometry of Language Models

链接https://arxiv.org/abs/2601.22364

作者:Eghbal A. Hosseini,Yuxuan Li,Yasaman Bahri,Declan Campbell,Andrew Kyle Lampinen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, facilitate next-token prediction, deep layers, linear extrapolation

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have been shown to organize the representations of input sequences into straighter neural trajectories in their deep layers, which has been hypothesized to facilitate next-token prediction via linear extrapolation. Language models can also adapt to diverse tasks and learn new structure in context, and recent work has shown that this in-context learning (ICL) can be reflected in representational changes. Here we bring these two lines of research together to explore whether representation straightening occurs \emph{within} a context during ICL. We measure representational straightening in Gemma 2 models across a diverse set of in-context tasks, and uncover a dichotomy in how LLMs' representations change in context. In continual prediction settings (e.g., natural language, grid world traversal tasks) we observe that increasing context increases the straightness of neural sequence trajectories, which is correlated with improvement in model prediction. Conversely, in structured prediction settings (e.g., few-shot tasks), straightening is inconsistent -- it is only present in phases of the task with explicit structure (e.g., repeating a template), but vanishes elsewhere. These results suggest that ICL is not a monolithic process. Instead, we propose that LLMs function like a Swiss Army knife: depending on task structure, the LLM dynamically selects between strategies, only some of which yield representational straightening.

81. 【2601.22361】MERMAID: Memory-Enhanced Retrieval and Reasoning with Multi-Agent Iterative Knowledge Grounding for Veracity Assessment

链接https://arxiv.org/abs/2601.22361

作者:Yupeng Cao,Chengyang He,Yangyang Yu,Ping Wang,K.P. Subbalakshmi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:veracity assessment, increasingly critical, online content, veracity, Assessing

备注

点击查看摘要

Abstract:Assessing the veracity of online content has become increasingly critical. Large language models (LLMs) have recently enabled substantial progress in automated veracity assessment, including automated fact-checking and claim verification systems. Typical veracity assessment pipelines break down complex claims into sub-claims, retrieve external evidence, and then apply LLM reasoning to assess veracity. However, existing methods often treat evidence retrieval as a static, isolated step and do not effectively manage or reuse retrieved evidence across claims. In this work, we propose MERMAID, a memory-enhanced multi-agent veracity assessment framework that tightly couples the retrieval and reasoning processes. MERMAID integrates agent-driven search, structured knowledge representations, and a persistent memory module within a Reason-Action style iterative process, enabling dynamic evidence acquisition and cross-claim evidence reuse. By retaining retrieved evidence in an evidence memory, the framework reduces redundant searches and improves verification efficiency and consistency. We evaluate MERMAID on three fact-checking benchmarks and two claim-verification datasets using multiple LLMs, including GPT, LLaMA, and Qwen families. Experimental results show that MERMAID achieves state-of-the-art performance while improving the search efficiency, demonstrating the effectiveness of synergizing retrieval, reasoning, and memory for reliable veracity assessment.

82. 【2601.22311】Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents

链接https://arxiv.org/abs/2601.22311

作者:Zehong Wang,Fang Wu,Hongru Wang,Xiangru Tang,Bolian Li,Zhenfei Yin,Yijun Ma,Yiyang Li,Weixiang Sun,Xiusi Chen,Yanfang Ye

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language model, Large language, based agents exhibit, agents exhibit strong, sustain coherent behavior

备注

点击查看摘要

Abstract:Large language model (LLM)-based agents exhibit strong step-by-step reasoning capabilities over short horizons, yet often fail to sustain coherent behavior over long planning horizons. We argue that this failure reflects a fundamental mismatch: step-wise reasoning induces a form of step-wise greedy policy that is adequate for short horizons but fails in long-horizon planning, where early actions must account for delayed consequences. From this planning-centric perspective, we study LLM-based agents in deterministic, fully structured environments with explicit state transitions and evaluation signals. Our analysis reveals a core failure mode of reasoning-based policies: locally optimal choices induced by step-wise scoring lead to early myopic commitments that are systematically amplified over time and difficult to recover from. We introduce FLARE (Future-aware Lookahead with Reward Estimation) as a minimal instantiation of future-aware planning to enforce explicit lookahead, value propagation, and limited commitment in a single model, allowing downstream outcomes to influence early decisions. Across multiple benchmarks, agent frameworks, and LLM backbones, FLARE consistently improves task performance and planning-level behavior, frequently allowing LLaMA-8B with FLARE to outperform GPT-4o with standard step-by-step reasoning. These results establish a clear distinction between reasoning and planning.

83. 【2601.22297】Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning

链接https://arxiv.org/abs/2601.22297

作者:Chenxi Liu,Yanshuo Chen,Ruibo Chen,Tianyi Xiong,Tong Zheng,Heng Huang

类目:Computation and Language (cs.CL)

关键词:large language models, verifiable rewards, abilities of large, large language, substantially improved

备注

点击查看摘要

Abstract:The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning (SDRL), a training framework that equips a single LLM with strong standalone problem-solving ability and the capability to learn from diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL improves overall MAD performance while simultaneously strengthening single model reasoning.

84. 【2601.22269】JAF: Judge Agent Forest

链接https://arxiv.org/abs/2601.22269

作者:Sahil Garg,Brad Cheezum,Sridhar Dutta,Vishal Agarwal

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Judge Agent Forest, enable iterative self-refinement, provide automated evaluation, Judge Agent, judge agent conducts

备注

点击查看摘要

Abstract:Judge agents are fundamental to agentic AI frameworks: they provide automated evaluation, and enable iterative self-refinement of reasoning processes. We introduce JAF: Judge Agent Forest, a framework in which the judge agent conducts joint inference across a cohort of query--response pairs generated by a primary agent, rather than evaluating each in isolation. This paradigm elevates the judge from a local evaluator to a holistic learner: by simultaneously assessing related responses, the judge discerns cross-instance patterns and inconsistencies, whose aggregate feedback enables the primary agent to improve by viewing its own outputs through the judge's collective perspective. Conceptually, JAF bridges belief propagation and ensemble-learning principles: overlapping in-context neighborhoods induce a knowledge-graph structure that facilitates propagation of critique, and repeated, randomized evaluations yield a robust ensemble of context-sensitive judgments. JAF can be instantiated entirely via ICL, with the judge prompted for each query using its associated primary-agent response plus a small, possibly noisy set of peer exemplars. While kNN in embedding space is a natural starting point for exemplars, this approach overlooks categorical structure, domain metadata, or nuanced distinctions accessible to modern LLMs. To overcome these limitations, we develop a flexible locality-sensitive hashing (LSH) algorithm that learns informative binary codes by integrating semantic embeddings, LLM-driven hash predicates, supervision from categorical labels, and relevant side information. These hash codes support efficient, interpretable, and relation-aware selection of diverse exemplars, and further optimize exploration of CoT reasoning paths. We validate JAF with an empirical study on the demanding task of cloud misconfigs triage in large-scale cloud environments.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2601.22269 [cs.AI]

(or
arXiv:2601.22269v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2601.22269

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
85. 【2601.22264】Predicting Intermittent Job Failure Categories for Diagnosis Using Few-Shot Fine-Tuned Language Models

链接https://arxiv.org/abs/2601.22264

作者:Henri Aïdasso,Francis Bordeleau,Ali Tizghadam

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Continuous Integration, provide valuable feedback, failures provide valuable, pipeline failures provide, code-related errors

备注

点击查看摘要

Abstract:In principle, Continuous Integration (CI) pipeline failures provide valuable feedback to developers on code-related errors. In practice, however, pipeline jobs often fail intermittently due to non-deterministic tests, network outages, infrastructure failures, resource exhaustion, and other reliability issues. These intermittent (flaky) job failures lead to substantial inefficiencies: wasted computational resources from repeated reruns and significant diagnosis time that distracts developers from core activities and often requires intervention from specialized teams. Prior work has proposed machine learning techniques to detect intermittent failures, but does not address the subsequent diagnosis challenge. To fill this gap, we introduce FlaXifyer, a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models. FlaXifyer requires only job execution logs and achieves 84.3% Macro F1 and 92.0% Top-2 accuracy with just 12 labeled examples per category. We also propose LogSift, an interpretability technique that identifies influential log statements in under one second, reducing review effort by 74.4% while surfacing relevant failure information in 87% of cases. Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of intermittent job failures.

86. 【2601.22240】A Systematic Literature Review on LLM Defenses Against Prompt Injection and Jailbreaking: Expanding NIST Taxonomy

链接https://arxiv.org/abs/2601.22240

作者:Pedro H. Barcha Correia,Ryan W. Achjian,Diego E. G. Caetano de Oliveira,Ygor Acacio Maria,Victor Takashi Hayashi,Marcos Lopes,Charles Christian Miers,Marcos A. Simplicio Jr

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:generative artificial intelligence, large language models, prompt injection, artificial intelligence, language models

备注: 27 pages, 14 figures, 11 tables, submitted to Elsevier Computer Science Review

点击查看摘要

Abstract:The rapid advancement and widespread adoption of generative artificial intelligence (GenAI) and large language models (LLMs) has been accompanied by the emergence of new security vulnerabilities and challenges, such as jailbreaking and other prompt injection attacks. These maliciously crafted inputs can exploit LLMs, causing data leaks, unauthorized actions, or compromised outputs, for instance. As both offensive and defensive prompt injection techniques evolve quickly, a structured understanding of mitigation strategies becomes increasingly important. To address that, this work presents the first systematic literature review on prompt injection mitigation strategies, comprehending 88 studies. Building upon NIST's report on adversarial machine learning, this work contributes to the field through several avenues. First, it identifies studies beyond those documented in NIST's report and other academic reviews and surveys. Second, we propose an extension to NIST taxonomy by introducing additional categories of defenses. Third, by adopting NIST's established terminology and taxonomy as a foundation, we promote consistency and enable future researchers to build upon the standardized taxonomy proposed in this work. Finally, we provide a comprehensive catalog of the reviewed prompt injection defenses, documenting their reported quantitative effectiveness across specific LLMs and attack datasets, while also indicating which solutions are open-source and model-agnostic. This catalog, together with the guidelines presented herein, aims to serve as a practical resource for researchers advancing the field of adversarial machine learning and for developers seeking to implement effective defenses in production systems.

87. 【2601.22228】Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation

链接https://arxiv.org/abs/2601.22228

作者:Ken Deng,Yifu Qiu,Yoni Kasten,Shay B. Cohen,Yftah Ziser

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:semantic reasoning compared, perception and semantic, limited understanding, relative camera, Vision-Language Models

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) perform well in 2D perception and semantic reasoning compared to their limited understanding of 3D spatial structure. We investigate this gap using relative camera pose estimation (RCPE), a fundamental vision task that requires inferring relative camera translation and rotation from a pair of images. We introduce VRRPI-Bench, a benchmark derived from unlabeled egocentric videos with verbalized annotations of relative camera motion, reflecting realistic scenarios with simultaneous translation and rotation around a shared object. We further propose VRRPI-Diag, a diagnostic benchmark that isolates individual motion degrees of freedom. Despite the simplicity of RCPE, most VLMs fail to generalize beyond shallow 2D heuristics, particularly for depth changes and roll transformations along the optical axis. Even state-of-the-art models such as GPT-5 ($0.64$) fall short of classic geometric baselines ($0.97$) and human performance ($0.92$). Moreover, VLMs exhibit difficulty in multi-image reasoning, with inconsistent performance (best $59.7\%$) when integrating spatial cues across frames. Our findings reveal limitations in grounding VLMs in 3D and multi-view spatial reasoning.

88. 【2601.22181】MrRoPE: Mixed-radix Rotary Position Embedding

链接https://arxiv.org/abs/2601.22181

作者:Qingyuan Tian,Wenhong Zhu,Xiaoran Liu,Xiaofeng Wang,Rui Wang

类目:Computation and Language (cs.CL)

关键词:Rotary Position Embedding, Position Embedding scheme, Rotary Position, Position Embedding, handle longer sequences

备注

点击查看摘要

Abstract:Rotary Position Embedding (RoPE)-extension refers to modifying or generalizing the Rotary Position Embedding scheme to handle longer sequences than those encountered during pre-training. However, current extension strategies are highly diverse and lack a unified theoretical foundation. In this paper, we propose MrRoPE (Mixed-radix RoPE), a generalized encoding formulation based on a radix system conversion perspective, which elegantly unifies various RoPE-extension approaches as distinct radix conversion strategies. Based on this theory, we introduce two training-free extensions, MrRoPE-Uni and MrRoPE-Pro, which leverage uniform and progressive radix conversion strategies, respectively, to achieve 'train short, test long' generalization. Without fine-tuning, MrRoPE-Pro sustains over 85% recall in the 128K-context Needle-in-a-Haystack test and achieves more than double YaRN's accuracy on Infinite-Bench retrieval and dialogue subsets. Theoretical analysis confirms that MrRoPE-Pro effectively raises the upper bound of RoPE's attainable encoding length, which further validates the reliability and utility of our theory and methodology.

89. 【2601.22169】In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

链接https://arxiv.org/abs/2601.22169

作者:Anudeex Shetty,Aditya Joshi,Salil S. Kanhere

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:Humans are susceptible, influence of alcohol, susceptible to undesirable, drunk language, Humans

备注: WIP

点击查看摘要

Abstract:Humans are susceptible to undesirable behaviours and privacy leaks under the influence of alcohol. This paper investigates drunk language, i.e., text written under the influence of alcohol, as a driver for safety failures in large language models (LLMs). We investigate three mechanisms for inducing drunk language in LLMs: persona-based prompting, causal fine-tuning, and reinforcement-based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both benchmarks are in English, as compared to the base LLMs as well as previously reported approaches. Via a robust combination of manual evaluation and LLM-based evaluators and analysis of error categories, our findings highlight a correspondence between human-intoxicated behaviour, and anthropomorphism in LLMs induced with drunk language. The simplicity and efficiency of our drunk language inducement approaches position them as potential counters for LLM safety tuning, highlighting significant risks to LLM safety.

90. 【2601.22873】EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis

链接https://arxiv.org/abs/2601.22873

作者:Li Zhou,Hao Jiang,Junjie Li,Tianrui Wang,Haizhou Li

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)

关键词:Achieving precise, emotion-aware TTS systems, crucial for producing, producing natural, natural and context-appropriate

备注: Activation Steering; Emotion-Aware TTS; Speech Synthesis; Accepted by ICASSP 2026

点击查看摘要

Abstract:Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional expressiveness while preserving naturalness and speaker similarity. Further analysis confirms the proposed EmoSteer layer's effectiveness and reveals its potential for controllable emotional intensity in speech synthesis.

91. 【2601.22792】CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

链接https://arxiv.org/abs/2601.22792

作者:Muhammad Shakeel,Yosuke Fukumoto,Chikara Maeda,Chyi-Jiunn Lin,Shinji Watanabe

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:automatic speech recognition, multi-speaker automatic speech, multi-speaker automatic, Contextual Acoustic-Linguistic Modeling, ASR

备注: Accepted to IEEE ICASSP 2026

点击查看摘要

Abstract:We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.

92. 【2601.22306】Sylber 2.0: A Universal Syllable Embedding

链接https://arxiv.org/abs/2601.22306

作者:Cheol Jun Cho,Nicholas Lee,Alan W Black,Gopala K. Anumanchipalli

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词:Scaling spoken language, Scaling spoken, requires speech tokens, Scaling, Sylber

备注

点击查看摘要

Abstract:Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. To address this gap, we present Sylber 2.0, a self-supervised framework for coding speech at the syllable level that enables efficient temporal compression and high-fidelity reconstruction. Sylber 2.0 achieves a very low token frequency around 5 Hz, while retaining both linguistic and acoustic detail across multiple languages and expressive styles. Experiments show that it performs on par with previous models operating on high-frequency baselines. Furthermore, Sylber 2.0 enables efficient TTS modeling which can generate speech with competitive intelligibility and quality with SOTA models using only 72M parameters. Moreover, the universality of Sylber 2.0 provides more effective features for low resource ASR than previous speech coding frameworks. In sum, we establish an effective syllable-level abstraction for general spoken language.

93. 【2601.22162】UniFinEval: Towards Unified Evaluation of Financial Multimodal Models across Text, Images and Videos

链接https://arxiv.org/abs/2601.22162

作者:Zhi Yang,Lingfeng Zeng,Fangqi Lou,Qi Qi,Wei Zhang,Zhenyu Wu,Zhenxiong Yu,Jun Han,Zhiheng Jin,Lejie Zhang,Xiaoming Huang,Xiaolong Liang,Zheng Wei,Junbo Zou,Dongpo Cheng,Zhaowei Liu,Xin Guo,Rongjunchen Zhang,Liwen Zhang

类目:General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:increasingly significant role, Multimodal large language, cross-modal multi-hop reasoning, existing multimodal benchmarks, Company Fundamental Reasoning

备注

点击查看摘要

Abstract:Multimodal large language models are playing an increasingly significant role in empowering the financial domain, however, the challenges they face, such as multimodal and high-density information and cross-modal multi-hop reasoning, go beyond the evaluation scope of existing multimodal benchmarks. To address this gap, we propose UniFinEval, the first unified multimodal benchmark designed for high-information-density financial environments, covering text, images, and videos. UniFinEval systematically constructs five core financial scenarios grounded in real-world financial systems: Financial Statement Auditing, Company Fundamental Reasoning, Industry Trend Insights, Financial Risk Sensing, and Asset Allocation Analysis. We manually construct a high-quality dataset consisting of 3,767 question-answer pairs in both chinese and english and systematically evaluate 10 mainstream MLLMs under Zero-Shot and CoT settings. Results show that Gemini-3-pro-preview achieves the best overall performance, yet still exhibits a substantial gap compared to financial experts. Further error analysis reveals systematic deficiencies in current models. UniFinEval aims to provide a systematic assessment of MLLMs' capabilities in fine-grained, high-information-density financial environments, thereby enhancing the robustness of MLLMs applications in real-world financial scenarios. Data and code are available at this https URL.

信息检索

1. 【2601.23085】OrLog: Resolving Complex Queries with LLMs and Probabilistic Reasoning

链接https://arxiv.org/abs/2601.23085

作者:Mohanna Hoveyda,Jelle Piepenbrock,Arjen P de Vries,Maarten de Rijke,Faegheh Hasibi

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Resolving complex information, candidate answer set, Resolving complex, logical operators encoded, answer set

备注: Accepted to ECIR 2026

点击查看摘要

Abstract:Resolving complex information needs that come with multiple constraints should consider enforcing the logical operators encoded in the query (i.e., conjunction, disjunction, negation) on the candidate answer set. Current retrieval systems either ignore these constraints in neural embeddings or approximate them in a generative reasoning process that can be inconsistent and unreliable. Although well-suited to structured reasoning, existing neuro-symbolic approaches remain confined to formal logic or mathematics problems as they often assume unambiguous queries and access to complete evidence, conditions rarely met in information retrieval. To bridge this gap, we introduce OrLog, a neuro-symbolic retrieval framework that decouples predicate-level plausibility estimation from logical reasoning: a large language model (LLM) provides plausibility scores for atomic predicates in one decoding-free forward pass, from which a probabilistic reasoning engine derives the posterior probability of query satisfaction. We evaluate OrLog across multiple backbone LLMs, varying levels of access to external knowledge, and a range of logical constraints, and compare it against base retrievers and LLM-as-reasoner methods. Provided with entity descriptions, OrLog can significantly boost top-rank precision compared to LLM reasoning with larger gains on disjunctive queries. OrLog is also more efficient, cutting mean tokens by $\sim$90\% per query-entity pair. These results demonstrate that generation-free predicate plausibility estimation combined with probabilistic reasoning enables constraint-aware retrieval that outperforms monolithic reasoning while using far fewer tokens.

2. 【2601.22925】BEAR: Towards Beam-Search-Aware Optimization for Recommendation with Large Language Models

链接https://arxiv.org/abs/2601.22925

作者:Weiqin Yang,Bohao Wang,Zhenxiang Xu,Jiawei Chen,Shengjia Zhang,Jingbang Chen,Canghong Jin,Can Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, leveraging Large Language, research leveraging Large, Language Models, Large Language

备注

点击查看摘要

Abstract:Recent years have witnessed a rapid surge in research leveraging Large Language Models (LLMs) for recommendation. These methods typically employ supervised fine-tuning (SFT) to adapt LLMs to recommendation scenarios, and utilize beam search during inference to efficiently retrieve $B$ top-ranked recommended items. However, we identify a critical training-inference inconsistency: while SFT optimizes the overall probability of positive items, it does not guarantee that such items will be retrieved by beam search even if they possess high overall probabilities. Due to the greedy pruning mechanism, beam search can prematurely discard a positive item once its prefix probability is insufficient. To address this inconsistency, we propose BEAR (Beam-SEarch-Aware Regularization), a novel fine-tuning objective that explicitly accounts for beam search behavior during training. Rather than directly simulating beam search for each instance during training, which is computationally prohibitive, BEAR enforces a relaxed necessary condition: each token in a positive item must rank within the top-$B$ candidate tokens at each decoding step. This objective effectively mitigates the risk of incorrect pruning while incurring negligible computational overhead compared to standard SFT. Extensive experiments across four real-world datasets demonstrate that BEAR significantly outperforms strong baselines. Code will be released upon acceptance.

Subjects:

Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2601.22925 [cs.IR]

(or
arXiv:2601.22925v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2601.22925

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
3. 【2601.22783】Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval

链接https://arxiv.org/abs/2601.22783

作者:Ilyass Moummad,Marius Miron,David Robinson,Kawtar Zaher,Hervé Goëau,Olivier Pietquin,Pierre Bonnet,Emmanuel Chemla,Matthieu Geist,Alexis Joly

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)

关键词:platforms increasingly rely, monitoring platforms increasingly, platforms increasingly, increasingly rely, rely on multimodal

备注

点击查看摘要

Abstract:Large-scale biodiversity monitoring platforms increasingly rely on multimodal wildlife observations. While recent foundation models enable rich semantic representations across vision, audio, and language, retrieving relevant observations from massive archives remains challenging due to the computational cost of high-dimensional similarity search. In this work, we introduce compact hypercube embeddings for fast text-based wildlife observation retrieval, a framework that enables efficient text-based search over large-scale wildlife image and audio databases using compact binary representations. Building on the cross-view code alignment hashing framework, we extend lightweight hashing beyond a single-modality setup to align natural language descriptions with visual or acoustic observations in a shared Hamming space. Our approach leverages pretrained wildlife foundation models, including BioCLIP and BioLingual, and adapts them efficiently for hashing using parameter-efficient fine-tuning. We evaluate our method on large-scale benchmarks, including iNaturalist2024 for text-to-image retrieval and iNatSounds2024 for text-to-audio retrieval, as well as multiple soundscape datasets to assess robustness under domain shift. Results show that retrieval using discrete hypercube embeddings achieves competitive, and in several cases superior, performance compared to continuous embeddings, while drastically reducing memory and search cost. Moreover, we observe that the hashing objective consistently improves the underlying encoder representations, leading to stronger retrieval and zero-shot generalization. These results demonstrate that binary, language-based retrieval enables scalable and efficient search over large wildlife archives for biodiversity monitoring systems.

4. 【2601.22694】Farewell to Item IDs: Unlocking the Scaling Potential of Large Ranking Models via Semantic Tokens

链接https://arxiv.org/abs/2601.22694

作者:Zhen Zhao,Tong Zhang,Jie Xu,Qingliang Cai,Qile Zhang,Leyuan Yang,Daorui Xiao,Xiaojia Chang

类目:Information Retrieval (cs.IR)

关键词:Recent studies, achieved substantial improvement, achieved substantial, ranking systems rely, recommendation systems

备注

点击查看摘要

Abstract:Recent studies on scaling up ranking models have achieved substantial improvement for recommendation systems and search engines. However, most large-scale ranking systems rely on item IDs, where each item is treated as an independent categorical symbol and mapped to a learned embedding. As items rapidly appear and disappear, these embeddings become difficult to train and maintain. This instability impedes effective learning of neural network parameters and limits the scalability of ranking models. In this paper, we show that semantic tokens possess greater scaling potential compared to item IDs. Our proposed framework TRM improves the token generation and application pipeline, leading to 33% reduction in sparse storage while achieving 0.85% AUC increase. Extensive experiments further show that TRM could consistently outperform state-of-the-art models when model capacity scales. Finally, TRM has been successfully deployed on large-scale personalized search engines, yielding 0.26% and 0.75% improvement on user active days and change query ratio respectively through A/B test.

5. 【2601.22547】PersonaAct: Simulating Short-Video Users with Personalized Agents for Counterfactual Filter Bubble Auditing

链接https://arxiv.org/abs/2601.22547

作者:Shilong Zhao,Qinggang Yang,Zhiyi Yin,Xiaoshi Wang,Zhenxing Chen,Du Su,Xueqi Cheng

类目:Information Retrieval (cs.IR)

关键词:Short-video platforms rely, narrow content exposure, personalized recommendation, raising concerns, platforms rely

备注

点击查看摘要

Abstract:Short-video platforms rely on personalized recommendation, raising concerns about filter bubbles that narrow content exposure. Auditing such phenomena at scale is challenging because real user studies are costly and privacy-sensitive, and existing simulators fail to reproduce realistic behaviors due to their reliance on textual signals and weak personalization. We propose PersonaAct, a framework for simulating short-video users with persona-conditioned multimodal agents trained on real behavioral traces for auditing filter bubbles in breadth and depth. PersonaAct synthesizes interpretable personas through automated interviews combining behavioral analysis with structured questioning, then trains agents on multimodal observations using supervised fine-tuning and reinforcement learning. We deploy trained agents for filter bubble auditing and evaluate bubble breadth via content diversity and bubble depth via escape potential. The evaluation demonstrates substantial improvements in fidelity over generic LLM baselines, enabling realistic behavior reproduction. Results reveal significant content narrowing over interaction. However, we find that Bilibili demonstrates the strongest escape potential. We release the first open multimodal short-video dataset and code to support reproducible auditing of recommender systems.

6. 【2601.22543】SCaLRec: Semantic Calibration for LLM-enabled Cloud-Device Sequential Recommendation

链接https://arxiv.org/abs/2601.22543

作者:Ruiqi Zheng,Jinli Cao,Jiao Yin,Hongzhi Yin

类目:Information Retrieval (cs.IR)

关键词:device leverages recent, cloud LLM, cloud, collaborative recommendation partitions, recommendation partitions computation

备注

点击查看摘要

Abstract:Cloud-device collaborative recommendation partitions computation across the cloud and user devices: the cloud provides semantic user modeling, while the device leverages recent interactions and cloud semantic signals for privacy-preserving, responsive reranking. With large language models (LLMs) on the cloud, semantic user representations can improve sequential recommendation by capturing high-level intent. However, regenerating such representations via cloud LLM inference for every request is often infeasible at real-world scale. As a result, on-device reranking commonly reuses a cached cloud semantic user embedding across requests. We empirically identify a cloud semantic staleness effect: reused embeddings become less aligned with the user's latest interactions, leading to measurable ranking degradation. Most existing LLM-enabled cloud-device recommenders are typically designed around on-demand cloud semantics, either by assuming low-latency cloud LLM access or by regenerating semantic embeddings per request. When per-request regeneration is infeasible and cached semantics must be reused, two technical challenges arise: (1) deciding when cached cloud semantics remain useful for on-device reranking, and (2) maintaining ranking quality when the cloud LLM cannot be invoked and only cached semantics are available. To address this gap, we introduce the Semantic Calibration for LLM-enabled Cloud-Device Recommendation (SCaLRec). First, it estimates the reliability of cached semantics under the user's latest interactions. Second, an on-device semantic calibration module is proposed to adjusts the cached semantic embedding on-device using up-to-date interaction evidence, without per-request cloud LLM involvement. Experiments on real-world datasets show that SCaLRec consistently improves recommendation performance over strong baselines under cloud semantic staleness.

Subjects:

Information Retrieval (cs.IR)

Cite as:
arXiv:2601.22543 [cs.IR]

(or
arXiv:2601.22543v1 [cs.IR] for this version)

https://doi.org/10.48550/arXiv.2601.22543

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
7. 【2601.22498】FITMM: Adaptive Frequency-Aware Multimodal Recommendation via Information-Theoretic Representation Learning

链接https://arxiv.org/abs/2601.22498

作者:Wei Yang,Rui Zhong,Yiqun Chen,Shixuan Li,Heng Ping,Chi Lu,Peng Jiang

类目:Information Retrieval (cs.IR)

关键词:enhance user preference, user preference modeling, rich item content, leveraging rich item, Gaussian Information Bottleneck

备注

点击查看摘要

Abstract:Multimodal recommendation aims to enhance user preference modeling by leveraging rich item content such as images and text. Yet dominant systems fuse modalities in the spatial domain, obscuring the frequency structure of signals and amplifying misalignment and redundancy. We adopt a spectral information-theoretic view and show that, under an orthogonal transform that approximately block-diagonalizes bandwise covariances, the Gaussian Information Bottleneck objective decouples across frequency bands, providing a principled basis for separate-then-fuse paradigm. Building on this foundation, we propose FITMM, a Frequency-aware Information-Theoretic framework for multimodal recommendation. FITMM constructs graph-enhanced item representations, performs modality-wise spectral decomposition to obtain orthogonal bands, and forms lightweight within-band multimodal components. A residual, task-adaptive gate aggregates bands into the final representation. To control redundancy and improve generalization, we regularize training with a frequency-domain IB term that allocates capacity across bands (Wiener-like shrinkage with shut-off of weak bands). We further introduce a cross-modal spectral consistency loss that aligns modalities within each band. The model is jointly optimized with the standard recommendation loss. Extensive experiments on three real-world datasets demonstrate that FITMM consistently and significantly outperforms advanced baselines.

8. 【2601.22493】Do AI Overviews Benefit Search Engines? An Ecosystem Perspective

链接https://arxiv.org/abs/2601.22493

作者:Yihang Wu,Jiajun Tang,Jinfei Liu,Haifeng Xu,Fan Yao

类目:Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR)

关键词:potentially discouraging high-quality, enhances user experience, discouraging high-quality content, high-quality content creation, causing user attrition

备注

点击查看摘要

Abstract:The integration of AI Overviews into search engines enhances user experience but diverts traffic from content creators, potentially discouraging high-quality content creation and causing user attrition that undermines long-term search engine profit. To address this issue, we propose a game-theoretic model of creator competition with costly effort, characterize equilibrium behavior, and design two incentive mechanisms: a citation mechanism that references sources within an AI Overview, and a compensation mechanism that offers monetary rewards to creators. For both cases, we provide structural insights and near-optimal profit-maximizing mechanisms. Evaluations on real click data show that although AI Overviews harm long-term search engine profit, interventions based on our proposed mechanisms can increase long-term profit across a range of realistic scenarios, pointing toward a more sustainable trajectory for AI-enhanced search ecosystems.

计算机视觉

1. 【2601.23286】VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

链接https://arxiv.org/abs/2601.23286

作者:Hongyang Du,Junjie Ye,Xiaoyan Cong,Runhao Li,Jingcheng Ni,Aman Agarwal,Zeqi Zhou,Zekun Li,Randall Balestriero,Yue Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:produce visually impressive, visually impressive results, recent video diffusion, produce visually, impressive results

备注

点击查看摘要

Abstract:While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

2. 【2601.23281】User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments

链接https://arxiv.org/abs/2601.23281

作者:Junfeng Lin,Yanming Xiu,Maria Gorlatova

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Open-set object detection, rejecting unknown classes, Open-set object, object detection, localizes objects

备注: Accepted by IEEE VR 2026: GenAI-XR workshop

点击查看摘要

Abstract:Open-set object detection (OSOD) localizes objects while identifying and rejecting unknown classes at inference. While recent OSOD models perform well on benchmarks, their behavior under realistic user prompting remains underexplored. In interactive XR settings, user-generated prompts are often ambiguous, underspecified, or overly detailed. To study prompt-conditioned robustness, we evaluate two OSOD models, GroundingDINO and YOLO-E, on real-world XR images and simulate diverse user prompting behaviors using vision-language models. We consider four prompt types: standard, underdetailed, overdetailed, and pragmatically ambiguous, and examine the impact of two enhancement strategies on these prompts. Results show that both models exhibit stable performance under underdetailed and standard prompts, while they suffer degradation under ambiguous prompts. Overdetailed prompts primarily affect GroundingDINO. Prompt enhancement substantially improves robustness under ambiguity, yielding gains exceeding 55% mIoU and 41% average confidence. Based on the findings, we propose several prompting strategies and prompt enhancement methods for OSOD models in XR environments.

3. 【2601.23265】PaperBanana: Automating Academic Illustration for AI Scientists

链接https://arxiv.org/abs/2601.23265

作者:Dawei Zhu,Rui Meng,Yale Song,Xiyu Wei,Sujian Li,Tomas Pfister,Jinsung Yoon

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:generating publication-ready illustrations, rapid advances, advances in autonomous, autonomous AI scientists, remains a labor-intensive

备注

点击查看摘要

Abstract:Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.

4. 【2601.23253】raining-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models

链接https://arxiv.org/abs/2601.23253

作者:Yi Zhang,Chun-Wun Cheng,Angelica I. Aviles-Rivero,Zhihai He,Liang-Jie Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:limiting real-world applicability, Brownian Distance Covariance, limiting real-world, real-world applicability, suffer performance degradation

备注: Accepted in ICASSP 2026

点击查看摘要

Abstract:Vision-language models suffer performance degradation under domain shift, limiting real-world applicability. Existing test-time adaptation methods are computationally intensive, rely on back-propagation, and often focus on single modalities. To address these issues, we propose Training-free Test-Time Adaptation with Brownian Distance Covariance (TaTa). TaTa leverages Brownian Distance Covariance-a powerful statistical measure that captures both linear and nonlinear dependencies via pairwise distances-to dynamically adapt VLMs to new domains without training or back-propagation. This not only improves efficiency but also enhances stability by avoiding disruptive weight updates. TaTa further integrates attribute-enhanced prompting to improve vision-language inference with descriptive visual cues. Combined with dynamic clustering and pseudo-label refinement, it effectively recalibrates the model for novel visual contexts. Experiments across diverse datasets show that TaTa significantly reduces computational cost while achieving state-of-the-art performance in domain and cross-dataset generalization.

5. 【2601.23251】Structured Over Scale: Learning Spatial Reasoning from Educational Video

链接https://arxiv.org/abs/2601.23251

作者:Bishoy Galoaa,Xiangyu Bai,Sarah Ostadabbas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrate impressive performance, Vision-language models, including counting, demonstrate impressive, impressive performance

备注

点击查看摘要

Abstract:Vision-language models (VLMs) demonstrate impressive performance on standard video understanding benchmarks yet fail systematically on simple reasoning tasks that preschool children can solve, including counting, spatial reasoning, and compositional understanding. We hypothesize that the pedagogically-structured content of educational videos provides an ideal training signal for improving these capabilities. We introduce DoraVQA, a dataset of 5,344 question-answer pairs automatically extracted from 8 seasons of Dora the Explorer with precise timestamp alignment. Each episode follows a consistent \textit{context-question-pause-answer} structure that creates a self-contained learning environment analogous to interactive tutoring. We fine-tune both Qwen2 and Qwen3 using Group Relative Policy Optimization (GRPO), leveraging the clear correctness signals and structured reasoning traces inherent in educational content. Despite training exclusively on 38 hours of children's educational videos, our approach achieves improvements of 8-14 points on DoraVQA and state-of-the-art 86.16\% on CVBench, with strong transfer to Video-MME and NExT-QA, demonstrating effective generalization from narrow pedagogical content to broad multimodal understanding. Through cross-domain benchmarks, we show that VLMs can perform tasks that require robust reasoning learned from structured educational content, suggesting that content structure matters as much as content scale.

6. 【2601.23232】ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search

链接https://arxiv.org/abs/2601.23232

作者:Tao Yu,Haopeng Jin,Hao Wang,Shenghua Chai,Yujia Yang,Junhao Gong,Jiaming Guo,Minghui Zhang,Xinlong Chen,Zhenghao Zhang,Yuxuan Zhou,Yanpei Gong,YuanCheng Liu,Yiming Ding,Kangwei Zeng,Pengfei Yang,Zhongtian Luo,Yufei Xiong,Shanbin Zhang,Shaoxiong Cheng,Huang Ruilin,Li Shuo,Yuxi Niu,Xinyuan Zhang,Yueya Xu,Jie Mao,Ruixuan Ji,Yaru Zhao,Mingchen Zhang,Jiabing Yang,Jiaqi Liu,YiFan Zhang,Hongzhu Yi,Xinming Wang,Cheng Zhong,Xiao Ma,Zhang Zhang,Yan Huang,Liang Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:made rapid progress, static multimodal settings, recent years, made rapid, rapid progress

备注: 28 pages, 7 figures

点击查看摘要

Abstract:In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.

7. 【2601.23224】Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

链接https://arxiv.org/abs/2601.23224

作者:Xiangyu Zeng,Zhiqiu Zhang,Yuhan Zhu,Xinhao Li,Zikang Wang,Changlian Ma,Qingyu Zhang,Zizheng Huang,Kun Ouyang,Tianxiang Jiang,Ziang Yan,Yi Wang,Hongjie Zhang,Yali Wang,Limin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing multimodal large, multimodal large language, large language models, understanding predominantly rely, critical evidence amid

备注: 24 pages, 15 figures, 11 tables

点击查看摘要

Abstract:Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3's strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.

8. 【2601.23222】Region-Normalized DPO for Medical Image Segmentation under Noisy Judges

链接https://arxiv.org/abs/2601.23222

作者:Hamza Kalisch,Constantin Seibold,Jens Kleesiek,Ken Herrmann,Frederic Jonske

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dense pixel-wise annotations, pixel-wise annotations remain, limit scalability, dense pixel-wise, remain the gold

备注

点击查看摘要

Abstract:While dense pixel-wise annotations remain the gold standard for medical image segmentation, they are costly to obtain and limit scalability. In contrast, many deployed systems already produce inexpensive automatic quality-control (QC) signals like model agreement, uncertainty measures, or learned mask-quality scores which can be used for further model training without additional ground-truth annotation. However, these signals can be noisy and biased, making preference-based fine-tuning susceptible to harmful updates. We study Direct Preference Optimization (DPO) for segmentation from such noisy judges using proposals generated by a supervised base segmenter trained on a small labeled set. We find that outcomes depend strongly on how preference pairs are mined: selecting the judge's top-ranked proposal can improve peak performance when the judge is reliable, but can amplify harmful errors under weaker judges. We propose Region-Normalized DPO (RN-DPO), a segmentation-aware objective which normalizes preference updates by the size of the disagreement region between masks, reducing the leverage of harmful comparisons and improving optimization stability. Across two medical datasets and multiple regimes, RN-DPO improves sustained performance and stabilizes preference-based fine-tuning, outperforming standard DPO and strong baselines without requiring additional pixel annotations.

9. 【2601.23220】Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

链接https://arxiv.org/abs/2601.23220

作者:Anglin Liu,Ruichao Chen,Yi Lu,Hongxia Xu,Jintai Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Multimodal Large Language, recent Multimodal Large, Language Models, Multimodal Large

备注

点击查看摘要

Abstract:Despite recent Multimodal Large Language Models (MLLMs)' linguistic prowess in medical diagnosis, we find even state-of-the-art MLLMs suffer from a critical perceptual deficit: geometric blindness. This failure to ground outputs in objective geometric constraints leads to plausible yet factually incorrect hallucinations, rooted in training paradigms that prioritize linguistic fluency over geometric fidelity. This paper introduces Med-Scout, a novel framework that "cures" this blindness via Reinforcement Learning (RL) that leverages the intrinsic geometric logic latent within unlabeled medical images. Instead of relying on costly expert annotations, Med-Scout derives verifiable supervision signals through three strategic proxy tasks: Hierarchical Scale Localization, Topological Jigsaw Reconstruction, and Anomaly Consistency Detection. To rigorously quantify this deficit, we present Med-Scout-Bench, a new benchmark specifically designed to evaluate geometric perception. Extensive evaluations show that Med-Scout significantly mitigates geometric blindness, outperforming leading proprietary and open-source MLLMs by over 40% on our benchmark. Furthermore, this enhanced geometric perception generalizes to broader medical understanding, achieving superior results on radiological and comprehensive medical VQA tasks.

10. 【2601.23167】Hi-Light: A Path to high-fidelity, high-resolution video relighting with a Novel Evaluation Paradigm

链接https://arxiv.org/abs/2601.23167

作者:Xiangrui Liu,Haoxiang Li,Yezhou Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offers immense creative, immense creative potential, severe light flickering, relighting offers immense, Video relighting offers

备注

点击查看摘要

Abstract:Video relighting offers immense creative potential and commercial value but is hindered by challenges, including the absence of an adequate evaluation metric, severe light flickering, and the degradation of fine-grained details during editing. To overcome these challenges, we introduce Hi-Light, a novel, training-free framework for high-fidelity, high-resolution, robust video relighting. Our approach introduces three technical innovations: lightness prior anchored guided relighting diffusion that stabilises intermediate relit video, a Hybrid Motion-Adaptive Lighting Smoothing Filter that leverages optical flow to ensure temporal stability without introducing motion blur, and a LAB-based Detail Fusion module that preserves high-frequency detail information from the original video. Furthermore, to address the critical gap in evaluation, we propose the Light Stability Score, the first quantitative metric designed to specifically measure lighting consistency. Extensive experiments demonstrate that Hi-Light significantly outperforms state-of-the-art methods in both qualitative and quantitative comparisons, producing stable, highly detailed relit videos.

11. 【2601.23159】Segment Any Events with Language

链接https://arxiv.org/abs/2601.23159

作者:Seungjun Lee,Gim Hee Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:point clouds, Scene understanding, free-form language, widely explored, explored within diverse

备注: ICLR 2026. Project Page: [this https URL](https://0nandon.github.io/SEAL)

点击查看摘要

Abstract:Scene understanding with free-form language has been widely explored within diverse modalities such as images, point clouds, and LiDAR. However, related studies on event sensors are scarce or narrowly centered on semantic-level understanding. We introduce SEAL, the first Semantic-aware Segment Any Events framework that addresses Open-Vocabulary Event Instance Segmentation (OV-EIS). Given the visual prompt, our model presents a unified framework to support both event segmentation and open-vocabulary mask classification at multiple levels of granularity, including instance-level and part-level. To enable thorough evaluation on OV-EIS, we curate four benchmarks that cover label granularity from coarse to fine class configurations and semantic granularity from instance-level to part-level understanding. Extensive experiments show that our SEAL largely outperforms proposed baselines in terms of performance and inference speed with a parameter-efficient architecture. In the Appendix, we further present a simple variant of our SEAL achieving generic spatiotemporal OV-EIS that does not require any visual prompts from users in the inference. Check out our project page in this https URL

12. 【2601.23107】FlowCalib: LiDAR-to-Vehicle Miscalibration Detection using Scene Flows

链接https://arxiv.org/abs/2601.23107

作者:Ilir Tahiraj,Peter Wittal,Markus Lienkamp

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:safe autonomous driving, calibration is essential, essential for safe, autonomous driving, safe autonomous

备注

点击查看摘要

Abstract:Accurate sensor-to-vehicle calibration is essential for safe autonomous driving. Angular misalignments of LiDAR sensors can lead to safety-critical issues during autonomous operation. However, current methods primarily focus on correcting sensor-to-sensor errors without considering the miscalibration of individual sensors that cause these errors in the first place. We introduce FlowCalib, the first framework that detects LiDAR-to-vehicle miscalibration using motion cues from the scene flow of static objects. Our approach leverages the systematic bias induced by rotational misalignment in the flow field generated from sequential 3D point clouds, eliminating the need for additional sensors. The architecture integrates a neural scene flow prior for flow estimation and incorporates a dual-branch detection network that fuses learned global flow features with handcrafted geometric descriptors. These combined representations allow the system to perform two complementary binary classification tasks: a global binary decision indicating whether misalignment is present and separate, axis-specific binary decisions indicating whether each rotational axis is misaligned. Experiments on the nuScenes dataset demonstrate FlowCalib's ability to robustly detect miscalibration, establishing a benchmark for sensor-to-vehicle miscalibration detection.

13. 【2601.23102】Rethinking Transferable Adversarial Attacks on Point Clouds from a Compact Subspace Perspective

链接https://arxiv.org/abs/2601.23102

作者:Keke Tang,Xianheng Liu,Weilong Peng,Xiaofei Wang,Daizong Liu,Peican Zhu,Can Lu,Zhihong Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:clouds remain challenging, remain challenging, point clouds remain, existing methods, methods often rely

备注

点击查看摘要

Abstract:Transferable adversarial attacks on point clouds remain challenging, as existing methods often rely on model-specific gradients or heuristics that limit generalization to unseen architectures. In this paper, we rethink adversarial transferability from a compact subspace perspective and propose CoSA, a transferable attack framework that operates within a shared low-dimensional semantic space. Specifically, each point cloud is represented as a compact combination of class-specific prototypes that capture shared semantic structure, while adversarial perturbations are optimized within a low-rank subspace to induce coherent and architecture-agnostic variations. This design suppresses model-dependent noise and constrains perturbations to semantically meaningful directions, thereby improving cross-model transferability without relying on surrogate-specific artifacts. Extensive experiments on multiple datasets and network architectures demonstrate that CoSA consistently outperforms state-of-the-art transferable attacks, while maintaining competitive imperceptibility and robustness under common defense strategies. Codes will be made public upon paper acceptance.

14. 【2601.23065】EAG-PT: Emission-Aware Gaussians and Path Tracing for Indoor Scene Reconstruction and Editing

链接https://arxiv.org/abs/2601.23065

作者:Xijie Yang,Mulin Yu,Changjian Jiang,Kerui Ren,Tao Lu,Jiangmiao Pang,Dahua Lin,Bo Dai,Linning Xu

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent reconstruction methods, high visual fidelity, explicit light transport, reproduce indoor scenes, path tracing

备注: project page: [this https URL](https://eag-pt.github.io)

点击查看摘要

Abstract:Recent reconstruction methods based on radiance field such as NeRF and 3DGS reproduce indoor scenes with high visual fidelity, but break down under scene editing due to baked illumination and the lack of explicit light transport. In contrast, physically based inverse rendering relies on mesh representations and path tracing, which enforce correct light transport but place strong requirements on geometric fidelity, becoming a practical bottleneck for real indoor scenes. In this work, we propose Emission-Aware Gaussians and Path Tracing (EAG-PT), aiming for physically based light transport with a unified 2D Gaussian representation. Our design is based on three cores: (1) using 2D Gaussians as a unified scene representation and transport-friendly geometry proxy that avoids reconstructed mesh, (2) explicitly separating emissive and non-emissive components during reconstruction for further scene editing, and (3) decoupling reconstruction from final rendering by using efficient single-bounce optimization and high-quality multi-bounce path tracing after scene editing. Experiments on synthetic and real indoor scenes show that EAG-PT produces more natural and physically consistent renders after editing than radiant scene reconstructions, while preserving finer geometric detail and avoiding mesh-induced artifacts compared to mesh-based inverse path tracing. These results suggest promising directions for future use in interior design, XR content creation, and embodied AI.

15. 【2601.23064】HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation

链接https://arxiv.org/abs/2601.23064

作者:Hari Krishna Gadi,Daniel Matos,Hongyi Luo,Lu Liu,Yongliang Wang,Yanfeng Zhang,Liqiu Meng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:remains challenging due, Visual geolocalization, visual ambiguity, inherently hierarchical structure, remains challenging

备注

点击查看摘要

Abstract:Visual geolocalization, the task of predicting where an image was taken, remains challenging due to global scale, visual ambiguity, and the inherently hierarchical structure of geography. Existing paradigms rely on either large-scale retrieval, which requires storing a large number of image embeddings, grid-based classifiers that ignore geographic continuity, or generative models that diffuse over space but struggle with fine detail. We introduce an entity-centric formulation of geolocation that replaces image-to-image retrieval with a compact hierarchy of geographic entities embedded in Hyperbolic space. Images are aligned directly to country, region, subregion, and city entities through Geo-Weighted Hyperbolic contrastive learning by directly incorporating haversine distance into the contrastive objective. This hierarchical design enables interpretable predictions and efficient inference with 240k entity embeddings instead of over 5 million image embeddings on the OSV5M benchmark, on which our method establishes a new state-of-the-art performance. Compared to the current methods in the literature, it reduces mean geodesic error by 19.5\%, while improving the fine-grained subregion accuracy by 43%. These results demonstrate that geometry-aware hierarchical embeddings provide a scalable and conceptually new alternative for global image geolocation.

16. 【2601.23041】One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs

链接https://arxiv.org/abs/2601.23041

作者:Youxu Shi,Suorong Yang,Dong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, Vision Language, achieve strong performance, Language Models, safety-related failures

备注

点击查看摘要

Abstract:Vision Language Models (VLMs) achieve strong performance on multimodal tasks but still suffer from hallucination and safety-related failures that persist even at scale. Steering offers a lightweight technique to improve model performance. However, steering, whether input-dependent or input-independent, achieves a meaningful trade-off between efficiency and effectiveness. In this work, we observe that steering vectors can generalize across inputs when tasks share aligned semantic intent. Based on this insight, we propose \textbf{OSGA} (\textbf{O}ne-shot \textbf{S}teering with \textbf{G}enerative \textbf{A}nchor), an input-independent framework that improves model performance with a single optimization instance. OSGA first selects an informative sample via a variance-based data selection strategy and learns a single steering vector with a contrastive objective with generative anchor regularization. The resulting vector can be universally applied at a certain layer during inference time without modifying model parameters. Experiments across multiple benchmarks show that a single OSGA-optimized steering vector consistently improves hallucination mitigation and safety enhancement with negligible overhead, highlighting one-shot steering as a practical and scalable solution for reliable VLMs.

17. 【2601.23007】Leveraging Multi-Rater Annotations to Calibrate Object Detectors in Microscopy Imaging

链接https://arxiv.org/abs/2601.23007

作者:Francesco Campi,Lucrezia Tondo,Ekin Karabati,Johannes Betge,Marie Piraud

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep learning-based object, Deep learning-based, achieved impressive performance, limiting their reliability, achieved impressive

备注: Accepted as a conference paper at ISBI 2026

点击查看摘要

Abstract:Deep learning-based object detectors have achieved impressive performance in microscopy imaging, yet their confidence estimates often lack calibration, limiting their reliability for biomedical applications. In this work, we introduce a new approach to improve model calibration by leveraging multi-rater annotations. We propose to train separate models on the annotations from single experts and aggregate their predictions to emulate consensus. This improves upon label sampling strategies, where models are trained on mixed annotations, and offers a more principled way to capture inter-rater variability. Experiments on a colorectal organoid dataset annotated by two experts demonstrate that our rater-specific ensemble strategy improves calibration performance while maintaining comparable detection accuracy. These findings suggest that explicitly modelling rater disagreement can lead to more trustworthy object detectors in biomedical imaging.

18. 【2601.22990】Self-Supervised Slice-to-Volume Reconstruction with Gaussian Representations for Fetal MRI

链接https://arxiv.org/abs/2601.22990

作者:Yinsong Wang,Thomas Fletcher,Xinzhe Luo,Aine Travers Dineen,Rhodri Cusack,Chen Qin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:challenging task, crucial and challenging, Reconstructing, motion-corrupted stacks, reconstruction

备注

点击查看摘要

Abstract:Reconstructing 3D fetal MR volumes from motion-corrupted stacks of 2D slices is a crucial and challenging task. Conventional slice-to-volume reconstruction (SVR) methods are time-consuming and require multiple orthogonal stacks for reconstruction. While learning-based SVR approaches have significantly reduced the time required at the inference stage, they heavily rely on ground truth information for training, which is inaccessible in practice. To address these challenges, we propose GaussianSVR, a self-supervised framework for slice-to-volume reconstruction. GaussianSVR represents the target volume using 3D Gaussian representations to achieve high-fidelity reconstruction. It leverages a simulated forward slice acquisition model to enable self-supervised training, alleviating the need for ground-truth volumes. Furthermore, to enhance both accuracy and efficiency, we introduce a multi-resolution training strategy that jointly optimizes Gaussian parameters and spatial transformations across different resolution levels. Experiments show that GaussianSVR outperforms the baseline methods on fetal MR volumetric reconstruction. Code will be available upon acceptance.

19. 【2601.22982】About an Automating Annotation Method for Robot Markers

链接https://arxiv.org/abs/2601.22982

作者:Wataru Uemura,Takeru Nagashima

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:increasingly important due, Factory automation, autonomous mobile robots, labor shortages, material transportation

备注

点击查看摘要

Abstract:Factory automation has become increasingly important due to labor shortages, leading to the introduction of autonomous mobile robots for tasks such as material transportation. Markers are commonly used for robot self-localization and object identification. In the RoboCup Logistics League (RCLL), ArUco markers are employed both for robot localization and for identifying processing modules. Conventional recognition relies on OpenCV-based image processing, which detects black-and-white marker patterns. However, these methods often fail under noise, motion blur, defocus, or varying illumination conditions. Deep-learning-based recognition offers improved robustness under such conditions, but requires large amounts of annotated data. Annotation must typically be done manually, as the type and position of objects cannot be detected automatically, making dataset preparation a major bottleneck. In contrast, ArUco markers include built-in recognition modules that provide both ID and positional information, enabling automatic annotation. This paper proposes an automated annotation method for training deep-learning models on ArUco marker images. By leveraging marker detection results obtained from the ArUco module, the proposed approach eliminates the need for manual labeling. A YOLO-based model is trained using the automatically annotated dataset, and its performance is evaluated under various conditions. Experimental results demonstrate that the proposed method improves recognition performance compared with conventional image-processing techniques, particularly for images affected by blur or defocus. Automatic annotation also reduces human effort and ensures consistent labeling quality. Future work will investigate the relationship between confidence thresholds and recognition performance.

20. 【2601.22961】Improving Supervised Machine Learning Performance in Optical Quality Control via Generative AI for Dataset Expansion

链接https://arxiv.org/abs/2601.22961

作者:Dennis Sprute,Hanna Senke,Holger Flatt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:optical quality control, learning algorithms play, algorithms play, play a crucial, crucial role

备注: Accepted at 19th CIRP Conference on Intelligent Computation in Manufacturing Engineering

点击查看摘要

Abstract:Supervised machine learning algorithms play a crucial role in optical quality control within industrial production. These approaches require representative datasets for effective model training. However, while non-defective components are frequent, defective parts are rare in production, resulting in highly imbalanced datasets that adversely impact model performance. Existing strategies to address this challenge, such as specialized loss functions or traditional data augmentation techniques, have limitations, including the need for careful hyperparameter tuning or the alteration of only simple image features. Therefore, this work explores the potential of generative artificial intelligence (GenAI) as an alternative method for expanding limited datasets and enhancing supervised machine learning performance. Specifically, we investigate Stable Diffusion and CycleGAN as image generation models, focusing on the segmentation of combine harvester components in thermal images for subsequent defect detection. Our results demonstrate that dataset expansion using Stable Diffusion yields the most significant improvement, enhancing segmentation performance by 4.6 %, resulting in a Mean Intersection over Union (Mean IoU) of 84.6 %.

21. 【2601.22959】riage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models

链接https://arxiv.org/abs/2601.22959

作者:Anmin Wang,Nan Zhang,Wei Tao,Xiaoyang Qu,Guokuan Li,Jiguang Wan,Jianzong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:face significant computational, massive data redundancy, significant computational challenges, creates prohibitively long, Vision-Language Models

备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

点击查看摘要

Abstract:Vision-Language Models (VLMs) face significant computational challenges in video processing due to massive data redundancy, which creates prohibitively long token sequences. To address this, we introduce Triage, a training-free, plug-and-play framework that reframes video reasoning as a resource allocation problem via hierarchical visual budgeting. Its first stage, Frame-Level Budgeting, identifies keyframes by evaluating their visual dynamics and relevance, generating a strategic prior based on their importance scores. Guided by this prior, the second stage, Token-Level Budgeting, allocates tokens in two phases: it first secures high-relevance Core Tokens, followed by diverse Context Tokens selected with an efficient batched Maximal Marginal Relevance (MMR) algorithm. Extensive experiments demonstrate that Triage improves inference speed and reduces memory footprint, while maintaining or surpassing the performance of baselines and other methods on various video reasoning benchmarks.

22. 【2601.22929】Semantic Leakage from Image Embeddings

链接https://arxiv.org/abs/2601.22929

作者:Yiyi Chen,Qiongkai Xu,Desmond Eliott,Qiongxiu Li,Johannes Bjerva

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:Image embeddings, limited privacy risk, Image, semantic, pose limited privacy

备注: 20 pages, 19 figures

点击查看摘要

Abstract:Image embeddings are generally assumed to pose limited privacy risk. We challenge this assumption by formalizing semantic leakage as the ability to recover semantic structures from compressed image embeddings. Surprisingly, we show that semantic leakage does not require exact reconstruction of the original image. Preserving local semantic neighborhoods under embedding alignment is sufficient to expose the intrinsic vulnerability of image embeddings. Crucially, this preserved neighborhood structure allows semantic information to propagate through a sequence of lossy mappings. Based on this conjecture, we propose Semantic Leakage from Image Embeddings (SLImE), a lightweight inference framework that reveals semantic information from standalone compressed image embeddings, incorporating a locally trained semantic retriever with off-the-shelf models, without training task-specific decoders. We thoroughly validate each step of the framework empirically, from aligned embeddings to retrieved tags, symbolic representations, and grammatical and coherent descriptions. We evaluate SLImE across a range of open and closed embedding models, including GEMINI, COHERE, NOMIC, and CLIP, and demonstrate consistent recovery of semantic information across diverse inference tasks. Our results reveal a fundamental vulnerability in image embeddings, whereby the preservation of semantic neighborhoods under alignment enables semantic leakage, highlighting challenges for privacy preservation.1

23. 【2601.22920】Q-Hawkeye: Reliable Visual Policy Optimization for Image Quality Assessment

链接https://arxiv.org/abs/2601.22920

作者:Wulin Xie,Rui Dai,Ruidong Ding,Kaikui Liu,Xiangxiang Chu,Xinwen Hou,Jie Wen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Image Quality Assessment, Quality Assessment, quality scores consistent, predicts perceptual quality, Recent RL-based IQA

备注

点击查看摘要

Abstract:Image Quality Assessment (IQA) predicts perceptual quality scores consistent with human judgments. Recent RL-based IQA methods built on MLLMs focus on generating visual quality descriptions and scores, ignoring two key reliability limitations: (i) although the model's prediction stability varies significantly across training samples, existing GRPO-based methods apply uniform advantage weighting, thereby amplifying noisy signals from unstable samples in gradient updates; (ii) most works emphasize text-grounded reasoning over images while overlooking the model's visual perception ability of image content. In this paper, we propose Q-Hawkeye, an RL-based reliable visual policy optimization framework that redesigns the learning signal through unified Uncertainty-Aware Dynamic Optimization and Perception-Aware Optimization. Q-Hawkeye estimates predictive uncertainty using the variance of predicted scores across multiple rollouts and leverages this uncertainty to reweight each sample's update strength, stabilizing policy optimization. To strengthen perceptual reliability, we construct paired inputs of degraded images and their original images and introduce an Implicit Perception Loss that constrains the model to ground its quality judgments in genuine visual evidence. Extensive experiments demonstrate that Q-Hawkeye outperforms state-of-the-art methods and generalizes better across multiple datasets. The code and models will be made available.

24. 【2601.22917】Deep in the Jungle: Towards Automating Chimpanzee Population Estimation

链接https://arxiv.org/abs/2601.22917

作者:Tom Raynes,Otto Brookes,Timm Haucke,Lukas Bösch,Anne-Sophie Crunchant,Hjalmar Kühl,Sara Beery,Majid Mirmehdi,Tilo Burghardt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:camera trap, great apes relies, frameworks that require, Dense Prediction Transformers, relies on statistical

备注

点击查看摘要

Abstract:The estimation of abundance and density in unmarked populations of great apes relies on statistical frameworks that require animal-to-camera distance measurements. In practice, acquiring these distances depends on labour-intensive manual interpretation of animal observations across large camera trap video corpora. This study introduces and evaluates an only sparsely explored alternative: the integration of computer vision-based monocular depth estimation (MDE) pipelines directly into ecological camera trap workflows for great ape conservation. Using a real-world dataset of 220 camera trap videos documenting a wild chimpanzee population, we combine two MDE models, Dense Prediction Transformers and Depth Anything, with multiple distance sampling strategies. These components are used to generate detection distance estimates, from which population density and abundance are inferred. Comparative analysis against manually derived ground-truth distances shows that calibrated DPT consistently outperforms Depth Anything. This advantage is observed in both distance estimation accuracy and downstream density and abundance inference. Nevertheless, both models exhibit systematic biases. We show that, given complex forest environments, they tend to overestimate detection distances and consequently underestimate density and abundance relative to conventional manual approaches. We further find that failures in animal detection across distance ranges are a primary factor limiting estimation accuracy. Overall, this work provides a case study that shows MDE-driven camera trap distance sampling is a viable and practical alternative to manual distance estimation. The proposed approach yields population estimates within 22% of those obtained using traditional methods.

25. 【2601.22913】Multi-Cue Anomaly Detection and Localization under Data Contamination

链接https://arxiv.org/abs/2601.22913

作者:Anindya Sundar Das,Monowar Bhuyan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:industrial settings faces, real-world industrial settings, major limitations, industrial settings, settings faces

备注: 12 pages total (10 pages main text + references), 6 figures. Preprint version; the final camera-ready version may differ

点击查看摘要

Abstract:Visual anomaly detection in real-world industrial settings faces two major limitations. First, most existing methods are trained on purely normal data or on unlabeled datasets assumed to be predominantly normal, presuming the absence of contamination, an assumption that is rarely satisfied in practice. Second, they assume no access to labeled anomaly samples, limiting the model from learning discriminative characteristics of true anomalies. Therefore, these approaches often struggle to distinguish anomalies from normal instances, resulting in reduced detection and weak localization performance. In real-world applications, where training data are frequently contaminated with anomalies, such methods fail to deliver reliable performance. In this work, we propose a robust anomaly detection framework that integrates limited anomaly supervision into the adaptive deviation learning paradigm. We introduce a composite anomaly score that combines three complementary components: a deviation score capturing statistical irregularity, an entropy-based uncertainty score reflecting predictive inconsistency, and a segmentation-based score highlighting spatial abnormality. This unified scoring mechanism enables accurate detection and supports gradient-based localization, providing intuitive and explainable visual evidence of anomalous regions. Following the few-anomaly paradigm, we incorporate a small set of labeled anomalies during training while simultaneously mitigating the influence of contaminated samples through adaptive instance weighting. Extensive experiments on the MVTec and VisA benchmarks demonstrate that our framework outperforms state-of-the-art baselines and achieves strong detection and localization performance, interpretability, and robustness under various levels of data contamination.

26. 【2601.22904】DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation

链接https://arxiv.org/abs/2601.22904

作者:Hun Chang,Byunghee Cha,Jong Chul Ye

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Vision Foundation Models, Recent studies, showing strong generative, strong generative performance, pretrained Vision Foundation

备注: 17 pages, and 11 figures

点击查看摘要

Abstract:Recent studies have explored using pretrained Vision Foundation Models (VFMs) such as DINO for generative autoencoders, showing strong generative performance. Unfortunately, existing approaches often suffer from limited reconstruction fidelity due to the loss of high-frequency details. In this work, we present the DINO Spherical Autoencoder (DINO-SAE), a framework that bridges semantic representation and pixel-level reconstruction. Our key insight is that semantic information in contrastive representations is primarily encoded in the direction of feature vectors, while forcing strict magnitude matching can hinder the encoder from preserving fine-grained details. To address this, we introduce Hierarchical Convolutional Patch Embedding module that enhances local structure and texture preservation, and Cosine Similarity Alignment objective that enforces semantic consistency while allowing flexible feature magnitudes for detail retention. Furthermore, leveraging the observation that SSL-based foundation model representations intrinsically lie on a hypersphere, we employ Riemannian Flow Matching to train a Diffusion Transformer (DiT) directly on this spherical latent manifold. Experiments on ImageNet-1K demonstrate that our approach achieves state-of-the-art reconstruction quality, reaching 0.37 rFID and 26.2 dB PSNR, while maintaining strong semantic alignment to the pretrained VFM. Notably, our Riemannian Flow Matching-based DiT exhibits efficient convergence, achieving a gFID of 3.47 at 80 epochs.

27. 【2601.22887】MoVE: Mixture of Value Embeddings -- A New Axis for Scaling Parametric Memory in Autoregressive Models

链接https://arxiv.org/abs/2601.22887

作者:Yangyan Li

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse modalities ranging, modern Generative, powering results, cornerstone of modern, results across diverse

备注

点击查看摘要

Abstract:Autoregressive sequence modeling stands as the cornerstone of modern Generative AI, powering results across diverse modalities ranging from text generation to image generation. However, a fundamental limitation of this paradigm is the rigid structural coupling of model capacity to computational cost: expanding a model's parametric memory -- its repository of factual knowledge or visual patterns -- traditionally requires deepening or widening the network, which incurs a proportional rise in active FLOPs. In this work, we introduce $\textbf{MoVE (Mixture of Value Embeddings)}$, a mechanism that breaks this coupling and establishes a new axis for scaling capacity. MoVE decouples memory from compute by introducing a global bank of learnable value embeddings shared across all attention layers. For every step in the sequence, the model employs a differentiable soft gating mechanism to dynamically mix retrieved concepts from this bank into the standard value projection. This architecture allows parametric memory to be scaled independently of network depth by simply increasing the number of embedding slots. We validate MoVE through strictly controlled experiments on two representative applications of autoregressive modeling: Text Generation and Image Generation. In both domains, MoVE yields consistent performance improvements over standard and layer-wise memory baselines, enabling the construction of "memory-dense" models that achieve lower perplexity and higher fidelity than their dense counterparts at comparable compute budgets.

28. 【2601.22868】When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection

链接https://arxiv.org/abs/2601.22868

作者:Shashank Mishra,Didier Stricker,Jason Rambach

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Anomaly detection, intrinsic property, Anomaly, context, contextual anomaly detection

备注: Preprint. Submitted to ICML 2026. 8 pages main text, plus appendix

点击查看摘要

Abstract:Anomaly detection is often formulated under the assumption that abnormality is an intrinsic property of an observation, independent of context. This assumption breaks down in many real-world settings, where the same object or action may be normal or anomalous depending on latent contextual factors (e.g., running on a track versus on a highway). We revisit \emph{contextual anomaly detection}, classically defined as context-dependent abnormality, and operationalize it in the visual domain, where anomaly labels depend on subject--context compatibility rather than intrinsic appearance. To enable systematic study of this setting, we introduce CAAD-3K, a benchmark that isolates contextual anomalies by controlling subject identity while varying context. We further propose a conditional compatibility learning framework that leverages vision--language representations to model subject--context relationships under limited supervision. Our method substantially outperforms existing approaches on CAAD-3K and achieves state-of-the-art performance on MVTec-AD and VisA, demonstrating that modeling context dependence complements traditional structural anomaly detection. Our code and dataset will be publicly released.

29. 【2601.22861】Under-Canopy Terrain Reconstruction in Dense Forests Using RGB Imaging and Neural 3D Reconstruction

链接https://arxiv.org/abs/2601.22861

作者:Refael Sheffer,Chen Pinchover,Haim Zisman,Dror Ozeri,Roee Litman

类目:Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Graphics (cs.GR)

关键词:Airborne Optical Sectioning, hidden beneath dense, dense forest canopies, great interest, interest for numerous

备注: WACV 2026 CV4EO

点击查看摘要

Abstract:Mapping the terrain and understory hidden beneath dense forest canopies is of great interest for numerous applications such as search and rescue, trail mapping, forest inventory tasks, and more. Existing solutions rely on specialized sensors: either heavy, costly airborne LiDAR, or Airborne Optical Sectioning (AOS), which uses thermal synthetic aperture photography and is tailored for person detection. We introduce a novel approach for the reconstruction of canopy-free, photorealistic ground views using only conventional RGB images. Our solution is based on the celebrated Neural Radiance Fields (NeRF), a recent 3D reconstruction method. Additionally, we include specific image capture considerations, which dictate the needed illumination to successfully expose the scene beneath the canopy. To better cope with the poorly lit understory, we employ a low light loss. Finally, we propose two complementary approaches to remove occluding canopy elements by controlling per-ray integration procedure. To validate the value of our approach, we present two possible downstream tasks. For the task of search and rescue (SAR), we demonstrate that our method enables person detection which achieves promising results compared to thermal AOS (using only RGB images). Additionally, we show the potential of our approach for forest inventory tasks like tree counting. These results position our approach as a cost-effective, high-resolution alternative to specialized sensors for SAR, trail mapping, and forest-inventory tasks.

Comments:
WACV 2026 CV4EO

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Emerging Technologies (cs.ET); Graphics (cs.GR)

ACMclasses:
I.3.3; I.3.5; I.3.7; I.3.8; I.4.1; I.4.3; I.4.5; I.4.9

Cite as:
arXiv:2601.22861 [cs.CV]

(or
arXiv:2601.22861v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.22861

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
30. 【2601.22853】Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification

链接https://arxiv.org/abs/2601.22853

作者:Siyi Du,Xinzhe Luo,Declan P. O'Regan,Chen Qin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable success, Existing incomplete MDL, achieved remarkable, remarkable success, practical deployment

备注: 27 pages (including appendix), accepted by ICLR 2026

点击查看摘要

Abstract:Multimodal deep learning (MDL) has achieved remarkable success across various domains, yet its practical deployment is often hindered by incomplete multimodal data. Existing incomplete MDL methods either discard missing modalities, risking the loss of valuable task-relevant information, or recover them, potentially introducing irrelevant noise, leading to the discarding-imputation dilemma. To address this dilemma, in this paper, we propose DyMo, a new inference-time dynamic modality selection framework that adaptively identifies and integrates reliable recovered modalities, fully exploring task-relevant information beyond the conventional discard-or-impute paradigm. Central to DyMo is a novel selection algorithm that maximizes multimodal task-relevant information for each test sample. Since direct estimation of such information at test time is intractable due to the unknown data distribution, we theoretically establish a connection between information and the task loss, which we compute at inference time as a tractable proxy. Building on this, a novel principled reward function is proposed to guide modality selection. In addition, we design a flexible multimodal network architecture compatible with arbitrary modality combinations, alongside a tailored training strategy for robust representation learning. Extensive experiments on diverse natural and medical image datasets show that DyMo significantly outperforms state-of-the-art incomplete/dynamic MDL methods across various missing-data scenarios. Our code is available at this https URL.

31. 【2601.22841】How Much of a Model Do We Need? Redundancy and Slimmability in Remote Sensing Foundation Models

链接https://arxiv.org/abs/2601.22841

作者:Leonard Hackel,Tom Burgert,Begüm Demir

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Earth observation applications, Large-scale foundation models, Large-scale foundation, Earth observation, remote sensing

备注

点击查看摘要

Abstract:Large-scale foundation models (FMs) in remote sensing (RS) are developed based on the paradigms established in computer vision (CV) and have shown promise for various Earth observation applications. However, the direct transfer of scaling assumptions from CV to RS has not been adequately examined. We hypothesize that RS FMs enter an overparameterized regime at substantially smaller scales than their CV counterparts, where increasing parameter count primarily induces redundant representations rather than qualitatively new abstractions. To test this hypothesis, we use post-hoc slimming, where we uniformly reduce the width of pretrained encoder, as a tool to measure representational redundancy across six state-of-the-art RS FMs on four downstream classification tasks. Our findings reveal a significant contrast with those in the CV domain: while a post-hoc slimmed masked autoencoder (MAE) trained on ImageNet retains less than 10% accuracy at 1% FLOPs, RS FMs maintain over 71% relative accuracy at the same budget. This sevenfold difference provides strong empirical support for our hypothesis. We further demonstrate that learned slimmable training can improve both Momentum Contrast (MoCo)- and MAE- based models. In addition, through the explained variance ratio and the feature correlation analysis, we provide mechanistic explanations showing that RS FMs distribute task-relevant information with high redundancy. Our findings establish post-hoc slimmability as both a practical deployment strategy for resource-constrained environments and a diagnostic tool that challenges the prevailing scaling paradigm in RS. Upon acceptance, we will publish all code.

32. 【2601.22838】Neural Clothing Tryer: Customized Virtual Try-On via Semantic Enhancement and Controlling Diffusion Model

链接https://arxiv.org/abs/2601.22838

作者:Zhijing Yang,Weiwei Zhang,Mingliang Yang,Siyuan Peng,Yukai Shi,Junpeng Tan,Tianshui Chen,Liruo Zhong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Customized Virtual Try-ON, Customized Virtual, Neural Clothing Tryer, traditional VTON task, Virtual Try-ON

备注: Accepted by Expert Systems with Applications. 16 pages, 10 figures

点击查看摘要

Abstract:This work aims to address a novel Customized Virtual Try-ON (Cu-VTON) task, enabling the superimposition of a specified garment onto a model that can be customized in terms of appearance, posture, and additional attributes. Compared with traditional VTON task, it enables users to tailor digital avatars to their individual preferences, thereby enhancing the virtual fitting experience with greater flexibility and engagement. To address this task, we introduce a Neural Clothing Tryer (NCT) framework, which exploits the advanced diffusion models equipped with semantic enhancement and controlling modules to better preserve semantic characterization and textural details of the garment and meanwhile facilitating the flexible editing of the model's postures and appearances. Specifically, NCT introduces a semantic-enhanced module to take semantic descriptions of garments and utilizes a visual-language encoder to learn aligned features across modalities. The aligned features are served as condition input to the diffusion model to enhance the preservation of the garment's semantics. Then, a semantic controlling module is designed to take the garment image, tailored posture image, and semantic description as input to maintain garment details while simultaneously editing model postures, expressions, and various attributes. Extensive experiments on the open available benchmark demonstrate the superior performance of the proposed NCT framework.

33. 【2601.22837】NativeTok: Native Visual Tokenization for Improved Image Generation

链接https://arxiv.org/abs/2601.22837

作者:Bin Wu,Mengqi Huang,Weinan Jia,Zhendong Mao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:tokenizer encodes images, VQ-based image generation, image generation typically, two-stage pipeline, tokenizer encodes

备注

点击查看摘要

Abstract:VQ-based image generation typically follows a two-stage pipeline: a tokenizer encodes images into discrete tokens, and a generative model learns their dependencies for reconstruction. However, improved tokenization in the first stage does not necessarily enhance the second-stage generation, as existing methods fail to constrain token dependencies. This mismatch forces the generative model to learn from unordered distributions, leading to bias and weak coherence. To address this, we propose native visual tokenization, which enforces causal dependencies during tokenization. Building on this idea, we introduce NativeTok, a framework that achieves efficient reconstruction while embedding relational constraints within token sequences. NativeTok consists of: (1) a Meta Image Transformer (MIT) for latent image modeling, and (2) a Mixture of Causal Expert Transformer (MoCET), where each lightweight expert block generates a single token conditioned on prior tokens and latent features. We further design a Hierarchical Native Training strategy that updates only new expert blocks, ensuring training efficiency. Extensive experiments demonstrate the effectiveness of NativeTok.

34. 【2601.22830】A Comparative Evaluation of Large Vision-Language Models for 2D Object Detection under SOTIF Conditions

链接https://arxiv.org/abs/2601.22830

作者:Ji Zhou,Yilin Ding,Yongqi Zhao,Jiachen Xu,Arno Eichberger

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Reliable environmental perception, Reliable environmental, environmental perception remains, main obstacles, obstacles for safe

备注: 6 pages, 11 figures

点击查看摘要

Abstract:Reliable environmental perception remains one of the main obstacles for safe operation of automated vehicles. Safety of the Intended Functionality (SOTIF) concerns safety risks from perception insufficiencies, particularly under adverse conditions where conventional detectors often falter. While Large Vision-Language Models (LVLMs) demonstrate promising semantic reasoning, their quantitative effectiveness for safety-critical 2D object detection is underexplored. This paper presents a systematic evaluation of ten representative LVLMs using the PeSOTIF dataset, a benchmark specifically curated for long-tail traffic scenarios and environmental degradations. Performance is quantitatively compared against the classical perception approach, a YOLO-based detector. Experimental results reveal a critical trade-off: top-performing LVLMs (e.g., Gemini 3, Doubao) surpass the YOLO baseline in recall by over 25% in complex natural scenarios, exhibiting superior robustness to visual degradation. Conversely, the baseline retains an advantage in geometric precision for synthetic perturbations. These findings highlight the complementary strengths of semantic reasoning versus geometric regression, supporting the use of LVLMs as high-level safety validators in SOTIF-oriented automated driving systems.

35. 【2601.22828】Decomposing and Composing: Towards Efficient Vision-Language Continual Learning via Rank-1 Expert Pool in a Single LoRA

链接https://arxiv.org/abs/2601.22828

作者:Zhan Fa,Yue Duan,Jian Zhang,Lei Qi,Wanqi Yang,Yinghuan Shi

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:faces significant challenges, improving task adaptation, avoiding catastrophic forgetting, vision-language models, faces significant

备注

点击查看摘要

Abstract:Continual learning (CL) in vision-language models (VLMs) faces significant challenges in improving task adaptation and avoiding catastrophic forgetting. Existing methods usually have heavy inference burden or rely on external knowledge, while Low-Rank Adaptation (LoRA) has shown potential in reducing these issues by enabling parameter-efficient tuning. However, considering directly using LoRA to alleviate the catastrophic forgetting problem is non-trivial, we introduce a novel framework that restructures a single LoRA module as a decomposable Rank-1 Expert Pool. Our method learns to dynamically compose a sparse, task-specific update by selecting from this expert pool, guided by the semantics of the [CLS] token. In addition, we propose an Activation-Guided Orthogonal (AGO) loss that orthogonalizes critical parts of LoRA weights across tasks. This sparse composition and orthogonalization enable fewer parameter updates, resulting in domain-aware learning while minimizing inter-task interference and maintaining downstream task performance. Extensive experiments across multiple settings demonstrate state-of-the-art results in all metrics, surpassing zero-shot upper bounds in generalization. Notably, it reduces trainable parameters by 96.7% compared to the baseline method, eliminating reliance on external datasets or task-ID discriminators. The merged LoRAs retain less weights and incur no inference latency, making our method computationally lightweight.

36. 【2601.22809】FarmMind: Reasoning-Query-Driven Dynamic Segmentation for Farmland Remote Sensing Images

链接https://arxiv.org/abs/2601.22809

作者:Haiyang Wu,Weiliang Mu,Jipeng Zhang,Zhong Dandan,Zhuofei Du,Haifeng Li,Tao Chao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:analysis relies solely, farmland remote sensing, segmentation generally follow, single input patch, generally follow

备注

点击查看摘要

Abstract:Existing methods for farmland remote sensing image (FRSI) segmentation generally follow a static segmentation paradigm, where analysis relies solely on the limited information contained within a single input patch. Consequently, their reasoning capability is limited when dealing with complex scenes characterized by ambiguity and visual uncertainty. In contrast, human experts, when interpreting remote sensing images in such ambiguous cases, tend to actively query auxiliary images (such as higher-resolution, larger-scale, or temporally adjacent data) to conduct cross-verification and achieve more comprehensive reasoning. Inspired by this, we propose a reasoning-query-driven dynamic segmentation framework for FRSIs, named FarmMind. This framework breaks through the limitations of the static segmentation paradigm by introducing a reasoning-query mechanism, which dynamically and on-demand queries external auxiliary images to compensate for the insufficient information in a single input image. Unlike direct queries, this mechanism simulates the thinking process of human experts when faced with segmentation ambiguity: it first analyzes the root causes of segmentation ambiguities through reasoning, and then determines what type of auxiliary image needs to be queried based on this analysis. Extensive experiments demonstrate that FarmMind achieves superior segmentation performance and stronger generalization ability compared with existing methods. The source code and dataset used in this work are publicly available at: this https URL.

37. 【2601.22808】Diachronic Stereo Matching for Multi-Date Satellite Imagery

链接https://arxiv.org/abs/2601.22808

作者:Elías Masquil(IIE, UDELAR),Luca Savant Aira(Polito),Roger Marí,Thibaud Ehret(AMIAD),Pablo Musé(IIE, UDELAR, CB),Gabriele Facciolo(CB, IUF)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, complementary directions, Recent, Diachronic, image

备注

点击查看摘要

Abstract:Recent advances in image-based satellite 3D reconstruction have progressed along two complementary directions. On one hand, multi-date approaches using NeRF or Gaussian-splatting jointly model appearance and geometry across many acquisitions, achieving accurate reconstructions on opportunistic imagery with numerous observations. On the other hand, classical stereoscopic reconstruction pipelines deliver robust and scalable results for simultaneous or quasi-simultaneous image pairs. However, when the two images are captured months apart, strong seasonal, illumination, and shadow changes violate standard stereoscopic assumptions, causing existing pipelines to fail. This work presents the first Diachronic Stereo Matching method for satellite imagery, enabling reliable 3D reconstruction from temporally distant pairs. Two advances make this possible: (1) fine-tuning a state-of-the-art deep stereo network that leverages monocular depth priors, and (2) exposing it to a dataset specifically curated to include a diverse set of diachronic image pairs. In particular, we start from a pretrained MonSter model, trained initially on a mix of synthetic and real datasets such as SceneFlow and KITTI, and fine-tune it on a set of stereo pairs derived from the DFC2019 remote sensing challenge. This dataset contains both synchronic and diachronic pairs under diverse seasonal and illumination conditions. Experiments on multi-date WorldView-3 imagery demonstrate that our approach consistently surpasses classical pipelines and unadapted deep stereo models on both synchronic and diachronic settings. Fine-tuning on temporally diverse images, together with monocular priors, proves essential for enabling 3D reconstruction from previously incompatible acquisition dates. Left image (winter) Right image (autumn) DSM geometry Ours (1.23 m) Zero-shot (3.99 m) LiDAR GT Figure 1. Output geometry for a winter-autumn image pair from Omaha (OMA 331 test scene). Our method recovers accurate geometry despite the diachronic nature of the pair, exhibiting strong appearance changes, which cause existing zero-shot methods to fail. Missing values due to perspective shown in black. Mean altitude error in parentheses; lower is better.

38. 【2601.22796】HeatMat: Simulation of City Material Impact on Urban Heat Island Effect

链接https://arxiv.org/abs/2601.22796

作者:Marie Reinbigler,Romain Rouffet,Peter Naylor,Mikolaj Czerkawski,Nikolaos Dionelis,Elisabeth Brunet,Catalin Fetita,Rosalie Martin

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Urban Heat Island, Heat Island, urban environments compared, satellites or in-situ, in-situ stations

备注

点击查看摘要

Abstract:The Urban Heat Island (UHI) effect, defined as a significant increase in temperature in urban environments compared to surrounding areas, is difficult to study in real cities using sensor data (satellites or in-situ stations) due to their coarse spatial and temporal resolution. Among the factors contributing to this effect are the properties of urban materials, which differ from those in rural areas. To analyze their individual impact and to test new material configurations, a high-resolution simulation at the city scale is required. Estimating the current materials used in a city, including those on building facades, is also challenging. We propose HeatMat, an approach to analyze at high resolution the individual impact of urban materials on the UHI effect in a real city, relying only on open data. We estimate building materials using street-view images and a pre-trained vision-language model (VLM) to supplement existing OpenStreetMap data, which describes the 2D geometry and features of buildings. We further encode this information into a set of 2D maps that represent the city's vertical structure and material characteristics. These maps serve as inputs for our 2.5D simulator, which models coupled heat transfers and enables random-access surface temperature estimation at multiple resolutions, reaching an x20 speedup compared to an equivalent simulation in 3D.

39. 【2601.22783】Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval

链接https://arxiv.org/abs/2601.22783

作者:Ilyass Moummad,Marius Miron,David Robinson,Kawtar Zaher,Hervé Goëau,Olivier Pietquin,Pierre Bonnet,Emmanuel Chemla,Matthieu Geist,Alexis Joly

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)

关键词:platforms increasingly rely, monitoring platforms increasingly, platforms increasingly, increasingly rely, rely on multimodal

备注

点击查看摘要

Abstract:Large-scale biodiversity monitoring platforms increasingly rely on multimodal wildlife observations. While recent foundation models enable rich semantic representations across vision, audio, and language, retrieving relevant observations from massive archives remains challenging due to the computational cost of high-dimensional similarity search. In this work, we introduce compact hypercube embeddings for fast text-based wildlife observation retrieval, a framework that enables efficient text-based search over large-scale wildlife image and audio databases using compact binary representations. Building on the cross-view code alignment hashing framework, we extend lightweight hashing beyond a single-modality setup to align natural language descriptions with visual or acoustic observations in a shared Hamming space. Our approach leverages pretrained wildlife foundation models, including BioCLIP and BioLingual, and adapts them efficiently for hashing using parameter-efficient fine-tuning. We evaluate our method on large-scale benchmarks, including iNaturalist2024 for text-to-image retrieval and iNatSounds2024 for text-to-audio retrieval, as well as multiple soundscape datasets to assess robustness under domain shift. Results show that retrieval using discrete hypercube embeddings achieves competitive, and in several cases superior, performance compared to continuous embeddings, while drastically reducing memory and search cost. Moreover, we observe that the hashing objective consistently improves the underlying encoder representations, leading to stronger retrieval and zero-shot generalization. These results demonstrate that binary, language-based retrieval enables scalable and efficient search over large wildlife archives for biodiversity monitoring systems.

40. 【2601.22778】Color Matters: Demosaicing-Guided Color Correlation Training for Generalizable AI-Generated Image Detection

链接https://arxiv.org/abs/2601.22778

作者:Nan Zhong,Yiran Xu,Mian Zou

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:threaten digital authenticity, camera imaging pipeline, generative artifact-based detectors, images threaten digital, realistic AI-generated images

备注

点击查看摘要

Abstract:As realistic AI-generated images threaten digital authenticity, we address the generalization failure of generative artifact-based detectors by exploiting the intrinsic properties of the camera imaging pipeline. Concretely, we investigate color correlations induced by the color filter array (CFA) and demosaicing, and propose a Demosaicing-guided Color Correlation Training (DCCT) framework for AI-generated image detection. By simulating the CFA sampling pattern, we decompose each color image into a single-channel input (as the condition) and the remaining two channels as the ground-truth targets (for prediction). A self-supervised U-Net is trained to model the conditional distribution of the missing channels from the given one, parameterized via a mixture of logistic functions. Our theoretical analysis reveals that DCCT targets a provable distributional difference in color-correlation features between photographic and AI-generated images. By leveraging these distinct features to construct a binary classifier, DCCT achieves state-of-the-art generalization and robustness, significantly outperforming prior methods across over 20 unseen generators.

41. 【2601.22763】Is Training Necessary for Anomaly Detection?

链接https://arxiv.org/abs/2601.22763

作者:Xingwu Zhang,Guanxuan Li,Paul Henderson,Gerardo Aragon-Camarasa,Zijun Long

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-class unsupervised anomaly, unsupervised anomaly detection, multi-class unsupervised, methods rely, training encoder-decoder models

备注

点击查看摘要

Abstract:Current state-of-the-art multi-class unsupervised anomaly detection (MUAD) methods rely on training encoder-decoder models to reconstruct anomaly-free features. We first show these approaches have an inherent fidelity-stability dilemma in how they detect anomalies via reconstruction residuals. We then abandon the reconstruction paradigm entirely and propose Retrieval-based Anomaly Detection (RAD). RAD is a training-free approach that stores anomaly-free features in a memory and detects anomalies through multi-level retrieval, matching test patches against the memory. Experiments demonstrate that RAD achieves state-of-the-art performance across four established benchmarks (MVTec-AD, VisA, Real-IAD, 3D-ADAM) under both standard and few-shot settings. On MVTec-AD, RAD reaches 96.7\% Pixel AUROC with just a single anomaly-free image compared to 98.5\% of RAD's full-data performance. We further prove that retrieval-based scores theoretically upper-bound reconstruction-residual scores. Collectively, these findings overturn the assumption that MUAD requires task-specific training, showing that state-of-the-art anomaly detection is feasible with memory-based retrieval. Our code is available at this https URL.

42. 【2601.22754】Procedural Knowledge Extraction from Industrial Troubleshooting Guides Using Vision Language Models

链接https://arxiv.org/abs/2601.22754

作者:Guillermo Gil de Avalle,Laura Maruster,Christos Emmanouilidis

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:encode diagnostic procedures, guides encode diagnostic, Industrial troubleshooting guides, technical language jointly, language jointly convey

备注

点击查看摘要

Abstract:Industrial troubleshooting guides encode diagnostic procedures in flowchart-like diagrams where spatial layout and technical language jointly convey meaning. To integrate this knowledge into operator support systems, which assist shop-floor personnel in diagnosing and resolving equipment issues, the information must first be extracted and structured for machine interpretation. However, when performed manually, this extraction is labor-intensive and error-prone. Vision Language Models offer potential to automate this process by jointly interpreting visual and textual meaning, yet their performance on such guides remains underexplored. This paper evaluates two VLMs on extracting structured knowledge, comparing two prompting strategies: standard instruction-guided versus an augmented approach that cues troubleshooting layout patterns. Results reveal model-specific trade-offs between layout sensitivity and semantic robustness, informing practical deployment decisions.

43. 【2601.22744】Beauty and the Beast: Imperceptible Perturbations Against Diffusion-Based Face Swapping via Directional Attribute Editing

链接https://arxiv.org/abs/2601.22744

作者:Yilong Huang,Songze Li

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:undermine personal reputation, personal reputation, face swapping achieves, exacerbates the potential, potential harm

备注

点击查看摘要

Abstract:Diffusion-based face swapping achieves state-of-the-art performance, yet it also exacerbates the potential harm of malicious face swapping to violate portraiture right or undermine personal reputation. This has spurred the development of proactive defense methods. However, existing approaches face a core trade-off: large perturbations distort facial structures, while small ones weaken protection effectiveness. To address these issues, we propose FaceDefense, an enhanced proactive defense framework against diffusion-based face swapping. Our method introduces a new diffusion loss to strengthen the defensive efficacy of adversarial examples, and employs a directional facial attribute editing to restore perturbation-induced distortions, thereby enhancing visual imperceptibility. A two-phase alternating optimization strategy is designed to generate final perturbed face images. Extensive experiments show that FaceDefense significantly outperforms existing methods in both imperceptibility and defense effectiveness, achieving a superior trade-off.

44. 【2601.22738】StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing

链接https://arxiv.org/abs/2601.22738

作者:Han Wang,Deyi Ji,Lanyun Zhu,Jiebo Luo,Roy Ka-Wei Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Live streaming platforms, platforms require real-time, require real-time monitoring, streaming platforms require, Live streaming

备注: 10 pages, 4 figures, The Web Conference 2026

点击查看摘要

Abstract:Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.

45. 【2601.22737】Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models

链接https://arxiv.org/abs/2601.22737

作者:Enyi Shi,Pengyang Shao,Yanxin Zhang,Chenhang Cui,Jiayi Lyu,Xu Xie,Xiaobo Xia,Fei Shen,Tat-Seng Chua

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inputs remains underexplored, multimodal inputs remains, vision-language large models, remains underexplored, Robust safety

备注

点击查看摘要

Abstract:Robust safety of vision-language large models (VLLMs) under joint multilingual and multimodal inputs remains underexplored. Existing benchmarks are typically multilingual but text-only, or multimodal but monolingual. Recent multilingual multimodal red-teaming efforts render harmful prompts into images, yet rely heavily on typography-style visuals and lack semantically grounded image-text pairs, limiting coverage of realistic cross-modal interactions. We introduce Lingua-SafetyBench, a benchmark of 100,440 harmful image-text pairs across 10 languages, explicitly partitioned into image-dominant and text-dominant subsets to disentangle risk sources. Evaluating 11 open-source VLLMs reveals a consistent asymmetry: image-dominant risks yield higher ASR in high-resource languages, while text-dominant risks are more severe in non-high-resource languages. A controlled study on the Qwen series shows that scaling and version upgrades reduce Attack Success Rate (ASR) overall but disproportionately benefit HRLs, widening the gap between HRLs and Non-HRLs under text-dominant risks. This underscores the necessity of language- and modality-aware safety alignment beyond mere this http URL facilitate reproducibility and future research, we will publicly release our benchmark, model checkpoints, and source this http URL code and dataset will be available at this https URL this paper contains examples with unsafe content.

46. 【2601.22730】ImgCoT: Compressing Long Chain of Thought into Compact Visual Tokens for Efficient Reasoning of Large Language Model

链接https://arxiv.org/abs/2601.22730

作者:Xiaoshu Chen,Sihang Zhou,Ke Liang,Taichun Zhou,Xinwang Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Compressing long chains, Compressing long, latent tokens, large language models, reasoning

备注

点击查看摘要

Abstract:Compressing long chains of thought (CoT) into compact latent tokens is crucial for efficient reasoning with large language models (LLMs). Recent studies employ autoencoders to achieve this by reconstructing textual CoT from latent tokens, thus encoding CoT semantics. However, treating textual CoT as the reconstruction target forces latent tokens to preserve surface-level linguistic features (e.g., word choice and syntax), introducing a strong linguistic inductive bias that prioritizes linguistic form over reasoning structure and limits logical abstraction. Thus, we propose ImgCoT that replaces the reconstruction target from textual CoT to the visual CoT obtained by rendering CoT into images. This substitutes linguistic bias with spatial inductive bias, i.e., a tendency to model spatial layouts of the reasoning steps in visual CoT, enabling latent tokens to better capture global reasoning structure. Moreover, although visual latent tokens encode abstract reasoning structure, they may blur reasoning details. We thus propose a loose ImgCoT, a hybrid reasoning that augments visual latent tokens with a few key textual reasoning steps, selected based on low token log-likelihood. This design allows LLMs to retain both global reasoning structure and fine-grained reasoning details with fewer tokens than the complete CoT. Extensive experiments across multiple datasets and LLMs demonstrate the effectiveness of the two versions of ImgCoT.

47. 【2601.22729】GaussianOcc3D: A Gaussian-Based Adaptive Multi-modal 3D Occupancy Prediction

链接https://arxiv.org/abs/2601.22729

作者:A. Enes Doruk,Hasan F. Ates

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantic occupancy prediction, single-modality methods face, methods face trade-offs, semantic occupancy, autonomous driving

备注

点击查看摘要

Abstract:3D semantic occupancy prediction is a pivotal task in autonomous driving, providing a dense and fine-grained understanding of the surrounding environment, yet single-modality methods face trade-offs between camera semantics and LiDAR geometry. Existing multi-modal frameworks often struggle with modality heterogeneity, spatial misalignment, and the representation crisis--where voxels are computationally heavy and BEV alternatives are lossy. We present GaussianOcc3D, a multi-modal framework bridging camera and LiDAR through a memory-efficient, continuous 3D Gaussian representation. We introduce four modules: (1) LiDAR Depth Feature Aggregation (LDFA), using depth-wise deformable sampling to lift sparse signals onto Gaussian primitives; (2) Entropy-Based Feature Smoothing (EBFS) to mitigate domain noise; (3) Adaptive Camera-LiDAR Fusion (ACLF) with uncertainty-aware reweighting for sensor reliability; and (4) a Gauss-Mamba Head leveraging Selective State Space Models for global context with linear complexity. Evaluations on Occ3D, SurroundOcc, and SemanticKITTI benchmarks demonstrate state-of-the-art performance, achieving mIoU scores of 49.4%, 28.9%, and 25.2% respectively. GaussianOcc3D exhibits superior robustness across challenging rainy and nighttime conditions.

48. 【2601.22725】OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

链接https://arxiv.org/abs/2601.22725

作者:Jin Li,Tao Chen,Shuai Jiang,Weijie Wang,Jingwen Luo,Chenhui Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Virtual Try-On, Recent advances, persistent bottleneck, advances in diffusion, diffusion models

备注

点击查看摘要

Abstract:Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $\tau$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.

49. 【2601.22714】Vision-Language Models Unlock Task-Centric Latent Actions

链接https://arxiv.org/abs/2601.22714

作者:Alexander Nikulin,Ilya Zisman,Albina Klepach,Denis Tarasov,Alexander Derevyagin,Andrei Polubarov,Lyubaykin Nikita,Vladislav Kurenkov

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:rapidly gained traction, Latent Action Models, Latent Action, pipelines of leading, meaningful latent actions

备注: Preprint

点击查看摘要

Abstract:Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as targets during LAM training and benchmark a wide variety of popular VLMs, revealing substantial variation in the quality of promptable representations as well as their robustness to different prompts and hyperparameters. Interestingly, we find that more recent VLMs may perform worse than older ones. Finally, we show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.

50. 【2601.22711】SQUAD: Scalable Quorum Adaptive Decisions via ensemble of early exit neural networks

链接https://arxiv.org/abs/2601.22711

作者:Matteo Gambella,Fabrizio Pittorino,Giuliano Casale,Manuel Roveri

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:Early-exit neural networks, Quorum Adaptive Decisions, Scalable Quorum Adaptive, allowing intermediate predictions, Quorum Search Technique

备注

点击查看摘要

Abstract:Early-exit neural networks have become popular for reducing inference latency by allowing intermediate predictions when sufficient confidence is achieved. However, standard approaches typically rely on single-model confidence thresholds, which are frequently unreliable due to inherent calibration issues. To address this, we introduce SQUAD (Scalable Quorum Adaptive Decisions), the first inference scheme that integrates early-exit mechanisms with distributed ensemble learning, improving uncertainty estimation while reducing the inference time. Unlike traditional methods that depend on individual confidence scores, SQUAD employs a quorum-based stopping criterion on early-exit learners by collecting intermediate predictions incrementally in order of computational complexity until a consensus is reached and halting the computation at that exit if the consensus is statistically significant. To maximize the efficacy of this voting mechanism, we also introduce QUEST (Quorum Search Technique), a Neural Architecture Search method to select early-exit learners with optimized hierarchical diversity, ensuring learners are complementary at every intermediate layer. This consensus-driven approach yields statistically robust early exits, improving the test accuracy up to 5.95% compared to state-of-the-art dynamic solutions with a comparable computational cost and reducing the inference latency up to 70.60% compared to static ensembles while maintaining a good accuracy.

51. 【2601.22709】Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs

链接https://arxiv.org/abs/2601.22709

作者:Yanlong Chen,Amirhossein Habibian,Luca Benini,Yawei Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:significant accuracy loss, strong multimodal performance, achieve strong multimodal, costly to deploy, accuracy loss

备注: This paper is currently under review for the 2026 International Conference on Machine Learning (ICML)

点击查看摘要

Abstract:Vision-Language Models (VLMs) achieve strong multimodal performance but are costly to deploy, and post-training quantization often causes significant accuracy loss. Despite its potential, quantization-aware training for VLMs remains underexplored. We propose GRACE, a framework unifying knowledge distillation and QAT under the Information Bottleneck principle: quantization constrains information capacity while distillation guides what to preserve within this budget. Treating the teacher as a proxy for task-relevant information, we introduce confidence-gated decoupled distillation to filter unreliable supervision, relational centered kernel alignment to transfer visual token structures, and an adaptive controller via Lagrangian relaxation to balance fidelity against capacity constraints. Across extensive benchmarks on LLaVA and Qwen families, our INT4 models consistently outperform FP16 baselines (e.g., LLaVA-1.5-7B: 70.1 vs. 66.8 on SQA; Qwen2-VL-2B: 76.9 vs. 72.6 on MMBench), nearly matching teacher performance. Using real INT4 kernel, we achieve 3$\times$ throughput with 54% memory reduction. This principled framework significantly outperforms existing quantization methods, making GRACE a compelling solution for resource-constrained deployment.

52. 【2601.22703】DAVIS: OOD Detection via Dominant Activations and Variance for Increased Separation

链接https://arxiv.org/abs/2601.22703

作者:Abid Hassan,Tuan Ngo,Saad Shafiq,Nenad Medvidovic

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deploying machine learning, machine learning models, global average pooling, real world, critical safeguard

备注

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) inputs is a critical safeguard for deploying machine learning models in the real world. However, most post-hoc detection methods operate on penultimate feature representations derived from global average pooling (GAP) -- a lossy operation that discards valuable distributional statistics from activation maps prior to global average pooling. We contend that these overlooked statistics, particularly channel-wise variance and dominant (maximum) activations, are highly discriminative for OOD detection. We introduce DAVIS, a simple and broadly applicable post-hoc technique that enriches feature vectors by incorporating these crucial statistics, directly addressing the information loss from GAP. Extensive evaluations show DAVIS sets a new benchmark across diverse architectures, including ResNet, DenseNet, and EfficientNet. It achieves significant reductions in the false positive rate (FPR95), with improvements of 48.26\% on CIFAR-10 using ResNet-18, 38.13\% on CIFAR-100 using ResNet-34, and 26.83\% on ImageNet-1k benchmarks using MobileNet-v2. Our analysis reveals the underlying mechanism for this improvement, providing a principled basis for moving beyond the mean in OOD detection.

53. 【2601.22696】Bi-MCQ: Reformulating Vision-Language Alignment for Negation Understanding

链接https://arxiv.org/abs/2601.22696

作者:Tae Hun Kim,Hyun Gyu Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:large-scale image-text pretraining, Recent vision-language models, achieve strong zero-shot, medical image analysis, Recent vision-language

备注: 15 pages, 4 figures, Submitted to ICPR 2026 (under review)

点击查看摘要

Abstract:Recent vision-language models (VLMs) achieve strong zero-shot performance via large-scale image-text pretraining and have been widely adopted in medical image analysis. However, existing VLMs remain notably weak at understanding negated clinical statements, largely due to contrastive alignment objectives that treat negation as a minor linguistic variation rather than a meaning-inverting operator. In multi-label settings, prompt-based InfoNCE fine-tuning further reinforces easy-positive image-prompt alignments, limiting effective learning of disease absence. To overcome these limitations, we reformulate vision-language alignment as a conditional semantic comparison problem, which is instantiated through a bi-directional multiple-choice learning framework(Bi-MCQ). By jointly training Image-to-Text and Text-to-Image MCQ tasks with affirmative, negative, and mixed prompts, our method implements fine-tuning as conditional semantic comparison instead of global similarity maximization. We further introduce direction-specific Cross-Attention fusion modules to address asymmetric cues required by bi-directional reasoning and reduce alignment interference. Experiments on ChestXray14, Open-I, CheXpert, and PadChest show that Bi-MCQ improves negation understanding by up to 0.47 AUC over the zero-shot performance of the state-of-the-art CARZero model, while achieving up to a 0.08 absolute gain on positive-negative combined (PNC) evaluation. Additionally, Bi-MCQ reduces the affirmative-negative AUC gap by an average of 0.12 compared to InfoNCE-based fine-tuning, demonstrating that objective reformulation can substantially enhance negation understanding in medical VLMs.

54. 【2601.22693】PEAR: Pixel-aligned Expressive humAn mesh Recovery

链接https://arxiv.org/abs/2601.22693

作者:Jiahao Wu,Yunfei Liu,Lijian Lin,Ye Zhu,Lei Zhu,Jingyi Li,Yu Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Reconstructing detailed, image remains, computer vision, remains a fundamental, Reconstructing

备注: 23 pages

点击查看摘要

Abstract:Reconstructing detailed 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing SMPLX-based methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine-grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR-a fast and robust framework for pixel-aligned expressive human mesh recovery. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine-grained human pose details, and insufficient facial expression capture. Specifically, to enable real-time SMPLX parameter inference, we depart from prior designs that rely on high resolution inputs or multi-branch architectures. Instead, we adopt a clean and unified ViT-based model capable of recovering coarse 3D human geometry. To compensate for the loss of fine-grained details caused by this simplified architecture, we introduce pixel-level supervision to optimize the geometry, significantly improving the reconstruction accuracy of fine-grained human details. To make this approach practical, we further propose a modular data annotation strategy that enriches the training data and enhances the robustness of the model. Overall, PEAR is a preprocessing-free framework that can simultaneously infer EHM-s (SMPLX and scaled-FLAME) parameters at over 100 FPS. Extensive experiments on multiple benchmark datasets demonstrate that our method achieves substantial improvements in pose estimation accuracy compared to previous SMPLX-based approaches. Project page: this https URL

55. 【2601.22685】OOVDet: Low-Density Prior Learning for Zero-Shot Out-of-Vocabulary Object Detection

链接https://arxiv.org/abs/2601.22685

作者:Binyi Su,Chenghao Huang,Haiyong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accurately recognize objects, simultaneously rejecting undefined, OOV, zero-shot OOV detector, aims to accurately

备注

点击查看摘要

Abstract:Zero-shot out-of-vocabulary detection (ZS-OOVD) aims to accurately recognize objects of in-vocabulary (IV) categories provided at zero-shot inference, while simultaneously rejecting undefined ones (out-of-vocabulary, OOV) that lack corresponding category prompts. However, previous methods are prone to overfitting the IV classes, leading to the OOV or undefined classes being misclassified as IV ones with a high confidence score. To address this issue, this paper proposes a zero-shot OOV detector (OOVDet), a novel framework that effectively detects predefined classes while reliably rejecting undefined ones in zero-shot scenes. Specifically, due to the model's lack of prior knowledge about the distribution of OOV data, we synthesize region-level OOV prompts by sampling from the low-likelihood regions of the class-conditional Gaussian distributions in the hidden space, motivated by the assumption that unknown semantics are more likely to emerge in low-density areas of the latent space. For OOV images, we further propose a Dirichlet-based gradient attribution mechanism to mine pseudo-OOV image samples, where the attribution gradients are interpreted as Dirichlet evidence to estimate prediction uncertainty, and samples with high uncertainty are selected as pseudo-OOV images. Building on these synthesized OOV prompts and pseudo-OOV images, we construct the OOV decision boundary through a low-density prior constraint, which regularizes the optimization of OOV classes using Gaussian kernel density estimation in accordance with the above assumption. Experimental results show that our method significantly improves the OOV detection performance in zero-shot scenes. The code is available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2601.22685 [cs.CV]

(or
arXiv:2601.22685v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.22685

Focus to learn more

              arXiv-issued DOI via DataCite</p>
56. 【2601.22680】Visual Personalization Turing Test

链接https://arxiv.org/abs/2601.22680

作者:Rameen Abdal,James Burgess,Sergey Tulyakov,Kuan-Chieh Jackson Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Personalization Turing Test, Visual Personalization Turing, visual personalization based, contextual visual personalization, Turing Test

备注: Webpage: [this https URL](https://snap-research.github.io/vptt)

点击查看摘要

Abstract:We introduce the Visual Personalization Turing Test (VPTT), a new paradigm for evaluating contextual visual personalization based on perceptual indistinguishability, rather than identity replication. A model passes the VPTT if its output (image, video, 3D asset, etc.) is indistinguishable to a human or calibrated VLM judge from content a given person might plausibly create or share. To operationalize VPTT, we present the VPTT Framework, integrating a 10k-persona benchmark (VPTT-Bench), a visual retrieval-augmented generator (VPRAG), and the VPTT Score, a text-only metric calibrated against human and VLM judgments. We show high correlation across human, VLM, and VPTT evaluations, validating the VPTT Score as a reliable perceptual proxy. Experiments demonstrate that VPRAG achieves the best alignment-originality balance, offering a scalable and privacy-safe foundation for personalized generative AI.

57. 【2601.22679】Stabilizing Consistency Training: A Flow Map Analysis and Self-Distillation

链接https://arxiv.org/abs/2601.22679

作者:Youngjoong Kim,Duhoe Kim,Woosung Kim,Jaesik Park

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:achieving results competitive, fast generative modeling, generative modeling, achieving results, proposed for fast

备注

点击查看摘要

Abstract:Consistency models have been proposed for fast generative modeling, achieving results competitive with diffusion and flow models. However, these methods exhibit inherent instability and limited reproducibility when training from scratch, motivating subsequent work to explain and stabilize these issues. While these efforts have provided valuable insights, the explanations remain fragmented, and the theoretical relationships remain unclear. In this work, we provide a theoretical examination of consistency models by analyzing them from a flow map-based perspective. This joint analysis clarifies how training stability and convergence behavior can give rise to degenerate solutions. Building on these insights, we revisit self-distillation as a practical remedy for certain forms of suboptimal convergence and reformulate it to avoid excessive gradient norms for stable optimization. We further demonstrate that our strategy extends beyond image generation to diffusion-based policy learning, without reliance on a pretrained diffusion model for initialization, thereby illustrating its broader applicability.

58. 【2601.22675】Fire on Motion: Optimizing Video Pass-bands for Efficient Spiking Action Recognition

链接https://arxiv.org/abs/2601.22675

作者:Shuhan Ye,Yuanbin Qian,Yi Yu,Chong Wang,Yuqi Xie,Jiazhen Xu,Kun Wang,Xudong Jiang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Spiking neural networks, neural networks, artificial neural networks, energy efficiency, gained traction

备注

点击查看摘要

Abstract:Spiking neural networks (SNNs) have gained traction in vision due to their energy efficiency, bio-plausibility, and inherent temporal processing. Yet, despite this temporal capacity, most progress concentrates on static image benchmarks, and SNNs still underperform on dynamic video tasks compared to artificial neural networks (ANNs). In this work, we diagnose a fundamental pass-band mismatch: Standard spiking dynamics behave as a temporal low pass that emphasizes static content while attenuating motion bearing bands, where task relevant information concentrates in dynamic tasks. This phenomenon explains why SNNs can approach ANNs on static tasks yet fall behind on tasks that demand richer temporal this http URL remedy this, we propose the Pass-Bands Optimizer (PBO), a plug-and-play module that optimizes the temporal pass-band toward task-relevant motion bands. PBO introduces only two learnable parameters, and a lightweight consistency constraint that preserves semantics and boundaries, incurring negligible computational overhead and requires no architectural changes. PBO deliberately suppresses static components that contribute little to discrimination, effectively high passing the stream so that spiking activity concentrates on motion bearing content. On UCF101, PBO yields over ten percentage points improvement. On more complex multi-modal action recognition and weakly supervised video anomaly detection, PBO delivers consistent and significant gains, offering a new perspective for SNN based video processing and understanding.

59. 【2601.22674】VisionTrim: Unified Vision Token Compression for Training-Free MLLM Acceleration

链接https://arxiv.org/abs/2601.22674

作者:Hanxun Yu,Wentong Li,Xuan Qu,Song Wang,Junbo Chen,Jianke Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, high computational costs, computational costs due, Multimodal large language, language models

备注: ICLR2026, Code Link: [this https URL](https://github.com/hanxunyu/VisionTrim)

点击查看摘要

Abstract:Multimodal large language models (MLLMs) suffer from high computational costs due to excessive visual tokens, particularly in high-resolution and video-based scenarios. Existing token reduction methods typically focus on isolated pipeline components and often neglect textual alignment, leading to performance degradation. In this paper, we propose VisionTrim, a unified framework for training-free MLLM acceleration, integrating two effective plug-and-play modules: 1) the Dominant Vision Token Selection (DVTS) module, which preserves essential visual tokens via a global-local view, and 2) the Text-Guided Vision Complement (TGVC) module, which facilitates context-aware token merging guided by textual cues. Extensive experiments across diverse image and video multimodal benchmarks demonstrate the performance superiority of our VisionTrim, advancing practical MLLM deployment in real-world applications. The code is available at: this https URL.

60. 【2601.22666】ExpAlign: Expectation-Guided Vision-Language Alignment for Open-Vocabulary Grounding

链接https://arxiv.org/abs/2601.22666

作者:Junyi Hu,Tian Bai,Fengyi Wu,Wenyan Li,Zhenming Peng,Yi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:heavy cross-attention designs, grounding requires accurate, global sentence embeddings, lack fine-grained expressiveness, requires accurate vision-language

备注: 20 pages, 6 figures

点击查看摘要

Abstract:Open-vocabulary grounding requires accurate vision-language alignment under weak supervision, yet existing methods either rely on global sentence embeddings that lack fine-grained expressiveness or introduce token-level alignment with explicit supervision or heavy cross-attention designs. We propose ExpAlign, a theoretically grounded vision-language alignment framework built on a principled multiple instance learning formulation. ExpAlign introduces an Expectation Alignment Head that performs attention-based soft MIL pooling over token-region similarities, enabling implicit token and instance selection without additional annotations. To further stabilize alignment learning, we develop an energy-based multi-scale consistency regularization scheme, including a Top-K multi-positive contrastive objective and a Geometry-Aware Consistency Objective derived from a Lagrangian-constrained free-energy minimization. Extensive experiments show that ExpAlign consistently improves open-vocabulary detection and zero-shot instance segmentation, particularly on long-tail categories. Most notably, it achieves 36.2 AP$_r$ on the LVIS minival split, outperforming other state-of-the-art methods at comparable model scale, while remaining lightweight and inference-efficient.

61. 【2601.22663】Unsupervised Synthetic Image Attribution: Alignment and Disentanglement

链接https://arxiv.org/abs/2601.22663

作者:Zongfang Liu,Guangyi Chen,Boyang Sun,Tongliang Liu,Kun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:synthetic images improves, ensuring model transparency, identifying the underlying, increasingly crucial, crucial for copyright

备注

点击查看摘要

Abstract:As the quality of synthetic images improves, identifying the underlying concepts of model-generated images is becoming increasingly crucial for copyright protection and ensuring model transparency. Existing methods achieve this attribution goal by training models using annotated pairs of synthetic images and their original training sources. However, obtaining such paired supervision is challenging, as it requires either well-designed synthetic concepts or precise annotations from millions of training sources. To eliminate the need for costly paired annotations, in this paper, we explore the possibility of unsupervised synthetic image attribution. We propose a simple yet effective unsupervised method called Alignment and Disentanglement. Specifically, we begin by performing basic concept alignment using contrastive self-supervised learning. Next, we enhance the model's attribution ability by promoting representation disentanglement with the Infomax loss. This approach is motivated by an interesting observation: contrastive self-supervised models, such as MoCo and DINO, inherently exhibit the ability to perform simple cross-domain alignment. By formulating this observation as a theoretical assumption on cross-covariance, we provide a theoretical explanation of how alignment and disentanglement can approximate the concept-matching process through a decomposition of the canonical correlation analysis objective. On the real-world benchmarks, AbC, we show that our unsupervised method surprisingly outperforms the supervised methods. As a starting point, we expect our intuitive insights and experimental findings to provide a fresh perspective on this challenging task.

62. 【2601.22634】What can Computer Vision learn from Ranganathan?

链接https://arxiv.org/abs/2601.22634

作者:Mayukh Bagchi,Fausto Giunchiglia

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Semantic Gap Problem, lexical semantics leading, Gap Problem, Computer Vision, Semantic Gap

备注: Accepted @ DRTC-ISI Conference 2026, Indian Statistical Institute (ISI), Bangalore, India

点击查看摘要

Abstract:The Semantic Gap Problem (SGP) in Computer Vision (CV) arises from the misalignment between visual and lexical semantics leading to flawed CV dataset design and CV benchmarks. This paper proposes that classification principles of S.R. Ranganathan can offer a principled starting point to address SGP and design high-quality CV datasets. We elucidate how these principles, suitably adapted, underpin the vTelos CV annotation methodology. The paper also briefly presents experimental evidence showing improvements in CV annotation and accuracy, thereby, validating vTelos.

63. 【2601.22630】LINA: Linear Autoregressive Image Generative Models with Continuous Tokens

链接https://arxiv.org/abs/2601.22630

作者:Jiahao Wang,Ting Pan,Haoge Deng,Dongchen Han,Taiqiang Wu,Xinlong Wang,Ping Luo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:high computational cost, continuous tokens form, computational cost, continuous tokens, tokens form

备注: 20 pages, 9 figures

点击查看摘要

Abstract:Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis, but they suffer from high computational cost. We study how to design compute-efficient linear attention within this framework. Specifically, we conduct a systematic empirical analysis of scaling behavior with respect to parameter counts under different design choices, focusing on (1) normalization paradigms in linear attention (division-based vs. subtraction-based) and (2) depthwise convolution for locality augmentation. Our results show that although subtraction-based normalization is effective for image classification, division-based normalization scales better for linear generative transformers. In addition, incorporating convolution for locality modeling plays a crucial role in autoregressive generation, consistent with findings in diffusion models. We further extend gating mechanisms, commonly used in causal linear attention, to the bidirectional setting and propose a KV gate. By introducing data-independent learnable parameters to the key and value states, the KV gate assigns token-wise memory weights, enabling flexible memory management similar to forget gates in language models. Based on these findings, we present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions. LINA achieves competitive performance on both class-conditional and T2I benchmarks, obtaining 2.18 FID on ImageNet (about 1.4B parameters) and 0.74 on GenEval (about 1.5B parameters). A single linear attention module reduces FLOPs by about 61 percent compared to softmax attention. Code and models are available at: this https URL.

Comments:
20 pages, 9 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2601.22630 [cs.CV]

(or
arXiv:2601.22630v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.22630

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Jiahao Wang [view email] [v1]
Fri, 30 Jan 2026 06:44:33 UTC (3,384 KB)

64. 【2601.22616】UniGeo: A Unified 3D Indoor Object Detection Framework Integrating Geometry-Aware Learning and Dynamic Channel Gating

链接https://arxiv.org/abs/2601.22616

作者:Xing Yi,Jinyang Huang,Feng-Qi Cui,Anyang Tong,Ruimin Wang,Liu Liu,Dan Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:driven considerable research, considerable research interest, object detection based, growing adoption, adoption of robotics

备注

点击查看摘要

Abstract:The growing adoption of robotics and augmented reality in real-world applications has driven considerable research interest in 3D object detection based on point clouds. While previous methods address unified training across multiple datasets, they fail to model geometric relationships in sparse point cloud scenes and ignore the feature distribution in significant areas, which ultimately restricts their performance. To deal with this issue, a unified 3D indoor detection framework, called UniGeo, is proposed. To model geometric relations in scenes, we first propose a geometry-aware learning module that establishes a learnable mapping from spatial relationships to feature weights, which enabes explicit geometric feature enhancement. Then, to further enhance point cloud feature representation, we propose a dynamic channel gating mechanism that leverages learnable channel-wise weighting. This mechanism adaptively optimizes features generated by the sparse 3D U-Net network, significantly enhancing key geometric information. Extensive experiments on six different indoor scene datasets clearly validate the superior performance of our method.

65. 【2601.22615】SA3R: Training-Free Temporal-Spatial Adaptive Persistent State for Streaming 3D Reconstruction

链接https://arxiv.org/abs/2601.22615

作者:Zhijie Zheng,Xinhao Xiang,Jiawei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Streaming recurrent models, Streaming recurrent, persistent state representations, models enable efficient, enable efficient

备注

点击查看摘要

Abstract:Streaming recurrent models enable efficient 3D reconstruction by maintaining persistent state representations. However, they suffer from catastrophic memory forgetting over long sequences due to balancing historical information with new observations. Recent methods alleviate this by deriving adaptive signals from attention perspective, but they operate on single dimensions without considering temporal and spatial consistency. To this end, we propose a training-free framework termed TTSA3R that leverages both temporal state evolution and spatial observation quality for adaptive state updates in 3D reconstruction. In particular, we devise a Temporal Adaptive Update Module that regulates update magnitude by analyzing temporal state evolution patterns. Then, a Spatial Contextual Update Module is introduced to localize spatial regions that require updates through observation-state alignment and scene dynamics. These complementary signals are finally fused to determine the state updating strategies. Extensive experiments demonstrate the effectiveness of TTSA3R in diverse 3D tasks. Moreover, our method exhibits only 15% error increase compared to over 200% degradation in baseline models on extended sequences, significantly improving long-term reconstruction stability. Our codes will be available soon.

66. 【2601.22596】FOTBCD: A Large-Scale Building Change Detection Benchmark from French Orthophotos and Topographic Data

链接https://arxiv.org/abs/2601.22596

作者:Abdelrrahman Moubane

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:authoritative French orthophotos, IGN France, authoritative French, French orthophotos, provided by IGN

备注

点击查看摘要

Abstract:We introduce FOTBCD, a large-scale building change detection dataset derived from authoritative French orthophotos and topographic building data provided by IGN France. Unlike existing benchmarks that are geographically constrained to single cities or limited regions, FOTBCD spans 28 departments across mainland France, with 25 used for training and three geographically disjoint departments held out for evaluation. The dataset covers diverse urban, suburban, and rural environments at 0.2m/pixel resolution. We publicly release FOTBCD-Binary, a dataset comprising approximately 28,000 before/after image pairs with pixel-wise binary building change masks, each associated with patch-level spatial metadata. The dataset is designed for large-scale benchmarking and evaluation under geographic domain shift, with validation and test samples drawn from held-out departments and manually verified to ensure label quality. In addition, we publicly release FOTBCD-Instances, a publicly available instance-level annotated subset comprising several thousand image pairs, which illustrates the complete annotation schema used in the full instance-level version of FOTBCD. Using a fixed reference baseline, we benchmark FOTBCD-Binary against LEVIR-CD+ and WHU-CD, providing strong empirical evidence that geographic diversity at the dataset level is associated with improved cross-domain generalization in building change detection.

67. 【2601.22581】Cross-Domain Few-Shot Learning for Hyperspectral Image Classification Based on Mixup Foundation Model

链接https://arxiv.org/abs/2601.22581

作者:Naeem Paeedeh,Mahardhika Pratama,Ary Shiddiqi,Zehong Cao,Mukesh Prasad,Wisnu Jatmiko

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:unrealistic data augmentation, data augmentation procedure, cross-domain few-shot learning, significant research interest, attracted significant research

备注

点击查看摘要

Abstract:Although cross-domain few-shot learning (CDFSL) for hyper-spectral image (HSI) classification has attracted significant research interest, existing works often rely on an unrealistic data augmentation procedure in the form of external noise to enlarge the sample size, thus greatly simplifying the issue of data scarcity. They involve a large number of parameters for model updates, being prone to the overfitting problem. To the best of our knowledge, none has explored the strength of the foundation model, having strong generalization power to be quickly adapted to downstream tasks. This paper proposes the MIxup FOundation MOdel (MIFOMO) for CDFSL of HSI classifications. MIFOMO is built upon the concept of a remote sensing (RS) foundation model, pre-trained across a large scale of RS problems, thus featuring generalizable features. The notion of coalescent projection (CP) is introduced to quickly adapt the foundation model to downstream tasks while freezing the backbone network. The concept of mixup domain adaptation (MDM) is proposed to address the extreme domain discrepancy problem. Last but not least, the label smoothing concept is implemented to cope with noisy pseudo-label problems. Our rigorous experiments demonstrate the advantage of MIFOMO, where it beats prior arts with up to 14% margin. The source code of MIFOMO is open-sourced in this https URL Paeedeh/MIFOMO for reproducibility and convenient further study.

68. 【2601.22575】PhoStream: Benchmarking Real-World Streaming for Omnimodal Assistants in Mobile Scenarios

链接https://arxiv.org/abs/2601.22575

作者:Xudong Lu,Huankang Guan,Yang Bo,Jinpeng Chen,Xintong Guo,Shuhan Li,Fang Liu,Peiwen Sun,Xueying Li,Wei Zhang,Xue Yang,Rui Liu,Hongsheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, streams remains underexplored, Multimodal Large, continuous real-world streams

备注: 18 pages

点击查看摘要

Abstract:Multimodal Large Language Models excel at offline audio-visual understanding, but their ability to serve as mobile assistants in continuous real-world streams remains underexplored. In daily phone use, mobile assistants must track streaming audio-visual inputs and respond at the right time, yet existing benchmarks are often restricted to multiple-choice questions or use shorter videos. In this paper, we introduce PhoStream, the first mobile-centric streaming benchmark that unifies on-screen and off-screen scenarios to evaluate video, audio, and temporal reasoning. PhoStream contains 5,572 open-ended QA pairs from 578 videos across 4 scenarios and 10 capabilities. We build it with an Automated Generative Pipeline backed by rigorous human verification, and evaluate models using a realistic Online Inference Pipeline and LLM-as-a-Judge evaluation for open-ended responses. Experiments reveal a temporal asymmetry in LLM-judged scores (0-100): models perform well on Instant and Backward tasks (Gemini 3 Pro exceeds 80), but drop sharply on Forward tasks (16.40), largely due to early responses before the required visual and audio cues appear. This highlights a fundamental limitation: current MLLMs struggle to decide when to speak, not just what to say. Code and datasets used in this work will be made publicly accessible at this https URL.

69. 【2601.22574】Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding

链接https://arxiv.org/abs/2601.22574

作者:Yuansheng Gao,Jinman Zhao,Tong Zhang,Xingguo Xu,Han Bao,Zonghui Wang,Wenzhi Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Video Large Language, Large Language, Language Models perform, Models perform remarkably

备注: Preprint

点击查看摘要

Abstract:Although Video Large Language Models perform remarkably well across tasks such as video understanding, question answering, and reasoning, they still suffer from the problem of hallucination, which refers to generating outputs that are inconsistent with explicit video content or factual evidence. However, existing decoding methods for mitigating video hallucinations, while considering the spatiotemporal characteristics of videos, mostly rely on heuristic designs. As a result, they fail to precisely capture the root causes of hallucinations and their fine-grained temporal and semantic correlations, leading to limited robustness and generalization in complex scenarios. To more effectively mitigate video hallucinations, we propose a novel decoding strategy termed Spatiotemporal-Semantic Contrastive Decoding. This strategy constructs negative features by deliberately disrupting the spatiotemporal consistency and semantic associations of video features, and suppresses video hallucinations through contrastive decoding against the original video features during inference. Extensive experiments demonstrate that our method not only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.

70. 【2601.22573】DELNet: Continuous All-in-One Weather Removal via Dynamic Expert Library

链接https://arxiv.org/abs/2601.22573

作者:Shihong Liu,Kun Zuo,Hanguang Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:weather image restoration, leading to high, weather image, image restoration, valuable in practice

备注: Accepted by the ICASSP conference, not yet officially published

点击查看摘要

Abstract:All-in-one weather image restoration methods are valuable in practice but depend on pre-collected data and require retraining for unseen degradations, leading to high cost. We propose DELNet, a continual learning framework for weather image restoration. DELNet integrates a judging valve that measures task similarity to distinguish new from known tasks, and a dynamic expert library that stores experts trained on different degradations. For new tasks, the valve selects top-k experts for knowledge transfer while adding new experts to capture task-specific features; for known tasks, the corresponding experts are directly reused. This design enables continuous optimization without retraining existing models. Experiments on OTS, Rain100H, and Snow100K demonstrate that DELNet surpasses state-of-the-art continual learning methods, achieving PSNR gains of 16\%, 11\%, and 12\%, respectively. These results highlight the effectiveness, robustness, and efficiency of DELNet, which reduces retraining cost and enables practical deployment in real-world scenarios.

71. 【2601.22570】Leveraging Data to Say No: Memory Augmented Plug-and-Play Selective Prediction

链接https://arxiv.org/abs/2601.22570

作者:Aditya Sarkar,Yi Li,Jiacheng Cheng,Shlok Mishra,Nuno Vasconcelos

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:avoid low confidence, low confidence predictions, Selective prediction aims, Selective prediction, aims to endow

备注: ICLR 2026

点击查看摘要

Abstract:Selective prediction aims to endow predictors with a reject option, to avoid low confidence predictions. However, existing literature has primarily focused on closed-set tasks, such as visual question answering with predefined options or fixed-category classification. This paper considers selective prediction for visual language foundation models, addressing a taxonomy of tasks ranging from closed to open set and from finite to unbounded vocabularies, as in image captioning. We seek training-free approaches of low-complexity, applicable to any foundation model and consider methods based on external vision-language model embeddings, like CLIP. This is denoted as Plug-and-Play Selective Prediction (PaPSP). We identify two key challenges: (1) instability of the visual-language representations, leading to high variance in image-text embeddings, and (2) poor calibration of similarity scores. To address these issues, we propose a memory augmented PaPSP (MA-PaPSP) model, which augments PaPSP with a retrieval dataset of image-text pairs. This is leveraged to reduce embedding variance by averaging retrieved nearest-neighbor pairs and is complemented by the use of contrastive normalization to improve score calibration. Through extensive experiments on multiple datasets, we show that MA-PaPSP outperforms PaPSP and other selective prediction baselines for selective captioning, image-text matching, and fine-grained classification. Code is publicly available at this https URL.

72. 【2601.22551】Hybrid Cross-Device Localization via Neural Metric Learning and Feature Fusion

链接https://arxiv.org/abs/2601.22551

作者:Meixia Lin,Mingkai Liu,Shuxue Peng,Dikai Fan,Shengyu Gu,Xianliang Huang,Haoyang Ye,Xiao Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hybrid cross-device localization, cross-device localization pipeline, localization pipeline developed, present a hybrid, hybrid cross-device

备注: 3 pages

点击查看摘要

Abstract:We present a hybrid cross-device localization pipeline developed for the CroCoDL 2025 Challenge. Our approach integrates a shared retrieval encoder and two complementary localization branches: a classical geometric branch using feature fusion and PnP, and a neural feed-forward branch (MapAnything) for metric localization conditioned on geometric inputs. A neural-guided candidate pruning strategy further filters unreliable map frames based on translation consistency, while depth-conditioned localization refines metric scale and translation precision on Spot scenes. These components jointly lead to significant improvements in recall and accuracy across both HYDRO and SUCCU benchmarks. Our method achieved a final score of 92.62 (R@0.5m, 5°) during the challenge.

73. 【2601.22529】SHED Light on Segmentation for Dense Prediction

链接https://arxiv.org/abs/2601.22529

作者:Seung Hyun Lee,Sangwoo Mo,Stella X. Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:prediction infers per-pixel, Dense prediction infers, perception and robotics, infers per-pixel, single image

备注

点击查看摘要

Abstract:Dense prediction infers per-pixel values from a single image and is fundamental to 3D perception and robotics. Although real-world scenes exhibit strong structure, existing methods treat it as an independent pixel-wise prediction, often resulting in structural inconsistencies. We propose SHED, a novel encoder-decoder architecture that enforces geometric prior explicitly by incorporating segmentation into dense prediction. By bidirectional hierarchical reasoning, segment tokens are hierarchically pooled in the encoder and unpooled in the decoder to reverse the hierarchy. The model is supervised only at the final output, allowing the segment hierarchy to emerge without explicit segmentation supervision. SHED improves depth boundary sharpness and segment coherence, while demonstrating strong cross-domain generalization from synthetic to the real-world environments. Its hierarchy-aware decoder better captures global 3D scene layouts, leading to improved semantic segmentation performance. Moreover, SHED enhances 3D reconstruction quality and reveals interpretable part-level structures that are often missed by conventional pixel-wise methods.

74. 【2601.22522】Can 3D point cloud data improve automated body condition score prediction in dairy cattle?

链接https://arxiv.org/abs/2601.22522

作者:Zhou Tang,Jin Wang,Angelo De Castro,Yuxi Zhang,Victoria Bastos Primo,Ana Beatriz Montevecchio Bernardino,Gota Morota,Xu Wang,Ricardo C Chebel,Haipeng Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:body energy status, conventional visual scoring, Body condition score, body energy, BCS prediction

备注

点击查看摘要

Abstract:Body condition score (BCS) is a widely used indicator of body energy status and is closely associated with metabolic status, reproductive performance, and health in dairy cattle; however, conventional visual scoring is subjective and labor-intensive. Computer vision approaches have been applied to BCS prediction, with depth images widely used because they capture geometric information independent of coat color and texture. More recently, three-dimensional point cloud data have attracted increasing interest due to their ability to represent richer geometric characteristics of animal morphology, but direct head-to-head comparisons with depth image-based approaches remain limited. In this study, we compared top-view depth image and point cloud data for BCS prediction under four settings: 1) unsegmented raw data, 2) segmented full-body data, 3) segmented hindquarter data, and 4) handcrafted feature data. Prediction models were evaluated using data from 1,020 dairy cows collected on a commercial farm, with cow-level cross-validation to prevent data leakage. Depth image-based models consistently achieved higher accuracy than point cloud-based models when unsegmented raw data and segmented full-body data were used, whereas comparable performance was observed when segmented hindquarter data were used. Both depth image and point cloud approaches showed reduced accuracy when handcrafted feature data were employed compared with the other settings. Overall, point cloud-based predictions were more sensitive to noise and model architecture than depth image-based predictions. Taken together, these results indicate that three-dimensional point clouds do not provide a consistent advantage over depth images for BCS prediction in dairy cattle under the evaluated conditions.

75. 【2601.22515】DNA: Uncovering Universal Latent Forgery Knowledge

链接https://arxiv.org/abs/2601.22515

作者:Jingtong Dou,Chuancheng Shi,Yemin Wang,Shiming Guo,Anqi Yi,Wenhua Wu,Li Zhang,Fei Shen,Tat-Seng Chua

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:superficial artifact detection, superficial artifact, artifact detection, achieves hyper-realism, Abstract

备注

点击查看摘要

Abstract:As generative AI achieves hyper-realism, superficial artifact detection has become obsolete. While prevailing methods rely on resource-intensive fine-tuning of black-box backbones, we propose that forgery detection capability is already encoded within pre-trained models rather than requiring end-to-end retraining. To elicit this intrinsic capability, we propose the discriminative neural anchors (DNA) framework, which employs a coarse-to-fine excavation mechanism. First, by analyzing feature decoupling and attention distribution shifts, we pinpoint critical intermediate layers where the focus of the model logically transitions from global semantics to local anomalies. Subsequently, we introduce a triadic fusion scoring metric paired with a curvature-truncation strategy to strip away semantic redundancy, precisely isolating the forgery-discriminative units (FDUs) inherently imprinted with sensitivity to forgery traces. Moreover, we introduce HIFI-Gen, a high-fidelity synthetic benchmark built upon the very latest models, to address the lag in existing datasets. Experiments demonstrate that by solely relying on these anchors, DNA achieves superior detection performance even under few-shot conditions. Furthermore, it exhibits remarkable robustness across diverse architectures and against unseen generative models, validating that waking up latent neurons is more effective than extensive fine-tuning.

76. 【2601.22508】CoVA: Text-Guided Composed Video Retrieval for Audio-Visual Content

链接https://arxiv.org/abs/2601.22508

作者:Gyuwon Han,Young Kyun Jang,Chanho Eom

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Composed Video Retrieval, aims to retrieve, retrieve a target, large gallery, introduce Composed retrieval

备注: Please visit our project page at [this https URL](https://perceptualai-lab.github.io/CoVA/)

点击查看摘要

Abstract:Composed Video Retrieval (CoVR) aims to retrieve a target video from a large gallery using a reference video and a textual query specifying visual modifications. However, existing benchmarks consider only visual changes, ignoring videos that differ in audio despite visual similarity. To address this limitation, we introduce Composed retrieval for Video with its Audio CoVA, a new retrieval task that accounts for both visual and auditory variations. To support this, we construct AV-Comp, a benchmark consisting of video pairs with cross-modal changes and corresponding textual queries that describe the differences. We also propose AVT Compositional Fusion (AVT), which integrates video, audio, and text features by selectively aligning the query to the most relevant modality. AVT outperforms traditional unimodal fusion and serves as a strong baseline for CoVA. Examples from the proposed dataset, including both visual and auditory information, are available at this https URL.

77. 【2601.22507】DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation

链接https://arxiv.org/abs/2601.22507

作者:Xin Jiang,Jingwen Chen,Yehao Li,Yingwei Pan,Kezhou Chen,Zechao Li,Ting Yao,Tao Mei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:attracted considerable attention, Recent advances, producing high-quality images, generation using diffusion, attracted considerable

备注: Accepted By ICASSP 2026

点击查看摘要

Abstract:Recent advances in subject-driven image generation using diffusion models have attracted considerable attention for their remarkable capabilities in producing high-quality images. Nevertheless, the potential of Visual Autoregressive (VAR) models, despite their unified architecture and efficient inference, remains underexplored. In this work, we present DreamVAR, a novel framework for subject-driven image synthesis built upon a VAR model that employs next-scale prediction. Technically, multi-scale features of the reference subject are first extracted by a visual tokenizer. Instead of interleaving these conditional features with target image tokens across scales, our DreamVAR pre-fills the full subject feature sequence prior to predicting target image tokens. This design simplifies autoregressive dependencies and mitigates the train-test discrepancy in multi-scale conditioning scenario within the VAR paradigm. DreamVAR further incorporates reinforcement learning to jointly enhance semantic alignment and subject consistency. Extensive experiments demonstrate that DreamVAR achieves superior appearance preservation compared to leading diffusion-based methods.

78. 【2601.22501】MIRRORTALK: Forging Personalized Avatars Via Disentangled Style and Hierarchical Motion Control

链接https://arxiv.org/abs/2601.22501

作者:Renjie Lu,Xulong Zhang,Xiaoyang Qu,Jianzong Wang,Shangfei Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)

关键词:Synthesizing personalized talking, personalized talking faces, Synthesizing personalized, speaker unique style, speaker unique persona

备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

点击查看摘要

Abstract:Synthesizing personalized talking faces that uphold and highlight a speaker's unique style while maintaining lip-sync accuracy remains a significant challenge. A primary limitation of existing approaches is the intrinsic confounding of speaker-specific talking style and semantic content within facial motions, which prevents the faithful transfer of a speaker's unique persona to arbitrary speech. In this paper, we propose MirrorTalk, a generative framework based on a conditional diffusion model, combined with a Semantically-Disentangled Style Encoder (SDSE) that can distill pure style representations from a brief reference video. To effectively utilize this representation, we further introduce a hierarchical modulation strategy within the diffusion process. This mechanism guides the synthesis by dynamically balancing the contributions of audio and style features across distinct facial regions, ensuring both precise lip-sync accuracy and expressive full-face dynamics. Extensive experiments demonstrate that MirrorTalk achieves significant improvements over state-of-the-art methods in terms of lip-sync accuracy and personalization preservation.

79. 【2601.22492】PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization

链接https://arxiv.org/abs/2601.22492

作者:Duncan McCain,Hossein Kashiani,Fatemeh Afghah

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-class settings poses, settings poses significant, Visual anomaly detection, poses significant challenges, significant challenges due

备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Visual anomaly detection in multi-class settings poses significant challenges due to the diversity of object categories, the scarcity of anomalous examples, and the presence of camouflaged defects. In this paper, we propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization that integrates semantic guidance through vision-language alignment. By leveraging CLIP-encoded text prompts describing both normal and anomalous class-specific characteristics, our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies. To further address the challenge of class imbalance at the pixel level, we incorporate Focal loss function, which emphasizes hard-to-detect anomalous regions during training. Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention and diffusion iterative refinement, yielding precise and high-resolution anomaly maps. Extensive experiments on the MVTec-AD dataset demonstrate that our method achieves state-of-the-art pixel-level performance, improving mean AUC to 98.35% and AP to 66.54%, while maintaining efficiency across diverse categories.

80. 【2601.22483】Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage

链接https://arxiv.org/abs/2601.22483

作者:Junfei Xie,Peng Pan,Xulong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Visual Question Answering, Multimodal Large, Language Models

备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) show strong performance in Visual Question Answering (VQA) but remain limited in fine-grained reasoning due to low-resolution inputs and noisy attention aggregation. We propose \textbf{Head Aware Visual Cropping (HAVC)}, a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads. HAVC first filters heads through an OCR-based diagnostic task, ensuring that only those with genuine grounding ability are retained. At inference, these heads are further refined using spatial entropy for stronger spatial concentration and gradient sensitivity for predictive contribution. The fused signals produce a reliable Visual Cropping Guidance Map, which highlights the most task-relevant region and guides the cropping of a subimage subsequently provided to the MLLM together with the image-question pair. Extensive experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies, achieving more precise localization, stronger visual grounding, providing a simple yet effective strategy for enhancing precision in MLLMs.

81. 【2601.22468】raining-Free Representation Guidance for Diffusion Models with a Representation Alignment Projector

链接https://arxiv.org/abs/2601.22468

作者:Wenqiang Zu,Shenghao Xie,Bo Lei,Lei Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent progress, supporting controllable sampling, enabled high-quality visual, diffusion-based frameworks, supporting controllable

备注

点击查看摘要

Abstract:Recent progress in generative modeling has enabled high-quality visual synthesis with diffusion-based frameworks, supporting controllable sampling and large-scale training. Inference-time guidance methods such as classifier-free and representative guidance enhance semantic alignment by modifying sampling dynamics; however, they do not fully exploit unsupervised feature representations. Although such visual representations contain rich semantic structure, their integration during generation is constrained by the absence of ground-truth reference images at inference. This work reveals semantic drift in the early denoising stages of diffusion transformers, where stochasticity results in inconsistent alignment even under identical conditioning. To mitigate this issue, we introduce a guidance scheme using a representation alignment projector that injects representations predicted by a projector into intermediate sampling steps, providing an effective semantic anchor without modifying the model architecture. Experiments on SiTs and REPAs show notable improvements in class-conditional ImageNet synthesis, achieving substantially lower FID scores; for example, REPA-XL/2 improves from 5.9 to 3.3, and the proposed method outperforms representative guidance when applied to SiT models. The approach further yields complementary gains when combined with classifier-free guidance, demonstrating enhanced semantic coherence and visual fidelity. These results establish representation-informed diffusion sampling as a practical strategy for reinforcing semantic preservation and image consistency.

82. 【2601.22467】CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control

链接https://arxiv.org/abs/2601.22467

作者:Jiaqi Shi,Xulong Zhang,Xiaoyang Qu,Jianzong Wang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, shown promise, promise for robot, train VLA models, Recent

备注: Accepted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

点击查看摘要

Abstract:Recent advances in Vision-Language-Action (VLA) models have shown promise for robot control, but their dependence on action supervision limits scalability and generalization. To address this challenge, we introduce CARE, a novel framework designed to train VLA models for robotic task execution. Unlike existing methods that depend on action annotations during pretraining, CARE eliminates the need for explicit action labels by leveraging only video-text pairs. These weakly aligned data sources enable the model to learn continuous latent action representations through a newly designed multi-task pretraining objective. During fine-tuning, a small set of labeled data is used to train the action head for control. Experimental results across various simulation tasks demonstrate CARE's superior success rate, semantic interpretability, and ability to avoid shortcut learning. These results underscore CARE's scalability, interpretability, and effectiveness in robotic control with weak supervision.

83. 【2601.22455】ScribbleSense: Generative Scribble-Based Texture Editing with Intent Prediction

链接https://arxiv.org/abs/2601.22455

作者:Yudi Zhang,Yeming Geng,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:presents enhanced opportunities, freehand drawing style, drawing style offering, editing presents enhanced, opportunities for creating

备注: Accepted by IEEE TVCG. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

点击查看摘要

Abstract:Interactive 3D model texture editing presents enhanced opportunities for creating 3D assets, with freehand drawing style offering the most intuitive experience. However, existing methods primarily support sketch-based interactions for outlining, while the utilization of coarse-grained scribble-based interaction remains limited. Furthermore, current methodologies often encounter challenges due to the abstract nature of scribble instructions, which can result in ambiguous editing intentions and unclear target semantic locations. To address these issues, we propose ScribbleSense, an editing method that combines multimodal large language models (MLLMs) and image generation models to effectively resolve these challenges. We leverage the visual capabilities of MLLMs to predict the editing intent behind the scribbles. Once the semantic intent of the scribble is discerned, we employ globally generated images to extract local texture details, thereby anchoring local semantics and alleviating ambiguities concerning the target semantic locations. Experimental results indicate that our method effectively leverages the strengths of MLLMs, achieving state-of-the-art interactive editing performance for scribble-based texture editing.

84. 【2601.22451】Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework

链接https://arxiv.org/abs/2601.22451

作者:Shiyu Liu,Xinyi Wen,Zhibin Lan,Ante Wang,Jinsong Su

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision Language, Vision Language Models, models generate descriptions, Large Vision, progress in Large

备注: Code is available at [this https URL](https://github.com/Liushiyu-0709/SelfVal)

点击查看摘要

Abstract:Despite progress in Large Vision Language Models (LVLMs), object hallucination remains a critical issue in image captioning task, where models generate descriptions of non-existent objects, compromising their reliability. Previous work attributes this to LVLMs' over-reliance on language priors and attempts to mitigate it through logits calibration. However, they still lack a thorough analysis of the over-reliance. To gain a deeper understanding of over-reliance, we conduct a series of preliminary experiments, indicating that as the generation length increases, LVLMs' over-reliance on language priors leads to inflated probability of hallucinated object tokens, consequently exacerbating object hallucination. To circumvent this issue, we propose Language-Prior-Free Verification to enable LVLMs to faithfully verify the confidence of object existence. Based on this, we propose a novel training-free Self-Validation Framework to counter the over-reliance trap. It first validates objects' existence in sampled candidate captions and further mitigates object hallucination via caption selection or aggregation. Experiment results demonstrate that our framework mitigates object hallucination significantly in image captioning task (e.g., 65.6% improvement on CHAIRI metric with LLaVA-v1.5-7B), surpassing the previous SOTA methods. This result highlights a novel path towards mitigating hallucination by unlocking the inherent potential within LVLMs themselves.

85. 【2601.22445】High-Definition 5MP Stereo Vision Sensing for Robotics

链接https://arxiv.org/abs/2601.22445

作者:Leaf Jiang,Matthew Holzel,Bernhard Kaplan,Hsiou-Yuan Liu,Sabyasachi Paul,Karen Rankin,Piotr Swierczynski

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:advancing robotic capabilities, generating significantly denser, stereo vision systems, robotic capabilities, enabling operation

备注

点击查看摘要

Abstract:High-resolution (5MP+) stereo vision systems are essential for advancing robotic capabilities, enabling operation over longer ranges and generating significantly denser and accurate 3D point clouds. However, realizing the full potential of high-angular-resolution sensors requires a commensurately higher level of calibration accuracy and faster processing -- requirements often unmet by conventional methods. This study addresses that critical gap by processing 5MP camera imagery using a novel, advanced frame-to-frame calibration and stereo matching methodology designed to achieve both high accuracy and speed. Furthermore, we introduce a new approach to evaluate real-time performance by comparing real-time disparity maps with ground-truth disparity maps derived from more computationally intensive stereo matching algorithms. Crucially, the research demonstrates that high-pixel-count cameras yield high-quality point clouds only through the implementation of high-accuracy calibration.

86. 【2601.22443】Weak Diffusion Priors Can Still Achieve Strong Inverse-Problem Performance

链接https://arxiv.org/abs/2601.22443

作者:Jing Jia,Wei Yuan,Sifan Liu,Liyue Shen,Guanyang Wang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO); Machine Learning (stat.ML)

关键词:recover human faces, bedrooms recover human, diffusion model trained, model trained, human faces

备注

点击查看摘要

Abstract:Can a diffusion model trained on bedrooms recover human faces? Diffusion models are widely used as priors for inverse problems, but standard approaches usually assume a high-fidelity model trained on data that closely match the unknown signal. In practice, one often must use a mismatched or low-fidelity diffusion prior. Surprisingly, these weak priors often perform nearly as well as full-strength, in-domain baselines. We study when and why inverse solvers are robust to weak diffusion priors. Through extensive experiments, we find that weak priors succeed when measurements are highly informative (e.g., many observed pixels), and we identify regimes where they fail. Our theory, based on Bayesian consistency, gives conditions under which high-dimensional measurements make the posterior concentrate near the true signal. These results provide a principled justification on when weak diffusion priors can be used reliably.

87. 【2601.22412】EMBC Special Issue: Calibrated Uncertainty for Trustworthy Clinical Gait Analysis Using Probabilistic Multiview Markerless Motion Capture

链接https://arxiv.org/abs/2601.22412

作者:Seth Donahue,Irina Djuraskovic,Kunal Shah,Fabian Sinz,Ross Chafetz,R.James Cotton

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video-based human movement, human movement analysis, movement analysis holds, analysis holds potential, Video-based human

备注: 9 pages, 5 figures, EMBS Special Issue

点击查看摘要

Abstract:Video-based human movement analysis holds potential for movement assessment in clinical practice and research. However, the clinical implementation and trust of multi-view markerless motion capture (MMMC) require that, in addition to being accurate, these systems produce reliable confidence intervals to indicate how accurate they are for any individual. Building on our prior work utilizing variational inference to estimate joint angle posterior distributions, this study evaluates the calibration and reliability of a probabilistic MMMC method. We analyzed data from 68 participants across two institutions, validating the model against an instrumented walkway and standard marker-based motion capture. We measured the calibration of the confidence intervals using the Expected Calibration Error (ECE). The model demonstrated reliable calibration, yielding ECE values generally 0.1 for both step and stride length and bias-corrected gait kinematics. We observed a median step and stride length error of ~16 mm and ~12 mm respectively, with median bias-corrected kinematic errors ranging from 1.5 to 3.8 degrees across lower extremity joints. Consistent with the calibrated ECE, the magnitude of the model's predicted uncertainty correlated strongly with observed error measures. These findings indicate that, as designed, the probabilistic model reconstruction quantifies epistemic uncertainty, allowing it to identify unreliable outputs without the need for concurrent ground-truth instrumentation.

88. 【2601.22398】Jailbreaks on Vision Language Model via Multimodal Reasoning

链接https://arxiv.org/abs/2601.22398

作者:Aarush Noheria,Yuguang Yao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:visual question answering, question answering, Vision-language models, central to tasks, image captioning

备注

点击查看摘要

Abstract:Vision-language models (VLMs) have become central to tasks such as visual question answering, image captioning, and text-to-image generation. However, their outputs are highly sensitive to prompt variations, which can reveal vulnerabilities in safety alignment. In this work, we present a jailbreak framework that exploits post-training Chain-of-Thought (CoT) prompting to construct stealthy prompts capable of bypassing safety filters. To further increase attack success rates (ASR), we propose a ReAct-driven adaptive noising mechanism that iteratively perturbs input images based on model feedback. This approach leverages the ReAct paradigm to refine adversarial noise in regions most likely to activate safety defenses, thereby enhancing stealth and evasion. Experimental results demonstrate that the proposed dual-strategy significantly improves ASR while maintaining naturalness in both text and visual domains.

89. 【2601.22376】FlexMap: Generalized HD Map Construction from Flexible Camera Configurations

链接https://arxiv.org/abs/2601.22376

作者:Run Wang,Chaoyi Zhou,Amir Salarpour,Xi Liu,Zhi-Qi Cheng,Feng Luo,Mert D. Pesé,Siyu Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving systems, provide essential semantic, essential semantic information, require calibrated multi-camera, calibrated multi-camera setups

备注

点击查看摘要

Abstract:High-definition (HD) maps provide essential semantic information of road structures for autonomous driving systems, yet current HD map construction methods require calibrated multi-camera setups and either implicit or explicit 2D-to-BEV transformations, making them fragile when sensors fail or camera configurations vary across vehicle fleets. We introduce FlexMap, unlike prior methods that are fixed to a specific N-camera rig, our approach adapts to variable camera configurations without any architectural changes or per-configuration retraining. Our key innovation eliminates explicit geometric projections by using a geometry-aware foundation model with cross-frame attention to implicitly encode 3D scene understanding in feature space. FlexMap features two core components: a spatial-temporal enhancement module that separates cross-view spatial reasoning from temporal dynamics, and a camera-aware decoder with latent camera tokens, enabling view-adaptive attention without the need for projection matrices. Experiments demonstrate that FlexMap outperforms existing methods across multiple configurations while maintaining robustness to missing views and sensor variations, enabling more practical real-world deployment.

90. 【2601.22301】Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes

链接https://arxiv.org/abs/2601.22301

作者:Gonzalo Gomez-Nogales,Yicong Hong,Chongjian Ge,Marc Comino-Trinidad,Dan Casas,Yi Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Traditional rendering pipelines, substantial computational resources, rendering pipelines rely, Traditional rendering, populated dynamic scenes

备注: Project website at [this https URL](https://gonzalognogales.github.io/coarse2real/)

点击查看摘要

Abstract:Traditional rendering pipelines rely on complex assets, accurate materials and lighting, and substantial computational resources to produce realistic imagery, yet they still face challenges in scalability and realism for populated dynamic scenes. We present C2R (Coarse-to-Real), a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Our approach uses coarse 3D renderings to explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. To overcome the lack of paired training data between coarse simulations and real videos, we adopt a two-phase mixed CG-real training strategy that learns a strong generative prior from large-scale real footage and introduces controllability through shared implicit spatio-temporal features across domains. The resulting system supports coarse-to-fine control, generalizes across diverse CG and game inputs, and produces temporally consistent, controllable, and realistic urban scene videos from minimal 3D input. We will release the model and project webpage at this https URL.

91. 【2601.22276】SurrogateSHAP: Training-Free Contributor Attribution for Text-to-Image (T2I) Models

链接https://arxiv.org/abs/2601.22276

作者:Mingyu Lu,Soham Gadgil,Chris Lin,Chanwoo Kim,Su-In Lee

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world creative workflows, sustainable data marketplaces, creative workflows, real-world creative, provide a collection

备注

点击查看摘要

Abstract:As Text-to-Image (T2I) diffusion models are increasingly used in real-world creative workflows, a principled framework for valuing contributors who provide a collection of data is essential for fair compensation and sustainable data marketplaces. While the Shapley value offers a theoretically grounded approach to attribution, it faces a dual computational bottleneck: (i) the prohibitive cost of exhaustive model retraining for each sampled subset of players (i.e., data contributors) and (ii) the combinatorial number of subsets needed to estimate marginal contributions due to contributor interactions. To this end, we propose SurrogateSHAP, a retraining-free framework that approximates the expensive retraining game through inference from a pretrained model. To further improve efficiency, we employ a gradient-boosted tree to approximate the utility function and derive Shapley values analytically from the tree-based model. We evaluate SurrogateSHAP across three diverse attribution tasks: (i) image quality for DDPM-CFG on CIFAR-20, (ii) aesthetics for Stable Diffusion on Post-Impressionist artworks, and (iii) product diversity for FLUX.1 on Fashion-Product data. Across settings, SurrogateSHAP outperforms prior methods while substantially reducing computational overhead, consistently identifying influential contributors across multiple utility metrics. Finally, we demonstrate that SurrogateSHAP effectively localizes data sources responsible for spurious correlations in clinical images, providing a scalable path toward auditing safety-critical generative models.

92. 【2601.22275】VMonarch: Efficient Video Diffusion Transformers with Structured Attention

链接https://arxiv.org/abs/2601.22275

作者:Cheng Liang,Haoxian Chen,Liang Hou,Qi Fan,Gangshan Wu,Xin Tao,Limin Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Video Diffusion Transformers, Diffusion Transformers, Video Diffusion, mechanism severely limits, Video DiTs

备注

点击查看摘要

Abstract:The quadratic complexity of the attention mechanism severely limits the context scalability of Video Diffusion Transformers (DiTs). We find that the highly sparse spatio-temporal attention patterns exhibited in Video DiTs can be naturally represented by the Monarch matrix. It is a class of structured matrices with flexible sparsity, enabling sub-quadratic attention via an alternating minimization algorithm. Accordingly, we propose VMonarch, a novel attention mechanism for Video DiTs that enables efficient computation over the dynamic sparse patterns with structured Monarch matrices. First, we adapt spatio-temporal Monarch factorization to explicitly capture the intra-frame and inter-frame correlations of the video data. Second, we introduce a recomputation strategy to mitigate artifacts arising from instabilities during alternating minimization of Monarch matrices. Third, we propose a novel online entropy algorithm fused into FlashAttention, enabling fast Monarch matrix updates for long sequences. Extensive experiments demonstrate that VMonarch achieves comparable or superior generation quality to full attention on VBench after minimal tuning. It overcomes the attention bottleneck in Video DiTs, reduces attention FLOPs by a factor of 17.5, and achieves a speedup of over 5x in attention computation for long videos, surpassing state-of-the-art sparse attention methods at 90% sparsity.

93. 【2601.22244】Is Hierarchical Quantization Essential for Optimal Reconstruction?

链接https://arxiv.org/abs/2601.22244

作者:Shirin Reyhanian,Laurenz Wiskott

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Vector-quantized variational autoencoders, Vector-quantized variational, variational autoencoders, generative pipelines, high reconstruction fidelity

备注: To appear in the Proceedings of ICPRAM 2026. Code available at : [this https URL](https://github.com/wiskott-lab/single-vs-hier-recon)

点击查看摘要

Abstract:Vector-quantized variational autoencoders (VQ-VAEs) are central to models that rely on high reconstruction fidelity, from neural compression to generative pipelines. Hierarchical extensions, such as VQ-VAE2, are often credited with superior reconstruction performance because they split global and local features across multiple levels. However, since higher levels derive all their information from lower levels, they should not carry additional reconstructive content beyond what the lower-level already encodes. Combined with recent advances in training objectives and quantization mechanisms, this leads us to ask whether a single-level VQ-VAE, with matched representational budget and no codebook collapse, can equal the reconstruction fidelity of its hierarchical counterpart. Although the multi-scale structure of hierarchical models may improve perceptual quality in downstream tasks, the effect of hierarchy on reconstruction accuracy, isolated from codebook utilization and overall representational capacity, remains empirically underexamined. We revisit this question by comparing a two-level VQ-VAE and a capacity-matched single-level model on high-resolution ImageNet images. Consistent with prior observations, we confirm that inadequate codebook utilization limits single-level VQ-VAEs and that overly high-dimensional embeddings destabilize quantization and increase codebook collapse. We show that lightweight interventions such as initialization from data, periodic reset of inactive codebook vectors, and systematic tuning of codebook hyperparameters significantly reduce collapse. Our results demonstrate that when representational budgets are matched, and codebook collapse is mitigated, single-level VQ-VAEs can match the reconstruction fidelity of hierarchical variants, challenging the assumption that hierarchical quantization is inherently superior for high-quality reconstructions.

94. 【2601.22231】Geometry without Position? When Positional Embeddings Help and Hurt Spatial Reasoning

链接https://arxiv.org/abs/2601.22231

作者:Jian Shi,Michael Birsak,Wenqing Cui,Zhenyu Li,Peter Wonka

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:positional embeddings, vision transformers, paper revisits, geometric perspective, revisits the role

备注

点击查看摘要

Abstract:This paper revisits the role of positional embeddings (PEs) within vision transformers (ViTs) from a geometric perspective. We show that PEs are not mere token indices but effectively function as geometric priors that shape the spatial structure of the representation. We introduce token-level diagnostics that measure how multi-view geometric consistency in ViT representation depends on consitent PEs. Through extensive experiments on 14 foundation ViT models, we reveal how PEs influence multi-view geometry and spatial reasoning. Our findings clarify the role of PEs as a causal mechanism that governs spatial structure in ViT representations. Our code is provided in this https URL

95. 【2601.22228】Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation

链接https://arxiv.org/abs/2601.22228

作者:Ken Deng,Yifu Qiu,Yoni Kasten,Shay B. Cohen,Yftah Ziser

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:semantic reasoning compared, perception and semantic, limited understanding, relative camera, Vision-Language Models

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) perform well in 2D perception and semantic reasoning compared to their limited understanding of 3D spatial structure. We investigate this gap using relative camera pose estimation (RCPE), a fundamental vision task that requires inferring relative camera translation and rotation from a pair of images. We introduce VRRPI-Bench, a benchmark derived from unlabeled egocentric videos with verbalized annotations of relative camera motion, reflecting realistic scenarios with simultaneous translation and rotation around a shared object. We further propose VRRPI-Diag, a diagnostic benchmark that isolates individual motion degrees of freedom. Despite the simplicity of RCPE, most VLMs fail to generalize beyond shallow 2D heuristics, particularly for depth changes and roll transformations along the optical axis. Even state-of-the-art models such as GPT-5 ($0.64$) fall short of classic geometric baselines ($0.97$) and human performance ($0.92$). Moreover, VLMs exhibit difficulty in multi-image reasoning, with inconsistent performance (best $59.7\%$) when integrating spatial cues across frames. Our findings reveal limitations in grounding VLMs in 3D and multi-view spatial reasoning.

96. 【2601.22218】What Lies Beneath: A Call for Distribution-based Visual Question Answer Datasets

链接https://arxiv.org/abs/2601.22218

作者:Jill P. Naiman,Daniel J. Evans,JooYoung Seo

类目:Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)

关键词:Visual Question Answering, Visual Question, Question Answering, VQA, VQA datasets

备注: Accepted to ACM/IEEE Joint Conference on Digital Libraries JCDL 2025, 4 pages, 2 figures

点击查看摘要

Abstract:Visual Question Answering (VQA) has become an important benchmark for assessing how large multimodal models (LMMs) interpret images. However, most VQA datasets focus on real-world images or simple diagrammatic analysis, with few focused on interpreting complex scientific charts. Indeed, many VQA datasets that analyze charts do not contain the underlying data behind those charts or assume a 1-to-1 correspondence between chart marks and underlying data. In reality, charts are transformations (i.e. analysis, simplification, modification) of data. This distinction introduces a reasoning challenge in VQA that the current datasets do not capture. In this paper, we argue for a dedicated VQA benchmark for scientific charts where there is no 1-to-1 correspondence between chart marks and underlying data. To do so, we survey existing VQA datasets and highlight limitations of the current field. We then generate synthetic histogram charts based on ground truth data, and ask both humans and a large reasoning model questions where precise answers depend on access to the underlying data. We release the open-source dataset, including figures, underlying data, distribution parameters used to generate the data, and bounding boxes for all figure marks and text for future research.

97. 【2601.22164】Do Open-Vocabulary Detectors Transfer to Aerial Imagery? A Comparative Evaluation

链接https://arxiv.org/abs/2601.22164

作者:Christos Tsourveloudis

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:Open-vocabulary object detection, Open-vocabulary object, enables zero-shot recognition, achieving strong performance, object detection

备注

点击查看摘要

Abstract:Open-vocabulary object detection (OVD) enables zero-shot recognition of novel categories through vision-language models, achieving strong performance on natural images. However, transferability to aerial imagery remains unexplored. We present the first systematic benchmark evaluating five state-of-the-art OVD models on the LAE-80C aerial dataset (3,592 images, 80 categories) under strict zero-shot conditions. Our experimental protocol isolates semantic confusion from visual localization through Global, Oracle, and Single-Category inference modes. Results reveal severe domain transfer failure: the best model (OWLv2) achieves only 27.6% F1-score with 69% false positive rate. Critically, reducing vocabulary size from 80 to 3.2 classes yields 15x improvement, demonstrating that semantic confusion is the primary bottleneck. Prompt engineering strategies such as domain-specific prefixing and synonym expansion, fail to provide meaningful performance gains. Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), exposing brittleness to imaging conditions. These findings establish baseline expectations and highlight the need for domain-adaptive approaches in aerial OVD.

98. 【2601.22161】Attention Isn't All You Need for Emotion Recognition:Domain Features Outperform Transformers on the EAV Dataset

链接https://arxiv.org/abs/2601.22161

作者:Anmol Guragain

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:EAV dataset, mechanisms improve performance, complex attention mechanisms, attention mechanisms improve, investigating whether complex

备注

点击查看摘要

Abstract:We present a systematic study of multimodal emotion recognition using the EAV dataset, investigating whether complex attention mechanisms improve performance on small datasets. We implement three model categories: baseline transformers (M1), novel factorized attention mechanisms (M2), and improved CNN baselines (M3). Our experiments show that sophisticated attention mechanisms consistently underperform on small datasets. M2 models achieved 5 to 13 percentage points below baselines due to overfitting and destruction of pretrained features. In contrast, simple domain-appropriate modifications proved effective: adding delta MFCCs to the audio CNN improved accuracy from 61.9\% to \textbf{65.56\%} (+3.66pp), while frequency-domain features for EEG achieved \textbf{67.62\%} (+7.62pp over the paper baseline). Our vision transformer baseline (M1) reached \textbf{75.30\%}, exceeding the paper's ViViT result (74.5\%) through domain-specific pretraining, and vision delta features achieved \textbf{72.68\%} (+1.28pp over the paper CNN). These findings demonstrate that for small-scale emotion recognition, domain knowledge and proper implementation outperform architectural complexity.

99. 【2601.23276】Denoising the Deep Sky: Physics-Based CCD Noise Formation for Astronomical Imaging

链接https://arxiv.org/abs/2601.23276

作者:Shuhong Liu,Xining Ge,Ziying Gu,Lin Gu,Ziteng Cui,Xuangeng Chu,Jun Liu,Dong Li,Tatsuya Harada

类目:Instrumentation and Methods for Astrophysics (astro-ph.IM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Astronomical imaging remains, practical observing constraints, imaging remains noise-limited, remove structured artifacts, Astronomical imaging

备注

点击查看摘要

Abstract:Astronomical imaging remains noise-limited under practical observing constraints, while standard calibration pipelines mainly remove structured artifacts and leave stochastic noise largely unresolved. Learning-based denoising is promising, yet progress is hindered by scarce paired training data and the need for physically interpretable and reproducible models in scientific workflows. We propose a physics-based noise synthesis framework tailored to CCD noise formation. The pipeline models photon shot noise, photo-response non-uniformity, dark-current noise, readout effects, and localized outliers arising from cosmic-ray hits and hot pixels. To obtain low-noise inputs for synthesis, we average multiple unregistered exposures to produce high-SNR bases. Realistic noisy counterparts synthesized from these bases using our noise model enable the construction of abundant paired datasets for supervised learning. We further introduce a real-world dataset across multi-bands acquired with two twin ground-based telescopes, providing paired raw frames and instrument-pipeline calibrated frames, together with calibration data and stacked high-SNR bases for real-world evaluation.

100. 【2601.23201】Scale-Cascaded Diffusion Models for Super-Resolution in Medical Imaging

链接https://arxiv.org/abs/2601.23201

作者:Darshan Thaker,Mahmoud Mostapha,Radu Miron,Shihan Qiu,Mariappan Nadar

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:solving inverse problems, strong generative priors, strong generative, solving inverse, inverse problems

备注: Accepted at IEEE International Symposium for Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Diffusion models have been increasingly used as strong generative priors for solving inverse problems such as super-resolution in medical imaging. However, these approaches typically utilize a diffusion prior trained at a single scale, ignoring the hierarchical scale structure of image data. In this work, we propose to decompose images into Laplacian pyramid scales and train separate diffusion priors for each frequency band. We then develop an algorithm to perform super-resolution that utilizes these priors to progressively refine reconstructions across different scales. Evaluated on brain, knee, and prostate MRI data, our approach both improves perceptual quality over baselines and reduces inference time through smaller coarse-scale networks. Our framework unifies multiscale reconstruction and diffusion priors for medical image super-resolution.

101. 【2601.23103】Vision-Language Controlled Deep Unfolding for Joint Medical Image Restoration and Segmentation

链接https://arxiv.org/abs/2601.23103

作者:Ping Chen,Zicheng Huang,Xiangming Wang,Yungeng Liu,Bingyu Liang,Haijin Zeng,Yongyong Chen

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Medical Image Restoration, Medical Image, low-level signal recovery, high-level semantic understanding, Image Restoration

备注: 18 pages, medical image

点击查看摘要

Abstract:We propose VL-DUN, a principled framework for joint All-in-One Medical Image Restoration and Segmentation (AiOMIRS) that bridges the gap between low-level signal recovery and high-level semantic understanding. While standard pipelines treat these tasks in isolation, our core insight is that they are fundamentally synergistic: restoration provides clean anatomical structures to improve segmentation, while semantic priors regularize the restoration process. VL-DUN resolves the sub-optimality of sequential processing through two primary innovations. (1) We formulate AiOMIRS as a unified optimization problem, deriving an interpretable joint unfolding mechanism where restoration and segmentation are mathematically coupled for mutual refinement. (2) We introduce a frequency-aware Mamba mechanism to capture long-range dependencies for global segmentation while preserving the high-frequency textures necessary for restoration. This allows for efficient global context modeling with linear complexity, effectively mitigating the spectral bias of standard architectures. As a pioneering work in the AiOMIRS task, VL-DUN establishes a new state-of-the-art across multi-modal benchmarks, improving PSNR by 0.92 dB and the Dice coefficient by 9.76\%. Our results demonstrate that joint collaborative learning offers a superior, more robust solution for complex clinical workflows compared to isolated task processing. The codes are provided in this https URL.

102. 【2601.23037】Scale Equivariance Regularization and Feature Lifting in High Dynamic Range Modulo Imaging

链接https://arxiv.org/abs/2601.23037

作者:Brayan Monroy,Jorge Bacca

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:high dynamic range, artificial wrap discontinuities, imaging enables high, enables high dynamic, accurate reconstruction remains

备注

点击查看摘要

Abstract:Modulo imaging enables high dynamic range (HDR) acquisition by cyclically wrapping saturated intensities, but accurate reconstruction remains challenging due to ambiguities between natural image edges and artificial wrap discontinuities. This work proposes a learning-based HDR restoration framework that incorporates two key strategies: (i) a scale-equivariant regularization that enforces consistency under exposure variations, and (ii) a feature lifting input design combining the raw modulo image, wrapped finite differences, and a closed-form initialization. Together, these components enhance the network's ability to distinguish true structure from wrapping artifacts, yielding state-of-the-art performance across perceptual and linear HDR quality metrics.

103. 【2601.22878】Development of Domain-Invariant Visual Enhancement and Restoration (DIVER) Approach for Underwater Images

链接https://arxiv.org/abs/2601.22878

作者:Rajini Makam,Sharanya Patil,Dhatri Shankari T M,Suresh Sundaram,Narasimhan Sundararajan

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:suffer severe degradation, severe degradation due, images suffer severe, Underwater images suffer, underwater image enhancement

备注: Submitted to IEEE Journal of Oceanic Engineering

点击查看摘要

Abstract:Underwater images suffer severe degradation due to wavelength-dependent attenuation, scattering, and illumination non-uniformity that vary across water types and depths. We propose an unsupervised Domain-Invariant Visual Enhancement and Restoration (DIVER) framework that integrates empirical correction with physics-guided modeling for robust underwater image enhancement. DIVER first applies either IlluminateNet for adaptive luminance enhancement or a Spectral Equalization Filter for spectral normalization. An Adaptive Optical Correction Module then refines hue and contrast using channel-adaptive filtering, while Hydro-OpticNet employs physics-constrained learning to compensate for backscatter and wavelength-dependent attenuation. The parameters of IlluminateNet and Hydro-OpticNet are optimized via unsupervised learning using a composite loss function. DIVER is evaluated on eight diverse datasets covering shallow, deep, and highly turbid environments, including both naturally low-light and artificially illuminated scenes, using reference and non-reference metrics. While state-of-the-art methods such as WaterNet, UDNet, and Phaseformer perform reasonably in shallow water, their performance degrades in deep, unevenly illuminated, or artificially lit conditions. In contrast, DIVER consistently achieves best or near-best performance across all datasets, demonstrating strong domain-invariant capability. DIVER yields at least a 9% improvement over SOTA methods in UCIQE. On the low-light SeaThru dataset, where color-palette references enable direct evaluation of color restoration, DIVER achieves at least a 4.9% reduction in GPMAE compared to existing methods. Beyond visual quality, DIVER also improves robotic perception by enhancing ORB-based keypoint repeatability and matching performance, confirming its robustness across diverse underwater environments.

104. 【2601.22732】Active Learning-Driven Lightweight YOLOv9: Enhancing Efficiency in Smart Agriculture

链接https://arxiv.org/abs/2601.22732

作者:Hung-Chih Tu,Bo-Syun Chen,Yun-Chien Cheng

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:agricultural robots deployed, greenhouse environments, study addresses, addresses the demand, demand for real-time

备注

点击查看摘要

Abstract:This study addresses the demand for real-time detection of tomatoes and tomato flowers by agricultural robots deployed on edge devices in greenhouse environments. Under practical imaging conditions, object detection systems often face challenges such as large scale variations caused by varying camera distances, severe occlusion from plant structures, and highly imbalanced class distributions. These factors make conventional object detection approaches that rely on fully annotated datasets difficult to simultaneously achieve high detection accuracy and deployment efficiency. To overcome these limitations, this research proposes an active learning driven lightweight object detection framework, integrating data analysis, model design, and training strategy. First, the size distribution of objects in raw agricultural images is analyzed to redefine an operational target range, thereby improving learning stability under real-world conditions. Second, an efficient feature extraction module is incorporated to reduce computational cost, while a lightweight attention mechanism is introduced to enhance feature representation under multi-scale and occluded scenarios. Finally, an active learning strategy is employed to iteratively select high-information samples for annotation and training under a limited labeling budget, effectively improving the recognition performance of minority and small-object categories. Experimental results demonstrate that, while maintaining a low parameter count and inference cost suitable for edge-device deployment, the proposed method effectively improves the detection performance of tomatoes and tomato flowers in raw images. Under limited annotation conditions, the framework achieves an overall detection accuracy of 67.8% mAP, validating its practicality and feasibility for intelligent agricultural applications.

105. 【2601.22637】raining Beyond Convergence: Grokking nnU-Net for Glioma Segmentation in Sub-Saharan MRI

链接https://arxiv.org/abs/2601.22637

作者:Mohtady Barakat,Omar Salah,Ahmed Yasser,Mostafa Ahmed,Zahirul Arief,Waleed Khan,Dong Zhang,Aondona Iorumbur,Confidence Raymond,Mohannad Barakat,Noha Magdy

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:increasingly clinical burden, placing an increasingly, increasingly clinical, clinical burden, burden on Sub-Saharan

备注

点击查看摘要

Abstract:Gliomas are placing an increasingly clinical burden on Sub-Saharan Africa (SSA). In the region, the median survival for patients remains under two years, and access to diagnostic imaging is extremely limited. These constraints highlight an urgent need for automated tools that can extract the maximum possible information from each available scan, tools that are specifically trained on local data, rather than adapted from high-income settings where conditions are vastly different. We utilize the Brain Tumor Segmentation (BraTS) Africa 2025 Challenge dataset, an expert annotated collection of glioma MRIs. Our objectives are: (i) establish a strong baseline with nnUNet on this dataset, and (ii) explore whether the celebrated "grokking" phenomenon an abrupt, late training jump from memorization to superior generalization can be triggered to push performance without extra labels. We evaluate two training regimes. The first is a fast, budget-conscious approach that limits optimization to just a few epochs, reflecting the constrained GPU resources typically available in African institutions. Despite this limitation, nnUNet achieves strong Dice scores: 92.3% for whole tumor (WH), 86.6% for tumor core (TC), and 86.3% for enhancing tumor (ET). The second regime extends training well beyond the point of convergence, aiming to trigger a grokking-driven performance leap. With this approach, we were able to achieve grokking and enhanced our results to higher Dice scores: 92.2% for whole tumor (WH), 90.1% for tumor core (TC), and 90.2% for enhancing tumor (ET).

106. 【2601.22576】Bonnet: Ultra-fast whole-body bone segmentation from CT scans

链接https://arxiv.org/abs/2601.22576

作者:Hanjiang Zhu,Pedro Martelleto Rezende,Zhang Yang,Tong Ye,Bruce Z. Gao,Feng Luo,Siyu Huang,Jiancheng Yang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:ultra-fast sparse-volume pipeline, work proposes Bonnet, whole-body bone segmentation, Accurate bone segmentation, work proposes

备注: 5 pages, 2 figures. Accepted for publication at the 2026 IEEE International Symposium on Biomedical Imaging (ISBI 2026)

点击查看摘要

Abstract:This work proposes Bonnet, an ultra-fast sparse-volume pipeline for whole-body bone segmentation from CT scans. Accurate bone segmentation is important for surgical planning and anatomical analysis, but existing 3D voxel-based models such as nnU-Net and STU-Net require heavy computation and often take several minutes per scan, which limits time-critical use. The proposed Bonnet addresses this by integrating a series of novel framework components including HU-based bone thresholding, patch-wise inference with a sparse spconv-based U-Net, and multi-window fusion into a full-volume prediction. Trained on TotalSegmentator and evaluated without additional tuning on RibSeg, CT-Pelvic1K, and CT-Spine1K, Bonnet achieves high Dice across ribs, pelvis, and spine while running in only 2.69 seconds per scan on an RTX A6000. Compared to strong voxel baselines, Bonnet attains a similar accuracy but reduces inference time by roughly 25x on the same hardware and tiling setup. The toolkit and pre-trained models will be released at this https URL.

107. 【2601.22537】EndoCaver: Handling Fog, Blur and Glare in Endoscopic Images via Joint Deblurring-Segmentation

链接https://arxiv.org/abs/2601.22537

作者:Zhuoyu Wu,Wenhui Ou,Pei-Sze Tan,Jiayan Yang,Wenqi Fang,Zheng Wang,Raphaël C.-W. Phan

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Endoscopic image analysis, colorectal cancer screening, automated polyp detection, severely compromise automated, compromise automated polyp

备注: Accepted for publication at IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026

点击查看摘要

Abstract:Endoscopic image analysis is vital for colorectal cancer screening, yet real-world conditions often suffer from lens fogging, motion blur, and specular highlights, which severely compromise automated polyp detection. We propose EndoCaver, a lightweight transformer with a unidirectional-guided dual-decoder architecture, enabling joint multi-task capability for image deblurring and segmentation while significantly reducing computational complexity and model parameters. Specifically, it integrates a Global Attention Module (GAM) for cross-scale aggregation, a Deblurring-Segmentation Aligner (DSA) to transfer restoration cues, and a cosine-based scheduler (LoCoS) for stable multi-task optimisation. Experiments on the Kvasir-SEG dataset show that EndoCaver achieves 0.922 Dice on clean data and 0.889 under severe image degradation, surpassing state-of-the-art methods while reducing model parameters by 90%. These results demonstrate its efficiency and robustness, making it well-suited for on-device clinical deployment. Code is available at this https URL.

108. 【2601.22202】A Survey on Semantic Communication for Vision: Categories, Frameworks, Enabling Techniques, and Applications

链接https://arxiv.org/abs/2601.22202

作者:Runze Cheng,Yao Sun,Ahmad Taha,Xuesong Liu,David Flynn,Muhammad Ali Imran

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:visual data transmission, traffic-intensive visual data, meaningful content transmission, visual data, shifting focus

备注

点击查看摘要

Abstract:Semantic communication (SemCom) emerges as a transformative paradigm for traffic-intensive visual data transmission, shifting focus from raw data to meaningful content transmission and relieving the increasing pressure on communication resources. However, to achieve SemCom, challenges are faced in accurate semantic quantization for visual data, robust semantic extraction and reconstruction under diverse tasks and goals, transceiver coordination with effective knowledge utilization, and adaptation to unpredictable wireless communication environments. In this paper, we present a systematic review of SemCom for visual data transmission (SemCom-Vision), wherein an interdisciplinary analysis integrating computer vision (CV) and communication engineering is conducted to provide comprehensive guidelines for the machine learning (ML)-empowered SemCom-Vision design. Specifically, this survey first elucidates the basics and key concepts of SemCom. Then, we introduce a novel classification perspective to categorize existing SemCom-Vision approaches as semantic preservation communication (SPC), semantic expansion communication (SEC), and semantic refinement communication (SRC) based on communication goals interpreted through semantic quantization schemes. Moreover, this survey articulates the ML-based encoder-decoder models and training algorithms for each SemCom-Vision category, followed by knowledge structure and utilization strategies. Finally, we discuss potential SemCom-Vision applications.

109. 【2601.22189】SCENE: Semantic-aware Codec Enhancement with Neural Embeddings

链接https://arxiv.org/abs/2601.22189

作者:Han-Yu Lin,Li-Wei Chen,Hung-Shin Lee

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:degrade perceptual quality, Compression artifacts, Compression, perceptual quality, Abstract

备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Compression artifacts from standard video codecs often degrade perceptual quality. We propose a lightweight, semantic-aware pre-processing framework that enhances perceptual fidelity by selectively addressing these distortions. Our method integrates semantic embeddings from a vision-language model into an efficient convolutional architecture, prioritizing the preservation of perceptually significant structures. The model is trained end-to-end with a differentiable codec proxy, enabling it to mitigate artifacts from various standard codecs without modifying the existing video pipeline. During inference, the codec proxy is discarded, and SCENE operates as a standalone pre-processor, enabling real-time performance. Experiments on high-resolution benchmarks show improved performance over baselines in both objective (MS-SSIM) and perceptual (VMAF) metrics, with notable gains in preserving detailed textures within salient regions. Our results show that semantic-guided, codec-aware pre-processing is an effective approach for enhancing compressed video streams.

110. 【2601.12526】Deep Lightweight Unrolled Network for High Dynamic Range Modulo Imaging

链接https://arxiv.org/abs/2601.12526

作者:Brayan Monroy,Jorge Bacca

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:offers a promising, saturation level, promising alternative, alternative for expanding, expanding the dynamic

备注

点击查看摘要

Abstract:Modulo-Imaging (MI) offers a promising alternative for expanding the dynamic range of images by resetting the signal intensity when it reaches the saturation level. Subsequently, high-dynamic range (HDR) modulo imaging requires a recovery process to obtain the HDR image. MI is a non-convex and ill-posed problem where recent recovery networks suffer in high-noise scenarios. In this work, we formulate the HDR reconstruction task as an optimization problem that incorporates a deep prior and subsequently unrolls it into an optimization-inspired deep neural network. The network employs a lightweight convolutional denoiser for fast inference with minimal computational overhead, effectively recovering intensity values while mitigating noise. Moreover, we introduce the Scaling Equivariance term that facilitates self-supervised fine-tuning, thereby enabling the model to adapt to new modulo images that fall outside the original training distribution. Extensive evaluations demonstrate the superiority of our method compared to state-of-the-art recovery algorithms in terms of performance and quality.