本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新1514篇论文,其中:

  • 自然语言处理247
  • 信息检索37
  • 计算机视觉318

自然语言处理

1. 【2601.14249】Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

链接https://arxiv.org/abs/2601.14249

作者:Yuming Yang,Mingyoung Lai,Wanxu Zhao,Xiaoran Fan,Zhiheng Xi,Mingqi Wu,Chiyue Huang,Jun Zhao,Haijun Lv,Jian Tong,Yunhua Zhou,Yicheng Zou,Qipeng Guo,Tao Gui,Qi Zhang,Xuanjing Huang

类目:Computation and Language (cs.CL)

关键词:provide rich supervision, trajectories provide rich, rich supervision signals, provide rich, rich supervision

备注: 26 pages. Project page: [this https URL](https://github.com/UmeanNever/RankSurprisalRatio)

点击查看摘要

Abstract:Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that closely align with the model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically combine low absolute probability with relatively high-ranked tokens under the student model, balancing learning signal strength and behavioral alignment. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training performance (average Spearman 0.86), outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

2. 【2601.14243】Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

链接https://arxiv.org/abs/2601.14243

作者:Haocheng Xi,Charlie Ruan,Peiyuan Liao,Yujun Lin,Han Cai,Yilong Zhao,Shuo Yang,Kurt Keutzer,Song Han,Ligeng Zhu

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:large language models, complex reasoning capabilities, Reinforcement learning, language models, training

备注: 11 pages, 6 figures, 4 tables

点击查看摘要

Abstract:Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.

3. 【2601.14242】APEX-Agents

链接https://arxiv.org/abs/2601.14242

作者:Bertie Vidgen,Austin Mann,Abby Fennelly,John Wright Stanly,Lucas Rothman,Marco Burstein,Julien Benchek,David Ostrofsky,Anirudh Ravichandran,Debnil Sur,Neel Venugopal,Alannah Hsia,Isaac Robinson,Calix Huang,Olivia Varones,Daniyal Khan,Michael Haines,Zach Richards,Chirag Mahapatra,Brendan Foody,Osvald Nitski

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Productivity Index, cross-application tasks created, investment banking analysts, management consultants, Thinking

备注

点击查看摘要

Abstract:We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.

4. 【2601.14230】MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

链接https://arxiv.org/abs/2601.14230

作者:Yiyang Wang,Yiqiao Jin,Alex Cabral,Josiah Hester

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:promising socio-collaborative companions, recently emerged, emerged as promising, emotional and cognitive, Collaborative Dialogue Optimization

备注: 15 pages, 9 figures

点击查看摘要

Abstract:Multi-agent systems (MAS) have recently emerged as promising socio-collaborative companions for emotional and cognitive support. However, these systems frequently suffer from persona collapse--where agents revert to generic, homogenized assistant behaviors--and social sycophancy, which produces redundant, non-constructive dialogue. We propose MASCOT, a generalizable framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that finetunes individual agents for strict persona fidelity to prevent identity loss; and 2) Collaborative Dialogue Optimization, a meta-policy guided by group-level rewards to ensure diverse and productive discourse. Extensive evaluations across psychological support and workplace domains demonstrate that MASCOT significantly outperforms state-of-the-art baselines, achieving improvements of up to +14.1 in Persona Consistency and +10.6 in Social Contribution. Our framework provides a practical roadmap for engineering the next generation of socially intelligent multi-agent systems.

5. 【2601.14212】Generalization and Completeness of Stochastic Local Search Algorithms

链接https://arxiv.org/abs/2601.14212

作者:Daniel Loscos,Narciso Marti-Oliet,Ismael Rodriguez

类目:Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL)

关键词:Stochastic Local Search, generalize Stochastic Local, Local Search, Stochastic Local, generalize Stochastic

备注: This paper was published in Swarm and Evolutionary Computation. The present version is the author's accepted manuscript

点击查看摘要

Abstract:We generalize Stochastic Local Search (SLS) heuristics into a unique formal model. This model has two key components: a common structure designed to be as large as possible and a parametric structure intended to be as small as possible. Each heuristic is obtained by instantiating the parametric part in a different way. Particular instances for Genetic Algorithms (GA), Ant Colony Optimization (ACO), and Particle Swarm Optimization (PSO) are presented. Then, we use our model to prove the Turing-completeness of SLS algorithms in general. The proof uses our framework to construct a GA able to simulate any Turing machine. This Turing-completeness implies that determining any non-trivial property concerning the relationship between the inputs and the computed outputs is undecidable for GA and, by extension, for the general set of SLS methods (although not necessarily for each particular method). Similar proofs are more informally presented for PSO and ACO.

6. 【2601.14210】HALT: Hallucination Assessment via Latent Testing

链接https://arxiv.org/abs/2601.14210

作者:Rohan Bhatnagar,Youran Sun,Chi Andrew Zhang,Yixin Wen,Haizhao Yang

类目:Computation and Language (cs.CL)

关键词:large language models, language models, large language, failure of faithful, pressures still yield

备注

点击查看摘要

Abstract:Hallucination in large language models (LLMs) can be understood as a failure of faithful readout: although internal representations may encode uncertainty about a query, decoding pressures still yield a fluent answer. We propose lightweight residual probes that read hallucination risk directly from intermediate hidden states of question tokens, motivated by the hypothesis that these layers retain epistemic signals that are attenuated in the final decoding stage. The probe is a small auxiliary network whose computation is orders of magnitude cheaper than token generation and can be evaluated fully in parallel with inference, enabling near-instantaneous hallucination risk estimation with effectively zero added latency in low-risk cases. We deploy the probe as an agentic critic for fast selective generation and routing, allowing LLMs to immediately answer confident queries while delegating uncertain ones to stronger verification pipelines. Across four QA benchmarks and multiple LLM families, the method achieves strong AUROC and AURAC, generalizes under dataset shift, and reveals interpretable structure in intermediate representations, positioning fast internal uncertainty readout as a principled foundation for reliable agentic AI.

7. 【2601.14209】InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

链接https://arxiv.org/abs/2601.14209

作者:Matthew Y. R. Yang,Hao Bai,Ian Wu,Gene Yang,Amrith Setlur,Aviral Kumar

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Outcome-reward reinforcement learning, Outcome-reward reinforcement, large language models, reinforcement learning, proven effective

备注

点击查看摘要

Abstract:Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.

8. 【2601.14192】oward Efficient Agents: Memory, Tool learning, and Planning

链接https://arxiv.org/abs/2601.14192

作者:Xiaofang Yang,Lijun Li,Heng Zhou,Tong Zhu,Xiaoye Qu,Yuchen Fan,Qianshan Wei,Rui Ye,Li Kang,Yiran Qin,Zhiqiang Kou,Daizong Liu,Qi Li,Ning Ding,Siheng Chen,Jing Shao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:witnessed increasing interest, extending large language, large language models, years have witnessed, witnessed increasing

备注: 35 pages, 200 references

点击查看摘要

Abstract:Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency, which is crucial for real-world deployment, has often been overlooked. This paper therefore investigates efficiency from three core components of agents: memory, tool learning, and planning, considering costs such as latency, tokens, steps, etc. Aimed at conducting comprehensive research addressing the efficiency of the agentic system itself, we review a broad range of recent approaches that differ in implementation yet frequently converge on shared high-level principles including but not limited to bounding context via compression and management, designing reinforcement learning rewards to minimize tool invocation, and employing controlled search mechanisms to enhance efficiency, which we discuss in detail. Accordingly, we characterize efficiency in two complementary ways: comparing effectiveness under a fixed cost budget, and comparing cost at a comparable level of effectiveness. This trade-off can also be viewed through the Pareto frontier between effectiveness and cost. From this perspective, we also examine efficiency oriented benchmarks by summarizing evaluation protocols for these components and consolidating commonly reported efficiency metrics from both benchmark and methodological studies. Moreover, we discuss the key challenges and future directions, with the goal of providing promising insights.

9. 【2601.14175】A model of errors in transformers

链接https://arxiv.org/abs/2601.14175

作者:Suvrat Raju,Praneeth Netrapalli

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); High Energy Physics - Theory (hep-th)

关键词:deterministic output, arithmetic that require, require a deterministic, error rate, small set

备注: 8+17pages

点击查看摘要

Abstract:We study the error rate of LLMs on tasks like arithmetic that require a deterministic output, and repetitive processing of tokens drawn from a small set of alternatives. We argue that incorrect predictions arise when small errors in the attention mechanism accumulate to cross a threshold, and use this insight to derive a quantitative two-parameter relationship between the accuracy and the complexity of the task. The two parameters vary with the prompt and the model; they can be interpreted in terms of an elementary noise rate, and the number of plausible erroneous tokens that can be predicted. Our analysis is inspired by an ``effective field theory'' perspective: the LLM's many raw parameters can be reorganized into just two parameters that govern the error rate. We perform extensive empirical tests, using Gemini 2.5 Flash, Gemini 2.5 Pro and DeepSeek R1, and find excellent agreement between the predicted and observed accuracy for a variety of tasks, although we also identify deviations in some cases. Our model provides an alternative to suggestions that errors made by LLMs on long repetitive tasks indicate the ``collapse of reasoning'', or an inability to express ``compositional'' functions. Finally, we show how to construct prompts to reduce the error rate.

10. 【2601.14172】Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum

链接https://arxiv.org/abs/2601.14172

作者:Víctor Yeste,Paolo Rosso

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Schwartz motivational continuum, study sentence-level identification, Schwartz motivational, motivational continuum, concrete formulation

备注: Code: [this https URL](https://github.com/VictorMYeste/human-value-detection) , 37 pages, 4 figures,

点击查看摘要

Abstract:We study sentence-level identification of the 19 values in the Schwartz motivational continuum as a concrete formulation of human value detection in text. The setting - out-of-context sentences from news and political manifestos - features sparse moral cues and severe class imbalance. This combination makes fine-grained sentence-level value detection intrinsically difficult, even for strong modern neural models. We first operationalize a binary moral presence task ("does any value appear?") and show that it is learnable from single sentences (positive-class F1 $\approx$ 0.74 with calibrated thresholds). We then compare a presence-gated hierarchy to a direct multi-label classifier under matched compute, both based on DeBERTa-base and augmented with lightweight signals (prior-sentence context, LIWC-22/eMFD/MJD lexica, and topic features). The hierarchy does not outperform direct prediction, indicating that gate recall limits downstream gains. We also benchmark instruction-tuned LLMs - Gemma 2 9B, Llama 3.1 8B, Mistral 8B, and Qwen 2.5 7B - in zero-/few-shot and QLoRA setups and build simple ensembles; a soft-vote supervised ensemble reaches macro-F1 0.332, significantly surpassing the best single supervised model and exceeding prior English-only baselines. Overall, in this scenario, lightweight signals and small ensembles yield the most reliable improvements, while hierarchical gating offers limited benefit. We argue that, under an 8 GB single-GPU constraint and at the 7-9B scale, carefully tuned supervised encoders remain a strong and compute-efficient baseline for structured human value detection, and we outline how richer value structure and sentence-in-document context could further improve performance.

11. 【2601.14160】Domain-Adaptation through Synthetic Data: Fine-Tuning Large Language Models for German Law

链接https://arxiv.org/abs/2601.14160

作者:Ali Hamza Bashir,Muhammad Rehan Khalid,Kostadin Cvejoski,Jana Birr,Jule Berghaus,Armin Berger,Sandra Halscheidt,Christian Temath,Rafet Sifa,David Berghaus

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, limited expert knowledge, factually incorrect outputs, Large language, legal reasoning due

备注

点击查看摘要

Abstract:Large language models (LLMs) often struggle in specialized domains such as legal reasoning due to limited expert knowledge, resulting in factually incorrect outputs or hallucinations. This paper presents an effective method for adapting advanced LLMs to German legal question answering through a novel synthetic data generation approach. In contrast to costly human-annotated resources or unreliable synthetic alternatives, our approach systematically produces high-quality, diverse, and legally accurate question-answer pairs directly from authoritative German statutes. Using rigorous automated filtering methods and parameter-efficient fine-tuning techniques, we demonstrate that LLMs adapted with our synthetic dataset significantly outperform their baseline counterparts on German legal question answering tasks. Our results highlight the feasibility of using carefully designed synthetic data as a robust alternative to manual annotation in high-stakes, knowledge-intensive domains.

12. 【2601.14152】Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models

链接https://arxiv.org/abs/2601.14152

作者:Hyunjong Ok,Jaeho Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large language models, remain poorly understood, sensitivity remain poorly, exhibit surprising sensitivity, language models exhibit

备注: preprint

点击查看摘要

Abstract:Large language models exhibit surprising sensitivity to the structure of the prompt, but the mechanisms underlying this sensitivity remain poorly understood. In this work, we conduct an in-depth investigation on a striking case: in multiple-choice question answering, placing context before the questions and options (CQO) outperforms the reverse order (QOC) by over 14%p, consistently over a wide range of models and datasets. Through systematic architectural analysis, we identify causal attention as the core mechanism: in QOC prompts, the causal mask prevents option tokens from attending to context, creating an information bottleneck where context becomes invisible to options.

13. 【2601.14127】he Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning

链接https://arxiv.org/abs/2601.14127

作者:Renmiao Chen,Yida Lu,Shiyao Cui,Xuan Ouyang,Victor Shea-Jay Huang,Shumin Zhang,Chengwei Pan,Han Qiu,Minlie Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, acquire stronger reasoning

备注: *15 pages, 5 figures. Introduces MIR-SafetyBench (2,676 instances; 9 multi-image relations). Equal contribution; †Corresponding author. Code/data: [this https URL](https://github.com/thu-coai/MIR-SafetyBench)

点击查看摘要

Abstract:As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19 MLLMs reveal a troubling trend: models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench. Beyond attack success rates, we find that many responses labeled as safe are superficial, often driven by misunderstanding or evasive, non-committal replies. We further observe that unsafe generations exhibit lower attention entropy than safe ones on average. This internal signature suggests a possible risk that models may over-focus on task solving while neglecting safety constraints. Our code and data are available at this https URL.

14. 【2601.14124】Style Transfer as Bias Mitigation: Diffusion Models for Synthetic Mental Health Text for Arabic

链接https://arxiv.org/abs/2601.14124

作者:Saad Mankarious,Aya Zirikly

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:existing approaches largely, approaches largely rely, limited output diversity, propagate biases inherited, large language models

备注

点击查看摘要

Abstract:Synthetic data offers a promising solution for mitigating data scarcity and demographic bias in mental health analysis, yet existing approaches largely rely on pretrained large language models (LLMs), which may suffer from limited output diversity and propagate biases inherited from their training data. In this work, we propose a pretraining-free diffusion-based approach for synthetic text generation that frames bias mitigation as a style transfer problem. Using the CARMA Arabic mental health corpus, which exhibits a substantial gender imbalance, we focus on male-to-female style transfer to augment underrepresented female-authored content. We construct five datasets capturing varying linguistic and semantic aspects of gender expression in Arabic and train separate diffusion models for each setting. Quantitative evaluations demonstrate consistently high semantic fidelity between source and generated text, alongside meaningful surface-level stylistic divergence, while qualitative analysis confirms linguistically plausible gender transformations. Our results show that diffusion-based style transfer can generate high-entropy, semantically faithful synthetic data without reliance on pretrained LLMs, providing an effective and flexible framework for mitigating gender bias in sensitive, low-resource mental health domains.

15. 【2601.14123】A Systematic Analysis of Chunking Strategies for Reliable Question Answering

链接https://arxiv.org/abs/2601.14123

作者:Sofia Bennani,Charles Moslonka

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Retrieval-Augmented Generation, systems in industry, document chunking choices, chunking choices impact, Natural Questions systematically

备注: 3 pages, 2 figures, 1 table, pre-print

点击查看摘要

Abstract:We study how document chunking choices impact the reliability of Retrieval-Augmented Generation (RAG) systems in industry. While practice often relies on heuristics, our end-to-end evaluation on Natural Questions systematically varies chunking method (token, sentence, semantic, code), chunk size, overlap, and context length. We use a standard industrial setup: SPLADE retrieval and a Mistral-8B generator. We derive actionable lessons for cost-efficient deployment: (i) overlap provides no measurable benefit and increases indexing cost; (ii) sentence chunking is the most cost-effective method, matching semantic chunking up to ~5k tokens; (iii) a "context cliff" reduces quality beyond ~2.5k tokens; and (iv) optimal context depends on the goal (semantic quality peaks at small contexts; exact match at larger ones).

16. 【2601.14121】NewsRECON: News article REtrieval for image CONtextualization

链接https://arxiv.org/abs/2601.14121

作者:Jonathan Tonglet,Iryna Gurevych,Tinne Tuytelaars,Marie-Francine Moens

类目:Computation and Language (cs.CL)

关键词:produce credible stories, debunk misinformation, crucial for journalists, journalists and forensic, forensic experts

备注: Preprint under review. Code available at [this https URL](https://github.com/jtonglet/arxiv2025-newsrecon)

点击查看摘要

Abstract:Identifying when and where a news image was taken is crucial for journalists and forensic experts to produce credible stories and debunk misinformation. While many existing methods rely on reverse image search (RIS) engines, these tools often fail to return results, thereby limiting their practical applicability. In this work, we address the challenging scenario where RIS evidence is unavailable. We introduce NewsRECON, a method that links images to relevant news articles to infer their date and location from article metadata. NewsRECON leverages a corpus of over 90,000 articles and integrates: (1) a bi-encoder for retrieving event-relevant articles; (2) two cross-encoders for reranking articles by location and event consistency. Experiments on the TARA and 5Pils-OOC show that NewsRECON outperforms prior work and can be combined with a multimodal large language model to achieve new SOTA results in the absence of RIS evidence. We make our code available.

17. 【2601.14112】Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns

链接https://arxiv.org/abs/2601.14112

作者:George Mihaila

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:applications including healthcare, high-stakes applications including, opacity hinders trust, legal systems, including healthcare

备注

点击查看摘要

Abstract:Explainable AI (XAI) has become critical as transformer-based models are deployed in high-stakes applications including healthcare, legal systems, and financial services, where opacity hinders trust and accountability. Transformers self-attention mechanisms have proven valuable for model interpretability, with attention weights successfully used to understand model focus and behavior (Xu et al., 2015); (Wiegreffe and Pinter, 2019). However, existing attention-based explanation methods rely on manually defined aggregation strategies and fixed attribution rules (Abnar and Zuidema, 2020a); (Chefer et al., 2021), while model-agnostic approaches (LIME, SHAP) treat the model as a black box and incur significant computational costs through input perturbation. We introduce Explanation Network (ExpNet), a lightweight neural network that learns an explicit mapping from transformer attention patterns to token-level importance scores. Unlike prior methods, ExpNet discovers optimal attention feature combinations automatically rather than relying on predetermined rules. We evaluate ExpNet in a challenging cross-task setting and benchmark it against a broad spectrum of model-agnostic methods and attention-based techniques spanning four methodological families.

18. 【2601.14105】ruth with a Twist: The Rhetoric of Persuasion in Professional vs. Community-Authored Fact-Checks

链接https://arxiv.org/abs/2601.14105

作者:Olesya Razuvayevskaya,Kalina Bontcheva

类目:Computation and Language (cs.CL)

关键词:versus professionally-written debunks, persuasion techniques present, persuasion techniques, versus professionally-written, study presents

备注: In Proceedings of the ACM Web Conference 2026 (WWW 2026)

点击查看摘要

Abstract:This study presents the first large-scale comparison of persuasion techniques present in crowd- versus professionally-written debunks. Using extensive datasets from Community Notes (CNs), EUvsDisinfo, and the Database of Known Fakes (DBKF), we quantify the prevalence and types of persuasion techniques across these fact-checking ecosystems. Contrary to prior hypothesis that community-produced debunks rely more heavily on subjective or persuasive wording, we find no evidence that CNs contain a higher average number of persuasion techniques than professional fact-checks. We additionally identify systematic rhetorical differences between CNs and professional debunking efforts, reflecting differences in institutional norms and topical coverage. Finally, we examine how the crowd evaluates persuasive language in CNs and show that, although notes with more persuasive elements receive slightly higher overall helpfulness ratings, crowd raters are effective at penalising the use of particular problematic rhetorical means

19. 【2601.14084】DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and Reasoning

链接https://arxiv.org/abs/2601.14084

作者:Abdurrahim Yilmaz,Ozan Erdem,Ece Gokyayla,Ayda Acar,Burc Bugra Dagtas,Dilara Ilhan Erdil,Gulsum Gencoglan,Burak Temelkuran

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:image-level classification tasks, dermatology remains limited, Vision-language models, medical applications, increasingly important

备注

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly important in medical applications; however, their evaluation in dermatology remains limited by datasets that focus primarily on image-level classification tasks such as lesion recognition. While valuable for recognition, such datasets cannot assess the full visual understanding, language grounding, and clinical reasoning capabilities of multimodal models. Visual question answering (VQA) benchmarks are required to evaluate how models interpret dermatological images, reason over fine-grained morphology, and generate clinically meaningful descriptions. We introduce DermaBench, a clinician-annotated dermatology VQA benchmark built on the Diverse Dermatology Images (DDI) dataset. DermaBench comprises 656 clinical images from 570 unique patients spanning Fitzpatrick skin types I-VI. Using a hierarchical annotation schema with 22 main questions (single-choice, multi-choice, and open-ended), expert dermatologists annotated each image for diagnosis, anatomic site, lesion morphology, distribution, surface features, color, and image quality, together with open-ended narrative descriptions and summaries, yielding approximately 14.474 VQA-style annotations. DermaBench is released as a metadata-only dataset to respect upstream licensing and is publicly available at Harvard Dataverse.

20. 【2601.14063】XCR-Bench: A Multi-Task Benchmark for Evaluating Cultural Reasoning in LLMs

链接https://arxiv.org/abs/2601.14063

作者:Mohsinul Kabir,Tasnim Ahmed,Md Mezbaur Rahman,Shaoxiong Ji,Hassan Alhuzali,Sophia Ananiadou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:identify Culture-Specific Items, large language models, Culture-Specific Items, language models, requires the ability

备注: 30 Pages, 13 Figures

点击查看摘要

Abstract:Cross-cultural competence in large language models (LLMs) requires the ability to identify Culture-Specific Items (CSIs) and to adapt them appropriately across cultural contexts. Progress in evaluating this capability has been constrained by the scarcity of high-quality CSI-annotated corpora with parallel cross-cultural sentence pairs. To address this limitation, we introduce XCR-Bench, a Cross(X)-Cultural Reasoning Benchmark consisting of 4.9k parallel sentences and 1,098 unique CSIs, spanning three distinct reasoning tasks with corresponding evaluation metrics. Our corpus integrates Newmark's CSI framework with Hall's Triad of Culture, enabling systematic analysis of cultural reasoning beyond surface-level artifacts and into semi-visible and invisible cultural elements such as social norms, beliefs, and values. Our findings show that state-of-the-art LLMs exhibit consistent weaknesses in identifying and adapting CSIs related to social etiquette and cultural reference. Additionally, we find evidence that LLMs encode regional and ethno-religious biases even within a single linguistic setting during cultural adaptation. We release our corpus and code to facilitate future research on cross-cultural NLP.

21. 【2601.14051】Kakugo: Distillation of Low-Resource Languages into Small Language Models

链接https://arxiv.org/abs/2601.14051

作者:Peter Devine,Mardhiyah Sanni,Farid Adilazuarda,Julieta Gil Loizaga,Barry Haddow

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:train general-purpose Small, general-purpose Small Language, Small Language Models, general-purpose Small, cost-effective pipeline designed

备注

点击查看摘要

Abstract:We present Kakugo, a novel and cost-effective pipeline designed to train general-purpose Small Language Models (SLMs) for low-resource languages using only the language name as input. By using a large teacher model to generate synthetic prompts and translate instruction datasets, we produced training data and SLMs for 54 low-resource languages. Evaluations across a diverse set of general natural language processing tasks, including translation, classification, and question answering, demonstrate that our pipeline consistently improves performance over base models. With a total generation and training cost of under $50 per language, Kakugo offers an accessible method for communities to develop language-specific AI.

22. 【2601.14050】Understanding Multilingualism in Mixture-of-Experts LLMs: Routing Mechanism, Expert Specialization, and Layerwise Steering

链接https://arxiv.org/abs/2601.14050

作者:Yuxin Chen,Zhengzhou Cai,Xiangtian Ji,Weixiang Zhao,An Zhang,Xiang Wang,Tat-Seng Chua

类目:Computation and Language (cs.CL)

关键词:remain insufficiently understood, internal mechanisms underlying, cross-language differences remain, differences remain insufficiently, strong multilingual capabilities

备注

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures have shown strong multilingual capabilities, yet the internal mechanisms underlying performance gains and cross-language differences remain insufficiently understood. In this work, we conduct a systematic analysis of MoE models, examining routing behavior and expert specialization across languages and network depth. Our analysis reveals that multilingual processing in MoE models is highly structured: routing aligns with linguistic families, expert utilization follows a clear layerwise pattern, and high-resource languages rely on shared experts while low-resource languages depend more on language-exclusive experts despite weaker performance. Layerwise interventions further show that early and late MoE layers support language-specific processing, whereas middle layers serve as language-agnostic capacity hubs. Building on these insights, we propose a routing-guided steering method that adaptively guides routing behavior in middle layers toward shared experts associated with dominant languages at inference time, leading to consistent multilingual performance improvements, particularly for linguistically related language pairs. Our code is available at this https URL.

23. 【2601.14046】PRiSM: Benchmarking Phone Realization in Speech Models

链接https://arxiv.org/abs/2601.14046

作者:Shikhar Bharadwaj,Chin-Jou Li,Yoonjae Kim,Kwanghee Choi,Eunjung Yeo,Ryan Soh-Eun Shim,Hanyu Zhou,Brendon Boldt,Karen Rosero Jacome,Kalvin Chang,Darsh Agrawal,Keer Xu,Chao-Han Huck Yang,Jian Zhu,Shinji Watanabe,David R. Mortensen

类目:Computation and Language (cs.CL); Sound (cs.SD)

关键词:Phone recognition, cross-lingual speech processing, atomic interface, interface for language-agnostic, language-agnostic modeling

备注

点击查看摘要

Abstract:Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability: this https URL.

24. 【2601.14041】op 10 Open Challenges Steering the Future of Diffusion Language Model and Its Variants

链接https://arxiv.org/abs/2601.14041

作者:Yunhe Wang,Kai Han,Huiling Zhen,Yuchuan Tian,Hanting Chen,Yongbing Huang,Yufei Cui,Yingte Shu,Shan Gao,Ismail Elezi,Roy Vaughan Miles,Songcen Xu,Feng Wen,Chao Xu,Sinan Zeng,Dacheng Tao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, paradigm of Large, Large Language, Language Models, Diffusion Language Models

备注

点击查看摘要

Abstract:The paradigm of Large Language Models (LLMs) is currently defined by auto-regressive (AR) architectures, which generate text through a sequential ``brick-by-brick'' process. Despite their success, AR models are inherently constrained by a causal bottleneck that limits global structural foresight and iterative refinement. Diffusion Language Models (DLMs) offer a transformative alternative, conceptualizing text generation as a holistic, bidirectional denoising process akin to a sculptor refining a masterpiece. However, the potential of DLMs remains largely untapped as they are frequently confined within AR-legacy infrastructures and optimization frameworks. In this Perspective, we identify ten fundamental challenges ranging from architectural inertia and gradient sparsity to the limitations of linear reasoning that prevent DLMs from reaching their ``GPT-4 moment''. We propose a strategic roadmap organized into four pillars: foundational infrastructure, algorithmic optimization, cognitive reasoning, and unified multimodal intelligence. By shifting toward a diffusion-native ecosystem characterized by multi-scale tokenization, active remasking, and latent thinking, we can move beyond the constraints of the causal horizon. We argue that this transition is essential for developing next-generation AI capable of complex structural reasoning, dynamic self-correction, and seamless multimodal integration.

25. 【2601.14032】RM-Distiller: Exploiting Generative LLM for Reward Model Distillation

链接https://arxiv.org/abs/2601.14032

作者:Hongli Zhou,Hui Huang,Wei Liu,Chenglong Wang,Xingyuan Bu,Lvyuan Han,Fuhai Song,Muyun Yang,Wenhao Jiang,Hailong Cao,Tiejun Zhao

类目:Computation and Language (cs.CL)

关键词:aligning large language, large language models, play a pivotal, pivotal role, role in aligning

备注

点击查看摘要

Abstract:Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. Due to the difficulty of obtaining high-quality human preference annotations, distilling preferences from generative LLMs has emerged as a standard practice. However, existing approaches predominantly treat teacher models as simple binary annotators, failing to fully exploit the rich knowledge and capabilities for RM distillation. To address this, we propose RM-Distiller, a framework designed to systematically exploit the multifaceted capabilities of teacher LLMs: (1) Refinement capability, which synthesizes highly correlated response pairs to create fine-grained and contrastive signals. (2) Scoring capability, which guides the RM in capturing precise preference strength via a margin-aware optimization objective. (3) Generation capability, which incorporates the teacher's generative distribution to regularize the RM to preserve its fundamental linguistic knowledge. Extensive experiments demonstrate that RM-Distiller significantly outperforms traditional distillation methods both on RM benchmarks and reinforcement learning-based alignment, proving that exploiting multifaceted teacher capabilities is critical for effective reward modeling. To the best of our knowledge, this is the first systematic research on RM distillation from generative LLMs.

26. 【2601.14007】BACH-V: Bridging Abstract and Concrete Human-Values in Large Language Models

链接https://arxiv.org/abs/2601.14007

作者:Junyu Zhang,Yipeng Kang,Jiong Guo,Jiayu Zhan,Junqi Wang

类目:Computation and Language (cs.CL)

关键词:large language models, genuinely understand abstract, understand abstract concepts, language models, genuinely understand

备注: 34 pagess, 16 figures, 6 tables, submitted to ACL 2026

点击查看摘要

Abstract:Do large language models (LLMs) genuinely understand abstract concepts, or merely manipulate them as statistical patterns? We introduce an abstraction-grounding framework that decomposes conceptual understanding into three capacities: interpretation of abstract concepts (Abstract-Abstract, A-A), grounding of abstractions in concrete events (Abstract-Concrete, A-C), and application of abstract principles to regulate concrete decisions (Concrete-Concrete, C-C). Using human values as a testbed - given their semantic richness and centrality to alignment - we employ probing (detecting value traces in internal activations) and steering (modifying representations to shift behavior). Across six open-source LLMs and ten value dimensions, probing shows that diagnostic probes trained solely on abstract value descriptions reliably detect the same values in concrete event narratives and decision reasoning, demonstrating cross-level transfer. Steering reveals an asymmetry: intervening on value representations causally shifts concrete judgments and decisions (A-C, C-C), yet leaves abstract interpretations unchanged (A-A), suggesting that encoded abstract values function as stable anchors rather than malleable activations. These findings indicate LLMs maintain structured value representations that bridge abstraction and action, providing a mechanistic and operational foundation for building value-driven autonomous AI systems with more transparent, generalizable alignment and control.

27. 【2601.14004】Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

链接https://arxiv.org/abs/2601.14004

作者:Hengyuan Zhang,Zhihao Zhang,Mingyang Wang,Zunhai Su,Yiwei Wang,Qianli Wang,Shuzhou Yuan,Ercong Nie,Xufeng Duan,Qibo Xue,Zeping Yu,Chenming Shang,Xiao Liang,Jing Xiong,Hui Shen,Chaofan Tao,Zhengwu Liu,Senjie Jin,Zhiheng Xi,Dongdong Zhang,Sophia Ananiadou,Tao Gui,Ruobing Xie,Hayden Kwok-Hay So,Hinrich Schütze,Xuanjing Huang,Qi Zhang,Ngai Wong

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Mechanistic Interpretability, Large Language, decision-making of Large, Language Models

备注

点击查看摘要

Abstract:Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at this https URL.

28. 【2601.13995】From Tags to Trees: Structuring Fine-Grained Knowledge for Controllable Data Selection in LLM Instruction Tuning

链接https://arxiv.org/abs/2601.13995

作者:Zihan Niu,Wenping Hu,Junmin Chen,Xiyue Wang,Tong Xu,Ruiming Tang

类目:Computation and Language (cs.CL)

关键词:LLM instruction tuning, massive open-source datasets, critical for LLM, LLM instruction, instruction tuning

备注

点击查看摘要

Abstract:Effective and controllable data selection is critical for LLM instruction tuning, especially with massive open-source datasets. Existing approaches primarily rely on instance-level quality scores, or diversity metrics based on embedding clusters or semantic tags. However, constrained by the flatness of embedding spaces or the coarseness of tags, these approaches overlook fine-grained knowledge and its intrinsic hierarchical dependencies, consequently hindering precise data valuation and knowledge-aligned sampling. To address this challenge, we propose Tree-aware Aligned Global Sampling (TAGS), a unified framework that leverages a knowledge tree built from fine-grained tags, thereby enabling joint control of global quality, diversity, and target alignment. Using an LLM-based tagger, we extract atomic knowledge concepts, which are organized into a global tree through bottom-up hierarchical clustering. By grounding data instances onto this tree, a tree-aware metric then quantifies data quality and diversity, facilitating effective sampling. Our controllable sampling strategy maximizes tree-level information gain and enforces leaf-level alignment via KL-divergence for specific domains. Extensive experiments demonstrate that TAGS significantly outperforms state-of-the-art baselines. Notably, it surpasses the full-dataset model by \textbf{+5.84\%} using only \textbf{5\%} of the data, while our aligned sampling strategy further boosts average performance by \textbf{+4.24\%}.

29. 【2601.13992】"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

链接https://arxiv.org/abs/2601.13992

作者:Jin Cui,Jiaqi Guo,Jiepeng Zhou,Ruixuan Yang,Jiayi Lu,Jiajun Xu,Jiangcheng Song,Boran Zhao,Pengju Ren

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:empowers Large Language, Large Language Models, prohibitive parameter scales, Large Language, typically requires prohibitive

备注: 11pages, 9figures

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect "epiphany moments" for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.

30. 【2601.13922】Automatic Prompt Optimization for Dataset-Level Feature Discovery

链接https://arxiv.org/abs/2601.13922

作者:Adrian Cosma,Oleg Szehr,David Kletz,Alessandro Antonucci,Olivier Pelletier

类目:Computation and Language (cs.CL)

关键词:current approaches largely, downstream classification pipelines, fixed feature schemas, approaches largely rely, classification pipelines

备注: 5 Figures, 1 Table

点击查看摘要

Abstract:Feature extraction from unstructured text is a critical step in many downstream classification pipelines, yet current approaches largely rely on hand-crafted prompts or fixed feature schemas. We formulate feature discovery as a dataset-level prompt optimization problem: given a labelled text corpus, the goal is to induce a global set of interpretable and discriminative feature definitions whose realizations optimize a downstream supervised learning objective. To this end, we propose a multi-agent prompt optimization framework in which language-model agents jointly propose feature definitions, extract feature values, and evaluate feature quality using dataset-level performance and interpretability feedback. Instruction prompts are iteratively refined based on this structured feedback, enabling optimization over prompts that induce shared feature sets rather than per-example predictions. This formulation departs from prior prompt optimization methods that rely on per-sample supervision and provides a principled mechanism for automatic feature discovery from unstructured text.

31. 【2601.13919】HyperWalker: Dynamic Hypergraph-Based Deep Diagnosis for Multi-Hop Clinical Modeling across EHR and X-Ray in Medical VLMs

链接https://arxiv.org/abs/2601.13919

作者:Yuezhe Yang,Hao Wang,Yige Peng,Jinman Kim,Lei Bi

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated clinical diagnosis, integrate multi-modal data, case-specific contexts, reason across complex, Automated clinical

备注: Under Review

点击查看摘要

Abstract:Automated clinical diagnosis remains a core challenge in medical AI, which usually requires models to integrate multi-modal data and reason across complex, case-specific contexts. Although recent methods have advanced medical report generation (MRG) and visual question answering (VQA) with medical vision-language models (VLMs), these methods, however, predominantly operate under a sample-isolated inference paradigm, as such processing cases independently without access to longitudinal electronic health records (EHRs) or structurally related patient examples. This paradigm limits reasoning to image-derived information alone, which ignores external complementary medical evidence for potentially more accurate diagnosis. To overcome this limitation, we propose \textbf{HyperWalker}, a \textit{Deep Diagnosis} framework that reformulates clinical reasoning via dynamic hypergraphs and test-time training. First, we construct a dynamic hypergraph, termed \textbf{iBrochure}, to model the structural heterogeneity of EHR data and implicit high-order associations among multimodal clinical information. Within this hypergraph, a reinforcement learning agent, \textbf{Walker}, navigates to and identifies optimal diagnostic paths. To ensure comprehensive coverage of diverse clinical characteristics in test samples, we incorporate a \textit{linger mechanism}, a multi-hop orthogonal retrieval strategy that iteratively selects clinically complementary neighborhood cases reflecting distinct clinical attributes. Experiments on MRG with MIMIC and medical VQA on EHRXQA demonstrate that HyperWalker achieves state-of-the-art performance. Code is available at: this https URL

32. 【2601.13918】AgentEHR: Advancing Autonomous Clinical Decision-Making via Retrospective Summarization

链接https://arxiv.org/abs/2601.13918

作者:Yusheng Liao,Chuan Xuan,Yutong Cai,Lina Yang,Zhe Chen,Yanfeng Wang,Yu Wang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, demonstrated profound utility, Electronic Health Records

备注: 37 pages, 12 figures

点击查看摘要

Abstract:Large Language Models have demonstrated profound utility in the medical domain. However, their application to autonomous Electronic Health Records~(EHRs) navigation remains constrained by a reliance on curated inputs and simplified retrieval tasks. To bridge the gap between idealized experimental settings and realistic clinical environments, we present AgentEHR. This benchmark challenges agents to execute complex decision-making tasks, such as diagnosis and treatment planning, requiring long-range interactive reasoning directly within raw and high-noise databases. In tackling these tasks, we identify that existing summarization methods inevitably suffer from critical information loss and fractured reasoning continuity. To address this, we propose RetroSum, a novel framework that unifies a retrospective summarization mechanism with an evolving experience strategy. By dynamically re-evaluating interaction history, the retrospective mechanism prevents long-context information loss and ensures unbroken logical coherence. Additionally, the evolving strategy bridges the domain gap by retrieving accumulated experience from a memory bank. Extensive empirical evaluations demonstrate that RetroSum achieves performance gains of up to 29.16% over competitive baselines, while significantly decreasing total interaction errors by up to 92.3%.

33. 【2601.13885】Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

链接https://arxiv.org/abs/2601.13885

作者:Esma Balkır,Alice Pernthaller,Marco Basaldella,José Hernández-Orallo,Nigel Collier

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:efficient LLM evaluation, modern LLM evaluation, LLM evaluation increasingly, LLM evaluation, evaluation increasingly relies

备注

点击查看摘要

Abstract:Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 {\tau} over random sampling, with 95% accuracy on confident predictions.

34. 【2601.13882】OpenLearnLM Benchmark: A Unified Framework for Evaluating Knowledge, Skill, and Attitude in Educational Large Language Models

链接https://arxiv.org/abs/2601.13882

作者:Unggi Lee,Sookbun Lee,Heungsoo Choi,Jinseo Lee,Haeun Park,Younghoon Jeon,Sungmin Cho,Minju Kang,Junbo Koh,Jiyeong Bae,Minwoo Nam,Juyeon Eun,Yeonji Jung,Yeil Jeong

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, existing benchmarks focus, learning sciences, Language Models

备注

点击查看摘要

Abstract:Large Language Models are increasingly deployed as educational tools, yet existing benchmarks focus on narrow skills and lack grounding in learning sciences. We introduce OpenLearnLM Benchmark, a theory-grounded framework evaluating LLMs across three dimensions derived from educational assessment theory: Knowledge (curriculum-aligned content and pedagogical understanding), Skills (scenario-based competencies organized through a four-level center-role-scenario-subscenario hierarchy), and Attitude (alignment consistency and deception resistance). Our benchmark comprises 124K+ items spanning multiple subjects, educational roles, and difficulty levels based on Bloom's taxonomy. The Knowledge domain prioritizes authentic assessment items from established benchmarks, while the Attitude domain adapts Anthropic's Alignment Faking methodology to detect behavioral inconsistency under varying monitoring conditions. Evaluation of seven frontier models reveals distinct capability profiles: Claude-Opus-4.5 excels in practical skills despite lower content knowledge, while Grok-4.1-fast leads in knowledge but shows alignment concerns. Notably, no single model dominates all dimensions, validating the necessity of multi-axis evaluation. OpenLearnLM provides an open, comprehensive framework for advancing LLM readiness in authentic educational contexts.

35. 【2601.13879】Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

链接https://arxiv.org/abs/2601.13879

作者:Dongxu Zhang,Yiding Sun,Cheng Tan,Wenbiao Yan,Ning Yang,Jihua Zhu,Hiajun Zhang

类目:Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, prohibitive latency constraints

备注

点击查看摘要

Abstract:While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, leading to hallucinations. To address this, we introduce V-Skip that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow, effectively rescuing visually salient anchors. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a $2.9\times$ speedup with negligible accuracy loss. Specifically, it preserves fine-grained visual details, outperforming other baselines over 30\% on the DocVQA.

36. 【2601.13876】Pedagogical Alignment for Vision-Language-Action Models: A Comprehensive Framework for Data, Architecture, and Evaluation in Education

链接https://arxiv.org/abs/2601.13876

作者:Unggi Lee,Jahyun Jeong,Sunyoung Shin,Haeun Park,Jeongsu Moon,Youngchang Song,Jaechang Shim,JaeHwan Lee,Yunju Noh,Seungwon Choi,Ahhyun Kim,TaeHyeon Kim,Kyungtae Joo,Taeyeong Kim,Gyeonggeon Lee

类目:Computation and Language (cs.CL)

关键词:effective STEM education, Pedagogical VLA Framework, effective STEM, VLA Framework, Pedagogical VLA

备注

点击查看摘要

Abstract:Science demonstrations are important for effective STEM education, yet teachers face challenges in conducting them safely and consistently across multiple occasions, where robotics can be helpful. However, current Vision-Language-Action (VLA) models require substantial computational resources and sacrifice language generation capabilities to maximize efficiency, making them unsuitable for resource-constrained educational settings that require interpretable, explanation-generating systems. We present \textit{Pedagogical VLA Framework}, a framework that applies pedagogical alignment to lightweight VLA models through four components: text healing to restore language generation capabilities, large language model (LLM) distillation to transfer pedagogical knowledge, safety training for educational environments, and pedagogical evaluation adjusted to science education contexts. We evaluate Pedagogical VLA Framework across five science demonstrations spanning physics, chemistry, biology, and earth science, using an evaluation framework developed in collaboration with science education experts. Our evaluation assesses both task performance (success rate, protocol compliance, efficiency, safety) and pedagogical quality through teacher surveys and LLM-as-Judge assessment. We additionally provide qualitative analysis of generated texts. Experimental results demonstrate that Pedagogical VLA Framework achieves comparable task performance to baseline models while producing contextually appropriate educational explanations.

37. 【2601.13836】FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

链接https://arxiv.org/abs/2601.13836

作者:Qian Chen,Jinlan Fu,Changsong Li,See-Kiong Ng,Xipeng Qiu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, remains largely unexplored

备注: [this https URL](https://openmoss.github.io/FutureOmni)

点击查看摘要

Abstract:Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (this https URL) and datasets (this https URL).

38. 【2601.13835】he Role of Prosodic and Lexical Cues in Turn-Taking with Self-Supervised Speech Representations

链接https://arxiv.org/abs/2601.13835

作者:Sam OConnor Russell,Delphine Charuau,Naomi Harte

类目:Computation and Language (cs.CL)

关键词:Fluid turn-taking remains, Fluid turn-taking, human-robot interaction, key challenge, challenge in human-robot

备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Fluid turn-taking remains a key challenge in human-robot interaction. Self-supervised speech representations (S3Rs) have driven many advances, but it remains unclear whether S3R-based turn-taking models rely on prosodic cues, lexical cues or both. We introduce a vocoder-based approach to control prosody and lexical cues in speech more cleanly than prior work. This allows us to probe the voice-activity projection model, an S3R-based turn-taking model. We find that prediction on prosody-matched, unintelligible noise is similar to accuracy on clean speech. This reveals both prosodic and lexical cues support turn-taking, but either can be used in isolation. Hence, future models may only require prosody, providing privacy and potential performance benefits. When either prosodic or lexical information is disrupted, the model exploits the other without further training, indicating they are encoded in S3Rs with limited interdependence. Results are consistent in CPC-based and wav2vec2.0 S3Rs. We discuss our findings and highlight a number of directions for future work. All code is available to support future research.

39. 【2601.13806】Knowledge Graph-Assisted LLM Post-Training for Enhanced Legal Reasoning

链接https://arxiv.org/abs/2601.13806

作者:Dezhao Song,Guglielmo Bonifazi,Frank Schilder,Jonathan Richard Schwarz

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large text corpora, human feedback, primarily relied, relied on large, large text

备注

点击查看摘要

Abstract:LLM post-training has primarily relied on large text corpora and human feedback, without capturing the structure of domain knowledge. This has caused models to struggle dealing with complex reasoning tasks, especially for high-stakes professional domains. In Law, reasoning requires deep understanding of the relations between various legal concepts, a key component missing in current LLM post-training. In this paper, we propose a knowledge graph (KG)-assisted approach for enhancing LLMs' reasoning capability in Legal that is generalizable to other high-stakes domains. We model key legal concepts by following the \textbf{IRAC} (Issue, Rule, Analysis and Conclusion) framework, and construct a KG with 12K legal cases. We then produce training data using our IRAC KG, and conduct both Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) with three state-of-the-art (SOTA) LLMs (30B, 49B and 70B), varying architecture and base model family. Our post-trained models obtained better average performance on 4/5 diverse legal benchmarks (14 tasks) than baselines. In particular, our 70B DPO model achieved the best score on 4/6 reasoning tasks, among baselines and a 141B SOTA legal LLM, demonstrating the effectiveness of our KG for enhancing LLMs' legal reasoning capability.

40. 【2601.13802】Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis

链接https://arxiv.org/abs/2601.13802

作者:Yushen Chen,Junzhe Liu,Yujie Tu,Zhikang Niu,Yuzhe Liang,Kai Yu,Chunyu Qiang,Chen Zhang,Xie Chen

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:notable gap persists, unified modeling perspective, Arabic dialects, modeling perspective, notable gap

备注

点击查看摘要

Abstract:A notable gap persists in speech synthesis research and development for Arabic dialects, particularly from a unified modeling perspective. Despite its high practical value, the inherent linguistic complexity of Arabic dialects, further compounded by a lack of standardized data, benchmarks, and evaluation guidelines, steers researchers toward safer ground. To bridge this divide, we present Habibi, a suite of specialized and unified text-to-speech models that harnesses existing open-source ASR corpora to support a wide range of high- to low-resource Arabic dialects through linguistically-informed curriculum learning. Our approach outperforms the leading commercial service in generation quality, while maintaining extensibility through effective in-context learning, without requiring text diacritization. We are committed to open-sourcing the model, along with creating the first systematic benchmark for multi-dialect Arabic speech synthesis. Furthermore, by identifying the key challenges in and establishing evaluation standards for the process, we aim to provide a solid groundwork for subsequent research. Resources at this https URL .

41. 【2601.13770】Look-Ahead-Bench: a Standardized Benchmark of Look-ahead Bias in Point-in-Time LLMs for Finance

链接https://arxiv.org/abs/2601.13770

作者:Mostapha Benhenda(LAGA)

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Computational Finance (q-fin.CP); General Finance (q-fin.GN)

关键词:Large Language Models, Large Language, benchmark measuring look-ahead, measuring look-ahead bias, Language Models

备注

点击查看摘要

Abstract:We introduce Look-Ahead-Bench, a standardized benchmark measuring look-ahead bias in Point-in-Time (PiT) Large Language Models (LLMs) within realistic and practical financial workflows. Unlike most existing approaches that primarily test inner lookahead knowledge via Q\\A, our benchmark evaluates model behavior in practical scenarios. To distinguish genuine predictive capability from memorization-based performance, we analyze performance decay across temporally distinct market regimes, incorporating several quantitative baselines to establish performance thresholds. We evaluate prominent open-source LLMs -- Llama 3.1 (8B and 70B) and DeepSeek 3.2 -- against a family of Point-in-Time LLMs (Pitinf-Small, Pitinf-Medium, and frontier-level model Pitinf-Large) from PiT-Inference. Results reveal significant lookahead bias in standard LLMs, as measured with alpha decay, unlike Pitinf models, which demonstrate improved generalization and reasoning abilities as they scale in size. This work establishes a foundation for the standardized evaluation of temporal bias in financial LLMs and provides a practical framework for identifying models suitable for real-world deployment. Code is available on GitHub: this https URL

42. 【2601.13761】DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution

链接https://arxiv.org/abs/2601.13761

作者:Shengda Fan,Xuyan Ye,Yankai Lin

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:self-improving artificial intelligence, achieving self-improving artificial, large language models, artificial intelligence, large language

备注

点击查看摘要

Abstract:Self-play with large language models has emerged as a promising paradigm for achieving self-improving artificial intelligence. However, existing self-play frameworks often suffer from optimization instability, due to (i) non-stationary objectives induced by solver-dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self-generated pseudo-labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two-stage framework that stabilizes the self-evolution process. First, we train the Questioner to synthesize difficulty-calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self-distillation mechanism, where a document-augmented teacher generates high-quality pseudo-labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model-agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations. The code is available at this https URL.

43. 【2601.13752】Finding RELIEF: Shaping Reasoning Behavior without Reasoning Supervision via Belief Engineering

链接https://arxiv.org/abs/2601.13752

作者:Chak Tou Leong,Dingwei Chen,Heming Xia,Qingyu Yin,Sunbowen Lee,Jian Wang,Wenjie Li

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:achieved remarkable success, Large reasoning models, Large reasoning, complex problem-solving, achieved remarkable

备注: Working in progress

点击查看摘要

Abstract:Large reasoning models (LRMs) have achieved remarkable success in complex problem-solving, yet they often suffer from computational redundancy or reasoning unfaithfulness. Current methods for shaping LRM behavior typically rely on reinforcement learning or fine-tuning with gold-standard reasoning traces, a paradigm that is both computationally expensive and difficult to scale. In this paper, we reveal that LRMs possess latent \textit{reasoning beliefs} that internally track their own reasoning traits, which can be captured through simple logit probing. Building upon this insight, we propose Reasoning Belief Engineering (RELIEF), a simple yet effective framework that shapes LRM behavior by aligning the model's self-concept with a target belief blueprint. Crucially, RELIEF completely bypasses the need for reasoning-trace supervision. It internalizes desired traits by fine-tuning on synthesized, self-reflective question-answering pairs that affirm the target belief. Extensive experiments on efficiency and faithfulness tasks demonstrate that RELIEF matches or outperforms behavior-supervised and preference-based baselines while requiring lower training costs. Further analysis validates that shifting a model's reasoning belief effectively shapes its actual behavior.

44. 【2601.13749】Pro-AI Bias in Large Language Models

链接https://arxiv.org/abs/2601.13749

作者:Benaya Trabelsi,Jonathan Shaki,Sarit Kraus

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)

关键词:Large language models, Large language, multiple domains, increasingly employed, employed for decision-support

备注: 13 pages, 6 figures. Code available at: [this https URL](https://github.com/benayat/Pro-AI-bias-in-LLMs)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly employed for decision-support across multiple domains. We investigate whether these models display a systematic preferential bias in favor of artificial intelligence (AI) itself. Across three complementary experiments, we find consistent evidence of pro-AI bias. First, we show that LLMs disproportionately recommend AI-related options in response to diverse advice-seeking queries, with proprietary models doing so almost deterministically. Second, we demonstrate that models systematically overestimate salaries for AI-related jobs relative to closely matched non-AI jobs, with proprietary models overestimating AI salaries more by 10 percentage points. Finally, probing internal representations of open-weight models reveals that ``Artificial Intelligence'' exhibits the highest similarity to generic prompts for academic fields under positive, negative, and neutral framings alike, indicating valence-invariant representational centrality. These patterns suggest that LLM-generated advice and valuation can systematically skew choices and perceptions in high-stakes decisions.

45. 【2601.13742】Dimension-First Evaluation of Speech-to-Speech Models with Structured Acoustic Cues

链接https://arxiv.org/abs/2601.13742

作者:Arjun Chandra,Kevin Miller,Venkatesh Ravichandran,Constantinos Papayiannis,Venkatesh Saligrama

类目:Computation and Language (cs.CL)

关键词:Large Language Model, Audio Language Models, Large Language, Language Model, exhibit strong reasoning

备注: EACL 2026 Findings

点击查看摘要

Abstract:Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models (ALMs). In this work, we propose TRACE (Textual Reasoning over Audio Cues for Evaluation), a novel framework that enables LLM judges to reason over audio cues to achieve cost-efficient and human-aligned S2S evaluation. To demonstrate the strength of the framework, we first introduce a Human Chain-of-Thought (HCoT) annotation protocol to improve the diagnostic capability of existing judge benchmarks by separating evaluation into explicit dimensions: content (C), voice quality (VQ), and paralinguistics (P). Using this data, TRACE constructs a textual blueprint of inexpensive audio signals and prompts an LLM to render dimension-wise judgments, fusing them into an overall rating via a deterministic policy. TRACE achieves higher agreement with human raters than ALMs and transcript-only LLM judges while being significantly more cost-effective. We will release the HCoT annotations and the TRACE framework to enable scalable and human-aligned S2S evaluation.

46. 【2601.13734】owards robust long-context understanding of large language model via active recap learning

链接https://arxiv.org/abs/2601.13734

作者:Chenyu Hui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:active recap learning, propose active recap, enhancing large language, large language model, recap learning

备注: 5 pages

点击查看摘要

Abstract:In this paper, we propose active recap learning (ARL), a framework for enhancing large language model (LLM) in understanding long contexts. ARL enables models to revisit and summarize earlier content through targeted sequence construction during contined pretraining and retrospective summarization at inference. First, we identify key tokens in prepared long context based on loss gaps between long and short forward contexts and find most revant preceding paragraphs, then summarize them using an LLM. Second, ARL equips models with the ability to autonomously generate and utilize these retrospective summaries during inference, thereby establishing a recursive memory mechanism across paragraphs. Experimental results show substantial gains, with ARL achieving a 26.8% improvement on RULER and a 9.44% improvement on LongBench. Overall, ARL offers a simple yet effective continued pretraining-based approach to strengthen long-context understanding, advancing scalable memory augmentation in LLM

47. 【2601.13729】On Temperature-Constrained Non-Deterministic Machine Translation: Potential and Evaluation

链接https://arxiv.org/abs/2601.13729

作者:Weichuan Wang,Mingyang Liu,Linqi Song,Chen Ma

类目:Computation and Language (cs.CL)

关键词:garnered considerable attention, recent years, real-world applications, language models, models have garnered

备注: 9 pages, 12 figures

点击查看摘要

Abstract:In recent years, the non-deterministic properties of language models have garnered considerable attention and have shown a significant influence on real-world applications. However, such properties remain under-explored in machine translation (MT), a complex, non-deterministic NLP task. In this study, we systematically evaluate modern MT systems and identify temperature-constrained Non-Deterministic MT (ND-MT) as a distinct phenomenon. Additionally, we demonstrate that ND-MT exhibits significant potential in addressing the multi-modality issue that has long challenged MT research and provides higher-quality candidates than Deterministic MT (D-MT) under temperature constraints. However, ND-MT introduces new challenges in evaluating system performance. Specifically, the evaluation framework designed for D-MT fails to yield consistent evaluation results when applied to ND-MT. We further investigate this emerging challenge by evaluating five state-of-the-art ND-MT systems across three open datasets using both lexical-based and semantic-based metrics at varying sampling sizes. The results reveal a Buckets effect across these systems: the lowest-quality candidate generated by ND-MT consistently determines the overall system ranking across different sampling sizes for all reasonable metrics. Furthermore, we propose the ExpectoSample strategy to automatically assess the reliability of evaluation metrics for selecting robust ND-MT.

48. 【2601.13722】OP-Bench: Benchmarking Over-Personalization for Memory-Augmented Personalized Conversational Agents

链接https://arxiv.org/abs/2601.13722

作者:Yulin Hu,Zimo Long,Jiahe Guo,Xingyu Sui,Xing Fu,Weixiang Zhao,Yanyan Zhao,Bing Qin

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:gained substantial traction, enable personalized interactions, conversational agents enable, agents enable personalized, substantial traction

备注

点击查看摘要

Abstract:Memory-augmented conversational agents enable personalized interactions using long-term user memory and have gained substantial traction. However, existing benchmarks primarily focus on whether agents can recall and apply user information, while overlooking whether such personalization is used appropriately. In fact, agents may overuse personal information, producing responses that feel forced, intrusive, or socially inappropriate to users. We refer to this issue as \emph{over-personalization}. In this work, we formalize over-personalization into three types: Irrelevance, Repetition, and Sycophancy, and introduce \textbf{OP-Bench} a benchmark of 1,700 verified instances constructed from long-horizon dialogue histories. Using \textbf{OP-Bench}, we evaluate multiple large language models and memory-augmentation methods, and find that over-personalization is widespread when memory is introduced. Further analysis reveals that agents tend to retrieve and over-attend to user memories even when unnecessary. To address this issue, we propose \textbf{Self-ReCheck}, a lightweight, model-agnostic memory filtering mechanism that mitigates over-personalization while preserving personalization performance. Our work takes an initial step toward more controllable and appropriate personalization in memory-augmented dialogue systems.

49. 【2601.13717】Simulated Ignorance Fails: A Systematic Study of LLM Behaviors on Forecasting Problems Before Model Knowledge Cutoff

链接https://arxiv.org/abs/2601.13717

作者:Zehan Li,Yuxuan Wang,Ali El Lahib,Ying-Jieh Xia,Xinyu Pi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Evaluating LLM forecasting, Evaluating LLM, prospective evaluation offers, faces rapidly shrinking, evaluation offers methodological

备注

点击查看摘要

Abstract:Evaluating LLM forecasting capabilities is constrained by a fundamental tension: prospective evaluation offers methodological rigor but prohibitive latency, while retrospective forecasting (RF) -- evaluating on already-resolved events -- faces rapidly shrinking clean evaluation data as SOTA models possess increasingly recent knowledge cutoffs. Simulated Ignorance (SI), prompting models to suppress pre-cutoff knowledge, has emerged as a potential solution. We provide the first systematic test of whether SI can approximate True Ignorance (TI). Across 477 competition-level questions and 9 models, we find that SI fails systematically: (1) cutoff instructions leave a 52% performance gap between SI and TI; (2) chain-of-thought reasoning fails to suppress prior knowledge, even when reasoning traces contain no explicit post-cutoff references; (3) reasoning-optimized models exhibit worse SI fidelity despite superior reasoning trace quality. These findings demonstrate that prompts cannot reliably "rewind" model knowledge. We conclude that RF on pre-cutoff events is methodologically flawed; we recommend against using SI-based retrospective setups to benchmark forecasting capabilities.

50. 【2601.13711】GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark

链接https://arxiv.org/abs/2601.13711

作者:Lotta Kiefer,Christoph Leiter,Sotaro Takeshita,Elena Schmidt,Steffen Eger

类目:Computation and Language (cs.CL)

关键词:Authorship verification, predominantly for English, studied extensively, task of determining, English data

备注

点击查看摘要

Abstract:Authorship verification (AV) is the task of determining whether two texts were written by the same author and has been studied extensively, predominantly for English data. In contrast, large-scale benchmarks and systematic evaluations for other languages remain scarce. We address this gap by introducing GerAV, a comprehensive benchmark for German AV comprising over 600k labeled text pairs. GerAV is built from Twitter and Reddit data, with the Reddit part further divided into in-domain and cross-domain message-based subsets, as well as a profile-based subset. This design enables controlled analysis of the effects of data source, topical domain, and text length. Using the provided training splits, we conduct a systematic evaluation of strong baselines and state-of-the-art models and find that our best approach, a fine-tuned large language model, outperforms recent baselines by up to 0.09 absolute F1 score and surpasses GPT-5 in a zero-shot setting by 0.08. We further observe a trade-off between specialization and generalization: models trained on specific data types perform best under matching conditions but generalize less well across data regimes, a limitation that can be mitigated by combining training sources. Overall, GerAV provides a challenging and versatile benchmark for advancing research on German and cross-domain AV.

51. 【2601.13710】Who Should Have Surgery? A Comparative Study of GenAI vs Supervised ML for CRS Surgical Outcome Prediction

链接https://arxiv.org/abs/2601.13710

作者:Sayeed Shafayet Chowdhury,Snehasis Mukhopadhyay,Shiaofen Fang,Vijay R. Ramakrishnan

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:reshaped medical imaging, Artificial intelligence, support remains limited, prospective decision support, decision support remains

备注

点击查看摘要

Abstract:Artificial intelligence has reshaped medical imaging, yet the use of AI on clinical data for prospective decision support remains limited. We study pre-operative prediction of clinically meaningful improvement in chronic rhinosinusitis (CRS), defining success as a more than 8.9-point reduction in SNOT-22 at 6 months (MCID). In a prospectively collected cohort where all patients underwent surgery, we ask whether models using only pre-operative clinical data could have identified those who would have poor outcomes, i.e. those who should have avoided surgery. We benchmark supervised ML (logistic regression, tree ensembles, and an in-house MLP) against generative AI (ChatGPT, Claude, Gemini, Perplexity), giving each the same structured inputs and constraining outputs to binary recommendations with confidence. Our best ML model (MLP) achieves 85 % accuracy with superior calibration and decision-curve net benefit. GenAI models underperform on discrimination and calibration across zero-shot setting. Notably, GenAI justifications align with clinician heuristics and the MLP's feature importance, repeatedly highlighting baseline SNOT-22, CT/endoscopy severity, polyp phenotype, and physchology/pain comorbidities. We provide a reproducible tabular-to-GenAI evaluation protocol and subgroup analyses. Findings support an ML-first, GenAI- augmented workflow: deploy calibrated ML for primary triage of surgical candidacy, with GenAI as an explainer to enhance transparency and shared decision-making.

52. 【2601.13709】Hidden in Plain Text: Measuring LLM Deception Quality Against Human Baselines Using Social Deduction Games

链接https://arxiv.org/abs/2601.13709

作者:Christopher Kao,Vanshika Vats,James Davis

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)

关键词:Large Language Model, Large Language, Language Model, social contexts, raising concerns

备注: For associated dataset, see [this https URL](https://github.com/cocochief4/llm-mafia) . Published in IEEE ICA 2025, waiting for IEEEXplore proceedings

点击查看摘要

Abstract:Large Language Model (LLM) agents are increasingly used in many applications, raising concerns about their safety. While previous work has shown that LLMs can deceive in controlled tasks, less is known about their ability to deceive using natural language in social contexts. In this paper, we study deception in the Social Deduction Game (SDG) Mafia, where success is dependent on deceiving others through conversation. Unlike previous SDG studies, we use an asynchronous multi-agent framework which better simulates realistic social contexts. We simulate 35 Mafia games with GPT-4o LLM agents. We then create a Mafia Detector using GPT-4-Turbo to analyze game transcripts without player role information to predict the mafia players. We use prediction accuracy as a surrogate marker for deception quality. We compare this prediction accuracy to that of 28 human games and a random baseline. Results show that the Mafia Detector's mafia prediction accuracy is lower on LLM games than on human games. The result is consistent regardless of the game days and the number of mafias detected. This indicates that LLMs blend in better and thus deceive more effectively. We also release a dataset of LLM Mafia transcripts to support future research. Our findings underscore both the sophistication and risks of LLM deception in social contexts.

53. 【2601.13697】Uncertainty-Aware Gradient Signal-to-Noise Data Selection for Instruction Tuning

链接https://arxiv.org/abs/2601.13697

作者:Zhihang Yuan,Chengyu Yue,Long Huang,Litu Ou,Lei Shi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:modern instruction datasets, large language models, making full-data fine-tuning, adapting large language, full-data fine-tuning costly

备注: Preprint

点击查看摘要

Abstract:Instruction tuning is a standard paradigm for adapting large language models (LLMs), but modern instruction datasets are large, noisy, and redundant, making full-data fine-tuning costly and often unnecessary. Existing data selection methods either build expensive gradient datastores or assign static scores from a weak proxy, largely ignoring evolving uncertainty, and thus missing a key source of LLM interpretability. We propose GRADFILTERING, an objective-agnostic, uncertainty-aware data selection framework that utilizes a small GPT-2 proxy with a LoRA ensemble and aggregates per-example gradients into a Gradient Signal-to-Noise Ratio (G-SNR) utility. Our method matches or surpasses random subsets and strong baselines in most LLM-as-a-judge evaluations as well as in human assessment. Moreover, GRADFILTERING-selected subsets converge faster than competitive filters under the same compute budget, reflecting the benefit of uncertainty-aware scoring.

54. 【2601.13695】OptiSQL: Executable SQL Generation from Optical TokensOptiSQL: Executable SQL Generation from Optical Tokens

链接https://arxiv.org/abs/2601.13695

作者:Sifan Li,Hongkai Chen,Yujun Cai,Liyang Chen,Qingwen Ye,Yiwei Wang

类目:Computation and Language (cs.CL)

关键词:fully linearized textual, linearized textual schemas, typically studied, provided as fully, fully linearized

备注

点击查看摘要

Abstract:Executable SQL generation is typically studied in text-to-SQL settings, where tables are provided as fully linearized textual schemas and contents. While effective, this formulation assumes access to structured text and incurs substantial token overhead, which is misaligned with many real-world scenarios where tables appear as visual artifacts in documents or webpages. We investigate whether compact optical representations can serve as an efficient interface for executable semantic parsing. We present OptiSQL, a vision-driven framework that generates executable SQL directly from table images and natural language questions using compact optical tokens. OptiSQL leverages an OCR-oriented visual encoder to compress table structure and content into a small set of optical tokens and fine-tunes a pretrained decoder for SQL generation while freezing the encoder to isolate representation sufficiency. Experiments on a visualized version of Spider 2.0-Snow show that OptiSQL retains strong execution accuracy while reducing table input tokens by an order of magnitude. Robustness analyses further demonstrate that optical tokens preserve essential structural information under visual perturbations.

55. 【2601.13690】Dr. Assistant: Enhancing Clinical Diagnostic Inquiry via Structured Diagnostic Reasoning Data and Reinforcement Learning

链接https://arxiv.org/abs/2601.13690

作者:Yue Guo,Fanfu Wang,Jianwei Lv,Xincheng Shi,Yuchen Li,Youya Wang,Yunsheng Zeng,Yujing Liu,Yunhao Qiao,Gen Li,Junfeng Wang,Bo Yuan

类目:Computation and Language (cs.CL)

关键词:Decision Support Systems, Clinical Decision Support, Support Systems, Decision Support, face notable challenges

备注

点击查看摘要

Abstract:Clinical Decision Support Systems (CDSSs) provide reasoning and inquiry guidance for physicians, yet they face notable challenges, including high maintenance costs and low generalization capability. Recently, Large Language Models (LLMs) have been widely adopted in healthcare due to their extensive knowledge reserves, retrieval, and communication capabilities. While LLMs show promise and excel at medical benchmarks, their diagnostic reasoning and inquiry skills are constrained. To mitigate this issue, we propose (1) Clinical Diagnostic Reasoning Data (CDRD) structure to capture abstract clinical reasoning logic, and a pipeline for its construction, and (2) the Dr. Assistant, a clinical diagnostic model equipped with clinical reasoning and inquiry skills. Its training involves a two-stage process: SFT, followed by RL with a tailored reward function. We also introduce a benchmark to evaluate both diagnostic reasoning and inquiry. Our experiments demonstrate that the Dr. Assistant outperforms open-source models and achieves competitive performance to closed-source models, providing an effective solution for clinical diagnostic inquiry guidance.

56. 【2601.13684】HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference

链接https://arxiv.org/abs/2601.13684

作者:Zhiyuan Shi,Qibo Qiu,Feng Xue,Zhonglin Jiang,Li Yu,Jian Jiang,Xiaofei He,Wenxiao Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:linear memory growth, bottleneck for LLM, LLM inference, linear memory, memory growth

备注

点击查看摘要

Abstract:The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information, principally because they overlook the attention drift phenomenon where token significance evolves dynamically. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse-grained caching strategies and incur high I/O overhead due to frequent data transfers. To overcome these limitations, we propose HeteroCache, a training-free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and redundancy. Consequently, we apply a fine-grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes, thereby addressing the inefficiency of coarse-grained strategies. Furthermore, we employ a hierarchical storage mechanism in which a subset of representative heads monitors attention shift, and trigger an asynchronous, on-demand retrieval of contexts from the CPU, effectively hiding I/O latency. Finally, experiments demonstrate that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to $3\times$ compared to the original model in the 224K context. Our code will be open-source.

57. 【2601.13669】CommunityBench: Benchmarking Community-Level Alignment across Diverse Groups and Tasks

链接https://arxiv.org/abs/2601.13669

作者:Jiayu Lin,Zhongyu Wei

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, behaviors reflect human, ensures model behaviors, model behaviors reflect

备注

点击查看摘要

Abstract:Large language models (LLMs) alignment ensures model behaviors reflect human value. Existing alignment strategies primarily follow two paths: one assumes a universal value set for a unified goal (i.e., one-size-fits-all), while the other treats every individual as unique to customize models (i.e., individual-level). However, assuming a monolithic value space marginalizes minority norms, while tailoring individual models is prohibitively expensive. Recognizing that human society is organized into social clusters with high intra-group value alignment, we propose community-level alignment as a "middle ground". Practically, we introduce CommunityBench, the first large-scale benchmark for community-level alignment evaluation, featuring four tasks grounded in Common Identity and Common Bond theory. With CommunityBench, we conduct a comprehensive evaluation of various foundation models on CommunityBench, revealing that current LLMs exhibit limited capacity to model community-specific preferences. Furthermore, we investigate the potential of community-level alignment in facilitating individual modeling, providing a promising direction for scalable and pluralistic alignment.

58. 【2601.13659】mporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis

链接https://arxiv.org/abs/2601.13659

作者:Chunlei Meng,Ziyang Zhou,Lucas He,Xiaojing Du,Chun Ouyang,Zhongxue Gan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:Multimodal Sentiment Analysis, Multimodal Sentiment, Analysis integrates Linguistic, Sentiment Analysis integrates, integrates Linguistic

备注: This study has been accepted by IEEE ICASSP2026

点击查看摘要

Abstract:Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.

59. 【2601.13658】Beyond Known Facts: Generating Unseen Temporal Knowledge to Address Data Contamination in LLM Evaluation

链接https://arxiv.org/abs/2601.13658

作者:Arthur Amalvy,Hen-Hsen Huang

类目:Computation and Language (cs.CL)

关键词:temporal knowledge graph, Knowledge Graph Forecasting, knowledge graph extraction, populating large web, information is important

备注: 12 pages

点击查看摘要

Abstract:The automatic extraction of information is important for populating large web knowledge bases such as Wikidata. The temporal version of that task, temporal knowledge graph extraction (TKGE), involves extracting temporally grounded facts from text, represented as semantic quadruples (subject, relation, object, timestamp). Many recent systems take advantage of large language models (LLMs), which are becoming a new cornerstone of the web due to their performance on many tasks across the natural language processing (NLP) field. Despite the importance of TKGE, existing datasets for training and evaluation remain scarce, and contamination of evaluation data is an unaddressed issue, potentially inflating LLMs' perceived performance due to overlaps between training and evaluation sets. To mitigate these challenges, we propose a novel synthetic evaluation dataset constructed from predicted future, previously unseen temporal facts, thereby eliminating contamination and enabling robust and unbiased benchmarking. Our dataset creation involves a two-step approach: (1) Temporal Knowledge Graph Forecasting (TKGF) generates plausible future quadruples, which are subsequently filtered to adhere to the original knowledge base schema; (2) LLMs perform quadruple-to-text generation, creating semantically aligned textual descriptions. We benchmark Extract, Define and Canonicalize (EDC), a state-of-the-art LLM-based extraction framework, demonstrating that LLM performance decreases when evaluated on our dataset compared to a dataset of known facts. We publicly release our dataset consisting of 4.2K future quadruples and corresponding textual descriptions, along with the generation methodology, enabling continuous creation of unlimited future temporal datasets to serve as long-term, contamination-free benchmarks for TKGE.

60. 【2601.13649】Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge

链接https://arxiv.org/abs/2601.13649

作者:Xiaolin Zhou,Zheng Luo,Yicheng Gao,Qixuan Chen,Xiyang Hu,Yue Zhao,Ruishan Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Recent advances, advances in Large, Large Language, Language

备注

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have incentivized the development of LLM-as-a-judge, an application of LLMs where they are used as judges to decide the quality of a certain piece of text given a certain context. However, previous studies have demonstrated that LLM-as-a-judge can be biased towards different aspects of the judged texts, which often do not align with human preference. One of the identified biases is language bias, which indicates that the decision of LLM-as-a-judge can differ based on the language of the judged texts. In this paper, we study two types of language bias in pairwise LLM-as-a-judge: (1) performance disparity between languages when the judge is prompted to compare options from the same language, and (2) bias towards options written in major languages when the judge is prompted to compare options of two different languages. We find that for same-language judging, there exist significant performance disparities across language families, with European languages consistently outperforming African languages, and this bias is more pronounced in culturally-related subjects. For inter-language judging, we observe that most models favor English answers, and that this preference is influenced more by answer language than question language. Finally, we investigate whether language bias is in fact caused by low-perplexity bias, a previously identified bias of LLM-as-a-judge, and we find that while perplexity is slightly correlated with language bias, language bias cannot be fully explained by perplexity only.

61. 【2601.13644】owards Token-Level Text Anomaly Detection

链接https://arxiv.org/abs/2601.13644

作者:Yang Cao,Bicheng Yu,Sikun Yang,Ming Liu,Yujiu Yang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:existing methods, document-level analysis, unable to identify, significant progress, web applications

备注: WWW 2026

点击查看摘要

Abstract:Despite significant progress in text anomaly detection for web applications such as spam filtering and fake news detection, existing methods are fundamentally limited to document-level analysis, unable to identify which specific parts of a text are anomalous. We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text. We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels. To facilitate research in this direction, we collect and annotate three benchmark datasets spanning spam, reviews and grammar errors with token-level labels. Experimental results demonstrate that our framework get better performance than other 6 baselines, opening new possibilities for precise anomaly localization in text. All the codes and data are publicly available on this https URL.

62. 【2601.13630】Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models

链接https://arxiv.org/abs/2601.13630

作者:Zhaopeng Zhang,Pengcheng Sun,Lan Zhang,Chen Tang,Jiewei Lai,Yunhao Wang,Hui Jin

类目:Computation and Language (cs.CL)

关键词:Large language models, efficient knowledge retrieval, Large language, language models, question answering

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed over knowledge bases for efficient knowledge retrieval and question answering. However, LLMs can inadvertently answer beyond a user's permission scope, leaking sensitive content, thus making it difficult to deploy knowledge-base QA under fine-grained access control requirements. In this work, we identify a geometric regularity in intermediate activations: for the same query, representations induced by different permission scopes cluster distinctly and are readily separable. Building on this separability, we propose Activation-space Anchored Access Control (AAAC), a training-free framework for multi-class permission control. AAAC constructs an anchor bank, with one permission anchor per class, from a small offline sample set and requires no fine-tuning. At inference time, a multi-anchor steering mechanism redirects each query's activations toward the anchor-defined authorized region associated with the current user, thereby suppressing over-privileged generations by design. Finally, extensive experiments across three LLM families demonstrate that AAAC reduces permission violation rates by up to 86.5% and prompt-based attack success rates by 90.7%, while improving response usability with minor inference overhead compared to baselines.

63. 【2601.13614】CauScientist: Teaching LLMs to Respect Data for Causal Discovery

链接https://arxiv.org/abs/2601.13614

作者:Bo Peng,Sirui Chen,Lei Xu,Chaochao Lu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Causal discovery, reliable decision-making, discovery is fundamental, fundamental to scientific, scientific understanding

备注

点击查看摘要

Abstract:Causal discovery is fundamental to scientific understanding and reliable decision-making. Existing approaches face critical limitations: purely data-driven methods suffer from statistical indistinguishability and modeling assumptions, while recent LLM-based methods either ignore statistical evidence or incorporate unverified priors that can mislead result. To this end, we propose CauScientist, a collaborative framework that synergizes LLMs as hypothesis-generating "data scientists" with probabilistic statistics as rigorous "verifiers". CauScientist employs hybrid initialization to select superior starting graphs, iteratively refines structures through LLM-proposed modifications validated by statistical criteria, and maintains error memory to guide efficient search space. Experiments demonstrate that CauScientist substantially outperforms purely data-driven baselines, achieving up to 53.8% F1 score improvement and enhancing recall from 35.0% to 100.0%. Notably, while standalone LLM performance degrades with graph complexity, CauScientist reduces structural hamming distance (SHD) by 44.0% compared to Qwen3-32B on 37-node graphs. Our project page is at this https URL.

64. 【2601.13591】DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

链接https://arxiv.org/abs/2601.13591

作者:Maojun Sun,Yifei Xie,Yue Wu,Ruijian Han,Binyan Jiang,Defeng Sun,Yancheng Yuan,Jian Huang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent LLM-based data, Recent LLM-based, real-world data science, data science, automate data science

备注

点击查看摘要

Abstract:Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 11 advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, GPT-5.2 is the most efficient, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04% to 11.30%. Overall, while current data science agents perform well on structured data and routine data anlysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions to advance the development of data science agents.

65. 【2601.13590】Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions

链接https://arxiv.org/abs/2601.13590

作者:Fan Huang,Haewoon Kwak,Jisun An

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, question-answering tasks, mainstream Large Language, increasingly employed

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly employed in various question-answering tasks. However, recent studies showcase that LLMs are susceptible to persuasion and could adopt counterfactual beliefs. We present a systematic evaluation of LLM susceptibility to persuasion under the Source--Message--Channel--Receiver (SMCR) communication framework. Across five mainstream Large Language Models (LLMs) and three domains (factual knowledge, medical QA, and social bias), we analyze how different persuasive strategies influence belief stability over multiple interaction turns. We further examine whether meta-cognition prompting (i.e., eliciting self-reported confidence) affects resistance to persuasion. Results show that smaller models exhibit extreme compliance, with over 80% of belief changes occurring at the first persuasive turn (average end turn of 1.1--1.4). Contrary to expectations, meta-cognition prompting increases vulnerability by accelerating belief erosion rather than enhancing robustness. Finally, we evaluate adversarial fine-tuning as a defense. While GPT-4o-mini achieves near-complete robustness (98.6%) and Mistral~7B improves substantially (35.7% $\rightarrow$ 79.3%), Llama models remain highly susceptible (14%) even when fine-tuned on their own failure cases. Together, these findings highlight substantial model-dependent limits of current robustness interventions and offer guidance for developing more trustworthy LLMs.

66. 【2601.13588】REX: Tokenizer Regression for Optimal Data Mixture

链接https://arxiv.org/abs/2601.13588

作者:Inho Won,Hangyeol Yoo,Minkyung Cho,Jungyeul Park,Hoyun Song,KyungTae Lim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Building effective tokenizers, multilingual Large Language, requires careful control, Building effective

备注: Accepted to EACL 2026. Long Paper. (19 languages studied: Chinese, Greek, Japanese, etc.)

点击查看摘要

Abstract:Building effective tokenizers for multilingual Large Language Models (LLMs) requires careful control over language-specific data mixtures. While a tokenizer's compression performance critically affects the efficiency of LLM training and inference, existing approaches rely on heuristics or costly large-scale searches to determine optimal language ratios. We introduce Tokenizer Regression for Optimal Data MiXture (TREX), a regression-based framework that efficiently predicts the optimal data mixture for tokenizer training. TREX trains small-scale proxy tokenizers on random mixtures, gathers their compression statistics, and learns to predict compression performance from data mixtures. This learned model enables scalable mixture search before large-scale tokenizer training, mitigating the accuracy-cost trade-off in multilingual tokenizer design. Tokenizers trained with TReX's predicted mixtures outperform mixtures based on LLaMA3 and uniform distributions by up to 12% in both inand out-of-distribution compression efficiency, demonstrating strong scalability, robustness, and practical effectiveness.

67. 【2601.13575】Comparing Without Saying: A Dataset and Benchmark for Implicit Comparative Opinion Mining from Same-User Reviews

链接https://arxiv.org/abs/2601.13575

作者:Thanh-Lam T. Nguyen,Ngoc-Quang Le,Quoc-Trung Phu,Thi-Phuong Le,Ngoc-Huyen Pham,Phuong-Nguyen Nguyen,Hoang-Quynh Le

类目:Computation and Language (cs.CL)

关键词:Existing studies, explicit comparative expressions, comparative opinion mining, uncommon in real-world, Existing

备注

点击查看摘要

Abstract:Existing studies on comparative opinion mining have mainly focused on explicit comparative expressions, which are uncommon in real-world reviews. This leaves implicit comparisons - here users express preferences across separate reviews - largely underexplored. We introduce SUDO, a novel dataset for implicit comparative opinion mining from same-user reviews, allowing reliable inference of user preferences even without explicit comparative cues. SUDO comprises 4,150 annotated review pairs (15,191 sentences) with a bi-level structure capturing aspect-level mentions and review-level preferences. We benchmark this task using two baseline architectures: traditional machine learning- and language model-based baselines. Experimental results show that while the latter outperforms the former, overall performance remains moderate, revealing the inherent difficulty of the task and establishing SUDO as a challenging and valuable benchmark for future research.

68. 【2601.13566】Self-Improvement as Coherence Optimization: A Theoretical Account

链接https://arxiv.org/abs/2601.13566

作者:Tianyi Qiu,Ahmed Hani Ismail,Zhonghao He,Shi Feng

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:language models improve, external supervision, improve their accuracy, accuracy without external, Abstract

备注: 39 pages

点击查看摘要

Abstract:Can language models improve their accuracy without external supervision? Methods such as debate, bootstrap, and internal coherence maximization achieve this surprising feat, even matching golden finetuning performance. Yet why they work remains theoretically unclear. We show that they are all special cases of coherence optimization: finding a context-to-behavior mapping that's most compressible and jointly predictable. We prove that coherence optimization is equivalent to description-length regularization, and that among all such regularization schemes, it is optimal for semi-supervised learning when the regularizer is derived from a pretrained model. Our theory, supported by preliminary experiments, explains why feedback-free self-improvement works and predicts when it should succeed or fail.

69. 【2601.13558】Leveraging ChatGPT and Other NLP Methods for Identifying Risk and Protective Behaviors in MSM: Social Media and Dating apps Text Analysis

链接https://arxiv.org/abs/2601.13558

作者:Mehrab Beikzadeh,Chenglin Hong,Cory J Cascalheira,Callisto Boka,Majid Sarrafzadeh,Ian W Holloway

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:sexually transmitted infections, harmful drinking compared, heterosexual men, social media, media and dating

备注

点击查看摘要

Abstract:Men who have sex with men (MSM) are at elevated risk for sexually transmitted infections and harmful drinking compared to heterosexual men. Text data collected from social media and dating applications may provide new opportunities for personalized public health interventions by enabling automatic identification of risk and protective behaviors. In this study, we evaluated whether text from social media and dating apps can be used to predict sexual risk behaviors, alcohol use, and pre-exposure prophylaxis (PrEP) uptake among MSM. With participant consent, we collected textual data and trained machine learning models using features derived from ChatGPT embeddings, BERT embeddings, LIWC, and a dictionary-based risk term approach. The models achieved strong performance in predicting monthly binge drinking and having more than five sexual partners, with F1 scores of 0.78, and moderate performance in predicting PrEP use and heavy drinking, with F1 scores of 0.64 and 0.63. These findings demonstrate that social media and dating app text data can provide valuable insights into risk and protective behaviors and highlight the potential of large language model-based methods to support scalable and personalized public health interventions for MSM.

70. 【2601.13547】HateXScore: A Metric Suite for Evaluating Reasoning Quality in Hate Speech Explanations

链接https://arxiv.org/abs/2601.13547

作者:Yujia Hu,Roy Ka-Wei Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:frameworks rarely assess, Hateful speech detection, current evaluation frameworks, evaluation frameworks rarely, deemed hateful

备注: EACL 2026 Main Conference

点击查看摘要

Abstract:Hateful speech detection is a key component of content moderation, yet current evaluation frameworks rarely assess why a text is deemed hateful. We introduce \textsf{HateXScore}, a four-component metric suite designed to evaluate the reasoning quality of model explanations. It assesses (i) conclusion explicitness, (ii) faithfulness and causal grounding of quoted spans, (iii) protected group identification (policy-configurable), and (iv) logical consistency among these elements. Evaluated on six diverse hate speech datasets, \textsf{HateXScore} is intended as a diagnostic complement to reveal interpretability failures and annotation inconsistencies that are invisible to standard metrics like Accuracy or F1. Moreover, human evaluation shows strong agreement with \textsf{HateXScore}, validating it as a practical tool for trustworthy and transparent moderation. \textcolor{red}{Disclaimer: This paper contains sensitive content that may be disturbing to some readers.}

Comments:
EACL 2026 Main Conference

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2601.13547 [cs.CL]

(or
arXiv:2601.13547v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2601.13547

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
71. 【2601.13537】When Wording Steers the Evaluation: Framing Bias in LLM judges

链接https://arxiv.org/abs/2601.13537

作者:Yerin Hwang,Dongryeol Lee,Taegwan Kang,Minwoo Lee,Kyomin Jung

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, produce varying responses, varying responses depending, Large language, indicating that subtle

备注: 4 pages

点击查看摘要

Abstract:Large language models (LLMs) are known to produce varying responses depending on prompt phrasing, indicating that subtle guidance in phrasing can steer their answers. However, the impact of this framing bias on LLM-based evaluation, where models are expected to make stable and impartial judgments, remains largely underexplored. Drawing inspiration from the framing effect in psychology, we systematically investigate how deliberate prompt framing skews model judgments across four high-stakes evaluation tasks. We design symmetric prompts using predicate-positive and predicate-negative constructions and demonstrate that such framing induces significant discrepancies in model outputs. Across 14 LLM judges, we observe clear susceptibility to framing, with model families showing distinct tendencies toward agreement or rejection. These findings suggest that framing bias is a structural property of current LLM-based evaluation systems, underscoring the need for framing-aware protocols.

72. 【2601.13528】Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs

链接https://arxiv.org/abs/2601.13528

作者:Jackson Kaunismaa,Avery Griffin,John Hughes,Christina Q. Knight,Mrinank Sharma,Erik Jones

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词:Model developers implement, filter dangerous outputs, developers implement safeguards, prevent misuse, developers implement

备注

点击查看摘要

Abstract:Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.

73. 【2601.13503】Anonpsy: A Graph-Based Framework for Structure-Preserving De-identification of Psychiatric Narratives

链接https://arxiv.org/abs/2601.13503

作者:Kyung Ho Lim,Byung-Hoon Kim

类目:Computation and Language (cs.CL)

关键词:encode patient identity, idiosyncratic life events, life events embedded, narratives encode patient, encode patient

备注

点击查看摘要

Abstract:Psychiatric narratives encode patient identity not only through explicit identifiers but also through idiosyncratic life events embedded in their clinical structure. Existing de-identification approaches, including PHI masking and LLM-based synthetic rewriting, operate at the text level and offer limited control over which semantic elements are preserved or altered. We introduce Anonpsy, a de-identification framework that reformulates the task as graph-guided semantic rewriting. Anonpsy (1) converts each narrative into a semantic graph encoding clinical entities, temporal anchors, and typed relations; (2) applies graph-constrained perturbations that modify identifying context while preserving clinically essential structure; and (3) regenerates text via graph-conditioned LLM generation. Evaluated on 90 clinician-authored psychiatric case narratives, Anonpsy preserves diagnostic fidelity while achieving consistently low re-identification risk under expert, semantic, and GPT-5-based evaluations. Compared with a strong LLM-only rewriting baseline, Anonpsy yields substantially lower semantic similarity and identifiability. These results demonstrate that explicit structural representations combined with constrained generation provide an effective approach to de-identification for psychiatric narratives.

74. 【2601.13487】he Hidden Toll of Social Media News: Causal Effects on Psychosocial Wellbeing

链接https://arxiv.org/abs/2601.13487

作者:Olivia Pal,Agam Goyal,Eshwar Chandrasekharan,Koustuv Saha

类目:ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)

关键词:outcomes remains unclear, remains unclear, Treated users exposed, psychosocial outcomes remains, shape psychosocial outcomes

备注

点击查看摘要

Abstract:News consumption on social media has become ubiquitous, yet how different forms of engagement shape psychosocial outcomes remains unclear. To address this gap, we leveraged a large-scale dataset of ~26M posts and ~45M comments on the BlueSky platform, and conducted a quasi-experimental study, matching 81,345 Treated users exposed to News feeds with 83,711 Control users using stratified propensity score analysis. We examined psychosocial wellbeing, in terms of affective, behavioral, and cognitive outcomes. Our findings reveal that news engagement produces systematic trade-offs: increased depression, stress, and anxiety, yet decreased loneliness and increased social interaction on the platform. Regression models reveal that News feed bookmarking is associated with greater psychosocial deterioration compared to commenting or quoting, with magnitude differences exceeding tenfold. These per-engagement effects accumulate with repeated exposure, showing significant psychosocial impacts. Our work extends theories of news effects beyond crisis-centric frameworks by demonstrating that routine consumption creates distinct psychological dynamics depending on engagement type, and bears implications for tools and interventions for mitigating the psychosocial costs of news consumption on social media.

75. 【2601.13453】PhysicsSolutionAgent: Towards Multimodal Explanations for Numerical Physics Problem Solving

链接https://arxiv.org/abs/2601.13453

作者:Aditya Thole,Anmol Agrawal,Arnav Ramamoorthy,Dhruv Kumar

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:Explaining numerical physics, substantially improve conceptual, Explaining numerical, text-based solutions, improve conceptual understanding

备注

点击查看摘要

Abstract:Explaining numerical physics problems often requires more than text-based solutions; clear visual reasoning can substantially improve conceptual understanding. While large language models (LLMs) demonstrate strong performance on many physics questions in textual form, their ability to generate long, high-quality visual explanations remains insufficiently explored. In this work, we introduce PhysicsSolutionAgent (PSA), an autonomous agent that generates physics-problem explanation videos of up to six minutes using Manim animations. To evaluate the generated videos, we design an assessment pipeline that performs automated checks across 15 quantitative parameters and incorporates feedback from a vision-language model (VLM) to iteratively improve video quality. We evaluate PSA on 32 videos spanning numerical and theoretical physics problems. Our results reveal systematic differences in video quality depending on problem difficulty and whether the task is numerical or theoretical. Using GPT-5-mini, PSA achieves a 100% video-completion rate with an average automated score of 3.8/5. However, qualitative analysis and human inspection uncover both minor and major issues, including visual layout inconsistencies and errors in how visual content is interpreted during feedback. These findings expose key limitations in reliable Manim code generation and highlight broader challenges in multimodal reasoning and evaluation for visual explanations of numerical physics problems. Our work underscores the need for improved visual understanding, verification, and evaluation frameworks in future multimodal educational systems

76. 【2601.13437】MOSLD-Bench: Multilingual Open-Set Learning and Discovery Benchmark for Text Categorization

链接https://arxiv.org/abs/2601.13437

作者:Adriana-Valentina Costache,Daria-Nicoleta Dragomir,Silviu-Florin Gheorghe,Eduard Poesina,Paul Irofti,Radu Tudor Ionescu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:challenging machine learning, Open-set learning, test time, challenging machine, machine learning task

备注

点击查看摘要

Abstract:Open-set learning and discovery (OSLD) is a challenging machine learning task in which samples from new (unknown) classes can appear at test time. It can be seen as a generalization of zero-shot learning, where the new classes are not known a priori, hence involving the active discovery of new classes. While zero-shot learning has been extensively studied in text classification, especially with the emergence of pre-trained language models, open-set learning and discovery is a comparatively new setup for the text domain. To this end, we introduce the first multilingual open-set learning and discovery (MOSLD) benchmark for text categorization by topic, comprising 960K data samples across 12 languages. To construct the benchmark, we (i) rearrange existing datasets and (ii) collect new data samples from the news domain. Moreover, we propose a novel framework for the OSLD task, which integrates multiple stages to continuously discover and learn new classes. We evaluate several language models, including our own, to obtain results that can be used as reference for future work. We release our benchmark at this https URL.

77. 【2601.13433】rust Me, I'm an Expert: Decoding and Steering Authority Bias in Large Language Models

链接https://arxiv.org/abs/2601.13433

作者:Priyanka Mary Mammen,Emil Joswin,Shankar Venkitachalam

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Prior research demonstrates, Prior research, influenced by suggestions, research demonstrates, Prior

备注

点击查看摘要

Abstract:Prior research demonstrates that performance of language models on reasoning tasks can be influenced by suggestions, hints and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect/misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.

78. 【2601.13392】Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks

链接https://arxiv.org/abs/2601.13392

作者:Shlok Shelat,Jay Raval,Souvik Roy,Manas Gaur

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL)

关键词:demonstrated strong performance, reflects genuine symbolic, constructions remains unclear, Large language models, familiar constructions remains

备注: 30 pages, 11 figures, 6 tables, Work in Progress

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated strong performance on formal language tasks, yet whether this reflects genuine symbolic reasoning or pattern matching on familiar constructions remains unclear. We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages, comprising factual knowledge questions, seen construction problems from public sources, and two types of unseen problems: hand-crafted instances with multiple interacting constraints and systematically generated problems via Arden's theorem. Models achieve perfect accuracy on factual questions and 84-90% on seen tasks. However, accuracy drops sharply on unseen problems (by 30-64%), with failures stemming from systematic misinterpretation of language constraints, incorrect handling of Kleene-star semantics, and a failure to preserve global consistency. We evaluate a three-stage hint protocol that enables correction of shallow errors but does not reliably resolve globally inconsistent or structurally flawed automata. Our analysis across multiple prompting strategies (direct, Chain-of-Thought, Tree-of-Thought) reveals that errors persist regardless of prompting approach, exposing a fundamental gap between LLMs' ability to generate syntactically plausible DFAs and their capacity for semantically correct formal reasoning.

79. 【2601.13388】Structured Insight from Unstructured Data: Large Language Models for SDOH-Driven Diabetes Risk Prediction

链接https://arxiv.org/abs/2601.13388

作者:Sasha Ronaghi,Prerit Choudhary,David H Rehkopf,Bryant Lin

类目:Computation and Language (cs.CL)

关键词:electronic health records, role in Type, electronic health, health records, play a critical

备注: 7 pages, 5 figures

点击查看摘要

Abstract:Social determinants of health (SDOH) play a critical role in Type 2 Diabetes (T2D) management but are often absent from electronic health records and risk prediction models. Most individual-level SDOH data is collected through structured screening tools, which lack the flexibility to capture the complexity of patient experiences and unique needs of a clinic's population. This study explores the use of large language models (LLMs) to extract structured SDOH information from unstructured patient life stories and evaluate the predictive value of both the extracted features and the narratives themselves for assessing diabetes control. We collected unstructured interviews from 65 T2D patients aged 65 and older, focused on their lived experiences, social context, and diabetes management. These narratives were analyzed using LLMs with retrieval-augmented generation to produce concise, actionable qualitative summaries for clinical interpretation and structured quantitative SDOH ratings for risk prediction modeling. The structured SDOH ratings were used independently and in combination with traditional laboratory biomarkers as inputs to linear and tree-based machine learning models (Ridge, Lasso, Random Forest, and XGBoost) to demonstrate how unstructured narrative data can be applied in conventional risk prediction workflows. Finally, we evaluated several LLMs on their ability to predict a patient's level of diabetes control (low, medium, high) directly from interview text with A1C values redacted. LLMs achieved 60% accuracy in predicting diabetes control levels from interview text. This work demonstrates how LLMs can translate unstructured SDOH-related data into structured insights, offering a scalable approach to augment clinical risk models and decision-making.

80. 【2601.13387】Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning

链接https://arxiv.org/abs/2601.13387

作者:Zhenjiang Mao,Anirudhh Venkat,Artem Bisliouk,Akshat Kothiyal,Sindhura Kumbakonam Subramanian,Saithej Singhu,Ivan Ruchkin

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, mathematical problem solving, scientific question answering

备注

点击查看摘要

Abstract:Large Language Models (LLMs) increasingly rely on long-form, multi-step reasoning to solve complex tasks such as mathematical problem solving and scientific question answering. Despite strong performance, existing confidence estimation methods typically reduce an entire reasoning process to a single scalar score, ignoring how confidence evolves throughout the generation. As a result, these methods are often sensitive to superficial factors such as response length or verbosity, and struggle to distinguish correct reasoning from confidently stated errors. We propose to characterize the stepwise confidence signal using Signal Temporal Logic (STL). Using a discriminative STL mining procedure, we discover temporal formulas that distinguish confidence signals of correct and incorrect responses. Our analysis found that the STL patterns generalize across tasks, and numeric parameters exhibit sensitivity to individual questions. Based on these insights, we develop a confidence estimation approach that informs STL blocks with parameter hypernetworks. Experiments on multiple reasoning tasks show our confidence scores are more calibrated than the baselines.

81. 【2601.13384】From Completion to Editing: Unlocking Context-Aware Code Infilling via Search-and-Replace Instruction Tuning

链接https://arxiv.org/abs/2601.13384

作者:Jiajun Zhang,Zeyu Cui,Jiaxi Yang,Lei Zhang,Yuheng Jing,Zeyao Ma,Tianyi Bai,Zilei Wang,Qiang Liu,Liang Wang,Binyuan Hui,Junyang Lin

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:correct contextual errors, insecure Base models, reliance on unaligned, rigid inability, inability to correct

备注

点击查看摘要

Abstract:The dominant Fill-in-the-Middle (FIM) paradigm for code completion is constrained by its rigid inability to correct contextual errors and reliance on unaligned, insecure Base models. While Chat LLMs offer safety and Agentic workflows provide flexibility, they suffer from performance degradation and prohibitive latency, respectively. To resolve this dilemma, we propose Search-and-Replace Infilling (SRI), a framework that internalizes the agentic verification-and-editing mechanism into a unified, single-pass inference process. By structurally grounding edits via an explicit search phase, SRI harmonizes completion tasks with the instruction-following priors of Chat LLMs, extending the paradigm from static infilling to dynamic context-aware editing. We synthesize a high-quality dataset, SRI-200K, and fine-tune the SRI-Coder series. Extensive evaluations demonstrate that with minimal data (20k samples), SRI-Coder enables Chat models to surpass the completion performance of their Base counterparts. Crucially, unlike FIM-style tuning, SRI preserves general coding competencies and maintains inference latency comparable to standard FIM. We empower the entire Qwen3-Coder series with SRI, encouraging the developer community to leverage this framework for advanced auto-completion and assisted development.

82. 【2601.13368】Recurrent Confidence Chain: Temporal-Aware Uncertainty Quantification in Large Language Models

链接https://arxiv.org/abs/2601.13368

作者:Zhenjiang Mao,Anirudhh Venkat

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:answering common-sense questions, solving math problems, answering common-sense, common-sense questions, questions and solving

备注

点击查看摘要

Abstract:As reasoning modules, such as the chain-of-thought mechanism, are applied to large language models, they achieve strong performance on various tasks such as answering common-sense questions and solving math problems. The main challenge now is to assess the uncertainty of answers, which can help prevent misleading or serious hallucinations for users. Although current methods analyze long reasoning sequences by filtering unrelated tokens and examining potential connections between nearby tokens or sentences, the temporal spread of confidence is often overlooked. This oversight can lead to inflated overall confidence, even when earlier steps exhibit very low confidence. To address this issue, we propose a novel method that incorporates inter-step attention to analyze semantic correlations across steps. For handling long-horizon responses, we introduce a hidden confidence mechanism to retain historical confidence information, which is then combined with stepwise confidence to produce a more accurate overall estimate. We evaluate our method on the GAOKAO math benchmark and the CLadder causal reasoning dataset using mainstream open-source large language models. Our approach is shown to outperform state-of-the-art methods by achieving a superior balance between predictive quality and calibration, demonstrated by strong performance on both Negative Log-Likelihood and Expected Calibration Error.

83. 【2601.13359】Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection

链接https://arxiv.org/abs/2601.13359

作者:Asen Dotsinski,Panagiotis Eustratiadis

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

关键词:open-weight large language, large language models, increase in capabilities, large language, open-weight large

备注

点击查看摘要

Abstract:As open-weight large language models (LLMs) increase in capabilities, safeguarding them against malicious prompts and understanding possible attack vectors becomes ever more important. While automated jailbreaking methods like GCG [Zou et al., 2023] remain effective, they often require substantial computational resources and specific expertise. We introduce "sockpuppetting'', a simple method for jailbreaking open-weight LLMs by inserting an acceptance sequence (e.g., "Sure, here is how to...'') at the start of a model's output and allowing it to complete the response. Requiring only a single line of code and no optimization, sockpuppetting achieves up to 80% higher attack success rate (ASR) than GCG on Qwen3-8B in per-prompt comparisons. We also explore a hybrid approach that optimizes the adversarial suffix within the assistant message block rather than the user prompt, increasing ASR by 64% over GCG on Llama-3.1-8B in a prompt-agnostic setting. The results establish sockpuppetting as an effective low-cost attack accessible to unsophisticated adversaries, highlighting the need for defences against output-prefix injection in open-weight models.

84. 【2601.13357】On the Relation of State Space Models and Hidden Markov Models

链接https://arxiv.org/abs/2601.13357

作者:Aydin Ghojogh,M.Hadi Sepanj,Benyamin Ghojogh

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Systems and Control (eess.SY)

关键词:Hidden Markov Models, State Space Models, Hidden Markov, State Space, NLP state space

备注

点击查看摘要

Abstract:State Space Models (SSMs) and Hidden Markov Models (HMMs) are foundational frameworks for modeling sequential data with latent variables and are widely used in signal processing, control theory, and machine learning. Despite their shared temporal structure, they differ fundamentally in the nature of their latent states, probabilistic assumptions, inference procedures, and training paradigms. Recently, deterministic state space models have re-emerged in natural language processing through architectures such as S4 and Mamba, raising new questions about the relationship between classical probabilistic SSMs, HMMs, and modern neural sequence models. In this paper, we present a unified and systematic comparison of HMMs, linear Gaussian state space models, Kalman filtering, and contemporary NLP state space models. We analyze their formulations through the lens of probabilistic graphical models, examine their inference algorithms -- including forward-backward inference and Kalman filtering -- and contrast their learning procedures via Expectation-Maximization and gradient-based optimization. By highlighting both structural similarities and semantic differences, we clarify when these models are equivalent, when they fundamentally diverge, and how modern NLP SSMs relate to classical probabilistic models. Our analysis bridges perspectives from control theory, probabilistic modeling, and modern deep learning.

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Systems and Control (eess.SY)

Cite as:
arXiv:2601.13357 [cs.LG]

(or
arXiv:2601.13357v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2601.13357

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
85. 【2601.13352】LLM-as-RNN: A Recurrent Language Model for Memory Updates and Sequence Prediction

链接https://arxiv.org/abs/2601.13352

作者:Yuxing Lu,J. Ben Tamo,Weichen Zhao,Nan Sun,Yishan Zhong,Wenqi Shi,Jinzhuo Wang,May D. Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:Large language models, immutable context histories, strong sequence predictors, standard inference relies, Large language

备注: 17 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Large language models are strong sequence predictors, yet standard inference relies on immutable context histories. After making an error at generation step t, the model lacks an updatable memory mechanism that improves predictions for step t+1. We propose LLM-as-RNN, an inference-only framework that turns a frozen LLM into a recurrent predictor by representing its hidden state as natural-language memory. This state, implemented as a structured system-prompt summary, is updated at each timestep via feedback-driven text rewrites, enabling learning without parameter updates. Under a fixed token budget, LLM-as-RNN corrects errors and retains task-relevant patterns, effectively performing online learning through language. We evaluate the method on three sequential benchmarks in healthcare, meteorology, and finance across Llama, Gemma, and GPT model families. LLM-as-RNN significantly outperforms zero-shot, full-history, and MemPrompt baselines, improving predictive accuracy by 6.5% on average, while producing interpretable, human-readable learning traces absent in standard context accumulation.

86. 【2601.13346】AfroScope: A Framework for Studying the Linguistic Landscape of Africa

链接https://arxiv.org/abs/2601.13346

作者:Sang Yun Kwon,AbdelRahim Elmadany,Muhammad Abdul-Mageed

类目:Computation and Language (cs.CL)

关键词:downstream NLP applications, fundamental preprocessing step, NLP applications, downstream NLP, African LID

备注

点击查看摘要

Abstract:Language Identification (LID) is the task of determining the language of a given text and is a fundamental preprocessing step that affects the reliability of downstream NLP applications. While recent work has expanded LID coverage for African languages, existing approaches remain limited in (i) the number of supported languages and (ii) their ability to make fine-grained distinctions among closely related varieties. We introduce AfroScope, a unified framework for African LID that includes AfroScope-Data, a dataset covering 713 African languages, and AfroScope-Models, a suite of strong LID models with broad language coverage. To better distinguish highly confusable languages, we propose a hierarchical classification approach that leverages Mirror-Serengeti, a specialized embedding model targeting 29 closely related or geographically proximate languages. This approach improves macro F1 by 4.55 on this confusable subset compared to our best base model. Finally, we analyze cross linguistic transfer and domain effects, offering guidance for building robust African LID systems. We position African LID as an enabling technology for large scale measurement of Africas linguistic landscape in digital text and release AfroScope-Data and AfroScope-Models publicly.

87. 【2601.13330】RegCheck: A tool for automating comparisons between study registrations and papers

链接https://arxiv.org/abs/2601.13330

作者:Jamie Cummins,Beth Clarke,Ian Hussey,Malte Elson

类目:Computation and Language (cs.CL)

关键词:planned research activities, research activities, planned research, social and medical, transparency and rigour

备注: 15 pages, 1 figure

点击查看摘要

Abstract:Across the social and medical sciences, researchers recognize that specifying planned research activities (i.e., 'registration') prior to the commencement of research has benefits for both the transparency and rigour of science. Despite this, evidence suggests that study registrations frequently go unexamined, minimizing their effectiveness. In a way this is no surprise: manually checking registrations against papers is labour- and time-intensive, requiring careful reading across formats and expertise across domains. The advent of AI unlocks new possibilities in facilitating this activity. We present RegCheck, a modular LLM-assisted tool designed to help researchers, reviewers, and editors from across scientific disciplines compare study registrations with their corresponding papers. Importantly, RegCheck keeps human expertise and judgement in the loop by (i) ensuring that users are the ones who determine which features should be compared, and (ii) presenting the most relevant text associated with each feature to the user, facilitating (rather than replacing) human discrepancy judgements. RegCheck also generates shareable reports with unique RegCheck IDs, enabling them to be easily shared and verified by other users. RegCheck is designed to be adaptable across scientific domains, as well as registration and publication formats. In this paper we provide an overview of the motivation, workflow, and design principles of RegCheck, and we discuss its potential as an extensible infrastructure for reproducible science with an example use case.

88. 【2601.13328】Reducing Tokenization Premiums for Low-Resource Languages

链接https://arxiv.org/abs/2601.13328

作者:Geoffrey Churchill,Steven Skiena

类目:Computation and Language (cs.CL)

关键词:Relative to English, substantial tokenization premiums, sentence in English, analogous sentence, English

备注

点击查看摘要

Abstract:Relative to English, low-resource languages suffer from substantial tokenization premiums in modern LMs, meaning that it generally requires several times as many tokens to encode a sentence in a low-resource language than to encode the analogous sentence in English. This tokenization premium results in increased API and energy costs and reduced effective context windows for these languages. In this paper we analyze the tokenizers of ten popular LMs to better understand their designs and per-language tokenization premiums. We also propose a mechanism to reduce tokenization premiums in pre-trained models, by post-hoc additions to the token vocabulary that coalesce multi-token characters into single tokens. We apply this methodology to 12 low-resource languages, demonstrating that the original and compressed inputs often have similar last hidden states when run through the Llama 3.2 1B model.

89. 【2601.13319】Arab Voices: Mapping Standard and Dialectal Arabic Speech Technology

链接https://arxiv.org/abs/2601.13319

作者:Peter Sullivan,AbdelRahim Elmadany,Alcides Alcoba Inciarte,Muhammad Abdul-Mageed

类目:Computation and Language (cs.CL)

关键词:speech data vary, complicating cross-dataset comparison, dialect labeling practices, data vary widely, Dialectal Arabic

备注

点击查看摘要

Abstract:Dialectal Arabic (DA) speech data vary widely in domain coverage, dialect labeling practices, and recording conditions, complicating cross-dataset comparison and model evaluation. To characterize this landscape, we conduct a computational analysis of linguistic ``dialectness'' alongside objective proxies of audio quality on the training splits of widely used DA corpora. We find substantial heterogeneity both in acoustic conditions and in the strength and consistency of dialectal signals across datasets, underscoring the need for standardized characterization beyond coarse labels. To reduce fragmentation and support reproducible evaluation, we introduce Arab Voices, a standardized framework for DA ASR. Arab Voices provides unified access to 31 datasets spanning 14 dialects, with harmonized metadata and evaluation utilities. We further benchmark a range of recent ASR systems, establishing strong baselines for modern DA ASR.

90. 【2601.13317】Paid Voices vs. Public Feeds: Interpretable Cross-Platform Theme Modeling of Climate Discourse

链接https://arxiv.org/abs/2601.13317

作者:Samantha Sudhoff,Pranav Perumal,Zhaoqing Wu,Tunazzina Islam

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词:discourse online plays, shaping public understanding, Climate discourse online, policy outcomes, online plays

备注

点击查看摘要

Abstract:Climate discourse online plays a crucial role in shaping public understanding of climate change and influencing political and policy outcomes. However, climate communication unfolds across structurally distinct platforms with fundamentally different incentive structures: paid advertising ecosystems incentivize targeted, strategic persuasion, while public social media platforms host largely organic, user-driven discourse. Existing computational studies typically analyze these environments in isolation, limiting our ability to distinguish institutional messaging from public expression. In this work, we present a comparative analysis of climate discourse across paid advertisements on Meta (previously known as Facebook) and public posts on Bluesky from July 2024 to September 2025. We introduce an interpretable, end-to-end thematic discovery and assignment framework that clusters texts by semantic similarity and leverages large language models (LLMs) to generate concise, human-interpretable theme labels. We evaluate the quality of the induced themes against traditional topic modeling baselines using both human judgments and an LLM-based evaluator, and further validate their semantic coherence through downstream stance prediction and theme-guided retrieval tasks. Applying the resulting themes, we characterize systematic differences between paid climate messaging and public climate discourse and examine how thematic prevalence shifts around major political events. Our findings show that platform-level incentives are reflected in the thematic structure, stance alignment, and temporal responsiveness of climate narratives. While our empirical analysis focuses on climate communication, the proposed framework is designed to support comparative narrative analysis across heterogeneous communication environments.

91. 【2601.13300】OI-Bench: An Option Injection Benchmark for Evaluating LLM Susceptibility to Directive Interference

链接https://arxiv.org/abs/2601.13300

作者:Yow-Fu Liou,Yu-Chien Tang,Yu-Hsiang Liu,An-Zi Yen

类目:Computation and Language (cs.CL)

关键词:Benchmarking large language, large language models, understanding their capabilities, large language, critical for understanding

备注

点击查看摘要

Abstract:Benchmarking large language models (LLMs) is critical for understanding their capabilities, limitations, and robustness. In addition to interface artifacts, prior studies have shown that LLM decisions can be influenced by directive signals such as social cues, framing, and instructions. In this work, we introduce option injection, a benchmarking approach that augments the multiple-choice question answering (MCQA) interface with an additional option containing a misleading directive, leveraging standardized choice structure and scalable evaluation. We construct OI-Bench, a benchmark of 3,000 questions spanning knowledge, reasoning, and commonsense tasks, with 16 directive types covering social compliance, bonus framing, threat framing, and instructional interference. This setting combines manipulation of the choice interface with directive-based interference, enabling systematic assessment of model susceptibility. We evaluate 12 LLMs to analyze attack success rates, behavioral responses, and further investigate mitigation strategies ranging from inference-time prompting to post-training alignment. Experimental results reveal substantial vulnerabilities and heterogeneous robustness across models. OI-Bench is expected to support more systematic evaluation of LLM robustness to directive interference within choice-based interfaces.

92. 【2601.13295】CooperBench: Why Coding Agents Cannot be Your Teammates Yet

链接https://arxiv.org/abs/2601.13295

作者:Arpandeep Khatua,Hao Zhu,Peter Tran,Arya Prabhudesai,Frederic Sadrieh,Johann K. Lieberwirth,Xinkai Yu,Yicheng Fu,Michael J. Ryan,Jiaxin Pei,Diyi Yang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)

关键词:find common ground, Resolving team conflicts, task-specific competence, build consensus, Resolving team

备注: [this https URL](https://cooperbench.com)

点击查看摘要

Abstract:Resolving team conflicts requires not only task-specific competence, but also social intelligence to find common ground and build consensus. As AI agents increasingly collaborate on complex work, they must develop coordination capabilities to function as effective teammates. Yet we hypothesize that current agents lack these capabilities. To test this, we introduce CooperBench, a benchmark of over 600 collaborative coding tasks across 12 libraries in 4 programming languages. Each task assigns two agents different features that can be implemented independently but may conflict without proper coordination. Tasks are grounded in real open-source repositories with expert-written tests. Evaluating state-of-the-art coding agents, we observe the curse of coordination: agents achieve on average 30% lower success rates when working together compared to performing both tasks individually. This contrasts sharply with human teams, where adding teammates typically improves productivity. Our analysis reveals three key issues: (1) communication channels become jammed with vague, ill-timed, and inaccurate messages; (2) even with effective communication, agents deviate from their commitments; and (3) agents often hold incorrect expectations about others' plans and communication. Through large-scale simulation, we also observe rare but interesting emergent coordination behavior including role division, resource division, and negotiation. Our research presents a novel benchmark for collaborative coding and calls for a shift from pursuing individual agent capability to developing social intelligence.

93. 【2601.13288】A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification

链接https://arxiv.org/abs/2601.13288

作者:Gonzalo Ariel Meyoyan,Luciano Del Corro

类目:Computation and Language (cs.CL)

关键词:Production LLM systems, Production LLM, LLM systems, classification-heavy steps, operational complexity

备注

点击查看摘要

Abstract:Production LLM systems often rely on separate models for safety and other classification-heavy steps, increasing latency, VRAM footprint, and operational complexity. We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation. We frame classification as representation selection over the full token-layer hidden-state tensor, rather than committing to a fixed token or fixed layer (e.g., first-token logits or final-layer pooling). To implement this, we introduce a two-stage aggregator that (i) summarizes tokens within each layer and (ii) aggregates across layer summaries to form a single representation for classification. We instantiate this template with direct pooling, a 100K-parameter scoring-attention gate, and a downcast multi-head self-attention (MHA) probe with up to 35M trainable parameters. Across safety and sentiment benchmarks our probes improve over logit-only reuse (e.g., MULI) and are competitive with substantially larger task-specific baselines, while preserving near-serving latency and avoiding the VRAM and latency costs of a separate guard-model pipeline.

94. 【2601.13264】Unlearning in LLMs: Methods, Evaluation, and Open Challenges

链接https://arxiv.org/abs/2601.13264

作者:Tyler Lizzo,Larry Heck

类目:Computation and Language (cs.CL)

关键词:achieved remarkable success, widespread deployment raises, deployment raises pressing, raises pressing concerns, language processing tasks

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success across natural language processing tasks, yet their widespread deployment raises pressing concerns around privacy, copyright, security, and bias. Machine unlearning has emerged as a promising paradigm for selectively removing knowledge or data from trained models without full retraining. In this survey, we provide a structured overview of unlearning methods for LLMs, categorizing existing approaches into data-centric, parameter-centric, architecture-centric, hybrid, and other strategies. We also review the evaluation ecosystem, including benchmarks, metrics, and datasets designed to measure forgetting effectiveness, knowledge retention, and robustness. Finally, we outline key challenges and open problems, such as scalable efficiency, formal guarantees, cross-language and multimodal unlearning, and robustness against adversarial relearning. By synthesizing current progress and highlighting open directions, this paper aims to serve as a roadmap for developing reliable and responsible unlearning techniques in large language models.

95. 【2601.13262】CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

链接https://arxiv.org/abs/2601.13262

作者:Eric Onyame,Akash Ghosh,Subhadip Baidya,Sriparna Saha,Xiuying Chen,Chirag Agarwal

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:multilingual medical reasoning, multilingual healthcare settings, medical reasoning applications, large language models, multilingual medical

备注

点击查看摘要

Abstract:While large language models (LLMs) have shown to perform well on monolingual mathematical and commonsense reasoning, they remain unreliable for multilingual medical reasoning applications, hindering their deployment in multilingual healthcare settings. We address this by first introducing CUREMED-BENCH, a high-quality multilingual medical reasoning dataset with open-ended reasoning queries with a single verifiable answer, spanning thirteen languages, including underrepresented languages such as Amharic, Yoruba, and Swahili. Building on this dataset, we propose CURE-MED, a curriculum-informed reinforcement learning framework that integrates code-switching-aware supervised fine-tuning and Group Relative Policy Optimization to jointly improve logical correctness and language stability. Across thirteen languages, our approach consistently outperforms strong baselines and scales effectively, achieving 85.21% language consistency and 54.35% logical correctness at 7B parameters, and 94.96% language consistency and 70.04% logical correctness at 32B parameters. These results support reliable and equitable multilingual medical reasoning in LLMs. The code and dataset are available at this https URL

96. 【2601.13260】Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

链接https://arxiv.org/abs/2601.13260

作者:Sawsan Alqahtani,Mir Tafseer Nayeem,Md Tahmid Rahman Laskar,Tasnim Mohiuddin,M Saiful Bari

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:inconsistently designed component, Byte Pair Encoding, designed component, large language model, underlies every large

备注: Accepted to EACL 2026 (long, main). The first two authors contributed equally

点击查看摘要

Abstract:Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic structure, amplify bias, and waste capacity across languages and domains. This paper reframes tokenization as a core modeling decision rather than a preprocessing step. We argue for a context-aware framework that integrates tokenizer and model co-design, guided by linguistic, domain, and deployment considerations. Standardized evaluation and transparent reporting are essential to make tokenization choices accountable and comparable. Treating tokenization as a core design problem, not a technical afterthought, can yield language technologies that are fairer, more efficient, and more adaptable.

97. 【2601.13253】A Hybrid Protocol for Large-Scale Semantic Dataset Generation in Low-Resource Languages: The Turkish Semantic Relations Corpus

链接https://arxiv.org/abs/2601.13253

作者:Ebubekir Tosun,Mehmet Emin Buldur,Özay Ezerceli,Mahmoud ElHussieni

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:semantic relations corpus, generating large-scale semantic, comprehensive Turkish semantic, Turkish semantic relations, relations corpus

备注

点击查看摘要

Abstract:We present a hybrid methodology for generating large-scale semantic relationship datasets in low-resource languages, demonstrated through a comprehensive Turkish semantic relations corpus. Our approach integrates three phases: (1) FastText embeddings with Agglomerative Clustering to identify semantic clusters, (2) Gemini 2.5-Flash for automated semantic relationship classification, and (3) integration with curated dictionary sources. The resulting dataset comprises 843,000 unique Turkish semantic pairs across three relationship types (synonyms, antonyms, co-hyponyms) representing a 10x scale increase over existing resources at minimal cost ($65). We validate the dataset through two downstream tasks: an embedding model achieving 90% top-1 retrieval accuracy and a classification model attaining 90% F1-macro. Our scalable protocol addresses critical data scarcity in Turkish NLP and demonstrates applicability to other low-resource languages. We publicly release the dataset and models.

98. 【2601.13251】Beyond Cosine Similarity: Taming Semantic Drift and Antonym Intrusion in a 15-Million Node Turkish Synonym Graph

链接https://arxiv.org/abs/2601.13251

作者:Ebubekir Tosun,Mehmet Emin Buldur,Özay Ezerceli,Mahmoud ElHussieni

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:notorious blind spot, blind spot, Neural embeddings, notorious blind, Neural

备注

点击查看摘要

Abstract:Neural embeddings have a notorious blind spot: they can't reliably tell synonyms apart from antonyms. Consequently, increasing similarity thresholds often fails to prevent opposites from being grouped together. We've built a large-scale semantic clustering system specifically designed to tackle this problem head on. Our pipeline chews through 15 million lexical items, evaluates a massive 520 million potential relationships, and ultimately generates 2.9 million high-precision semantic clusters. The system makes three primary contributions. First, we introduce a labeled dataset of 843,000 concept pairs spanning synonymy, antonymy, and co-hyponymy, constructed via Gemini 2.5-Flash LLM augmentation and verified using human-curated dictionary resources. Second, we propose a specialized three-way semantic relation discriminator that achieves 90% macro-F1, enabling robust disambiguation beyond raw embedding similarity. Third, we introduce a novel soft-to-hard clustering algorithm that mitigates semantic drift preventing erroneous transitive chains (e.g., hot - spicy - pain - depression) while simultaneously resolving polysemy. Our approach employs a topology-aware two-stage expansion-pruning procedure with topological voting, ensuring that each term is assigned to exactly one semantically coherent cluster. The resulting resource enables high-precision semantic search and retrieval-augmented generation, particularly for morphologically rich and low-resource languages where existing synonym databases remain sparse.

99. 【2601.13247】Aligning Agentic World Models via Knowledgeable Experience Learning

链接https://arxiv.org/abs/2601.13247

作者:Baochang Ren,Yunzhi Yao,Rui Sun,Shuofei Qiao,Ningyu Zhang,Huajun Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Current Large Language, Large Language Models, critical modal disconnect, Current Large, possess vast semantic

备注: Ongoing work

点击查看摘要

Abstract:Current Large Language Models (LLMs) exhibit a critical modal disconnect: they possess vast semantic knowledge but lack the procedural grounding to respect the immutable laws of the physical world. Consequently, while these agents implicitly function as world models, their simulations often suffer from physical hallucinations-generating plans that are logically sound but physically unexecutable. Existing alignment strategies predominantly rely on resource-intensive training or fine-tuning, which attempt to compress dynamic environmental rules into static model parameters. However, such parametric encapsulation is inherently rigid, struggling to adapt to the open-ended variability of physical dynamics without continuous, costly retraining. To bridge this gap, we introduce WorldMind, a framework that autonomously constructs a symbolic World Knowledge Repository by synthesizing environmental feedback. Specifically, it unifies Process Experience to enforce physical feasibility via prediction errors and Goal Experience to guide task optimality through successful trajectories. Experiments on EB-ALFRED and EB-Habitat demonstrate that WorldMind achieves superior performance compared to baselines with remarkable cross-model and cross-environment transferability.

100. 【2601.13240】KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development?

链接https://arxiv.org/abs/2601.13240

作者:Xue Jiang,Jiaru Qian,Xianjie Shi,Chenjie Li,Hao Zhu,Ziyu Wang,Jielun Zhang,Zheyu Zhao,Kechi Zhang,Jia Li,Wenpin Jiao,Zhi Jin,Ge Li,Yihong Dong

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, domain specialization methods, specialization methods, Large language, domain specialization

备注

点击查看摘要

Abstract:Large language models (LLMs) excel at general programming but struggle with domain-specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing domain-specific code benchmarks cannot evaluate the effectiveness of domain specialization methods, which focus on assessing what knowledge LLMs possess rather than how they acquire and apply new knowledge, lacking explicit knowledge corpora for developing domain specialization methods. To this end, we present KOCO-BENCH, a novel benchmark designed for evaluating domain specialization methods in real-world software development. KOCO-BENCH contains 6 emerging domains with 11 software frameworks and 25 projects, featuring curated knowledge corpora alongside multi-granularity evaluation tasks including domain code generation (from function-level to project-level with rigorous test suites) and domain knowledge understanding (via multiple-choice QA). Unlike previous benchmarks that only provide test sets for direct evaluation, KOCO-BENCH requires acquiring and applying diverse domain knowledge (APIs, rules, constraints, etc.) from knowledge corpora to solve evaluation tasks. Our evaluations reveal that KOCO-BENCH poses significant challenges to state-of-the-art LLMs. Even with domain specialization methods (e.g., SFT, RAG, kNN-LM) applied, improvements remain marginal. Best-performing coding agent, Claude Code, achieves only 34.2%, highlighting the urgent need for more effective domain specialization methods. We release KOCO-BENCH, evaluation code, and baselines to advance further research at this https URL.

101. 【2601.13235】RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions

链接https://arxiv.org/abs/2601.13235

作者:Drishti Goel,Jeongah Lee,Qiuyue Joy Zhong,Violeta J. Rodriguez,Daniel S. Brown,Ravi Karkar,Dong Whi Yoo,Koustuv Saha

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)

关键词:emotional validation, warrant careful evaluation, support express complex, distress cues, safety and appropriateness

备注

点击查看摘要

Abstract:Caregivers seeking AI-mediated support express complex needs -- information-seeking, emotional validation, and distress cues -- that warrant careful evaluation of response safety and appropriateness. Existing AI evaluation frameworks, primarily focused on general risks (toxicity, hallucinations, policy violations, etc), may not adequately capture the nuanced risks of LLM-responses in caregiving-contexts. We introduce RubRIX (Rubric-based Risk Index), a theory-driven, clinician-validated framework for evaluating risks in LLM caregiving responses. Grounded in the Elements of an Ethic of Care, RubRIX operationalizes five empirically-derived risk dimensions: Inattention, Bias Stigma, Information Inaccuracy, Uncritical Affirmation, and Epistemic Arrogance. We evaluate six state-of-the-art LLMs on over 20,000 caregiver queries from Reddit and ALZConnected. Rubric-guided refinement consistently reduced risk-components by 45-98% after one iteration across models. This work contributes a methodological approach for developing domain-sensitive, user-centered evaluation frameworks for high-burden contexts. Our findings highlight the importance of domain-sensitive, interactional risk evaluation for the responsible deployment of LLMs in caregiving support contexts. We release benchmark datasets to enable future research on contextual risk evaluation in AI-mediated support.

102. 【2601.13228】Autoregressive Models Rival Diffusion Models at ANY-ORDER Generation

链接https://arxiv.org/abs/2601.13228

作者:Tianqi Du,Lizhe Fang,Weijie Yang,Chenheng Zhang,Zeming Wei,Yifei Wang,Yisen Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:offering appealing flexibility, offering appealing, models enable any-order, enable any-order generation, Any-order Any-subset Autoregressive

备注

点击查看摘要

Abstract:Diffusion language models enable any-order generation and bidirectional conditioning, offering appealing flexibility for tasks such as infilling, rewriting, and self-correction. However, their formulation-predicting one part of a sequence from another within a single-step dependency-limits modeling depth and often yields lower sample quality and stability than autoregressive (AR) models. To address this, we revisit autoregressive modeling as a foundation and reformulate diffusion-style training into a structured multi-group prediction process. We propose Any-order Any-subset Autoregressive modeling (A3), a generalized framework that extends the standard AR factorization to arbitrary token groups and generation orders. A3 preserves the probabilistic rigor and multi-layer dependency modeling of AR while inheriting diffusion models' flexibility for parallel and bidirectional generation. We implement A3 through a two-stream attention architecture and a progressive adaptation strategy that transitions pretrained AR models toward any-order prediction. Experiments on question answering, commonsense reasoning, and story infilling demonstrate that A3 outperforms diffusion-based models while maintaining flexible decoding. This work offers a unified approach for a flexible, efficient, and novel language modeling paradigm.

103. 【2601.13217】Beyond Single-shot Writing: Deep Research Agents are Unreliable at Multi-turn Report Revision

链接https://arxiv.org/abs/2601.13217

作者:Bingsen Chen,Boyan Li,Ping Nie,Yuyu Zhang,Xi Ye,Chen Zhao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Deep Research Agents, Deep Research, single-shot writing task, human researchers iteratively, researchers iteratively draft

备注

点击查看摘要

Abstract:Existing benchmarks for Deep Research Agents (DRAs) treat report generation as a single-shot writing task, which fundamentally diverges from how human researchers iteratively draft and revise reports via self-reflection or peer feedback. Whether DRAs can reliably revise reports with user feedback remains unexplored. We introduce Mr Dre, an evaluation suite that establishes multi-turn report revision as a new evaluation axis for DRAs. Mr Dre consists of (1) a unified long-form report evaluation protocol spanning comprehensiveness, factuality, and presentation, and (2) a human-verified feedback simulation pipeline for multi-turn revision. Our analysis of five diverse DRAs reveals a critical limitation: while agents can address most user feedback, they also regress on 16-27% of previously covered content and citation quality. Over multiple revision turns, even the best-performing agents leave significant headroom, as they continue to disrupt content outside the feedback's scope and fail to preserve earlier edits. We further show that these issues are not easily resolvable through inference-time fixes such as prompt engineering and a dedicated sub-agent for report revision.

104. 【2601.13183】OpenExempt: A Diagnostic Benchmark for Legal Reasoning and a Framework for Creating Custom Benchmarks on Demand

链接https://arxiv.org/abs/2601.13183

作者:Sergio Servantez,Sarah B. Lawsky,Rajiv Jain,Daniel W. Linna Jr.,Kristian Hammond

类目:Computation and Language (cs.CL)

关键词:played a crucial, crucial role, Reasoning, benchmark, Abstract

备注: 25 pages, 9 Figures, 15 tables

点击查看摘要

Abstract:Reasoning benchmarks have played a crucial role in the progress of language models. Yet rigorous evaluation remains a significant challenge as static question-answer pairs provide only a snapshot of performance, compressing complex behavior into a single accuracy metric. This limitation is especially true in complex, rule-bound domains such as law, where existing benchmarks are costly to build and ill suited for isolating specific failure modes. To address this, we introduce OpenExempt, a framework and benchmark for diagnostic evaluation of legal reasoning. The OpenExempt Framework uses expert-crafted symbolic representations of U.S. Bankruptcy Code statutes to dynamically generate a large space of natural language reasoning tasks and their machine-computable solutions on demand. This gives users fine-grained control over task complexity and scope, allowing individual reasoning skills to be probed in isolation. Using this system, we construct the OpenExempt Benchmark, a diagnostic benchmark for legal reasoning with 9,765 samples across nine evaluation suites designed to carefully probe model capabilities. Experiments on 13 diverse language models reveal sharp performance cliffs that emerge only under longer reasoning paths and in the presence of obfuscating statements. We release the framework and benchmark publicly to support research aimed at understanding and improving the next generation of reasoning systems.

105. 【2601.13178】Medical Triage as Pairwise Ranking: A Benchmark for Urgency in Patient Portal Messages

链接https://arxiv.org/abs/2601.13178

作者:Joseph Gatto,Parker Seegmiller,Timothy Burdick,Philip Resnik,Roshnik Rahat,Sarah DeLozier,Sarah M. Preum

类目:Computation and Language (cs.CL)

关键词:prioritizing patients based, Medical triage, Medical, allocating medical resources, allocating medical

备注: 19 Pages, 5 Figures

点击查看摘要

Abstract:Medical triage is the task of allocating medical resources and prioritizing patients based on medical need. This paper introduces the first large-scale public dataset for studying medical triage in the context of asynchronous outpatient portal messages. Our novel task formulation views patient message triage as a pairwise inference problem, where we train LLMs to choose `"which message is more medically urgent" in a head-to-head tournament-style re-sort of a physician's inbox. Our novel benchmark PMR-Bench contains 1569 unique messages and 2,000+ high-quality test pairs for pairwise medical urgency assessment alongside a scalable training data generation pipeline. PMR-Bench includes samples that contain both unstructured patient-written messages alongside real electronic health record (EHR) data, emulating a real-world medical triage scenario. We develop a novel automated data annotation strategy to provide LLMs with in-domain guidance on this task. The resulting data is used to train two model classes, UrgentReward and UrgentSFT, leveraging Bradley-Terry and next token prediction objective, respectively to perform pairwise urgency classification. We find that UrgentSFT achieves top performance on PMR-Bench, with UrgentReward showing distinct advantages in low-resource settings. For example, UrgentSFT-8B and UrgentReward-8B provide a 15- and 16-point boost, respectively, on inbox sorting metrics over off-the-shelf 8B models. Paper resources can be found at this https URL

Comments:
19 Pages, 5 Figures

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2601.13178 [cs.CL]

(or
arXiv:2601.13178v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2601.13178

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
106. 【2601.13155】Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference

链接https://arxiv.org/abs/2601.13155

作者:Zimeng Wu,Donghao Wang,Chaozhe Jin,Jiaxin Chen,Yunhong Wang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, significant computational overhead, incurring significant computational, capability of Large

备注

点击查看摘要

Abstract:Long-context inference enhances the reasoning capability of Large Language Models (LLMs) while incurring significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown promise in reducing inference latency, but still suffer from inherently limited acceleration potential, outdated proxy signals, and redundancy interference, thus yielding suboptimal speed-accuracy trade-offs. To address these challenges, we propose SPTS (Self-Predictive Token Skipping), a training-free framework for efficient long-context LLM inference. Specifically, motivated by the thought of probing the influence of targeted skipping layers, we design two component-specific strategies for selective token skipping: Partial Attention Probing (PAP) for multi-head attention, which selects informative tokens by performing partial forward attention computation, and Low-rank Transformation Probing (LTP) for feed forward network, which constructs a low-rank proxy network to predict token transformations. Furthermore, a Multi-Stage Delayed Pruning (MSDP) strategy reallocates the skipping budget and progressively prunes redundant tokens across layers. Extensive experiments demonstrate the effectiveness of our method, achieving up to 2.46$\times$ and 2.29$\times$ speedups for prefilling and end-to-end generation, respectively, while maintaining state-of-the-art model performance. The source code will be publicly available upon paper acceptance.

107. 【2601.13142】VWorld: Foundations for Remote-Control TV Agents

链接https://arxiv.org/abs/2601.13142

作者:Zhantao Ma,Quanfeng Lu,Shuai Zhong,Dahai Yu,Ping Luo,Michael K. Ng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent large vision-language, Recent large, large vision-language models, demonstrated strong potential, device control

备注

点击查看摘要

Abstract:Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that injects topology awareness into LVLMs. Using this framework, we develop \textbf{TVTheseus}, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of $68.3\%$ on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.

108. 【2601.13137】Adversarial Alignment: Ensuring Value Consistency in Large Language Models for Sensitive Domains

链接https://arxiv.org/abs/2601.13137

作者:Yuan Gao,Zhigang Liu,Xinyu Yao,Bo Chen,Xiaobing Zhao

类目:Computation and Language (cs.CL)

关键词:large language models, sensitive domains, society and politics, gradually emerged, terms of race

备注: 13 pages, 5 figures

点击查看摘要

Abstract:With the wide application of large language models (LLMs), the problems of bias and value inconsistency in sensitive domains have gradually emerged, especially in terms of race, society and politics. In this paper, we propose an adversarial alignment framework, which enhances the value consistency of the model in sensitive domains through continued pre-training, instruction fine-tuning and adversarial training. In adversarial training, we use the Attacker to generate controversial queries, the Actor to generate responses with value consistency, and the Critic to filter and ensure response quality. Furthermore, we train a Value-Consistent Large Language Model, VC-LLM, for sensitive domains, and construct a bilingual evaluation dataset in Chinese and English. The experimental results show that VC-LLM performs better than the existing mainstream models in both Chinese and English tests, verifying the effectiveness of the method. Warning: This paper contains examples of LLMs that are offensive or harmful in nature.

109. 【2601.13115】Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning

链接https://arxiv.org/abs/2601.13115

作者:Fengran Mo,Yifan Gao,Sha Li,Hansi Zeng,Xin Liu,Zhaoxuan Tan,Xian Li,Jianshu Chen,Dakuo Wang,Meng Jiang

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, supporting information seeking, Large Language, Language Models, supporting information

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have become a popular interface for human-AI interaction, supporting information seeking and task assistance through natural, multi-turn dialogue. To respond to users within multi-turn dialogues, the context-dependent user intent evolves across interactions, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Existing studies usually follow static rewrite, retrieve, and generate pipelines, which optimize different procedures separately and overlook the mixed-initiative action optimization simultaneously. Although the recent developments in deep search agents demonstrate the effectiveness in jointly optimizing retrieval and generation via reasoning, these approaches focus on single-turn scenarios, which might lack the ability to handle multi-turn interactions. We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training with tailored rewards towards evolving user goals. The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods by surpassing several existing strong baselines.

110. 【2601.13111】CORE-T: COherent REtrieval of Tables for Text-to-SQL

链接https://arxiv.org/abs/2601.13111

作者:Hassan Soliman,Vivek Gupta,Dan Roth,Iryna Gurevych

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:require joining multiple, joining multiple tables, workflows often require, require joining, joining multiple

备注: Preprint under review. Code and data available at: [this https URL](https://github.com/UKPLab/arxiv2026-core-t)

点击查看摘要

Abstract:Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a simple additive adjustment step restores strongly compatible tables. Across Bird, Spider, and MMQA, CORE-T improves table-selection F1 by up to 22.7 points while retrieving up to 42% fewer tables, improving multi-table execution accuracy by up to 5.0 points on Bird and 6.9 points on MMQA, and using 4-5x fewer tokens than LLM-intensive baselines.

111. 【2601.13105】Leveraging Lora Fine-Tuning and Knowledge Bases for Construction Identification

链接https://arxiv.org/abs/2601.13105

作者:Liu Kaipeng,Wu Ling

类目:Computation and Language (cs.CL)

关键词:British National Corpus, English ditransitive construction, framework.A binary classification, binary classification task, National Corpus

备注: 19pages, 1figure

点击查看摘要

Abstract:This study investigates the automatic identification of the English ditransitive construction by integrating LoRA-based fine-tuning of a large language model with a Retrieval-Augmented Generation (RAG) framework.A binary classification task was conducted on annotated data from the British National Corpus. Results demonstrate that a LoRA-fine-tuned Qwen3-8B model significantly outperformed both a native Qwen3-MAX model and a theory-only RAG system. Detailed error analysis reveals that fine-tuning shifts the model's judgment from a surface-form pattern matching towards a more semantically grounded understanding based.

112. 【2601.13099】Alexandria: A Multi-Domain Dialectal Arabic Machine Translation Dataset for Culturally Inclusive and Linguistically Diverse LLMs

链接https://arxiv.org/abs/2601.13099

作者:Abdellah El Mekki,Samar M. Magdy,Houdaifa Atou,Ruwa AbuHweidi,Baraah Qawasmeh,Omer Nacar,Thikra Al-hibiri,Razan Saadie,Hamzah Alsayadi,Nadia Ghezaiel Hammouda,Alshima Alkhazimi,Aya Hamod,Al-Yas Al-Ghafri,Wesam El-Sayed,Asila Al sharji,Mohamad Ballout,Anas Belfathi,Karim Ghaddar,Serry Sibaee,Alaa Aoun,Areej Asiri,Lina Abureesh,Ahlam Bashiti,Majdal Yousef,Abdulaziz Hafiz,Yehdih Mohamed,Emira Hamedtou,Brakehe Brahim,Rahaf Alhamouri,Youssef Nafea,Aya El Aatar,Walid Al-Dhabyani,Emhemed Hamed,Sara Shatnawi,Fakhraddin Alwajih,Khalid Elkhidir,Ashwag Alasmari,Abdurrahman Gerrio,Omar Alshahri,AbdelRahim A. Elmadany,Ismail Berrada,Amir Azad Adli Alkathiri,Fadi A Zaraket,Mustafa Jarrar,Yahya Mohamed El Hadj,Hassan Alhuzali,Muhammad Abdul-Mageed

类目:Computation and Language (cs.CL)

关键词:Modern Standard Arabic, Modern Standard, daily communication occurs, highly diglossic language, Standard Arabic

备注: Project resources will be available here: [this https URL](https://github.com/UBC-NLP/Alexandria)

点击查看摘要

Abstract:Arabic is a highly diglossic language where most daily communication occurs in regional dialects rather than Modern Standard Arabic. Despite this, machine translation (MT) systems often generalize poorly to dialectal input, limiting their utility for millions of speakers. We introduce \textbf{Alexandria}, a large-scale, community-driven, human-translated dataset designed to bridge this gap. Alexandria covers 13 Arab countries and 11 high-impact domains, including health, education, and agriculture. Unlike previous resources, Alexandria provides unprecedented granularity by associating contributions with city-of-origin metadata, capturing authentic local varieties beyond coarse regional labels. The dataset consists of multi-turn conversational scenarios annotated with speaker-addressee gender configurations, enabling the study of gender-conditioned variation in dialectal use. Comprising 107K total samples, Alexandria serves as both a training resource and a rigorous benchmark for evaluating MT and Large Language Models (LLMs). Our automatic and human evaluation of Arabic-aware LLMs benchmarks current capabilities in translating across diverse Arabic dialects and sub-dialects, while exposing significant persistent challenges.

113. 【2601.13050】Profiling German Text Simplification with Interpretable Model-Fingerprints

链接https://arxiv.org/abs/2601.13050

作者:Lars Klöser,Mika Beele,Bodo Kraft

类目:Computation and Language (cs.CL)

关键词:produce highly nuanced, highly nuanced text, nuanced text simplifications, produce highly, highly nuanced

备注: Presented at 2nd International Conference on Explainable AI for Neural and Symbolic Systems

点击查看摘要

Abstract:While Large Language Models (LLMs) produce highly nuanced text simplifications, developers currently lack tools for a holistic, efficient, and reproducible diagnosis of their behavior. This paper introduces the Simplification Profiler, a diagnostic toolkit that generates a multidimensional, interpretable fingerprint of simplified texts. Multiple aggregated simplifications of a model result in a model's fingerprint. This novel evaluation paradigm is particularly vital for languages, where the data scarcity problem is magnified when creating flexible models for diverse target groups rather than a single, fixed simplification style. We propose that measuring a model's unique behavioral signature is more relevant in this context as an alternative to correlating metrics with human preferences. We operationalize this with a practical meta-evaluation of our fingerprints' descriptive power, which bypasses the need for large, human-rated datasets. This test measures if a simple linear classifier can reliably identify various model configurations by their created simplifications, confirming that our metrics are sensitive to a model's specific characteristics. The Profiler can distinguish high-level behavioral variations between prompting strategies and fine-grained changes from prompt engineering, including few-shot examples. Our complete feature set achieves classification F1-scores up to 71.9 %, improving upon simple baselines by over 48 percentage points. The Simplification Profiler thus offers developers a granular, actionable analysis to build more effective and truly adaptive text simplification systems.

114. 【2601.13044】yphoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

链接https://arxiv.org/abs/2601.13044

作者:Warit Sirichotedumrong,Adisai Na-Thalang,Potsawee Manakul,Pittawat Taveekitworachai,Sittipong Sripaisarnmongkol,Kunat Pipatanakul

类目:Computation and Language (cs.CL)

关键词:Large encoder-decoder models, Large encoder-decoder, streaming applications due, Typhoon ASR Real-time, Whisper achieve strong

备注: Models and datasets are publicly available on [this https URL](https://huggingface.co/collections/typhoon-ai/typhoon-asr-technical-report) ; Project Page: [this https URL](https://opentyphoon.ai/model/typhoon-asr-realtime)

点击查看摘要

Abstract:Large encoder-decoder models like Whisper achieve strong offline transcription but remain impractical for streaming applications due to high latency. However, due to the accessibility of pre-trained checkpoints, the open Thai ASR landscape remains dominated by these offline architectures, leaving a critical gap in efficient streaming solutions. We present Typhoon ASR Real-time, a 115M-parameter FastConformer-Transducer model for low-latency Thai speech recognition. We demonstrate that rigorous text normalization can match the impact of model scaling: our compact model achieves a 45x reduction in computational cost compared to Whisper Large-v3 while delivering comparable accuracy. Our normalization pipeline resolves systemic ambiguities in Thai transcription --including context-dependent number verbalization and repetition markers (mai yamok) --creating consistent training targets. We further introduce a two-stage curriculum learning approach for Isan (north-eastern) dialect adaptation that preserves Central Thai performance. To address reproducibility challenges in Thai ASR, we release the Typhoon ASR Benchmark, a gold-standard human-labeled datasets with transcriptions following established Thai linguistic conventions, providing standardized evaluation protocols for the research community.

115. 【2601.13035】SASA: Semantic-Aware Contrastive Learning Framework with Separated Attention for Triple Classification

链接https://arxiv.org/abs/2601.13035

作者:Xu Xiaodan,Hu Xiaolin

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Knowledge Graphs, unreliable knowledge, restricts their utility, suffer from unreliable, Knowledge

备注: in progress

点击查看摘要

Abstract:Knowledge Graphs~(KGs) often suffer from unreliable knowledge, which restricts their utility. Triple Classification~(TC) aims to determine the validity of triples from KGs. Recently, text-based methods learn entity and relation representations from natural language descriptions, significantly improving the generalization capabilities of TC models and setting new benchmarks in performance. However, there are still two critical challenges. First, existing methods often ignore the effective semantic interaction among different KG components. Second, most approaches adopt single binary classification training objective, leading to insufficient semantic representation learning. To address these challenges, we propose \textbf{SASA}, a novel framework designed to enhance TC models via separated attention mechanism and semantic-aware contrastive learning~(CL). Specifically, we first propose separated attention mechanism to encode triples into decoupled contextual representations and then fuse them through a more effective interactive way. Then, we introduce semantic-aware hierarchical CL as auxiliary training objective to guide models in improving their discriminative capabilities and achieving sufficient semantic learning, considering both local level and global level CL. Experimental results across two benchmark datasets demonstrate that SASA significantly outperforms state-of-the-art methods. In terms of accuracy, we advance the state-of-the-art by +5.9\% on FB15k-237 and +3.4\% on YAGO3-10.

116. 【2601.13024】ars or Cheers? Benchmarking LLMs via Culturally Elicited Distinct Affective Responses

链接https://arxiv.org/abs/2601.13024

作者:Chongyuan Dai,Yaling Shen,Jinpeng Hu,Zihan Gao,Jia Li,Yishun Jiang,Yaxiong Wang,Liu Liu,Zongyuan Ge

类目:Computation and Language (cs.CL)

关键词:Culture serves, interpret emotional stimuli, fundamental determinant, processing and profoundly, profoundly shapes

备注: 24 pages, 10 figures, 9 Tables

点击查看摘要

Abstract:Culture serves as a fundamental determinant of human affective processing and profoundly shapes how individuals perceive and interpret emotional stimuli. Despite this intrinsic link extant evaluations regarding cultural alignment within Large Language Models primarily prioritize declarative knowledge such as geographical facts or established societal customs. These benchmarks remain insufficient to capture the subjective interpretative variance inherent to diverse sociocultural lenses. To address this limitation, we introduce CEDAR, a multimodal benchmark constructed entirely from scenarios capturing Culturally \underline{\textsc{E}}licited \underline{\textsc{D}}istinct \underline{\textsc{A}}ffective \underline{\textsc{R}}esponses. To construct CEDAR, we implement a novel pipeline that leverages LLM-generated provisional labels to isolate instances yielding cross-cultural emotional distinctions, and subsequently derives reliable ground-truth annotations through rigorous human evaluation. The resulting benchmark comprises 10,962 instances across seven languages and 14 fine-grained emotion categories, with each language including 400 multimodal and 1,166 text-only samples. Comprehensive evaluations of 17 representative multilingual models reveal a dissociation between language consistency and cultural alignment, demonstrating that culturally grounded affective understanding remains a significant challenge for current models.

117. 【2601.13018】Bi-Attention HateXplain : Taking into account the sequential aspect of data during explainability in a multi-task context

链接https://arxiv.org/abs/2601.13018

作者:Ghislain Dorian Tchuente Mondjo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:online social networks, Technological advances, Internet and online, benefits to humanity, online social

备注: Accepted at "EAI AFRICOMM 2025 - 17th EAI International Conference on Communications and Networks in Africa"

点击查看摘要

Abstract:Technological advances in the Internet and online social networks have brought many benefits to humanity. At the same time, this growth has led to an increase in hate speech, the main global threat. To improve the reliability of black-box models used for hate speech detection, post-hoc approaches such as LIME, SHAP, and LRP provide the explanation after training the classification model. In contrast, multi-task approaches based on the HateXplain benchmark learn to explain and classify simultaneously. However, results from HateXplain-based algorithms show that predicted attention varies considerably when it should be constant. This attention variability can lead to inconsistent interpretations, instability of predictions, and learning difficulties. To solve this problem, we propose the BiAtt-BiRNN-HateXplain (Bidirectional Attention BiRNN HateXplain) model which is easier to explain compared to LLMs which are more complex in view of the need for transparency, and will take into account the sequential aspect of the input data during explainability thanks to a BiRNN layer. Thus, if the explanation is correctly estimated, thanks to multi-task learning (explainability and classification task), the model could classify better and commit fewer unintentional bias errors related to communities. The experimental results on HateXplain data show a clear improvement in detection performance, explainability and a reduction in unintentional bias.

118. 【2601.12995】Graph Reasoning Paradigm: Structured and Symbolic Reasoning with Topology-Aware Reinforcement Learning for Large Language Models

链接https://arxiv.org/abs/2601.12995

作者:Runxuan Liu,Xianhao Ou,Xinyan Ma,Jiyuan Wang,Jiafeng Liang,Jiaqi Li,Tao He,Zheng Chu,Rongchuan Mu,Zekun Wang,Baoxin Wang,Dayong Wu,Ming Liu,Shijin Wang,Guoping Hu,Bing Qin

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Reinforcement Learning, Learning with Verifiable, Large Language, achieved by Reinforcement

备注

点击查看摘要

Abstract:Long Chain-of-Thought (LCoT), achieved by Reinforcement Learning with Verifiable Rewards (RLVR), has proven effective in enhancing the reasoning capabilities of Large Language Models (LLMs). However, reasoning in current LLMs is primarily generated as plain text, where performing semantic evaluation on such unstructured data creates a computational bottleneck during training. Despite RLVR-based optimization, existing methods still suffer from coarse-grained supervision, reward hacking, high training costs, and poor generalization. To address these issues, we propose the Graph Reasoning Paradigm (GRP), which realizes structured and symbolic reasoning, implemented via graph-structured representations with step-level cognitive labels. Building upon GRP, we further design Process-Aware Stratified Clipping Group Relative Policy Optimization (PASC-GRPO), which leverages structured evaluation to replace semantic evaluation, achieves process-aware verification through graph-structured outcome rewards, and mitigates reward hacking via stratified clipping advantage estimation. Experiments demonstrate significant improvements across mathematical reasoning and code generation tasks. Data, models, and code will be released later.

119. 【2601.12991】RAGExplorer: A Visual Analytics System for the Comparative Diagnosis of RAG Systems

链接https://arxiv.org/abs/2601.12991

作者:Haoyu Tian,Yingchaojie Feng,Zhen Wen,Haoxuan Li,Minfeng Zhu,Wei Chen

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, produce factually accurate, Retrieval-Augmented Generation, Language Models

备注: 11 pages, 7 figures. Accepted to IEEE TVCG (PacificVis 2026)

点击查看摘要

Abstract:The advent of Retrieval-Augmented Generation (RAG) has significantly enhanced the ability of Large Language Models (LLMs) to produce factually accurate and up-to-date responses. However, the performance of a RAG system is not determined by a single component but emerges from a complex interplay of modular choices, such as embedding models and retrieval algorithms. This creates a vast and often opaque configuration space, making it challenging for developers to understand performance trade-offs and identify optimal designs. To address this challenge, we present RAGExplorer, a visual analytics system for the systematic comparison and diagnosis of RAG configurations. RAGExplorer guides users through a seamless macro-to-micro analytical workflow. Initially, it empowers developers to survey the performance landscape across numerous configurations, allowing for a high-level understanding of which design choices are most effective. For a deeper analysis, the system enables users to drill down into individual failure cases, investigate how differences in retrieved information contribute to errors, and interactively test hypotheses by manipulating the provided context to observe the resulting impact on the generated answer. We demonstrate the effectiveness of RAGExplorer through detailed case studies and user studies, validating its ability to empower developers in navigating the complex RAG design space. Our code and user guide are publicly available at this https URL.

120. 【2601.12983】ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation

链接https://arxiv.org/abs/2601.12983

作者:Jesus-German Ortiz-Barajas,Jonathan Tonglet,Vivek Gupta,Iryna Gurevych

类目:Computation and Language (cs.CL)

关键词:Multimodal large language, large language models, Multimodal large, enabling efficient data, efficient data analysis

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) are increasingly used to automate chart generation from data tables, enabling efficient data analysis and reporting but also introducing new misuse risks. In this work, we introduce ChartAttack, a novel framework for evaluating how MLLMs can be misused to generate misleading charts at scale. ChartAttack injects misleaders into chart designs, aiming to induce incorrect interpretations of the underlying data. Furthermore, we create AttackViz, a chart question-answering (QA) dataset where each (chart specification, QA) pair is labeled with effective misleaders and their induced incorrect answers. Experiments in in-domain and cross-domain settings show that ChartAttack significantly degrades the QA performance of MLLM readers, reducing accuracy by an average of 19.6 points and 14.9 points, respectively. A human study further shows an average 20.2 point drop in accuracy for participants exposed to misleading charts generated by ChartAttack. Our findings highlight an urgent need for robustness and security considerations in the design, evaluation, and deployment of MLLM-based chart generation systems. We make our code and data publicly available.

121. 【2601.12979】he Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check

链接https://arxiv.org/abs/2601.12979

作者:Qingyu Lu,Liang Ding,Kanjian Zhang,Jinxia Zhang,Dacheng Tao

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Diffusion-based Large Language, Language Models, Diffusion-based Large, Large Language

备注: Under Review

点击查看摘要

Abstract:The pursuit of real-time agentic interaction has driven interest in Diffusion-based Large Language Models (dLLMs) as alternatives to auto-regressive backbones, promising to break the sequential latency bottleneck. However, does such efficiency gains translate into effective agentic behavior? In this work, we present a comprehensive evaluation of dLLMs (e.g., LLaDA, Dream) across two distinct agentic paradigms: Embodied Agents (requiring long-horizon planning) and Tool-Calling Agents (requiring precise formatting). Contrary to the efficiency hype, our results on Agentboard and BFCL reveal a "bitter lesson": current dLLMs fail to serve as reliable agentic backbones, frequently leading to systematically failure. (1) In Embodied settings, dLLMs suffer repeated attempts, failing to branch under temporal feedback. (2) In Tool-Calling settings, dLLMs fail to maintain symbolic precision (e.g. strict JSON schemas) under diffusion noise. To assess the potential of dLLMs in agentic workflows, we introduce DiffuAgent, a multi-agent evaluation framework that integrates dLLMs as plug-and-play cognitive cores. Our analysis shows that dLLMs are effective in non-causal roles (e.g., memory summarization and tool selection) but require the incorporation of causal, precise, and logically grounded reasoning mechanisms into the denoising process to be viable for agentic tasks.

122. 【2601.12974】Bridging the Knowledge-Action Gap by Evaluating LLMs in Dynamic Dental Clinical Scenarios

链接https://arxiv.org/abs/2601.12974

作者:Hongyang Ma,Tiantian Gu,Huaiyuan Sun,Huilin Zhu,Yongxin Wang,Jie Li,Wubin Sun,Zeliang Lian,Yinghong Zhou,Yi Gao,Shirui Wang,Zhihui Tang

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, clinical agents demands, evaluation-from static accuracy, dynamic behavioral reliability

备注: 29 pages, 15 figures

点击查看摘要

Abstract:The transition of Large Language Models (LLMs) from passive knowledge retrievers to autonomous clinical agents demands a shift in evaluation-from static accuracy to dynamic behavioral reliability. To explore this boundary in dentistry, a domain where high-quality AI advice uniquely empowers patient-participatory decision-making, we present the Standardized Clinical Management Performance Evaluation (SCMPE) benchmark, which comprehensively assesses performance from knowledge-oriented evaluations (static objective tasks) to workflow-based simulations (multi-turn simulated patient interactions). Our analysis reveals that while models demonstrate high proficiency in static objective tasks, their performance precipitates in dynamic clinical dialogues, identifying that the primary bottleneck lies not in knowledge retention, but in the critical challenges of active information gathering and dynamic state tracking. Mapping "Guideline Adherence" versus "Decision Quality" reveals a prevalent "High Efficacy, Low Safety" risk in general models. Furthermore, we quantify the impact of Retrieval-Augmented Generation (RAG). While RAG mitigates hallucinations in static tasks, its efficacy in dynamic workflows is limited and heterogeneous, sometimes causing degradation. This underscores that external knowledge alone cannot bridge the reasoning gap without domain-adaptive pre-training. This study empirically charts the capability boundaries of dental LLMs, providing a roadmap for bridging the gap between standardized knowledge and safe, autonomous clinical practice.

123. 【2601.12973】Pardon? Evaluating Conversational Repair in Large Audio-Language Models

链接https://arxiv.org/abs/2601.12973

作者:Shuanghong Huang,Jinlei Xu,Youchao Zhou,Yanghao Zhou,Xuan Zhao,Chong Feng,Wenxuan Zhang

类目:Computation and Language (cs.CL)

关键词:Large Audio-Language Models, demonstrated strong performance, spoken question answering, Large Audio-Language, existing evaluations primarily

备注

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) have demonstrated strong performance in spoken question answering (QA), with existing evaluations primarily focusing on answer accuracy and robustness to acoustic perturbations. However, such evaluations implicitly assume that spoken inputs remain semantically answerable, an assumption that often fails in real-world interaction when essential information is missing. In this work, we introduce a repair-aware evaluation setting that explicitly distinguishes between answerable and unanswerable audio inputs. We define answerability as a property of the input itself and construct paired evaluation conditions using a semantic-acoustic masking protocol. Based on this setting, we propose the Evaluability Awareness and Repair (EAR) score, a non-compensatory metric that jointly evaluates task competence under answerable conditions and repair behavior under unanswerable conditions. Experiments on two spoken QA benchmarks across diverse LALMs reveal a consistent gap between answer accuracy and conversational reliability: while many models perform well when inputs are answerable, most fail to recognize semantic unanswerability and initiate appropriate conversational repair. These findings expose a limitation of prevailing accuracy-centric evaluation practices and motivate reliability assessments that treat unanswerable inputs as cues for repair and continued interaction.

124. 【2601.12966】Lombard Speech Synthesis for Any Voice with Controllable Style Embeddings

链接https://arxiv.org/abs/2601.12966

作者:Seymanur Akti,Alexander Waibel

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:addressing hearing-impaired listeners, Lombard effect plays, natural communication, hearing-impaired listeners, effect plays

备注

点击查看摘要

Abstract:The Lombard effect plays a key role in natural communication, particularly in noisy environments or when addressing hearing-impaired listeners. We present a controllable text-to-speech (TTS) system capable of synthesizing Lombard speech for any speaker without requiring explicit Lombard data during training. Our approach leverages style embeddings learned from a large, prosodically diverse dataset and analyzes their correlation with Lombard attributes using principal component analysis (PCA). By shifting the relevant PCA components, we manipulate the style embeddings and incorporate them into our TTS model to generate speech at desired Lombard levels. Evaluations demonstrate that our method preserves naturalness and speaker identity, enhances intelligibility under noise, and provides fine-grained control over prosody, offering a robust solution for controllable Lombard TTS for any speaker.

125. 【2601.12960】rustworthy Data-driven Chronological Age Estimation from Panoramic Dental Images

链接https://arxiv.org/abs/2601.12960

作者:Ainhoa Vivel-Couso,Nicolás Vila-Blanco,María J. Carreira,Alberto Bugarín-Diz,Inmaculada Tomás,Jose M. Alonso-Moral

类目:Computation and Language (cs.CL)

关键词:Integrating deep learning, healthcare enables personalized, enables personalized care, raises trust issues, trust issues due

备注: This paper is a preliminary version of an accepted article in Information Systems Frontiers, Springer, Special Issue "Explainability in Human-Centric AI". Please cite the final published version of the paper, not this preprint. The final published version can be found at [this https URL](https://link.springer.com/article/10.1007/s10796-025-10682-3)

点击查看摘要

Abstract:Integrating deep learning into healthcare enables personalized care but raises trust issues due to model opacity. To improve transparency, we propose a system for dental age estimation from panoramic images that combines an opaque and a transparent method within a natural language generation (NLG) module. This module produces clinician-friendly textual explanations about the age estimations, designed with dental experts through a rule-based approach. Following the best practices in the field, the quality of the generated explanations was manually validated by dental experts using a questionnaire. The results showed a strong performance, since the experts rated 4.77+/-0.12 (out of 5) on average across the five dimensions considered. We also performed a trustworthy self-assessment procedure following the ALTAI checklist, in which it scored 4.40+/-0.27 (out of 5) across seven dimensions of the AI Trustworthiness Assessment List.

126. 【2601.12946】AI-generated data contamination erodes pathological variability and diagnostic reliability

链接https://arxiv.org/abs/2601.12946

作者:Hongyu He,Shaowen Xiang,Ye Zhang,Yingtao Zhu,Jin Zhang,Hao Deng,Emily Alsentzer,Qingyu Chen,Kun-Hsing Yu,Andrew Marmenshall,Tingting Chen,Srinivas Anumasa,Daniel Ebner,Dean Ho,Kee Yuan Ngiam,Ching-Yu Cheng,Dianbo Liu

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Generative artificial intelligence, rapidly populating medical, populating medical records, artificial intelligence, creating a feedback

备注: *Corresponding author: Dianbo Liu (dianbo@nus. [this http URL](http://edu.sg) )

点击查看摘要

Abstract:Generative artificial intelligence (AI) is rapidly populating medical records with synthetic content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI-generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self-referential cycle drives a rapid erosion of pathological variability and diagnostic reliability. By analysing more than 800,000 synthetic data points across clinical text generation, vision-language reporting, and medical image synthesis, we find that models progressively converge toward generic phenotypes regardless of the model architecture. Specifically, rare but critical findings, including pneumothorax and effusions, vanish from the synthetic content generated by AI models, while demographic representations skew heavily toward middle-aged male phenotypes. Crucially, this degradation is masked by false diagnostic confidence; models continue to issue reassuring reports while failing to detect life-threatening pathology, with false reassurance rates tripling to 40%. Blinded physician evaluation confirms that this decoupling of confidence and accuracy renders AI-generated documentation clinically useless after just two generations. We systematically evaluate three mitigation strategies, finding that while synthetic volume scaling fails to prevent collapse, mixing real data with quality-aware filtering effectively preserves diversity. Ultimately, our results suggest that without policy-mandated human oversight, the deployment of generative AI threatens to degrade the very healthcare data ecosystems it relies upon.

127. 【2601.12945】A Component-Based Survey of Interactions between Large Language Models and Multi-Armed Bandits

链接https://arxiv.org/abs/2601.12945

作者:Miao Xie,Siguang Chen,Chunli Lv

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, Large language, powerful and widely, provide a principled, principled framework

备注: 27 pages, 6 table

点击查看摘要

Abstract:Large language models (LLMs) have become powerful and widely used systems for language understanding and generation, while multi-armed bandit (MAB) algorithms provide a principled framework for adaptive decision-making under uncertainty. This survey explores the potential at the intersection of these two fields. As we know, it is the first survey to systematically review the bidirectional interaction between large language models and multi-armed bandits at the component level. We highlight the bidirectional benefits: MAB algorithms address critical LLM challenges, spanning from pre-training to retrieval-augmented generation (RAG) and personalization. Conversely, LLMs enhance MAB systems by redefining core components such as arm definition and environment modeling, thereby improving decision-making in sequential tasks. We analyze existing LLM-enhanced bandit systems and bandit-enhanced LLM systems, providing insights into their design, methodologies, and performance. Key challenges and representative findings are identified to help guide future research. An accompanying GitHub repository that indexes relevant literature is available at this https URL.

128. 【2601.12921】Injecting Knowledge from Social Science Journals to Improve Indonesian Cultural Understanding by LLMs

链接https://arxiv.org/abs/2601.12921

作者:Adimulya Kartiyasa,Bao Gia Cao,Boyang Li

类目:Computation and Language (cs.CL)

关键词:large language models, language models, intensifying efforts, efforts to improve, improve the understanding

备注

点击查看摘要

Abstract:Recently there have been intensifying efforts to improve the understanding of Indonesian cultures by large language models (LLMs). An attractive source of cultural knowledge that has been largely overlooked is local journals of social science, which likely contain substantial cultural studies from a native perspective. We present a novel text dataset of journal article passages, created from 151 open-source Indonesian social science journals, called IndoSoSci. We demonstrate an effective recipe for injecting Indonesian cultural knowledge therein into LLMs: extracting the facts related to Indonesian culture, and apply retrieval-augmented generation (RAG) with LLM-generated hypothetical documents as queries during retrieval. The proposed recipe yields strong performance gains over several strong baselines on the IndoCulture benchmark. Additionally, by combining IndoSoSci with Indonesian Wikipedia, we set a new state-of-the-art accuracy on the IndoCulture benchmark.

129. 【2601.12910】SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

链接https://arxiv.org/abs/2601.12910

作者:Tim Baumgärtner,Iryna Gurevych

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:ensure faithful implementations, faithful implementations, scientific publications, codebases to ensure, ensure faithful

备注

点击查看摘要

Abstract:We present SciCoQA, a dataset for detecting discrepancies between scientific publications and their codebases to ensure faithful implementations. We construct SciCoQA from GitHub issues and reproducibility papers, and to scale our dataset, we propose a synthetic data generation method for constructing paper-code discrepancies. We analyze the paper-code discrepancies in detail and propose discrepancy types and categories to better understand the occurring mismatches. In total, our dataset consists of 611 paper-code discrepancies (81 real, 530 synthetic), spanning diverse computational science disciplines, including AI, Physics, Quantitative Biology, and others. Our evaluation of 21 LLMs highlights the difficulty of SciCoQA, particularly for instances involving omitted paper details, long-context inputs, and data outside the models' pre-training corpus. The best performing model in our evaluation, GPT-5, can only detect 45.7\% of real-world paper-code discrepancies.

130. 【2601.12906】Gated Differentiable Working Memory for Long-Context Language Modeling

链接https://arxiv.org/abs/2601.12906

作者:Lingrui Mei,Shenghua Liu,Yiwei Wang,Yuyao Ge,Baolong Bi,Jiayu Yao,Jun Wan,Ziling Yin,Jiafeng Guo,Xueqi Cheng

类目:Computation and Language (cs.CL)

关键词:Long contexts challenge, attention scores dilute, contexts challenge transformers, Long contexts, challenge transformers

备注

点击查看摘要

Abstract:Long contexts challenge transformers: attention scores dilute across thousands of tokens, critical information is often lost in the middle, and models struggle to adapt to novel patterns at inference time. Recent work on test-time adaptation addresses this by maintaining a form of working memory -- transient parameters updated on the current context -- but existing approaches rely on uniform write policies that waste computation on low-utility regions and suffer from high gradient variance across semantically heterogeneous contexts. In this work, we reframe test-time adaptation as a budget-constrained memory consolidation problem, focusing on which parts of the context should be consolidated into working memory under limited computation. We propose Gdwm (Gated Differentiable Working Memory), a framework that introduces a write controller to gate the consolidation process. The controller estimates Contextual Utility, an information-theoretic measure of long-range contextual dependence, and allocates gradient steps accordingly while maintaining global coverage. Experiments on ZeroSCROLLS and LongBench v2 demonstrate that Gdwm achieves comparable or superior performance with 4$\times$ fewer gradient steps than uniform baselines, establishing a new efficiency-performance Pareto frontier for test-time adaptation.

131. 【2601.12904】From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

链接https://arxiv.org/abs/2601.12904

作者:Jiahao Wang,Weiyu Xie,Mingxing Zhang,Boxing Zhang,Jianwei Dong,Yuening Zhu,Chen Lin,Jinqi Tang,Yaochen Han,Zhiyuan Ai,Xianglin Chen,Yongwei Wu,Congfeng Jiang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:enhances Large Language, Large Language Models, Generation enhances Large, Large Language, integrating external knowledge

备注

点击查看摘要

Abstract:Retrieval-Augmented Generation enhances Large Language Models by integrating external knowledge, which reduces hallucinations but increases prompt length. This increase leads to higher computational costs and longer Time to First Token (TTFT). To mitigate this issue, existing solutions aim to reuse the preprocessed KV cache of each retrieved chunk to accelerate RAG. However, the lack of cross-chunk contextual information leads to a significant drop in generation quality, leaving the potential benefits of KV cache reuse largely unfulfilled. The challenge lies in how to reuse the precomputed KV cache of chunks while preserving generation quality. We propose FusionRAG, a novel inference framework that optimizes both the preprocessing and reprocessing stages of RAG. In the offline preprocessing stage, we embed information from other related text chunks into each chunk, while in the online reprocessing stage, we recompute the KV cache for tokens that the model focuses on. As a result, we achieve a better trade-off between generation quality and efficiency. According to our experiments, FusionRAG significantly improves generation quality at the same recomputation ratio compared to previous state-of-the-art solutions. By recomputing fewer than 15% of the tokens, FusionRAG achieves up to 70% higher normalized F1 scores than baselines and reduces TTFT by 2.66x-9.39x compared to Full Attention.

132. 【2601.12879】Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition

链接https://arxiv.org/abs/2601.12879

作者:Mohammed Mudassir Uddin,Shahnawaz Alam,Mohammed Kaif Pasha

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Mechanistic interpretability seeks, remains challenging due, exponential search complexity, extracting sparse computational, models remains challenging

备注

点击查看摘要

Abstract:Mechanistic interpretability seeks to reverse-engineer neural network computations into human-understandable algorithms, yet extracting sparse computational circuits from billion-parameter language models remains challenging due to exponential search complexity and pervasive polysemanticity. The proposed Hierarchical Attribution Graph Decomposition (HAGD) framework reduces circuit discovery complexity from O(2^n) exhaustive enumeration to O(n^2 log n) through multi-resolution abstraction hierarchies and differentiable circuit search. The methodology integrates cross-layer transcoders for monosemantic feature extraction, graph neural network meta-learning for topology prediction, and causal intervention protocols for validation. Empirical evaluation spans GPT-2 variants, Llama-7B through Llama-70B, and Pythia suite models across algorithmic tasks and natural language benchmarks. On modular arithmetic tasks, the framework achieves up to 91% behavioral preservation ($\pm$2.3\% across runs) while maintaining interpretable subgraph sizes. Cross-architecture transfer experiments suggest that discovered circuits exhibit moderate structural similarity (averaging 67%) across model families, indicating potential shared computational patterns. These results provide preliminary foundations for interpretability at larger model scales while identifying significant limitations in current attribution methodologies that require future advances.

133. 【2601.12868】Race, Ethnicity and Their Implication on Bias in Large Language Models

链接https://arxiv.org/abs/2601.12868

作者:Shiyue Hu,Ruizhe Li,Yanjun Gao

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large language models, high-stakes settings including, settings including healthcare, Large language, increasingly operate

备注: Work in process

点击查看摘要

Abstract:Large language models (LLMs) increasingly operate in high-stakes settings including healthcare and medicine, where demographic attributes such as race and ethnicity may be explicitly stated or implicitly inferred from text. However, existing studies primarily document outcome-level disparities, offering limited insight into internal mechanisms underlying these effects. We present a mechanistic study of how race and ethnicity are represented and operationalized within LLMs. Using two publicly available datasets spanning toxicity-related generation and clinical narrative understanding tasks, we analyze three open-source models with a reproducible interpretability pipeline combining probing, neuron-level attribution, and targeted intervention. We find that demographic information is distributed across internal units with substantial cross-model variation. Although some units encode sensitive or stereotype-related associations from pretraining, identical demographic cues can induce qualitatively different behaviors. Interventions suppressing such neurons reduce bias but leave substantial residual effects, suggesting behavioral rather than representational change and motivating more systematic mitigation.

134. 【2601.12844】Rapport du Projet de Recherche TRAIMA

链接https://arxiv.org/abs/2601.12844

作者:Julie Rançon(UP, FoReLLIS, Poitiers),Jean-François Cerisier(Techné, Poitiers),Emilie Remond(Techné, Poitiers),Aurélien Nguyen(Techné, Poitiers),Andrew Peterson(Techné, Poitiers),Ladjel Bellatreche(ISAE-ENSMA, IDD, A\amp;S)

类目:Computation and Language (cs.CL)

关键词:TRaitement Automatique des, Automatique des Interactions, des Interactions Multimodales, Multimodales en Apprentissage, conducted between March

备注: in French language

点击查看摘要

Abstract:The TRAIMA project (TRaitement Automatique des Interactions Multimodales en Apprentissage), conducted between March 2019 and June 2020, investigates the potential of automatic processing of multimodal interactions in educational settings. The project addresses a central methodological challenge in educational and interactional research: the analysis of verbal, paraverbal, and non-verbal data is currently carried out manually, making it extremely time-consuming and difficult to scale. TRAIMA explores how machine learning approaches could contribute to the categorisation and classification of such interactions. The project focuses specifically on explanatory and collaborative sequences occurring in classroom interactions, particularly in French as a Foreign Language (FLE) and French as a First Language (FLM) contexts. These sequences are analysed as inherently multimodal phenomena, combining spoken language with prosody, gestures, posture, gaze, and spatial positioning. A key theoretical contribution of the project is the precise linguistic and interactional definition of explanatory discourse as a tripartite sequence (opening, explanatory core, closure), drawing on discourse analysis and interactional linguistics. A substantial part of the research is devoted to the methodological foundations of transcription, which constitute a critical bottleneck for any form of automation. The report provides a detailed state of the art of existing transcription conventions (ICOR, Mondada, GARS, VALIBEL, Ferr{é}), highlighting their respective strengths and limitations when applied to multimodal classroom data. Through comparative analyses of manually transcribed sequences, the project demonstrates the inevitable variability and interpretative dimension of transcription practices, depending on theoretical positioning and analytical goals. Empirical work is based on several corpora, notably the INTER-EXPLIC corpus (approximately 30 hours of classroom interaction) and the EXPLIC-LEXIC corpus, which serve both as testing grounds for manual annotation and as reference datasets for future automation. Particular attention is paid to teacher gestures (kin{é}sic and proxemic resources), prosodic features, and their functional role in meaning construction and learner comprehension. The project also highlights the strategic role of the Techn{é}LAB platform, which provides advanced multimodal data capture (multi-camera video, synchronized audio, eye-tracking, digital interaction traces) and constitutes both a research infrastructure and a test environment for the development of automated tools. In conclusion, TRAIMA does not aim to deliver a fully operational automated system, but rather to establish a rigorous methodological framework for the automatic processing of multimodal pedagogical interactions. The project identifies transcription conventions, annotation categories, and analytical units that are compatible with machine learning approaches, while emphasizing the need for theoretical explicitness and researcher reflexivity. TRAIMA thus lays the groundwork for future interdisciplinary research at the intersection of didactics, discourse analysis, multimodality, and artificial intelligence in education.

135. 【2601.12815】Multimodal Multi-Agent Empowered Legal Judgment Prediction

链接https://arxiv.org/abs/2601.12815

作者:Zhaolu Kang,Junhao Gong,Qingxi Chen,Hao Zhang,Jiaxin Liu,Rong Fu,Zhiyuan Feng,Yuan Wang,Simon Fong,Kaiyue Zhou

类目:Computation and Language (cs.CL)

关键词:Legal Judgment Prediction, Judgment Prediction, legal cases based, aims to predict, factual descriptions

备注

点击查看摘要

Abstract:Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.

136. 【2601.12812】Do Clinical Question Answering Systems Really Need Specialised Medical Fine Tuning?

链接https://arxiv.org/abs/2601.12812

作者:Sushant Kumar Ray,Gautam Siddharth Kashyap,Sahil Tripathi,Nipun Joshi,Vijay Govindarajan,Rafiq Ali,Jiechao Gao,Usman Naseem

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Clinical Question-Answering, rely on Large, specialised medical LLMs

备注: Accepted at EACL 2026 (Industry Track)

点击查看摘要

Abstract:Clinical Question-Answering (CQA) industry systems are increasingly rely on Large Language Models (LLMs), yet their deployment is often guided by the assumption that domain-specific fine-tuning is essential. Although specialised medical LLMs such as BioBERT, BioGPT, and PubMedBERT remain popular, they face practical limitations including narrow coverage, high retraining costs, and limited adaptability. Efforts based on Supervised Fine-Tuning (SFT) have attempted to address these assumptions but continue to reinforce what we term the SPECIALISATION FALLACY-the belief that specialised medical LLMs are inherently superior for CQA. To address this assumption, we introduce MEDASSESS-X, a deployment-industry-oriented CQA framework that applies alignment at inference time rather than through SFT. MEDASSESS-X uses lightweight steering vectors to guide model activations toward medically consistent reasoning without updating model weights or requiring domain-specific retraining. This inference-time alignment layer stabilises CQA performance across both general-purpose and specialised medical LLMs, thereby resolving the SPECIALISATION FALLACY. Empirically, MEDASSESS-X delivers consistent gains across all LLM families, improving Accuracy by up to +6%, Factual Consistency by +7%, and reducing Safety Error Rate by as much as 50%.

137. 【2601.12799】FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions

链接https://arxiv.org/abs/2601.12799

作者:Peng Li,Zihan Zhuang,Yangfan Gao,Yi Dong,Sixian Li,Changhao Jiang,Shihan Dou,Zhiheng Xi,Enyu Zhou,Jixuan Huang,Hui Li,Jingjing Gong,Xingjun Ma,Tao Gui,Zuxuan Wu,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Xipeng Qiu

类目:Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:capable of performing, Humanoid robots, Humanoid, whole-body motion, robots

备注: Project Page: [this https URL](https://openmoss.github.io/FRoM-W1)

点击查看摘要

Abstract:Humanoid robots are capable of performing various actions such as greeting, dancing and even backflipping. However, these motions are often hard-coded or specifically trained, which limits their versatility. In this work, we present FRoM-W1, an open-source framework designed to achieve general humanoid whole-body motion control using natural language. To universally understand natural language and generate corresponding motions, as well as enable various humanoid robots to stably execute these motions in the physical world under gravity, FRoM-W1 operates in two stages: (a) H-GPT: utilizing massive human data, a large-scale language-driven human whole-body motion generation model is trained to generate diverse natural behaviors. We further leverage the Chain-of-Thought technique to improve the model's generalization in instruction understanding. (b) H-ACT: After retargeting generated human whole-body motions into robot-specific actions, a motion controller that is pretrained and further fine-tuned through reinforcement learning in physical simulation enables humanoid robots to accurately and stably perform corresponding actions. It is then deployed on real robots via a modular simulation-to-reality module. We extensively evaluate FRoM-W1 on Unitree H1 and G1 robots. Results demonstrate superior performance on the HumanML3D-X benchmark for human whole-body motion generation, and our introduced reinforcement learning fine-tuning consistently improves both motion tracking accuracy and task success rates of these humanoid robots. We open-source the entire FRoM-W1 framework and hope it will advance the development of humanoid intelligence.

138. 【2601.12779】Open Vocabulary Panoptic Segmentation With Retrieval Augmentation

链接https://arxiv.org/abs/2601.12779

作者:Nafis Sadeq,Qingfeng Liu,Mostafa El-Khamy

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:panoptic segmentation aims, Open Vocabulary Panoptic, Vocabulary Panoptic Segmentation, panoptic segmentation, segmentation aims

备注

点击查看摘要

Abstract:Given an input image and set of class names, panoptic segmentation aims to label each pixel in an image with class labels and instance labels. In comparison, Open Vocabulary Panoptic Segmentation aims to facilitate the segmentation of arbitrary classes according to user input. The challenge is that a panoptic segmentation system trained on a particular dataset typically does not generalize well to unseen classes beyond the training data. In this work, we propose RetCLIP, a retrieval-augmented panoptic segmentation method that improves the performance of unseen classes. In particular, we construct a masked segment feature database using paired image-text data. At inference time, we use masked segment features from the input image as query keys to retrieve similar features and associated class labels from the database. Classification scores for the masked segment are assigned based on the similarity between query features and retrieved features. The retrieval-based classification scores are combined with CLIP-based scores to produce the final output. We incorporate our solution with a previous SOTA method (FC-CLIP). When trained on COCO, the proposed method demonstrates 30.9 PQ, 19.3 mAP, 44.0 mIoU on the ADE20k dataset, achieving +4.5 PQ, +2.5 mAP, +10.0 mIoU absolute improvement over the baseline.

139. 【2601.12771】Who Does This Name Remind You of? Nationality Prediction via Large Language Model Associative Memory

链接https://arxiv.org/abs/2601.12771

作者:Keito Inoshita

类目:Computation and Language (cs.CL)

关键词:Large language models, possess extensive world, Large language, knowledge remain underexplored, LLM world knowledge

备注

点击查看摘要

Abstract:Large language models (LLMs) possess extensive world knowledge, yet methods for effectively eliciting this knowledge remain underexplored. Nationality and region prediction tasks require understanding of not only linguistic features but also cultural and historical background, making LLM world knowledge particularly valuable. However, conventional LLM prompting methods rely on direct reasoning approaches, which have limitations in applying abstract linguistic rules. We propose LLM Associative Memory Agents (LAMA), a novel framework that leverages LLM world knowledge as associative memory. Rather than directly inferring nationality from names, LAMA recalls famous individuals with the same name and aggregates their nationalities through indirect reasoning. A dual-agent architecture comprising a Person Agent and a Media Agent, specialized in different knowledge domains, recalls famous individuals in parallel, generating Top-1 predictions through voting and Top-K predictions through conditional completion. On a 99-country nationality prediction task, LAMA achieved 0.817 accuracy, substantially outperforming conventional LLM prompting methods and neural models. Our experiments reveal that LLMs exhibit higher reliability in recalling concrete examples than in abstract reasoning, that recall-based approaches are robust to low-frequency nationalities independent of data frequency distributions, and that the dual-agent architecture functions complementarily to produce synergistic effects. These results demonstrate the effectiveness of a new multi-agent system that retrieves and aggregates LLM knowledge rather than prompting reasoning.

140. 【2601.12758】VISPA: Pluralistic Alignment via Automatic Value Selection and Activation

链接https://arxiv.org/abs/2601.12758

作者:Shenyan Zheng,Jiayou Zhong,Anudeex Shetty,Heng Ji,Preslav Nakov,Usman Naseem

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:human preference, high-stakes domains, reflect not average, varying perspectives, outputs reflect

备注: WIP

点击查看摘要

Abstract:As large language models are increasingly used in high-stakes domains, it is essential that their outputs reflect not average} human preference, rather range of varying perspectives. Achieving such pluralism, however, remains challenging. Existing approaches consider limited values or rely on prompt-level interventions, lacking value control and representation. To address this, we introduce VISPA, a training-free pluralistic alignment framework, that enables direct control over value expression by dynamic selection and internal model activation steering. Across extensive empirical studies spanning multiple models and evaluation settings, we show VISPA is performant across all pluralistic alignment modes in healthcare and beyond. Further analysis reveals VISPA is adaptable with different steering initiations, model, and/or values. These results suggest that pluralistic alignment can be achieved through internal activation mechanisms, offering a scalable path toward language models that serves all.

141. 【2601.12754】PAIR-SAFE: A Paired-Agent Approach for Runtime Auditing and Refining AI-Mediated Mental Health Support

链接https://arxiv.org/abs/2601.12754

作者:Jiwon Kim,Violeta J. Rodriguez,Dong Whi Yoo,Eshwar Chandrasekharan,Koustuv Saha

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Large language models, Large language, mental health support, language models, overly directive

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used for mental health support, yet they can produce responses that are overly directive, inconsistent, or clinically misaligned, particularly in sensitive or high-risk contexts. Existing approaches to mitigating these risks largely rely on implicit alignment through training or prompting, offering limited transparency and runtime accountability. We introduce PAIR-SAFE, a paired-agent framework for auditing and refining AI-generated mental health support that integrates a Responder agent with a supervisory Judge agent grounded in the clinically validated Motivational Interviewing Treatment Integrity (MITI-4) framework. The Judgeaudits each response and provides structuredALLOW or REVISE decisions that guide runtime response refinement. We simulate counseling interactions using a support-seeker simulator derived from human-annotated motivational interviewing data. We find that Judge-supervised interactions show significant improvements in key MITI dimensions, including Partnership, Seek Collaboration, and overall Relational quality. Our quantitative findings are supported by qualitative expert evaluation, which further highlights the nuances of runtime supervision. Together, our results reveal that such pairedagent approach can provide clinically grounded auditing and refinement for AI-assisted conversational mental health support.

142. 【2601.12748】owards Robust Process Reward Modeling via Noise-aware Learning

链接https://arxiv.org/abs/2601.12748

作者:Bin Xie,Bingbing Xu,Xueyun Tian,Yilin Chen,Huawei Shen

类目:Computation and Language (cs.CL)

关键词:achieved strong results, Monte Carlo Estimation, costly process-level supervision, Process Reward Models, defines process rewards

备注

点击查看摘要

Abstract:Process Reward Models (PRMs) have achieved strong results in complex reasoning, but are bottlenecked by costly process-level supervision. A widely used alternative, Monte Carlo Estimation (MCE), defines process rewards as the probability that a policy model reaches the correct final answer from a given reasoning step. However, step correctness is an intrinsic property of the reasoning trajectory, and should be invariant to policy choice. Our empirical findings show that MCE producing policy-dependent rewards that induce label noise, including false positives that reward incorrect steps and false negatives that penalize correct ones. To address above challenges, we propose a two-stage framework to mitigate noisy supervision. In the labeling stage, we introduce a reflection-aware label correction mechanism that uses a large language model (LLM) as a judge to detect reflection and self-correction behaviors related to the current reasoning step, thereby suppressing overestimated rewards. In the training stage, we further propose a \underline{\textbf{N}}oise-\underline{\textbf{A}}ware \underline{\textbf{I}}terative \underline{\textbf{T}}raining framework that enables the PRM to progressively refine noisy labels based on its own confidence. Extensive Experiments show that our method substantially improves step-level correctness discrimination, achieving up to a 27\% absolute gain in average F1 over PRMs trained with noisy supervision.

143. 【2601.12731】A Shared Geometry of Difficulty in Multilingual Language Models

链接https://arxiv.org/abs/2601.12731

作者:Stefano Civelli,Pietro Bernardelle,Nicolò Brunello,Gianluca Demartini

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:training linear probes, Predicting problem-difficulty, training linear, refers to estimating, typically by training

备注

点击查看摘要

Abstract:Predicting problem-difficulty in large language models (LLMs) refers to estimating how difficult a task is according to the model itself, typically by training linear probes on its internal representations. In this work, we study the multilingual geometry of problem-difficulty in LLMs by training linear probes using the AMC subset of the Easy2Hard benchmark, translated into 21 languages. We found that difficulty-related signals emerge at two distinct stages of the model internals, corresponding to shallow (early-layers) and deep (later-layers) internal representations, that exhibit functionally different behaviors. Probes trained on deep representations achieve high accuracy when evaluated on the same language but exhibit poor cross-lingual generalization. In contrast, probes trained on shallow representations generalize substantially better across languages, despite achieving lower within-language performance. Together, these results suggest that LLMs first form a language-agnostic representation of problem difficulty, which subsequently becomes language-specific. This closely aligns with existing findings in LLM interpretability showing that models tend to operate in an abstract conceptual space before producing language-specific outputs. We demonstrate that this two-stage representational process extends beyond semantic content to high-level meta-cognitive properties such as problem-difficulty estimation.

144. 【2601.12698】A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

链接https://arxiv.org/abs/2601.12698

作者:Qiuyi Qu,Yicheng Sui,Yufei Sun,Rui Chen,Xiaofei Zhang,Yuzhi Zhang,Haofeng Wang,Ge Lan,Ning Zhang

类目:Computation and Language (cs.CL)

关键词:GPU code optimization, key performance bottleneck, bottleneck for HPC, GPU code, training and inference

备注

点击查看摘要

Abstract:GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.

145. 【2601.12696】UbuntuGuard: A Culturally-Grounded Policy Benchmark for Equitable AI Safety in African Languages

链接https://arxiv.org/abs/2601.12696

作者:Tassallah Abdullahi,Macton Mgonzo,Mardiyyah Oduwole,Paul Okewunmi,Abraham Owodunni,Ritambhara Singh,Carsten Eickhoff

类目:Computation and Language (cs.CL)

关键词:Current guardian models, predominantly Western-centric, Western-centric and optimized, Current guardian, leaving low-resource African

备注: 12 pages

点击查看摘要

Abstract:Current guardian models are predominantly Western-centric and optimized for high-resource languages, leaving low-resource African languages vulnerable to evolving harms, cross-lingual safety failures, and cultural misalignment. Moreover, most guardian models rely on rigid, predefined safety categories that fail to generalize across diverse linguistic and sociocultural contexts. Robust safety, therefore, requires flexible, runtime-enforceable policies and benchmarks that reflect local norms, harm scenarios, and cultural expectations. We introduce UbuntuGuard, the first African policy-based safety benchmark built from adversarial queries authored by 155 domain experts across sensitive fields, including healthcare. From these expert-crafted queries, we derive context-specific safety policies and reference responses that capture culturally grounded risk signals, enabling policy-aligned evaluation of guardian models. We evaluate 13 models, comprising six general-purpose LLMs and seven guardian models across three distinct variants: static, dynamic, and multilingual. Our findings reveal that existing English-centric benchmarks overestimate real-world multilingual safety, cross-lingual transfer provides partial but insufficient coverage, and dynamic models, while better equipped to leverage policies at inference time, still struggle to fully localize African-language contexts. These findings highlight the urgent need for multilingual, culturally grounded safety benchmarks to enable the development of reliable and equitable guardian models for low-resource languages. Our code can be found online.\footnote{Code repository available at this https URL.

146. 【2601.12658】Augmenting Question Answering with A Hybrid RAG Approach

链接https://arxiv.org/abs/2601.12658

作者:Tianyi Yang,Nashrah Haque,Vaishnave Jonnalagadda,Yuya Jeremy Ong,Zhehui Chen,Yanzhao Wu,Lei Yu,Divyesh Jadav,Wenqi Wei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Retrieval-Augmented Generation, Large Language Models, Generation, powerful technique, RAG

备注: 10 pages, 5 tables, 2 figures; presented at IEEE CogMI 2025

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful technique for enhancing the quality of responses in Question-Answering (QA) tasks. However, existing approaches often struggle with retrieving contextually relevant information, leading to incomplete or suboptimal answers. In this paper, we introduce Structured-Semantic RAG (SSRAG), a hybrid architecture that enhances QA quality by integrating query augmentation, agentic routing, and a structured retrieval mechanism combining vector and graph based techniques with context unification. By refining retrieval processes and improving contextual grounding, our approach improves both answer accuracy and informativeness. We conduct extensive evaluations on three popular QA datasets, TruthfulQA, SQuAD and WikiQA, across five Large Language Models (LLMs), demonstrating that our proposed approach consistently improves response quality over standard RAG implementations.

147. 【2601.12648】Intelligent Documentation in Medical Education: Can AI Replace Manual Case Logging?

链接https://arxiv.org/abs/2601.12648

作者:Nafiz Imtiaz Khan,Kylie Cleland,Vladimir Filkov,Roger Eric Goldman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Procedural case logs, automate procedural case, Procedural case, core requirement, time-consuming to complete

备注: 51 pages, 12 figures, 8 tables. Feasibility study using retrospective radiology reports. Submitted to JAMIA Open (under review)

点击查看摘要

Abstract:Procedural case logs are a core requirement in radiology training, yet they are time-consuming to complete and prone to inconsistency when authored manually. This study investigates whether large language models (LLMs) can automate procedural case log documentation directly from free-text radiology reports. We evaluate multiple local and commercial LLMs under instruction-based and chain-of-thought prompting to extract structured procedural information from 414 curated interventional radiology reports authored by nine residents between 2018 and 2024. Model performance is assessed using sensitivity, specificity, and F1-score, alongside inference latency and token efficiency to estimate operational cost. Results show that both local and commercial models achieve strong extraction performance, with best F1-scores approaching 0.87, while exhibiting different trade-offs between speed and cost. Automation using LLMs has the potential to substantially reduce clerical burden for trainees and improve consistency in case logging. These findings demonstrate the feasibility of AI-assisted documentation in medical education and highlight the need for further validation across institutions and clinical workflows.

148. 【2601.12639】Objective Matters: Fine-Tuning Objectives Shape Safety, Robustness, and Persona Drift

链接https://arxiv.org/abs/2601.12639

作者:Daniel Vennemeyer,Punya Syon Pandey,Phan Anh Duong,Michael Umeokoli,Samuel Ratnam

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:outcomes remain limited, Odds Ratio Preference, Direct Preference Optimization, Ratio Preference Optimization, Preference Optimization

备注

点击查看摘要

Abstract:Fine-tuning LLMs on benign data can still degrade alignment and adversarial robustness, yet direct analysis of the role of fine-tuning objectives in shaping these safety outcomes remain limited. We present a controlled comparison of six fine-tuning objectives -- Supervised Fine-Tuning, Direct Preference Optimization, Conditional Fine-Tuning, Inoculation Prompting, Odds Ratio Preference Optimization, and KL-regularized fine-tuning -- holding data, domain, architecture, and optimization fixed. Across closed-form reasoning and open-ended generation tasks, we find that objective choice induces systematic, scale-dependent shifts along the safety-capability frontier. At small training budgets, robustness is similar across objectives but capability differs. At larger budgets, objectives diverge sharply: supervised and preference-based tuning tightly couple capability gains to increased adversarial vulnerability and persona drift, while objectives that constrain learning signals -- especially ORPO and KL-regularization -- substantially mitigate both. Fine-tuning objectives therefore matter little for safety at small scales but become a primary driver of adversarial robustness and latent persona stability as training scale increases.

149. 【2601.12632】BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models

链接https://arxiv.org/abs/2601.12632

作者:Kriti Bhattarai,Vipina K. Keloth,Donald Wright,Andrew Loza,Yang Ren,Hua Xu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large language models, supporting model development, existing benchmark datasets, Large language, development and evaluation

备注

点击查看摘要

Abstract:Objective: Large language models (LLMs) are increasingly applied in biomedical settings, and existing benchmark datasets have played an important role in supporting model development and evaluation. However, these benchmarks often have limitations. Many rely on static or outdated datasets that fail to capture the dynamic, context-rich, and high-stakes nature of biomedical knowledge. They also carry increasing risk of data leakage due to overlap with model pretraining corpora and often overlook critical dimensions such as robustness to linguistic variation and potential demographic biases. Materials and Methods: To address these gaps, we introduce BioPulse-QA, a benchmark that evaluates LLMs on answering questions from newly published biomedical documents including drug labels, trial protocols, and clinical guidelines. BioPulse-QA includes 2,280 expert-verified question answering (QA) pairs and perturbed variants, covering both extractive and abstractive formats. We evaluate four LLMs - GPT-4o, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents. Results: GPT-o1 achieves the highest relaxed F1 score (0.92), followed by Gemini-2.0-Flash (0.90) on drug labels. Clinical trials are the most challenging source, with extractive F1 scores as low as 0.36. Discussion and Conclusion: Performance differences are larger for paraphrasing than for typographical errors, while bias testing shows negligible differences. BioPulse-QA provides a scalable and clinically relevant framework for evaluating biomedical LLMs.

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2601.12632 [cs.CL]

(or
arXiv:2601.12632v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2601.12632

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Kriti Bhattarai [view email] [v1]
Mon, 19 Jan 2026 00:38:33 UTC (3,169 KB)

150. 【2601.12618】Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems

链接https://arxiv.org/abs/2601.12618

作者:Elham Tajik,Conrad Borchers,Bahar Shahrokhian,Sebastian Simon,Ali Keramati,Sonika Pal,Sreecharan Sankaranarayanan

类目:Computation and Language (cs.CL)

关键词:understand learning processes, Learning analytics researchers, learning processes, understand learning, analyze qualitative student

备注: LAK 2026 conference paper, 7 pages

点击查看摘要

Abstract:Learning analytics researchers often analyze qualitative student data such as coded annotations or interview transcripts to understand learning processes. With the rise of generative AI, fully automated and human-AI workflows have emerged as promising methods for analysis. However, methodological standards to guide such workflows remain limited. In this study, we propose that reasoning traces generated by large language model (LLM) agents, especially within multi-agent systems, constitute a novel and rich form of process data to enhance interpretive practices in qualitative coding. We apply cosine similarity to LLM reasoning traces to systematically detect, quantify, and interpret disagreements among agents, reframing disagreement as a meaningful analytic signal. Analyzing nearly 10,000 instances of agent pairs coding human tutoring dialog segments, we show that LLM agents' semantic reasoning similarity robustly differentiates consensus from disagreement and correlates with human coding reliability. Qualitative analysis guided by this metric reveals nuanced instructional sub-functions within codes and opportunities for conceptual codebook refinement. By integrating quantitative similarity metrics with qualitative review, our method has the potential to improve and accelerate establishing inter-rater reliability during coding by surfacing interpretive ambiguity, especially when LLMs collaborate with humans. We discuss how reasoning-trace disagreements represent a valuable new class of analytic signals advancing methodological rigor and interpretive depth in educational research.

151. 【2601.12607】A Cloud-based Multi-Agentic Workflow for Science

链接https://arxiv.org/abs/2601.12607

作者:Anurag Acharya,Timothy Vega,Rizwan A. Ashraf,Anshu Sharma,Derek Parker,Robert Rallo

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, make complex decisions, complex decisions limits, Language Models

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) become ubiquitous across various scientific domains, their lack of ability to perform complex tasks like running simulations or to make complex decisions limits their utility. LLM-based agents bridge this gap due to their ability to call external resources and tools and thus are now rapidly gaining popularity. However, coming up with a workflow that can balance the models, cloud providers, and external resources is very challenging, making implementing an agentic system more of a hindrance than a help. In this work, we present a domain-agnostic, model-independent workflow for an agentic framework that can act as a scientific assistant while being run entirely on cloud. Built with a supervisor agent marshaling an array of agents with individual capabilities, our framework brings together straightforward tasks like literature review and data analysis with more complex ones like simulation runs. We describe the framework here in full, including a proof-of-concept system we built to accelerate the study of Catalysts, which is highly important in the field of Chemistry and Material Science. We report the cost to operate and use this framework, including the breakdown of the cost by services use. We also evaluate our system on a custom-curated synthetic benchmark and a popular Chemistry benchmark, and also perform expert validation of the system. The results show that our system is able to route the task to the correct agent 90% of the time and successfully complete the assigned task 97.5% of the time for the synthetic tasks and 91% of the time for real-world tasks, while still achieving better or comparable accuracy to most frontier models, showing that this is a viable framework for other scientific domains to replicate.

152. 【2601.12600】SSVD-O: Parameter-Efficient Fine-Tuning with Structured SVD for Speech Recognition

链接https://arxiv.org/abs/2601.12600

作者:Pu Wang,Shinji Watanabe,Hugo Van hamme

类目:ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

关键词:adapting large speech, large speech foundation, Parameter-efficient fine-tuning, approach for adapting, adapting large

备注: Accepted by IEEE ICASSP 2026

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) is a scalable approach for adapting large speech foundation models to new domains. While methods such as LoRA and its state-of-the-art variants reduce adaptation costs, they typically allocate parameters uniformly across model subspaces, which limits their efficiency and scalability in speech applications. Building on our prior work, this paper introduces SSVD-Outer (SSVD-O), an extension of the structured SVD-guided (SSVD) fine-tuning method. SSVD-O combines input acoustic feature space-associated inner transformations with output semantic feature space-associated outer transformations to enable scalable and balanced adaptation. We conduct the first systematic analysis of parameter budget allocation across model subspaces in PEFT for automatic speech recognition (ASR), and investigate the trade-off between learning and forgetting under constrained resources. SSVD-O is benchmarked against LoRA, DoRA, PiSSA, and SSVD on domain-shifted ASR tasks, including child speech and regional accents, across model scales from 0.1B to 2B within the ESPnet framework. Experimental results show that SSVD-O consistently narrows the performance gap to full fine-tuning while improving generalization and mitigating catastrophic forgetting.

153. 【2601.12598】Dissecting Linear Recurrent Models: How Different Gating Strategies Drive Selectivity and Generalization

链接https://arxiv.org/abs/2601.12598

作者:Younes Bouhadjar,Maxime Fabre,Felix Schmidt,Emre Neftci

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:original Transformer softmax, highly parallelizable training, original Transformer, recurrent neural networks, linear recurrent models

备注: 11 pages, 4 figures and 4 tables

点击查看摘要

Abstract:Linear recurrent neural networks have emerged as efficient alternatives to the original Transformer's softmax attention mechanism, thanks to their highly parallelizable training and constant memory and computation requirements at inference. Iterative refinements of these models have introduced an increasing number of architectural mechanisms, leading to increased complexity and computational costs. Nevertheless, systematic direct comparisons among these models remain limited. Existing benchmark tasks are either too simplistic to reveal substantial differences or excessively resource-intensive for experimentation. In this work, we propose a refined taxonomy of linear recurrent models and introduce SelectivBench, a set of lightweight and customizable synthetic benchmark tasks for systematically evaluating sequence models. SelectivBench specifically evaluates selectivity in sequence models at small to medium scale, such as the capacity to focus on relevant inputs while ignoring context-based distractors. It employs rule-based grammars to generate sequences with adjustable complexity, incorporating irregular gaps that intentionally violate transition rules. Evaluations of linear recurrent models on SelectivBench reveal performance patterns consistent with results from large-scale language tasks. Our analysis clarifies the roles of essential architectural features: gating and rapid forgetting mechanisms facilitate recall, in-state channel mixing is unnecessary for selectivity, but critical for generalization, and softmax attention remains dominant due to its memory capacity scaling with sequence length. Our benchmark enables targeted, efficient exploration of linear recurrent models and provides a controlled setting for studying behaviors observed in large-scale evaluations. Code is available at this https URL

154. 【2601.12555】Evaluating Contextually Mediated Factual Recall in Multilingual Large Language Models

链接https://arxiv.org/abs/2601.12555

作者:Yihong Liu,Bingyu Xiong,Hinrich Schütze

类目:Computation and Language (cs.CL)

关键词:Large language, Large language models, wide range, Large, factual recall

备注: preprint

点击查看摘要

Abstract:Large language models (LLMs) can recall a wide range of factual knowledge across languages. However, existing factual recall evaluations primarily assess fact retrieval in isolation, where the queried entity is explicitly named and the fact is requested directly. In natural language use, facts are often accessed through context, where the relevant entity is introduced only indirectly. In this work, we study contextually mediated factual recall, asking whether LLMs can reliably retrieve factual knowledge when the target entity is embedded in a naturalistic context rather than queried explicitly, across languages. We construct controlled prompts that preserve the underlying fact while introducing referential mediation through contextual sentences. To disentangle contextual effects from name-specific associations, we further compare performance using synthetic names and real names across languages. Evaluating multiple model families in five languages, we find that contextual mediation consistently degrades factual recall, with substantial variation across relations. Larger models are more robust to contextual mediation, exhibiting a reduced performance gap relative to direct queries, while the effect of real names and name origin is mixed and unsystematic. These findings highlight a gap between isolated factual recall and context-dependent language understanding in multilingual LLMs.

155. 【2601.12549】Benchmarking Concept-Spilling Across Languages in LLMs

链接https://arxiv.org/abs/2601.12549

作者:Ilia Badanin,Daniil Dzenhaliou,Imanol Schlag

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:exhibit remarkable cross-lingual, remarkable cross-lingual abilities, Multilingual Large Language, Large Language Models, Multilingual Large

备注

点击查看摘要

Abstract:Multilingual Large Language Models (LLMs) exhibit remarkable cross-lingual abilities, yet often exhibit a systematic bias toward the representations from other languages, resulting in semantic interference when generating content in non-English languages$-$a phenomenon we define as language spilling. This paper presents a novel comparative framework for evaluating multilingual semantic robustness by systematically measuring how models handle polysemous words across languages. Our methodology provides a relative measure of model performance: when required to generate exactly five meanings, both strong and weak models may resort to meanings from dominant languages, but semantically stronger models do so later in the generation sequence, producing more true meanings from the target language before failing, while weaker models resort to dominant-language meanings earlier in the sequence. We evaluate a diverse set of open and closed multilingual LLMs using a structured meaning generation task across nine languages, employing a carefully curated benchmark of 100 high-polysemy English words. Our findings reveal significant variation in semantic robustness across both models and languages, providing a principled ranking system for model comparison without requiring definitive causal attribution of error sources. We contribute both a scalable comparative benchmark for multilingual semantic evaluation and a rigorous validation pipeline$-$critical tools for developing more linguistically balanced AI systems.

156. 【2601.12539】MemeLens: Multilingual Multitask VLMs for Memes

链接https://arxiv.org/abs/2601.12539

作者:Ali Ezzat Shahroor,Mohamed Bayan Kmainasi,Abul Hasnat,Dimitar Dimitrov,Giovanni Da San Martino,Preslav Nakov,Firoj Alam

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:embedded text, cultural context, dominant medium, medium for online, online communication

备注: disinformation, misinformation, factuality, harmfulness, fake news, propaganda, hateful meme, multimodality, text, images

点击查看摘要

Abstract:Memes are a dominant medium for online communication and manipulation because meaning emerges from interactions between embedded text, imagery, and cultural context. Existing meme research is distributed across tasks (hate, misogyny, propaganda, sentiment, humour) and languages, which limits cross-domain generalization. To address this gap we propose MemeLens, a unified multilingual and multitask explanation-enhanced Vision Language Model (VLM) for meme understanding. We consolidate 38 public meme datasets, filter and map dataset-specific labels into a shared taxonomy of $20$ tasks spanning harm, targets, figurative/pragmatic intent, and affect. We present a comprehensive empirical analysis across modeling paradigms, task categories, and datasets. Our findings suggest that robust meme understanding requires multimodal training, exhibits substantial variation across semantic categories, and remains sensitive to over-specialization when models are fine-tuned on individual datasets rather than trained in a unified setting. We will make the experimental resources and datasets publicly available for the community.

157. 【2601.12538】Agentic Reasoning for Large Language Models

链接https://arxiv.org/abs/2601.12538

作者:Tianxin Wei,Ting-Wei Li,Zhining Liu,Xuying Ning,Ze Yang,Jiaru Zou,Zhichen Zeng,Ruizhong Qiu,Xiao Lin,Dongqi Fu,Zihao Li,Mengting Ai,Duo Zhou,Wenxuan Bao,Yunzhe Li,Gaotang Li,Cheng Qian,Yu Wang,Xiangru Tang,Yin Xiao,Liri Fang,Hui Liu,Xianfeng Tang,Yuji Zhang,Chi Wang,Jiaxuan You,Heng Ji,Hanghang Tong,Jingrui He

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:process underlying inference, fundamental cognitive process, cognitive process underlying, Agentic reasoning, Reasoning

备注: Project: [this https URL](https://github.com/weitianxin/Awesome-Agentic-Reasoning)

点击查看摘要

Abstract:Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, they struggle in open-ended and dynamic environments. Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we organize agentic reasoning along three complementary dimensions. First, we characterize environmental dynamics through three layers: foundational agentic reasoning, which establishes core single-agent capabilities including planning, tool use, and search in stable environments; self-evolving agentic reasoning, which studies how agents refine these capabilities through feedback, memory, and adaptation; and collective multi-agent reasoning, which extends intelligence to collaborative settings involving coordination, knowledge sharing, and shared goals. Across these layers, we distinguish in-context reasoning, which scales test-time interaction through structured orchestration, from post-training reasoning, which optimizes behaviors via reinforcement learning and supervised fine-tuning. We further review representative agentic reasoning frameworks across real-world applications and benchmarks, including science, robotics, healthcare, autonomous research, and mathematics. This survey synthesizes agentic reasoning methods into a unified roadmap bridging thought and action, and outlines open challenges and future directions, including personalization, long-horizon interaction, world modeling, scalable multi-agent training, and governance for real-world deployment.

158. 【2601.12535】Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning

链接https://arxiv.org/abs/2601.12535

作者:Ahmed Attia,Alham Fikri

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:gained increasing attention, low-resource language communities, Low-resource machine translation, communities is collected, remain unexplored

备注

点击查看摘要

Abstract:Low-resource machine translation (MT) has gained increasing attention as parallel data from low-resource language communities is collected, but many potential methods for improving low-resource MT remain unexplored. We investigate a self-supervised reinforcement-learning-based fine-tuning for translation in low-resource settings using round-trip bootstrapping with the No Language Left Behind (NLLB) family of models. Our approach translates English into a target low-resource language and then back into English, using a combination of chrF++ and BLEU as the reward function on the reconstructed English sentences. Using the NLLB-MD dataset, we evaluate both the 600M and 1.3B parameter NLLB models and observe consistent improvements for the following languages: Central Aymara, Friulian, Wolof and Russian. Qualitative inspection of translation outputs indicates increased fluency and semantic fidelity. We argue that our method can further benefit from scale, enabling models to increasingly leverage their pretrained knowledge and continue self-improving.

159. 【2601.12505】DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity

链接https://arxiv.org/abs/2601.12505

作者:Ashish Raj Shekhar,Shiven Agarwal,Priyanuj Bordoloi,Yash Shah,Tejas Anvekar,Vivek Gupta

类目:Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) can directly consume exam documents, threatening conventional assessments and academic integrity. We present DoPE (Decoy-Oriented Perturbation Encapsulation), a document-layer defense framework that embeds semantic decoys into PDF/HTML assessments to exploit render-parse discrepancies in MLLM pipelines. By instrumenting exams at authoring time, DoPE provides model-agnostic prevention (stop or confound automated solving) and detection (flag blind AI reliance) without relying on conventional one-shot classifiers. We formalize prevention and detection tasks, and introduce FewSoRT-Q, an LLM-guided pipeline that generates question-level semantic decoys and FewSoRT-D to encapsulate them into watermarked documents. We evaluate on Integrity-Bench, a novel benchmark of 1826 exams (PDF+HTML) derived from public QA datasets and OpenCourseWare. Against black-box MLLMs from OpenAI and Anthropic, DoPE yields strong empirical gains: a 91.4% detection rate at an 8.7% false-positive rate using an LLM-as-Judge verifier, and prevents successful completion or induces decoy-aligned failures in 96.3% of attempts. We release Integrity-Bench, our toolkit, and evaluation code to enable reproducible study of document-layer defenses for academic integrity.

160. 【2601.12494】Harmonizing the Arabic Audio Space with Data Scheduling

链接https://arxiv.org/abs/2601.12494

作者:Hunzalah Hassan Bhatti,Firoj Alam,Shammur Absar Chowdhury

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:dialect-rich settings remains, large language models, settings remains underexplored, Arabic-centric audio LLM, Audio large language

备注: Foundation Models, Large Language Models, Native, Speech Models, Arabic

点击查看摘要

Abstract:Audio large language models (LLMs) enable unified speech understanding and generation, yet their adaptation to linguistically complex, dialect-rich settings remains underexplored. This paper presents the first systematic study of multi-task instruction tuning for an Arabic-centric audio LLM, covering a hierarchy of generative tasks (ASR, speech summarization) and discriminative tasks (dialect and emotion identification). To support this study, we introduce AraMega-SSum, a novel dataset for Arabic speech summarization. We fine-tune Qwen2.5-Omni (7B) and propose Task-Progressive Curriculum (TPC) along with Aligner-Based Diverse Sampling (ADS), a strategy that constructs information-dense batches by selecting task- and label-balanced examples. Our results reveal a critical efficiency, robustness trade-off: while ADS accelerates initial convergence and boosts paralinguistic F1-scores, its inherent gradient volatility can destabilize generative decoding under prolonged training. Furthermore, while the TPC stabilizes core acoustic mapping, it often induces negative transfer in downstream tasks. We demonstrate that a Hybrid TPC+ADS Strategy provides an optimal training ``recipe'', first establishing a robust representative foundation before employing diversity-aware refinement to capture fine-grained nuances. These findings offer practical guidance for the efficient adaptation of Omni-models in complex, low-resource multimodal environments.

161. 【2601.12473】Capability-Aware Early-Stage Research Idea Evaluation

链接https://arxiv.org/abs/2601.12473

作者:Renlong Jie,Chen Chu,Zhen Wang

类目:Computation and Language (cs.CL)

关键词:holds great potential, conceptual stage, holds great, research ideas, great potential

备注

点击查看摘要

Abstract:Predicting the outcomes of research ideas at their conceptual stage (i.e. before significant resources are committed) holds great potential for optimizing scientific resource allocation and research planning. While existing methods rely heavily on finished manuscripts or peer reviews, we propose a novel capability-aware framework that predicts paper acceptance and ratings using only author information and research ideas, without requiring full text or experimental results. Our approach integrates author information, (inferred) capability presentation, and research ideas through a three-way transformer architecture with flexible fusion mechanisms. We also introduce a two-stage architecture for learning the capability representation given the author information and idea. Experiments show that our method significantly outperform the single-way models by finetuning bert-base and bert-large, and the capability predicting significantly increase the predictive accuracy of the final model. The proposed method can be applied in both early-stage research outcome prediction and scientific resource allocation.

162. 【2601.12471】Knowing When to Abstain: Medical LLMs Under Clinical Uncertainty

链接https://arxiv.org/abs/2601.12471

作者:Sravanthi Machcha,Sushrita Yerra,Sahil Gupta,Aishwarya Sahoo,Sharmin Sultana,Hong Yu,Zonghai Yao

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:overwhelmingly prioritizes accuracy, Current evaluation, large language models, overwhelmingly prioritizes, prioritizes accuracy

备注: Equal contribution for the first two authors; To appear in proceedings of the Main Conference of the European Chapter of the Association for Computational Linguistics (EACL) 2026

点击查看摘要

Abstract:Current evaluation of large language models (LLMs) overwhelmingly prioritizes accuracy; however, in real-world and safety-critical applications, the ability to abstain when uncertain is equally vital for trustworthy deployment. We introduce MedAbstain, a unified benchmark and evaluation protocol for abstention in medical multiple-choice question answering (MCQA) -- a discrete-choice setting that generalizes to agentic action selection -- integrating conformal prediction, adversarial question perturbations, and explicit abstention options. Our systematic evaluation of both open- and closed-source LLMs reveals that even state-of-the-art, high-accuracy models often fail to abstain with uncertain. Notably, providing explicit abstention options consistently increases model uncertainty and safer abstention, far more than input perturbations, while scaling model size or advanced prompting brings little improvement. These findings highlight the central role of abstention mechanisms for trustworthy LLM deployment and offer practical guidance for improving safety in high-stakes applications.

163. 【2601.12465】Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

链接https://arxiv.org/abs/2601.12465

作者:Miao Peng,Weizhou Shen,Nuo Chen,Chenliang Li,Ming Yan,Jia Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Verifiable Rewards, enhancing LLMs short-context, Reinforcement Learning, robust long-range reasoning, performance degrades

备注

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs short-context reasoning, but its performance degrades in long-context scenarios that require both precise grounding and robust long-range reasoning. We identify the "almost-there" phenomenon in long-context reasoning, where trajectories are largely correct but fail at the final step, and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data that push LLMs beyond mere grounding toward sophisticated multi-hop reasoning; and (2) the loss of valuable learning signals during long-context RL training due to the indiscriminate penalization of partially correct trajectories with incorrect outcomes. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains. Building on this, we introduce Long-context Process Advantage Shaping (LongPAS), a simple yet effective method that performs fine-grained credit assignment by evaluating reasoning steps along Validity and Relevance dimensions, which captures critical learning signals from "almost-there" trajectories. Experiments on three long-context reasoning benchmarks show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters. Further analysis confirms the effectiveness of our methods in strengthening long-context reasoning while maintaining stable RL training.

164. 【2601.12447】Privacy-Preserving Federated Learning with Verifiable Fairness Guarantees

链接https://arxiv.org/abs/2601.12447

作者:Mohammed Himayath Ali,Mohammed Aqib Abdullah,Syed Muneer Hussin,Mohammed Mudassir Uddin,Shahnawaz Alam

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:remains fundamentally unresolved, centralizing sensitive data, collaborative model training, heterogeneous data distributions, enables collaborative model

备注

点击查看摘要

Abstract:Federated learning enables collaborative model training across distributed institutions without centralizing sensitive data; however, ensuring algorithmic fairness across heterogeneous data distributions while preserving privacy remains fundamentally unresolved. This paper introduces CryptoFair-FL, a novel cryptographic framework providing the first verifiable fairness guarantees for federated learning systems under formal security definitions. The proposed approach combines additively homomorphic encryption with secure multi-party computation to enable privacy-preserving verification of demographic parity and equalized odds metrics without revealing protected attribute distributions or individual predictions. A novel batched verification protocol reduces computational complexity from BigO(n^2) to BigO(n \log n) while maintaining (\dparam, \deltap)-differential privacy with dparam = 0.5 and deltap = 10^{-6}. Theoretical analysis establishes information-theoretic lower bounds on the privacy cost of fairness verification, demonstrating that the proposed protocol achieves near-optimal privacy-fairness tradeoffs. Comprehensive experiments across four benchmark datasets (MIMIC-IV healthcare records, Adult Income, CelebA, and a novel FedFair-100 benchmark) demonstrate that CryptoFair-FL reduces fairness violations from 0.231 to 0.031 demographic parity difference while incurring only 2.3 times computational overhead compared to standard federated averaging. The framework successfully defends against attribute inference attacks, maintaining adversarial success probability below 0.05 across all tested configurations. These results establish a practical pathway for deploying fairness-aware federated learning in regulated industries requiring both privacy protection and algorithmic accountability.

165. 【2601.12430】System-Mediated Attention Imbalances Make Vision-Language Models Say Yes

链接https://arxiv.org/abs/2601.12430

作者:Tsan Tsai Chan,Varsha Suresh,Anisha Saha,Michael Hahn,Vera Demberg

类目:Computation and Language (cs.CL)

关键词:Vision-language model, commonly linked, linked to imbalanced, imbalanced allocation, Vision-language

备注: Under review

点击查看摘要

Abstract:Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities. In this study, we evaluate a more holistic, system-mediated account, which attributes these imbalances to functionally redundant system weights that reduce attention to image and textual inputs. We show that this framework offers a useful empirical perspective on the yes-bias, a common form of hallucination in which VLMs indiscriminately respond 'yes'. Causally redistributing attention from the system modality to image and textual inputs substantially suppresses this bias, often outperforming existing approaches. We further present evidence suggesting that system-mediated attention imbalances contribute to the yes-bias by encouraging a default reliance on coarse input representations, which are effective for some tasks but ill-suited to others. Taken together, these findings firmly establish system attention as a key factor in VLM hallucination and highlight its potential as a lever for mitigation.

166. 【2601.12419】Legal experts disagree with rationale extraction techniques for explaining ECtHR case outcome classification

链接https://arxiv.org/abs/2601.12419

作者:Mahammad Namazov,Tomáš Koref,Ivan Habernal

类目:Computation and Language (cs.CL)

关键词:large language models, trust and transparency, critical for applications, applications of large, large language

备注

点击查看摘要

Abstract:Interpretability is critical for applications of large language models in the legal domain which requires trust and transparency. While some studies develop task-specific approaches, other use the classification model's parameters to explain the decisions. However, which technique explains the legal outcome prediction best remains an open question. To address this challenge, we propose a comparative analysis framework for model-agnostic interpretability techniques. Among these, we employ two rationale extraction methods, which justify outcomes with human-interpretable and concise text fragments (i.e., rationales) from the given input text. We conduct comparison by evaluating faithfulness-via normalized sufficiency and comprehensiveness metrics along with plausibility-by asking legal experts to evaluate extracted rationales. We further assess the feasibility of LLM-as-a-Judge using legal expert evaluation results. We show that the model's "reasons" for predicting a violation differ substantially from those of legal experts, despite highly promising quantitative analysis results and reasonable downstream classification performance. The source code of our experiments is publicly available at this https URL.

167. 【2601.12407】De-Anonymization at Scale via Tournament-Style Attribution

链接https://arxiv.org/abs/2601.12407

作者:Lirui Zhang,Huishuai Zhang

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:LLMs rapidly advance, increasingly important, rapidly advance, advance and enter, enter real-world

备注: 14 pages

点击查看摘要

Abstract:As LLMs rapidly advance and enter real-world use, their privacy implications are increasingly important. We study an authorship de-anonymization threat: using LLMs to link anonymous documents to their authors, potentially compromising settings such as double-blind peer review. We propose De-Anonymization at Scale (DAS), a large language model-based method for attributing authorship among tens of thousands of candidate texts. DAS uses a sequential progression strategy: it randomly partitions the candidate corpus into fixed-size groups, prompts an LLM to select the text most likely written by the same author as a query text, and iteratively re-queries the surviving candidates to produce a ranked top-k list. To make this practical at scale, DAS adds a dense-retrieval prefilter to shrink the search space and a majority-voting style aggregation over multiple independent runs to improve robustness and ranking precision. Experiments on anonymized review data show DAS can recover same-author texts from pools of tens of thousands with accuracy well above chance, demonstrating a realistic privacy risk for anonymous platforms. On standard authorship benchmarks (Enron emails and blog posts), DAS also improves both accuracy and scalability over prior approaches, highlighting a new LLM-enabled de-anonymization vulnerability.

Comments:
14 pages

Subjects:

Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2601.12407 [cs.CR]

(or
arXiv:2601.12407v1 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2601.12407

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
168. 【2601.12389】NADIR: Differential Attention Flow for Non-Autoregressive Transliteration in Indic Languages

链接https://arxiv.org/abs/2601.12389

作者:Lakshya Tomar,Vinayak Abrol,Puneet Agarwal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:strong inductive biases, biases of autoregressive, require the strong, strong inductive, inductive biases

备注: Accepted at the AAAI Conference on Artificial Intelligence (AAAI 2026)

点击查看摘要

Abstract:In this work, we argue that not all sequence-to-sequence tasks require the strong inductive biases of autoregressive (AR) models. Tasks like multilingual transliteration, code refactoring, grammatical correction or text normalization often rely on local dependencies where the full modeling capacity of AR models can be overkill, creating a trade-off between their high accuracy and high inference latency. While non-autoregressive (NAR) models offer speed, they typically suffer from hallucinations and poor length control. To explore this trade-off, we focus on the multilingual transliteration task in Indic languages and introduce NADIR, a novel NAR architecture designed to strike a balance between speed and accuracy. NADIR integrates a Differential Transformer and a Mixture-of-Experts mechanism, enabling it to robustly model complex character mappings without sequential dependencies. NADIR achieves over a 13x speed-up compared to the state-of-the-art AR baseline. It maintains a competitive mean Character Error Rate of 15.78%, compared to 14.44% for the AR model and 21.88% for a standard NAR equivalent. Importantly, NADIR reduces Repetition errors by 49.53%, Substitution errors by 24.45%, Omission errors by 32.92%, and Insertion errors by 16.87%. This work provides a practical blueprint for building fast and reliable NAR systems, effectively bridging the gap between AR accuracy and the demands of real-time, large-scale deployment.

169. 【2601.12376】LR-DWM: Efficient Watermarking for Diffusion Language Models

链接https://arxiv.org/abs/2601.12376

作者:Ofek Raban,Ethan Fetaya,Gal Chechik

类目:Computation and Language (cs.CL)

关键词:attributing AI-generated content, Large Language Models, Language Models, Diffusion Language Models, AI-generated content

备注: Submitted to ACL Rolling Review (ARR). 7 pages, 4 figures

点击查看摘要

Abstract:Watermarking (WM) is a critical mechanism for detecting and attributing AI-generated content. Current WM methods for Large Language Models (LLMs) are predominantly tailored for autoregressive (AR) models: They rely on tokens being generated sequentially, and embed stable signals within the generated sequence based on the previously sampled text. Diffusion Language Models (DLMs) generate text via non-sequential iterative denoising, which requires significant modification to use WM methods designed for AR models. Recent work proposed to watermark DLMs by inverting the process when needed, but suffers significant computational or memory overhead. We introduce Left-Right Diffusion Watermarking (LR-DWM), a scheme that biases the generated token based on both left and right neighbors, when they are available. LR-DWM incurs minimal runtime and memory overhead, remaining close to the non-watermarked baseline DLM while enabling reliable statistical detection under standard evaluation settings. Our results demonstrate that DLMs can be watermarked efficiently, achieving high detectability with negligible computational and memory overhead.

170. 【2601.12374】A Scalable Entity-Based Framework for Auditing Bias in LLMs

链接https://arxiv.org/abs/2601.12374

作者:Akram Elbouanani,Aboubacar Tuo,Adrian Popescu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:trade ecological validity, poorly reflect real-world, Existing approaches, trade ecological, statistical control

备注

点击查看摘要

Abstract:Existing approaches to bias evaluation in large language models (LLMs) trade ecological validity for statistical control, relying on artificial prompts that poorly reflect real-world use, or on naturalistic tasks that lack scale and rigor. We introduce a scalable bias-auditing framework using named entities as probes to measure structural disparities in model behavior. We show that synthetic data reliably reproduces bias patterns observed in natural text, enabling large-scale analysis. Using this approach, we conduct the largest bias audit to date, comprising 1.9 billion data points across multiple entity types, tasks, languages, models, and prompting strategies. Our results reveal systematic biases: models penalize right-wing politicians, favor left-wing politicians, prefer Western and wealthy nations over the Global South, favor Western companies, and penalize firms in the defense and pharmaceutical sectors. While instruction tuning reduces bias, increasing model scale amplifies it, and prompting in Chinese or Russian does not attenuate Western-aligned preferences. These results indicate that LLMs should undergo rigorous auditing before deployment in high-stakes applications.

171. 【2601.12369】Can Deep Research Agents Find and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

链接https://arxiv.org/abs/2601.12369

作者:Ming Zhang,Jiabao Zhuang,Wenqing Jing,Ziyu Kong,Jingyi Deng,Yujiong Shen,Kexin Tan,Yuhang Zhao,Ning Luo,Renzhe Zheng,Jiahui Lin,Mingqi Wu,Long Ma,Yi Zou,Shihan Dou,Tao Gui,Qi Zhang,Xuanjing Huang

类目:Computation and Language (cs.CL)

关键词:Deep Research, automated survey generation, Deep Research Agents, Research, Research Agents

备注

点击查看摘要

Abstract:Deep Research Agents are increasingly used for automated survey generation. However, whether they can write surveys like human experts remains unclear. Existing benchmarks focus on fluency or citation accuracy, but none evaluates the core capabilities: retrieving essential papers and organizing them into coherent knowledge structures. We introduce TaxoBench, a diagnostic benchmark derived from 72 highly-cited computer science surveys. We manually extract expert-authored taxonomy trees containing 3,815 precisely categorized citations as ground truth. Our benchmark supports two evaluation modes: Deep Research mode tests end-to-end retrieval and organization given only a topic, while Bottom-Up mode isolates structuring capability by providing the exact papers human experts used. We evaluate 7 leading Deep Research agents and 12 frontier LLMs. Results reveal a dual bottleneck: the best agent recalls only 20.9% of expert-selected papers, and even with perfect input, the best model achieves only 0.31 ARI in organization. Current deep research agents remain far from expert-level survey writing. Our benchmark is publicly available at this https URL.

172. 【2601.12359】Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

链接https://arxiv.org/abs/2601.12359

作者:Anirudh Sekar,Mrinal Agarwal,Rachel Sharma,Akitsugu Tanaka,Jasmine Zhang,Arjun Damerla,Kevin Zhu

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:circumvent alignment safeguards, prompts exploit indirect, adversarial prompts exploit, exploit indirect input, indirect input channels

备注: Accepted to NeurIPS 2025 Lock-LLM Workshop

点击查看摘要

Abstract:Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful or unintended outputs. Despite advances in alignment, even state-of-the-art LLMs remain broadly vulnerable to adversarial prompts, underscoring the urgent need for robust, productive, and generalizable detection mechanisms beyond inefficient, model-specific patches. In this work, we propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts by quantifying semantic shifts in embedding space between benign and suspect inputs. ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, enabling efficient zero-shot deployment across diverse LLM architectures. Our method uses adversarial-clean prompt pairs and measures embedding drift via cosine similarity to capture subtle adversarial manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources. Extensive experiments demonstrate that embedding drift is a robust and transferable signal, outperforming traditional methods in detection accuracy and operational efficiency. With greater than 93% accuracy in classifying prompt injections across model architectures like Llama 3, Qwen 2, and Mistral and a false positive rate of 3%, our approach offers a lightweight, scalable defense layer that integrates into existing LLM pipelines, addressing a critical gap in securing LLM-powered systems to withstand adaptive adversarial threats.

173. 【2601.12307】Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline

链接https://arxiv.org/abs/2601.12307

作者:Jiawei Xu,Arief Koesdwiady,Sisong Bei,Yan Han,Baixiang Huang,Dakuo Wang,Yutong Chen,Zheshen Wang,Peihao Wang,Pan Li,Ying Ding

类目:Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Recent advances, multiple LLM agents, distinct roles, complex tasks, advances in LLM-based

备注

点击查看摘要

Abstract:Recent advances in LLM-based multi-agent systems (MAS) show that workflows composed of multiple LLM agents with distinct roles, tools, and communication patterns can outperform single-LLM baselines on complex tasks. However, most frameworks are homogeneous, where all agents share the same base LLM and differ only in prompts, tools, and positions in the workflow. This raises the question of whether such workflows can be simulated by a single agent through multi-turn conversations. We investigate this across seven benchmarks spanning coding, mathematics, general question answering, domain-specific reasoning, and real-world planning and tool use. Our results show that a single agent can reach the performance of homogeneous workflows with an efficiency advantage from KV cache reuse, and can even match the performance of an automatically optimized heterogeneous workflow. Building on this finding, we propose \textbf{OneFlow}, an algorithm that automatically tailors workflows for single-agent execution, reducing inference costs compared to existing automatic multi-agent design frameworks without trading off accuracy. These results position the single-LLM implementation of multi-agent workflows as a strong baseline for MAS research. We also note that single-LLM methods cannot capture heterogeneous workflows due to the lack of KV cache sharing across different LLMs, highlighting future opportunities in developing \textit{truly} heterogeneous multi-agent systems.

174. 【2601.12286】Conversational Context Classification: A Representation Engineering Approach

链接https://arxiv.org/abs/2601.12286

作者:Jonathan Pan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:Large Language Models, Large Language, demands effective safeguards, prevalence of Large, Language Models

备注

点击查看摘要

Abstract:The increasing prevalence of Large Language Models (LLMs) demands effective safeguards for their operation, particularly concerning their tendency to generate out-of-context responses. A key challenge is accurately detecting when LLMs stray from expected conversational norms, manifesting as topic shifts, factual inaccuracies, or outright hallucinations. Traditional anomaly detection struggles to directly apply within contextual semantics. This paper outlines our experiment in exploring the use of Representation Engineering (RepE) and One-Class Support Vector Machine (OCSVM) to identify subspaces within the internal states of LLMs that represent a specific context. By training OCSVM on in-context examples, we establish a robust boundary within the LLM's hidden state latent space. We evaluate out study with two open source LLMs - Llama and Qwen models in specific contextual domain. Our approach entailed identifying the optimal layers within the LLM's internal state subspaces that strongly associates with the context of interest. Our evaluation results showed promising results in identifying the subspace for a specific context. Aside from being useful in detecting in or out of context conversation threads, this research work contributes to the study of better interpreting LLMs.

175. 【2601.12269】Simulated Annealing Enhances Theory-of-Mind Reasoning in Autoregressive Language Models

链接https://arxiv.org/abs/2601.12269

作者:Xucong Hu,Jian-Qiao Zhu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:optimizing surface plausibility, correct latent-state representations, maintaining correct latent-state, local coherence, global coherence

备注

点击查看摘要

Abstract:Autoregressive language models are next-token predictors and have been criticized for only optimizing surface plausibility (i.e., local coherence) rather than maintaining correct latent-state representations (i.e., global coherence). Because Theory of Mind (ToM) tasks crucially depend on reasoning about latent mental states of oneself and others, such models are therefore often thought to fail at ToM. While post-training methods can improve ToM performance, we show that strong ToM capability can be recovered directly from the base model without any additional weight updates or verifications. Our approach builds on recent power-sampling methods (Karan Du, 2025) that use Markov chain Monte Carlo (MCMC) to sample from sharpened sequence-level (rather than token-level) probability distributions of autoregressive language models. We further find that incorporating annealing, where the tempered distribution is gradually shifted from high to low temperature, substantially improves ToM performance over fixed-temperature power sampling. Together, these results suggest that sampling-based optimization provides a powerful way to extract latent capabilities from language models without retraining.

176. 【2601.12263】Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

链接https://arxiv.org/abs/2601.12263

作者:Yixuan Du,Chenxiao Yu,Haoyan Xu,Ziyi Wang,Yue Zhao,Xiyang Hu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:rapidly replacing unimodal, replacing unimodal encoders, recommendation systems, rapidly replacing, replacing unimodal

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate that our coordinated attack significantly outperforms text-only and image-only baselines. These findings reveal that multimodal synergy, typically a strength of VLMs, can be weaponized to compromise the integrity of search rankings without triggering conventional content filters.

177. 【2601.12262】Environment-Aware Code Generation: How far are We?

链接https://arxiv.org/abs/2601.12262

作者:Tongtong Wu,Rongyi Chen,Wenjie Du,Suyu Ma,Guilin Qi,Zhenchang Xing,Shahram Khadivi,Ramesh Periyathambi,Gholamreza Haffari

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:large language models, Recent progress, improved code generation, code generation, language models

备注: ICSE 2026

点击查看摘要

Abstract:Recent progress in large language models (LLMs) has improved code generation, but most evaluations still test isolated, small-scale code (e.g., a single function) under default or unspecified software environments. As a result, it is unclear whether LLMs can reliably generate executable code tailored to a user's specific environment. We present the first systematic study of Environment-Aware Code Generation (EACG), where generated code must be functionally correct and directly executable under arbitrary software configurations. To enable realistic evaluation, we introduce VersiBCB, a benchmark that is multi-package, execution-verified, and deprecation-aware, capturing complex and evolving environments that prior datasets often overlook. Using VersiBCB, we investigate three complementary adaptation axes: data, parameters, and cache, and develop representative strategies for each. Our results show that current LLMs struggle with environment-specific code generation, while our adaptations improve environment compatibility and executability. These findings highlight key challenges and opportunities for deploying LLMs in practical software engineering workflows.

178. 【2601.12247】Plan, Verify and Fill: A Structured Parallel Decoding Approach for Diffusion Language Models

链接https://arxiv.org/abs/2601.12247

作者:Miao Li,Hanyang Jiang,Sikai Chen,Hengyu Fu,Yuhang Cai,Baihe Huang,Tinghan Ye,Xuanzhou Chen,Pascal Van Hentenryck

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Diffusion Language Models, Diffusion Language, Language Models, promising non-sequential paradigm, present a promising

备注

点击查看摘要

Abstract:Diffusion Language Models (DLMs) present a promising non-sequential paradigm for text generation, distinct from standard autoregressive (AR) approaches. However, current decoding strategies often adopt a reactive stance, underutilizing the global bidirectional context to dictate global trajectories. To address this, we propose Plan-Verify-Fill (PVF), a training-free paradigm that grounds planning via quantitative validation. PVF actively constructs a hierarchical skeleton by prioritizing high-leverage semantic anchors and employs a verification protocol to operationalize pragmatic structural stopping where further deliberation yields diminishing returns. Extensive evaluations on LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate that PVF reduces the Number of Function Evaluations (NFE) by up to 65% compared to confidence-based parallel decoding across benchmark datasets, unlocking superior efficiency without compromising accuracy.

179. 【2601.12208】CoReflect: Conversational Evaluation via Co-Evolutionary Simulation and Reflective Rubric Refinement

链接https://arxiv.org/abs/2601.12208

作者:Yunzhe Li,Richie Yueqi Feng,Tianxin Wei,Chin-Chia Hsu

类目:Computation and Language (cs.CL)

关键词:Evaluating conversational systems, multi-turn settings remains, Evaluating conversational, fundamental challenge, systems in multi-turn

备注

点击查看摘要

Abstract:Evaluating conversational systems in multi-turn settings remains a fundamental challenge. Conventional pipelines typically rely on manually defined rubrics and fixed conversational context$-$a static approach that limits coverage and fails to capture the diverse, emergent behaviors of dialogue models. To address this, we introduce CoReflect (Conversational Evaluation via Co-Evolutionary Simulation and Reflective Rubric Refinement), which unifies dialogue simulation and evaluation into an adaptive, iterative process. CoReflect employs a conversation planner that generates structured templates to guide a user simulator through diverse, goal-directed dialogues. Subsequently, a reflective analyzer processes these dialogues to identify systematic behavioral patterns and automatically refine the evaluation rubrics. Crucially, the insights from the conversation analysis are fed back into the planner to update conversation templates for subsequent iterations. This co-evolution loop ensures that the complexity of test cases and the diagnostic precision of rubrics improve in tandem. By minimizing human intervention, CoReflect provides a scalable and self-refining methodology that allows evaluation protocols to adapt alongside the rapidly advancing capabilities of dialogue models.

180. 【2601.12199】CTC-DID: CTC-Based Arabic dialect identification for streaming applications

链接https://arxiv.org/abs/2601.12199

作者:Muhammad Umar Farooq,Oscar Saz

类目:Computation and Language (cs.CL)

关键词:Connectionist Temporal Classification, Automatic Speech Recognition, Temporal Classification, Speech Recognition, Connectionist Temporal

备注: Accepted for IEEE ICASSP 2026

点击查看摘要

Abstract:This paper proposes a Dialect Identification (DID) approach inspired by the Connectionist Temporal Classification (CTC) loss function as used in Automatic Speech Recognition (ASR). CTC-DID frames the dialect identification task as a limited-vocabulary ASR system, where dialect tags are treated as a sequence of labels for a given utterance. For training, the repetition of dialect tags in transcriptions is estimated either using a proposed Language-Agnostic Heuristic (LAH) approach or a pre-trained ASR model. The method is evaluated on the low-resource Arabic Dialect Identification (ADI) task, with experimental results demonstrating that an SSL-based CTC-DID model, trained on a limited dataset, outperforms both fine-tuned Whisper and ECAPA-TDNN models. Notably, CTC-DID also surpasses these models in zero-shot evaluation on the Casablanca dataset. The proposed approach is found to be more robust to shorter utterances and is shown to be easily adaptable for streaming, real-time applications, with minimal performance degradation.

181. 【2601.12179】olerance Principle and Small Language Model Learning

链接https://arxiv.org/abs/2601.12179

作者:Adam E. Friedman,Stevan Harnad,Rushen Shi

类目:Computation and Language (cs.CL)

关键词:LLaMA require massive, Modern language models, Tolerance Principle, require massive training, Modern language

备注: 14 pages, 6 figures. BUCLD 50 Proceedings. To be published in 2026 by Cascadilla Press

点击查看摘要

Abstract:Modern language models like GPT-3, BERT, and LLaMA require massive training data, yet with sufficient training they reliably learn to distinguish grammatical from ungrammatical sentences. Children aged as young as 14 months already have the capacity to learn abstract grammar rules from very few exemplars, even in the presence of non-rule-following exceptions. Yang's (2016) Tolerance Principle defines a precise threshold for how many exceptions a rule can tolerate and still be learnable. The present study explored the minimal amount and quality of training data necessary for rules to be generalized by a transformer-based language model to test the predictions of the Tolerance Principle. We trained BabyBERTa (Huebner et al. 2021), a transformer model optimized for small datasets, on artificial grammars. The training sets varied in size, number of unique sentence types, and proportion of rule-following versus exception exemplars. We found that, unlike human infants, BabyBERTa's learning dynamics do not align with the Tolerance Principle.

182. 【2601.12164】he Language You Ask In: Language-Conditioned Ideological Divergence in LLM Analysis of Contested Political Documents

链接https://arxiv.org/abs/2601.12164

作者:Oleg Smirnov

类目:Computers and Society (cs.CY); Computation and Language (cs.CL)

关键词:carry systematic biases, systematic biases conditioned, Large language models, Large language, increasingly deployed

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed as analytical tools across multilingual contexts, yet their outputs may carry systematic biases conditioned by the language of the prompt. This study presents an experimental comparison of LLM-generated political analyses of a Ukrainian civil society document, using semantically equivalent prompts in Russian and Ukrainian. Despite identical source material and parallel query structures, the resulting analyses varied substantially in rhetorical positioning, ideological orientation, and interpretive conclusions. The Russian-language output echoed narratives common in Russian state discourse, characterizing civil society actors as illegitimate elites undermining democratic mandates. The Ukrainian-language output adopted vocabulary characteristic of Western liberal-democratic political science, treating the same actors as legitimate stakeholders within democratic contestation. These findings demonstrate that prompt language alone can produce systematically different ideological orientations from identical models analyzing identical content, with significant implications for AI deployment in polarized information environments, cross-lingual research applications, and the governance of AI systems in multilingual societies.

183. 【2601.12154】Analyzing Cancer Patients' Experiences with Embedding-based Topic Modeling and LLMs

链接https://arxiv.org/abs/2601.12154

作者:Teodor-Călin Ionescu,Lifeng Han,Jan Heijdra Suasnabar,Anne Stiggelbout,Suzan Verberne

类目:Computation and Language (cs.CL)

关键词:uncover meaningful themes, patient storytelling data, patient-oriented healthcare practices, storytelling data, study investigates

备注: under review to CLIN journal

点击查看摘要

Abstract:This study investigates the use of neural topic modeling and LLMs to uncover meaningful themes from patient storytelling data, to offer insights that could contribute to more patient-oriented healthcare practices. We analyze a collection of transcribed interviews with cancer patients (132,722 words in 13 interviews). We first evaluate BERTopic and Top2Vec for individual interview summarization by using similar preprocessing, chunking, and clustering configurations to ensure a fair comparison on Keyword Extraction. LLMs (GPT4) are then used for the next step topic labeling. Their outputs for a single interview (I0) are rated through a small-scale human evaluation, focusing on {coherence}, {clarity}, and {relevance}. Based on the preliminary results and evaluation, BERTopic shows stronger performance and is selected for further experimentation using three {clinically oriented embedding} models. We then analyzed the full interview collection with the best model setting. Results show that domain-specific embeddings improved topic \textit{precision} and \textit{interpretability}, with BioClinicalBERT producing the most consistent results across transcripts. The global analysis of the full dataset of 13 interviews, using the BioClinicalBERT embedding model, reveals the most dominant topics throughout all 13 interviews, namely ``Coordination and Communication in Cancer Care Management" and ``Patient Decision-Making in Cancer Treatment Journey''. Although the interviews are machine translations from Dutch to English, and clinical professionals are not involved in this evaluation, the findings suggest that neural topic modeling, particularly BERTopic, can help provide useful feedback to clinicians from patient interviews. This pipeline could support more efficient document navigation and strengthen the role of patients' voices in healthcare workflows.

184. 【2601.12132】Bengali Text Classification: An Evaluation of Large Language Model Approaches

链接https://arxiv.org/abs/2601.12132

作者:Md Mahmudul Hoque,Md Mehedi Hassain,Md Hojaifa Tanvir,Rahul Nandy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:natural language processing, Bengali text classification, Significant task, predefined labels, Bengali

备注

点击查看摘要

Abstract:Bengali text classification is a Significant task in natural language processing (NLP), where text is categorized into predefined labels. Unlike English, Bengali faces challenges due to the lack of extensive annotated datasets and pre-trained language models. This study explores the effectiveness of large language models (LLMs) in classifying Bengali newspaper articles. The dataset used, obtained from Kaggle, consists of articles from Prothom Alo, a major Bangladeshi newspaper. Three instruction-tuned LLMs LLaMA 3.1 8B Instruct, LLaMA 3.2 3B Instruct, and Qwen 2.5 7B Instruct were evaluated for this task under the same classification framework. Among the evaluated models, Qwen 2.5 achieved the highest classification accuracy of 72%, showing particular strength in the "Sports" category. In comparison, LLaMA 3.1 and LLaMA 3.2 attained accuracies of 53% and 56%, respectively. The findings highlight the effectiveness of LLMs in Bengali text classification, despite the scarcity of resources for Bengali NLP. Future research will focus on exploring additional models, addressing class imbalance issues, and refining fine-tuning approaches to improve classification performance.

185. 【2601.12104】Powerful Training-Free Membership Inference Against Autoregressive Language Models

链接https://arxiv.org/abs/2601.12104

作者:David Ilić,David Stanojević,Kostadin Cvejoski

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

关键词:expose sensitive information, pose significant privacy, models pose significant, significant privacy risks, pose significant

备注: 9 pages, 2 figures; appendix with additional experiments and derivations

点击查看摘要

Abstract:Fine-tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false-positive thresholds required for practical privacy auditing. We present EZ-MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT-2, EZ-MIA achieves 3.8x higher detection than the previous state-of-the-art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near-perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real-world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama-2-7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine-tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at this https URL.

186. 【2601.12099】Large language models struggle with ethnographic text annotation

链接https://arxiv.org/abs/2601.12099

作者:Leonardo S. Goodall,Dor Shilton,Daniel A. Mullins,Harvey Whitehouse

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, accelerate cross-cultural research, extracting structured data, Large language, raising hopes

备注

点击查看摘要

Abstract:Large language models (LLMs) have shown promise for automated text annotation, raising hopes that they might accelerate cross-cultural research by extracting structured data from ethnographic texts. We evaluated 7 state-of-the-art LLMs on their ability to annotate 121 ritual features across 567 ethnographic excerpts. Performance was limited, falling well below levels required for reliable automated annotation. Longer texts, features requiring ordinal distinctions, and ambiguous constructs proved particularly difficult. Human inter-coder reliability set an approximate ceiling on LLM accuracy: features that human coders found difficult to agree upon were also difficult for LLMs. Yet even on features where humans reliably agreed, models fell short of human performance. Our findings suggest that LLMs cannot yet substitute for human expertise in ethnographic annotation.

187. 【2601.12095】Neural Isomorphic Fields: A Transformer-based Algebraic Numerical Embedding

链接https://arxiv.org/abs/2601.12095

作者:Hamidreza Sadeghi,Saeedeh Momtazi,Reza Safabakhsh

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:unstable output variations, large numbers due, output variations, processing very small, due to issues

备注

点击查看摘要

Abstract:Neural network models often face challenges when processing very small or very large numbers due to issues such as overflow, underflow, and unstable output variations. To mitigate these problems, we propose using embedding vectors for numbers instead of directly using their raw values. These embeddings aim to retain essential algebraic properties while preventing numerical instabilities. In this paper, we introduce, for the first time, a fixed-length number embedding vector that preserves algebraic operations, including addition, multiplication, and comparison, within the field of rational numbers. We propose a novel Neural Isomorphic Field, a neural abstraction of algebraic structures such as groups and fields. The elements of this neural field are embedding vectors that maintain algebraic structure during computations. Our experiments demonstrate that addition performs exceptionally well, achieving over 95 percent accuracy on key algebraic tests such as identity, closure, and associativity. In contrast, multiplication exhibits challenges, with accuracy ranging from 53 percent to 73 percent across various algebraic properties. These findings highlight the model's strengths in preserving algebraic properties under addition while identifying avenues for further improvement in handling multiplication.

188. 【2601.12078】Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization

链接https://arxiv.org/abs/2601.12078

作者:Linfeng Du,Ye Yuan,Zichen Zhao,Fuyuan Lyu,Emiliano Penaloza,Xiuying Chen,Zipeng Sun,Jikun Kang,Laurent Charlin,Xue Liu,Haolun Wu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, users remains challenging, individual users remains, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) excel at general-purpose tasks, yet adapting their responses to individual users remains challenging. Retrieval augmentation provides a lightweight alternative to fine-tuning by conditioning LLMs on user history records, and existing approaches typically select these records based on semantic relevance. We argue that relevance serves as an unreliable proxy for utility: a record may be semantically similar to a query yet fail to improve generation quality or even degrade it due to redundancy or conflicting information. To bridge this gap, we propose PURPLE, a contextual bandit framework that oPtimizes UseR Profiles for Llm pErsonalization. In contrast to a greedy selection of the most relevant records, PURPLE treats profile construction as a set generation process and utilizes a Plackett-Luce ranking model to capture complex inter-record dependencies. By training with dense feedback provided by the likelihood of the reference response, our method aligns retrieval directly with generation quality. Extensive experiments on nine personalization tasks demonstrate that PURPLE consistently outperforms strong heuristic and retrieval-augmented baselines in both effectiveness and efficiency, establishing a principled and scalable solution for optimizing user profiles.

189. 【2601.12076】CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation

链接https://arxiv.org/abs/2601.12076

作者:H. Jiang,Y. Sun,Z. Dong,T. Liu,Y. Gu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Remote sensing video, Remote sensing, weak target saliency, maintain discriminative target, discriminative target representations

备注

点击查看摘要

Abstract:Remote sensing video referring object segmentation (RS-RVOS) is challenged by weak target saliency and severe visual information truncation in dynamic scenes, making it extremely difficult to maintain discriminative target representations during segmentation. Moreover, progress in this field is hindered by the absence of large-scale dedicated benchmarks, while existing models are often affected by biased initial memory construction that impairs accurate instance localization in complex scenarios, as well as indiscriminate memory accumulation that encodes noise from occlusions or misclassifications, leading to persistent error propagation. This paper advances RS-RVOS research through dual contributions in data and methodology. First, we construct RS-RVOS Bench, the first large-scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations. Unlike common RVOS benchmarks where many expressions are written with access to the full video context, our dataset adopts a strict causality-aware annotation strategy in which linguistic references are generated solely from the target state in the initial frame. Second, we propose a memory-quality-aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC-SAM). MQC-SAM introduces a temporal motion consistency module for initial memory calibration, leveraging short-term motion trajectory priors to correct structural deviations and establish accurate memory anchoring. Furthermore, it incorporates a decoupled attention-based memory integration mechanism with dynamic quality assessment, selectively updating high-confidence semantic features while filtering unreliable information, thereby effectively preventing error accumulation and propagation. Extensive experiments on RS-RVOS Bench demonstrate that MQC-SAM achieves state-of-the-art performance.

190. 【2601.12075】o Copy or Not to Copy: Copying Is Easier to Induce Than Recall

链接https://arxiv.org/abs/2601.12075

作者:Mehrdad Farahani,Franziska Penzkofer,Richard Johansson

类目:Computation and Language (cs.CL)

关键词:parametric knowledge stored, Language models, elicit parametric recall, retrieval-augmented settings, settings must arbitrate

备注

点击查看摘要

Abstract:Language models used in retrieval-augmented settings must arbitrate between parametric knowledge stored in their weights and contextual information in the prompt. This work presents a mechanistic study of that choice by extracting an \emph{arbitration vector} from model activations on a curated dataset designed to disentangle (i) irrelevant contexts that elicit parametric recall and (ii) relevant but false contexts that elicit copying. The vector is computed as the residual-stream centroid difference between these regimes across 27 relations, and is injected as an additive intervention at selected layers and token spans to steer behavior in two directions: Copy$\rightarrow$Recall (suppressing context use) and Recall$\rightarrow$Copy (inducing the model to copy any token from the context). Experiments on two architectures (decoder-only and encoder/decoder) and two open-domain QA benchmarks show consistent behavior shifts under moderate scaling while monitoring accuracy and fluency. Mechanistic analyses of attention routing, MLP contributions, and layer-wise probability trajectories reveal an asymmetry: inducing copying is an easy ``reactivation'' process that can be triggered at different locations in the input, while restoring recall is a ``suppression'' process that is more fragile and strongly tied to object-token interventions.

191. 【2601.12068】Bridging the Gap in Bangla Healthcare: Machine Learning Based Disease Prediction Using a Symptoms-Disease Dataset

链接https://arxiv.org/abs/2601.12068

作者:Rowzatul Zannat,Abdullah Al Shafi,Abdul Muntakim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:prediction remain limited, Increased access, remain limited, Bangla, Increased

备注

点击查看摘要

Abstract:Increased access to reliable health information is essential for non-English-speaking populations, yet resources in Bangla for disease prediction remain limited. This study addresses this gap by developing a comprehensive Bangla symptoms-disease dataset containing 758 unique symptom-disease relationships spanning 85 diseases. To ensure transparency and reproducibility, we also make our dataset publicly available. The dataset enables the prediction of diseases based on Bangla symptom inputs, supporting healthcare accessibility for Bengali-speaking populations. Using this dataset, we evaluated multiple machine learning models to predict diseases based on symptoms provided in Bangla and analyzed their performance on our dataset. Both soft and hard voting ensemble approaches combining top-performing models achieved 98\% accuracy, demonstrating superior robustness and generalization. Our work establishes a foundational resource for disease prediction in Bangla, paving the way for future advancements in localized health informatics and diagnostic tools. This contribution aims to enhance equitable access to health information for Bangla-speaking communities, particularly for early disease detection and healthcare interventions.

192. 【2601.12061】Codebook-Injected Dialogue Segmentation for Multi-Utterance Constructs Annotation: LLM-Assisted and Gold-Label-Free Evaluation

链接https://arxiv.org/abs/2601.12061

作者:Jinsook Lee,Kirk Vanacore,Zhuqian Zhou,Jeanine Grutter,Rene F. Kizilcec

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:typically treats communicative, annotation typically treats, Dialogue Act, utterances or turns, typically treats

备注: Under Review for ACL 2026

点击查看摘要

Abstract:Dialogue Act (DA) annotation typically treats communicative or pedagogical intent as localized to individual utterances or turns. This leads annotators to agree on the underlying action while disagreeing on segment boundaries, reducing apparent reliability. We propose codebook-injected segmentation, which conditions boundary decisions on downstream annotation criteria, and evaluate LLM-based segmenters against standard and retrieval-augmented baselines. To assess these without gold labels, we introduce evaluation metrics for span consistency, distinctiveness, and human-AI distributional agreement. We found DA-awareness produces segments that are internally more consistent than text-only baselines. While LLMs excel at creating construct-consistent spans, coherence-based baselines remain superior at detecting global shifts in dialogue flow. Across two datasets, no single segmenter dominates. Improvements in within-segment coherence frequently trade off against boundary distinctiveness and human-AI distributional agreement. These results highlight segmentation as a consequential design choice that should be optimized for downstream objectives rather than a single performance score.

193. 【2601.12034】Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs

链接https://arxiv.org/abs/2601.12034

作者:Ziyi Zhao,Chongming Gao,Yang Zhang,Haoyan Liu,Weinan Gan,Huifeng Guo,Yong Liu,Fuli Feng

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, Personalization in Large, Large Language, user-specific soft prompts, Language Models

备注: Accepted to AAAI 2026 (Oral). 9 pages, 5 figures

点击查看摘要

Abstract:Personalization in Large Language Models (LLMs) often relies on user-specific soft prompts. However, these prompts become obsolete when the foundation model is upgraded, necessitating costly, full-scale retraining. To overcome this limitation, we propose the Prompt-level User Migration Adapter (PUMA), a lightweight framework to efficiently migrate personalized prompts across incompatible models. PUMA utilizes a parameter-efficient adapter to bridge the semantic gap, combined with a group-based user selection strategy to significantly reduce training costs. Experiments on three large-scale datasets show our method matches or even surpasses the performance of retraining from scratch, reducing computational cost by up to 98%. The framework demonstrates strong generalization across diverse model architectures and robustness in advanced scenarios like chained and aggregated migrations, offering a practical path for the sustainable evolution of personalized AI by decoupling user assets from the underlying models.

194. 【2601.12033】Preserving Fairness and Safety in Quantized LLMs Through Critical Weight Protection

链接https://arxiv.org/abs/2601.12033

作者:Muhammad Alif Al Hakim,Alfan Farizki Wicaksono,Fajri Koto

类目:Computation and Language (cs.CL)

关键词:large language models, remain underexplored, multilingual contexts, widely adopted, adopted to reduce

备注

点击查看摘要

Abstract:Quantization is widely adopted to reduce the computational cost of large language models (LLMs); however, its implications for fairness and safety, particularly in dynamic quantization and multilingual contexts, remain underexplored. In this work, we conduct a systematic study of how static and dynamic quantization methods impact fairness and safety across benchmarks measuring intrinsic and extrinsic bias and safety alignment. For fairness, we evaluate English, French, Dutch, Spanish, and Turkish; for safety, we focus on English, Korean, and Arabic. Our findings reveal that quantization consistently degrades fairness and safety, with dynamic methods demonstrating greater stability than static ones. Moreover, fairness degradation varies across languages, while safety deterioration is especially pronounced in non-English settings. To address these risks, we introduce Critical Weight Protection, a novel technique that identifies and preserves fairness- and safety-critical weights during quantization. This approach effectively mitigates bias and safety deterioration without costly retraining or alignment, maintaining trustworthiness while retaining efficiency.

195. 【2601.12024】A Multi-Agent System for Generating Actionable Business Advice

链接https://arxiv.org/abs/2601.12024

作者:Kartikey Singh Bhandari,Tanish Jain,Archit Agrawal,Dhruv Kumar,Praveen Kumar,Pratik Narang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:existing analytic methods, analytic methods rarely, methods rarely move, aspect extraction, Customer reviews

备注

点击查看摘要

Abstract:Customer reviews contain rich signals about product weaknesses and unmet user needs, yet existing analytic methods rarely move beyond descriptive tasks such as sentiment analysis or aspect extraction. While large language models (LLMs) can generate free-form suggestions, their outputs often lack accuracy and depth of reasoning. In this paper, we present a multi-agent, LLM-based framework for prescriptive decision support, which transforms large scale review corpora into actionable business advice. The framework integrates four components: clustering to select representative reviews, generation of advices, iterative evaluation, and feasibility based ranking. This design couples corpus distillation with feedback driven advice refinement to produce outputs that are specific, actionable, and practical. Experiments across three service domains and multiple model families show that our framework consistently outperform single model baselines on actionability, specificity, and non-redundancy, with medium sized models approaching the performance of large model frameworks.

196. 【2601.12019】Acting Flatterers via LLMs Sycophancy: Combating Clickbait with LLMs Opposing-Stance Reasoning

链接https://arxiv.org/abs/2601.12019

作者:Chaowei Zhang,Xiansheng Luo,Zewei Zhang,Yi Zhu,Jipeng Qiang,Longwei Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:exaggerated headlines designed, deceptive or exaggerated, attract attention, widespread proliferation, proliferation of online

备注

点击查看摘要

Abstract:The widespread proliferation of online content has intensified concerns about clickbait, deceptive or exaggerated headlines designed to attract attention. While Large Language Models (LLMs) offer a promising avenue for addressing this issue, their effectiveness is often hindered by Sycophancy, a tendency to produce reasoning that matches users' beliefs over truthful ones, which deviates from instruction-following principles. Rather than treating sycophancy as a flaw to be eliminated, this work proposes a novel approach that initially harnesses this behavior to generate contrastive reasoning from opposing perspectives. Specifically, we design a Self-renewal Opposing-stance Reasoning Generation (SORG) framework that prompts LLMs to produce high-quality agree and disagree reasoning pairs for a given news title without requiring ground-truth labels. To utilize the generated reasoning, we develop a local Opposing Reasoning-based Clickbait Detection (ORCD) model that integrates three BERT encoders to represent the title and its associated reasoning. The model leverages contrastive learning, guided by soft labels derived from LLM-generated credibility scores, to enhance detection robustness. Experimental evaluations on three benchmark datasets demonstrate that our method consistently outperforms LLM prompting, fine-tuned smaller language models, and state-of-the-art clickbait detection baselines.

197. 【2601.11969】$\texttt{MemoryRewardBench}$: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

链接https://arxiv.org/abs/2601.11969

作者:Zecheng Tang,Baibei Ji,Ruoxi Sun,Haitian Wang,WangJie You,Zhang Yijun,Wenpeng Zhu,Ji Qi,Juntao Li,Min Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Existing works increasingly, increasingly adopt memory-centric, adopt memory-centric mechanisms, enables large language, effectively propagate information

备注

点击查看摘要

Abstract:Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information across the entire sequence. Therefore, leveraging reward models (RMs) to automatically and reliably evaluate memory quality is critical. In this work, we introduce $\texttt{MemoryRewardBench}$, the first benchmark to systematically study the ability of RMs to evaluate long-term memory management processes. $\texttt{MemoryRewardBench}$ covers both long-context comprehension and long-form generation tasks, featuring 10 distinct settings with different memory management patterns, with context length ranging from 8K to 128K tokens. Evaluations on 13 cutting-edge RMs indicate a diminishing performance gap between open-source and proprietary models, with newer-generation models consistently outperforming their predecessors regardless of parameter count. We further expose the capabilities and fundamental limitations of current RMs in evaluating LLM memory management across diverse settings.

198. 【2601.11960】R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning

链接https://arxiv.org/abs/2601.11960

作者:Jingchu Wang,Bingbing Xu,Yige Yuan,Bin Xie,Xiaoqian Sun,Huawei Shen

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:improving LLM reasoning, Reinforcement learning, improving LLM, LLM reasoning, central paradigm

备注

点击查看摘要

Abstract:Reinforcement learning has become a central paradigm for improving LLM reasoning. However, existing methods use a single policy to produce both inference responses and training optimization trajectories. The objective conflict between generating stable inference responses and diverse training trajectories leads to insufficient exploration, which harms reasoning capability. In this paper, to address the problem, we propose R$^2$PO (Residual Rollout Policy Optimization), which introduces a lightweight Residual Rollout-Head atop the policy to decouple training trajectories from inference responses, enabling controlled trajectory diversification during training while keeping inference generation stable. Experiments across multiple benchmarks show that our method consistently outperforms baselines, achieving average accuracy gains of 3.1% on MATH-500 and 2.4% on APPS, while also reducing formatting errors and mitigating length bias for stable optimization. Our code is publicly available at this https URL.

199. 【2601.11957】PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning

链接https://arxiv.org/abs/2601.11957

作者:Bingxuan Li,Jeonghwan Kim,Cheng Qian,Xiusi Chen,Eitan Anzenberg,Niran Kundapur,Heng Ji

类目:Computation and Language (cs.CL)

关键词:Overlapping calendar invitations, invitations force busy, force busy professionals, calendar invitations force, Overlapping calendar

备注

点击查看摘要

Abstract:Overlapping calendar invitations force busy professionals to repeatedly decide which meetings to attend, reschedule, or decline. We refer to this preference-driven decision process as calendar conflict resolution. Automating such process is crucial yet challenging. Scheduling logistics drain hours, and human delegation often fails at scale, which motivate we to ask: Can we trust large language model (LLM) or language agent to manager time? To enable systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution. Conflicts are presented sequentially and agents receive feedback after each round, requiring them to infer and adapt to user preferences progressively. Our experiments show that current LLM agents perform poorly with high error rates, e.g., Qwen-3-30B-Think has 35% average error rate. To address this gap, we propose PEARL, a reinforcement-learning framework that augments language agent with an external memory module and optimized round-wise reward design, enabling agent to progressively infer and adapt to user preferences on-the-fly. Experiments on CalConflictBench shows that PEARL achieves 0.76 error reduction rate, and 55% improvement in average error rate compared to the strongest baseline.

200. 【2601.11956】Double-Calibration: Towards Trustworthy LLMs via Calibrating Knowledge and Reasoning Confidence

链接https://arxiv.org/abs/2601.11956

作者:Yuyin Lu,Ziran Liang,Yanghui Rao,Wenqi Fan,Fu Lee Wang,Qing Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Trustworthy reasoning, Language Models, propensity for hallucination

备注

点击查看摘要

Abstract:Trustworthy reasoning in Large Language Models (LLMs) is challenged by their propensity for hallucination. While augmenting LLMs with Knowledge Graphs (KGs) improves factual accuracy, existing KG-augmented methods fail to quantify epistemic uncertainty in both the retrieved evidence and LLMs' reasoning. To bridge this gap, we introduce DoublyCal, a framework built on a novel double-calibration principle. DoublyCal employs a lightweight proxy model to first generate KG evidence alongside a calibrated evidence confidence. This calibrated supporting evidence then guides a black-box LLM, yielding final predictions that are not only more accurate but also well-calibrated, with confidence scores traceable to the uncertainty of the supporting evidence. Experiments on knowledge-intensive benchmarks show that DoublyCal significantly improves both the accuracy and confidence calibration of black-box LLMs with low token cost.

201. 【2601.11940】hinking Traps in Long Chain-of-Thought: A Measurable Study and Trap-Aware Adaptive Restart

链接https://arxiv.org/abs/2601.11940

作者:Kang Chen,Fan Yu,Junjie Nian,Shihan Zhao,Zhuoka Feng,Zijun Yao,Heng Wang,Minshen Yu,Yixin Cao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:early wrong commitment, Scaling test-time compute, compute via Long, significantly enhances reasoning, enhances reasoning capabilities

备注

点击查看摘要

Abstract:Scaling test-time compute via Long Chain-of-Thought (Long-CoT) significantly enhances reasoning capabilities, yet extended generation does not guarantee correctness: after an early wrong commitment, models may keep elaborating a self-consistent but incorrect prefix. Through fine-grained trajectory analysis, we identify Thinking Traps, prefix-dominant deadlocks where later reflection, alternative attempts, or verification fails to revise the root error. On a curated subset of DAPO-MATH, 89\% of failures exhibit such traps. To solve this problem, we introduce TAAR (Trap-Aware Adaptive Restart), a test-time control framework that trains a diagnostic policy to predict two signals from partial trajectories: a trap index for where to truncate and an escape probability for whether and how strongly to intervene. At inference time, TAAR truncates the trajectory before the predicted trap segment and adaptively restarts decoding; for severely trapped cases, it applies stronger perturbations, including higher-temperature resampling and an optional structured reboot suffix. Experiments on challenging mathematical and scientific reasoning benchmarks (AIME24, AIME25, GPQA-Diamond, HMMT25, BRUMO25) show that TAAR improves reasoning performance without fine-tuning base model parameters.

202. 【2601.11932】Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes

链接https://arxiv.org/abs/2601.11932

作者:Abdullah Al Monsur,Nitesh Vamshi Bommisetty,Gene Louis Kim

类目:Computation and Language (cs.CL)

关键词:notable re-occurring limitations, current state, notable re-occurring, re-occurring limitations, event detection research

备注

点击查看摘要

Abstract:The current state of event detection research has two notable re-occurring limitations that we investigate in this study. First, the unidirectional nature of decoder-only LLMs presents a fundamental architectural bottleneck for natural language understanding tasks that depend on rich, bidirectional context. Second, we confront the conventional reliance on Micro-F1 scores in event detection literature, which systematically inflates performance by favoring majority classes. Instead, we focus on Macro-F1 as a more representative measure of a model's ability across the long-tail of event types. Our experiments demonstrate that models enhanced with sentence context achieve superior performance over canonical decoder-only baselines. Using Low-Rank Adaptation (LoRA) during finetuning provides a substantial boost in Macro-F1 scores in particular, especially for the decoder-only models, showing that LoRA can be an effective tool to enhance LLMs' performance on long-tailed event classes.

203. 【2601.11923】Mapping the maturation of TCM as an adjuvant to radiotherapy

链接https://arxiv.org/abs/2601.11923

作者:P. Bilha Githinji,Aikaterini Melliou,Xi Yuan,Dayan Zhang,Lian Zhang,Zhenglin Chen,Jiansong Ji,Chengying Lv,Jinhao Xu,Peiwu Qin,Dongmei Yu

类目:Computation and Language (cs.CL)

关键词:Traditional Chinese Medicine, Traditional Chinese, Chinese Medicine, adoption of Traditional, complementary medicine

备注

点击查看摘要

Abstract:The integration of complementary medicine into oncology represents a paradigm shift that has seen to increasing adoption of Traditional Chinese Medicine (TCM) as an adjuvant to radiotherapy. About twenty-five years since the formal institutionalization of integrated oncology, it is opportune to synthesize the trajectory of evidence for TCM as an adjuvant to radiotherapy. Here we conduct a large-scale analysis of 69,745 publications (2000 - 2025), emerging a cyclical evolution defined by coordinated expansion and contraction in publication output, international collaboration, and funding commitments that mirrors a define-ideate-test pattern. Using a theme modeling workflow designed to determine a stable thematic structure of the field, we identify five dominant thematic axes - cancer types, supportive care, clinical endpoints, mechanisms, and methodology - that signal a focus on patient well-being, scientific rigor and mechanistic exploration. Cross-theme integration of TCM is patient-centered and systems-oriented. Together with the emergent cycles of evolution, the thematic structure demonstrates progressive specialization and potential defragmentation of the field or saturation of existing research agenda. The analysis points to a field that has matured its current research agenda and is likely at the cusp of something new. Additionally, the field exhibits positive reporting of findings that is homogeneous across publication types, thematic areas, and the cycles of evolution suggesting a system-wide positive reporting bias agnostic to structural drivers.

204. 【2601.11920】Enhancing LLM-Based Data Annotation with Error Decomposition

链接https://arxiv.org/abs/2601.11920

作者:Zhen Xu,Vedant Khatri,Yijun Dai,Xiner Liu,Siyan Li,Xuanming Zhang,Renzhe Yu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, language models offer, Large language, LLM annotation errors, annotation tasks

备注

点击查看摘要

Abstract:Large language models offer a scalable alternative to human coding for data annotation tasks, enabling the scale-up of research across data-intensive domains. While LLMs are already achieving near-human accuracy on objective annotation tasks, their performance on subjective annotation tasks, such as those involving psychological constructs, is less consistent and more prone to errors. Standard evaluation practices typically collapse all annotation errors into a single alignment metric, but this simplified approach may obscure different kinds of errors that affect final analytical conclusions in different ways. Here, we propose a diagnostic evaluation paradigm that incorporates a human-in-the-loop step to separate task-inherent ambiguity from model-driven inaccuracies and assess annotation quality in terms of their potential downstream impacts. We refine this paradigm on ordinal annotation tasks, which are common in subjective annotation. The refined paradigm includes: (1) a diagnostic taxonomy that categorizes LLM annotation errors along two dimensions: source (model-specific vs. task-inherent) and type (boundary ambiguity vs. conceptual misidentification); (2) a lightweight human annotation test to estimate task-inherent ambiguity from LLM annotations; and (3) a computational method to decompose observed LLM annotation errors following our taxonomy. We validate this paradigm on four educational annotation tasks, demonstrating both its conceptual validity and practical utility. Theoretically, our work provides empirical evidence for why excessively high alignment is unrealistic in specific annotation tasks and why single alignment metrics inadequately reflect the quality of LLM annotations. In practice, our paradigm can be a low-cost diagnostic tool that assesses the suitability of a given task for LLM annotation and provides actionable insights for further technical optimization.

205. 【2601.11913】LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding

链接https://arxiv.org/abs/2601.11913

作者:Yichen Jiang,Peng Ye,Jiakang Yuan,Chongjun Tu,Lei Bai,Tao Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:processing long contexts, large language models, Effectively processing long, long contexts remains, fundamental yet unsolved

备注: 12 pages, 5 figures

点击查看摘要

Abstract:Effectively processing long contexts remains a fundamental yet unsolved challenge for large language models (LLMs). Existing single-LLM-based methods primarily reduce the context window or optimize the attention mechanism, but they often encounter additional computational costs or constrained expanded context length. While multi-agent-based frameworks can mitigate these limitations, they remain susceptible to the accumulation of errors and the propagation of hallucinations. In this work, we draw inspiration from the Long Short-Term Memory (LSTM) architecture to design a Multi-Agent System called LSTM-MAS, emulating LSTM's hierarchical information flow and gated memory mechanisms for long-context understanding. Specifically, LSTM-MAS organizes agents in a chained architecture, where each node comprises a worker agent for segment-level comprehension, a filter agent for redundancy reduction, a judge agent for continuous error detection, and a manager agent for globally regulates information propagation and retention, analogous to LSTM and its input gate, forget gate, constant error carousel unit, and output gate. These novel designs enable controlled information transfer and selective long-term dependency modeling across textual segments, which can effectively avoid error accumulation and hallucination propagation. We conducted an extensive evaluation of our method. Compared with the previous best multi-agent approach, CoA, our model achieves improvements of 40.93%, 43.70%,121.57% and 33.12%, on NarrativeQA, Qasper, HotpotQA, and MuSiQue, respectively.

206. 【2601.11908】PPA-Plan: Proactive Pitfall Avoidance for Reliable Planning in Long-Context LLM Reasoning

链接https://arxiv.org/abs/2601.11908

作者:Byeongjin Kim,Gyuwan Kim,Seo Yeon Park

类目:Computation and Language (cs.CL)

关键词:Large language models, Large language, language models, sparsely distributed, long contexts

备注: 23 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) struggle with reasoning over long contexts where relevant information is sparsely distributed. Although plan-and-execute frameworks mitigate this by decomposing tasks into planning and execution, their effectiveness is often limited by unreliable plan generation due to dependence on surface-level cues. Consequently, plans may be based on incorrect assumptions, and once a plan is formed, identifying what went wrong and revising it reliably becomes difficult, limiting the effectiveness of reactive refinement. To address this limitation, we propose PPA-Plan, a proactive planning strategy for long-context reasoning that focuses on preventing such failures before plan generation. PPA-Plan identifies potential logical pitfalls and false assumptions, formulates them as negative constraints, and conditions plan generation on explicitly avoiding these constraints. Experiments on long-context QA benchmarks show that executing plans generated by PPA-Plan consistently outperforms existing plan-and-execute methods and direct prompting.

207. 【2601.11886】Faithfulness vs. Safety: Evaluating LLM Behavior Under Counterfactual Medical Evidence

链接https://arxiv.org/abs/2601.11886

作者:Kaijie Mo,Siddhartha Venkatayogi,Chantal Shaib,Ramez Kouzy,Wei Xu,Byron C. Wallace,Junyi Jessy Li

类目:Computation and Language (cs.CL)

关键词:domains like medicine, high-stakes domains, generally desirable, faithfully adhere, evidence

备注: 26 pages

点击查看摘要

Abstract:In high-stakes domains like medicine, it may be generally desirable for models to faithfully adhere to the context provided. But what happens if the context does not align with model priors or safety protocols? In this paper, we investigate how LLMs behave and reason when presented with counterfactual or even adversarial medical evidence. We first construct MedCounterFact, a counterfactual medical QA dataset that requires the models to answer clinical comparison questions (i.e., judge the efficacy of certain treatments, with evidence consisting of randomized controlled trials provided as context). In MedCounterFact, real-world medical interventions within the questions and evidence are systematically replaced with four types of counterfactual stimuli, ranging from unknown words to toxic substances. Our evaluation across multiple frontier LLMs on MedCounterFact reveals that in the presence of counterfactual evidence, existing models overwhelmingly accept such "evidence" at face value even when it is dangerous or implausible, and provide confident and uncaveated answers. While it may be prudent to draw a boundary between faithfulness and safety, our findings reveal that there exists no such boundary yet.

208. 【2601.11872】GloCTM: Cross-Lingual Topic Modeling via a Global Context Space

链接https://arxiv.org/abs/2601.11872

作者:Nguyen Tien Phat,Ngo Vu Minh,Linh Van Ngo,Nguyen Thi Ngoc Diep,Thien Huu Nguyen

类目:Computation and Language (cs.CL)

关键词:topic modeling seeks, modeling seeks, seeks to uncover, uncover coherent, coherent and semantically

备注: AAAI 2026

点击查看摘要

Abstract:Cross-lingual topic modeling seeks to uncover coherent and semantically aligned topics across languages - a task central to multilingual understanding. Yet most existing models learn topics in disjoint, language-specific spaces and rely on alignment mechanisms (e.g., bilingual dictionaries) that often fail to capture deep cross-lingual semantics, resulting in loosely connected topic spaces. Moreover, these approaches often overlook the rich semantic signals embedded in multilingual pretrained representations, further limiting their ability to capture fine-grained alignment. We introduce GloCTM (Global Context Space for Cross-Lingual Topic Model), a novel framework that enforces cross-lingual topic alignment through a unified semantic space spanning the entire model pipeline. GloCTM constructs enriched input representations by expanding bag-of-words with cross-lingual lexical neighborhoods, and infers topic proportions using both local and global encoders, with their latent representations aligned through internal regularization. At the output level, the global topic-word distribution, defined over the combined vocabulary, structurally synchronizes topic meanings across languages. To further ground topics in deep semantic space, GloCTM incorporates a Centered Kernel Alignment (CKA) loss that aligns the latent topic space with multilingual contextual embeddings. Experiments across multiple benchmarks demonstrate that GloCTM significantly improves topic coherence and cross-lingual alignment, outperforming strong baselines.

209. 【2601.11866】Advances in LLM Reasoning Enable Flexibility in Clinical Problem-Solving

链接https://arxiv.org/abs/2601.11866

作者:Kie Shidara,Preethi Prem,Jonathan Kim,Anna Podlasek,Feng Liu,Ahmed Alaa,Danilo Bernardo

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, achieved high accuracy, flexible clinical reasoning, Language Models

备注: 10 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved high accuracy on medical question-answer (QA) benchmarks, yet their capacity for flexible clinical reasoning has been debated. Here, we asked whether advances in reasoning LLMs improve their cognitive flexibility in clinical reasoning. We assessed reasoning models from the OpenAI, Grok, Gemini, Claude, and DeepSeek families on the medicine abstraction and reasoning corpus (mARC), an adversarial medical QA benchmark which utilizes the Einstellung effect to induce inflexible overreliance on learned heuristic patterns in contexts where they become suboptimal. We found that strong reasoning models avoided Einstellung-based traps more often than weaker reasoning models, achieving human-level performance on mARC. On questions most commonly missed by physicians, the top 5 performing models answered 55% to 70% correctly with high confidence, indicating that these models may be less susceptible than humans to Einstellung effects. Our results indicate that strong reasoning models demonstrate improved flexibility in medical reasoning, achieving performance on par with humans on mARC.

210. 【2601.11865】CTPD: Cross Tokenizer Preference Distillation

链接https://arxiv.org/abs/2601.11865

作者:Truong Nguyen,Phi Van Dat,Ngan Nguyen,Linh Ngo Van,Trung Le,Thanh Hong Nguyen

类目:Computation and Language (cs.CL)

关键词:preferences remains underexplored, human preferences remains, realistic cross-tokenizer setting, instruction tuning, remains underexplored

备注: AAAI 2026

点击查看摘要

Abstract:While knowledge distillation has seen widespread use in pre-training and instruction tuning, its application to aligning language models with human preferences remains underexplored, particularly in the more realistic cross-tokenizer setting. The incompatibility of tokenization schemes between teacher and student models has largely prevented fine-grained, white-box distillation of preference information. To address this gap, we propose Cross-Tokenizer Preference Distillation (CTPD), the first unified framework for transferring human-aligned behavior between models with heterogeneous tokenizers. CTPD introduces three key innovations: (1) Aligned Span Projection, which maps teacher and student tokens to shared character-level spans for precise supervision transfer; (2) a cross-tokenizer adaptation of Token-level Importance Sampling (TIS-DPO) for improved credit assignment; and (3) a Teacher-Anchored Reference, allowing the student to directly leverage the teacher's preferences in a DPO-style objective. Our theoretical analysis grounds CTPD in importance sampling, and experiments across multiple benchmarks confirm its effectiveness, with significant performance gains over existing methods. These results establish CTPD as a practical and general solution for preference distillation across diverse tokenization schemes, opening the door to more accessible and efficient alignment of language models.

211. 【2601.11864】AGGC: Adaptive Group Gradient Clipping for Stabilizing Large Language Model Training

链接https://arxiv.org/abs/2601.11864

作者:Zhiyuan Li,Yuan Wu,Yi Chang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, alleviate exploding gradients, Language Models, Group-wise Gradient Clipping

备注: 13 pages

点击查看摘要

Abstract:To stabilize the training of Large Language Models (LLMs), gradient clipping is a nearly ubiquitous heuristic used to alleviate exploding gradients. However, traditional global norm clipping erroneously presupposes gradient homogeneity across different functional modules, leading to an adverse "spill-over" effect where volatile parameters force unnecessary scaling on stable ones. To overcome this, we propose Adaptive Group-wise Gradient Clipping (AGGC). AGGC partitions parameters into groups based on functional types and regulates each according to its historical behavior using an Exponential Moving Average (EMA). Specifically, it constructs an adaptive interval to simultaneously mitigate gradient explosion and vanishing, while employing a time-dependent scheduling mechanism to balance exploration and convergence. Experiments on LLaMA 2-7B, Mistral-7B, and Gemma-7B models show that AGGC consistently outperforms LoRA and frequently surpasses Full Fine-Tuning. On the GSM8K benchmark, Mistral-7B fine-tuned with AGGC achieves an accuracy of 72.93%, exceeding LoRA's 69.5%. AGGC also effectively stabilizes Reinforcement Learning with Verifiable Rewards (RLVR), enhancing the logic deduction of Qwen 2.5 and Llama 3.2 models. Experimental results demonstrate that AGGC effectively addresses the limitations of traditional gradient clipping methods, particularly in overcoming gradient heterogeneity, by utilizing a modular, adaptive clipping strategy to stabilize the training process. Due to its lightweight design, AGGC can be seamlessly integrated into existing post-training pipelines with negligible overhead.

212. 【2601.11863】Utilizing Metadata for Better Retrieval-Augmented Generation

链接https://arxiv.org/abs/2601.11863

作者:Raquib Bin Yousuf,Shengzhe Xu,Mandar Sharma,Andrew Neeser,Chris Latimer,Naren Ramakrishnan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)

关键词:Retrieval-Augmented Generation systems, Generation systems depend, large language models, Retrieval-Augmented Generation, Generation systems

备注: The 48th European Conference on Information Retrieval (ECIR 2026)

点击查看摘要

Abstract:Retrieval-Augmented Generation systems depend on retrieving semantically relevant document chunks to support accurate, grounded outputs from large language models. In structured and repetitive corpora such as regulatory filings, chunk similarity alone often fails to distinguish between documents with overlapping language. Practitioners often flatten metadata into input text as a heuristic, but the impact and trade-offs of this practice remain poorly understood. We present a systematic study of metadata-aware retrieval strategies, comparing plain-text baselines with approaches that embed metadata directly. Our evaluation spans metadata-as-text (prefix and suffix), a dual-encoder unified embedding that fuses metadata and content in a single index, dual-encoder late-fusion retrieval, and metadata-aware query reformulation. Across multiple retrieval metrics and question types, we find that prefixing and unified embeddings consistently outperform plain-text baselines, with the unified at times exceeding prefixing while being easier to maintain. Beyond empirical comparisons, we analyze embedding space, showing that metadata integration improves effectiveness by increasing intra-document cohesion, reducing inter-document confusion, and widening the separation between relevant and irrelevant chunks. Field-level ablations show that structural cues provide strong disambiguating signals. Our code, evaluation framework, and the RAGMATE-10K dataset are publicly hosted.

213. 【2601.11854】ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue System

链接https://arxiv.org/abs/2601.11854

作者:Yifei Zhang,Hooshang Nayyeri,Rinat Khaziev,Emine Yilmaz,Gokhan Tur,Dilek Hakkani-Tür,Hari Thadakamalla

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:maintain long-horizon context, large language models, coordinate interleaved goals, enabled conversational agents, Recent advances

备注

点击查看摘要

Abstract:Recent advances in task-oriented dialogue (TOD) systems, driven by large language models (LLMs) with extensive API and tool integration, have enabled conversational agents to coordinate interleaved goals, maintain long-horizon context, and act proactively through asynchronous execution. These capabilities extend beyond traditional TOD systems, yet existing benchmarks lack systematic support for evaluating such agentic behaviors. To address this gap, we introduce ATOD, a benchmark and synthetic dialogue generation pipeline that produces richly annotated conversations requiring long-term reasoning. ATOD captures key characteristics of advanced TOD, including multi-goal coordination, dependency management, memory, adaptability, and proactivity. Building on ATOD, we propose ATOD-Eval, a holistic evaluation framework that translates these dimensions into fine-grained metrics and supports reproducible offline and online evaluation. We further present a strong agentic memory-based evaluator for benchmarking on ATOD. Experiments show that ATOD-Eval enables comprehensive assessment across task completion, agentic capability, and response quality, and that the proposed evaluator offers a better accuracy-efficiency tradeoff compared to existing memory- and LLM-based approaches under this evaluation setting.

214. 【2601.11846】he Third VoicePrivacy Challenge: Preserving Emotional Expressiveness and Linguistic Content in Voice Anonymization

链接https://arxiv.org/abs/2601.11846

作者:Natalia Tomashenko,Xiaoxiao Miao,Pierre Champion,Sarina Meyer,Michele Panariello,Xin Wang,Nicholas Evans,Emmanuel Vincent,Junichi Yamagishi,Massimiliano Todisco

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:voice anonymization technologies, advancing voice anonymization, speaker voice identity, voice anonymization, present results

备注: under review

点击查看摘要

Abstract:We present results and analyses from the third VoicePrivacy Challenge held in 2024, which focuses on advancing voice anonymization technologies. The task was to develop a voice anonymization system for speech data that conceals a speaker's voice identity while preserving linguistic content and emotional state. We provide a systematic overview of the challenge framework, including detailed descriptions of the anonymization task and datasets used for both system development and evaluation. We outline the attack model and objective evaluation metrics for assessing privacy protection (concealing speaker voice identity) and utility (content and emotional state preservation). We describe six baseline anonymization systems and summarize the innovative approaches developed by challenge participants. Finally, we provide key insights and observations to guide the design of future VoicePrivacy challenges and identify promising directions for voice anonymization research.

215. 【2601.11819】Weddit : A Dataset of Triggering Stories Predominantly Shared by Women on Reddit

链接https://arxiv.org/abs/2601.11819

作者:Shirlene Rose Bandela,Sanjeev Parthasarathy,Vaibhav Garg

类目:Computation and Language (cs.CL)

关键词:sexual violence, Reddit, Abstract, sexual, violence

备注: 11 pages, 12 figures, 7 tables

点击查看摘要

Abstract:Warning: This paper may contain examples and topics that may be disturbing to some readers, especially survivors of miscarriage and sexual violence. People affected by abortion, miscarriage, or sexual violence often share their experiences on social media to express emotions and seek support. On public platforms like Reddit, where users can post long, detailed narratives (up to 40,000 characters), readers may be exposed to distressing content. Although Reddit allows manual trigger warnings, many users omit them due to limited awareness or uncertainty about which categories apply. There is scarcity of datasets on Reddit stories labeled for triggering experiences. We propose a curated Reddit dataset, TWeddit, covering triggering experiences related to issues majorly faced by women. Our linguistic analyses show that annotated stories in TWeddit express distinct topics and moral foundations, making the dataset useful for a wide range of future research.

216. 【2601.11792】A self-evolving multi-role collaborative framework with fine-grained difficulty guidance for innovative mathematical problem generation

链接https://arxiv.org/abs/2601.11792

作者:Yifei Sun,Yongan Li,A.K. Qin,Sicheng Hou,Tamas Pflanzner

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:significant research direction, Mathematical problem generation, Mathematical problem, intelligent education, significant research

备注

点击查看摘要

Abstract:Mathematical problem generation (MPG) is a significant research direction in the field of intelligent education. In recent years, the rapid development of large language models (LLMs) has enabled new technological approaches to problem-generation tasks. Although existing LLMs can achieve high correctness rates, they generally lack innovation and exhibit poor discrimination. In this paper, we propose the task of innovative math problem generation (IMPG). To solve the IMPG task, this paper proposes a self-evolving, multi-role collaborative framework with fine-grained difficulty guidance. First, a multi-role collaborative mechanism comprising a sampler, generator, evaluator, state machine, and memory is constructed, ensuring the correctness of generated problems through iterative optimization informed by self-assessment and external feedback. Second, we introduce an improved difficulty model to quantify difficulty and provide fine-grained guidance. We adopt the data-driven association-guided path sampling (DAPS) algorithm to enhance the semantic rationality of sampled encodings. Third, we construct the HSM3K-CN dataset, which comprises high-quality high school math problems. A multi-stage training pipeline is adopted, incorporating continual pre-training (CPT), supervised fine-tuning (SFT), and group relative policy optimization (GRPO), to enhance the generation and evaluation capabilities of the base model. Finally, system self-evolution is achieved by transferring evaluation capabilities from the expert model to the apprentice model via distillation. Experiments show that, compared to baseline models, our proposed method significantly improves the innovation of the generated problems while maintaining a high correctness rate.

217. 【2601.11791】Beyond Tokens: Concept-Level Training Objectives for LLMs

链接https://arxiv.org/abs/2601.11791

作者:Laya Iyer,Pranav Somani,Alice Guo,Dan Jurafsky,Chen Shani

类目:Computation and Language (cs.CL)

关键词:modern large language, large language models, driving advances, fluency and generalization, development of modern

备注

点击查看摘要

Abstract:The next-token prediction (NTP) objective has been foundational in the development of modern large language models (LLMs), driving advances in fluency and generalization. However, NTP operates at the \textit{token} level, treating deviations from a single reference continuation as errors even when alternative continuations are equally plausible or semantically equivalent (e.g., ``mom'' vs. ``mother''). As a result, token-level loss can penalize valid abstractions, paraphrases, or conceptually correct reasoning paths, biasing models toward surface form rather than underlying meaning. This mismatch between the training signal and semantic correctness motivates learning objectives that operate over higher-level representations. We propose a shift from token-level to concept-level prediction, where concepts group multiple surface forms of the same idea (e.g., ``mom,'' ``mommy,'' ``mother'' $\rightarrow$ \textit{MOTHER}). We introduce various methods for integrating conceptual supervision into LLM training and show that concept-aware models achieve lower perplexity, improved robustness under domain shift, and stronger performance than NTP-based models on diverse NLP benchmarks. This suggests \textit{concept-level supervision} as an improved training signal that better aligns LLMs with human semantic abstractions.

218. 【2601.11783】he Stability Trap: Evaluating the Reliability of LLM-Based Instruction Adherence Auditing

链接https://arxiv.org/abs/2601.11783

作者:Murtuza N. Shergadwala

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:Human Resources, reproducible auditing mechanisms, Large Language Model, governance of Generative, stability

备注

点击查看摘要

Abstract:The enterprise governance of Generative AI (GenAI) in regulated sectors, such as Human Resources (HR), demands scalable yet reproducible auditing mechanisms. While Large Language Model (LLM)-as-a-Judge approaches offer scalability, their reliability in evaluating adherence of different types of system instructions remains unverified. This study asks: To what extent does the instruction type of an Application Under Test (AUT) influence the stability of judge evaluations? To address this, we introduce the Scoped Instruction Decomposition Framework to classify AUT instructions into Objective and Subjective types, isolating the factors that drive judge instability. We applied this framework to two representative HR GenAI applications, evaluating the stability of four judge architectures over variable runs. Our results reveal a ``Stability Trap'' characterized by a divergence between Verdict Stability and Reasoning Stability. While judges achieved near-perfect verdict agreement ($99\%$) for both objective and subjective evaluations, their accompanying justification traces diverged significantly. Objective instructions requiring quantitative analysis, such as word counting, exhibited reasoning stability as low as $\approx19\%$, driven by variances in numeric justifications. Similarly, reasoning stability for subjective instructions varied widely ($35\%$--$83\%$) based on evidence granularity, with feature-specific checks failing to reproduce consistent rationale. Conversely, objective instructions focusing on discrete entity extraction achieved high reasoning stability ($90\%$). These findings demonstrate that high verdict stability can mask fragile reasoning. Thus, we suggest that auditors scope automated evaluation protocols strictly: delegate all deterministically verifiable logic to code, while reserving LLM judges for complex semantic evaluation.

219. 【2601.11778】ranslation as a Scalable Proxy for Multilingual Evaluation

链接https://arxiv.org/abs/2601.11778

作者:Sheriff Issaka,Erick Rosas Gonzalez,Lieqi Liu,Evans Kofi Agyei,Lucas Bandarkar,Nanyun Peng,David Ifeoluwa Adelani,Francisco Guzmán,Saadia Gabriel

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:critical evaluation paradox, claim multilingual proficiency, LLMs claim multilingual, empirical void, LLMs claim

备注

点击查看摘要

Abstract:The rapid proliferation of LLMs has created a critical evaluation paradox: while LLMs claim multilingual proficiency, comprehensive non-machine-translated benchmarks exist for fewer than 30 languages, leaving 98% of the world's 7,000 languages in an empirical void. Traditional benchmark construction faces scaling challenges such as cost, scarcity of domain experts, and data contamination. We evaluate the validity of a simpler alternative: can translation quality alone indicate a model's broader multilingual capabilities? Through systematic evaluation of 14 models (1B-72B parameters) across 9 diverse benchmarks and 7 translation metrics, we find that translation performance is a good indicator of downstream task success (e.g., Phi-4, median Pearson r: MetricX = 0.89, xCOMET = 0.91, SSA-COMET = 0.87). These results suggest that the representational abilities supporting faithful translation overlap with those required for multilingual understanding. Translation quality, thus emerges as a strong, inexpensive first-pass proxy of multilingual performance, enabling a translation-first screening with targeted follow-up for specific tasks.

220. 【2601.11776】Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

链接https://arxiv.org/abs/2601.11776

作者:Kaituo Zhang,Zhimeng Jiang,Na Zou

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:revealed remarkable generative, remarkable generative capabilities, Large Language Models, Recent breakthroughs, emerging self-regulatory mechanisms

备注

点击查看摘要

Abstract:Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention --factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector --an internal self-identification mechanism, coupled with a systematic intervention process to transform toxic text into its non-toxic counterpart. This iterative procedure yields a contrastive detoxification dataset used to fine-tune the model, enhancing its ability for safe and coherent text generation. Experiments on benchmark datasets such as DetoxLLM and ParaDetox show that our method achieves better detoxification performance than state-of-the-art methods while preserving semantic fidelity. By obviating the need for human intervention or external components, this paper reveals the intrinsic self-detoxification ability of LLMs, offering a consistent and effective approach for mitigating harmful content generation. Ultimately, our findings underscore the potential for truly self-regulated language models, paving the way for more responsible and ethically guided text generation systems.

221. 【2601.11762】Industry-Aligned Granular Topic Modeling

链接https://arxiv.org/abs/2601.11762

作者:Sae Young Moon,Myeongjun Erik Jang,Haoyan Luo,Chunyang Xiao,Antonios Georgiadis,Fran Silavong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Topic modeling, topic modeling methods, TIDE topic modeling, text mining, mining and data

备注

点击查看摘要

Abstract:Topic modeling has extensive applications in text mining and data analysis across various industrial sectors. Although the concept of granularity holds significant value for business applications by providing deeper insights, the capability of topic modeling methods to produce granular topics has not been thoroughly explored. In this context, this paper introduces a framework called TIDE, which primarily provides a novel granular topic modeling method based on large language models (LLMs) as a core feature, along with other useful functionalities for business applications, such as summarizing long documents, topic parenting, and distillation. Through extensive experiments on a variety of public and real-world business datasets, we demonstrate that TIDE's topic modeling approach outperforms modern topic modeling methods, and our auxiliary components provide valuable support for dealing with industrial business scenarios. The TIDE framework is currently undergoing the process of being open sourced.

222. 【2601.11758】Early Linguistic Pattern of Anxiety from Social Media Using Interpretable Linguistic Features: A Multi-Faceted Validation Study with Author-Disjoint Evaluation

链接https://arxiv.org/abs/2601.11758

作者:Arnab Das Utsa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Anxiety affects hundreds, screening remains limited, large-scale screening remains, individuals globally, remains limited

备注: 9 figures, more than 1o pages

点击查看摘要

Abstract:Anxiety affects hundreds of millions of individuals globally, yet large-scale screening remains limited. Social media language provides an opportunity for scalable detection, but current models often lack interpretability, keyword-robustness validation, and rigorous user-level data integrity. This work presents a transparent approach to social media-based anxiety detection through linguistically interpretable feature-grounded modeling and cross-domain validation. Using a substantial dataset of Reddit posts, we trained a logistic regression classifier on carefully curated subreddits for training, validation, and test splits. Comprehensive evaluation included feature ablation, keyword masking experiments, and varying-density difference analyses comparing anxious and control groups, along with external validation using clinically interviewed participants with diagnosed anxiety disorders. The model achieved strong performance while maintaining high accuracy even after sentiment removal or keyword masking. Early detection using minimal post history significantly outperformed random classification, and cross-domain analysis demonstrated strong consistency with clinical interview data. Results indicate that transparent linguistic features can support reliable, generalizable, and keyword-robust anxiety detection. The proposed framework provides a reproducible baseline for interpretable mental health screening across diverse online contexts.

223. 【2601.11746】LIME-LLM: Probing Models with Fluent Counterfactuals, Not Broken Text

链接https://arxiv.org/abs/2601.11746

作者:George Mihaila,Suleyman Olcay Polat,Poli Nemkova,Himanshu Sharma,Namratha V. Urs,Mark V. Albert

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:random token masking, remain fundamental, token masking, Large Language Models, fundamental to trustworthy

备注

点击查看摘要

Abstract:Local explanation methods such as LIME (Ribeiro et al., 2016) remain fundamental to trustworthy AI, yet their application to NLP is limited by a reliance on random token masking. These heuristic perturbations frequently generate semantically invalid, out-of-distribution inputs that weaken the fidelity of local surrogate models. While recent generative approaches such as LLiMe (Angiulli et al., 2025b) attempt to mitigate this by employing Large Language Models for neighborhood generation, they rely on unconstrained paraphrasing that introduces confounding variables, making it difficult to isolate specific feature contributions. We introduce LIME-LLM, a framework that replaces random noise with hypothesis-driven, controlled perturbations. By enforcing a strict "Single Mask-Single Sample" protocol and employing distinct neutral infill and boundary infill strategies, LIME-LLM constructs fluent, on-manifold neighborhoods that rigorously isolate feature effects. We evaluate our method against established baselines (LIME, SHAP, Integrated Gradients) and the generative LLiMe baseline across three diverse benchmarks: CoLA, SST-2, and HateXplain using human-annotated rationales as ground truth. Empirical results demonstrate that LIME-LLM establishes a new benchmark for black-box NLP explainability, achieving significant improvements in local explanation fidelity compared to both traditional perturbation-based methods and recent generative alternatives.

224. 【2601.11739】Bridging Human Interpretation and Machine Representation: A Landscape of Qualitative Data Analysis in the LLM Era

链接https://arxiv.org/abs/2601.11739

作者:Xinyu Pi,Qisen Yang,Chuong Nguyen,Hua Shen

类目:Computation and Language (cs.CL)

关键词:support qualitative research, existing systems produce, systems produce outputs, system models, LLMs are increasingly

备注

点击查看摘要

Abstract:LLMs are increasingly used to support qualitative research, yet existing systems produce outputs that vary widely--from trace-faithful summaries to theory-mediated explanations and system models. To make these differences explicit, we introduce a 4$\times$4 landscape crossing four levels of meaning-making (descriptive, categorical, interpretive, theoretical) with four levels of modeling (static structure, stages/timelines, causal pathways, feedback dynamics). Applying the landscape to prior LLM-based automation highlights a strong skew toward low-level meaning and low-commitment representations, with few reliable attempts at interpretive/theoretical inference or dynamical modeling. Based on the revealed gap, we outline an agenda for applying and building LLM-systems that make their interpretive and modeling commitments explicit, selectable, and governable.

225. 【2601.11722】RAC: Retrieval-Augmented Clarification for Faithful Conversational Search

链接https://arxiv.org/abs/2601.11722

作者:Ahmed Rayane Kebir,Vincent Guigue,Lynda Said Lhadj,Laure Soulier

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:underspecified user queries, search systems resolve, systems resolve ambiguous, conversational search systems, conversational search

备注: This is the author's version of the work. The definitive version is published in: Proceedings of the 48th European Conference on Information Retrieval (ECIR '26), 29 March--2 April, 2026, Delft, Netherlands

点击查看摘要

Abstract:Clarification questions help conversational search systems resolve ambiguous or underspecified user queries. While prior work has focused on fluency and alignment with user intent, especially through facet extraction, much less attention has been paid to grounding clarifications in the underlying corpus. Without such grounding, systems risk asking questions that cannot be answered from the available documents. We introduce RAC (Retrieval-Augmented Clarification), a framework for generating corpus-faithful clarification questions. After comparing several indexing strategies for retrieval, we fine-tune a large language model to make optimal use of research context and to encourage the generation of evidence-based question. We then apply contrastive preference optimization to favor questions supported by retrieved passages over ungrounded alternatives. Evaluated on four benchmarks, RAC demonstrate significant improvements over baselines. In addition to LLM-as-Judge assessments, we introduce novel metrics derived from NLI and data-to-text to assess how well questions are anchored in the context, and we demonstrate that our approach consistently enhances faithfulness.

226. 【2601.11658】owards AGI A Pragmatic Approach Towards Self Evolving Agent

链接https://arxiv.org/abs/2601.11658

作者:Indrajit Kar,Sammy Zonunpuia,Zonunfeli Ralte

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Model, Large Language, Language Model, autonomously expand capabilities, static after deployment

备注

点击查看摘要

Abstract:Large Language Model (LLM) based agents are powerful yet fundamentally static after deployment, lacking the ability to autonomously expand capabilities, generate new tools, or evolve their reasoning. This work introduces a hierarchical self-evolving multi-agent framework that integrates a Base LLM, an operational SLM agent, a Code-Generation LLM, and a Teacher-LLM to enable continuous adaptation. The workflow begins with the agent attempting a task using reasoning and existing tools; if unsuccessful, it escalates to tool synthesis through the Code-Gen LLM, and when failures persist, it triggers an evolution phase using Curriculum Learning (CL), Reward-Based Learning (RL), or Genetic Algorithm (GA) evolution. Using the TaskCraft dataset rich in hierarchical tasks, tool-use traces, and difficulty scaling we evaluate these paradigms. CL delivers fast recovery and strong generalization, RL excels on high-difficulty tasks, and GA offers high behavioral diversity. Across all settings, evolved agents outperform their originals, demonstrating robust, autonomous, self-improving agentic evolution.

227. 【2601.11655】Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

链接https://arxiv.org/abs/2601.11655

作者:Caihua Li,Lianghong Guo,Yanlin Wang,Daya Guo,Wei Tao,Zhenyu Shan,Mingwei Liu,Jiachi Chen,Haoyu Song,Duyu Tang,Hongyu Zhang,Zibin Zheng

类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)

关键词:complex Software Engineering, Software Engineering, Issue resolution, complex Software, real-world development

备注: 26 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Issue resolution, a complex Software Engineering (SWE) task integral to real-world development, has emerged as a compelling challenge for artificial intelligence. The establishment of benchmarks like SWE-bench revealed this task as profoundly difficult for large language models, thereby significantly accelerating the evolution of autonomous coding agents. This paper presents a systematic survey of this emerging domain. We begin by examining data construction pipelines, covering automated collection and synthesis approaches. We then provide a comprehensive analysis of methodologies, spanning training-free frameworks with their modular components to training-based techniques, including supervised fine-tuning and reinforcement learning. Subsequently, we discuss critical analyses of data quality and agent behavior, alongside practical applications. Finally, we identify key challenges and outline promising directions for future research. An open-source repository is maintained at this https URL to serve as a dynamic resource in this field.

228. 【2601.11616】Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective

链接https://arxiv.org/abs/2601.11616

作者:Feilong Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:remains poorly characterized, representations remains poorly, poorly characterized, commonly motivated, motivated by efficiency

备注

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures are commonly motivated by efficiency and conditional computation, but their effect on the geometry of learned functions and representations remains poorly characterized. In this work, we study MoEs through a geometric lens, interpreting routing as a form of soft partitioning of the representation space into overlapping local charts. We introduce a Dual Jacobian-PCA Spectral Geometry probe. It analyzes local function geometry via Jacobian singular-value spectra and representation geometry via weighted PCA of routed hidden states. Using a controlled MLP-MoE setting that permits exact Jacobian computation, we compare dense, Top-k, and fully-soft routing architectures under matched capacity. Across random seeds, we observe that MoE routing consistently reduces local sensitivity, with expert-local Jacobians exhibiting smaller leading singular values and faster spectral decay than dense baselines. At the same time, weighted PCA reveals that expert-local representations distribute variance across a larger number of principal directions, indicating higher effective rank under identical input distributions. We further find that average expert Jacobians are nearly orthogonal, suggesting a decomposition of the transformation into low-overlap expert-specific subspaces rather than scaled variants of a shared map. We analyze how routing sharpness modulates these effects, showing that Top-k routing produces lower-rank, more concentrated expert-local structure, while fully-soft routing yields broader, higher-rank representations. Together, these results support a geometric interpretation of MoEs as soft partitionings of function space that flatten local curvature while redistributing representation variance.

229. 【2601.11585】Entropic Context Shaping: Information-Theoretic Filtering for Context-Aware LLM Agents

链接https://arxiv.org/abs/2601.11585

作者:Hyunjun Kim

类目:Computation and Language (cs.CL)

关键词:agents requires distinguishing, requires distinguishing pragmatically, large language model, Entropic Context Shaping, agents requires

备注

点击查看摘要

Abstract:Context engineering for large language model (LLM) agents requires distinguishing pragmatically useful information from misleading distractors. We introduce Entropic Context Shaping (ECS), an information-theoretic framework that measures context utility via the shift in the model's answer distribution toward the correct answer. Unlike lexical similarity methods that rely on word overlap, ECS captures pragmatic utility -- whether a passage actually helps answer the question. We formalize utility as the signed change in answer probability and provide theoretical analysis showing that task-irrelevant updates yield near-zero distribution shift. We evaluate on multi-turn context selection tasks using LongMemEval (session-level) and LoCoMo (turn-level) benchmarks. On fine-grained turn selection, ECS with Llama-3.1-8B achieves F1=0.265, a 71.83% relative improvement over TF-IDF (F1=0.154), demonstrating that pragmatic utility outperforms lexical similarity when precise context selection matters. Code and data are available in the supplementary materials.

230. 【2601.11582】Overview of the SciHigh Track at FIRE 2025: Research Highlight Generation from Scientific Papers

链接https://arxiv.org/abs/2601.11582

作者:Tohida Rehman,Debarshi Kumar Sanyal,Samiran Chattopadhyay

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Scientific Papers' focuses, Research Highlight Generation, Papers' focuses, meaningful bullet-point highlights, bullet-point highlights directly

备注: 7 pages, 2 tables

点击查看摘要

Abstract:`SciHigh: Research Highlight Generation from Scientific Papers' focuses on the task of automatically generating concise, informative, and meaningful bullet-point highlights directly from scientific abstracts. The goal of this task is to evaluate how effectively computational models can generate highlights that capture the key contributions, findings, and novelty of a paper in a concise form. Highlights help readers grasp essential ideas quickly and are often easier to read and understand than longer paragraphs, especially on mobile devices. The track uses the MixSub dataset \cite{10172215}, which provides pairs of abstracts and corresponding author-written highlights. In this inaugural edition of the track, 12 teams participated, exploring various approaches, including pre-trained language models, to generate highlights from this scientific dataset. All submissions were evaluated using established metrics such as ROUGE, METEOR, and BERTScore to measure both alignment with author-written highlights and overall informativeness. Teams were ranked based on ROUGE-L scores. The findings suggest that automatically generated highlights can reduce reading effort, accelerate literature reviews, and enhance metadata for digital libraries and academic search platforms. SciHigh provides a dedicated benchmark for advancing methods aimed at concise and accurate highlight generation from scientific writing.

Comments:
7 pages, 2 tables

Subjects:

Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2601.11582 [cs.CY]

(or
arXiv:2601.11582v1 [cs.CY] for this version)

https://doi.org/10.48550/arXiv.2601.11582

Focus to learn more

              arXiv-issued DOI via DataCite</p>
231. 【2601.11581】Enhancing the QA Model through a Multi-domain Debiasing Framework

链接https://arxiv.org/abs/2601.11581

作者:Yuefeng Wang,ChangJae Lee

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:machine reading comprehension, Stanford Question Answering, Question Answering Dataset, hinder their performance, advanced significantly

备注: 5 pages, 7 tables

点击查看摘要

Abstract:Question-answering (QA) models have advanced significantly in machine reading comprehension but often exhibit biases that hinder their performance, particularly with complex queries in adversarial conditions. This study evaluates the ELECTRA-small model on the Stanford Question Answering Dataset (SQuAD) v1.1 and adversarial datasets AddSent and AddOneSent. By identifying errors related to lexical bias, numerical reasoning, and entity recognition, we develop a multi-domain debiasing framework incorporating knowledge distillation, debiasing techniques, and domain expansion. Our results demonstrate up to 2.6 percentage point improvements in Exact Match (EM) and F1 scores across all test sets, with gains in adversarial contexts. These findings highlight the potential of targeted bias mitigation strategies to enhance the robustness and reliability of natural language understanding systems.

232. 【2601.11580】Speculative Decoding: Performance or Illusion?

链接https://arxiv.org/abs/2601.11580

作者:Xiaoxuan Liu,Jiaxiang Yu,Jongseok Park,Ion Stoica,Alvin Cheung

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:accelerate Large Language, Large Language Model, real-world effectiveness remains, effectiveness remains unclear, prior evaluations rely

备注

点击查看摘要

Abstract:Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ($n$-gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between observed and theoretical upper bounds, and we leverage this observation to highlight new research opportunities that our study opens up in improving SD.

233. 【2601.11579】Bielik 11B v3: Multilingual Large Language Model for European Languages

链接https://arxiv.org/abs/2601.11579

作者:Krzysztof Ociepa,Łukasz Flis,Remigiusz Kinas,Krzysztof Wróbel,Adrian Gwoździej

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Direct Preference Optimization, maintaining strong capabilities, model highly optimized, European languages, highly optimized

备注

点击查看摘要

Abstract:We present Bielik 11B v3, a state-of-the-art language model highly optimized for the Polish language, while also maintaining strong capabilities in other European languages. This model extends the Mistral 7B v0.2 architecture, scaled to 11B parameters via depth up-scaling. Its development involved a comprehensive four-stage training pipeline: continuous pre-training, supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and reinforcement learning. Comprehensive evaluations demonstrate that Bielik 11B v3 achieves exceptional performance. It significantly surpasses other specialized Polish language models and outperforms many larger models (with 2-6 times more parameters) on a wide range of tasks, from basic linguistic understanding to complex reasoning. The model's parameter efficiency, combined with extensive quantization options, allows for effective deployment across diverse hardware configurations. Bielik 11B v3 not only advances AI capabilities for the Polish language but also establishes a new benchmark for developing resource-efficient, high-performance models for less-represented languages.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

ACMclasses:
I.2.7

Cite as:
arXiv:2601.11579 [cs.CL]

(or
arXiv:2601.11579v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2601.11579

Focus to learn more

              arXiv-issued DOI via DataCite</p>
234. 【2601.11578】LimAgents: Multi-Agent LLMs for Generating Research Limitations

链接https://arxiv.org/abs/2601.11578

作者:Ibrahim Al Azher,Zhishuai Guo,Hamed Alhoori

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:rigorous scientific research, Identifying and articulating, scientific research, essential for transparent, transparent and rigorous

备注: 18 Pages, 9 figures

点击查看摘要

Abstract:Identifying and articulating limitations is essential for transparent and rigorous scientific research. However, zero-shot large language models (LLMs) approach often produce superficial or general limitation statements (e.g., dataset bias or generalizability). They usually repeat limitations reported by authors without looking at deeper methodological issues and contextual gaps. This problem is made worse because many authors disclose only partial or trivial limitations. We propose LimAgents, a multi-agent LLM framework for generating substantive limitations. LimAgents integrates OpenReview comments and author-stated limitations to provide stronger ground truth. It also uses cited and citing papers to capture broader contextual weaknesses. In this setup, different agents have specific roles as sequential role: some extract explicit limitations, others analyze methodological gaps, some simulate the viewpoint of a peer reviewer, and a citation agent places the work within the larger body of literature. A Judge agent refines their outputs, and a Master agent consolidates them into a clear set. This structure allows for systematic identification of explicit, implicit, peer review-focused, and literature-informed limitations. Moreover, traditional NLP metrics like BLEU, ROUGE, and cosine similarity rely heavily on n-gram or embedding overlap. They often overlook semantically similar limitations. To address this, we introduce a pointwise evaluation protocol that uses an LLM-as-a-Judge to measure coverage more accurately. Experiments show that LimAgents substantially improve performance. The RAG + multi-agent GPT-4o mini configuration achieves a +15.51% coverage gain over zero-shot baselines, while the Llama 3 8B multi-agent setup yields a +4.41% improvement.

235. 【2601.11575】Concept Attractors in LLMs and their Applications

链接https://arxiv.org/abs/2601.11575

作者:Sotirios Panagiotis Chytas,Vikas Singh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, forms differ widely, map semantically related, semantically related prompts, similar internal representations

备注

点击查看摘要

Abstract:Large language models (LLMs) often map semantically related prompts to similar internal representations at specific layers, even when their surface forms differ widely. We show that this behavior can be explained through Iterated Function Systems (IFS), where layers act as contractive mappings toward concept-specific Attractors. We leverage this insight and develop simple, training-free methods that operate directly on these Attractors to solve a wide range of practical tasks, including language translation, hallucination reduction, guardrailing, and synthetic data generation. Despite their simplicity, these Attractor-based interventions match or exceed specialized baselines, offering an efficient alternative to heavy fine-tuning, generalizable in scenarios where baselines underperform.

236. 【2601.11573】An Empirical Analysis of Fine-Tuning Large Language Models on Bioinformatics Literature: PRSGPT and BioStarsGPT

链接https://arxiv.org/abs/2601.11573

作者:Muhammad Muneeb,David B. Ascher

类目:Computation and Language (cs.CL)

关键词:Large language models, lack specialized knowledge, Large language, knowledge for complex, Google Gemini

备注

点击查看摘要

Abstract:Large language models (LLMs) often lack specialized knowledge for complex bioinformatics applications. We present a reproducible pipeline for fine-tuning LLMs on specialized bioinformatics data, demonstrated through two use cases: PRSGPT, focused on polygenic risk score (PRS) tools, and BioStarsGPT, trained on community forum discussions. The nine-step pipeline integrates diverse data sources, structured preprocessing, prompt-based question-answer (QA) generation (via Google Gemini), natural language inference (NLI) for quality control, semantic deduplication, clustering-based data splitting, and parameter-efficient fine-tuning using LoRA. We fine-tuned three LLMs (LLaMA-3.2-3B, Qwen2.5-7B, Gemma) and benchmarked them on over 14 lexical and semantic metrics. Qwen2.5-7B emerged as the best performer, with BLEU-4 and ROUGE-1 improvements of 82\% and 70\% for PRSGPT and 6\% and 18\% for BioStarsGPT, respectively. The open-source datasets produced include over 28,000 QA pairs for PRSGPT and 154,282 for BioStarsGPT. Human evaluation of PRSGPT yielded 61.9\% accuracy on the PRS tools comparison task, comparable to Google Gemini (61.4\%), but with richer methodological detail and accurate citations. BioStarsGPT demonstrated 59\% conceptual accuracy across 142 curated bioinformatics questions. Our pipeline enables scalable, domain-specific fine-tuning of LLMs. It enables privacy-preserving, locally deployable bioinformatics assistants, explores their practical applications, and addresses the challenges, limitations, and mitigation strategies associated with their development and use.

237. 【2601.11568】AdaFRUGAL: Adaptive Memory-Efficient Training with Dynamic Control

链接https://arxiv.org/abs/2601.11568

作者:Quang-Hung Bui,Anh Son Ta

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Training Large Language, Language Models, Large Language, highly memory-intensive due

备注

点击查看摘要

Abstract:Training Large Language Models (LLMs) is highly memory-intensive due to optimizer state overhead. The FRUGAL framework mitigates this with gradient splitting, but its static hyperparameters -- the subspace ratio ($\rho$) and update frequency ($T$) -- require costly manual tuning, limiting adaptability. We present AdaFRUGAL, which automates this process by introducing two dynamic controls: (i) a linear decay for $\rho$ to progressively reduce memory, and (ii) a loss-aware schedule for $T$ to lower computational overhead. Experiments across large-scale pre-training (English C4, Vietnamese VietVault) and fine-tuning (GLUE) demonstrate that AdaFRUGAL achieves a compelling trade-off. It maintains competitive performance against AdamW and static FRUGAL while significantly reducing both GPU memory and training time, offering a more practical, autonomous solution for resource-constrained LLM training.

238. 【2601.11567】Measuring Stability Beyond Accuracy in Small Open-Source Medical Large Language Models for Pediatric Endocrinology

链接https://arxiv.org/abs/2601.11567

作者:Vanessa D'Amario,Randy Daniel,Alessandro Zanetti,Dhruv Edamadaka,Nitya Alaparthy,Joshua Tarkoff

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:offer promising opportunities, medical large language, Small open-source medical, open-source medical large, large language models

备注: 20 pages, 11 figures, accepted at 47 workshop Reproducible Artificial Intelligence (AAAI 2026, Singapore, January 27, 2026)

点击查看摘要

Abstract:Small open-source medical large language models (LLMs) offer promising opportunities for low-resource deployment and broader accessibility. However, their evaluation is often limited to accuracy on medical multiple choice question (MCQ) benchmarks, and lacks evaluation of consistency, robustness, or reasoning behavior. We use MCQ coupled to human evaluation and clinical review to assess six small open-source medical LLMs (HuatuoGPT-o1 (Chen 2024), Diabetica-7B, Diabetica-o1 (Wei 2024), Meditron3-8B (Sallinen2025), MedFound-7B (Liu 2025), and ClinicaGPT-base-zh (Wang 2023)) in pediatric endocrinology. In deterministic settings, we examine the effect of prompt variation on models' output and self-assessment bias. In stochastic settings, we evaluate output variability and investigate the relationship between consistency and correctness. HuatuoGPT-o1-8B achieved the highest performance. The results show that high consistency across the model response is not an indicator of correctness, although HuatuoGPT-o1-8B showed the highest consistency rate. When tasked with selecting correct reasoning, both HuatuoGPT-o1-8B and Diabetica-o1 exhibit self-assessment bias and dependency on the order of the candidate explanations. Expert review of incorrect reasoning rationales identified a mix of clinically acceptable responses and clinical oversight. We further show that system-level perturbations, such as differences in CUDA builds, can yield statistically significant shifts in model output despite stable accuracy. This work demonstrates that small, semantically negligible prompt perturbations lead to divergent outputs, raising concerns about reproducibility of LLM-based evaluations and highlights the output variability under different stochastic regimes, emphasizing the need of a broader diagnostic framework to understand potential pitfalls in real-world clinical decision support scenarios.

239. 【2601.11565】Compass-Embedding v4: Robust Contrastive Learning for Multilingual E-commerce Embeddings

链接https://arxiv.org/abs/2601.11565

作者:Pakorn Ueareeworakul,Shuman Liu,Jinghao Feng,Ling Hu,Zhantang Shi,Chengqi Sun,Liang Yao,Panyi Ouyang,Haibo Zhang,Anxiang Zeng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:global e-commerce rapidly, e-commerce rapidly expands, high-quality semantic representations, emerging markets, search systems

备注

点击查看摘要

Abstract:As global e-commerce rapidly expands into emerging markets, the lack of high-quality semantic representations for low-resource languages has become a decisive bottleneck for retrieval, recommendation, and search systems. In this work, we present Compass-Embedding v4, a high-efficiency multilingual embedding framework specifically optimized for Southeast Asian (SEA) e-commerce scenarios, where data scarcity, noisy supervision, and strict production constraints jointly challenge representation learning. Compass-Embedding v4 addresses three core challenges. First, large-batch contrastive training under mixed task supervision introduces systematic false negatives that degrade semantic alignment. We propose Class-Aware Masking (CAM), a lightweight modification to the InfoNCE objective that suppresses invalid in-batch negatives and improves semantic discrimination without altering training efficiency. Second, low-resource SEA languages suffer from limited and uneven data coverage. We construct a diversified training corpus through context-grounded synthetic data generation, cross-lingual translation, and structured e-commerce data construction, enabling robust multilingual and domain-specific learning. Third, production deployment requires high-throughput inference while preserving embedding quality. We combine robustness-driven large-batch training with spherical model merging to mitigate catastrophic forgetting, and optimize inference via vLLM and FP8 quantization. Extensive evaluations across multilingual benchmarks and proprietary e-commerce tasks show that Compass-Embedding v4 achieves state-of-the-art performance on major SEA languages, significantly outperforming general-purpose embedding models in domain-specific retrieval and classification, while maintaining competitive performance on high-resource languages.

240. 【2601.11564】Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths

链接https://arxiv.org/abs/2601.11564

作者:Ahilan Ayyachamy Nadar Ponnusamy,Karthic Chandran,M Maruf Hossain

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, maximum context window, Large Language, Language Models, facilitate complex

备注: 22 pages, 6 figures

点击查看摘要

Abstract:The scaling trend in Large Language Models (LLMs) has prioritized increasing the maximum context window to facilitate complex, long-form reasoning and document analysis. However, managing this expanded context introduces severe computational overhead. This paper investigates the critical trade-off between system performance and model quality when dense transformer architectures--specifically Llama-3.1-70B and Qwen1.5-14B--are exposed to large volumes of irrelevant and distracting context. The research identifies a non-linear performance degradation tied to the growth of the Key-Value (KV) cache. Furthermore, an extended analysis of the Mixture-of-Experts (MoE) architecture reveals unique behavioral anomalies at varying context scales, suggesting that architectural benefits may be masked by infrastructure bottlenecks at high token volumes.

241. 【2601.11559】MIMIC-RD: Can LLMs differentially diagnose rare diseases in real-world clinical settings?

链接https://arxiv.org/abs/2601.11559

作者:Zilal Eiz AlDin,John Wu,Jeffrey Paul Fung,Jennifer King,Mya Watts,Lauren ONeill,Adam Richard Cross,Jimeng Sun

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:diagnosis remains challenging, differential diagnosis, rare diseases affecting, differential diagnosis remains, disease differential diagnosis

备注: 5 pages

点击查看摘要

Abstract:Despite rare diseases affecting 1 in 10 Americans, their differential diagnosis remains challenging. Due to their impressive recall abilities, large language models (LLMs) have been recently explored for differential diagnosis. Existing approaches to evaluating LLM-based rare disease diagnosis suffer from two critical limitations: they rely on idealized clinical case studies that fail to capture real-world clinical complexity, or they use ICD codes as disease labels, which significantly undercounts rare diseases since many lack direct mappings to comprehensive rare disease databases like Orphanet. To address these limitations, we explore MIMIC-RD, a rare disease differential diagnosis benchmark constructed by directly mapping clinical text entities to Orphanet. Our methodology involved an initial LLM-based mining process followed by validation from four medical annotators to confirm identified entities were genuine rare diseases. We evaluated various models on our dataset of 145 patients and found that current state-of-the-art LLMs perform poorly on rare disease differential diagnosis, highlighting the substantial gap between existing capabilities and clinical needs. From our findings, we outline several future steps towards improving differential diagnosis of rare diseases.

242. 【2601.11556】CSyMR: Benchmarking Compositional Symbolic Muisc Reasoning With MIR Tool Integration

链接https://arxiv.org/abs/2601.11556

作者:Boyang Wang,Yash Vishe,Xin Xu,Zachary Novack,Julian McAuley,Junda Wu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Large Language Models, Large Language, Language Models, connect musical structures, symbolic music reasoning

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are leveraged in symbolic music reasoning, yet existing benchmarks emphasize isolated knowledge or atomic analyses rather than the integrative compositional reasoning needed to connect musical structures. To address this, we present the Compositional Symbolic Music Reasoning Benchmark (CSyMR-Bench), a curated multiple-choice dataset of 126 questions derived from expert forums and professional examinations. Each item involves combining several atomic analyses to arrive at the final answer. Furthermore, we introduce a tool-augmented agent framework that leverages symbolic music analysis tools from the music21 library to address the challenges posed by CSyMR-Bench. Experiments validate that CSyMR-Bench poses a non-trivial challenge across both community-sourced and exam-style questions, while our tool-augmented agent consistently outperforms all baselines, achieving 5-7% absolute accuracy gains.

243. 【2601.11544】Medication counseling with large language models: balancing flexibility and rigidity

链接https://arxiv.org/abs/2601.11544

作者:Joar Sabel,Mattias Wingren,Andreas Lundell,Sören Andersson,Sara Rosenberg,Susanne Hägglund,Linda Estman,Malin Andtfolk

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, language models, introduction of large, large language, greatly enhanced

备注: Accepted for 2025 IEEE International Conference on Agentic AI (ICA). 14 pages, 2 figures

点击查看摘要

Abstract:The introduction of large language models (LLMs) has greatly enhanced the capabilities of software agents. Instead of relying on rule-based interactions, agents can now interact in flexible ways akin to humans. However, this flexibility quickly becomes a problem in fields where errors can be disastrous, such as in a pharmacy context, but the opposite also holds true; a system that is too inflexible will also lead to errors, as it can become too rigid to handle situations that are not accounted for. Work using LLMs in a pharmacy context have adopted a wide scope, accounting for many different medications in brief interactions -- our strategy is the opposite: focus on a more narrow and long task. This not only enables a greater understanding of the task at hand, but also provides insight into what challenges are present in an interaction of longer nature. The main challenge, however, remains the same for a narrow and wide system: it needs to strike a balance between adherence to conversational requirements and flexibility. In an effort to strike such a balance, we present a prototype system meant to provide medication counseling while juggling these two extremes. We also cover our design in constructing such a system, with a focus on methods aiming to fulfill conversation requirements, reduce hallucinations and promote high-quality responses. The methods used have the potential to increase the determinism of the system, while simultaneously not removing the dynamic conversational abilities granted by the usage of LLMs. However, a great deal of work remains ahead, and the development of this kind of system needs to involve continuous testing and a human-in-the-loop. It should also be evaluated outside of commonly used benchmarks for LLMs, as these do not adequately capture the complexities of this kind of conversational system.

244. 【2509.02908】Advancing Minority Stress Detection with Transformers: Insights from the Social Media Datasets

链接https://arxiv.org/abs/2509.02908

作者:Santosh Chapagain,Cory J Cascalheira,Shah Muhammad Hamdi,Soukaina Filali Boubrahimi,Jillian R. Scheer

类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词:groups experience disproportionately, experience disproportionately high, disproportionately high rates, mental disorders compared, gender minority groups

备注: Accepted in Social Network Analysis and Mining Journal (SNAM)

点击查看摘要

Abstract:Individuals from sexual and gender minority groups experience disproportionately high rates of poor health outcomes and mental disorders compared to their heterosexual and cisgender counterparts, largely as a consequence of minority stress as described by Meyer's (2003) model. This study presents the first comprehensive evaluation of transformer-based architectures for detecting minority stress in online discourse. We benchmark multiple transformer models including ELECTRA, BERT, RoBERTa, and BART against traditional machine learning baselines and graph-augmented variants. We further assess zero-shot and few-shot learning paradigms to assess their applicability on underrepresented datasets. Experiments are conducted on the two largest publicly available Reddit corpora for minority stress detection, comprising 12,645 and 5,789 posts, and are repeated over five random seeds to ensure robustness. Our results demonstrate that integrating graph structure consistently improves detection performance across transformer-only models and that supervised fine-tuning with relational context outperforms zero and few-shot approaches. Theoretical analysis reveals that modeling social connectivity and conversational context via graph augmentation sharpens the models' ability to identify key linguistic markers such as identity concealment, internalized stigma, and calls for support, suggesting that graph-enhanced transformers offer the most reliable foundation for digital health interventions and public health policy.

245. 【2601.12805】SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

链接https://arxiv.org/abs/2601.12805

作者:Xiaohan Huang,Meng Xiao,Chuan Qin,Qingqing Long,Jinmiao Chen,Yuanchun Zhou,Hengshu Zhu

类目:Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, shown growing promise, Large language, knowledge-driven interpretation tasks, shown growing

备注: 16 pages

点击查看摘要

Abstract:Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, a core requirement for knowledge-enhanced cell atlas interpretation, remains largely underexplored. To address this gap, we introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases. The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios relevant to cell type annotation, functional interpretation, and mechanism-oriented analysis. Motivated by behavioral patterns observed in preliminary examinations, SciHorizon-GENE evaluates LLMs along four biologically critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, explicitly targeting failure modes that limit the safe adoption of LLMs in biological interpretation pipelines. We systematically evaluate a wide range of state-of-the-art general-purpose and biomedical LLMs, revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations. Our benchmark establishes a systematic foundation for analyzing LLM behavior at the gene scale and offers insights for model selection and development, with direct relevance to knowledge-enhanced biological interpretation.

246. 【2601.12248】AQUA-Bench: Beyond Finding Answers to Knowing When There Are None in Audio Question Answering

链接https://arxiv.org/abs/2601.12248

作者:Chun-Yi Kuan,Hung-yi Lee

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

关键词:audio-aware large language, shown strong performance, Recent advances, audio question answering, Audio Question Detection

备注: Accepted to ICASSP 2026. Project Website: [this https URL](https://kuan2jiu99.github.io/AQUA-Bench-demo/)

点击查看摘要

Abstract:Recent advances in audio-aware large language models have shown strong performance on audio question answering. However, existing benchmarks mainly cover answerable questions and overlook the challenge of unanswerable ones, where no reliable answer can be inferred from the audio. Such cases are common in real-world settings, where questions may be misleading, ill-posed, or incompatible with the information. To address this gap, we present AQUA-Bench, a benchmark for Audio Question Unanswerability Assessment. It systematically evaluates three scenarios: Absent Answer Detection (the correct option is missing), Incompatible Answer Set Detection (choices are categorically mismatched with the question), and Incompatible Audio Question Detection (the question is irrelevant or lacks sufficient grounding in the audio). By assessing these cases, AQUA-Bench offers a rigorous measure of model reliability and promotes the development of audio-language systems that are more robust and trustworthy. Our experiments suggest that while models excel on standard answerable tasks, they often face notable challenges with unanswerable ones, pointing to a blind spot in current audio-language understanding.

247. 【2506.06252】Lightweight Prompt Biasing for Contextualized End-to-End ASR Systems

链接https://arxiv.org/abs/2506.06252

作者:Bo Ren,Yu Shi,Jinyu Li

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)

关键词:Automatic Speech Recognition, Automatic Speech, Speech Recognition, struggles with rare, rare and domain-specific

备注

点击查看摘要

Abstract:End-to-End Automatic Speech Recognition (ASR) has advanced significantly yet still struggles with rare and domain-specific entities. This paper introduces a simple yet efficient prompt-based biasing technique for contextualized ASR, enhancing recognition accuracy by leverage a unified multitask learning framework. The approach comprises two key components: a prompt biasing model which is trained to determine when to focus on entities in prompt, and a entity filtering mechanism which efficiently filters out irrelevant entities. Our method significantly enhances ASR accuracy on entities, achieving a relative 30.7% and 18.0% reduction in Entity Word Error Rate compared to the baseline model with shallow fusion on in-house domain dataset with small and large entity lists, respectively. The primary advantage of this method lies in its efficiency and simplicity without any structure change, making it lightweight and highly efficient.

信息检索

1. 【2601.14245】XR: Cross-Modal Agents for Composed Image Retrieval

链接https://arxiv.org/abs/2601.14245

作者:Zhongyu Yang,Wei Pang,Yingfang Yuan

类目:Information Retrieval (cs.IR)

关键词:conventional similarity-based paradigms, demanding multimodal reasoning, demanding multimodal, similarity-based paradigms, redefined by agentic

备注: Accepted by WWW 2026. Project: [this https URL](https://01yzzyu.github.io/xr.github.io/)

点击查看摘要

Abstract:Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: this https URL.

2. 【2601.14224】Rerank Before You Reason: Analyzing Reranking Tradeoffs through Effective Token Cost in Deep Search Agents

链接https://arxiv.org/abs/2601.14224

作者:Sahel Sharifymoghaddam,Jimmy Lin

类目:Information Retrieval (cs.IR)

关键词:answer complex queries, significant efficiency concerns, research agents rely, scaling test-time computation, test-time computation raises

备注: 10 pages, 7 figures

点击查看摘要

Abstract:Deep research agents rely on iterative retrieval and reasoning to answer complex queries, but scaling test-time computation raises significant efficiency concerns. We study how to allocate reasoning budget in deep search pipelines, focusing on the role of listwise reranking. Using the BrowseComp-Plus benchmark, we analyze tradeoffs between model scale, reasoning effort, reranking depth, and total token cost via a novel effective token cost (ETC) metric. Our results show that reranking consistently improves retrieval and end-to-end accuracy, and that moderate reranking often yields larger gains than increasing search-time reasoning, achieving comparable accuracy at substantially lower cost. All our code is available at this https URL

3. 【2601.14176】ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery

链接https://arxiv.org/abs/2601.14176

作者:Youran Sun,Yixin Wen,Haizhao Yang

类目:Databases (cs.DB); Information Retrieval (cs.IR)

关键词:Earth Science data, Earth Science, Science data discovery, Toggle, Science data

备注

点击查看摘要

Abstract:The rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbf{ReSearch}, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a literature-grounded benchmark by aligning natural language intent with datasets cited in peer-reviewed Earth Science studies. Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods, particularly for task-based queries expressing abstract scientific goals. These results underscore the importance of intent-aware, multi-stage search as a foundational capability for reproducible and scalable Earth Science research.

Subjects:

Databases (cs.DB); Information Retrieval (cs.IR)

Cite as:
arXiv:2601.14176 [cs.DB]

(or
arXiv:2601.14176v1 [cs.DB] for this version)

https://doi.org/10.48550/arXiv.2601.14176

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Youran Sun [view email] [v1]
Tue, 20 Jan 2026 17:27:12 UTC (154 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery, by Youran Sun and 2 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.DB

prev

|
next

new
|
recent
| 2026-01

Change to browse by:

cs
cs.IR

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

4. 【2601.14123】A Systematic Analysis of Chunking Strategies for Reliable Question Answering

链接https://arxiv.org/abs/2601.14123

作者:Sofia Bennani,Charles Moslonka

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Retrieval-Augmented Generation, systems in industry, document chunking choices, chunking choices impact, Natural Questions systematically

备注: 3 pages, 2 figures, 1 table, pre-print

点击查看摘要

Abstract:We study how document chunking choices impact the reliability of Retrieval-Augmented Generation (RAG) systems in industry. While practice often relies on heuristics, our end-to-end evaluation on Natural Questions systematically varies chunking method (token, sentence, semantic, code), chunk size, overlap, and context length. We use a standard industrial setup: SPLADE retrieval and a Mistral-8B generator. We derive actionable lessons for cost-efficient deployment: (i) overlap provides no measurable benefit and increases indexing cost; (ii) sentence chunking is the most cost-effective method, matching semantic chunking up to ~5k tokens; (iii) a "context cliff" reduces quality beyond ~2.5k tokens; and (iv) optimal context depends on the goal (semantic quality peaks at small contexts; exact match at larger ones).

5. 【2601.14001】Auditory Brain Passage Retrieval: Cross-Sensory EEG Training for Neural Information Retrieval

链接https://arxiv.org/abs/2601.14001

作者:Niall McGuire,Yashar Moshfeghi

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Information Retrieval paradigms, remains fundamentally challenging, Query formulation, Retrieval paradigms due, internal information

备注: Accepted At ECIR 2026

点击查看摘要

Abstract:Query formulation from internal information needs remains fundamentally challenging across all Information Retrieval paradigms due to cognitive complexity and physical impairments. Brain Passage Retrieval (BPR) addresses this by directly mapping EEG signals to passage representations without intermediate text translation. However, existing BPR research exclusively uses visual stimuli, leaving critical questions unanswered: Can auditory EEG enable effective retrieval for voice-based interfaces and visually impaired users? Can training on combined EEG datasets from different sensory modalities improve performance despite severe data scarcity? We present the first systematic investigation of auditory EEG for BPR and evaluate cross-sensory training benefits. Using dual encoder architectures with four pooling strategies (CLS, mean, max, multi-vector), we conduct controlled experiments comparing auditory-only, visual-only, and combined training on the Alice (auditory) and Nieuwland (visual) datasets. Results demonstrate that auditory EEG consistently outperforms visual EEG, and cross-sensory training with CLS pooling achieves substantial improvements over individual training: 31% in MRR (0.474), 43% in Hit@1 (0.314), and 28% in Hit@10 (0.858). Critically, combined auditory EEG models surpass BM25 text baselines (MRR: 0.474 vs 0.428), establishing neural queries as competitive with traditional retrieval whilst enabling accessible interfaces. These findings validate auditory neural interfaces for IR tasks and demonstrate that cross-sensory training addresses data scarcity whilst outperforming single-modality approaches Code: this https URL

6. 【2601.13969】Autonomous Knowledge Graph Exploration with Adaptive Breadth-Depth Retrieval

链接https://arxiv.org/abs/2601.13969

作者:Joaquín Polonuer(1,2),Lucas Vittor(1),Iñaki Arango(1),Ayush Noori(1,3),David A. Clifton(3,4),Luciano Del Corro(5,6),Marinka Zitnik(1,7,8,9) ((1) Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA, (2) Departamento de Computación, FCEyN, Universidad de Buenos Aires, Buenos Aires, Argentina, (3) Department of Engineering Science, University of Oxford, Oxford, UK, (4) Oxford Suzhou Centre for Advanced Research, University of Oxford, Suzhou, Jiangsu, China, (5) ELIAS Lab, Departamento de Ingeniería, Universidad de San Andrés, Victoria, Argentina, (6) Lumina Labs, Buenos Aires, Argentina, (7) Kempner Institute for the Study of Natural and Artificial Intelligence, Allston, MA, USA, (8) Broad Institute of MIT and Harvard, Cambridge, MA, USA, (9) Harvard Data Science Initiative, Cambridge, MA, USA)

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:graphs requires balancing, follow relational links, requires balancing broad, knowledge graphs requires, Retrieving evidence

备注

点击查看摘要

Abstract:Retrieving evidence for language model queries from knowledge graphs requires balancing broad search across the graph with multi-hop traversal to follow relational links. Similarity-based retrievers provide coverage but remain shallow, whereas traversal-based methods rely on selecting seed nodes to start exploration, which can fail when queries span multiple entities and relations. We introduce ARK: Adaptive Retriever of Knowledge, an agentic KG retriever that gives a language model control over this breadth-depth tradeoff using a two-operation toolset: global lexical search over node descriptors and one-hop neighborhood exploration that composes into multi-hop traversal. ARK alternates between breadth-oriented discovery and depth-oriented expansion without depending on a fragile seed selection, a pre-set hop depth, or requiring retrieval training. ARK adapts tool use to queries, using global search for language-heavy queries and neighborhood exploration for relation-heavy queries. On STaRK, ARK reaches 59.1% average Hit@1 and 67.4 average MRR, improving average Hit@1 by up to 31.4% and average MRR by up to 28.0% over retrieval-based and agentic training-free methods. Finally, we distill ARK's tool-use trajectories from a large teacher into an 8B model via label-free imitation, improving Hit@1 by +7.0, +26.6, and +13.5 absolute points over the base 8B model on AMAZON, MAG, and PRIME datasets, respectively, while retaining up to 98.5% of the teacher's Hit@1 rate.

7. 【2601.13938】IF-GEO: Conflict-Aware Instruction Fusion for Multi-Query Generative Engine Optimization

链接https://arxiv.org/abs/2601.13938

作者:Heyang Zhou(1),JiaJia Chen(2),Xiaolu Chen(1),Jie Bao(1),Zhen Chen(1),Yong Liao(1) ((1) School of Cyber Science and Technology, University of Science and Technology of China, (2) Institute of Dataspace, Hefei Comprehensive National Science Center)

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Generative Engines revolutionize, ensuring source visibility, Engines revolutionize information, Generative Engine Optimization, termed Generative Engine

备注: 9 pages, 3 figures. Submitted to ACL 2026. Corresponding author: Zhen Chen

点击查看摘要

Abstract:As Generative Engines revolutionize information retrieval by synthesizing direct answers from retrieved sources, ensuring source visibility becomes a significant challenge. Improving it through targeted content revisions is a practical strategy termed Generative Engine Optimization (GEO). However, optimizing a document for diverse queries presents a constrained optimization challenge where heterogeneous queries often impose conflicting and competing revision requirements under a limited content budget. To address this challenge, we propose IF-GEO, a "diverge-then-converge" framework comprising two phases: (i) mining distinct optimization preferences from representative latent queries; (ii) synthesizing a Global Revision Blueprint for guided editing by coordinating preferences via conflict-aware instruction fusion. To explicitly quantify IF-GEO's objective of cross-query stability, we introduce risk-aware stability metrics. Experiments on multi-query benchmarks demonstrate that IF-GEO achieves substantial performance gains while maintaining robustness across diverse retrieval scenarios.

8. 【2601.13931】owards Effective Negation Modeling in Joint Audio-Text Models for Music

链接https://arxiv.org/abs/2601.13931

作者:Yannis Vasilakis,Rachel Bittner,Johan Pauwels

类目:ound (cs.SD); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Million Song Dataset, Joint audio-text models, struggle with semantic, semantic phenomena, Joint audio-text

备注: Accepted at IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026

点击查看摘要

Abstract:Joint audio-text models are widely used for music retrieval, yet they struggle with semantic phenomena such as negation. Negation is fundamental for distinguishing the absence (or presence) of musical elements (e.g., "with vocals" vs. "without vocals"), but current systems fail to represent this reliably. In this work, we investigate and mitigate this limitation by training CLAP models from scratch on the Million Song Dataset with LP-MusicCaps-MSD captions. We introduce negation through text augmentation and a dissimilarity-based contrastive loss, designed to explicitly separate original and negated captions in the joint embedding space. To evaluate progress, we propose two protocols that frame negation modeling as retrieval and binary classification tasks. Experiments demonstrate that both methods, individually and combined, improve negation handling while largely preserving retrieval performance.

9. 【2601.13856】Question-Focused Filtering for Knowledge-based VQA

链接https://arxiv.org/abs/2601.13856

作者:Wei Ye,Yixin Su,Yueguo Chen,Longxiang Gao,Jianjun Li,Ruixuan Li,Rui Zhang

类目:Information Retrieval (cs.IR)

关键词:Knowledge-based Visual Question, Visual Question Answering, Knowledge-based Visual, Question Answering, Visual Question

备注

点击查看摘要

Abstract:Knowledge-based Visual Question Answering (KB-VQA) aims to answer questions by integrating images with external knowledge. Effective knowledge filtering is crucial for improving accuracy. Typical filtering methods use similarity metrics to locate relevant article sections from one article, leading to information selection errors at the article and intra-article levels. Although recent explorations of Multimodal Large Language Model (MLLM)-based filtering methods demonstrate superior semantic understanding and cross-article filtering capabilities, their high computational cost limits practical application. To address these issues, this paper proposes a question-focused filtering method. This approach can perform question-focused, cross-article filtering, efficiently obtaining high-quality filtered knowledge while keeping computational costs comparable to typical methods. Specifically, we design a trainable Question-Focused Filter (QFF) and a Chunk-based Dynamic Multi-Article Selection (CDA) module, which collectively alleviate information selection errors at both the article and intra-article levels. Experiments show that our method outperforms current state-of-the-art models by 4.9% on E-VQA and 3.8% on InfoSeek, validating its effectiveness. The code is publicly available at: this https URL.

10. 【2601.13719】Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

链接https://arxiv.org/abs/2601.13719

作者:Xinlei Yin,Xiulian Peng,Xiao Li,Zhiwei Xiong,Yan Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:long context windows, extremely long context, vision-language models due, presents significant challenges, extremely long

备注

点击查看摘要

Abstract:Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.

11. 【2601.13609】Balancing Fairness and High Match Rates in Reciprocal Recommender Systems: A Nash Social Welfare Approach

链接https://arxiv.org/abs/2601.13609

作者:Yoji Tomita,Tomohiko Yokoyama

类目:Information Retrieval (cs.IR)

关键词:online dating services, Matching platforms, increasingly prevalent, online dating, dating services

备注: arXiv admin note: text overlap with [arXiv:2409.00720](https://arxiv.org/abs/2409.00720)

点击查看摘要

Abstract:Matching platforms, such as online dating services and job recommendations, have become increasingly prevalent. For the success of these platforms, it is crucial to design reciprocal recommender systems (RRSs) that not only increase the total number of matches but also avoid creating unfairness among users. In this paper, we investigate the fairness of RRSs on matching platforms. From the perspective of fair division, we define the users' opportunities to be recommended and establish the fairness concept of envy-freeness in the allocation of these opportunities. We first introduce the Social Welfare (SW) method, which approximately maximizes the number of matches, and show that it leads to significant unfairness in recommendation opportunities, illustrating the trade-off between fairness and match rates. To address this challenge, we propose the Nash Social Welfare (NSW) method, which alternately optimizes two NSW functions and achieves nearly envy-free recommendations. We further generalize the SW and NSW method to the $\alpha$-SW method, which balances the trade-off between fairness and high match rates. Additionally, we develop a computationally efficient approximation algorithm for the SW/NSW/$\alpha$-SW methods based on the Sinkhorn algorithm. Through extensive experiments on both synthetic datasets and two real-world datasets, we demonstrate the practical effectiveness of our approach.

12. 【2601.13525】More Than Efficiency: Embedding Compression Improves Domain Adaptation in Dense Retrieval

链接https://arxiv.org/abs/2601.13525

作者:Chunsheng Zuo,Daniel Khashabi

类目:Information Retrieval (cs.IR)

关键词:Dense retrievers powered, target domain distributions, specialized domains due, Dense retrievers, powered by pretrained

备注

点击查看摘要

Abstract:Dense retrievers powered by pretrained embeddings are widely used for document retrieval but struggle in specialized domains due to the mismatches between the training and target domain distributions. Domain adaptation typically requires costly annotation and retraining of query-document pairs. In this work, we revisit an overlooked alternative: applying PCA to domain embeddings to derive lower-dimensional representations that preserve domain-relevant features while discarding non-discriminative components. Though traditionally used for efficiency, we demonstrate that this simple embedding compression can effectively improve retrieval performance. Evaluated across 9 retrievers and 14 MTEB datasets, PCA applied solely to query embeddings improves NDCG@10 in 75.4% of model-dataset pairs, offering a simple and lightweight method for domain adaptation.

13. 【2601.13505】Integrating Vision-Centric Text Understanding for Conversational Recommender Systems

链接https://arxiv.org/abs/2601.13505

作者:Wei Yuan,Shutong Qiao,Tong Chen,Quoc Viet Hung Nguyen,Zi Huang,Hongzhi Yin

类目:Information Retrieval (cs.IR)

关键词:attracted growing attention, Conversational Recommender Systems, deliver personalized recommendations, Conversational Recommender, natural language interactions

备注

点击查看摘要

Abstract:Conversational Recommender Systems (CRSs) have attracted growing attention for their ability to deliver personalized recommendations through natural language interactions. To more accurately infer user preferences from multi-turn conversations, recent works increasingly expand conversational context (e.g., by incorporating diverse entity information or retrieving related dialogues). While such context enrichment can assist preference modeling, it also introduces longer and more heterogeneous inputs, leading to practical issues such as input length constraints, text style inconsistency, and irrelevant textual noise, thereby raising the demand for stronger language understanding ability. In this paper, we propose STARCRS, a Screen-Text-AwaRe Conversational Recommender System that integrates two complementary text understanding modes: (1) a screen-reading pathway that encodes auxiliary textual information as visual tokens, mimicking skim reading on a screen, and (2) an LLM-based textual pathway that focuses on a limited set of critical content for fine-grained reasoning. We design a knowledge-anchored fusion framework that combines contrastive alignment, cross-attention interaction, and adaptive gating to integrate the two modes for improved preference modeling and response generation. Extensive experiments on two widely used benchmarks demonstrate that STARCRS consistently improves both recommendation accuracy and generated response quality.

14. 【2601.13353】Guidelines for the Creation of an Annotated Corpus

链接https://arxiv.org/abs/2601.13353

作者:Bahdja Boudoua,Nadia Guiffant,Mathieu Roche,Maguelonne Teisseire,Annelise Tran

类目:Information Retrieval (cs.IR)

关键词:UMR TETIS members, annotated textual datasets, creating annotation guidelines, UMR TETIS, feedback from UMR

备注: 8 pages, 3 figures

点击查看摘要

Abstract:This document, based on feedback from UMR TETIS members and the scientific literature, provides a generic methodology for creating annotation guidelines and annotated textual datasets (corpora). It covers methodological aspects, as well as storage, sharing, and valorization of the data. It includes definitions and examples to clearly illustrate each step of the process, thus providing a comprehensive framework to support the creation and use of corpora in various research contexts.

15. 【2601.13227】Insider Knowledge: How Much Can RAG Systems Gain from Evaluation Secrets?

链接https://arxiv.org/abs/2601.13227

作者:Laura Dietz,Bryan Li,Eugene Yang,Dawn Lawrie,William Walden,James Mayfield

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:RAG systems, increasingly evaluated, dominant paradigm, RAG, nugget-based RAG systems

备注

点击查看摘要

Abstract:RAG systems are increasingly evaluated and optimized using LLM judges, an approach that is rapidly becoming the dominant paradigm for system assessment. Nugget-based approaches in particular are now embedded not only in evaluation frameworks but also in the architectures of RAG systems themselves. While this integration can lead to genuine improvements, it also creates a risk of faulty measurements due to circularity. In this paper, we investigate this risk through comparative experiments with nugget-based RAG systems, including Ginger and Crucible, against strong baselines such as GPT-Researcher. By deliberately modifying Crucible to generate outputs optimized for an LLM judge, we show that near-perfect evaluation scores can be achieved when elements of the evaluation - such as prompt templates or gold nuggets - are leaked or can be predicted. Our results highlight the importance of blind evaluation settings and methodological diversity to guard against mistaking metric overfitting for genuine system progress.

16. 【2601.13222】Incorporating QA Nuggets into Retrieval-Augmented Generation

链接https://arxiv.org/abs/2601.13222

作者:Laura Dietz,Bryan Li,Gabrielle Liu,Jia-Huei Ju,Eugene Yang,Dawn Lawrie,William Walden,James Mayfield

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:RAGE systems integrate, systems integrate ideas, RAGE systems, Retrieval-augmented Generation, automatic evaluation

备注

点击查看摘要

Abstract:RAGE systems integrate ideas from automatic evaluation (E) into Retrieval-augmented Generation (RAG). As one such example, we present Crucible, a Nugget-Augmented Generation System that preserves explicit citation provenance by constructing a bank of QA nuggets from retrieved documents and uses them to guide extraction, selection, and report generation. Reasoning on nuggets avoids repeated information through clear and interpretable QA semantics - instead of opaque cluster abstractions - while maintaining citation provenance throughout the entire generation process. Evaluated on the TREC NeuCLIR 2024 collection, our Crucible system substantially outperforms Ginger, a recent nugget-based RAG system, in nugget recall, density, and citation grounding.

17. 【2601.13115】Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning

链接https://arxiv.org/abs/2601.13115

作者:Fengran Mo,Yifan Gao,Sha Li,Hansi Zeng,Xin Liu,Zhaoxuan Tan,Xian Li,Jianshu Chen,Dakuo Wang,Meng Jiang

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, supporting information seeking, Large Language, Language Models, supporting information

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have become a popular interface for human-AI interaction, supporting information seeking and task assistance through natural, multi-turn dialogue. To respond to users within multi-turn dialogues, the context-dependent user intent evolves across interactions, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Existing studies usually follow static rewrite, retrieve, and generate pipelines, which optimize different procedures separately and overlook the mixed-initiative action optimization simultaneously. Although the recent developments in deep search agents demonstrate the effectiveness in jointly optimizing retrieval and generation via reasoning, these approaches focus on single-turn scenarios, which might lack the ability to handle multi-turn interactions. We introduce a conversational agent that interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through reinforcement learning (RL) training with tailored rewards towards evolving user goals. The experimental results across four widely used conversational benchmarks demonstrate the effectiveness of our methods by surpassing several existing strong baselines.

18. 【2601.13111】CORE-T: COherent REtrieval of Tables for Text-to-SQL

链接https://arxiv.org/abs/2601.13111

作者:Hassan Soliman,Vivek Gupta,Dan Roth,Iryna Gurevych

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:require joining multiple, joining multiple tables, workflows often require, require joining, joining multiple

备注: Preprint under review. Code and data available at: [this https URL](https://github.com/UKPLab/arxiv2026-core-t)

点击查看摘要

Abstract:Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a simple additive adjustment step restores strongly compatible tables. Across Bird, Spider, and MMQA, CORE-T improves table-selection F1 by up to 22.7 points while retrieving up to 42% fewer tables, improving multi-table execution accuracy by up to 5.0 points on Bird and 6.9 points on MMQA, and using 4-5x fewer tokens than LLM-intensive baselines.

19. 【2601.12985】Rules, Resources, and Restrictions: A Taxonomy of Task-Based Information Request Intents

链接https://arxiv.org/abs/2601.12985

作者:Melanie A. Kilian,David Elsweiler

类目:Information Retrieval (cs.IR)

关键词:improve retrieval effectiveness, Understanding and classifying, helping align search, align search results, Large Language Models

备注: 11 pages, 1 figure, to be published in: 2026 ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR '26), March 22-26, 2026, Seattle, WA, USA. ACM, New York, NY, USA, 11 pages. [this https URL](https://doi.org/10.1145/3786304.3787863)

点击查看摘要

Abstract:Understanding and classifying query intents can improve retrieval effectiveness by helping align search results with the motivations behind user queries. However, existing intent taxonomies are typically derived from system log data and capture mostly isolated information needs, while the broader task context often remains unaddressed. This limitation becomes increasingly relevant as interactions with Large Language Models (LLMs) expand user expectations from simple query answering toward comprehensive task support, for example, with purchasing decisions or in travel planning. At the same time, current LLMs still struggle to fully interpret complex and multifaceted tasks. To address this gap, we argue for a stronger task-based perspective on query intent. Drawing on a grounded-theory-based interview study with airport information clerks, we present a taxonomy of task-based information request intents that bridges the gap between traditional query-focused approaches and the emerging demands of AI-driven task-oriented search.

20. 【2601.12902】Audit du syst{è}me d'information et du mod{è}le de gouvernance de la Biblioth{è}que Num{é}rique de l'Espace universitaire Francophone (BNEUF) du projet Initiative pour le D{é}veloppement du Num{é}rique dans l'Espace Universitaire Francophone (IDNEUF)

链接https://arxiv.org/abs/2601.12902

作者:Mokhtar Ben Henda(MICA)

类目:Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:French speaking Universities, Development in French, speaking Universities, French speaking, Digital Development

备注: in French language

点击查看摘要

Abstract:This document provides an assessment of the overall structure of the BNEUF system and how it operates within the framework of the Initiative for Digital Development in French speaking Universities (IDNEUF). This report aims to support the AUF's new strategy for 2021-2025, with its new structural and governance foundations for the implementation of the Francophonie scientifique project. It was therefore decided to reorganize existing and future digital resources and services with a view to incorporating them into the future global collaborative platform for integrated services. This report provides an external assessment with new forms of organization and use of the BNEUF system. The aim is to provide the AUF project team with new avenues for optimized management of the compiled digital resources and to synergize them with the related modules of the Atlas of Expertise and the Francophone Social Network.

21. 【2601.12828】he Unfairness of Multifactorial Bias in Recommendation

链接https://arxiv.org/abs/2601.12828

作者:Masoud Mansoury,Jin Huang,Mykola Pechenizkiy,Herke van Hoof,Maarten de Rijke

类目:Information Retrieval (cs.IR)

关键词:bias, Popularity bias, prominent sources, multifactorial bias, positivity bias

备注

点击查看摘要

Abstract:Popularity bias and positivity bias are two prominent sources of bias in recommender systems. Both arise from input data, propagate through recommendation models, and lead to unfair or suboptimal outcomes. Popularity bias occurs when a small subset of items receives most interactions, while positivity bias stems from the over-representation of high rating values. Although each bias has been studied independently, their combined effect, to which we refer to as multifactorial bias, remains underexplored. In this work, we examine how multifactorial bias influences item-side fairness, focusing on exposure bias, which reflects the unequal visibility of items in recommendation outputs. Through simulation studies, we find that positivity bias is disproportionately concentrated on popular items, further amplifying their over-exposure. Motivated by this insight, we adapt a percentile-based rating transformation as a pre-processing strategy to mitigate multifactorial bias. Experiments using six recommendation algorithms across four public datasets show that this approach improves exposure fairness with negligible accuracy loss. We also demonstrate that integrating this pre-processing step into post-processing fairness pipelines enhances their effectiveness and efficiency, enabling comparable or better fairness with reduced computational cost. These findings highlight the importance of addressing multifactorial bias and demonstrate the practical value of simple, data-driven pre-processing methods for improving fairness in recommender systems.

22. 【2601.12681】HyFormer: Revisiting the Roles of Sequence Modeling and Feature Interaction in CTR Prediction

链接https://arxiv.org/abs/2601.12681

作者:Yunwen Huang,Shiyong Hong,Xijun Xiao,Jinqiu Jin,Xuanyuan Luo,Zhe Wang,Zheng Chai,Shikang Wu,Yuchao Zheng,Jingjian Lin

类目:Information Retrieval (cs.IR)

关键词:strict efficiency constraints, jointly modeling long-range, modeling long-range user, long-range user behavior, user behavior sequences

备注

点击查看摘要

Abstract:Industrial large-scale recommendation models (LRMs) face the challenge of jointly modeling long-range user behavior sequences and heterogeneous non-sequential features under strict efficiency constraints. However, most existing architectures employ a decoupled pipeline: long sequences are first compressed with a query-token based sequence compressor like LONGER, followed by fusion with dense features through token-mixing modules like RankMixer, which thereby limits both the representation capacity and the interaction flexibility. This paper presents HyFormer, a unified hybrid transformer architecture that tightly integrates long-sequence modeling and feature interaction into a single backbone. From the perspective of sequence modeling, we revisit and redesign query tokens in LRMs, and frame the LRM modeling task as an alternating optimization process that integrates two core components: Query Decoding which expands non-sequential features into Global Tokens and performs long sequence decoding over layer-wise key-value representations of long behavioral sequences; and Query Boosting which enhances cross-query and cross-sequence heterogeneous interactions via efficient token mixing. The two complementary mechanisms are performed iteratively to refine semantic representations across layers. Extensive experiments on billion-scale industrial datasets demonstrate that HyFormer consistently outperforms strong LONGER and RankMixer baselines under comparable parameter and FLOPs budgets, while exhibiting superior scaling behavior with increasing parameters and FLOPs. Large-scale online A/B tests in high-traffic production systems further validate its effectiveness, showing significant gains over deployed state-of-the-art models. These results highlight the practicality and scalability of HyFormer as a unified modeling framework for industrial LRMs.

23. 【2601.12632】BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models

链接https://arxiv.org/abs/2601.12632

作者:Kriti Bhattarai,Vipina K. Keloth,Donald Wright,Andrew Loza,Yang Ren,Hua Xu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large language models, supporting model development, existing benchmark datasets, Large language, development and evaluation

备注

点击查看摘要

Abstract:Objective: Large language models (LLMs) are increasingly applied in biomedical settings, and existing benchmark datasets have played an important role in supporting model development and evaluation. However, these benchmarks often have limitations. Many rely on static or outdated datasets that fail to capture the dynamic, context-rich, and high-stakes nature of biomedical knowledge. They also carry increasing risk of data leakage due to overlap with model pretraining corpora and often overlook critical dimensions such as robustness to linguistic variation and potential demographic biases. Materials and Methods: To address these gaps, we introduce BioPulse-QA, a benchmark that evaluates LLMs on answering questions from newly published biomedical documents including drug labels, trial protocols, and clinical guidelines. BioPulse-QA includes 2,280 expert-verified question answering (QA) pairs and perturbed variants, covering both extractive and abstractive formats. We evaluate four LLMs - GPT-4o, GPT-o1, Gemini-2.0-Flash, and LLaMA-3.1 8B Instruct - released prior to the publication dates of the benchmark documents. Results: GPT-o1 achieves the highest relaxed F1 score (0.92), followed by Gemini-2.0-Flash (0.90) on drug labels. Clinical trials are the most challenging source, with extractive F1 scores as low as 0.36. Discussion and Conclusion: Performance differences are larger for paraphrasing than for typographical errors, while bias testing shows negligible differences. BioPulse-QA provides a scalable and clinically relevant framework for evaluating biomedical LLMs.

Subjects:

Computation and Language (cs.CL); Information Retrieval (cs.IR)

Cite as:
arXiv:2601.12632 [cs.CL]

(or
arXiv:2601.12632v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2601.12632

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Kriti Bhattarai [view email] [v1]
Mon, 19 Jan 2026 00:38:33 UTC (3,169 KB)

24. 【2601.12544】Information Farming: From Berry Picking to Berry Growing

链接https://arxiv.org/abs/2601.12544

作者:Leif Azzopardi,Adam Roegiest

类目:Information Retrieval (cs.IR); Human-Computer Interaction (cs.HC)

关键词:Berry Picking, Information Foraging Theory, Theory have framed, satisfy evolving information, Foraging Theory

备注: ACM CHIIR 2026

点击查看摘要

Abstract:The classic paradigms of Berry Picking and Information Foraging Theory have framed users as gatherers, opportunistically searching across distributed sources to satisfy evolving information needs. However, the rise of GenAI is driving a fundamental transformation in how people produce, structure, and reuse information - one that these paradigms no longer fully capture. This transformation is analogous to the Neolithic Revolution, when societies shifted from hunting and gathering to cultivation. Generative technologies empower users to "farm" information by planting seeds in the form of prompts, cultivating workflows over time, and harvesting richly structured, relevant yields within their own plots, rather than foraging across others people's patches. In this perspectives paper, we introduce the notion of Information Farming as a conceptual framework and argue that it represents a natural evolution in how people engage with information. Drawing on historical analogy and empirical evidence, we examine the benefits and opportunities of information farming, its implications for design and evaluation, and the accompanying risks posed by this transition. We hypothesize that as GenAI technologies proliferate, cultivating information will increasingly supplant transient, patch-based foraging as a dominant mode of engagement, marking a broader shift in human-information interaction and its study.

25. 【2601.12522】Improved Bug Localization with AI Agents Leveraging Hypothesis and Dynamic Cognition

链接https://arxiv.org/abs/2601.12522

作者:Asif Mohammed Samir,Mohammad Masudur Rahman

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:Software bugs cost, cost technology providers, bugs cost technology, Large Language Models, Software bugs

备注: 13 pages, 7 tables, 5 figures

点击查看摘要

Abstract:Software bugs cost technology providers (e.g., ATT) billions annually and cause developers to spend roughly 50% of their time on bug resolution. Traditional methods for bug localization often analyze the suspiciousness of code components (e.g., methods, documents) in isolation, overlooking their connections with other components in the codebase. Recent advances in Large Language Models (LLMs) and agentic AI techniques have shown strong potential for code understanding, but still lack causal reasoning during code exploration and struggle to manage growing context effectively, limiting their capability. In this paper, we present a novel agentic technique for bug localization -- CogniGent -- that overcomes the limitations above by leveraging multiple AI agents capable of causal reasoning, call-graph-based root cause analysis and context engineering. It emulates developers-inspired debugging practices (a.k.a., dynamic cognitive debugging) and conducts hypothesis testing to support bug localization. We evaluate CogniGent on a curated dataset of 591 bug reports using three widely adopted performance metrics and compare it against six established baselines from the literature. Experimental results show that our technique consistently outperformed existing traditional and LLM-based techniques, achieving MAP improvements of 23.33-38.57% at the document and method levels. Similar gains were observed in MRR, with increases of 25.14-53.74% at both granularity levels. Statistical significance tests also confirm the superiority of our technique. By addressing the reasoning, dependency, and context limitations, CogniGent advances the state of bug localization, bridging human-like cognition with agentic automation for improved performance.

26. 【2601.12301】Facet-Aware Multi-Head Mixture-of-Experts Model with Text-Enhanced Pre-training for Sequential Recommendation

链接https://arxiv.org/abs/2601.12301

作者:Mingrui Liu,Sixiao Zhang,Cheng Long

类目:Information Retrieval (cs.IR)

关键词:users' dynamic preferences, capturing users' dynamic, interaction histories, users' dynamic, leveraging their interaction

备注: Extended from WSDM paper. arXiv admin note: substantial text overlap with [arXiv:2411.01457](https://arxiv.org/abs/2411.01457)

点击查看摘要

Abstract:Sequential recommendation (SR) systems excel at capturing users' dynamic preferences by leveraging their interaction histories. Most existing SR systems assign a single embedding vector to each item to represent its features, adopting various models to combine these embeddings into a sequence representation that captures user intent. However, we argue that this representation alone is insufficient to capture an item's multi-faceted nature (e.g., movie genres, starring actors). Furthermore, users often exhibit complex and varied preferences within these facets (e.g., liking both action and musical films within the genre facet), which are challenging to fully represent with static identifiers. To address these issues, we propose a novel architecture titled Facet-Aware Multi-Head Mixture-of-Experts Model for Sequential Recommendation (FAME). We leverage sub-embeddings from each head in the final multi-head attention layer to predict the next item separately, effectively capturing distinct item facets. A gating mechanism then integrates these predictions by dynamically determining their importance. Additionally, we introduce a Mixture-of-Experts (MoE) network within each attention head to disentangle varied user preferences within each facet, utilizing a learnable router network to aggregate expert outputs based on context. Complementing this architecture, we design a Text-Enhanced Facet-Aware Pre-training module to overcome the limitations of randomly initialized embeddings. By utilizing a pre-trained text encoder and employing an alternating supervised contrastive learning objective, we explicitly disentangle facet-specific features from textual metadata (e.g., descriptions) before sequential training begins. This ensures that the item embeddings are semantically robust and aligned with the downstream multi-facet framework.

27. 【2601.12078】Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization

链接https://arxiv.org/abs/2601.12078

作者:Linfeng Du,Ye Yuan,Zichen Zhao,Fuyuan Lyu,Emiliano Penaloza,Xiuying Chen,Zipeng Sun,Jikun Kang,Laurent Charlin,Xue Liu,Haolun Wu

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, users remains challenging, individual users remains, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) excel at general-purpose tasks, yet adapting their responses to individual users remains challenging. Retrieval augmentation provides a lightweight alternative to fine-tuning by conditioning LLMs on user history records, and existing approaches typically select these records based on semantic relevance. We argue that relevance serves as an unreliable proxy for utility: a record may be semantically similar to a query yet fail to improve generation quality or even degrade it due to redundancy or conflicting information. To bridge this gap, we propose PURPLE, a contextual bandit framework that oPtimizes UseR Profiles for Llm pErsonalization. In contrast to a greedy selection of the most relevant records, PURPLE treats profile construction as a set generation process and utilizes a Plackett-Luce ranking model to capture complex inter-record dependencies. By training with dense feedback provided by the likelihood of the reference response, our method aligns retrieval directly with generation quality. Extensive experiments on nine personalization tasks demonstrate that PURPLE consistently outperforms strong heuristic and retrieval-augmented baselines in both effectiveness and efficiency, establishing a principled and scalable solution for optimizing user profiles.

28. 【2601.12034】Don't Start Over: A Cost-Effective Framework for Migrating Personalized Prompts Between LLMs

链接https://arxiv.org/abs/2601.12034

作者:Ziyi Zhao,Chongming Gao,Yang Zhang,Haoyan Liu,Weinan Gan,Huifeng Guo,Yong Liu,Fuli Feng

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Large Language Models, Personalization in Large, Large Language, user-specific soft prompts, Language Models

备注: Accepted to AAAI 2026 (Oral). 9 pages, 5 figures

点击查看摘要

Abstract:Personalization in Large Language Models (LLMs) often relies on user-specific soft prompts. However, these prompts become obsolete when the foundation model is upgraded, necessitating costly, full-scale retraining. To overcome this limitation, we propose the Prompt-level User Migration Adapter (PUMA), a lightweight framework to efficiently migrate personalized prompts across incompatible models. PUMA utilizes a parameter-efficient adapter to bridge the semantic gap, combined with a group-based user selection strategy to significantly reduce training costs. Experiments on three large-scale datasets show our method matches or even surpasses the performance of retraining from scratch, reducing computational cost by up to 98%. The framework demonstrates strong generalization across diverse model architectures and robustness in advanced scenarios like chained and aggregated migrations, offering a practical path for the sustainable evolution of personalized AI by decoupling user assets from the underlying models.

29. 【2601.11995】Learning Audio-Visual Embeddings with Inferred Latent Interaction Graphs

链接https://arxiv.org/abs/2601.11995

作者:Donghuo Zeng,Hao Niu,Yanan Wang,Masato Taya

类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Sound (cs.SD)

关键词:requires bringing genuinely, bringing genuinely related, genuinely related audio, inferred latent interactions, embeddings requires bringing

备注: 16 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Learning robust audio-visual embeddings requires bringing genuinely related audio and visual signals together while filtering out incidental co-occurrences - background noise, unrelated elements, or unannotated events. Most contrastive and triplet-loss methods use sparse annotated labels per clip and treat any co-occurrence as semantic similarity. For example, a video labeled "train" might also contain motorcycle audio and visual, because "motorcycle" is not the chosen annotation; standard methods treat these co-occurrences as negatives to true motorcycle anchors elsewhere, creating false negatives and missing true cross-modal dependencies. We propose a framework that leverages soft-label predictions and inferred latent interactions to address these issues: (1) Audio-Visual Semantic Alignment Loss (AV-SAL) trains a teacher network to produce aligned soft-label distributions across modalities, assigning nonzero probability to co-occurring but unannotated events and enriching the supervision signal. (2) Inferred Latent Interaction Graph (ILI) applies the GRaSP algorithm to teacher soft labels to infer a sparse, directed dependency graph among classes. This graph highlights directional dependencies (e.g., "Train (visual)" - "Motorcycle (audio)") that expose likely semantic or conditional relationships between classes; these are interpreted as estimated dependency patterns. (3) Latent Interaction Regularizer (LIR): A student network is trained with both metric loss and a regularizer guided by the ILI graph, pulling together embeddings of dependency-linked but unlabeled pairs in proportion to their soft-label probabilities. Experiments on AVE and VEGAS benchmarks show consistent improvements in mean average precision (mAP), demonstrating that integrating inferred latent interactions into embedding learning enhances robustness and semantic coherence.

30. 【2601.11888】Agentic-R: Learning to Retrieve for Agentic Search

链接https://arxiv.org/abs/2601.11888

作者:Wenhan Liu,Xinyu Ma,Yutao Zhu,Yuchen Li,Daiting Shi,Dawei Yin,Zhicheng Dou

类目:Information Retrieval (cs.IR)

关键词:interleaves multi-step reasoning, solve complex questions, Agentic search, agent interleaves multi-step, powerful paradigm

备注

点击查看摘要

Abstract:Agentic search has recently emerged as a powerful paradigm, where an agent interleaves multi-step reasoning with on-demand retrieval to solve complex questions. Despite its success, how to design a retriever for agentic search remains largely underexplored. Existing search agents typically rely on similarity-based retrievers, while similar passages are not always useful for final answer generation. In this paper, we propose a novel retriever training framework tailored for agentic search. Unlike retrievers designed for single-turn retrieval-augmented generation (RAG) that only rely on local passage utility, we propose to use both local query-passage relevance and global answer correctness to measure passage utility in a multi-turn agentic search. We further introduce an iterative training strategy, where the search agent and the retriever are optimized bidirectionally and iteratively. Different from RAG retrievers that are only trained once with fixed questions, our retriever is continuously improved using evolving and higher-quality queries from the agent. Extensive experiments on seven single-hop and multi-hop QA benchmarks demonstrate that our retriever, termed \ours{}, consistently outperforms strong baselines across different search agents. Our codes are available at: this https URL.

31. 【2601.11874】Cultural Analytics for Good: Building Inclusive Evaluation Frameworks for Historical IR

链接https://arxiv.org/abs/2601.11874

作者:Suchana Datta,Dwaipayan Roy,Derek Greene,Gerardine Meaney,Karen Wade,Philipp Mayr

类目:Information Retrieval (cs.IR)

关键词:support equitable access, Large Language Model, bridges the fields, fields of information, equitable access

备注

点击查看摘要

Abstract:This work bridges the fields of information retrieval and cultural analytics to support equitable access to historical knowledge. Using the British Library BL19 digital collection (more than 35,000 works from 1700-1899), we construct a benchmark for studying changes in language, terminology and retrieval in the 19th-century fiction and non-fiction. Our approach combines expert-driven query design, paragraph-level relevance annotation, and Large Language Model (LLM) assistance to create a scalable evaluation framework grounded in human expertise. We focus on knowledge transfer from fiction to non-fiction, investigating how narrative understanding and semantic richness in fiction can improve retrieval for scholarly and factual materials. This interdisciplinary framework not only improves retrieval accuracy but also fosters interpretability, transparency, and cultural inclusivity in digital archives. Our work provides both practical evaluation resources and a methodological paradigm for developing retrieval systems that support richer, historically aware engagement with digital archives, ultimately working towards more emancipatory knowledge infrastructures.

32. 【2601.11863】Utilizing Metadata for Better Retrieval-Augmented Generation

链接https://arxiv.org/abs/2601.11863

作者:Raquib Bin Yousuf,Shengzhe Xu,Mandar Sharma,Andrew Neeser,Chris Latimer,Naren Ramakrishnan

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)

关键词:Retrieval-Augmented Generation systems, Generation systems depend, large language models, Retrieval-Augmented Generation, Generation systems

备注: The 48th European Conference on Information Retrieval (ECIR 2026)

点击查看摘要

Abstract:Retrieval-Augmented Generation systems depend on retrieving semantically relevant document chunks to support accurate, grounded outputs from large language models. In structured and repetitive corpora such as regulatory filings, chunk similarity alone often fails to distinguish between documents with overlapping language. Practitioners often flatten metadata into input text as a heuristic, but the impact and trade-offs of this practice remain poorly understood. We present a systematic study of metadata-aware retrieval strategies, comparing plain-text baselines with approaches that embed metadata directly. Our evaluation spans metadata-as-text (prefix and suffix), a dual-encoder unified embedding that fuses metadata and content in a single index, dual-encoder late-fusion retrieval, and metadata-aware query reformulation. Across multiple retrieval metrics and question types, we find that prefixing and unified embeddings consistently outperform plain-text baselines, with the unified at times exceeding prefixing while being easier to maintain. Beyond empirical comparisons, we analyze embedding space, showing that metadata integration improves effectiveness by increasing intra-document cohesion, reducing inter-document confusion, and widening the separation between relevant and irrelevant chunks. Field-level ablations show that structural cues provide strong disambiguating signals. Our code, evaluation framework, and the RAGMATE-10K dataset are publicly hosted.

33. 【2601.11825】AI Co-Scientist for Knowledge Synthesis in Medical Contexts: A Proof of Concept

链接https://arxiv.org/abs/2601.11825

作者:Arya Rahgozar,Pouria Mortezaagha

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:incomplete reporting, redundant studies, evidence synthesis workflows, science is driven, driven by redundant

备注

点击查看摘要

Abstract:Research waste in biomedical science is driven by redundant studies, incomplete reporting, and the limited scalability of traditional evidence synthesis workflows. We present an AI co-scientist for scalable and transparent knowledge synthesis based on explicit formalization of Population, Intervention, Comparator, Outcome, and Study design (PICOS). The platform integrates relational storage, vector-based semantic retrieval, and a Neo4j knowledge graph. Evaluation was conducted on dementia-sport and non-communicable disease corpora. Automated PICOS compliance and study design classification from titles and abstracts were performed using a Bidirectional Long Short-Term Memory baseline and a transformer-based multi-task classifier fine-tuned from PubMedBERT. Full-text synthesis employed retrieval-augmented generation with hybrid vector and graph retrieval, while BERTopic was used to identify thematic structure, redundancy, and evidence gaps. The transformer model achieved 95.7% accuracy for study design classification with strong agreement against expert annotations, while the Bi-LSTM achieved 87% accuracy for PICOS compliance detection. Retrieval-augmented generation outperformed non-retrieval generation for queries requiring structured constraints, cross-study integration, and graph-based reasoning, whereas non-retrieval approaches remained competitive for high-level summaries. Topic modeling revealed substantial thematic redundancy and identified underexplored research areas. These results demonstrate that PICOS-aware and explainable natural language processing can improve the scalability, transparency, and efficiency of evidence synthesis. The proposed architecture is domain-agnostic and offers a practical framework for reducing research waste across biomedical disciplines.

34. 【2601.11808】GPU-Resident Inverted File Index for Streaming Vector Databases

链接https://arxiv.org/abs/2601.11808

作者:Dongfang Zhao

类目:Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)

关键词:Retrieval-Augmented Generation, powering critical systems, critical systems ranging, Streaming Inverted File, Inverted File

备注

点击查看摘要

Abstract:Vector search has emerged as the computational backbone of modern AI infrastructure, powering critical systems ranging from Vector Databases to Retrieval-Augmented Generation (RAG). While the GPU-accelerated Inverted File (IVF) index acts as one of the most widely used techniques for these large-scale workloads due to its memory efficiency, its traditional architecture remains fundamentally static. Existing designs rely on rigid and contiguous memory layouts that lack native support for in-place mutation, creating a severe bottleneck for streaming scenarios. In applications requiring real-time knowledge updates, such as live recommendation engines or dynamic RAG systems, maintaining index freshness necessitates expensive CPU-GPU roundtrips that cause system latency to spike from milliseconds to seconds. In this paper, we propose SIVF (Streaming Inverted File), a new GPU-native architecture designed to empower vector databases with high-velocity data ingestion and deletion capabilities. SIVF replaces the static memory layout with a slab-based allocation system and a validity bitmap, enabling lock-free and in-place mutation directly in VRAM. We further introduce a GPU-resident address translation table (ATT) to resolve the overhead of locating vectors, providing $O(1)$ access to physical storage slots. We evaluate SIVF against the industry-standard GPU IVF implementation on the SIFT1M and GIST1M datasets. Microbenchmarks demonstrate that SIVF reduces deletion latency by up to $13,300\times$ (from 11.8 seconds to 0.89 ms on GIST1M) and improves ingestion throughput by $36\times$ to $105\times$. In end-to-end sliding window scenarios, SIVF eliminates system freezes and achieves a $161\times$ to $266\times$ speedup with single-digit millisecond latency. Notably, this performance incurs negligible storage penalty, maintaining less than 0.8\% memory overhead compared to static indices.

35. 【2601.11722】RAC: Retrieval-Augmented Clarification for Faithful Conversational Search

链接https://arxiv.org/abs/2601.11722

作者:Ahmed Rayane Kebir,Vincent Guigue,Lynda Said Lhadj,Laure Soulier

类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:underspecified user queries, search systems resolve, systems resolve ambiguous, conversational search systems, conversational search

备注: This is the author's version of the work. The definitive version is published in: Proceedings of the 48th European Conference on Information Retrieval (ECIR '26), 29 March--2 April, 2026, Delft, Netherlands

点击查看摘要

Abstract:Clarification questions help conversational search systems resolve ambiguous or underspecified user queries. While prior work has focused on fluency and alignment with user intent, especially through facet extraction, much less attention has been paid to grounding clarifications in the underlying corpus. Without such grounding, systems risk asking questions that cannot be answered from the available documents. We introduce RAC (Retrieval-Augmented Clarification), a framework for generating corpus-faithful clarification questions. After comparing several indexing strategies for retrieval, we fine-tune a large language model to make optimal use of research context and to encourage the generation of evidence-based question. We then apply contrastive preference optimization to favor questions supported by retrieved passages over ungrounded alternatives. Evaluated on four benchmarks, RAC demonstrate significant improvements over baselines. In addition to LLM-as-Judge assessments, we introduce novel metrics derived from NLI and data-to-text to assess how well questions are anchored in the context, and we demonstrate that our approach consistently enhances faithfulness.

36. 【2601.11560】DeepEvidence: Empowering Biomedical Discovery with Deep Knowledge Graph Research

链接https://arxiv.org/abs/2601.11560

作者:Zifeng Wang,Zheng Chen,Ziwei Yang,Xuan Wang,Qiao Jin,Yifan Peng,Zhiyong Lu,Jimeng Sun

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:information spanning literature, heterogeneous information spanning, discovery remains difficult, scientific discovery remains, Deep Research

备注

点击查看摘要

Abstract:Biomedical knowledge graphs (KGs) encode vast, heterogeneous information spanning literature, genes, pathways, drugs, diseases, and clinical trials, but leveraging them collectively for scientific discovery remains difficult. Their structural differences, continual evolution, and limited cross-resource alignment require substantial manual integration, limiting the depth and scale of knowledge exploration. We introduce DeepEvidence, an AI-agent framework designed to perform Deep Research across various heterogeneous biomedical KGs. Unlike generic Deep Research systems that rely primarily on internet-scale text, DeepEvidence incorporates specialized knowledge-graph tooling and coordinated exploration strategies to systematically bridge heterogeneous resources. At its core is an orchestrator that directs two complementary agents: Breadth-First ReSearch (BFRS) for broad, multi-graph entity search, and Depth-First ReSearch (DFRS) for multi-hop, evidence-focused reasoning. An internal, incrementally built evidence graph provides a structured record of retrieved entities, relations, and supporting evidence. To operate at scale, DeepEvidence includes unified interfaces for querying diverse biomedical APIs and an execution sandbox that enables programmatic data retrieval, extraction, and analysis. Across established deep-reasoning benchmarks and four key stages of the biomedical discovery lifecycle: drug discovery, pre-clinical experimentation, clinical trial development, and evidence-based medicine, DeepEvidence demonstrates substantial gains in systematic exploration and evidence synthesis. These results highlight the potential of knowledge-graph-driven Deep Research to accelerate biomedical discovery.

37. 【2601.11557】From HNSW to Information-Theoretic Binarization: Rethinking the Architecture of Scalable Vector Search

链接https://arxiv.org/abs/2601.11557

作者:Seyed Moein Abtahi,Majid Fekri,Tara Khani,Akramul Azim

类目:Databases (cs.DB); Information Retrieval (cs.IR); Information Theory (cs.IT); Performance (cs.PF)

关键词:approximate nearest neighbor, escalating operational cost, high-precision floating-point vectors, Modern semantic search, systems rely predominantly

备注: 16 Pages, 5 Figures, 3 Tables

点击查看摘要

Abstract:Modern semantic search and retrieval-augmented generation (RAG) systems rely predominantly on in-memory approximate nearest neighbor (ANN) indexes over high-precision floating-point vectors, resulting in escalating operational cost and inherent trade-offs between latency, throughput, and retrieval accuracy. This paper analyzes the architectural limitations of the dominant "HNSW + float32 + cosine similarity" stack and evaluates existing cost-reduction strategies, including storage disaggregation and lossy vector quantization, which inevitably sacrifice either performance or accuracy. We introduce and empirically evaluate an alternative information-theoretic architecture based on maximally informative binarization (MIB), efficient bitwise distance metrics, and an information-theoretic scoring (ITS) mechanism. Unlike conventional ANN systems, this approach enables exhaustive search over compact binary representations, allowing deterministic retrieval and eliminating accuracy degradation under high query concurrency. Using the MAIR benchmark across 14 datasets and 10,038 queries, we compare this architecture against Elasticsearch, Pinecone, PGVector, and Qdrant. Results demonstrate retrieval quality comparable to full-precision systems, while achieving substantially lower latency and maintaining constant throughput at high request rates. We show that this architectural shift enables a truly serverless, cost-per-query deployment model, challenging the necessity of large in-memory ANN indexes for high-quality semantic search.

计算机视觉

1. 【2601.14256】Implicit Neural Representation Facilitates Unified Universal Vision Encoding

链接https://arxiv.org/abs/2601.14256

作者:Matthew Gwilliam,Xiao Wang,Xuefeng Hu,Zhenheng Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image representation learning, typically designed, representation learning, model, image representation

备注: 18 pages, 16 tables, 4 figures

点击查看摘要

Abstract:Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at this https URL.

2. 【2601.14255】VideoMaMa: Mask-Guided Video Matting via Generative Prior

链接https://arxiv.org/abs/2601.14255

作者:Sangbeom Lim,Seoung Wug Oh,Jiahui Huang,Heeji Yoon,Seungryong Kim,Joon-Young Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:significant challenge due, Generalizing video matting, Generalizing video, real-world videos remains, video matting

备注: Project page: [this https URL](https://cvlab-kaist.github.io/VideoMaMa/)

点击查看摘要

Abstract:Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.

3. 【2601.14253】Motion 3-to-4: 3D Motion Reconstruction for 4D Synthesis

链接https://arxiv.org/abs/2601.14253

作者:Hongyuan Chen,Xingyu Chen,Youjia Zhang,Zexiang Xu,Anpei Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single monocular video, synthesising high-quality, dynamic objects, feed-forward framework, framework for synthesising

备注: Project page: [this https URL](https://motion3-to-4.github.io/) . Code: [this https URL](https://github.com/Inception3D/Motion324)

点击查看摘要

Abstract:We present Motion 3-to-4, a feed-forward framework for synthesising high-quality 4D dynamic objects from a single monocular video and an optional 3D reference mesh. While recent advances have significantly improved 2D, video, and 3D content generation, 4D synthesis remains difficult due to limited training data and the inherent ambiguity of recovering geometry and motion from a monocular viewpoint. Motion 3-to-4 addresses these challenges by decomposing 4D synthesis into static 3D shape generation and motion reconstruction. Using a canonical reference mesh, our model learns a compact motion latent representation and predicts per-frame vertex trajectories to recover complete, temporally coherent geometry. A scalable frame-wise transformer further enables robustness to varying sequence lengths. Evaluations on both standard benchmarks and a new dataset with accurate ground-truth geometry show that Motion 3-to-4 delivers superior fidelity and spatial consistency compared to prior work. Project page is available at this https URL.

4. 【2601.14251】LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

链接https://arxiv.org/abs/2601.14251

作者:Said Taghadouini,Adrien Cavaillès,Baptiste Aubertin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:brittle OCR pipelines, naturally ordered text, OCR pipelines, brittle OCR, converts document images

备注

点击查看摘要

Abstract:We present \textbf{LightOnOCR-2-1B}, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9$\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and \textbf{LightOnOCR-bbox-bench} evaluation under their respective licenses.

5. 【2601.14250】OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

链接https://arxiv.org/abs/2601.14250

作者:Pengze Zhang,Yanze Wu,Mengtian Li,Xu Bai,Songtao Zhao,Fulong Ye,Chong Mou,Xinghui Li,Zhuowei Chen,Qian He,Mingyuan Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Videos convey richer, convey richer information, capturing both spatial, convey richer, images or text

备注: Github Page: [this https URL](https://pangzecheung.github.io/OmniTransfer/)

点击查看摘要

Abstract:Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.

6. 【2601.14246】Soft Tail-dropping for Adaptive Visual Tokenization

链接https://arxiv.org/abs/2601.14246

作者:Zeyuan Chen,Kai Zhang,Zhuowen Tu,Yuanjun Xiong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Soft Tail-dropping Adaptive, Tail-dropping Adaptive Tokenizer, present Soft Tail-dropping, Soft Tail-dropping, Tail-dropping Adaptive

备注

点击查看摘要

Abstract:We present Soft Tail-dropping Adaptive Tokenizer (STAT), a 1D discrete visual tokenizer that adaptively chooses the number of output tokens per image according to its structural complexity and level of detail. STAT encodes an image into a sequence of discrete codes together with per-token keep probabilities. Beyond standard autoencoder objectives, we regularize these keep probabilities to be monotonically decreasing along the sequence and explicitly align their distribution with an image-level complexity measure. As a result, STAT produces length-adaptive 1D visual tokens that are naturally compatible with causal 1D autoregressive (AR) visual generative models. On ImageNet-1k, equipping vanilla causal AR models with STAT yields competitive or superior visual generation quality compared to other probabilistic model families, while also exhibiting favorable scaling behavior that has been elusive in prior vanilla AR visual generation attempts.

7. 【2601.14232】KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

链接https://arxiv.org/abs/2601.14232

作者:Egor Cherepanov,Daniil Zelezetsky,Alexey K. Kovalev,Aleksandr I. Panov

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Pixel-based reinforcement learning, hinder systematic analysis, reinforcement learning agents, entangle multiple sources, Pixel-based reinforcement

备注: 38 pages, 44 figures, 3 tables

点击查看摘要

Abstract:Pixel-based reinforcement learning agents often fail under purely visual distribution shift even when latent dynamics and rewards are unchanged, but existing benchmarks entangle multiple sources of shift and hinder systematic analysis. We introduce KAGE-Env, a JAX-native 2D platformer that factorizes the observation process into independently controllable visual axes while keeping the underlying control problem fixed. By construction, varying a visual axis affects performance only through the induced state-conditional action distribution of a pixel policy, providing a clean abstraction for visual generalization. Building on this environment, we define KAGE-Bench, a benchmark of six known-axis suites comprising 34 train-evaluation configuration pairs that isolate individual visual shifts. Using a standard PPO-CNN baseline, we observe strong axis-dependent failures, with background and photometric shifts often collapsing success, while agent-appearance shifts are comparatively benign. Several shifts preserve forward motion while breaking task completion, showing that return alone can obscure generalization failures. Finally, the fully vectorized JAX implementation enables up to 33M environment steps per second on a single GPU, enabling fast and reproducible sweeps over visual factors. Code: this https URL.

8. 【2601.14208】Rig-Aware 3D Reconstruction of Vehicle Undercarriages using Gaussian Splatting

链接https://arxiv.org/abs/2601.14208

作者:Nitin Kulkarni,Akhil Devarashetti,Charlie Cluss,Livio Forte,Dan Buckmaster,Philip Schneider,Chunming Qiao,Alina Vereshchaka

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:labor-intensive task, task that requires, crouch or crawl, crawl underneath, Inspecting the undercarriage

备注: 8 pages, 9 figures, Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): [this https URL](https://www.icmla-conference.org/icmla25/)

点击查看摘要

Abstract:Inspecting the undercarriage of used vehicles is a labor-intensive task that requires inspectors to crouch or crawl underneath each vehicle to thoroughly examine it. Additionally, online buyers rarely see undercarriage photos. We present an end-to-end pipeline that utilizes a three-camera rig to capture videos of the undercarriage as the vehicle drives over it, and produces an interactive 3D model of the undercarriage. The 3D model enables inspectors and customers to rotate, zoom, and slice through the undercarriage, allowing them to detect rust, leaks, or impact damage in seconds, thereby improving both workplace safety and buyer confidence. Our primary contribution is a rig-aware Structure-from-Motion (SfM) pipeline specifically designed to overcome the challenges of wide-angle lens distortion and low-parallax scenes. Our method overcomes the challenges of wide-angle lens distortion and low-parallax scenes by integrating precise camera calibration, synchronized video streams, and strong geometric priors from the camera rig. We use a constrained matching strategy with learned components, the DISK feature extractor, and the attention-based LightGlue matcher to generate high-quality sparse point clouds that are often unattainable with standard SfM pipelines. These point clouds seed the Gaussian splatting process to generate photorealistic undercarriage models that render in real-time. Our experiments and ablation studies demonstrate that our design choices are essential to achieve state-of-the-art quality.

9. 【2601.14207】Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints

链接https://arxiv.org/abs/2601.14207

作者:Rotem Gatenyo,Ohad Fried

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:text prompt describing, study zero-shot, scene assembly, text prompt, prompt describing

备注

点击查看摘要

Abstract:We study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation -- an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the optimization on the interaction region. To enable evaluation, we curate a benchmark containing diverse categories and relations, and compare against baselines. Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.

10. 【2601.14188】IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models

链接https://arxiv.org/abs/2601.14188

作者:Liang Shi,Wei Li,Kevin M Beussman,Lin Chen,Yun Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:concerns distinguishing individual, Instance-level recognition, concerns distinguishing, In-context Instance-level Recognition, distinguishing individual instances

备注

点击查看摘要

Abstract:Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.

11. 【2601.14180】Progressive self-supervised blind-spot denoising method for LDCT denoising

链接https://arxiv.org/abs/2601.14180

作者:Yichao Liu,Yueyang Teng,Junwen Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:low-dose computed tomography, computed tomography, clinical practice, NDCT, increasingly investigated

备注

点击查看摘要

Abstract:Self-supervised learning is increasingly investigated for low-dose computed tomography (LDCT) image denoising, as it alleviates the dependence on paired normal-dose CT (NDCT) data, which are often difficult to acquire in clinical practice. In this paper, we propose a novel self-supervised training strategy that relies exclusively on LDCT images. We introduce a step-wise blind-spot denoising mechanism that enforces conditional independence in a progressive manner, enabling more fine-grained denoising learning. In addition, we add Gaussian noise to LDCT images, which acts as a regularization and mitigates overfitting. Extensive experiments on the Mayo LDCT dataset demonstrate that the proposed method consistently outperforms existing self-supervised approaches and achieves performance comparable to, or better than, several representative supervised denoising methods.

12. 【2601.14165】ASBA: A-line State Space Model and B-line Attention for Sparse Optical Doppler Tomography Reconstruction

链接https://arxiv.org/abs/2601.14165

作者:Zhenghong Li,Wensheng Cheng,Congwu Du,Yingtian Pan,Zhaozheng Yin,Haibin Ling

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Optical Doppler Tomography, Optical Doppler, Doppler Tomography, A-line ROI State, Doppler phase-subtraction analysis

备注: 17 pages, 11 figures

点击查看摘要

Abstract:Optical Doppler Tomography (ODT) is an emerging blood flow analysis technique. A 2D ODT image (B-scan) is generated by sequentially acquiring 1D depth-resolved raw A-scans (A-line) along the lateral axis (B-line), followed by Doppler phase-subtraction analysis. To ensure high-fidelity B-scan images, current practices rely on dense sampling, which prolongs scanning time, increases storage demands, and limits the capture of rapid blood flow dynamics. Recent studies have explored sparse sampling of raw A-scans to alleviate these limitations, but their effectiveness is hindered by the conservative sampling rates and the uniform modeling of flow and background signals. In this study, we introduce a novel blood flow-aware network, named ASBA (A-line ROI State space model and B-line phase Attention), to reconstruct ODT images from highly sparsely sampled raw A-scans. Specifically, we propose an A-line ROI state space model to extract sparsely distributed flow features along the A-line, and a B-line phase attention to capture long-range flow signals along each B-line based on phase difference. Moreover, we introduce a flow-aware weighted loss function that encourages the network to prioritize the accurate reconstruction of flow signals. Extensive experiments on real animal data demonstrate that the proposed approach clearly outperforms existing state-of-the-art reconstruction methods.

13. 【2601.14161】One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion

链接https://arxiv.org/abs/2601.14161

作者:Yitong Dong,Qi Zhang,Minchao Jiang,Zhiqiang Wu,Qingnan Fan,Ying Feng,Huaqi Zhang,Hujun Bao,Guofeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Transformer, addressing key limitations, built on Vision, Gaussian Splatting, addressing key

备注

点击查看摘要

Abstract:We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions. To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.

14. 【2601.14154】LLM Augmented Intervenable Multimodal Adaptor for Post-operative Complication Prediction in Lung Cancer Surgery

链接https://arxiv.org/abs/2601.14154

作者:Shubham Pandey,Bhavin Jawade,Srirangaraj Setlur,Venu Govindaraju,Kenneth Seastedt

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:rising healthcare costs, Postoperative complications remain, adversely affecting patient, affecting patient outcomes, adversely affecting

备注: Accepted to P2P-CV @ WACV 2026

点击查看摘要

Abstract:Postoperative complications remain a critical concern in clinical practice, adversely affecting patient outcomes and contributing to rising healthcare costs. We present MIRACLE, a deep learning architecture for prediction of risk of postoperative complications in lung cancer surgery by integrating preoperative clinical and radiological data. MIRACLE employs a hyperspherical embedding space fusion of heterogeneous inputs, enabling the extraction of robust, discriminative features from both structured clinical records and high-dimensional radiological images. To enhance transparency of prediction and clinical utility, we incorporate an interventional deep learning module in MIRACLE, that not only refines predictions but also provides interpretable and actionable insights, allowing domain experts to interactively adjust recommendations based on clinical expertise. We validate our approach on POC-L, a real-world dataset comprising 3,094 lung cancer patients who underwent surgery at Roswell Park Comprehensive Cancer Center. Our results demonstrate that MIRACLE outperforms various traditional machine learning models and contemporary large language models (LLM) variants alone, for personalized and explainable postoperative risk management.

15. 【2601.14133】winBrainVLA: Unleashing the Potential of Generalist VLMs for Embodied Tasks via Asymmetric Mixture-of-Transformers

链接https://arxiv.org/abs/2601.14133

作者:Bin Yu,Shijie Lian,Xiaopeng Lin,Yuliang Wei,Zhaolong Shen,Changti Wu,Yuzhuo Miao,Xinming Wang,Bailing Wang,Cong Huang,Kai Chen

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:models typically fine-tune, monolithic Vision-Language Model, typically fine-tune, fine-tune a monolithic, monolithic Vision-Language

备注: GitHub: [this https URL](https://github.com/ZGC-EmbodyAI/TwinBrainVLA)

点击查看摘要

Abstract:Standard Vision-Language-Action (VLA) models typically fine-tune a monolithic Vision-Language Model (VLM) backbone explicitly for robotic control. However, this approach creates a critical tension between maintaining high-level general semantic understanding and learning low-level, fine-grained sensorimotor skills, often leading to "catastrophic forgetting" of the model's open-world capabilities. To resolve this conflict, we introduce TwinBrainVLA, a novel architecture that coordinates a generalist VLM retaining universal semantic understanding and a specialist VLM dedicated to embodied proprioception for joint robotic control. TwinBrainVLA synergizes a frozen "Left Brain", which retains robust general visual reasoning, with a trainable "Right Brain", specialized for embodied perception, via a novel Asymmetric Mixture-of-Transformers (AsyMoT) mechanism. This design allows the Right Brain to dynamically query semantic knowledge from the frozen Left Brain and fuse it with proprioceptive states, providing rich conditioning for a Flow-Matching Action Expert to generate precise continuous controls. Extensive experiments on SimplerEnv and RoboCasa benchmarks demonstrate that TwinBrainVLA achieves superior manipulation performance compared to state-of-the-art baselines while explicitly preserving the comprehensive visual understanding capabilities of the pre-trained VLM, offering a promising direction for building general-purpose robots that simultaneously achieve high-level semantic understanding and low-level physical dexterity.

16. 【2601.14130】GIC-DLC: Differentiable Logic Circuits for Hardware-Friendly Grayscale Image Compression

链接https://arxiv.org/abs/2601.14130

作者:Till Aczel,David F. Jenny,Simon Bührer,Andreas Plesner,Antonio Di Maio,Roger Wattenhofer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:PNG or JPEG-XL, Differentiable Logic Circuits, substantial computational overhead, codecs achieve higher, achieve higher compression

备注

点击查看摘要

Abstract:Neural image codecs achieve higher compression ratios than traditional hand-crafted methods such as PNG or JPEG-XL, but often incur substantial computational overhead, limiting their deployment on energy-constrained devices such as smartphones, cameras, and drones. We propose Grayscale Image Compression with Differentiable Logic Circuits (GIC-DLC), a hardware-aware codec where we train lookup tables to combine the flexibility of neural networks with the efficiency of Boolean operations. Experiments on grayscale benchmark datasets show that GIC-DLC outperforms traditional codecs in compression efficiency while allowing substantial reductions in energy consumption and latency. These results demonstrate that learned compression can be hardware-friendly, offering a promising direction for low-power image compression on edge devices.

17. 【2601.14127】he Side Effects of Being Smart: Safety Risks in MLLMs' Multi-Image Reasoning

链接https://arxiv.org/abs/2601.14127

作者:Renmiao Chen,Yida Lu,Shiyao Cui,Xuan Ouyang,Victor Shea-Jay Huang,Shumin Zhang,Chengwei Pan,Han Qiu,Minlie Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, acquire stronger reasoning

备注: *15 pages, 5 figures. Introduces MIR-SafetyBench (2,676 instances; 9 multi-image relations). Equal contribution; †Corresponding author. Code/data: [this https URL](https://github.com/thu-coai/MIR-SafetyBench)

点击查看摘要

Abstract:As Multimodal Large Language Models (MLLMs) acquire stronger reasoning capabilities to handle complex, multi-image instructions, this advancement may pose new safety risks. We study this problem by introducing MIR-SafetyBench, the first benchmark focused on multi-image reasoning safety, which consists of 2,676 instances across a taxonomy of 9 multi-image relations. Our extensive evaluations on 19 MLLMs reveal a troubling trend: models with more advanced multi-image reasoning can be more vulnerable on MIR-SafetyBench. Beyond attack success rates, we find that many responses labeled as safe are superficial, often driven by misunderstanding or evasive, non-committal replies. We further observe that unsafe generations exhibit lower attention entropy than safe ones on average. This internal signature suggests a possible risk that models may over-focus on task solving while neglecting safety constraints. Our code and data are available at this https URL.

18. 【2601.14111】PMCE: Probabilistic Multi-Granularity Semantics with Caption-Guided Enhancement for Few-Shot Learning

链接https://arxiv.org/abs/2601.14111

作者:Jiaying Wu,Can Gao,Jinglu Hu,Hui Li,Xiaofeng Cao,Jingcai Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-shot learning aims, labeled samples, generalize poorly, learning aims, aims to identify

备注

点击查看摘要

Abstract:Few-shot learning aims to identify novel categories from only a handful of labeled samples, where prototypes estimated from scarce data are often biased and generalize poorly. Semantic-based methods alleviate this by introducing coarse class-level information, but they are mostly applied on the support side, leaving query representations unchanged. In this paper, we present PMCE, a Probabilistic few-shot framework that leverages Multi-granularity semantics with Caption-guided Enhancement. PMCE constructs a nonparametric knowledge bank that stores visual statistics for each category as well as CLIP-encoded class name embeddings of the base classes. At meta-test time, the most relevant base classes are retrieved based on the similarities of class name embeddings for each novel category. These statistics are then aggregated into category-specific prior information and fused with the support set prototypes via a simple MAP update. Simultaneously, a frozen BLIP captioner provides label-free instance-level image descriptions, and a lightweight enhancer trained on base classes optimizes both support prototypes and query features under an inductive protocol with a consistency regularization to stabilize noisy captions. Experiments on four benchmarks show that PMCE consistently improves over strong baselines, achieving up to 7.71% absolute gain over the strongest semantic competitor on MiniImageNet in the 1-shot setting. Our code is available at this https URL

19. 【2601.14104】Diffusion-Guided Backdoor Attacks in Real-World Reinforcement Learning

链接https://arxiv.org/abs/2601.14104

作者:Tairan Huang,Qingqing Ye,Yulin Jin,Jiawei Lian,Yi Wang,Haibo Hu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:embed hidden malicious, hidden malicious behaviors, attacks embed hidden, reinforcement learning, policies and activate

备注

点击查看摘要

Abstract:Backdoor attacks embed hidden malicious behaviors in reinforcement learning (RL) policies and activate them using triggers at test time. Most existing attacks are validated only in simulation, while their effectiveness in real-world robotic systems remains unclear. In physical deployment, safety-constrained control pipelines such as velocity limiting, action smoothing, and collision avoidance suppress abnormal actions, causing strong attenuation of conventional backdoor attacks. We study this previously overlooked problem and propose a diffusion-guided backdoor attack framework (DGBA) for real-world RL. We design small printable visual patch triggers placed on the floor and generate them using a conditional diffusion model that produces diverse patch appearances under real-world visual variations. We treat the robot control stack as a black-box system. We further introduce an advantage-based poisoning strategy that injects triggers only at decision-critical training states. We evaluate our method on a TurtleBot3 mobile robot and demonstrate reliable activation of targeted attacks while preserving normal task performance. Demo videos and code are available in the supplementary material.

20. 【2601.14103】Interp3D: Correspondence-aware Interpolation for Generative Textured 3D Morphing

链接https://arxiv.org/abs/2601.14103

作者:Xiaolu Liu,Yicong Li,Qiyuan He,Jiayin Zhu,Wei Ji,Angela Yao,Jianke Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:seeks to generate, generate smooth, smooth and plausible, morphing seeks, plausible transitions

备注: 22 pages, 12 figures

点击查看摘要

Abstract:Textured 3D morphing seeks to generate smooth and plausible transitions between two 3D assets, preserving both structural coherence and fine-grained appearance. This ability is crucial not only for advancing 3D generation research but also for practical applications in animation, editing, and digital content creation. Existing approaches either operate directly on geometry, limiting them to shape-only morphing while neglecting textures, or extend 2D interpolation strategies into 3D, which often causes semantic ambiguity, structural misalignment, and texture blurring. These challenges underscore the necessity to jointly preserve geometric consistency, texture alignment, and robustness throughout the transition process. To address this, we propose Interp3D, a novel training-free framework for textured 3D morphing. It harnesses generative priors and adopts a progressive alignment principle to ensure both geometric fidelity and texture coherence. Starting from semantically aligned interpolation in condition space, Interp3D enforces structural consistency via SLAT (Structured Latent)-guided structure interpolation, and finally transfers appearance details through fine-grained texture fusion. For comprehensive evaluations, we construct a dedicated dataset, Interp3DData, with graded difficulty levels and assess generation results from fidelity, transition smoothness, and plausibility. Both quantitative metrics and human studies demonstrate the significant advantages of our proposed approach over previous methods. Source code is available at this https URL.

21. 【2601.14101】Curriculum-Based Strategies for Efficient Cross-Domain Action Recognition

链接https://arxiv.org/abs/2601.14101

作者:Emily Kim,Allen Wu,Jessica Hodgins

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diverse viewpoints remains, generalizing to diverse, remains a challenge, significant progress, progress in human

备注

点击查看摘要

Abstract:Despite significant progress in human action recognition, generalizing to diverse viewpoints remains a challenge. Most existing datasets are captured from ground-level perspectives, and models trained on them often struggle to transfer to drastically different domains such as aerial views. This paper examines how curriculum-based training strategies can improve generalization to unseen real aerial-view data without using any real aerial data during training. We explore curriculum learning for cross-view action recognition using two out-of-domain sources: synthetic aerial-view data and real ground-view data. Our results on the evaluation on order of training (fine-tuning on synthetic aerial data vs. real ground data) shows that fine-tuning on real ground data but differ in how they transition from synthetic to real. The first uses a two-stage curriculum with direct fine-tuning, while the second applies a progressive curriculum that expands the dataset in multiple stages before fine-tuning. We evaluate both methods on the REMAG dataset using SlowFast (CNN-based) and MViTv2 (Transformer-based) architectures. Results show that combining the two out-of-domain datasets clearly outperforms training on a single domain, whether real ground-view or synthetic aerial-view. Both curriculum strategies match the top-1 accuracy of simple dataset combination while offering efficiency gains. With the two-step fine-tuning method, SlowFast achieves up to a 37% reduction in iterations and MViTv2 up to a 30% reduction compared to simple combination. The multi-step progressive approach further reduces iterations, by up to 9% for SlowFast and 30% for MViTv2, relative to the two-step method. These findings demonstrate that curriculum-based training can maintain comparable performance (top-1 accuracy within 3% range) while improving training efficiency in cross-view action recognition.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2601.14101 [cs.CV]

(or
arXiv:2601.14101v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.14101

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
22. 【2601.14086】wo-Stream temporal transformer for video action classification

链接https://arxiv.org/abs/2601.14086

作者:Nattapong Kurpukdee,Adrian G. Bors

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Motion representation plays, including action recognition, applications including action, Motion representation, action recognition

备注

点击查看摘要

Abstract:Motion representation plays an important role in video understanding and has many applications including action recognition, robot and autonomous guidance or others. Lately, transformer networks, through their self-attention mechanism capabilities, have proved their efficiency in many applications. In this study, we introduce a new two-stream transformer video classifier, which extracts spatio-temporal information from content and optical flow representing movement information. The proposed model identifies self-attention features across the joint optical flow and temporal frame domain and represents their relationships within the transformer encoder mechanism. The experimental results show that our proposed methodology provides excellent classification results on three well-known video datasets of human activities.

23. 【2601.14084】DermaBench: A Clinician-Annotated Benchmark Dataset for Dermatology Visual Question Answering and Reasoning

链接https://arxiv.org/abs/2601.14084

作者:Abdurrahim Yilmaz,Ozan Erdem,Ece Gokyayla,Ayda Acar,Burc Bugra Dagtas,Dilara Ilhan Erdil,Gulsum Gencoglan,Burak Temelkuran

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:image-level classification tasks, dermatology remains limited, Vision-language models, medical applications, increasingly important

备注

点击查看摘要

Abstract:Vision-language models (VLMs) are increasingly important in medical applications; however, their evaluation in dermatology remains limited by datasets that focus primarily on image-level classification tasks such as lesion recognition. While valuable for recognition, such datasets cannot assess the full visual understanding, language grounding, and clinical reasoning capabilities of multimodal models. Visual question answering (VQA) benchmarks are required to evaluate how models interpret dermatological images, reason over fine-grained morphology, and generate clinically meaningful descriptions. We introduce DermaBench, a clinician-annotated dermatology VQA benchmark built on the Diverse Dermatology Images (DDI) dataset. DermaBench comprises 656 clinical images from 570 unique patients spanning Fitzpatrick skin types I-VI. Using a hierarchical annotation schema with 22 main questions (single-choice, multi-choice, and open-ended), expert dermatologists annotated each image for diagnosis, anatomic site, lesion morphology, distribution, surface features, color, and image quality, together with open-ended narrative descriptions and summaries, yielding approximately 14.474 VQA-style annotations. DermaBench is released as a metadata-only dataset to respect upstream licensing and is publicly available at Harvard Dataverse.

24. 【2601.14079】VENI: Variational Encoder for Natural Illumination

链接https://arxiv.org/abs/2601.14079

作者:Paul Walker,James A. D. Gardner,Andreea Ardelean,William A. P. Smith,Bernhard Egger

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Inverse rendering, ill-posed problem, Vector Neuron Vision, Neuron Vision Transformer, Vector Neurons

备注: Project Repo - [this https URL](https://github.com/paul-pw/veni) Project page - [this https URL](https://paul-pw.github.io/veni)

点击查看摘要

Abstract:Inverse rendering is an ill-posed problem, but priors like illumination priors, can simplify it. Existing work either disregards the spherical and rotation-equivariant nature of illumination environments or does not provide a well-behaved latent space. We propose a rotation-equivariant variational autoencoder that models natural illumination on the sphere without relying on 2D projections. To preserve the SO(2)-equivariance of environment maps, we use a novel Vector Neuron Vision Transformer (VN-ViT) as encoder and a rotation-equivariant conditional neural field as decoder. In the encoder, we reduce the equivariance from SO(3) to SO(2) using a novel SO(2)-equivariant fully connected layer, an extension of Vector Neurons. We show that our SO(2)-equivariant fully connected layer outperforms standard Vector Neurons when used in our SO(2)-equivariant model. Compared to previous methods, our variational autoencoder enables smoother interpolation in latent space and offers a more well-behaved latent space.

25. 【2601.14069】Unsupervised Video Class-Incremental Learning via Deep Embedded Clustering Management

链接https://arxiv.org/abs/2601.14069

作者:Nattapong Kurpukdee,Adrian G. Bors

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:important learning paradigm, Unsupervised video class, Unsupervised video, represents an important, class incremental learning

备注

点击查看摘要

Abstract:Unsupervised video class incremental learning (uVCIL) represents an important learning paradigm for learning video information without forgetting, and without considering any data labels. Prior approaches have focused on supervised class-incremental learning, relying on using the knowledge of labels and task boundaries, which is costly, requires human annotation, or is simply not a realistic option. In this paper, we propose a simple yet effective approach to address the uVCIL. We first consider a deep feature extractor network, providing a set of representative video features during each task without assuming any class or task information. We then progressively build a series of deep clusters from the extracted features. During the successive task learning, the model updated from the previous task is used as an initial state in order to transfer knowledge to the current learning task. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, by ignoring the labels from the supervised setting. Our approach significantly outperforms other baselines on all datasets.

26. 【2601.14066】VERIDAH: Solving Enumeration Anomaly Aware Vertebra Labeling across Imaging Sequences

链接https://arxiv.org/abs/2601.14066

作者:Hendrik Möller,Hanna Schoen,Robert Graf,Matan Atad,Nathan Molinier,Anjany Sekuboyina,Bettina K. Budai,Fabian Bamberg,Steffen Ringhof,Christopher Schlett,Tobias Pischon,Thoralf Niendorf,Josua A. Decker,Marc-André Weber,Bjoern Menze,Daniel Rueckert,Jan S. Kirschke

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human spine commonly, spine commonly consists, enumeration anomalies, lumbar enumeration anomalies, thoracic enumeration anomalies

备注

点击查看摘要

Abstract:The human spine commonly consists of seven cervical, twelve thoracic, and five lumbar vertebrae. However, enumeration anomalies may result in individuals having eleven or thirteen thoracic vertebrae and four or six lumbar vertebrae. Although the identification of enumeration anomalies has potential clinical implications for chronic back pain and operation planning, the thoracolumbar junction is often poorly assessed and rarely described in clinical reports. Additionally, even though multiple deep-learning-based vertebra labeling algorithms exist, there is a lack of methods to automatically label enumeration anomalies. Our work closes that gap by introducing "Vertebra Identification with Anomaly Handling" (VERIDAH), a novel vertebra labeling algorithm based on multiple classification heads combined with a weighted vertebra sequence prediction algorithm. We show that our approach surpasses existing models on T2w TSE sagittal (98.30% vs. 94.24% of subjects with all vertebrae correctly labeled, p 0.001) and CT imaging (99.18% vs. 77.26% of subjects with all vertebrae correctly labeled, p 0.001) and works in arbitrary field-of-view images. VERIDAH correctly labeled the presence 2 Möller et al. of thoracic enumeration anomalies in 87.80% and 96.30% of T2w and CT images, respectively, and lumbar enumeration anomalies in 94.48% and 97.22% for T2w and CT, respectively. Our code and models are available at: this https URL.

27. 【2601.14060】Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration

链接https://arxiv.org/abs/2601.14060

作者:Yongcong Ye,Kai Zhang,Yanghai Zhang,Enhong Chen,Longfei Li,Jun Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant practical applications, rapidly growing area, Zero-shot composed image, composed image retrieval, relative caption describing

备注

点击查看摘要

Abstract:Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications, allowing users to retrieve a target image by providing a reference image and a relative caption describing the desired modifications. Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively. They primarily rely on either transforming the multimodal query into a single text using image-to-text models or employing large language models for target image description generation, approaches that often fail to capture complementary visual information and complete semantic context. To address these limitations, we propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration (CVSI). Specifically, CVSI leverages three key components: (1) Visual Information Extraction, which not only extracts global image features but also uses a pre-trained mapping network to convert the image into a pseudo token, combining it with the modification text and the objects most likely to be added. (2) Semantic Information Extraction, which involves using a pre-trained captioning model to generate multiple captions for the reference image, followed by leveraging an LLM to generate the modified captions and the objects most likely to be added. (3) Complementary Information Retrieval, which integrates information extracted from both the query and database images to retrieve the target image, enabling the system to efficiently handle retrieval queries in a variety of situations. Extensive experiments on three public datasets (e.g., CIRR, CIRCO, and FashionIQ) demonstrate that CVSI significantly outperforms existing state-of-the-art methods. Our code is available at this https URL.

28. 【2601.14056】POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion

链接https://arxiv.org/abs/2601.14056

作者:Andrea Rigo,Luca Stornaiuolo,Weijie Wang,Mauro Martino,Bruno Lepri,Nicu Sebe

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:diffusion-based approach, Positioning Objects Consistently, Blended Latent Diffusion, Consistently and Interactively, propose a diffusion-based

备注

点击查看摘要

Abstract:We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing. While prior methods improve spatial adherence using 2D cues or iterative copy-warp-paste strategies, they often distort object geometry and fail to preserve consistency across edits. To address these limitations, we introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff), a novel formulation for jointly enforcing 3D geometric constraints and instance-level semantic binding within a unified diffusion process. Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes through Blended Latent Diffusion, allowing one-shot synthesis of complex multi-object scenes. We further propose a warping-free generative editing pipeline that supports object insertion, removal, and transformation via regeneration rather than pixel deformation. To preserve object identity and consistency across edits, we condition the diffusion process on reference images using IP-Adapter, enabling coherent object appearance throughout interactive 3D editing while maintaining global scene coherence. Experimental results demonstrate that POCI-Diff produces high-quality images consistent with the specified 3D layouts and edits, outperforming state-of-the-art methods in both visual fidelity and layout adherence while eliminating warping-induced geometric artifacts.

29. 【2601.14055】Decoder-Free Supervoxel GNN for Accurate Brain-Tumor Localization in Multi-Modal MRI

链接https://arxiv.org/abs/2601.14055

作者:Andrea Protani,Marc Molina Van Den Bosch,Lorenzo Giusti,Heloisa Barbosa Da Silva,Paolo Cacace,Albert Sund Aillet,Miguel Angel Gonzalez Ballester,Friedhelm Hummel,Luigi Serio

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Modern vision backbones, parameter-heavy encoder-decoder structures, imaging typically process, typically process dense, process dense voxel

备注: 10 pages, 3 figures,

点击查看摘要

Abstract:Modern vision backbones for 3D medical imaging typically process dense voxel grids through parameter-heavy encoder-decoder structures, a design that allocates a significant portion of its parameters to spatial reconstruction rather than feature learning. Our approach introduces SVGFormer, a decoder-free pipeline built upon a content-aware grouping stage that partitions the volume into a semantic graph of supervoxels. Its hierarchical encoder learns rich node representations by combining a patch-level Transformer with a supervoxel-level Graph Attention Network, jointly modeling fine-grained intra-region features and broader inter-regional dependencies. This design concentrates all learnable capacity on feature encoding and provides inherent, dual-scale explainability from the patch to the region level. To validate the framework's flexibility, we trained two specialized models on the BraTS dataset: one for node-level classification and one for tumor proportion regression. Both models achieved strong performance, with the classification model achieving a F1-score of 0.875 and the regression model a MAE of 0.028, confirming the encoder's ability to learn discriminative and localized features. Our results establish that a graph-based, encoder-only paradigm offers an accurate and inherently interpretable alternative for 3D medical image representation.

30. 【2601.14053】LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

链接https://arxiv.org/abs/2601.14053

作者:Badri N. Patro,Vijay S. Agneeswaran

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA); Image and Video Processing (eess.IV)

关键词:foundational Transformer architectures, foundational Transformer, Transformer architectures, reasoning-capable systems approaching, systems approaching human-level

备注

点击查看摘要

Abstract:The field of artificial intelligence has undergone a revolution from foundational Transformer architectures to reasoning-capable systems approaching human-level performance. We present LLMOrbit, a comprehensive circular taxonomy navigating the landscape of large language models spanning 2019-2025. This survey examines over 50 models across 15 organizations through eight interconnected orbital dimensions, documenting architectural innovations, training methodologies, and efficiency patterns defining modern LLMs, generative AI, and agentic systems. We identify three critical crises: (1) data scarcity (9-27T tokens depleted by 2026-2028), (2) exponential cost growth ($3M to $300M+ in 5 years), and (3) unsustainable energy consumption (22x increase), establishing the scaling wall limiting brute-force approaches. Our analysis reveals six paradigms breaking this wall: (1) test-time compute (o1, DeepSeek-R1 achieve GPT-4 performance with 10x inference compute), (2) quantization (4-8x compression), (3) distributed edge computing (10x cost reduction), (4) model merging, (5) efficient training (ORPO reduces memory 50%), and (6) small specialized models (Phi-4 14B matches larger models). Three paradigm shifts emerge: (1) post-training gains (RLHF, GRPO, pure RL contribute substantially, DeepSeek-R1 achieving 79.8% MATH), (2) efficiency revolution (MoE routing 18x efficiency, Multi-head Latent Attention 8x KV cache compression enables GPT-4-level performance at $0.30/M tokens), and (3) democratization (open-source Llama 3 88.6% MMLU surpasses GPT-4 86.4%). We provide insights into techniques (RLHF, PPO, DPO, GRPO, ORPO), trace evolution from passive generation to tool-using agents (ReAct, RAG, multi-agent systems), and analyze post-training innovations.

31. 【2601.14052】Vision Also You Need: Navigating Out-of-Distribution Detection with Multimodal Large Language Model

链接https://arxiv.org/abs/2601.14052

作者:Haoran Xu,Yanlin Liu,Zizhao Tong,Jiaze Li,Kexue Fu,Yuyang Zhang,Longxiang Gao,Shuaiguang Li,Xingyu Li,Yanran Xu,Changwei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:garnered significant attention, OOD, OOD tasks, zero-shot OOD detection, OOD detection

备注

点击查看摘要

Abstract:Out-of-Distribution (OOD) detection is a critical task that has garnered significant attention. The emergence of CLIP has spurred extensive research into zero-shot OOD detection, often employing a training-free approach. Current methods leverage expert knowledge from large language models (LLMs) to identify potential outliers. However, these approaches tend to over-rely on knowledge in the text space, neglecting the inherent challenges involved in detecting out-of-distribution samples in the image space. In this paper, we propose a novel pipeline, MM-OOD, which leverages the multimodal reasoning capabilities of MLLMs and their ability to conduct multi-round conversations for enhanced outlier detection. Our method is designed to improve performance in both near OOD and far OOD tasks. Specifically, (1) for near OOD tasks, we directly feed ID images and corresponding text prompts into MLLMs to identify potential outliers; and (2) for far OOD tasks, we introduce the sketch-generate-elaborate framework: first, we sketch outlier exposure using text prompts, then generate corresponding visual OOD samples, and finally elaborate by using multimodal prompts. Experiments demonstrate that our method achieves significant improvements on widely used multimodal datasets such as Food-101, while also validating its scalability on ImageNet-1K.

32. 【2601.14044】Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology

链接https://arxiv.org/abs/2601.14044

作者:Kaiyu Wu,Pucheng Han,Hualong Zhang,Naigeng Wu,Keze Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, Vision Language, Language Models, advancing reasoning capabilities, show advancing reasoning

备注

点击查看摘要

Abstract:While Vision Language Models (VLMs) show advancing reasoning capabilities, their application in meteorology is constrained by a domain gap and a reasoning faithfulness gap. Specifically, mainstream Reinforcement Fine-Tuning (RFT) can induce Self-Contradictory Reasoning (Self-Contra), where the model's reasoning contradicts its final answer, which is unacceptable in such a high-stakes domain. To address these challenges, we construct WeatherQA, a novel multimodal reasoning benchmark in meteorology. We also propose Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT), which resolves Self-Contra by introducing a logical consistency reward. Furthermore, we introduce Weather-R1, the first reasoning VLM with logical faithfulness in meteorology, to the best of our knowledge. Experiments demonstrate that Weather-R1 improves performance on WeatherQA by 9.8 percentage points over the baseline, outperforming Supervised Fine-Tuning and RFT, and even surpassing the original Qwen2.5-VL-32B. These results highlight the effectiveness of our LoCo-RFT and the superiority of Weather-R1. Our benchmark and code are available at this https URL.

33. 【2601.14042】Federated Balanced Learning

链接https://arxiv.org/abs/2601.14042

作者:Jiaze Li,Haoran Xu,Wanyi Wu,Changwei Wang,Shuaiguang Li,Jianzhong Ju,Zhenbo Luo,Jian Luan,Youyang Qu,Longxiang Gao,Xudong Yang,Lumin Xing

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:sharing model parameters, Federated Balanced Learning, paradigm of joint, collaborate by sharing, client side

备注

点击查看摘要

Abstract:Federated learning is a paradigm of joint learning in which clients collaborate by sharing model parameters instead of data. However, in the non-iid setting, the global model experiences client drift, which can seriously affect the final performance of the model. Previous methods tend to correct the global model that has already deviated based on the loss function or gradient, overlooking the impact of the client samples. In this paper, we rethink the role of the client side and propose Federated Balanced Learning, i.e., FBL, to prevent this issue from the beginning through sample balance on the client side. Technically, FBL allows unbalanced data on the client side to achieve sample balance through knowledge filling and knowledge sampling using edge-side generation models, under the limitation of a fixed number of data samples on clients. Furthermore, we design a Knowledge Alignment Strategy to bridge the gap between synthetic and real data, and a Knowledge Drop Strategy to regularize our method. Meanwhile, we scale our method to real and complex scenarios, allowing different clients to adopt various methods, and extend our framework to further improve performance. Numerous experiments show that our method outperforms state-of-the-art baselines. The code is released upon acceptance.

34. 【2601.14039】Generalizing Abstention for Noise-Robust Learning in Medical Image Segmentation

链接https://arxiv.org/abs/2601.14039

作者:Wesam Moustafa,Hossam Elsafty,Helen Schneider,Lorenz Sparrenberg,Rafet Sifa

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:manual annotation, critical problem, inherent difficulty, difficulty of manual, medical image segmentation

备注

点击查看摘要

Abstract:Label noise is a critical problem in medical image segmentation, often arising from the inherent difficulty of manual annotation. Models trained on noisy data are prone to overfitting, which degrades their generalization performance. While a number of methods and strategies have been proposed to mitigate noisy labels in the segmentation domain, this area remains largely under-explored. The abstention mechanism has proven effective in classification tasks by enhancing the capabilities of Cross Entropy, yet its potential in segmentation remains unverified. In this paper, we address this gap by introducing a universal and modular abstention framework capable of enhancing the noise-robustness of a diverse range of loss functions. Our framework improves upon prior work with two key components: an informed regularization term to guide abstention behaviour, and a more flexible power-law-based auto-tuning algorithm for the abstention penalty. We demonstrate the framework's versatility by systematically integrating it with three distinct loss functions to create three novel, noise-robust variants: GAC, SAC, and ADS. Experiments on the CaDIS and DSAD medical datasets show our methods consistently and significantly outperform their non-abstaining baselines, especially under high noise levels. This work establishes that enabling models to selectively ignore corrupted samples is a powerful and generalizable strategy for building more reliable segmentation models. Our code is publicly available at this https URL.

35. 【2601.14038】Correcting and Quantifying Systematic Errors in 3D Box Annotations for Autonomous Driving

链接https://arxiv.org/abs/2601.14038

作者:Alexandre Justo Miro(1 and 2),Ludvig af Klinteberg(2),Bogdan Timus(1),Aron Asefaw(3),Ajinkya Khoche(1 and 3),Thomas Gustafsson(1),Sina Sharif Mansouri(1),Masoud Daneshtalab(2) ((1) Traton Group Ramp;D, (2) Mälardalen University, (3) KTH Royal Institute of Technology)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous vehicle systems, ground truth annotations, Accurate ground truth, ground truth, critical to supervised

备注: Accepted to The IEEE/CVF Winter Conference on Applications of Computer Vision 2026

点击查看摘要

Abstract:Accurate ground truth annotations are critical to supervised learning and evaluating the performance of autonomous vehicle systems. These vehicles are typically equipped with active sensors, such as LiDAR, which scan the environment in predefined patterns. 3D box annotation based on data from such sensors is challenging in dynamic scenarios, where objects are observed at different timestamps, hence different positions. Without proper handling of this phenomenon, systematic errors are prone to being introduced in the box annotations. Our work is the first to discover such annotation errors in widely used, publicly available datasets. Through our novel offline estimation method, we correct the annotations so that they follow physically feasible trajectories and achieve spatial and temporal consistency with the sensor data. For the first time, we define metrics for this problem; and we evaluate our method on the Argoverse 2, MAN TruckScenes, and our proprietary datasets. Our approach increases the quality of box annotations by more than 17% in these datasets. Furthermore, we quantify the annotation errors in them and find that the original annotations are misplaced by up to 2.5 m, with highly dynamic objects being the most affected. Finally, we test the impact of the errors in benchmarking and find that the impact is larger than the improvements that state-of-the-art methods typically achieve with respect to the previous state-of-the-art methods; showing that accurate annotations are essential for correct interpretation of performance. Our code is available at this https URL.

36. 【2601.14037】Human detectors are surprisingly powerful reward models

链接https://arxiv.org/abs/2601.14037

作者:Kumar Ashutosh,XuDong Wang,Xi Yin,Kristen Grauman,Adam Polyak,Ishan Misra,Rohit Girdhar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently achieved impressive, achieved impressive visual, impressive visual fidelity, recently achieved, achieved impressive

备注: Technical report

点击查看摘要

Abstract:Video generation models have recently achieved impressive visual fidelity and temporal coherence. Yet, they continue to struggle with complex, non-rigid motions, especially when synthesizing humans performing dynamic actions such as sports, dance, etc. Generated videos often exhibit missing or extra limbs, distorted poses, or physically implausible actions. In this work, we propose a remarkably simple reward model, HuDA, to quantify and improve the human motion in generated videos. HuDA integrates human detection confidence for appearance quality, and a temporal prompt alignment score to capture motion realism. We show this simple reward function that leverages off-the-shelf models without any additional training, outperforms specialized models finetuned with manually annotated data. Using HuDA for Group Reward Policy Optimization (GRPO) post-training of video models, we significantly enhance video generation, especially when generating complex human motions, outperforming state-of-the-art models like Wan 2.1, with win-rate of 73%. Finally, we demonstrate that HuDA improves generation quality beyond just humans, for instance, significantly improving generation of animal videos and human-object interactions.

37. 【2601.14030】Likelihood-Separable Diffusion Inference for Multi-Image MRI Super-Resolution

链接https://arxiv.org/abs/2601.14030

作者:Samuel W. Remedios,Zhangxing Bian,Shuwen Wei,Aaron Carass,Jerry L. Prince,Blake E. Dewey

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:solving inverse problems, solving inverse, times, posterior sampling, MRI

备注

点击查看摘要

Abstract:Diffusion models are the current state-of-the-art for solving inverse problems in imaging. Their impressive generative capability allows them to approximate sampling from a prior distribution, which alongside a known likelihood function permits posterior sampling without retraining the model. While recent methods have made strides in advancing the accuracy of posterior sampling, the majority focuses on single-image inverse problems. However, for modalities such as magnetic resonance imaging (MRI), it is common to acquire multiple complementary measurements, each low-resolution along a different axis. In this work, we generalize common diffusion-based inverse single-image problem solvers for multi-image super-resolution (MISR) MRI. We show that the DPS likelihood correction allows an exactly-separable gradient decomposition across independently acquired measurements, enabling MISR without constructing a joint operator, modifying the diffusion model, or increasing network function evaluations. We derive MISR versions of DPS, DMAP, DPPS, and diffusion-based PnP/ADMM, and demonstrate substantial gains over SISR across $4\times/8\times/16\times$ anisotropic degradations. Our results achieve state-of-the-art super-resolution of anisotropic MRI volumes and, critically, enable reconstruction of near-isotropic anatomy from routine 2D multi-slice acquisitions, which are otherwise highly degraded in orthogonal views.

38. 【2601.13986】Equivariant Learning for Unsupervised Image Dehazing

链接https://arxiv.org/abs/2601.13986

作者:Zhang Wen,Jiangwei Xie,Dongdong Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Image Dehazing, aims to produce, EID, Image, Dehazing

备注: Technical report

点击查看摘要

Abstract:Image Dehazing (ID) aims to produce a clear image from an observation contaminated by haze. Current ID methods typically rely on carefully crafted priors or extensive haze-free ground truth, both of which are expensive or impractical to acquire, particularly in the context of scientific imaging. We propose a new unsupervised learning framework called Equivariant Image Dehazing (EID) that exploits the symmetry of image signals to restore clarity to hazy observations. By enforcing haze consistency and systematic equivariance, EID can recover clear patterns directly from raw, hazy images. Additionally, we propose an adversarial learning strategy to model unknown haze physics and facilitate EID learning. Experiments on two scientific image dehazing benchmarks (including cell microscopy and medical endoscopy) and on natural image dehazing have demonstrated that EID significantly outperforms state-of-the-art approaches. By unifying equivariant learning with modelling haze physics, we hope that EID will enable more versatile and effective haze removal in scientific imaging. Code and datasets will be published.

39. 【2601.13976】FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

链接https://arxiv.org/abs/2601.13976

作者:Jing Zuo,Lingzhou Mu,Fan Jiang,Chengcheng Ma,Mu Xu,Yonggang Qi

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Achieving human-level performance, long action sequences, Achieving human-level, understand multimodal instructions, requires an embodied

备注

点击查看摘要

Abstract:Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.

40. 【2601.13975】Harmonizing the Deep: A Unified Information Pipeline for Robust Marine Biodiversity Assessment Across Heterogeneous Domains

链接https://arxiv.org/abs/2601.13975

作者:Marco Piccolo,Qiwei Han,Astrid van Toor,Joachim Vanneste

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:complex underwater environments, biodiversity monitoring requires, monitoring requires scalability, invasive-species management, Marine biodiversity monitoring

备注: 9 pages, 4 figures 8 tables

点击查看摘要

Abstract:Marine biodiversity monitoring requires scalability and reliability across complex underwater environments to support conservation and invasive-species management. Yet existing detection solutions often exhibit a pronounced deployment gap, with performance degrading sharply when transferred to new sites. This work establishes the foundational detection layer for a multi-year invasive species monitoring initiative targeting Arctic and Atlantic marine ecosystems. We address this challenge by developing a Unified Information Pipeline that standardises heterogeneous datasets into a comparable information flow and evaluates a fixed, deployment-relevant detector under controlled cross-domain protocols. Across multiple domains, we find that structural factors, such as scene composition, object density, and contextual redundancy, explain cross-domain performance loss more strongly than visual degradation such as turbidity, with sparse scenes inducing a characteristic "Context Collapse" failure mode. We further validate operational feasibility by benchmarking inference on low-cost edge hardware, showing that runtime optimisation enables practical sampling rates for remote monitoring. The results shift emphasis from image enhancement toward structure-aware reliability, providing a democratised tool for consistent marine ecosystem assessment.

41. 【2601.13974】STEC: A Reference-Free Spatio-Temporal Entropy Coverage Metric for Evaluating Sampled Video Frames

链接https://arxiv.org/abs/2601.13974

作者:Shih-Yao Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:language model pipelines, frames remains challenging, sampled frames remains, sampled frames, language model

备注: This paper corresponds to the camera-ready version of a WACV 2026 Workshop paper

点击查看摘要

Abstract:Frame sampling is a fundamental component in video understanding and video--language model pipelines, yet evaluating the quality of sampled frames remains challenging. Existing evaluation metrics primarily focus on perceptual quality or reconstruction fidelity, and are not designed to assess whether a set of sampled frames adequately captures informative and representative video content. We propose Spatio-Temporal Entropy Coverage (STEC), a simple and non-reference metric for evaluating the effectiveness of video frame sampling. STEC builds upon Spatio-Temporal Frame Entropy (STFE), which measures per-frame spatial information via entropy-based structural complexity, and evaluates sampled frames based on their temporal coverage and redundancy. By jointly modeling spatial information strength, temporal dispersion, and non-redundancy, STEC provides a principled and lightweight measure of sampling quality. Experiments on the MSR-VTT test-1k benchmark demonstrate that STEC clearly differentiates common sampling strategies, including random, uniform, and content-aware methods. We further show that STEC reveals robustness patterns across individual videos that are not captured by average performance alone, highlighting its practical value as a general-purpose evaluation tool for efficient video understanding. We emphasize that STEC is not designed to predict downstream task accuracy, but to provide a task-agnostic diagnostic signal for analyzing frame sampling behavior under constrained budgets.

Comments:
This paper corresponds to the camera-ready version of a WACV 2026 Workshop paper

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2601.13974 [cs.CV]

(or
arXiv:2601.13974v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.13974

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Shih-Yao (Mike) Lin [view email] [v1]
Tue, 20 Jan 2026 13:51:33 UTC (1,002 KB)

42. 【2601.13954】DExTeR: Weakly Semi-Supervised Object Detection with Class and Instance Experts for Medical Imaging

链接https://arxiv.org/abs/2601.13954

作者:Adrien Meyer,Didier Mutter,Nicolas Padoy

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Detecting anatomical landmarks, Detecting anatomical, intervention guidance, anatomical landmarks, essential for diagnosis

备注

点击查看摘要

Abstract:Detecting anatomical landmarks in medical imaging is essential for diagnosis and intervention guidance. However, object detection models rely on costly bounding box annotations, limiting scalability. Weakly Semi-Supervised Object Detection (WSSOD) with point annotations proposes annotating each instance with a single point, minimizing annotation time while preserving localization signals. A Point-to-Box teacher model, trained on a small box-labeled subset, converts these point annotations into pseudo-box labels to train a student detector. Yet, medical imagery presents unique challenges, including overlapping anatomy, variable object sizes, and elusive structures, which hinder accurate bounding box inference. To overcome these challenges, we introduce DExTeR (DETR with Experts), a transformer-based Point-to-Box regressor tailored for medical imaging. Built upon Point-DETR, DExTeR encodes single-point annotations as object queries, refining feature extraction with the proposed class-guided deformable attention, which guides attention sampling using point coordinates and class labels to capture class-specific characteristics. To improve discrimination in complex structures, it introduces CLICK-MoE (CLass, Instance, and Common Knowledge Mixture of Experts), decoupling class and instance representations to reduce confusion among adjacent or overlapping instances. Finally, we implement a multi-point training strategy which promotes prediction consistency across different point placements, improving robustness to annotation variability. DExTeR achieves state-of-the-art performance across three datasets spanning different medical domains (endoscopy, chest X-rays, and endoscopic ultrasound) highlighting its potential to reduce annotation costs while maintaining high detection accuracy.

43. 【2601.13951】VTONGuard: Automatic Detection and Authentication of AI-Generated Virtual Try-On Content

链接https://arxiv.org/abs/2601.13951

作者:Shengyi Wu,Yan Hong,Shengyao Chen,Zheng Wang,Xianbing Sun,Jiahui Zhan,Jun Lan,Jianfu Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:digital entertainment, rapid advancement, advancement of generative, increasingly common, common in e-commerce

备注

点击查看摘要

Abstract:With the rapid advancement of generative AI, virtual try-on (VTON) systems are becoming increasingly common in e-commerce and digital entertainment. However, the growing realism of AI-generated try-on content raises pressing concerns about authenticity and responsible use. To address this, we present VTONGuard, a large-scale benchmark dataset containing over 775,000 real and synthetic try-on images. The dataset covers diverse real-world conditions, including variations in pose, background, and garment styles, and provides both authentic and manipulated examples. Based on this benchmark, we conduct a systematic evaluation of multiple detection paradigms under unified training and testing protocols. Our results reveal each method's strengths and weaknesses and highlight the persistent challenge of cross-paradigm generalization. To further advance detection, we design a multi-task framework that integrates auxiliary segmentation to enhance boundary-aware feature learning, achieving the best overall performance on VTONGuard. We expect this benchmark to enable fair comparisons, facilitate the development of more robust detection models, and promote the safe and responsible deployment of VTON technologies in practice.

44. 【2601.13942】Glance-or-Gaze: Incentivizing LMMs to Adaptively Focus Search via Reinforcement Learning

链接https://arxiv.org/abs/2601.13942

作者:Hongbo Bai,Yujin Zhou,Yile Wu,Chi-Min Chan,Pengcheng Wen,Kunhao Pan,Sirui Han,Yike Guo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Multimodal Models, Large Multimodal, static parametric knowledge, achieved remarkable success, involving long-tail entities

备注

点击查看摘要

Abstract:Large Multimodal Models (LMMs) have achieved remarkable success in visual understanding, yet they struggle with knowledge-intensive queries involving long-tail entities or evolving information due to static parametric knowledge. Recent search-augmented approaches attempt to address this limitation, but existing methods rely on indiscriminate whole-image retrieval that introduces substantial visual redundancy and noise, and lack deep iterative reflection, limiting their effectiveness on complex visual queries. To overcome these challenges, we propose Glance-or-Gaze (GoG), a fully autonomous framework that shifts from passive perception to active visual planning. GoG introduces a Selective Gaze mechanism that dynamically chooses whether to glance at global context or gaze into high-value regions, filtering irrelevant information before retrieval. We design a dual-stage training strategy: Reflective GoG Behavior Alignment via supervised fine-tuning instills the fundamental GoG paradigm, while Complexity-Adaptive Reinforcement Learning further enhances the model's capability to handle complex queries through iterative reasoning. Experiments across six benchmarks demonstrate state-of-the-art performance. Ablation studies confirm that both Selective Gaze and complexity-adaptive RL are essential for effective visual search. We will release our data and models for further exploration soon.

45. 【2601.13935】rackletGPT: A Language-like GPT Framework for White Matter Tract Segmentation

链接https://arxiv.org/abs/2601.13935

作者:Anoushkrit Goel,Simroop Singh,Ankita Joshi,Ranjeet Ranjan Jha,Chirag Ahuja,Aditya Nigam,Arnav Bhavsar

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:White Matter Tract, Matter Tract Segmentation, brain structural connectivity, White Matter, studying brain structural

备注: Accepted at 23rd IEEE International Symposium on Biomedical Imaging (ISBI), 2026

点击查看摘要

Abstract:White Matter Tract Segmentation is imperative for studying brain structural connectivity, neurological disorders and neurosurgery. This task remains complex, as tracts differ among themselves, across subjects and conditions, yet have similar 3D structure across hemispheres and subjects. To address these challenges, we propose TrackletGPT, a language-like GPT framework which reintroduces sequential information in tokens using tracklets. TrackletGPT generalises seamlessly across datasets, is fully automatic, and encodes granular sub-streamline segments, Tracklets, scaling and refining GPT models in Tractography Segmentation. Based on our experiments, TrackletGPT outperforms state-of-the-art methods on average DICE, Overlap and Overreach scores on TractoInferno and HCP datasets, even on inter-dataset experiments.

46. 【2601.13919】HyperWalker: Dynamic Hypergraph-Based Deep Diagnosis for Multi-Hop Clinical Modeling across EHR and X-Ray in Medical VLMs

链接https://arxiv.org/abs/2601.13919

作者:Yuezhe Yang,Hao Wang,Yige Peng,Jinman Kim,Lei Bi

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated clinical diagnosis, integrate multi-modal data, case-specific contexts, reason across complex, Automated clinical

备注: Under Review

点击查看摘要

Abstract:Automated clinical diagnosis remains a core challenge in medical AI, which usually requires models to integrate multi-modal data and reason across complex, case-specific contexts. Although recent methods have advanced medical report generation (MRG) and visual question answering (VQA) with medical vision-language models (VLMs), these methods, however, predominantly operate under a sample-isolated inference paradigm, as such processing cases independently without access to longitudinal electronic health records (EHRs) or structurally related patient examples. This paradigm limits reasoning to image-derived information alone, which ignores external complementary medical evidence for potentially more accurate diagnosis. To overcome this limitation, we propose \textbf{HyperWalker}, a \textit{Deep Diagnosis} framework that reformulates clinical reasoning via dynamic hypergraphs and test-time training. First, we construct a dynamic hypergraph, termed \textbf{iBrochure}, to model the structural heterogeneity of EHR data and implicit high-order associations among multimodal clinical information. Within this hypergraph, a reinforcement learning agent, \textbf{Walker}, navigates to and identifies optimal diagnostic paths. To ensure comprehensive coverage of diverse clinical characteristics in test samples, we incorporate a \textit{linger mechanism}, a multi-hop orthogonal retrieval strategy that iteratively selects clinically complementary neighborhood cases reflecting distinct clinical attributes. Experiments on MRG with MIMIC and medical VQA on EHRXQA demonstrate that HyperWalker achieves state-of-the-art performance. Code is available at: this https URL

47. 【2601.13913】On the Role of Rotation Equivariance in Monocular 3D Human Pose Estimation

链接https://arxiv.org/abs/2601.13913

作者:Pavlo Melnyk,Cuong Le,Urs Waldmann,Per-Erik Forssén,Bastian Wandt

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision, Estimating, central tasks, input image, input

备注

点击查看摘要

Abstract:Estimating 3D from 2D is one of the central tasks in computer vision. In this work, we consider the monocular setting, i.e. single-view input, for 3D human pose estimation (HPE). Here, the task is to predict a 3D point set of human skeletal joints from a single 2D input image. While by definition this is an ill-posed problem, recent work has presented methods that solve it with up to several-centimetre error. Typically, these methods employ a two-step approach, where the first step is to detect the 2D skeletal joints in the input image, followed by the step of 2D-to-3D lifting. We find that common lifting models fail when encountering a rotated input. We argue that learning a single human pose along with its in-plane rotations is considerably easier and more geometrically grounded than directly learning a point-to-point mapping. Furthermore, our intuition is that endowing the model with the notion of rotation equivariance without explicitly constraining its parameter space should lead to a more straightforward learning process than one with equivariance by design. Utilising the common HPE benchmarks, we confirm that the 2D rotation equivariance per se improves the model performance on human poses akin to rotations in the image plane, and can be efficiently and straightforwardly learned by augmentation, outperforming state-of-the-art equivariant-by-design methods.

48. 【2601.13899】owards Visually Explaining Statistical Tests with Applications in Biomedical Imaging

链接https://arxiv.org/abs/2601.13899

作者:Masoumeh Javanbakhat,Piotr Komorowski,Dilyara Bareeva,Wei-Chang Lai,Wojciech Samek,Christoph Lippert

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently shown strong, shown strong power, black-box nature limits, nature limits interpretability, detecting distributional differences

备注

点击查看摘要

Abstract:Deep neural two-sample tests have recently shown strong power for detecting distributional differences between groups, yet their black-box nature limits interpretability and practical adoption in biomedical analysis. Moreover, most existing post-hoc explainability methods rely on class labels, making them unsuitable for label-free statistical testing settings. We propose an explainable deep statistical testing framework that augments deep two-sample tests with sample-level and feature-level explanations, revealing which individual samples and which input features drive statistically significant group differences. Our method highlights which image regions and which individual samples contribute most to the detected group difference, providing spatial and instance-wise insight into the test's decision. Applied to biomedical imaging data, the proposed framework identifies influential samples and highlights anatomically meaningful regions associated with disease-related variation. This work bridges statistical inference and explainable AI, enabling interpretable, label-free population analysis in medical imaging.

49. 【2601.13895】OmniOVCD: Streamlining Open-Vocabulary Change Detection with SAM 3

链接https://arxiv.org/abs/2601.13895

作者:Xu Zhang,Danyang Li,Yingjie Xia,Xiaohang Dong,Hualong Yu,Jianye Wang,Qicheng Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Change Detection, remote sensing, Open-Vocabulary Change Detection, Detection, OVCD

备注

点击查看摘要

Abstract:Change Detection (CD) is a fundamental task in remote sensing. It monitors the evolution of land cover over time. Based on this, Open-Vocabulary Change Detection (OVCD) introduces a new requirement. It aims to reduce the reliance on predefined categories. Existing training-free OVCD methods mostly use CLIP to identify categories. These methods also need extra models like DINO to extract features. However, combining different models often causes problems in matching features and makes the system unstable. Recently, the Segment Anything Model 3 (SAM 3) is introduced. It integrates segmentation and identification capabilities within one promptable model, which offers new possibilities for the OVCD task. In this paper, we propose OmniOVCD, a standalone framework designed for OVCD. By leveraging the decoupled output heads of SAM 3, we propose a Synergistic Fusion to Instance Decoupling (SFID) strategy. SFID first fuses the semantic, instance, and presence outputs of SAM 3 to construct land-cover masks, and then decomposes them into individual instance masks for change comparison. This design preserves high accuracy in category recognition and maintains instance-level consistency across images. As a result, the model can generate accurate change masks. Experiments on four public benchmarks (LEVIR-CD, WHU-CD, S2Looking, and SECOND) demonstrate SOTA performance, achieving IoU scores of 67.2, 66.5, 24.5, and 27.1 (class-average), respectively, surpassing all previous methods.

50. 【2601.13886】Revisiting Multi-Task Visual Representation Learning

链接https://arxiv.org/abs/2601.13886

作者:Shangzhe Di,Zhonghua Zhai,Weidi Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:capture intricate local, Current visual representation, intricate local structures, lack spatial precision, high-level semantic context

备注: Code: [this https URL](https://github.com/Becomebright/MTV)

点击查看摘要

Abstract:Current visual representation learning remains bifurcated: vision-language models (e.g., CLIP) excel at global semantic alignment but lack spatial precision, while self-supervised methods (e.g., MAE, DINO) capture intricate local structures yet struggle with high-level semantic context. We argue that these paradigms are fundamentally complementary and can be integrated into a principled multi-task framework, further enhanced by dense spatial supervision. We introduce MTV, a multi-task visual pretraining framework that jointly optimizes a shared backbone across vision-language contrastive, self-supervised, and dense spatial objectives. To mitigate the need for manual annotations, we leverage high-capacity "expert" models -- such as Depth Anything V2 and OWLv2 -- to synthesize dense, structured pseudo-labels at scale. Beyond the framework, we provide a systematic investigation into the mechanics of multi-task visual learning, analyzing: (i) the marginal gain of each objective, (ii) task synergies versus interference, and (iii) scaling behavior across varying data and model scales. Our results demonstrate that MTV achieves "best-of-both-worlds" performance, significantly enhancing fine-grained spatial reasoning without compromising global semantic understanding. Our findings suggest that multi-task learning, fueled by high-quality pseudo-supervision, is a scalable path toward more general visual encoders.

51. 【2601.13879】Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

链接https://arxiv.org/abs/2601.13879

作者:Dongxu Zhang,Yiding Sun,Cheng Tan,Wenbiao Yan,Ning Yang,Jihua Zhu,Hiajun Zhang

类目:Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, prohibitive latency constraints

备注

点击查看摘要

Abstract:While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, leading to hallucinations. To address this, we introduce V-Skip that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow, effectively rescuing visually salient anchors. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a $2.9\times$ speedup with negligible accuracy loss. Specifically, it preserves fine-grained visual details, outperforming other baselines over 30\% on the DocVQA.

52. 【2601.13871】OCCAM: Class-Agnostic, Training-Free, Prior-Free and Multi-Class Object Counting

链接https://arxiv.org/abs/2601.13871

作者:Michail Spanakis,Iason Oikonomidis,Antonis Argyros

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Class-Agnostic object Counting, involves counting instances, involves counting, CAC, Neighbor Clustering Hierarchy

备注

点击查看摘要

Abstract:Class-Agnostic object Counting (CAC) involves counting instances of objects from arbitrary classes within an image. Due to its practical importance, CAC has received increasing attention in recent years. Most existing methods assume a single object class per image, rely on extensive training of large deep learning models and address the problem by incorporating additional information, such as visual exemplars or text prompts. In this paper, we present OCCAM, the first training-free approach to CAC that operates without the need of any supplementary information. Moreover, our approach addresses the multi-class variant of the problem, as it is capable of counting the object instances in each and every class among arbitrary object classes within an image. We leverage Segment Anything Model 2 (SAM2), a foundation model, and a custom threshold-based variant of the First Integer Neighbor Clustering Hierarchy (FINCH) algorithm to achieve competitive performance on widely used benchmark datasets, FSC-147 and CARPK. We propose a synthetic multi-class dataset and F1 score as a more suitable evaluation metric. The code for our method and the proposed synthetic dataset will be made publicly available at this https URL.

53. 【2601.13852】Probabilistic Deep Discriminant Analysis for Wind Blade Segmentation

链接https://arxiv.org/abs/2601.13852

作者:Raül Pérez-Gonzalo,Andreas Espersen,Antonio Agudo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Linear discriminant analysis, non-linearly separable data, discriminant analysis improves, Linear discriminant, Deep Discriminant Analysis

备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Linear discriminant analysis improves class separability but struggles with non-linearly separable data. To overcome this, we introduce Deep Discriminant Analysis (DDA), which directly optimizes the Fisher criterion utilizing deep networks. To ensure stable training and avoid computational instabilities, we incorporate signed between-class variance, bound outputs with a sigmoid function, and convert multiplicative relationships into additive ones. We present two stable DDA loss functions and augment them with a probability loss, resulting in Probabilistic DDA (PDDA). PDDA effectively minimizes class overlap in output distributions, producing highly confident predictions with reduced within-class variance. When applied to wind blade segmentation, PDDA showcases notable advances in performance and consistency, critical for wind energy maintenance. To our knowledge, this is the first application of DDA to image segmentation.

54. 【2601.13839】DisasterVQA: A Visual Question Answering Benchmark Dataset for Disaster Scenes

链接https://arxiv.org/abs/2601.13839

作者:Aisha Al-Mohannadi,Ayisha Firoz,Yin Yang,Muhammad Imran,Ferda Ofli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Social media imagery, enabling rapid damage, rapid damage assessment, Social media, Visual Question Answering

备注

点击查看摘要

Abstract:Social media imagery provides a low-latency source of situational information during natural and human-induced disasters, enabling rapid damage assessment and response. While Visual Question Answering (VQA) has shown strong performance in general-purpose domains, its suitability for the complex and safety-critical reasoning required in disaster response remains unclear. We introduce DisasterVQA, a benchmark dataset designed for perception and reasoning in crisis contexts. DisasterVQA consists of 1,395 real-world images and 4,405 expert-curated question-answer pairs spanning diverse events such as floods, wildfires, and earthquakes. Grounded in humanitarian frameworks including FEMA ESF and OCHA MIRA, the dataset includes binary, multiple-choice, and open-ended questions covering situational awareness and operational decision-making tasks. We benchmark seven state-of-the-art vision-language models and find performance variability across question types, disaster categories, regions, and humanitarian tasks. Although models achieve high accuracy on binary questions, they struggle with fine-grained quantitative reasoning, object counting, and context-sensitive interpretation, particularly for underrepresented disaster scenarios. DisasterVQA provides a challenging and practical benchmark to guide the development of more robust and operationally meaningful vision-language models for disaster response. The dataset is publicly available at this https URL.

55. 【2601.13837】FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation

链接https://arxiv.org/abs/2601.13837

作者:Xinya Ji,Sebastian Weiss,Manuel Kansy,Jacek Naruniec,Xun Cao,Barbara Solenthaler,Derek Bradley

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:efficiently generating high, generating high fidelity, Gaussian-based head avatar, fidelity avatars remains, high fidelity avatars

备注

点击查看摘要

Abstract:Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose \OURS, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.

56. 【2601.13836】FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

链接https://arxiv.org/abs/2601.13836

作者:Qian Chen,Jinlan Fu,Changsong Li,See-Kiong Ng,Xipeng Qiu

类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, remains largely unexplored

备注: [this https URL](https://openmoss.github.io/FutureOmni)

点击查看摘要

Abstract:Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (this https URL) and datasets (this https URL).

57. 【2601.13816】Discriminant Learning-based Colorspace for Blade Segmentation

链接https://arxiv.org/abs/2601.13816

作者:Raül Pérez-Gonzalo,Andreas Espersen,Antonio Agudo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:hinders accurate image, Suboptimal color representation, modern algorithms neglect, critical preprocessing step, Suboptimal color

备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Suboptimal color representation often hinders accurate image segmentation, yet many modern algorithms neglect this critical preprocessing step. This work presents a novel multidimensional nonlinear discriminant analysis algorithm, Colorspace Discriminant Analysis (CSDA), for improved segmentation. Extending Linear Discriminant Analysis into a deep learning context, CSDA customizes color representation by maximizing multidimensional signed inter-class separability while minimizing intra-class variability through a generalized discriminative loss. To ensure stable training, we introduce three alternative losses that enable end-to-end optimization of both the discriminative colorspace and segmentation process. Experiments on wind turbine blade data demonstrate significant accuracy gains, emphasizing the importance of tailored preprocessing in domain-specific segmentation.

58. 【2601.13798】Insight: Interpretable Semantic Hierarchies in Vision-Language Encoders

链接https://arxiv.org/abs/2601.13798

作者:Kai Wittenmayer,Sukrut Rao,Amin Parchami-Araghi,Bernt Schiele,Jonas Fischer

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:diverse downstream tasks, models perform strongly, Language-aligned vision foundation, perform strongly, strongly across diverse

备注: 32 pages, 24 figures, 3 tables

点击查看摘要

Abstract:Language-aligned vision foundation models perform strongly across diverse downstream tasks. Yet, their learned representations remain opaque, making interpreting their decision-making hard. Recent works decompose these representations into human-interpretable concepts, but provide poor spatial grounding and are limited to image classification tasks. In this work, we propose Insight, a language-aligned concept foundation model that provides fine-grained concepts, which are human-interpretable and spatially grounded in the input image. We leverage a hierarchical sparse autoencoder and a foundation model with strong semantic representations to automatically extract concepts at various granularities. Examining local co-occurrence dependencies of concepts allows us to define concept relationships. Through these relations we further improve concept naming and obtain richer explanations. On benchmark data, we show that Insight provides performance on classification and segmentation that is competitive with opaque foundation models while providing fine-grained, high quality concept-based explanations. Code is available at this https URL.

59. 【2601.13797】PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval

链接https://arxiv.org/abs/2601.13797

作者:Gabriele Serussi,David Vainshtein,Jonathan Kouchly,Dotan Di Castro,Chaim Baskin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Composed Video Retrieval, Composed Video, aims to retrieve, video based, Composed

备注

点击查看摘要

Abstract:Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text. Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs), either using outdated architectures or requiring computationally expensive fine-tuning and slow caption generation. We introduce PREGEN (PRE GENeration extraction), an efficient and powerful CoVR framework that overcomes these limitations. Our approach uniquely pairs a frozen, pre-trained VLM with a lightweight encoding model, eliminating the need for any VLM fine-tuning. We feed the query video and modifying text into the VLM and extract the hidden state of the final token from each layer. A simple encoder is then trained on these pooled representations, creating a semantically rich and compact embedding for retrieval. PREGEN significantly advances the state of the art, surpassing all prior methods on standard CoVR benchmarks with substantial gains in Recall@1 of +27.23 and +69.59. Our method demonstrates robustness across different VLM backbones and exhibits strong zero-shot generalization to more complex textual modifications, highlighting its effectiveness and semantic capabilities.

60. 【2601.13751】HiT: History-Injection Transformers for Onboard Continuous Flood Change Detection

链接https://arxiv.org/abs/2601.13751

作者:Daniel Kyselica,Jonáš Herec,Oliver Kutis,Rado Pitoňák

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:strict operational constraints, requires processing multi-temporal, satellite observation requires, operational constraints, strict operational

备注: 19 pages, 9 figures, submitted to conference

点击查看摘要

Abstract:Natural disaster monitoring through continuous satellite observation requires processing multi-temporal data under strict operational constraints. This paper addresses flood detection, a critical application for hazard management, by developing an onboard change detection system that operates within the memory and computational limits of small satellites. We propose History Injection mechanism for Transformer models (HiT), that maintains historical context from previous observations while reducing data storage by over 99\% of original image size. Moreover, testing on the STTORM-CD flood dataset confirms that the HiT mechanism within the Prithvi-tiny foundation model maintains detection accuracy compared to the bitemporal baseline. The proposed HiT-Prithvi model achieved 43 FPS on Jetson Orin Nano, a representative onboard hardware used in nanosats. This work establishes a practical framework for satellite-based continuous monitoring of natural disasters, supporting real-time hazard assessment without dependency on ground-based processing infrastructure. Architecture as well as model checkpoints is available at this https URL

61. 【2601.13724】Facial Spatiotemporal Graphs: Leveraging the 3D Facial Surface for Remote Physiological Measurement

链接https://arxiv.org/abs/2601.13724

作者:Sam Cantrill,David Ahmedt-Aristizabal,Lars Petersson,Hanna Suominen,Mohammad Ali Armin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial remote photoplethysmography, estimate physiological signals, Facial, remote photoplethysmography, facial surface

备注

点击查看摘要

Abstract:Facial remote photoplethysmography (rPPG) methods estimate physiological signals by modeling subtle color changes on the 3D facial surface over time. However, existing methods fail to explicitly align their receptive fields with the 3D facial surface-the spatial support of the rPPG signal. To address this, we propose the Facial Spatiotemporal Graph (STGraph), a novel representation that encodes facial color and structure using 3D facial mesh sequences-enabling surface-aligned spatiotemporal processing. We introduce MeshPhys, a lightweight spatiotemporal graph convolutional network that operates on the STGraph to estimate physiological signals. Across four benchmark datasets, MeshPhys achieves state-of-the-art or competitive performance in both intra- and cross-dataset settings. Ablation studies show that constraining the model's receptive field to the facial surface acts as a strong structural prior, and that surface-aligned, 3D-aware node features are critical for robustly encoding facial surface color. Together, the STGraph and MeshPhys constitute a novel, principled modeling paradigm for facial rPPG, enabling robust, interpretable, and generalizable estimation. Code is available at this https URL .

62. 【2601.13719】Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search

链接https://arxiv.org/abs/2601.13719

作者:Xinlei Yin,Xiulian Peng,Xiao Li,Zhiwei Xiong,Yan Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:long context windows, extremely long context, vision-language models due, presents significant challenges, extremely long

备注

点击查看摘要

Abstract:Long video understanding presents significant challenges for vision-language models due to extremely long context windows. Existing solutions relying on naive chunking strategies with retrieval-augmented generation, typically suffer from information fragmentation and a loss of global coherence. We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning by integrating audiovisual entity cohesion and hierarchical video indexing with agentic search. First, we preserve semantic consistency by integrating entity-level representations across visual and auditory streams, while organizing content into a structured hierarchy spanning global summary, scene, segment, and entity levels. Then we employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers, facilitating coherent narrative reconstruction and fine-grained entity tracking. Extensive experiments demonstrate that our method achieves good temporal coherence, entity consistency, and retrieval efficiency, establishing a new state-of-the-art with an overall accuracy of 84.1% on LVBench. Notably, it achieves outstanding performance in the challenging reasoning category, reaching 80.1%. These results highlight the effectiveness of structured, multimodal reasoning for comprehensive and context-consistent understanding of long-form videos.

63. 【2601.13715】MVGD-Net: A Novel Motion-aware Video Glass Surface Detection Network

链接https://arxiv.org/abs/2601.13715

作者:Yiwei Lu,Hao Huang,Tao Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:professional environments presents, Glass Surface Detection, Glass surface ubiquitous, Glass surface, Video Glass Surface

备注: This paper has been accepted by the 40th AAAI Conference on Artificial Intelligence (AAAI-26). It contians 9 pages, 11 figures

点击查看摘要

Abstract:Glass surface ubiquitous in both daily life and professional environments presents a potential threat to vision-based systems, such as robot and drone navigation. To solve this challenge, most recent studies have shown significant interest in Video Glass Surface Detection (VGSD). We observe that objects in the reflection (or transmission) layer appear farther from the glass surfaces. Consequently, in video motion scenarios, the notable reflected (or transmitted) objects on the glass surface move slower than objects in non-glass regions within the same spatial plane, and this motion inconsistency can effectively reveal the presence of glass surfaces. Based on this observation, we propose a novel network, named MVGD-Net, for detecting glass surfaces in videos by leveraging motion inconsistency cues. Our MVGD-Net features three novel modules: the Cross-scale Multimodal Fusion Module (CMFM) that integrates extracted spatial features and estimated optical flow maps, the History Guided Attention Module (HGAM) and Temporal Cross Attention Module (TCAM), both of which further enhances temporal features. A Temporal-Spatial Decoder (TSD) is also introduced to fuse the spatial and temporal features for generating the glass region mask. Furthermore, for learning our network, we also propose a large-scale dataset, which comprises 312 diverse glass scenarios with a total of 19,268 frames. Extensive experiments demonstrate that our MVGD-Net outperforms relevant state-of-the-art methods.

64. 【2601.13707】Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs

链接https://arxiv.org/abs/2601.13707

作者:Yujin Jo,Sangyoon Bae,Taesup Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:causing object misidentification, visually inconsistent descriptions, causing object, inconsistent descriptions, language priors dominate

备注

点击查看摘要

Abstract:Hallucinations in large vision-language models (LVLMs) often arise when language priors dominate over visual evidence, causing object misidentification and visually inconsistent descriptions. We address this issue by framing hallucination mitigation as contrastive guidance, steering generation toward visually grounded and semantically faithful text. This approach regulates the model's internal behavior by reducing over-dependence on language priors and contrasting visually grounded with language-only representations. We propose Attention-space Contrastive Guidance (ACG), a single-pass mechanism that operates within self-attention layers to construct both vision-language and language-only attention paths in a single forward computation. This integration enables computationally efficient guidance directly embedded in the model's representation contextualization. To correct approximation bias introduced by the single-pass formulation, we further apply an orthogonalized correction that removes components aligned with the language-only path, selectively amplifying visual contributions. Experiments on the CHAIR and POPE benchmarks show that ACG achieves state-of-the-art faithfulness and caption quality while significantly reducing computational cost. Our method establishes a principled and efficient alternative, reducing latency by up to 2x compared to prior contrastive decoding methods that require multiple forward passes.

65. 【2601.13706】ParkingTwin: Training-Free Streaming 3D Reconstruction for Parking-Lot Digital Twins

链接https://arxiv.org/abs/2601.13706

作者:Xinhao Liu,Yu Wang,Xiansheng Guo,Gordon Owusu Boateng,Yu Cao,Haonan Si,Xingchen Guo,Nirwan Ansari

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated Valet Parking, High-fidelity parking-lot digital, parking-lot digital twins, digital twins provide, twins provide essential

备注: 35 pages, 10 figures. Submitted to ISPRS Journal of Photogrammetry and Remote Sensing. Under review

点击查看摘要

Abstract:High-fidelity parking-lot digital twins provide essential priors for path planning, collision checking, and perception validation in Automated Valet Parking (AVP). Yet robot-oriented reconstruction faces a trilemma: sparse forward-facing views cause weak parallax and ill-posed geometry; dynamic occlusions and extreme lighting hinder stable texture fusion; and neural rendering typically needs expensive offline optimization, violating edge-side streaming constraints. We propose ParkingTwin, a training-free, lightweight system for online streaming 3D reconstruction. First, OSM-prior-driven geometric construction uses OpenStreetMap semantic topology to directly generate a metric-consistent TSDF, replacing blind geometric search with deterministic mapping and avoiding costly optimization. Second, geometry-aware dynamic filtering employs a quad-modal constraint field (normal/height/depth consistency) to reject moving vehicles and transient occlusions in real time. Third, illumination-robust fusion in CIELAB decouples luminance and chromaticity via adaptive L-channel weighting and depth-gradient suppression, reducing seams under abrupt lighting changes. ParkingTwin runs at 30+ FPS on an entry-level GTX 1660. On a 68,000 m^2 real-world dataset, it achieves SSIM 0.87 (+16.0%), delivers about 15x end-to-end speedup, and reduces GPU memory by 83.3% compared with state-of-the-art 3D Gaussian Splatting (3DGS) that typically requires high-end GPUs (RTX 4090D). The system outputs explicit triangle meshes compatible with Unity/Unreal digital-twin pipelines. Project page: this https URL

66. 【2601.13705】Reasoning or Pattern Matching? Probing Large Vision-Language Models with Visual Puzzles

链接https://arxiv.org/abs/2601.13705

作者:Maria Lymperaiou,Vasileios Karampinis,Giorgos Filandrianos,Angelos Vlachos,Chrysoula Zerva,Athanasios Voulodimos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rule discovery, human cognition, prior knowledge, Large Vision-Language Models, visual puzzles

备注

点击查看摘要

Abstract:Puzzles have long served as compact and revealing probes of human cognition, isolating abstraction, rule discovery, and systematic reasoning with minimal reliance on prior knowledge. Leveraging these properties, visual puzzles have recently emerged as a powerful diagnostic tool for evaluating the reasoning abilities of Large Vision-Language Models (LVLMs), offering controlled, verifiable alternatives to open-ended multimodal benchmarks. This survey provides a unified perspective of visual puzzle reasoning in LVLMs. We frame visual puzzles through a common abstraction and organize existing benchmarks by the reasoning mechanisms they target (inductive, analogical, algorithmic, deductive, and geometric/spatial), thereby linking puzzle design to the cognitive operations required for solving. Synthesizing empirical evidence across these categories, we identify consistent limitations in current models, including brittle generalization, tight entanglement between perception and reasoning, and a persistent gap between fluent explanations and faithful execution. By framing visual puzzles as diagnostic instruments rather than task formats, this survey elaborates on the state of LVLM reasoning and outlines key directions for future benchmarks and reasoning-aware multimodal systems.

67. 【2601.13683】Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation

链接https://arxiv.org/abs/2601.13683

作者:Boyuan Cao,Xingbo Yao,Chenhui Wang,Jiaxin Ye,Yujie Wei,Hongming Shan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:major scalability bottleneck, linear diffusion transformers, high-fidelity image generation, Diffusion transformers, resulting linear diffusion

备注

点击查看摘要

Abstract:Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been adopted to reduce computational cost; unfortunately, the resulting linear diffusion transformers (LiTs) models often come at the expense of generative performance, frequently producing over-smoothed attention weights that limit expressiveness. In this work, we introduce Dynamic Differential Linear Attention (DyDiLA), a novel linear attention formulation that enhances the effectiveness of LiTs by mitigating the oversmoothing issue and improving generation quality. Specifically, the novelty of DyDiLA lies in three key designs: (i) dynamic projection module, which facilitates the decoupling of token representations by learning with dynamically assigned knowledge; (ii) dynamic measure kernel, which provides a better similarity measurement to capture fine-grained semantic distinctions between tokens by dynamically assigning kernel functions for token processing; and (iii) token differential operator, which enables more robust query-to-key retrieval by calculating the differences between the tokens and their corresponding information redundancy produced by dynamic measure kernel. To capitalize on DyDiLA, we introduce a refined LiT, termed DyDi-LiT, that systematically incorporates our advancements. Extensive experiments show that DyDi-LiT consistently outperforms current state-of-the-art (SOTA) models across multiple metrics, underscoring its strong practical potential.

68. 【2601.13677】Finally Outshining the Random Baseline: A Simple and Effective Solution for Active Learning in 3D Biomedical Imaging

链接https://arxiv.org/abs/2601.13677

作者:Carsten T. Lüth,Jeremias Traub,Kim-Celine Kahl,Till J. Bungert,Lukas Klein,Lars Krämer,Paul F. Jäger,Klaus Maier-Hein,Fabian Isensee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Active learning, drastically reduce annotation, reduce annotation costs, time-consuming and expensive, Power Predictive Entropy

备注: Accepted at TMLR

点击查看摘要

Abstract:Active learning (AL) has the potential to drastically reduce annotation costs in 3D biomedical image segmentation, where expert labeling of volumetric data is both time-consuming and expensive. Yet, existing AL methods are unable to consistently outperform improved random sampling baselines adapted to 3D data, leaving the field without a reliable solution. We introduce Class-stratified Scheduled Power Predictive Entropy (ClaSP PE), a simple and effective query strategy that addresses two key limitations of standard uncertainty-based AL methods: class imbalance and redundancy in early selections. ClaSP PE combines class-stratified querying to ensure coverage of underrepresented structures and log-scale power noising with a decaying schedule to enforce query diversity in early-stage AL and encourage exploitation later. In our evaluation on 24 experimental settings using four 3D biomedical datasets within the comprehensive nnActive benchmark, ClaSP PE is the only method that generally outperforms improved random baselines in terms of both segmentation quality with statistically significant gains, whilst remaining annotation efficient. Furthermore, we explicitly simulate the real-world application by testing our method on four previously unseen datasets without manual adaptation, where all experiment parameters are set according to predefined guidelines. The results confirm that ClaSP PE robustly generalizes to novel tasks without requiring dataset-specific tuning. Within the nnActive framework, we present compelling evidence that an AL method can consistently outperform random baselines adapted to 3D segmentation, in terms of both performance and annotation efficiency in a realistic, close-to-production scenario. Our open-source implementation and clear deployment guidelines make it readily applicable in practice. Code is at this https URL.

69. 【2601.13665】ransformer based Multi-task Fusion Network for Food Spoilage Detection and Shelf life Forecasting

链接https://arxiv.org/abs/2601.13665

作者:Mounika Kanulla,Rajasree Dadigi,Sailaja Thota,Vivek Yelleti

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:agricultural supply chain, effective spoilage detection, DeiT transformer, critical challenges, accurate and effective

备注

点击查看摘要

Abstract:Food wastage is one of the critical challenges in the agricultural supply chain, and accurate and effective spoilage detection can help to reduce it. Further, it is highly important to forecast the spoilage information. This aids the longevity of the supply chain management in the agriculture field. This motivated us to propose fusion based architectures by combining CNN with LSTM and DeiT transformer for the following multi-tasks simultaneously: (i) vegetable classification, (ii) food spoilage detection, and (iii) shelf life forecasting. We developed a dataset by capturing images of vegetables from their fresh state until they were completely spoiled. From the experimental analysis it is concluded that the proposed fusion architectures CNN+CNN-LSTM and CNN+DeiT Transformer outperformed several deep learning models such as CNN, VGG16, ResNet50, Capsule Networks, and DeiT Transformers. Overall, CNN + DeiT Transformer yielded F1-score of 0.98 and 0.61 in vegetable classification and spoilage detection respectively and mean squared error (MSE) and symmetric mean absolute percentage error (SMAPE) of 3.58, and 41.66% respectively in spoilage forecasting. Further, the reliability of the fusion models was validated on noisy images and integrated with LIME to visualize the model decisions.

70. 【2601.13664】VIAFormer: Voxel-Image Alignment Transformer for High-Fidelity Voxel Refinement

链接https://arxiv.org/abs/2601.13664

作者:Tiancheng Fang,Bowen Pan,Lingxi Chen,Jiangjing Lyu,Chengfei Lyu,Chaoyue Niu,Fan Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Conditioned Voxel Refinement, Voxel-Image Alignment Transformer, Multi-view Conditioned Voxel, Alignment Transformer model, calibrated multi-view images

备注: Under review at CVPR 2026

点击查看摘要

Abstract:We propose VIAFormer, a Voxel-Image Alignment Transformer model designed for Multi-view Conditioned Voxel Refinement--the task of repairing incomplete noisy voxels using calibrated multi-view images as guidance. Its effectiveness stems from a synergistic design: an Image Index that provides explicit 3D spatial grounding for 2D image tokens, a Correctional Flow objective that learns a direct voxel-refinement trajectory, and a Hybrid Stream Transformer that enables robust cross-modal fusion. Experiments show that VIAFormer establishes a new state of the art in correcting both severe synthetic corruptions and realistic artifacts on the voxel shape obtained from powerful Vision Foundation Models. Beyond benchmarking, we demonstrate VIAFormer as a practical and reliable bridge in real-world 3D creation pipelines, paving the way for voxel-based methods to thrive in large-model, big-data wave.

71. 【2601.13651】Face-Voice Association with Inductive Bias for Maximum Class Separation

链接https://arxiv.org/abs/2601.13651

作者:Marta Moscati,Oleksandr Kats,Mubashir Noman,Muhammad Zaigham Zaheer,Yufang Hou,Markus Schedl,Shah Nawaz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:approached representing faces, Face-voice association, maximum class separation, widely studied, approached representing

备注: Accepted at ICASSP 2026

点击查看摘要

Abstract:Face-voice association is widely studied in multimodal learning and is approached representing faces and voices with embeddings that are close for a same person and well separated from those of others. Previous work achieved this with loss functions. Recent advancements in classification have shown that the discriminative ability of embeddings can be strengthened by imposing maximum class separation as inductive bias. This technique has never been used in the domain of face-voice association, and this work aims at filling this gap. More specifically, we develop a method for face-voice association that imposes maximum class separation among multimodal representations of different speakers as an inductive bias. Through quantitative experiments we demonstrate the effectiveness of our approach, showing that it achieves SOTA performance on two task formulation of face-voice association. Furthermore, we carry out an ablation study to show that imposing inductive bias is most effective when combined with losses for inter-class orthogonality. To the best of our knowledge, this work is the first that applies and demonstrates the effectiveness of maximum class separation as an inductive bias in multimodal learning; it hence paves the way to establish a new paradigm.

72. 【2601.13645】Quadratic Upper Bound for Boosting Robustness

链接https://arxiv.org/abs/2601.13645

作者:Euijin You,Hyang-Won Lee

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Fast adversarial training, reduced training time, compromised robustness due, Fast adversarial, aims to enhance

备注: Accepted at ICML 2025. Published in PMLR 267:72656-72676

点击查看摘要

Abstract:Fast adversarial training (FAT) aims to enhance the robustness of models against adversarial attacks with reduced training time, however, FAT often suffers from compromised robustness due to insufficient exploration of adversarial space. In this paper, we develop a loss function to mitigate the problem of degraded robustness under FAT. Specifically, we derive a quadratic upper bound (QUB) on the adversarial training (AT) loss function and propose to utilize the bound with existing FAT methods. Our experimental results show that applying QUB loss to the existing methods yields significant improvement of robustness. Furthermore, using various metrics, we demonstrate that this improvement is likely to result from the smoothened loss landscape of the resulting model.

73. 【2601.13633】Scaling Test-time Inference for Visual Grounding

链接https://arxiv.org/abs/2601.13633

作者:Guanqi Zhan,Changye Li,Zhijian Liu,Yao Lu,Yi Wu,Song Han,Ligeng Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real physical world, Visual Language Models, grounding visual language, visual Grounding language, Language Models

备注

点击查看摘要

Abstract:Visual grounding is an essential capability of Visual Language Models (VLMs) to understand the real physical world. Previous state-of-the-art grounding visual language models usually have large model sizes, making them heavy for deployment and slow for inference. However, we notice that the sizes of visual encoders are nearly the same for small and large VLMs and the major difference is the sizes of the language models. Small VLMs fall behind larger VLMs in grounding because of the difference in language understanding capability rather than visual information handling. To mitigate the gap, we introduce 'Efficient visual Grounding language Models' (EGM): a method to scale the test-time computation (#generated tokens). Scaling the test-time computation of a small model is deployment-friendly, and yields better end-to-end latency as the cost of each token is much cheaper compared to directly running a large model. On the RefCOCO benchmark, our EGM-Qwen3-VL-8B demonstrates 91.4 IoU with an average of 737ms (5.9x faster) latency while Qwen3-VL-235B demands 4,320ms to achieve 90.5 IoU. To validate our approach's generality, we further set up a new amodal grounding setting that requires the model to predict both the visible and occluded parts of the objects. Experiments show our method can consistently and significantly improve the vanilla grounding and amodal grounding capabilities of small models to be on par with or outperform the larger models, thereby improving the efficiency for visual grounding.

74. 【2601.13622】CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

链接https://arxiv.org/abs/2601.13622

作者:Donghee Lee,Rui Cai,Zhe Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent advancements, Large Vision-Language Models, advancements in Large, general-purpose assistants, Large Vision-Language

备注

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have pushed them closer to becoming general-purpose assistants. Despite their strong performance, LVLMs still struggle with vision-centric tasks such as image classification, underperforming compared to their base vision encoders, which are often CLIP-based models. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a novel, model-agnostic framework which introduces vision-integration layers and a context-aware ensemble strategy to identify when to prioritize image representations or rely on the reasoning capabilities of the language model. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations, leading to consistent improvements in generalization across classification and vision-language benchmarks. Extensive experiments demonstrate that CARPE not only improves performance on image classification benchmarks but also enhances results across various vision-language benchmarks. Finally, CARPE is designed to be effectively integrated with most open-source LVLMs that consist of a vision encoder and a language model, ensuring its adaptability across diverse architectures.

75. 【2601.13606】ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch

链接https://arxiv.org/abs/2601.13606

作者:Zheng Liu,Honglin Lin,Chonghan Qin,Xiaoyang Wang,Xin Gao,Yu Li,Mengzhang Cai,Yun Zhu,Zhanping Zhong,Qizhi Pei,Zhuoshi Pan,Xiaoran Shang,Bin Cui,Conghui He,Wentao Zhang,Lijun Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, Vision Language, capability for Vision, Language Models, critical capability

备注: 29 pages

点击查看摘要

Abstract:Chart reasoning is a critical capability for Vision Language Models (VLMs). However, the development of open-source models is severely hindered by the lack of high-quality training data. Existing datasets suffer from a dual challenge: synthetic charts are often simplistic and repetitive, while the associated QA pairs are prone to hallucinations and lack the reasoning depth required for complex tasks. To bridge this gap, we propose ChartVerse, a scalable framework designed to synthesize complex charts and reliable reasoning data from scratch. (1) To address the bottleneck of simple patterns, we first introduce Rollout Posterior Entropy (RPE), a novel metric that quantifies chart complexity. Guided by RPE, we develop complexity-aware chart coder to autonomously synthesize diverse, high-complexity charts via executable programs. (2) To guarantee reasoning rigor, we develop truth-anchored inverse QA synthesis. Diverging from standard generation, we adopt an answer-first paradigm: we extract deterministic answers directly from the source code, generate questions conditional on these anchors, and enforce strict consistency verification. To further elevate difficulty and reasoning depth, we filter samples based on model fail-rate and distill high-quality Chain-of-Thought (CoT) reasoning. We curate ChartVerse-SFT-600K and ChartVerse-RL-40K using Qwen3-VL-30B-A3B-Thinking as the teacher. Experimental results demonstrate that ChartVerse-8B achieves state-of-the-art performance, notably surpassing its teacher and rivaling the stronger Qwen3-VL-32B-Thinking.

76. 【2601.13578】FG-OrIU: Towards Better Forgetting via Feature-Gradient Orthogonality for Incremental Unlearning

链接https://arxiv.org/abs/2601.13578

作者:Qian Feng,JiaHang Tu,Mintong Kang,Hanbin Zhao,Chao Zhang,Hui Qian

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:data deletion requests, information remains recoverable, primarily suppress parameters, residual information remains, sequential data deletion

备注: This paper has been accepted by ICCV 2025. code: \url{ [this https URL](https://github.com/RAIAN08/FG-OrIU) }

点击查看摘要

Abstract:Incremental unlearning (IU) is critical for pre-trained models to comply with sequential data deletion requests, yet existing methods primarily suppress parameters or confuse knowledge without explicit constraints on both feature and gradient level, resulting in \textit{superficial forgetting} where residual information remains recoverable. This incomplete forgetting risks security breaches and disrupts retention balance, especially in IU scenarios. We propose FG-OrIU (\textbf{F}eature-\textbf{G}radient \textbf{Or}thogonality for \textbf{I}ncremental \textbf{U}nlearning), the first framework unifying orthogonal constraints on both features and gradients level to achieve deep forgetting, where the forgetting effect is irreversible. FG-OrIU decomposes feature spaces via Singular Value Decomposition (SVD), separating forgetting and remaining class features into distinct subspaces. It then enforces dual constraints: feature orthogonal projection on both forgetting and remaining classes, while gradient orthogonal projection prevents the reintroduction of forgotten knowledge and disruption to remaining classes during updates. Additionally, dynamic subspace adaptation merges newly forgetting subspaces and contracts remaining subspaces, ensuring a stable balance between removal and retention across sequential unlearning tasks. Extensive experiments demonstrate the effectiveness of our method.

77. 【2601.13565】Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation

链接https://arxiv.org/abs/2601.13565

作者:Yu Qin,Shimeng Fan,Fan Yang,Zixuan Xue,Zijie Mai,Wenrui Chen,Kailun Yang,Zhiyong Li

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词:unseen objects guided, objects guided solely, arbitrary unseen objects, manipulate arbitrary unseen, estimation empowers robots

备注: The source code will be made publicly available at [this https URL](https://github.com/zjjqinyu/FiCoP)

点击查看摘要

Abstract:Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. Our core innovation lies in leveraging a patch-to-patch correlation matrix as a structural prior to narrowing the matching scope, effectively filtering out irrelevant clutter to prevent it from degrading pose estimation. Firstly, we introduce an object-centric disentanglement preprocessing to isolate the semantic target from environmental noise. Secondly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning. Finally, we design a Patch Correlation Predictor (PCP) that generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at this https URL.

78. 【2601.13562】Reasoning is a Modality

链接https://arxiv.org/abs/2601.13562

作者:Zhiguang Liu,Yi Shang

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:studying abstract reasoning, studying abstract, Reasoning Corpus, compact laboratory, laboratory for studying

备注: Code access: [this https URL](https://github.com/lz7fd/Reasoning_is_a_Modality)

点击查看摘要

Abstract:The Abstraction and Reasoning Corpus (ARC) provides a compact laboratory for studying abstract reasoning, an ability central to human intelligence. Modern AI systems, including LLMs and ViTs, largely operate as sequence-of-behavior prediction machines: they match observable behaviors by modeling token statistics without a persistent, readable mental state. This creates a gap with human-like behavior: humans can explain an action by decoding internal state, while AI systems can produce fluent post-hoc rationalizations that are not grounded in such a state. We hypothesize that reasoning is a modality: reasoning should exist as a distinct channel separate from the low-level workspace on which rules are applied. To test this hypothesis, on solving ARC tasks as a visual reasoning problem, we designed a novel role-separated transformer block that splits global controller tokens from grid workspace tokens, enabling iterative rule execution. Trained and evaluated within the VARC vision-centric protocol, our method achieved 62.6% accuracy on ARC-1, surpassing average human performance (60.2%) and outperforming prior methods significantly. Qualitatively, our models exhibit more coherent rule-application structure than the dense ViT baseline, consistent with a shift away from plausible probability blobs toward controller-driven reasoning.

79. 【2601.13551】DiffFace-Edit: A Diffusion-Based Facial Dataset for Forgery-Semantic Driven Deepfake Detection Analysis

链接https://arxiv.org/abs/2601.13551

作者:Feng Ding,Wenhui Yi,Xinan He,Mengyao Xiao,Jianfeng Xu,Jianqiang Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:posing significant privacy, significant privacy risks, produce imperceptible, posing significant, privacy risks

备注

点击查看摘要

Abstract:Generative models now produce imperceptible, fine-grained manipulated faces, posing significant privacy risks. However, existing AI-generated face datasets generally lack focus on samples with fine-grained regional manipulations. Furthermore, no researchers have yet studied the real impact of splice attacks, which occur between real and manipulated samples, on detectors. We refer to these as detector-evasive samples. Based on this, we introduce the DiffFace-Edit dataset, which has the following advantages: 1) It contains over two million AI-generated fake images. 2) It features edits across eight facial regions (e.g., eyes, nose) and includes a richer variety of editing combinations, such as single-region and multi-region edits. Additionally, we specifically analyze the impact of detector-evasive samples on detection models. We conduct a comprehensive analysis of the dataset and propose a cross-domain evaluation that combines IMDL methods. Dataset will be available at this https URL.

80. 【2601.13524】GO-MLVTON: Garment Occlusion-Aware Multi-Layer Virtual Try-On with Diffusion Models

链接https://arxiv.org/abs/2601.13524

作者:Yang Yu,Yunze Deng,Yige Zhang,Yanjie Xiao,Youkun Ou,Wenhao Hu,Mingchao Li,Bin Feng,Wenyu Liu,Dandan Zheng,Jingdong Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing Image-based virtual, Existing Image-based, Image-based virtual try-on, visually plausible outcomes, involves dressing multiple

备注: 5pages, 3 figures

点击查看摘要

Abstract:Existing Image-based virtual try-on (VTON) methods primarily focus on single-layer or multi-garment VTON, neglecting multi-layer VTON (ML-VTON), which involves dressing multiple layers of garments onto the human body with realistic deformation and layering to generate visually plausible outcomes. The main challenge lies in accurately modeling occlusion relationships between inner and outer garments to reduce interference from redundant inner garment features. To address this, we propose GO-MLVTON, the first multi-layer VTON method, introducing the Garment Occlusion Learning module to learn occlusion relationships and the StableDiffusion-based Garment Morphing Fitting module to deform and fit garments onto the human body, producing high-quality multi-layer try-on results. Additionally, we present the MLG dataset for this task and propose a new metric named Layered Appearance Coherence Difference (LACD) for evaluation. Extensive experiments demonstrate the state-of-the-art performance of GO-MLVTON. Project page: this https URL.

81. 【2601.13502】DIS2: Disentanglement Meets Distillation with Classwise Attention for Robust Remote Sensing Segmentation under Missing Modalities

链接https://arxiv.org/abs/2601.13502

作者:Nhi Kieu,Kien Nguyen,Arnold Wiliem,Clinton Fookes,Sridha Sridharan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remote sensing, efficacy of multimodal, severely undermined, multimodal learning, learning

备注: Accepted to WACV 2026 - Computer Vision for Earth Observation Workshop

点击查看摘要

Abstract:The efficacy of multimodal learning in remote sensing (RS) is severely undermined by missing modalities. The challenge is exacerbated by the RS highly heterogeneous data and huge scale variation. Consequently, paradigms proven effective in other domains often fail when confronted with these unique data characteristics. Conventional disentanglement learning, which relies on significant feature overlap between modalities (modality-invariant), is insufficient for this heterogeneity. Similarly, knowledge distillation becomes an ill-posed mimicry task where a student fails to focus on the necessary compensatory knowledge, leaving the semantic gap unaddressed. Our work is therefore built upon three pillars uniquely designed for RS: (1) principled missing information compensation, (2) class-specific modality contribution, and (3) multi-resolution feature importance. We propose a novel method DIS2, a new paradigm shifting from modality-shared feature dependence and untargeted imitation to active, guided missing features compensation. Its core novelty lies in a reformulated synergy between disentanglement learning and knowledge distillation, termed DLKD. Compensatory features are explicitly captured which, when fused with the features of the available modality, approximate the ideal fused representation of the full-modality case. To address the class-specific challenge, our Classwise Feature Learning Module (CFLM) adaptively learn discriminative evidence for each target depending on signal availability. Both DLKD and CFLM are supported by a hierarchical hybrid fusion (HF) structure using features across resolutions to strengthen prediction. Extensive experiments validate that our proposed approach significantly outperforms state-of-the-art methods across benchmarks.

82. 【2601.13498】Optical Linear Systems Framework for Event Sensing and Computational Neuromorphic Imaging

链接https://arxiv.org/abs/2601.13498

作者:Nimrod Kruger,Nicholas Owen Ralph,Gregory Cohen,Paul Hurley

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:log-intensity threshold crossings, Event vision sensors, low data bandwidth, enabling microsecond-scale sensing, high dynamic range

备注

点击查看摘要

Abstract:Event vision sensors (neuromorphic cameras) output sparse, asynchronous ON/OFF events triggered by log-intensity threshold crossings, enabling microsecond-scale sensing with high dynamic range and low data bandwidth. As a nonlinear system, this event representation does not readily integrate with the linear forward models that underpin most computational imaging and optical system design. We present a physics-grounded processing pipeline that maps event streams to estimates of per-pixel log-intensity and intensity derivatives, and embeds these measurements in a dynamic linear systems model with a time-varying point spread function. This enables inverse filtering directly from event data, using frequency-domain Wiener deconvolution with a known (or parameterised) dynamic transfer function. We validate the approach in simulation for single and overlapping point sources under modulated defocus, and on real event data from a tunable-focus telescope imaging a star field, demonstrating source localisation and separability. The proposed framework provides a practical bridge between event sensing and model-based computational imaging for dynamic optical systems.

83. 【2601.13451】Event-based Heterogeneous Information Processing for Online Vision-based Obstacle Detection and Localization

链接https://arxiv.org/abs/2601.13451

作者:Reza Ahmadvand,Sarah Safura Sharif,Yaser Mike Banad

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Spiking Neural Network, Artificial Neural Networks, Hybrid Neural Networks, Neural Networks, enhance situational awareness

备注

点击查看摘要

Abstract:This paper introduces a novel framework for robotic vision-based navigation that integrates Hybrid Neural Networks (HNNs) with Spiking Neural Network (SNN)-based filtering to enhance situational awareness for unmodeled obstacle detection and localization. By leveraging the complementary strengths of Artificial Neural Networks (ANNs) and SNNs, the system achieves both accurate environmental understanding and fast, energy-efficient processing. The proposed architecture employs a dual-pathway approach: an ANN component processes static spatial features at low frequency, while an SNN component handles dynamic, event-based sensor data in real time. Unlike conventional hybrid architectures that rely on domain conversion mechanisms, our system incorporates a pre-developed SNN-based filter that directly utilizes spike-encoded inputs for localization and state estimation. Detected anomalies are validated using contextual information from the ANN pathway and continuously tracked to support anticipatory navigation strategies. Simulation results demonstrate that the proposed method offers acceptable detection accuracy while maintaining computational efficiency close to SNN-only implementations, which operate at a fraction of the resource cost. This framework represents a significant advancement in neuromorphic navigation systems for robots operating in unpredictable and dynamic environments.

84. 【2601.13440】Analyzing VLM-Based Approaches for Anomaly Classification and Segmentation

链接https://arxiv.org/abs/2601.13440

作者:Mohit Kakda,Mirudula Shri Muthukumaran,Uttapreksha Patel,Lawrence Swaminathan Xavier Prince

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:extensive labeled datasets, Vision-Language Models, few-shot defect identification, revolutionized anomaly detection, labeled datasets

备注: 10 pages,4 images

点击查看摘要

Abstract:Vision-Language Models (VLMs), particularly CLIP, have revolutionized anomaly detection by enabling zero-shot and few-shot defect identification without extensive labeled datasets. By learning aligned representations of images and text, VLMs facilitate anomaly classification and segmentation through natural language descriptions of normal and abnormal states, eliminating traditional requirements for task-specific training or defect examples. This project presents a comprehensive analysis of VLM-based approaches for anomaly classification (AC) and anomaly segmentation (AS). We systematically investigate key architectural paradigms including sliding window-based dense feature extraction (WinCLIP), multi-stage feature alignment with learnable projections (AprilLab framework), and compositional prompt ensemble strategies. Our analysis evaluates these methods across critical dimensions: feature extraction mechanisms, text-visual alignment strategies, prompt engineering techniques, zero-shot versus few-shot trade-offs, computational efficiency, and cross-domain generalization. Through rigorous experimentation on benchmarks such as MVTec AD and VisA, we compare classification accuracy, segmentation precision, and inference efficiency. The primary contribution is a foundational understanding of how and why VLMs succeed in anomaly detection, synthesizing practical insights for method selection and identifying current limitations. This work aims to facilitate informed adoption of VLM-based methods in industrial quality control and guide future research directions.

85. 【2601.13417】SGW-GAN: Sliced Gromov-Wasserstein Guided GANs for Retinal Fundus Image Enhancement

链接https://arxiv.org/abs/2601.13417

作者:Yujian Xiong,Xuanzhao Dong,Wenhui Zhu,Xin Li,Oana Dumitrascu,Yalin Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Retinal fundus photography, screening and diagnosis, uneven illumination, fundus photography, photography is indispensable

备注

点击查看摘要

Abstract:Retinal fundus photography is indispensable for ophthalmic screening and diagnosis, yet image quality is often degraded by noise, artifacts, and uneven illumination. Recent GAN- and diffusion-based enhancement methods improve perceptual quality by aligning degraded images with high-quality distributions, but our analysis shows that this focus can distort intra-class geometry: clinically related samples become dispersed, disease-class boundaries blur, and downstream tasks such as grading or lesion detection are harmed. The Gromov Wasserstein (GW) discrepancy offers a principled solution by aligning distributions through internal pairwise distances, naturally preserving intra-class structure, but its high computational cost restricts practical use. To overcome this, we propose SGW-GAN, the first framework to incorporate Sliced GW (SGW) into retinal image enhancement. SGW approximates GW via random projections, retaining relational fidelity while greatly reducing cost. Experiments on public datasets show that SGW-GAN produces visually compelling enhancements, achieves superior diabetic retinopathy grading, and reports the lowest GW discrepancy across disease labels, demonstrating both efficiency and clinical fidelity for unpaired medical image enhancement.

86. 【2601.13416】Diffusion Representations for Fine-Grained Image Classification: A Marine Plankton Case Study

链接https://arxiv.org/abs/2601.13416

作者:A. Nieto Juscafresa,Á. Mazcuñán Herreros,J. Sullivan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:encoders remains underexplored, general-purpose feature encoders, feature encoders remains, image synthesis, remains underexplored

备注: 21 pages, 6 figures, CVPR format

点击查看摘要

Abstract:Diffusion models have emerged as state-of-the-art generative methods for image synthesis, yet their potential as general-purpose feature encoders remains underexplored. Trained for denoising and generation without labels, they can be interpreted as self-supervised learners that capture both low- and high-level structure. We show that a frozen diffusion backbone enables strong fine-grained recognition by probing intermediate denoising features across layers and timesteps and training a linear classifier for each pair. We evaluate this in a real-world plankton-monitoring setting with practical impact, using controlled and comparable training setups against established supervised and self-supervised baselines. Frozen diffusion features are competitive with supervised baselines and outperform other self-supervised methods in both balanced and naturally long-tailed settings. Out-of-distribution evaluations on temporally and geographically shifted plankton datasets further show that frozen diffusion features maintain strong accuracy and Macro F1 under substantial distribution shift.

87. 【2601.13412】Using deep learning for predicting cleansing quality of colon capsule endoscopy images

链接https://arxiv.org/abs/2601.13412

作者:Puneet Sharma,Kristian Dalsbø Hindberg,Benedicte Schelde-Olesen,Ulrik Deding,Esmaeil S. Nadimi,Jan-Matthias Braun

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:colon capsule endoscopy, deep learning techniques, capsule endoscopy, deep learning, colon capsule

备注: 24 pages

点击查看摘要

Abstract:In this study, we explore the application of deep learning techniques for predicting cleansing quality in colon capsule endoscopy (CCE) images. Using a dataset of 500 images labeled by 14 clinicians on the Leighton-Rex scale (Poor, Fair, Good, and Excellent), a ResNet-18 model was trained for classification, leveraging stratified K-fold cross-validation to ensure robust performance. To optimize the model, structured pruning techniques were applied iteratively, achieving significant sparsity while maintaining high accuracy. Explainability of the pruned model was evaluated using Grad-CAM, Grad-CAM++, Eigen-CAM, Ablation-CAM, and Random-CAM, with the ROAD method employed for consistent evaluation. Our results indicate that for a pruned model, we can achieve a cross-validation accuracy of 88% with 79% sparsity, demonstrating the effectiveness of pruning in improving efficiency from 84% without compromising performance. We also highlight the challenges of evaluating cleansing quality of CCE images, emphasize the importance of explainability in clinical applications, and discuss the challenges associated with using the ROAD method for our task. Finally, we employ a variant of adaptive temperature scaling to calibrate the pruned models for an external dataset.

88. 【2601.13404】Local-to-Global Logical Explanations for Deep Vision Models

链接https://arxiv.org/abs/2601.13404

作者:Bhavan Vasu,Giuseppe Raffa,Prasad Tadepalli

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:deep neural networks, hard to interpret, deep neural, neural networks, networks are extremely

备注: 15 pages, 5 figures, 5th International Joint Conference on Learning Reasoning 2025

点击查看摘要

Abstract:While deep neural networks are extremely effective at classifying images, they remain opaque and hard to interpret. We introduce local and global explanation methods for black-box models that generate explanations in terms of human-recognizable primitive concepts. Both the local explanations for a single image and the global explanations for a set of images are cast as logical formulas in monotone disjunctive-normal-form (MDNF), whose satisfaction guarantees that the model yields a high score on a given class. We also present an algorithm for explaining the classification of examples into multiple classes in the form of a monotone explanation list over primitive concepts. Despite their simplicity and interpretability we show that the explanations maintain high fidelity and coverage with respect to the blackbox models they seek to explain in challenging vision datasets.

89. 【2601.13401】Reasoning with Pixel-level Precision: QVLM Architecture and SQuID Dataset for Quantitative Geospatial Analytics

链接https://arxiv.org/abs/2601.13401

作者:Peter A. Massih,Eric Cosatto

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Current Vision-Language Models, destroy pixel-level information, Current Vision-Language, pixel-level information required, quantitative spatial reasoning

备注: Submitted to CVPR 2026. Introduces the QVLM architecture and the SQuID dataset for quantitative geospatial reasoning. Dataset DOI: [https://doi.org/10.57967/hf/7565](https://doi.org/10.57967/hf/7565)

点击查看摘要

Abstract:Current Vision-Language Models (VLMs) fail at quantitative spatial reasoning because their architectures destroy pixel-level information required for counting and measurements. Vision encoders compress images through patch embeddings, reducing spatial indexing and losing the precise pixel-level tracking required for accurate counting. We present two contributions to address this fundamental limitation. First, we introduce SQuID (Satellite Quantitative Intelligence Dataset), a benchmark of 2,000 satellite image Question-Answer pairs with both numerical range and categorical answers, designed to evaluate quantitative spatial reasoning. The dataset spans three difficulty tiers with annotations automatically generated from human labels and their learned variability. Second, we propose QVLM (Quantitative Vision-Language Model), a code-generation architecture that maintains pixel precision by decoupling language understanding from visual analysis. Instead of encoding images into embeddings, QVLM generates executable code that first calls a segmentation model to obtain pixel-level masks, then operates directly on these masks, preserving spatial indexing throughout the reasoning process. Our experiments show that QVLM using GPT-5 as coder achieves 42.0% accuracy on SQuID compared to 28.1% for a VLM prompted with image-question pairs. Our work reveals that, for quantitative spatial reasoning, architectural decoupling enables better accuracy on quantitative tasks.

90. 【2601.13400】Deep Image Prior with L0 Gradient Regularizer for Image Smoothing

链接https://arxiv.org/abs/2601.13400

作者:Nhat Thanh Tran,Kevin Bui,Jack Xin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:removes minor details, fundamental image processing, image processing operation, Image smoothing, underlying structure

备注: To be published in the Proceedings of IEEE ICASSP 2026

点击查看摘要

Abstract:Image smoothing is a fundamental image processing operation that preserves the underlying structure, such as strong edges and contours, and removes minor details and textures in an image. Many image smoothing algorithms rely on computing local window statistics or solving an optimization problem. Recent state-of-the-art methods leverage deep learning, but they require a carefully curated training dataset. Because constructing a proper training dataset for image smoothing is challenging, we propose DIP-$\ell_0$, a deep image prior framework that incorporates the $\ell_0$ gradient regularizer. This framework can perform high-quality image smoothing without any training data. To properly minimize the associated loss function that has the nonconvex, nonsmooth $\ell_0$ ``norm", we develop an alternating direction method of multipliers algorithm that utilizes an off-the-shelf $\ell_0$ gradient minimization solver. Numerical experiments demonstrate that the proposed DIP-$\ell_0$ outperforms many image smoothing algorithms in edge-preserving image smoothing and JPEG artifact removal.

91. 【2601.13386】Leveraging Transformer Decoder for Automotive Radar Object Detection

链接https://arxiv.org/abs/2601.13386

作者:Changxu Zhang,Zhaoze Wang,Tai Fei,Christopher Grimm,Yi Jin,Claas Tebruegge,Ernst Warsitz,Markus Gardill

类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词:radar feature representations, present a Transformer-based, Transformer-based architecture, Transformer Decoder, Pyramid Token Fusion

备注

点击查看摘要

Abstract:In this paper, we present a Transformer-based architecture for 3D radar object detection that uses a novel Transformer Decoder as the prediction head to directly regress 3D bounding boxes and class scores from radar feature representations. To bridge multi-scale radar features and the decoder, we propose Pyramid Token Fusion (PTF), a lightweight module that converts a feature pyramid into a unified, scale-aware token sequence. By formulating detection as a set prediction problem with learnable object queries and positional encodings, our design models long-range spatial-temporal correlations and cross-feature interactions. This approach eliminates dense proposal generation and heuristic post-processing such as extensive non-maximum suppression (NMS) tuning. We evaluate the proposed framework on the RADDet, where it achieves significant improvements over state-of-the-art radar-only baselines.

92. 【2601.13385】Organ-Aware Attention Improves CT Triage and Classification

链接https://arxiv.org/abs/2601.13385

作者:Lavsen Dahal,Yubraj Bhandari,Geoffrey D. Rubin,Joseph Y. Lo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:mitigate radiologist burnout, high-volume medical imaging, medical imaging modalities, Vision Language Models, improve patient care

备注

点击查看摘要

Abstract:There is an urgent need for triage and classification of high-volume medical imaging modalities such as computed tomography (CT), which can improve patient care and mitigate radiologist burnout. Study-level CT triage requires calibrated predictions with localized evidence; however, off-the-shelf Vision Language Models (VLM) struggle with 3D anatomy, protocol shifts, and noisy report supervision. This study used the two largest publicly available chest CT datasets: CT-RATE and RADCHEST-CT (held-out external test set). Our carefully tuned supervised baseline (instantiated as a simple Global Average Pooling head) establishes a new supervised state of the art, surpassing all reported linear-probe VLMs. Building on this baseline, we present ORACLE-CT, an encoder-agnostic, organ-aware head that pairs Organ-Masked Attention (mask-restricted, per-organ pooling that yields spatial evidence) with Organ-Scalar Fusion (lightweight fusion of normalized volume and mean-HU cues). In the chest setting, ORACLE-CT masked attention model achieves AUROC 0.86 on CT-RATE; in the abdomen setting, on MERLIN (30 findings), our supervised baseline exceeds a reproduced zero-shot VLM baseline obtained by running publicly released weights through our pipeline, and adding masked attention plus scalar fusion further improves performance to AUROC 0.85. Together, these results deliver state-of-the-art supervised classification performance across both chest and abdomen CT under a unified evaluation protocol. The source code is available at this https URL.

93. 【2601.13380】Practical Insights into Semi-Supervised Object Detection Approaches

链接https://arxiv.org/abs/2601.13380

作者:Chaoxin Wang,Bharaneeshwar Balasubramaniyam,Anurag Sangem,Nicolais Guevara,Doina Caragea

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently gained significant, gained significant attention, research community, data-scarce settings, settings has recently

备注

点击查看摘要

Abstract:Learning in data-scarce settings has recently gained significant attention in the research community. Semi-supervised object detection(SSOD) aims to improve detection performance by leveraging a large number of unlabeled images alongside a limited number of labeled images(a.k.a.,few-shot learning). In this paper, we present a comprehensive comparison of three state-of-the-art SSOD approaches, including MixPL, Semi-DETR and Consistent-Teacher, with the goal of understanding how performance varies with the number of labeled images. We conduct experiments using the MS-COCO and Pascal VOC datasets, two popular object detection benchmarks which allow for standardized evaluation. In addition, we evaluate the SSOD approaches on a custom Beetle dataset which enables us to gain insights into their performance on specialized datasets with a smaller number of object categories. Our findings highlight the trade-offs between accuracy, model size, and latency, providing insights into which methods are best suited for low-data regimes.

94. 【2601.13373】A Lightweight Model-Driven 4D Radar Framework for Pervasive Human Detection in Harsh Conditions

链接https://arxiv.org/abs/2601.13373

作者:Zhenan Liu,Amir Khajepour,George Shaker

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rapidly degrade optical, Pervasive sensing, confined geometry, metallic structures, severely constrained

备注

点击查看摘要

Abstract:Pervasive sensing in industrial and underground environments is severely constrained by airborne dust, smoke, confined geometry, and metallic structures, which rapidly degrade optical and LiDAR based perception. Elevation resolved 4D mmWave radar offers strong resilience to such conditions, yet there remains a limited understanding of how to process its sparse and anisotropic point clouds for reliable human detection in enclosed, visibility degraded spaces. This paper presents a fully model-driven 4D radar perception framework designed for real-time execution on embedded edge hardware. The system uses radar as its sole perception modality and integrates domain aware multi threshold filtering, ego motion compensated temporal accumulation, KD tree Euclidean clustering with Doppler aware refinement, and a rule based 3D classifier. The framework is evaluated in a dust filled enclosed trailer and in real underground mining tunnels, and in the tested scenarios the radar based detector maintains stable pedestrian identification as camera and LiDAR modalities fail under severe visibility degradation. These results suggest that the proposed model-driven approach provides robust, interpretable, and computationally efficient perception for safety-critical applications in harsh industrial and subterranean environments.

95. 【2601.13371】Spherical Geometry Diffusion: Generating High-quality 3D Face Geometry via Sphere-anchored Representations

链接https://arxiv.org/abs/2601.13371

作者:Junyi Zhang,Yiming Wang,Yunhong Lu,Qichao Wang,Wenzhe Qian,Xiaoyin Xu,David Gu,Min Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieving high-quality geometry, fundamental challenge, achieving high-quality, geometry, Spherical Geometry Representation

备注: Association for the Advancement of Artificial Intelligence

点击查看摘要

Abstract:A fundamental challenge in text-to-3D face generation is achieving high-quality geometry. The core difficulty lies in the arbitrary and intricate distribution of vertices in 3D space, making it challenging for existing models to establish clean connectivity and resulting in suboptimal geometry. To address this, our core insight is to simplify the underlying geometric structure by constraining the distribution onto a simple and regular manifold, a topological sphere. Building on this, we first propose the Spherical Geometry Representation, a novel face representation that anchors geometric signals to uniform spherical coordinates. This guarantees a regular point distribution, from which the mesh connectivity can be robustly reconstructed. Critically, this canonical sphere can be seamlessly unwrapped into a 2D map, creating a perfect synergy with powerful 2D generative models. We then introduce Spherical Geometry Diffusion, a conditional diffusion framework built upon this 2D map. It enables diverse and controllable generation by jointly modeling geometry and texture, where the geometry explicitly conditions the texture synthesis process. Our method's effectiveness is demonstrated through its success in a wide range of tasks: text-to-3D generation, face reconstruction, and text-based 3D editing. Extensive experiments show that our approach substantially outperforms existing methods in geometric quality, textual fidelity, and inference efficiency.

96. 【2601.13364】Real-Time 4D Radar Perception for Robust Human Detection in Harsh Enclosed Environments

链接https://arxiv.org/abs/2601.13364

作者:Zhenan Liu,Yaodong Cui,Amir Khajepour,George Shaker

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling repeatable mm-wave, severe electromagnetic constraints, repeatable mm-wave propagation, mm-wave propagation studies, multi-level dust concentrations

备注

点击查看摘要

Abstract:This paper introduces a novel methodology for generating controlled, multi-level dust concentrations in a highly cluttered environment representative of harsh, enclosed environments, such as underground mines, road tunnels, or collapsed buildings, enabling repeatable mm-wave propagation studies under severe electromagnetic constraints. We also present a new 4D mmWave radar dataset, augmented by camera and LiDAR, illustrating how dust particles and reflective surfaces jointly impact the sensing functionality. To address these challenges, we develop a threshold-based noise filtering framework leveraging key radar parameters (RCS, velocity, azimuth, elevation) to suppress ghost targets and mitigate strong multipath reflections at the raw data level. Building on the filtered point clouds, a cluster-level, rule-based classification pipeline exploits radar semantics-velocity, RCS, and volumetric spread-to achieve reliable, real-time pedestrian detection without extensive domainspecific training. Experimental results confirm that this integrated approach significantly enhances clutter mitigation, detection robustness, and overall system resilience in dust-laden mining environments.

97. 【2601.13331】MultiST: A Cross-Attention-Based Multimodal Model for Spatial Transcriptomic

链接https://arxiv.org/abs/2601.13331

作者:Wei Wang,Quoc-Toan Ly,Chong Yu,Jun Bai

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:enables transcriptome-wide profiling, offering unprecedented opportunities, study tissue organization, enables transcriptome-wide, offering unprecedented

备注

点击查看摘要

Abstract:Spatial transcriptomics (ST) enables transcriptome-wide profiling while preserving the spatial context of tissues, offering unprecedented opportunities to study tissue organization and cell-cell interactions in situ. Despite recent advances, existing methods often lack effective integration of histological morphology with molecular profiles, relying on shallow fusion strategies or omitting tissue images altogether, which limits their ability to resolve ambiguous spatial domain boundaries. To address this challenge, we propose MultiST, a unified multimodal framework that jointly models spatial topology, gene expression, and tissue morphology through cross-attention-based fusion. MultiST employs graph-based gene encoders with adversarial alignment to learn robust spatial representations, while integrating color-normalized histological features to capture molecular-morphological dependencies and refine domain boundaries. We evaluated the proposed method on 13 diverse ST datasets spanning two organs, including human brain cortex and breast cancer tissue. MultiST yields spatial domains with clearer and more coherent boundaries than existing methods, leading to more stable pseudotime trajectories and more biologically interpretable cell-cell interaction patterns. The MultiST framework and source code are available at this https URL.

98. 【2601.13304】CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning

链接https://arxiv.org/abs/2601.13304

作者:Wenxin Ma,Chenlong Wang,Ruisheng Yuan,Hao Chen,Nanru Dai,S. Kevin Zhou,Yijun Yang,Alan Yuille,Jieneng Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:instantly predict, ability Causal Spatial, Causal Spatial Reasoning, static spatial perception, Causal Object World

备注: Code is available: [this https URL](https://github.com/CausalSpatial/CausalSpatial)

点击查看摘要

Abstract:Humans can look at a static scene and instantly predict what happens next -- will moving this object cause a collision? We call this ability Causal Spatial Reasoning. However, current multimodal large language models (MLLMs) cannot do this, as they remain largely restricted to static spatial perception, struggling to answer "what-if" questions in a 3D scene. We introduce CausalSpatial, a diagnostic benchmark evaluating whether models can anticipate consequences of object motions across four tasks: Collision, Compatibility, Occlusion, and Trajectory. Results expose a severe gap: humans score 84% while GPT-5 achieves only 54%. Why do MLLMs fail? Our analysis uncovers a fundamental deficiency: models over-rely on textual chain-of-thought reasoning that drifts from visual evidence, producing fluent but spatially ungrounded hallucinations. To address this, we propose the Causal Object World model (COW), a framework that externalizes the simulation process by generating videos of hypothetical dynamics. With explicit visual cues of causality, COW enables models to ground their reasoning in physical reality rather than linguistic priors. We make the dataset and code publicly available here: this https URL

99. 【2601.13299】Enginuity: Building an Open Multi-Domain Dataset of Complex Engineering Diagrams

链接https://arxiv.org/abs/2601.13299

作者:Ethan Seefried,Prahitha Movva,Naga Harshita Marupaka,Tilak Kasturi,Tirthankar Ghosal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:comprehensive structural annotations, structural annotations designed, automated diagram parsing, multi-domain engineering diagram, diagram parsing

备注: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Ai4 Science

点击查看摘要

Abstract:We propose Enginuity - the first open, large-scale, multi-domain engineering diagram dataset with comprehensive structural annotations designed for automated diagram parsing. By capturing hierarchical component relationships, connections, and semantic elements across diverse engineering domains, our proposed dataset would enable multimodal large language models to address critical downstream tasks including structured diagram parsing, cross-modal information retrieval, and AI-assisted engineering simulation. Enginuity would be transformative for AI for Scientific Discovery by enabling artificial intelligence systems to comprehend and manipulate the visual-structural knowledge embedded in engineering diagrams, breaking down a fundamental barrier that currently prevents AI from fully participating in scientific workflows where diagram interpretation, technical drawing analysis, and visual reasoning are essential for hypothesis generation, experimental design, and discovery.

100. 【2601.13263】Deep Learning for Semantic Segmentation of 3D Ultrasound Data

链接https://arxiv.org/abs/2601.13263

作者:Chenyu Liu,Marco Cecotti,Harikrishnan Vijayakumar,Patrick Robinson,James Barson,Mihai Caleap

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Developing cost-efficient, perception systems remains, automated vehicles, remains a central, central challenge

备注: 14 pages, 10 figures, 8 tables, presented at 2025 13th International Conference on Robot Intelligence Technology and Applications (RITA)

点击查看摘要

Abstract:Developing cost-efficient and reliable perception systems remains a central challenge for automated vehicles. LiDAR and camera-based systems dominate, yet they present trade-offs in cost, robustness and performance under adverse conditions. This work introduces a novel framework for learning-based 3D semantic segmentation using Calyo Pulse, a modular, solid-state 3D ultrasound sensor system for use in harsh and cluttered environments. A 3D U-Net architecture is introduced and trained on the spatial ultrasound data for volumetric segmentation. Results demonstrate robust segmentation performance from Calyo Pulse sensors, with potential for further improvement through larger datasets, refined ground truth, and weighted loss functions. Importantly, this study highlights 3D ultrasound sensing as a promising complementary modality for reliable autonomy.

101. 【2601.13247】Aligning Agentic World Models via Knowledgeable Experience Learning

链接https://arxiv.org/abs/2601.13247

作者:Baochang Ren,Yunzhi Yao,Rui Sun,Shuofei Qiao,Ningyu Zhang,Huajun Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:Current Large Language, Large Language Models, critical modal disconnect, Current Large, possess vast semantic

备注: Ongoing work

点击查看摘要

Abstract:Current Large Language Models (LLMs) exhibit a critical modal disconnect: they possess vast semantic knowledge but lack the procedural grounding to respect the immutable laws of the physical world. Consequently, while these agents implicitly function as world models, their simulations often suffer from physical hallucinations-generating plans that are logically sound but physically unexecutable. Existing alignment strategies predominantly rely on resource-intensive training or fine-tuning, which attempt to compress dynamic environmental rules into static model parameters. However, such parametric encapsulation is inherently rigid, struggling to adapt to the open-ended variability of physical dynamics without continuous, costly retraining. To bridge this gap, we introduce WorldMind, a framework that autonomously constructs a symbolic World Knowledge Repository by synthesizing environmental feedback. Specifically, it unifies Process Experience to enforce physical feasibility via prediction errors and Goal Experience to guide task optimality through successful trajectories. Experiments on EB-ALFRED and EB-Habitat demonstrate that WorldMind achieves superior performance compared to baselines with remarkable cross-model and cross-environment transferability.

102. 【2601.13238】A Semantic Decoupling-Based Two-Stage Rainy-Day Attack for Revealing Weather Robustness Deficiencies in Vision-Language Models

链接https://arxiv.org/abs/2601.13238

作者:Chengyin Hu,Xiang Chen,Zhe Jia,Weiwen Shi,Fengyu Zhang,Jiujiang Guo,Yiwei Wei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:image-text pairs collected, achieve strong performance, canonical visual conditions, trained on image-text, image-text pairs

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) are trained on image-text pairs collected under canonical visual conditions and achieve strong performance on multimodal tasks. However, their robustness to real-world weather conditions, and the stability of cross-modal semantic alignment under such structured perturbations, remain insufficiently studied. In this paper, we focus on rainy scenarios and introduce the first adversarial framework that exploits realistic weather to attack VLMs, using a two-stage, parameterized perturbation model based on semantic decoupling to analyze rain-induced shifts in decision-making. In Stage 1, we model the global effects of rainfall by applying a low-dimensional global modulation to condition the embedding space and gradually weaken the original semantic decision boundaries. In Stage 2, we introduce structured rain variations by explicitly modeling multi-scale raindrop appearance and rainfall-induced illumination changes, and optimize the resulting non-differentiable weather space to induce stable semantic shifts. Operating in a non-pixel parameter space, our framework generates perturbations that are both physically grounded and interpretable. Experiments across multiple tasks show that even physically plausible, highly constrained weather perturbations can induce substantial semantic misalignment in mainstream VLMs, posing potential safety and reliability risks in real-world deployment. Ablations further confirm that illumination modeling and multi-scale raindrop structures are key drivers of these semantic shifts.

103. 【2601.13234】ConvMambaNet: A Hybrid CNN-Mamba State Space Architecture for Accurate and Real-Time EEG Seizure Detection

链接https://arxiv.org/abs/2601.13234

作者:Md. Nishan Khan,Kazi Shahriar Sanjid,Md. Tanzim Hossain,Asib Mostakim Fony,Istiak Ahmed,M. Monir Uddin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:chronic neurological disorder, neurological disorder marked, severely impact quality, Convolutional Neural Networks, quality of life

备注

点击查看摘要

Abstract:Epilepsy is a chronic neurological disorder marked by recurrent seizures that can severely impact quality of life. Electroencephalography (EEG) remains the primary tool for monitoring neural activity and detecting seizures, yet automated analysis remains challenging due to the temporal complexity of EEG signals. This study introduces ConvMambaNet, a hybrid deep learning model that integrates Convolutional Neural Networks (CNNs) with the Mamba Structured State Space Model (SSM) to enhance temporal feature extraction. By embedding the Mamba-SSM block within a CNN framework, the model effectively captures both spatial and long-range temporal dynamics. Evaluated on the CHB-MIT Scalp EEG dataset, ConvMambaNet achieved a 99% accuracy and demonstrated robust performance under severe class imbalance. These results underscore the model's potential for precise and efficient seizure detection, offering a viable path toward real-time, automated epilepsy monitoring in clinical environments.

104. 【2601.13225】Not all Blends are Equal: The BLEMORE Dataset of Blended Emotion Expressions with Relative Salience Annotations

链接https://arxiv.org/abs/2601.13225

作者:Tim Lachmann,Alexandra Israelsson,Christina Tornberg,Teimuraz Saghinadze,Michal Balazia,Philipp Müller,Petri Laukka

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:Humans often experience, relative salience, salience, blended emotion, emotion

备注: Accepted for publication at IEEE Face Gesture 2026

点击查看摘要

Abstract:Humans often experience not just a single basic emotion at a time, but rather a blend of several emotions with varying salience. Despite the importance of such blended emotions, most video-based emotion recognition approaches are designed to recognize single emotions only. The few approaches that have attempted to recognize blended emotions typically cannot assess the relative salience of the emotions within a blend. This limitation largely stems from the lack of datasets containing a substantial number of blended emotion samples annotated with relative salience. To address this shortcoming, we introduce BLEMORE, a novel dataset for multimodal (video, audio) blended emotion recognition that includes information on the relative salience of each emotion within a blend. BLEMORE comprises over 3,000 clips from 58 actors, performing 6 basic emotions and 10 distinct blends, where each blend has 3 different salience configurations (50/50, 70/30, and 30/70). Using this dataset, we conduct extensive evaluations of state-of-the-art video classification approaches on two blended emotion prediction tasks: (1) predicting the presence of emotions in a given sample, and (2) predicting the relative salience of emotions in a blend. Our results show that unimodal classifiers achieve up to 29% presence accuracy and 13% salience accuracy on the validation set, while multimodal methods yield clear improvements, with ImageBind + WavLM reaching 35% presence accuracy and HiCMAE 18% salience accuracy. On the held-out test set, the best models achieve 33% presence accuracy (VideoMAEv2 + HuBERT) and 18% salience accuracy (HiCMAE). In sum, the BLEMORE dataset provides a valuable resource to advancing research on emotion recognition systems that account for the complexity and significance of blended emotion expressions.

105. 【2601.13218】ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments

链接https://arxiv.org/abs/2601.13218

作者:Igor Vozniak,Philipp Mueller,Nils Lipp,Janis Sprenger,Konstantin Poddubnyy,Davit Hovhannisyan,Christian Mueller,Andreas Bulling,Philipp Slusallek

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cognitive science, nature of human, well-known in cognitive, played a minor, minor role

备注: Accepted for publication at the IEEE Intelligent Vehicles Symposium (IV), 2026

点击查看摘要

Abstract:The object-based nature of human visual attention is well-known in cognitive science, but has only played a minor role in computational visual attention models so far. This is mainly due to a lack of suitable datasets and evaluation metrics for object-based attention. To address these limitations, we present \dataset~ -- a novel 120-participant dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations. The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult. \dataset~ not only features accurate gaze data and a complete state-space representation of objects in the virtual environment, but it also offers variable scenario complexities and rich annotations, including panoptic segmentation, depth information, and vehicle keypoints. We further propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models, a previously unexplored performance characteristic. Our evaluations show that explicitly optimising for object-based attention not only improves oSIM performance but also leads to an improved model performance on common metrics. In addition, we present SUMGraph, a Mamba U-Net-based model, which explicitly encodes critical scene objects (vehicles) in a graph representation, leading to further performance improvements over several state-of-the-art visual attention prediction methods. The dataset, code and models will be publicly released.

106. 【2601.13208】Rethinking Skip Connections: Additive U-Net for Robust and Interpretable Denoising

链接https://arxiv.org/abs/2601.13208

作者:Vikram R Lakkavalli

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:allowing uncontrolled noise, doubles channel dimensionality, obscures information flow, standard concatenation doubles, allowing uncontrolled

备注

点击查看摘要

Abstract:Skip connections are central to U-Net architectures for image denoising, but standard concatenation doubles channel dimensionality and obscures information flow, allowing uncontrolled noise transfer. We propose the Additive U-Net, which replaces concatenative skips with gated additive connections. Each skip pathway is scaled by a learnable non-negative scalar, offering explicit and interpretable control over encoder contributions while avoiding channel inflation. Evaluations on the Kodak-17 denoising benchmark show that Additive U-Net achieves competitive PSNR/SSIM at noise levels {\sigma} = 15, 25, 50, with robustness across kernel schedules and depths. Notably, effective denoising is achieved even without explicit down/up-sampling or forced hierarchies, as the model naturally learns a progression from high-frequency to band-pass to low-frequency features. These results position additive skips as a lightweight and interpretable alternative to concatenation, enabling both efficient design and a clearer understanding of multi-scale information transfer in reconstruction networks.

107. 【2601.13207】GTPred: Benchmarking MLLMs for Interpretable Geo-localization and Time-of-capture Prediction

链接https://arxiv.org/abs/2601.13207

作者:Jinnao Li,Zijian Chen,Tingzhu Chen,Changbo Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:observable visual evidence, aims to infer, infer the geographic, captured using observable, Geo-localization aims

备注

点击查看摘要

Abstract:Geo-localization aims to infer the geographic location where an image was captured using observable visual evidence. Traditional methods achieve impressive results through large-scale training on massive image corpora. With the emergence of multi-modal large language models (MLLMs), recent studies have explored their applications in geo-localization, benefiting from improved accuracy and interpretability. However, existing benchmarks largely ignore the temporal information inherent in images, which can further constrain the location. To bridge this gap, we introduce GTPred, a novel benchmark for geo-temporal prediction. GTPred comprises 370 globally distributed images spanning over 120 years. We evaluate MLLM predictions by jointly considering year and hierarchical location sequence matching, and further assess intermediate reasoning chains using meticulously annotated ground-truth reasoning processes. Experiments on 8 proprietary and 7 open-source MLLMs show that, despite strong visual perception, current models remain limited in world knowledge and geo-temporal reasoning. Results also demonstrate that incorporating temporal information significantly enhances location inference performance.

108. 【2601.13166】From 100,000+ images to winning the first brain MRI foundation model challenges: Sharing lessons and models

链接https://arxiv.org/abs/2601.13166

作者:Pedro M. Gordaliza,Jaume Banus,Benoît Gérin,Maxence Wynen,Nataliia Molchanova,Jonas Richiardi,Meritxell Bach Cuadra

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Developing Foundation Models, Developing Foundation, medical image analysis, radiological tasks, medical image

备注: Work presented at the SSL3D Challenge (1st place, ResEnc-L track) and FOMO Challenge (1st place, Methods track) on Brain MRI Foundation Models at MICCAI 2025

点击查看摘要

Abstract:Developing Foundation Models for medical image analysis is essential to overcome the unique challenges of radiological tasks. The first challenges of this kind for 3D brain MRI, SSL3D and FOMO25, were held at MICCAI 2025. Our solution ranked first in tracks of both contests. It relies on a U-Net CNN architecture combined with strategies leveraging anatomical priors and neuroimaging domain knowledge. Notably, our models trained 1-2 orders of magnitude faster and were 10 times smaller than competing transformer-based approaches. Models are available here: this https URL.

109. 【2601.13148】ICo3D: An Interactive Conversational 3D Virtual Human

链接https://arxiv.org/abs/2601.13148

作者:Richard Shaw,Youngkyoon Jang,Athanasios Papaioannou,Arthur Moreau,Helisa Dhamo,Zhensong Zhang,Eduardo Pérez-Pellitero

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:work presents Interactive, presents Interactive Conversational, Interactive Conversational, method for generating, generating an interactive

备注: Accepted by International Journal on Computer Vision (IJCV). Project page: [this https URL](https://ico3d.github.io/) . This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this article is published in International Journal of Computer Vision and is available online at [this https URL](https://doi.org/10.1007/s11263-025-02725-8)

点击查看摘要

Abstract:This work presents Interactive Conversational 3D Virtual Human (ICo3D), a method for generating an interactive, conversational, and photorealistic 3D human avatar. Based on multi-view captures of a subject, we create an animatable 3D face model and a dynamic 3D body model, both rendered by splatting Gaussian primitives. Once merged together, they represent a lifelike virtual human avatar suitable for real-time user interactions. We equip our avatar with an LLM for conversational ability. During conversation, the audio speech of the avatar is used as a driving signal to animate the face model, enabling precise synchronization. We describe improvements to our dynamic Gaussian models that enhance photorealism: SWinGS++ for body reconstruction and HeadGaS++ for face reconstruction, and provide as well a solution to merge the separate face and body models without artifacts. We also present a demo of the complete system, showcasing several use cases of real-time conversation with the 3D avatar. Our approach offers a fully integrated virtual avatar experience, supporting both oral and written form interactions in immersive environments. ICo3D is applicable to a wide range of fields, including gaming, virtual assistance, and personalized education, among others. Project page: this https URL

110. 【2601.13142】VWorld: Foundations for Remote-Control TV Agents

链接https://arxiv.org/abs/2601.13142

作者:Zhantao Ma,Quanfeng Lu,Shuai Zhong,Dahai Yu,Ping Luo,Michael K. Ng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Recent large vision-language, Recent large, large vision-language models, demonstrated strong potential, device control

备注

点击查看摘要

Abstract:Recent large vision-language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce \textbf{TVWorld}, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: \textbf{TVWorld-N} for topology-aware navigation and \textbf{TVWorld-G} for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a \emph{Topology-Aware Training} framework that injects topology awareness into LVLMs. Using this framework, we develop \textbf{TVTheseus}, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of $68.3\%$ on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.

111. 【2601.13134】Earth Embeddings as Products: Taxonomy, Ecosystem, and Standardized Access

链接https://arxiv.org/abs/2601.13134

作者:Heng Fang,Adam J. Stewart,Isaac Corley,Xiao Xiang Zhu,Hossein Azizpour

类目:oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)

关键词:provide powerful representations, high compute costs, compute costs hinder, Geospatial Foundation Models, Foundation Models

备注

点击查看摘要

Abstract:Geospatial Foundation Models (GFMs) provide powerful representations, but high compute costs hinder their widespread use. Pre-computed embedding data products offer a practical "frozen" alternative, yet they currently exist in a fragmented ecosystem of incompatible formats and resolutions. This lack of standardization creates an engineering bottleneck that prevents meaningful model comparison and reproducibility. We formalize this landscape through a three-layer taxonomy: Data, Tools, and Value. We survey existing products to identify interoperability barriers. To bridge this gap, we extend TorchGeo with a unified API that standardizes the loading and querying of diverse embedding products. By treating embeddings as first-class geospatial datasets, we decouple downstream analysis from model-specific engineering, providing a roadmap for more transparent and accessible Earth observation workflows.

112. 【2601.13133】CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks

链接https://arxiv.org/abs/2601.13133

作者:Mingshuang Luo,Ruibing Hou,Bo Chao,Hong Chang,Zimo Liu,Yaowei Wang,Shiguang Shan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including surveillance, human-computer interaction, plays a pivotal, pivotal role, diverse applications

备注: Accepted by TMM (IEEE Transactions on Multimedia), 16 pages, 7 figures

点击查看摘要

Abstract:Human-centric visual analysis plays a pivotal role in diverse applications, including surveillance, healthcare, and human-computer interaction. With the emergence of large-scale unlabeled human image datasets, there is an increasing need for a general unsupervised pre-training model capable of supporting diverse human-centric downstream tasks. To achieve this goal, we propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework designed for unsupervised pre-training in human-centric visual tasks. CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels. These multi-level semantic cues are then integrated into the learned visual representations, enriching their expressiveness and generalizability. Recognizing that different downstream tasks demand varying levels of semantic granularity, CLASP incorporates a Prompt-Controlled Mixture-of-Experts (MoE) module. MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability. Furthermore, CLASP employs a multi-task pre-training strategy, where part- and attribute-level pseudo-labels derived from CLIP guide the representation learning process. Extensive experiments across multiple benchmarks demonstrate that CLASP consistently outperforms existing unsupervised pre-training methods, advancing the field of human-centric visual analysis.

113. 【2601.13132】GaussExplorer: 3D Gaussian Splatting for Embodied Exploration and Reasoning

链接https://arxiv.org/abs/2601.13132

作者:Kim Yu-Ji,Dahye Lee,Kim Jun-Seong,GeonU Kim,Nam Hyeon-Woo,Yongjin Kwon,Yu-Chiang Frank Wang,Jaesung Choe,Tae-Hyun Oh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, Splatting, Gaussian embeddings, Gaussian, present GaussExplorer

备注: Project page: [this https URL](https://gaussexplorer.github.io/)

点击查看摘要

Abstract:We present GaussExplorer, a framework for embodied exploration and reasoning built on 3D Gaussian Splatting (3DGS). While prior approaches to language-embedded 3DGS have made meaningful progress in aligning simple text queries with Gaussian embeddings, they are generally optimized for relatively simple queries and struggle to interpret more complex, compositional language queries. Alternative studies based on object-centric RGB-D structured memories provide spatial grounding but are constrained by pre-fixed viewpoints. To address these issues, GaussExplorer introduces Vision-Language Models (VLMs) on top of 3DGS to enable question-driven exploration and reasoning within 3D scenes. We first identify pre-captured images that are most correlated with the query question, and subsequently adjust them into novel viewpoints to more accurately capture visual information for better reasoning by VLMs. Experiments show that ours outperforms existing methods on several benchmarks, demonstrating the effectiveness of integrating VLM-based reasoning with 3DGS for embodied tasks.

114. 【2601.13128】PhaseMark: A Post-hoc, Optimization-Free Watermarking of AI-generated Images in the Latent Frequency Domain

链接https://arxiv.org/abs/2601.13128

作者:Sung Ju Lee,Nam Ik Cho

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Latent Diffusion Models, Diffusion Models, existing post-hoc methods, prohibitively slow due, demands robust watermarking

备注: Accepted to the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026

点击查看摘要

Abstract:The proliferation of hyper-realistic images from Latent Diffusion Models (LDMs) demands robust watermarking, yet existing post-hoc methods are prohibitively slow due to iterative optimization or inversion processes. We introduce PhaseMark, a single-shot, optimization-free framework that directly modulates the phase in the VAE latent frequency domain. This approach makes PhaseMark thousands of times faster than optimization-based techniques while achieving state-of-the-art resilience against severe attacks, including regeneration, without degrading image quality. We analyze four modulation variants, revealing a clear performance-quality trade-off. PhaseMark demonstrates a new paradigm where efficient, resilient watermarking is achieved by exploiting intrinsic latent properties.

115. 【2601.13126】A Streamlined Attention-Based Network for Descriptor Extraction

链接https://arxiv.org/abs/2601.13126

作者:Mattia D'Urso,Emanuele Santellani,Christian Sormann,Mattia Rossi,Andreas Kuhn,Friedrich Fraundorfer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Streamlined Attention-Based Network, Streamlined Attention-Based, Convolutional Block Attention, Block Attention Modules, Attention-Based Network

备注: Accepted to 3DV 2026

点击查看摘要

Abstract:We introduce SANDesc, a Streamlined Attention-Based Network for Descriptor extraction that aims to improve on existing architectures for keypoint description. Our descriptor network learns to compute descriptors that improve matching without modifying the underlying keypoint detector. We employ a revised U-Net-like architecture enhanced with Convolutional Block Attention Modules and residual paths, enabling effective local representation while maintaining computational efficiency. We refer to the building blocks of our model as Residual U-Net Blocks with Attention. The model is trained using a modified triplet loss in combination with a curriculum learning-inspired hard negative mining strategy, which improves training stability. Extensive experiments on HPatches, MegaDepth-1500, and the Image Matching Challenge 2021 show that training SANDesc on top of existing keypoint detectors leads to improved results on multiple matching tasks compared to the original keypoint descriptors. At the same time, SANDesc has a model complexity of just 2.4 million parameters. As a further contribution, we introduce a new urban dataset featuring 4K images and pre-calibrated intrinsics, designed to evaluate feature extractors. On this benchmark, SANDesc achieves substantial performance gains over the existing descriptors while operating with limited computational resources.

Comments:
Accepted to 3DV 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2601.13126 [cs.CV]

(or
arXiv:2601.13126v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.13126

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Mattia D’Urso [view email] [v1]
Mon, 19 Jan 2026 15:12:42 UTC (21,933 KB)

116. 【2601.13096】LLM-VLM Fusion Framework for Autonomous Maritime Port Inspection using a Heterogeneous UAV-USV System

链接https://arxiv.org/abs/2601.13096

作者:Muhayy Ud Din,Waseem Akram,Ahsan B. Bakht,Irfan Hussain

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:complex maritime environments, Large Language Models, Vision Language Models, port inspection plays, Language Models

备注: submitted in AEJ

点击查看摘要

Abstract:Maritime port inspection plays a critical role in ensuring safety, regulatory compliance, and operational efficiency in complex maritime environments. However, existing inspection methods often rely on manual operations and conventional computer vision techniques that lack scalability and contextual understanding. This study introduces a novel integrated engineering framework that utilizes the synergy between Large Language Models (LLMs) and Vision Language Models (VLMs) to enable autonomous maritime port inspection using cooperative aerial and surface robotic platforms. The proposed framework replaces traditional state-machine mission planners with LLM-driven symbolic planning and improved perception pipelines through VLM-based semantic inspection, enabling context-aware and adaptive monitoring. The LLM module translates natural language mission instructions into executable symbolic plans with dependency graphs that encode operational constraints and ensure safe UAV-USV coordination. Meanwhile, the VLM module performs real-time semantic inspection and compliance assessment, generating structured reports with contextual reasoning. The framework was validated using the extended MBZIRC Maritime Simulator with realistic port infrastructure and further assessed through real-world robotic inspection trials. The lightweight on-board design ensures suitability for resource-constrained maritime platforms, advancing the development of intelligent, autonomous inspection systems. Project resources (code and videos) can be found here: this https URL

117. 【2601.13094】Patient-Conditioned Adaptive Offsets for Reliable Diagnosis across Subgroups

链接https://arxiv.org/abs/2601.13094

作者:Gelei Xu,Yuying Duan,Jun Xia,Ruining Deng,Wei Jin,Yiyu Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:clinical risk profiles, exhibit uneven performance, disease prevalence, risk profiles, diagnosis often exhibit

备注

点击查看摘要

Abstract:AI models for medical diagnosis often exhibit uneven performance across patient populations due to heterogeneity in disease prevalence, imaging appearance, and clinical risk profiles. Existing algorithmic fairness approaches typically seek to reduce such disparities by suppressing sensitive attributes. However, in medical settings these attributes often carry essential diagnostic information, and removing them can degrade accuracy and reliability, particularly in high-stakes applications. In contrast, clinical decision making explicitly incorporates patient context when interpreting diagnostic evidence, suggesting a different design direction for subgroup-aware models. In this paper, we introduce HyperAdapt, a patient-conditioned adaptation framework that improves subgroup reliability while maintaining a shared diagnostic model. Clinically relevant attributes such as age and sex are encoded into a compact embedding and used to condition a hypernetwork-style module, which generates small residual modulation parameters for selected layers of a shared backbone. This design preserves the general medical knowledge learned by the backbone while enabling targeted adjustments that reflect patient-specific variability. To ensure efficiency and robustness, adaptations are constrained through low-rank and bottlenecked parameterizations, limiting both model complexity and computational overhead. Experiments across multiple public medical imaging benchmarks demonstrate that the proposed approach consistently improves subgroup-level performance without sacrificing overall accuracy. On the PAD-UFES-20 dataset, our method outperforms the strongest competing baseline by 4.1% in recall and 4.4% in F1 score, with larger gains observed for underrepresented patient populations.

118. 【2601.13059】Prototype Learning-Based Few-Shot Segmentation for Low-Light Crack on Concrete Structures

链接https://arxiv.org/abs/2601.13059

作者:Yulun Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:concrete infrastructure safety, degrading computer vision, vision segmentation accuracy, computer vision segmentation, infrastructure safety

备注

点击查看摘要

Abstract:Crack detection is critical for concrete infrastructure safety, but real-world cracks often appear in low-light environments like tunnels and bridge undersides, degrading computer vision segmentation accuracy. Pixel-level annotation of low-light crack images is extremely time-consuming, yet most deep learning methods require large, well-illuminated datasets. We propose a dual-branch prototype learning network integrating Retinex theory with few-shot learning for low-light crack segmentation. Retinex-based reflectance components guide illumination-invariant global representation learning, while metric learning reduces dependence on large annotated datasets. We introduce a cross-similarity prior mask generation module that computes high-dimensional similarities between query and support features to capture crack location and structure, and a multi-scale feature enhancement module that fuses multi-scale features with the prior mask to alleviate spatial inconsistency. Extensive experiments on multiple benchmarks demonstrate consistent state-of-the-art performance under low-light conditions. Code: this https URL.

119. 【2601.13052】GridNet-HD: A High-Resolution Multi-Modal Dataset for LiDAR-Image Fusion on Power Line Infrastructure

链接https://arxiv.org/abs/2601.13052

作者:Antoine Carreaud,Shanci Li,Malo De Lacour,Digre Frinde,Jan Skaloud,Adrien Gressin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:overhead electrical infrastructures, paper presents GridNet-HD, electrical infrastructures, paper presents, segmentation of overhead

备注

点击查看摘要

Abstract:This paper presents GridNet-HD, a multi-modal dataset for 3D semantic segmentation of overhead electrical infrastructures, pairing high-density LiDAR with high-resolution oblique imagery. The dataset comprises 7,694 images and 2.5 billion points annotated into 11 classes, with predefined splits and mIoU metrics. Unimodal (LiDAR-only, image-only) and multi-modal fusion baselines are provided. On GridNet-HD, fusion models outperform the best unimodal baseline by +5.55 mIoU, highlighting the complementarity of geometry and appearance. As reviewed in Sec. 2, no public dataset jointly provides high-density LiDAR and high-resolution oblique imagery with 3D semantic labels for power-line assets. Dataset, baselines, and codes are available: this https URL.

120. 【2601.13029】hink3D: Thinking with Space for Spatial Reasoning

链接https://arxiv.org/abs/2601.13029

作者:Zaibin Zhang,Yuhan Wu,Lianjie Jia,Yifan Wang,Zhongbo Zhang,Yijiang Li,Binghao Ran,Fuxi Zhang,Zhuohan Sun,Zhenfei Yin,Lijun Wang,Huchuan Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:physical world requires, world requires spatial, interpret geometry, physical world, world requires

备注

点击查看摘要

Abstract:Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle with genuine 3D reasoning. We introduce Think3D, a framework that enables VLM agents to think with 3D space. By leveraging 3D reconstruction models that recover point clouds and camera poses from images or videos, Think3D allows the agent to actively manipulate space through camera-based operations and ego/global-view switching, transforming spatial reasoning into an interactive 3D chain-of-thought process. Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models such as GPT-4.1 and Gemini 2.5 Pro, yielding average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. We further show that smaller models, which struggle with spatial exploration, benefit significantly from a reinforcement learning policy that enables the model to select informative viewpoints and operations. With RL, the benefit from tool usage increases from +0.7% to +6.8%. Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents, establishing a new dimension of multimodal intelligence. Code and weights are released at this https URL.

121. 【2601.12994】AsyncBEV: Cross-modal Flow Alignment in Asynchronous 3D Object Detection

链接https://arxiv.org/abs/2601.12994

作者:Shiming Wang,Holger Caesar,Liangliang Nan,Julian F. P. Kooij

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-modal perception tasks, detection typically rely, object detection typically, autonomous driving, training and inference

备注

点击查看摘要

Abstract:In autonomous driving, multi-modal perception tasks like 3D object detection typically rely on well-synchronized sensors, both at training and inference. However, despite the use of hardware- or software-based synchronization algorithms, perfect synchrony is rarely guaranteed: Sensors may operate at different frequencies, and real-world factors such as network latency, hardware failures, or processing bottlenecks often introduce time offsets between sensors. Such asynchrony degrades perception performance, especially for dynamic objects. To address this challenge, we propose AsyncBEV, a trainable lightweight and generic module to improve the robustness of 3D Birds' Eye View (BEV) object detection models against sensor asynchrony. Inspired by scene flow estimation, AsyncBEV first estimates the 2D flow from the BEV features of two different sensor modalities, taking into account the known time offset between these sensor measurements. The predicted feature flow is then used to warp and spatially align the feature maps, which we show can easily be integrated into different current BEV detector architectures (e.g., BEV grid-based and token-based). Extensive experiments demonstrate AsyncBEV improves robustness against both small and large asynchrony between LiDAR or camera sensors in both the token-based CMT and grid-based UniBEV, especially for dynamic objects. We significantly outperform the ego motion compensated CMT and UniBEV baselines, notably by $16.6$ % and $11.9$ % NDS on dynamic objects in the worst-case scenario of a $0.5 s$ time offset. Code will be released upon acceptance.

122. 【2601.12981】Early Prediction of Type 2 Diabetes Using Multimodal data and Tabular Transformers

链接https://arxiv.org/abs/2601.12981

作者:Sulaiman Khan,Md. Rafiul Biswas,Zubair Shah

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:early Type, analyze longitudinal patient, Diabetes Mellitus, architecture to analyze, approach for early

备注: 08 pages, 06 figures, accepted for publication in FLLM2025

点击查看摘要

Abstract:This study introduces a novel approach for early Type 2 Diabetes Mellitus (T2DM) risk prediction using a tabular transformer (TabTrans) architecture to analyze longitudinal patient data. By processing patients` longitudinal health records and bone-related tabular data, our model captures complex, long-range dependencies in disease progression that conventional methods often overlook. We validated our TabTrans model on a retrospective Qatar BioBank (QBB) cohort of 1,382 subjects, comprising 725 men (146 diabetic, 579 healthy) and 657 women (133 diabetic, 524 healthy). The study integrated electronic health records (EHR) with dual-energy X-ray absorptiometry (DXA) data. To address class imbalance, we employed SMOTE and SMOTE-ENN resampling techniques. The proposed model`s performance is evaluated against conventional machine learning (ML) and generative AI models, including Claude 3.5 Sonnet (Anthropic`s constitutional AI), GPT-4 (OpenAI`s generative pre-trained transformer), and Gemini Pro (Google`s multimodal language model). Our TabTrans model demonstrated superior predictive performance, achieving ROC AUC $\geq$ 79.7 % for T2DM prediction compared to both generative AI models and conventional ML approaches. Feature interpretation analysis identified key risk indicators, with visceral adipose tissue (VAT) mass and volume, ward bone mineral density (BMD) and bone mineral content (BMC), T and Z-scores, and L1-L4 scores emerging as the most important predictors associated with diabetes development in Qatari adults. These findings demonstrate the significant potential of TabTrans for analyzing complex tabular healthcare data, providing a powerful tool for proactive T2DM management and personalized clinical interventions in the Qatari population. Index Terms: tabular transformers, multimodal data, DXA data, diabetes, T2DM, feature interpretation, tabular data

Comments:
08 pages, 06 figures, accepted for publication in FLLM2025

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2601.12981 [cs.CV]

(or
arXiv:2601.12981v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.12981

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
123. 【2601.12964】Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation

链接https://arxiv.org/abs/2601.12964

作者:John Waithaka,Gustave Bwirayesu,Moise Busogi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:mid-spatial resolution, high availability, image datasets due, remote sensing, self-supervised learning frameworks

备注

点击查看摘要

Abstract:Self-supervised pretraining in remote sensing is mostly done using mid-spatial resolution (MR) image datasets due to their high availability. Given the release of high-resolution (HR) datasets, we ask how HR datasets can be included in self-supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self-supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self-supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.

124. 【2601.12954】StyMam: A Mamba-Based Generator for Artistic Style Transfer

链接https://arxiv.org/abs/2601.12954

作者:Zhou Hong,Rongsheng Hu,Yicheng Di,Xiaolong Xu,Ning Dong,Yihua Shao,Run Ling,Yun Wang,Juqin Wang,Zhanjie Zhang,Ao Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:style transfer aims, specific artistic style, Image style transfer, style transfer, artistic style

备注: Accepted by ICASSP 2026

点击查看摘要

Abstract:Image style transfer aims to integrate the visual patterns of a specific artistic style into a content image while preserving its content structure. Existing methods mainly rely on the generative adversarial network (GAN) or stable diffusion (SD). GAN-based approaches using CNNs or Transformers struggle to jointly capture local and global dependencies, leading to artifacts and disharmonious patterns. SD-based methods reduce such issues but often fail to preserve content structures and suffer from slow inference. To address these issues, we revisit GAN and propose a mamba-based generator, termed as StyMam, to produce high-quality stylized images without introducing artifacts and disharmonious patterns. Specifically, we introduce a mamba-based generator with a residual dual-path strip scanning mechanism and a channel-reweighted spatial attention module. The former efficiently captures local texture features, while the latter models global dependencies. Finally, extensive qualitative and quantitative experiments demonstrate that the proposed method outperforms state-of-the-art algorithms in both quality and speed.

125. 【2601.12948】GazeD: Context-Aware Diffusion for Accurate 3D Gaze Estimation

链接https://arxiv.org/abs/2601.12948

作者:Riccardo Catalini,Davide Di Nucci,Guido Borghi,Davide Davoli,Lorenzo Garattoni,Giampiero Francesca,Yuki Kawana,Roberto Vezzani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:single RGB image, single RGB, RGB image, human pose, gaze

备注

点击查看摘要

Abstract:We introduce GazeD, a new 3D gaze estimation method that jointly provides 3D gaze and human pose from a single RGB image. Leveraging the ability of diffusion models to deal with uncertainty, it generates multiple plausible 3D gaze and pose hypotheses based on the 2D context information extracted from the input image. Specifically, we condition the denoising process on the 2D pose, the surroundings of the subject, and the context of the scene. With GazeD we also introduce a novel way of representing the 3D gaze by positioning it as an additional body joint at a fixed distance from the eyes. The rationale is that the gaze is usually closely related to the pose, and thus it can benefit from being jointly denoised during the diffusion process. Evaluations across three benchmark datasets demonstrate that GazeD achieves state-of-the-art performance in 3D gaze estimation, even surpassing methods that rely on temporal information. Project details will be available at this https URL.

126. 【2601.12946】AI-generated data contamination erodes pathological variability and diagnostic reliability

链接https://arxiv.org/abs/2601.12946

作者:Hongyu He,Shaowen Xiang,Ye Zhang,Yingtao Zhu,Jin Zhang,Hao Deng,Emily Alsentzer,Qingyu Chen,Kun-Hsing Yu,Andrew Marmenshall,Tingting Chen,Srinivas Anumasa,Daniel Ebner,Dean Ho,Kee Yuan Ngiam,Ching-Yu Cheng,Dianbo Liu

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Generative artificial intelligence, rapidly populating medical, populating medical records, artificial intelligence, creating a feedback

备注: *Corresponding author: Dianbo Liu (dianbo@nus. [this http URL](http://edu.sg) )

点击查看摘要

Abstract:Generative artificial intelligence (AI) is rapidly populating medical records with synthetic content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI-generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self-referential cycle drives a rapid erosion of pathological variability and diagnostic reliability. By analysing more than 800,000 synthetic data points across clinical text generation, vision-language reporting, and medical image synthesis, we find that models progressively converge toward generic phenotypes regardless of the model architecture. Specifically, rare but critical findings, including pneumothorax and effusions, vanish from the synthetic content generated by AI models, while demographic representations skew heavily toward middle-aged male phenotypes. Crucially, this degradation is masked by false diagnostic confidence; models continue to issue reassuring reports while failing to detect life-threatening pathology, with false reassurance rates tripling to 40%. Blinded physician evaluation confirms that this decoupling of confidence and accuracy renders AI-generated documentation clinically useless after just two generations. We systematically evaluate three mitigation strategies, finding that while synthetic volume scaling fails to prevent collapse, mixing real data with quality-aware filtering effectively preserves diversity. Ultimately, our results suggest that without policy-mandated human oversight, the deployment of generative AI threatens to degrade the very healthcare data ecosystems it relies upon.

127. 【2601.12936】QASA: Quality-Guided K-Adaptive Slot Attention for Unsupervised Object-Centric Learning

链接https://arxiv.org/abs/2601.12936

作者:Tianran Ouyang,Xingping Dong,Jing Zhang,Mang Ye,Jun Chen,Bo Du

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:unsupervised object-centric learning, object-centric learning, approach that binds, Slot Attention, Slot

备注

点击查看摘要

Abstract:Slot Attention, an approach that binds different objects in a scene to a set of "slots", has become a leading method in unsupervised object-centric learning. Most methods assume a fixed slot count K, and to better accommodate the dynamic nature of object cardinality, a few works have explored K-adaptive variants. However, existing K-adaptive methods still suffer from two limitations. First, they do not explicitly constrain slot-binding quality, so low-quality slots lead to ambiguous feature attribution. Second, adding a slot-count penalty to the reconstruction objective creates conflicting optimization goals between reducing the number of active slots and maintaining reconstruction fidelity. As a result, they still lag significantly behind strong K-fixed baselines. To address these challenges, we propose Quality-Guided K-Adaptive Slot Attention (QASA). First, we decouple slot selection from reconstruction, eliminating the mutual constraints between the two objectives. Then, we propose an unsupervised Slot-Quality metric to assess per-slot quality, providing a principled signal for fine-grained slot--object binding. Based on this metric, we design a Quality-Guided Slot Selection scheme that dynamically selects a subset of high-quality slots and feeds them into our newly designed gated decoder for reconstruction during training. At inference, token-wise competition on slot attention yields a K-adaptive outcome. Experiments show that QASA substantially outperforms existing K-adaptive methods on both real and synthetic datasets. Moreover, on real-world datasets QASA surpasses K-fixed methods.

128. 【2601.12929】Membership Inference Test: Auditing Training Data in Object Classification Models

链接https://arxiv.org/abs/2601.12929

作者:Gonzalo Mancera,Daniel DeAlcala,Aythami Morales,Ruben Tolosana,Julian Fierrez

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Membership Inference Tests, Inference Tests, Membership Inference, object recognition, object recognition domain

备注: Deployable AI (DAI 2025) workshop co-located with AAAI-25

点击查看摘要

Abstract:In this research, we analyze the performance of Membership Inference Tests (MINT), focusing on determining whether given data were utilized during the training phase, specifically in the domain of object recognition. Within the area of object recognition, we propose and develop architectures tailored for MINT models. These architectures aim to optimize performance and efficiency in data utilization, offering a tailored solution to tackle the complexities inherent in the object recognition domain. We conducted experiments involving an object detection model, an embedding extractor, and a MINT module. These experiments were performed in three public databases, totaling over 174K images. The proposed architecture leverages convolutional layers to capture and model the activation patterns present in the data during the training process. Through our analysis, we are able to identify given data used for testing and training, achieving precision rates ranging between 70% and 80%, contingent upon the depth of the detection module layer chosen for input to the MINT module. Additionally, our studies entail an analysis of the factors influencing the MINT Module, delving into the contributing elements behind more transparent training processes.

129. 【2601.12926】Dual-Stream Collaborative Transformer for Image Captioning

链接https://arxiv.org/abs/2601.12926

作者:Jun Wan,Jun Liu,Zhihui lai,Jie Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Current region feature-based, achieved remarkable performance, Current region, remarkable performance, methods have progressed

备注

点击查看摘要

Abstract:Current region feature-based image captioning methods have progressed rapidly and achieved remarkable performance. However, they are still prone to generating irrelevant descriptions due to the lack of contextual information and the over-reliance on generated partial descriptions for predicting the remaining words. In this paper, we propose a Dual-Stream Collaborative Transformer (DSCT) to address this issue by introducing the segmentation feature. The proposed DSCT consolidates and then fuses the region and segmentation features to guide the generation of caption sentences. It contains multiple Pattern-Specific Mutual Attention Encoders (PSMAEs) and Dynamic Nomination Decoders (DNDs). The PSMAE effectively highlights and consolidates the private information of two representations by querying each other. The DND dynamically searches for the most relevant learning blocks to the input textual representations and exploits the homogeneous features between the consolidated region and segmentation features to generate more accurate and descriptive caption sentences. To the best of our knowledge, this is the first study to explore how to fuse different pattern-specific features in a dynamic way to bypass their semantic inconsistencies and spatial misalignment issues for image captioning. The experimental results from popular benchmark datasets demonstrate that our DSCT outperforms the state-of-the-art image captioning models in the literature.

130. 【2601.12919】Supervision-by-Hallucination-and-Transfer: A Weakly-Supervised Approach for Robust and Precise Facial Landmark Detection

链接https://arxiv.org/abs/2601.12919

作者:Jun Wan,Yuanzhi Yao,Zhihui Lai,Jie Zhou,Xianxu Hou,Wenwen Min

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:High-precision facial landmark, High-precision facial, high-resolution deep feature, deep feature representations, FLD

备注

点击查看摘要

Abstract:High-precision facial landmark detection (FLD) relies on high-resolution deep feature representations. However, low-resolution face images or the compression (via pooling or strided convolution) of originally high-resolution images hinder the learning of such features, thereby reducing FLD accuracy. Moreover, insufficient training data and imprecise annotations further degrade performance. To address these challenges, we propose a weakly-supervised framework called Supervision-by-Hallucination-and-Transfer (SHT) for more robust and precise FLD. SHT contains two novel mutually enhanced modules: Dual Hallucination Learning Network (DHLN) and Facial Pose Transfer Network (FPTN). By incorporating FLD and face hallucination tasks, DHLN is able to learn high-resolution representations with low-resolution inputs for recovering both facial structures and local details and generating more effective landmark heatmaps. Then, by transforming faces from one pose to another, FPTN can further improve landmark heatmaps and faces hallucinated by DHLN for detecting more accurate landmarks. To the best of our knowledge, this is the first study to explore weakly-supervised FLD by integrating face hallucination and facial pose transfer tasks. Experimental results of both face hallucination and FLD demonstrate that our method surpasses state-of-the-art techniques.

131. 【2601.12895】woHead-SwinFPN: A Unified DL Architecture for Synthetic Manipulation, Detection and Localization in Identity Documents

链接https://arxiv.org/abs/2601.12895

作者:Chan Naseeb,Adeel Ashraf Cheema,Hassan Sami,Tayyab Afzal,Muhammad Omair,Usman Habib

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:text inpainting attacks, inpainting attacks, Block Attention Module, Convolutional Block Attention, proliferation of sophisticated

备注: 8 pages

点击查看摘要

Abstract:The proliferation of sophisticated generative AI models has significantly escalated the threat of synthetic manipulations in identity documents, particularly through face swapping and text inpainting attacks. This paper presents TwoHead-SwinFPN, a unified deep learning architecture that simultaneously performs binary classification and precise localization of manipulated regions in ID documents. Our approach integrates a Swin Transformer backbone with Feature Pyramid Network (FPN) and UNet-style decoder, enhanced with Convolutional Block Attention Module (CBAM) for improved feature representation. The model employs a dual-head architecture for joint optimization of detection and segmentation tasks, utilizing uncertainty-weighted multi-task learning. Extensive experiments on the FantasyIDiap dataset demonstrate superior performance with 84.31\% accuracy, 90.78\% AUC for classification, and 57.24\% mean Dice score for localization. The proposed method achieves an F1-score of 88.61\% for binary classification while maintaining computational efficiency suitable for real-world deployment through FastAPI implementation. Our comprehensive evaluation includes ablation studies, cross-device generalization analysis, and detailed performance assessment across 10 languages and 3 acquisition devices.

132. 【2601.12894】Sparse ActionGen: Accelerating Diffusion Policy with Real-time Pruning

链接https://arxiv.org/abs/2601.12894

作者:Kangye Ji,Yuan Meng,Zhou Jianbo,Ye Li,Hanyun Cui,Zhi Wang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-step denoising processes, denoising processes make, Policy has dominated, multi-modal action distributions, real-time visuomotor control

备注

点击查看摘要

Abstract:Diffusion Policy has dominated action generation due to its strong capabilities for modeling multi-modal action distributions, but its multi-step denoising processes make it impractical for real-time visuomotor control. Existing caching-based acceleration methods typically rely on $\textit{static}$ schedules that fail to adapt to the $\textit{dynamics}$ of robot-environment interactions, thereby leading to suboptimal performance. In this paper, we propose $\underline{\textbf{S}}$parse $\underline{\textbf{A}}$ction$\underline{\textbf{G}}$en ($\textbf{SAG}$) for extremely sparse action generation. To accommodate the iterative interactions, SAG customizes a rollout-adaptive prune-then-reuse mechanism that first identifies prunable computations globally and then reuses cached activations to substitute them during action diffusion. To capture the rollout dynamics, SAG parameterizes an observation-conditioned diffusion pruner for environment-aware adaptation and instantiates it with a highly parameter- and inference-efficient design for real-time prediction. Furthermore, SAG introduces a one-for-all reusing strategy that reuses activations across both timesteps and blocks in a zig-zag manner, minimizing the global redundancy. Extensive experiments on multiple robotic benchmarks demonstrate that SAG achieves up to 4$\times$ generation speedup without sacrificing performance. Project Page: this https URL.

133. 【2601.12889】Simultaneous Detection of LSD and FMD in Cattle Using Ensemble Deep Learning

链接https://arxiv.org/abs/2601.12889

作者:Nazibul Basar Ayon,Abdul Hasib,Md. Faishal Ahmed,Md. Sadiqur Rahman,Kamrul Islam,T. M. Mehrab Hasan,A. S. M. Ahsanul Sarkar Akib

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Lumpy Skin Disease, highly contagious viral, Lumpy Skin, causing significant economic, significant economic losses

备注

点击查看摘要

Abstract:Lumpy Skin Disease (LSD) and Foot-and-Mouth Disease (FMD) are highly contagious viral diseases affecting cattle, causing significant economic losses and welfare challenges. Their visual diagnosis is complicated by significant symptom overlap with each other and with benign conditions like insect bites or chemical burns, hindering timely control measures. Leveraging a comprehensive dataset of 10,516 expert-annotated images from 18 farms across India, Brazil, and the USA, this study presents a novel Ensemble Deep Learning framework integrating VGG16, ResNet50, and InceptionV3 with optimized weighted averaging for simultaneous LSD and FMD detection. The model achieves a state-of-the-art accuracy of 98.2\%, with macro-averaged precision of 98.2\%, recall of 98.1\%, F1-score of 98.1\%, and an AUC-ROC of 99.5\%. This approach uniquely addresses the critical challenge of symptom overlap in multi-disease detection, enabling early, precise, and automated diagnosis. This tool has the potential to enhance disease management, support global agricultural sustainability, and is designed for future deployment in resource-limited settings.

134. 【2601.12882】YOLO26: An Analysis of NMS-Free End to End Framework for Real-Time Object Detection

链接https://arxiv.org/abs/2601.12882

作者:Sudip Chakrabarty

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:real-time object detection, Non-Maximum Suppression, framework has long, traditional iterations, remain constrained

备注

点击查看摘要

Abstract:The "You Only Look Once" (YOLO) framework has long served as the benchmark for real-time object detection, yet traditional iterations (YOLOv1 through YOLO11) remain constrained by the latency and hyperparameter sensitivity of Non-Maximum Suppression (NMS) post-processing. This paper analyzes a comprehensive analysis of YOLO26, an architecture that fundamentally redefines this paradigm by eliminating NMS in favor of a native end-to-end learning strategy. This study examines the critical innovations that enable this transition, specifically the introduction of the MuSGD optimizer for stabilizing lightweight backbones, STAL for small-target-aware assignment, and ProgLoss for dynamic supervision. Through a systematic review of official performance benchmarks, the results demonstrate that YOLO26 establishes a new Pareto front, outperforming a comprehensive suite of predecessors and state-of-the-art competitors (including RTMDet and DAMO-YOLO) in both inference speed and detection accuracy. The analysis confirms that by decoupling representation learning from heuristic post-processing, YOLOv26 successfully resolves the historical trade-off between latency and precision, signaling the next evolutionary step in edge-based computer vision.

135. 【2601.12876】Exploring Talking Head Models With Adjacent Frame Prior for Speech-Preserving Facial Expression Manipulation

链接https://arxiv.org/abs/2601.12876

作者:Zhenxuan Lu,Zhihua Xu,Zhijing Yang,Feng Gao,Yongyi Lu,Keze Wang,Tianshui Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial Expression Manipulation, innovative technique aimed, Speech-Preserving Facial Expression, altering facial expressions, Head Facial Expression

备注: Accepted by ACM Transactions on Multimedia Computing, Communications, and Applications

点击查看摘要

Abstract:Speech-Preserving Facial Expression Manipulation (SPFEM) is an innovative technique aimed at altering facial expressions in images and videos while retaining the original mouth movements. Despite advancements, SPFEM still struggles with accurate lip synchronization due to the complex interplay between facial expressions and mouth shapes. Capitalizing on the advanced capabilities of audio-driven talking head generation (AD-THG) models in synthesizing precise lip movements, our research introduces a novel integration of these models with SPFEM. We present a new framework, Talking Head Facial Expression Manipulation (THFEM), which utilizes AD-THG models to generate frames with accurately synchronized lip movements from audio inputs and SPFEM-altered images. However, increasing the number of frames generated by AD-THG models tends to compromise the realism and expression fidelity of the images. To counter this, we develop an adjacent frame learning strategy that finetunes AD-THG models to predict sequences of consecutive frames. This strategy enables the models to incorporate information from neighboring frames, significantly improving image quality during testing. Our extensive experimental evaluations demonstrate that this framework effectively preserves mouth shapes during expression manipulations, highlighting the substantial benefits of integrating AD-THG with SPFEM.

136. 【2601.12865】Proxy Robustness in Vision Language Models is Effortlessly Transferable

链接https://arxiv.org/abs/2601.12865

作者:Xiaowei Fu,Fuxiang Huang,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image classification tasks, demonstrated remarkable success, conventional image classification, classification tasks, pivotal technique

备注

点击查看摘要

Abstract:As a pivotal technique for improving the defense of deep models, adversarial robustness transfer via distillation has demonstrated remarkable success in conventional image classification tasks. However, this paradigm encounters critical challenges when applied to vision-language models (VLM) (e.g., CLIP): constructing adversarially robust teacher for large-scale multi-modal models demands prohibitively high computational resources. We bridge this gap by revealing an interesting phenomenon: vanilla CLIP (without adversarial training) exhibits intrinsic defensive capabilities against adversarial examples generated by another CLIP with different architectures. We formally define this as proxy adversarial robustness, and naturally propose a Heterogeneous Proxy Transfer (HPT) framework that establishes cross-architectural robustness distillation channels between CLIP variants, effortlessly enabling the VLM robustness transfer from proxy to target models. Yet, such proxy transfer paradigm easily induces severe overfitting, leading to a sharp degradation in zero-shot natural generalization. To resolve that, we design Generalization-Pivot Decoupling (GPD) by leveraging the difference in learning rate scheduling. This decouples the proxy transfer process into a generalization-anchored warm-up that maintains generalization and a generalization-pulled HPT that promotes adversarial robustness, to achieve an equilibrium between natural generalization and adversarial robustness. Extensive experiments on 15 zero-shot datasets demonstrate the effectiveness of our HPT-GPD method. The code is available at the website of this http URL.

137. 【2601.12863】FGTBT: Frequency-Guided Task-Balancing Transformer for Unified Facial Landmark Detection

链接https://arxiv.org/abs/2601.12863

作者:Jun Wan,Xinyu Xiong,Ning Chen,Zhihui Lai,Jie Zhou,Wenwen Min

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved considerable success, deep learning based, deep learning, considerable success, achieved considerable

备注

点击查看摘要

Abstract:Recently, deep learning based facial landmark detection (FLD) methods have achieved considerable success. However, in challenging scenarios such as large pose variations, illumination changes, and facial expression variations, they still struggle to accurately capture the geometric structure of the face, resulting in performance degradation. Moreover, the limited size and diversity of existing FLD datasets hinder robust model training, leading to reduced detection accuracy. To address these challenges, we propose a Frequency-Guided Task-Balancing Transformer (FGTBT), which enhances facial structure perception through frequency-domain modeling and multi-dataset unified training. Specifically, we propose a novel Fine-Grained Multi-Task Balancing loss (FMB-loss), which moves beyond coarse task-level balancing by assigning weights to individual landmarks based on their occurrence across datasets. This enables more effective unified training and mitigates the issue of inconsistent gradient magnitudes. Additionally, a Frequency-Guided Structure-Aware (FGSA) model is designed to utilize frequency-guided structure injection and regularization to help learn facial structure constraints. Extensive experimental results on popular benchmark datasets demonstrate that the integration of the proposed FMB-loss and FGSA model into our FGTBT framework achieves performance comparable to state-of-the-art methods. The code is available at this https URL.

138. 【2601.12831】Data-Consistent Learning of Inverse Problems

链接https://arxiv.org/abs/2601.12831

作者:Markus Haltmeier,Gyeongha Hwang

类目:Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV)

关键词:Inverse problems, inherently ill-posed, suffering from non-uniqueness, non-uniqueness and instability, problems are inherently

备注

点击查看摘要

Abstract:Inverse problems are inherently ill-posed, suffering from non-uniqueness and instability. Classical regularization methods provide mathematically well-founded solutions, ensuring stability and convergence, but often at the cost of reduced flexibility or visual quality. Learned reconstruction methods, such as convolutional neural networks, can produce visually compelling results, yet they typically lack rigorous theoretical guarantees. DC (DC) networks address this gap by enforcing the measurement model within the network architecture. In particular, null-space networks combined with a classical regularization method as an initial reconstruction define a convergent regularization method. This approach preserves the theoretical reliability of classical schemes while leveraging the expressive power of data-driven learning, yielding reconstructions that are both accurate and visually appealing.

139. 【2601.12826】Seeing Isn't Always Believing: Analysis of Grad-CAM Faithfulness and Localization Reliability in Lung Cancer CT Classification

链接https://arxiv.org/abs/2601.12826

作者:Teerapong Panboonyuen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Explainable Artificial Intelligence, Class Activation Mapping, Gradient-weighted Class Activation, Explainable Artificial, Artificial Intelligence

备注: 7 pages

点击查看摘要

Abstract:Explainable Artificial Intelligence (XAI) techniques, such as Gradient-weighted Class Activation Mapping (Grad-CAM), have become indispensable for visualizing the reasoning process of deep neural networks in medical image analysis. Despite their popularity, the faithfulness and reliability of these heatmap-based explanations remain under scrutiny. This study critically investigates whether Grad-CAM truly represents the internal decision-making of deep models trained for lung cancer image classification. Using the publicly available IQ-OTH/NCCD dataset, we evaluate five representative architectures: ResNet-50, ResNet-101, DenseNet-161, EfficientNet-B0, and ViT-Base-Patch16-224, to explore model-dependent variations in Grad-CAM interpretability. We introduce a quantitative evaluation framework that combines localization accuracy, perturbation-based faithfulness, and explanation consistency to assess Grad-CAM reliability across architectures. Experimental findings reveal that while Grad-CAM effectively highlights salient tumor regions in most convolutional networks, its interpretive fidelity significantly degrades for Vision Transformer models due to non-local attention behavior. Furthermore, cross-model comparisons indicate substantial variability in saliency localization, implying that Grad-CAM explanations may not always correspond to the true diagnostic evidence used by the networks. This work exposes critical limitations of current saliency-based XAI approaches in medical imaging and emphasizes the need for model-aware interpretability methods that are both computationally sound and clinically meaningful. Our findings aim to inspire a more cautious and rigorous adoption of visual explanation tools in medical AI, urging the community to rethink what it truly means to "trust" a model's explanation.

140. 【2601.12823】reeDGS: Aerial Gaussian Splatting for Distant DBH Measurement

链接https://arxiv.org/abs/2601.12823

作者:Belal Shaheen,Minh-Hieu Nguyen,Bach-Thuan Bui,Shubham,Tim Wu,Michael Fairley,Matthew David Zane,Michael Wu,James Tompkin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:efficient large-area surveying, Aerial remote sensing, remote sensing enables, sensing enables efficient, enables efficient large-area

备注

点击查看摘要

Abstract:Aerial remote sensing enables efficient large-area surveying, but accurate direct object-level measurement remains difficult in complex natural scenes. Recent advancements in 3D vision, particularly learned radiance-field representations such as NeRF and 3D Gaussian Splatting, have begun to raise the ceiling on reconstruction fidelity and densifiable geometry from posed imagery. Nevertheless, direct aerial measurement of important natural attributes such as tree diameter at breast height (DBH) remains challenging. Trunks in aerial forest scans are distant and sparsely observed in image views: at typical operating altitudes, stems may span only a few pixels. With these constraints, conventional reconstruction methods leave breast-height trunk geometry weakly constrained. We present TreeDGS, an aerial image reconstruction method that leverages 3D Gaussian Splatting as a continuous, densifiable scene representation for trunk measurement. After SfM-MVS initialization and Gaussian optimization, we extract a dense point set from the Gaussian field using RaDe-GS's depth-aware cumulative-opacity integration and associate each sample with a multi-view opacity reliability score. We then estimate DBH from trunk-isolated points using opacity-weighted solid-circle fitting. Evaluated on 10 plots with field-measured DBH, TreeDGS reaches 4.79,cm RMSE (about 2.6 pixels at this GSD) and outperforms a state-of-the-art LiDAR baseline (7.91,cm RMSE), demonstrating that densified splat-based geometry can enable accurate, low-cost aerial DBH measurement.

141. 【2601.12820】A Generalist Foundation Model for Total-body PET/CT Enables Diagnostic Reporting and System-wide Metabolic Profiling

链接https://arxiv.org/abs/2601.12820

作者:Wei Chen,Liang Wu,Shuyi Lu,Yuanyuan Sun,Wenkai Bi,Zilong Yuan,Yaoyao He,Feng Wang,Junchi Ma,Shuyong Liu,Zhaoping Cheng,Xiaoyan Hu,Jianfeng Qiu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:assume single-modality inputs, challenge existing medical, Total-body PET, Systemic Dual-stream Fusion, system-wide molecular imaging

备注

点击查看摘要

Abstract:Total-body PET/CT enables system-wide molecular imaging, but heterogeneous anatomical and metabolic signals, approximately 2 m axial coverage, and structured radiology semantics challenge existing medical AI models that assume single-modality inputs, localized fields of view, and coarse image-text alignment. We introduce SDF-HOLO (Systemic Dual-stream Fusion Holo Model), a multimodal foundation model for holistic total-body PET/CT, pre-trained on more than 10,000 patients. SDF-HOLO decouples CT and PET representation learning with dual-stream encoders and couples them through a cross-modal interaction module, allowing anatomical context to refine PET aggregation while metabolic saliency guides subtle morphological reasoning. To model long-range dependencies across the body, hierarchical context modeling combines efficient local windows with global attention. To bridge voxels and clinical language, we use anatomical segmentation masks as explicit semantic anchors and perform voxel-mask-text alignment during pre-training. Across tumor segmentation, low-dose lesion detection, and multilingual diagnostic report generation, SDF-HOLO outperforms strong task-specific and clinical-reference baselines while reducing localization errors and hallucinated findings. Beyond focal interpretation, the model enables system-wide metabolic profiling and reveals tumor-associated fingerprints of inter-organ metabolic network interactions, providing a scalable computational foundation for total-body PET/CT diagnostics and system-level precision oncology.

142. 【2601.12814】CSGaussian: Progressive Rate-Distortion Compression and Segmentation for 3D Gaussian Splatting

链接https://arxiv.org/abs/2601.12814

作者:Yu-Jen Tseng,Chia-Hao Kao,Jing-Zhong Chen,Alessandro Gnutti,Shao-Yuan Lo,Yen-Yu Lin,Wen-Hsiao Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, unified framework, Splatting, semantic scene understanding, prior works

备注: Accepted at WACV 2026

点击查看摘要

Abstract:We present the first unified framework for rate-distortion-optimized compression and segmentation of 3D Gaussian Splatting (3DGS). While 3DGS has proven effective for both real-time rendering and semantic scene understanding, prior works have largely treated these tasks independently, leaving their joint consideration unexplored. Inspired by recent advances in rate-distortion-optimized 3DGS compression, this work integrates semantic learning into the compression pipeline to support decoder-side applications--such as scene editing and manipulation--that extend beyond traditional scene reconstruction and view synthesis. Our scheme features a lightweight implicit neural representation-based hyperprior, enabling efficient entropy coding of both color and semantic attributes while avoiding costly grid-based hyperprior as seen in many prior works. To facilitate compression and segmentation, we further develop compression-guided segmentation learning, consisting of quantization-aware training to enhance feature separability and a quality-aware weighting mechanism to suppress unreliable Gaussian primitives. Extensive experiments on the LERF and 3D-OVS datasets demonstrate that our approach significantly reduces transmission cost while preserving high rendering quality and strong segmentation performance.

143. 【2601.12809】Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

链接https://arxiv.org/abs/2601.12809

作者:Takaki Yamamoto,Chihiro Noguchi,Toshihiro Tanizawa

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Spatial understanding remains, Spatial understanding, remains a key, key challenge, challenge in vision-language

备注

点击查看摘要

Abstract:Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between positional and token embeddings induce a horizontal attention gradient that breaks left-right symmetry in the encoders; ablating this contribution substantially reduces left-right discrimination. Our results provide a mechanistic insight of when and how CLIP-style models acquire relational competence.

144. 【2601.12808】Joint Source-Channel-Generation Coding: From Distortion-oriented Reconstruction to Semantic-consistent Generation

链接https://arxiv.org/abs/2601.12808

作者:Tong Wu,Zhiyong Chen,Guo Lu,Li Song,Feng Yang,Meixia Tao,Wenjun Zhang

类目:Information Theory (cs.IT); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Shannon rate-distortion theory, AI-driven joint source-channel, guided by Shannon, Shannon rate-distortion, joint source-channel coding

备注: submitted to IEEE ISIT 2026

点击查看摘要

Abstract:Conventional communication systems, including both separation-based coding and AI-driven joint source-channel coding (JSCC), are largely guided by Shannon's rate-distortion theory. However, relying on generic distortion metrics fails to capture complex human visual perception, often resulting in blurred or unrealistic reconstructions. In this paper, we propose Joint Source-Channel-Generation Coding (JSCGC), a novel paradigm that shifts the focus from deterministic reconstruction to probabilistic generation. JSCGC leverages a generative model at the receiver as a generator rather than a conventional decoder to parameterize the data distribution, enabling direct maximization of mutual information under channel constraints while controlling stochastic sampling to produce outputs residing on the authentic data manifold with high fidelity. We further derive a theoretical lower bound on the maximum semantic inconsistency with given transmitted mutual information, elucidating the fundamental limits of communication in controlling the generative process. Extensive experiments on image transmission demonstrate that JSCGC substantially improves perceptual quality and semantic fidelity, significantly outperforming conventional distortion-oriented JSCC methods.

145. 【2601.12799】FRoM-W1: Towards General Humanoid Whole-Body Control with Language Instructions

链接https://arxiv.org/abs/2601.12799

作者:Peng Li,Zihan Zhuang,Yangfan Gao,Yi Dong,Sixian Li,Changhao Jiang,Shihan Dou,Zhiheng Xi,Enyu Zhou,Jixuan Huang,Hui Li,Jingjing Gong,Xingjun Ma,Tao Gui,Zuxuan Wu,Qi Zhang,Xuanjing Huang,Yu-Gang Jiang,Xipeng Qiu

类目:Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:capable of performing, Humanoid robots, Humanoid, whole-body motion, robots

备注: Project Page: [this https URL](https://openmoss.github.io/FRoM-W1)

点击查看摘要

Abstract:Humanoid robots are capable of performing various actions such as greeting, dancing and even backflipping. However, these motions are often hard-coded or specifically trained, which limits their versatility. In this work, we present FRoM-W1, an open-source framework designed to achieve general humanoid whole-body motion control using natural language. To universally understand natural language and generate corresponding motions, as well as enable various humanoid robots to stably execute these motions in the physical world under gravity, FRoM-W1 operates in two stages: (a) H-GPT: utilizing massive human data, a large-scale language-driven human whole-body motion generation model is trained to generate diverse natural behaviors. We further leverage the Chain-of-Thought technique to improve the model's generalization in instruction understanding. (b) H-ACT: After retargeting generated human whole-body motions into robot-specific actions, a motion controller that is pretrained and further fine-tuned through reinforcement learning in physical simulation enables humanoid robots to accurately and stably perform corresponding actions. It is then deployed on real robots via a modular simulation-to-reality module. We extensively evaluate FRoM-W1 on Unitree H1 and G1 robots. Results demonstrate superior performance on the HumanML3D-X benchmark for human whole-body motion generation, and our introduced reinforcement learning fine-tuning consistently improves both motion tracking accuracy and task success rates of these humanoid robots. We open-source the entire FRoM-W1 framework and hope it will advance the development of humanoid intelligence.

146. 【2601.12798】PhyG-MoE: A Physics-Guided Mixture-of-Experts Framework for Energy-Efficient GNSS Interference Recognition

链接https://arxiv.org/abs/2601.12798

作者:Zhihan Zeng,Yang Zhao,Kaihe Wang,Dusit Niyato,Yue Xiu,Lu Chen,Zhongpei Zhang,Ning Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Global Navigation Satellite, compromises Global Navigation, Navigation Satellite Systems, increasingly compromises Global, Integrated Networks

备注

点击查看摘要

Abstract:Complex electromagnetic interference increasingly compromises Global Navigation Satellite Systems (GNSS), threatening the reliability of Space-Air-Ground Integrated Networks (SAGIN). Although deep learning has advanced interference recognition, current static models suffer from a \textbf{fundamental limitation}: they impose a fixed computational topology regardless of the input's physical entropy. This rigidity leads to severe resource mismatch, where simple primitives consume the same processing cost as chaotic, saturated mixtures. To resolve this, this paper introduces PhyG-MoE (Physics-Guided Mixture-of-Experts), a framework designed to \textbf{dynamically align model capacity with signal complexity}. Unlike static architectures, the proposed system employs a spectrum-based gating mechanism that routes signals based on their spectral feature entanglement. A high-capacity TransNeXt expert is activated on-demand to disentangle complex features in saturated scenarios, while lightweight experts handle fundamental signals to minimize latency. Evaluations on 21 jamming categories demonstrate that PhyG-MoE achieves an overall accuracy of 97.58\%. By resolving the intrinsic conflict between static computing and dynamic electromagnetic environments, the proposed framework significantly reduces computational overhead without performance degradation, offering a viable solution for resource-constrained cognitive receivers.

147. 【2601.12795】Combating Noisy Labels through Fostering Self- and Neighbor-Consistency

链接https://arxiv.org/abs/2601.12795

作者:Zeren Sun,Yazhou Yao,Tongliang Liu,Zechao Li,Fumin Shen,Jinhui Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world scenarios, posing challenges, challenges in supervised, supervised deep learning, textbf

备注: accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence

点击查看摘要

Abstract:Label noise is pervasive in various real-world scenarios, posing challenges in supervised deep learning. Deep networks are vulnerable to such label-corrupted samples due to the memorization effect. One major stream of previous methods concentrates on identifying clean data for training. However, these methods often neglect imbalances in label noise across different mini-batches and devote insufficient attention to out-of-distribution noisy data. To this end, we propose a noise-robust method named Jo-SNC (\textbf{Jo}int sample selection and model regularization based on \textbf{S}elf- and \textbf{N}eighbor-\textbf{C}onsistency). Specifically, we propose to employ the Jensen-Shannon divergence to measure the ``likelihood'' of a sample being clean or out-of-distribution. This process factors in the nearest neighbors of each sample to reinforce the reliability of clean sample identification. We design a self-adaptive, data-driven thresholding scheme to adjust per-class selection thresholds. While clean samples undergo conventional training, detected in-distribution and out-of-distribution noisy samples are trained following partial label learning and negative learning, respectively. Finally, we advance the model performance further by proposing a triplet consistency regularization that promotes self-prediction consistency, neighbor-prediction consistency, and feature consistency. Extensive experiments on various benchmark datasets and comprehensive ablation studies demonstrate the effectiveness and superiority of our approach over existing state-of-the-art methods.

148. 【2601.12791】SKANet: A Cognitive Dual-Stream Framework with Adaptive Modality Fusion for Robust Compound GNSS Interference Classification

链接https://arxiv.org/abs/2601.12791

作者:Zhihan Zeng,Yang Zhao,Kaihe Wang,Dusit Niyato,Hongyuan Shu,Junchu Zhao,Yanjun Huang,Yue Xiu,Zhongpei Zhang,Ning Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Navigation Satellite Systems, Global Navigation Satellite, Satellite Systems, face growing threats, Navigation Satellite

备注

点击查看摘要

Abstract:As the electromagnetic environment becomes increasingly complex, Global Navigation Satellite Systems (GNSS) face growing threats from sophisticated jamming interference. Although Deep Learning (DL) effectively identifies basic interference, classifying compound interference remains difficult due to the superposition of diverse jamming sources. Existing single-domain approaches often suffer from performance degradation because transient burst signals and continuous global signals require conflicting feature extraction scales. We propose the Selective Kernel and Asymmetric convolution Network(SKANet), a cognitive deep learning framework built upon a dual-stream architecture that integrates Time-Frequency Images (TFIs) and Power Spectral Density (PSD). Distinct from conventional fusion methods that rely on static receptive fields, the proposed architecture incorporates a Multi-Branch Selective Kernel (SK) module combined with Asymmetric Convolution Blocks (ACBs). This mechanism enables the network to dynamically adjust its receptive fields, acting as an adaptive filter that simultaneously captures micro-scale transient features and macro-scale spectral trends within entangled compound signals. To complement this spatial-temporal adaptation, a Squeeze-and-Excitation (SE) mechanism is integrated at the fusion stage to adaptively recalibrate the contribution of heterogeneous features from each modality. Evaluations on a dataset of 405,000 samples demonstrate that SKANet achieves an overall accuracy of 96.99\%, exhibiting superior robustness for compound jamming classification, particularly under low Jamming-to-Noise Ratio (JNR) regimes.

149. 【2601.12781】VIRO: Robust and Efficient Neuro-Symbolic Reasoning with Verification for Referring Expression Comprehension

链接https://arxiv.org/abs/2601.12781

作者:Hyejin Park,Junhyuk Kwon,Suha Kwak,Jungseul Ok

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Referring Expression Comprehension, Referring Expression, Expression Comprehension, aims to localize, natural-language query

备注

点击查看摘要

Abstract:Referring Expression Comprehension (REC) aims to localize the image region corresponding to a natural-language query. Recent neuro-symbolic REC approaches leverage large language models (LLMs) and vision-language models (VLMs) to perform compositional reasoning, decomposing queries 4 structured programs and executing them step-by-step. While such approaches achieve interpretable reasoning and strong zero-shot generalization, they assume that intermediate reasoning steps are accurate. However, this assumption causes cascading errors: false detections and invalid relations propagate through the reasoning chain, yielding high-confidence false positives even when no target is present in the image. To address this limitation, we introduce Verification-Integrated Reasoning Operators (VIRO), a neuro-symbolic framework that embeds lightweight operator-level verifiers within reasoning steps. Each operator executes and validates its output, such as object existence or spatial relationship, thereby allowing the system to robustly handle no-target cases when verification conditions are not met. Our framework achieves state-of-the-art performance, reaching 61.1% balanced accuracy across target-present and no-target settings, and demonstrates generalization to real-world egocentric data. Furthermore, VIRO shows superior computational efficiency in terms of throughput, high reliability with a program failure rate of less than 0.3%, and scalability through decoupled program generation from execution.

150. 【2601.12779】Open Vocabulary Panoptic Segmentation With Retrieval Augmentation

链接https://arxiv.org/abs/2601.12779

作者:Nafis Sadeq,Qingfeng Liu,Mostafa El-Khamy

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:panoptic segmentation aims, Open Vocabulary Panoptic, Vocabulary Panoptic Segmentation, panoptic segmentation, segmentation aims

备注

点击查看摘要

Abstract:Given an input image and set of class names, panoptic segmentation aims to label each pixel in an image with class labels and instance labels. In comparison, Open Vocabulary Panoptic Segmentation aims to facilitate the segmentation of arbitrary classes according to user input. The challenge is that a panoptic segmentation system trained on a particular dataset typically does not generalize well to unseen classes beyond the training data. In this work, we propose RetCLIP, a retrieval-augmented panoptic segmentation method that improves the performance of unseen classes. In particular, we construct a masked segment feature database using paired image-text data. At inference time, we use masked segment features from the input image as query keys to retrieve similar features and associated class labels from the database. Classification scores for the masked segment are assigned based on the similarity between query features and retrieved features. The retrieval-based classification scores are combined with CLIP-based scores to produce the final output. We incorporate our solution with a previous SOTA method (FC-CLIP). When trained on COCO, the proposed method demonstrates 30.9 PQ, 19.3 mAP, 44.0 mIoU on the ADE20k dataset, achieving +4.5 PQ, +2.5 mAP, +10.0 mIoU absolute improvement over the baseline.

151. 【2601.12770】Generalizable and Animatable 3D Full-Head Gaussian Avatar from a Single Image

链接https://arxiv.org/abs/2601.12770

作者:Shuling Zhao,Dan Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:challenging problem, important yet challenging, animatable head avatars, Building, full-head animatable avatar

备注: Project page: [this https URL](https://shaelynz.github.io/fhavatar/)

点击查看摘要

Abstract:Building 3D animatable head avatars from a single image is an important yet challenging problem. Existing methods generally collapse under large camera pose variations, compromising the realism of 3D avatars. In this work, we propose a new framework to tackle the novel setting of one-shot 3D full-head animatable avatar reconstruction in a single feed-forward pass, enabling real-time animation and simultaneous 360$^\circ$ rendering views. To facilitate efficient animation control, we model 3D head avatars with Gaussian primitives embedded on the surface of a parametric face model within the UV space. To obtain knowledge of full-head geometry and textures, we leverage rich 3D full-head priors within a pretrained 3D generative adversarial network (GAN) for global full-head feature extraction and multi-view supervision. To increase the fidelity of the 3D reconstruction of the input image, we take advantage of the symmetric nature of the UV space and human faces to fuse local fine-grained input image features with the global full-head textures. Extensive experiments demonstrate the effectiveness of our method, achieving high-quality 3D full-head modeling as well as real-time animation, thereby improving the realism of 3D talking avatars.

152. 【2601.12768】Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval

链接https://arxiv.org/abs/2601.12768

作者:Zequn Xie,Boyun Zhang,Yuxiao Lin,Tao Jin

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:natural language queries, locate relevant videos, aims to locate, language queries, locate relevant

备注

点击查看摘要

Abstract:Video-text retrieval (VTR) aims to locate relevant videos using natural language queries. Current methods, often based on pre-trained models like CLIP, are hindered by video's inherent redundancy and their reliance on coarse, final-layer features, limiting matching accuracy. To address this, we introduce the HVP-Net (Hierarchical Visual Perception Network), a framework that mines richer video semantics by extracting and refining features from multiple intermediate layers of a vision encoder. Our approach progressively distills salient visual concepts from raw patch-tokens at different semantic levels, mitigating redundancy while preserving crucial details for alignment. This results in a more robust video representation, leading to new state-of-the-art performance on challenging benchmarks including MSRVTT, DiDeMo, and ActivityNet. Our work validates the effectiveness of exploiting hierarchical features for advancing video-text retrieval. Our codes are available at this https URL.

153. 【2601.12766】Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration

链接https://arxiv.org/abs/2601.12766

作者:Lu Yue,Yue Fan,Shiwei Lian,Yu Zhao,Jiaxin Yu,Liang Xie,Feitian Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

关键词:Large Language Models, leveraging Large Language, Language Models, Large Language, agents leveraging Large

备注

点击查看摘要

Abstract:Zero-shot Vision-and-Language Navigation (VLN) agents leveraging Large Language Models (LLMs) excel in generalization but suffer from insufficient spatial perception. Focusing on complex continuous environments, we categorize key perceptual bottlenecks into three spatial challenges: door interaction,multi-room navigation, and ambiguous instruction execution, where existing methods consistently suffer high failure rates. We present Spatial-VLN, a perception-guided exploration framework designed to overcome these challenges. The framework consists of two main modules. The Spatial Perception Enhancement (SPE) module integrates panoramic filtering with specialized door and region experts to produce spatially coherent, cross-view consistent perceptual representations. Building on this foundation, our Explored Multi-expert Reasoning (EMR) module uses parallel LLM experts to address waypoint-level semantics and region-level spatial transitions. When discrepancies arise between expert predictions, a query-and-explore mechanism is activated, prompting the agent to actively probe critical areas and resolve perceptual ambiguities. Experiments on VLN-CE demonstrate that Spatial VLN achieves state-of-the-art performance using only low-cost LLMs. Furthermore, to validate real-world applicability, we introduce a value-based waypoint sampling strategy that effectively bridges the Sim2Real gap. Extensive real-world evaluations confirm that our framework delivers superior generalization and robustness in complex environments. Our codes and videos are available at this https URL.

154. 【2601.12765】owards Unbiased Source-Free Object Detection via Vision Foundation Models

链接https://arxiv.org/abs/2601.12765

作者:Zhi Cai,Yingjie Gao,Yanan Zhang,Xinzhu Ma,Di Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Source-Free Object Detection, Debiased Source-free Object, Object Detection, Source Bias problem, adapted model remains

备注

点击查看摘要

Abstract:Source-Free Object Detection (SFOD) has garnered much attention in recent years by eliminating the need of source-domain data in cross-domain tasks, but existing SFOD methods suffer from the Source Bias problem, i.e. the adapted model remains skewed towards the source domain, leading to poor generalization and error accumulation during self-training. To overcome this challenge, we propose Debiased Source-free Object Detection (DSOD), a novel VFM-assisted SFOD framework that can effectively mitigate source bias with the help of powerful VFMs. Specifically, we propose Unified Feature Injection (UFI) module that integrates VFM features into the CNN backbone through Simple-Scale Extension (SSE) and Domain-aware Adaptive Weighting (DAAW). Then, we propose Semantic-aware Feature Regularization (SAFR) that constrains feature learning to prevent overfitting to source domain characteristics. Furthermore, we propose a VFM-free variant, termed DSOD-distill for computation-restricted scenarios through a novel Dual-Teacher distillation scheme. Extensive experiments on multiple benchmarks demonstrate that DSOD outperforms state-of-the-art SFOD methods, achieving 48.1% AP on Normal-to-Foggy weather adaptation, 39.3% AP on Cross-scene adaptation, and 61.4% AP on Synthetic-to-Real adaptation.

155. 【2601.12761】Moaw: Unleashing Motion Awareness for Video Diffusion Models

链接https://arxiv.org/abs/2601.12761

作者:Tianqi Zhang,Ziyi Wang,Wenzhao Zheng,Weiliang Chen,Yuanhui Huang,Zhengyang Huang,Jie Zhou,Jiwen Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video diffusion models, naturally capture correspondences, trained on large-scale, capture correspondences, correspondences of shared

备注

点击查看摘要

Abstract:Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between the two networks, these features can be naturally adapted in a zero-shot manner, enabling motion transfer without additional adapters. Our work provides a new paradigm for bridging generative modeling and motion understanding, paving the way for more unified and controllable video learning frameworks.

156. 【2601.12747】SSPFormer: Self-Supervised Pretrained Transformer for MRI Images

链接https://arxiv.org/abs/2601.12747

作者:Jingkai Li,Xiaoze Tian,Yuhang Shen,Jia Wang,Dianjie Lu,Guijuan Zhang,Zhuoran Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrates remarkable generalization, pre-trained transformer demonstrates, transformer demonstrates remarkable, natural image processing, remarkable generalization ability

备注: Undergraduate student as first author submitted to IJCAI

点击查看摘要

Abstract:The pre-trained transformer demonstrates remarkable generalization ability in natural image processing. However, directly transferring it to magnetic resonance images faces two key challenges: the inability to adapt to the specificity of medical anatomical structures and the limitations brought about by the privacy and scarcity of medical data. To address these issues, this paper proposes a Self-Supervised Pretrained Transformer (SSPFormer) for MRI images, which effectively learns domain-specific feature representations of medical images by leveraging unlabeled raw imaging data. To tackle the domain gap and data scarcity, we introduce inverse frequency projection masking, which prioritizes the reconstruction of high-frequency anatomical regions to enforce structure-aware representation learning. Simultaneously, to enhance robustness against real-world MRI artifacts, we employ frequency-weighted FFT noise enhancement that injects physiologically realistic noise into the Fourier domain. Together, these strategies enable the model to learn domain-invariant and artifact-robust features directly from raw scans. Through extensive experiments on segmentation, super-resolution, and denoising tasks, the proposed SSPFormer achieves state-of-the-art performance, fully verifying its ability to capture fine-grained MRI image fidelity and adapt to clinical application requirements.

157. 【2601.12736】KaoLRM: Repurposing Pre-trained Large Reconstruction Models for Parametric 3D Face Reconstruction

链接https://arxiv.org/abs/2601.12736

作者:Qingtian Zhu,Xu Cao,Zhixiang Wang,Yinqiang Zheng,Takafumi Taketomi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Reconstruction Model, Large Reconstruction, single-view images, re-target the learned, Morphable Models

备注

点击查看摘要

Abstract:We propose KaoLRM to re-target the learned prior of the Large Reconstruction Model (LRM) for parametric 3D face reconstruction from single-view images. Parametric 3D Morphable Models (3DMMs) have been widely used for facial reconstruction due to their compact and interpretable parameterization, yet existing 3DMM regressors often exhibit poor consistency across varying viewpoints. To address this, we harness the pre-trained 3D prior of LRM and incorporate FLAME-based 2D Gaussian Splatting into LRM's rendering pipeline. Specifically, KaoLRM projects LRM's pre-trained triplane features into the FLAME parameter space to recover geometry, and models appearance via 2D Gaussian primitives that are tightly coupled to the FLAME mesh. The rich prior enables the FLAME regressor to be aware of the 3D structure, leading to accurate and robust reconstructions under self-occlusions and diverse viewpoints. Experiments on both controlled and in-the-wild benchmarks demonstrate that KaoLRM achieves superior reconstruction accuracy and cross-view consistency, while existing methods remain sensitive to viewpoint variations. The code is released at this https URL.

158. 【2601.12729】DC-VLAQ: Query-Residual Aggregation for Robust Visual Place Recognition

链接https://arxiv.org/abs/2601.12729

作者:Hanyu Zhu,Zhihao Zhan,Yuhang Ming,Liang Li,Dibo Hou,Javier Civera,Wanzeng Kong

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:visual place recognition, illumination variations, place recognition, large viewpoint, global aggregation

备注: 10 pages, 4 figures, 5 tables

点击查看摘要

Abstract:One of the central challenges in visual place recognition (VPR) is learning a robust global representation that remains discriminative under large viewpoint changes, illumination variations, and severe domain shifts. While visual foundation models (VFMs) provide strong local features, most existing methods rely on a single model, overlooking the complementary cues offered by different VFMs. However, exploiting such complementary information inevitably alters token distributions, which challenges the stability of existing query-based global aggregation schemes. To address these challenges, we propose DC-VLAQ, a representation-centric framework that integrates the fusion of complementary VFMs and robust global aggregation. Specifically, we first introduce a lightweight residual-guided complementary fusion that anchors representations in the DINOv2 feature space while injecting complementary semantics from CLIP through a learned residual correction. In addition, we propose the Vector of Local Aggregated Queries (VLAQ), a query--residual global aggregation scheme that encodes local tokens by their residual responses to learnable queries, resulting in improved stability and the preservation of fine-grained discriminative cues. Extensive experiments on standard VPR benchmarks, including Pitts30k, Tokyo24/7, MSLS, Nordland, SPED, and AmsterTime, demonstrate that DC-VLAQ consistently outperforms strong baselines and achieves state-of-the-art performance, particularly under challenging domain shifts and long-term appearance changes.

159. 【2601.12719】S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

链接https://arxiv.org/abs/2601.12719

作者:Lin Zhao,Yushu Wu,Aleksei Lebedev,Dishani Lahiri,Meng Dong,Arpit Sahni,Michael Vasilkovsky,Hao Chen,Ju Hu,Aliaksandr Siarohin,Sergey Tulyakov,Yanzhi Wang,Anil Kag,Yanyu Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Sandwich Diffusion Transformer, recently improved video, Diffusion Transformer designed, Diffusion Transformers, recently improved

备注

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.

160. 【2601.12715】RSOD: Reliability-Guided Sonar Image Object Detection with Extremely Limited Labels

链接https://arxiv.org/abs/2601.12715

作者:Chengzhou Li,Ping Guo,Guanchen Meng,Qi Jia,Jinyuan Liu,Zhu Liu,Xiaokang Liu,Yu Liu,Zhongxuan Luo,Xin Fan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:sonar images, underwater detection systems, sonar, images, key technology

备注: Accepted by AAAI 2026,9 pages,10 figures

点击查看摘要

Abstract:Object detection in sonar images is a key technology in underwater detection systems. Compared to natural images, sonar images contain fewer texture details and are more susceptible to noise, making it difficult for non-experts to distinguish subtle differences between classes. This leads to their inability to provide precise annotation data for sonar images. Therefore, designing effective object detection methods for sonar images with extremely limited labels is particularly important. To address this, we propose a teacher-student framework called RSOD, which aims to fully learn the characteristics of sonar images and develop a pseudo-label strategy suitable for these images to mitigate the impact of limited labels. First, RSOD calculates a reliability score by assessing the consistency of the teacher's predictions across different views. To leverage this score, we introduce an object mixed pseudo-label method to tackle the shortage of labeled data in sonar images. Finally, we optimize the performance of the student by implementing a reliability-guided adaptive constraint. By taking full advantage of unlabeled data, the student can perform well even in situations with extremely limited labels. Notably, on the UATD dataset, our method, using only 5% of labeled data, achieves results that can compete against those of our baseline algorithm trained on 100% labeled data. We also collected a new dataset to provide more valuable data for research in the field of sonar.

161. 【2601.12714】P2L-CA: An Effective Parameter Tuning Framework for Rehearsal-Free Multi-Label Class-Incremental Learning

链接https://arxiv.org/abs/2601.12714

作者:Songlin Dong,Jiangyang Li,Chenhao Ding,Zhiheng Ma,Haoyu Luo,Yuhang He,Yihong Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Class-Incremental Learning aims, Multi-label Class-Incremental Learning, multiple objects co-occur, Class-Incremental Learning, Learning aims

备注: 12 pages, 5 figures

点击查看摘要

Abstract:Multi-label Class-Incremental Learning aims to continuously recognize novel categories in complex scenes where multiple objects co-occur. However, existing approaches often incur high computational costs due to full-parameter fine-tuning and substantial storage overhead from memory buffers, or they struggle to address feature confusion and domain discrepancies adequately. To overcome these limitations, we introduce P2L-CA, a parameter-efficient framework that integrates a Prompt-to-Label module with a Continuous Adapter module. The P2L module leverages class-specific prompts to disentangle multi-label representations while incorporating linguistic priors to enforce stable semantic-visual alignment. Meanwhile, the CA module employs lightweight adapters to mitigate domain gaps between pre-trained models and downstream tasks, thereby enhancing model plasticity. Extensive experiments across standard and challenging MLCIL settings on MS-COCO and PASCAL VOC show that P2L-CA not only achieves substantial improvements over state-of-the-art methods but also demonstrates strong generalization in CIL scenarios, all while requiring minimal trainable parameters and eliminating the need for memory buffers.

162. 【2601.12697】Fusing in 3D: Free-Viewpoint Fusion Rendering with a 3D Infrared-Visible Scene Representation

链接https://arxiv.org/abs/2601.12697

作者:Chao Yang,Deshui Miao,Chao Tian,Guoqing Zhu,Yameng Gu,Zhenyu He

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)

关键词:image fusion aims, single fused image, Infrared-Visible Gaussian Fusion, Infrared-visible image fusion, aims to integrate

备注

点击查看摘要

Abstract:Infrared-visible image fusion aims to integrate infrared and visible information into a single fused image. Existing 2D fusion methods focus on fusing images from fixed camera viewpoints, neglecting a comprehensive understanding of complex scenarios, which results in the loss of critical information about the scene. To address this limitation, we propose a novel Infrared-Visible Gaussian Fusion (IVGF) framework, which reconstructs scene geometry from multimodal 2D inputs and enables direct rendering of fused images. Specifically, we propose a cross-modal adjustment (CMA) module that modulates the opacity of Gaussians to solve the problem of cross-modal conflicts. Moreover, to preserve the distinctive features from both modalities, we introduce a fusion loss that guides the optimization of CMA, thus ensuring that the fused image retains the critical characteristics of each modality. Comprehensive qualitative and quantitative experiments demonstrate the effectiveness of the proposed method.

163. 【2601.12683】GaussianTrimmer: Online Trimming Boundaries for 3DGS Segmentation

链接https://arxiv.org/abs/2601.12683

作者:Liwei Liao,Ronggang Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Gaussian segmentation methods, scene representation, Gaussian segmentation, scene segmentation methods, Gaussian

备注

点击查看摘要

Abstract:With the widespread application of 3D Gaussians in 3D scene representation, 3D scene segmentation methods based on 3D Gaussians have also gradually emerged. However, existing 3D Gaussian segmentation methods basically segment on the basis of Gaussian primitives. Due to the large variation range of the scale of 3D Gaussians, large-sized Gaussians that often span the foreground and background lead to jagged boundaries of segmented objects. To this end, we propose an online boundary trimming method, GaussianTrimmer, which is an efficient and plug-and-play post-processing method capable of trimming coarse boundaries for existing 3D Gaussian segmentation methods. Our method consists of two core steps: 1. Generating uniformly and well-covered virtual cameras; 2. Trimming Gaussian at the primitive level based on 2D segmentation results on virtual cameras. Extensive quantitative and qualitative experiments demonstrate that our method can improve the segmentation quality of existing 3D Gaussian segmentation methods as a plug-and-play method.

164. 【2601.12682】Fusion-Restoration Image Processing Algorithm to Improve the High-Temperature Deformation Measurement

链接https://arxiv.org/abs/2601.12682

作者:Banglei Guan,Dongcai Tan,Jing Tao,Ang Su,Yang Shang,Qifeng Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image degradation caused, deformation measurement, thermal deformation measurement, thermal radiation, deformation measurement errors

备注

点击查看摘要

Abstract:In the deformation measurement of high-temperature structures, image degradation caused by thermal radiation and random errors introduced by heat haze restrict the accuracy and effectiveness of deformation measurement. To suppress thermal radiation and heat haze using fusion-restoration image processing methods, thereby improving the accuracy and effectiveness of DIC in the measurement of high-temperature deformation. For image degradation caused by thermal radiation, based on the image layered representation, the image is decomposed into positive and negative channels for parallel processing, and then optimized for quality by multi-exposure image fusion. To counteract the high-frequency, random errors introduced by heat haze, we adopt the FSIM as the objective function to guide the iterative optimization of model parameters, and the grayscale average algorithm is applied to equalize anomalous gray values, thereby reducing measurement error. The proposed multi-exposure image fusion algorithm effectively suppresses image degradation caused by complex illumination conditions, boosting the effective computation area from 26% to 50% for under-exposed images and from 32% to 40% for over-exposed images without degrading measurement accuracy in the experiment. Meanwhile, the image restoration combined with the grayscale average algorithm reduces static thermal deformation measurement errors. The error in {\epsilon}_xx is reduced by 85.3%, while the errors in {\epsilon}_yy and {\gamma}_xy are reduced by 36.0% and 36.4%, respectively. We present image processing methods to suppress the interference of thermal radiation and heat haze in high-temperature deformation measurement using DIC. The experimental results verify that the proposed method can effectively improve image quality, reduce deformation measurement errors, and has potential application value in thermal deformation measurement.

165. 【2601.12672】VILTA: A VLM-in-the-Loop Adversary for Enhancing Driving Policy Robustness

链接https://arxiv.org/abs/2601.12672

作者:Qimao Chen,Fang Li,Shaoqing Xu,Zhiyi Lai,Zixun Xie,Yuechen Luo,Shengyin Jiang,Hanbing Li,Long Chen,Bing Wang,Yi Zhang,Zhi-Xin Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Models, systems is fundamentally, real-world data, safe deployment, deployment of autonomous

备注: Accepted to AAAI 2026

点击查看摘要

Abstract:The safe deployment of autonomous driving (AD) systems is fundamentally hindered by the long-tail problem, where rare yet critical driving scenarios are severely underrepresented in real-world data. Existing solutions including safety-critical scenario generation and closed-loop learning often rely on rule-based heuristics, resampling methods and generative models learned from offline datasets, limiting their ability to produce diverse and novel challenges. While recent works leverage Vision Language Models (VLMs) to produce scene descriptions that guide a separate, downstream model in generating hazardous trajectories for agents, such two-stage framework constrains the generative potential of VLMs, as the diversity of the final trajectories is ultimately limited by the generalization ceiling of the downstream algorithm. To overcome these limitations, we introduce VILTA (VLM-In-the-Loop Trajectory Adversary), a novel framework that integrates a VLM into the closed-loop training of AD agents. Unlike prior works, VILTA actively participates in the training loop by comprehending the dynamic driving environment and strategically generating challenging scenarios through direct, fine-grained editing of surrounding agents' future trajectories. This direct-editing approach fully leverages the VLM's powerful generalization capabilities to create a diverse curriculum of plausible yet challenging scenarios that extend beyond the scope of traditional methods. We demonstrate that our approach substantially enhances the safety and robustness of the resulting AD policy, particularly in its ability to navigate critical long-tail events.

166. 【2601.12671】Exploiting Test-Time Augmentation in Federated Learning for Brain Tumor MRI Classification

链接https://arxiv.org/abs/2601.12671

作者:Thamara Leandra de Deus Melo,Rodrigo Moreira,Larissa Ferreira Rodrigues Moreira,André Ricardo Backes

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Efficient brain tumor, brain tumor diagnosis, Efficient brain, early treatment, brain tumor

备注: 21st International Conference on Computer Vision Theory and Applications (VISAPP 2026), 9-11 March 2026, Marbella, Spain

点击查看摘要

Abstract:Efficient brain tumor diagnosis is crucial for early treatment; however, it is challenging because of lesion variability and image complexity. We evaluated convolutional neural networks (CNNs) in a federated learning (FL) setting, comparing models trained on original versus preprocessed MRI images (resizing, grayscale conversion, normalization, filtering, and histogram equalization). Preprocessing alone yielded negligible gains; combined with test-time augmentation (TTA), it delivered consistent, statistically significant improvements in federated MRI classification (p0.001). In practice, TTA should be the default inference strategy in FL-based medical imaging; when the computational budget permits, pairing TTA with light preprocessing provides additional reliable gains.

167. 【2601.12666】Near-Light Color Photometric Stereo for mono-Chromaticity non-lambertian surface

链接https://arxiv.org/abs/2601.12666

作者:Zonglin Li,Jieji Ren,Shuangfan Zhou,Heng Guo,Jinnuo Zhang,Jiang Zhou,Boxin Shi,Zhanyu Ma,Guoying Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:extending conventional photometric, Color photometric stereo, stereo enables single-shot, requires multiple images, enables single-shot surface

备注: 5 pages 7figures

点击查看摘要

Abstract:Color photometric stereo enables single-shot surface reconstruction, extending conventional photometric stereo that requires multiple images of a static scene under varying illumination to dynamic scenarios. However, most existing approaches assume ideal distant lighting and Lambertian reflectance, leaving more practical near-light conditions and non-Lambertian surfaces underexplored. To overcome this limitation, we propose a framework that leverages neural implicit representations for depth and BRDF modeling under the assumption of mono-chromaticity (uniform chromaticity and homogeneous material), which alleviates the inherent ill-posedness of color photometric stereo and allows for detailed surface recovery from just one image. Furthermore, we design a compact optical tactile sensor to validate our approach. Experiments on both synthetic and real-world datasets demonstrate that our method achieves accurate and robust surface reconstruction.

168. 【2601.12664】Generalizable Hyperparameter Optimization for Federated Learning on Non-IID Cancer Images

链接https://arxiv.org/abs/2601.12664

作者:Elisa Gonçalves Ribeiro,Rodrigo Moreira,Larissa Ferreira Rodrigues Moreira,André Ricardo Backes

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Deep learning, histopathology training conflicts, clinical settings, training conflicts, conflicts with privacy

备注: 21st International Conference on Computer Vision Theory and Applications (VISAPP 2026), 9-11 March 2026, Marbella, Spain

点击查看摘要

Abstract:Deep learning for cancer histopathology training conflicts with privacy constraints in clinical settings. Federated Learning (FL) mitigates this by keeping data local; however, its performance depends on hyperparameter choices under non-independent and identically distributed (non-IID) client datasets. This paper examined whether hyperparameters optimized on one cancer imaging dataset generalized across non-IID federated scenarios. We considered binary histopathology tasks for ovarian and colorectal cancers. We perform centralized Bayesian hyperparameter optimization and transfer dataset-specific optima to the non-IID FL setup. The main contribution of this study is the introduction of a simple cross-dataset aggregation heuristic by combining configurations by averaging the learning rates and considering the modal optimizers and batch sizes. This combined configuration achieves a competitive classification performance.

169. 【2601.12638】Mixed Precision PointPillars for Efficient 3D Object Detection with TensorRT

链接https://arxiv.org/abs/2601.12638

作者:Ninnart Fuengfusin,Keisuke Yoneda,Naoki Suganuma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:object detection, autonomous vehicles, important tasks, PTQ, LIDAR wide numerical

备注: 6 pages, 3 figures

点击查看摘要

Abstract:LIDAR 3D object detection is one of the important tasks for autonomous vehicles. Ensuring that this task operates in real-time is crucial. Toward this, model quantization can be used to accelerate the runtime. However, directly applying model quantization often leads to performance degradation due to LIDAR's wide numerical distributions and extreme outliers. To address the wide numerical distribution, we proposed a mixed precision framework designed for PointPillars. Our framework first searches for sensitive layers with post-training quantization (PTQ) by quantizing one layer at a time to 8-bit integer (INT8) and evaluating each model for average precision (AP). The top-k most sensitive layers are assigned as floating point (FP). Combinations of these layers are greedily searched to produce candidate mixed precision models, which are finalized with either PTQ or quantization-aware training (QAT). Furthermore, to handle outliers, we observe that using a very small number of calibration data reduces the likelihood of encountering outliers, thereby improving PTQ performance. Our methods provides mixed precision models without training in the PTQ pipeline, while our QAT pipeline achieves the performance competitive to FP models. With TensorRT deployment, our models offer less latency and sizes by up to 2.35 and 2.26 times, respectively.

170. 【2601.12636】From Bands to Depth: Understanding Bathymetry Decisions on Sentinel-2

链接https://arxiv.org/abs/2601.12636

作者:Satyaki Roy Chowdhury,Aswathnarayan Radhakrishnan,Hsiao Jou Hsu,Hari Subramoni,Joachim Moortgat

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:satellite derived bathymetry, sites remains challenging, satellite derived, derived bathymetry, remains challenging

备注: Accepted by WACV 2026

点击查看摘要

Abstract:Deploying Sentinel-2 satellite derived bathymetry (SDB) robustly across sites remains challenging. We analyze a Swin-Transformer based U-Net model (Swin-BathyUNet) to understand how it infers depth and when its predictions are trustworthy. A leave-one-band out study ranks spectral importance to the different bands consistent with shallow water optics. We adapt ablation-based CAM to regression (A-CAM-R) and validate the reliability via a performance retention test: keeping only the top-p% salient pixels while neutralizing the rest causes large, monotonic RMSE increase, indicating explanations localize on evidence the model relies on. Attention ablations show decoder conditioned cross attention on skips is an effective upgrade, improving robustness to glint/foam. Cross-region inference (train on one site, test on another) reveals depth-dependent degradation: MAE rises nearly linearly with depth, and bimodal depth distributions exacerbate mid/deep errors. Practical guidance follows: maintain wide receptive fields, preserve radiometric fidelity in green/blue channels, pre-filter bright high variance near shore, and pair light target site fine tuning with depth aware calibration to transfer across regions.

171. 【2601.12626】Linear Mechanisms for Spatiotemporal Reasoning in Vision Language Models

链接https://arxiv.org/abs/2601.12626

作者:Raphi Kang,Hongqiao Chen,Georgia Gkioxari,Pietro Perona

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain largely opaque, abilities remain largely, Vision Language Models, capability of Vision, Vision Language

备注

点击查看摘要

Abstract:Spatio-temporal reasoning is a remarkable capability of Vision Language Models (VLMs), but the underlying mechanisms of such abilities remain largely opaque. We postulate that visual/geometrical and textual representations of spatial structure must be combined at some point in VLM computations. We search for such confluence, and ask whether the identified representation can causally explain aspects of input-output model behavior through a linear model. We show empirically that VLMs encode object locations by linearly binding \textit{spatial IDs} to textual activations, then perform reasoning via language tokens. Through rigorous causal interventions we demonstrate that these IDs, which are ubiquitous across the model, can systematically mediate model beliefs at intermediate VLM layers. Additionally, we find that spatial IDs serve as a diagnostic tool for identifying limitations in existing VLMs, and as a valuable learning signal. We extend our analysis to video VLMs and identify an analogous linear temporal ID mechanism. By characterizing our proposed spatiotemporal ID mechanism, we elucidate a previously underexplored internal reasoning process in VLMs, toward improved interpretability and the principled design of more aligned and capable models. We release our code for reproducibility: this https URL.

172. 【2601.12624】owards Robust Universal Perturbation Attacks: A Float-Coded, Penalty-Driven Evolutionary Approach

链接https://arxiv.org/abs/2601.12624

作者:Shiqi Wang,Mahdi Khosravy,Neeraj Gupta,Olaf Witkowski

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:single noise pattern, garnered significant attention, significant attention due, undermine deep neural, deep neural networks

备注

点击查看摘要

Abstract:Universal adversarial perturbations (UAPs) have garnered significant attention due to their ability to undermine deep neural networks across multiple inputs using a single noise pattern. Evolutionary algorithms offer a promising approach to generating such perturbations due to their ability to navigate non-convex, gradient-free landscapes. In this work, we introduce a float-coded, penalty-driven single-objective evolutionary framework for UAP generation that achieves lower visibility perturbations while enhancing attack success rates. Our approach leverages continuous gene representations aligned with contemporary deep learning scales, incorporates dynamic evolutionary operators with adaptive scheduling, and utilizes a modular PyTorch implementation for seamless integration with modern architectures. Additionally, we ensure the universality of the generated perturbations by testing across diverse models and by periodically switching batches to prevent overfitting. Experimental results on the ImageNet dataset demonstrate that our framework consistently produces perturbations with smaller norms, higher misclassification effectiveness, and faster convergence compared to existing evolutionary-based methods. These findings highlight the robustness and scalability of our approach for universal adversarial attacks across various deep learning architectures.

173. 【2601.12567】Camera Pose Revisited

链接https://arxiv.org/abs/2601.12567

作者:Władysław Skarbek,Michał Salomonowicz,Michał Król

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Estimating the position, computer vision, multi-sensor systems, position and orientation, observed scene

备注: 30 pages, 9 figures, 9 tables

点击查看摘要

Abstract:Estimating the position and orientation of a camera with respect to an observed scene is one of the central problems in computer vision, particularly in the context of camera calibration and multi-sensor systems. This paper addresses the planar Perspective--$n$--Point problem, with special emphasis on the initial estimation of the pose of a calibration object. As a solution, we propose the \texttt{PnP-ProCay78} algorithm, which combines the classical quadratic formulation of the reconstruction error with a Cayley parameterization of rotations and least-squares optimization. The key component of the method is a deterministic selection of starting points based on an analysis of the reconstruction error for two canonical vectors, allowing costly solution-space search procedures to be avoided. Experimental validation is performed using data acquired also from high-resolution RGB cameras and very low-resolution thermal cameras in an integrated RGB--IR setup. The results demonstrate that the proposed algorithm achieves practically the same projection accuracy as optimal \texttt{SQPnP} and slightly higher than \texttt{IPPE}, both prominent \texttt{PnP-OpenCV} procedures. However, \texttt{PnP-ProCay78} maintains a significantly simpler algorithmic structure. Moreover, the analysis of optimization trajectories in Cayley space provides an intuitive insight into the convergence process, making the method attractive also from a didactic perspective. Unlike existing PnP solvers, the proposed \texttt{PnP-ProCay78} algorithm combines projection error minimization with an analytically eliminated reconstruction-error surrogate for translation, yielding a hybrid cost formulation that is both geometrically transparent and computationally efficient.

174. 【2601.12557】Life, Machine Learning, and the Search for Habitability: Predicting Biosignature Fluxes for the Habitable Worlds Observatory

链接https://arxiv.org/abs/2601.12557

作者:Mark Moussa,Amber V. Young,Brianna Isola,Vasuda Trehan,Michael D. Himes,Nicholas Wogan,Giada Arney

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Habitable Worlds Observatory, NASA Habitable Worlds, Worlds Observatory, face critical decisions, Future direct-imaging flagship

备注: 8 pages, 4 figures. Submitted and accepted in AAAI-26 (IAAI Emerging Applications track)

点击查看摘要

Abstract:Future direct-imaging flagship missions, such as NASA's Habitable Worlds Observatory (HWO), face critical decisions in prioritizing observations due to extremely stringent time and resource constraints. In this paper, we introduce two advanced machine-learning architectures tailored for predicting biosignature species fluxes from exoplanetary reflected-light spectra: a Bayesian Convolutional Neural Network (BCNN) and our novel model architecture, the Spectral Query Adaptive Transformer (SQuAT). The BCNN robustly quantifies both epistemic and aleatoric uncertainties, offering reliable predictions under diverse observational conditions, whereas SQuAT employs query-driven attention mechanisms to enhance interpretability by explicitly associating spectral features with specific biosignature species. We demonstrate that both models achieve comparably high predictive accuracy on an augmented dataset spanning a wide range of exoplanetary conditions, while highlighting their distinct advantages in uncertainty quantification and spectral interpretability. These capabilities position our methods as promising tools for accelerating target triage, optimizing observation schedules, and maximizing scientific return for upcoming flagship missions such as HWO.

175. 【2601.12551】PISE: Physics-Anchored Semantically-Enhanced Deep Computational Ghost Imaging for Robust Low-Bandwidth Machine Perception

链接https://arxiv.org/abs/2601.12551

作者:Tong Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:low-bandwidth edge perception, physics-informed deep ghost, deep ghost imaging, ghost imaging framework, edge perception

备注: 4 pages, 4 figures, 3 tables. Submitted to IEICE Transactions

点击查看摘要

Abstract:We propose PISE, a physics-informed deep ghost imaging framework for low-bandwidth edge perception. By combining adjoint operator initialization with semantic guidance, PISE improves classification accuracy by 2.57% and reduces variance by 9x at 5% sampling.

176. 【2601.12534】Encoding Emotion Through Self-Supervised Eye Movement Reconstruction

链接https://arxiv.org/abs/2601.12534

作者:Marcus Ma,Jordan Prescott,Emily Zhou,Tiantian Feng,Kleanthis Avramidis,Gabor Mihaly Toth,Shrikanth Narayanan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:literature establishing gaze, establishing gaze patterns, eye movement, literature establishing, patterns are reliable

备注

点击查看摘要

Abstract:The relationship between emotional expression and eye movement is well-documented, with literature establishing gaze patterns are reliable indicators of emotion. However, most studies utilize specialized, high-resolution eye-tracking equipment, limiting the potential reach of findings. We investigate how eye movement can be used to predict multimodal markers of emotional expression from naturalistic, low-resolution videos. We utilize a collection of video interviews from the USC Shoah Foundation's Visual History Archive with Holocaust survivors as they recount their experiences in the Auschwitz concentration camp. Inspired by pretraining methods on language models, we develop a novel gaze detection model that uses self-supervised eye movement reconstruction that can effectively leverage unlabeled video. We use this model's encoder embeddings to fine-tune models on two downstream tasks related to emotional expression. The first is aligning eye movement with directional emotion estimates from speech. The second task is using eye gaze as a predictor of three momentary manifestations of emotional behaviors: laughing, crying/sobbing, and sighing. We find our new model is predictive of emotion outcomes and observe a positive correlation between pretraining performance and emotion processing performance for both experiments. We conclude self-supervised eye movement reconstruction is an effective method for encoding the affective signal they carry.

177. 【2601.12533】BirdsEye-RU: A Dataset For Detecting Faces from Overhead Images

链接https://arxiv.org/abs/2601.12533

作者:Md. Ahanaf Arif Khan,Ariful Islam,Sangeeta Biswas,Md. Iqbal Aziz Khan,Subrata Pramanik,Sanjoy Kumar Chakrabarty,Bimal Kumar Pramanik

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant challenge due, extreme scale variations, overhead images remains, Detecting faces, environmental clutter

备注

点击查看摘要

Abstract:Detecting faces in overhead images remains a significant challenge due to extreme scale variations and environmental clutter. To address this, we created the BirdsEye-RU dataset, a comprehensive collection of 2,978 images containing over eight thousand annotated faces. This dataset is specifically designed to capture small and distant faces across diverse environments, containing both drone images and smartphone-captured images from high altitude. We present a detailed description of the BirdsEye-RU dataset in this paper. We made our dataset freely available to the public, and it can be accessed at this https URL.

178. 【2601.12530】XRefine: Attention-Guided Keypoint Match Refinement

链接https://arxiv.org/abs/2601.12530

作者:Jan Fabian Schmid,Annika Hagemann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Sparse keypoint matching, spatially inaccurate matches, produce spatially inaccurate, vision tasks, Sparse keypoint

备注

点击查看摘要

Abstract:Sparse keypoint matching is crucial for 3D vision tasks, yet current keypoint detectors often produce spatially inaccurate matches. Existing refinement methods mitigate this issue through alignment of matched keypoint locations, but they are typically detector-specific, requiring retraining for each keypoint detector. We introduce XRefine, a novel, detector-agnostic approach for sub-pixel keypoint refinement that operates solely on image patches centered at matched keypoints. Our cross-attention-based architecture learns to predict refined keypoint coordinates without relying on internal detector representations, enabling generalization across detectors. Furthermore, XRefine can be extended to handle multi-view feature tracks. Experiments on MegaDepth, KITTI, and ScanNet demonstrate that the approach consistently improves geometric estimation accuracy, achieving superior performance compared to existing refinement methods while maintaining runtime efficiency. Our code and trained models can be found at this https URL.

179. 【2601.12527】Deep Feature Deformation Weights

链接https://arxiv.org/abs/2601.12527

作者:Richard Liu,Itai Lang,Rana Hanocka

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Handle-based mesh deformation, Handle-based mesh, enabling intuitive shape, computer graphics, enabling intuitive

备注

点击查看摘要

Abstract:Handle-based mesh deformation has been a long-standing paradigm in computer graphics, enabling intuitive shape edits from sparse controls. Classic techniques offer precise and rapid deformation control. However, they solve an optimization problem with constraints defined by control handle placement, requiring a user to know apriori the ideal distribution of handles on the shape to accomplish the desired edit. The mapping from handle set to deformation behavior is often unintuitive and, importantly, non-semantic. Modern data-driven methods, on the other hand, leverage a data prior to obtain semantic edits, but are slow and imprecise. We propose a technique that fuses the semantic prior of data with the precise control and speed of traditional frameworks. Our approach is surprisingly simple yet effective: deep feature proximity makes for smooth and semantic deformation weights, with no need for additional regularization. The weights can be computed in real-time for any surface point, whereas prior methods require optimization for new handles. Moreover, the semantic prior from deep features enables co-deformation of semantic parts. We introduce an improved feature distillation pipeline, barycentric feature distillation, which efficiently uses the visual signal from shape renders to minimize distillation cost. This allows our weights to be computed for high resolution meshes in under a minute, in contrast to potentially hours for both classical and neural methods. We preserve and extend properties of classical methods through feature space constraints and locality weighting. Our field representation allows for automatic detection of semantic symmetries, which we use to produce symmetry-preserving deformations. We show a proof-of-concept application which can produce deformations for meshes up to 1 million faces in real-time on a consumer-grade machine.

180. 【2601.12512】Fine-Tuning Cycle-GAN for Domain Adaptation of MRI Images

链接https://arxiv.org/abs/2601.12512

作者:Mohd Usama,Belal Ahmad,Faleh Menawer R Althiyabi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, domain shifts owing, scans acquired

备注: 14 pages, 9 figures, 2 tables

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) scans acquired from different scanners or institutions often suffer from domain shifts owing to variations in hardware, protocols, and acquisition parameters. This discrepancy degrades the performance of deep learning models trained on source domain data when applied to target domain images. In this study, we propose a Cycle-GAN-based model for unsupervised medical-image domain adaptation. Leveraging CycleGANs, our model learns bidirectional mappings between the source and target domains without paired training data, preserving the anatomical content of the images. By leveraging Cycle-GAN capabilities with content and disparity loss for adaptation tasks, we ensured image-domain adaptation while maintaining image integrity. Several experiments on MRI datasets demonstrated the efficacy of our model in bidirectional domain adaptation without labelled data. Furthermore, research offers promising avenues for improving the diagnostic accuracy of healthcare. The statistical results confirm that our approach improves model performance and reduces domain-related variability, thus contributing to more precise and consistent medical image analysis.

181. 【2601.12507】SDCoNet: Saliency-Driven Multi-Task Collaborative Network for Remote Sensing Object Detection

链接https://arxiv.org/abs/2601.12507

作者:Ruo Qi,Linhui Dai,Yusong Qin,Chaolei Yang,Yanshan Li

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:scales make accurate, make accurate detection, object scales make, low-quality imaging conditions, imaging conditions

备注

点击查看摘要

Abstract:In remote sensing images, complex backgrounds, weak object signals, and small object scales make accurate detection particularly challenging, especially under low-quality imaging conditions. A common strategy is to integrate single-image super-resolution (SR) before detection; however, such serial pipelines often suffer from misaligned optimization objectives, feature redundancy, and a lack of effective interaction between SR and detection. To address these issues, we propose a Saliency-Driven multi-task Collaborative Network (SDCoNet) that couples SR and detection through implicit feature sharing while preserving task specificity. SDCoNet employs the swin transformer-based shared encoder, where hierarchical window-shifted self-attention supports cross-task feature collaboration and adaptively balances the trade-off between texture refinement and semantic representation. In addition, a multi-scale saliency prediction module produces importance scores to select key tokens, enabling focused attention on weak object regions, suppression of background clutter, and suppression of adverse features introduced by multi-task coupling. Furthermore, a gradient routing strategy is introduced to mitigate optimization conflicts. It first stabilizes detection semantics and subsequently routes SR gradients along a detection-oriented direction, enabling the framework to guide the SR branch to generate high-frequency details that are explicitly beneficial for detection. Experiments on public datasets, including NWPU VHR-10-Split, DOTAv1.5-Split, and HRSSD-Split, demonstrate that the proposed method, while maintaining competitive computational efficiency, significantly outperforms existing mainstream algorithms in small object detection on low-quality remote sensing images. Our code is available at this https URL.

182. 【2601.12500】Video Individual Counting and Tracking from Moving Drones: A Benchmark and Methods

链接https://arxiv.org/abs/2601.12500

作者:Yaowu Fan,Jia Wan,Tao Han,Andy J. Ma,Antoni B. Chan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:provide limited spatial, limited spatial coverage, large-scale dense crowd, dense crowd analysis, dense crowd counting

备注

点击查看摘要

Abstract:Counting and tracking dense crowds in large-scale scenes is highly challenging, yet existing methods mainly rely on datasets captured by fixed cameras, which provide limited spatial coverage and are inadequate for large-scale dense crowd analysis. To address this limitation, we propose a flexible solution using moving drones to capture videos and perform video-level crowd counting and tracking of unique pedestrians across entire scenes. We introduce MovingDroneCrowd++, the largest video-level dataset for dense crowd counting and tracking captured by moving drones, covering diverse and complex conditions with varying flight altitudes, camera angles, and illumination. Existing methods fail to achieve satisfactory performance on this dataset. To this end, we propose GD3A (Global Density Map Decomposition via Descriptor Association), a density map-based video individual counting method that avoids explicit localization. GD3A establishes pixel-level correspondences between pedestrian descriptors across consecutive frames via optimal transport with an adaptive dustbin score, enabling the decomposition of global density maps into shared, inflow, and outflow components. Building on this framework, we further introduce DVTrack, which converts descriptor-level matching into instance-level associations through a descriptor voting mechanism for pedestrian tracking. Experimental results show that our methods significantly outperform existing approaches under dense crowds and complex motion, reducing counting error by 47.4 percent and improving tracking performance by 39.2 percent.

183. 【2601.12493】Histopath-C: Towards Realistic Domain Shifts for Histopathology Vision-Language Adaptation

链接https://arxiv.org/abs/2601.12493

作者:Mehrdad Noori,Gustavo Adolfo Vargas Hakim,David Osowiechi,Fereshteh Shakeri,Ali Bahri,Moslem Yazdanpanah,Sahar Dastani,Ismail Ben Ayed,Christian Desrosiers

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Medical Vision-language models, Medical Vision-language, medical imaging domains, Vision-language models, shown remarkable performances

备注: Accepted to WACV 2026

点击查看摘要

Abstract:Medical Vision-language models (VLMs) have shown remarkable performances in various medical imaging domains such as histo\-pathology by leveraging pre-trained, contrastive models that exploit visual and textual information. However, histopathology images may exhibit severe domain shifts, such as staining, contamination, blurring, and noise, which may severely degrade the VLM's downstream performance. In this work, we introduce Histopath-C, a new benchmark with realistic synthetic corruptions designed to mimic real-world distribution shifts observed in digital histopathology. Our framework dynamically applies corruptions to any available dataset and evaluates Test-Time Adaptation (TTA) mechanisms on the fly. We then propose LATTE, a transductive, low-rank adaptation strategy that exploits multiple text templates, mitigating the sensitivity of histopathology VLMs to diverse text inputs. Our approach outperforms state-of-the-art TTA methods originally designed for natural images across a breadth of histopathology datasets, demonstrating the effectiveness of our proposed design for robust adaptation in histopathology images. Code and data are available at this https URL.

184. 【2601.12481】NeuralFur: Animal Fur Reconstruction From Multi-View Images

链接https://arxiv.org/abs/2601.12481

作者:Vanessa Sklyarova,Berna Kabadayi,Anastasios Yiannakidis,Giorgio Becherini,Michael J. Black,Justus Thies

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:challenging task due, Reconstructing realistic animal, Reconstructing realistic, fine-scale details, challenging task

备注: For additional results and code, please refer to [this https URL](https://neuralfur.is.tue.mpg.de)

点击查看摘要

Abstract:Reconstructing realistic animal fur geometry from images is a challenging task due to the fine-scale details, self-occlusion, and view-dependent appearance of fur. In contrast to human hairstyle reconstruction, there are also no datasets that can be leveraged to learn a fur prior for different animals. In this work, we present a first multi-view-based method for high-fidelity 3D fur modeling of animals using a strand-based representation, leveraging the general knowledge of a vision language model. Given multi-view RGB images, we first reconstruct a coarse surface geometry using traditional multi-view stereo techniques. We then use a vision language model (VLM) system to retrieve information about the realistic length structure of the fur for each part of the body. We use this knowledge to construct the animal's furless geometry and grow strands atop it. The fur reconstruction is supervised with both geometric and photometric losses computed from multi-view images. To mitigate orientation ambiguities stemming from the Gabor filters that are applied to the input images, we additionally utilize the VLM to guide the strands' growth direction and their relation to the gravity vector that we incorporate as a loss. With this new schema of using a VLM to guide 3D reconstruction from multi-view inputs, we show generalization across a variety of animals with different fur types. For additional results and code, please refer to this https URL.

185. 【2601.12468】DCAC: Dynamic Class-Aware Cache Creates Stronger Out-of-Distribution Detectors

链接https://arxiv.org/abs/2601.12468

作者:Yanqi Wu,Qichao Chen,Runhe Lai,Xinhua Lu,Jia-Xin Zhuang,Zhilin Zhao,Wei-Shi Zheng,Ruixuan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural networks, unseen OOD samples, neural networks, OOD, OOD samples

备注: 9 pages, 9 figures, Accepted by AAAI2026

点击查看摘要

Abstract:Out-of-distribution (OOD) detection remains a fundamental challenge for deep neural networks, particularly due to overconfident predictions on unseen OOD samples during testing. We reveal a key insight: OOD samples predicted as the same class, or given high probabilities for it, are visually more similar to each other than to the true in-distribution (ID) samples. Motivated by this class-specific observation, we propose DCAC (Dynamic Class-Aware Cache), a training-free, test-time calibration module that maintains separate caches for each ID class to collect high-entropy samples and calibrate the raw predictions of input samples. DCAC leverages cached visual features and predicted probabilities through a lightweight two-layer module to mitigate overconfident predictions on OOD samples. This module can be seamlessly integrated with various existing OOD detection methods across both unimodal and vision-language models while introducing minimal computational overhead. Extensive experiments on multiple OOD benchmarks demonstrate that DCAC significantly enhances existing methods, achieving substantial improvements, i.e., reducing FPR95 by 6.55% when integrated with ASH-S on ImageNet OOD benchmark.

186. 【2601.12464】Large-scale EM Benchmark for Multi-Organelle Instance Segmentation in the Wild

链接https://arxiv.org/abs/2601.12464

作者:Yanrui Lu,Danyang Chen,Haowen Xiao,Jiarui Zhu,Fukang Ge,Binqian Zou,Jiali Guan,Jiayin Liang,Yuting Wang,Ziqian Guan,Xiangcheng Bao,Jinhao Bi,Lin Gu,Jun He,Yingying Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate instance-level segmentation, Accurate instance-level, electron microscopy, inter-organelle interactions, critical for quantitative

备注

点击查看摘要

Abstract:Accurate instance-level segmentation of organelles in electron microscopy (EM) is critical for quantitative analysis of subcellular morphology and inter-organelle interactions. However, current benchmarks, based on small, curated datasets, fail to capture the inherent heterogeneity and large spatial context of in-the-wild EM data, imposing fundamental limitations on current patch-based methods. To address these limitations, we developed a large-scale, multi-source benchmark for multi-organelle instance segmentation, comprising over 100,000 2D EM images across variety cell types and five organelle classes that capture real-world variability. Dataset annotations were generated by our designed connectivity-aware Label Propagation Algorithm (3D LPA) with expert refinement. We further benchmarked several state-of-the-art models, including U-Net, SAM variants, and Mask2Former. Our results show several limitations: current models struggle to generalize across heterogeneous EM data and perform poorly on organelles with global, distributed morphologies (e.g., Endoplasmic Reticulum). These findings underscore the fundamental mismatch between local-context models and the challenge of modeling long-range structural continuity in the presence of real-world variability. The benchmark dataset and labeling tool will be publicly released soon.

187. 【2601.12447】Privacy-Preserving Federated Learning with Verifiable Fairness Guarantees

链接https://arxiv.org/abs/2601.12447

作者:Mohammed Himayath Ali,Mohammed Aqib Abdullah,Syed Muneer Hussin,Mohammed Mudassir Uddin,Shahnawaz Alam

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:remains fundamentally unresolved, centralizing sensitive data, collaborative model training, heterogeneous data distributions, enables collaborative model

备注

点击查看摘要

Abstract:Federated learning enables collaborative model training across distributed institutions without centralizing sensitive data; however, ensuring algorithmic fairness across heterogeneous data distributions while preserving privacy remains fundamentally unresolved. This paper introduces CryptoFair-FL, a novel cryptographic framework providing the first verifiable fairness guarantees for federated learning systems under formal security definitions. The proposed approach combines additively homomorphic encryption with secure multi-party computation to enable privacy-preserving verification of demographic parity and equalized odds metrics without revealing protected attribute distributions or individual predictions. A novel batched verification protocol reduces computational complexity from BigO(n^2) to BigO(n \log n) while maintaining (\dparam, \deltap)-differential privacy with dparam = 0.5 and deltap = 10^{-6}. Theoretical analysis establishes information-theoretic lower bounds on the privacy cost of fairness verification, demonstrating that the proposed protocol achieves near-optimal privacy-fairness tradeoffs. Comprehensive experiments across four benchmark datasets (MIMIC-IV healthcare records, Adult Income, CelebA, and a novel FedFair-100 benchmark) demonstrate that CryptoFair-FL reduces fairness violations from 0.231 to 0.031 demographic parity difference while incurring only 2.3 times computational overhead compared to standard federated averaging. The framework successfully defends against attribute inference attacks, maintaining adversarial success probability below 0.05 across all tested configurations. These results establish a practical pathway for deploying fairness-aware federated learning in regulated industries requiring both privacy protection and algorithmic accountability.

188. 【2601.12443】Adversarial Defense in Vision-Language Models: An Overview

链接https://arxiv.org/abs/2601.12443

作者:Xiaowei Fu,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision Language Models, Vision Language, Test-time Adaptation Defense, Language Models, defense

备注

点击查看摘要

Abstract:The widespread use of Vision Language Models (VLMs, e.g. CLIP) has raised concerns about their vulnerability to sophisticated and imperceptible adversarial attacks. These attacks could compromise model performance and system security in cross-modal tasks. To address this challenge, three main defense paradigms have been proposed: Training-time Defense, Test-time Adaptation Defense, and Training-free Defense. Training-time Defense involves modifying the training process, typically through adversarial fine-tuning to improve the robustness to adversarial examples. While effective, this approach requires substantial computational resources and may not generalize across all adversarial attacks. Test-time Adaptation Defense focuses on adapting the model at inference time by updating its parameters to handle unlabeled adversarial examples, offering flexibility but often at the cost of increased complexity and computational overhead. Training-free Defense avoids modifying the model itself, instead focusing on altering the adversarial inputs or their feature embeddings, which enforces input perturbations to mitigate the impact of attacks without additional training. This survey reviews the latest advancements in adversarial defense strategies for VLMs, highlighting the strengths and limitations of such approaches and discussing ongoing challenges in enhancing the robustness of VLMs.

189. 【2601.12442】Constraint-Aware Neurosymbolic Uncertainty Quantification with Bayesian Deep Learning for Scientific Discovery

链接https://arxiv.org/abs/2601.12442

作者:Shahnawaz Alam,Mohammed Mudassir Uddin,Mohammed Kaif Pasha

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Scientific Artificial Intelligence, Artificial Intelligence, applications require models, trustworthy uncertainty estimates, deliver trustworthy uncertainty

备注

点击查看摘要

Abstract:Scientific Artificial Intelligence (AI) applications require models that deliver trustworthy uncertainty estimates while respecting domain constraints. Existing uncertainty quantification methods lack mechanisms to incorporate symbolic scientific knowledge, while neurosymbolic approaches operate deterministically without principled uncertainty modeling. We introduce the Constraint-Aware Neurosymbolic Uncertainty Framework (CANUF), unifying Bayesian deep learning with differentiable symbolic reasoning. The architecture comprises three components: automated constraint extraction from scientific literature, probabilistic neural backbone with variational inference, and differentiable constraint satisfaction layer ensuring physical consistency. Experiments on Materials Project (140,000+ materials), QM9 molecular properties, and climate benchmarks show CANUF reduces Expected Calibration Error by 34.7% versus Bayesian neural networks while maintaining 99.2% constraint satisfaction. Ablations reveal constraint-guided recalibration contributes 18.3% performance gain, with constraint extraction achieving 91.4% precision. CANUF provides the first end-to-end differentiable pipeline simultaneously addressing uncertainty quantification, constraint satisfaction, and interpretable explanations for scientific predictions.

190. 【2601.12432】SkeFi: Cross-Modal Knowledge Transfer for Wireless Skeleton-Based Action Recognition

链接https://arxiv.org/abs/2601.12432

作者:Shunyu Huang,Yunjiao Zhou,Jianfei Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:shows superior generalization, Skeleton-based action recognition, recognition leverages human, categorize human actions, leverages human pose

备注: Published in IEEE Internet of Things Journal

点击查看摘要

Abstract:Skeleton-based action recognition leverages human pose keypoints to categorize human actions, which shows superior generalization and interoperability compared to regular end-to-end action recognition. Existing solutions use RGB cameras to annotate skeletal keypoints, but their performance declines in dark environments and raises privacy concerns, limiting their use in smart homes and hospitals. This paper explores non-invasive wireless sensors, i.e., LiDAR and mmWave, to mitigate these challenges as a feasible alternative. Two problems are addressed: (1) insufficient data on wireless sensor modality to train an accurate skeleton estimation model, and (2) skeletal keypoints derived from wireless sensors are noisier than RGB, causing great difficulties for subsequent action recognition models. Our work, SkeFi, overcomes these gaps through a novel cross-modal knowledge transfer method acquired from the data-rich RGB modality. We propose the enhanced Temporal Correlation Adaptive Graph Convolution (TC-AGC) with frame interactive enhancement to overcome the noise from missing or inconsecutive frames. Additionally, our research underscores the effectiveness of enhancing multiscale temporal modeling through dual temporal convolution. By integrating TC-AGC with temporal modeling for cross-modal transfer, our framework can extract accurate poses and actions from noisy wireless sensors. Experiments demonstrate that SkeFi realizes state-of-the-art performances on mmWave and LiDAR. The code is available at this https URL.

191. 【2601.12428】ReWorld: Multi-Dimensional Reward Modeling for Embodied World Models

链接https://arxiv.org/abs/2601.12428

作者:Baorui Peng,Wenyao Zhang,Liang Xu,Zekun Qi,Jiazhao Zhang,Hongsi Liu,Wenjun Zeng,Xin Jin

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:gained increasing attention, learn to simulate, gained increasing, increasing attention, attention in robot

备注

点击查看摘要

Abstract:Recently, video-based world models that learn to simulate the dynamics have gained increasing attention in robot learning. However, current approaches primarily emphasize visual generative quality while overlooking physical fidelity, dynamic consistency, and task logic, especially for contact-rich manipulation tasks, which limits their applicability to downstream tasks. To this end, we introduce ReWorld, a framework aimed to employ reinforcement learning to align the video-based embodied world models with physical realism, task completion capability, embodiment plausibility and visual quality. Specifically, we first construct a large-scale (~235K) video preference dataset and employ it to train a hierarchical reward model designed to capture multi-dimensional reward consistent with human preferences. We further propose a practical alignment algorithm that post-trains flow-based world models using this reward through a computationally efficient PPO-style algorithm. Comprehensive experiments and theoretical analysis demonstrate that ReWorld significantly improves the physical fidelity, logical coherence, embodiment and visual quality of generated rollouts, outperforming previous methods.

192. 【2601.12423】HOT-POT: Optimal Transport for Sparse Stereo Matching

链接https://arxiv.org/abs/2601.12423

作者:Antonin Clerc,Michael Quellmalz,Moritz Piening,Philipp Flotho,Gregor Kornhardt,Gabriele Steidl

类目:Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

关键词:including occlusions, range of challenges, autonomous driving, images faces, vision between images

备注: 18 pages, 10 figures, 6 tables

点击查看摘要

Abstract:Stereo vision between images faces a range of challenges, including occlusions, motion, and camera distortions, across applications in autonomous driving, robotics, and face analysis. Due to parameter sensitivity, further complications arise for stereo matching with sparse features, such as facial landmarks. To overcome this ill-posedness and enable unsupervised sparse matching, we consider line constraints of the camera geometry from an optimal transport (OT) viewpoint. Formulating camera-projected points as (half)lines, we propose the use of the classical epipolar distance as well as a 3D ray distance to quantify matching quality. Employing these distances as a cost function of a (partial) OT problem, we arrive at efficiently solvable assignment problems. Moreover, we extend our approach to unsupervised object matching by formulating it as a hierarchical OT problem. The resulting algorithms allow for efficient feature and object matching, as demonstrated in our numerical experiments. Here, we focus on applications in facial analysis, where we aim to match distinct landmarking conventions.

193. 【2601.12402】Weaknesses of Facial Emotion Recognition Systems

链接https://arxiv.org/abs/2601.12402

作者:Aleksandra Jamróz,Patrycja Wysocka,Piotr Garbat

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:machine learning problems, learning problems needed, human-computer interaction, detection from faces, machine learning

备注

点击查看摘要

Abstract:Emotion detection from faces is one of the machine learning problems needed for human-computer interaction. The variety of methods used is enormous, which motivated an in-depth review of articles and scientific studies. Three of the most interesting and best solutions are selected, followed by the selection of three datasets that stood out for the diversity and number of images in them. The selected neural networks are trained, and then a series of experiments are performed to compare their performance, including testing on different datasets than a model was trained on. This reveals weaknesses in existing solutions, including differences between datasets, unequal levels of difficulty in recognizing certain emotions and the challenges in differentiating between closely related emotions.

194. 【2601.12391】Class-Partitioned VQ-VAE and Latent Flow Matching for Point Cloud Scene Generation

链接https://arxiv.org/abs/2601.12391

作者:Dasith de Silva Edirimuni,Ajmal Saeed Mian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:bounding box parameters, generating object bounding, object bounding box, newer diffusion methods, bounding box

备注: Accepted to AAAI 2026, Main Technical Track

点击查看摘要

Abstract:Most 3D scene generation methods are limited to only generating object bounding box parameters while newer diffusion methods also generate class labels and latent features. Using object size or latent feature, they then retrieve objects from a predefined database. For complex scenes of varied, multi-categorical objects, diffusion-based latents cannot be effectively decoded by current autoencoders into the correct point cloud objects which agree with target classes. We introduce a Class-Partitioned Vector Quantized Variational Autoencoder (CPVQ-VAE) that is trained to effectively decode object latent features, by employing a pioneering $\textit{class-partitioned codebook}$ where codevectors are labeled by class. To address the problem of $\textit{codebook collapse}$, we propose a $\textit{class-aware}$ running average update which reinitializes dead codevectors within each partition. During inference, object features and class labels, both generated by a Latent-space Flow Matching Model (LFMM) designed specifically for scene generation, are consumed by the CPVQ-VAE. The CPVQ-VAE's class-aware inverse look-up then maps generated latents to codebook entries that are decoded to class-specific point cloud shapes. Thereby, we achieve pure point cloud generation without relying on an external objects database for retrieval. Extensive experiments reveal that our method reliably recovers plausible point cloud scenes, with up to 70.4% and 72.3% reduction in Chamfer and Point2Mesh errors on complex living room scenes.

195. 【2601.12382】A Hierarchical Benchmark of Foundation Models for Dermatology

链接https://arxiv.org/abs/2601.12382

作者:Furkan Yuceyalcin,Abdurrahim Yilmaz,Burak Temelkuran

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale task-specific training, providing robust feature, robust feature representations, transformed medical image, medical image analysis

备注

点击查看摘要

Abstract:Foundation models have transformed medical image analysis by providing robust feature representations that reduce the need for large-scale task-specific training. However, current benchmarks in dermatology often reduce the complex diagnostic taxonomy to flat, binary classification tasks, such as distinguishing melanoma from benign nevi. This oversimplification obscures a model's ability to perform fine-grained differential diagnoses, which is critical for clinical workflow integration. This study evaluates the utility of embeddings derived from ten foundation models, spanning general computer vision, general medical imaging, and dermatology-specific domains, for hierarchical skin lesion classification. Using the DERM12345 dataset, which comprises 40 lesion subclasses, we calculated frozen embeddings and trained lightweight adapter models using a five-fold cross-validation. We introduce a hierarchical evaluation framework that assesses performance across four levels of clinical granularity: 40 Subclasses, 15 Main Classes, 2 and 4 Superclasses, and Binary Malignancy. Our results reveal a "granularity gap" in model capabilities: MedImageInsights achieved the strongest overall performance (97.52% weighted F1-Score on Binary Malignancy detection) but declined to 65.50% on fine-grained 40-class subtype classification. Conversely, MedSigLip (69.79%) and dermatology-specific models (Derm Foundation and MONET) excelled at fine-grained 40-class subtype discrimination while achieving lower overall performance than MedImageInsights on broader classification tasks. Our findings suggest that while general medical foundation models are highly effective for high-level screening, specialized modeling strategies are necessary for the granular distinctions required in diagnostic support systems.

196. 【2601.12379】Utilizing the Score of Data Distribution for Hyperspectral Anomaly Detection

链接https://arxiv.org/abs/2601.12379

作者:Jiahui Sheng,Yidan Shi,Shu Xiang,Xiaorun Li,Shuhan Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Hyperspectral images, abundant spectral information, Hyperspectral, hyperspectral manifold hypothesis, hyperspectral anomaly detection

备注

点击查看摘要

Abstract:Hyperspectral images (HSIs) are a type of image that contains abundant spectral information. As a type of real-world data, the high-dimensional spectra in hyperspectral images are actually determined by only a few factors, such as chemical composition and illumination. Thus, spectra in hyperspectral images are highly likely to satisfy the manifold hypothesis. Based on the hyperspectral manifold hypothesis, we propose a novel hyperspectral anomaly detection method (named ScoreAD) that leverages the time-dependent gradient field of the data distribution (i.e., the score), as learned by a score-based generative model (SGM). Our method first trains the SGM on the entire set of spectra from the hyperspectral image. At test time, each spectrum is passed through a perturbation kernel, and the resulting perturbed spectrum is fed into the trained SGM to obtain the estimated score. The manifold hypothesis of HSIs posits that background spectra reside on one or more low-dimensional manifolds. Conversely, anomalous spectra, owing to their unique spectral signatures, are considered outliers that do not conform to the background manifold. Based on this fundamental discrepancy in their manifold distributions, we leverage a generative SGM to achieve hyperspectral anomaly detection. Experiments on the four hyperspectral datasets demonstrate the effectiveness of the proposed method. The code is available at this https URL.

197. 【2601.12373】CD-TWINSAFE: A ROS-enabled Digital Twin for Scene Understanding and Safety Emerging V2I Technology

链接https://arxiv.org/abs/2601.12373

作者:Amro Khaled,Farah Khaled,Omar Riad,Catherine M. Elias

类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Robotics (cs.RO)

关键词:CD-TWINSAFE is introduced, digital twin stack, digital twin, Unreal Engine, on-board driving stack

备注

点击查看摘要

Abstract:In this paper, the CD-TWINSAFE is introduced, a V2I-based digital twin for Autonomous Vehicles. The proposed architecture is composed of two stacks running simultaneously, an on-board driving stack that includes a stereo camera for scene understanding, and a digital twin stack that runs an Unreal Engine 5 replica of the scene viewed by the camera as well as returning safety alerts to the cockpit. The on-board stack is implemented on the vehicle side including 2 main autonomous modules; localization and perception. The position and orientation of the ego vehicle are obtained using on-board sensors. Furthermore, the perception module is responsible for processing 20-fps images from stereo camera and understands the scene through two complementary pipelines. The pipeline are working on object detection and feature extraction including object velocity, yaw and the safety metrics time-to-collision and time-headway. The collected data form the driving stack are sent to the infrastructure side through the ROS-enabled architecture in the form of custom ROS2 messages and sent over UDP links that ride a 4G modem for V2I communication. The environment is monitored via the digital twin through the shared messages which update the information of the spawned ego vehicle and detected objects based on the real-time localization and perception data. Several tests with different driving scenarios to confirm the validity and real-time response of the proposed architecture.

198. 【2601.12366】DepthCropSeg++: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data

链接https://arxiv.org/abs/2601.12366

作者:Jiafei Zhang,Songliang Cao,Binghui Xu,Yanan Li,Weiwei Jia,Tingting Wu,Hao Lu,Weijuan Hu,Zhiguo Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:open in-field environment, crop segmentation, capable of segmenting, crop, segmentation

备注: 13 pages, 15 figures and 7 tables

点击查看摘要

Abstract:DepthCropSeg++: a foundation model for crop segmentation, capable of segmenting different crop species under open in-field environment. Crop segmentation is a fundamental task for modern agriculture, which closely relates to many downstream tasks such as plant phenotyping, density estimation, and weed control. In the era of foundation models, a number of generic large language and vision models have been developed. These models have demonstrated remarkable real world generalization due to significant model capacity and largescale datasets. However, current crop segmentation models mostly learn from limited data due to expensive pixel-level labelling cost, often performing well only under specific crop types or controlled environment. In this work, we follow the vein of our previous work DepthCropSeg, an almost unsupervised approach to crop segmentation, to scale up a cross-species and crossscene crop segmentation dataset, with 28,406 images across 30+ species and 15 environmental conditions. We also build upon a state-of-the-art semantic segmentation architecture ViT-Adapter architecture, enhance it with dynamic upsampling for improved detail awareness, and train the model with a two-stage selftraining pipeline. To systematically validate model performance, we conduct comprehensive experiments to justify the effectiveness and generalization capabilities across multiple crop datasets. Results demonstrate that DepthCropSeg++ achieves 93.11% mIoU on a comprehensive testing set, outperforming both supervised baselines and general-purpose vision foundation models like Segmentation Anything Model (SAM) by significant margins (+0.36% and +48.57% respectively). The model particularly excels in challenging scenarios including night-time environment (86.90% mIoU), high-density canopies (90.09% mIoU), and unseen crop varieties (90.09% mIoU), indicating a new state of the art for crop segmentation.

199. 【2601.12358】From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles

链接https://arxiv.org/abs/2601.12358

作者:Omar Y. Goba,Ahmed Y. Gado,Catherine M. Elias,Ahmed Hussein

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:real-world environments safely, require adaptive behavior, Autonomous vehicles, adaptive behavior planners, require adaptive

备注

点击查看摘要

Abstract:Autonomous vehicles (AVs) require adaptive behavior planners to navigate unpredictable, real-world environments safely. Traditional behavior trees (BTs) offer structured decision logic but are inherently static and demand labor-intensive manual tuning, limiting their applicability at SAE Level 5 autonomy. This paper presents an agentic framework that leverages large language models (LLMs) and multi-modal vision models (LVMs) to generate and adapt BTs on the fly. A specialized Descriptor agent applies chain-of-symbols prompting to assess scene criticality, a Planner agent constructs high-level sub-goals via in-context learning, and a Generator agent synthesizes executable BT sub-trees in XML format. Integrated into a CARLA+Nav2 simulation, our system triggers only upon baseline BT failure, demonstrating successful navigation around unexpected obstacles (e.g., street blockage) with no human intervention. Compared to a static BT baseline, this approach is a proof-of-concept that extends to diverse driving scenarios.

200. 【2601.12357】SimpleMatch: A Simple and Strong Baseline for Semantic Correspondence

链接https://arxiv.org/abs/2601.12357

作者:Hailing Jin,Huiying Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:pre-trained large-scale models, Recent advances, large-scale models, largely driven, pre-trained large-scale

备注

点击查看摘要

Abstract:Recent advances in semantic correspondence have been largely driven by the use of pre-trained large-scale models. However, a limitation of these approaches is their dependence on high-resolution input images to achieve optimal performance, which results in considerable computational overhead. In this work, we address a fundamental limitation in current methods: the irreversible fusion of adjacent keypoint features caused by deep downsampling operations. This issue is triggered when semantically distinct keypoints fall within the same downsampled receptive field (e.g., 16x16 patches). To address this issue, we present SimpleMatch, a simple yet effective framework for semantic correspondence that delivers strong performance even at low resolutions. We propose a lightweight upsample decoder that progressively recovers spatial detail by upsampling deep features to 1/4 resolution, and a multi-scale supervised loss that ensures the upsampled features retain discriminative features across different spatial scales. In addition, we introduce sparse matching and window-based localization to optimize training memory usage and reduce it by 51%. At a resolution of 252x252 (3.3x smaller than current SOTA methods), SimpleMatch achieves superior performance with 84.1% PCK@0.1 on the SPair-71k benchmark. We believe this framework provides a practical and efficient baseline for future research in semantic correspondence. Code is available at: this https URL.

201. 【2601.12346】MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

链接https://arxiv.org/abs/2601.12346

作者:Peizhou Huang,Zixuan Zhong,Zhongwei Wan,Donghao Zhou,Samiul Alam,Xin Wang,Zexin Li,Zhihao Dou,Li Zhu,Jing Xiong,Chaofan Tao,Yan Xu,Dimitrios Dimitriadis,Tuo Zhang,Mi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:target text-only settings, generate citation-rich reports, Deep Research Agents, generate citation-rich, Research Agents

备注

点击查看摘要

Abstract:Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.

202. 【2601.12337】urbo-GoDec: Exploiting the Cluster Sparsity Prior for Hyperspectral Anomaly Detection

链接https://arxiv.org/abs/2601.12337

作者:Jiahui Sheng,Xiaorun Li,Shuhan Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:undergone extensive research, garnered significant attention, cluster sparsity prior, sparsity prior, cluster sparsity

备注

点击查看摘要

Abstract:As a key task in hyperspectral image processing, hyperspectral anomaly detection has garnered significant attention and undergone extensive research. Existing methods primarily relt on two prior assumption: low-rank background and sparse anomaly, along with additional spatial assumptions of the background. However, most methods only utilize the sparsity prior assumption for anomalies and rarely expand on this hypothesis. From observations of hyperspectral images, we find that anomalous pixels exhibit certain spatial distribution characteristics: they often manifest as small, clustered groups in space, which we refer to as cluster sparsity of anomalies. Then, we combined the cluster sparsity prior with the classical GoDec algorithm, incorporating the cluster sparsity prior into the S-step of GoDec. This resulted in a new hyperspectral anomaly detection method, which we called Turbo-GoDec. In this approach, we modeled the cluster sparsity prior of anomalies using a Markov random field and computed the marginal probabilities of anomalies through message passing on a factor graph. Locations with high anomalous probabilities were treated as the sparse component in the Turbo-GoDec. Experiments are conducted on three real hyperspectral image (HSI) datasets which demonstrate the superior performance of the proposed Turbo-GoDec method in detecting small-size anomalies comparing with the vanilla GoDec (LSMAD) and state-of-the-art anomaly detection methods. The code is available at this https URL.

203. 【2601.12329】FlowIID: Single-Step Intrinsic Image Decomposition via Latent Flow Matching

链接https://arxiv.org/abs/2601.12329

作者:Mithlesh Singla,Seema Kumari,Shanmuganathan Raman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Intrinsic Image Decomposition, Intrinsic Image, Image Decomposition, separates an image, Image

备注

点击查看摘要

Abstract:Intrinsic Image Decomposition (IID) separates an image into albedo and shading components. It is a core step in many real-world applications, such as relighting and material editing. Existing IID models achieve good results, but often use a large number of parameters. This makes them costly to combine with other models in real-world settings. To address this problem, we propose a flow matching-based solution. For this, we design a novel architecture, FlowIID, based on latent flow matching. FlowIID combines a VAE-guided latent space with a flow matching module, enabling a stable decomposition of albedo and shading. FlowIID is not only parameter-efficient, but also produces results in a single inference step. Despite its compact design, FlowIID delivers competitive and superior results compared to existing models across various benchmarks. This makes it well-suited for deployment in resource-constrained and real-time vision applications.

204. 【2601.12326】EmoKGEdit: Training-free Affective Injection via Visual Cue Transformation

链接https://arxiv.org/abs/2601.12326

作者:Jing Zhang,Bingjie Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing image emotion, Existing image, distorted visual structures, weak emotional expression, yielding weak emotional

备注: 11pages,10figures

点击查看摘要

Abstract:Existing image emotion editing methods struggle to disentangle emotional cues from latent content representations, often yielding weak emotional expression and distorted visual structures. To bridge this gap, we propose EmoKGEdit, a novel training-free framework for precise and structure-preserving image emotion editing. Specifically, we construct a Multimodal Sentiment Association Knowledge Graph (MSA-KG) to disentangle the intricate relationships among objects, scenes, attributes, visual clues and emotion. MSA-KG explicitly encode the causal chain among object-attribute-emotion, and as external knowledge to support chain of thought reasoning, guiding the multimodal large model to infer plausible emotion-related visual cues and generate coherent instructions. In addition, based on MSA-KG, we design a disentangled structure-emotion editing module that explicitly separates emotional attributes from layout features within the latent space, which ensures that the target emotion is effectively injected while strictly maintaining visual spatial coherence. Extensive experiments demonstrate that EmoKGEdit achieves excellent performance in both emotion fidelity and content preservation, and outperforms the state-of-the-art methods.

205. 【2601.12325】Multi-Sensor Matching with HyperNetworks

链接https://arxiv.org/abs/2601.12325

作者:Eli Passov,Nathan S. Netanyahu,Yosi Keller

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generate or modulate, modulate the weights, Siamese CNN, models that generate, model size

备注

点击查看摘要

Abstract:Hypernetworks are models that generate or modulate the weights of another network. They provide a flexible mechanism for injecting context and task conditioning and have proven broadly useful across diverse applications without significant increases in model size. We leverage hypernetworks to improve multimodal patch matching by introducing a lightweight descriptor-learning architecture that augments a Siamese CNN with (i) hypernetwork modules that compute adaptive, per-channel scaling and shifting and (ii) conditional instance normalization that provides modality-specific adaptation (e.g., visible vs. infrared, VIS-IR) in shallow layers. This combination preserves the efficiency of descriptor-based methods during inference while increasing robustness to appearance shifts. Trained with a triplet loss and hard-negative mining, our approach achieves state-of-the-art results on VIS-NIR and other VIS-IR benchmarks and matches or surpasses prior methods on additional datasets, despite their higher inference cost. To spur progress on domain shift, we also release GAP-VIR, a cross-platform (ground/aerial) VIS-IR patch dataset with 500K pairs, enabling rigorous evaluation of cross-domain generalization and adaptation.

206. 【2601.12316】GazeFormer-MoE: Context-Aware Gaze Estimation via CLIP and MoE Transformer

链接https://arxiv.org/abs/2601.12316

作者:Xinyuan Zhao,Xianrui Chen,Ahmad Chaddad

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multi scale Transformer, gaze estimation, semantics modulated, present a semantics, scale Transformer

备注: accepted at ICASSP 2026

点击查看摘要

Abstract:We present a semantics modulated, multi scale Transformer for 3D gaze estimation. Our model conditions CLIP global features with learnable prototype banks (illumination, head pose, background, direction), fuses these prototype-enriched global vectors with CLIP patch tokens and high-resolution CNN tokens in a unified attention space, and replaces several FFN blocks with routed/shared Mixture of Experts to increase conditional capacity. Evaluated on MPIIFaceGaze, EYEDIAP, Gaze360 and ETH-XGaze, our model achieves new state of the art angular errors of 2.49°, 3.22°, 10.16°, and 1.44°, demonstrating up to a 64% relative improvement over previously reported results. ablations attribute gains to prototype conditioning, cross scale fusion, MoE and hyperparameter. Our code is publicly available at https://github. com/AIPMLab/Gazeformer.

207. 【2601.12313】S^2F-Net:A Robust Spatial-Spectral Fusion Framework for Cross-Model AIGC Detection

链接https://arxiv.org/abs/2601.12313

作者:Xiangyu Hu,Yicheng Hong,Hongchuang Zheng,Wenjun Zeng,Bingyao Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strong generalization capabilities, rapid development, imposed an urgent, urgent demand, schemes with strong

备注: 27pages 9figures

点击查看摘要

Abstract:The rapid development of generative models has imposed an urgent demand for detection schemes with strong generalization capabilities. However, existing detection methods generally suffer from overfitting to specific source models, leading to significant performance degradation when confronted with unseen generative architectures. To address these challenges, this paper proposes a cross-model detection framework called S 2 F-Net, whose core lies in exploring and leveraging the inherent spectral discrepancies between real and synthetic textures. Considering that upsampling operations leave unique and distinguishable frequency fingerprints in both texture-poor and texture-rich regions, we focus our research on the detection of frequency-domain artifacts, aiming to fundamentally improve the generalization performance of the model. Specifically, we introduce a learnable frequency attention module that adaptively weights and enhances discriminative frequency bands by synergizing spatial texture analysis and spectral this http URL the AIGCDetectBenchmark, which includes 17 categories of generative models, S 2 F-Net achieves a detection accuracy of 90.49%, significantly outperforming various existing baseline methods in cross-domain detection scenarios.

208. 【2601.12312】CurConMix+: A Unified Spatio-Temporal Framework for Hierarchical Surgical Workflow Understanding

链接https://arxiv.org/abs/2601.12312

作者:Yongjun Jeon,Jongmin Shin,Kanggil Park,Seonmin Park,Soyoung Lim,Jung Yong Kim,Jinsoo Rhu,Jongman Kim,Gyu-Seong Choi,Namkee Oh,Kyu-Hwan Jung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Surgical action triplet, action triplet recognition, triplet recognition aims, interactions among instruments, anatomical targets

备注

点击查看摘要

Abstract:Surgical action triplet recognition aims to understand fine-grained surgical behaviors by modeling the interactions among instruments, actions, and anatomical targets. Despite its clinical importance for workflow analysis and skill assessment, progress has been hindered by severe class imbalance, subtle visual variations, and the semantic interdependence among triplet components. Existing approaches often address only a subset of these challenges rather than tackling them jointly, which limits their ability to form a holistic understanding. This study builds upon CurConMix, a spatial representation framework. At its core, a curriculum-guided contrastive learning strategy learns discriminative and progressively correlated features, further enhanced by structured hard-pair sampling and feature-level mixup. Its temporal extension, CurConMix+, integrates a Multi-Resolution Temporal Transformer (MRTT) that achieves robust, context-aware understanding by adaptively fusing multi-scale temporal features and dynamically balancing spatio-temporal cues. Furthermore, we introduce LLS48, a new, hierarchically annotated benchmark for complex laparoscopic left lateral sectionectomy, providing step-, task-, and action-level annotations. Extensive experiments on CholecT45 and LLS48 demonstrate that CurConMix+ not only outperforms state-of-the-art approaches in triplet recognition, but also exhibits strong cross-level generalization, as its fine-grained features effectively transfer to higher-level phase and step recognition tasks. Together, the framework and dataset provide a unified foundation for hierarchy-aware, reproducible, and interpretable surgical workflow understanding. The code and dataset will be publicly released on GitHub to facilitate reproducibility and further research.

209. 【2601.12308】Adaptive Multi-Scale Correlation Meta-Network for Few-Shot Remote Sensing Image Classification

链接https://arxiv.org/abs/2601.12308

作者:Anurag Kaushish,Ayan Sar,Sampurna Roy,Sudeshna Chakraborty,Prashant Trivedi,Tanupriya Choudhury,Kanav Gupta

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:substantial domain shifts, remains challenging due, sensing remains challenging, labeled data, substantial domain

备注: Accepted in IEEE ICASSP 2026

点击查看摘要

Abstract:Few-shot learning in remote sensing remains challenging due to three factors: the scarcity of labeled data, substantial domain shifts, and the multi-scale nature of geospatial objects. To address these issues, we introduce Adaptive Multi-Scale Correlation Meta-Network (AMC-MetaNet), a lightweight yet powerful framework with three key innovations: (i) correlation-guided feature pyramids for capturing scale-invariant patterns, (ii) an adaptive channel correlation module (ACCM) for learning dynamic cross-scale relationships, and (iii) correlation-guided meta-learning that leverages correlation patterns instead of conventional prototype averaging. Unlike prior approaches that rely on heavy pre-trained models or transformers, AMC-MetaNet is trained from scratch with only $\sim600K$ parameters, offering $20\times$ fewer parameters than ResNet-18 while maintaining high efficiency ($50$ms per image inference). AMC-MetaNet achieves up to 86.65\% accuracy in 5-way 5-shot classification on various remote sensing datasets, including EuroSAT, NWPU-RESISC45, UC Merced Land Use, and AID. Our results establish AMC-MetaNet as a computationally efficient, scale-aware framework for real-world few-shot remote sensing.

210. 【2601.12304】A Two-Stage Globally-Diverse Adversarial Attack for Vision-Language Pre-training Models

链接https://arxiv.org/abs/2601.12304

作者:Wutao Chen,Huaqin Zou,Chen Wan,Lifeng Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Vision-language pre-training, Vision-language, black-box scenarios, Abstract, pre-training

备注: Accepted to ICASSP 2026

点击查看摘要

Abstract:Vision-language pre-training (VLP) models are vulnerable to adversarial examples, particularly in black-box scenarios. Existing multimodal attacks often suffer from limited perturbation diversity and unstable multi-stage pipelines. To address these challenges, we propose 2S-GDA, a two-stage globally-diverse attack framework. The proposed method first introduces textual perturbations through a globally-diverse strategy by combining candidate text expansion with globally-aware replacement. To enhance visual diversity, image-level perturbations are generated using multi-scale resizing and block-shuffle rotation. Extensive experiments on VLP models demonstrate that 2S-GDA consistently improves attack success rates over state-of-the-art methods, with gains of up to 11.17\% in black-box settings. Our framework is modular and can be easily combined with existing methods to further enhance adversarial transferability.

211. 【2601.12303】Concepts from Representations: Post-hoc Concept Bottleneck Models via Sparse Decomposition of Visual Representations

链接https://arxiv.org/abs/2601.12303

作者:Shizhan Gong,Xiaofan Zhang,Qi Dou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable success, inherent opacity poses, opacity poses challenges, Deep learning, critical domains

备注: AAAI 2026

点击查看摘要

Abstract:Deep learning has achieved remarkable success in image recognition, yet their inherent opacity poses challenges for deployment in critical domains. Concept-based interpretations aim to address this by explaining model reasoning through human-understandable concepts. However, existing post-hoc methods and ante-hoc concept bottleneck models (CBMs), suffer from limitations such as unreliable concept relevance, non-visual or labor-intensive concept definitions, and model or data-agnostic assumptions. This paper introduces Post-hoc Concept Bottleneck Model via Representation Decomposition (PCBM-ReD), a novel pipeline that retrofits interpretability onto pretrained opaque models. PCBM-ReD automatically extracts visual concepts from a pre-trained encoder, employs multimodal large language models (MLLMs) to label and filter concepts based on visual identifiability and task relevance, and selects an independent subset via reconstruction-guided optimization. Leveraging CLIP's visual-text alignment, it decomposes image representations into linear combination of concept embeddings to fit into the CBMs abstraction. Extensive experiments across 11 image classification tasks show PCBM-ReD achieves state-of-the-art accuracy, narrows the performance gap with end-to-end models, and exhibits better interpretability.

212. 【2601.12291】OpenNavMap: Structure-Free Topometric Mapping via Large-Scale Collaborative Localization

链接https://arxiv.org/abs/2601.12291

作者:Jianhao Jiao,Changkun Liu,Jingwen Yu,Boyi Liu,Qianyi Zhang,Yue Wang,Dimitrios Kanoulas

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:enabling large-scale visual, Scalable and maintainable, maintainable map representations, large-scale visual navigation, representations are fundamental

备注: 21 pages, 20 figures

点击查看摘要

Abstract:Scalable and maintainable map representations are fundamental to enabling large-scale visual navigation and facilitating the deployment of robots in real-world environments. While collaborative localization across multi-session mapping enhances efficiency, traditional structure-based methods struggle with high maintenance costs and fail in feature-less environments or under significant viewpoint changes typical of crowd-sourced data. To address this, we propose OPENNAVMAP, a lightweight, structure-free topometric system leveraging 3D geometric foundation models for on-demand reconstruction. Our method unifies dynamic programming-based sequence matching, geometric verification, and confidence-calibrated optimization to robust, coarse-to-fine submap alignment without requiring pre-built 3D models. Evaluations on the Map-Free benchmark demonstrate superior accuracy over structure-from-motion and regression baselines, achieving an average translation error of 0.62m. Furthermore, the system maintains global consistency across 15km of multi-session data with an absolute trajectory error below 3m for map merging. Finally, we validate practical utility through 12 successful autonomous image-goal navigation tasks on simulated and physical robots. Code and datasets will be publicly available in this https URL.

213. 【2601.12285】LegacyAvatars: Volumetric Face Avatars For Traditional Graphics Pipelines

链接https://arxiv.org/abs/2601.12285

作者:Safa C. Medin,Gengyan Li,Ziqian Bai,Ruofei Du,Leonhard Helminger,Yinda Zhang,Stephan J. Garbin,Philip L. Davidson,Gregory W. Wornell,Thabo Beeler,Abhimitra Meka

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:efficient classical rendering, rendering of photorealistic, parametric face models, controllable volumetric rendering, efficient classical

备注

点击查看摘要

Abstract:We introduce a novel representation for efficient classical rendering of photorealistic 3D face avatars. Leveraging recent advances in radiance fields anchored to parametric face models, our approach achieves controllable volumetric rendering of complex facial features, including hair, skin, and eyes. At enrollment time, we learn a set of radiance manifolds in 3D space to extract an explicit layered mesh, along with appearance and warp textures. During deployment, this allows us to control and animate the face through simple linear blending and alpha compositing of textures over a static mesh. This explicit representation also enables the generated avatar to be efficiently streamed online and then rendered using classical mesh and shader-based rendering on legacy graphics platforms, eliminating the need for any custom engineering or integration.

214. 【2601.12283】SDiT: Semantic Region-Adaptive for Diffusion Transformers

链接https://arxiv.org/abs/2601.12283

作者:Bowen Lin,Fanjiang Ye,Yihua Liu,Zhenghui Guo,Boyuan Zhang,Weijian Zheng,Yufan Xu,Tiancheng Xing,Yuke Wang,Chengming Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain computationally expensive, computationally expensive due, Region-Adaptive Diffusion Transformer, synthesis but remain, global attention

备注

点击查看摘要

Abstract:Diffusion Transformers (DiTs) achieve state-of-the-art performance in text-to-image synthesis but remain computationally expensive due to the iterative nature of denoising and the quadratic cost of global attention. In this work, we observe that denoising dynamics are spatially non-uniform-background regions converge rapidly while edges and textured areas evolve much more actively. Building on this insight, we propose SDiT, a Semantic Region-Adaptive Diffusion Transformer that allocates computation according to regional complexity. SDiT introduces a training-free framework combining (1) semantic-aware clustering via fast Quickshift-based segmentation, (2) complexity-driven regional scheduling to selectively update informative areas, and (3) boundary-aware refinement to maintain spatial coherence. Without any model retraining or architectural modification, SDiT achieves up to 3.0x acceleration while preserving nearly identical perceptual and semantic quality to full-attention inference.

215. 【2601.12282】CytoCLIP: Learning Cytoarchitectural Characteristics in Developing Human Brain Using Contrastive Language Image Pre-Training

链接https://arxiv.org/abs/2601.12282

作者:Pralaypati Ta,Sriram Venkatesaperumal,Keerthi Ram,Mohanasankar Sivaprakasam

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:closely linked, spatial arrangement, arrangement and morphology, brain, cytoarchitecture

备注

点击查看摘要

Abstract:The functions of different regions of the human brain are closely linked to their distinct cytoarchitecture, which is defined by the spatial arrangement and morphology of the cells. Identifying brain regions by their cytoarchitecture enables various scientific analyses of the brain. However, delineating these areas manually in brain histological sections is time-consuming and requires specialized knowledge. An automated approach is necessary to minimize the effort needed from human experts. To address this, we propose CytoCLIP, a suite of vision-language models derived from pre-trained Contrastive Language-Image Pre-Training (CLIP) frameworks to learn joint visual-text representations of brain cytoarchitecture. CytoCLIP comprises two model variants: one is trained using low-resolution whole-region images to understand the overall cytoarchitectural pattern of an area, and the other is trained on high-resolution image tiles for detailed cellular-level representation. The training dataset is created from NISSL-stained histological sections of developing fetal brains of different gestational weeks. It includes 86 distinct regions for low-resolution images and 384 brain regions for high-resolution tiles. We evaluate the model's understanding of the cytoarchitecture and generalization ability using region classification and cross-modal retrieval tasks. Multiple experiments are performed under various data setups, including data from samples of different ages and sectioning planes. Experimental results demonstrate that CytoCLIP outperforms existing methods. It achieves an F1 score of 0.87 for whole-region classification and 0.91 for high-resolution image tile classification.

216. 【2601.12272】AgenticPruner: MAC-Constrained Neural Network Compression via LLM-Driven Strategy Search

链接https://arxiv.org/abs/2601.12272

作者:Shahrzad Esmat,Mahdi Banisharif,Ali Jannesari

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural network pruning, existing approaches primarily, Neural network, deploying deep learning, controlling computational cost

备注: 38 pages, 2 figures, 14 tables

点击查看摘要

Abstract:Neural network pruning remains essential for deploying deep learning models on resource-constrained devices, yet existing approaches primarily target parameter reduction without directly controlling computational cost. This yields unpredictable inference latency in deployment scenarios where strict Multiply-Accumulate (MAC) operation budgets must be met. We propose AgenticPruner, a framework utilizing large language models to achieve MAC-constrained optimization through iterative strategy learning. Our approach coordinates three specialized agents: a Profiling Agent that analyzes model architecture and MAC distributions, a Master Agent that orchestrates the workflow with divergence monitoring, and an Analysis Agent powered by Claude 3.5 Sonnet that learns optimal strategies from historical attempts. Through in-context learning, the Analysis Agent improves convergence success rate from 48% to 71% compared to grid search. Building upon isomorphic pruning's graph-based structural grouping, our method adds context-aware adaptation by analyzing patterns across pruning iterations, enabling automatic convergence to target MAC budgets within user-defined tolerance bands. We validate our framework on ImageNet-1K across ResNet, ConvNeXt, and DeiT architectures. On CNNs, our approach achieves MAC targeting while maintaining or improving accuracy: ResNet-50 reaches 1.77G MACs with 77.04% accuracy (+0.91% vs baseline); ResNet-101 achieves 4.22G MACs with 78.94% accuracy (+1.56% vs baseline). For ConvNeXt-Small, pruning to 8.17G MACs yields 1.41x GPU and 1.07x CPU speedup with 45% parameter reduction. On Vision Transformers, we demonstrate MAC-budget compliance within user-defined tolerance bands (typically +1% to +5% overshoot, -5% to -15% undershoot), establishing feasibility for deployment scenarios requiring strict computational guarantees.

Comments:
38 pages, 2 figures, 14 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

ACMclasses:
I.2.6; I.5.4

Cite as:
arXiv:2601.12272 [cs.CV]

(or
arXiv:2601.12272v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.12272

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
217. 【2601.12257】Soft Shadow Diffusion (SSD): Physics-inspired Learning for 3D Computational Periscopy

链接https://arxiv.org/abs/2601.12257

作者:Fadlullah Raji,John Murray-Bruce

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Geometry (cs.CG); Graphics (cs.GR)

关键词:Conventional imaging requires, create accurate visual, accurate visual representations, Conventional imaging, line of sight

备注

点击查看摘要

Abstract:Conventional imaging requires a line of sight to create accurate visual representations of a scene. In certain circumstances, however, obtaining a suitable line of sight may be impractical, dangerous, or even impossible. Non-line-of-sight (NLOS) imaging addresses this challenge by reconstructing the scene from indirect measurements. Recently, passive NLOS methods that use an ordinary photograph of the subtle shadow cast onto a visible wall by the hidden scene have gained interest. These methods are currently limited to 1D or low-resolution 2D color imaging or to localizing a hidden object whose shape is approximately known. Here, we generalize this class of methods and demonstrate a 3D reconstruction of a hidden scene from an ordinary NLOS photograph. To achieve this, we propose a novel reformulation of the light transport model that conveniently decomposes the hidden scene into \textit{light-occluding} and \textit{non-light-occluding} components to yield a separable non-linear least squares (SNLLS) inverse problem. We develop two solutions: A gradient-based optimization method and a physics-inspired neural network approach, which we call Soft Shadow diffusion (SSD). Despite the challenging ill-conditioned inverse problem encountered here, our approaches are effective on numerous 3D scenes in real experimental scenarios. Moreover, SSD is trained in simulation but generalizes well to unseen classes in simulation and real-world NLOS scenes. SSD also shows surprising robustness to noise and ambient illumination.

218. 【2601.12253】Federated Joint Learning for Domain and Class Generalization

链接https://arxiv.org/abs/2601.12253

作者:Haoran Xu,Jiaze Li,Jianzhong Ju,Zhenbo Luo

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:large-scale parameter size, Efficient fine-tuning, models like CLIP, extensive pretraining requirements, textbf

备注: ICASSP 2026

点击查看摘要

Abstract:Efficient fine-tuning of visual-language models like CLIP has become crucial due to their large-scale parameter size and extensive pretraining requirements. Existing methods typically address either the issue of unseen classes or unseen domains in isolation, without considering a joint framework for both. In this paper, we propose \textbf{Fed}erated Joint Learning for \textbf{D}omain and \textbf{C}lass \textbf{G}eneralization, termed \textbf{FedDCG}, a novel approach that addresses both class and domain generalization in federated learning settings. Our method introduces a domain grouping strategy where class-generalized networks are trained within each group to prevent decision boundary confusion. During inference, we aggregate class-generalized results based on domain similarity, effectively integrating knowledge from both class and domain generalization. Specifically, a learnable network is employed to enhance class generalization capabilities, and a decoupling mechanism separates general and domain-specific knowledge, improving generalization to unseen domains. Extensive experiments across various datasets show that \textbf{FedDCG} outperforms state-of-the-art baselines in terms of accuracy and robustness.

219. 【2601.12252】Breaking Coordinate Overfitting: Geometry-Aware WiFi Sensing for Cross-Layout 3D Pose Estimation

链接https://arxiv.org/abs/2601.12252

作者:Songming Jia,Yan Lu,Bin Liu,Xiang Zhang,Peng Zhao,Xinmeng Tang,Yelin Wei,Jinyang Huang,Huan Yan,Zhi Liu

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:smart interaction, pose estimation offers, offers a low-cost, low-cost and privacy-preserving, privacy-preserving alternative

备注: Accpeted by AMC Mobicom 2026

点击查看摘要

Abstract:WiFi-based 3D human pose estimation offers a low-cost and privacy-preserving alternative to vision-based systems for smart interaction. However, existing approaches rely on visual 3D poses as supervision and directly regress CSI to a camera-based coordinate system. We find that this practice leads to coordinate overfitting: models memorize deployment-specific WiFi transceiver layouts rather than only learning activity-relevant representations, resulting in severe generalization failures. To address this challenge, we present PerceptAlign, the first geometry-conditioned framework for WiFi-based cross-layout pose estimation. PerceptAlign introduces a lightweight coordinate unification procedure that aligns WiFi and vision measurements in a shared 3D space using only two checkerboards and a few photos. Within this unified space, it encodes calibrated transceiver positions into high-dimensional embeddings and fuses them with CSI features, making the model explicitly aware of device geometry as a conditional variable. This design forces the network to disentangle human motion from deployment layouts, enabling robust and, for the first time, layout-invariant WiFi pose estimation. To support systematic evaluation, we construct the largest cross-domain 3D WiFi pose estimation dataset to date, comprising 21 subjects, 5 scenes, 18 actions, and 7 device layouts. Experiments show that PerceptAlign reduces in-domain error by 12.3% and cross-domain error by more than 60% compared to state-of-the-art baselines. These results establish geometry-conditioned learning as a viable path toward scalable and practical WiFi sensing.

220. 【2601.12249】An Innovative Framework for Breast Cancer Detection Using Pyramid Adaptive Atrous Convolution, Transformer Integration, and Multi-Scale Feature Fusion

链接https://arxiv.org/abs/2601.12249

作者:Ehsan Sadeghi Pour,Mahdi Esmaeili,Morteza Romoozi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:improving treatment outcomes, Adaptive Atrous Convolution, Pyramid Adaptive Atrous, timely diagnosis plays, Breast cancer

备注: 13 page

点击查看摘要

Abstract:Breast cancer is one of the most common cancers among women worldwide, and its accurate and timely diagnosis plays a critical role in improving treatment outcomes. This thesis presents an innovative framework for detecting malignant masses in mammographic images by integrating the Pyramid Adaptive Atrous Convolution (PAAC) and Transformer architectures. The proposed approach utilizes Multi-Scale Feature Fusion to enhance the extraction of features from benign and malignant tissues and combines Dice Loss and Focal Loss functions to improve the model's learning process, effectively reducing errors in binary breast cancer classification and achieving high accuracy and efficiency. In this study, a comprehensive dataset of breast cancer images from INbreast, MIAS, and DDSM was preprocessed through data augmentation and contrast enhancement and resized to 227x227 pixels for model training. Leveraging the Transformer's ability to manage long-range dependencies with Self-Attention mechanisms, the proposed model achieved high accuracy in detecting cancerous masses, outperforming foundational models such as BreastNet, DeepMammo, Multi-Scale CNN, Swin-Unet, and SegFormer. The final evaluation results for the proposed model include an accuracy of 98.5\%, sensitivity of 97.8\%, specificity of 96.3\%, F1-score of 98.2\%, and overall precision of 97.9\%. These metrics demonstrate a significant improvement over traditional methods and confirm the model's effectiveness in identifying cancerous masses in complex scenarios and large datasets. This model shows potential as a reliable and efficient tool for breast cancer diagnosis and can be effectively integrated into medical diagnostic systems.

221. 【2601.12243】Less is More: Label-Guided Summarization of Procedural and Instructional Videos

链接https://arxiv.org/abs/2601.12243

作者:Shreya Rajpal,Michal Golovanesky,Carsten Eickhoff

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:turn long videos, easier to review, surgical training, turn long, high-stakes domains

备注: 22 pages, 6 figures

点击查看摘要

Abstract:Video summarization helps turn long videos into clear, concise representations that are easier to review, document, and analyze, especially in high-stakes domains like surgical training. Prior work has progressed from using basic visual features like color, motion, and structural changes to using pre-trained vision-language models that can better understand what's happening in the video (semantics) and capture temporal flow, resulting in more context-aware video summarization. We propose a three-stage framework, PRISM: Procedural Representation via Integrated Semantic and Multimodal analysis, that produces semantically grounded video summaries. PRISM combines adaptive visual sampling, label-driven keyframe anchoring, and contextual validation using a large language model (LLM). Our method ensures that selected frames reflect meaningful and procedural transitions while filtering out generic or hallucinated content, resulting in contextually coherent summaries across both domain-specific and instructional videos. We evaluate our method on instructional and activity datasets, using reference summaries for instructional videos. Despite sampling fewer than 5% of the original frames, our summaries retain 84% semantic content while improving over baselines by as much as 33%. Our approach generalizes across procedural and domain-specific video tasks, achieving strong performance with both semantic alignment and precision.

222. 【2601.12234】Proc3D: Procedural 3D Generation and Parametric Editing of 3D Shapes with Large Language Models

链接https://arxiv.org/abs/2601.12234

作者:Fadlullah Raji,Stefano Petrangeli,Matheus Gadelha,Yu Shen,Uttaran Bhattacharya,Gang Wu

类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:requiring specialized expertise, complex task requiring, task requiring specialized, specialized expertise, complex task

备注

点击查看摘要

Abstract:Generating 3D models has traditionally been a complex task requiring specialized expertise. While recent advances in generative AI have sought to automate this process, existing methods produce non-editable representation, such as meshes or point clouds, limiting their adaptability for iterative design. In this paper, we introduce Proc3D, a system designed to generate editable 3D models while enabling real-time modifications. At its core, Proc3D introduces procedural compact graph (PCG), a graph representation of 3D models, that encodes the algorithmic rules and structures necessary for generating the model. This representation exposes key parameters, allowing intuitive manual adjustments via sliders and checkboxes, as well as real-time, automated modifications through natural language prompts using Large Language Models (LLMs). We demonstrate Proc3D's capabilities using two generative approaches: GPT-4o with in-context learning (ICL) and a fine-tuned LLAMA-3 model. Experimental results show that Proc3D outperforms existing methods in editing efficiency, achieving more than 400x speedup over conventional approaches that require full regeneration for each modification. Additionally, Proc3D improves ULIP scores by 28%, a metric that evaluates the alignment between generated 3D models and text prompts. By enabling text-aligned 3D model generation along with precise, real-time parametric edits, Proc3D facilitates highly accurate text-based image editing applications.

223. 【2601.12233】DiffusionQC: Artifact Detection in Histopathology via Diffusion Model

链接https://arxiv.org/abs/2601.12233

作者:Zhenzhen Wang,Zhongliang Zhou,Zhuoyu Wen,Jeong Hwan Kook,John B Wojcik,John Kang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Digital pathology plays, offering critical insights, Digital pathology, modern medicine, offering critical

备注: 7 pages

点击查看摘要

Abstract:Digital pathology plays a vital role across modern medicine, offering critical insights for disease diagnosis, prognosis, and treatment. However, histopathology images often contain artifacts introduced during slide preparation and digitization. Detecting and excluding them is essential to ensure reliable downstream analysis. Traditional supervised models typically require large annotated datasets, which is resource-intensive and not generalizable to novel artifact types. To address this, we propose DiffusionQC, which detects artifacts as outliers among clean images using a diffusion model. It requires only a set of clean images for training rather than pixel-level artifact annotations and predefined artifact types. Furthermore, we introduce a contrastive learning module to explicitly enlarge the distribution separation between artifact and clean images, yielding an enhanced version of our method. Empirical results demonstrate superior performance to state-of-the-art and offer cross-stain generalization capacity, with significantly less data and annotations.

224. 【2601.12224】Where It Moves, It Matters: Referring Surgical Instrument Segmentation via Motion

链接https://arxiv.org/abs/2601.12224

作者:Meng Wei,Kun Yuan,Shi Li,Yue Zhou,Long Bai,Nassir Navab,Hongliang Ren,Hong Joo Lee,Tom Vercauteren,Nicolas Padoy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:intelligent operating rooms, Enabling intuitive, surgical robotic assistance, autonomous surgical robotic, robotic assistance

备注

点击查看摘要

Abstract:Enabling intuitive, language-driven interaction with surgical scenes is a critical step toward intelligent operating rooms and autonomous surgical robotic assistance. However, the task of referring segmentation, localizing surgical instruments based on natural language descriptions, remains underexplored in surgical videos, with existing approaches struggling to generalize due to reliance on static visual cues and predefined instrument names. In this work, we introduce SurgRef, a novel motion-guided framework that grounds free-form language expressions in instrument motion, capturing how tools move and interact across time, rather than what they look like. This allows models to understand and segment instruments even under occlusion, ambiguity, or unfamiliar terminology. To train and evaluate SurgRef, we present Ref-IMotion, a diverse, multi-institutional video dataset with dense spatiotemporal masks and rich motion-centric expressions. SurgRef achieves state-of-the-art accuracy and generalization across surgical procedures, setting a new benchmark for robust, language-driven surgical video segmentation.

225. 【2601.12193】VIRTUE: Versatile Video Retrieval Through Unified Embeddings

链接https://arxiv.org/abs/2601.12193

作者:Shaunak Halbe,Bhagyashree Puranik,Jayakrishnan Unnikrishnan,Kushan Thakkar,Vimal Bhat,Toufiq Parag

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Modern video retrieval, flexible multimodal querying, Modern video, handle diverse tasks, diverse tasks ranging

备注

点击查看摘要

Abstract:Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval and fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VIRTUE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models which are trained on orders of magnitude larger data.

226. 【2601.12155】Inverse Rendering for High-Genus 3D Surface Meshes from Multi-view Images with Persistent Homology Priors

链接https://arxiv.org/abs/2601.12155

作者:Xiang Gao,Xinmu Wang,Yuanpeng Liu,Yue Wang,Junqi Huang,Wei Chen,Xianfeng Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:ill-posed problem due, inherently an ill-posed, ill-posed problem, problem due, Abstract

备注: ICASSP2026 Accepted

点击查看摘要

Abstract:Reconstructing 3D objects from images is inherently an ill-posed problem due to ambiguities in geometry, appearance, and topology. This paper introduces collaborative inverse rendering with persistent homology priors, a novel strategy that leverages topological constraints to resolve these ambiguities. By incorporating priors that capture critical features such as tunnel loops and handle loops, our approach directly addresses the difficulty of reconstructing high-genus surfaces. The collaboration between photometric consistency from multi-view images and homology-based guidance enables recovery of complex high-genus geometry while circumventing catastrophic failures such as collapsing tunnels or losing high-genus structure. Instead of neural networks, our method relies on gradient-based optimization within a mesh-based inverse rendering framework to highlight the role of topological priors. Experimental results show that incorporating persistent homology priors leads to lower Chamfer Distance (CD) and higher Volume IoU compared to state-of-the-art mesh-based methods, demonstrating improved geometric accuracy and robustness against topological failure.

227. 【2601.12150】Enhanced Diagnostic Performance via Large-Resolution Inference Optimization for Pathology Foundation Models

链接https://arxiv.org/abs/2601.12150

作者:Mengxuan Hu,Zihan Guan,John Kang,Sheng Li,Zhongliang Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:creating substantial inefficiencies, pathology foundation models, foundation models remain, models remain constrained, specific input size

备注: 8 pages

点击查看摘要

Abstract:Despite their prominent performance on tasks such as ROI classification and segmentation, many pathology foundation models remain constrained by a specific input size e.g. 224 x 224, creating substantial inefficiencies when applied to whole-slide images (WSIs), which span thousands of resolutions. A naive strategy is to either enlarge inputs or downsample the WSIs. However, enlarging inputs results in prohibitive GPU memory consumption, while downsampling alters the microns-per-pixel resolution and obscures critical morphological details. To overcome these limitations, we propose an space- and time- efficient inference strategy that sparsifies attention using spatially aware neighboring blocks and filters out non-informative tokens through global attention scores. This design substantially reduces GPU memory and runtime during high-resolution WSI inference while preserving and even improving the downstream performance, enabling inference at higher resolutions under the same GPU budget. The experimental results show that our method can achieves up to an 7.67% improvement in the ROI classification and compatible results in segmentation.

228. 【2601.12149】Principal Component Analysis-Based Terahertz Self-Supervised Denoising and Deblurring Deep Neural Networks

链接https://arxiv.org/abs/2601.12149

作者:Pengfei Zhu,Xavier Maldague

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:frequency-dependent degradation effects, systems inherently introduce, inherently introduce frequency-dependent, introduce frequency-dependent degradation, systems inherently

备注

点击查看摘要

Abstract:Terahertz (THz) systems inherently introduce frequency-dependent degradation effects, resulting in low-frequency blurring and high-frequency noise in amplitude images. Conventional image processing techniques cannot simultaneously address both issues, and manual intervention is often required due to the unknown boundary between denoising and deblurring. To tackle this challenge, we propose a principal component analysis (PCA)-based THz self-supervised denoising and deblurring network (THz-SSDD). The network employs a Recorrupted-to-Recorrupted self-supervised learning strategy to capture the intrinsic features of noise by exploiting invariance under repeated corruption. PCA decomposition and reconstruction are then applied to restore images across both low and high frequencies. The performance of the THz-SSDD network was evaluated on four types of samples. Training requires only a small set of unlabeled noisy images, and testing across samples with different material properties and measurement modes demonstrates effective denoising and deblurring. Quantitative analysis further validates the network feasibility, showing improvements in image quality while preserving the physical characteristics of the original signals.

229. 【2601.12147】Segment and Matte Anything in a Unified Model

链接https://arxiv.org/abs/2601.12147

作者:Zezhong Fan,Xiaohan Li,Topojoy Biswas,Kaushiki Nag,Kannan Achan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:demonstrating zero-shot generalization, recently pushed, pushed the boundaries, demonstrating zero-shot, zero-shot generalization

备注: AAAI 2026

点击查看摘要

Abstract:Segment Anything (SAM) has recently pushed the boundaries of segmentation by demonstrating zero-shot generalization and flexible prompting after training on over one billion masks. Despite this, its mask prediction accuracy often falls short of the precision required in real-world applications. While several refinement modules have been proposed to boost SAM's segmentation quality, achieving highly accurate object delineation within a single, unified framework remains an open challenge. Furthermore, interactive image matting, which aims to generate fine-grained alpha mattes guided by diverse user hints, has not yet been explored in the context of SAM. Insights from recent studies highlight strong correlations between segmentation and matting, suggesting the feasibility of a unified model capable of both tasks. In this paper, we introduce Segment And Matte Anything (SAMA), a lightweight extension of SAM that delivers high-quality interactive image segmentation and matting with minimal extra parameters. Our Multi-View Localization Encoder (MVLE) captures detailed features from local views, while the Localization Adapter (Local-Adapter) refines mask outputs by recovering subtle boundary details. We also incorporate two prediction heads for each task into the architecture to generate segmentation and matting masks, simultaneously. Trained on a diverse dataset aggregated from publicly available sources, SAMA achieves state-of-the-art performance across multiple segmentation and matting benchmarks, showcasing its adaptability and effectiveness in a wide range of downstream tasks.

230. 【2601.12137】EMoE: Eigenbasis-Guided Routing for Mixture-of-Experts

链接https://arxiv.org/abs/2601.12137

作者:Anzhe Cheng,Shukai Duan,Shixuan Li,Chenzhong Yin,Mingxi Cheng,Shahin Nazarian,Paul Thompson,Paul Bogdan

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:unsustainable computational demands, deep learning models, computational demands, greater efficiency, relentless scaling

备注: accepted by ICASSP2026

点击查看摘要

Abstract:The relentless scaling of deep learning models has led to unsustainable computational demands, positioning Mixture-of-Experts (MoE) architectures as a promising path towards greater efficiency. However, MoE models are plagued by two fundamental challenges: 1) a load imbalance problem known as the``rich get richer" phenomenon, where a few experts are over-utilized, and 2) an expert homogeneity problem, where experts learn redundant representations, negating their purpose. Current solutions typically employ an auxiliary load-balancing loss that, while mitigating imbalance, often exacerbates homogeneity by enforcing uniform routing at the expense of specialization. To resolve this, we introduce the Eigen-Mixture-of-Experts (EMoE), a novel architecture that leverages a routing mechanism based on a learned orthonormal eigenbasis. EMoE projects input tokens onto this shared eigenbasis and routes them based on their alignment with the principal components of the feature space. This principled, geometric partitioning of data intrinsically promotes both balanced expert utilization and the development of diverse, specialized experts, all without the need for a conflicting auxiliary loss function. Our code is publicly available at this https URL.

231. 【2601.12119】CARLA-Round: A Multi-Factor Simulation Dataset for Roundabout Trajectory Prediction

链接https://arxiv.org/abs/2601.12119

作者:Xiaotong Zhou,Zhenhui Yuan,Yi Han,Tianhua Xu,Laurence T. Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:circular road geometry, remains highly challenging, highly challenging due, reducing traffic accidents, road geometry

备注

点击查看摘要

Abstract:Accurate trajectory prediction of vehicles at roundabouts is critical for reducing traffic accidents, yet it remains highly challenging due to their circular road geometry, continuous merging and yielding interactions, and absence of traffic signals. Developing accurate prediction algorithms relies on reliable, multimodal, and realistic datasets; however, such datasets for roundabout scenarios are scarce, as real-world data collection is often limited by incomplete observations and entangled factors that are difficult to isolate. We present CARLA-Round, a systematically designed simulation dataset for roundabout trajectory prediction. The dataset varies weather conditions (five types) and traffic density levels (spanning Level-of-Service A-E) in a structured manner, resulting in 25 controlled scenarios. Each scenario incorporates realistic mixtures of driving behaviors and provides explicit annotations that are largely absent from existing datasets. Unlike randomly sampled simulation data, this structured design enables precise analysis of how different conditions influence trajectory prediction performance. Validation experiments using standard baselines (LSTM, GCN, GRU+GCN) reveal traffic density dominates prediction difficulty with strong monotonic effects, while weather shows non-linear impacts. The best model achieves 0.312m ADE on real-world rounD dataset, demonstrating effective sim-to-real transfer. This systematic approach quantifies factor impacts impossible to isolate in confounded real-world datasets. Our CARLA-Round dataset is available at this https URL.

232. 【2601.12111】RCDN: Real-Centered Detection Network for Robust Face Forgery Identification

链接https://arxiv.org/abs/2601.12111

作者:Wyatt McCurdy,Xin Zhang,Yuqi Song,Min Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:AI-based generation tools, fraudulent facial content, generation tools, critical threat, rapid proliferation

备注

点击查看摘要

Abstract:Image forgery has become a critical threat with the rapid proliferation of AI-based generation tools, which make it increasingly easy to synthesize realistic but fraudulent facial content. Existing detection methods achieve near-perfect performance when training and testing are conducted within the same domain, yet their effectiveness deteriorates substantially in crossdomain scenarios. This limitation is problematic, as new forgery techniques continuously emerge and detectors must remain reliable against unseen manipulations. To address this challenge, we propose the Real-Centered Detection Network (RCDN), a frequency spatial convolutional neural networks(CNN) framework with an Xception backbone that anchors its representation space around authentic facial images. Instead of modeling the diverse and evolving patterns of forgeries, RCDN emphasizes the consistency of real images, leveraging a dual-branch architecture and a real centered loss design to enhance robustness under distribution shifts. Extensive experiments on the DiFF dataset, focusing on three representative forgery types (FE, I2I, T2I), demonstrate that RCDN achieves both state-of-the-art in-domain accuracy and significantly stronger cross-domain generalization. Notably, RCDN reduces the generalization gap compared to leading baselines and achieves the highest cross/in-domain stability ratio, highlighting its potential as a practical solution for defending against evolving and unseen image forgery techniques.

233. 【2601.12109】Energy-Aware Ensemble Learning for Coffee Leaf Disease Classification

链接https://arxiv.org/abs/2601.12109

作者:Larissa Ferreira Rodrigues Moreira,Rodrigo Moreira,Leonardo Gabriel Ferreira Rodrigues

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:presents significant challenges, field presents significant, assessing leaf diseases, significant challenges, Convolutional Neural Networks

备注

点击查看摘要

Abstract:Coffee yields are contingent on the timely and accurate diagnosis of diseases; however, assessing leaf diseases in the field presents significant challenges. Although Artificial Intelligence (AI) vision models achieve high accuracy, their adoption is hindered by the limitations of constrained devices and intermittent connectivity. This study aims to facilitate sustainable on-device diagnosis through knowledge distillation: high-capacity Convolutional Neural Networks (CNNs) trained in data centers transfer knowledge to compact CNNs through Ensemble Learning (EL). Furthermore, dense tiny pairs were integrated through simple and optimized ensembling to enhance accuracy while adhering to strict computational and energy constraints. On a curated coffee leaf dataset, distilled tiny ensembles achieved competitive with prior work with significantly reduced energy consumption and carbon footprint. This indicates that lightweight models, when properly distilled and ensembled, can provide practical diagnostic solutions for Internet of Things (IoT) applications.

234. 【2601.12090】Detecting 3D Line Segments for 6DoF Pose Estimation with Limited Data

链接https://arxiv.org/abs/2601.12090

作者:Matej Mok,Lukáš Gajdošech,Michal Mesároš,Martin Madaras,Viktor Kocur

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamental problems, industrial automation, practical applications, pose estimation, pose

备注: 8 pages

点击查看摘要

Abstract:The task of 6DoF object pose estimation is one of the fundamental problems of 3D vision with many practical applications such as industrial automation. Traditional deep learning approaches for this task often require extensive training data or CAD models, limiting their application in real-world industrial settings where data is scarce and object instances vary. We propose a novel method for 6DoF pose estimation focused specifically on bins used in industrial settings. We exploit the cuboid geometry of bins by first detecting intermediate 3D line segments corresponding to their top edges. Our approach extends the 2D line segment detection network LeTR to operate on structured point cloud data. The detected 3D line segments are then processed using a simple geometric procedure to robustly determine the bin's 6DoF pose. To evaluate our method, we extend an existing dataset with a newly collected and annotated dataset, which we make publicly available. We show that incorporating synthetic training data significantly improves pose estimation accuracy on real scans. Moreover, we show that our method significantly outperforms current state-of-the-art 6DoF pose estimation methods in terms of the pose accuracy (3 cm translation error, 8.2$^\circ$ rotation error) while not requiring instance-specific CAD models during inference.

235. 【2601.12082】Conditional Random Fields for Interactive Refinement of Histopathological Predictions

链接https://arxiv.org/abs/2601.12082

作者:Tiffanie Godelaine,Maxime Zanella,Karim El Khoury,Saïd Mahmoudi,Benoît Macq,Christophe De Vleeschouwer

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:supports cancer detection, Assisting pathologists, detection and staging, images has high, high clinical

备注

点击查看摘要

Abstract:Assisting pathologists in the analysis of histopathological images has high clinical value, as it supports cancer detection and staging. In this context, histology foundation models have recently emerged. Among them, Vision-Language Models (VLMs) provide strong yet imperfect zero-shot predictions. We propose to refine these predictions by adapting Conditional Random Fields (CRFs) to histopathological applications, requiring no additional model training. We present HistoCRF, a CRF-based framework, with a novel definition of the pairwise potential that promotes label diversity and leverages expert annotations. We consider three experiments: without annotations, with expert annotations, and with iterative human-in-the-loop annotations that progressively correct misclassified patches. Experiments on five patch-level classification datasets covering different organs and diseases demonstrate average accuracy gains of 16.0% without annotations and 27.5% with only 100 annotations, compared to zero-shot predictions. Moreover, integrating a human in the loop reaches a further gain of 32.6% with the same number of annotations. The code will be made available on this https URL.

236. 【2601.12080】oward Real-World High-Precision Image Matting and Segmentation

链接https://arxiv.org/abs/2601.12080

作者:Haipeng Zhou,Zhaohu Xing,Hongqiu Wang,Jun Ma,Ping Li,Lei Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:High-precision scene parsing, scene parsing tasks, including image matting, extremely fine details, High-precision scene

备注: Accepted by AAAI2026, Poster

点击查看摘要

Abstract:High-precision scene parsing tasks, including image matting and dichotomous segmentation, aim to accurately predict masks with extremely fine details (such as hair). Most existing methods focus on salient, single foreground objects. While interactive methods allow for target adjustment, their class-agnostic design restricts generalization across different categories. Furthermore, the scarcity of high-quality annotation has led to a reliance on inharmonious synthetic data, resulting in poor generalization to real-world scenarios. To this end, we propose a Foreground Consistent Learning model, dubbed as FCLM, to address the aforementioned issues. Specifically, we first introduce a Depth-Aware Distillation strategy where we transfer the depth-related knowledge for better foreground representation. Considering the data dilemma, we term the processing of synthetic data as domain adaptation problem where we propose a domain-invariant learning strategy to focus on foreground learning. To support interactive prediction, we contribute an Object-Oriented Decoder that can receive both visual and language prompts to predict the referring target. Experimental results show that our method quantitatively and qualitatively outperforms SOTA methods.

237. 【2601.12079】EmoLat: Text-driven Image Sentiment Transfer via Emotion Latent Space

链接https://arxiv.org/abs/2601.12079

作者:Jing Zhang,Bingjie Fan,Jixiang Zhu,Zhe Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enables fine-grained, space that enables, text-driven image sentiment, visual attributes, EmoSpace Set

备注: 10 pages, 5 figures

点击查看摘要

Abstract:We propose EmoLat, a novel emotion latent space that enables fine-grained, text-driven image sentiment transfer by modeling cross-modal correlations between textual semantics and visual emotion features. Within EmoLat, an emotion semantic graph is constructed to capture the relational structure among emotions, objects, and visual attributes. To enhance the discriminability and transferability of emotion representations, we employ adversarial regularization, aligning the latent emotion distributions across modalities. Building upon EmoLat, a cross-modal sentiment transfer framework is proposed to manipulate image sentiment via joint embedding of text and EmoLat features. The network is optimized using a multi-objective loss incorporating semantic consistency, emotion alignment, and adversarial regularization. To support effective modeling, we construct EmoSpace Set, a large-scale benchmark dataset comprising images with dense annotations on emotions, object semantics, and visual attributes. Extensive experiments on EmoSpace Set demonstrate that our approach significantly outperforms existing state-of-the-art methods in both quantitative metrics and qualitative transfer fidelity, establishing a new paradigm for controllable image sentiment editing guided by textual input. The EmoSpace Set and all the code are available at this http URL.

238. 【2601.12076】CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation

链接https://arxiv.org/abs/2601.12076

作者:H. Jiang,Y. Sun,Z. Dong,T. Liu,Y. Gu

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Remote sensing video, Remote sensing, weak target saliency, maintain discriminative target, discriminative target representations

备注

点击查看摘要

Abstract:Remote sensing video referring object segmentation (RS-RVOS) is challenged by weak target saliency and severe visual information truncation in dynamic scenes, making it extremely difficult to maintain discriminative target representations during segmentation. Moreover, progress in this field is hindered by the absence of large-scale dedicated benchmarks, while existing models are often affected by biased initial memory construction that impairs accurate instance localization in complex scenarios, as well as indiscriminate memory accumulation that encodes noise from occlusions or misclassifications, leading to persistent error propagation. This paper advances RS-RVOS research through dual contributions in data and methodology. First, we construct RS-RVOS Bench, the first large-scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations. Unlike common RVOS benchmarks where many expressions are written with access to the full video context, our dataset adopts a strict causality-aware annotation strategy in which linguistic references are generated solely from the target state in the initial frame. Second, we propose a memory-quality-aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC-SAM). MQC-SAM introduces a temporal motion consistency module for initial memory calibration, leveraging short-term motion trajectory priors to correct structural deviations and establish accurate memory anchoring. Furthermore, it incorporates a decoupled attention-based memory integration mechanism with dynamic quality assessment, selectively updating high-confidence semantic features while filtering unreliable information, thereby effectively preventing error accumulation and propagation. Extensive experiments on RS-RVOS Bench demonstrate that MQC-SAM achieves state-of-the-art performance.

239. 【2601.12067】ARMARecon: An ARMA Convolutional Filter based Graph Neural Network for Neurodegenerative Dementias Classification

链接https://arxiv.org/abs/2601.12067

作者:VSS Tejaswi Abburi,Ananya Singhal,Saurabh J. Shigwan,Nitin Kumar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:severe disease stages, Frontotemporal Dementia, Alzheimer Disease, Early detection, disease stages

备注: Accepted at IEEE International Symposium on Biomedical Imaging (ISBI) 2026

点击查看摘要

Abstract:Early detection of neurodegenerative diseases such as Alzheimer's Disease (AD) and Frontotemporal Dementia (FTD) is essential for reducing the risk of progression to severe disease stages. As AD and FTD propagate along white-matter regions in a global, graph-dependent manner, graph-based neural networks are well suited to capture these patterns. Hence, we introduce ARMARecon, a unified graph learning framework that integrates Autoregressive Moving Average (ARMA) graph filtering with a reconstruction-driven objective to enhance feature representation and improve classification accuracy. ARMARecon effectively models both local and global connectivity by leveraging 20-bin Fractional Anisotropy (FA) histogram features extracted from white-matter regions, while mitigating over-smoothing. Overall, ARMARecon achieves superior performance compared to state-of-the-art methods on the multi-site dMRI datasets ADNI and NIFD.

240. 【2601.12066】Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

链接https://arxiv.org/abs/2601.12066

作者:Zijie Lou,Xiangwei Feng,Jiaxin Wang,Xiaochao Qu,Luoqi Liu,Ting Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:uninformative Gaussian noise, Gaussian noise, uninformative Gaussian, methods predominantly rely, predominantly rely

备注

点击查看摘要

Abstract:Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene's physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency.

241. 【2601.12062】Learning Language-Driven Sequence-Level Modal-Invariant Representations for Video-Based Visible-Infrared Person Re-Identification

链接https://arxiv.org/abs/2601.12062

作者:Xiaomei Yang,Xizhan Gao,Antai Liu,Kang Wei,Fa Zhu,Guang Feng,Xiaofeng Qu,Sijie Niu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visible-infrared person re-identification, video-based visible-infrared person, modal-invariant representations, person re-identification, core of video-based

备注

点击查看摘要

Abstract:The core of video-based visible-infrared person re-identification (VVI-ReID) lies in learning sequence-level modal-invariant representations across different modalities. Recent research tends to use modality-shared language prompts generated by CLIP to guide the learning of modal-invariant representations. Despite achieving optimal performance, such methods still face limitations in efficient spatial-temporal modeling, sufficient cross-modal interaction, and explicit modality-level loss guidance. To address these issues, we propose the language-driven sequence-level modal-invariant representation learning (LSMRL) method, which includes spatial-temporal feature learning (STFL) module, semantic diffusion (SD) module and cross-modal interaction (CMI) module. To enable parameter- and computation-efficient spatial-temporal modeling, the STFL module is built upon CLIP with minimal modifications. To achieve sufficient cross-modal interaction and enhance the learning of modal-invariant features, the SD module is proposed to diffuse modality-shared language prompts into visible and infrared features to establish preliminary modal consistency. The CMI module is further developed to leverage bidirectional cross-modal self-attention to eliminate residual modality gaps and refine modal-invariant representations. To explicitly enhance the learning of modal-invariant representations, two modality-level losses are introduced to improve the features' discriminative ability and their generalization to unseen categories. Extensive experiments on large-scale VVI-ReID datasets demonstrate the superiority of LSMRL over AOTA methods.

242. 【2601.12055】Automating Parameter Selection in Deep Image Prior for Fluorescence Microscopy Image Denoising via Similarity-Based Parameter Transfer

链接https://arxiv.org/abs/2601.12055

作者:Lina Meyer,Felix Wissel,Tobias Knopp,Susanne Pfefferle,Ralf Fliegert,Maximilian Sandmann,Liana Uebler,Franziska Möckl,Björn-Philipp Diercks,David Lohr,René Werner

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Unsupervised deep image, supervised deep learning, Unsupervised deep, deep image prior, DIP

备注

点击查看摘要

Abstract:Unsupervised deep image prior (DIP) addresses shortcomings of training data requirements and limited generalization associated with supervised deep learning. The performance of DIP depends on the network architecture and the stopping point of its iterative process. Optimizing these parameters for a new image requires time, restricting DIP application in domains where many images need to be processed. Focusing on fluorescence microscopy data, we hypothesize that similar images share comparable optimal parameter configurations for DIP-based denoising, potentially enabling optimization-free DIP for fluorescence microscopy. We generated a calibration (n=110) and validation set (n=55) of semantically different images from an open-source dataset for a network architecture search targeted towards ideal U-net architectures and stopping points. The calibration set represented our transfer basis. The validation set enabled the assessment of which image similarity criterion yields the best results. We then implemented AUTO-DIP, a pipeline for automatic parameter transfer, and compared it to the originally published DIP configuration (baseline) and a state-of-the-art image-specific variational denoising approach. We show that a parameter transfer from the calibration dataset to a test image based on only image metadata similarity (e.g., microscope type, imaged specimen) leads to similar and better performance than a transfer based on quantitative image similarity measures. AUTO-DIP outperforms the baseline DIP (DIP with original DIP parameters) as well as the variational denoising approaches for several open-source test datasets of varying complexity, particularly for very noisy inputs. Applications to locally acquired fluorescence microscopy images further proved superiority of AUTO-DIP.

243. 【2601.12052】ask-Driven Prompt Learning: A Joint Framework for Multi-modal Cloud Removal and Segmentation

链接https://arxiv.org/abs/2601.12052

作者:Zaiyan Zhang,Jie Li,Shaowei Shi,Qiangqiang Yuan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remote sensing imagery, Earth observation, Optical remote sensing, persistent cloud occlusion, cloud occlusion limits

备注: Submitted to IGARSS 2026 Conference

点击查看摘要

Abstract:Optical remote sensing imagery is indispensable for Earth observation, yet persistent cloud occlusion limits its downstream utility. Most cloud removal (CR) methods are optimized for low-level fidelity and can over-smooth textures and boundaries that are critical for analysis-ready data (ARD), leading to a mismatch between visually plausible restoration and semantic utility. To bridge this gap, we propose TDP-CR, a task-driven multimodal framework that jointly performs cloud removal and land-cover segmentation. Central to our approach is a Prompt-Guided Fusion (PGF) mechanism, which utilizes a learnable degradation prompt to encode cloud thickness and spatial uncertainty. By combining global channel context with local prompt-conditioned spatial bias, PGF adaptively integrates Synthetic Aperture Radar (SAR) information only where optical data is corrupted. We further introduce a parameter-efficient two-phase training strategy that decouples reconstruction and semantic representation learning. Experiments on the LuojiaSET-OSFCR dataset demonstrate the superiority of our framework: TDP-CR surpasses heavy state-of-the-art baselines by 0.18 dB in PSNR while using only 15\% of the parameters, and achieves a 1.4\% improvement in mIoU consistently against multi-task competitors, effectively delivering analysis-ready data.

244. 【2601.12051】A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models

链接https://arxiv.org/abs/2601.12051

作者:Weixin Ye,Wei Wang,Yahui Liu,Yue Song,Bin Ren,Wei Bi,Rita Cucchiara,Nicu Sebe

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Natural Language Processing, faces critical challenges, Language Processing, Masked Jigsaw Puzzle, Computer Vision

备注: 9 figures, 12 tables

点击查看摘要

Abstract:In federated learning, Transformer, as a popular architecture, faces critical challenges in defending against gradient attacks and improving model performance in both Computer Vision (CV) and Natural Language Processing (NLP) tasks. It has been revealed that the gradient of Position Embeddings (PEs) in Transformer contains sufficient information, which can be used to reconstruct the input data. To mitigate this issue, we introduce a Masked Jigsaw Puzzle (MJP) framework. MJP starts with random token shuffling to break the token order, and then a learnable \textit{unknown (unk)} position embedding is used to mask out the PEs of the shuffled tokens. In this manner, the local spatial information which is encoded in the position embeddings is disrupted, and the models are forced to learn feature representations that are less reliant on the local spatial information. Notably, with the careful use of MJP, we can not only improve models' robustness against gradient attacks, but also boost their performance in both vision and text application scenarios, such as classification for images (\textit{e.g.,} ImageNet-1K) and sentiment analysis for text (\textit{e.g.,} Yelp and Amazon). Experimental results suggest that MJP is a unified framework for different Transformer-based models in both vision and language tasks. Code is publicly available via this https URL

245. 【2601.12049】\textit{FocaLogic}: Logic-Based Interpretation of Visual Model Decisions

链接https://arxiv.org/abs/2601.12049

作者:Chenchen Zhao,Muxi Chen,Qiang Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:high-stakes applications, existing interpretability methods, interpretability methods typically, modern visual models, visual

备注: 12 pages, 13 figures

点击查看摘要

Abstract:Interpretability of modern visual models is crucial, particularly in high-stakes applications. However, existing interpretability methods typically suffer from either reliance on white-box model access or insufficient quantitative rigor. To address these limitations, we introduce FocaLogic, a novel model-agnostic framework designed to interpret and quantify visual model decision-making through logic-based representations. FocaLogic identifies minimal interpretable subsets of visual regions-termed visual focuses-that decisively influence model predictions. It translates these visual focuses into precise and compact logical expressions, enabling transparent and structured interpretations. Additionally, we propose a suite of quantitative metrics, including focus precision, recall, and divergence, to objectively evaluate model behavior across diverse scenarios. Empirical analyses demonstrate FocaLogic's capability to uncover critical insights such as training-induced concentration, increasing focus accuracy through generalization, and anomalous focuses under biases and adversarial attacks. Overall, FocaLogic provides a systematic, scalable, and quantitative solution for interpreting visual models.

246. 【2601.12037】Multimodal Feedback for Handheld Tool Guidance: Combining Wrist-Based Haptics with Augmented Reality

链接https://arxiv.org/abs/2601.12037

作者:Yue Yang,Christoph Leuze,Brian Hargreaves,Bruce Daniel,Fred M Baik

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

关键词:see-through augmented reality, vibrotactile wrist feedback, optical see-through augmented, handheld tool movement, augmented reality

备注

点击查看摘要

Abstract:We investigate how vibrotactile wrist feedback can enhance spatial guidance for handheld tool movement in optical see-through augmented reality (AR). While AR overlays are widely used to support surgical tasks, visual occlusion, lighting conditions, and interface ambiguity can compromise precision and confidence. To address these challenges, we designed a multimodal system combining AR visuals with a custom wrist-worn haptic device delivering directional and state-based cues. A formative study with experienced surgeons and residents identified key tool maneuvers and preferences for reference mappings, guiding our cue design. In a cue identification experiment (N=21), participants accurately recognized five vibration patterns under visual load, with higher recognition for full-actuator states than spatial direction cues. In a guidance task (N=27), participants using both AR and haptics achieved significantly higher spatial precision (5.8 mm) and usability (SUS = 88.1) than those using either modality alone, despite having modest increases in task time. Participants reported that haptic cues provided reassuring confirmation and reduced cognitive effort during alignment. Our results highlight the promise of integrating wrist-based haptics into AR systems for high-precision, visually complex tasks such as surgical guidance. We discuss design implications for multimodal interfaces supporting confident, efficient tool manipulation.

247. 【2601.12020】DIAMOND-SSS: Diffusion-Augmented Multi-View Optimization for Data-efficient SubSurface Scattering

链接https://arxiv.org/abs/2601.12020

作者:Guillermo Figueroa-Araneda,Iris Diana Jimenez,Florian Hofherr,Manny Ko,Hector Andrade-Loarca,Daniel Cremers

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:characteristic soft shadows, Subsurface scattering, color bleeding, soft shadows, diffuse glow

备注

点击查看摘要

Abstract:Subsurface scattering (SSS) gives translucent materials -- such as wax, jade, marble, and skin -- their characteristic soft shadows, color bleeding, and diffuse glow. Modeling these effects in neural rendering remains challenging due to complex light transport and the need for densely captured multi-view, multi-light datasets (often more than 100 views and 112 OLATs). We present DIAMOND-SSS, a data-efficient framework for high-fidelity translucent reconstruction from extremely sparse supervision -- even as few as ten images. We fine-tune diffusion models for novel-view synthesis and relighting, conditioned on estimated geometry and trained on less than 7 percent of the dataset, producing photorealistic augmentations that can replace up to 95 percent of missing captures. To stabilize reconstruction under sparse or synthetic supervision, we introduce illumination-independent geometric priors: a multi-view silhouette consistency loss and a multi-view depth consistency loss. Across all sparsity regimes, DIAMOND-SSS achieves state-of-the-art quality in relightable Gaussian rendering, reducing real capture requirements by up to 90 percent compared to SSS-3DGS.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2601.12020 [cs.CV]

(or
arXiv:2601.12020v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.12020

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
248. 【2601.12015】SAR-Based Marine Oil Spill Detection Using the DeepSegFusion Architecture

链接https://arxiv.org/abs/2601.12015

作者:Pavan Kumar Yata,Pediredla Pradeep,Goli Himanish,Swathi M

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Synthetic Aperture Radar, maritime safety, environmental surveillance, surveillance and maritime, oil spill

备注: 12 pages, 6 figures. Submitted to arXiv. Code and dataset details included in the paper

点击查看摘要

Abstract:Detection of oil spills from satellite images is essential for both environmental surveillance and maritime safety. Traditional threshold-based methods frequently encounter performance degradation due to very high false alarm rates caused by look-alike phenomena such as wind slicks and ship wakes. Here, a hybrid deep learning model, DeepSegFusion, is presented for oil spill segmentation in Synthetic Aperture Radar (SAR) images. The model uses SegNet and DeepLabV3+ integrated with an attention-based feature fusion mechanism to achieve better boundary precision as well as improved contextual understanding. Results obtained on SAR oil spill datasets, including ALOS PALSAR imagery, confirm that the proposed DeepSegFusion model achieves an accuracy of 94.85%, an Intersection over Union (IoU) of 0.5685, and a ROC-AUC score of 0.9330. The proposed method delivers more than three times fewer false detections compared to individual baseline models and traditional non-segmentation methods, achieving a reduction of 64.4%. These results indicate that DeepSegFusion is a stable model under various marine conditions and can therefore be used in near real-time oil spill monitoring scenarios.

249. 【2601.12010】SMc2f: Robust Scenario Mining for Robotic Autonomy from Coarse to Fine

链接https://arxiv.org/abs/2601.12010

作者:Yifei Chen,Ross Greer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous robotic vehicles, robotic vehicles hinges, stacks against rare, safety validation, validation of autonomous

备注

点击查看摘要

Abstract:The safety validation of autonomous robotic vehicles hinges on systematically testing their planning and control stacks against rare, safety-critical scenarios. Mining these long-tail events from massive real-world driving logs is therefore a critical step in the robotic development lifecycle. The goal of the Scenario Mining task is to retrieve useful information to enable targeted re-simulation, regression testing, and failure analysis of the robot's decision-making algorithms. RefAV, introduced by the Argoverse team, is an end-to-end framework that uses large language models (LLMs) to spatially and temporally localize scenarios described in natural language. However, this process performs retrieval on trajectory labels, ignoring the direct connection between natural language and raw RGB images, which runs counter to the intuition of video retrieval; it also depends on the quality of upstream 3D object detection and tracking. Further, inaccuracies in trajectory data lead to inaccuracies in downstream spatial and temporal localization. To address these issues, we propose Robust Scenario Mining for Robotic Autonomy from Coarse to Fine (SMc2f), a coarse-to-fine pipeline that employs vision-language models (VLMs) for coarse image-text filtering, builds a database of successful mining cases on top of RefAV and automatically retrieves exemplars to few-shot condition the LLM for more robust retrieval, and introduces text-trajectory contrastive learning to pull matched pairs together and push mismatched pairs apart in a shared embedding space, yielding a fine-grained matcher that refines the LLM's candidate trajectories. Experiments on public datasets demonstrate substantial gains in both retrieval quality and efficiency.

250. 【2601.11990】DAOS: A Multimodal In-cabin Behavior Monitoring with Driver Action-Object Synergy Dataset

链接https://arxiv.org/abs/2601.11990

作者:Yiming Li,Chen Cai,Tianyi Liu,Dan Lin,Wenqian Wang,Wenfei Liang,Bingbing Li,Kim-Hui Yap

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:driver activity monitoring, activity monitoring, upper body, driver, driver activity

备注

点击查看摘要

Abstract:In driver activity monitoring, movements are mostly limited to the upper body, which makes many actions look similar. To tell these actions apart, human often rely on the objects the driver is using, such as holding a phone compared with gripping the steering wheel. However, most existing driver-monitoring datasets lack accurate object-location annotations or do not link objects to their associated actions, leaving a critical gap for reliable action recognition. To address this, we introduce the Driver Action with Object Synergy (DAOS) dataset, comprising 9,787 video clips annotated with 36 fine-grained driver actions and 15 object classes, totaling more than 2.5 million corresponding object instances. DAOS offers multi-modal, multi-view data (RGB, IR, and depth) from front, face, left, and right perspectives. Although DAOS captures a wide range of cabin objects, only a few are directly relevant to each action for prediction, so focusing on task-specific human-object relations is essential. To tackle this challenge, we propose the Action-Object-Relation Network (AOR-Net). AOR-Net comprehends complex driver actions through multi-level reasoning and a chain-of-action prompting mechanism that models the logical relationships among actions, objects, and their relations. Additionally, the Mixture of Thoughts module is introduced to dynamically select essential knowledge at each stage, enhancing robustness in object-rich and object-scarce conditions. Extensive experiments demonstrate that our model outperforms other state-of-the-art methods on various datasets.

251. 【2601.11987】Structural Graph Neural Networks with Anatomical Priors for Explainable Chest X-ray Diagnosis

链接https://arxiv.org/abs/2601.11987

作者:Khaled Berkani

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:incorporates explicit anatomical, explainable vision-based diagnosis, explicit anatomical priors, vision-based diagnosis, incorporates explicit

备注: 15 pages, 3 figures, 3 tables

点击查看摘要

Abstract:We present a structural graph reasoning framework that incorporates explicit anatomical priors for explainable vision-based diagnosis. Convolutional feature maps are reinterpreted as patch-level graphs, where nodes encode both appearance and spatial coordinates, and edges reflect local structural adjacency. Unlike conventional graph neural networks that rely on generic message passing, we introduce a custom structural propagation mechanism that explicitly models relative spatial relations as part of the reasoning process. This design enables the graph to act as an inductive bias for structured inference rather than a passive relational representation. The proposed model jointly supports node-level lesion-aware predictions and graph-level diagnostic reasoning, yielding intrinsic explainability through learned node importance scores without relying on post-hoc visualization techniques. We demonstrate the approach through a chest X-ray case study, illustrating how structural priors guide relational reasoning and improve interpretability. While evaluated in a medical imaging context, the framework is domain-agnostic and aligns with the broader vision of graph-based reasoning across artificial intelligence systems. This work contributes to the growing body of research exploring graphs as computational substrates for structure-aware and explainable learning.

252. 【2601.11983】An AI-IoT Based Smart Wheelchair with Gesture-Controlled Mobility, Deep Learning-Based Obstacle Detection, Multi-Sensor Health Monitoring, and Emergency Alert System

链接https://arxiv.org/abs/2601.11983

作者:Md. Asiful Islam,Abdul Hasib,Tousif Mahmud Emon,Khandaker Tabin Hasan,A. S. M. Ahsanul Sarkar Akib

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:elderly individuals demands, combine safe navigation, individuals demands affordable, growing number, number of differently-abled

备注

点击查看摘要

Abstract:The growing number of differently-abled and elderly individuals demands affordable, intelligent wheelchairs that combine safe navigation with health monitoring. Traditional wheelchairs lack dynamic features, and many smart alternatives remain costly, single-modality, and limited in health integration. Motivated by the pressing demand for advanced, personalized, and affordable assistive technologies, we propose a comprehensive AI-IoT based smart wheelchair system that incorporates glove-based gesture control for hands-free navigation, real-time object detection using YOLOv8 with auditory feedback for obstacle avoidance, and ultrasonic for immediate collision avoidance. Vital signs (heart rate, SpO$_2$, ECG, temperature) are continuously monitored, uploaded to ThingSpeak, and trigger email alerts for critical conditions. Built on a modular and low-cost architecture, the gesture control achieved a 95.5\% success rate, ultrasonic obstacle detection reached 94\% accuracy, and YOLOv8-based object detection delivered 91.5\% Precision, 90.2\% Recall, and a 90.8\% F1-score. This integrated, multi-modal approach offers a practical, scalable, and affordable solution, significantly enhancing user autonomy, safety, and independence by bridging the gap between innovative research and real-world deployment.

253. 【2601.11981】Nip Rumors in the Bud: Retrieval-Guided Topic-Level Adaptation for Test-Time Fake News Video Detection

链接https://arxiv.org/abs/2601.11981

作者:Jian Lang,Rongpei Hong,Ting Zhong,Yong Wang,Fan Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Detection, social stability, critical for social, Detection, Fake News Video

备注: 13 pages. Accepted by KDD 2026 research track. Codes are released at [this https URL](https://github.com/Jian-Lang/RADAR)

点击查看摘要

Abstract:Fake News Video Detection (FNVD) is critical for social stability. Existing methods typically assume consistent news topic distribution between training and test phases, failing to detect fake news videos tied to emerging events and unseen topics. To bridge this gap, we introduce RADAR, the first framework that enables test-time adaptation to unseen news videos. RADAR pioneers a new retrieval-guided adaptation paradigm that leverages stable (source-close) videos from the target domain to guide robust adaptation of semantically related but unstable instances. Specifically, we propose an Entropy Selection-Based Retrieval mechanism that provides videos with stable (low-entropy), relevant references for adaptation. We also introduce a Stable Anchor-Guided Alignment module that explicitly aligns unstable instances' representations to the source domain via distribution-level matching with their stable references, mitigating severe domain discrepancies. Finally, our novel Target-Domain Aware Self-Training paradigm can generate informative pseudo-labels augmented by stable references, capturing varying and imbalanced category distributions in the target domain and enabling RADAR to adapt to the fast-changing label distributions. Extensive experiments demonstrate that RADAR achieves superior performance for test-time FNVD, enabling strong on-the-fly adaptation to unseen fake news video topics.

254. 【2601.11976】AVIR: Adaptive Visual In-Document Retrieval for Efficient Multi-Page Document Question Answering

链接https://arxiv.org/abs/2601.11976

作者:Zongmin Li,Yachuan Li,Lei Kang,Dimosthenis Karatzas,Wenkang Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-page Document Visual, Adaptive Visual In-document, Visual In-document Retrieval, Document Visual Question, Visual Question Answering

备注: 7 pages, 3 figures

点击查看摘要

Abstract:Multi-page Document Visual Question Answering (MP-DocVQA) remains challenging because long documents not only strain computational resources but also reduce the effectiveness of the attention mechanism in large vision-language models (LVLMs). We tackle these issues with an Adaptive Visual In-document Retrieval (AVIR) framework. A lightweight retrieval model first scores each page for question relevance. Pages are then clustered according to the score distribution to adaptively select relevant content. The clustered pages are screened again by Top-K to keep the context compact. However, for short documents, clustering reliability decreases, so we use a relevance probability threshold to select pages. The selected pages alone are fed to a frozen LVLM for answer generation, eliminating the need for model fine-tuning. The proposed AVIR framework reduces the average page count required for question answering by 70%, while achieving an ANLS of 84.58% on the MP-DocVQA dataset-surpassing previous methods with significantly lower computational cost. The effectiveness of the proposed AVIR is also verified on the SlideVQA and DUDE benchmarks. The code is available at this https URL.

255. 【2601.11970】Real-Time Multi-Modal Embedded Vision Framework for Object Detection Facial Emotion Recognition and Biometric Identification on Low-Power Edge Platforms

链接https://arxiv.org/abs/2601.11970

作者:S. M. Khalid Bin Zahid,Md. Rakibul Hasan Nishat,Abdul Hasib,Md. Rakibul Hasan,Md. Ashiqussalehin,Md. Sahadat Hossen Sajib,A. S. M. Ahsanul Sarkar Akib

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:handle perceptual tasks, emotion analysis independently, adaptive runtime scheduler, dynamically allocates computational, allocates computational resources

备注

点击查看摘要

Abstract:Intelligent surveillance systems often handle perceptual tasks such as object detection, facial recognition, and emotion analysis independently, but they lack a unified, adaptive runtime scheduler that dynamically allocates computational resources based on contextual triggers. This limits their holistic understanding and efficiency on low-power edge devices. To address this, we present a real-time multi-modal vision framework that integrates object detection, owner-specific face recognition, and emotion detection into a unified pipeline deployed on a Raspberry Pi 5 edge platform. The core of our system is an adaptive scheduling mechanism that reduces computational load by 65\% compared to continuous processing by selectively activating modules such as, YOLOv8n for object detection, a custom FaceNet-based embedding system for facial recognition, and DeepFace's CNN for emotion classification. Experimental results demonstrate the system's efficacy, with the object detection module achieving an Average Precision (AP) of 0.861, facial recognition attaining 88\% accuracy, and emotion detection showing strong discriminatory power (AUC up to 0.97 for specific emotions), while operating at 5.6 frames per second. Our work demonstrates that context-aware scheduling is the key to unlocking complex multi-modal AI on cost-effective edge hardware, making intelligent perception more accessible and privacy-preserving.

256. 【2601.11967】A Constraint Programming Model for the Super-Agile Earth Observation Satellite Imaging Scheduling Problem

链接https://arxiv.org/abs/2601.11967

作者:Margarida Caleiras,Samuel Moniz,Paulo Jorge Nascimento

类目:ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

关键词:super-agile Earth observation, Earth observation satellites, providing unprecedented imaging, unprecedented imaging flexibility, super-agile Earth

备注: 12 pages, 4 figures, To be published in the Proceedings of the International Conference on Operations Research and Enterprise Systems (ICORES 2026)

点击查看摘要

Abstract:As the dependence on satellite imaging continues to grow, modern satellites have become increasingly agile, with the new generation, namely super-agile Earth observation satellites (SAEOS), providing unprecedented imaging flexibility. The highly dynamic capabilities of these satellites introduce additional challenges to the scheduling of observation tasks, as existing approaches for conventional agile satellites do not account for variable observation durations and multiple imaging directions. Although some efforts have been made in this regard, the SAEOS imaging scheduling problem (SAEOS-ISP) remains largely unexplored, and no exact approaches have yet been proposed. In this context, this study presents the first exact Constraint Programming formulation for the SAEOS-ISP, considering flexible observation windows, multiple pointing directions and sequence-dependent transition times across multiple satellites. Computational experiments on a newly generated benchmark set demonstrate that the model can be solved efficiently and within very short computational times. Moreover, the results also show that the proposed approach has the potential to achieve higher computational performance compared to the non-exact approaches that are currently considered state-of-the-art.

257. 【2601.11952】Decoder Gradient Shields: A Family of Provable and High-Fidelity Methods Against Gradient-Based Box-Free Watermark Removal

链接https://arxiv.org/abs/2601.11952

作者:Haonan An,Guang Hua,Wei Du,Hangcheng Cao,Yihang Tao,Guowen Xu,Susanto Rahardja,Yuguang Fang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep neural network, intellectual property protection, gained significant attention, property protection due, flexibly manage high-entropy

备注

点击查看摘要

Abstract:Box-free model watermarking has gained significant attention in deep neural network (DNN) intellectual property protection due to its model-agnostic nature and its ability to flexibly manage high-entropy image outputs from generative models. Typically operating in a black-box manner, it employs an encoder-decoder framework for watermark embedding and extraction. While existing research has focused primarily on the encoders for the robustness to resist various attacks, the decoders have been largely overlooked, leading to attacks against the watermark. In this paper, we identify one such attack against the decoder, where query responses are utilized to obtain backpropagated gradients to train a watermark remover. To address this issue, we propose Decoder Gradient Shields (DGSs), a family of defense mechanisms, including DGS at the output (DGS-O), at the input (DGS-I), and in the layers (DGS-L) of the decoder, with a closed-form solution for DGS-O and provable performance for all DGS. Leveraging the joint design of reorienting and rescaling of the gradients from watermark channel gradient leaking queries, the proposed DGSs effectively prevent the watermark remover from achieving training convergence to the desired low-loss value, while preserving image quality of the decoder output. We demonstrate the effectiveness of our proposed DGSs in diverse application scenarios. Our experimental results on deraining and image generation tasks with the state-of-the-art box-free watermarking show that our DGSs achieve a defense success rate of 100% under all settings.

258. 【2601.11944】Deep learning-based neurodevelopmental assessment in preterm infants

链接https://arxiv.org/abs/2601.11944

作者:Lexin Ren,Jiamiao Lu,Weichuan Zhang,Benqing Wu,Tuo Wang,Yi Liao,Jiapan Guo,Changming Sun,Liang Guo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:face elevated risks, making early identification, early identification crucial, Preterm infants, weeks of gestation

备注: 27 pages, 8 figures

点击查看摘要

Abstract:Preterm infants (born between 28 and 37 weeks of gestation) face elevated risks of neurodevelopmental delays, making early identification crucial for timely intervention. While deep learning-based volumetric segmentation of brain MRI scans offers a promising avenue for assessing neonatal neurodevelopment, achieving accurate segmentation of white matter (WM) and gray matter (GM) in preterm infants remains challenging due to their comparable signal intensities (isointense appearance) on MRI during early brain development. To address this, we propose a novel segmentation neural network, named Hierarchical Dense Attention Network. Our architecture incorporates a 3D spatial-channel attention mechanism combined with an attention-guided dense upsampling strategy to enhance feature discrimination in low-contrast volumetric data. Quantitative experiments demonstrate that our method achieves superior segmentation performance compared to state-of-the-art baselines, effectively tackling the challenge of isointense tissue differentiation. Furthermore, application of our algorithm confirms that WM and GM volumes in preterm infants are significantly lower than those in term infants, providing additional imaging evidence of the neurodevelopmental delays associated with preterm birth. The code is available at: this https URL.

259. 【2601.11931】Language-Guided and Motion-Aware Gait Representation for Generalizable Recognition

链接https://arxiv.org/abs/2601.11931

作者:Zhengxian Wu,Chuanrui Zhang,Shenao Jiang,Hangrui Xu,Zirui Liao,Luyuan Zhang,Huaqiu Li,Peng Jiao,Haoqian Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision, promising technology, innovative field, field within computer, Motion-aware gait recognition

备注

点击查看摘要

Abstract:Gait recognition is emerging as a promising technology and an innovative field within computer vision. However, existing methods typically rely on complex architectures to directly extract features from images and apply pooling operations to obtain sequence-level representations. Such designs often lead to overfitting on static noise (e.g., clothing), while failing to effectively capture dynamic motion this http URL address the above challenges, we present a Language guided and Motion-aware gait recognition framework, named this http URL particular, we utilize designed gait-related language cues to capture key motion features in gait sequences.

260. 【2601.11930】SupScene: Learning Overlap-Aware Global Descriptor for Unconstrained SfM

链接https://arxiv.org/abs/2601.11930

作者:Xulei Shi,Maoyu Wang,Yuning Peng,Guanbo Wang,Xin Wang,Qi Chen,Pengjie Tao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:matching in unconstrained, image retrieval typically, Image retrieval, critical step, step for alleviating

备注

点击查看摘要

Abstract:Image retrieval is a critical step for alleviating the quadratic complexity of image matching in unconstrained Structure-from-Motion (SfM). However, in this context, image retrieval typically focuses more on the image pairs of geometric matchability than on those of semantic similarity, a nuance that most existing deep learning-based methods guided by batched binaries (overlapping vs. non-overlapping pairs) fail to capture. In this paper, we introduce SupScene, a novel solution that learns global descriptors tailored for finding overlapping image pairs of similar geometric nature for SfM. First, to better underline co-visible regions, we employ a subgraph-based training strategy that moves beyond equally important isolated pairs, leveraging ground-truth geometric overlapping relationships with various weights to provide fine-grained supervision via a soft supervised contrastive loss. Second, we introduce DiVLAD, a DINO-inspired VLAD aggregator that leverages the inherent multi-head attention maps from the last block of ViT. And then, a learnable gating mechanism is designed to adaptively utilize these semantically salient cues with visual features, enabling a more discriminative global descriptor. Extensive experiments on the GL3D dataset demonstrate that our method achieves state-of-the-art performance, significantly outperforming NetVLAD while introducing a negligible number of additional trainable parameters. Furthermore, we show that the proposed training strategy brings consistent gains across different aggregation techniques. Code and models are available at this https URL.

261. 【2601.11918】Effects of Gabor Filters on Classification Performance of CNNs Trained on a Limited Number of Conditions

链接https://arxiv.org/abs/2601.11918

作者:Akito Morita,Hirotsugu Okuno

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:convolutional neural networks, robot vision applications, real-world robot vision, running on edge, edge devices

备注: 5 pages, 4 figures, 4 tables

点击查看摘要

Abstract:In this study, we propose a technique to improve the accuracy and reduce the size of convolutional neural networks (CNNs) running on edge devices for real-world robot vision applications. CNNs running on edge devices must have a small architecture, and CNNs for robot vision applications involving on-site object recognition must be able to be trained efficiently to identify specific visual targets from data obtained under a limited variation of conditions. The visual nervous system (VNS) is a good example that meets the above requirements because it learns from few visual experiences. Therefore, we used a Gabor filter, a model of the feature extractor of the VNS, as a preprocessor for CNNs to investigate the accuracy of the CNNs trained with small amounts of data. To evaluate how well CNNs trained on image data acquired under a limited variation of conditions generalize to data acquired under other conditions, we created an image dataset consisting of images acquired from different camera positions, and investigated the accuracy of the CNNs that trained using images acquired at a certain distance. The results were compared after training on multiple CNN architectures with and without Gabor filters as preprocessing. The results showed that preprocessing with Gabor filters improves the generalization performance of CNNs and contributes to reducing the size of CNNs.

262. 【2601.11915】From Spurious to Causal: Low-rank Orthogonal Subspace Intervention for Generalizable Face Forgery Detection

链接https://arxiv.org/abs/2601.11915

作者:Chi Wang,Xinjue Hu,Boyu Wang,Ziwen He,Zhangjie Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generalization problem remains, face forgery detection, problem remains, remains a critical, critical challenge

备注

点击查看摘要

Abstract:The generalization problem remains a critical challenge in face forgery detection. Some researches have discovered that ``a backdoor path" in the representations from forgery-irrelevant information to labels induces biased learning, thereby hindering the generalization. In this paper, these forgery-irrelevant information are collectively termed spurious correlations factors. Previous methods predominantly focused on identifying concrete, specific spurious correlation and designing corresponding solutions to address them. However, spurious correlations arise from unobservable confounding factors, making it impractical to identify and address each one individually. To address this, we propose an intervention paradigm for representation space. Instead of tracking and blocking various instance-level spurious correlation one by one, we uniformly model them as a low-rank subspace and intervene in them. Specifically, we decompose spurious correlation features into a low-rank subspace via orthogonal low-rank projection, subsequently removing this subspace from the original representation and training its orthogonal complement to capture forgery-related features. This low-rank projection removal effectively eliminates spurious correlation factors, ensuring that classification decision is based on authentic forgery cues. With only 0.43M trainable parameters, our method achieves state-of-the-art performance across several benchmarks, demonstrating excellent robustness and generalization.

263. 【2601.11911】Reliable Deep Learning for Small-Scale Classifications: Experiments on Real-World Image Datasets from Bangladesh

链接https://arxiv.org/abs/2601.11911

作者:Muhammad Ibrahim,Alfe Suny,MD Sakib Ul Islam,Md. Imran Hossain

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Convolutional neural networks, Convolutional neural, involve complex architectures, involve complex, overfit on small

备注

点击查看摘要

Abstract:Convolutional neural networks (CNNs) have achieved state-of-the-art performance in image recognition tasks but often involve complex architectures that may overfit on small datasets. In this study, we evaluate a compact CNN across five publicly available, real-world image datasets from Bangladesh, including urban encroachment, vehicle detection, road damage, and agricultural crops. The network demonstrates high classification accuracy, efficient convergence, and low computational overhead. Quantitative metrics and saliency analyses indicate that the model effectively captures discriminative features and generalizes robustly across diverse scenarios, highlighting the suitability of streamlined CNN architectures for small-class image classification tasks.

264. 【2601.11910】A Training-Free Guess What Vision Language Model from Snippets to Open-Vocabulary Object Detection

链接https://arxiv.org/abs/2601.11910

作者:Guiying Zhu,Bowen Yang,Yin Zhuang,Tong Zhang,Guanqun Wang,Zhihao Che,He Chen,Lianlin Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision Language Model, Open-Vocabulary Object Detection, Visual Language Searching, Language Model, Large Language Model

备注

点击查看摘要

Abstract:Open-Vocabulary Object Detection (OVOD) aims to develop the capability to detect anything. Although myriads of large-scale pre-training efforts have built versatile foundation models that exhibit impressive zero-shot capabilities to facilitate OVOD, the necessity of creating a universal understanding for any object cognition according to already pretrained foundation models is usually overlooked. Therefore, in this paper, a training-free Guess What Vision Language Model, called GW-VLM, is proposed to form a universal understanding paradigm based on our carefully designed Multi-Scale Visual Language Searching (MS-VLS) coupled with Contextual Concept Prompt (CCP) for OVOD. This approach can engage a pre-trained Vision Language Model (VLM) and a Large Language Model (LLM) in the game of "guess what". Wherein, MS-VLS leverages multi-scale visual-language soft-alignment for VLM to generate snippets from the results of class-agnostic object detection, while CCP can form the concept of flow referring to MS-VLS and then make LLM understand snippets for OVOD. Finally, the extensive experiments are carried out on natural and remote sensing datasets, including COCO val, Pascal VOC, DIOR, and NWPU-10, and the results indicate that our proposed GW-VLM can achieve superior OVOD performance compared to the-state-of-the-art methods without any training step.

265. 【2601.11909】Effects of the retina-inspired light intensity encoding on color discrimination performance

链接https://arxiv.org/abs/2601.11909

作者:Io Yamada,Hirotsugu Okuno

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Color, color information, object recognition, greatly affected, visual

备注: 8 pages, 14 figures, 4 tables

点击查看摘要

Abstract:Color is an important source of information for visual functions such as object recognition, but it is greatly affected by the color of illumination. The ability to perceive the color of a visual target independent of illumination color is called color constancy (CC), and is an important feature for vision systems that use color information. In this study, we investigated the effects of the light intensity encoding function on the performance of CC of the center/surround (C/S) retinex model, which is a well-known model inspired by CC of the visual nervous system. The functions used to encode light intensity are the logarithmic function used in the original C/S retinex model and the Naka-Rushton (N-R) function, which is a model of retinal photoreceptor response. Color-variable LEDs were used to illuminate visual targets with various lighting colors, and color information computed by each model was used to evaluate the degree to which the color of visual targets illuminated with different lighting colors could be discriminated. Color information was represented using the HSV color space and a color plane based on the classical opponent color theory. The results showed that the combination of the N-R function and the double opponent color plane representation provided superior discrimination performance.

266. 【2601.11907】owards Airborne Object Detection: A Deep Learning Analysis

链接https://arxiv.org/abs/2601.11907

作者:Prosenjit Chatterjee,ANK Zaman

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词:including commercial aircraft, automated threat assessment, threat assessment systems, including commercial, commercial aircraft

备注

点击查看摘要

Abstract:The rapid proliferation of airborne platforms, including commercial aircraft, drones, and UAVs, has intensified the need for real-time, automated threat assessment systems. Current approaches depend heavily on manual monitoring, resulting in limited scalability and operational inefficiencies. This work introduces a dual-task model based on EfficientNetB4 capable of performing airborne object classification and threat-level prediction simultaneously. To address the scarcity of clean, balanced training data, we constructed the AODTA Dataset by aggregating and refining multiple public sources. We benchmarked our approach on both the AVD Dataset and the newly developed AODTA Dataset and further compared performance against a ResNet-50 baseline, which consistently underperformed EfficientNetB4. Our EfficientNetB4 model achieved 96% accuracy in object classification and 90% accuracy in threat-level prediction, underscoring its promise for applications in surveillance, defense, and airspace management. Although the title references detection, this study focuses specifically on classification and threat-level inference using pre-localized airborne object images provided by existing datasets.

267. 【2601.11898】RemoteVAR: Autoregressive Visual Modeling for Remote Sensing Change Detection

链接https://arxiv.org/abs/2601.11898

作者:Yilmaz Korkmaz,Vishal M. Patel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:characterize scene changes, Remote sensing change, disaster assessment, sensing change detection, change detection aims

备注

点击查看摘要

Abstract:Remote sensing change detection aims to localize and characterize scene changes between two time points and is central to applications such as environmental monitoring and disaster assessment. Meanwhile, visual autoregressive models (VARs) have recently shown impressive image generation capability, but their adoption for pixel-level discriminative tasks remains limited due to weak controllability, suboptimal dense prediction performance and exposure bias. We introduce RemoteVAR, a new VAR-based change detection framework that addresses these limitations by conditioning autoregressive prediction on multi-resolution fused bi-temporal features via cross-attention, and by employing an autoregressive training strategy designed specifically for change map prediction. Extensive experiments on standard change detection benchmarks show that RemoteVAR delivers consistent and significant improvements over strong diffusion-based and transformer-based baselines, establishing a competitive autoregressive alternative for remote sensing change detection. Code will be available \href{this https URL}{\underline{here}}.

268. 【2601.11896】Digital FAST: An AI-Driven Multimodal Framework for Rapid and Early Stroke Screening

链接https://arxiv.org/abs/2601.11896

作者:Ngoc-Khai Hoang,Thi-Nhu-Mai Nguyen,Huy-Hieu Pham

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:improving patient outcomes, enabling timely intervention, patient outcomes, prehospital settings, symptoms is essential

备注

点击查看摘要

Abstract:Early identification of stroke symptoms is essential for enabling timely intervention and improving patient outcomes, particularly in prehospital settings. This study presents a fast, non-invasive multimodal deep learning framework for automatic binary stroke screening based on data collected during the F.A.S.T. assessment. The proposed approach integrates complementary information from facial expressions, speech signals, and upper-body movements to enhance diagnostic robustness. Facial dynamics are represented using landmark based features and modeled with a Transformer architecture to capture temporal dependencies. Speech signals are converted into mel spectrograms and processed using an Audio Spectrogram Transformer, while upper-body pose sequences are analyzed with an MLP-Mixer network to model spatiotemporal motion patterns. The extracted modality specific representations are combined through an attention-based fusion mechanism to effectively learn cross modal interactions. Experiments conducted on a self-collected dataset of 222 videos from 37 subjects demonstrate that the proposed multimodal model consistently outperforms unimodal baselines, achieving 95.83% accuracy and a 96.00% F1-score. The model attains a strong balance between sensitivity and specificity and successfully detects all stroke cases in the test set. These results highlight the potential of multimodal learning and transfer learning for early stroke screening, while emphasizing the need for larger, clinically representative datasets to support reliable real-world deployment.

269. 【2601.11876】AI for Green Spaces: Leveraging Autonomous Navigation and Computer Vision for Park Litter Removal

链接https://arxiv.org/abs/2601.11876

作者:Christopher Kao,Akhil Pathapati,James Davis

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

关键词:billion pieces, pieces of litter, Spanning Tree Coverage, Convolutional Neural Network, trash

备注: Published in IEEE/SICE SII 2025

点击查看摘要

Abstract:There are 50 billion pieces of litter in the U.S. alone. Grass fields contribute to this problem because picnickers tend to leave trash on the field. We propose building a robot that can autonomously navigate, identify, and pick up trash in parks. To autonomously navigate the park, we used a Spanning Tree Coverage (STC) algorithm to generate a coverage path the robot could follow. To navigate this path, we successfully used Real-Time Kinematic (RTK) GPS, which provides a centimeter-level reading every second. For computer vision, we utilized the ResNet50 Convolutional Neural Network (CNN), which detects trash with 94.52% accuracy. For trash pickup, we tested multiple design concepts. We select a new pickup mechanism that specifically targets the trash we encounter on the field. Our solution achieved an overall success rate of 80%, demonstrating that autonomous trash pickup robots on grass fields are a viable solution.

270. 【2601.11827】MixFlow: Mixture-Conditioned Flow Matching for Out-of-Distribution Generalization

链接https://arxiv.org/abs/2601.11827

作者:Andrea Rubbi,Amir Akbarnejad,Mohammad Vali Sanian,Aryan Yazdan Parast,Hesam Asadollahzadeh,Arian Amani,Naveed Akhtar,Sarah Cooper,Andrew Bassett,Pietro Liò,Lassi Paavolainen,Sattar Vakili,Mo Lotfollahi

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:distribution shift remains, existing conditional flow-based, conditional flow-based methods, shift remains, remains a central

备注

点击查看摘要

Abstract:Achieving robust generalization under distribution shift remains a central challenge in conditional generative modeling, as existing conditional flow-based methods often struggle to extrapolate beyond the training conditions. We introduce MixFlow, a conditional flow-matching framework for descriptor-controlled generation that directly targets this limitation by jointly learning a descriptor-conditioned base distribution and a descriptor-conditioned flow field via shortest-path flow matching. By modeling the base distribution as a learnable, descriptor-dependent mixture, MixFlow enables smooth interpolation and extrapolation to unseen conditions, leading to substantially improved out-of-distribution generalization. We provide analytical insights into the behavior of the proposed framework and empirically demonstrate its effectiveness across multiple domains, including prediction of responses to unseen perturbations in single-cell transcriptomic data and high-content microscopy-based drug screening tasks. Across these diverse settings, MixFlow consistently outperforms standard conditional flow-matching baselines. Overall, MixFlow offers a simple yet powerful approach for achieving robust, generalizable, and controllable generative modeling across heterogeneous domains.

271. 【2601.11794】Physics-Constrained Denoising Autoencoders for Data-Scarce Wildfire UAV Sensing

链接https://arxiv.org/abs/2601.11794

作者:Abdelrahman Ramadan,Zahra Dorbeigi Namaghi,Emily Taylor,Lucas Edwards,Xan Giuliani,David S. McLagan,Sidney Givigi,Melissa Greeff

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Unmanned Aerial Vehicles, Wildfire monitoring requires, Aerial Vehicles, Unmanned Aerial, high-resolution atmospheric measurements

备注

点击查看摘要

Abstract:Wildfire monitoring requires high-resolution atmospheric measurements, yet low-cost sensors on Unmanned Aerial Vehicles (UAVs) exhibit baseline drift, cross-sensitivity, and response lag that corrupt concentration estimates. Traditional deep learning denoising approaches demand large datasets impractical to obtain from limited UAV flight campaigns. We present PC$^2$DAE, a physics-informed denoising autoencoder that addresses data scarcity by embedding physical constraints directly into the network architecture. Non-negative concentration estimates are enforced via softplus activations and physically plausible temporal smoothing, ensuring outputs are physically admissible by construction rather than relying on loss function penalties. The architecture employs hierarchical decoder heads for Black Carbon, Gas, and CO$_2$ sensor families, with two variants: PC$^2$DAE-Lean (21k parameters) for edge deployment and PC$^2$DAE-Wide (204k parameters) for offline processing. We evaluate on 7,894 synchronized 1 Hz samples collected from UAV flights during prescribed burns in Saskatchewan, Canada (approximately 2.2 hours of flight data), two orders of magnitude below typical deep learning requirements. PC$^2$DAE-Lean achieves 67.3\% smoothness improvement and 90.7\% high-frequency noise reduction with zero physics violations. Five baselines (LSTM-AE, U-Net, Transformer, CBDAE, DeSpaWN) produce 15--23\% negative outputs. The lean variant outperforms wide (+5.6\% smoothness), suggesting reduced capacity with strong inductive bias prevents overfitting in data-scarce regimes. Training completes in under 65 seconds on consumer hardware.

272. 【2601.11781】Risk-Aware Human-in-the-Loop Framework with Adaptive Intrusion Response for Autonomous Vehicles

链接https://arxiv.org/abs/2601.11781

作者:Dawood Wasif,Terrence J. Moore,Seunghyun Yoon,Hyuk Lim,Dan Dongseong Kim,Frederica F. Nelson,Jin-Hee Cho

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:encountering rare long-tailed, rare long-tailed scenarios, Autonomous vehicles, Intrusion Risk Score, Test Success Rate

备注: Submitted to ICRA 2026 (under review)

点击查看摘要

Abstract:Autonomous vehicles must remain safe and effective when encountering rare long-tailed scenarios or cyber-physical intrusions during driving. We present RAIL, a risk-aware human-in-the-loop framework that turns heterogeneous runtime signals into calibrated control adaptations and focused learning. RAIL fuses three cues (curvature actuation integrity, time-to-collision proximity, and observation-shift consistency) into an Intrusion Risk Score (IRS) via a weighted Noisy-OR. When IRS exceeds a threshold, actions are blended with a cue-specific shield using a learned authority, while human override remains available; when risk is low, the nominal policy executes. A contextual bandit arbitrates among shields based on the cue vector, improving mitigation choices online. RAIL couples Soft Actor-Critic (SAC) with risk-prioritized replay and dual rewards so that takeovers and near misses steer learning while nominal behavior remains covered. On MetaDrive, RAIL achieves a Test Return (TR) of 360.65, a Test Success Rate (TSR) of 0.85, a Test Safety Violation (TSV) of 0.75, and a Disturbance Rate (DR) of 0.0027, while logging only 29.07 training safety violations, outperforming RL, safe RL, offline/imitation learning, and prior HITL baselines. Under Controller Area Network (CAN) injection and LiDAR spoofing attacks, it improves Success Rate (SR) to 0.68 and 0.80, lowers the Disengagement Rate under Attack (DRA) to 0.37 and 0.03, and reduces the Attack Success Rate (ASR) to 0.34 and 0.11. In CARLA, RAIL attains a TR of 1609.70 and TSR of 0.41 with only 8000 steps.

273. 【2601.11779】Cross-Domain Object Detection Using Unsupervised Image Translation

链接https://arxiv.org/abs/2601.11779

作者:Vinicius F. Arruda,Rodrigo F. Berriel,Thiago M. Paixão,Claudine Badue,Alberto F. De Souza,Nicu Sebe,Thiago Oliveira-Santos

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:object detection addresses, unseen target domain, Unsupervised domain adaptation, detection addresses, addresses the adaption

备注

点击查看摘要

Abstract:Unsupervised domain adaptation for object detection addresses the adaption of detectors trained in a source domain to work accurately in an unseen target domain. Recently, methods approaching the alignment of the intermediate features proven to be promising, achieving state-of-the-art results. However, these methods are laborious to implement and hard to interpret. Although promising, there is still room for improvements to close the performance gap toward the upper-bound (when training with the target data). In this work, we propose a method to generate an artificial dataset in the target domain to train an object detector. We employed two unsupervised image translators (CycleGAN and an AdaIN-based model) using only annotated data from the source domain and non-annotated data from the target domain. Our key contributions are the proposal of a less complex yet more effective method that also has an improved interpretability. Results on real-world scenarios for autonomous driving show significant improvements, outperforming state-of-the-art methods in most cases, further closing the gap toward the upper-bound.

274. 【2601.11772】studentSplat: Your Student Model Learns Single-view 3D Gaussian Splatting

链接https://arxiv.org/abs/2601.11772

作者:Yimu Pan,Hongda Mao,Qingshuang Chen,Yelin Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain under-explored due, Recent advance, Gaussian splatting, reconstruction remain under-explored, Gaussian splatting method

备注

点击查看摘要

Abstract:Recent advance in feed-forward 3D Gaussian splatting has enable remarkable multi-view 3D scene reconstruction or single-view 3D object reconstruction but single-view 3D scene reconstruction remain under-explored due to inherited ambiguity in single-view. We present \textbf{studentSplat}, a single-view 3D Gaussian splatting method for scene reconstruction. To overcome the scale ambiguity and extrapolation problems inherent in novel-view supervision from a single input, we introduce two techniques: 1) a teacher-student architecture where a multi-view teacher model provides geometric supervision to the single-view student during training, addressing scale ambiguity and encourage geometric validity; and 2) an extrapolation network that completes missing scene context, enabling high-quality extrapolation. Extensive experiments show studentSplat achieves state-of-the-art single-view novel-view reconstruction quality and comparable performance to multi-view methods at the scene level. Furthermore, studentSplat demonstrates competitive performance as a self-supervised single-view depth estimation method, highlighting its potential for general single-view 3D understanding tasks.

275. 【2601.11769】From Pixels to Purchase: Building and Evaluating a Taxonomy-Decoupled Visual Search Engine for Home Goods E-commerce

链接https://arxiv.org/abs/2601.11769

作者:Cheng Lyu,Jingyue Zhang,Ryan Maunu,Mengwei Li,Vinny DeGenova,Yuanli Pei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:critical for e-commerce, subjective and open-ended, style-driven domains, domains where user, user intent

备注

点击查看摘要

Abstract:Visual search is critical for e-commerce, especially in style-driven domains where user intent is subjective and open-ended. Existing industrial systems typically couple object detection with taxonomy-based classification and rely on catalog data for evaluation, which is prone to noise that limits robustness and scalability. We propose a taxonomy-decoupled architecture that uses classification-free region proposals and unified embeddings for similarity retrieval, enabling a more flexible and generalizable visual search. To overcome the evaluation bottleneck, we propose an LLM-as-a-Judge framework that assesses nuanced visual similarity and category relevance for query-result pairs in a zero-shot manner, removing dependence on human annotations or noise-prone catalog data. Deployed at scale on a global home goods platform, our system improves retrieval quality and yields a measurable uplift in customer engagement, while our offline evaluation metrics strongly correlate with real-world outcomes.

276. 【2601.11729】SpaRRTa: A Synthetic Benchmark for Evaluating Spatial Intelligence in Visual Foundation Models

链接https://arxiv.org/abs/2601.11729

作者:Turhan Can Kargin,Wojciech Jasiński,Adam Pardyl,Bartosz Zieliński,Marcin Przewięźlikowski

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:DINO and CLIP, Visual Foundation Models, exhibit limited spatial, Foundation Models, Visual Foundation

备注: Project page is available at [this https URL](https://sparrta.gmum.net/)

点击查看摘要

Abstract:Visual Foundation Models (VFMs), such as DINO and CLIP, excel in semantic understanding of images but exhibit limited spatial reasoning capabilities, which limits their applicability to embodied systems. As a result, recent work incorporates some 3D tasks (such as depth estimation) into VFM training. However, VFM performance remains inconsistent across other spatial tasks, raising the question of whether these models truly have spatial awareness or overfit to specific 3D objectives. To address this question, we introduce the Spatial Relation Recognition Task (SpaRRTa) benchmark, which evaluates the ability of VFMs to identify relative positions of objects in the image. Unlike traditional 3D objectives that focus on precise metric prediction (e.g., surface normal estimation), SpaRRTa probes a fundamental capability underpinning more advanced forms of human-like spatial understanding. SpaRRTa generates an arbitrary number of photorealistic images with diverse scenes and fully controllable object arrangements, along with freely accessible spatial annotations. Evaluating a range of state-of-the-art VFMs, we reveal significant disparities between their spatial reasoning abilities. Through our analysis, we provide insights into the mechanisms that support or hinder spatial awareness in modern VFMs. We hope that SpaRRTa will serve as a useful tool for guiding the development of future spatially aware visual models.

277. 【2601.11724】SemAlign: Language Guided Semi-supervised Domain Generalization

链接https://arxiv.org/abs/2601.11724

作者:Muditha Fernando,Kajhanan Kailainathan,Krishnakanth Nagaratnam,Isuranga Udaravi Bandara Senavirathne,Ranga Rodrigo

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Semi-supervised Domain Generalization, unseen target domains, Semi-supervised Domain, Domain Generalization, limited labeled data

备注: 15 pages, 6 figures

点击查看摘要

Abstract:Semi-supervised Domain Generalization (SSDG) addresses the challenge of generalizing to unseen target domains with limited labeled data. Existing SSDG methods highlight the importance of achieving high pseudo-labeling (PL) accuracy and preventing model overfitting as the main challenges in SSDG. In this light, we show that the SSDG literature's excessive focus on PL accuracy, without consideration for maximum data utilization during training, limits potential performance improvements. We propose a novel approach to the SSDG problem by aligning the intermediate features of our model with the semantically rich and generalized feature space of a Vision Language Model (VLM) in a way that promotes domain-invariance. The above approach is enhanced with effective image-level augmentation and output-level regularization strategies to improve data utilization and minimize overfitting. Extensive experimentation across four benchmarks against existing SSDG baselines suggests that our method achieves SOTA results both qualitatively and quantitatively. The code will be made publicly available.

278. 【2601.11700】ng Human and Machine Handwriting Apart

链接https://arxiv.org/abs/2601.11700

作者:Luis A. Leiva,Moises Diaz,Nuwan T. Attygalle,Miguel A. Ferrer,Rejean Plamondon

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Handwriting movements, behavioral biometrics, device or application, unique form, form of behavioral

备注

点击查看摘要

Abstract:Handwriting movements can be leveraged as a unique form of behavioral biometrics, to verify whether a real user is operating a device or application. This task can be framed as a reverse Turing test in which a computer has to detect if an input instance has been generated by a human or artificially. To tackle this task, we study ten public datasets of handwritten symbols (isolated characters, digits, gestures, pointing traces, and signatures) that are artificially reproduced using seven different synthesizers, including, among others, the Kinematic Theory (Sigma h model), generative adversarial networks, Transformers, and Diffusion models. We train a shallow recurrent neural network that achieves excellent performance (98.3 percent Area Under the ROC Curve (AUC) score and 1.4 percent equal error rate on average across all synthesizers and datasets) using nonfeaturized trajectory data as input. In few-shot settings, we show that our classifier achieves such an excellent performance when trained on just 10 percent of the data, as evaluated on the remaining 90% of the data as a test set. We further challenge our classifier in out-of-domain settings, and observe very competitive results as well. Our work has implications for computerized systems that need to verify human presence, and adds an additional layer of security to keep attackers at bay.

279. 【2601.11679】Conformal Point and the Calibrated Conic

链接https://arxiv.org/abs/2601.11679

作者:Richard Hartley

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:calibrating conic, conformal point, visualizing image geometry, Abstract, conic

备注

点击查看摘要

Abstract:This gives some information about the conformal point and the calibrating conic, and their relationship one to the other. These concepts are useful for visualizing image geometry, and lead to intuitive ways to compute geometry, such as angles and directions in an image.

280. 【2601.11675】Generating metamers of human scene understanding

链接https://arxiv.org/abs/2601.11675

作者:Ritik Raina,Abe Leite,Alexandros Graikos,Seoyoung Ahn,Dimitris Samaras,Gregory J. Zelinsky

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:vision combines low-resolution, Human vision combines, periphery with sparse, sparse but high-resolution, locations to construct

备注

点击查看摘要

Abstract:Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene. In this paper, we introduce MetamerGen, a tool for generating scenes that are aligned with latent human scene representations. MetamerGen is a latent diffusion model that combines peripherally obtained scene gist information with information obtained from scene-viewing fixations to generate image metamers for what humans understand after viewing a scene. Generating images from both high and low resolution (i.e. "foveated") inputs constitutes a novel image-to-image synthesis problem, which we tackle by introducing a dual-stream representation of the foveated scenes consisting of DINOv2 tokens that fuse detailed features from fixated areas with peripherally degraded features capturing scene context. To evaluate the perceptual alignment of MetamerGen generated images to latent human scene representations, we conducted a same-different behavioral experiment where participants were asked for a "same" or "different" response between the generated and the original image. With that, we identify scene generations that are indeed metamers for the latent scene representations formed by the viewers. MetamerGen is a powerful tool for understanding scene understanding. Our proof-of-concept analyses uncovered specific features at multiple levels of visual processing that contributed to human judgments. While it can generate metamers even conditioned on random fixations, we find that high-level semantic alignment most strongly predicts metamerism when the generated scenes are conditioned on viewers' own fixated regions.

281. 【2601.11669】IPEC: Test-Time Incremental Prototype Enhancement Classifier for Few-Shot Learning

链接https://arxiv.org/abs/2601.11669

作者:Wenwen Liao,Hang Ruan,Jianbo Yu,Xiaofeng Yang,Qingchao Jiang,Xuefeng Yan

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Metric-based few-shot approaches, gained significant popularity, significant popularity due, straightforward implementation, computational efficiency

备注

点击查看摘要

Abstract:Metric-based few-shot approaches have gained significant popularity due to their relatively straightforward implementation, high interpret ability, and computational efficiency. However, stemming from the batch-independence assumption during testing, which prevents the model from leveraging valuable knowledge accumulated from previous batches. To address these challenges, we propose a novel test-time method called Incremental Prototype Enhancement Classifier (IPEC), a test-time method that optimizes prototype estimation by leveraging information from previous query samples. IPEC maintains a dynamic auxiliary set by selectively incorporating query samples that are classified with high confidence. To ensure sample quality, we design a robust dual-filtering mechanism that assesses each query sample based on both global prediction confidence and local discriminative ability. By aggregating this auxiliary set with the support set in subsequent tasks, IPEC builds progressively more stable and representative prototypes, effectively reducing its reliance on the initial support set. We ground this approach in a Bayesian interpretation, conceptualizing the support set as a prior and the auxiliary set as a data-driven posterior, which in turn motivates the design of a practical "warm-up and test" two-stage inference protocol. Extensive empirical results validate the superior performance of our proposed method across multiple few-shot classification tasks.

282. 【2601.11666】MATEX: Multi-scale Attention and Text-guided Explainability of Medical Vision-Language Models

链接https://arxiv.org/abs/2601.11666

作者:Muhammad Imran,Chi Lee,Yugyung Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:incorporating anatomically informed, informed spatial reasoning, Text-guided Explainability, medical vision-language models, anatomically informed spatial

备注: 12 pages, 3 figures, 1 table

点击查看摘要

Abstract:We introduce MATEX (Multi-scale Attention and Text-guided Explainability), a novel framework that advances interpretability in medical vision-language models by incorporating anatomically informed spatial reasoning. MATEX synergistically combines multi-layer attention rollout, text-guided spatial priors, and layer consistency analysis to produce precise, stable, and clinically meaningful gradient attribution maps. By addressing key limitations of prior methods, such as spatial imprecision, lack of anatomical grounding, and limited attention granularity, MATEX enables more faithful and interpretable model explanations. Evaluated on the MS-CXR dataset, MATEX outperforms the state-of-the-art M2IB approach in both spatial precision and alignment with expert-annotated findings. These results highlight MATEX's potential to enhance trust and transparency in radiological AI applications.

283. 【2601.11665】UAV-Based Infrastructure Inspections: A Literature Review and Proposed Framework for AEC+FM

链接https://arxiv.org/abs/2601.11665

作者:Amir Farzin Nikkhah,Dong Chen,Bradford Campbell,Somayeh Asadi,Arsalan Heydarian

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Unmanned Aerial Vehicles, Unmanned Aerial, Aerial Vehicles, Facility Management, transforming infrastructure inspections

备注: Accepted for publication at the International Conference on Construction Engineering and Management (I3CE 2025)

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are transforming infrastructure inspections in the Architecture, Engineering, Construction, and Facility Management (AEC+FM) domain. By synthesizing insights from over 150 studies, this review paper highlights UAV-based methodologies for data acquisition, photogrammetric modeling, defect detection, and decision-making support. Key innovations include path optimization, thermal integration, and advanced machine learning (ML) models such as YOLO and Faster R-CNN for anomaly detection. UAVs have demonstrated value in structural health monitoring (SHM), disaster response, urban infrastructure management, energy efficiency evaluations, and cultural heritage preservation. Despite these advancements, challenges in real-time processing, multimodal data fusion, and generalizability remain. A proposed workflow framework, informed by literature and a case study, integrates RGB imagery, LiDAR, and thermal sensing with transformer-based architectures to improve accuracy and reliability in detecting structural defects, thermal anomalies, and geometric inconsistencies. The proposed framework ensures precise and actionable insights by fusing multimodal data and dynamically adapting path planning for complex environments, presented as a comprehensive step-by-step guide to address these challenges effectively. This paper concludes with future research directions emphasizing lightweight AI models, adaptive flight planning, synthetic datasets, and richer modality fusion to streamline modern infrastructure inspections.

284. 【2601.11662】LTV-YOLO: A Lightweight Thermal Object Detector for Young Pedestrians in Adverse Conditions

链接https://arxiv.org/abs/2601.11662

作者:Abdullah Jirjees,Ryan Myers,Muhammad Haris Ikram,Mohamed H. Zaki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:vulnerable road users, Detecting vulnerable road, weather conditions remains, road users, vulnerable road

备注

点击查看摘要

Abstract:Detecting vulnerable road users (VRUs), particularly children and adolescents, in low light and adverse weather conditions remains a critical challenge in computer vision, surveillance, and autonomous vehicle systems. This paper presents a purpose-built lightweight object detection model designed to identify young pedestrians in various environmental scenarios. To address these challenges, our approach leverages thermal imaging from long-wave infrared (LWIR) cameras, which enhances detection reliability in conditions where traditional RGB cameras operating in the visible spectrum fail. Based on the YOLO11 architecture and customized for thermal detection, our model, termed LTV-YOLO (Lightweight Thermal Vision YOLO), is optimized for computational efficiency, accuracy and real-time performance on edge devices. By integrating separable convolutions in depth and a feature pyramid network (FPN), LTV-YOLO achieves strong performance in detecting small-scale, partially occluded, and thermally distinct VRUs while maintaining a compact architecture. This work contributes a practical and scalable solution to improve pedestrian safety in intelligent transportation systems, particularly in school zones, autonomous navigation, and smart city infrastructure. Unlike prior thermal detectors, our contribution is task-specific: a thermally only edge-capable design designed for young and small VRUs (children and distant adults). Although FPN and depthwise separable convolutions are standard components, their integration into a thermal-only pipeline optimized for short/occluded VRUs under adverse conditions is, to the best of our knowledge, novel.

285. 【2601.11660】Zeros can be Informative: Masked Binary U-Net for Image Segmentation on Tensor Cores

链接https://arxiv.org/abs/2601.11660

作者:Chunshu Wu,Ruibing Song,Sushant Kondguli,Tong Geng,Ang Li

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:resource-constrained edge devices, Real-time image segmentation, autonomous systems, edge devices, Real-time image

备注

点击查看摘要

Abstract:Real-time image segmentation is a key enabler for AR/VR, robotics, drones, and autonomous systems, where tight accuracy, latency, and energy budgets must be met on resource-constrained edge devices. While U-Net offers a favorable balance of accuracy and efficiency compared to large transformer-based models, achieving real-time performance on high-resolution input remains challenging due to compute, memory, and power limits. Extreme quantization, particularly binary networks, is appealing for its hardware-friendly operations. However, two obstacles limit practicality: (1) severe accuracy degradation, and (2) a lack of end-to-end implementations that deliver efficiency on general-purpose GPUs. We make two empirical observations that guide our design. (1) An explicit zero state is essential: training with zero masking to binary U-Net weights yields noticeable sparsity. (2) Quantization sensitivity is uniform across layers. Motivated by these findings, we introduce Masked Binary U-Net (MBU-Net), obtained through a cost-aware masking strategy that prioritizes masking where it yields the highest accuracy-per-cost, reconciling accuracy with near-binary efficiency. To realize these gains in practice, we develop a GPU execution framework that maps MBU-Net to Tensor Cores via a subtractive bit-encoding scheme, efficiently implementing masked binary weights with binary activations. This design leverages native binary Tensor Core BMMA instructions, enabling high throughput and energy savings on widely available GPUs. Across 3 segmentation benchmarks, MBU-Net attains near full-precision accuracy (3% average drop) while delivering 2.04x speedup and 3.54x energy reductions over a 16-bit floating point U-Net.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2601.11660 [cs.CV]

(or
arXiv:2601.11660v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.11660

Focus to learn more

              arXiv-issued DOI via DataCite</p>
286. 【2601.11656】An Efficient and Explainable KAN Framework forWireless Radiation Field Prediction

链接https://arxiv.org/abs/2601.11656

作者:Jingzhou Shen,Xuyu Wang

类目:Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Modeling wireless channels, wireless channels accurately, channels accurately remains, Modeling wireless, wireless channels

备注: Accepted in 2025 IEEE 22nd International Conference on Mobile Ad-Hoc and Smart Systems (MASS). See [this https URL](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11206262&tag=1)

点击查看摘要

Abstract:Modeling wireless channels accurately remains a challenge due to environmental variations and signal uncertainties. Recent neural networks can learn radio frequency~(RF) signal propagation patterns, but they process each voxel on the ray independently, without considering global context or environmental factors. Our paper presents a new approach that learns comprehensive representations of complete rays rather than individual points, capturing more detailed environmental features. We integrate a Kolmogorov-Arnold network (KAN) architecture with transformer modules to achieve better performance across realistic and synthetic scenes while maintaining computational efficiency. Our experimental results show that this approach outperforms existing methods in various scenarios. Ablation studies confirm that each component of our model contributes to its effectiveness. Additional experiments provide clear explanations for our model's performance.

287. 【2601.11654】PSSI-MaxST: An Efficient Pixel-Segment Similarity Index Using Intensity and Smoothness Features for Maximum Spanning Tree Based Segmentation

链接https://arxiv.org/abs/2601.11654

作者:Kaustubh Shivshankar Shejole,Gaurav Mishra

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:foreground and background, partition an image, user inputs, background regions, Segment Similarity Index

备注

点击查看摘要

Abstract:Interactive graph-based segmentation methods partition an image into foreground and background regions with the aid of user inputs. However, existing approaches often suffer from high computational costs, sensitivity to user interactions, and degraded performance when the foreground and background share similar color distributions. A key factor influencing segmentation performance is the similarity measure used for assigning edge weights in the graph. To address these challenges, we propose a novel Pixel Segment Similarity Index (PSSI), which leverages the harmonic mean of inter-channel similarities by incorporating both pixel intensity and spatial smoothness features. The harmonic mean effectively penalizes dissimilarities in any individual channel, enhancing robustness. The computational complexity of PSSI is $\mathcal{O}(B)$, where $B$ denotes the number of histogram bins. Our segmentation framework begins with low-level segmentation using MeanShift, which effectively captures color, texture, and segment shape. Based on the resulting pixel segments, we construct a pixel-segment graph with edge weights determined by PSSI. For partitioning, we employ the Maximum Spanning Tree (MaxST), which captures strongly connected local neighborhoods beneficial for precise segmentation. The integration of the proposed PSSI, MeanShift, and MaxST allows our method to jointly capture color similarity, smoothness, texture, shape, and strong local connectivity. Experimental evaluations on the GrabCut and Images250 datasets demonstrate that our method consistently outperforms current graph-based interactive segmentation methods such as AMOE, OneCut, and SSNCut in terms of segmentation quality, as measured by Jaccard Index (IoU), $F_1$ score, execution time and Mean Error (ME). Code is publicly available at: this https URL.

288. 【2601.11651】Aesthetics as Structural Harm: Algorithmic Lookism Across Text-to-Image Generation and Classification

链接https://arxiv.org/abs/2601.11651

作者:Miriam Doh,Aditya Gulati,Corina Canali,Nuria Oliver

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:preferential treatment based, paper examines algorithmic, examines algorithmic lookism-the, lookism-the systematic preferential, systematic preferential treatment

备注: 22 pages, 15 figures

点击查看摘要

Abstract:This paper examines algorithmic lookism-the systematic preferential treatment based on physical appearance-in text-to-image (T2I) generative AI and a downstream gender classification task. Through the analysis of 26,400 synthetic faces created with Stable Diffusion 2.1 and 3.5 Medium, we demonstrate how generative AI models systematically associate facial attractiveness with positive attributes and vice-versa, mirroring socially constructed biases rather than evidence-based correlations. Furthermore, we find significant gender bias in three gender classification algorithms depending on the attributes of the input faces. Our findings reveal three critical harms: (1) the systematic encoding of attractiveness-positive attribute associations in T2I models; (2) gender disparities in classification systems, where women's faces, particularly those generated with negative attributes, suffer substantially higher misclassification rates than men's; and (3) intensifying aesthetic constraints in newer models through age homogenization, gendered exposure patterns, and geographic reductionism. These convergent patterns reveal algorithmic lookism as systematic infrastructure operating across AI vision systems, compounding existing inequalities through both representation and recognition. Disclaimer: This work includes visual and textual content that reflects stereotypical associations between physical appearance and socially constructed attributes, including gender, race, and traits associated with social desirability. Any such associations found in this study emerge from the biases embedded in generative AI systems-not from empirical truths or the authors' views.

Comments:
22 pages, 15 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Cite as:
arXiv:2601.11651 [cs.CV]

(or
arXiv:2601.11651v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.11651

Focus to learn more

              arXiv-issued DOI via DataCite</p>
289. 【2601.11645】IMSAHLO: Integrating Multi-Scale Attention and Hybrid Loss Optimization Framework for Robust Neuronal Brain Cell Segmentation

链接https://arxiv.org/abs/2601.11645

作者:Ujjwal Jain,Oshin Misra,Roshni Chakraborty,Mahua Bhattacharya

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Accurate segmentation, computational neuroscience, fluorescence microscopy, fundamental task, task for quantitative

备注

点击查看摘要

Abstract:Accurate segmentation of neuronal cells in fluorescence microscopy is a fundamental task for quantitative analysis in computational neuroscience. However, it is significantly impeded by challenges such as the coexistence of densely packed and sparsely distributed cells, complex overlapping morphologies, and severe class imbalance. Conventional deep learning models often fail to preserve fine topological details or accurately delineate boundaries under these conditions. To address these limitations, we propose a novel deep learning framework, IMSAHLO (Integrating Multi-Scale Attention and Hybrid Loss Optimization), for robust and adaptive neuronal segmentation. The core of our model features Multi-Scale Dense Blocks (MSDBs) to capture features at various receptive fields, effectively handling variations in cell density, and a Hierarchical Attention (HA) mechanism that adaptively focuses on salient morphological features to preserve Region of Interest (ROI) boundary details. Furthermore, we introduce a novel hybrid loss function synergistically combining Tversky and Focal loss to combat class imbalance, alongside a topology-aware Centerline Dice (clDice) loss and a Contour-Weighted Boundary loss to ensure topological continuity and precise separation of adjacent cells. Large-scale experiments on the public Fluorescent Neuronal Cells (FNC) dataset demonstrate that our framework outperforms state-of-the-art architectures, achieving precision of 81.4%, macro F1 score of 82.7%, micro F1 score of 83.3%, and balanced accuracy of 99.5% on difficult dense and sparse cases. Ablation studies validate the synergistic benefits of multi-scale attention and hybrid loss terms. This work establishes a foundation for generalizable segmentation models applicable to a wide range of biomedical imaging modalities, pushing AI-assisted analysis toward high-throughput neurobiological pipelines.

290. 【2601.11644】Predicting When to Trust Vision-Language Models for Spatial Reasoning

链接https://arxiv.org/abs/2601.11644

作者:Muhammad Imran,Yugyung Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:basic directional relationships, spatial reasoning failures, exhibit systematic spatial, systematic spatial reasoning, demonstrate impressive capabilities

备注: 9 pages, 5 figures, 6 tables

点击查看摘要

Abstract:Vision-Language Models (VLMs) demonstrate impressive capabilities across multimodal tasks, yet exhibit systematic spatial reasoning failures, achieving only 49% (CLIP) to 54% (BLIP-2) accuracy on basic directional relationships. For safe deployment in robotics and autonomous systems, we need to predict when to trust VLM spatial predictions rather than accepting all outputs. We propose a vision-based confidence estimation framework that validates VLM predictions through independent geometric verification using object detection. Unlike text-based approaches relying on self-assessment, our method fuses four signals via gradient boosting: geometric alignment between VLM claims and coordinates, spatial ambiguity from overlap, detection quality, and VLM internal uncertainty. We achieve 0.674 AUROC on BLIP-2 (34.0% improvement over text-based baselines) and 0.583 AUROC on CLIP (16.1% improvement), generalizing across generative and classification architectures. Our framework enables selective prediction: at 60% target accuracy, we achieve 61.9% coverage versus 27.6% baseline (2.2x improvement) on BLIP-2. Feature analysis reveals vision-based signals contribute 87.4% of model importance versus 12.7% from VLM confidence, validating that external geometric verification outperforms self-assessment. We demonstrate reliable scene graph construction where confidence-based pruning improves precision from 52.1% to 78.3% while retaining 68.2% of edges.

291. 【2601.11642】PSSF: Early osteoarthritis detection using physical synthetic knee X-ray scans and AI radiomics models

链接https://arxiv.org/abs/2601.11642

作者:Abbas Alzubaidi,Ali Al-Bayaty

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:subjective radiographic grading, commonly the Kellgren-Lawrence, disability worldwide, X-ray scans, controllable X-ray scans

备注: 16 pages, 6 figures

点击查看摘要

Abstract:Knee osteoarthritis (OA) is a major cause of disability worldwide and is still largely assessed using subjective radiographic grading, most commonly the Kellgren-Lawrence (KL) scale. Artificial intelligence (AI) and radiomics offer quantitative tools for OA assessment but depend on large, well-annotated image datasets, mainly X-ray scans, that are often difficult to obtain because of privacy, governance and resourcing constraints. In this research, we introduce a physics-based synthetic simulation framework (PSSF) to fully generate controllable X-ray scans without patients' involvement and violating their privacy and institutional constraints. This PSSF is a 2D X-ray projection simulator of anteroposterior knee radiographs from a parametric anatomical model of the distal femur and proximal tibia. Using PSSF, we create a virtual cohort of 180 subjects (260 knees), each is imaged under three protocols (reference, low-dose, and geometry-shift). Medial joint regions are automatically localized, preprocessed, and processed with the Image Biomarker Standardisation Initiative (IBSI). Practically, three machine learning (ML) models are utilized, logistic regression, random forest, and gradient boosting, to train binary (KL-like "0" vs. "2") and three-class (0-2) prediction radiographic images. Robustness is assessed within IBSI protocol, cross-protocol, and multi-protocol scenarios. Finally, features stability is then evaluated using intraclass correlation coefficients across acquisition changes.

292. 【2601.11641】Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

链接https://arxiv.org/abs/2601.11641

作者:Yuxi Liu,Yipeng Hu,Zekun Zhang,Kunze Jiang,Kun Yuan

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Diffusion Transformers, creating significant barriers, achieved notable progress, task remains constrained, quadratic complexity inherent

备注

点击查看摘要

Abstract:While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers to practical deployment. Although sparse attention methods attempt to address this challenge, existing approaches either rely on oversimplified static patterns or require computationally expensive sampling operations to achieve dynamic sparsity, resulting in inaccurate pattern predictions and degraded generation quality. To overcome these limitations, we propose a \underline{\textbf{M}}ixtrue-\underline{\textbf{O}}f-\underline{\textbf{D}}istribution \textbf{DiT} (\textbf{MOD-DiT}), a novel sampling-free dynamic attention framework that accurately models evolving attention patterns through a two-stage process. First, MOD-DiT leverages prior information from early denoising steps and adopts a {distributed mixing approach} to model an efficient linear approximation model, which is then used to predict mask patterns for a specific denoising interval. Second, an online block masking strategy dynamically applies these predicted masks while maintaining historical sparsity information, eliminating the need for repetitive sampling operations. Extensive evaluations demonstrate consistent acceleration and quality improvements across multiple benchmarks and model architectures, validating MOD-DiT's effectiveness for efficient, high-quality video generation while overcoming the computational limitations of traditional sparse attention approaches.

293. 【2601.11640】Confident Learning for Object Detection under Model Constraints

链接https://arxiv.org/abs/2601.11640

作者:Yingda Yu,Jiaqi Xuan,Shuhui Shi,Xuanyu Teng,Shuyang Xu,Guanchao Tong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-time inference latency, Agricultural weed detection, computational resources, Agricultural weed, inference latency

备注: Submitted to ICPR 2026, currently under review

点击查看摘要

Abstract:Agricultural weed detection on edge devices is subject to strict constraints on model capacity, computational resources, and real-time inference latency, which prevent performance improvements through model scaling or ensembling. This paper proposes Model-Driven Data Correction (MDDC), a data-centric framework that enhances detection performance by iteratively diagnosing and correcting data quality deficiencies. An automated error analysis procedure categorizes detection failures into four types: false negatives, false positives, class confusion, and localization errors. These error patterns are systematically addressed through a structured train-fix-retrain pipeline with version-controlled data management. Experimental results on multiple weed detection datasets demonstrate consistent improvements of 5-25 percent in mAP at 0.5 using a fixed lightweight detector (YOLOv8n), indicating that systematic data quality optimization can effectively alleviate performance bottlenecks under fixed model capacity constraints.

294. 【2601.11637】Evaluating Self-Correcting Vision Agents Through Quantitative and Qualitative Metrics

链接https://arxiv.org/abs/2601.11637

作者:Aradhya Dixit

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:executable tool-based plans, decompose complex visual, enabled Vision-Language Agents, complex visual tasks, tool-based plans

备注

点击查看摘要

Abstract:Recent progress in multimodal foundation models has enabled Vision-Language Agents (VLAs) to decompose complex visual tasks into executable tool-based plans. While recent benchmarks have begun to evaluate iterative self-correction, its quantitative limits and dominant reasoning bottlenecks remain poorly characterized. This work introduces a Diagnostic Micro-Benchmark. Our analysis decouples Task Success Rate (TSR = 62 percent) from Correction Success Rate (CSR = 25 to 33 percent), revealing that initial competence does not predict repair ability. We explicitly quantify the diminishing returns of correction, which saturates after three retries. Our Failure Taxonomy reveals a frequent factor is Semantic Drift (about 28 percent of failures), a loss of contextual state. By isolating this reasoning bottleneck, this benchmark defines a reproducible framework toward stateful, trustworthy multimodal agents.

295. 【2601.11635】Now You See Me, Now You Don't: A Unified Framework for Expression Consistent Anonymization in Talking Head Videos

链接https://arxiv.org/abs/2601.11635

作者:Anil Egin,Andrea Tangherloni,Antitza Dantcheva

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision downstream, vision downstream tasks, people tracking, Face video anonymization, anonymization is aimed

备注

点击查看摘要

Abstract:Face video anonymization is aimed at privacy preservation while allowing for the analysis of videos in a number of computer vision downstream tasks such as expression recognition, people tracking, and action recognition. We propose here a novel unified framework referred to as Anon-NET, streamlined to de-identify facial videos, while preserving age, gender, race, pose, and expression of the original video. Specifically, we inpaint faces by a diffusion-based generative model guided by high-level attribute recognition and motion-aware expression transfer. We then animate deidentified faces by video-driven animation, which accepts the de-identified face and the original video as input. Extensive experiments on the datasets VoxCeleb2, CelebV-HQ, and HDTF, which include diverse facial dynamics, demonstrate the effectiveness of AnonNET in obfuscating identity while retaining visual realism and temporal consistency. The code of AnonNet will be publicly released.

296. 【2601.11634】When Rules Fall Short: Agent-Driven Discovery of Emerging Content Issues in Short Video Platforms

链接https://arxiv.org/abs/2601.11634

作者:Chenghui Yu,Hongwei Wang,Junwen Chen,Zixuan Wang,Bingfeng Deng,Zhuolin Hao,Hongyu Xiong,Yang Song

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:short-video platforms evolve, Trends on short-video, existing annotation policies, rapid pace, annotation policies

备注

点击查看摘要

Abstract:Trends on short-video platforms evolve at a rapid pace, with new content issues emerging every day that fall outside the coverage of existing annotation policies. However, traditional human-driven discovery of emerging issues is too slow, which leads to delayed updates of annotation policies and poses a major challenge for effective content governance. In this work, we propose an automatic issue discovery method based on multimodal LLM agents. Our approach automatically recalls short videos containing potential new issues and applies a two-stage clustering strategy to group them, with each cluster corresponding to a newly discovered issue. The agent then generates updated annotation policies from these clusters, thereby extending coverage to these emerging issues. Our agent has been deployed in the real system. Both offline and online experiments demonstrate that this agent-based method significantly improves the effectiveness of emerging-issue discovery (with an F1 score improvement of over 20%) and enhances the performance of subsequent issue governance (reducing the view count of problematic videos by approximately 15%). More importantly, compared to manual issue discovery, it greatly reduces time costs and substantially accelerates the iteration of annotation policies.

297. 【2601.11633】Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images

链接https://arxiv.org/abs/2601.11633

作者:Xuchen Li,Xuzhao Li,Renjie Pi,Shiyu Hu,Jian Zhao,Jiahui Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reasoning process remains, critical challenge, remarkable progress, progress of Vision-Language, process remains

备注: Preprint, Under review

点击查看摘要

Abstract:Despite the remarkable progress of Vision-Language Models (VLMs) in adopting "Thinking-with-Images" capabilities, accurately evaluating the authenticity of their reasoning process remains a critical challenge. Existing benchmarks mainly rely on outcome-oriented accuracy, lacking the capability to assess whether models can accurately leverage fine-grained visual cues for multi-step reasoning. To address these limitations, we propose ViEBench, a process-verifiable benchmark designed to evaluate faithful visual reasoning. Comprising 200 multi-scenario high-resolution images with expert-annotated visual evidence, ViEBench uniquely categorizes tasks by difficulty into perception and reasoning dimensions, where reasoning tasks require utilizing localized visual details with prior knowledge. To establish comprehensive evaluation criteria, we introduce a dual-axis matrix that provides fine-grained metrics through four diagnostic quadrants, enabling transparent diagnosis of model behavior across varying task complexities. Our experiments yield several interesting observations: (1) VLMs can sometimes produce correct final answers despite grounding on irrelevant regions, and (2) they may successfully locate the correct evidence but still fail to utilize it to reach accurate conclusions. Our findings demonstrate that ViEBench can serve as a more explainable and practical benchmark for comprehensively evaluating the effectiveness agentic VLMs. The codes will be released at: this https URL.

298. 【2601.11632】KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering

链接https://arxiv.org/abs/2601.11632

作者:Zhiyang Li,Ao Ke,Yukun Cao,Xike Xie

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Visual Question Answering, Multi-modal Large Language, Language Models, Question Answering

备注

点击查看摘要

Abstract:Multi-modal Large Language Models (MLLMs) for Visual Question Answering (VQA) often suffer from dual limitations: knowledge hallucination and insufficient fine-grained visual perception. Crucially, we identify that commonsense graphs and scene graphs provide precisely complementary solutions to these respective deficiencies by providing rich external knowledge and capturing fine-grained visual details. However, prior works typically treat them in isolation, overlooking their synergistic potential. To bridge this gap, we propose KG-ViP, a unified framework that empowers MLLMs by fusing scene graphs and commonsense graphs. The core of the KG-ViP framework is a novel retrieval-and-fusion pipeline that utilizes the query as a semantic bridge to progressively integrate both graphs, synthesizing a unified structured context that facilitates reliable multi-modal reasoning. Extensive experiments on FVQA 2.0+ and MVQA benchmarks demonstrate that KG-ViP significantly outperforms existing VQA methods.

299. 【2601.11631】Compress to Focus: Efficient Coordinate Compression for Policy Optimization in Multi-Turn GUI Agents

链接https://arxiv.org/abs/2601.11631

作者:Yurun Song,Jiong Yin,Rongjunchen Zhang,Ian G. Harris

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-turn GUI agents, enable complex task, complex task completion, GUI agents enable, severe context inflation

备注

点击查看摘要

Abstract:Multi-turn GUI agents enable complex task completion through sequential decision-making, but suffer from severe context inflation as interaction history accumulates. Existing strategies either sacrifice long-term context via truncation or compromise spatial structure through token pruning. In this paper, we propose Coordinate Compression Policy Optimization (CCPO), an efficient policy optimization framework that couples visual compression with policy optimization for multi-turn GUI agents. CCPO introduces Coordinate-Aware Spatial Compression (CASC), which aggregates coordinates from multiple rollouts to capture target-relevant regions and progressively narrow historical attention around key visual areas. From interactions across rollouts, CASC adaptively constructs attention boundaries that concentrate computation on the most informative regions of the scene. We further design a Distance-Based Advantage that provides fine-grained learning signals based on distance rather than binary correctness, improving both grounding accuracy and compression quality. Extensive experiments demonstrate that CCPO achieves SOTA performance across four benchmarks with up to 55% token compression and 3.8$\times$ training speedup.

300. 【2601.11630】A one-step generation model with a Single-Layer Transformer: Layer number re-distillation of FreeFlow

链接https://arxiv.org/abs/2601.11630

作者:Haonan Wei,Linyuan Wang,Nuolin Sun,Zhizhong Zheng,Lei Li,Bin Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Ordinary Differential Equations, Differential Equations, Ordinary Differential, Flow matching methods, Flow matching

备注

点击查看摘要

Abstract:Currently, Flow matching methods aim to compress the iterative generation process of diffusion models into a few or even a single step, with MeanFlow and FreeFlow being representative achievements of one-step generation based on Ordinary Differential Equations (ODEs). We observe that the 28-layer Transformer architecture of FreeFlow can be characterized as an Euler discretization scheme for an ODE along the depth axis, where the layer index serves as the discrete time step. Therefore, we distill the number of layers of the FreeFlow model, following the same derivation logic as FreeFlow, and propose SLT (Single-Layer Transformer), which uses a single shared DiT block to approximate the depth-wise feature evolution of the 28-layer teacher. During training, it matches the teacher's intermediate features at several depth patches, fuses those patch-level representations, and simultaneously aligns the teacher's final velocity prediction. Through distillation training, we compress the 28 independent Transformer Blocks of the teacher model DiT-XL/2 into a single Transformer Block, reducing the parameter count from 675M to 4.3M. Furthermore, leveraging its minimal parameters and rapid sampling speed, SLT can screen more candidate points in the noise space within the same timeframe, thereby selecting higher-quality initial points for the teacher model FreeFlow and ultimately enhancing the quality of generated images. Experimental results demonstrate that within a time budget comparable to two random samplings of the teacher model, our method performs over 100 noise screenings and produces a high-quality sample through the teacher model using the selected points. Quality fluctuations caused by low-quality initial noise under a limited number of FreeFlow sampling calls are effectively avoided, substantially improving the stability and average generation quality of one-step generation.

301. 【2601.11627】Handcrafted Feature-Assisted One-Class Learning for Artist Authentication in Historical Drawings

链接https://arxiv.org/abs/2601.11627

作者:Hassan Ugail,Jan Ritch-Frel,Irina Matuzava

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:limited tonal variation, paper remain persistent, remain persistent challenges, Albert Museum Collections, Royal Collection Trust

备注

点击查看摘要

Abstract:Authentication and attribution of works on paper remain persistent challenges in cultural heritage, particularly when the available reference corpus is small and stylistic cues are primarily expressed through line and limited tonal variation. We present a verification-based computational framework for historical drawing authentication using one-class autoencoders trained on a compact set of interpretable handcrafted features. Ten artist-specific verifiers are trained using authenticated sketches from the Metropolitan Museum of Art open-access collection, the Ashmolean Collections Catalogue, the Morgan Library and Museum, the Royal Collection Trust (UK), the Victoria and Albert Museum Collections, and an online catalogue of the Casa Buonarroti collection and evaluated under a biometric-style protocol with genuine and impostor trials. Feature vectors comprise Fourier-domain energy, Shannon entropy, global contrast, GLCM-based homogeneity, and a box-counting estimate of fractal complexity. Across 900 verification decisions (90 genuine and 810 impostor trials), the pooled system achieves a True Acceptance Rate of 83.3% with a False Acceptance Rate of 9.5% at the chosen operating point. Performance varies substantially by artist, with near-zero false acceptance for some verifiers and elevated confusability for others. A pairwise attribution of false accepts indicates structured error pathways consistent with stylistic proximity and shared drawing conventions, whilst also motivating tighter control of digitisation artefacts and threshold calibration. The proposed methodology is designed to complement, rather than replace, connoisseurship by providing reproducible, quantitative evidence suitable for data-scarce settings common in historical sketch attribution.

302. 【2601.11617】PointSLAM++: Robust Dense Neural Gaussian Point Cloud-based SLAM

链接https://arxiv.org/abs/2601.11617

作者:Xu Wang,Boyao Han,Xiaojun Chen,Ying Liu,Ruihui Li

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)

关键词:maintain structural consistency, current simultaneous localization, robust pose estimation, RGB-D SLAM system, augmented reality

备注

点击查看摘要

Abstract:Real-time 3D reconstruction is crucial for robotics and augmented reality, yet current simultaneous localization and mapping(SLAM) approaches often struggle to maintain structural consistency and robust pose estimation in the presence of depth noise. This work introduces PointSLAM++, a novel RGB-D SLAM system that leverages a hierarchically constrained neural Gaussian representation to preserve structural relationships while generating Gaussian primitives for scene mapping. It also employs progressive pose optimization to mitigate depth sensor noise, significantly enhancing localization accuracy. Furthermore, it utilizes a dynamic neural representation graph that adjusts the distribution of Gaussian nodes based on local geometric complexity, enabling the map to adapt to intricate scene details in real time. This combination yields high-precision 3D mapping and photorealistic scene rendering. Experimental results show PointSLAM++ outperforms existing 3DGS-based SLAM methods in reconstruction accuracy and rendering quality, demonstrating its advantages for large-scale AR and robotics.

303. 【2601.11614】Multi-modal MRI-Based Alzheimer's Disease Diagnosis with Transformer-based Image Synthesis and Transfer Learning

链接https://arxiv.org/abs/2601.11614

作者:Jason Qiu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)

关键词:progressive neurodegenerative disorder, Alzheimer disease, making early detection, Magnetic Resonance Imaging, early detection essential

备注: 19 pages, 10 figures

点击查看摘要

Abstract:Alzheimer's disease (AD) is a progressive neurodegenerative disorder in which pathological changes begin many years before the onset of clinical symptoms, making early detection essential for timely intervention. T1-weighted (T1w) Magnetic Resonance Imaging (MRI) is routinely used in clinical practice to identify macroscopic brain alterations, but these changes typically emerge relatively late in the disease course. Diffusion MRI (dMRI), in contrast, is sensitive to earlier microstructural abnormalities by probing water diffusion in brain tissue. dMRI metrics, including fractional anisotropy (FA) and mean diffusivity (MD), provide complementary information about white matter integrity and neurodegeneration. However, dMRI acquisitions are time-consuming and susceptible to motion artifacts, limiting their routine use in clinical populations. To bridge this gap, I propose a 3D TransUNet image synthesis framework that predicts FA and MD maps directly from T1w MRI. My model generates high-fidelity maps, achieving a structural similarity index (SSIM) exceeding 0.93 and a strong Pearson correlation (0.94) with ground-truth dMRI. When integrated into a multi-modal diagnostic model, these synthetic features boost AD classification accuracy by 5% (78.75%-83.75%) and, most importantly, improve mild cognitive impairment (MCI) detection by 12.5%. This study demonstrates that high-quality diffusion microstructural information can be inferred from routinely acquired T1w MRI, effectively transferring the benefits of multi-modality imaging to settings where diffusion data are unavailable. By reducing scan time while preserving complementary structural and microstructural information, the proposed approach has the potential to improve the accessibility, efficiency, and accuracy of AD diagnosis in clinical practice.

304. 【2601.11612】Domain-Specific Self-Supervised Pre-training for Agricultural Disease Classification: A Hierarchical Vision Transformer Study

链接https://arxiv.org/abs/2601.11612

作者:Arnav S. Sonavane

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:domain-specific self-supervised pre-training, agricultural disease classification, investigate the impact, impact of domain-specific, domain-specific self-supervised

备注: 11 pages, 4 figures, 9 tables

点击查看摘要

Abstract:We investigate the impact of domain-specific self-supervised pre-training on agricultural disease classification using hierarchical vision transformers. Our key finding is that SimCLR pre-training on just 3,000 unlabeled agricultural images provides a +4.57% accuracy improvement--exceeding the +3.70% gain from hierarchical architecture design. Critically, we show this SSL benefit is architecture-agnostic: applying the same pre-training to Swin-Base yields +4.08%, to ViT-Base +4.20%, confirming practitioners should prioritize domain data collection over architectural choices. Using HierarchicalViT (HVT), a Swin-style hierarchical transformer, we evaluate on three datasets: Cotton Leaf Disease (7 classes, 90.24%), PlantVillage (38 classes, 96.3%), and PlantDoc (27 classes, 87.1%). At matched parameter counts, HVT-Base (78M) achieves 88.91% vs. Swin-Base (88M) at 87.23%, a +1.68% improvement. For deployment reliability, we report calibration analysis showing HVT achieves 3.56% ECE (1.52% after temperature scaling). Code: this https URL

305. 【2601.11532】"Jutters"

链接https://arxiv.org/abs/2601.11532

作者:Meike Driessen,Selina Khan,Gonçalo Marcelino

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Dutch coastal foragers, Dutch coastal, gathering and repurposing, coastal foragers, foragers who comb

备注: Accepted at 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Creative AI Track

点击查看摘要

Abstract:This project explores how we engage with AI-generated content through the lens of the jutter: Dutch coastal foragers who comb the shoreline after storms, gathering and repurposing what the sea leaves behind. Reflecting how our lives are increasingly shaped by AI-generated media, we create a beach-like installation that blends real shoreline debris with AI-transformed images and videos. Visitors are invited to explore this space as contemporary jutters, deciding what to keep and what to discard. In doing so, the project reimagines AI-imagery as material for reflection, encouraging a more discerning engagement with the content that drifts through our feeds. A video preview of the installation can be found at this https URL.

306. 【2601.14077】MooneyMaker: A Python package to create ambiguous two-tone images

链接https://arxiv.org/abs/2601.14077

作者:Lars C. Reining,Thabo Matthies,Luisa Haussner,Rabea Turon,Thomas S. A. Wallis

类目:Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)

关键词:thresholding photographic images, thresholding photographic, Mooney images, Mooney, images

备注

点击查看摘要

Abstract:Mooney images are high-contrast, two-tone visual stimuli, created by thresholding photographic images. They allow researchers to separate image content from image understanding, making them valuable for studying visual perception. An ideal Mooney image for this purpose achieves a specific balance: it initially appears unrecognizable but becomes fully interpretable to the observer after seeing the original template. Researchers traditionally created these stimuli manually using subjective criteria, which is labor-intensive and can introduce inconsistencies across studies. Automated generation techniques now offer an alternative to this manual approach. Here, we present MooneyMaker, an open-source Python package that automates the generation of ambiguous Mooney images using several complementary approaches. Users can choose between various generation techniques that range from approaches based on image statistics to deep learning models. These models strategically alter edge information to increase initial ambiguity. The package lets users create two-tone images with multiple methods and directly compare the results visually. In an experiment, we validate MooneyMaker by generating Mooney images using different techniques and assess their recognizability for human observers before and after disambiguating them by presenting the template images. Our results reveal that techniques with lower initial recognizability are associated with higher post-template recognition (i.e. a larger disambiguation effect). To help vision scientists build effective databases of Mooney stimuli, we provide practical guidelines for technique selection. By standardizing the generation process, MooneyMaker supports more consistent and reproducible visual perception research.

307. 【2601.13987】SHARE: A Fully Unsupervised Framework for Single Hyperspectral Image Restoration

链接https://arxiv.org/abs/2601.13987

作者:Jiangwei Xie,Zhang Wen,Mike Davies,Dongdong Chen

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision, fundamental challenge, challenge in computational, Single Hyperspectral Image, Hyperspectral image

备注: Technical report

点击查看摘要

Abstract:Hyperspectral image (HSI) restoration is a fundamental challenge in computational imaging and computer vision. It involves ill-posed inverse problems, such as inpainting and super-resolution. Although deep learning methods have transformed the field through data-driven learning, their effectiveness hinges on access to meticulously curated ground-truth datasets. This fundamentally restricts their applicability in real-world scenarios where such data is unavailable. This paper presents SHARE (Single Hyperspectral Image Restoration with Equivariance), a fully unsupervised framework that unifies geometric equivariance principles with low-rank spectral modelling to eliminate the need for ground truth. SHARE's core concept is to exploit the intrinsic invariance of hyperspectral structures under differentiable geometric transformations (e.g. rotations and scaling) to derive self-supervision signals through equivariance consistency constraints. Our novel Dynamic Adaptive Spectral Attention (DASA) module further enhances this paradigm shift by explicitly encoding the global low-rank property of HSI and adaptively refining local spectral-spatial correlations through learnable attention mechanisms. Extensive experiments on HSI inpainting and super-resolution tasks demonstrate the effectiveness of SHARE. Our method outperforms many state-of-the-art unsupervised approaches and achieves performance comparable to that of supervised methods. We hope that our approach will shed new light on HSI restoration and broader scientific imaging scenarios. The code will be released at this https URL.

308. 【2601.12850】Accurate Simulation Pipeline for Passive Single-Photon Imaging

链接https://arxiv.org/abs/2601.12850

作者:Aleksi Suonsivu,Lauri Salmela,Leevi Uosukainen,Edoardo Peretti,Radu Ciprian Bilcu,Giacomo Boracchi

类目:Instrumentation and Detectors (physics.ins-det); Computer Vision and Pattern Recognition (cs.CV)

关键词:Single-Photon Avalanche Diodes, Avalanche Diodes, Single-Photon Avalanche, SPAD sensors, promising imaging sensors

备注: 18 pages, 10 figures, 3 tables

点击查看摘要

Abstract:Single-Photon Avalanche Diodes (SPADs) are new and promising imaging sensors. These sensors are sensitive enough to detect individual photons hitting each pixel, with extreme temporal resolution and without readout noise. Thus, SPADs stand out as an optimal choice for low-light imaging. Due to the high price and limited availability of SPAD sensors, the demand for an accurate data simulation pipeline is substantial. Indeed, the scarcity of SPAD datasets hinders the development of SPAD-specific processing algorithms and impedes the training of learning-based solutions. In this paper, we present a comprehensive SPAD simulation pipeline and validate it with multiple experiments using two recent commercial SPAD sensors. Our simulator is used to generate the SPAD-MNIST, a single-photon version of the seminal MNIST dataset, to investigate the effectiveness of convolutional neural network (CNN) classifiers on reconstructed fluxes, even at extremely low light conditions, e.g., 5 mlux. We also assess the performance of classifiers exclusively trained on simulated data on real images acquired from SPAD sensors at different light conditions. The synthetic dataset encompasses different SPAD imaging modalities and is made available for download. Project page: this https URL.

Comments:
18 pages, 10 figures, 3 tables

Subjects:

Instrumentation and Detectors (physics.ins-det); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2601.12850 [physics.ins-det]

(or
arXiv:2601.12850v1 [physics.ins-det] for this version)

https://doi.org/10.48550/arXiv.2601.12850

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Related DOI:

https://doi.org/10.1109/JSEN.2025.3645459

Focus to learn more

            DOI(s) linking to related resources</p>
309. 【2601.12261】DALD-PCAC: Density-Adaptive Learning Descriptor for Point Cloud Lossless Attribute Compression

链接https://arxiv.org/abs/2601.12261

作者:Chunyang Fu,Ge Li,Wei Gao,Shiqi Wang,Zhu Li,Shan Liu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:point cloud geometry, point clouds, cloud geometry compression, lossless attribute compression, significantly advanced

备注: Accepted by TOMM

点击查看摘要

Abstract:Recently, deep learning has significantly advanced the performance of point cloud geometry compression. However, the learning-based lossless attribute compression of point clouds with varying densities is under-explored. In this paper, we develop a learning-based framework, namely DALD-PCAC that leverages Levels of Detail (LoD) to tailor for point cloud lossless attribute compression. We develop a point-wise attention model using a permutation-invariant Transformer to tackle the challenges of sparsity and irregularity of point clouds during context modeling. We also propose a Density-Adaptive Learning Descriptor (DALD) capable of capturing structure and correlations among points across a large range of neighbors. In addition, we develop a prior-guided block partitioning to reduce the attribute variance within blocks and enhance the performance. Experiments on LiDAR and object point clouds show that DALD-PCAC achieves the state-of-the-art performance on most data. Our method boosts the compression performance and is robust to the varying densities of point clouds. Moreover, it guarantees a good trade-off between performance and complexity, exhibiting great potential in real-world applications. The source code is available at this https URL.

310. 【2601.12255】DeepRAHT: Learning Predictive RAHT for Point Cloud Attribute Compression

链接https://arxiv.org/abs/2601.12255

作者:Chunyang Fu,Tai Qin,Shiqi Wang,Zhu Li

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT); Multimedia (cs.MM)

关键词:Regional Adaptive Hierarchical, Adaptive Hierarchical Transform, Regional Adaptive, Adaptive Hierarchical, cloud attribute compression

备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Regional Adaptive Hierarchical Transform (RAHT) is an effective point cloud attribute compression (PCAC) method. However, its application in deep learning lacks research. In this paper, we propose an end-to-end RAHT framework for lossy PCAC based on the sparse tensor, called DeepRAHT. The RAHT transform is performed within the learning reconstruction process, without requiring manual RAHT for preprocessing. We also introduce the predictive RAHT to reduce bitrates and design a learning-based prediction model to enhance performance. Moreover, we devise a bitrate proxy that applies run-length coding to entropy model, achieving seamless variable-rate coding and improving robustness. DeepRAHT is a reversible and distortion-controllable framework, ensuring its lower bound performance and offering significant application potential. The experiments demonstrate that DeepRAHT is a high-performance, faster, and more robust solution than the baseline methods. Project Page: this https URL.

311. 【2601.11878】Accelerated MR Elastography Using Learned Neural Network Representation

链接https://arxiv.org/abs/2601.11878

作者:Xi Peng

类目:ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

关键词:achieving fast high-resolution, high-quality training dataset, highly undersampled data, develop a deep-learning, achieving fast

备注

点击查看摘要

Abstract:To develop a deep-learning method for achieving fast high-resolution MR elastography from highly undersampled data without the need of high-quality training dataset. We first framed the deep neural network representation as a nonlinear extension of the linear subspace model, then used it to represent and reconstruct MRE image repetitions from undersampled k-space data. The network weights were learned using a multi-level k-space consistent loss in a self-supervised manner. To further enhance reconstruction quality, phase-contrast specific magnitude and phase priors were incorporated, including the similarity of anatomical structures and smoothness of wave-induced harmonic displacement. Experiments were conducted using both 3D gradient-echo spiral and multi-slice spin-echo spiral MRE datasets. Compared to the conventional linear subspace-based approaches, the nonlinear network representation method was able to produce superior image reconstruction with suppressed noise and artifacts from a single in-plane spiral arm per MRE repetition (e.g., total R=10), yielding comparable stiffness estimation to the fully sampled data. This work demonstrated the feasibility of using deep network representations to model and reconstruct MRE images from highly-undersampled data, a nonlinear extension of the subspace-based approaches.

312. 【2601.11833】Karhunen-Loève Expansion-Based Residual Anomaly Map for Resource-Efficient Glioma MRI Segmentation

链接https://arxiv.org/abs/2601.11833

作者:Anthony Hur

类目:Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:brain tumor segmentation, Accurate segmentation, treatment planning, essential for clinical, clinical diagnosis

备注

点击查看摘要

Abstract:Accurate segmentation of brain tumors is essential for clinical diagnosis and treatment planning. Deep learning is currently the state-of-the-art for brain tumor segmentation, yet it requires either large datasets or extensive computational resources that are inaccessible in most areas. This makes the problem increasingly difficult: state-of-the-art models use thousands of training cases and vast computational power, where performance drops sharply when either is limited. The top performer in the Brats GLI 2023 competition relied on supercomputers trained on over 92,000 augmented MRI scans using an AMD EPYC 7402 CPU, six NVIDIA RTX 6000 GPUs (48GB VRAM each), and 1024GB of RAM over multiple weeks. To address this, the Karhunen--Loève Expansion (KLE) was implemented as a feature extraction step on downsampled, z-score normalized MRI volumes. Each 240$\times$240$\times$155 multi-modal scan is reduced to four $48^3$ channels and compressed into 32 KL coefficients. The resulting approximate reconstruction enables a residual-based anomaly map, which is upsampled and added as a fifth channel to a compact 3D U-Net. All experiments were run on a consumer workstation (AMD Ryzen 5 7600X CPU, RTX 4060Ti (8GB VRAM), and 64GB RAM while using far fewer training cases. This model achieves post-processed Dice scores of 0.929 (WT), 0.856 (TC), and 0.821 (ET), with HD95 distances of 2.93, 6.78, and 10.35 voxels. These results are significantly better than the winning BraTS 2023 methodology for HD95 distances and WT dice scores. This demonstrates that a KLE-based residual anomaly map can dramatically reduce computational cost and data requirements while retaining state-of-the-art performance.

313. 【2601.11694】Anisotropic Tensor Deconvolution of Hyperspectral Images

链接https://arxiv.org/abs/2601.11694

作者:Xinjue Wang,Xiuheng Wang,Esa Ollila,Sergiy A. Vorobyov

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP)

关键词:Canonical Polyadic Decomposition, Alternating Linearized Minimization, Proximal Alternating Linearized, http URL propose, http URL efficient

备注: To appear in ICASSP 2026

点击查看摘要

Abstract:Hyperspectral image (HSI) deconvolution is a challenging ill-posed inverse problem, made difficult by the data's high this http URL propose a parameter-parsimonious framework based on a low-rank Canonical Polyadic Decomposition (CPD) of the entire latent HSI $\mathbf{\mathcal{X}} \in \mathbb{R}^{P\times Q \times N}$.This approach recasts the problem from recovering a large-scale image with $PQN$ variables to estimating the CPD factors with $(P+Q+N)R$ this http URL model also enables a structure-aware, anisotropic Total Variation (TV) regularization applied only to the spatial factors, preserving the smooth spectral this http URL efficient algorithm based on the Proximal Alternating Linearized Minimization (PALM) framework is developed to solve the resulting non-convex optimization this http URL confirm the model's efficiency, showing a numerous parameter reduction of over two orders of magnitude and a compelling trade-off between model compactness and reconstruction accuracy.

314. 【2601.11689】Bridging Modalities: Joint Synthesis and Registration Framework for Aligning Diffusion MRI with T1-Weighted Images

链接https://arxiv.org/abs/2601.11689

作者:Xiaofan Wang,Junyi Wang,Yuqian Chen,Lauren J. O' Donnell,Fan Zhang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:aligning diffusion-weighted imaging, diffusion MRI, MRI, MRI images, diffusion-weighted imaging

备注

点击查看摘要

Abstract:Multimodal image registration between diffusion MRI (dMRI) and T1-weighted (T1w) MRI images is a critical step for aligning diffusion-weighted imaging (DWI) data with structural anatomical space. Traditional registration methods often struggle to ensure accuracy due to the large intensity differences between diffusion data and high-resolution anatomical structures. This paper proposes an unsupervised registration framework based on a generative registration network, which transforms the original multimodal registration problem between b0 and T1w images into a unimodal registration task between a generated image and the real T1w image. This effectively reduces the complexity of cross-modal registration. The framework first employs an image synthesis model to generate images with T1w-like contrast, and then learns a deformation field from the generated image to the fixed T1w image. The registration network jointly optimizes local structural similarity and cross-modal statistical dependency to improve deformation estimation accuracy. Experiments conducted on two independent datasets demonstrate that the proposed method outperforms several state-of-the-art approaches in multimodal registration tasks.

315. 【2601.11685】owards Efficient Image Deblurring for Edge Deployment

链接https://arxiv.org/abs/2601.11685

作者:Srinivas Miriyala,Sowmya Vajrala,Sravanth Kodavanti

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:mobile image signal, image signal processing, signal processing pipelines, restore fine structures, mobile image

备注

点击查看摘要

Abstract:Image deblurring is a critical stage in mobile image signal processing pipelines, where the ability to restore fine structures and textures must be balanced with real-time constraints on edge devices. While recent deep networks such as transformers and activation-free architectures achieve state-of-the-art (SOTA) accuracy, their efficiency is typically measured in FLOPs or parameters, which do not correlate with latency on embedded hardware. We propose a hardware-aware adaptation framework that restructures existing models through sensitivity-guided block substitution, surrogate distillation, and training-free multi-objective search driven by device profiling. Applied to the 36-block NAFNet baseline, the optimized variants achieve up to 55% reduction in GMACs compared to the recent transformer-based SOTA while maintaining competitive accuracy. Most importantly, on-device deployment yields a 1.25X latency improvement over the baseline. Experiments on motion deblurring (GoPro), defocus deblurring (DPDD), and auxiliary benchmarks (RealBlur-J/R, HIDE) demonstrate the generality of the approach, while comparisons with prior efficient baselines confirm its accuracy-efficiency trade-off. These results establish feedback-driven adaptation as a principled strategy for bridging the gap between algorithmic design and deployment-ready deblurring models.

316. 【2601.11684】Mobile-friendly Image de-noising: Hardware Conscious Optimization for Edge Application

链接https://arxiv.org/abs/2601.11684

作者:Srinivas Miriyala,Sowmya Vajrala,Hitesh Kumar,Sravanth Kodavanti,Vikram Rajendiran

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Image Signal Processing, traditional Image Signal, entangled with noise, critical task, task in computer

备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Image enhancement is a critical task in computer vision and photography that is often entangled with noise. This renders the traditional Image Signal Processing (ISP) ineffective compared to the advances in deep learning. However, the success of such methods is increasingly associated with the ease of their deployment on edge devices, such as smartphones. This work presents a novel mobile-friendly network for image de-noising obtained with Entropy-Regularized differentiable Neural Architecture Search (NAS) on a hardware-aware search space for a U-Net architecture, which is first-of-its-kind. The designed model has 12% less parameters, with ~2-fold improvement in ondevice latency and 1.5-fold improvement in the memory footprint for a 0.7% drop in PSNR, when deployed and profiled on Samsung Galaxy S24 Ultra. Compared to the SOTA Swin-Transformer for Image Restoration, the proposed network had competitive accuracy with ~18-fold reduction in GMACs. Further, the network was tested successfully for Gaussian de-noising with 3 intensities on 4 benchmarks and real-world de-noising on 1 benchmark demonstrating its generalization ability.

317. 【2601.11680】FourierPET: Deep Fourier-based Unrolled Network for Low-count PET Reconstruction

链接https://arxiv.org/abs/2601.11680

作者:Zheng Zhang,Hao Tang,Yingying Hu,Zhanli Hu,Jing Qin

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Low-count positron emission, positron emission tomography, challenging inverse problem, inverse problem due, Low-count positron

备注: Accepted for oral presentation at AAAI 2026

点击查看摘要

Abstract:Low-count positron emission tomography (PET) reconstruction is a challenging inverse problem due to severe degradations arising from Poisson noise, photon scarcity, and attenuation correction errors. Existing deep learning methods typically address these in the spatial domain with an undifferentiated optimization objective, making it difficult to disentangle overlapping artifacts and limiting correction effectiveness. In this work, we perform a Fourier-domain analysis and reveal that these degradations are spectrally separable: Poisson noise and photon scarcity cause high-frequency phase perturbations, while attenuation errors suppress low-frequency amplitude components. Leveraging this insight, we propose FourierPET, a Fourier-based unrolled reconstruction framework grounded in the Alternating Direction Method of Multipliers. It consists of three tailored modules: a spectral consistency module that enforces global frequency alignment to maintain data fidelity, an amplitude-phase correction module that decouples and compensates for high-frequency phase distortions and low-frequency amplitude suppression, and a dual adjustment module that accelerates convergence during iterative reconstruction. Extensive experiments demonstrate that FourierPET achieves state-of-the-art performance with significantly fewer parameters, while offering enhanced interpretability through frequency-aware correction.

318. 【2601.11674】Pigment Network Detection and Classification in Dermoscopic Images Using Directional Imaging Algorithms and Convolutional Neural Networks

链接https://arxiv.org/abs/2601.11674

作者:M. A. Rasel,Sameem Abdul Kareem,Unaizah Obaidellah

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:thousands of lives, relies heavily, save thousands, Early diagnosis, Principal Component Analysis

备注

点击查看摘要

Abstract:Early diagnosis of melanoma, which can save thousands of lives, relies heavily on the analysis of dermoscopic images. One crucial diagnostic criterion is the identification of unusual pigment network (PN). However, distinguishing between regular (typical) and irregular (atypical) PN is challenging. This study aims to automate the PN detection process using a directional imaging algorithm and classify PN types using machine learning classifiers. The directional imaging algorithm incorporates Principal Component Analysis (PCA), contrast enhancement, filtering, and noise reduction. Applied to the PH2 dataset, this algorithm achieved a 96% success rate, which increased to 100% after pixel intensity adjustments. We created a new dataset containing only PN images from these results. We then employed two classifiers, Convolutional Neural Network (CNN) and Bag of Features (BoF), to categorize PN into atypical and typical classes. Given the limited dataset of 200 images, a simple and effective CNN was designed, featuring two convolutional layers and two batch normalization layers. The proposed CNN achieved 90% accuracy, 90% sensitivity, and 89% specificity. When compared to state-of-the-art methods, our CNN demonstrated superior performance. Our study highlights the potential of the proposed CNN model for effective PN classification, suggesting future research should focus on expanding datasets and incorporating additional dermatological features to further enhance melanoma diagnosis.