本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新360篇论文,其中:

  • 自然语言处理40
  • 信息检索7
  • 计算机视觉96

自然语言处理

1. 【2512.22101】A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

链接https://arxiv.org/abs/2512.22101

作者:Shuyu Gan,Renxiang Wang,James Mooney,Dongyeop Kang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:diverse visual evidence, generating insightful, data science pipeline, visual evidence, evidence and assembling

备注: 3 pages, 3 figures; Accepted by 1st Workshop on GenAI, Agents and the Future of VIS as Mini-challenge paper and win the Honorable Mention award. Submit number is 7597 and the paper is archived on the workshop website: [this https URL](https://visxgenai.github.io/subs-2025/7597/7597-doc.pdf)

点击查看摘要

Abstract:Automating end-to-end data science pipeline with AI agents still stalls on two gaps: generating insightful, diverse visual evidence and assembling it into a coherent, professional report. We present A2P-Vis, a two-part, multi-agent pipeline that turns raw datasets into a high-quality data-visualization report. The Data Analyzer orchestrates profiling, proposes diverse visualization directions, generates and executes plotting code, filters low-quality figures with a legibility checker, and elicits candidate insights that are automatically scored for depth, correctness, specificity, depth and actionability. The Presenter then orders topics, composes chart-grounded narratives from the top-ranked insights, writes justified transitions, and revises the document for clarity and consistency, yielding a coherent, publication-ready report. Together, these agents convert raw data into curated materials (charts + vetted insights) and into a readable narrative without manual glue work. We claim that by coupling a quality-assured Analyzer with a narrative Presenter, A2P-Vis operationalizes co-analysis end-to-end, improving the real-world usefulness of automated data analysis for practitioners. For the complete dataset report, please see: this https URL.

2. 【2512.22100】Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

链接https://arxiv.org/abs/2512.22100

作者:Duygu Altinok

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:NLP systems, evaluating English NLU, large language models, measure performance, multiple dimensions

备注: under review by Springer

点击查看摘要

Abstract:Evaluating the performance of various model architectures, such as transformers, large language models (LLMs), and other NLP systems, requires comprehensive benchmarks that measure performance across multiple dimensions. Among these, the evaluation of natural language understanding (NLU) is particularly critical as it serves as a fundamental criterion for assessing model capabilities. Thus, it is essential to establish benchmarks that enable thorough evaluation and analysis of NLU abilities from diverse perspectives. While the GLUE benchmark has set a standard for evaluating English NLU, similar benchmarks have been developed for other languages, such as CLUE for Chinese, FLUE for French, and JGLUE for Japanese. However, no comparable benchmark currently exists for the Turkish language. To address this gap, we introduce TrGLUE, a comprehensive benchmark encompassing a variety of NLU tasks for Turkish. In addition, we present SentiTurca, a specialized benchmark for sentiment analysis. To support researchers, we also provide fine-tuning and evaluation code for transformer-based models, facilitating the effective use of these benchmarks. TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations, with labels obtained through a semi-automated pipeline that combines strong LLM-based annotation, cross-model agreement checks, and subsequent human validation. This design prioritizes linguistic naturalness, minimizes direct translation artifacts, and yields a scalable, reproducible workflow. With TrGLUE, our goal is to establish a robust evaluation framework for Turkish NLU, empower researchers with valuable resources, and provide insights into generating high-quality semi-automated datasets.

3. 【2512.22088】Unifying Learning Dynamics and Generalization in Transformers Scaling Law

链接https://arxiv.org/abs/2512.22088

作者:Chiwun Yang

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Model, cornerstone of Large, Large Language, predicts improvements, performance with increasing

备注

点击查看摘要

Abstract:The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish a theoretical upper bound on excess risk characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $\Theta(\mathsf{C}^{-1/6})$. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the upper bounds of generalization.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2512.22088 [cs.LG]

(or
arXiv:2512.22088v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2512.22088

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2512.22087】Context as a Tool: Context Management for Long-Horizon SWE-Agents

链接https://arxiv.org/abs/2512.22087

作者:Shukai Liu,Jian Yang,Bo Jiang,Yizhi Li,Jinyang Guo,Xianglong Liu,Bryan Dai

类目:Computation and Language (cs.CL)

关键词:real-world software engineering, recently shown strong, shown strong potential, large language models, software engineering

备注

点击查看摘要

Abstract:Agents based on large language models have recently shown strong potential on real-world software engineering (SWE) tasks that require long-horizon interaction with repository-scale codebases. However, most existing agents rely on append-only context maintenance or passively triggered compression heuristics, which often lead to context explosion, semantic drift, and degraded reasoning in long-running interactions. We propose CAT, a new context management paradigm that elevates context maintenance to a callable tool integrated into the decision-making process of agents. CAT formalizes a structured context workspace consisting of stable task semantics, condensed long-term memory, and high-fidelity short-term interactions, and enables agents to proactively compress historical trajectories into actionable summaries at appropriate milestones. To support context management for SWE-agents, we propose a trajectory-level supervision framework, CAT-GENERATOR, based on an offline data construction pipeline that injects context-management actions into complete interaction trajectories. Using this framework, we train a context-aware model, SWE-Compressor. Experiments on SWE-Bench-Verified demonstrate that SWE-Compressor reaches a 57.6% solved rate and significantly outperforms ReAct-based agents and static compression baselines, while maintaining stable and scalable long-horizon reasoning under a bounded context budget.

5. 【2512.22060】oward Secure and Compliant AI: Organizational Standards and Protocols for NLP Model Lifecycle Management

链接https://arxiv.org/abs/2512.22060

作者:Sunil Arora,John Hastings

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Natural Language Processing, handle large volumes, NLP Lifecycle Management, Natural Language, Language Processing

备注: 9 pages, 2 tables, 1 figure

点击查看摘要

Abstract:Natural Language Processing (NLP) systems are increasingly used in sensitive domains such as healthcare, finance, and government, where they handle large volumes of personal and regulated data. However, these systems introduce distinct risks related to security, privacy, and regulatory compliance that are not fully addressed by existing AI governance frameworks. This paper introduces the Secure and Compliant NLP Lifecycle Management Framework (SC-NLP-LMF), a comprehensive six-phase model designed to ensure the secure operation of NLP systems from development to retirement. The framework, developed through a systematic PRISMA-based review of 45 peer-reviewed and regulatory sources, aligns with leading standards, including NIST AI RMF, ISO/IEC 42001:2023, the EU AI Act, and MITRE ATLAS. It integrates established methods for bias detection, privacy protection (differential privacy, federated learning), secure deployment, explainability, and secure model decommissioning. A healthcare case study illustrates how SC-NLP-LMF detects emerging terminology drift (e.g., COVID-related language) and guides compliant model updates. The framework offers organizations a practical, lifecycle-wide structure for developing, deploying, and maintaining secure and accountable NLP systems in high-risk environments.

6. 【2512.21956】Self-attention vector output similarities reveal how machines pay attention

链接https://arxiv.org/abs/2512.21956

作者:Tal Halevi,Yarden Tzach,Ronit D. Gross,Shalom Rosner,Ido Kanter

类目:Computation and Language (cs.CL)

关键词:advanced language-learning machines, natural language processing, self-attention mechanism, facilitating the development, language-learning machines

备注: 22 pages, 13 figures

点击查看摘要

Abstract:The self-attention mechanism has significantly advanced the field of natural language processing, facilitating the development of advanced language-learning machines. Although its utility is widely acknowledged, the precise mechanisms of self-attention underlying its advanced learning and the quantitative characterization of this learning process remains an open research question. This study introduces a new approach for quantifying information processing within the self-attention mechanism. The analysis conducted on the BERT-12 architecture reveals that, in the final layers, the attention map focuses on sentence separator tokens, suggesting a practical approach to text segmentation based on semantic features. Based on the vector space emerging from the self-attention heads, a context similarity matrix, measuring the scalar product between two token vectors was derived, revealing distinct similarities between different token vector pairs within each head and layer. The findings demonstrated that different attention heads within an attention block focused on different linguistic characteristics, such as identifying token repetitions in a given text or recognizing a token of common appearance in the text and its surrounding context. This specialization is also reflected in the distribution of distances between token vectors with high similarity as the architecture progresses. The initial attention layers exhibit substantially long-range similarities; however, as the layers progress, a more short-range similarity develops, culminating in a preference for attention heads to create strong similarities within the same sentence. Finally, the behavior of individual heads was analyzed by examining the uniqueness of their most common tokens in their high similarity elements. Each head tends to focus on a unique token from the text and builds similarity pairs centered around it.

7. 【2512.21933】Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs

链接https://arxiv.org/abs/2512.21933

作者:Sachin Pawar,Manoj Apte,Kshitij Jadhav,Girish Keshav Palshikar,Nitin Ramrakhiyani

类目:Computation and Language (cs.CL)

关键词:Large Language Model, Large Language, Language Model, model fixed vocabulary, training any Large

备注: International Joint Conference on Natural Language Processing Asia-Pacific Chapter of the Association for Computational Linguistics (IJCNLP-AACL 2025)

点击查看摘要

Abstract:Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model's fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of "natural" words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral's tokenizer splits "martial" into "mart" and "ial"). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how "bad" the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.

8. 【2512.21919】SWE-RM: Execution-free Feedback For Software Engineering Agents

链接https://arxiv.org/abs/2512.21919

作者:KaShun Shum,Binyuan Hui,Jiawei Chen,Lei Zhang,X. W.,Jiaxi Yang,Yuzhen Huang,Junyang Lin,Junxian He

类目:Computation and Language (cs.CL)

关键词:Execution-based feedback, unit test cases, test-time scaling, reinforcement learning, testing is widely

备注: 21 pages

点击查看摘要

Abstract:Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.

9. 【2512.21911】Accelerate Speculative Decoding with Sparse Computation in Verification

链接https://arxiv.org/abs/2512.21911

作者:Jikai Wang,Jianchao Tan,Yuxuan Hu,Jiayu Qin,Yerui Sun,Yuchen Xie,Xunliang Cai,Juntao Li,Min Zhang

类目:Computation and Language (cs.CL)

关键词:language model inference, accelerates autoregressive language, verifying multiple draft, autoregressive language model, Speculative decoding accelerates

备注: Pre-print

点击查看摘要

Abstract:Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel. However, the verification stage often becomes the dominant computational bottleneck, especially for long-context inputs and mixture-of-experts (MoE) models. Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding to remove substantial computational redundancy in LLMs. This work systematically adopts different sparse methods on the verification stage of the speculative decoding and identifies structured redundancy across multiple dimensions. Based on these observations, we propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost. The framework further incorporates an inter-draft token and inter-layer retrieval reuse strategy to further reduce redundant computation without introducing additional training. Extensive experiments across summarization, question answering, and mathematical reasoning datasets demonstrate that the proposed methods achieve favorable efficiency-accuracy trade-offs, while maintaining stable acceptance length.

10. 【2512.21902】Explainable Statute Prediction via Attention-based Model and LLM Prompting

链接https://arxiv.org/abs/2512.21902

作者:Sachin Pawar,Girish Keshav Palshikar,Anindita Sinha Banerjee,Nitin Ramrakhiyani,Basit Ali

类目:Computation and Language (cs.CL)

关键词:automatic statute prediction, statute, statute prediction, specific Act, automatic statute

备注

点击查看摘要

Abstract:In this paper, we explore the problem of automatic statute prediction where for a given case description, a subset of relevant statutes are to be predicted. Here, the term "statute" refers to a section, a sub-section, or an article of any specific Act. Addressing this problem would be useful in several applications such as AI-assistant for lawyers and legal question answering system. For better user acceptance of such Legal AI systems, we believe the predictions should also be accompanied by human understandable explanations. We propose two techniques for addressing this problem of statute prediction with explanations -- (i) AoS (Attention-over-Sentences) which uses attention over sentences in a case description to predict statutes relevant for it and (ii) LLMPrompt which prompts an LLM to predict as well as explain relevance of a certain statute. AoS uses smaller language models, specifically sentence transformers and is trained in a supervised manner whereas LLMPrompt uses larger language models in a zero-shot manner and explores both standard as well as Chain-of-Thought (CoT) prompting techniques. Both these models produce explanations for their predictions in human understandable forms. We compare statute prediction performance of both the proposed techniques with each other as well as with a set of competent baselines, across two popular datasets. Also, we evaluate the quality of the generated explanations through an automated counter-factual manner as well as through human evaluation.

11. 【2512.21877】CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

链接https://arxiv.org/abs/2512.21877

作者:Vaibhav Devraj,Dhruv Kumar,Jagat Sesh Challa

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:billion fans globally, popular sport globally, fans globally, commanding a massive, billion fans

备注: Under Review

点击查看摘要

Abstract:Cricket is the second most popular sport globally, commanding a massive following of over 2.5 billion fans globally. Enthusiasts and analysts frequently seek advanced statistical insights, such as long-term historical performance trends or complex player comparisons, that are often unavailable through standard web searches. While Large Language Models (LLMs) have advanced significantly in Text-to-SQL tasks, their capability to handle the domain-specific nuances, complex schema variations, and multilingual requirements inherent to sports analytics remains under-explored. To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data. To curate a "Gold Standard" dataset, we collaborate with domain experts in cricket and SQL to manually author complex queries, ensuring logical correctness. Recognizing linguistic diversity, we construct the benchmark in both English and Hindi, establishing a framework that is open for further extension to other regional languages. We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol. Our results reveal that high performance on general benchmarks does not guarantee success in specialized domains. While the open-weights reasoning model DeepSeek R1 achieves state-of-the-art performance (50.6%), surpassing proprietary giants like Claude 3.7 Sonnet (47.7%) and GPT-4o (33.7%), it still exhibits a significant accuracy drop when moving from general benchmarks (BIRD) to CricBench. Furthermore, we observe that code-mixed Hindi queries frequently yield parity or higher accuracy compared to English, challenging the assumption that English is the optimal prompt language for specialized SQL tasks.

12. 【2512.21871】Bridging the Copyright Gap: Do Large Vision-Language Models Recognize and Respect Copyrighted Content?

链接https://arxiv.org/abs/2512.21871

作者:Naen Xu,Jinghuai Zhang,Changjiang Li,Hengyu An,Chunyi Zhou,Jun Wang,Boyu Xu,Yuyuan Li,Tianyu Du,Shouling Ji

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)

关键词:Large vision-language models, achieved remarkable advancements, Large vision-language, multimodal reasoning tasks, copyright

备注: AAAI 2026 (Oral)

点击查看摘要

Abstract:Large vision-language models (LVLMs) have achieved remarkable advancements in multimodal reasoning tasks. However, their widespread accessibility raises critical concerns about potential copyright infringement. Will LVLMs accurately recognize and comply with copyright regulations when encountering copyrighted content (i.e., user input, retrieved documents) in the context? Failure to comply with copyright regulations may lead to serious legal and ethical consequences, particularly when LVLMs generate responses based on copyrighted materials (e.g., retrieved book experts, news reports). In this paper, we present a comprehensive evaluation of various LVLMs, examining how they handle copyrighted content -- such as book excerpts, news articles, music lyrics, and code documentation when they are presented as visual inputs. To systematically measure copyright compliance, we introduce a large-scale benchmark dataset comprising 50,000 multimodal query-content pairs designed to evaluate how effectively LVLMs handle queries that could lead to copyright infringement. Given that real-world copyrighted content may or may not include a copyright notice, the dataset includes query-content pairs in two distinct scenarios: with and without a copyright notice. For the former, we extensively cover four types of copyright notices to account for different cases. Our evaluation reveals that even state-of-the-art closed-source LVLMs exhibit significant deficiencies in recognizing and respecting the copyrighted content, even when presented with the copyright notice. To solve this limitation, we introduce a novel tool-augmented defense framework for copyright compliance, which reduces infringement risks in all scenarios. Our findings underscore the importance of developing copyright-aware LVLMs to ensure the responsible and lawful use of copyrighted content.

13. 【2512.21859】meBill: Time-Budgeted Inference for Large Language Models

链接https://arxiv.org/abs/2512.21859

作者:Qi Fan,An Zou,Yehan Ma

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, generating accurate responses, autonomous driving

备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in time-critical systems, such as robotics, autonomous driving, embodied intelligence, and industrial automation, where generating accurate responses within a given time budget is crucial for decision-making, control, or safety-critical tasks. However, the auto-regressive generation process of LLMs makes it challenging to model and estimate the end-to-end execution time. Furthermore, existing efficient inference methods based on a fixed key-value (KV) cache eviction ratio struggle to adapt to varying tasks with diverse time budgets, where an improper eviction ratio may lead to incomplete inference or a drop in response performance. In this paper, we propose TimeBill, a novel time-budgeted inference framework for LLMs that balances the inference efficiency and response performance. To be more specific, we propose a fine-grained response length predictor (RLP) and an execution time estimator (ETE) to accurately predict the end-to-end execution time of LLMs. Following this, we develop a time-budgeted efficient inference approach that adaptively adjusts the KV cache eviction ratio based on execution time prediction and the given time budget. Finally, through extensive experiments, we demonstrate the advantages of TimeBill in improving task completion rate and maintaining response performance under various overrun strategies.

14. 【2512.21849】HeartBench: Probing Core Dimensions of Anthropomorphic Intelligence in LLMs

链接https://arxiv.org/abs/2512.21849

作者:Jiaxin Liu,Peiyi Tu,Wenyu Chen,Yihong Zhuang,Xinxia Ling,Anji Zhou,Chenxi Wang,Zhuo Han,Zhengkai Yang,Junbo Zhao,Zenan Huang,Yuanyuan Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, achieved remarkable success, Large Language, navigate complex social, anthropomorphic intelligence-the capacity

备注: 10 pages

点击查看摘要

Abstract:While Large Language Models (LLMs) have achieved remarkable success in cognitive and reasoning benchmarks, they exhibit a persistent deficit in anthropomorphic intelligence-the capacity to navigate complex social, emotional, and ethical nuances. This gap is particularly acute in the Chinese linguistic and cultural context, where a lack of specialized evaluation frameworks and high-quality socio-emotional data impedes progress. To address these limitations, we present HeartBench, a framework designed to evaluate the integrated emotional, cultural, and ethical dimensions of Chinese LLMs. Grounded in authentic psychological counseling scenarios and developed in collaboration with clinical experts, the benchmark is structured around a theory-driven taxonomy comprising five primary dimensions and 15 secondary capabilities. We implement a case-specific, rubric-based methodology that translates abstract human-like traits into granular, measurable criteria through a ``reasoning-before-scoring'' evaluation protocol. Our assessment of 13 state-of-the-art LLMs indicates a substantial performance ceiling: even leading models achieve only 60% of the expert-defined ideal score. Furthermore, analysis using a difficulty-stratified ``Hard Set'' reveals a significant performance decay in scenarios involving subtle emotional subtexts and complex ethical trade-offs. HeartBench establishes a standardized metric for anthropomorphic AI evaluation and provides a methodological blueprint for constructing high-quality, human-aligned training data.

15. 【2512.21842】AlignAR: Generative Sentence Alignment for Arabic-English Parallel Corpora of Legal and Literary Texts

链接https://arxiv.org/abs/2512.21842

作者:Baorong Huang,Ali Asiri

类目:Computation and Language (cs.CL)

关键词:High-quality parallel corpora, Machine Translation, High-quality parallel, essential for Machine, translation teaching

备注

点击查看摘要

Abstract:High-quality parallel corpora are essential for Machine Translation (MT) research and translation teaching. However, Arabic-English resources remain scarce and existing datasets mainly consist of simple one-to-one mappings. In this paper, we present AlignAR, a generative sentence alignment method, and a new Arabic-English dataset comprising complex legal and literary texts. Our evaluation demonstrates that "Easy" datasets lack the discriminatory power to fully assess alignment methods. By reducing one-to-one mappings in our "Hard" subset, we exposed the limitations of traditional alignment methods. In contrast, LLM-based approaches demonstrated superior robustness, achieving an overall F1-score of 85.5%, a 9% improvement over previous methods. Our datasets and codes are open-sourced at this https URL.

16. 【2512.21837】Knowledge Reasoning of Large Language Models Integrating Graph-Structured Information for Pest and Disease Control in Tobacco

链接https://arxiv.org/abs/2512.21837

作者:Siyu Li,Chenwei Song,Wan Zhou,Xinyi Liu

类目:Computation and Language (cs.CL)

关键词:integrates graph-structured information, large language model, paper proposes, proposes a large, large language

备注

点击查看摘要

Abstract:This paper proposes a large language model (LLM) approach that integrates graph-structured information for knowledge reasoning in tobacco pest and disease control. Built upon the GraphRAG framework, the proposed method enhances knowledge retrieval and reasoning by explicitly incorporating structured information from a domain-specific knowledge graph. Specifically, LLMs are first leveraged to assist in the construction of a tobacco pest and disease knowledge graph, which organizes key entities such as diseases, symptoms, control methods, and their relationships. Based on this graph, relevant knowledge is retrieved and integrated into the reasoning process to support accurate answer generation. The Transformer architecture is adopted as the core inference model, while a graph neural network (GNN) is employed to learn expressive node representations that capture both local and global relational information within the knowledge graph. A ChatGLM-based model serves as the backbone LLM and is fine-tuned using LoRA to achieve parameter-efficient adaptation. Extensive experimental results demonstrate that the proposed approach consistently outperforms baseline methods across multiple evaluation metrics, significantly improving both the accuracy and depth of reasoning, particularly in complex multi-hop and comparative reasoning scenarios.

17. 【2512.21817】Method Decoration (DeMe): A Framework for LLM-Driven Adaptive Method Generation in Dynamic IoT Environments

链接https://arxiv.org/abs/2512.21817

作者:Hong Su

类目:Computation and Language (cs.CL)

关键词:large language models, Intelligent IoT systems, systems increasingly rely, generate task-execution methods, IoT systems increasingly

备注

点击查看摘要

Abstract:Intelligent IoT systems increasingly rely on large language models (LLMs) to generate task-execution methods for dynamic environments. However, existing approaches lack the ability to systematically produce new methods when facing previously unseen situations, and they often depend on fixed, device-specific logic that cannot adapt to changing environmental this http URL this paper, we propose Method Decoration (DeMe), a general framework that modifies the method-generation path of an LLM using explicit decorations derived from hidden goals, accumulated learned methods, and environmental feedback. Unlike traditional rule augmentation, decorations in DeMe are not hardcoded; instead, they are extracted from universal behavioral principles, experience, and observed environmental differences. DeMe enables the agent to reshuffle the structure of its method path-through pre-decoration, post-decoration, intermediate-step modification, and step insertion-thereby producing context-aware, safety-aligned, and environment-adaptive methods. Experimental results show that method decoration allows IoT devices to derive ore appropriate methods when confronting unknown or faulty operating conditions.

18. 【2512.21809】On The Conceptualization and Societal Impact of Cross-Cultural Bias

链接https://arxiv.org/abs/2512.21809

作者:Vitthal Bhandari

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:large language models, Research has shown, Natural Language Processing, generalize across cultures, generate their responses

备注: Term paper for LING 575 (Societal Impacts of Language Technologies)

点击查看摘要

Abstract:Research has shown that while large language models (LLMs) can generate their responses based on cultural context, they are not perfect and tend to generalize across cultures. However, when evaluating the cultural bias of a language technology on any dataset, researchers may choose not to engage with stakeholders actually using that technology in real life, which evades the very fundamental problem they set out to address. Inspired by the work done by arXiv:2005.14050v2, I set out to analyse recent literature about identifying and evaluating cultural bias in Natural Language Processing (NLP). I picked out 20 papers published in 2025 about cultural bias and came up with a set of observations to allow NLP researchers in the future to conceptualize bias concretely and evaluate its harms effectively. My aim is to advocate for a robust assessment of the societal impact of language technologies exhibiting cross-cultural bias.

Comments:
Term paper for LING 575 (Societal Impacts of Language Technologies)

Subjects:

Computation and Language (cs.CL); Computers and Society (cs.CY)

Cite as:
arXiv:2512.21809 [cs.CL]

(or
arXiv:2512.21809v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.21809

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
19. 【2512.21789】Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning

链接https://arxiv.org/abs/2512.21789

作者:Ting-Hao K.Huang,Ryan A. Rossi,Sungchul Kim,Tong Yu,Ting-Yao E. Hsu,Ho Yin(Sam)Ng,C. Lee Giles

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:Pennsylvania State University, small seed-funded idea, central efforts shaping, Penn State seed, Penn State

备注: Accepted to the 5th Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE 2026)

点击查看摘要

Abstract:Between 2021 and 2025, the SciCap project grew from a small seed-funded idea at The Pennsylvania State University (Penn State) into one of the central efforts shaping the scientific figure-captioning landscape. Supported by a Penn State seed grant, Adobe, and the Alfred P. Sloan Foundation, what began as our attempt to test whether domain-specific training, which was successful in text models like SciBERT, could also work for figure captions expanded into a multi-institution collaboration. Over these five years, we curated, released, and continually updated a large collection of figure-caption pairs from arXiv papers, conducted extensive automatic and human evaluations on both generated and author-written captions, navigated the rapid rise of large language models (LLMs), launched annual challenges, and built interactive systems that help scientists write better captions. In this piece, we look back at the first five years of SciCap and summarize the key technical and methodological lessons we learned. We then outline five major unsolved challenges and propose directions for the next phase of research in scientific figure captioning.

20. 【2512.21787】Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation

链接https://arxiv.org/abs/2512.21787

作者:Abdullah Alabdullah,Lifeng Han,Chenghua Lin

类目:Computation and Language (cs.CL)

关键词:Modern Standard Arabic, Modern Standard, dialects and MSA, task in Machine, Standard Arabic

备注

点击查看摘要

Abstract:Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation is a challenging task in Machine Translation (MT) due to significant lexical, syntactic, and semantic divergences between Arabic dialects and MSA. Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment. This paper introduces Ara-HOPE, a human-centric post-editing evaluation framework designed to systematically address these challenges. The framework includes a five-category error taxonomy and a decision-tree annotation protocol. Through comparative evaluation of three MT systems (Arabic-centric Jais, general-purpose GPT-3.5, and baseline NLLB-200), Ara-HOPE effectively highlights systematic performance differences between these systems. The results show that dialect-specific terminology and semantic preservation remain the most persistent challenges in DA-MSA translation. Ara-HOPE establishes a new framework for evaluating Dialectal Arabic MT quality and provides actionable guidance for improving dialect-aware MT systems.

21. 【2512.21720】An Information Theoretic Perspective on Agentic System Design

链接https://arxiv.org/abs/2512.21720

作者:Shizhe He,Avanika Narayan,Ishan S. Khare,Scott W. Linderman,Christopher Ré,Dan Biderman

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)

关键词:Claude Code, power modern applications, leverage multi-LM architectures, overcome context limitations, systems power modern

备注

点击查看摘要

Abstract:Agentic language model (LM) systems power modern applications like "Deep Research" and "Claude Code," and leverage multi-LM architectures to overcome context limitations. Beneath their apparent diversity lies a recurring pattern: smaller "compressor" LMs (that can even run locally) distill raw context into compact text that is then consumed by larger "predictor" LMs. Despite their popularity, the design of compressor-predictor systems remains largely ad hoc, with little guidance on how compressor and predictor choices shape downstream performance. In practice, attributing gains to compression versus prediction requires costly, task-specific pairwise sweeps. We argue that these agentic system design questions are, at root, information-theoretic. Viewing the compressor LM as a noisy channel, we introduce a simple estimator of mutual information between the context and its compression to quantify compression quality in a task-independent way. We show that mutual information strongly predicts downstream performance, independent of any specific task. Through an information-theoretic framework, we perform a comprehensive empirical analysis across five datasets and three model families. Results reveal that larger compressors not only are more accurate, but also more token-efficient, conveying more bits of information per token. A 7B Qwen-2.5 compressor, for instance, is $1.6\times$ more accurate, $4.6\times$ more concise, and conveys $5.5\times$ more bits of mutual information per token than its 1.5B sibling. Across datasets, scaling compressors is substantially more effective than scaling predictors, enabling larger on-device compressors to pair with smaller cloud predictors. Applied to a Deep Research system, these principles enable local compressors as small as 3B parameters to recover $99\%$ of frontier-LM accuracy at $26\%$ of API costs.

22. 【2512.21715】CATCH: A Controllable Theme Detection Framework with Contextualized Clustering and Hierarchical Generation

链接https://arxiv.org/abs/2512.21715

作者:Rui Ke,Jiahui Xu,Shenghao Yang,Kuang Wang,Feng Jiang,Haizhou Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:user-centric dialogue systems, Theme detection, Controllable Theme Detection, aiming to identify, predefined schemas

备注

点击查看摘要

Abstract:Theme detection is a fundamental task in user-centric dialogue systems, aiming to identify the latent topic of each utterance without relying on predefined schemas. Unlike intent induction, which operates within fixed label spaces, theme detection requires cross-dialogue consistency and alignment with personalized user preferences, posing significant challenges. Existing methods often struggle with sparse, short utterances for accurate topic representation and fail to capture user-level thematic preferences across dialogues. To address these challenges, we propose CATCH (Controllable Theme Detection with Contextualized Clustering and Hierarchical Generation), a unified framework that integrates three core components: (1) context-aware topic representation, which enriches utterance-level semantics using surrounding topic segments; (2) preference-guided topic clustering, which jointly models semantic proximity and personalized feedback to align themes across dialogue; and (3) a hierarchical theme generation mechanism designed to suppress noise and produce robust, coherent topic labels. Experiments on a multi-domain customer dialogue benchmark (DSTC-12) demonstrate the effectiveness of CATCH with 8B LLM in both theme clustering and topic generation quality.

23. 【2512.21711】Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought

链接https://arxiv.org/abs/2512.21711

作者:Yuyi Zhang,Boyu Tang,Tianjie Ju,Sufeng Duan,Gongshen Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:mechanisms remain unclear, remain unclear, Latent tokens, large language models, internal mechanisms remain

备注: 13 pages, 5 figures

点击查看摘要

Abstract:Latent tokens are gaining attention for enhancing reasoning in large language models (LLMs), yet their internal mechanisms remain unclear. This paper examines the problem from a reliability perspective, uncovering fundamental weaknesses: latent tokens function as uninterpretable placeholders rather than encoding faithful reasoning. While resistant to perturbation, they promote shortcut usage over genuine reasoning. We focus on Chain-of-Continuous-Thought (COCONUT), which claims better efficiency and stability than explicit Chain-of-Thought (CoT) while maintaining performance. We investigate this through two complementary approaches. First, steering experiments perturb specific token subsets, namely COCONUT and explicit CoT. Unlike CoT tokens, COCONUT tokens show minimal sensitivity to steering and lack reasoning-critical information. Second, shortcut experiments evaluate models under biased and out-of-distribution settings. Results on MMLU and HotpotQA demonstrate that COCONUT consistently exploits dataset artifacts, inflating benchmark performance without true reasoning. These findings reposition COCONUT as a pseudo-reasoning mechanism: it generates plausible traces that conceal shortcut dependence rather than faithfully representing reasoning processes.

24. 【2512.21709】Detecting AI-Generated Paraphrases in Bengali: A Comparative Study of Zero-Shot and Fine-Tuned Transformers

链接https://arxiv.org/abs/2512.21709

作者:Md. Rakibul Islam,Most. Sharmin Sultana Samu,Md. Zahid Hossain,Farhad Uz Zaman,Md. Kamrozzaman Bhuiyan

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:resembles human writing, closely resembles human, Large language models, Large language, human writing

备注: Accepted for publication in 2025 28th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:Large language models (LLMs) can produce text that closely resembles human writing. This capability raises concerns about misuse, including disinformation and content manipulation. Detecting AI-generated text is essential to maintain authenticity and prevent malicious applications. Existing research has addressed detection in multiple languages, but the Bengali language remains largely unexplored. Bengali's rich vocabulary and complex structure make distinguishing human-written and AI-generated text particularly challenging. This study investigates five transformer-based models: XLMRoBERTa-Large, mDeBERTaV3-Base, BanglaBERT-Base, IndicBERT-Base and MultilingualBERT-Base. Zero-shot evaluation shows that all models perform near chance levels (around 50% accuracy) and highlight the need for task-specific fine-tuning. Fine-tuning significantly improves performance, with XLM-RoBERTa, mDeBERTa and MultilingualBERT achieving around 91% on both accuracy and F1-score. IndicBERT demonstrates comparatively weaker performance, indicating limited effectiveness in fine-tuning for this task. This work advances AI-generated text detection in Bengali and establishes a foundation for building robust systems to counter AI-generated content.

25. 【2512.21708】MoRAgent: Parameter Efficient Agent Tuning with Mixture-of-Roles

链接https://arxiv.org/abs/2512.21708

作者:Jing Han,Binwei Yan,Tianyu Guo,Zheyuan Bai,Mengyu Zheng,Hanting Chen,Ying Nie

类目:Computation and Language (cs.CL)

关键词:fine-tuning large language, large language models, remain largely unexplored, agent remain largely, facilitate agent tasks

备注: Accepted by ICML 2025

点击查看摘要

Abstract:Despite recent advancements of fine-tuning large language models (LLMs) to facilitate agent tasks, parameter-efficient fine-tuning (PEFT) methodologies for agent remain largely unexplored. In this paper, we introduce three key strategies for PEFT in agent tasks: 1) Inspired by the increasingly dominant Reason+Action paradigm, we first decompose the capabilities necessary for the agent tasks into three distinct roles: reasoner, executor, and summarizer. The reasoner is responsible for comprehending the user's query and determining the next role based on the execution trajectory. The executor is tasked with identifying the appropriate functions and parameters to invoke. The summarizer conveys the distilled information from conversations back to the user. 2) We then propose the Mixture-of-Roles (MoR) framework, which comprises three specialized Low-Rank Adaptation (LoRA) groups, each designated to fulfill a distinct role. By focusing on their respective specialized capabilities and engaging in collaborative interactions, these LoRAs collectively accomplish the agent task. 3) To effectively fine-tune the framework, we develop a multi-role data generation pipeline based on publicly available datasets, incorporating role-specific content completion and reliability verification. We conduct extensive experiments and thorough ablation studies on various LLMs and agent benchmarks, demonstrating the effectiveness of the proposed method. This project is publicly available at this https URL.

26. 【2512.21706】Enabling Conversational Behavior Reasoning Capabilities in Full-Duplex Speech

链接https://arxiv.org/abs/2512.21706

作者:Shuchang Pan,Siddharth Banerjee,Dhruv Hebbar,Siddhant Patel,Akshaj Gupta,Kan Jen Cheng,Hanjo Kim,Zeyi Austin Li,Martin Q. Ma,Tingle Li,Gopala Anumanchipalli,Jiachen Lian

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Human conversation, conversation is organized, thoughts that manifests, manifests as timed, timed speech acts

备注

点击查看摘要

Abstract:Human conversation is organized by an implicit chain of thoughts that manifests as timed speech acts. Capturing this causal pathway is key to building natural full-duplex interactive systems. We introduce a framework that enables reasoning over conversational behaviors by modeling this process as causal inference within a Graph-of-Thoughts (GoT). Our approach formalizes the intent-to-action pathway with a hierarchical labeling scheme, predicting high-level communicative intents and low-level speech acts to learn their causal and temporal dependencies. To train this system, we develop a hybrid corpus that pairs controllable, event-rich simulations with human-annotated rationales and real conversational speech. The GoT framework structures streaming predictions as an evolving graph, enabling a multimodal transformer to forecast the next speech act, generate concise justifications for its decisions, and dynamically refine its reasoning. Experiments on both synthetic and real duplex dialogues show that the framework delivers robust behavior detection, produces interpretable reasoning chains, and establishes a foundation for benchmarking conversational reasoning in full duplex spoken dialogue systems.

27. 【2512.21653】Semantic Codebooks as Effective Priors for Neural Speech Compression

链接https://arxiv.org/abs/2512.21653

作者:Liuyang Bai,Weiyi Lu,Li Guo

类目:ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:waveform fidelity, allocating bits, linguistic structure, preserve acoustic detail, traditionally optimized

备注

点击查看摘要

Abstract:Speech codecs are traditionally optimized for waveform fidelity, allocating bits to preserve acoustic detail even when much of it can be inferred from linguistic structure. This leads to inefficient compression and suboptimal performance on downstream recognition tasks. We propose SemDAC, a semantic-aware neural audio codec that leverages semantic codebooks as effective priors for speech compression. In SemDAC, the first quantizer in a residual vector quantization (RVQ) stack is distilled from HuBERT features to produce semantic tokens that capture phonetic content, while subsequent quantizers model residual acoustics. A FiLM-conditioned decoder reconstructs audio conditioned on the semantic tokens, improving efficiency in the use of acoustic codebooks. Despite its simplicity, this design proves highly effective: SemDAC outperforms DAC across perceptual metrics and achieves lower WER when running Whisper on reconstructed speech, all while operating at substantially lower bitrates (e.g., 0.95 kbps vs. 2.5 kbps for DAC). These results demonstrate that semantic codebooks provide an effective inductive bias for neural speech compression, producing compact yet recognition-friendly representations.

28. 【2512.21635】Heaven-Sent or Hell-Bent? Benchmarking the Intelligence and Defectiveness of LLM Hallucinations

链接https://arxiv.org/abs/2512.21635

作者:Chengxu Yang,Jingling Yuan,Siqi Cai,Jiawei Jiang,Chuang Hu

类目:Computation and Language (cs.CL)

关键词:large language models, large language, commonly regarded, regarded as errors, Hallucinations

备注: Published as a conference paper at KDD 2026

点击查看摘要

Abstract:Hallucinations in large language models (LLMs) are commonly regarded as errors to be minimized. However, recent perspectives suggest that some hallucinations may encode creative or epistemically valuable content, a dimension that remains underquantified in current literature. Existing hallucination detection methods primarily focus on factual consistency, struggling to handle heterogeneous scientific tasks and balance creativity with accuracy. To address these challenges, we propose HIC-Bench, a novel evaluation framework that categorizes hallucinations into Intelligent Hallucinations (IH) and Defective Hallucinations (DH), enabling systematic investigation of their interplay in LLM creativity. HIC-Bench features three core characteristics: (1) Structured IH/DH Assessment. using a multi-dimensional metric matrix integrating Torrance Tests of Creative Thinking (TTCT) metrics (Originality, Feasibility, Value) with hallucination-specific dimensions (scientific plausibility, factual deviation); (2) Cross-Domain Applicability. spanning ten scientific domains with open-ended innovation tasks; and (3) Dynamic Prompt Optimization. leveraging the Dynamic Hallucination Prompt (DHP) to guide models toward creative and reliable outputs. The evaluation process employs multiple LLM judges, averaging scores to mitigate bias, with human annotators verifying IH/DH classifications. Experimental results reveal a nonlinear relationship between IH and DH, demonstrating that creativity and correctness can be jointly optimized. These insights position IH as a catalyst for creativity and reveal the ability of LLM hallucinations to drive scientific this http URL, the HIC-Bench offers a valuable platform for advancing research into the creative intelligence of LLM hallucinations.

29. 【2512.21625】Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards

链接https://arxiv.org/abs/2512.21625

作者:Xinyu Tang,Yuliang Zhan,Zhixun Li,Wayne Xin Zhao,Zhenduo Zhang,Zujie Wen,Zhiqiang Zhang,Jun Zhou

类目:Computation and Language (cs.CL)

关键词:Large reasoning models, verifiable reward, Large reasoning, typically trained, trained using reinforcement

备注

点击查看摘要

Abstract:Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.

30. 【2512.21580】Gamayun's Path to Multilingual Mastery: Cost-Efficient Training of a 1.5B-Parameter LLM

链接https://arxiv.org/abs/2512.21580

作者:Alexander Podolskiy,Semen Molokov,Timofey Gerasin,Maksim Titov,Alexey Rukhovich,Artem Khrapov,Kirill Morozov,Evgeny Tetin,Constantine Korikov,Pavel Efimov,Polina Lazukova,Yuliya Skripkar,Nikita Okhotnikov,Irina Piontkovskaya,Meng Xiaojun,Zou Xueyi,Zhang Zhenhe

类目:Computation and Language (cs.CL)

关键词:present Gamayun, tokens, language model trained, Gamayun, high-quality English enrichment

备注

点击查看摘要

Abstract:We present Gamayun, a 1.5B-parameter multilingual language model trained entirely from scratch on 2.5T tokens. Designed for efficiency and deployment in resource-constrained environments, Gamayun addresses the lack of research on small non-English-centric LLMs by adopting a novel two-stage pre-training strategy: balanced multilingual training for cross-lingual alignment, followed by high-quality English enrichment to transfer performance gains across languages. Our model supports 12 languages, with special focus on Russian. Despite a significantly smaller training budget than comparable models, Gamayun outperforms LLaMA3.2-1B (9T tokens) on all considered benchmarks, and surpasses Qwen2.5-1.5B (18T tokens) on a wide range of English and multilingual tasks. It matches or exceeds Qwen3 (36T tokens) on most tasks outside advanced STEM, achieving state-of-the-art results in Russian, including the MERA benchmark, among the models of comparable size (1-2B parameters).

31. 【2512.21577】A Unified Definition of Hallucination, Or: It's the World Model, Stupid

链接https://arxiv.org/abs/2512.21577

作者:Emmy Liu,Varun Gangal,Chelsea Zou,Xiaoqi Huang,Michael Yu,Alex Chang,Zhuofu Tao,Sachin Kumar,Steven Y. Feng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:frontier large language, language models today, numerous attempts, attempts to solve, solve the issue

备注

点击查看摘要

Abstract:Despite numerous attempts to solve the issue of hallucination since the inception of neural language models, it remains a problem in even frontier large language models today. Why is this the case? We walk through definitions of hallucination used in the literature from a historical perspective up to the current day, and fold them into a single definition of hallucination, wherein different prior definitions focus on different aspects of our definition. At its core, we argue that hallucination is simply inaccurate (internal) world modeling, in a form where it is observable to the user (e.g., stating a fact which contradicts a knowledge base, or producing a summary which contradicts a known source). By varying the reference world model as well as the knowledge conflict policy (e.g., knowledge base vs. in-context), we arrive at the different existing definitions of hallucination present in the literature. We argue that this unified view is useful because it forces evaluations to make clear their assumed "world" or source of truth, clarifies what should and should not be called hallucination (as opposed to planning or reward/incentive-related errors), and provides a common language to compare benchmarks and mitigation techniques. Building on this definition, we outline plans for a family of benchmarks in which hallucinations are defined as mismatches with synthetic but fully specified world models in different environments, and sketch out how these benchmarks can use such settings to stress-test and improve the world modeling components of language models.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)

Cite as:
arXiv:2512.21577 [cs.CL]

(or
arXiv:2512.21577v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.21577

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
32. 【2512.21567】Beyond Heuristics: A Decision-Theoretic Framework for Agent Memory Management

链接https://arxiv.org/abs/2512.21567

作者:Changzhi Sun,Xiangyu Chen,Jixiang Luo,Dell Zhang,Xuelong Li

类目:Computation and Language (cs.CL)

关键词:large language model, modern large language, External memory, language model, key component

备注

点击查看摘要

Abstract:External memory is a key component of modern large language model (LLM) systems, enabling long-term interaction and personalization. Despite its importance, memory management is still largely driven by hand-designed heuristics, offering little insight into the long-term and uncertain consequences of memory decisions. In practice, choices about what to read or write shape future retrieval and downstream behavior in ways that are difficult to anticipate. We argue that memory management should be viewed as a sequential decision-making problem under uncertainty, where the utility of memory is delayed and dependent on future interactions. To this end, we propose DAM (Decision-theoretic Agent Memory), a decision-theoretic framework that decomposes memory management into immediate information access and hierarchical storage maintenance. Within this architecture, candidate operations are evaluated via value functions and uncertainty estimators, enabling an aggregate policy to arbitrate decisions based on estimated long-term utility and risk. Our contribution is not a new algorithm, but a principled reframing that clarifies the limitations of heuristic approaches and provides a foundation for future research on uncertainty-aware memory systems.

33. 【2512.21551】Human-AI Interaction Alignment: Designing, Evaluating, and Evolving Value-Centered AI For Reciprocal Human-AI Futures

链接https://arxiv.org/abs/2512.21551

作者:Hua Shen,Tiffany Knearem,Divy Thakkar,Pat Pataranutaporn,Anoop Sinha,Yike(Cassandra)Shi,Jenny T. Liang,Lama Ahmad,Tanu Mitra,Brad A. Myers,Yang Li

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:everyday life underscores, unidirectional alignment models, rapid integration, integration of generative, everyday life

备注: CHI 2026 BiAlign Workshop

点击查看摘要

Abstract:The rapid integration of generative AI into everyday life underscores the need to move beyond unidirectional alignment models that only adapt AI to human values. This workshop focuses on bidirectional human-AI alignment, a dynamic, reciprocal process where humans and AI co-adapt through interaction, evaluation, and value-centered design. Building on our past CHI 2025 BiAlign SIG and ICLR 2025 Workshop, this workshop will bring together interdisciplinary researchers from HCI, AI, social sciences and more domains to advance value-centered AI and reciprocal human-AI collaboration. We focus on embedding human and societal values into alignment research, emphasizing not only steering AI toward human values but also enabling humans to critically engage with and evolve alongside AI systems. Through talks, interdisciplinary discussions, and collaborative activities, participants will explore methods for interactive alignment, frameworks for societal impact evaluation, and strategies for alignment in dynamic contexts. This workshop aims to bridge the disciplines' gaps and establish a shared agenda for responsible, reciprocal human-AI futures.

34. 【2512.21515】Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training

链接https://arxiv.org/abs/2512.21515

作者:Lei Liu,Hao Zhu,Yue Shen,Zhixuan Chu,Jian Wang,Jinjie Gu,Kui Ren

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Continual Pre-training, adapting foundation models, adapting foundation, Continual, domain-specific applications

备注

点击查看摘要

Abstract:Continual Pre-training (CPT) serves as a fundamental approach for adapting foundation models to domain-specific applications. Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM. However, the marginal gains from simply increasing data for CPT diminish rapidly, yielding suboptimal data utilization and inefficient training. To address this challenge, we propose a novel perplexity-aware data scaling law to establish a predictive relationship between the perplexity landscape of domain-specific data and the test loss. Our approach leverages the perplexity derived from the pre-trained model on domain data as a proxy for estimating the knowledge gap, effectively quantifying the informational perplexity landscape of candidate training samples. By fitting this scaling law across diverse perplexity regimes, we enable adaptive selection of high-utility data subsets, prioritizing content that maximizes knowledge absorption while minimizing redundancy and noise. Extensive experiments demonstrate that our method consistently identifies near-optimal training subsets and achieves superior performance on both medical and general-domain benchmarks.

35. 【2512.21506】MotionTeller: Multi-modal Integration of Wearable Time-Series with LLMs for Health and Behavioral Understanding

链接https://arxiv.org/abs/2512.21506

作者:Aiwei Zhang,Arvind Pillai,Andrew Campbell,Nicholas C. Jacobson

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:key challenge remains, raw physiological signals, generate natural language, movement data collected, minute-level movement data

备注

点击查看摘要

Abstract:As wearable sensing becomes increasingly pervasive, a key challenge remains: how can we generate natural language summaries from raw physiological signals such as actigraphy - minute-level movement data collected via accelerometers? In this work, we introduce MotionTeller, a generative framework that natively integrates minute-level wearable activity data with large language models (LLMs). MotionTeller combines a pretrained actigraphy encoder with a lightweight projection module that maps behavioral embeddings into the token space of a frozen decoder-only LLM, enabling free-text, autoregressive generation of daily behavioral summaries. We construct a novel dataset of 54383 (actigraphy, text) pairs derived from real-world NHANES recordings, and train the model using cross-entropy loss with supervision only on the language tokens. MotionTeller achieves high semantic fidelity (BERTScore-F1 = 0.924) and lexical accuracy (ROUGE-1 = 0.722), outperforming prompt-based baselines by 7 percent in ROUGE-1. The average training loss converges to 0.38 by epoch 15, indicating stable optimization. Qualitative analysis confirms that MotionTeller captures circadian structure and behavioral transitions, while PCA plots reveal enhanced cluster alignment in embedding space post-training. Together, these results position MotionTeller as a scalable, interpretable system for transforming wearable sensor data into fluent, human-centered descriptions, introducing new pathways for behavioral monitoring, clinical review, and personalized health interventions.

36. 【2512.21494】Oogiri-Master: Benchmarking Humor Understanding via Oogiri

链接https://arxiv.org/abs/2512.21494

作者:Soichiro Murakami,Hidetaka Kamigaito,Hiroya Takamura,Manabu Okumura

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:human-like creative thinking, large language models, Japanese creative response, salient testbed, testbed for human-like

备注

点击查看摘要

Abstract:Humor is a salient testbed for human-like creative thinking in large language models (LLMs). We study humor using the Japanese creative response game Oogiri, in which participants produce witty responses to a given prompt, and ask the following research question: What makes such responses funny to humans? Previous work has offered only limited reliable means to answer this question. Existing datasets contain few candidate responses per prompt, expose popularity signals during ratings, and lack objective and comparable metrics for funniness. Thus, we introduce Oogiri-Master and Oogiri-Corpus, which are a benchmark and dataset designed to enable rigorous evaluation of humor understanding in LLMs. Each prompt is paired with approximately 100 diverse candidate responses, and funniness is rated independently by approximately 100 human judges without access to others' ratings, reducing popularity bias and enabling robust aggregation. Using Oogiri-Corpus, we conduct a quantitative analysis of the linguistic factors associated with funniness, such as text length, ambiguity, and incongruity resolution, and derive objective metrics for predicting human judgments. Subsequently, we benchmark a range of LLMs and human baselines in Oogiri-Master, demonstrating that state-of-the-art models approach human performance and that insight-augmented prompting improves the model performance. Our results provide a principled basis for evaluating and advancing humor understanding in LLMs.

37. 【2512.21439】Morality is Contextual: Learning Interpretable Moral Contexts from Human Data with Probabilistic Clustering and Large Language Models

链接https://arxiv.org/abs/2512.21439

作者:Geoffroy Morlat,Marceau Nahon,Augustin Chartouny,Raja Chatila,Ismael T. Freire,Mehdi Khamassi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:human moral evaluations, Textual Human inputs, Moral, Human, Moral Evaluation

备注: 11 pages, 5 figures, +24 pages of Appendix

点击查看摘要

Abstract:Moral actions are judged not only by their outcomes but by the context in which they occur. We present COMETH (Contextual Organization of Moral Evaluation from Textual Human inputs), a framework that integrates a probabilistic context learner with LLM-based semantic abstraction and human moral evaluations to model how context shapes the acceptability of ambiguous actions. We curate an empirically grounded dataset of 300 scenarios across six core actions (violating Do not kill, Do not deceive, and Do not break the law) and collect ternary judgments (Blame/Neutral/Support) from N=101 participants. A preprocessing pipeline standardizes actions via an LLM filter and MiniLM embeddings with K-means, producing robust, reproducible core-action clusters. COMETH then learns action-specific moral contexts by clustering scenarios online from human judgment distributions using principled divergence criteria. To generalize and explain predictions, a Generalization module extracts concise, non-evaluative binary contextual features and learns feature weights in a transparent likelihood-based model. Empirically, COMETH roughly doubles alignment with majority human judgments relative to end-to-end LLM prompting (approx. 60% vs. approx. 30% on average), while revealing which contextual features drive its predictions. The contributions are: (i) an empirically grounded moral-context dataset, (ii) a reproducible pipeline combining human judgments with model-based context learning and LLM semantics, and (iii) an interpretable alternative to end-to-end LLMs for context-sensitive moral prediction and explanation.

38. 【2512.21422】aching People LLM's Errors and Getting it Right

链接https://arxiv.org/abs/2512.21422

作者:Nathan Stringham,Fateme Hashemi Chaleshtori,Xinyuan Yan,Zhichao Xu,Bei Wang,Ana Marasović

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:People use large, large language models, failure patterns, LLM, People

备注

点击查看摘要

Abstract:People use large language models (LLMs) when they should not. This is partly because they see LLMs compose poems and answer intricate questions, so they understandably, but incorrectly, assume LLMs won't stumble on basic tasks like simple arithmetic. Prior work has tried to address this by clustering instance embeddings into regions where an LLM is likely to fail and automatically describing patterns in these regions. The found failure patterns are taught to users to mitigate their overreliance. Yet, this approach has not fully succeeded. In this analysis paper, we aim to understand why. We first examine whether the negative result stems from the absence of failure patterns. We group instances in two datasets by their meta-labels and evaluate an LLM's predictions on these groups. We then define criteria to flag groups that are sizable and where the LLM is error-prone, and find meta-label groups that meet these criteria. Their meta-labels are the LLM's failure patterns that could be taught to users, so they do exist. We next test whether prompting and embedding-based approaches can surface these known failures. Without this, users cannot be taught about them to reduce their overreliance. We find mixed results across methods, which could explain the negative result. Finally, we revisit the final metric that measures teaching effectiveness. We propose to assess a user's ability to effectively use the given failure patterns to anticipate when an LLM is error-prone. A user study shows a positive effect from teaching with this metric, unlike the human-AI team accuracy. Our findings show that teaching failure patterns could be a viable approach to mitigating overreliance, but success depends on better automated failure-discovery methods and using metrics like ours.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2512.21422 [cs.CL]

(or
arXiv:2512.21422v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.21422

Focus to learn more

              arXiv-issued DOI via DataCite</p>
39. 【2512.21345】Query Carefully: Detecting the Unanswerables in Text-to-SQL Tasks

链接https://arxiv.org/abs/2512.21345

作者:Jasmin Saxer(1),Isabella Maria Aigner(2),Luise Linzmeier(3),Andreas Weiler(1),Kurt Stockinger(1) ((1) Institute of Computer Science, Zurich University of Applied Sciences, Winterthur, Switzerland, (2) Institute of Medical Virology, University of Zurich, Zurich, Switzerland, (3) Department of Gastroenterology and Hepatology, University Hospital Zurich, University of Zurich, Zurich, Switzerland)

类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:natural language, experts to interact, interact with relational, relational databases, databases using natural

备注: Accepted to the HC@AIxIA + HYDRA 2025

点击查看摘要

Abstract:Text-to-SQL systems allow non-SQL experts to interact with relational databases using natural language. However, their tendency to generate executable SQL for ambiguous, out-of-scope, or unanswerable queries introduces a hidden risk, as outputs may be misinterpreted as correct. This risk is especially serious in biomedical contexts, where precision is critical. We therefore present Query Carefully, a pipeline that integrates LLM-based SQL generation with explicit detection and handling of unanswerable inputs. Building on the OncoMX component of ScienceBenchmark, we construct OncoMX-NAQ (No-Answer Questions), a set of 80 no-answer questions spanning 8 categories (non-SQL, out-of-schema/domain, and multiple ambiguity types). Our approach employs llama3.3:70b with schema-aware prompts, explicit No-Answer Rules (NAR), and few-shot examples drawn from both answerable and unanswerable questions. We evaluate SQL exact match, result accuracy, and unanswerable-detection accuracy. On the OncoMX dev split, few-shot prompting with answerable examples increases result accuracy, and adding unanswerable examples does not degrade performance. On OncoMX-NAQ, balanced prompting achieves the highest unanswerable-detection accuracy (0.8), with near-perfect results for structurally defined categories (non-SQL, missing columns, out-of-domain) but persistent challenges for missing-value queries (0.5) and column ambiguity (0.3). A lightweight user interface surfaces interim SQL, execution results, and abstentions, supporting transparent and reliable text-to-SQL in biomedical applications.

40. 【2505.10282】From Questions to Clinical Recommendations: Large Language Models Driving Evidence-Based Clinical Decision Making

链接https://arxiv.org/abs/2505.10282

作者:Dubai Li,Nan Jiang,Kangping Huang,Ruiqi Tu,Shuyu Ouyang,Huayu Yu,Lin Qiao,Chen Yu,Tianshu Zhou,Danyang Tong,Qian Wang,Mengtao Li,Xiaofeng Zeng,Yu Tian,Xinping Tian,Jingsong Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:reliable scientific foundations, derived from rigorous, data analysis, Clinical, Quicker

备注

点击查看摘要

Abstract:Clinical evidence, derived from rigorous research and data analysis, provides healthcare professionals with reliable scientific foundations for informed decision-making. Integrating clinical evidence into real-time practice is challenging due to the enormous workload, complex professional processes, and time constraints. This highlights the need for tools that automate evidence synthesis to support more efficient and accurate decision making in clinical settings. This study introduces Quicker, an evidence-based clinical decision support system powered by large language models (LLMs), designed to automate evidence synthesis and generate clinical recommendations modeled after standard clinical guideline development processes. Quicker implements a fully automated chain that covers all phases, from questions to clinical recommendations, and further enables customized decision-making through integrated tools and interactive user interfaces. To evaluate Quicker's capabilities, we developed the Q2CRBench-3 benchmark dataset, based on clinical guideline development records for three different diseases. Experimental results highlighted Quicker's strong performance, with fine-grained question decomposition tailored to user preferences, retrieval sensitivities comparable to human experts, and literature screening performance approaching comprehensive inclusion of relevant studies. In addition, Quicker-assisted evidence assessment effectively supported human reviewers, while Quicker's recommendations were more comprehensive and logically coherent than those of clinicians. In system-level testing, collaboration between a single reviewer and Quicker reduced the time required for recommendation development to 20-40 minutes. In general, our findings affirm the potential of Quicker to help physicians make quicker and more reliable evidence-based clinical decisions.

信息检索

1. 【2512.21921】AutoPP: Towards Automated Product Poster Generation and Optimization

链接https://arxiv.org/abs/2512.21921

作者:Jiahao Fan,Yuxin Qin,Wei Feng,Yanyin Chen,Yaoyu Li,Ao Ma,Yixiu Li,Li Zhuang,Haoyi Bian,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:capture customer attention, blend striking visuals, posters blend striking, Product posters blend, product poster generation

备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Product posters blend striking visuals with informative text to highlight the product and capture customer attention. However, crafting appealing posters and manually optimizing them based on online performance is laborious and resource-consuming. To address this, we introduce AutoPP, an automated pipeline for product poster generation and optimization that eliminates the need for human intervention. Specifically, the generator, relying solely on basic product information, first uses a unified design module to integrate the three key elements of a poster (background, text, and layout) into a cohesive output. Then, an element rendering module encodes these elements into condition tokens, efficiently and controllably generating the product poster. Based on the generated poster, the optimizer enhances its Click-Through Rate (CTR) by leveraging online feedback. It systematically replaces elements to gather fine-grained CTR comparisons and utilizes Isolated Direct Preference Optimization (IDPO) to attribute CTR gains to isolated elements. Our work is supported by AutoPP1M, the largest dataset specifically designed for product poster generation and optimization, which contains one million high-quality posters and feedback collected from over one million users. Experiments demonstrate that AutoPP achieves state-of-the-art results in both offline and online settings. Our code and dataset are publicly available at: this https URL

2. 【2512.21863】Frozen LVLMs for Micro-Video Recommendation: A Systematic Study of Feature Extraction and Fusion

链接https://arxiv.org/abs/2512.21863

作者:Huatuan Sun,Yunshan Ma,Changguang Wu,Yanxin Zhang,Pengfei Wang,Xiaoyu Du

类目:Information Retrieval (cs.IR); Multimedia (cs.MM)

关键词:Large Video Language, Video Language Models, Video Language, Frozen Large Video, strong multimodal understanding

备注: 10 pages, 4 figures

点击查看摘要

Abstract:Frozen Large Video Language Models (LVLMs) are increasingly employed in micro-video recommendation due to their strong multimodal understanding. However, their integration lacks systematic empirical evaluation: practitioners typically deploy LVLMs as fixed black-box feature extractors without systematically comparing alternative representation strategies. To address this gap, we present the first systematic empirical study along two key design dimensions: (i) integration strategies with ID embeddings, specifically replacement versus fusion, and (ii) feature extraction paradigms, comparing LVLM-generated captions with intermediate decoder hidden states. Extensive experiments on representative LVLMs reveal three key principles: (1) intermediate hidden states consistently outperform caption-based representations, as natural-language summarization inevitably discards fine-grained visual semantics crucial for recommendation; (2) ID embeddings capture irreplaceable collaborative signals, rendering fusion strictly superior to replacement; and (3) the effectiveness of intermediate decoder features varies significantly across layers. Guided by these insights, we propose the Dual Feature Fusion (DFF) Framework, a lightweight and plug-and-play approach that adaptively fuses multi-layer representations from frozen LVLMs with item ID embeddings. DFF achieves state-of-the-art performance on two real-world micro-video recommendation benchmarks, consistently outperforming strong baselines and providing a principled approach to integrating off-the-shelf large vision-language models into micro-video recommender systems.

3. 【2512.21799】KG20C KG20C-QA: Scholarly Knowledge Graph Benchmarks for Link Prediction and Question Answering

链接https://arxiv.org/abs/2512.21799

作者:Hung-Nghiep Tran,Atsuhiro Takasu

类目:Information Retrieval (cs.IR)

关键词:advancing question answering, Microsoft Academic Graph, Microsoft Academic, scholarly data, graph

备注: Extracted and extended from the first author's PhD thesis titled "Multi-Relational Embedding for Knowledge Graph Representation and Analysis"

点击查看摘要

Abstract:In this paper, we present KG20C and KG20C-QA, two curated datasets for advancing question answering (QA) research on scholarly data. KG20C is a high-quality scholarly knowledge graph constructed from the Microsoft Academic Graph through targeted selection of venues, quality-based filtering, and schema definition. Although KG20C has been available online in non-peer-reviewed sources such as GitHub repository, this paper provides the first formal, peer-reviewed description of the dataset, including clear documentation of its construction and specifications. KG20C-QA is built upon KG20C to support QA tasks on scholarly data. We define a set of QA templates that convert graph triples into natural language question--answer pairs, producing a benchmark that can be used both with graph-based models such as knowledge graph embeddings and with text-based models such as large language models. We benchmark standard knowledge graph embedding methods on KG20C-QA, analyze performance across relation types, and provide reproducible evaluation protocols. By officially releasing these datasets with thorough documentation, we aim to contribute a reusable, extensible resource for the research community, enabling future work in QA, reasoning, and knowledge-driven applications in the scholarly domain. The full datasets will be released at this https URL upon paper publication.

4. 【2512.21595】LLM-I2I: Boost Your Small Item2Item Recommendation Model with Large Language Model

链接https://arxiv.org/abs/2512.21595

作者:Yinfu Feng,Yanjing Wu,Rong Xiao,Xiaoyi Zen

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:real-world systems due, real-time capabilities, Large Language Models, high recommendation quality, leveraging Large Language

备注

点击查看摘要

Abstract:Item-to-Item (I2I) recommendation models are widely used in real-world systems due to their scalability, real-time capabilities, and high recommendation quality. Research to enhance I2I performance focuses on two directions: 1) model-centric approaches, which adopt deeper architectures but risk increased computational costs and deployment complexity, and 2) data-centric methods, which refine training data without altering models, offering cost-effectiveness but struggling with data sparsity and noise. To address these challenges, we propose LLM-I2I, a data-centric framework leveraging Large Language Models (LLMs) to mitigate data quality issues. LLM-I2I includes (1) an LLM-based generator that synthesizes user-item interactions for long-tail items, alleviating data sparsity, and (2) an LLM-based discriminator that filters noisy interactions from real and synthetic data. The refined data is then fused to train I2I models. Evaluated on industry (AEDS) and academic (ARD) datasets, LLM-I2I consistently improves recommendation accuracy, particularly for long-tail items. Deployed on a large-scale cross-border e-commerce platform, it boosts recall number (RN) by 6.02% and gross merchandise value (GMV) by 1.22% over existing I2I models. This work highlights the potential of LLMs in enhancing data-centric recommendation systems without modifying model architectures.

5. 【2512.21543】CEMG: Collaborative-Enhanced Multimodal Generative Recommendation

链接https://arxiv.org/abs/2512.21543

作者:Yuzhen Lin,Hongyi Chen,Xuanjing Chen,Shaowen Wang,Ivonne Xu,Dongming Jiang

类目:Information Retrieval (cs.IR)

关键词:Generative recommendation, Multimodal Generative Recommendation, key challenges, Generative Recommendation framework, superficial integration

备注

点击查看摘要

Abstract:Generative recommendation models often struggle with two key challenges: (1) the superficial integration of collaborative signals, and (2) the decoupled fusion of multimodal features. These limitations hinder the creation of a truly holistic item representation. To overcome this, we propose CEMG, a novel Collaborative-Enhaned Multimodal Generative Recommendation framework. Our approach features a Multimodal Fusion Layer that dynamically integrates visual and textual features under the guidance of collaborative signals. Subsequently, a Unified Modality Tokenization stage employs a Residual Quantization VAE (RQ-VAE) to convert this fused representation into discrete semantic codes. Finally, in the End-to-End Generative Recommendation stage, a large language model is fine-tuned to autoregressively generate these item codes. Extensive experiments demonstrate that CEMG significantly outperforms state-of-the-art baselines.

6. 【2512.21526】Selective LLM-Guided Regularization for Enhancing Recommendation Models

链接https://arxiv.org/abs/2512.21526

作者:Shanglin Yang,Zhan Shi

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:strong reasoning capabilities, provide rich semantic, rich semantic priors, promising auxiliary signals, language models provide

备注

点击查看摘要

Abstract:Large language models provide rich semantic priors and strong reasoning capabilities, making them promising auxiliary signals for recommendation. However, prevailing approaches either deploy LLMs as standalone recommender or apply global knowledge distillation, both of which suffer from inherent drawbacks. Standalone LLM recommender are costly, biased, and unreliable across large regions of the user item space, while global distillation forces the downstream model to imitate LLM predictions even when such guidance is inaccurate. Meanwhile, recent studies show that LLMs excel particularly in re-ranking and challenging scenarios, rather than uniformly across all this http URL introduce Selective LLM Guided Regularization, a model-agnostic and computation efficient framework that activates LLM based pairwise ranking supervision only when a trainable gating mechanism informing by user history length, item popularity, and model uncertainty predicts the LLM to be reliable. All LLM scoring is performed offline, transferring knowledge without increasing inference cost. Experiments across multiple datasets show that this selective strategy consistently improves overall accuracy and yields substantial gains in cold start and long tail regimes, outperforming global distillation baselines.

7. 【2512.21501】Dynamic Cooperative Strategies in Search Engine Advertising Market: With and Without Retail Competition

链接https://arxiv.org/abs/2512.21501

作者:Huiran Li,Qiucheng Li,Baozhu Feng

类目:Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR); Systems and Control (eess.SY)

关键词:search engine advertising, intense and multifaceted, critical factor, cooperative advertising, search engine

备注: 60 pages, 17 figures,6 tables

点击查看摘要

Abstract:In search engine advertising (SEA) market, where competition among retailers is intense and multifaceted, channel coordination between retailers and manufacturers emerges as a critical factor, which significantly influences the effectiveness of advertising strategies. This research attempts to provide managerial guidelines for cooperative advertising in the SEA context by modeling two cooperative advertising decision scenarios. Scenario I defines a simple cooperative channel consisting of one manufacturer and one retailer. In Scenario II, we consider a more general setting where there is an independent retailer who competes with the Manufacturer-Retailer alliance in Scenario I. We propose a novel cooperative advertising optimization model, wherein a manufacturer can advertise product directly through SEA campaigns and indirectly by subsidizing its retailer. To highlight the distinctive features of SEA, our model incorporates dynamic quality scores and focuses on a finite time horizon. In each scenario, we provide a feasible equilibrium solution of optimal policies for all members. Subsequently, we conduct numerical experiments to perform sensitivity analysis for both the quality score and gross margin. Additionally, we explore the impact of the initial market share of the competing retailer in Scenario II. Finally, we investigate how retail competition affects the cooperative alliance's optimal strategy and channel performance. Our identified properties derived from the equilibrium and numerical analyses offer crucial insights for participants engaged in cooperative advertising within the SEA market.

计算机视觉

1. 【2512.22120】See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

链接https://arxiv.org/abs/2512.22120

作者:Shuoshuo Zhang,Yizhen Zhang,Jingjing Fu,Lei Song,Jiang Bian,Yujiu Yang,Rui Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large vision-language models, high inference-time cost, incur high inference-time, Bi-directional Perceptual Shaping, Large vision-language

备注

点击查看摘要

Abstract:Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.

2. 【2512.22118】ProEdit: Inversion-based Editing From Prompts Done Right

链接https://arxiv.org/abs/2512.22118

作者:Zhi Ouyang,Dian Zheng,Xiao-Ming Wu,Jian-Jian Jiang,Kun-Yu Lin,Jingke Meng,Wei-Shi Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Inversion-based visual editing, Inversion-based visual, user instructions, effective and training-free, based on user

备注: Equal contributions from first two authors. Project page: [this https URL](https://isee-laboratory.github.io/ProEdit/) Code: [this https URL](https://github.com/iSEE-Laboratory/ProEdit)

点击查看摘要

Abstract:Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.

3. 【2512.22105】Learning Association via Track-Detection Matching for Multi-Object Tracking

链接https://arxiv.org/abs/2512.22105

作者:Momir Adžemović

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-object tracking aims, maintain object identities, Multi-object tracking, tracking aims, aims to maintain

备注: 14 pages (+4 for references), 8 tables, 4 figures

点击查看摘要

Abstract:Multi-object tracking aims to maintain object identities over time by associating detections across video frames. Two dominant paradigms exist in literature: tracking-by-detection methods, which are computationally efficient but rely on handcrafted association heuristics, and end-to-end approaches, which learn association from data at the cost of higher computational complexity. We propose Track-Detection Link Prediction (TDLP), a tracking-by-detection method that performs per-frame association via link prediction between tracks and detections, i.e., by predicting the correct continuation of each track at every frame. TDLP is architecturally designed primarily for geometric features such as bounding boxes, while optionally incorporating additional cues, including pose and appearance. Unlike heuristic-based methods, TDLP learns association directly from data without handcrafted rules, while remaining modular and computationally efficient compared to end-to-end trackers. Extensive experiments on multiple benchmarks demonstrate that TDLP consistently surpasses state-of-the-art performance across both tracking-by-detection and end-to-end methods. Finally, we provide a detailed analysis comparing link prediction with metric learning-based association and show that link prediction is more effective, particularly when handling heterogeneous features such as detection bounding boxes. Our code is available at \href{this https URL}{this https URL}.

4. 【2512.22096】Yume-1.5: A Text-Controlled Interactive World Generation Model

链接https://arxiv.org/abs/2512.22096

作者:Xiaofeng Mao,Zhen Li,Chuanhao Li,Xiaojie Xu,Kaining Ying,Tong He,Jiangmiao Pang,Yu Qiao,Kaipeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent approaches, approaches have demonstrated, demonstrated the promise, diffusion models, Recent

备注

点击查看摘要

Abstract:Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.

5. 【2512.22065】StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars

链接https://arxiv.org/abs/2512.22065

作者:Zhiyao Sun,Ziqiao Peng,Yifeng Ma,Yi Chen,Zhengguang Zhou,Zixiang Zhou,Guozhen Zhang,Youliang Zhang,Yuan Zhou,Qinglin Lu,Yong-Jin Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:digital human research, interactive avatars represent, represent a critical, critical yet challenging, challenging goal

备注: Project page: [this https URL](https://streamavatar.github.io)

点击查看摘要

Abstract:Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness. Project page: this https URL .

6. 【2512.22047】MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

链接https://arxiv.org/abs/2512.22047

作者:Hanzhang Zhou,Xu Zhang,Panrong Tong,Jianan Zhang,Liangyu Chen,Quyu Kong,Chenglin Cai,Chen Liu,Yue Wang,Jingren Zhou,Steven Hoi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:GUI agents, foundation GUI agents, GUI agents spanning, generation of human-computer, GUI

备注

点击查看摘要

Abstract:The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.

7. 【2512.22046】Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

链接https://arxiv.org/abs/2512.22046

作者:Zongmin Zhang,Zhen Sun,Yifan Liao,Wenhan Dong,Xinlei He,Xingshuo Han,Shengmin Xu,Xinyi Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:Segmentation Foundation Models, Video Segmentation Foundation, Foundation Models, Prompt-driven Video Segmentation, Prompt-driven Video

备注

点击查看摘要

Abstract:Prompt-driven Video Segmentation Foundation Models (VSFMs) such as SAM2 are increasingly deployed in applications like autonomous driving and digital pathology, raising concerns about backdoor threats. Surprisingly, we find that directly transferring classic backdoor attacks (e.g., BadNet) to VSFMs is almost ineffective, with ASR below 5\%. To understand this, we study encoder gradients and attention maps and observe that conventional training keeps gradients for clean and triggered samples largely aligned, while attention still focuses on the true object, preventing the encoder from learning a distinct trigger-related representation. To address this challenge, we propose BadVSFM, the first backdoor framework tailored to prompt-driven VSFMs. BadVSFM uses a two-stage strategy: (1) steer the image encoder so triggered frames map to a designated target embedding while clean frames remain aligned with a clean reference encoder; (2) train the mask decoder so that, across prompt types, triggered frame-prompt pairs produce a shared target mask, while clean outputs stay close to a reference decoder. Extensive experiments on two datasets and five VSFMs show that BadVSFM achieves strong, controllable backdoor effects under diverse triggers and prompts while preserving clean segmentation quality. Ablations over losses, stages, targets, trigger settings, and poisoning rates demonstrate robustness to reasonable hyperparameter changes and confirm the necessity of the two-stage design. Finally, gradient-conflict analysis and attention visualizations show that BadVSFM separates triggered and clean representations and shifts attention to trigger regions, while four representative defenses remain largely ineffective, revealing an underexplored vulnerability in current VSFMs.

8. 【2512.22027】Patch-Discontinuity Mining for Generalized Deepfake Detection

链接https://arxiv.org/abs/2512.22027

作者:Huanhuan Yuan,Yang Ping,Zhengqin Xu,Junyi Cao,Shuai Jia,Chao Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generative artificial intelligence, highly realistic fake, fake facial images, realistic fake facial, posing serious threats

备注: Our paper was accepted by the IEEE Transactions on Multimedia

点击查看摘要

Abstract:The rapid advancement of generative artificial intelligence has enabled the creation of highly realistic fake facial images, posing serious threats to personal privacy and the integrity of online information. Existing deepfake detection methods often rely on handcrafted forensic cues and complex architectures, achieving strong performance in intra-domain settings but suffering significant degradation when confronted with unseen forgery patterns. In this paper, we propose GenDF, a simple yet effective framework that transfers a powerful large-scale vision model to the deepfake detection task with a compact and neat network design. GenDF incorporates deepfake-specific representation learning to capture discriminative patterns between real and fake facial images, feature space redistribution to mitigate distribution mismatch, and a classification-invariant feature augmentation strategy to enhance generalization without introducing additional trainable parameters. Extensive experiments demonstrate that GenDF achieves state-of-the-art generalization performance in cross-domain and cross-manipulation settings while requiring only 0.28M trainable parameters, validating the effectiveness and efficiency of the proposed framework.

9. 【2512.22016】SketchPlay: Intuitive Creation of Physically Realistic VR Content with Gesture-Driven Sketching

链接https://arxiv.org/abs/2512.22016

作者:Xiangwen Zhang,Xiaowei Dai,Runnan Chen,Xiaoming Chen,Zeke Zexi Hu

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:Creating physically realistic, requires complex modeling, complex modeling tools, physically realistic content, Creating physically

备注

点击查看摘要

Abstract:Creating physically realistic content in VR often requires complex modeling tools or predefined 3D models, textures, and animations, which present significant barriers for non-expert users. In this paper, we propose SketchPlay, a novel VR interaction framework that transforms humans' air-drawn sketches and gestures into dynamic, physically realistic scenes, making content creation intuitive and playful like drawing. Specifically, sketches capture the structure and spatial arrangement of objects and scenes, while gestures convey physical cues such as velocity, direction, and force that define movement and behavior. By combining these complementary forms of input, SketchPlay captures both the structure and dynamics of user-created content, enabling the generation of a wide range of complex physical phenomena, such as rigid body motion, elastic deformation, and cloth dynamics. Experimental results demonstrate that, compared to traditional text-driven methods, SketchPlay offers significant advantages in expressiveness, and user experience. By providing an intuitive and engaging creation process, SketchPlay lowers the entry barrier for non-expert users and shows strong potential for applications in education, art, and immersive storytelling.

10. 【2512.22010】LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration

链接https://arxiv.org/abs/2512.22010

作者:Wen Jiang,Li Wang,Kangyao Huang,Wei Fan,Jinyuan Liu,Shaoyu Liu,Hongwei Duan,Bin Xu,Xiangyang Ji

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Unmanned aerial vehicles, high information density, Unmanned aerial, long-horizon UAV VLN, UAV VLN

备注

点击查看摘要

Abstract:Unmanned aerial vehicles (UAVs) are crucial tools for post-disaster search and rescue, facing challenges such as high information density, rapid changes in viewpoint, and dynamic structures, especially in long-horizon navigation. However, current UAV vision-and-language navigation(VLN) methods struggle to model long-horizon spatiotemporal context in complex environments, resulting in inaccurate semantic alignment and unstable path planning. To this end, we propose LongFly, a spatiotemporal context modeling framework for long-horizon UAV VLN. LongFly proposes a history-aware spatiotemporal modeling strategy that transforms fragmented and redundant historical data into structured, compact, and expressive representations. First, we propose the slot-based historical image compression module, which dynamically distills multi-view historical observations into fixed-length contextual representations. Then, the spatiotemporal trajectory encoding module is introduced to capture the temporal dynamics and spatial structure of UAV trajectories. Finally, to integrate existing spatiotemporal context with current observations, we design the prompt-guided multimodal integration module to support time-based reasoning and robust waypoint prediction. Experimental results demonstrate that LongFly outperforms state-of-the-art UAV VLN baselines by 7.89\% in success rate and 6.33\% in success weighted by path length, consistently across both seen and unseen environments.

11. 【2512.22009】SHIFT: Lightweight Slow-Fast GUI Agent with Adaptive Perception

链接https://arxiv.org/abs/2512.22009

作者:Sarthak Mehrotra,Sairam V C Rebbapragada,Mani Hemanth Reddy Bonthu,Vineeth N Balasubramanian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pixel-rich Graphical User, Multimodal Large Language, Graphical User Interface, Graphical User, show strong potential

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) show strong potential for interpreting and interacting with complex, pixel-rich Graphical User Interface (GUI) environments. However, building agents that are both efficient for high-level tasks and precise for fine-grained interactions remains challenging. GUI agents must perform routine actions efficiently while also handling tasks that demand exact visual grounding, yet existing approaches struggle when accuracy depends on identifying specific interface elements. These MLLMs also remain large and cannot adapt their reasoning depth to the task at hand. In this work, we introduce iSHIFT: Implicit Slow-fast Hybrid Inference with Flexible Tokens, a lightweight agent that integrates latent thinking (implicit chain-of-thought) with a perception control module. iSHIFT enables an MLLM to switch between a slow mode, which leverages detailed visual grounding for high precision and a fast mode that uses global cues for efficiency. Special perception tokens guide attention to relevant screen regions, allowing the model to decide both how to reason and where to focus. Despite its compact 2.5B size, iSHIFT matches state-of-the-art performance on multiple benchmark datasets.

12. 【2512.21999】Look Closer! An Adversarial Parametric Editing Framework for Hallucination Mitigation in VLMs

链接https://arxiv.org/abs/2512.21999

作者:Jiayu Hu,Beibei Li,Jiangwei Xia,Yanjun Qin,Bing Ji,Zhongshi He

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:promising practical applications, generating outputs misaligned, garnered increasing attention, persistent hallucination issues, exhibit persistent hallucination

备注

点击查看摘要

Abstract:While Vision-Language Models (VLMs) have garnered increasing attention in the AI community due to their promising practical applications, they exhibit persistent hallucination issues, generating outputs misaligned with visual inputs. Recent studies attribute these hallucinations to VLMs' over-reliance on linguistic priors and insufficient visual feature integration, proposing heuristic decoding calibration strategies to mitigate them. However, the non-trainable nature of these strategies inherently limits their optimization potential. To this end, we propose an adversarial parametric editing framework for Hallucination mitigation in VLMs, which follows an \textbf{A}ctivate-\textbf{L}ocate-\textbf{E}dit \textbf{A}dversarially paradigm. Specifically, we first construct an activation dataset that comprises grounded responses (positive samples attentively anchored in visual features) and hallucinatory responses (negative samples reflecting LLM prior bias and internal knowledge artifacts). Next, we identify critical hallucination-prone parameter clusters by analyzing differential hidden states of response pairs. Then, these clusters are fine-tuned using prompts injected with adversarial tuned prefixes that are optimized to maximize visual neglect, thereby forcing the model to prioritize visual evidence over inherent parametric biases. Evaluations on both generative and discriminative VLM tasks demonstrate the significant effectiveness of ALEAHallu in alleviating hallucinations. Our code is available at this https URL.

13. 【2512.21985】LVLM-Aided Alignment of Task-Specific Vision Models

链接https://arxiv.org/abs/2512.21985

作者:Alexander Koebler,Lukas Kuhn,Ingo Thon,Florian Buettner

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:low computational requirements, small task-specific vision, task-specific vision models, crucial due, low computational

备注

点击查看摘要

Abstract:In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model's dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.

14. 【2512.21984】A Lightweight Multi-Scale Attention Framework for Real-Time Spinal Endoscopic Instance Segmentation

链接https://arxiv.org/abs/2512.21984

作者:Qi Lai,JunYan Li,Qiang Cai,Lei Wang,Tao Yan,XiaoKun Liang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:protecting critical anatomy, Real-time instance segmentation, specular highlights, unclear boundaries, anatomy during surgery

备注

点击查看摘要

Abstract:Real-time instance segmentation for spinal endoscopy is important for identifying and protecting critical anatomy during surgery, but it is difficult because of the narrow field of view, specular highlights, smoke/bleeding, unclear boundaries, and large scale changes. Deployment is also constrained by limited surgical hardware, so the model must balance accuracy and speed and remain stable under small-batch (even batch-1) training. We propose LMSF-A, a lightweight multi-scale attention framework co-designed across backbone, neck, and head. The backbone uses a C2f-Pro module that combines RepViT-style re-parameterized convolution (RVB) with efficient multi-scale attention (EMA), enabling multi-branch training while collapsing into a single fast path for inference. The neck improves cross-scale consistency and boundary detail using Scale-Sequence Feature Fusion (SSFF) and Triple Feature Encoding (TFE), which strengthens high-resolution features. The head adopts a Lightweight Multi-task Shared Head (LMSH) with shared convolutions and GroupNorm to reduce parameters and support batch-1 stability. We also release the clinically reviewed PELD dataset (61 patients, 610 images) with instance masks for adipose tissue, bone, ligamentum flavum, and nerve. Experiments show that LMSF-A is highly competitive (or even better than) in all evaluation metrics and much lighter than most instance segmentation methods requiring only 1.8M parameters and 8.8 GFLOPs, and it generalizes well to a public teeth benchmark. Code and dataset: this https URL.

15. 【2512.21964】Perceive and Calibrate: Analyzing and Enhancing Robustness of Medical Multi-Modal Large Language Models

链接https://arxiv.org/abs/2512.21964

作者:Dunyuan XU,Xikai Yang,Yaoqian Li,Juzheng Miao,Jinpeng Li,Pheng-Ann Heng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multi-modal Large Language, Language Models, Large Language, Medical Multi-modal Large

备注

点击查看摘要

Abstract:Medical Multi-modal Large Language Models (MLLMs) have shown promising clinical performance. However, their sensitivity to real-world input perturbations, such as imaging artifacts and textual errors, critically undermines their clinical applicability. Systematic analysis of such noise impact on medical MLLMs remains largely unexplored. Furthermore, while several works have investigated the MLLMs' robustness in general domains, they primarily focus on text modality and rely on costly fine-tuning. They are inadequate to address the complex noise patterns and fulfill the strict safety standards in medicine. To bridge this gap, this work systematically analyzes the impact of various perturbations on medical MLLMs across both visual and textual modalities. Building on our findings, we introduce a training-free Inherent-enhanced Multi-modal Calibration (IMC) framework that leverages MLLMs' inherent denoising capabilities following the perceive-and-calibrate principle for cross-modal robustness enhancement. For the visual modality, we propose a Perturbation-aware Denoising Calibration (PDC) which leverages MLLMs' own vision encoder to identify noise patterns and perform prototype-guided feature calibration. For text denoising, we design a Self-instantiated Multi-agent System (SMS) that exploits the MLLMs' self-assessment capabilities to refine noisy text through a cooperative hierarchy of agents. We construct a benchmark containing 11 types of noise across both image and text modalities on 2 datasets. Experimental results demonstrate our method achieves the state-of-the-art performance across multiple modalities, showing potential to enhance MLLMs' robustness in real clinical scenarios.

16. 【2512.21948】Automated Discovery of Parsimonious Spectral Indices via Normalized Difference Polynomials

链接https://arxiv.org/abs/2512.21948

作者:Ali Lotfi,Adam Carter,Thuan Ha,Mohammad Meysami,Kwabena Nketia,Steve Shirtliffe

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:find compact spectral, compact spectral indices, introduce an automated, find compact, normalized differences

备注: 23 pages, 5 figures, 6 tables

点击查看摘要

Abstract:We introduce an automated way to find compact spectral indices for vegetation classification. The idea is to take all pairwise normalized differences from the spectral bands and then build polynomial combinations up to a fixed degree, which gives a structured search space that still keeps the illumination invariance needed in remote sensing. For a sensor with $n$ bands this produces $\binom{n}{2}$ base normalized differences, and the degree-2 polynomial expansion gives 1,080 candidate features for the 10-band Sentinel-2 configuration we use here. Feature selection methods (ANOVA filtering, recursive elimination, and $L_1$-regularized SVM) then pick out small sets of indices that reach the desired accuracy, so the final models stay simple and easy to interpret. We test the framework on Kochia (\textit{Bassia scoparia}) detection using Sentinel-2 imagery from Saskatchewan, Canada ($N = 2{,}318$ samples, 2022--2024). A single degree-2 index, the product of two normalized differences from the red-edge bands, already reaches 96.26\% accuracy, and using eight indices only raises this to 97.70\%. In every case the chosen features are degree-2 products built from bands $b_4$ through $b_8$, which suggests that the discriminative signal comes from spectral \emph{interactions} rather than individual band ratios. Because the indices involve only simple arithmetic, they can be deployed directly in platforms like Google Earth Engine. The same approach works for other sensors and classification tasks, and an open-source implementation (\texttt{ndindex}) is available.

17. 【2512.21944】Data relativistic uncertainty framework for low-illumination anime scenery image enhancement

链接https://arxiv.org/abs/2512.21944

作者:Yiquan Gao,John See

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词:low-illumination quality degradation, anime scenery images, unpaired anime scenery, anime scenery dataset, anime scenery

备注: Preprint, awaiting submission to the appropriate conference or journal

点击查看摘要

Abstract:By contrast with the prevailing works of low-light enhancement in natural images and videos, this study copes with the low-illumination quality degradation in anime scenery images to bridge the domain gap. For such an underexplored enhancement task, we first curate images from various sources and construct an unpaired anime scenery dataset with diverse environments and illumination conditions to address the data scarcity. To exploit the power of uncertainty information inherent with the diverse illumination conditions, we propose a Data Relativistic Uncertainty (DRU) framework, motivated by the idea from Relativistic GAN. By analogy with the wave-particle duality of light, our framework interpretably defines and quantifies the illumination uncertainty of dark/bright samples, which is leveraged to dynamically adjust the objective functions to recalibrate the model learning under data uncertainty. Extensive experiments demonstrate the effectiveness of DRU framework by training several versions of EnlightenGANs, yielding superior perceptual and aesthetic qualities beyond the state-of-the-art methods that are incapable of learning from data uncertainty perspective. We hope our framework can expose a novel paradigm of data-centric learning for potential visual and language domains. Code is available.

18. 【2512.21924】Unsupervised Anomaly Detection in Brain MRI via Disentangled Anatomy Learning

链接https://arxiv.org/abs/2512.21924

作者:Tao Yang,Xiuying Wang,Hao Liu,Guanzhong Gong,Lian-Ming Wu,Yu-Ping Wang,Lisheng Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:clinically critical, challenging due, imaging information, multi-center MRIs due, Detection

备注: Accepted by Medical Image Analysis (2025)

点击查看摘要

Abstract:Detection of various lesions in brain MRI is clinically critical, but challenging due to the diversity of lesions and variability in imaging conditions. Current unsupervised learning methods detect anomalies mainly through reconstructing abnormal images into pseudo-healthy images (PHIs) by normal samples learning and then analyzing differences between images. However, these unsupervised models face two significant limitations: restricted generalizability to multi-modality and multi-center MRIs due to their reliance on the specific imaging information in normal training data, and constrained performance due to abnormal residuals propagated from input images to reconstructed PHIs. To address these limitations, two novel modules are proposed, forming a new PHI reconstruction framework. Firstly, the disentangled representation module is proposed to improve generalizability by decoupling brain MRI into imaging information and essential imaging-invariant anatomical images, ensuring that the reconstruction focuses on the anatomy. Specifically, brain anatomical priors and a differentiable one-hot encoding operator are introduced to constrain the disentanglement results and enhance the disentanglement stability. Secondly, the edge-to-image restoration module is designed to reconstruct high-quality PHIs by restoring the anatomical representation from the high-frequency edge information of anatomical images, and then recoupling the disentangled imaging information. This module not only suppresses abnormal residuals in PHI by reducing abnormal pixels input through edge-only input, but also effectively reconstructs normal regions using the preserved structural details in the edges. Evaluated on nine public datasets (4,443 patients' MRIs from multiple centers), our method outperforms 17 SOTA methods, achieving absolute improvements of +18.32% in AP and +13.64% in DSC.

19. 【2512.21921】AutoPP: Towards Automated Product Poster Generation and Optimization

链接https://arxiv.org/abs/2512.21921

作者:Jiahao Fan,Yuxin Qin,Wei Feng,Yanyin Chen,Yaoyu Li,Ao Ma,Yixiu Li,Li Zhuang,Haoyi Bian,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law

类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:capture customer attention, blend striking visuals, posters blend striking, Product posters blend, product poster generation

备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Product posters blend striking visuals with informative text to highlight the product and capture customer attention. However, crafting appealing posters and manually optimizing them based on online performance is laborious and resource-consuming. To address this, we introduce AutoPP, an automated pipeline for product poster generation and optimization that eliminates the need for human intervention. Specifically, the generator, relying solely on basic product information, first uses a unified design module to integrate the three key elements of a poster (background, text, and layout) into a cohesive output. Then, an element rendering module encodes these elements into condition tokens, efficiently and controllably generating the product poster. Based on the generated poster, the optimizer enhances its Click-Through Rate (CTR) by leveraging online feedback. It systematically replaces elements to gather fine-grained CTR comparisons and utilizes Isolated Direct Preference Optimization (IDPO) to attribute CTR gains to isolated elements. Our work is supported by AutoPP1M, the largest dataset specifically designed for product poster generation and optimization, which contains one million high-quality posters and feedback collected from over one million users. Experiments demonstrate that AutoPP achieves state-of-the-art results in both offline and online settings. Our code and dataset are publicly available at: this https URL

20. 【2512.21916】Patch as Node: Human-Centric Graph Representation Learning for Multimodal Action Recognition

链接https://arxiv.org/abs/2512.21916

作者:Zeyu Liang,Hailun Xia,Naichuan Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:witnessed notable achievements, multimodal action recognition, methods fusing RGB, action recognition, human action recognition

备注

点击查看摘要

Abstract:While human action recognition has witnessed notable achievements, multimodal methods fusing RGB and skeleton modalities still suffer from their inherent heterogeneity and fail to fully exploit the complementary potential between them. In this paper, we propose PAN, the first human-centric graph representation learning framework for multimodal action recognition, in which token embeddings of RGB patches containing human joints are represented as spatiotemporal graphs. The human-centric graph modeling paradigm suppresses the redundancy in RGB frames and aligns well with skeleton-based methods, thus enabling a more effective and semantically coherent fusion of multimodal features. Since the sampling of token embeddings heavily relies on 2D skeletal data, we further propose attention-based post calibration to reduce the dependency on high-quality skeletal data at a minimal cost interms of model performance. To explore the potential of PAN in integrating with skeleton-based methods, we present two variants: PAN-Ensemble, which employs dual-path graph convolution networks followed by late fusion, and PAN-Unified, which performs unified graph representation learning within a single network. On three widely used multimodal action recognition datasets, both PAN-Ensemble and PAN-Unified achieve state-of-the-art (SOTA) performance in their respective settings of multimodal fusion: separate and unified modeling, respectively.

21. 【2512.21905】High-Fidelity and Long-Duration Human Image Animation with Diffusion Transformer

链接https://arxiv.org/abs/2512.21905

作者:Shen Zheng,Jiaran Cai,Yuansheng Guan,Shenneng Huang,Xingpei Ma,Junjie Cao,Hanfeng Zhao,Qiang Zhang,Shunsi Zhang,Xiao-Ping Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent progress, significantly advanced, advanced the field, Recent, human image animation

备注

点击查看摘要

Abstract:Recent progress in diffusion models has significantly advanced the field of human image animation. While existing methods can generate temporally consistent results for short or regular motions, significant challenges remain, particularly in generating long-duration videos. Furthermore, the synthesis of fine-grained facial and hand details remains under-explored, limiting the applicability of current approaches in real-world, high-quality applications. To address these limitations, we propose a diffusion transformer (DiT)-based framework which focuses on generating high-fidelity and long-duration human animation videos. First, we design a set of hybrid implicit guidance signals and a sharpness guidance factor, enabling our framework to additionally incorporate detailed facial and hand features as guidance. Next, we incorporate the time-aware position shift fusion module, modify the input format within the DiT backbone, and refer to this mechanism as the Position Shift Adaptive Module, which enables video generation of arbitrary length. Finally, we introduce a novel data augmentation strategy and a skeleton alignment model to reduce the impact of human shape variations across different identities. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving superior performance in both high-fidelity and long-duration human image animation.

22. 【2512.21890】CrownGen: Patient-customized Crown Generation via Point Diffusion Model

链接https://arxiv.org/abs/2512.21890

作者:Juyoung Bae,Moo Hyun Son,Jiale Peng,Wanting Qu,Wener Chen,Zelin Qiu,Kaixin Li,Xiaojuan Chen,Yifan Lin,Hao Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Digital crown design, crown design remains, Digital crown, restorative dentistry, remains a labor-intensive

备注

点击查看摘要

Abstract:Digital crown design remains a labor-intensive bottleneck in restorative dentistry. We present \textbf{CrownGen}, a generative framework that automates patient-customized crown design using a denoising diffusion model on a novel tooth-level point cloud representation. The system employs two core components: a boundary prediction module to establish spatial priors and a diffusion-based generative module to synthesize high-fidelity morphology for multiple teeth in a single inference pass. We validated CrownGen through a quantitative benchmark on 496 external scans and a clinical study of 26 restoration cases. Results demonstrate that CrownGen surpasses state-of-the-art models in geometric fidelity and significantly reduces active design time. Clinical assessments by trained dentists confirmed that CrownGen-assisted crowns are statistically non-inferior in quality to those produced by expert technicians using manual workflows. By automating complex prosthetic modeling, CrownGen offers a scalable solution to lower costs, shorten turnaround times, and enhance patient access to high-quality dental care.

23. 【2512.21883】Reloc-VGGT: Visual Re-localization with Geometry Grounded Transformer

链接https://arxiv.org/abs/2512.21883

作者:Tianchen Deng,Wenhua Wu,Kunzhen Wu,Guangming Wang,Siting Zhu,Shenghai Yuan,Xun Chen,Guole Shen,Zhe Liu,Hesheng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pose regression problem, pair-wise pose regression, regression problem, traditionally been formulated, Visual localization

备注

点击查看摘要

Abstract:Visual localization has traditionally been formulated as a pair-wise pose regression problem. Existing approaches mainly estimate relative poses between two images and employ a late-fusion strategy to obtain absolute pose estimates. However, the late motion average is often insufficient for effectively integrating spatial information, and its accuracy degrades in complex environments. In this paper, we present the first visual localization framework that performs multi-view spatial integration through an early-fusion mechanism, enabling robust operation in both structured and unstructured environments. Our framework is built upon the VGGT backbone, which encodes multi-view 3D geometry, and we introduce a pose tokenizer and projection module to more effectively exploit spatial relationships from multiple database views. Furthermore, we propose a novel sparse mask attention strategy that reduces computational cost by avoiding the quadratic complexity of global attention, thereby enabling real-time performance at scale. Trained on approximately eight million posed image pairs, Reloc-VGGT demonstrates strong accuracy and remarkable generalization ability. Extensive experiments across diverse public datasets consistently validate the effectiveness and efficiency of our approach, delivering high-quality camera pose estimates in real time while maintaining robustness to unseen environments. Our code and models will be publicly released upon this http URL://github.com/dtc111111/Reloc-VGGT.

24. 【2512.21881】SLIM-Brain: A Data- and Training-Efficient Foundation Model for fMRI Data Analysis

链接https://arxiv.org/abs/2512.21881

作者:Mo Wang,Junfeng Xia,Wenhao Ye,Enyu Liu,Kaining Peng,Jianfeng Feng,Quanying Liu,Hongkai Wen

类目:Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)

关键词:current approaches face, fMRI Foundation Model, Foundation models, general-purpose foundation models, atlas-free foundation model

备注: The code will be released after review

点击查看摘要

Abstract:Foundation models are emerging as a powerful paradigm for fMRI analysis, but current approaches face a dual bottleneck of data- and training-efficiency. Atlas-based methods aggregate voxel signals into fixed regions of interest, reducing data dimensionality but discarding fine-grained spatial details, and requiring extremely large cohorts to train effectively as general-purpose foundation models. Atlas-free methods, on the other hand, operate directly on voxel-level information - preserving spatial fidelity but are prohibitively memory- and compute-intensive, making large-scale pre-training infeasible. We introduce SLIM-Brain (Sample-efficient, Low-memory fMRI Foundation Model for Human Brain), a new atlas-free foundation model that simultaneously improves both data- and training-efficiency. SLIM-Brain adopts a two-stage adaptive design: (i) a lightweight temporal extractor captures global context across full sequences and ranks data windows by saliency, and (ii) a 4D hierarchical encoder (Hiera-JEPA) learns fine-grained voxel-level representations only from the top-$k$ selected windows, while deleting about 70% masked patches. Extensive experiments across seven public benchmarks show that SLIM-Brain establishes new state-of-the-art performance on diverse tasks, while requiring only 4 thousand pre-training sessions and approximately 30% of GPU memory comparing to traditional voxel-level methods.

25. 【2512.21867】DPAR: Dynamic Patchification for Efficient Autoregressive Visual Generation

链接https://arxiv.org/abs/2512.21867

作者:Divyansh Srivastava,Akshay Mehra,Pranav Maneriker,Debopam Sanyal,Vishnu Raj,Vijay Kamarshi,Fan Du,Joshua Kimball

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fixed-length tokenization schemes, generation typically relies, Decoder-only autoregressive image, counts grow quadratically, decoder-only autoregressive model

备注

点击查看摘要

Abstract:Decoder-only autoregressive image generation typically relies on fixed-length tokenization schemes whose token counts grow quadratically with resolution, substantially increasing the computational and memory demands of attention. We present DPAR, a novel decoder-only autoregressive model that dynamically aggregates image tokens into a variable number of patches for efficient image generation. Our work is the first to demonstrate that next-token prediction entropy from a lightweight and unsupervised autoregressive model provides a reliable criterion for merging tokens into larger patches based on information content. DPAR makes minimal modifications to the standard decoder architecture, ensuring compatibility with multimodal generation frameworks and allocating more compute to generation of high-information image regions. Further, we demonstrate that training with dynamically sized patches yields representations that are robust to patch boundaries, allowing DPAR to scale to larger patch sizes at inference. DPAR reduces token count by 1.81x and 2.06x on Imagenet 256 and 384 generation resolution respectively, leading to a reduction of up to 40% FLOPs in training costs. Further, our method exhibits faster convergence and improves FID by up to 27.1% relative to baseline models.

26. 【2512.21865】EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition

链接https://arxiv.org/abs/2512.21865

作者:Yihan Hu,Xuelin Chen,Xiaodong Cun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:powerful generative priors, inference-time optimization pipelines, exploit powerful generative, producing suboptimal decompositions, Existing video omnimatte

备注

点击查看摘要

Abstract:Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.

27. 【2512.21861】Balancing Accuracy and Efficiency: CNN Fusion Models for Diabetic Retinopathy Screening

链接https://arxiv.org/abs/2512.21861

作者:Md Rafid Islam,Rafsan Jany,Akib Ahmed,Mohammad Ashrafuzzaman Khan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:limited specialist availability, variable image quality, remains a leading, preventable blindness, devices and populations

备注

点击查看摘要

Abstract:Diabetic retinopathy (DR) remains a leading cause of preventable blindness, yet large-scale screening is constrained by limited specialist availability and variable image quality across devices and populations. This work investigates whether feature-level fusion of complementary convolutional neural network (CNN) backbones can deliver accurate and efficient binary DR screening on globally sourced fundus images. Using 11,156 images pooled from five public datasets (APTOS, EyePACS, IDRiD, Messidor, and ODIR), we frame DR detection as a binary classification task and compare three pretrained models (ResNet50, EfficientNet-B0, and DenseNet121) against pairwise and tri-fusion variants. Across five independent runs, fusion consistently outperforms single backbones. The EfficientNet-B0 + DenseNet121 (Eff+Den) fusion model achieves the best overall mean performance (accuracy: 82.89\%) with balanced class-wise F1-scores for normal (83.60\%) and diabetic (82.60\%) cases. While the tri-fusion is competitive, it incurs a substantially higher computational cost. Inference profiling highlights a practical trade-off: EfficientNet-B0 is the fastest (approximately 1.16 ms/image at batch size 1000), whereas the Eff+Den fusion offers a favorable accuracy--latency balance. These findings indicate that lightweight feature fusion can enhance generalization across heterogeneous datasets, supporting scalable binary DR screening workflows where both accuracy and throughput are critical.

28. 【2512.21860】raining-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

链接https://arxiv.org/abs/2512.21860

作者:Masayuki Kawarada,Kosuke Yamada,Antonio Tejero-de-Pablos,Naoto Inoue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Conditional image embeddings, Conditional image, challenging problem, specific aspects, image

备注

点击查看摘要

Abstract:Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional image similarity tasks demonstrate that DIOR outperforms existing training-free baselines, including CLIP. Furthermore, DIOR achieves superior performance compared to methods that require additional training across multiple settings.

29. 【2512.21857】Fast Inference of Visual Autoregressive Model with Adjacency-Adaptive Dynamical Draft Trees

链接https://arxiv.org/abs/2512.21857

作者:Haodong Lei,Hongsong Wang,Xin Geng,Liang Wang,Pan Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieve diffusion-level quality, requiring approximately, sequential inference, diffusion-level quality, quality but suffer

备注

点击查看摘要

Abstract:Autoregressive (AR) image models achieve diffusion-level quality but suffer from sequential inference, requiring approximately 2,000 steps for a 576x576 image. Speculative decoding with draft trees accelerates LLMs yet underperforms on visual AR models due to spatially varying token prediction difficulty. We identify a key obstacle in applying speculative decoding to visual AR models: inconsistent acceptance rates across draft trees due to varying prediction difficulties in different image regions. We propose Adjacency-Adaptive Dynamical Draft Trees (ADT-Tree), an adjacency-adaptive dynamic draft tree that dynamically adjusts draft tree depth and width by leveraging adjacent token states and prior acceptance rates. ADT-Tree initializes via horizontal adjacency, then refines depth/width via bisectional adaptation, yielding deeper trees in simple regions and wider trees in complex ones. The empirical evaluations on MS-COCO 2017 and PartiPrompts demonstrate that ADT-Tree achieves speedups of 3.13xand 3.05x, respectively. Moreover, it integrates seamlessly with relaxed sampling methods such as LANTERN, enabling further acceleration. Code is available at this https URL.

30. 【2512.21856】Breaking Alignment Barriers: TPS-Driven Semantic Correlation Learning for Alignment-Free RGB-T Salient Object Detection

链接https://arxiv.org/abs/2512.21856

作者:Lupiao Hu,Fasheng Wang,Fangmei Chen,Fuming Sun,Haojie Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:methods predominantly rely, handle real-world scenarios, RGB-T image pairs, RGB-T salient object, Correlation Learning Network

备注: Accepted by AAAI2026

点击查看摘要

Abstract:Existing RGB-T salient object detection methods predominantly rely on manually aligned and annotated datasets, struggling to handle real-world scenarios with raw, unaligned RGB-T image pairs. In practical applications, due to significant cross-modal disparities such as spatial misalignment, scale variations, and viewpoint shifts, the performance of current methods drastically deteriorates on unaligned datasets. To address this issue, we propose an efficient RGB-T SOD method for real-world unaligned image pairs, termed Thin-Plate Spline-driven Semantic Correlation Learning Network (TPS-SCL). We employ a dual-stream MobileViT as the encoder, combined with efficient Mamba scanning mechanisms, to effectively model correlations between the two modalities while maintaining low parameter counts and computational overhead. To suppress interference from redundant background information during alignment, we design a Semantic Correlation Constraint Module (SCCM) to hierarchically constrain salient features. Furthermore, we introduce a Thin-Plate Spline Alignment Module (TPSAM) to mitigate spatial discrepancies between modalities. Additionally, a Cross-Modal Correlation Module (CMCM) is incorporated to fully explore and integrate inter-modal dependencies, enhancing detection performance. Extensive experiments on various datasets demonstrate that TPS-SCL attains state-of-the-art (SOTA) performance among existing lightweight SOD methods and outperforms mainstream RGB-T SOD approaches.

31. 【2512.21845】Scalable Class-Incremental Learning Based on Parametric Neural Collapse

链接https://arxiv.org/abs/2512.21845

作者:Chuangxin Zhang,Guangfeng Lin,Enhui Zhao,Kaiyang Liao,Yajun Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Equiangular Tight Frame, encounter challenges, evolving class distributions, class misalignment due, parametric Equiangular Tight

备注: 42 pages, 8 figures, submitted to Pattern Recognition (PR)

点击查看摘要

Abstract:Incremental learning often encounter challenges such as overfitting to new data and catastrophic forgetting of old data. Existing methods can effectively extend the model for new tasks while freezing the parameters of the old model, but ignore the necessity of structural efficiency to lead to the feature difference between modules and the class misalignment due to evolving class distributions. To address these issues, we propose scalable class-incremental learning based on parametric neural collapse (SCL-PNC) that enables demand-driven, minimal-cost backbone expansion by adapt-layer and refines the static into a dynamic parametric Equiangular Tight Frame (ETF) framework according to incremental class. This method can efficiently handle the model expansion question with the increasing number of categories in real-world scenarios. Additionally, to counteract feature drift in serial expansion models, the parallel expansion framework is presented with a knowledge distillation algorithm to align features across expansion modules. Therefore, SCL-PNC can not only design a dynamic and extensible ETF classifier to address class misalignment due to evolving class distributions, but also ensure feature consistency by an adapt-layer with knowledge distillation between extended modules. By leveraging neural collapse, SCL-PNC induces the convergence of the incremental expansion model through a structured combination of the expandable backbone, adapt-layer, and the parametric ETF classifier. Experiments on standard benchmarks demonstrate the effectiveness and efficiency of our proposed method. Our code is available at this https URL ETF2. Keywords: Class incremental learning; Catastrophic forgetting; Neural collapse;Knowledge distillation; Expanded model.

32. 【2512.21831】End-to-End 3D Spatiotemporal Perception with Multimodal Fusion and V2X Collaboration

链接https://arxiv.org/abs/2512.21831

作者:Zhenwei Yang,Yibo Ai,Weidong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-view cooperative perception, essential for reliable, autonomous driving, understanding in autonomous, unifies multi-view multimodal

备注: 19 pages, 19 figures

点击查看摘要

Abstract:Multi-view cooperative perception and multimodal fusion are essential for reliable 3D spatiotemporal understanding in autonomous driving, especially under occlusions, limited viewpoints, and communication delays in V2X scenarios. This paper proposes XET-V2X, a multi-modal fused end-to-end tracking framework for v2x collaboration that unifies multi-view multimodal sensing within a shared spatiotemporal representation. To efficiently align heterogeneous viewpoints and modalities, XET-V2X introduces a dual-layer spatial cross-attention module based on multi-scale deformable attention. Multi-view image features are first aggregated to enhance semantic consistency, followed by point cloud fusion guided by the updated spatial queries, enabling effective cross-modal interaction while reducing computational overhead. Experiments on the real-world V2X-Seq-SPD dataset and the simulated V2X-Sim-V2V and V2X-Sim-V2I benchmarks demonstrate consistent improvements in detection and tracking performance under varying communication delays. Both quantitative results and qualitative visualizations indicate that XET-V2X achieves robust and temporally stable perception in complex traffic scenarios.

33. 【2512.21815】Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

链接https://arxiv.org/abs/2512.21815

作者:Mengqi He,Xinyu Tian,Xin Shen,Jinhong Ni,Shu Zou,Zhaoyuan Yang,Jing Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Vision-language models, achieve remarkable performance, remarkable performance, performance but remain, Vision-language

备注: 19 Pages,11 figures,8 tables

点击查看摘要

Abstract:Vision-language models (VLMs) achieve remarkable performance but remain vulnerable to adversarial attacks. Entropy, a measure of model uncertainty, is strongly correlated with the reliability of VLM. Prior entropy-based attacks maximize uncertainty at all decoding steps, implicitly assuming that every token contributes equally to generation instability. We show instead that a small fraction (about 20%) of high-entropy tokens, i.e., critical decision points in autoregressive generation, disproportionately governs output trajectories. By concentrating adversarial perturbations on these positions, we achieve semantic degradation comparable to global methods while using substantially smaller budgets. More importantly, across multiple representative VLMs, such selective attacks convert 35-49% of benign outputs into harmful ones, exposing a more critical safety risk. Remarkably, these vulnerable high-entropy forks recur across architecturally diverse VLMs, enabling feasible transferability (17-26% harmful rates on unseen targets). Motivated by these findings, we propose Entropy-bank Guided Adversarial attacks (EGA), which achieves competitive attack success rates (93-95%) alongside high harmful conversion, thereby revealing new weaknesses in current VLM safety mechanisms.

34. 【2512.21804】SP 500 Stock's Movement Prediction using CNN

链接https://arxiv.org/abs/2512.21804

作者:Rahul Gupta

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

关键词:stock, artificial neural network, neural network, data, stock consist

备注: 9 pages, 19 diagrams. Originally submitted as a part of my Stanford University program taught by Dr. Fei Fei Lee and Andrej Karpathy CS231N 2018

点击查看摘要

Abstract:This paper is about predicting the movement of stock consist of SP 500 index. Historically there are many approaches have been tried using various methods to predict the stock movement and being used in the market currently for algorithm trading and alpha generating systems using traditional mathematical approaches [1, 2]. The success of artificial neural network recently created a lot of interest and paved the way to enable prediction using cutting-edge research in the machine learning and deep learning. Some of these papers have done a great job in implementing and explaining benefits of these new technologies. Although most these papers do not go into the complexity of the financial data and mostly utilize single dimension data, still most of these papers were successful in creating the ground for future research in this comparatively new phenomenon. In this paper, I am trying to use multivariate raw data including stock split/dividend events (as-is) present in real-world market data instead of engineered financial data. Convolution Neural Network (CNN), the best-known tool so far for image classification, is used on the multi-dimensional stock numbers taken from the market mimicking them as a vector of historical data matrices (read images) and the model achieves promising results. The predictions can be made stock by stock, i.e., a single stock, sector-wise or for the portfolio of stocks.

Comments:
9 pages, 19 diagrams. Originally submitted as a part of my Stanford University program taught by Dr. Fei Fei Lee and Andrej Karpathy CS231N 2018

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)

Cite as:
arXiv:2512.21804 [cs.CV]

(or
arXiv:2512.21804v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2512.21804

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
ADaSci Lattice Journal, Vol. 1, January 10, 2021

35. 【2512.21803】CellMamba: Adaptive Mamba for Accurate and Efficient Cell Detection

链接https://arxiv.org/abs/2512.21803

作者:Ruochen Liu,Yi Tian,Jiahao Wang,Hongbin Liu,Xianxu Hou,Jingxin Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:subtle inter-class differences, severe background clutter, pathological images presents, images presents unique, presents unique challenges

备注: 36th British Machine Vision Conference (BMVC 2025)

点击查看摘要

Abstract:Cell detection in pathological images presents unique challenges due to densely packed objects, subtle inter-class differences, and severe background clutter. In this paper, we propose CellMamba, a lightweight and accurate one-stage detector tailored for fine-grained biomedical instance detection. Built upon a VSSD backbone, CellMamba integrates CellMamba Blocks, which couple either NC-Mamba or Multi-Head Self-Attention (MSA) with a novel Triple-Mapping Adaptive Coupling (TMAC) module. TMAC enhances spatial discriminability by splitting channels into two parallel branches, equipped with dual idiosyncratic and one consensus attention map, adaptively fused to preserve local sensitivity and global consistency. Furthermore, we design an Adaptive Mamba Head that fuses multi-scale features via learnable weights for robust detection under varying object sizes. Extensive experiments on two public datasets-CoNSeP and CytoDArk0-demonstrate that CellMamba outperforms both CNN-based, Transformer-based, and Mamba-based baselines in accuracy, while significantly reducing model size and inference latency. Our results validate CellMamba as an efficient and effective solution for high-resolution cell detection.

36. 【2512.21797】Diffusion Posterior Sampling for Super-Resolution under Gaussian Measurement Noise

链接https://arxiv.org/abs/2512.21797

作者:Abu Hanif Muhammad Syarubany

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:report studies diffusion, studies diffusion posterior, report studies, diffusion posterior sampling, additive Gaussian noise

备注

点击查看摘要

Abstract:This report studies diffusion posterior sampling (DPS) for single-image super-resolution (SISR) under a known degradation model. We implement a likelihood-guided sampling procedure that combines an unconditional diffusion prior with gradient-based conditioning to enforce measurement consistency for $4\times$ super-resolution with additive Gaussian noise. We evaluate posterior sampling (PS) conditioning across guidance scales and noise levels, using PSNR and SSIM as fidelity metrics and a combined selection score $(\mathrm{PSNR}/40)+\mathrm{SSIM}$. Our ablation shows that moderate guidance improves reconstruction quality, with the best configuration achieved at PS scale $0.95$ and noise standard deviation $\sigma=0.01$ (score $1.45231$). Qualitative results confirm that the selected PS setting restores sharper edges and more coherent facial details compared to the downsampled inputs, while alternative conditioning strategies (e.g., MCG and PS-annealed) exhibit different texture fidelity trade-offs. These findings highlight the importance of balancing diffusion priors and measurement-gradient strength to obtain stable, high-quality reconstructions without retraining the diffusion model for each operator.

37. 【2512.21792】AI for Mycetoma Diagnosis in Histopathological Images: The MICCAI 2024 Challenge

链接https://arxiv.org/abs/2512.21792

作者:Hyam Omar Ali,Sahar Alhesseen,Lamis Elkhair,Adrian Galdran,Ming Feng,Zhixiang Xiong,Zengming Lin,Kele Xu,Liang Hu,Benjamin Keel,Oliver Mills,James Battye,Akshay Kumar,Asra Aslam,Prasad Dutande,Ujjwal Baid,Bhakti Baheti,Suhas Gajre,Aravind Shrenivas Murali,Eung-Joo Lee,Ahmed Fahal,Rachid Jennane

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:neglected tropical disease, tropical disease caused, severe tissue damage, Mycetoma, damage and disabilities

备注

点击查看摘要

Abstract:Mycetoma is a neglected tropical disease caused by fungi or bacteria leading to severe tissue damage and disabilities. It affects poor and rural communities and presents medical challenges and socioeconomic burdens on patients and healthcare systems in endemic regions worldwide. Mycetoma diagnosis is a major challenge in mycetoma management, particularly in low-resource settings where expert pathologists are limited. To address this challenge, this paper presents an overview of the Mycetoma MicroImage: Detect and Classify Challenge (mAIcetoma) which was organized to advance mycetoma diagnosis through AI solutions. mAIcetoma focused on developing automated models for segmenting mycetoma grains and classifying mycetoma types from histopathological images. The challenge attracted the attention of several teams worldwide to participate and five finalist teams fulfilled the challenge objectives. The teams proposed various deep learning architectures for the ultimate goal of this challenge. Mycetoma database (MyData) was provided to participants as a standardized dataset to run the proposed models. Those models were evaluated using evaluation metrics. Results showed that all the models achieved high segmentation accuracy, emphasizing the necessitate of grain detection as a critical step in mycetoma diagnosis. In addition, the top-performing models show a significant performance in classifying mycetoma types.

38. 【2512.21789】Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning

链接https://arxiv.org/abs/2512.21789

作者:Ting-Hao K.Huang,Ryan A. Rossi,Sungchul Kim,Tong Yu,Ting-Yao E. Hsu,Ho Yin(Sam)Ng,C. Lee Giles

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)

关键词:Pennsylvania State University, small seed-funded idea, central efforts shaping, Penn State seed, Penn State

备注: Accepted to the 5th Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE 2026)

点击查看摘要

Abstract:Between 2021 and 2025, the SciCap project grew from a small seed-funded idea at The Pennsylvania State University (Penn State) into one of the central efforts shaping the scientific figure-captioning landscape. Supported by a Penn State seed grant, Adobe, and the Alfred P. Sloan Foundation, what began as our attempt to test whether domain-specific training, which was successful in text models like SciBERT, could also work for figure captions expanded into a multi-institution collaboration. Over these five years, we curated, released, and continually updated a large collection of figure-caption pairs from arXiv papers, conducted extensive automatic and human evaluations on both generated and author-written captions, navigated the rapid rise of large language models (LLMs), launched annual challenges, and built interactive systems that help scientists write better captions. In this piece, we look back at the first five years of SciCap and summarize the key technical and methodological lessons we learned. We then outline five major unsolved challenges and propose directions for the next phase of research in scientific figure captioning.

39. 【2512.21788】InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

链接https://arxiv.org/abs/2512.21788

作者:Jinqi Xiao,Qing Yan,Liming Jiang,Zichuan Liu,Hao Kang,Shen Sang,Tiancheng Zhi,Jing Liu,Cheng Yang,Xin Lu,Bo Yuan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Diffusion Transformers, Mixture of Low-rank, Fine-Tuning of Diffusion, Low-rank Experts, Transformers

备注

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning of Diffusion Transformers (DiTs) for diverse, multi-conditional tasks often suffers from task interference when using monolithic adapters like LoRA. The Mixture of Low-rank Experts (MoLE) architecture offers a modular solution, but its potential is usually limited by routing policies that operate at a token level. Such local routing can conflict with the global nature of user instructions, leading to artifacts like spatial fragmentation and semantic drift in complex image generation tasks. To address these limitations, we introduce InstructMoLE, a novel framework that employs an Instruction-Guided Mixture of Low-Rank Experts. Instead of per-token routing, InstructMoLE utilizes a global routing signal, Instruction-Guided Routing (IGR), derived from the user's comprehensive instruction. This ensures that a single, coherently chosen expert council is applied uniformly across all input tokens, preserving the global semantics and structural integrity of the generation process. To complement this, we introduce an output-space orthogonality loss, which promotes expert functional diversity and mitigates representational collapse. Extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants across challenging multi-conditional generation benchmarks. Our work presents a robust and generalizable framework for instruction-driven fine-tuning of generative models, enabling superior compositional control and fidelity to user intent.

40. 【2512.21778】Scene-VLM: Multimodal Video Scene Segmentation via Vision-Language Models

链接https://arxiv.org/abs/2512.21778

作者:Nimrod Berman,Adam Botach,Emanuel Ben-Baruch,Shunit Haviv Hakimi,Asaf Gendler,Ilan Naiman,Erez Yosef,Igor Kviatkovsky

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Segmenting long-form videos, Segmenting long-form, large-scale video understanding, fundamental task, task in large-scale

备注

点击查看摘要

Abstract:Segmenting long-form videos into semantically coherent scenes is a fundamental task in large-scale video understanding. Existing encoder-based methods are limited by visual-centric biases, classify each shot in isolation without leveraging sequential dependencies, and lack both narrative understanding and explainability. In this paper, we present Scene-VLM, the first fine-tuned vision-language model (VLM) framework for video scene segmentation. Scene-VLM jointly processes visual and textual cues including frames, transcriptions, and optional metadata to enable multimodal reasoning across consecutive shots. The model generates predictions sequentially with causal dependencies among shots and introduces a context-focus window mechanism to ensure sufficient temporal context for each shot-level decision. In addition, we propose a scheme to extract confidence scores from the token-level logits of the VLM, enabling controllable precision-recall trade-offs that were previously limited to encoder-based methods. Furthermore, we demonstrate that our model can be aligned to generate coherent natural-language rationales for its boundary decisions through minimal targeted supervision. Our approach achieves state-of-the-art performance on standard scene segmentation benchmarks. On MovieNet, for example, Scene-VLM yields significant improvements of +6 AP and +13.7 F1 over the previous leading method.

41. 【2512.21776】Inference-based GAN Video Generation

链接https://arxiv.org/abs/2512.21776

作者:Jingbo Yang,Adrian G. Bors

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Generative Adversarial Networks, generative deep learning, Video, remarkable progresses, recently Diffusion Networks

备注

点击查看摘要

Abstract:Video generation has seen remarkable progresses thanks to advancements in generative deep learning. Generated videos should not only display coherent and continuous movement but also meaningful movement in successions of scenes. Generating models such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs) and more recently Diffusion Networks have been used for generating short video sequences, usually of up to 16 frames. In this paper, we first propose a new type of video generator by enabling adversarial-based unconditional video generators with a variational encoder, akin to a VAE-GAN hybrid structure, in order to enable the generation process with inference capabilities. The proposed model, as in other video deep learning-based processing frameworks, incorporates two processing branches, one for content and another for movement. However, existing models struggle with the temporal scaling of the generated videos. In classical approaches when aiming to increase the generated video length, the resulting video quality degrades, particularly when considering generating significantly long sequences. To overcome this limitation, our research study extends the initially proposed VAE-GAN video generation model by employing a novel, memory-efficient approach to generate long videos composed of hundreds or thousands of frames ensuring their temporal continuity, consistency and dynamics. Our approach leverages a Markov chain framework with a recall mechanism, with each state representing a VAE-GAN short-length video generator. This setup allows for the sequential connection of generated video sub-sequences, enabling temporal dependencies, resulting in meaningful long video sequences.

42. 【2512.21769】BertsWin: Resolving Topological Sparsity in 3D Masked Autoencoders via Component-Balanced Structural Optimization

链接https://arxiv.org/abs/2512.21769

作者:Evgeny Alves Limarenko,Anastasiia Studenikina

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:Vision Transformers, approaches demonstrates promising, medical imaging, Swin Transformer windows, fraught with difficulties

备注: Code available at [this https URL](https://github.com/AlevLab-dev/BertsWinMAE) and [this https URL](https://github.com/AlevLab-dev/GCond) . Zenodo repository (DOI: [https://doi.org/10.5281/zenodo.17916932](https://doi.org/10.5281/zenodo.17916932) ) contains source images, training logs, trained models, and code

点击查看摘要

Abstract:The application of self-supervised learning (SSL) and Vision Transformers (ViTs) approaches demonstrates promising results in the field of 2D medical imaging, but the use of these methods on 3D volumetric images is fraught with difficulties. Standard Masked Autoencoders (MAE), which are state-of-the-art solution for 2D, have a hard time capturing three-dimensional spatial relationships, especially when 75% of tokens are discarded during pre-training. We propose BertsWin, a hybrid architecture combining full BERT-style token masking using Swin Transformer windows, to enhance spatial context learning in 3D during SSL pre-training. Unlike the classic MAE, which processes only visible areas, BertsWin introduces a complete 3D grid of tokens (masked and visible), preserving the spatial topology. And to smooth out the quadratic complexity of ViT, single-level local Swin windows are used. We introduce a structural priority loss function and evaluate the results of cone beam computed tomography of the temporomandibular joints. The subsequent assessment includes TMJ segmentation on 3D CT scans. We demonstrate that the BertsWin architecture, by maintaining a complete three-dimensional spatial topology, inherently accelerates semantic convergence by a factor of 5.8x compared to standard ViT-MAE baselines. Furthermore, when coupled with our proposed GradientConductor optimizer, the full BertsWin framework achieves a 15-fold reduction in training epochs (44 vs 660) required to reach state-of-the-art reconstruction fidelity. Analysis reveals that BertsWin achieves this acceleration without the computational penalty typically associated with dense volumetric processing. At canonical input resolutions, the architecture maintains theoretical FLOP parity with sparse ViT baselines, resulting in a significant net reduction in total computational resources due to faster convergence.

43. 【2512.21760】A-QCF-Net: An Adaptive Quaternion Cross-Fusion Network for Multimodal Liver Tumor Segmentation from Unpaired Datasets

链接https://arxiv.org/abs/2512.21760

作者:Arunkumar V,Firos V M,Senthilkumar S,Gangadharan G R

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal medical imaging, Adaptive Quaternion Cross-Fusion, Quaternion Neural Networks, Multimodal medical, Quaternion Cross-Fusion Network

备注

点击查看摘要

Abstract:Multimodal medical imaging provides complementary information that is crucial for accurate delineation of pathology, but the development of deep learning models is limited by the scarcity of large datasets in which different modalities are paired and spatially aligned. This paper addresses this fundamental limitation by proposing an Adaptive Quaternion Cross-Fusion Network (A-QCF-Net) that learns a single unified segmentation model from completely separate and unpaired CT and MRI cohorts. The architecture exploits the parameter efficiency and expressive power of Quaternion Neural Networks to construct a shared feature space. At its core is the Adaptive Quaternion Cross-Fusion (A-QCF) block, a data driven attention module that enables bidirectional knowledge transfer between the two streams. By learning to modulate the flow of information dynamically, the A-QCF block allows the network to exchange abstract modality specific expertise, such as the sharp anatomical boundary information available in CT and the subtle soft tissue contrast provided by MRI. This mutual exchange regularizes and enriches the feature representations of both streams. We validate the framework by jointly training a single model on the unpaired LiTS (CT) and ATLAS (MRI) datasets. The jointly trained model achieves Tumor Dice scores of 76.7% on CT and 78.3% on MRI, significantly exceeding the strong unimodal nnU-Net baseline by margins of 5.4% and 4.7% respectively. Furthermore, comprehensive explainability analysis using Grad-CAM and Grad-CAM++ confirms that the model correctly focuses on relevant pathological structures, ensuring the learned representations are clinically meaningful. This provides a robust and clinically viable paradigm for unlocking the large unpaired imaging archives that are common in healthcare.

44. 【2512.21747】Modified TSception for Analyzing Driver Drowsiness and Mental Workload from EEG

链接https://arxiv.org/abs/2512.21747

作者:Gourav Siddhad,Anurag Singh,Rajkumar Saini,Partha Pratim Roy

类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词:ensure road safety, Driver drowsiness remains, reliable detection systems, Modified TSception architecture, traffic accidents

备注: 8 Pages, 3 Figures, 1 Table

点击查看摘要

Abstract:Driver drowsiness remains a primary cause of traffic accidents, necessitating the development of real-time, reliable detection systems to ensure road safety. This study presents a Modified TSception architecture designed for the robust assessment of driver fatigue using Electroencephalography (EEG). The model introduces a novel hierarchical architecture that surpasses the original TSception by implementing a five-layer temporal refinement strategy to capture multi-scale brain dynamics. A key innovation is the use of Adaptive Average Pooling, which provides the structural flexibility to handle varying EEG input dimensions, and a two - stage fusion mechanism that optimizes the integration of spatiotemporal features for improved stability. When evaluated on the SEED-VIG dataset and compared against established methods - including SVM, Transformer, EEGNet, ConvNeXt, LMDA-Net, and the original TSception - the Modified TSception achieves a comparable accuracy of 83.46% (vs. 83.15% for the original). Critically, the proposed model exhibits a substantially reduced confidence interval (0.24 vs. 0.36), signifying a marked improvement in performance stability. Furthermore, the architecture's generalizability is validated on the STEW mental workload dataset, where it achieves state-of-the-art results with 95.93% and 95.35% accuracy for 2-class and 3-class classification, respectively. These improvements in consistency and cross-task generalizability underscore the effectiveness of the proposed modifications for reliable EEG-based monitoring of drowsiness and mental workload.

45. 【2512.21743】Dynamic Feedback Engines: Layer-Wise Control for Self-Regulating Continual Learning

链接https://arxiv.org/abs/2512.21743

作者:Hengyi Wu,Zhenyi Wang,Heng Huang

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:catastrophic forgetting, aims to acquire, previously learned, struggle with catastrophic, Continual learning aims

备注: 14 pages

点击查看摘要

Abstract:Continual learning aims to acquire new tasks while preserving performance on previously learned ones, but most methods struggle with catastrophic forgetting. Existing approaches typically treat all layers uniformly, often trading stability for plasticity or vice versa. However, different layers naturally exhibit varying levels of uncertainty (entropy) when classifying tasks. High-entropy layers tend to underfit by failing to capture task-specific patterns, while low-entropy layers risk overfitting by becoming overly confident and specialized. To address this imbalance, we propose an entropy-aware continual learning method that employs a dynamic feedback mechanism to regulate each layer based on its entropy. Specifically, our approach reduces entropy in high-entropy layers to mitigate underfitting and increases entropy in overly confident layers to alleviate overfitting. This adaptive regulation encourages the model to converge to wider local minima, which have been shown to improve generalization. Our method is general and can be seamlessly integrated with both replay- and regularization-based approaches. Experiments on various datasets demonstrate substantial performance gains over state-of-the-art continual learning baselines.

46. 【2512.21736】SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

链接https://arxiv.org/abs/2512.21736

作者:Xindi Zhang,Dechao Meng,Steven Xiao,Qi Wang,Peng Zhang,Bang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:High-quality AI-powered video, High-quality AI-powered, AI-powered video dubbing, video dubbing demands, precise audio-lip synchronization

备注

点击查看摘要

Abstract:High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the mouth region is masked in talking-head videos, and the model learns to synthesize lip movements from corrupted inputs and target audios. While this facilitates lip-sync accuracy, it disrupts spatiotemporal context, impairing performance on dynamic facial motions and causing instability in facial structure and background consistency. To overcome this limitation, we propose SyncAnyone, a novel two-stage learning framework that achieves accurate motion modeling and high visual fidelity simultaneously. In Stage 1, we train a diffusion-based video transformer for masked mouth inpainting, leveraging its strong spatiotemporal modeling to generate accurate, audio-driven lip movements. However, due to input corruption, minor artifacts may arise in the surrounding facial regions and the background. In Stage 2, we develop a mask-free tuning pipeline to address mask-induced artifacts. Specifically, on the basis of the Stage 1 model, we develop a data generation pipeline that creates pseudo-paired training samples by synthesizing lip-synced videos from the source video and random sampled audio. We further tune the stage 2 model on this synthetic data, achieving precise lip editing and better background consistency. Extensive experiments show that our method achieves state-of-the-art results in visual quality, temporal coherence, and identity preservation under in-the wild lip-syncing scenarios.

47. 【2512.21734】Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation

链接https://arxiv.org/abs/2512.21734

作者:Steven Xiao,XIndi Zhang,Dechao Meng,Qi Wang,Peng Zhang,Bang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requiring high visual, high visual fidelity, ultra-low latency, live avatars, requiring high

备注

点击查看摘要

Abstract:Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A "running ahead" mechanism that dynamically updates the reference frame's temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.

48. 【2512.21714】AstraNav-World: World Model for Foresight Control and Consistency

链接https://arxiv.org/abs/2512.21714

作者:Junjun Hu,Jintao Chen,Haochen Bai,Minghua Luo,Shichao Xie,Ziyi Chen,Fei Liu,Zedong Chu,Xinda Xue,Botao Ren,Xiaolong Wu,Mu Xu,Shanghang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:environments demands accurate, dynamic environments demands, demands accurate foresight, unfold over time, environments demands

备注

点击查看摘要

Abstract:Embodied navigation in open, dynamic environments demands accurate foresight of how the world will evolve and how actions will unfold over time. We propose AstraNav-World, an end-to-end world model that jointly reasons about future visual states and action sequences within a unified probabilistic framework. Our framework integrates a diffusion-based video generator with a vision-language policy, enabling synchronized rollouts where predicted scenes and planned actions are updated simultaneously. Training optimizes two complementary objectives: generating action-conditioned multi-step visual predictions and deriving trajectories conditioned on those predicted visuals. This bidirectional constraint makes visual predictions executable and keeps decisions grounded in physically consistent, task-relevant futures, mitigating cumulative errors common in decoupled "envision-then-plan" pipelines. Experiments across diverse embodied navigation benchmarks show improved trajectory accuracy and higher success rates. Ablations confirm the necessity of tight vision-action coupling and unified training, with either branch removal degrading both prediction quality and policy reliability. In real-world testing, AstraNav-World demonstrated exceptional zero-shot capabilities, adapting to previously unseen scenarios without any real-world fine-tuning. These results suggest that AstraNav-World captures transferable spatial understanding and planning-relevant navigation dynamics, rather than merely overfitting to simulation-specific data distribution. Overall, by unifying foresight vision and control within a single generative model, we move closer to reliable, interpretable, and general-purpose embodied agents that operate robustly in open-ended real-world settings.

49. 【2512.21710】RAPTOR: Real-Time High-Resolution UAV Video Prediction with Efficient Video Attention

链接https://arxiv.org/abs/2512.21710

作者:Zhan Chen,Zile Guo,Enze Zhu,Peirong Zhang,Xiaoxuan Liu,Lei Wang,Yidan Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:perceptual quality typically, fundamental trilemma, latency-critical applications, perceptual quality, quality typically

备注

点击查看摘要

Abstract:Video prediction is plagued by a fundamental trilemma: achieving high-resolution and perceptual quality typically comes at the cost of real-time speed, hindering its use in latency-critical applications. This challenge is most acute for autonomous UAVs in dense urban environments, where foreseeing events from high-resolution imagery is non-negotiable for safety. Existing methods, reliant on iterative generation (diffusion, autoregressive models) or quadratic-complexity attention, fail to meet these stringent demands on edge hardware. To break this long-standing trade-off, we introduce RAPTOR, a video prediction architecture that achieves real-time, high-resolution performance. RAPTOR's single-pass design avoids the error accumulation and latency of iterative approaches. Its core innovation is Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling. Instead of processing flattened spacetime tokens with $O((ST)^2)$ or $O(ST)$ complexity, EVA alternates operations along the spatial (S) and temporal (T) axes. This factorization reduces the time complexity to $O(S + T)$ and memory complexity to $O(max(S, T))$, enabling global context modeling at $512^2$ resolution and beyond, operating directly on dense feature maps with a patch-free design. Complementing this architecture is a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp, temporally coherent details. Experiments show RAPTOR is the first predictor to exceed 30 FPS on a Jetson AGX Orin for $512^2$ video, setting a new state-of-the-art on UAVid, KTH, and a custom high-resolution dataset in PSNR, SSIM, and LPIPS. Critically, RAPTOR boosts the mission success rate in a real-world UAV navigation task by 18/%, paving the way for safer and more anticipatory embodied agents.

50. 【2512.21707】Spatiotemporal-Untrammelled Mixture of Experts for Multi-Person Motion Prediction

链接https://arxiv.org/abs/2512.21707

作者:Zheng Yin,Chengjian Li,Xiangbo Shu,Meiqi Cao,Rui Yan,Jinhui Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Comprehensively, Comprehensively and flexibly, Experts, complex spatio-temporal, motion prediction

备注: 12 pages, 7 figures, Accepted by AAAI 2026 (oral)

点击查看摘要

Abstract:Comprehensively and flexibly capturing the complex spatio-temporal dependencies of human motion is critical for multi-person motion prediction. Existing methods grapple with two primary limitations: i) Inflexible spatiotemporal representation due to reliance on positional encodings for capturing spatiotemporal information. ii) High computational costs stemming from the quadratic time complexity of conventional attention mechanisms. To overcome these limitations, we propose the Spatiotemporal-Untrammelled Mixture of Experts (ST-MoE), which flexibly explores complex spatio-temporal dependencies in human motion and significantly reduces computational cost. To adaptively mine complex spatio-temporal patterns from human motion, our model incorporates four distinct types of spatiotemporal experts, each specializing in capturing different spatial or temporal dependencies. To reduce the potential computational overhead while integrating multiple experts, we introduce bidirectional spatiotemporal Mamba as experts, each sharing bidirectional temporal and spatial Mamba in distinct combinations to achieve model efficiency and parameter economy. Extensive experiments on four multi-person benchmark datasets demonstrate that our approach not only outperforms state-of-art in accuracy but also reduces model parameter by 41.38% and achieves a 3.6x speedup in training. The code is available at this https URL.

51. 【2512.21695】FUSE: Unifying Spectral and Semantic Cues for Robust AI-Generated Image Detection

链接https://arxiv.org/abs/2512.21695

作者:Md. Zahid Hossain,Most. Sharmin Sultana Samu,Md. Kamrozzaman Bhuiyan,Farhad Uz Zaman,Md. Rakibul Islam

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Fast Fourier Transform, CLIP Vision encoder, evolution of generative, heightened the demand, demand for reliable

备注: accepted for publication in 2025 28th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:The fast evolution of generative models has heightened the demand for reliable detection of AI-generated images. To tackle this challenge, we introduce FUSE, a hybrid system that combines spectral features extracted through Fast Fourier Transform with semantic features obtained from the CLIP's Vision encoder. The features are fused into a joint representation and trained progressively in two stages. Evaluations on GenImage, WildFake, DiTFake, GPT-ImgEval and Chameleon datasets demonstrate strong generalization across multiple generators. Our FUSE (Stage 1) model demonstrates state-of-the-art results on the Chameleon benchmark. It also attains 91.36% mean accuracy on the GenImage dataset, 88.71% accuracy across all tested generators, and a mean Average Precision of 94.96%. Stage 2 training further improves performance for most generators. Unlike existing methods, which often perform poorly on high-fidelity images in Chameleon, our approach maintains robustness across diverse generators. These findings highlight the benefits of integrating spectral and semantic features for generalized detection of images generated by AI.

52. 【2512.21694】BeHGAN: Bengali Handwritten Word Generation from Plain Text Using Generative Adversarial Networks

链接https://arxiv.org/abs/2512.21694

作者:Md. Rakibul Islam,Md. Kamrozzaman Bhuiyan,Safwan Muntasir,Arifur Rahman Jawad,Most. Sharmin Sultana Samu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Handwritten Text Recognition, Text Recognition, Handwritten Text, Handwritten Text Generation, HTR

备注: Accepted for publication in 2025 28th International Conference on Computer and Information Technology (ICCIT)

点击查看摘要

Abstract:Handwritten Text Recognition (HTR) is a well-established research area. In contrast, Handwritten Text Generation (HTG) is an emerging field with significant potential. This task is challenging due to the variation in individual handwriting styles. A large and diverse dataset is required to generate realistic handwritten text. However, such datasets are difficult to collect and are not readily available. Bengali is the fifth most spoken language in the world. While several studies exist for languages such as English and Arabic, Bengali handwritten text generation has received little attention. To address this gap, we propose a method for generating Bengali handwritten words. We developed and used a self-collected dataset of Bengali handwriting samples. The dataset includes contributions from approximately five hundred individuals across different ages and genders. All images were pre-processed to ensure consistency and quality. Our approach demonstrates the ability to produce diverse handwritten outputs from input plain text. We believe this work contributes to the advancement of Bengali handwriting generation and can support further research in this area.

53. 【2512.21693】Prior-AttUNet: Retinal OCT Fluid Segmentation Based on Normal Anatomical Priors and Attention Gating

链接https://arxiv.org/abs/2512.21693

作者:Li Yang,Yuting Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:diabetic macular edema, age-related macular degeneration, macular edema, hallmark pathological feature, age-related macular

备注

点击查看摘要

Abstract:Accurate segmentation of macular edema, a hallmark pathological feature in vision-threatening conditions such as age-related macular degeneration and diabetic macular edema, is essential for clinical diagnosis and management. To overcome the challenges of segmenting fluid regions in optical coherence tomography (OCT) images-notably ambiguous boundaries and cross-device heterogeneity-this study introduces Prior-AttUNet, a segmentation model augmented with generative anatomical priors. The framework adopts a hybrid dual-path architecture that integrates a generative prior pathway with a segmentation network. A variational autoencoder supplies multi-scale normative anatomical priors, while the segmentation backbone incorporates densely connected blocks and spatial pyramid pooling modules to capture richer contextual information. Additionally, a novel triple-attention mechanism, guided by anatomical priors, dynamically modulates feature importance across decoding stages, substantially enhancing boundary delineation. Evaluated on the public RETOUCH benchmark, Prior-AttUNet achieves excellent performance across three OCT imaging devices (Cirrus, Spectralis, and Topcon), with mean Dice similarity coefficients of 93.93%, 95.18%, and 93.47%, respectively. The model maintains a low computational cost of 0.37 TFLOPs, striking an effective balance between segmentation precision and inference efficiency. These results demonstrate its potential as a reliable tool for automated clinical analysis.

54. 【2512.21692】ShinyNeRF: Digitizing Anisotropic Appearance in Neural Radiance Fields

链接https://arxiv.org/abs/2512.21692

作者:Albert Barreiro,Roger Marí,Rafael Redondo,Gloria Haro,Carles Bosch

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Recent advances, Neural Radiance Fields, cultural heritage, technologies have transformed, transformed the preservation

备注

点击查看摘要

Abstract:Recent advances in digitization technologies have transformed the preservation and dissemination of cultural heritage. In this vein, Neural Radiance Fields (NeRF) have emerged as a leading technology for 3D digitization, delivering representations with exceptional realism. However, existing methods struggle to accurately model anisotropic specular surfaces, typically observed, for example, on brushed metals. In this work, we introduce ShinyNeRF, a novel framework capable of handling both isotropic and anisotropic reflections. Our method is capable of jointly estimating surface normals, tangents, specular concentration, and anisotropy magnitudes of an Anisotropic Spherical Gaussian (ASG) distribution, by learning an approximation of the outgoing radiance as an encoded mixture of isotropic von Mises-Fisher (vMF) distributions. Experimental results show that ShinyNeRF not only achieves state-of-the-art performance on digitizing anisotropic specular reflections, but also offers plausible physical interpretations and editing of material properties compared to existing methods.

55. 【2512.21691】Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective

链接https://arxiv.org/abs/2512.21691

作者:Huan Li,Longjun Luo,Yuling Shi,Xiaodong Gu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:http URL prove, http URL theory, Visual Geometry Grounded, Geometry Grounded Transformer, URL prove that,in

备注: 8 pages, 4 figures

点击查看摘要

Abstract:Visual Geometry Grounded Transformer (VGGT) delivers state-of-the-art feed-forward 3D reconstruction, yet its global self-attention layer suffers from a drastic collapse phenomenon when the input sequence exceeds a few hundred frames: attention matrices rapidly become near rank-one, token geometry degenerates to an almost one-dimensional subspace, and reconstruction error accumulates this http URL this report,we establish a rigorous mathematical explanation of the collapse by viewing the global-attention iteration as a degenerate diffusion this http URL prove that,in VGGT, the token-feature flow converges toward a Dirac-type measure at a $O(1/L)$ rate, where $L$ is the layer index, yielding a closed-form mean-field partial differential equation that precisely predicts the empirically observed rank this http URL theory quantitatively matches the attention-heat-map evolution and a series of experiments outcomes reported in relevant works and explains why its token-merging remedy -- which periodically removes redundant tokens -- slows the effective diffusion coefficient and thereby delays collapse without additional this http URL believe the analysis provides a principled lens for interpreting future scalable 3D-vision transformers,and we highlight its potential for multi-modal generalization.

56. 【2512.21684】SlideChain: Semantic Provenance for Lecture Understanding via Blockchain Registration

链接https://arxiv.org/abs/2512.21684

作者:Md Motaleb Hossen Manik,Md Zabirul Islam,Ge Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:outputs remain challenging, Modern vision, semantic outputs remain, challenging to verify, audit over time

备注

点击查看摘要

Abstract:Modern vision--language models (VLMs) are increasingly used to interpret and generate educational content, yet their semantic outputs remain challenging to verify, reproduce, and audit over time. Inconsistencies across model families, inference settings, and computing environments undermine the reliability of AI-generated instructional material, particularly in high-stakes and quantitative STEM domains. This work introduces SlideChain, a blockchain-backed provenance framework designed to provide verifiable integrity for multimodal semantic extraction at scale. Using the SlideChain Slides Dataset-a curated corpus of 1,117 medical imaging lecture slides from a university course-we extract concepts and relational triples from four state-of-the-art VLMs and construct structured provenance records for every slide. SlideChain anchors cryptographic hashes of these records on a local EVM (Ethereum Virtual Machine)-compatible blockchain, providing tamper-evident auditability and persistent semantic baselines. Through the first systematic analysis of semantic disagreement, cross-model similarity, and lecture-level variability in multimodal educational content, we reveal pronounced cross-model discrepancies, including low concept overlap and near-zero agreement in relational triples on many slides. We further evaluate gas usage, throughput, and scalability under simulated deployment conditions, and demonstrate perfect tamper detection along with deterministic reproducibility across independent extraction runs. Together, these results show that SlideChain provides a practical and scalable step toward trustworthy, verifiable multimodal educational pipelines, supporting long-term auditability, reproducibility, and integrity for AI-assisted instructional systems.

57. 【2512.21683】Contrastive Graph Modeling for Cross-Domain Few-Shot Medical Image Segmentation

链接https://arxiv.org/abs/2512.21683

作者:Yuntian Bo,Tao Zhou,Zechao Li,Haofeng Zhang,Ling Shao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Cross-domain few-shot medical, offers a promising, analysis is required, few-shot medical image, promising and data-efficient

备注: Accepted to IEEE Transactions on Medical Imaging (T-MI), 2026

点击查看摘要

Abstract:Cross-domain few-shot medical image segmentation (CD-FSMIS) offers a promising and data-efficient solution for medical applications where annotations are severely scarce and multimodal analysis is required. However, existing methods typically filter out domain-specific information to improve generalization, which inadvertently limits cross-domain performance and degrades source-domain accuracy. To address this, we present Contrastive Graph Modeling (C-Graph), a framework that leverages the structural consistency of medical images as a reliable domain-transferable prior. We represent image features as graphs, with pixels as nodes and semantic affinities as edges. A Structural Prior Graph (SPG) layer is proposed to capture and transfer target-category node dependencies and enable global structure modeling through explicit node interactions. Building upon SPG layers, we introduce a Subgraph Matching Decoding (SMD) mechanism that exploits semantic relations among nodes to guide prediction. Furthermore, we design a Confusion-minimizing Node Contrast (CNC) loss to mitigate node ambiguity and subgraph heterogeneity by contrastively enhancing node discriminability in the graph space. Our method significantly outperforms prior CD-FSMIS approaches across multiple cross-domain benchmarks, achieving state-of-the-art performance while simultaneously preserving strong segmentation accuracy on the source domain.

58. 【2512.21675】UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

链接https://arxiv.org/abs/2512.21675

作者:Shuo Cao,Jiayang Li,Xiaohui Li,Yuandong Pu,Kaiwen Zhu,Yuanting Gao,Siqi Luo,Yi Xin,Qi Qin,Yu Zhou,Xiangyu Chen,Wenlong Zhang,Bin Fu,Yu Qiao,Yihao Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable progress, perceptual-level image understanding, Multimodal large language, image understanding, perceptual-level image

备注: 27 pages, 14 figures, 17 tables

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains limited. In this work, we present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture. We establish a hierarchical definition system and construct large-scale datasets to evaluate perceptual-level image understanding. Based on this foundation, we develop a strong baseline UniPercept trained via Domain-Adaptive Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks. UniPercept outperforms existing MLLMs on perceptual-level image understanding and can serve as a plug-and-play reward model for text-to-image generation. This work defines Perceptual-Level Image Understanding in the era of MLLMs and, through the introduction of a comprehensive benchmark together with a strong baseline, provides a solid foundation for advancing perceptual-level multimodal image understanding.

59. 【2512.21673】Comparative Analysis of Deep Learning Models for Perception in Autonomous Vehicles

链接https://arxiv.org/abs/2512.21673

作者:Jalal Khan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:machine learning, deep learning, plethora of machine, reliability of autonomous, autonomous vehicles

备注: 6 pages, 3 figures

点击查看摘要

Abstract:Recently, a plethora of machine learning (ML) and deep learning (DL) algorithms have been proposed to achieve the efficiency, safety, and reliability of autonomous vehicles (AVs). The AVs use a perception system to detect, localize, and identify other vehicles, pedestrians, and road signs to perform safe navigation and decision-making. In this paper, we compare the performance of DL models, including YOLO-NAS and YOLOv8, for a detection-based perception task. We capture a custom dataset and experiment with both DL models using our custom dataset. Our analysis reveals that the YOLOv8s model saves 75% of training time compared to the YOLO-NAS model. In addition, the YOLOv8s model (83%) outperforms the YOLO-NAS model (81%) when the target is to achieve the highest object detection accuracy. These comparative analyses of these new emerging DL models will allow the relevant research community to understand the models' performance under real-world use case scenarios.

60. 【2512.21670】he Deepfake Detective: Interpreting Neural Forensics Through Sparse Features and Manifolds

链接https://arxiv.org/abs/2512.21670

作者:Subramanyam Sahoo,Jared Junkin

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:identifying synthetic media, remain largely opaque, achieved high accuracy, decision processes remain, processes remain largely

备注: 10 pages, 5 figures, Initial Work

点击查看摘要

Abstract:Deepfake detection models have achieved high accuracy in identifying synthetic media, but their decision processes remain largely opaque. In this paper we present a mechanistic interpretability framework for deepfake detection applied to a vision-language model. Our approach combines a sparse autoencoder (SAE) analysis of internal network representations with a novel forensic manifold analysis that probes how the model's features respond to controlled forensic artifact manipulations. We demonstrate that only a small fraction of latent features are actively used in each layer, and that the geometric properties of the model's feature manifold, including intrinsic dimensionality, curvature, and feature selectivity, vary systematically with different types of deepfake artifacts. These insights provide a first step toward opening the "black box" of deepfake detectors, allowing us to identify which learned features correspond to specific forensic artifacts and to guide the development of more interpretable and robust models.

61. 【2512.21643】Omni-Weather: Unified Multimodal Foundation Model for Weather Generation and Understanding

链接https://arxiv.org/abs/2512.21643

作者:Zhiwang Zhou,Yuandong Pu,Xuming He,Yidi Liu,Yixin Chen,Junchao Gong,Xiang Zhuang,Wanghan Xu,Qinglong Cao,Shixiang Tang,Yihao Liu,Wenlong Zhang,Lei Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:existing methods treat, Weather modeling requires, weather generation, mechanistic interpretation, goals in isolation

备注: 25 pages, 12 figures. ICLR 2026 submission

点击查看摘要

Abstract:Weather modeling requires both accurate prediction and mechanistic interpretation, yet existing methods treat these goals in isolation, separating generation from understanding. To address this gap, we present Omni-Weather, the first multimodal foundation model that unifies weather generation and understanding within a single architecture. Omni-Weather integrates a radar encoder for weather generation tasks, followed by unified processing using a shared self-attention mechanism. Moreover, we construct a Chain-of-Thought dataset for causal reasoning in weather generation, enabling interpretable outputs and improved perceptual quality. Extensive experiments show Omni-Weather achieves state-of-the-art performance in both weather generation and understanding. Our findings further indicate that generative and understanding tasks in the weather domain can mutually enhance each other. Omni-Weather also demonstrates the feasibility and value of unifying weather generation and understanding.

62. 【2512.21641】rackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References

链接https://arxiv.org/abs/2512.21641

作者:Jiahong Yu,Ziqi Wang,Hailiang Zhao,Wei Zhai,Xueqiang Yan,Shuiguang Deng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Understanding natural-language references, interactive autonomous systems, Understanding natural-language, driving scenes, autonomous systems

备注

点击查看摘要

Abstract:Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.

63. 【2512.21637】raining-Free Disentangled Text-Guided Image Editing via Sparse Latent Constraints

链接https://arxiv.org/abs/2512.21637

作者:Mutiara Shabrina,Nova Kurnia Putri,Jefri Satria Ferdiansyah,Sabita Khansa Dewi,Novanto Yudistira

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Text-driven image manipulation, adding bangs, Text-driven image, unintentionally alters, modifying a target

备注

点击查看摘要

Abstract:Text-driven image manipulation often suffers from attribute entanglement, where modifying a target attribute (e.g., adding bangs) unintentionally alters other semantic properties such as identity or appearance. The Predict, Prevent, and Evaluate (PPE) framework addresses this issue by leveraging pre-trained vision-language models for disentangled editing. In this work, we analyze the PPE framework, focusing on its architectural components, including BERT-based attribute prediction and StyleGAN2-based image generation on the CelebA-HQ dataset. Through empirical analysis, we identify a limitation in the original regularization strategy, where latent updates remain dense and prone to semantic leakage. To mitigate this issue, we introduce a sparsity-based constraint using L1 regularization on latent space manipulation. Experimental results demonstrate that the proposed approach enforces more focused and controlled edits, effectively reducing unintended changes in non-target attributes while preserving facial identity.

64. 【2512.21618】SymDrive: Realistic and Controllable Driving Simulator via Symmetric Auto-regressive Online Restoration

链接https://arxiv.org/abs/2512.21618

作者:Zhiyuan Liu,Daocheng Fu,Pinlong Cai,Lening Wang,Ying Liu,Yilong Ren,Botian Shi,Jianqiang Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Autonomous Driving, long-tail data scarcity, existing methods struggle, interactive traffic editing, High-fidelity and controllable

备注

点击查看摘要

Abstract:High-fidelity and controllable 3D simulation is essential for addressing the long-tail data scarcity in Autonomous Driving (AD), yet existing methods struggle to simultaneously achieve photorealistic rendering and interactive traffic editing. Current approaches often falter in large-angle novel view synthesis and suffer from geometric or lighting artifacts during asset manipulation. To address these challenges, we propose SymDrive, a unified diffusion-based framework capable of joint high-quality rendering and scene editing. We introduce a Symmetric Auto-regressive Online Restoration paradigm, which constructs paired symmetric views to recover fine-grained details via a ground-truth-guided dual-view formulation and utilizes an auto-regressive strategy for consistent lateral view generation. Furthermore, we leverage this restoration capability to enable a training-free harmonization mechanism, treating vehicle insertion as context-aware inpainting to ensure seamless lighting and shadow consistency. Extensive experiments demonstrate that SymDrive achieves state-of-the-art performance in both novel-view enhancement and realistic 3D vehicle insertion.

65. 【2512.21617】CausalFSFG: Rethinking Few-Shot Fine-Grained Visual Categorization from Causal Perspective

链接https://arxiv.org/abs/2512.21617

作者:Zhiwen Yang,Jinglin Xu,Yuxin Pen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fine-grained visual categorization, Few-shot fine-grained visual, visual categorization, focuses on identifying, common superclass

备注: 12 pages, 5 figures, accepted by IEEE TMM

点击查看摘要

Abstract:Few-shot fine-grained visual categorization (FS-FGVC) focuses on identifying various subcategories within a common superclass given just one or few support examples. Most existing methods aim to boost classification accuracy by enriching the extracted features with discriminative part-level details. However, they often overlook the fact that the set of support samples acts as a confounding variable, which hampers the FS-FGVC performance by introducing biased data distribution and misguiding the extraction of discriminative features. To address this issue, we propose a new causal FS-FGVC (CausalFSFG) approach inspired by causal inference for addressing biased data distributions through causal intervention. Specifically, based on the structural causal model (SCM), we argue that FS-FGVC infers the subcategories (i.e., effect) from the inputs (i.e., cause), whereas both the few-shot condition disturbance and the inherent fine-grained nature (i.e., large intra-class variance and small inter-class variance) lead to unobservable variables that bring spurious correlations, compromising the final classification performance. To further eliminate the spurious correlations, our CausalFSFG approach incorporates two key components: (1) Interventional multi-scale encoder (IMSE) conducts sample-level interventions, (2) Interventional masked feature reconstruction (IMFR) conducts feature-level interventions, which together reveal real causalities from inputs to subcategories. Extensive experiments and thorough analyses on the widely-used public datasets, including CUB-200-2011, Stanford Dogs, and Stanford Cars, demonstrate that our CausalFSFG achieves new state-of-the-art performance. The code is available at this https URL.

66. 【2512.21616】AMEing Long Contexts in Personalization: Towards Training-Free and State-Aware MLLM Personalized Assistant

链接https://arxiv.org/abs/2512.21616

作者:Rongpei Hong,Jian Lang,Ting Zhong,Yong Wang,Fan Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Model, Multimodal Large, Language Model, Large Language

备注: Accepted by KDD 2026 research track. Code and data are available at [this https URL](https://github.com/ronpay/TAME)

点击查看摘要

Abstract:Multimodal Large Language Model (MLLM) Personalization is a critical research problem that facilitates personalized dialogues with MLLMs targeting specific entities (known as personalized concepts). However, existing methods and benchmarks focus on the simple, context-agnostic visual identification and textual replacement of the personalized concept (e.g., "A yellow puppy" - "Your puppy Mochi"), overlooking the ability to support long-context conversations. An ideal personalized MLLM assistant is capable of engaging in long-context dialogues with humans and continually improving its experience quality by learning from past dialogue histories. To bridge this gap, we propose LCMP, the first Long-Context MLLM Personalization evaluation benchmark. LCMP assesses the capability of MLLMs in perceiving variations of personalized concepts and generating contextually appropriate personalized responses that reflect these variations. As a strong baseline for LCMP, we introduce a novel training-free and state-aware framework TAME. TAME endows MLLMs with double memories to manage the temporal and persistent variations of each personalized concept in a differentiated manner. In addition, TAME incorporates a new training-free Retrieve-then-Align Augmented Generation (RA2G) paradigm. RA2G introduces an alignment step to extract the contextually fitted information from the multi-memory retrieved knowledge to the current questions, enabling better interactions for complex real-world user queries. Experiments on LCMP demonstrate that TAME achieves the best performance, showcasing remarkable and evolving interaction experiences in long-context scenarios.

67. 【2512.21602】Robustness and Scalability Of Machine Learning for Imbalanced Clinical Data in Emergency and Critical Care

链接https://arxiv.org/abs/2512.21602

作者:Yusuf Brima,Marcellin Atemkeng

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:require predictive models, severely imbalanced, Emergency and intensive, environments require predictive, models

备注

点击查看摘要

Abstract:Emergency and intensive care environments require predictive models that are both accurate and computationally efficient, yet clinical data in these settings are often severely imbalanced. Such skewness undermines model reliability, particularly for rare but clinically crucial outcomes, making robustness and scalability essential for real-world usage. In this paper, we systematically evaluate the robustness and scalability of classical machine learning models on imbalanced tabular data from MIMIC-IV-ED and eICU. Class imbalance was quantified using complementary metrics, and we compared the performance of tree-based methods, the state-of-the-art TabNet deep learning model, and a custom lightweight residual network. TabResNet was designed as a computationally efficient alternative to TabNet, replacing its complex attention mechanisms with a streamlined residual architecture to maintain representational capacity for real-time clinical use. All models were optimized via a Bayesian hyperparameter search and assessed on predictive performance, robustness to increasing imbalance, and computational scalability. Our results, on seven clinically vital predictive tasks, show that tree-based methods, particularly XGBoost, consistently achieved the most stable performance across imbalance levels and scaled efficiently with sample size. Deep tabular models degraded more sharply under imbalance and incurred higher computational costs, while TabResNet provided a lighter alternative to TabNet but did not surpass ensemble benchmarks. These findings indicate that in emergency and critical care, robustness to imbalance and computational scalability could outweigh architectural complexity. Tree-based ensemble methods currently offer the most practical and clinically feasible choice, equipping practitioners with a framework for selecting models suited to high-stakes, time-sensitive environments.

68. 【2512.21599】GaussianEM: Model compositional and conformational heterogeneity using 3D Gaussians

链接https://arxiv.org/abs/2512.21599

作者:Bintao He,Yiran Cheng,Hongjia Li,Xiang Gao,Xin Gao,Fa Zhang,Renmin Han

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Understanding protein flexibility, protein function study, Understanding protein, protein flexibility, protein function

备注

点击查看摘要

Abstract:Understanding protein flexibility and its dynamic interactions with other molecules is essential for protein function study. Cryogenic electron microscopy (cryo-EM) provides an opportunity to directly observe macromolecular dynamics. However, analyzing datasets that contain both continuous motions and discrete states remains highly challenging. Here we present GaussianEM, a Gaussian pseudo-atomic framework that simultaneously models compositional and conformational heterogeneity from experimental cryo-EM images. GaussianEM employs a two-encoder-one-decoder architecture to map an image to its individual Gaussian components, and represent structural variability through changes in Gaussian parameters. This approach provides an intuitive and interpretable description of conformational changes, preserves local structural consistency along the transition trajectories, and naturally bridges the gap between density-based models and corresponding atomic models. We demonstrate the effectiveness of GaussianEM on both simulated and experimental datasets.

69. 【2512.21598】From Shallow Humor to Metaphor: Towards Label-Free Harmful Meme Detection via LMM Agent Self-Improvement

链接https://arxiv.org/abs/2512.21598

作者:Jian Lang,Rongpei Hong,Ting Zhong,Leiting Chen,Qiang Gao,Fan Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:media poses significant, poses significant risks, Large Multimodal Model, online media poses, health and stability

备注: 12 pages. Accepted by KDD 2026 research track. Codes are released at [this https URL](https://github.com/Jian-Lang/ALARM)

点击查看摘要

Abstract:The proliferation of harmful memes on online media poses significant risks to public health and stability. Existing detection methods heavily rely on large-scale labeled data for training, which necessitates substantial manual annotation efforts and limits their adaptability to the continually evolving nature of harmful content. To address these challenges, we present ALARM, the first lAbeL-free hARmful Meme detection framework powered by Large Multimodal Model (LMM) agent self-improvement. The core innovation of ALARM lies in exploiting the expressive information from "shallow" memes to iteratively enhance its ability to tackle more complex and subtle ones. ALARM consists of a novel Confidence-based Explicit Meme Identification mechanism that isolates the explicit memes from the original dataset and assigns them pseudo-labels. Besides, a new Pairwise Learning Guided Agent Self-Improvement paradigm is introduced, where the explicit memes are reorganized into contrastive pairs (positive vs. negative) to refine a learner LMM agent. This agent autonomously derives high-level detection cues from these pairs, which in turn empower the agent itself to handle complex and challenging memes effectively. Experiments on three diverse datasets demonstrate the superior performance and strong adaptability of ALARM to newly evolved memes. Notably, our method even outperforms label-driven methods. These results highlight the potential of label-free frameworks as a scalable and promising solution for adapting to novel forms and topics of harmful memes in dynamic online environments.

70. 【2512.21584】UltraLBM-UNet: Ultralight Bidirectional Mamba-based Model for Skin Lesion Segmentation

链接https://arxiv.org/abs/2512.21584

作者:Linxuan Fan(1),Juntao Jiang(2),Weixuan Liu(3),Zhucun Xue(2),Jiajun Lv(2),Jiangning Zhang(2),Yong Liu(2) ((1) Data Science Institute, Vanderbilt University, Nashville, USA (2) College of Control Science and Engineering, Zhejiang University, Hangzhou, China (3) School of Computer Science and Technology, East China Normal University, Shanghai, China)

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:guiding clinical decision-making, Skin lesion segmentation, Skin lesion, clinical decision-making, crucial step

备注: 12 pages, 4 figures

点击查看摘要

Abstract:Skin lesion segmentation is a crucial step in dermatology for guiding clinical decision-making. However, existing methods for accurate, robust, and resource-efficient lesion analysis have limitations, including low performance and high computational complexity. To address these limitations, we propose UltraLBM-UNet, a lightweight U-Net variant that integrates a bidirectional Mamba-based global modeling mechanism with multi-branch local feature perception. The proposed architecture integrates efficient local feature injection with bidirectional state-space modeling, enabling richer contextual interaction across spatial dimensions while maintaining computational compactness suitable for point-of-care deployment. Extensive experiments on the ISIC 2017, ISIC 2018, and PH2 datasets demonstrate that our model consistently achieves state-of-the-art segmentation accuracy, outperforming existing lightweight and Mamba counterparts with only 0.034M parameters and 0.060 GFLOPs. In addition, we introduce a hybrid knowledge distillation strategy to train an ultra-compact student model, where the distilled variant UltraLBM-UNet-T, with only 0.011M parameters and 0.019 GFLOPs, achieves competitive segmentation performance. These results highlight the suitability of UltraLBM-UNet for point-of-care deployment, where accurate and robust lesion analyses are essential. The source code is publicly available at this https URL.

71. 【2512.21582】LLM-Free Image Captioning Evaluation in Reference-Flexible Settings

链接https://arxiv.org/abs/2512.21582

作者:Shinnosuke Hirano,Yuiga Wada,Kazuki Matsuda,Seitaro Otsuki,Komei Sugiura

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reference-free settings, reference-based and reference-free, automatic evaluation, settings, image

备注: Accepted for presentation at AAAI2026

点击查看摘要

Abstract:We focus on the automatic evaluation of image captions in both reference-based and reference-free settings. Existing metrics based on large language models (LLMs) favor their own generations; therefore, the neutrality is in question. Most LLM-free metrics do not suffer from such an issue, whereas they do not always demonstrate high performance. To address these issues, we propose Pearl, an LLM-free supervised metric for image captioning, which is applicable to both reference-based and reference-free settings. We introduce a novel mechanism that learns the representations of image--caption and caption--caption similarities. Furthermore, we construct a human-annotated dataset for image captioning metrics, that comprises approximately 333k human judgments collected from 2,360 annotators across over 75k images. Pearl outperformed other existing LLM-free metrics on the Composite, Flickr8K-Expert, Flickr8K-CF, Nebula, and FOIL datasets in both reference-based and reference-free settings. Our project page is available at this https URL.

72. 【2512.21576】owards Long-window Anchoring in Vision-Language Model Distillation

链接https://arxiv.org/abs/2512.21576

作者:Haoyi Zhou,Shuo Li,Tianyu Chen,Qi Song,Chonghan Gao,Jianxin Li

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:limited window size, Rotary Position Embeddings, prevalent small branches, large vision-language models, linguistics-photography alignment

备注: Accepted by AAAI 2026

点击查看摘要

Abstract:While large vision-language models (VLMs) demonstrate strong long-context understanding, their prevalent small branches fail on linguistics-photography alignment for a limited window size. We discover that knowledge distillation improves students' capability as a complement to Rotary Position Embeddings (RoPE) on window sizes (anchored from large models). Building on this insight, we propose LAid, which directly aims at the transfer of long-range attention mechanisms through two complementary components: (1) a progressive distance-weighted attention matching that dynamically emphasizes longer position differences during training, and (2) a learnable RoPE response gain modulation that selectively amplifies position sensitivity where needed. Extensive experiments across multiple model families demonstrate that LAid-distilled models achieve up to 3.2 times longer effective context windows compared to baseline small models, while maintaining or improving performance on standard VL benchmarks. Spectral analysis also suggests that LAid successfully preserves crucial low-frequency attention components that conventional methods fail to transfer. Our work not only provides practical techniques for building more efficient long-context VLMs but also offers theoretical insights into how positional understanding emerges and transfers during distillation.

73. 【2512.21562】Exploration of Reproducible Generated Image Detection

链接https://arxiv.org/abs/2512.21562

作者:Yihang Duan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:detecting AI-Generated Content, insufficient gen eralizability, AI-Generated Content, advanced rapidly, gen eralizability

备注: AAAI workshop RAI accepted

点击查看摘要

Abstract:While the technology for detecting AI-Generated Content (AIGC) images has advanced rapidly, the field still faces two core issues: poor reproducibility and insufficient gen eralizability, which hinder the practical application of such technologies. This study addresses these challenges by re viewing 7 key papers on AIGC detection, constructing a lightweight test dataset, and reproducing a representative detection method. Through this process, we identify the root causes of the reproducibility dilemma in the field: firstly, papers often omit implicit details such as prepro cessing steps and parameter settings; secondly, most detec tion methods overfit to exclusive features of specific gener ators rather than learning universal intrinsic features of AIGC images. Experimental results show that basic perfor mance can be reproduced when strictly following the core procedures described in the original papers. However, de tection performance drops sharply when preprocessing dis rupts key features or when testing across different genera tors. This research provides empirical evidence for improv ing the reproducibility of AIGC detection technologies and offers reference directions for researchers to disclose ex perimental details more comprehensively and verify the generalizability of their proposed methods.

74. 【2512.21560】oward Intelligent Scene Augmentation for Context-Aware Object Placement and Sponsor-Logo Integration

链接https://arxiv.org/abs/2512.21560

作者:Unnati Saraswat,Tarun Rao,Namah Gupta,Shweta Swami,Shikhar Sharma,Prateek Narang,Dhruv Kumar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Intelligent image editing, image editing increasingly, editing increasingly relies, Intelligent image, multimodal reasoning

备注

点击查看摘要

Abstract:Intelligent image editing increasingly relies on advances in computer vision, multimodal reasoning, and generative modeling. While vision-language models (VLMs) and diffusion models enable guided visual manipulation, existing work rarely ensures that inserted objects are \emph{contextually appropriate}. We introduce two new tasks for advertising and digital media: (1) \emph{context-aware object insertion}, which requires predicting suitable object categories, generating them, and placing them plausibly within the scene; and (2) \emph{sponsor-product logo augmentation}, which involves detecting products and inserting correct brand logos, even when items are unbranded or incorrectly branded. To support these tasks, we build two new datasets with category annotations, placement regions, and sponsor-product labels.

75. 【2512.21545】EraseLoRA: MLLM-Driven Foreground Exclusion and Background Subtype Aggregation for Dataset-Free Object Removal

链接https://arxiv.org/abs/2512.21545

作者:Sanghyun Jo,Donghwan Lee,Eunji Jung,Seong Je Oh,Kyungsu Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Object removal differs, common inpainting, contextual fidelity, hole plausibly, differs from common

备注

点击查看摘要

Abstract:Object removal differs from common inpainting, since it must prevent the masked target from reappearing and reconstruct the occluded background with structural and contextual fidelity, rather than merely filling a hole plausibly. Recent dataset-free approaches that redirect self-attention inside the mask fail in two ways: non-target foregrounds are often misinterpreted as background, which regenerates unwanted objects, and direct attention manipulation disrupts fine details and hinders coherent integration of background cues. We propose EraseLoRA, a novel dataset-free framework that replaces attention surgery with background-aware reasoning and test-time adaptation. First, Background-aware Foreground Exclusion (BFE), uses a multimodal large-language models to separate target foreground, non-target foregrounds, and clean background from a single image-mask pair without paired supervision, producing reliable background cues while excluding distractors. Second, Background-aware Reconstruction with Subtype Aggregation (BRSA), performs test-time optimization that treats inferred background subtypes as complementary pieces and enforces their consistent integration through reconstruction and alignment objectives, preserving local detail and global structure without explicit attention intervention. We validate EraseLoRA as a plug-in to pretrained diffusion models and across benchmarks for object removal, demonstrating consistent improvements over dataset-free baselines and competitive results against dataset-driven methods. The code will be made available upon publication.

76. 【2512.21542】Vision Transformers are Circulant Attention Learners

链接https://arxiv.org/abs/2512.21542

作者:Dongchen Han,Tianyu Li,Ziyi Wang,Gao Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:key factor, vision Transformers, Circulant Attention, Block Circulant matrix, Circulant

备注: AAAI 2026

点击查看摘要

Abstract:The self-attention mechanism has been a key factor in the advancement of vision Transformers. However, its quadratic complexity imposes a heavy computational burden in high-resolution scenarios, restricting the practical application. Previous methods attempt to mitigate this issue by introducing handcrafted patterns such as locality or sparsity, which inevitably compromise model capacity. In this paper, we present a novel attention paradigm termed \textbf{Circulant Attention} by exploiting the inherent efficient pattern of self-attention. Specifically, we first identify that the self-attention matrix in vision Transformers often approximates the Block Circulant matrix with Circulant Blocks (BCCB), a kind of structured matrix whose multiplication with other matrices can be performed in $\mathcal{O}(N\log N)$ time. Leveraging this interesting pattern, we explicitly model the attention map as its nearest BCCB matrix and propose an efficient computation algorithm for fast calculation. The resulting approach closely mirrors vanilla self-attention, differing only in its use of BCCB matrices. Since our design is inspired by the inherent efficient paradigm, it not only delivers $\mathcal{O}(N\log N)$ computation complexity, but also largely maintains the capacity of standard self-attention. Extensive experiments on diverse visual tasks demonstrate the effectiveness of our approach, establishing circulant attention as a promising alternative to self-attention for vision Transformer architectures. Code is available at this https URL.

77. 【2512.21529】Hierarchy-Aware Fine-Tuning of Vision-Language Models

链接https://arxiv.org/abs/2512.21529

作者:Jiayu Li,Rajesh Gangireddy,Samet Akcay,Wei Cheng,Juhua Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:large-scale image-text pretraining, learn powerful multimodal, powerful multimodal representations, Vision-Language Models, learn powerful

备注

点击查看摘要

Abstract:Vision-Language Models (VLMs) learn powerful multimodal representations through large-scale image-text pretraining, but adapting them to hierarchical classification is underexplored. Standard approaches treat labels as flat categories and require full fine-tuning, which is expensive and produces inconsistent predictions across taxonomy levels. We propose an efficient hierarchy-aware fine-tuning framework that updates a few parameters while enforcing structural consistency. We combine two objectives: Tree-Path KL Divergence (TP-KL) aligns predictions along the ground-truth label path for vertical coherence, while Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) encourages consistent predictions among sibling classes. Both losses work in the VLM's shared embedding space and integrate with lightweight LoRA adaptation. Experiments across multiple benchmarks show consistent improvements in Full-Path Accuracy and Tree-based Inconsistency Error with minimal parameter overhead. Our approach provides an efficient strategy for adapting VLMs to structured taxonomies.

78. 【2512.21516】Global-Graph Guided and Local-Graph Weighted Contrastive Learning for Unified Clustering on Incomplete and Noise Multi-View Data

链接https://arxiv.org/abs/2512.21516

作者:Hongqing He,Jie Xu,Wenyuan Yang,Yonghua Zhu,Guoqiu Wen,Xiaofeng Zhu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:attracted increasing attention, plays an important, increasing attention, contrastive learning, MVC

备注

点击查看摘要

Abstract:Recently, contrastive learning (CL) plays an important role in exploring complementary information for multi-view clustering (MVC) and has attracted increasing attention. Nevertheless, real-world multi-view data suffer from data incompleteness or noise, resulting in rare-paired samples or mis-paired samples which significantly challenges the effectiveness of CL-based MVC. That is, rare-paired issue prevents MVC from extracting sufficient multi-view complementary information, and mis-paired issue causes contrastive learning to optimize the model in the wrong direction. To address these issues, we propose a unified CL-based MVC framework for enhancing clustering effectiveness on incomplete and noise multi-view data. First, to overcome the rare-paired issue, we design a global-graph guided contrastive learning, where all view samples construct a global-view affinity graph to form new sample pairs for fully exploring complementary information. Second, to mitigate the mis-paired issue, we propose a local-graph weighted contrastive learning, which leverages local neighbors to generate pair-wise weights to adaptively strength or weaken the pair-wise contrastive learning. Our method is imputation-free and can be integrated into a unified global-local graph-guided contrastive learning framework. Extensive experiments on both incomplete and noise settings of multi-view data demonstrate that our method achieves superior performance compared with state-of-the-art approaches.

79. 【2512.21514】DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO

链接https://arxiv.org/abs/2512.21514

作者:Henglin Liu,Huijuan Huang,Jing Wang,Chang Liu,Xiu Li,Xiangyang Ji

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Reinforcement learning, significantly by comparing, comparing the relative, relative performance, improves image generation

备注: Project Page: [this https URL](https://henglin-liu.github.io/DiverseGRPO/)

点击查看摘要

Abstract:Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, which restricts its application scenarios. This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on single-sample quality as the reward signal, driving the model to converge toward a few high-reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early-stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality--diversity trade-off. Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves a 13\%--18\% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.

80. 【2512.21513】MuS-Polar3D: A Benchmark Dataset for Computational Polarimetric 3D Imaging under Multi-Scattering Conditions

链接https://arxiv.org/abs/2512.21513

作者:Puyun Wang,Kaimin Yu,Huayang He,Xianyu Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:exhibiting distinct advantages, exploits polarization cues, suppress background scattering, exhibiting distinct, turbid water

备注

点击查看摘要

Abstract:Polarization-based underwater 3D imaging exploits polarization cues to suppress background scattering, exhibiting distinct advantages in turbid water. Although data-driven polarization-based underwater 3D reconstruction methods show great potential, existing public datasets lack sufficient diversity in scattering and observation conditions, hindering fair comparisons among different approaches, including single-view and multi-view polarization imaging methods. To address this limitation, we construct MuS-Polar3D, a benchmark dataset comprising polarization images of 42 objects captured under seven quantitatively controlled scattering conditions and five viewpoints, together with high-precision 3D models (+/- 0.05 mm accuracy), normal maps, and foreground masks. The dataset supports multiple vision tasks, including normal estimation, object segmentation, descattering, and 3D reconstruction. Inspired by computational imaging, we further decouple underwater 3D reconstruction under scattering into a two-stage pipeline, namely descattering followed by 3D reconstruction, from an imaging-chain perspective. Extensive evaluations using multiple baseline methods under complex scattering conditions demonstrate the effectiveness of the proposed benchmark, achieving a best mean angular error of 15.49 degrees. To the best of our knowledge, MuS-Polar3D is the first publicly available benchmark dataset for quantitative turbidity underwater polarization-based 3D imaging, enabling accurate reconstruction and fair algorithm evaluation under controllable scattering conditions. The dataset and code are publicly available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2512.21513 [cs.CV]

(or
arXiv:2512.21513v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2512.21513

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
81. 【2512.21512】Fixed-Threshold Evaluation of a Hybrid CNN-ViT for AI-Generated Image Detection Across Photos and Art

链接https://arxiv.org/abs/2512.21512

作者:Md Ashik Khan,Arafat Alam Jion

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:necessitating robust detectors, common post-processing transformations, image generators create, JPEG compression, generators create

备注: Accepted at the 2025 28th International Conference on Computer and Information Technology (ICCIT). 6 pages, 5 figures

点击查看摘要

Abstract:AI image generators create both photorealistic images and stylized art, necessitating robust detectors that maintain performance under common post-processing transformations (JPEG compression, blur, downscaling). Existing methods optimize single metrics without addressing deployment-critical factors such as operating point selection and fixed-threshold robustness. This work addresses misleading robustness estimates by introducing a fixed-threshold evaluation protocol that holds decision thresholds, selected once on clean validation data, fixed across all post-processing transformations. Traditional methods retune thresholds per condition, artificially inflating robustness estimates and masking deployment failures. We report deployment-relevant performance at three operating points (Low-FPR, ROC-optimal, Best-F1) under systematic degradation testing using a lightweight CNN-ViT hybrid with gated fusion and optional frequency enhancement. Our evaluation exposes a statistically validated forensic-semantic spectrum: frequency-aided CNNs excel on pristine photos but collapse under compression (93.33% to 61.49%), whereas ViTs degrade minimally (92.86% to 88.36%) through robust semantic pattern recognition. Multi-seed experiments demonstrate that all architectures achieve 15% higher AUROC on artistic content (0.901-0.907) versus photorealistic images (0.747-0.759), confirming that semantic patterns provide fundamentally more reliable detection cues than forensic artifacts. Our hybrid approach achieves balanced cross-domain performance: 91.4% accuracy on tiny-genimage photos, 89.7% on AiArtData art/graphics, and 98.3% (competitive) on CIFAKE. Fixed-threshold evaluation eliminates retuning inflation, reveals genuine robustness gaps, and yields actionable deployment guidance: prefer CNNs for clean photo verification, ViTs for compressed content, and hybrids for art/graphics screening.

82. 【2512.21510】Missing Pattern Tree based Decision Grouping and Ensemble for Deep Incomplete Multi-View Clustering

链接https://arxiv.org/abs/2512.21510

作者:Wenyuan Yang,Jie Xu,Hongqing He,Jiangzhang Gan,Xiaofeng Zhu

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:inconsistent missing patterns, Real-world multi-view data, highly inconsistent missing, inconsistent missing, Real-world multi-view

备注

点击查看摘要

Abstract:Real-world multi-view data usually exhibits highly inconsistent missing patterns which challenges the effectiveness of incomplete multi-view clustering (IMVC). Although existing IMVC methods have made progress from both imputation-based and imputation-free routes, they have overlooked the pair under-utilization issue, i.e., inconsistent missing patterns make the incomplete but available multi-view pairs unable to be fully utilized, thereby limiting the model performance. To address this, we propose a novel missing-pattern tree based IMVC framework entitled TreeEIC. Specifically, to achieve full exploitation of available multi-view pairs, TreeEIC first defines the missing-pattern tree model to group data into multiple decision sets according to different missing patterns, and then performs multi-view clustering within each set. Furthermore, a multi-view decision ensemble module is proposed to aggregate clustering results from all decision sets, which infers uncertainty-based weights to suppress unreliable clustering decisions and produce robust decisions. Finally, an ensemble-to-individual knowledge distillation module transfers the ensemble knowledge to view-specific clustering models, which enables ensemble and individual modules to promote each other by optimizing cross-view consistency and inter-cluster discrimination losses. Extensive experiments on multiple benchmark datasets demonstrate that our TreeEIC achieves state-of-the-art IMVC performance and exhibits superior robustness under highly inconsistent missing patterns.

83. 【2512.21508】Fixed-Budget Parameter-Efficient Training with Frozen Encoders Improves Multimodal Chest X-Ray Classification

链接https://arxiv.org/abs/2512.21508

作者:Md Ashik Khan,Md Nahid Siddique

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fine-tunes large vision-language, Indiana University Chest, chest X-Ray analysis, University Chest X-Ray, large vision-language models

备注: Accepted at the 2025 28th International Conference on Computer and Information Technology (ICCIT). 6 pages, 6 figures

点击查看摘要

Abstract:Multimodal chest X-Ray analysis often fine-tunes large vision-language models, which is computationally costly. We study parameter-efficient training (PET) strategies, including frozen encoders, BitFit, LoRA, and adapters for multi-label classification on the Indiana University Chest X-Ray dataset (3,851 image-report pairs; 579 test samples). To mitigate data leakage, we redact pathology terms from reports used as text inputs while retaining clinical context. Under a fixed parameter budget (2.37M parameters, 2.51% of total), all PET variants achieve AUROC between 0.892 and 0.908, outperforming full fine-tuning (0.770 AUROC), which uses 94.3M trainable parameters, a 40x reduction. External validation on CheXpert (224,316 images, 58x larger) confirms scalability: all PET methods achieve 0.69 AUROC with 9% trainable parameters, with Adapter achieving best performance (0.7214 AUROC). Budget-matched comparisons reveal that vision-only models (0.653 AUROC, 1.06M parameters) outperform budget-matched multimodal models (0.641 AUROC, 1.06M parameters), indicating improvements arise primarily from parameter allocation rather than cross-modal synergy. While PET methods show degraded calibration (ECE: 0.29-0.34) compared to simpler models (ECE: 0.049), this represents a tractable limitation addressable through post-hoc calibration methods. These findings demonstrate that frozen encoder strategies provide superior discrimination at substantially reduced computational cost, though calibration correction is essential for clinical deployment.

84. 【2512.21507】SVBench: Evaluation of Video Generation Models on Social Reasoning

链接https://arxiv.org/abs/2512.21507

作者:Wenshuo Peng,Gongxuan Wang,Tianmeng Yang,Chuanhao Li,Xiaojie Xu,Hui He,Kaipeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:exhibit remarkable progress, remain fundamentally limited, generate socially coherent, socially coherent behavior, models exhibit remarkable

备注: 10pages

点击查看摘要

Abstract:Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.

85. 【2512.21495】Generative Multi-Focus Image Fusion

链接https://arxiv.org/abs/2512.21495

作者:Xinzhe Xie,Buyu Guo,Bolin Li,Shuangyan He,Yanzhen Gu,Qingyan Jiang,Peiliang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:partially focused input, focused input images, Multi-focus image fusion, aims to generate, sequence of partially

备注

点击查看摘要

Abstract:Multi-focus image fusion aims to generate an all-in-focus image from a sequence of partially focused input images. Existing fusion algorithms generally assume that, for every spatial location in the scene, there is at least one input image in which that location is in focus. Furthermore, current fusion models often suffer from edge artifacts caused by uncertain focus estimation or hard-selection operations in complex real-world scenarios. To address these limitations, we propose a generative multi-focus image fusion framework, termed GMFF, which operates in two sequential stages. In the first stage, deterministic fusion is implemented using StackMFF V4, the latest version of the StackMFF series, and integrates the available focal plane information to produce an initial fused image. The second stage, generative restoration, is realized through IFControlNet, which leverages the generative capabilities of latent diffusion models to reconstruct content from missing focal planes, restore fine details, and eliminate edge artifacts. Each stage is independently developed and functions seamlessly in a cascaded manner. Extensive experiments demonstrate that GMFF achieves state-of-the-art fusion performance and exhibits significant potential for practical applications, particularly in scenarios involving complex multi-focal content. The implementation is publicly available at this https URL.

86. 【2512.21476】GPF-Net: Gated Progressive Fusion Learning for Polyp Re-Identification

链接https://arxiv.org/abs/2512.21476

作者:Suncheng Xiang,Xiaoyang Wang,Junjie Jiang,Hejia Wang,Dahong Qian

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Colonoscopic Polyp Re-Identification, Polyp Re-Identification aims, Colonoscopic Polyp, Gated Progressive Fusion, computer-aided diagnosis

备注: Work in progress

点击查看摘要

Abstract:Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras, which plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, the coarse resolution of high-level features of a specific polyp often leads to inferior results for small objects where detailed information is important. To address this challenge, we propose a novel architecture, named Gated Progressive Fusion network, to selectively fuse features from multiple levels using gates in a fully connected way for polyp ReID. On the basis of it, a gated progressive fusion strategy is introduced to achieve layer-wise refinement of semantic information through multi-level feature interactions. Experiments on standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the specialized multimodal fusion strategy.

87. 【2512.21472】IMA++: ISIC Archive Multi-Annotator Dermoscopic Skin Lesion Segmentation Dataset

链接https://arxiv.org/abs/2512.21472

作者:Kumar Abhishek,Jeremy Kawahara,Ghassan Hamarneh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:multi-annotator skin lesion, skin lesion, important research problem, Dermoscopic skin lesion, skin lesion imaging

备注: 11 pages, 7 figures

点击查看摘要

Abstract:Multi-annotator medical image segmentation is an important research problem, but requires annotated datasets that are expensive to collect. Dermoscopic skin lesion imaging allows human experts and AI systems to observe morphological structures otherwise not discernable from regular clinical photographs. However, currently there are no large-scale publicly available multi-annotator skin lesion segmentation (SLS) datasets with annotator-labels for dermoscopic skin lesion imaging. We introduce ISIC MultiAnnot++, a large public multi-annotator skin lesion segmentation dataset for images from the ISIC Archive. The final dataset contains 17,684 segmentation masks spanning 14,967 dermoscopic images, where 2,394 dermoscopic images have 2-5 segmentations per image, making it the largest publicly available SLS dataset. Further, metadata about the segmentation, including the annotators' skill level and segmentation tool, is included, enabling research on topics such as annotator-specific preference modeling for segmentation and annotator metadata analysis. We provide an analysis on the characteristics of this dataset, curated data partitions, and consensus segmentation masks.

88. 【2512.21459】CCAD: Compressed Global Feature Conditioned Anomaly Detection

链接https://arxiv.org/abs/2512.21459

作者:Xiao Jin,Liang Diao,Qixin Xiao,Yifan Hu,Ziqi Zhang,Yuchen Liu,Haisong Gu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:considerable industrial significance, limited anomalous data, holds considerable industrial, detection holds considerable, Anomaly detection holds

备注: 18 pages, 9 figures

点击查看摘要

Abstract:Anomaly detection holds considerable industrial significance, especially in scenarios with limited anomalous data. Currently, reconstruction-based and unsupervised representation-based approaches are the primary focus. However, unsupervised representation-based methods struggle to extract robust features under domain shift, whereas reconstruction-based methods often suffer from low training efficiency and performance degradation due to insufficient constraints. To address these challenges, we propose a novel method named Compressed Global Feature Conditioned Anomaly Detection (CCAD). CCAD synergizes the strengths of both paradigms by adapting global features as a new modality condition for the reconstruction model. Furthermore, we design an adaptive compression mechanism to enhance both generalization and training efficiency. Extensive experiments demonstrate that CCAD consistently outperforms state-of-the-art methods in terms of AUC while achieving faster convergence. In addition, we contribute a reorganized and re-annotated version of the DAGM 2007 dataset with new annotations to further validate our method's effectiveness. The code for reproducing main results is available at this https URL.

89. 【2512.21452】Intelligent recognition of GPR road hidden defect images based on feature fusion and attention mechanism

链接https://arxiv.org/abs/2512.21452

作者:Haotian Lv,Yuhui Zhang,Jiangbo Dai,Hanli Wu,Jiaji Wang,Dawei Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Ground Penetrating Radar, Ground Penetrating, Penetrating Radar, Global Attention Mechanism, Global Attention Network

备注: Accepted for publication in *IEEE Transactions on Geoscience and Remote Sensing*

点击查看摘要

Abstract:Ground Penetrating Radar (GPR) has emerged as a pivotal tool for non-destructive evaluation of subsurface road defects. However, conventional GPR image interpretation remains heavily reliant on subjective expertise, introducing inefficiencies and inaccuracies. This study introduces a comprehensive framework to address these limitations: (1) A DCGAN-based data augmentation strategy synthesizes high-fidelity GPR images to mitigate data scarcity while preserving defect morphology under complex backgrounds; (2) A novel Multi-modal Chain and Global Attention Network (MCGA-Net) is proposed, integrating Multi-modal Chain Feature Fusion (MCFF) for hierarchical multi-scale defect representation and Global Attention Mechanism (GAM) for context-aware feature enhancement; (3) MS COCO transfer learning fine-tunes the backbone network, accelerating convergence and improving generalization. Ablation and comparison experiments validate the framework's efficacy. MCGA-Net achieves Precision (92.8%), Recall (92.5%), and mAP@50 (95.9%). In the detection of Gaussian noise, weak signals and small targets, MCGA-Net maintains robustness and outperforms other models. This work establishes a new paradigm for automated GPR-based defect detection, balancing computational efficiency with high accuracy in complex subsurface environments.

90. 【2512.21434】Scalable Deep Subspace Clustering Network

链接https://arxiv.org/abs/2512.21434

作者:Nairouz Mrabah,Mohamed Bouguessa,Sihem Sami

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:face inherent scalability, inherent scalability limits, scalability limits due, performing spectral decomposition, Scalable Deep Subspace

备注: Published at the 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA)

点击查看摘要

Abstract:Subspace clustering methods face inherent scalability limits due to the $O(n^3)$ cost (with $n$ denoting the number of data samples) of constructing full $n\times n$ affinities and performing spectral decomposition. While deep learning-based approaches improve feature extraction, they maintain this computational bottleneck through exhaustive pairwise similarity computations. We propose SDSNet (Scalable Deep Subspace Network), a deep subspace clustering framework that achieves $\mathcal{O}(n)$ complexity through (1) landmark-based approximation, avoiding full affinity matrices, (2) joint optimization of auto-encoder reconstruction with self-expression objectives, and (3) direct spectral clustering on factorized representations. The framework combines convolutional auto-encoders with subspace-preserving constraints. Experimental results demonstrate that SDSNet achieves comparable clustering quality to state-of-the-art methods with significantly improved computational efficiency.

91. 【2512.21414】A Tool Bottleneck Framework for Clinically-Informed and Interpretable Medical Image Understanding

链接https://arxiv.org/abs/2512.21414

作者:Christina Liu,Alan Q. Wang,Joy Hsu,Jiajun Wu,Ehsan Adeli

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Recent tool-use frameworks, Recent tool-use, grounding model predictions, powered by vision-language, Recent

备注

点击查看摘要

Abstract:Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools. Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language. However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone. To address this, we propose a tool-use framework for medical image understanding called the Tool Bottleneck Framework (TBF), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM). For a given image and task, TBF leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features. Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection. Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors. We evaluate TBF on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes. Our code is available at this https URL.

92. 【2512.21402】Understanding Virality: A Rubric based Vision-Language Model Framework for Short-Form Edutainment Evaluation

链接https://arxiv.org/abs/2512.21402

作者:Arnav Gupta,Gurekas Singh Sahney,Hardik Rathi,Abhishek Chandwani,Ishaan Gupta,Pratik Narang,Dhruv Kumar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:content requires moving, Evaluating short-form video, video content requires, surface-level quality metrics, Evaluating short-form

备注: Under Review

点击查看摘要

Abstract:Evaluating short-form video content requires moving beyond surface-level quality metrics toward human-aligned, multimodal reasoning. While existing frameworks like VideoScore-2 assess visual and semantic fidelity, they do not capture how specific audiovisual attributes drive real audience engagement. In this work, we propose a data-driven evaluation framework that uses Vision-Language Models (VLMs) to extract unsupervised audiovisual features, clusters them into interpretable factors, and trains a regression-based evaluator to predict engagement on short-form edutainment videos. Our curated YouTube Shorts dataset enables systematic analysis of how VLM-derived features relate to human engagement behavior. Experiments show strong correlations between predicted and actual engagement, demonstrating that our lightweight, feature-based evaluator provides interpretable and scalable assessments compared to traditional metrics (e.g., SSIM, FID). By grounding evaluation in both multimodal feature importance and human-centered engagement signals, our approach advances toward robust and explainable video understanding.

93. 【2512.21988】he Color-Clinical Decoupling: Why Perceptual Calibration Fails Clinical Biomarkers in Smartphone Dermatology

链接https://arxiv.org/abs/2512.21988

作者:Sungwoo Kang

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词:Smartphone-based tele-dermatology assumes, underrepresented skin phototypes, Smartphone-based tele-dermatology, ensures clinical reliability, skin phototypes

备注

点击查看摘要

Abstract:Smartphone-based tele-dermatology assumes that colorimetric calibration ensures clinical reliability, yet this remains untested for underrepresented skin phototypes. We investigated whether standard calibration translates to reliable clinical biomarkers using 43,425 images from 965 Korean subjects (Fitzpatrick III-IV) across DSLR, tablet, and smartphone devices. While Linear Color Correction Matrix (CCM) normalization reduced color error by 67-77% -- achieving near-clinical accuracy (Delta E 2.3) -- this success did not translate to biomarker reliability. We identify a phenomenon termed "color-clinical decoupling": despite perceptual accuracy, the Individual Typology Angle (ITA) showed poor inter-device agreement (ICC = 0.40), while the Melanin Index achieved good agreement (ICC = 0.77). This decoupling is driven by the ITA formula's sensitivity to b* channel noise and is further compounded by anatomical variance. Facial region accounts for 25.2% of color variance -- 3.6x greater than device effects (7.0%) -- challenging the efficacy of single-patch calibration. Our results demonstrate that current colorimetric standards are insufficient for clinical-grade biomarker extraction, necessitating region-aware protocols for mobile dermatology.

Subjects:

Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

Cite as:
arXiv:2512.21988 [eess.IV]

(or
arXiv:2512.21988v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2512.21988

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
94. 【2512.21975】RT-Focuser: A Real-Time Lightweight Model for Edge-side Image Deblurring

链接https://arxiv.org/abs/2512.21975

作者:Zhuoyu Wu,Wenhui Ou,Qiawei Zheng,Jiayan Yang,Quanjun Wang,Wenqi Fang,Zheng Wang,Yongkui Yang,Heshan Li

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Motion blur caused, object movement severely, movement severely degrades, severely degrades image, degrades image quality

备注: 2 pages, 2 figures, this paper already accepted by IEEE ICTA 2025

点击查看摘要

Abstract:Motion blur caused by camera or object movement severely degrades image quality and poses challenges for real-time applications such as autonomous driving, UAV perception, and medical imaging. In this paper, a lightweight U-shaped network tailored for real-time deblurring is presented and named RT-Focuser. To balance speed and accuracy, we design three key components: Lightweight Deblurring Block (LD) for edge-aware feature extraction, Multi-Level Integrated Aggregation module (MLIA) for encoder integration, and Cross-source Fusion Block (X-Fuse) for progressive decoder refinement. Trained on a single blurred input, RT-Focuser achieves 30.67 dB PSNR with only 5.85M parameters and 15.76 GMACs. It runs 6ms per frame on GPU and mobile, exceeds 140 FPS on both, showing strong potential for deployment on the edge. The official code and usage are available on: this https URL.

95. 【2512.21593】Residual Prior Diffusion: A Probabilistic Framework Integrating Coarse Latent Priors with Diffusion Models

链接https://arxiv.org/abs/2512.21593

作者:Takuro Kutsuna

类目:Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:deep generative modeling, single diffusion schedule, target data distribution, standard formulations rely, single network

备注: 40 pages

点击查看摘要

Abstract:Diffusion models have become a central tool in deep generative modeling, but standard formulations rely on a single network and a single diffusion schedule to transform a simple prior, typically a standard normal distribution, into the target data distribution. As a result, the model must simultaneously represent the global structure of the distribution and its fine-scale local variations, which becomes difficult when these scales are strongly mismatched. This issue arises both in natural images, where coarse manifold-level structure and fine textures coexist, and in low-dimensional distributions with highly concentrated local structure. To address this issue, we propose Residual Prior Diffusion (RPD), a two-stage framework in which a coarse prior model first captures the large-scale structure of the data distribution, and a diffusion model is then trained to represent the residual between the prior and the target data distribution. We formulate RPD as an explicit probabilistic model with a tractable evidence lower bound, whose optimization reduces to the familiar objectives of noise prediction or velocity prediction. We further introduce auxiliary variables that leverage information from the prior model and theoretically analyze how they reduce the difficulty of the prediction problem in RPD. Experiments on synthetic datasets with fine-grained local structure show that standard diffusion models fail to capture local details, whereas RPD accurately captures fine-scale detail while preserving the large-scale structure of the distribution. On natural image generation tasks, RPD achieved generation quality that matched or exceeded that of representative diffusion-based baselines and it maintained strong performance even with a small number of inference steps.

96. 【2512.21372】A Graph-Augmented knowledge Distillation based Dual-Stream Vision Transformer with Region-Aware Attention for Gastrointestinal Disease Classification with Explainable AI

链接https://arxiv.org/abs/2512.21372

作者:Md Assaduzzaman,Nushrat Jahan Oyshi,Eram Mahamud

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:histopathological imagery remains, vast data volume, inter-class visuals, accurate classification, classification of gastrointestinal

备注

点击查看摘要

Abstract:The accurate classification of gastrointestinal diseases from endoscopic and histopathological imagery remains a significant challenge in medical diagnostics, mainly due to the vast data volume and subtle variation in inter-class visuals. This study presents a hybrid dual-stream deep learning framework built on teacher-student knowledge distillation, where a high-capacity teacher model integrates the global contextual reasoning of a Swin Transformer with the local fine-grained feature extraction of a Vision Transformer. The student network was implemented as a compact Tiny-ViT structure that inherits the teacher's semantic and morphological knowledge via soft-label distillation, achieving a balance between efficiency and diagnostic accuracy. Two carefully curated Wireless Capsule Endoscopy datasets, encompassing major GI disease classes, were employed to ensure balanced representation and prevent inter-sample bias. The proposed framework achieved remarkable performance with accuracies of 0.9978 and 0.9928 on Dataset 1 and Dataset 2 respectively, and an average AUC of 1.0000, signifying near-perfect discriminative capability. Interpretability analyses using Grad-CAM, LIME, and Score-CAM confirmed that the model's predictions were grounded in clinically significant tissue regions and pathologically relevant morphological cues, validating the framework's transparency and reliability. The Tiny-ViT demonstrated diagnostic performance with reduced computational complexity comparable to its transformer-based teacher while delivering faster inference, making it suitable for resource-constrained clinical environments. Overall, the proposed framework provides a robust, interpretable, and scalable solution for AI-assisted GI disease diagnosis, paving the way toward future intelligent endoscopic screening that is compatible with clinical practicality.