本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新811篇论文,其中:
- 自然语言处理100篇
- 信息检索8篇
- 计算机视觉203篇
自然语言处理
1. 【2512.19682】GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators
链接:https://arxiv.org/abs/2512.19682
作者:Jiacheng Guo,Ling Yang,Peter Chen,Qixin Xiao,Yinjie Wang,Xinzhe Juan,Jiahao Qiu,Ke Shen,Mengdi Wang
类目:Computation and Language (cs.CL)
关键词:Training capable Large, capable Large Language, Large Language Model, Large Language, Training capable
备注: Our codes are available at [this https URL](https://github.com/Gen-Verse/GenEnv)
点击查看摘要
Abstract:Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent's ``zone of proximal development''. This process is guided by a simple but effective $\alpha$-Curriculum Reward, which aligns task difficulty with the agent's current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3\%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.
2. 【2512.19673】Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies
链接:https://arxiv.org/abs/2512.19673
作者:Yuqiao Tan,Minzheng Wang,Shizhu He,Huanxuan Liao,Chengfeng Zhao,Qiunan Lu,Tian Liang,Jun Zhao,Kang Liu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Existing reinforcement learning, approaches treat large, Existing reinforcement, treat large language, single unified policy
备注: Preprint. Our code is available at [this https URL](https://github.com/Trae1ounG/BuPO)
点击查看摘要
Abstract:Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at this https URL.
3. 【2512.19651】Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting
链接:https://arxiv.org/abs/2512.19651
作者:Filippos Ventirozos,Peter Appleby,Matthew Shardlow
类目:Computation and Language (cs.CL)
关键词:Aspect-Category Sentiment Analysis, Sentiment Analysis, identifying specific themes, Aspect-Category Sentiment, granular insights
备注: 9 pages, 3 figures, 3 tables
点击查看摘要
Abstract:Aspect-Category Sentiment Analysis (ACSA) provides granular insights by identifying specific themes within reviews and their associated sentiment. While supervised learning approaches dominate this field, the scarcity and high cost of annotated data for new domains present significant barriers. We argue that leveraging large language models (LLMs) in a zero-shot setting is a practical alternative where resources for data annotation are limited. In this work, we propose a novel Chain-of-Thought (CoT) prompting technique that utilises an intermediate Unified Meaning Representation (UMR) to structure the reasoning process for the ACSA task. We evaluate this UMR-based approach against a standard CoT baseline across three models (Qwen3-4B, Qwen3-8B, and Gemini-2.5-Pro) and four diverse datasets. Our findings suggest that UMR effectiveness may be model-dependent. Whilst preliminary results indicate comparable performance for mid-sized models such as Qwen3-8B, these observations warrant further investigation, particularly regarding the potential applicability to smaller model architectures. Further research is required to establish the generalisability of these findings across different model scales.
4. 【2512.19630】Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori
链接:https://arxiv.org/abs/2512.19630
作者:Rolando Coto-Solano,Daisy Li,Manoela Teleginski Ferraz,Olivia Sasse,Cha Krupka,Sharid Loáiciga,Sally Akevai Tenamu Nicholas
类目:Computation and Language (cs.CL)
关键词:Cook Islands Māori, text normalization essential, natural language processing, Cook Islands, Islands Māori
备注:
点击查看摘要
Abstract:We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Māori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resourced contexts.
5. 【2512.19620】Exploring the features used for summary evaluation by Human and GPT
链接:https://arxiv.org/abs/2512.19620
作者:Zahra Sadeghi,Evangelos Milios,Frank Rudzicz
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Summary assessment involves, generated summary reflects, assessment involves evaluating, Summary assessment, generated summary
备注:
点击查看摘要
Abstract:Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.
6. 【2512.19612】MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery
链接:https://arxiv.org/abs/2512.19612
作者:Angelo Ortiz Tandazo,Manel Khentout,Youssef Benchekroun,Thomas Hueber,Emmanuel Dupoux
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:robust cross-lingual phonetic, paper introduces MauBERT, leverages articulatory features, paper introduces, robust cross-lingual
备注:
点击查看摘要
Abstract:This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. We continue HuBERT pre-training with supervision based on a phonetic-to-articulatory feature mapping in 55 languages. Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations that capture multilingual phonetic properties. Through comprehensive ABX discriminability testing, we show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models. Additionally, the models effectively adapt to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech). This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models.
7. 【2512.19585】Increasing the Thinking Budget is Not All You Need
链接:https://arxiv.org/abs/2512.19585
作者:Ignacio Iacobacci,Zhaozhi Qian,Faroq AL-Tam,Muhammad AL-Qurishi,Riad Souissi
类目:Computation and Language (cs.CL)
关键词:thinking-capable Large Language, Large Language Models, demonstrating exceptional capabilities, Large Language, thinking-capable Large
备注: 4 pages, 4 figures, 3 tables
点击查看摘要
Abstract:Recently, a new wave of thinking-capable Large Language Models has emerged, demonstrating exceptional capabilities across a wide range of reasoning benchmarks. Early studies have begun to explore how the amount of compute in terms of the length of the reasoning process, the so-called thinking budget, impacts model performance. In this work, we propose a systematic investigation of the thinking budget as a key parameter, examining its interaction with various configurations such as self-consistency, reflection, and others. Our goal is to provide an informative, balanced comparison framework that considers both performance outcomes and computational cost. Among our findings, we discovered that simply increasing the thinking budget is not the most effective use of compute. More accurate responses can instead be achieved through alternative configurations, such as self-consistency and self-reflection.
8. 【2512.19543】Algerian Dialect
链接:https://arxiv.org/abs/2512.19543
作者:Zakaria Benmounah,Abdennour Boulesnane
类目:Computation and Language (cs.CL)
关键词:present Algerian Dialect, YouTube Data API, Algerian Arabic dialect, large-scale sentiment-annotated dataset, sentiment-annotated dataset consisting
备注:
点击查看摘要
Abstract:We present Algerian Dialect, a large-scale sentiment-annotated dataset consisting of 45,000 YouTube comments written in Algerian Arabic dialect. The comments were collected from more than 30 Algerian press and media channels using the YouTube Data API. Each comment is manually annotated into one of five sentiment categories: very negative, negative, neutral, positive, and very positive. In addition to sentiment labels, the dataset includes rich metadata such as collection timestamps, like counts, video URLs, and annotation dates. This dataset addresses the scarcity of publicly available resources for Algerian dialect and aims to support research in sentiment analysis, dialectal Arabic NLP, and social media analytics. The dataset is publicly available on Mendeley Data under a CC BY 4.0 license at this https URL.
9. 【2512.19537】Event Extraction in Large Language Model
链接:https://arxiv.org/abs/2512.19537
作者:Bobo Li,Xudong Han,Jiang Liu,Yuzhe Ding,Liqiang Jing,Zhaoqi Zhang,Jinheng Li,Xinya Du,Fei Li,Meishan Zhang,Min Zhang,Aixin Sun,Philip S. Yu,Hao Fei
类目:Computation and Language (cs.CL)
关键词:Large language models, produce structured outputs, Large language, changing event extraction, prompting and generation
备注: 38 pages, 9 Figures, 5 Tables
点击查看摘要
Abstract:Large language models (LLMs) and multimodal LLMs are changing event extraction (EE): prompting and generation can often produce structured outputs in zero shot or few shot settings. Yet LLM based pipelines face deployment gaps, including hallucinations under weak constraints, fragile temporal and causal linking over long contexts and across documents, and limited long horizon knowledge management within a bounded context window. We argue that EE should be viewed as a system component that provides a cognitive scaffold for LLM centered solutions. Event schemas and slot constraints create interfaces for grounding and verification; event centric structures act as controlled intermediate representations for stepwise reasoning; event links support relation aware retrieval with graph based RAG; and event stores offer updatable episodic and agent memory beyond the context window. This survey covers EE in text and multimodal settings, organizing tasks and taxonomy, tracing method evolution from rule based and neural models to instruction driven and generative frameworks, and summarizing formulations, decoding strategies, architectures, representations, datasets, and evaluation. We also review cross lingual, low resource, and domain specific settings, and highlight open challenges and future directions for reliable event centric systems. Finally, we outline open challenges and future directions that are central to the LLM era, aiming to evolve EE from static extraction into a structurally reliable, agent ready perception and memory layer for open world systems.
10. 【2512.19475】A Large-Language-Model Framework for Automated Humanitarian Situation Reporting
链接:https://arxiv.org/abs/2512.19475
作者:Ivan Decostanzi,Yelena Mejova,Kyriaki Kalimeri
类目:Computation and Language (cs.CL)
关键词:remain largely manual, current workflows remain, workflows remain largely, resource intensive, largely manual
备注: 18 pages, 3 figures
点击查看摘要
Abstract:Timely and accurate situational reports are essential for humanitarian decision-making, yet current workflows remain largely manual, resource intensive, and inconsistent. We present a fully automated framework that uses large language models (LLMs) to transform heterogeneous humanitarian documents into structured and evidence-grounded reports. The system integrates semantic text clustering, automatic question generation, retrieval augmented answer extraction with citations, multi-level summarization, and executive summary generation, supported by internal evaluation metrics that emulate expert reasoning. We evaluated the framework across 13 humanitarian events, including natural disasters and conflicts, using more than 1,100 documents from verified sources such as ReliefWeb. The generated questions achieved 84.7 percent relevance, 84.0 percent importance, and 76.4 percent urgency. The extracted answers reached 86.3 percent relevance, with citation precision and recall both exceeding 76 percent. Agreement between human and LLM based evaluations surpassed an F1 score of 0.80. Comparative analysis shows that the proposed framework produces reports that are more structured, interpretable, and actionable than existing baselines. By combining LLM reasoning with transparent citation linking and multi-level evaluation, this study demonstrates that generative AI can autonomously produce accurate, verifiable, and operationally useful humanitarian situation reports.
11. 【2512.19466】Epistemological Fault Lines Between Human and Artificial Intelligence
链接:https://arxiv.org/abs/2512.19466
作者:Walter Quattrociocchi,Valerio Capraro,Matjaž Perc
类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Large language models, profile diverges sharply, Large language, epistemic profile diverges, profile diverges
备注: 16 pages, 1 figure
点击查看摘要
Abstract:Large language models (LLMs) are widely described as artificial intelligence, yet their epistemic profile diverges sharply from human cognition. Here we show that the apparent alignment between human and machine outputs conceals a deeper structural mismatch in how judgments are produced. Tracing the historical shift from symbolic AI and information filtering systems to large-scale generative transformers, we argue that LLMs are not epistemic agents but stochastic pattern-completion systems, formally describable as walks on high-dimensional graphs of linguistic transitions rather than as systems that form beliefs or models of the world. By systematically mapping human and artificial epistemic pipelines, we identify seven epistemic fault lines, divergences in grounding, parsing, experience, motivation, causal reasoning, metacognition, and value. We call the resulting condition Epistemia: a structural situation in which linguistic plausibility substitutes for epistemic evaluation, producing the feeling of knowing without the labor of judgment. We conclude by outlining consequences for evaluation, governance, and epistemic literacy in societies increasingly organized around generative AI.
12. 【2512.19456】Activations as Features: Probing LLMs for Generalizable Essay Scoring Representations
链接:https://arxiv.org/abs/2512.19456
作者:Jinwei Chi,Ke Wang,Yu Chen,Xuanye Lin,Qiang Xu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Automated essay scoring, Automated essay, cross-prompt settings due, AES, Automated
备注:
点击查看摘要
Abstract:Automated essay scoring (AES) is a challenging task in cross-prompt settings due to the diversity of scoring criteria. While previous studies have focused on the output of large language models (LLMs) to improve scoring accuracy, we believe activations from intermediate layers may also provide valuable information. To explore this possibility, we evaluated the discriminative power of LLMs' activations in cross-prompt essay scoring task. Specifically, we used activations to fit probes and further analyzed the effects of different models and input content of LLMs on this discriminative power. By computing the directions of essays across various trait dimensions under different prompts, we analyzed the variation in evaluation perspectives of large language models concerning essay types and traits. Results show that the activations possess strong discriminative power in evaluating essay quality and that LLMs can adapt their evaluation perspectives to different traits and essay types, effectively handling the diversity of scoring criteria in cross-prompt settings.
13. 【2512.19455】SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation
链接:https://arxiv.org/abs/2512.19455
作者:Thittipat Pairatsuppawat,Abhibhu Tachaapornchai,Paweekorn Kusolsomboon,Chutikan Chaiwong,Thodsaporn Chay-intr,Kobkrit Viriyayudhakorn,Nongnuch Ketui,Aslan B. Wong
类目:Computation and Language (cs.CL)
关键词:models remain difficult, strong English performance, remain difficult, difficult to deploy, due to unstable
备注:
点击查看摘要
Abstract:Open-weights large language models remain difficult to deploy for Thai due to unstable generation under complex instructions, despite strong English performance. To mitigate these limitations, We present SiamGPT-32B, an open-weights model based on Qwen3-32B, fine-tuned with a Quality-First strategy emphasizing curated supervision over data scale. The fine-tuning pipeline combines translated high-complexity English instruction data with a Thai-adapted AutoIF framework for instruction and linguistic constraints. Using supervised fine-tuning only, without continual pretraining or corpus expansion, SiamGPT-32B improves instruction adherence, multi-turn robustness, and linguistic stability. Evaluations on the SEA-HELM benchmark show that SiamGPT-32B achieves the strongest overall performance among similar-scale open-weights Thai models, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.
14. 【2512.19432】MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments
链接:https://arxiv.org/abs/2512.19432
作者:Quyu Kong,Xu Zhang,Zhenyu Yang,Nolan Gao,Chen Liu,Panrong Tong,Chenglin Cai,Hanzhang Zhou,Jianan Zhang,Liangyu Chen,Zhidan Liu,Steven Hoi,Yue Wang
类目:Computation and Language (cs.CL)
关键词:existing online mobile-use, dominant benchmark due, online mobile-use benchmarks, recent agents achieving, recent agents
备注:
点击查看摘要
Abstract:Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. To bridge this gap, we introduce MobileWorld, a substantially more challenging benchmark designed to better reflect real-world mobile usage, comprising 201 tasks across 20 applications, while maintaining the same level of reproducible evaluation as AndroidWorld. The difficulty of MobileWorld is twofold. First, it emphasizes long-horizon tasks with cross-application interactions: MobileWorld requires nearly twice as many task-completion steps on average (27.8 vs. 14.3) and includes far more multi-application tasks (62.2% vs. 9.5%) compared to AndroidWorld. Second, MobileWorld extends beyond standard GUI manipulation by introducing novel task categories, including agent-user interaction and MCP-augmented tasks. To ensure robust evaluation, we provide snapshot-based container environment and precise functional verifications, including backend database inspection and task callback APIs. We further develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively. Our analysis shows that current models struggle significantly with user interaction and MCP calls, offering a strategic roadmap toward more robust, next-generation mobile intelligence.
15. 【2512.19424】CodeSimpleQA: Scaling Factuality in Code Large Language Models
链接:https://arxiv.org/abs/2512.19424
作者:Jian Yang,Wei Zhang,Yizhi Li,Shawn Guo,Haowen Wang,Aishan Liu,Ge Zhang,Zili Wang,Zhoujun Li,Xianglong Liu,Weifeng Lv
类目:Computation and Language (cs.CL)
关键词:achieving impressive capabilities, made significant strides, Large language models, synthesizing code snippets, Large language
备注:
点击查看摘要
Abstract:Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.
16. 【2512.19414】From Retrieval to Reasoning: A Framework for Cyber Threat Intelligence NER with Explicit and Adaptive Instructions
链接:https://arxiv.org/abs/2512.19414
作者:Jiaren Peng,Hongda Sun,Xuan Tian,Cheng Huang,Zeqing Li,Rui Yan
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:Cyber Threat Intelligence, Named Entity Recognition, Threat Intelligence, Cyber Threat, extract critical entities
备注:
点击查看摘要
Abstract:The automation of Cyber Threat Intelligence (CTI) relies heavily on Named Entity Recognition (NER) to extract critical entities from unstructured text. Currently, Large Language Models (LLMs) primarily address this task through retrieval-based In-Context Learning (ICL). This paper analyzes this mainstream paradigm, revealing a fundamental flaw: its success stems not from global semantic similarity but largely from the incidental overlap of entity types within retrieved examples. This exposes the limitations of relying on unreliable implicit induction. To address this, we propose TTPrompt, a framework shifting from implicit induction to explicit instruction. TTPrompt maps the core concepts of CTI's Tactics, Techniques, and Procedures (TTPs) into an instruction hierarchy: formulating task definitions as Tactics, guiding strategies as Techniques, and annotation guidelines as Procedures. Furthermore, to handle the adaptability challenge of static guidelines, we introduce Feedback-driven Instruction Refinement (FIR). FIR enables LLMs to self-refine guidelines by learning from errors on minimal labeled data, adapting to distinct annotation dialects. Experiments on five CTI NER benchmarks demonstrate that TTPrompt consistently surpasses retrieval-based baselines. Notably, with refinement on just 1% of training data, it rivals models fine-tuned on the full dataset. For instance, on LADDER, its Micro F1 of 71.96% approaches the fine-tuned baseline, and on the complex CTINexus, its Macro F1 exceeds the fine-tuned ACLM model by 10.91%.
17. 【2512.19400】Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara
链接:https://arxiv.org/abs/2512.19400
作者:Yacouba Diarra,Panga Azazia Kamate,Nouhoum Souleymane Coulibaly,Michael Leventhal
类目:Computation and Language (cs.CL)
关键词:Bambara ASR dataset, Malian radio archives, capture present-day spontaneous, ASR dataset compiled, Bambara ASR
备注: 7 pages, 2 figures
点击查看摘要
Abstract:We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\% to 37.12\% on one and from 36.07\% to 32.33\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
18. 【2512.19378】HATS: High-Accuracy Triple-Set Watermarking for Large Language Models
链接:https://arxiv.org/abs/2512.19378
作者:Zhiqing Hu,Chenxu Zhao,Jiazhong Lu,Xiaolei Liu
类目:Computation and Language (cs.CL)
关键词:embed implicit signals, Misuse of LLM-generated, curbed by watermarking, watermarking techniques, techniques that embed
备注: Camera-ready version of the paper accepted for oral presentation at the 11th International Conference on Computer and Communications (ICCC 2025)
点击查看摘要
Abstract:Misuse of LLM-generated text can be curbed by watermarking techniques that embed implicit signals into the output. We propose a watermark that partitions the vocabulary at each decoding step into three sets (Green/Yellow/Red) with fixed ratios and restricts sampling to the Green and Yellow sets. At detection time, we replay the same partitions, compute Green-enrichment and Red-depletion statistics, convert them to one-sided z-scores, and aggregate their p-values via Fisher's method to decide whether a passage is watermarked. We implement generation, detection, and testing on Llama 2 7B, and evaluate true-positive rate, false-positive rate, and text quality. Results show that the triple-partition scheme achieves high detection accuracy at fixed FPR while preserving readability.
19. 【2512.19320】MAGIC: Achieving Superior Model Merging via Magnitude Calibration
链接:https://arxiv.org/abs/2512.19320
作者:Yayuan Li,Jian Zhang,Jintao Guo,Zihan Cheng,Lei Qi,Yinghuan Shi,Yang Gao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Space Calibration, specialised models, proliferation of pre-trained, wide array, Model
备注:
点击查看摘要
Abstract:The proliferation of pre-trained models has given rise to a wide array of specialised, fine-tuned models. Model merging aims to merge the distinct capabilities of these specialised models into a unified model, requiring minimal or even no additional training. A core objective of model merging is to ensure the merged model retains the behavioural characteristics of the specialised models, typically achieved through feature alignment. We identify that features consist of two critical components: direction and magnitude. Prior research has predominantly focused on directional alignment, while the influence of magnitude remains largely neglected, despite its pronounced vulnerability to perturbations introduced by common merging operations (e.g., parameter fusion and sparsification). Such perturbations to magnitude inevitably lead to feature deviations in the merged model from the specialised models, resulting in subsequent performance degradation. To address this, we propose MAGnItude Calibration (MAGIC), a plug-and-play framework that rectifies layer-wise magnitudes in feature and weight spaces, with three variants. Specifically, our Feature Space Calibration (FSC) realigns the merged model's features using a small set of unlabelled data, while Weight Space Calibration (WSC) extends this calibration to the weight space without requiring additional data. Combining these yields Dual Space Calibration (DSC). Comprehensive experiments demonstrate that MAGIC consistently boosts performance across diverse Computer Vision tasks (+4.3% on eight datasets) and NLP tasks (+8.0% on Llama) without additional training. Our code is available at: this https URL
20. 【2512.19305】CienaLLM: Generative Climate-Impact Extraction from News Articles with Autoregressive LLMs
链接:https://arxiv.org/abs/2512.19305
作者:Javier Vela-Tambo,Jorge Gracia,Fernando Dominguez-Castro
类目:Computation and Language (cs.CL)
关键词:Understanding and monitoring, Generative Information Extraction, requires extracting structured, climate hazards requires, Information Extraction
备注:
点击查看摘要
Abstract:Understanding and monitoring the socio-economic impacts of climate hazards requires extracting structured information from heterogeneous news articles on a large scale. To that end, we have developed CienaLLM, a modular framework based on schema-guided Generative Information Extraction. CienaLLM uses open-weight Large Language Models for zero-shot information extraction from news articles, and supports configurable prompts and output schemas, multi-step pipelines, and cloud or on-premise inference. To systematically assess how the choice of LLM family, size, precision regime, and prompting strategy affect performance, we run a large factorial study in models, precisions, and prompt engineering techniques. An additional response parsing step nearly eliminates format errors while preserving accuracy; larger models deliver the strongest and most stable performance, while quantization offers substantial efficiency gains with modest accuracy trade-offs; and prompt strategies show heterogeneous, model-specific effects. CienaLLM matches or outperforms the supervised baseline in accuracy for extracting drought impacts from Spanish news, although at a higher inference cost. While evaluated in droughts, the schema-driven and model-agnostic design is suitable for adapting to related information extraction tasks (e.g., other hazards, sectors, or languages) by editing prompts and schemas rather than retraining. We release code, configurations, and schemas to support reproducible use.
21. 【2512.19247】Auto-Prompting with Retrieval Guidance for Frame Detection in Logistics
链接:https://arxiv.org/abs/2512.19247
作者:Do Minh Duc,Quan Xuan Truong,Nguyen Tat Dat,Nguyen Van Vinh
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, adapting large language, Prompt engineering plays, language models, engineering plays
备注:
点击查看摘要
Abstract:Prompt engineering plays a critical role in adapting large language models (LLMs) to complex reasoning and labeling tasks without the need for extensive fine-tuning. In this paper, we propose a novel prompt optimization pipeline for frame detection in logistics texts, combining retrieval-augmented generation (RAG), few-shot prompting, chain-of-thought (CoT) reasoning, and automatic CoT synthesis (Auto-CoT) to generate highly effective task-specific prompts. Central to our approach is an LLM-based prompt optimizer agent that iteratively refines the prompts using retrieved examples, performance feedback, and internal self-evaluation. Our framework is evaluated on a real-world logistics text annotation task, where reasoning accuracy and labeling efficiency are critical. Experimental results show that the optimized prompts - particularly those enhanced via Auto-CoT and RAG - improve real-world inference accuracy by up to 15% compared to baseline zero-shot or static prompts. The system demonstrates consistent improvements across multiple LLMs, including GPT-4o, Qwen 2.5 (72B), and LLaMA 3.1 (70B), validating its generalizability and practical value. These findings suggest that structured prompt optimization is a viable alternative to full fine-tuning, offering scalable solutions for deploying LLMs in domain-specific NLP applications such as logistics.
22. 【2512.19240】ChemATP: A Training-Free Chemical Reasoning Framework for Large Language Models
链接:https://arxiv.org/abs/2512.19240
作者:Mingxu Zhang,Dazhong Shen,Qi Zhang,Ying Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, standard string representations, molecular science due, exhibit strong general
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit strong general reasoning but struggle in molecular science due to the lack of explicit chemical priors in standard string representations. Current solutions face a fundamental dilemma. Training-based methods inject priors into parameters, but this static coupling hinders rapid knowledge updates and often compromises the model's general reasoning capabilities. Conversely, existing training-free methods avoid these issues but rely on surface-level prompting, failing to provide the fine-grained atom-level priors essential for precise chemical reasoning. To address this issue, we introduce ChemATP, a framework that decouples chemical knowledge from the reasoning engine. By constructing the first atom-level textual knowledge base, ChemATP enables frozen LLMs to explicitly retrieve and reason over this information dynamically. This architecture ensures interpretability and adaptability while preserving the LLM's intrinsic general intelligence. Experiments show that ChemATP significantly outperforms training-free baselines and rivals state-of-the-art training-based models, demonstrating that explicit prior injection is a competitive alternative to implicit parameter updates.
23. 【2512.19238】Identifying Features Associated with Bias Against 93 Stigmatized Groups in Language Models and Guardrail Model Safety Mitigation
链接:https://arxiv.org/abs/2512.19238
作者:Anna-Maria Gueorguieva,Aylin Caliskan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large language models, Large language, bias, Large, outputs
备注:
点击查看摘要
Abstract:Large language models (LLMs) have been shown to exhibit social bias, however, bias towards non-protected stigmatized identities remain understudied. Furthermore, what social features of stigmas are associated with bias in LLM outputs is unknown. From psychology literature, it has been shown that stigmas contain six shared social features: aesthetics, concealability, course, disruptiveness, origin, and peril. In this study, we investigate if human and LLM ratings of the features of stigmas, along with prompt style and type of stigma, have effect on bias towards stigmatized groups in LLM outputs. We measure bias against 93 stigmatized groups across three widely used LLMs (Granite 3.0-8B, Llama-3.1-8B, Mistral-7B) using SocialStigmaQA, a benchmark that includes 37 social scenarios about stigmatized identities; for example deciding wether to recommend them for an internship. We find that stigmas rated by humans to be highly perilous (e.g., being a gang member or having HIV) have the most biased outputs from SocialStigmaQA prompts (60% of outputs from all models) while sociodemographic stigmas (e.g. Asian-American or old age) have the least amount of biased outputs (11%). We test if the amount of biased outputs could be decreased by using guardrail models, models meant to identify harmful input, using each LLM's respective guardrail model (Granite Guardian 3.0, Llama Guard 3.0, Mistral Moderation API). We find that bias decreases significantly by 10.4%, 1.4%, and 7.8%, respectively. However, we show that features with significant effect on bias remain unchanged post-mitigation and that guardrail models often fail to recognize the intent of bias in prompts. This work has implications for using LLMs in scenarios involving stigmatized groups and we suggest future work towards improving guardrail models for bias mitigation.
24. 【2512.19173】CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation
链接:https://arxiv.org/abs/2512.19173
作者:Dazhen Deng,Sen Yang,Yuchen He,Yuan Tian,Yingcai Wu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Current chart-specific tasks, Current chart-specific, link chart generation, chart, chart question answering
备注:
点击查看摘要
Abstract:Current chart-specific tasks, such as chart question answering, chart parsing, and chart generation, are typically studied in isolation, preventing models from learning the shared semantics that link chart generation and interpretation. We introduce CycleChart, a consistency-based learning framework for bidirectional chart understanding and generation. CycleChart adopts a schema-centric formulation as a common interface across tasks. We construct a consistent multi-task dataset, where each chart sample includes aligned annotations for schema prediction, data parsing, and question answering. To learn cross-directional chart semantics, CycleChart introduces a generate-parse consistency objective: the model generates a chart schema from a table and a textual query, then learns to recover the schema and data from the generated chart, enforcing semantic alignment across directions. CycleChart achieves strong results on chart generation, chart parsing, and chart question answering, demonstrating improved cross-task generalization and marking a step toward more general chart understanding models.
25. 【2512.19171】JEPA-Reasoner: Decoupling Latent Reasoning from Token Generation
链接:https://arxiv.org/abs/2512.19171
作者:Bingyang Kelvin Liu,Ziyu Patrick Chen
类目:Computation and Language (cs.CL)
关键词:Joint-Embedding Predictive Architecture, Predictive Architecture, Joint-Embedding Predictive, lacks generative abilities, rich latent representations
备注:
点击查看摘要
Abstract:While Joint-Embedding Predictive Architecture (JEPA) has emerged as a powerful architecture for learning rich latent representations, it fundamentally lacks generative abilities. Meanwhile, latent space reasoning attempts for Transformer models like COCONUT do improve performance, but they ultimately rely on token-by-token generation, which still accumulates compounding error and relies on context information to gain reasoning insights. To address these limitations, we propose JEPA-Reasoner, a novel JEPA model enhanced with generative ability that reasons in latent space. We augment it with a separate action-taker model, Talker, to produce human-readable sentences. Our approach demonstrates that decoupling latent space reasoning and token generation enables JEPA-Reasoner to produce mixed latent vectors that might lay the foundation for multi-threaded reasoning, while performing autoregressive generation with superior robustness to compounding error.
26. 【2512.19161】From Speech to Subtitles: Evaluating ASR Models in Subtitling Italian Television Programs
链接:https://arxiv.org/abs/2512.19161
作者:Alessandro Lucca,Francesco Pierri
类目:Computation and Language (cs.CL)
关键词:Automatic Speech Recognition, Subtitles are essential, Modern Automatic Speech, audience engagement, accessibility and audience
备注:
点击查看摘要
Abstract:Subtitles are essential for video accessibility and audience engagement. Modern Automatic Speech Recognition (ASR) systems, built upon Encoder-Decoder neural network architectures and trained on massive amounts of data, have progressively reduced transcription errors on standard benchmark datasets. However, their performance in real-world production environments, particularly for non-English content like long-form Italian videos, remains largely unexplored. This paper presents a case study on developing a professional subtitling system for an Italian media company. To inform our system design, we evaluated four state-of-the-art ASR models (Whisper Large v2, AssemblyAI Universal, Parakeet TDT v3 0.6b, and WhisperX) on a 50-hour dataset of Italian television programs. The study highlights their strengths and limitations, benchmarking their performance against the work of professional human subtitlers. The findings indicate that, while current models cannot meet the media industry's accuracy needs for full autonomy, they can serve as highly effective tools for enhancing human productivity. We conclude that a human-in-the-loop (HITL) approach is crucial and present the production-grade, cloud-based infrastructure we designed to support this workflow.
27. 【2512.19134】QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2512.19134
作者:Dehai Min,Kailin Zhang,Tongtong Wu,Lu Cheng
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Retrieval-Augmented Generation adaptively, Generation adaptively determines, large language models, adaptively determines, large language
备注:
点击查看摘要
Abstract:Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at this https URL.
28. 【2512.19126】AWPO: Enhancing Tool-Use of Large Language Models through Explicit Integration of Reasoning Rewards
链接:https://arxiv.org/abs/2512.19126
作者:Zihan Lin,Xiaohan Wang,Hexiong Yang,Jiajun Chai,Jie Cao,Guojun Yin,Wei Lin,Ran He
类目:Computation and Language (cs.CL)
关键词:existing methods largely, methods largely overlook, verifiable outcome rewards, training tool-use large, tool-use large language
备注:
点击查看摘要
Abstract:While reinforcement learning (RL) shows promise in training tool-use large language models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of explicit reasoning rewards to bolster reasoning and tool utilization. Furthermore, natively combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose advantage-weighted policy optimization (AWPO) -- a principled RL framework that effectively integrates explicit reasoning rewards to enhance tool-use capability. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0 percent in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.
29. 【2512.19125】SAP: Syntactic Attention Pruning for Transformer-based Language Models
链接:https://arxiv.org/abs/2512.19125
作者:Tzu-Yun Lee,Ding-Yong Hong,Jan-Jan Wu
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:paper introduces Syntactic, introduces Syntactic Attention, Transformer models, paper introduces, introduces Syntactic
备注:
点击查看摘要
Abstract:This paper introduces Syntactic Attention Pruning (SAP), a novel method for effectively pruning attention heads in Transformer models. Unlike conventional approaches that rely solely on mathematical analysis of model weights and activations, SAP incorporates both the syntactic structure and attention patterns of sentences to guide the pruning process. By leveraging these linguistic features, SAP not only achieves performance comparable to state-of-the-art methods but also enhances the interpretability of model behavior. To further improve robustness, we propose Candidate Filtering (CF), a mechanism that prioritizes heads based on their contribution to model performance, mitigating degradation during pruning. Experimental results indicate that SAP effectively preserves critical heads of a high density of strong attention values, outperforming existing head pruning strategies in retrain-free settings. These findings position SAP as a promising foundation for a new direction in model compression research, offering high flexibility for pruning across all transformer-based language models.
30. 【2512.19122】BanglaForge: LLM Collaboration with Self-Refinement for Bangla Code Generation
链接:https://arxiv.org/abs/2512.19122
作者:Mahir Labib Dihan,Sadif Ahmed,Md Nafiu Rahman
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:lacking large-scale annotated, transform natural language, natural language specifications, large-scale annotated datasets, Bangla Code Generation
备注: Accepted at BLP Workshop @ IJCNLP-AACL 2025. Code is available at [this https URL](https://github.com/mahirlabibdihan/BanglaForge)
点击查看摘要
Abstract:Bangla is a low-resource language for code generation, lacking large-scale annotated datasets and tools to transform natural language specifications into executable programs. This makes Bangla-to-code generation a challenging task requiring innovative solutions. To address this, we introduce BanglaForge, a novel framework for generating code from Bangla function descriptions. BanglaForge leverages a retrieval-augmented dual-model collaboration paradigm with self-refinement, combining in-context learning, llm-based translation, systematic prompt engineering, and iterative self-refinement based on execution feedback, where a coder generates initial solutions and a reviewer enhances them for robustness. On the BLP-2025 Bangla Code Generation benchmark, BanglaForge achieves a competitive Pass@1 accuracy of 84.00%, demonstrating the effectiveness of retrieval, model collaboration, and self-refinement for low-resource Bangla code generation.
31. 【2512.19117】Stop saying LLM: Large Discourse Models (LDM) and Artificial Discursive Agent (ADA)?
链接:https://arxiv.org/abs/2512.19117
作者:Amar Lakel(MICA)
类目:Computation and Language (cs.CL)
关键词:category Large Language, Large Language Models, Large Discourse Models, large generative models, Large Language
备注: in French language
点击查看摘要
Abstract:This paper proposes an epistemological shift in the analysis of large generative models, replacing the category ''Large Language Models'' (LLM) with that of ''Large Discourse Models'' (LDM), and then with that of Artificial Discursive Agent (ADA). The theoretical framework is based on an ontological triad distinguishing three regulatory instances: the apprehension of the phenomenal regularities of the referential world, the structuring of embodied cognition, and the structural-linguistic sedimentation of the utterance within a socio-historical context. LDMs, operating on the product of these three instances (the document), model the discursive projection of a portion of human experience reified by the learning corpus. The proposed program aims to replace the ''fascination/fear'' dichotomy with public trials and procedures that make the place, uses, and limits of artificial discursive agents in contemporary social space decipherable, situating this approach within a perspective of governance and co-regulation involving the State, industry, civil society, and academia.
32. 【2512.19092】A Large Language Model Based Method for Complex Logical Reasoning over Knowledge Graphs
链接:https://arxiv.org/abs/2512.19092
作者:Ziyan Zhang,Chao Wang,Zhuo Chen,Lei Chen,Chiyi Li,Kai Song
类目:Computation and Language (cs.CL)
关键词:first-order logic, real-world KGs, challenging due, inherent incompleteness, incompleteness of real-world
备注:
点击查看摘要
Abstract:Reasoning over knowledge graphs (KGs) with first-order logic (FOL) queries is challenging due to the inherent incompleteness of real-world KGs and the compositional complexity of logical query structures. Most existing methods rely on embedding entities and relations into continuous geometric spaces and answer queries via differentiable set operations. While effective for simple query patterns, these approaches often struggle to generalize to complex queries involving multiple operators, deeper reasoning chains, or heterogeneous KG schemas. We propose ROG (Reasoning Over knowledge Graphs with large language models), an ensemble-style framework that combines query-aware KG neighborhood retrieval with large language model (LLM)-based chain-of-thought reasoning. ROG decomposes complex FOL queries into sequences of simpler sub-queries, retrieves compact, query-relevant subgraphs as contextual evidence, and performs step-by-step logical inference using an LLM, avoiding the need for task-specific embedding optimization. Experiments on standard KG reasoning benchmarks demonstrate that ROG consistently outperforms strong embedding-based baselines in terms of mean reciprocal rank (MRR), with particularly notable gains on high-complexity query types. These results suggest that integrating structured KG retrieval with LLM-driven logical reasoning offers a robust and effective alternative for complex KG reasoning tasks.
33. 【2512.19070】Watch Closely: Mitigating Object Hallucinations in Large Vision-Language Models with Disentangled Decoding
链接:https://arxiv.org/abs/2512.19070
作者:Ruiqi Ma,Yu Yan,Chunhong Zhang,Minghao Yin,XinChao Liu,Zhihong Jin,Zheng Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, demonstrating strong potential, Large Vision-Language, bridge the gap, demonstrating strong
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model's dependence on language priors but also enhances its visual performance. (Code: this https URL)
34. 【2512.19012】DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation
链接:https://arxiv.org/abs/2512.19012
作者:Shijian Ma,Yunqi Huang,Yan Lin
类目:Computation and Language (cs.CL)
关键词:maintain character consistency, advance plot coherently, Drama script continuation, preserve dramatic structurecapabilities, character consistency
备注:
点击查看摘要
Abstract:Drama script continuation requires models to maintain character consistency, advance plot coherently, and preserve dramatic structurecapabilities that existing benchmarks fail to evaluate comprehensively. We present DramaBench, the first large-scale benchmark for evaluating drama script continuation across six independent dimensions: Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling. Our framework combines rulebased analysis with LLM-based labeling and statistical metrics, ensuring objective and reproducible evaluation. We conduct comprehensive evaluation of 8 state-of-the-art language models on 1,103 scripts (8,824 evaluations total), with rigorous statistical significance testing (252 pairwise comparisons, 65.9% significant) and human validation (188 scripts, substantial agreement on 3/5 dimensions). Our ablation studies confirm all six dimensions capture independent quality aspects (mean | r | = 0.020). DramaBench provides actionable, dimensionspecific feedback for model improvement and establishes a rigorous standard for creative writing evaluation.
35. 【2512.19011】Efficient Jailbreak Mitigation Using Semantic Linear Classification in a Multi-Staged Pipeline
链接:https://arxiv.org/abs/2512.19011
作者:Akshaj Prashanth Rao,Advait Singh,Saumya Kumaar Saksena,Dhruv Kumar
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language model, jailbreaking attacks pose, attacks pose persistent, pose persistent security, persistent security challenges
备注: Under Review
点击查看摘要
Abstract:Prompt injection and jailbreaking attacks pose persistent security challenges to large language model (LLM)-based systems. We present an efficient and systematically evaluated defense architecture that mitigates these threats through a lightweight, multi-stage pipeline. Its core component is a semantic filter based on text normalization, TF-IDF representations, and a Linear SVM classifier. Despite its simplicity, this module achieves 93.4% accuracy and 96.5% specificity on held-out data, substantially reducing attack throughput while incurring negligible computational overhead. Building on this efficient foundation, the full pipeline integrates complementary detection and mitigation mechanisms that operate at successive stages, providing strong robustness with minimal latency. In comparative experiments, our SVM-based configuration improves overall accuracy from 35.1% to 93.4% while reducing average time to completion from approximately 450s to 47s, yielding over 10 times lower latency than ShieldGemma. These results demonstrate that the proposed design simultaneously advances defensive precision and efficiency, addressing a core limitation of current model-based moderators. Evaluation across a curated corpus of over 30,000 labeled prompts, including benign, jailbreak, and application-layer injections, confirms that staged, resource-efficient defenses can robustly secure modern LLM-driven applications.
Comments:
Under Review
Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2512.19011 [cs.CR]
(or
arXiv:2512.19011v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2512.19011
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
36. 【2512.19004】Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models
链接:https://arxiv.org/abs/2512.19004
作者:Tongyuan Miao,Gary Huang,Kai Jun Han,Annie Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:enable fully parallel, inference time due, Large Language Models, Diffusion Large Language, fully parallel token
备注:
点击查看摘要
Abstract:Diffusion Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time due to the many denoising iterations required to refine an information-free, fully masked initialization into coherent text. Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies. We advance a complementary perspective: shorten the trajectory itself by starting closer to the target distribution through context-aware initialization. We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization, and instantiate it with two mechanisms: discrete token injection and representation-level embedding interpolation. Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism. Preliminary evidence on GSM8K suggests that context-aware initialization can substantially reduce denoising iterations (about 35\% fewer function evaluations in our setting), while also exposing a key open challenge: naive warm-starting can degrade final accuracy relative to strong diffusion baselines. We use these findings to motivate a research agenda around calibration, revision mechanisms, and representation alignment for reliable warm-started diffusion decoding.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2512.19004 [cs.CL]
(or
arXiv:2512.19004v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2512.19004
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
37. 【2512.18999】Evaluating the Challenges of LLMs in Real-world Medical Follow-up: A Comparative Study and An Optimized Framework
链接:https://arxiv.org/abs/2512.18999
作者:Jinyan Liu,Zikang Chen,Qinchuan Wang,Tan Xie,Heming Zheng,Xudong Lv
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, inaccurate information extraction, information extraction due
备注: 10 pages,3 figures,conference ICCBB2025
点击查看摘要
Abstract:When applied directly in an end-to-end manner to medical follow-up tasks, Large Language Models (LLMs) often suffer from uncontrolled dialog flow and inaccurate information extraction due to the complexity of follow-up forms. To address this limitation, we designed and compared two follow-up chatbot systems: an end-to-end LLM-based system (control group) and a modular pipeline with structured process control (experimental group). Experimental results show that while the end-to-end approach frequently fails on lengthy and complex forms, our modular method-built on task decomposition, semantic clustering, and flow management-substantially improves dialog stability and extraction accuracy. Moreover, it reduces the number of dialogue turns by 46.73% and lowers token consumption by 80% to 87.5%. These findings highlight the necessity of integrating external control mechanisms when deploying LLMs in high-stakes medical follow-up scenarios.
38. 【2512.18987】Affordance RAG: Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation
链接:https://arxiv.org/abs/2512.18987
作者:Ryosuke Korekata,Quanting Xie,Yonatan Bisk,Komei Sugiura
类目:Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:free-form natural language, natural language instructions, address the problem, problem of open-vocabulary, required to carry
备注: Accepted to IEEE RA-L, with presentation at ICRA 2026
点击查看摘要
Abstract:In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.
39. 【2512.18940】FASTRIC: Prompt Specification Language for Verifiable LLM Interactions
链接:https://arxiv.org/abs/2512.18940
作者:Wen-Long Jin
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:Large Language Models, Finite State Machines, Large Language, lack formal specifications, implicit Finite State
备注: 13 pages, 3 figures. Supplementary materials at [this https URL](https://doi.org/10.17605/OSF.IO/PV6R3)
点击查看摘要
Abstract:Large Language Models (LLMs) execute complex multi-turn interaction protocols but lack formal specifications to verify execution against designer intent. We introduce FASTRIC, a Prompt Specification Language that makes implicit Finite State Machines (FSMs) explicit in natural language prompts, enabling conformance verification through execution trace analysis. The LLM serves as intelligent execution agent: interpreting designer-encoded FSMs to execute specified behavioral roles. Unlike symbolic specification languages requiring parsers and compilers, FASTRIC leverages LLMs as unified infrastructure-simultaneously parser, interpreter, runtime environment, and development assistant. FASTRIC guides designers to articulate seven FSM elements (Final States, Agents, States, Triggers, Roles, Initial State, Constraints) structuring multi-turn interactions. Specification formality-ranging from implicit descriptions that frontier models infer to explicit step-by-step instructions for weaker models-serves as a design parameter. We introduce procedural conformance as verification metric measuring execution adherence to FSM specifications. Testing a 3-state kindergarten tutoring FSM across four formality levels and three model scales (14.7B, 685B, 1T+ parameters) reveals optimal specification formality is a function of model capacity. DeepSeek-V3.2 (685B) achieves perfect conformance (1.00) at L2-L4; ChatGPT-5 (~1T) peaks at L3 (0.90) before collapsing at L4 (0.39); Phi4 (14.7B) shows no stable optimum with high variance (SD=0.16-0.36). These findings reveal model-specific formality ranges-"Goldilocks zones"-where specifications provide sufficient structure without over-constraint, establishing Prompt Specification Engineering for creating verifiable interaction protocols, transforming multi-turn interaction design from heuristic art to systematic engineering with measurable procedural guarantees.
40. 【2512.18906】Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations
链接:https://arxiv.org/abs/2512.18906
作者:Shaomu Tan,Ryosuke Mitani,Ritvik Choudhary,Qiyu Wu,Toshiyuki Sekiya,Christof Monz
类目:Computation and Language (cs.CL)
关键词:human ratings, hillclimbed benchmarks, benchmarks and presented, human-level agreement, agreement with human
备注:
点击查看摘要
Abstract:Over the years, automatic MT metrics have hillclimbed benchmarks and presented strong and sometimes human-level agreement with human ratings. Yet they remain black-box, offering little insight into their decision-making and often failing under real-world out-of-distribution (OOD) inputs. We introduce Remedy-R, a reasoning-driven generative MT metric trained with reinforcement learning from pairwise translation preferences, without requiring error-span annotations or distillation from closed LLMs. Remedy-R produces step-by-step analyses of accuracy, fluency, and completeness, followed by a final score, enabling more interpretable assessments. With only 60K training pairs across two language pairs, Remedy-R remains competitive with top scalar metrics and GPT-4-based judges on WMT22-24 meta-evaluation, generalizes to other languages, and exhibits strong robustness on OOD stress tests. Moreover, Remedy-R models generate self-reflective feedback that can be reused for translation improvement. Building on this finding, we introduce Remedy-R Agent, a simple evaluate-revise pipeline that leverages Remedy-R's evaluation analysis to refine translations. This agent consistently improves translation quality across diverse models, including Qwen2.5, ALMA-R, GPT-4o-mini, and Gemini-2.0-Flash, suggesting that Remedy-R's reasoning captures translation-relevant information and is practically useful.
41. 【2512.18880】Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction
链接:https://arxiv.org/abs/2512.18880
作者:Ming Li,Han Chen,Yunze Xiao,Jian Chen,Hong Jiao,Tianyi Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:cold start problem, Large Language Models, start problem, educational assessment, assessment but suffers
备注:
点击查看摘要
Abstract:Accurate estimation of item (question or task) difficulty is critical for educational assessment but suffers from the cold start problem. While Large Language Models demonstrate superhuman problem-solving capabilities, it remains an open question whether they can perceive the cognitive struggles of human learners. In this work, we present a large-scale empirical analysis of Human-AI Difficulty Alignment for over 20 models across diverse domains such as medical knowledge and mathematical reasoning. Our findings reveal a systematic misalignment where scaling up model size is not reliably helpful; instead of aligning with humans, models converge toward a shared machine consensus. We observe that high performance often impedes accurate difficulty estimation, as models struggle to simulate the capability limitations of students even when being explicitly prompted to adopt specific proficiency levels. Furthermore, we identify a critical lack of introspection, as models fail to predict their own limitations. These results suggest that general problem-solving capability does not imply an understanding of human cognitive struggles, highlighting the challenge of using current models for automated difficulty prediction.
42. 【2512.18865】Application of deep learning approaches for medieval historical documents transcription
链接:https://arxiv.org/abs/2512.18865
作者:Maksym Voloshchuk,Bohdana Zarembovska,Mykola Kozlenko
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:solutions show excellent, handwritten Latin-language documents, drops with Latin, optical character recognition, character recognition solutions
备注: 15 pages, 15 figures, 4 tables. Originally published by CEUR Workshop Proceedings ( [this http URL](http://CEUR-WS.org) , ISSN 1613-0073), available: [this https URL](https://ceur-ws.org/Vol-4133/S_05_Kozlenko.pdf)
点击查看摘要
Abstract:Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but efficiency drops with Latin documents of medieval times. This paper presents a deep learning method to extract text information from handwritten Latin-language documents of the 9th to 11th centuries. The approach takes into account the properties inherent in medieval documents. The paper provides a brief introduction to the field of historical document transcription, a first-sight analysis of the raw data, and the related works and studies. The paper presents the steps of dataset development for further training of the models. The explanatory data analysis of the processed data is provided as well. The paper explains the pipeline of deep learning models to extract text information from the document images, from detecting objects to word recognition using classification models and embedding word images. The paper reports the following results: recall, precision, F1 score, intersection over union, confusion matrix, and mean string distance. The plots of the metrics are also included. The implementation is published on the GitHub repository.
43. 【2512.18859】oward Human-Centered AI-Assisted Terminology Work
链接:https://arxiv.org/abs/2512.18859
作者:Antonio San Martin
类目:Computation and Language (cs.CL)
关键词:transforming terminology work, generative artificial intelligence, artificial intelligence, rapid diffusion, diffusion of generative
备注:
点击查看摘要
Abstract:The rapid diffusion of generative artificial intelligence is transforming terminology work. While this technology promises gains in efficiency, its unstructured adoption risks weakening professional autonomy, amplifying bias, and eroding linguistic and conceptual diversity. This paper argues that a human-centered approach to artificial intelligence has become a necessity for terminology work. Building on research in artificial intelligence and translation studies, it proposes a human-centered framework that conceptualizes artificial intelligence as a means of amplifying the terminologist's capabilities, rather than replacing them. The framework is organized around three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. Together, these dimensions emphasize the compatibility of high automation with strong human control, the central role of terminologists in bias mitigation, and the importance of designing AI tools and workflows around the needs, values, and well-being of the terminologist. The paper concludes by stressing that current choices in AI adoption will shape not only terminological practice, but also the preservation of accuracy, adequacy, and diversity in terminology and specialized knowledge.
44. 【2512.18841】MDToC: Metacognitive Dynamic Tree of Concepts for Boosting Mathematical Problem-Solving of Large Language Models
链接:https://arxiv.org/abs/2512.18841
作者:Tung Duong Ta,Tim Oates
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, established prompting techniques, Language Models, mathematical reasoning capabilities
备注:
点击查看摘要
Abstract:Despite advances in mathematical reasoning capabilities, Large Language Models (LLMs) still struggle with calculation verification when using established prompting techniques. We present MDToC (Metacognitive Dynamic Tree of Concepts), a three-phase approach that constructs a concept tree, develops accuracy-verified calculations for each concept, and employs majority voting to evaluate competing solutions. Evaluations across CHAMP, MATH, and Game-of-24 benchmarks demonstrate our MDToC's effectiveness, with GPT-4-Turbo achieving 58.1\% on CHAMP, 86.6\% on MATH, and 85\% on Game-of-24 - outperforming GoT by 5\%, 5.4\%, and 4\% on all these tasks, respectively, without hand-engineered hints. MDToC consistently surpasses existing prompting methods across all backbone models, yielding improvements of up to 7.6\% over ToT and 6.2\% over GoT, establishing metacognitive calculation verification as a promising direction for enhanced mathematical reasoning.
45. 【2512.18834】AraMix: Recycling, Refiltering, and Deduplicating to Deliver the Largest Arabic Pretraining Corpus
链接:https://arxiv.org/abs/2512.18834
作者:Sultan Alrashed,Francesco Orabona
类目:Computation and Language (cs.CL)
关键词:deduplicated Arabic pretraining, Arabic pretraining corpus, million documents, Arabic pretraining, deduplicated Arabic
备注: Initial version, without pretraining experiments
点击查看摘要
Abstract:We present AraMix, a deduplicated Arabic pretraining corpus containing approximately 178 billion tokens across 179 million documents. Rather than scraping the web again, AraMix demonstrates that substantial value lies in systematically reusing and curating existing pretraining datasets: we combine seven publicly available Arabic web datasets, apply quality filtering designed specifically for Arabic text to re-filter some datasets, and perform cross-dataset deduplication, both MinHash and sentence-level. This approach reveals that nearly 60% of tokens across these independently collected corpora are duplicates, redundancy that any new scraping efforts will reproduce. Our work suggests that for lower resource languages, investment in curation pipelines for existing data yields greater returns than additional web crawls, an approach that allowed us to curate the largest heavily filtered publicly available Arabic pretraining corpus.
46. 【2512.18832】From Word to World: Can Large Language Models be Implicit Text-based World Models?
链接:https://arxiv.org/abs/2512.18832
作者:Yixia Li,Hongru Wang,Jiahao Qiu,Zhenfei Yin,Dongdong Zhang,Cheng Qian,Zeping Li,Pony Ma,Guanhua Chen,Heng Ji,Mengdi Wang
类目:Computation and Language (cs.CL)
关键词:Agentic reinforcement learning, learning increasingly relies, environments remain non-adaptive, Agentic reinforcement, real-world environments remain
备注:
点击查看摘要
Abstract:Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.
47. 【2512.18779】From Natural Language to Control Signals: A Conceptual Framework for Semantic Channel Finding in Complex Experimental Infrastructure
链接:https://arxiv.org/abs/2512.18779
作者:Thorsten Hellert,Nikolay Agladze,Alex Giovannone,Jan Jug,Frank Mayet,Mark Sherwin,Antonin Sulc,Chris Tennant
类目:Computation and Language (cs.CL); Accelerator Physics (physics.acc-ph)
关键词:Modern experimental platforms, industrial process control, systems expose tens, process control systems, control systems expose
备注:
点击查看摘要
Abstract:Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language-model-driven interfaces. We formalize semantic channel finding-mapping natural-language intent to concrete control-system signals-as a general problem in complex experimental infrastructure, and introduce a four-paradigm framework to guide architecture selection across facility-specific data regimes. The paradigms span (i) direct in-context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool-based database queries, and (iv) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions. We demonstrate each paradigm through proof-of-concept implementations at four operational facilities spanning two orders of magnitude in scale-from compact free-electron lasers to large synchrotron light sources-and diverse control-system architectures, from clean hierarchies to legacy environments. These implementations achieve 90-97% accuracy on expert-curated operational queries.
48. 【2512.18748】Code2Doc: A Quality-First Curated Dataset for Code Documentation
链接:https://arxiv.org/abs/2512.18748
作者:Recep Kaan Karaman,Meftun Akarsu
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:models depends critically, code documentation generation, code documentation, depends critically, training data
备注:
点击查看摘要
Abstract:The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision. However, most existing code documentation datasets are constructed through large scale scraping of public repositories with limited quality control. As a result, they often contain noisy documentation, extensive duplication, and increasing contamination from AI generated content. These issues weaken the supervision signal available to learning-based models and complicate evaluation. We introduce \textbf{Code2Doc}, a quality-first curated dataset for function-level code documentation generation. Code2Doc consists of 13,358 high-quality function-documentation pairs extracted from widely used open-source repositories spanning five programming languages: Python, Java, TypeScript, JavaScript, and C++. The dataset is constructed using a four-stage curation pipeline that enforces documentation completeness and clarity, filters functions based on structural and complexity criteria, removes exact and near-duplicate code, and identifies documentation likely to be AI generated. Starting from 52,069 extracted candidates, only 25.6 percent satisfy all quality constraints. We provide a detailed analysis of the resulting dataset, which achieves a mean documentation quality score of 6.93 out of 10. Overall, 86.9% of samples contain explicit type annotations, and only 2.9\% are flagged as potentially AI generated. Baseline experiments show that fine-tuning a large language model on Code2Doc yields relative improvements of 29.47% in BLEU and 24.04% in ROUGE-L over zero shot performance, despite the modest dataset size. We release both the dataset and the full curation pipeline to support reproducible research on automatic code documentation generation.
Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2512.18748 [cs.SE]
(or
arXiv:2512.18748v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2512.18748
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
49. 【2512.18746】MemEvolve: Meta-Evolution of Agent Memory Systems
链接:https://arxiv.org/abs/2512.18746
作者:Guibin Zhang,Haotian Ren,Chong Zhan,Zhenhong Zhou,Junhao Wang,He Zhu,Wangchunshu Zhou,Shuicheng Yan
类目:Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:large language model, unprecedentedly reshaping, reshaping the evolutionary, large language, memory
备注:
点击查看摘要
Abstract:Self-evolving memory systems are unprecedentedly reshaping the evolutionary paradigm of large language model (LLM)-based agents. Prior work has predominantly relied on manually engineered memory architectures to store trajectories, distill experience, and synthesize reusable tools, enabling agents to evolve on the fly within environment interactions. However, this paradigm is fundamentally constrained by the staticity of the memory system itself: while memory facilitates agent-level evolving, the underlying memory architecture cannot be meta-adapted to diverse task contexts. To address this gap, we propose MemEvolve, a meta-evolutionary framework that jointly evolves agents' experiential knowledge and their memory architecture, allowing agent systems not only to accumulate experience but also to progressively refine how they learn from it. To ground MemEvolve in prior research and foster openness in future self-evolving systems, we introduce EvolveLab, a unified self-evolving memory codebase that distills twelve representative memory systems into a modular design space (encode, store, retrieve, manage), providing both a standardized implementation substrate and a fair experimental arena. Extensive evaluations on four challenging agentic benchmarks demonstrate that MemEvolve achieves (I) substantial performance gains, improving frameworks such as SmolAgent and Flash-Searcher by up to $17.06\%$; and (II) strong cross-task and cross-LLM generalization, designing memory architectures that transfer effectively across diverse benchmarks and backbone models.
50. 【2512.18745】InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
链接:https://arxiv.org/abs/2512.18745
作者:Kaican Li,Lewei Yao,Jiannan Wu,Tiezheng Yu,Jierun Chen,Haoli Bai,Lu Hou,Lanqing Hong,Wei Zhang,Nevin L. Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:sophisticated blend, reasoning, visual, multimodal, Abstract
备注:
点击查看摘要
Abstract:The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at this https URL .
51. 【2512.18682】Solver-Independent Automated Problem Formulation via LLMs for High-Cost Simulation-Driven Design
链接:https://arxiv.org/abs/2512.18682
作者:Yuchen Li,Handing Wang,Bing Xue,Mengjie Zhang,Yaochu Jin
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:optimizing product performance, simulation-driven design domain, translating ambiguous design, translating ambiguous, product performance
备注:
点击查看摘要
Abstract:In the high-cost simulation-driven design domain, translating ambiguous design requirements into a mathematical optimization formulation is a bottleneck for optimizing product performance. This process is time-consuming and heavily reliant on expert knowledge. While large language models (LLMs) offer potential for automating this task, existing approaches either suffer from poor formalization that fails to accurately align with the design intent or rely on solver feedback for data filtering, which is unavailable due to the high simulation costs. To address this challenge, we propose APF, a framework for solver-independent, automated problem formulation via LLMs designed to automatically convert engineers' natural language requirements into executable optimization models. The core of this framework is an innovative pipeline for automatically generating high-quality data, which overcomes the difficulty of constructing suitable fine-tuning datasets in the absence of high-cost solver feedback with the help of data generation and test instance annotation. The generated high-quality dataset is used to perform supervised fine-tuning on LLMs, significantly enhancing their ability to generate accurate and executable optimization problem formulations. Experimental results on antenna design demonstrate that APF significantly outperforms the existing methods in both the accuracy of requirement formalization and the quality of resulting radiation efficiency curves in meeting the design goals.
52. 【2512.18679】brat: Aligned Multi-View Embeddings for Brain MRI Analysis
链接:https://arxiv.org/abs/2512.18679
作者:Maxime Kayser,Maksim Gridnev,Wanting Wang,Max Bain,Aneesh Rangnekar,Avijit Chatterjee,Aleksandr Petrov,Harini Veeraraghavan,Nathaniel C. Swinburne
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:magnetic resonance imaging, representation learning framework, brain magnetic resonance, report alignment transformer, multi-view representation learning
备注: First round accept at WACV 2026
点击查看摘要
Abstract:We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset $10\times$ larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. The brat foundation models are publicly released.
53. 【2512.18658】Does It Tie Out? Towards Autonomous Legal Agents in Venture Capital
链接:https://arxiv.org/abs/2512.18658
作者:Pierre Colombo,Malik Boudiaf,Allyn Sweet,Michael Desa,Hongxi Wang,Kevin Candra,Syméon del Marmol
类目:Computation and Language (cs.CL)
关键词:capital financing rounds, lawyers conduct diligence, closing venture capital, venture capital financing, underlying legal documentation
备注:
点击查看摘要
Abstract:Before closing venture capital financing rounds, lawyers conduct diligence that includes tying out the capitalization table: verifying that every security (for example, shares, options, warrants) and issuance term (for example, vesting schedules, acceleration triggers, transfer restrictions) is supported by large sets of underlying legal documentation. While LLMs continue to improve on legal benchmarks, specialized legal workflows, such as capitalization tie-out, remain out of reach even for strong agentic systems. The task requires multi-document reasoning, strict evidence traceability, and deterministic outputs that current approaches fail to reliably deliver. We characterize capitalization tie-out as an instance of a real-world benchmark for legal AI, analyze and compare the performance of existing agentic systems, and propose a world model architecture toward tie-out automation-and more broadly as a foundation for applied legal intelligence.
54. 【2512.18623】LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction
链接:https://arxiv.org/abs/2512.18623
作者:Jensen Zhang,Ningyuan Liu,Yijia Fan,Zihao Huang,Qinglin Zeng,Kaitong Cai,Jian Wang,Keze Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:generate hallucinated content, Large language models, Large language, contextual grounding, critical applications
备注: Accepted at AAAI 2026
点击查看摘要
Abstract:Large language models (LLMs) often generate hallucinated content that lacks factual or contextual grounding, limiting their reliability in critical applications. Existing approaches such as supervised fine-tuning and reinforcement learning from human feedback are data intensive and computationally expensive, while static parameter editing methods struggle with context dependent errors and catastrophic forgetting. We propose LLM-CAS, a framework that formulates real-time hallucination correction as a hierarchical reinforcement learning problem. LLM-CAS trains an agent to learn a policy that dynamically selects temporary neuron perturbations during inference based on the current context. Unlike prior dynamic approaches that rely on heuristic or predefined adjustments, this policy driven mechanism enables adaptive and fine grained correction without permanent parameter modification. Experiments across multiple language models demonstrate that LLM-CAS consistently improves factual accuracy, achieving gains of 10.98 percentage points on StoryCloze, 2.71 points on TriviaQA, and 2.06 points on the MC1 score of TruthfulQA. These results outperform both static editing methods such as ITI and CAA and the dynamic SADI framework. Overall, LLM-CAS provides an efficient and context aware solution for improving the reliability of LLMs, with promising potential for future multimodal extensions.
Comments:
Accepted at AAAI 2026
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2512.18623 [cs.CL]
(or
arXiv:2512.18623v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2512.18623
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
55. 【2512.18622】A Multi-agent Text2SQL Framework using Small Language Models and Execution Feedback
链接:https://arxiv.org/abs/2512.18622
作者:Thanh Dat Hoang,Thanh Trung Huynh,Matthias Weidlich,Thanh Tam Nguyen,Tong Chen,Hongzhi Yin,Quoc Viet Hung Nguyen
类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
关键词:generating SQL queries, natural language text, Large Language Models, generating SQL, SQL queries
备注:
点击查看摘要
Abstract:Text2SQL, the task of generating SQL queries from natural language text, is a critical challenge in data engineering. Recently, Large Language Models (LLMs) have demonstrated superior performance for this task due to their advanced comprehension and generation capabilities. However, privacy and cost considerations prevent companies from using Text2SQL solutions based on external LLMs offered as a service. Rather, small LLMs (SLMs) that are openly available and can hosted in-house are adopted. These SLMs, in turn, lack the generalization capabilities of larger LLMs, which impairs their effectiveness for complex tasks such as Text2SQL. To address these limitations, we propose MATS, a novel Text2SQL framework designed specifically for SLMs. MATS uses a multi-agent mechanism that assigns specialized roles to auxiliary agents, reducing individual workloads and fostering interaction. A training scheme based on reinforcement learning aligns these agents using feedback obtained during execution, thereby maintaining competitive performance despite a limited LLM size. Evaluation results using on benchmark datasets show that MATS, deployed on a single- GPU server, yields accuracy that are on-par with large-scale LLMs when using significantly fewer parameters. Our source code and data are available at this https URL.
56. 【2512.18608】A Comparative Study of Light-weight Language Models for PII Masking and their Deployment for Real Conversational Texts
链接:https://arxiv.org/abs/2512.18608
作者:Prabigya Acharya,Liza Shrestha
类目:Computation and Language (cs.CL)
关键词:Personally Identifiable Information, Identifiable Information, Personally Identifiable, privacy-preserving conversational systems, Automated masking
备注:
点击查看摘要
Abstract:Automated masking of Personally Identifiable Information (PII) is critical for privacy-preserving conversational systems. While current frontier large language models demonstrate strong PII masking capabilities, concerns about data handling and computational costs motivate exploration of whether lightweight models can achieve comparable performance. We compare encoder-decoder and decoder-only architectures by fine-tuning T5-small and Mistral-Instruct-v0.3 on English datasets constructed from the AI4Privacy benchmark. We create different dataset variants to study label standardization and PII representation, covering 24 standardized PII categories and higher-granularity settings. Evaluation using entity-level and character-level metrics, type accuracy, and exact match shows that both lightweight models achieve performance comparable to frontier LLMs for PII masking tasks. Label normalization consistently improves performance across architectures. Mistral achieves higher F1 and recall with greater robustness across PII types but incurs significantly higher generation latency. T5, while less robust in conversational text, offers more controllable structured outputs and lower inference cost, motivating its use in a real-time Discord bot for real-world PII redaction. Evaluation on live messages reveals performance degradation under informal inputs. These results clarify trade-offs between accuracy, robustness, and computational efficiency, demonstrating that lightweight models can provide effective PII masking while addressing data handling concerns associated with frontier LLMs.
57. 【2512.18601】On Finding Inconsistencies in Documents
链接:https://arxiv.org/abs/2512.18601
作者:Charles J. Lovering,Seth Ebner,Brandon Smock,Michael Krumdick,Saad Rabbani,Ahmed Muhammad,Varshini Reddy,Chris Tanner
类目:Computation and Language (cs.CL)
关键词:Professionals in academia, result in monetary, scientific costs, finance audit, Professionals
备注:
点击查看摘要
Abstract:Professionals in academia, law, and finance audit their documents because inconsistencies can result in monetary, reputational, and scientific costs. Language models (LMs) have the potential to dramatically speed up this auditing process. To understand their abilities, we introduce a benchmark, FIND (Finding INconsistencies in Documents), where each example is a document with an inconsistency inserted manually by a domain expert. Despite the documents being long, technical, and complex, the best-performing model (gpt-5) recovered 64% of the inserted inconsistencies. Surprisingly, gpt-5 also found undiscovered inconsistencies present in the original documents. For example, on 50 arXiv papers, we judged 136 out of 196 of the model's suggestions to be legitimate inconsistencies missed by the original authors. However, despite these findings, even the best models miss almost half of the inconsistencies in FIND, demonstrating that inconsistency detection is still a challenging task.
58. 【2512.18593】From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation
链接:https://arxiv.org/abs/2512.18593
作者:Amit Barman,Atanu Mandal,Sudip Kumar Naskar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:judicial documentation remains, nations like India, remains in English, language barriers, hindered by language
备注:
点击查看摘要
Abstract:In multilingual nations like India, access to legal information is often hindered by language barriers, as much of the legal and judicial documentation remains in English. Legal Machine Translation (L-MT) offers a scalable solution to this challenge by enabling accurate and accessible translations of legal documents. This paper presents our work for the JUST-NLP 2025 Legal MT shared task, focusing on English-Hindi translation using Transformer-based approaches. We experiment with 2 complementary strategies, fine-tuning a pre-trained OPUS-MT model for domain-specific adaptation and training a Transformer model from scratch using the provided legal corpus. Performance is evaluated using standard MT metrics, including SacreBLEU, chrF++, TER, ROUGE, BERTScore, METEOR, and COMET. Our fine-tuned OPUS-MT model achieves a SacreBLEU score of 46.03, significantly outperforming both baseline and from-scratch models. The results highlight the effectiveness of domain adaptation in enhancing translation quality and demonstrate the potential of L-MT systems to improve access to justice and legal transparency in multilingual contexts.
59. 【2512.18552】oward Training Superintelligent Software Agents through Self-Play SWE-RL
链接:https://arxiv.org/abs/2512.18552
作者:Yuxiang Wei,Zhiqing Sun,Emily McMilin,Jonas Gehring,David Zhang,Gabriel Synnaeve,Daniel Fried,Lingming Zhang,Sida Wang
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:boost programmer productivity, large language models, programmer productivity, pull requests, heavily depend
备注:
点击查看摘要
Abstract:While current software agents powered by large language models (LLMs) and agentic reinforcement learning (RL) can boost programmer productivity, their training data (e.g., GitHub issues and pull requests) and environments (e.g., pass-to-pass and fail-to-pass tests) heavily depend on human knowledge or curation, posing a fundamental barrier to superintelligence. In this paper, we present Self-play SWE-RL (SSR), a first step toward training paradigms for superintelligent software agents. Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies, with no need for human-labeled issues or tests. Grounded in these real-world codebases, a single LLM agent is trained via reinforcement learning in a self-play setting to iteratively inject and repair software bugs of increasing complexity, with each bug formally specified by a test patch rather than a natural language issue description. On the SWE-bench Verified and SWE-Bench Pro benchmarks, SSR achieves notable self-improvement (+10.4 and +7.8 points, respectively) and consistently outperforms the human-data baseline over the entire training trajectory, despite being evaluated on natural language issues absent from self-play. Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories, ultimately enabling superintelligent systems that exceed human capabilities in understanding how systems are constructed, solving novel challenges, and autonomously creating new software from scratch.
60. 【2512.18551】Neologism Learning as a Parameter-Efficient Alternative to Fine-Tuning for Model Steering
链接:https://arxiv.org/abs/2512.18551
作者:Sungjoon Park,Varun Ramamurthi,Owen Terry
类目:Computation and Language (cs.CL)
关键词:language modeling, tokens trained, trained to represent, represent a concept, model vocabulary
备注:
点击查看摘要
Abstract:In language modeling, neologisms are new tokens trained to represent a concept not already included in a given model's vocabulary. Neologisms can be used to encourage specific behavior in models, for example by appending prompts with "Give me a neologism answer." Behavioral steering can also be achieved through fine-tuning, albeit with more compute and less flexibility: learning a neologism only trains d parameters and allows the user to still access the model's default behavior. We compare the performance of neologism learning against low-rank adaptation (LoRA) fine-tuning, finding that neologisms outperform fine-tuned models under a matched training setup (same data and hyperparameters). We also investigate self-verbalizations of neologisms, and observe that the model will occasionally make up its own new words when asked about a neologism.
61. 【2512.18546】LLMs on Drugs: Language Models Are Few-Shot Consumers
链接:https://arxiv.org/abs/2512.18546
作者:Alexander Doudkin
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, inference time, benchmarked rigorously, Large
备注: 8 pages, 2 figures, 2 tables
点击查看摘要
Abstract:Large language models (LLMs) are sensitive to the personas imposed on them at inference time, yet prompt-level "drug" interventions have never been benchmarked rigorously. We present the first controlled study of psychoactive framings on GPT-5-mini using ARC-Challenge. Four single-sentence prompts -- LSD, cocaine, alcohol, and cannabis -- are compared against a sober control across 100 validation items per condition, with deterministic decoding, full logging, Wilson confidence intervals, and Fisher exact tests. Control accuracy is 0.45; alcohol collapses to 0.10 (p = 3.2e-8), cocaine to 0.21 (p = 4.9e-4), LSD to 0.19 (p = 1.3e-4), and cannabis to 0.30 (p = 0.041), largely because persona prompts disrupt the mandated "Answer: LETTER" template. Persona text therefore behaves like a "few-shot consumable" that can destroy reliability without touching model weights. All experimental code, raw results, and analysis scripts are available at this https URL.
62. 【2512.18542】SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models
链接:https://arxiv.org/abs/2512.18542
作者:Scott Thornton
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:assistants produce vulnerable, security-relevant scenarios, introducing flaws, assistants produce, security
备注: 37 pages, 5 figures. Dataset available at [this https URL](https://huggingface.co/datasets/scthornton/securecode-v2) . Code and validation tools at [this https URL](https://github.com/scthornton/securecode-v2)
点击查看摘要
Abstract:AI assistants produce vulnerable code in 45% of security-relevant scenarios, introducing flaws into production systems at scale. Yet existing secure coding datasets fall short. They lack incident grounding, don't provide the scale modern training requires, and miss the operational security context developers need for production deployments. We present SecureCode v2.0, a production-grade dataset of 1,215 security-focused coding examples that passed structural validation and expert security review. Every example ties to actual documented security incidents with CVE references, provides vulnerable and secure implementations, demonstrates concrete attacks, and includes defense-in-depth operational guidance. The dataset covers 11 vulnerability categories (complete OWASP Top 10:2025 plus AI/ML Security Threats) across 11 languages (Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, and YAML for infrastructure-as-code). Our quality assurance framework ensures complete incident grounding. Each example includes SIEM integration strategies, infrastructure hardening recommendations (Docker, AppArmor, WAF configurations), and testing approaches using language-appropriate frameworks. The dataset uses a 4-turn conversational structure mirroring actual developer-AI interactions, escalating from basic implementations to advanced security considerations and defense-in-depth guidance. Our contributions: (1) 1,215 rigorously validated examples split into 989 training, 122 validation, and 104 test sets, (2) an automated validation framework ensuring dataset consistency, (3) a 4-turn conversational structure capturing realistic security workflows, (4) comprehensive operational security guidance with SIEM integration strategies, (5) complete language-specific implementation fidelity, and (6) open-source release of data, validation tools, and benchmarking protocols.
Comments:
37 pages, 5 figures. Dataset available at this https URL. Code and validation tools at this https URL
Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:
arXiv:2512.18542 [cs.CR]
(or
arXiv:2512.18542v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2512.18542
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
63. 【2512.18533】Generalization Gaps in Political Fake News Detection: An Empirical Study on the LIAR Dataset
链接:https://arxiv.org/abs/2512.18533
作者:S Mahmudul Hasan,Shaily Roy,Akib Jawad Nafis
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:automated fact-checking systems, subtle political disinformation, political disinformation poses, linguistically subtle political, proliferation of linguistically
备注:
点击查看摘要
Abstract:The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.
64. 【2512.18505】aching and Critiquing Conceptualization and Operationalization in NLP
链接:https://arxiv.org/abs/2512.18505
作者:Vagrant Gautam
类目:Computation and Language (cs.CL)
关键词:NLP researchers regularly, regularly invoke abstract, researchers regularly invoke, invoke abstract concepts, NLP researchers
备注:
点击查看摘要
Abstract:NLP researchers regularly invoke abstract concepts like "interpretability," "bias," "reasoning," and "stereotypes," without defining them. Each subfield has a shared understanding or conceptualization of what these terms mean and how we should treat them, and this shared understanding is the basis on which operational decisions are made: Datasets are built to evaluate these concepts, metrics are proposed to quantify them, and claims are made about systems. But what do they mean, what should they mean, and how should we measure them? I outline a seminar I created for students to explore these questions of conceptualization and operationalization, with an interdisciplinary reading list and an emphasis on discussion and critique.
65. 【2512.18475】Research on a hybrid LSTM-CNN-Attention model for text-based web content classification
链接:https://arxiv.org/abs/2512.18475
作者:Mykola Kuz,Ihor Lazarovych,Mykola Kozlenko,Mykola Pikuliak,Andrii Kvasniuk
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:study presents, Attention mechanism, LSTM layer models, integrates LSTM, integrated Attention mechanism
备注: 10 pages, 5 figures, 2 tables. Accepted by Radio Electronics Computer Science Control 2025
点击查看摘要
Abstract:This study presents a hybrid deep learning architecture that integrates LSTM, CNN, and an Attention mechanism to enhance the classification of web content based on text. Pretrained GloVe embeddings are used to represent words as dense vectors that preserve semantic similarity. The CNN layer extracts local n-gram patterns and lexical features, while the LSTM layer models long-range dependencies and sequential structure. The integrated Attention mechanism enables the model to focus selectively on the most informative parts of the input sequence. A 5-fold cross-validation setup was used to assess the robustness and generalizability of the proposed solution. Experimental results show that the hybrid LSTM-CNN-Attention model achieved outstanding performance, with an accuracy of 0.98, precision of 0.94, recall of 0.92, and F1-score of 0.93. These results surpass the performance of baseline models based solely on CNNs, LSTMs, or transformer-based classifiers such as BERT. The combination of neural network components enabled the model to effectively capture both fine-grained text structures and broader semantic context. Furthermore, the use of GloVe embeddings provided an efficient and effective representation of textual data, making the model suitable for integration into systems with real-time or near-real-time requirements. The proposed hybrid architecture demonstrates high effectiveness in text-based web content classification, particularly in tasks requiring both syntactic feature extraction and semantic interpretation. By combining presented mechanisms, the model addresses the limitations of individual architectures and achieves improved generalization. These findings support the broader use of hybrid deep learning approaches in NLP applications, especially where complex, unstructured textual data must be processed and classified with high reliability.
66. 【2512.18462】Mitigating Spurious Correlations in NLI via LLM-Synthesized Counterfactuals and Dynamic Balanced Sampling
链接:https://arxiv.org/abs/2512.18462
作者:Christopher Román Jaimes
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Natural Language Inference, Natural Language, Language Inference, models frequently rely, models frequently
备注:
点击查看摘要
Abstract:Natural Language Inference (NLI) models frequently rely on spurious correlations rather than semantic reasoning. Existing mitigation strategies often incur high annotation costs or trigger catastrophic forgetting during fine-tuning. We propose an automated, scalable pipeline to address these limitations. First, we introduce Log-Frequency LMI (LF-LMI) to accurately detect semantic artifacts. Second, we generate a high-quality synthetic contrast set via an LLM-synthesis pipeline with multi-judge verification. Finally, we introduce Dynamic Balanced Sampling, a training strategy that rotates the original data distribution to prevent forgetting. Our method improves consistency on a challenging benchmark from 63.5% to 81.0% while maintaining 88.4% in-domain accuracy, significantly outperforming naive fine-tuning.
67. 【2512.18440】An Agentic AI Framework for Training General Practitioner Student Skills
链接:https://arxiv.org/abs/2512.18440
作者:Victor De Marez,Jens Van Nooten,Luna De Bruyne,Walter Daelemans
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:resource-intensive traditional methods, large language models, language models offer, models offer strong, offer strong potential
备注:
点击查看摘要
Abstract:Advancements in large language models offer strong potential for enhancing virtual simulated patients (VSPs) in medical education by providing scalable alternatives to resource-intensive traditional methods. However, current VSPs often struggle with medical accuracy, consistent roleplaying, scenario generation for VSP use, and educationally structured feedback. We introduce an agentic framework for training general practitioner student skills that unifies (i) configurable, evidence-based vignette generation, (ii) controlled persona-driven patient dialogue with optional retrieval grounding, and (iii) standards-based assessment and feedback for both communication and clinical reasoning. We instantiate the framework in an interactive spoken consultation setting and evaluate it with medical students ($\mathbf{N{=}14}$). Participants reported realistic and vignette-faithful dialogue, appropriate difficulty calibration, a stable personality signal, and highly useful example-rich feedback, alongside excellent overall usability. These results support agentic separation of scenario control, interaction control, and standards-based assessment as a practical pattern for building dependable and pedagogically valuable VSP training tools.
68. 【2512.18399】AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3
链接:https://arxiv.org/abs/2512.18399
作者:Mark Kashirskiy,Artiom Lipinski,Ilya Makarov
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:directly impacting training, critical preprocessing step, directly impacting, critical preprocessing, Latin-script languages exhibit
备注: 8 pages, 8 figures, 5 tables
点击查看摘要
Abstract:Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages exhibit suboptimal performance on morphologically rich languages such as Arabic, resulting in inflated token sequences and reduced compression efficiency. In this work, we present AraToken, an Arabic-optimized tokenizer built on SentencePiece Unigram algorithm with a comprehensive normalization pipeline addressing Arabic-specific orthographic variations including Alif variants, diacritics, and Arabic-Indic numerals. We systematically compare BPE, WordPiece, and SentencePiece algorithms across multiple configurations, demonstrating that SentencePiece with normalization achieves 18% lower fertility (1.199 vs 1.35 tokens/word) compared to unnormalized baselines. Furthermore, we introduce the Language Extension Pipeline (LEP), a method for integrating the optimized tokenizer into Qwen3-0.6B through vocabulary extension with mean subtoken initialization and selective transformer layer unfreezing. Our experiments show that LEP reduces evaluation loss from 8.28 to 2.43 within 800 training steps on 100K Arabic samples. We release our tokenizer, training scripts, and model checkpoints to facilitate Arabic NLP research.
69. 【2512.18362】SRS-Stories: Vocabulary-constrained multilingual story generation for language learning
链接:https://arxiv.org/abs/2512.18362
作者:Wiktor Kamzela,Mateusz Lango,Ondrej Dusek
类目:Computation and Language (cs.CL)
关键词:generate personalized stories, large language models, Spaced Repetition System, models to generate, generate personalized
备注: EMNLP 2025
点击查看摘要
Abstract:In this paper, we use large language models to generate personalized stories for language learners, using only the vocabulary they know. The generated texts are specifically written to teach the user new vocabulary by simply reading stories where it appears in context, while at the same time seamlessly reviewing recently learned vocabulary. The generated stories are enjoyable to read and the vocabulary reviewing/learning is optimized by a Spaced Repetition System. The experiments are conducted in three languages: English, Chinese and Polish, evaluating three story generation methods and three strategies for enforcing lexical constraints. The results show that the generated stories are more grammatical, coherent, and provide better examples of word usage than texts generated by the standard constrained beam search approach
70. 【2512.18360】LLM Agents Implement an NLG System from Scratch: Building Interpretable Rule-Based RDF-to-Text Generators
链接:https://arxiv.org/abs/2512.18360
作者:Mateusz Lango,Ondřej Dušek
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:multiple LLM agents, LLM agents produce, LLM agents, traditional backpropagation, multiple LLM
备注: EMNLP 2025
点击查看摘要
Abstract:We present a novel neurosymbolic framework for RDF-to-text generation, in which the model is "trained" through collaborative interactions among multiple LLM agents rather than traditional backpropagation. The LLM agents produce rule-based Python code for a generator for the given domain, based on RDF triples only, with no in-domain human reference texts. The resulting system is fully interpretable, requires no supervised training data, and generates text nearly instantaneously using only a single CPU. Our experiments on the WebNLG and OpenDialKG data show that outputs produced by our approach reduce hallucination, with only slight fluency penalties compared to finetuned or prompted language models
71. 【2512.18357】DACE For Railway Acronym Disambiguation
链接:https://arxiv.org/abs/2512.18357
作者:El Mokhtar Hribach,Oussama Mechhour,Mohammed Elmonstaser,Yassine El Boudouri,Othmane Kabal
类目:Computation and Language (cs.CL)
关键词:technical text processing, complicates automated analysis, high ambiguity complicates, ambiguity complicates automated, Acronym Disambiguation
备注:
点击查看摘要
Abstract:Acronym Disambiguation (AD) is a fundamental challenge in technical text processing, particularly in specialized sectors where high ambiguity complicates automated analysis. This paper addresses AD within the context of the TextMine'26 competition on French railway documentation. We present DACE (Dynamic Prompting, Retrieval Augmented Generation, Contextual Selection, and Ensemble Aggregation), a framework that enhances Large Language Models through adaptive in-context learning and external domain knowledge injection. By dynamically tailoring prompts to acronym ambiguity and aggregating ensemble predictions, DACE mitigates hallucination and effectively handles low-resource scenarios. Our approach secured the top rank in the competition with an F1 score of 0.9069.
72. 【2512.18352】LLM-based Few-Shot Early Rumor Detection with Imitation Agent
链接:https://arxiv.org/abs/2512.18352
作者:Fengzhu Zeng,Qian Shao,Ling Cheng,Wei Gao,Shih-Fen Cheng,Jing Ma,Cheng Niu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:social media posts, accurately classified based, aims to identify, media posts, Early Rumor Detection
备注:
点击查看摘要
Abstract:Early Rumor Detection (EARD) aims to identify the earliest point at which a claim can be accurately classified based on a sequence of social media posts. This is especially challenging in data-scarce settings. While Large Language Models (LLMs) perform well in few-shot NLP tasks, they are not well-suited for time-series data and are computationally expensive for both training and inference. In this work, we propose a novel EARD framework that combines an autonomous agent and an LLM-based detection model, where the agent acts as a reliable decision-maker for \textit{early time point determination}, while the LLM serves as a powerful \textit{rumor detector}. This approach offers the first solution for few-shot EARD, necessitating only the training of a lightweight agent and allowing the LLM to remain training-free. Extensive experiments on four real-world datasets show our approach boosts performance across LLMs and surpasses existing EARD methods in accuracy and earliness.
73. 【2512.18337】owards Efficient Agents: A Co-Design of Inference Architecture and System
链接:https://arxiv.org/abs/2512.18337
作者:Weizhe Lin,Hui-Ling Zhen,Shuai Yang,Xian Wang,Renxi Liu,Hanting Chen,Wangze Zhang,Chuansai Zhou,Yiming Li,Chen Chen,Xing Li,Zhiyuan Yang,Xiaosong Li,Xianzhi Yu,Zhenhua Dong,Mingxuan Yuan,Yunhe Wang
类目:Computation and Language (cs.CL)
关键词:autonomous multi-turn reasoning, large language model, tool-augmented decision-making, rapid development, unlocked new possibilities
备注:
点击查看摘要
Abstract:The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making. However, their real-world deployment is hindered by severe inefficiencies that arise not from isolated model inference, but from the systemic latency accumulated across reasoning loops, context growth, and heterogeneous tool interactions. This paper presents AgentInfer, a unified framework for end-to-end agent acceleration that bridges inference optimization and architectural design. We decompose the problem into four synergistic components: AgentCollab, a hierarchical dual-model reasoning framework that balances large- and small-model usage through dynamic role assignment; AgentSched, a cache-aware hybrid scheduler that minimizes latency under heterogeneous request patterns; AgentSAM, a suffix-automaton-based speculative decoding method that reuses multi-session semantic memory to achieve low-overhead inference acceleration; and AgentCompress, a semantic compression mechanism that asynchronously distills and reorganizes agent memory without disrupting ongoing reasoning. Together, these modules form a Self-Evolution Engine capable of sustaining efficiency and cognitive stability throughout long-horizon reasoning tasks. Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%, achieving an overall 1.8-2.5 times speedup with preserved accuracy. These results underscore that optimizing for agentic task completion-rather than merely per-token throughput-is the key to building scalable, efficient, and self-improving intelligent systems.
74. 【2512.18329】LIR$^3$AG: A Lightweight Rerank Reasoning Strategy Framework for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2512.18329
作者:Guo Chen,Junjie Huang,Huaijin Xie,Fei Sun,Tao Jia
类目:Computation and Language (cs.CL)
关键词:enhances Large Language, Large Language Models, effectively enhances Large, Large Language, enhances Large
备注: AAAI2026
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) effectively enhances Large Language Models (LLMs) by incorporating retrieved external knowledge into the generation process. Reasoning models improve LLM performance in multi-hop QA tasks, which require integrating and reasoning over multiple pieces of evidence across different documents to answer a complex question. However, they often introduce substantial computational costs, including increased token consumption and inference latency. To better understand and mitigate this trade-off, we conduct a comprehensive study of reasoning strategies for reasoning models in RAG multi-hop QA tasks. Our findings reveal that reasoning models adopt structured strategies to integrate retrieved and internal knowledge, primarily following two modes: Context-Grounded Reasoning, which relies directly on retrieved content, and Knowledge-Reconciled Reasoning, which resolves conflicts or gaps using internal knowledge. To this end, we propose a novel Lightweight Rerank Reasoning Strategy Framework for RAG (LiR$^3$AG) to enable non-reasoning models to transfer reasoning strategies by restructuring retrieved evidence into coherent reasoning chains. LiR$^3$AG significantly reduce the average 98% output tokens overhead and 58.6% inferencing time while improving 8B non-reasoning model's F1 performance ranging from 6.2% to 22.5% to surpass the performance of 32B reasoning model in RAG, offering a practical and efficient path forward for RAG systems.
75. 【2512.18321】CTTA-T: Continual Test-Time Adaptation for Text Understanding via Teacher-Student with a Domain-aware and Generalized Teacher
链接:https://arxiv.org/abs/2512.18321
作者:Tianlun Liu,Zhiliang Tian,Zhen Huang,Xingzhi Zhou,Wanlong Yu,Tianle Liu,Feng Liu,Dongsheng Li
类目:Computation and Language (cs.CL)
关键词:Text understanding, continual test-time adaptation, testing, test-time adaptation, domains
备注:
点击查看摘要
Abstract:Text understanding often suffers from domain shifts. To handle testing domains, domain adaptation (DA) is trained to adapt to a fixed and observed testing domain; a more challenging paradigm, test-time adaptation (TTA), cannot access the testing domain during training and online adapts to the testing samples during testing, where the samples are from a fixed domain. We aim to explore a more practical and underexplored scenario, continual test-time adaptation (CTTA) for text understanding, which involves a sequence of testing (unobserved) domains in testing. Current CTTA methods struggle in reducing error accumulation over domains and enhancing generalization to handle unobserved domains: 1) Noise-filtering reduces accumulated errors but discards useful information, and 2) accumulating historical domains enhances generalization, but it is hard to achieve adaptive accumulation. In this paper, we propose a CTTA-T (continual test-time adaptation for text understanding) framework adaptable to evolving target domains: it adopts a teacher-student framework, where the teacher is domain-aware and generalized for evolving domains. To improve teacher predictions, we propose a refine-then-filter based on dropout-driven consistency, which calibrates predictions and removes unreliable guidance. For the adaptation-generalization trade-off, we construct a domain-aware teacher by dynamically accumulating cross-domain semantics via incremental PCA, which continuously tracks domain shifts. Experiments show CTTA-T excels baselines.
76. 【2512.18301】InstructNet: A Novel Approach for Multi-Label Instruction Classification through Advanced Deep Learning
链接:https://arxiv.org/abs/2512.18301
作者:Tanjim Taharat Aurpa,Md Shoaib Ahmed,Md Mahbubur Rahman,Md. Golam Moazzam
类目:Computation and Language (cs.CL)
关键词:topics and items, specialized objects, aspirational and specialized, search engines, peoples preferred resource
备注:
点击查看摘要
Abstract:People use search engines for various topics and items, from daily essentials to more aspirational and specialized objects. Therefore, search engines have taken over as peoples preferred resource. The How To prefix has become familiar and widely used in various search styles to find solutions to particular problems. This search allows people to find sequential instructions by providing detailed guidelines to accomplish specific tasks. Categorizing instructional text is also essential for task-oriented learning and creating knowledge bases. This study uses the How To articles to determine the multi-label instruction category. We have brought this work with a dataset comprising 11,121 observations from wikiHow, where each record has multiple categories. To find out the multi-label category meticulously, we employ some transformer-based deep neural architectures, such as Generalized Autoregressive Pretraining for Language Understanding (XLNet), Bidirectional Encoder Representation from Transformers (BERT), etc. In our multi-label instruction classification process, we have reckoned our proposed architectures using accuracy and macro f1-score as the performance metrics. This thorough evaluation showed us much about our strategys strengths and drawbacks. Specifically, our implementation of the XLNet architecture has demonstrated unprecedented performance, achieving an accuracy of 97.30% and micro and macro average scores of 89.02% and 93%, a noteworthy accomplishment in multi-label classification. This high level of accuracy and macro average score is a testament to the effectiveness of the XLNet architecture in our proposed InstructNet approach. By employing a multi-level strategy in our evaluation process, we have gained a more comprehensive knowledge of the effectiveness of our proposed architectures and identified areas for forthcoming improvement and refinement.
77. 【2512.18298】Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition
链接:https://arxiv.org/abs/2512.18298
作者:Sudip Chakrabarty,Pappu Bishwas,Rajdeep Chatterjee
类目:ound (cs.SD); Computation and Language (cs.CL)
关键词:Speech Emotion Recognition, Speech Emotion, Emotion Recognition, systems often degrade, degrade in performance
备注:
点击查看摘要
Abstract:Speech Emotion Recognition (SER) systems often degrade in performance when exposed to the unpredictable acoustic interference found in real-world environments. Additionally, the opacity of deep learning models hinders their adoption in trust-sensitive applications. To bridge this gap, we propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks. Our dual-stream architecture processes raw waveforms to capture long-range temporal dependencies while simultaneously extracting noise-resistant spectral features (MFCC, ZCR, RMSE) via a custom Attentive Temporal Pooling mechanism. We conducted extensive validation across four diverse benchmark datasets: RAVDESS, TESS, SAVEE, and CREMA-D. To rigorously test robustness, we subjected the model to non-stationary acoustic interference using real-world noise profiles from the SAS-KIIT dataset. The proposed framework demonstrates superior generalization and state-of-the-art accuracy across all datasets, significantly outperforming single-branch baselines under realistic environmental interference. Furthermore, we address the ``black-box" problem by integrating SHAP and Score-CAM into the evaluation pipeline. These tools provide granular visual explanations, revealing how the model strategically shifts attention between temporal and spectral cues to maintain reliability in the presence of complex environmental noise.
78. 【2512.18292】Measuring Fine-Grained Negotiation Tactics of Humans and LLMs in Diplomacy
链接:https://arxiv.org/abs/2512.18292
作者:Wenkai Li,Lynnette Hui Xian Ng,Andy Liu,Daniel Fried
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:back to Aristotle, styles dates back, dates back, negotiation styles dates, negotiation
备注:
点击查看摘要
Abstract:The study of negotiation styles dates back to Aristotle's ethos-pathos-logos rhetoric. Prior efforts primarily studied the success of negotiation agents. Here, we shift the focus towards the styles of negotiation strategies. Our focus is the strategic dialogue board game Diplomacy, which affords rich natural language negotiation and measures of game success. We used LLM-as-a-judge to annotate a large human-human set of Diplomacy games for fine-grained negotiation tactics from a sociologically-grounded taxonomy. Using a combination of the It Takes Two and WebDiplomacy datasets, we demonstrate the reliability of our LLM-as-a-Judge framework and show strong correlations between negotiation features and success in the Diplomacy setting. Lastly, we investigate the differences between LLM and human negotiation strategies and show that fine-tuning can steer LLM agents toward more human-like negotiation behaviors.
79. 【2512.18231】Investigating Spatial Attention Bias in Vision-Language Models
链接:https://arxiv.org/abs/2512.18231
作者:Aryan Chaudhary,Sanchit Goyal,Pratik Narang,Dhruv Kumar
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:remain largely unexplored, demonstrated remarkable capabilities, processing remain largely, spatial processing remain, largely unexplored
备注:
点击查看摘要
Abstract:Vision-Language Models have demonstrated remarkable capabilities in understanding visual content, yet systematic biases in their spatial processing remain largely unexplored. This work identifies and characterizes a systematic spatial attention bias where VLMs consistently prioritize describing left-positioned content before right-positioned content in horizontally concatenated images. Through controlled experiments on image pairs using both open-source and closed-source models, we demonstrate that this bias persists across different architectures, with models describing left-positioned content first in approximately 97% of cases under neutral prompting conditions. Testing on an Arabic-finetuned model reveals that the bias persists despite right-to-left language training, ruling out language reading direction as the primary cause. Investigation of training dataset annotation guidelines from PixMo and Visual Genome reveals no explicit left-first ordering instructions, suggesting the bias is consistent with architectural factors rather than explicit training data instructions. These findings reveal fundamental limitations in how current VLMs process spatial information.
80. 【2512.18225】GeoSense-AI: Fast Location Inference from Crisis Microblogs
链接:https://arxiv.org/abs/2512.18225
作者:Deepit Sapru
类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)
关键词:lightweight named-entity recognition, noisy microblog streams, unifying statistical hashtag, statistical hashtag segmentation, infer locations directly
备注:
点击查看摘要
Abstract:This paper presents an applied AI pipeline for realtime geolocation from noisy microblog streams, unifying statistical hashtag segmentation, part-of-speech-driven proper-noun detection, dependency parsing around disaster lexicons, lightweight named-entity recognition, and gazetteer-grounded disambiguation to infer locations directly from text rather than sparse geotags. The approach operationalizes information extraction under streaming constraints, emphasizing low-latency NLP components and efficient validation against geographic knowledge bases to support situational awareness during emergencies. In head to head comparisons with widely used NER toolkits, the system attains strong F1 while being engineered for orders-of-magnitude faster throughput, enabling deployment in live crisis informatics settings. A production map interface demonstrates end-to-end AI functionality ingest, inference, and visualization--surfacing locational signals at scale for floods, outbreaks, and other fastmoving events. By prioritizing robustness to informal text and streaming efficiency, GeoSense-AI illustrates how domain-tuned NLP and knowledge grounding can elevate emergency response beyond conventional geo-tag reliance.
81. 【2512.18215】Stable and Efficient Single-Rollout RL for Multimodal Reasoning
链接:https://arxiv.org/abs/2512.18215
作者:Rui Liu,Dian Yu,Lei Ke,Haolin Liu,Yujun Zhou,Zhenwen Liang,Haitao Mi,Pratap Tokekar,Dong Yu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Reinforcement Learning, Verifiable Rewards, Language Models
备注:
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.
82. 【2512.18196】raining LLMs with LogicReward for Faithful and Rigorous Reasoning
链接:https://arxiv.org/abs/2512.18196
作者:Jundong Xu,Hao Fei,Huichi Zhou,Xin Quan,Qijun Huang,Shengqiong Wu,William Yang Wang,Mong-Li Lee,Wynne Hsu
类目:Computation and Language (cs.CL)
关键词:LLMs exhibit strong, methods largely depend, produce correct answers, existing training methods, training methods largely
备注: Preprint
点击查看摘要
Abstract:Although LLMs exhibit strong reasoning capabilities, existing training methods largely depend on outcome-based feedback, which can produce correct answers with flawed reasoning. Prior work introduces supervision on intermediate steps but still lacks guarantees of logical soundness, which is crucial in high-stakes scenarios where logical consistency is paramount. To address this, we propose LogicReward, a novel reward system that guides model training by enforcing step-level logical correctness with a theorem prover. We further introduce Autoformalization with Soft Unification, which reduces natural language ambiguity and improves formalization quality, enabling more effective use of the theorem prover. An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6\% and 2\% on natural language inference and logical reasoning tasks with simple training procedures. Further analysis shows that LogicReward enhances reasoning faithfulness, improves generalizability to unseen tasks such as math and commonsense reasoning, and provides a reliable reward signal even without ground-truth labels. We will release all data and code at this https URL.
83. 【2512.18190】External Hippocampus: Topological Cognitive Maps for Guiding Large Language Model Reasoning
链接:https://arxiv.org/abs/2512.18190
作者:Jian Yan
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:External Hippocampus framework, External Hippocampus, proposes the External, cognitive dynamics perspective, Hippocampus framework
备注: 12 pages, 7 figures
点击查看摘要
Abstract:This paper proposes the External Hippocampus framework, which models language model reasoning from a cognitive dynamics perspective as the flow of information energy in semantic space. Unlike traditional weight-space optimization methods, this framework constructs topological cognitive maps through dimensionality reduction projection, enabling precise navigation and intervention of energy flow at test time while avoiding substantial computational requirements and demonstrating predictable intervention patterns. The method effectively addresses the cognitive deadlock problem in multi-step reasoning for small models. Experiments on models =7B parameters show: map-guided methods achieve 81.20% accuracy on 500 challenging problems (relative baseline +16.80%), reduce reasoning time by = 15x, with key findings revealing that reasoning stagnation manifests as "Cognitive Vortex" and low-entropy potential wells, while temperature perturbations effectively restart energy flow. The framework requires no additional training, possesses autonomous growth capability, and provides an efficient and controllable topological-aware solution for small model reasoning.
84. 【2512.18115】Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown
链接:https://arxiv.org/abs/2512.18115
作者:Changxu Duan
类目:Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
关键词:digital library workflows, enable scalable digital, scalable digital library, Academic documents stored, plain text structured
备注: Accepted ICDAR 2025
点击查看摘要
Abstract:Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid editing-generation model whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the transformation latency up to 44.5% compared to end-to-end decoder transformer models, while maintaining transformation quality. Our code and reproducible dataset production scripts are open-sourced.
85. 【2512.18072】Statistical laws and linguistics inform meaning in naturalistic and fictional conversation
链接:https://arxiv.org/abs/2512.18072
作者:Ashley M. A. Fehr,Calla G. Beauregard,Julia Witte Zimmerman,Katie Ekström,Pablo Rosillo-Rodes,Christopher M. Danforth,Peter Sheridan Dodds
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:well-being outcomes, cornerstone of social, social connection, linked to well-being, Heaps' law
备注:
点击查看摘要
Abstract:Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps's law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.
86. 【2512.18041】Narrative Consolidation: Formulating a New Task for Unifying Multi-Perspective Accounts
链接:https://arxiv.org/abs/2512.18041
作者:Roger A. Finger,Eduardo G. Cortes,Sandro J. Rigo,Gabriel de O. Ramos
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Processing overlapping narrative, chronologically sound text, overlapping narrative documents, Processing overlapping, historical accounts
备注:
点击查看摘要
Abstract:Processing overlapping narrative documents, such as legal testimonies or historical accounts, often aims not for compression but for a unified, coherent, and chronologically sound text. Standard Multi-Document Summarization (MDS), with its focus on conciseness, fails to preserve narrative flow. This paper formally defines this challenge as a new NLP task: Narrative Consolidation, where the central objectives are chronological integrity, completeness, and the fusion of complementary details. To demonstrate the critical role of temporal structure in this task, we introduce Temporal Alignment Event Graph (TAEG), a graph structure that explicitly models chronology and event alignment. By applying a standard centrality algorithm to TAEG, our method functions as a version selection mechanism, choosing the most central representation of each event in its correct temporal position. In a study on the four Biblical Gospels, this structure-focused approach guarantees perfect temporal ordering (Kendall's Tau of 1.000) by design and dramatically improves content metrics (e.g., +357.2% in ROUGE-L F1). The success of this baseline method validates the formulation of Narrative Consolidation as a relevant task and establishes that an explicit temporal backbone is a fundamental component for its resolution.
87. 【2512.18027】CoPE: A Small Language Model for Steerable and Scalable Content Labeling
链接:https://arxiv.org/abs/2512.18027
作者:Samidh Chakrabarti,David Willner,Kevin Klyman,Tiffany Saade,Emily Capstick,Sabina Nong
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
关键词:policy-steerable small language, accurate content labeling, small language model, language model capable, called Binocular Labeling
备注: 21 pages, 2 figures, 7 tables
点击查看摘要
Abstract:This paper details the methodology behind CoPE, a policy-steerable small language model capable of fast and accurate content labeling. We present a novel training curricula called Contradictory Example Training that enables the model to learn policy interpretation rather than mere policy memorization. We also present a novel method for generating content policies, called Binocular Labeling, which enables rapid construction of unambiguous training datasets. When evaluated across seven different harm areas, CoPE exhibits equal or superior accuracy to frontier models at only 1% of their size. We openly release a 9 billion parameter version of the model that can be run on a single consumer-grade GPU. Models like CoPE represent a paradigm shift for classifier systems. By turning an ML task into a policy writing task, CoPE opens up new design possibilities for the governance of online platforms.
88. 【2512.18014】ReGal: A First Look at PPO-based Legal AI for Judgment Prediction and Summarization in India
链接:https://arxiv.org/abs/2512.18014
作者:Shubham Kumar Nigam,Tanuj Tyagi,Siddharth Shukla,Aditya Kumar Guru,Balaramamahanthi Deepak Patnaik,Danush Khanna,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Proximal Policy Optimization, reinforcement learning methodologies, Indian context, reinforcement learning, Reinforcement Learning-based Legal
备注: Accepted in AILaw @ AAAI 2026 conference
点击查看摘要
Abstract:This paper presents an early exploration of reinforcement learning methodologies for legal AI in the Indian context. We introduce Reinforcement Learning-based Legal Reasoning (ReGal), a framework that integrates Multi-Task Instruction Tuning with Reinforcement Learning from AI Feedback (RLAIF) using Proximal Policy Optimization (PPO). Our approach is evaluated across two critical legal tasks: (i) Court Judgment Prediction and Explanation (CJPE), and (ii) Legal Document Summarization. Although the framework underperforms on standard evaluation metrics compared to supervised and proprietary models, it provides valuable insights into the challenges of applying RL to legal texts. These challenges include reward model alignment, legal language complexity, and domain-specific adaptation. Through empirical and qualitative analysis, we demonstrate how RL can be repurposed for high-stakes, long-document tasks in law. Our findings establish a foundation for future work on optimizing legal reasoning pipelines using reinforcement learning, with broader implications for building interpretable and adaptive legal AI systems.
89. 【2512.18004】Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models
链接:https://arxiv.org/abs/2512.18004
作者:Shubham Kumar Nigam,Parjanya Aditya Shukla,Noel Shallum,Arnab Bhattacharya
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:pose significant challenges, lack large digitized, large digitized corpora, exhibit high variability, machine translation continue
备注: Accepted in AILaw @ AAAI 2026 Conference
点击查看摘要
Abstract:Handwritten text recognition (HTR) and machine translation continue to pose significant challenges, particularly for low-resource languages like Marathi, which lack large digitized corpora and exhibit high variability in handwriting styles. The conventional approach to address this involves a two-stage pipeline: an OCR system extracts text from handwritten images, which is then translated into the target language using a machine translation model. In this work, we explore and compare the performance of traditional OCR-MT pipelines with Vision Large Language Models that aim to unify these stages and directly translate handwritten text images in a single, end-to-end step. Our motivation is grounded in the urgent need for scalable, accurate translation systems to digitize legal records such as FIRs, charge sheets, and witness statements in India's district and high courts. We evaluate both approaches on a curated dataset of handwritten Marathi legal documents, with the goal of enabling efficient legal document processing, even in low-resource environments. Our findings offer actionable insights toward building robust, edge-deployable solutions that enhance access to legal information for non-native speakers and legal professionals alike.
90. 【2512.17920】Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression
链接:https://arxiv.org/abs/2512.17920
作者:Rahul Baxi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:exhibit degraded performance, remain poorly understood, mechanisms remain poorly, Compression-Decay Comprehension Test, Large language models
备注: 19 pages, 9 figures; currently under peer review at TMLR
点击查看摘要
Abstract:Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.
Comments:
19 pages, 9 figures; currently under peer review at TMLR
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
ACMclasses:
I.2.7
Cite as:
arXiv:2512.17920 [cs.CL]
(or
arXiv:2512.17920v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2512.17920
Focus to learn more
arXiv-issued DOI via DataCite
Submission history From: Rahul Baxi [view email] [v1]
Tue, 2 Dec 2025 13:25:48 UTC (149 KB)
91. 【2512.17917】KVReviver: Reversible KV Cache Compression with Sketch-Based Token Reconstruction
链接:https://arxiv.org/abs/2512.17917
作者:Aomufei Yuan,Zhiming Wang,Ruijie Miao,Dayu Wang,Yuxuan Tian,Zihan Wang,Yebo Peng,Yuhan Wu,Bairen Yi,Xin Liu,Tong Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:current large language, large language models, LLM deployment, rapidly increases, batch processing
备注: 12 pages, 6 figures
点击查看摘要
Abstract:As the context length of current large language models (LLMs) rapidly increases, the memory demand for the Key-Value (KV) cache is becoming a bottleneck for LLM deployment and batch processing. Traditional KV cache compression methods typically involve permanently evicting or irreversibly merging "less important" tokens with low attention scores. This approach results in the unrecoverable loss of token information, which we call Contextual Amnesia, significantly degrading the model's information retrieval capability. To address this issue, we propose KVReviver, a reversible KV cache compression method based on the sketch algorithm. This method allows reconstructing compressed tokens from an additional data structure, thus enabling full-scale computation within limited memory. Experiments showed that in 2k-length contexts, it requires only 10% of KV Cache budget while maintaining identical end-to-end inference accuracy. For 32k-length contexts, it achieves equivalent or comparable accuracy ~2% accuracy loss) using merely 25% of KV Cache budget.
92. 【2512.17916】Learning to Prioritize IT Tickets: A Comparative Evaluation of Embedding-based Approaches and Fine-Tuned Transformer Models
链接:https://arxiv.org/abs/2512.17916
作者:Minh Tri LÊ,Ali Ait-Bachir
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Prioritizing service tickets, subjective writing styles, pronounced class imbalance, remains challenging due, Prioritizing service
备注: 12 pages
点击查看摘要
Abstract:Prioritizing service tickets in IT Service Management (ITSM) is critical for operational efficiency but remains challenging due to noisy textual inputs, subjective writing styles, and pronounced class imbalance. We evaluate two families of approaches for ticket prioritization: embedding-based pipelines that combine dimensionality reduction, clustering, and classical classifiers, and a fine-tuned multilingual transformer that processes both textual and numerical features. Embedding-based methods exhibit limited generalization across a wide range of thirty configurations, with clustering failing to uncover meaningful structures and supervised models highly sensitive to embedding quality. In contrast, the proposed transformer model achieves substantially higher performance, with an average F1-score of 78.5% and weighted Cohen's kappa values of nearly 0.80, indicating strong alignment with true labels. These results highlight the limitations of generic embeddings for ITSM data and demonstrate the effectiveness of domain-adapted transformer architectures for operational ticket prioritization.
93. 【2512.17915】Supplementary Resources and Analysis for Automatic Speech Recognition Systems Trained on the Loquacious Dataset
链接:https://arxiv.org/abs/2512.17915
作者:Nick Rossenbach,Robin Schmitt,Tina Raissi,Simon Berger,Larissa Kleppel,Ralf Schlüter
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:established English automatic, English automatic speech, recently published Loquacious, automatic speech recognition, published Loquacious dataset
备注:
点击查看摘要
Abstract:The recently published Loquacious dataset aims to be a replacement for established English automatic speech recognition (ASR) datasets such as LibriSpeech or TED-Lium. The main goal of the Loquacious dataset is to provide properly defined training and test partitions across many acoustic and language domains, with an open license suitable for both academia and industry. To further promote the benchmarking and usability of this new dataset, we present additional resources in the form of n-gram language models (LMs), a grapheme-to-phoneme (G2P) model and pronunciation lexica, with open and public access. Utilizing those additional resources we show experimental results across a wide range of ASR architectures with different label units and topologies. Our initial experimental results indicate that the Loquacious dataset offers a valuable study case for a variety of common challenges in ASR.
94. 【2512.17914】Q-KVComm: Efficient Multi-Agent Communication Via Adaptive KV Cache Compression
链接:https://arxiv.org/abs/2512.17914
作者:Boris Kriuk,Logic Ng
类目:Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:Multi-agent Large Language, Large Language Model, Multi-agent Large, Large Language, consumes excessive bandwidth
备注: 7 pages, 4 figures, 1 table
点击查看摘要
Abstract:Multi-agent Large Language Model (LLM) systems face a critical bottleneck: redundant transmission of contextual information between agents consumes excessive bandwidth and computational resources. Traditional approaches discard internal semantic representations and transmit raw text, forcing receiving agents to recompute similar representations from scratch. We introduce Q-KVComm, a new protocol that enables direct transmission of compressed key-value (KV) cache representations between LLM agents. Q-KVComm combines three key innovations: (1) adaptive layer-wise quantization that allocates variable bit-widths based on sensitivity profiling, (2) hybrid information extraction that preserves critical facts across content domains, and (3) heterogeneous model calibration establishing cross-architecture communication. Extensive experiments across three diverse question-answering datasets demonstrate that Q-KVComm achieves 5-6x compression ratios while maintaining semantic fidelity, with coherence quality scores above 0.77 across all scenarios. The protocol exhibits robust performance across model sizes (1.1B-1.5B parameters) and adapts to real-world applications including conversational QA and multi-hop reasoning. Our work establishes a new paradigm for LLM agent communication, shifting from text-based to representation-based information exchange.
95. 【2512.17912】Graph-O1 : Monte Carlo Tree Search with Reinforcement Learning for Text-Attributed Graph Reasoning
链接:https://arxiv.org/abs/2512.17912
作者:Lihui Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:rich textual information, diverse domains, nodes and edges, edges contain rich, Text-attributed graphs
备注:
点击查看摘要
Abstract:ChatGPT said: Text-attributed graphs, where nodes and edges contain rich textual information, are widely used across diverse domains. A central challenge in this setting is question answering, which requires jointly leveraging unstructured text and the structured relational signals within the graph. Although Large Language Models (LLMs) have made significant advances in natural language understanding, their direct use for reasoning over text-attributed graphs remains limited. Retrieval-augmented generation methods that operate purely on text often treat passages as isolated units, ignoring the interconnected structure of the graph. Conversely, graph-based RAG methods that serialize large subgraphs into long textual sequences quickly become infeasible due to LLM context-length constraints, resulting in fragmented reasoning and degraded accuracy. To overcome these limitations, we introduce Graph-O1, an agentic GraphRAG framework that enables LLMs to conduct stepwise, interactive reasoning over graphs. Our approach integrates Monte Carlo Tree Search (MCTS) with end-to-end reinforcement learning, allowing the model to selectively explore and retrieve only the most informative subgraph components. The reasoning procedure is framed as a multi-turn interaction between the agent and the graph environment, and the agent is trained through a unified reward mechanism. Extensive experiments across multiple LLM backbones demonstrate that Graph-O1 consistently surpasses state-of-the-art baselines, producing answers that are more accurate, reliable, and interpretable.
96. 【2512.17911】owards Reasoning-Preserving Unlearning in Multimodal Large Language Models
链接:https://arxiv.org/abs/2512.17911
作者:Hongji Li,Junchi yao,Manjiang Yu,Priyanka Singh,Xue Li,Di Wang,Lijie Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Machine unlearning aims, erase requested data, Large Language Models, Multimodal Large Language, Machine unlearning
备注:
点击查看摘要
Abstract:Machine unlearning aims to erase requested data from trained models without full retraining. For Reasoning Multimodal Large Language Models (RMLLMs), this is uniquely challenging: intermediate chain-of-thought steps can still leak sensitive information even when final answers are forgotten, and overly aggressive interventions easily damage general reasoning ability. Yet no benchmark jointly evaluates how well unlearning methods suppress reasoning-level leakage while preserving reasoning competence. We address this gap with RMLLMU-Bench, the first benchmark for RMLLM unlearning that extends standard forgetting metrics with dedicated measures of reasoning leakage and reasoning retention. A systematic evaluation on RMLLMU-Bench reveals that existing unlearning methods for MLLMs and Large (Language) Reasoning Models (LRMs) either leave substantial leakage in the reasoning process or severely degrade reasoning performance. To address these gaps, we propose R-MUSE (Reasoning-preserving MLLM Unlearning via Subspace guidance and Adaptive Steering), a training-free and inference-time intervention framework that steers internal representations to forget both answers and reasoning traces while explicitly preserving general reasoning. Experiments on RMLLMU-Bench demonstrate that R-MUSE achieves a substantially better balance between effective forgetting and reasoning retention.
97. 【2512.18861】Merge on workspaces as Hopf algebra Markov chain
链接:https://arxiv.org/abs/2512.18861
作者:Matilde Marcolli,David Skigin
类目:Dynamical Systems (math.DS); Computation and Language (cs.CL); Quantum Algebra (math.QA); Rings and Algebras (math.RA)
关键词:Hopf algebra Markov, binary rooted forests, algebra Markov chain, Hopf algebra, Markovian dynamical system
备注: 80 pages, LaTeX, 1 png figure
点击查看摘要
Abstract:We study the dynamical properties of a Hopf algebra Markov chain with state space the binary rooted forests with labelled leaves. This Markovian dynamical system describes the core computational process of structure formation and transformation in syntax via the Merge operation, according to Chomsky's Minimalism model of generative linguistics. The dynamics decomposes into an ergodic dynamical system with uniform stationary distribution, given by the action of Internal Merge, while the contributions of External Merge and (a minimal form of) Sideward Merge reduce to a simpler Markov chain with state space the set of partitions and with combinatorial weights. The Sideward Merge part of the dynamics prevents convergence to fully formed connected structures (trees), unless the different forms of Merge are weighted by a cost function, as predicted by linguistic theory. Results on the asymptotic behavior of the Perron-Frobenius eigenvalue and eigenvector in this weighted case, obtained in terms of an associated Perron-Frobenius problem in the tropical semiring, show that the usual cost functions (Minimal Search and Resource Restrictions) proposed in the linguistic literature do not suffice to obtain convergence to the tree structures, while an additional optimization property based on the Shannon entropy achieves the expected result for the dynamics. We also comment on the introduction of continuous parameters related to semantic embedding and other computational models, and also on some filtering of the dynamics by coloring rules that model the linguistic filtering by theta roles and phase structure, and on parametric variation and the process of parameter setting in Externalization.
98. 【2512.18263】ICL+: A Case Study On Speech In-Context Learning for Children's Speech Recognition
链接:https://arxiv.org/abs/2512.18263
作者:Haolong Zheng,Yekaterina Yegorova,Mark Hasegawa-Johnson
类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:limited labeled data, recognition remains challenging, remains challenging due, speech recognition remains, Children speech recognition
备注: Published at IEEE ASRU 2025 Satellite Workshop-AI for Children's Speech and Language
点击查看摘要
Abstract:Children's speech recognition remains challenging due to substantial acoustic and linguistic variability, limited labeled data, and significant differences from adult speech. Speech foundation models can address these challenges through Speech In-Context Learning (SICL), allowing adaptation to new domains without fine-tuning. However, the effectiveness of SICL depends on how in-context examples are selected. We extend an existing retrieval-based method, Text-Embedding KNN for SICL (TICL), introducing an acoustic reranking step to create TICL+. This extension prioritizes examples that are both semantically and acoustically aligned with the test input. Experiments on four children's speech corpora show that TICL+ achieves up to a 53.3% relative word error rate reduction over zero-shot performance and 37.6% over baseline TICL, highlighting the value of combining semantic and acoustic information for robust, scalable ASR in children's speech.
99. 【2512.18119】Distributed Asymmetric Allocation: A Topic Model for Large Imbalanced Corpora in Social Sciences
链接:https://arxiv.org/abs/2512.18119
作者:Kohei Watanabe
类目:Methodology (stat.ME); Computation and Language (cs.CL)
关键词:find highly specific, highly specific topics, specific topics defined, unsupervised LDA fragments, semi-supervised LDA fails
备注: 34 pages
点击查看摘要
Abstract:Social scientists employ latent Dirichlet allocation (LDA) to find highly specific topics in large corpora, but they often struggle in this task because (1) LDA, in general, takes a significant amount of time to fit on large corpora; (2) unsupervised LDA fragments topics into sub-topics in short documents; (3) semi-supervised LDA fails to identify specific topics defined using seed words. To solve these problems, I have developed a new topic model called distributed asymmetric allocation (DAA) that integrates multiple algorithms for efficiently identifying sentences about important topics in large corpora. I evaluate the ability of DAA to identify politically important topics by fitting it to the transcripts of speeches at the United Nations General Assembly between 1991 and 2017. The results show that DAA can classify sentences significantly more accurately and quickly than LDA thanks to the new algorithms. More generally, the results demonstrate that it is important for social scientists to optimize Dirichlet priors of LDA to perform content analysis accurately.
100. 【2512.17968】A Critical Review of Monte Carlo Algorithms Balancing Performance and Probabilistic Accuracy with AI Augmented Framework
链接:https://arxiv.org/abs/2512.17968
作者:Ravi Prasad
类目:Computation (stat.CO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Monte Carlo algorithms, Hamiltonian Monte Carlo, effective application hinges, Monte Carlo, modern computational science
备注:
点击查看摘要
Abstract:Monte Carlo algorithms are a foundational pillar of modern computational science, yet their effective application hinges on a deep understanding of their performance trade offs. This paper presents a critical analysis of the evolution of Monte Carlo algorithms, focusing on the persistent tension between statistical efficiency and computational cost. We describe the historical development from the foundational Metropolis Hastings algorithm to contemporary methods like Hamiltonian Monte Carlo. A central emphasis of this survey is the rigorous discussion of time and space complexity, including upper, lower, and asymptotic tight bounds for each major algorithm class. We examine the specific motivations for developing these methods and the key theoretical and practical observations such as the introduction of gradient information and adaptive tuning in HMC that led to successively better solutions. Furthermore, we provide a justification framework that discusses explicit situations in which using one algorithm is demonstrably superior to another for the same problem. The paper concludes by assessing the profound significance and impact of these algorithms and detailing major current research challenges.
信息检索
1. 【2512.19360】Generative vector search to improve pathology foundation models across multimodal vision-language tasks
链接:https://arxiv.org/abs/2512.19360
作者:Markus Ekvall,Ludvig Bergenstråhle,Patrick Truong,Ben Murrell,Joakim Lundeberg
类目:Information Retrieval (cs.IR)
关键词:Retrieval-augmented generation improves, external knowledge sources, addressing knowledge cutoffs, generation improves large, Retrieval-augmented generation
备注: 13 pages main (54 total), 2 main figures (9 total)
点击查看摘要
Abstract:Retrieval-augmented generation improves large language models by grounding outputs in external knowledge sources, reducing hallucinations and addressing knowledge cutoffs. However, standard embedding-based retrieval fails to capture the complexity of multi-concept queries, particularly in domains like biomedicine, where biological data are inherently high-dimensional. For example,omics datasets, and clinical reports simultaneously exhibit numerous molecular, cellular, and physiological features. We present Stochastic Latent Matching (STHLM), a generative vector search method that samples query-conditioned embeddings from text or image inputs to enhance retrieval performance. Analogous to how Chain-of-Thought reasoning enables language models to "think longer" on complex problems, STHLM allows retrieval systems to "search wider" through iterative sampling. STHLM demonstrates critical improvements over classical vector retrieval across diverse benchmarks, including scientific literature, clinical notes, and tissue images, boosting retrieval performance by 10-30% through test-time compute (trading latency for accuracy), while enabling up to a 10-fold compression of embedding dimensions.
2. 【2512.19134】QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2512.19134
作者:Dehai Min,Kailin Zhang,Tongtong Wu,Lu Cheng
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Retrieval-Augmented Generation adaptively, Generation adaptively determines, large language models, adaptively determines, large language
备注:
点击查看摘要
Abstract:Dynamic Retrieval-Augmented Generation adaptively determines when to retrieve during generation to mitigate hallucinations in large language models (LLMs). However, existing methods rely on model-internal signals (e.g., logits, entropy), which are fundamentally unreliable because LLMs are typically ill-calibrated and often exhibit high confidence in erroneous outputs. We propose QuCo-RAG, which shifts from subjective confidence to objective statistics computed from pre-training data. Our method quantifies uncertainty through two stages: (1) before generation, we identify low-frequency entities indicating long-tail knowledge gaps; (2) during generation, we verify entity co-occurrence in the pre-training corpus, where zero co-occurrence often signals hallucination risk. Both stages leverage Infini-gram for millisecond-latency queries over 4 trillion tokens, triggering retrieval when uncertainty is high. Experiments on multi-hop QA benchmarks show QuCo-RAG achieves EM gains of 5--12 points over state-of-the-art baselines with OLMo-2 models, and transfers effectively to models with undisclosed pre-training data (Llama, Qwen, GPT), improving EM by up to 14 points. Domain generalization on biomedical QA further validates the robustness of our paradigm. These results establish corpus-grounded verification as a principled, practically model-agnostic paradigm for dynamic RAG. Our code is publicly available at this https URL.
3. 【2512.18996】Modular Layout Synthesis (MLS): Front-end Code via Structure Normalization and Constrained Generation
链接:https://arxiv.org/abs/2512.18996
作者:Chong Liu,Ming Zhang,Fei Li,Hao Zhou,Xiaoshuang Chen,Ye Yuan
类目:Information Retrieval (cs.IR); Software Engineering (cs.SE)
关键词:Automated front-end engineering, manual coding overhead, front-end engineering drastically, engineering drastically reduces, drastically reduces development
备注:
点击查看摘要
Abstract:Automated front-end engineering drastically reduces development cycles and minimizes manual coding overhead. While Generative AI has shown promise in translating designs to code, current solutions often produce monolithic scripts, failing to natively support modern ecosystems like React, Vue, or Angular. Furthermore, the generated code frequently suffers from poor modularity, making it difficult to maintain. To bridge this gap, we introduce Modular Layout Synthesis (MLS), a hierarchical framework that merges visual understanding with structural normalization. Initially, a visual-semantic encoder maps the screen capture into a serialized tree topology, capturing the essential layout hierarchy. Instead of simple parsing, we apply heuristic deduplication and pattern recognition to isolate reusable blocks, creating a framework-agnostic schema. Finally, a constraint-based generation protocol guides the LLM to synthesize production-ready code with strict typing and component props. Evaluations show that MLS significantly outperforms existing baselines, ensuring superior code reusability and structural integrity across multiple frameworks
4. 【2512.18683】CIRR: Causal-Invariant Retrieval-Augmented Recommendation with Faithful Explanations under Distribution Shift
链接:https://arxiv.org/abs/2512.18683
作者:Sebastian Sun
类目:Information Retrieval (cs.IR)
关键词:Recent advances, enhancing recommendation systems, external knowledge, shown promise, promise in enhancing
备注:
点击查看摘要
Abstract:Recent advances in retrieval-augmented generation (RAG) have shown promise in enhancing recommendation systems with external knowledge. However, existing RAG-based recommenders face two critical challenges: (1) vulnerability to distribution shifts across different environments (e.g., time periods, user segments), leading to performance degradation in out-of-distribution (OOD) scenarios, and (2) lack of faithful explanations that can be verified against retrieved evidence. In this paper, we propose CIRR, a Causal-Invariant Retrieval-Augmented Recommendation framework that addresses both challenges simultaneously. CIRR learns environment-invariant user preference representations through causal inference, which guide a debiased retrieval process to select relevant evidence from multiple sources. Furthermore, we introduce consistency constraints that enforce faithfulness between retrieved evidence, generated explanations, and recommendation outputs. Extensive experiments on two real-world datasets demonstrate that CIRR achieves robust performance under distribution shifts, reducing performance degradation from 15.4% (baseline) to only 5.6% in OOD scenarios, while providing more faithful and interpretable explanations (26% improvement in faithfulness score) compared to state-of-the-art baselines.
5. 【2512.18434】Efficient Optimization of Hierarchical Identifiers for Generative Recommendation
链接:https://arxiv.org/abs/2512.18434
作者:Federica Valeau,Odysseas Boufalis,Polytimi Gkotsi,Joshua Rosenthal,David Vos
类目:Information Retrieval (cs.IR)
关键词:utilizing balanced tree-structured, recommendation inference efficiency, generative retrieval model, contrastive training objectives, balanced tree-structured item
备注: Accepted at ECIR 2026 Reproducibility Track (to appear)
点击查看摘要
Abstract:SEATER is a generative retrieval model that improves recommendation inference efficiency and retrieval quality by utilizing balanced tree-structured item identifiers and contrastive training objectives. We reproduce and validate SEATER's reported improvements in retrieval quality over strong baselines across all datasets from the original work, and extend the evaluation to Yambda, a large-scale music recommendation dataset. Our experiments verify SEATER's strong performance, but show that its tree construction step during training becomes a major bottleneck as the number of items grows. To address this, we implement and evaluate two alternative construction algorithms: a greedy method optimized for minimal build time, and a hybrid method that combines greedy clustering at high levels with more precise grouping at lower levels. The greedy method reduces tree construction time to less than 2% of the original with only a minor drop in quality on the dataset with the largest item collection. The hybrid method achieves retrieval quality on par with the original, and even improves on the largest dataset, while cutting construction time to just 5-8%. All data and code are publicly available for full reproducibility at this https URL.
6. 【2512.18384】Datasets for machine learning and for assessing the intelligence level of automatic patent search systems
链接:https://arxiv.org/abs/2512.18384
作者:Boris Genin(1),Alexander Gorbunov(1),Dmitry Zolkin(1),Igor Nekrasov(1) ((1) Federal Institute of Industrial Property, Berezhkovskaya nab. 30-1, Moscow, Russian Federation)
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:artificial intelligence lies, semantic clusters, developing large datasets, ensuring their availability, automating prior art
备注: 14 pages, 3 figures, 2 tables
点击查看摘要
Abstract:The key to success in automating prior art search in patent research using artificial intelligence lies in developing large datasets for machine learning and ensuring their availability. This work is dedicated to providing a comprehensive solution to the problem of creating infrastructure for research in this field, including datasets and tools for calculating search quality criteria. The paper discusses the concept of semantic clusters of patent documents that determine the state of the art in a given subject, as proposed by the authors. A definition of such semantic clusters is also provided. Prior art search is presented as the task of identifying elements within a semantic cluster of patent documents in the subject area specified by the document under consideration. A generator of user-configurable datasets for machine learning, based on collections of U.S. and Russian patent documents, is described. The dataset generator creates a database of links to documents in semantic clusters. Then, based on user-defined parameters, it forms a dataset of semantic clusters in JSON format for machine learning. To evaluate machine learning outcomes, it is proposed to calculate search quality scores that account for semantic clusters of the documents being searched. To automate the evaluation process, the paper describes a utility developed by the authors for assessing the quality of prior art document search.
7. 【2512.18283】Improving Data Reusability in Interactive Information Retrieval: Insights from the Community
链接:https://arxiv.org/abs/2512.18283
作者:Tianji Jiang,Wenqi Li,Jiqun Liu
类目:Information Retrieval (cs.IR); Digital Libraries (cs.DL)
关键词:conducted semi-structured interviews, IIR researchers, data reuse practices, data reuse, conducted semi-structured
备注: Accepted by CHIIR 2025
点击查看摘要
Abstract:In this study, we conducted semi-structured interviews with 21 IIR researchers to investigate their data reuse practices. This study aims to expand upon current findings by exploring IIR researchers' information-obtaining behaviors regarding data reuse. We identified the information about shared data characteristics that IIR researchers need when evaluating data reusability, as well as the sources they typically consult to obtain this information. We consider this work to be an initial step toward revealing IIR researchers' data reuse practices and identifying what the community needs to do to promote data reuse. We hope that this study, as well as future research, will inspire more individuals to contribute to ongoing efforts aimed at designing standards, infrastructures, and policies, as well as fostering a sustainable culture of data sharing and reuse in this field.
8. 【2512.18117】Factorized Transport Alignment for Multimodal and Multiview E-commerce Representation Learning
链接:https://arxiv.org/abs/2512.18117
作者:Xiwen Chen,Yen-Chieh Lien,Susan Liu,María Castaños,Abolfazl Razi,Xiaoting Zhao,Congzhe Su
类目:Information Retrieval (cs.IR)
关键词:capture diverse signals, requires robust multimodal, robust multimodal representations, e-commerce requires robust, rapid growth
备注: Accepted by WSDM'26
点击查看摘要
Abstract:The rapid growth of e-commerce requires robust multimodal representations that capture diverse signals from user-generated listings. Existing vision-language models (VLMs) typically align titles with primary images, i.e., single-view, but overlook non-primary images and auxiliary textual views that provide critical semantics in open marketplaces such as Etsy or Poshmark. To this end, we propose a framework that unifies multimodal and multi-view learning through Factorized Transport, a lightweight approximation of optimal transport, designed for scalability and deployment efficiency. During training, the method emphasizes primary views while stochastically sampling auxiliary ones, reducing training cost from quadratic in the number of views to constant per item. At inference, all views are fused into a single cached embedding, preserving the efficiency of two-tower retrieval with no additional online overhead. On an industrial dataset of 1M product listings and 0.3M interactions, our approach delivers consistent improvements in cross-view and query-to-item retrieval, achieving up to +7.9% Recall@500 over strong multimodal baselines. Overall, our framework bridges scalability with optimal transport-based learning, making multi-view pretraining practical for large-scale e-commerce search.
计算机视觉
1. 【2512.19693】he Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding
链接:https://arxiv.org/abs/2512.19693
作者:Weichen Fan,Haiwen Diao,Quan Wang,Dahua Lin,Ziwei Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep representations, inherently intertwined, representations across modalities, modalities are inherently, Deep
备注: Code link: [this https URL](https://github.com/WeichenFan/UAE)
点击查看摘要
Abstract:Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.
2. 【2512.19692】Interact2Ar: Full-Body Human-Human Interaction Generation via Autoregressive Diffusion Models
链接:https://arxiv.org/abs/2512.19692
作者:Pablo Ruiz-Ponce,Sergio Escalera,José García-Rodríguez,Jiankang Deng,Rolandos Alexandros Potamias
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-quality individual body, Generating realistic human-human, challenging task, task that requires, high-quality individual
备注: Project Page: [this https URL](https://pabloruizponce.com/papers/Interact2Ar)
点击查看摘要
Abstract:Generating realistic human-human interactions is a challenging task that requires not only high-quality individual body and hand motions, but also coherent coordination among all interactants. Due to limitations in available data and increased learning complexity, previous methods tend to ignore hand motions, limiting the realism and expressivity of the interactions. Additionally, current diffusion-based approaches generate entire motion sequences simultaneously, limiting their ability to capture the reactive and adaptive nature of human interactions. To address these limitations, we introduce Interact2Ar, the first end-to-end text-conditioned autoregressive diffusion model for generating full-body, human-human interactions. Interact2Ar incorporates detailed hand kinematics through dedicated parallel branches, enabling high-fidelity full-body generation. Furthermore, we introduce an autoregressive pipeline coupled with a novel memory technique that facilitates adaptation to the inherent variability of human interactions using efficient large context windows. The adaptability of our model enables a series of downstream applications, including temporal motion composition, real-time adaptation to disturbances, and extension beyond dyadic to multi-person scenarios. To validate the generated motions, we introduce a set of robust evaluators and extended metrics designed specifically for assessing full-body interactions. Through quantitative and qualitative experiments, we demonstrate the state-of-the-art performance of Interact2Ar.
3. 【2512.19687】Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
链接:https://arxiv.org/abs/2512.19687
作者:Apoorv Vyas,Heng-Jui Chang,Cheng-Fu Yang,Po-Yao Huang,Luya Gao,Julius Richter,Sanyuan Chen,Matt Le,Piotr Dollár,Christoph Feichtenhofer,Ann Lee,Wei-Ning Hsu
类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:introduce Perception Encoder, Perception Encoder Audiovisual, Perception Encoder, video understanding trained, introduce Perception
备注:
点击查看摘要
Abstract:We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.
4. 【2512.19686】Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models
链接:https://arxiv.org/abs/2512.19686
作者:Zixuan Ye,Quande Liu,Cong Wei,Yuanxing Zhang,Xintao Wang,Pengfei Wan,Kun Gai,Wenhan Luo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual context consistency, visual, Adaptive Visual Planning, visual context, Iterative Visual Correction
备注: Project Page: [this https URL](https://zixuan-ye.github.io/VACoT/)
点击查看摘要
Abstract:Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.
5. 【2512.19684】Zero-shot Reconstruction of In-Scene Object Manipulation from Video
链接:https://arxiv.org/abs/2512.19684
作者:Dixuan Lin,Tianyou Wang,Zhuoyang Pan,Yufu Wang,Lingjie Liu,Kostas Daniilidis
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:monocular RGB video, in-scene object manipulation, reconstructing in-scene object, monocular RGB, RGB video
备注:
点击查看摘要
Abstract:We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions. Existing methods operate in hand centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand-object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.
6. 【2512.19683】From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
链接:https://arxiv.org/abs/2512.19683
作者:Mingrui Wu,Zhaozhi Wang,Fangjinhua Wang,Jiaolong Yang,Marc Pollefeys,Tong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
备注: Project page: [this https URL](https://harmlesssr.github.io/openbench/)
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.
7. 【2512.19680】VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
链接:https://arxiv.org/abs/2512.19680
作者:Xinyao Liao,Qiyuan He,Kai Xu,Xiaoye Qu,Yicong Li,Wei Wei,Angela Yao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:map images, visual generation relies, token sequences, generation relies, token
备注: 21 pages, 24 figures
点击查看摘要
Abstract:Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$\pi$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$\pi$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$\pi$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$\pi$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at this https URL.
8. 【2512.19678】WorldWarp: Propagating 3D Geometry with Asynchronous Video Diffusion
链接:https://arxiv.org/abs/2512.19678
作者:Hanyang Kong,Xingyi Yang,Xiaoxu Zheng,Xinchao Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:camera-conditioned latent space, demands strict adherence, Generating long-range, geometrically consistent video, pixel space
备注: Project page: [this https URL](https://hyokong.github.io/worldwarp-page/)
点击查看摘要
Abstract:Generating long-range, geometrically consistent video presents a fundamental dilemma: while consistency demands strict adherence to 3D geometry in pixel space, state-of-the-art generative models operate most effectively in a camera-conditioned latent space. This disconnect causes current methods to struggle with occluded areas and complex camera trajectories. To bridge this gap, we propose WorldWarp, a framework that couples a 3D structural anchor with a 2D generative refiner. To establish geometric grounding, WorldWarp maintains an online 3D geometric cache built via Gaussian Splatting (3DGS). By explicitly warping historical content into novel views, this cache acts as a structural scaffold, ensuring each new frame respects prior geometry. However, static warping inevitably leaves holes and artifacts due to occlusions. We address this using a Spatio-Temporal Diffusion (ST-Diff) model designed for a "fill-and-revise" objective. Our key innovation is a spatio-temporal varying noise schedule: blank regions receive full noise to trigger generation, while warped regions receive partial noise to enable refinement. By dynamically updating the 3D cache at every step, WorldWarp maintains consistency across video chunks. Consequently, it achieves state-of-the-art fidelity by ensuring that 3D logic guides structure while diffusion logic perfects texture. Project page: \href{this https URL}{this https URL}.
9. 【2512.19676】Efficient Vision Mamba for MRI Super-Resolution via Hybrid Selective Scanning
链接:https://arxiv.org/abs/2512.19676
作者:Mojtaba Safari,Shansong Wang,Vanessa L Wildman,Mingzhe Hu,Zach Eidex,Chih-Wei Chang,Erik H Middlebrooks,Richard L.J Qiu,Pretesh Patel,Ashesh B. Jania,Hui Mao,Zhen Tian,Xiaofeng Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:long acquisition times, acquisition times limit, times limit clinical, High-resolution MRI, critical for diagnosis
备注:
点击查看摘要
Abstract:Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan, yet existing deep learning methods face fidelity-efficiency trade-offs. Purpose: To develop a computationally efficient and accurate deep learning framework for MRI SR that preserves anatomical detail for clinical integration. Materials and Methods: We propose a novel SR framework combining multi-head selective state-space models (MHSSM) with a lightweight channel MLP. The model uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing. Evaluation used 7T brain T1 MP2RAGE maps (n=142) and 1.5T prostate T2w MRI (n=334). Comparisons included Bicubic interpolation, GANs (CycleGAN, Pix2pix, SPSR), transformers (SwinIR), Mamba (MambaIR), and diffusion models (I2SB, Res-SRDiff). Results: Our model achieved superior performance with exceptional efficiency. For 7T brain data: SSIM=0.951+-0.021, PSNR=26.90+-1.41 dB, LPIPS=0.076+-0.022, GMSD=0.083+-0.017, significantly outperforming all baselines (p0.001). For prostate data: SSIM=0.770+-0.049, PSNR=27.15+-2.19 dB, LPIPS=0.190+-0.095, GMSD=0.087+-0.013. The framework used only 0.9M parameters and 57 GFLOPs, reducing parameters by 99.8% and computation by 97.5% versus Res-SRDiff, while outperforming SwinIR and MambaIR in accuracy and efficiency. Conclusion: The proposed framework provides an efficient, accurate MRI SR solution, delivering enhanced anatomical detail across datasets. Its low computational demand and state-of-the-art performance show strong potential for clinical translation.
10. 【2512.19663】Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis
链接:https://arxiv.org/abs/2512.19663
作者:Argha Kamal Samanta,Harshika Goyal,Vasudha Joshi,Tushar Mungle,Pabitra Mitra
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:preventable blindness worldwide, demanding accurate automated, Diabetic retinopathy, automated diagnostic systems, accurate automated diagnostic
备注: 14 pages, 14 figures
点击查看摘要
Abstract:Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.
11. 【2512.19661】Over++: Generative Video Compositing for Layer Interaction Effects
链接:https://arxiv.org/abs/2512.19661
作者:Luchao Qi,Jiaye Wu,Jun Myeong Choi,Cary Phillips,Roni Sengupta,Dan B Goldman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:splashes-between foreground subjects, manually create environmental, create environmental interactions-such, video compositing workflows, artists must manually
备注: Project page: [this https URL](https://overplusplus.github.io/)
点击查看摘要
Abstract:In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.
12. 【2512.19648】4D Gaussian Splatting as a Learned Dynamical System
链接:https://arxiv.org/abs/2512.19648
作者:Arnold Caleb Asiimwe,Carl Vondrick
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:applying per-frame deformations, neural dynamical field, Gaussian Splatting, learned neural dynamical, continuous-time dynamical system
备注:
点击查看摘要
Abstract:We reinterpret 4D Gaussian Splatting as a continuous-time dynamical system, where scene motion arises from integrating a learned neural dynamical field rather than applying per-frame deformations. This formulation, which we call EvoGS, treats the Gaussian representation as an evolving physical system whose state evolves continuously under a learned motion law. This unlocks capabilities absent in deformation-based approaches:(1) sample-efficient learning from sparse temporal supervision by modeling the underlying motion law; (2) temporal extrapolation enabling forward and backward prediction beyond observed time ranges; and (3) compositional dynamics that allow localized dynamics injection for controllable scene synthesis. Experiments on dynamic scene benchmarks show that EvoGS achieves better motion coherence and temporal consistency compared to deformation-field baselines while maintaining real-time rendering
13. 【2512.19632】Generative diffusion models for agricultural AI: plant image generation, indoor-to-outdoor translation, and expert preference alignment
链接:https://arxiv.org/abs/2512.19632
作者:Da Tan,Michael Beck,Christopher P. Bidinosti,Robert H. Gulden,Christopher J. Henry
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:artificial intelligence depends, intelligence depends heavily, real field conditions, agricultural artificial intelligence, high-quality plant image
备注:
点击查看摘要
Abstract:The success of agricultural artificial intelligence depends heavily on large, diverse, and high-quality plant image datasets, yet collecting such data in real field conditions is costly, labor intensive, and seasonally constrained. This paper investigates diffusion-based generative modeling to address these challenges through plant image synthesis, indoor-to-outdoor translation, and expert preference aligned fine tuning. First, a Stable Diffusion model is fine tuned on captioned indoor and outdoor plant imagery to generate realistic, text conditioned images of canola and soybean. Evaluation using Inception Score, Frechet Inception Distance, and downstream phenotype classification shows that synthetic images effectively augment training data and improve accuracy. Second, we bridge the gap between high resolution indoor datasets and limited outdoor imagery using DreamBooth-based text inversion and image guided diffusion, generating translated images that enhance weed detection and classification with YOLOv8. Finally, a preference guided fine tuning framework trains a reward model on expert scores and applies reward weighted updates to produce more stable and expert aligned outputs. Together, these components demonstrate a practical pathway toward data efficient generative pipelines for agricultural AI.
14. 【2512.19629】LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry
链接:https://arxiv.org/abs/2512.19629
作者:Jiaqi Peng,Wenzhe Cai,Yuqiang Yang,Tai Wang,Yuan Shen,Jiangmiao Pang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Trajectory planning, mobile robots, fundamental and challenging, challenging capability, capability for mobile
备注: Project page: [this https URL](https://steinate.github.io/logoplanner.github.io/)
点击查看摘要
Abstract:Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots. Traditional modular pipelines suffer from latency and cascading errors across perception, localization, mapping, and planning modules. Recent end-to-end learning methods map raw visual observations directly to control signals or trajectories, promising greater performance and efficiency in open-world settings. However, most prior end-to-end approaches still rely on separate localization modules that depend on accurate sensor extrinsic calibration for self-state estimation, thereby limiting generalization across embodiments and environments. We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework that addresses these limitations by: (1) finetuning a long-horizon visual-geometry backbone to ground predictions with absolute metric scale, thereby providing implicit state estimation for accurate localization; (2) reconstructing surrounding scene geometry from historical observations to supply dense, fine-grained environmental awareness for reliable obstacle avoidance; and (3) conditioning the policy on implicit geometry bootstrapped by the aforementioned auxiliary tasks, thereby reducing error this http URL evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error while metric-aware geometry memory enhances planning consistency and obstacle avoidance, leading to more than a 27.3\% improvement over oracle-localization baselines and strong generalization across embodiments and environments. The code and models have been made publicly available on the \href{this https URL}{project page}.
15. 【2512.19609】MapTrace: Scalable Data Generation for Route Tracing on Maps
链接:https://arxiv.org/abs/2512.19609
作者:Artemis Panagopoulou,Aveek Purohit,Achin Kulshrestha,Soroosh Yazdani,Mohit Goyal
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, maps remains limited
备注:
点击查看摘要
Abstract:While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.
16. 【2512.19605】KerJEPA: Kernel Discrepancies for Euclidean Self-Supervised Learning
链接:https://arxiv.org/abs/2512.19605
作者:Eric Zimmermann,Harley Wiltzer,Justin Szeto,David Alvarez-Melis,Lester Mackey
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Joint-Embedding Predictive Architectures, regularizing Euclidean representations, self-supervised Joint-Embedding Predictive, Predictive Architectures, yields provable gains
备注:
点击查看摘要
Abstract:Recent breakthroughs in self-supervised Joint-Embedding Predictive Architectures (JEPAs) have established that regularizing Euclidean representations toward isotropic Gaussian priors yields provable gains in training stability and downstream generalization. We introduce a new, flexible family of KerJEPAs, self-supervised learning algorithms with kernel-based regularizers. One instance of this family corresponds to the recently-introduced LeJEPA Epps-Pulley regularizer which approximates a sliced maximum mean discrepancy (MMD) with a Gaussian prior and Gaussian kernel. By expanding the class of viable kernels and priors and computing the closed-form high-dimensional limit of sliced MMDs, we develop alternative KerJEPAs with a number of favorable properties including improved training stability and design flexibility.
17. 【2512.19602】No Data? No Problem: Robust Vision-Tabular Learning with Missing Values
链接:https://arxiv.org/abs/2512.19602
作者:Marta Hasny,Laura Daza,Keno Bressem,Maxime Di Folco,Julia Schnabel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large-scale medical biobanks, Large-scale medical, extensive tabular information, medical biobanks provide, biobanks provide imaging
备注:
点击查看摘要
Abstract:Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as demographics or clinical measurements. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that can leverage all the tabular data during training while remaining robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning using a gated cross-attention module for multimodal fusion. During fine-tuning, we employ a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with disentangled gradient learning, this enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The code is available at this https URL.
18. 【2512.19560】BabyFlow: 3D modeling of realistic and expressive infant faces
链接:https://arxiv.org/abs/2512.19560
作者:Antonia Alomar,Mireia Masias,Marius George Linguraru,Federico M. Sukno,Gemma Piella
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:infant craniofacial morphology, frequent spontaneous expressions, analyzing infant craniofacial, craniofacial morphology, detection of developmental
备注:
点击查看摘要
Abstract:Early detection of developmental disorders can be aided by analyzing infant craniofacial morphology, but modeling infant faces is challenging due to limited data and frequent spontaneous expressions. We introduce BabyFlow, a generative AI model that disentangles facial identity and expression, enabling independent control over both. Using normalizing flows, BabyFlow learns flexible, probabilistic representations that capture the complex, non-linear variability of expressive infant faces without restrictive linear assumptions. To address scarce and uncontrolled expressive data, we perform cross-age expression transfer, adapting expressions from adult 3D scans to enrich infant datasets with realistic and systematic expressive variants. As a result, BabyFlow improves 3D reconstruction accuracy, particularly in highly expressive regions such as the mouth, eyes, and nose, and supports synthesis and modification of infant expressions while preserving identity. Additionally, by integrating with diffusion models, BabyFlow generates high-fidelity 2D infant images with consistent 3D geometry, providing powerful tools for data augmentation and early facial analysis.
19. 【2512.19546】ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
链接:https://arxiv.org/abs/2512.19546
作者:Ziqiao Peng,Yi Chen,Yifeng Ma,Guozhen Zhang,Zhiyao Sun,Zixiang Zhou,Youliang Zhang,Zhengguang Zhou,Zhaoxin Fan,Hongyan Liu,Yuan Zhou,Qinglin Lu,Jun He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:talking avatar generation, face critical challenges, additional control signals, insufficient text-following capability, existing methods face
备注: Project Page: [this https URL](https://ziqiaopeng.github.io/ActAvatar/)
点击查看摘要
Abstract:Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.
20. 【2512.19539】StoryMem: Multi-shot Long Video Storytelling with Memory
链接:https://arxiv.org/abs/2512.19539
作者:Kaiwen Zhang,Liming Jiang,Angtian Wang,Jacob Zhiyuan Fang,Tiancheng Zhi,Qing Yan,Hao Kang,Xin Lu,Xingang Pan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requires generating multi-shot, storytelling requires generating, Visual storytelling requires, generating multi-shot videos, requires generating
备注: Project page: [this https URL](https://kevin-thu.github.io/StoryMem)
点击查看摘要
Abstract:Visual storytelling requires generating multi-shot videos with cinematic quality and long-range consistency. Inspired by human memory, we propose StoryMem, a paradigm that reformulates long-form video storytelling as iterative shot synthesis conditioned on explicit visual memory, transforming pre-trained single-shot video diffusion models into multi-shot storytellers. This is achieved by a novel Memory-to-Video (M2V) design, which maintains a compact and dynamically updated memory bank of keyframes from historical generated shots. The stored memory is then injected into single-shot video diffusion models via latent concatenation and negative RoPE shifts with only LoRA fine-tuning. A semantic keyframe selection strategy, together with aesthetic preference filtering, further ensures informative and stable memory throughout generation. Moreover, the proposed framework naturally accommodates smooth shot transitions and customized story generation applications. To facilitate evaluation, we introduce ST-Bench, a diverse benchmark for multi-shot video storytelling. Extensive experiments demonstrate that StoryMem achieves superior cross-shot consistency over previous methods while preserving high aesthetic quality and prompt adherence, marking a significant step toward coherent minute-long video storytelling.
21. 【2512.19535】CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion
链接:https://arxiv.org/abs/2512.19535
作者:Moritz Böhle,Amélie Royer,Juliette Marrie,Edouard Grave,Patrick Pérez
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:pretrained vision encoder, Vision-language models, commonly trained, trained by inserting, pretrained vision
备注:
点击查看摘要
Abstract:Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at this https URL .
22. 【2512.19534】SlicerOrbitSurgerySim: An Open-Source Platform for Virtual Registration and Quantitative Comparison of Preformed Orbital Plates
链接:https://arxiv.org/abs/2512.19534
作者:Chi Zhang,Braedon Gunn,Andrew M. Read-Fuller
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Poor adaptation, orbital implants remains, preformed orbital plates, revision surgery, remains a major
备注: 12 pages, 8 figures. Submitted to Journal of Oral and Maxillofacial Surgery. Code: [this https URL](https://github.com/chz31/SlicerOrbitSurgerySim/tree/main)
点击查看摘要
Abstract:Poor adaptation of orbital implants remains a major contributor to postoperative complications and revision surgery. Although preformed orbital plates are widely used to reduce cost and operative time compared with customized implants, surgeons currently lack publicly available tools and standardized metrics to quantitatively compare plate fit across vendors, sizes, and patient anatomy. We developed SlicerOrbitSurgerySim, an open-source extension for the 3D Slicer platform that enables interactive virtual registration, evaluation, and comparison of multiple preformed orbital plates in a patient-specific virtual planning environment. The software generates reproducible quantitative plate-to-orbit distance metrics and visualization tools that support both patient-specific planning and population-level statistical analysis of plate adaptability. By facilitating objective comparison of implant designs and placement strategies, this tool aims to improve preoperative decision-making, reduce intraoperative plate modification, and promote collaborative research and surgical education. Pilot studies, sample datasets, and detailed tutorials are provided to support testing, transparency, and reproducibility.
23. 【2512.19528】Multi-Modal Soccer Scene Analysis with Masked Pre-Training
链接:https://arxiv.org/abs/2512.19528
作者:Marc Peral,Guillem Capellera,Luis Ferraz,Antonio Rubio,Antonio Agudo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:tactical camera footage, analyzing soccer scenes, ball trajectory inference, ball state classification, ball possessor identification
备注: 10 pages, 2 figures. WACV 2026
点击查看摘要
Abstract:In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, ball state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the ball state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pre-training strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pre-training. We show the effectiveness of our approach on a large-scale dataset providing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.
24. 【2512.19522】A Convolutional Neural Deferred Shader for Physics Based Rendering
链接:https://arxiv.org/abs/2512.19522
作者:Zhuo He,Yingdong Ru,Qianying Liu,Paul Henderson,Nicolas Pugeault
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, achieved impressive results, multilayer perceptron, achieved impressive, relighting real-world objects
备注:
点击查看摘要
Abstract:Recent advances in neural rendering have achieved impressive results on photorealistic shading and relighting, by using a multilayer perceptron (MLP) as a regression model to learn the rendering equation from a real-world dataset. Such methods show promise for photorealistically relighting real-world objects, which is difficult to classical rendering, as there is no easy-obtained material ground truth. However, significant challenges still remain the dense connections in MLPs result in a large number of parameters, which requires high computation resources, complicating the training, and reducing performance during rendering. Data driven approaches require large amounts of training data for generalization; unbalanced data might bias the model to ignore the unusual illumination conditions, e.g. dark scenes. This paper introduces pbnds+: a novel physics-based neural deferred shading pipeline utilizing convolution neural networks to decrease the parameters and improve the performance in shading and relighting tasks; Energy regularization is also proposed to restrict the model reflection during dark illumination. Extensive experiments demonstrate that our approach outperforms classical baselines, a state-of-the-art neural shading model, and a diffusion-based method.
25. 【2512.19512】Anatomy-R1: Enhancing Anatomy Reasoning in Multimodal Large Language Models via Anatomical Similarity Curriculum and Group Diversity Augmentation
链接:https://arxiv.org/abs/2512.19512
作者:Ziyang Song,Zelin Zang,Zuyao Chen,Xusheng Liang,Dong Yi,Jinlin Wu,Hongbin Liu,Jiebo Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, imaging remains underexplored, achieved impressive progress, Large Language Models, anatomical surgical images
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have achieved impressive progress in natural image reasoning, yet their potential in medical imaging remains underexplored, especially in clinical anatomical surgical images. Anatomy understanding tasks demand precise understanding and clinically coherent answers, which are difficult to achieve due to the complexity of medical data and the scarcity of high-quality expert annotations. These challenges limit the effectiveness of conventional Supervised Fine-Tuning (SFT) strategies. While recent work has demonstrated that Group Relative Policy Optimization (GRPO) can enhance reasoning in MLLMs without relying on large amounts of data, we find two weaknesses that hinder GRPO's reasoning performance in anatomy recognition: 1) knowledge cannot be effectively shared between different anatomical structures, resulting in uneven information gain and preventing the model from converging, and 2) the model quickly converges to a single reasoning path, suppressing the exploration of diverse strategies. To overcome these challenges, we propose two novel methods. First, we implement a progressive learning strategy called Anatomical Similarity Curriculum Learning by controlling question difficulty via the similarity of answer choices, enabling the model to master complex problems incrementally. Second, we utilize question augmentation referred to as Group Diversity Question Augmentation to expand the model's search space for difficult queries, mitigating the tendency to produce uniform responses. Comprehensive experiments on the SGG-VQA and OmniMedVQA benchmarks show our method achieves a significant improvement across the two benchmarks, demonstrating its effectiveness in enhancing the medical reasoning capabilities of MLLMs. The code can be found in this https URL
26. 【2512.19504】FusionNet: Physics-Aware Representation Learning for Multi-Spectral and Thermal Data via Trainable Signal-Processing Priors
链接:https://arxiv.org/abs/2512.19504
作者:Georgios Voulgaris
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:governing signal formation, multi-modal visual signals, Modern deep learning, physical processes governing, leading to brittle
备注: Preprint. Under review at IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS)
点击查看摘要
Abstract:Modern deep learning models operating on multi-modal visual signals often rely on inductive biases that are poorly aligned with the physical processes governing signal formation, leading to brittle performance under cross-spectral and real-world conditions. In particular, approaches that prioritise direct thermal cues struggle to capture indirect yet persistent environmental alterations induced by sustained heat emissions. This work introduces a physics-aware representation learning framework that leverages multi-spectral information to model stable signatures of long-term physical processes. Specifically, a geological Short Wave Infrared (SWIR) ratio sensitive to soil property changes is integrated with Thermal Infrared (TIR) data through an intermediate fusion architecture, instantiated as FusionNet. The proposed backbone embeds trainable differential signal-processing priors within convolutional layers, combines mixed pooling strategies, and employs wider receptive fields to enhance robustness across spectral modalities. Systematic ablations show that each architectural component contributes to performance gains, with DGCNN achieving 88.7% accuracy on the SWIR ratio and FusionNet reaching 90.6%, outperforming state-of-the-art baselines across five spectral configurations. Transfer learning experiments further show that ImageNet pretraining degrades TIR performance, highlighting the importance of modality-aware training for cross-spectral learning. Evaluated on real-world data, the results demonstrate that combining physics-aware feature selection with principled deep learning architectures yields robust and generalisable representations, illustrating how first-principles signal modelling can improve multi-spectral learning under challenging conditions.
Comments:
Preprint. Under review at IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (JSTARS)
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.19504 [cs.CV]
(or
arXiv:2512.19504v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.19504
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Georgios Voulgaris [view email] [v1]
Mon, 22 Dec 2025 15:59:37 UTC (4,137 KB)
27. 【2512.19486】Dynamic Stream Network for Combinatorial Explosion Problem in Deformable Medical Image Registration
链接:https://arxiv.org/abs/2512.19486
作者:Shaochen Bi,Yuting He,Weiming Wang,Hao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical Image Registration, Deformable Medical Image, Combinatorial explosion problem, Deformable Medical, explosion problem caused
备注:
点击查看摘要
Abstract:Combinatorial explosion problem caused by dual inputs presents a critical challenge in Deformable Medical Image Registration (DMIR). Since DMIR processes two images simultaneously as input, the combination relationships between features has grown exponentially, ultimately the model considers more interfering features during the feature modeling process. Introducing dynamics in the receptive fields and weights of the network enable the model to eliminate the interfering features combination and model the potential feature combination relationships. In this paper, we propose the Dynamic Stream Network (DySNet), which enables the receptive fields and weights to be dynamically adjusted. This ultimately enables the model to ignore interfering feature combinations and model the potential feature relationships. With two key innovations: 1) Adaptive Stream Basin (AdSB) module dynamically adjusts the shape of the receptive field, thereby enabling the model to focus on the feature relationships with greater correlation. 2) Dynamic Stream Attention (DySA) mechanism generates dynamic weights to search for more valuable feature relationships. Extensive experiments have shown that DySNet consistently outperforms the most advanced DMIR methods, highlighting its outstanding generalization ability. Our code will be released on the website: this https URL.
28. 【2512.19479】Emotion-Director: Bridging Affective Shortcut in Emotion-Oriented Image Generation
链接:https://arxiv.org/abs/2512.19479
作者:Guoli Jia,Junyao Hu,Xinwei Long,Kai Tian,Kaiyan Zhang,KaiKai Zhao,Ning Ding,Bowen Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated impressive capability, impressive capability, motivating exploration, specialized applications, Image generation based
备注:
点击查看摘要
Abstract:Image generation based on diffusion models has demonstrated impressive capability, motivating exploration into diverse and specialized applications. Owing to the importance of emotion in advertising, emotion-oriented image generation has attracted increasing attention. However, current emotion-oriented methods suffer from an affective shortcut, where emotions are approximated to semantics. As evidenced by two decades of research, emotion is not equivalent to semantics. To this end, we propose Emotion-Director, a cross-modal collaboration framework consisting of two modules. First, we propose a cross-Modal Collaborative diffusion model, abbreviated as MC-Diffusion. MC-Diffusion integrates visual prompts with textual prompts for guidance, enabling the generation of emotion-oriented images beyond semantics. Further, we improve the DPO optimization by a negative visual prompt, enhancing the model's sensitivity to different emotions under the same semantics. Second, we propose MC-Agent, a cross-Modal Collaborative Agent system that rewrites textual prompts to express the intended emotions. To avoid template-like rewrites, MC-Agent employs multi-agents to simulate human subjectivity toward emotions, and adopts a chain-of-concept workflow that improves the visual expressiveness of the rewritten prompts. Extensive qualitative and quantitative experiments demonstrate the superiority of Emotion-Director in emotion-oriented image generation.
29. 【2512.19451】Sign Language Recognition using Parallel Bidirectional Reservoir Computing
链接:https://arxiv.org/abs/2512.19451
作者:Nitin Kumar Singh,Arie Rachmad Syulistyo,Yuichiro Tanaka,Hakaru Tamukoh
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Sign language recognition, facilitates communication, hearing communities, communication between deaf, deaf and hearing
备注:
点击查看摘要
Abstract:Sign language recognition (SLR) facilitates communication between deaf and hearing communities. Deep learning based SLR models are commonly used but require extensive computational resources, making them unsuitable for deployment on edge devices. To address these limitations, we propose a lightweight SLR system that combines parallel bidirectional reservoir computing (PBRC) with MediaPipe. MediaPipe enables real-time hand tracking and precise extraction of hand joint coordinates, which serve as input features for the PBRC architecture. The proposed PBRC architecture consists of two echo state network (ESN) based bidirectional reservoir computing (BRC) modules arranged in parallel to capture temporal dependencies, thereby creating a rich feature representation for classification. We trained our PBRC-based SLR system on the Word-Level American Sign Language (WLASL) video dataset, achieving top-1, top-5, and top-10 accuracies of 60.85%, 85.86%, and 91.74%, respectively. Training time was significantly reduced to 18.67 seconds due to the intrinsic properties of reservoir computing, compared to over 55 minutes for deep learning based methods such as Bi-GRU. This approach offers a lightweight, cost-effective solution for real-time SLR on edge devices.
30. 【2512.19443】D2Pruner: Debiased Importance and Structural Diversity for MLLM Token Pruning
链接:https://arxiv.org/abs/2512.19443
作者:Evelyn Zhang,Fufu Yu,Aoqi Wu,Zichen Wen,Ke Yan,Shouhong Ding,Biqing Qi,Linfeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Processing long visual, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:Processing long visual token sequences poses a significant computational burden on Multimodal Large Language Models (MLLMs). While token pruning offers a path to acceleration, we find that current methods, while adequate for general understanding, catastrophically fail on fine-grained localization tasks. We attribute this failure to the inherent flaws of the two prevailing strategies: importance-based methods suffer from a strong positional bias, an inherent model artifact that distracts from semantic content, while diversity-based methods exhibit structural blindness, disregarding the user's prompt and spatial redundancy. To address this, we introduce D2Pruner, a framework that rectifies these issues by uniquely combining debiased importance with a structural pruning mechanism. Our method first secures a core set of the most critical tokens as pivots based on a debiased attention score. It then performs a Maximal Independent Set (MIS) selection on the remaining tokens, which are modeled on a hybrid graph where edges signify spatial proximity and semantic similarity. This process iteratively preserves the most important and available token while removing its neighbors, ensuring that the supplementary tokens are chosen to maximize importance and diversity. Extensive experiments demonstrate that D2Pruner has exceptional efficiency and fidelity. Applied to LLaVA-1.5-7B for general understanding tasks, it reduces FLOPs by 74.2\% while retaining 99.2\% of its original performance. Furthermore, in challenging localization benchmarks with InternVL-2.5-8B, it maintains 85.7\% performance at a 90\% token reduction rate, marking a significant advancement with up to 63. 53\% improvement over existing methods.
31. 【2512.19438】MT-Mark: Rethinking Image Watermarking via Mutual-Teacher Collaboration with Adaptive Feature Modulation
链接:https://arxiv.org/abs/2512.19438
作者:Fei Ge,Ying Huang,Jie Liu,Guixuan Zhang,Zhi Zeng,Shuwu Zhang,Hu Guan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Existing deep image, Existing deep, follow a fixed, optimized in isolation, deep image watermarking
备注:
点击查看摘要
Abstract:Existing deep image watermarking methods follow a fixed embedding-distortion-extraction pipeline, where the embedder and extractor are weakly coupled through a final loss and optimized in isolation. This design lacks explicit collaboration, leaving no structured mechanism for the embedder to incorporate decoding-aware cues or for the extractor to guide embedding during training. To address this architectural limitation, we rethink deep image watermarking by reformulating embedding and extraction as explicitly collaborative components. To realize this reformulation, we introduce a Collaborative Interaction Mechanism (CIM) that establishes direct, bidirectional communication between the embedder and extractor, enabling a mutual-teacher training paradigm and coordinated optimization. Built upon this explicitly collaborative architecture, we further propose an Adaptive Feature Modulation Module (AFMM) to support effective interaction. AFMM enables content-aware feature regulation by decoupling modulation structure and strength, guiding watermark embedding toward stable image features while suppressing host interference during extraction. Under CIM, the AFMMs on both sides form a closed-loop collaboration that aligns embedding behavior with extraction objectives. This architecture-level redesign changes how robustness is learned in watermarking systems. Rather than relying on exhaustive distortion simulation, robustness emerges from coordinated representation learning between embedding and extraction. Experiments on real-world and AI-generated datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches in watermark extraction accuracy while maintaining high perceptual quality, showing strong robustness and generalization.
32. 【2512.19433】dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models
链接:https://arxiv.org/abs/2512.19433
作者:Yi Xin,Siqi Luo,Qi Qin,Haoxing Chen,Kaiwen Zhu,Zhiwei Zhang,Yangfan He,Rongchao Zhang,Jinbin Bai,Shuo Cao,Bin Fu,Junjun He,Yihao Liu,Yuewen Cao,Xiaohong Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Multi-modal Large, Large Language Models, Multi-modal Large Language, Diffusion Multi-modal, Language Models
备注: Project page: [this https URL](https://github.com/Alpha-VLLM/Lumina-DiMOO)
点击查看摘要
Abstract:Diffusion Multi-modal Large Language Models (dMLLMs) have recently emerged as a novel architecture unifying image generation and understanding. However, developing effective and efficient Test-Time Scaling (TTS) methods to unlock their full generative potential remains an underexplored challenge. To address this, we propose dMLLM-TTS, a novel framework operating on two complementary scaling axes: (1) trajectory exploration scaling to enhance the diversity of generated hypotheses, and (2) iterative refinement scaling for stable generation. Conventional TTS approaches typically perform linear search across these two dimensions, incurring substantial computational costs of O(NT) and requiring an external verifier for best-of-N selection. To overcome these limitations, we propose two innovations. First, we design an efficient hierarchical search algorithm with O(N+T) complexity that adaptively expands and prunes sampling trajectories. Second, we introduce a self-verified feedback mechanism that leverages the dMLLMs' intrinsic image understanding capabilities to assess text-image alignment, eliminating the need for external verifier. Extensive experiments on the GenEval benchmark across three representative dMLLMs (e.g., Lumina-DiMOO, MMaDA, Muddit) show that our framework substantially improves generation quality while achieving up to 6x greater efficiency than linear search. Project page: this https URL.
33. 【2512.19415】Non-Contrast CT Esophageal Varices Grading through Clinical Prior-Enhanced Multi-Organ Analysis
链接:https://arxiv.org/abs/2512.19415
作者:Xiaoming Zhang,Chunli Li,Jiacheng Hao,Yuan Gao,Danyang Tu,Jianyi Qiao,Xiaoli Yin,Le Lu,Ling Zhang,Ke Yan,Yang Hou,Yu Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significant bleeding risk, Esophageal varices, affecting approximately, portal hypertension, critical complication
备注: Medical Image Analysis
点击查看摘要
Abstract:Esophageal varices (EV) represent a critical complication of portal hypertension, affecting approximately 60% of cirrhosis patients with a significant bleeding risk of ~30%. While traditionally diagnosed through invasive endoscopy, non-contrast computed tomography (NCCT) presents a potential non-invasive alternative that has yet to be fully utilized in clinical practice. We present Multi-Organ-COhesion Network++ (MOON++), a novel multimodal framework that enhances EV assessment through comprehensive analysis of NCCT scans. Inspired by clinical evidence correlating organ volumetric relationships with liver disease severity, MOON++ synthesizes imaging characteristics of the esophagus, liver, and spleen through multimodal learning. We evaluated our approach using 1,631 patients, those with endoscopically confirmed EV were classified into four severity grades. Validation in 239 patient cases and independent testing in 289 cases demonstrate superior performance compared to conventional single organ methods, achieving an AUC of 0.894 versus 0.803 for the severe grade EV classification (G3 versus G3) and 0.921 versus 0.793 for the differentiation of moderate to severe grades (=G2 versus G2). We conducted a reader study involving experienced radiologists to further validate the performance of MOON++. To our knowledge, MOON++ represents the first comprehensive multi-organ NCCT analysis framework incorporating clinical knowledge priors for EV assessment, potentially offering a promising non-invasive diagnostic alternative.
34. 【2512.19402】Real2Edit2Real: Generating Robotic Demonstrations via a 3D Control Interface
链接:https://arxiv.org/abs/2512.19402
作者:Yujie Zhao,Hongwei Fan,Di Chen,Shengcong Chen,Liliang Chen,Xiaoqi Li,Guanghui Ren,Hao Dong
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:visuomotor policy architectures, policy robustness remains, powerful visuomotor policy, robustness remains limited, Recent progress
备注:
点击查看摘要
Abstract:Recent progress in robot learning has been driven by large-scale datasets and powerful visuomotor policy architectures, yet policy robustness remains limited by the substantial cost of collecting diverse demonstrations, particularly for spatial generalization in manipulation tasks. To reduce repetitive data collection, we present Real2Edit2Real, a framework that generates new demonstrations by bridging 3D editability with 2D visual data through a 3D control interface. Our approach first reconstructs scene geometry from multi-view RGB observations with a metric-scale 3D reconstruction model. Based on the reconstructed geometry, we perform depth-reliable 3D editing on point clouds to generate new manipulation trajectories while geometrically correcting the robot poses to recover physically consistent depth, which serves as a reliable condition for synthesizing new demonstrations. Finally, we propose a multi-conditional video generation model guided by depth as the primary control signal, together with action, edge, and ray maps, to synthesize spatially augmented multi-view manipulation videos. Experiments on four real-world manipulation tasks demonstrate that policies trained on data generated from only 1-5 source demonstrations can match or outperform those trained on 50 real-world demonstrations, improving data efficiency by up to 10-50x. Moreover, experimental results on height and texture editing demonstrate the framework's flexibility and extensibility, indicating its potential to serve as a unified data generation framework.
35. 【2512.19390】winAligner: Visual-Dynamic Alignment Empowers Physics-aware Real2Sim2Real for Robotic Manipulation
链接:https://arxiv.org/abs/2512.19390
作者:Hongwei Fan,Hang Dai,Jiyao Zhang,Jinzhou Li,Qiyang Yan,Yujie Zhao,Mingju Gao,Jinghang Wu,Hao Tang,Hao Dong
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:multimodal large models, evolving towards data-driven, inspired by multimodal, large models, robotics field
备注:
点击查看摘要
Abstract:The robotics field is evolving towards data-driven, end-to-end learning, inspired by multimodal large models. However, reliance on expensive real-world data limits progress. Simulators offer cost-effective alternatives, but the gap between simulation and reality challenges effective policy transfer. This paper introduces TwinAligner, a novel Real2Sim2Real system that addresses both visual and dynamic gaps. The visual alignment module achieves pixel-level alignment through SDF reconstruction and editable 3DGS rendering, while the dynamic alignment module ensures dynamic consistency by identifying rigid physics from robot-object interaction. TwinAligner improves robot learning by providing scalable data collection and establishing a trustworthy iterative cycle, accelerating algorithm development. Quantitative evaluations highlight TwinAligner's strong capabilities in visual and dynamic real-to-sim alignment. This system enables policies trained in simulation to achieve strong zero-shot generalization to the real world. The high consistency between real-world and simulated policy performance underscores TwinAligner's potential to advance scalable robot learning. Code and data will be released on this https URL
36. 【2512.19387】DSTED: Decoupling Temporal Stabilization and Discriminative Enhancement for Surgical Workflow Recognition
链接:https://arxiv.org/abs/2512.19387
作者:Yueyao Chen,Kai-Ni Wang,Dario Tayupo,Arnaud Huaulm'e,Krystel Nyangoh Timoh,Pierre Jannin,Qi Dou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:enables context-aware assistance, recognition enables context-aware, Surgical workflow recognition, computer-assisted interventions, enables context-aware
备注: Early accepted to IPCAI 2026
点击查看摘要
Abstract:Purpose: Surgical workflow recognition enables context-aware assistance and skill assessment in computer-assisted interventions. Despite recent advances, current methods suffer from two critical challenges: prediction jitter across consecutive frames and poor discrimination of ambiguous phases. This paper aims to develop a stable framework by selectively propagating reliable historical information and explicitly modeling uncertainty for hard sample enhancement. Methods: We propose a dual-pathway framework DSTED with Reliable Memory Propagation (RMP) and Uncertainty-Aware Prototype Retrieval (UPR). RMP maintains temporal coherence by filtering and fusing high-confidence historical features through multi-criteria reliability assessment. UPR constructs learnable class-specific prototypes from high-uncertainty samples and performs adaptive prototype matching to refine ambiguous frame representations. Finally, a confidence-driven gate dynamically balances both pathways based on prediction certainty. Results: Our method achieves state-of-the-art performance on AutoLaparo-hysterectomy with 84.36% accuracy and 65.51% F1-score, surpassing the second-best method by 3.51% and 4.88% respectively. Ablations reveal complementary gains from RMP (2.19%) and UPR (1.93%), with synergistic effects when combined. Extensive analysis confirms substantial reduction in temporal jitter and marked improvement on challenging phase transitions. Conclusion: Our dual-pathway design introduces a novel paradigm for stable workflow recognition, demonstrating that decoupling the modeling of temporal consistency and phase ambiguity yields superior performance and clinical applicability.
Comments:
Early accepted to IPCAI 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2512.19387 [cs.CV]
(or
arXiv:2512.19387v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.19387
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Yueyao Chen [view email] [v1]
Mon, 22 Dec 2025 13:36:26 UTC (3,229 KB)
37. 【2512.19365】Efficient Spike-driven Transformer for High-performance Drone-View Geo-Localization
链接:https://arxiv.org/abs/2512.19365
作者:Zhongwei Chen,Hai-Jun Rong,Zhao-Xu Yang,Guoqi Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Traditional drone-view geo-localization, achieved remarkable performance, Traditional drone-view, artificial neural networks, drone-view geo-localization
备注:
点击查看摘要
Abstract:Traditional drone-view geo-localization (DVGL) methods based on artificial neural networks (ANNs) have achieved remarkable performance. However, ANNs rely on dense computation, which results in high power consumption. In contrast, spiking neural networks (SNNs), which benefit from spike-driven computation, inherently provide low power consumption. Regrettably, the potential of SNNs for DVGL has yet to be thoroughly investigated. Meanwhile, the inherent sparsity of spike-driven computation for representation learning scenarios also results in loss of critical information and difficulties in learning long-range dependencies when aligning heterogeneous visual data sources. To address these, we propose SpikeViMFormer, the first SNN framework designed for DVGL. In this framework, a lightweight spike-driven transformer backbone is adopted to extract coarse-grained features. To mitigate the loss of critical information, the spike-driven selective attention (SSA) block is designed, which uses a spike-driven gating mechanism to achieve selective feature enhancement and highlight discriminative regions. Furthermore, a spike-driven hybrid state space (SHS) block is introduced to learn long-range dependencies using a hybrid state space. Moreover, only the backbone is utilized during the inference stage to reduce computational cost. To ensure backbone effectiveness, a novel hierarchical re-ranking alignment learning (HRAL) strategy is proposed. It refines features via neighborhood re-ranking and maintains cross-batch consistency to directly optimize the backbone. Experimental results demonstrate that SpikeViMFormer outperforms state-of-the-art SNNs. Compared with advanced ANNs, it also achieves competitive this http URL code is available at this https URL
38. 【2512.19354】ReasonCD: A Multimodal Reasoning Large Model for Implicit Change-of-Interest Semantic Mining
链接:https://arxiv.org/abs/2512.19354
作者:Zhenyang Huang,Xiao Yu,Yi Zhang,Decheng Wang,Hang Ruan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:sensing intelligent interpretation, Remote sensing image, remote sensing intelligent, Remote sensing, change detection
备注:
点击查看摘要
Abstract:Remote sensing image change detection is one of the fundamental tasks in remote sensing intelligent interpretation. Its core objective is to identify changes within change regions of interest (CRoI). Current multimodal large models encode rich human semantic knowledge, which is utilized for guidance in tasks such as remote sensing change detection. However, existing methods that use semantic guidance for detecting users' CRoI overly rely on explicit textual descriptions of CRoI, leading to the problem of near-complete performance failure when presented with implicit CRoI textual descriptions. This paper proposes a multimodal reasoning change detection model named ReasonCD, capable of mining users' implicit task intent. The model leverages the powerful reasoning capabilities of pre-trained large language models to mine users' implicit task intents and subsequently obtains different change detection results based on these intents. Experiments on public datasets demonstrate that the model achieves excellent change detection performance, with an F1 score of 92.1\% on the BCDD dataset. Furthermore, to validate its superior reasoning functionality, this paper annotates a subset of reasoning data based on the SECOND dataset. Experimental results show that the model not only excels at basic reasoning-based change detection tasks but can also explain the reasoning process to aid human decision-making.
39. 【2512.19336】GANeXt: A Fully ConvNeXt-Enhanced Generative Adversarial Network for MRI- and CBCT-to-CT Synthesis
链接:https://arxiv.org/abs/2512.19336
作者:Siyuan Mei,Yan Xia,Fuxin Fan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:magnetic resonance imaging, clinical treatment planning, accurate anatomical representation, enabling accurate anatomical, computed tomography
备注:
点击查看摘要
Abstract:The synthesis of computed tomography (CT) from magnetic resonance imaging (MRI) and cone-beam CT (CBCT) plays a critical role in clinical treatment planning by enabling accurate anatomical representation in adaptive radiotherapy. In this work, we propose GANeXt, a 3D patch-based, fully ConvNeXt-powered generative adversarial network for unified CT synthesis across different modalities and anatomical regions. Specifically, GANeXt employs an efficient U-shaped generator constructed from stacked 3D ConvNeXt blocks with compact convolution kernels, while the discriminator adopts a conditional PatchGAN. To improve synthesis quality, we incorporate a combination of loss functions, including mean absolute error (MAE), perceptual loss, segmentation-based masked MAE, and adversarial loss and a combination of Dice loss and cross-entropy for multi-head segmentation discriminator. For both tasks, training is performed with a batch size of 8 using two separate AdamW optimizers for the generator and discriminator, each equipped with a warmup and cosine decay scheduler, with learning rates of $5\times10^{-4}$ and $1\times10^{-3}$, respectively. Data preprocessing includes deformable registration, foreground cropping, percentile normalization for the input modality, and linear normalization of the CT to the range $[-1024, 1000]$. Data augmentation involves random zooming within $(0.8, 1.3)$ (for MRI-to-CT only), fixed-size cropping to $32\times160\times192$ for MRI-to-CT and $32\times128\times128$ for CBCT-to-CT, and random flipping. During inference, we apply a sliding-window approach with $0.8$ overlap and average folding to reconstruct the full-size sCT, followed by inversion of the CT normalization. After joint training on all regions without any fine-tuning, the final models are selected at the end of 3000 epochs for MRI-to-CT and 1000 epochs for CBCT-to-CT using the full training dataset.
40. 【2512.19331】DeltaMIL: Gated Memory Integration for Efficient and Discriminative Whole Slide Image Analysis
链接:https://arxiv.org/abs/2512.19331
作者:Yueting Zhu,Yuehao Song,Shuai Zhang,Wenyu Liu,Xinggang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Slide Images, multiple instance learning, instance learning, typically analyzed, Images
备注: 11 pages,7 figures,8 tables
点击查看摘要
Abstract:Whole Slide Images (WSIs) are typically analyzed using multiple instance learning (MIL) methods. However, the scale and heterogeneity of WSIs generate highly redundant and dispersed information, making it difficult to identify and integrate discriminative signals. Existing MIL methods either fail to discard uninformative cues effectively or have limited ability to consolidate relevant features from multiple patches, which restricts their performance on large and heterogeneous WSIs. To address this issue, we propose DeltaMIL, a novel MIL framework that explicitly selects semantically relevant regions and integrates the discriminative information from WSIs. Our method leverages the gated delta rule to efficiently filter and integrate information through a block combining forgetting and memory mechanisms. The delta mechanism dynamically updates the memory by removing old values and inserting new ones according to their correlation with the current patch. The gating mechanism further enables rapid forgetting of irrelevant signals. Additionally, DeltaMIL integrates a complementary local pattern mixing mechanism to retain fine-grained pathological locality. Our design enhances the extraction of meaningful cues and suppresses redundant or noisy information, which improves the model's robustness and discriminative power. Experiments demonstrate that DeltaMIL achieves state-of-the-art performance. Specifically, for survival prediction, DeltaMIL improves performance by 3.69\% using ResNet-50 features and 2.36\% using UNI features. For slide-level classification, it increases accuracy by 3.09\% with ResNet-50 features and 3.75\% with UNI features. These results demonstrate the strong and consistent performance of DeltaMIL across diverse WSI tasks.
41. 【2512.19327】Extended OpenTT Games Dataset: A table tennis dataset for fine-grained shot type and point outcome
链接:https://arxiv.org/abs/2512.19327
作者:Moamal Fadhil Abdul(1),Jonas Bruun Hubrechts(1),Thomas Martini Jørgensen(1),Emil Hovad(1) ((1) Department of Applied Mathematics and Computer Science, Technical University of Denmark, Richard Petersens Plads, Building 324, 2800 Kgs. Lyngby, Denmark)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enrich broadcast overlays, streamline training workflows, Automatically detecting, fine-grained performance analytics, enable fine-grained performance
备注: Thomas Martini Jørgensen and Emil Hovad contributed equally and share last authorship
点击查看摘要
Abstract:Automatically detecting and classifying strokes in table tennis video can streamline training workflows, enrich broadcast overlays, and enable fine-grained performance analytics. For this to be possible, annotated video data of table tennis is needed. We extend the public OpenTTGames dataset with highly detailed, frame-accurate shot type annotations (forehand, backhand with subtypes), player posture labels (body lean and leg stance), and rally outcome tags at point end. OpenTTGames is a set of recordings from the side of the table with official labels for bounces, when the ball is above the net, or hitting the net. The dataset already contains ball coordinates near events, which are either "bounce", "net", or "empty_event" in the original OpenTTGames dataset, and semantic masks (humans, table, scoreboard). Our extension adds the types of stroke to the events and a per-player taxonomy so models can move beyond event spotting toward tactical understanding (e.g., whether a stroke is likely to win the point or set up an advantage). We provide a compact coding scheme and code-assisted labeling procedure to support reproducible annotations and baselines for fine-grained stroke understanding in racket sports. This fills a practical gap in the community, where many prior video resources are either not publicly released or carry restrictive/unclear licenses that hinder reuse and benchmarking. Our annotations are released under the same CC BY-NC-SA 4.0 license as OpenTTGames, allowing free non-commercial use, modification, and redistribution, with appropriate attribution.
42. 【2512.19320】MAGIC: Achieving Superior Model Merging via Magnitude Calibration
链接:https://arxiv.org/abs/2512.19320
作者:Yayuan Li,Jian Zhang,Jintao Guo,Zihan Cheng,Lei Qi,Yinghuan Shi,Yang Gao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Space Calibration, specialised models, proliferation of pre-trained, wide array, Model
备注:
点击查看摘要
Abstract:The proliferation of pre-trained models has given rise to a wide array of specialised, fine-tuned models. Model merging aims to merge the distinct capabilities of these specialised models into a unified model, requiring minimal or even no additional training. A core objective of model merging is to ensure the merged model retains the behavioural characteristics of the specialised models, typically achieved through feature alignment. We identify that features consist of two critical components: direction and magnitude. Prior research has predominantly focused on directional alignment, while the influence of magnitude remains largely neglected, despite its pronounced vulnerability to perturbations introduced by common merging operations (e.g., parameter fusion and sparsification). Such perturbations to magnitude inevitably lead to feature deviations in the merged model from the specialised models, resulting in subsequent performance degradation. To address this, we propose MAGnItude Calibration (MAGIC), a plug-and-play framework that rectifies layer-wise magnitudes in feature and weight spaces, with three variants. Specifically, our Feature Space Calibration (FSC) realigns the merged model's features using a small set of unlabelled data, while Weight Space Calibration (WSC) extends this calibration to the weight space without requiring additional data. Combining these yields Dual Space Calibration (DSC). Comprehensive experiments demonstrate that MAGIC consistently boosts performance across diverse Computer Vision tasks (+4.3% on eight datasets) and NLP tasks (+8.0% on Llama) without additional training. Our code is available at: this https URL
43. 【2512.19316】Neural Implicit Heart Coordinates: 3D cardiac shape reconstruction from sparse segmentations
链接:https://arxiv.org/abs/2512.19316
作者:Marica Muffoletto,Uxio Hermida,Charlène Mauger,Avan Suinesiaputra,Yiyang Xu,Richard Burns,Lisa Pankewitz,Andrew D McCulloch,Steffen E Petersen,Daniel Rueckert,Alistair A Young
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:clinical images remains, Accurate reconstruction, Neural Implicit Heart, images remains, remains a major
备注: 42 pages, 8 figures
点击查看摘要
Abstract:Accurate reconstruction of cardiac anatomy from sparse clinical images remains a major challenge in patient-specific modeling. While neural implicit functions have previously been applied to this task, their application to mapping anatomical consistency across subjects has been limited. In this work, we introduce Neural Implicit Heart Coordinates (NIHCs), a standardized implicit coordinate system, based on universal ventricular coordinates, that provides a common anatomical reference frame for the human heart. Our method predicts NIHCs directly from a limited number of 2D segmentations (sparse acquisition) and subsequently decodes them into dense 3D segmentations and high-resolution meshes at arbitrary output resolution. Trained on a large dataset of 5,000 cardiac meshes, the model achieves high reconstruction accuracy on clinical contours, with mean Euclidean surface errors of 2.51$\pm$0.33 mm in a diseased cohort (n=4549) and 2.3$\pm$0.36 mm in a healthy cohort (n=5576). The NIHC representation enables anatomically coherent reconstruction even under severe slice sparsity and segmentation noise, faithfully recovering complex structures such as the valve planes. Compared with traditional pipelines, inference time is reduced from over 60 s to 5-15 s. These results demonstrate that NIHCs constitute a robust and efficient anatomical representation for patient-specific 3D cardiac reconstruction from minimal input data.
44. 【2512.19311】MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture
链接:https://arxiv.org/abs/2512.19311
作者:Hui Li,Jiayue Lyu,Fu-Yun Wang,Kaihui Cheng,Siyu Zhu,Jingdong Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:generated noisy data, exposure bias, training-testing discrepancy, timestep, paper studies
备注:
点击查看摘要
Abstract:This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.
45. 【2512.19302】Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing
链接:https://arxiv.org/abs/2512.19302
作者:Xu Zhang,Junyao Ge,Yang Zheng,Kaitai Guo,Jimin Liang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, hold great promise, advancing remote sensing, Large Vision-Language, frameworks couple linguistic
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) hold great promise for advancing remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only reinforcement learning objective, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Remarkably, the learned prompting policy generalizes zero-shot to multiple referring segmentation benchmarks, exposing a distinct divide between semantic-level and instance-level grounding. We further found that compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds. Together, these findings establish semantic-level reasoning segmentation as a new paradigm for geospatial understanding, opening the way toward unified, interpretable LVLM-driven Earth observation. Our code and model are available at this https URL.
46. 【2512.19300】RMLer: Synthesizing Novel Objects across Diverse Categories via Reinforcement Mixing Learning
链接:https://arxiv.org/abs/2512.19300
作者:Jun Li,Zikun Chen,Haibo Chen,Shuo Chen,Jian Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:integrating distinct textual, distinct textual concepts, synthesis by integrating, integrating distinct, distinct textual
备注: accepted by AAAI2026
点击查看摘要
Abstract:Novel object synthesis by integrating distinct textual concepts from diverse categories remains a significant challenge in Text-to-Image (T2I) generation. Existing methods often suffer from insufficient concept mixing, lack of rigorous evaluation, and suboptimal outputs-manifesting as conceptual imbalance, superficial combinations, or mere juxtapositions. To address these limitations, we propose Reinforcement Mixing Learning (RMLer), a framework that formulates cross-category concept fusion as a reinforcement learning problem: mixed features serve as states, mixing strategies as actions, and visual outcomes as rewards. Specifically, we design an MLP-policy network to predict dynamic coefficients for blending cross-category text embeddings. We further introduce visual rewards based on (1) semantic similarity and (2) compositional balance between the fused object and its constituent concepts, optimizing the policy via proximal policy optimization. At inference, a selection strategy leverages these rewards to curate the highest-quality fused objects. Extensive experiments demonstrate RMLer's superiority in synthesizing coherent, high-fidelity objects from diverse categories, outperforming existing methods. Our work provides a robust framework for generating novel visual concepts, with promising applications in film, gaming, and design.
47. 【2512.19283】Hand-Aware Egocentric Motion Reconstruction with Sequence-Level Context
链接:https://arxiv.org/abs/2512.19283
作者:Kyungwon Cho,Hanbyul Joo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Egocentric vision systems, creating new opportunities, human-computer interaction, vision systems, opportunities for human-computer
备注: Project Page: [this https URL](https://kyungwoncho.github.io/HaMoS/)
点击查看摘要
Abstract:Egocentric vision systems are becoming widely available, creating new opportunities for human-computer interaction. A core challenge is estimating the wearer's full-body motion from first-person videos, which is crucial for understanding human behavior. However, this task is difficult since most body parts are invisible from the egocentric view. Prior approaches mainly rely on head trajectories, leading to ambiguity, or assume continuously tracked hands, which is unrealistic for lightweight egocentric devices. In this work, we present HaMoS, the first hand-aware, sequence-level diffusion framework that directly conditions on both head trajectory and intermittently visible hand cues caused by field-of-view limitations and occlusions, as in real-world egocentric devices. To overcome the lack of datasets pairing diverse camera views with human motion, we introduce a novel augmentation method that models such real-world conditions. We also demonstrate that sequence-level contexts such as body shape and field-of-view are crucial for accurate motion reconstruction, and thus employ local attention to infer long sequences efficiently. Experiments on public benchmarks show that our method achieves state-of-the-art accuracy and temporal smoothness, demonstrating a practical step toward reliable in-the-wild egocentric 3D motion understanding.
48. 【2512.19275】Is Visual Realism Enough? Evaluating Gait Biometric Fidelity in Generative AI Human Animation
链接:https://arxiv.org/abs/2512.19275
作者:Ivan DeAndres-Tame,Chengwei Ye,Ruben Tolosana,Ruben Vera-Rodriguez,Shiqi Yu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:enabling the synthesis, Generative, GenAI, human animation, models
备注:
点击查看摘要
Abstract:Generative AI (GenAI) models have revolutionized animation, enabling the synthesis of humans and motion patterns with remarkable visual fidelity. However, generating truly realistic human animation remains a formidable challenge, where even minor inconsistencies can make a subject appear unnatural. This limitation is particularly critical when AI-generated videos are evaluated for behavioral biometrics, where subtle motion cues that define identity are easily lost or distorted. The present study investigates whether state-of-the-art GenAI human animation models can preserve the subtle spatio-temporal details needed for person identification through gait biometrics. Specifically, we evaluate four different GenAI models across two primary evaluation tasks to assess their ability to i) restore gait patterns from reference videos under varying conditions of complexity, and ii) transfer these gait patterns to different visual identities. Our results show that while visual quality is mostly high, biometric fidelity remains low in tasks focusing on identification, suggesting that current GenAI models struggle to disentangle identity from motion. Furthermore, through an identity transfer task, we expose a fundamental flaw in appearance-based gait recognition: when texture is disentangled from motion, identification collapses, proving current GenAI models rely on visual attributes rather than temporal dynamics.
49. 【2512.19271】3SGen: Unified Subject, Style, and Structure-Driven Image Generation with Adaptive Task-specific Memory
链接:https://arxiv.org/abs/2512.19271
作者:Xinyang Song,Libin Wang,Weining Wang,Zhiwei Li,Jianxin Sun,Dandan Zheng,Jingdong Chen,Qi Li,Zhenan Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent image generation, Recent image, limited task transferability, image generation approaches, leading to feature
备注:
点击查看摘要
Abstract:Recent image generation approaches often address subject, style, and structure-driven conditioning in isolation, leading to feature entanglement and limited task transferability. In this paper, we introduce 3SGen, a task-aware unified framework that performs all three conditioning modes within a single model. 3SGen employs an MLLM equipped with learnable semantic queries to align text-image semantics, complemented by a VAE branch that preserves fine-grained visual details. At its core, an Adaptive Task-specific Memory (ATM) module dynamically disentangles, stores, and retrieves condition-specific priors, such as identity for subjects, textures for styles, and spatial layouts for structures, via a lightweight gating mechanism along with several scalable memory items. This design mitigates inter-task interference and naturally scales to compositional inputs. In addition, we propose 3SGen-Bench, a unified image-driven generation benchmark with standardized metrics for evaluating cross-task fidelity and controllability. Extensive experiments on our proposed 3SGen-Bench and other public benchmarks demonstrate our superior performance across diverse image-driven generation tasks.
50. 【2512.19253】Machine Unlearning in the Era of Quantum Machine Learning: An Empirical Study
链接:https://arxiv.org/abs/2512.19253
作者:Carla Crivoi,Radu Tudor Ionescu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:quantum-classical neural networks, hybrid quantum-classical neural, comprehensive empirical study, neural networks, quantum-classical neural
备注:
点击查看摘要
Abstract:We present the first comprehensive empirical study of machine unlearning (MU) in hybrid quantum-classical neural networks. While MU has been extensively explored in classical deep learning, its behavior within variational quantum circuits (VQCs) and quantum-augmented architectures remains largely unexplored. First, we adapt a broad suite of unlearning methods to quantum settings, including gradient-based, distillation-based, regularization-based and certified techniques. Second, we introduce two new unlearning strategies tailored to hybrid models. Experiments across Iris, MNIST, and Fashion-MNIST, under both subset removal and full-class deletion, reveal that quantum models can support effective unlearning, but outcomes depend strongly on circuit depth, entanglement structure, and task complexity. Shallow VQCs display high intrinsic stability with minimal memorization, whereas deeper hybrid models exhibit stronger trade-offs between utility, forgetting strength, and alignment with retrain oracle. We find that certain methods, e.g. EU-k, LCA, and Certified Unlearning, consistently provide the best balance across metrics. These findings establish baseline empirical insights into quantum machine unlearning and highlight the need for quantum-aware algorithms and theoretical guarantees, as quantum machine learning systems continue to expand in scale and capability. We publicly release our code at: this https URL.
51. 【2512.19243】VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis
链接:https://arxiv.org/abs/2512.19243
作者:Meng Chu,Senqiao Yang,Haoxuan Che,Suiyun Zhang,Xichen Zhang,Shaozuo Yu,Haokun Gui,Zhefan Rao,Dandan Tu,Rui Liu,Jiaya Jia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:produce photorealistic imagery, professional designers issue, Long Goal Bench, Generative models, photorealistic imagery
备注:
点击查看摘要
Abstract:Generative models can now produce photorealistic imagery, yet they still struggle with the long, multi-goal prompts that professional designers issue. To expose this gap and better evaluate models' performance in real-world settings, we introduce Long Goal Bench (LGBench), a 2,000-task suite (1,000 T2I and 1,000 I2I) whose average instruction contains 18 to 22 tightly coupled goals spanning global layout, local object placement, typography, and logo fidelity. We find that even state-of-the-art models satisfy fewer than 72 percent of the goals and routinely miss localized edits, confirming the brittleness of current pipelines. To address this, we present VisionDirector, a training-free vision-language supervisor that (i) extracts structured goals from long instructions, (ii) dynamically decides between one-shot generation and staged edits, (iii) runs micro-grid sampling with semantic verification and rollback after every edit, and (iv) logs goal-level rewards. We further fine-tune the planner with Group Relative Policy Optimization, yielding shorter edit trajectories (3.1 versus 4.2 steps) and stronger alignment. VisionDirector achieves new state of the art on GenEval (plus 7 percent overall) and ImgEdit (plus 0.07 absolute) while producing consistent qualitative improvements on typography, multi-object scenes, and pose editing.
52. 【2512.19221】From Pixels to Predicates Structuring urban perception with scene graphs
链接:https://arxiv.org/abs/2512.19221
作者:Yunlong Liu,Shuyang Li,Pengyuan Liu,Yu Zhang,Rudi Stouffs
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:object co-occurrence statistics, shape human perception, modelled using streetscapes, co-occurrence statistics, overlooking the explicit
备注: 10 pages, CAADRIA2026 presentation forthcoming
点击查看摘要
Abstract:Perception research is increasingly modelled using streetscapes, yet many approaches still rely on pixel features or object co-occurrence statistics, overlooking the explicit relations that shape human perception. This study proposes a three stage pipeline that transforms street view imagery (SVI) into structured representations for predicting six perceptual indicators. In the first stage, each image is parsed using an open-set Panoptic Scene Graph model (OpenPSG) to extract object predicate object triplets. In the second stage, compact scene-level embeddings are learned through a heterogeneous graph autoencoder (GraphMAE). In the third stage, a neural network predicts perception scores from these embeddings. We evaluate the proposed approach against image-only baselines in terms of accuracy, precision, and cross-city generalization. Results indicate that (i) our approach improves perception prediction accuracy by an average of 26% over baseline models, and (ii) maintains strong generalization performance in cross-city prediction tasks. Additionally, the structured representation clarifies which relational patterns contribute to lower perception scores in urban scenes, such as graffiti on wall and car parked on sidewalk. Overall, this study demonstrates that graph-based structure provides expressive, generalizable, and interpretable signals for modelling urban perception, advancing human-centric and context-aware urban analytics.
53. 【2512.19219】owards Minimal Fine-Tuning of VLMs
链接:https://arxiv.org/abs/2512.19219
作者:Tiange Luo,Lajanugen Logeswaran,Jaekyeom Kim,Justin Johnson,Honglak Lee
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:transformer-based vision-language models, lightweight parameter efficient, recipe for transformer-based, vision-language models, transformer-based vision-language
备注:
点击查看摘要
Abstract:We introduce Image-LoRA, a lightweight parameter efficient fine-tuning (PEFT) recipe for transformer-based vision-language models (VLMs). Image-LoRA applies low-rank adaptation only to the value path of attention layers within the visual-token span, reducing adapter-only training FLOPs roughly in proportion to the visual-token fraction. We further adapt only a subset of attention heads, selected using head influence scores estimated with a rank-1 Image-LoRA, and stabilize per-layer updates via selection-size normalization. Across screen-centric grounding and referring benchmarks spanning text-heavy to image-heavy regimes, Image-LoRA matches or closely approaches standard LoRA accuracy while using fewer trainable parameters and lower adapter-only training FLOPs. The method also preserves the pure-text reasoning performance of VLMs before and after fine-tuning, as further shown on GSM8K.
54. 【2512.19214】HippMetric: A skeletal-representation-based framework for cross-sectional and longitudinal hippocampal substructural morphometry
链接:https://arxiv.org/abs/2512.19214
作者:Na Gao,Chenfei Ye,Yanwu Yang,Anqi Li,Zhengbo He,Li Liang,Zhiyuan Liu,Xingyu Hao,Ting Ma,Tengfei Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:early neurodegenerative biomarkers, detecting subtle structural, identifying early neurodegenerative, Accurate characterization, neurodegenerative biomarkers
备注: 35 pages, 8 figures
点击查看摘要
Abstract:Accurate characterization of hippocampal substructure is crucial for detecting subtle structural changes and identifying early neurodegenerative biomarkers. However, high inter-subject variability and complex folding pattern of human hippocampus hinder consistent cross-subject and longitudinal analysis. Most existing approaches rely on subject-specific modelling and lack a stable intrinsic coordinate system to accommodate anatomical variability, which limits their ability to establish reliable inter- and intra-individual correspondence. To address this, we propose HippMetric, a skeletal representation (s-rep)-based framework for hippocampal substructural morphometry and point-wise correspondence across individuals and scans. HippMetric builds on the Axis-Referenced Morphometric Model (ARMM) and employs a deformable skeletal coordinate system aligned with hippocampal anatomy and function, providing a biologically grounded reference for correspondence. Our framework comprises two core modules: a skeletal-based coordinate system that respects the hippocampus' conserved longitudinal lamellar architecture, in which functional units (lamellae) are stacked perpendicular to the long-axis, enabling anatomically consistent localization across subjects and time; and individualized s-reps generated through surface reconstruction, deformation, and geometrically constrained spoke refinement, enforcing boundary adherence, orthogonality and non-intersection to produce mathematically valid skeletal geometry. Extensive experiments on two international cohorts demonstrate that HippMetric achieves higher accuracy, reliability, and correspondence stability compared to existing shape models.
55. 【2512.19213】InvCoSS: Inversion-driven Continual Self-supervised Learning in Medical Multi-modal Image Pre-training
链接:https://arxiv.org/abs/2512.19213
作者:Zihao Luo,Shaohao Rui,Zhenyu Tang,Guotai Wang,Xiaosong Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:offering promising improvements, foundation model sequentially, medical imaging trains, Continual self-supervised learning, preserving data privacy
备注: 16 pages, 10 figures, 5 tables
点击查看摘要
Abstract:Continual self-supervised learning (CSSL) in medical imaging trains a foundation model sequentially, alleviating the need for collecting multi-modal images for joint training and offering promising improvements in downstream performance while preserving data privacy. However, most existing methods still rely on replaying data from previous stages to prevent catastrophic forgetting, which compromises privacy and limits their applicability in real-world scenarios where data transfer across sites is often restricted. In this work, we propose InvCoSS, an inversion-driven continual self-supervised learning framework for medical multi-modal image pre-training. Specifically, after training on a previous task, InvCoSS inverts the pre-trained self-supervised model to generate synthetic images that approximate the original training distribution. These synthetic images are then combined with data from the new task for joint optimization, which effectively mitigates catastrophic forgetting while strictly adhering to the constraint of no access to previous real data. Furthermore, to improve the fidelity of synthetic images, we introduce a novel InvUNet with a multi-scale fusion architecture to restore both high- and low-frequency components of the inverted images. To enhance diversity and prevent mode collapse, we design a repulsive representation-learning mechanism that encourages a diverse feature space for synthetic images without class guidance. Extensive experiments across nine downstream tasks validate the effectiveness of InvCoSS, achieving performance comparable to or even superior to prior data-replay methods while significantly reducing storage requirements and eliminating data privacy constraints.
56. 【2512.19190】PEDESTRIAN: An Egocentric Vision Dataset for Obstacle Detection on Pavements
链接:https://arxiv.org/abs/2512.19190
作者:Marios Thoma(1 and 2),Zenonas Theodosiou(1 and 3),Harris Partaourides(4),Vassilis Vassiliades(1),Loizos Michael(2 and 1),Andreas Lanitis(1 and 5) ((1) CYENS Centre of Excellence, Nicosia, Cyprus, (2) Open University Cyprus, Nicosia, Cyprus, (3) Department of Communication and Internet Studies, Cyprus University of Technology, Limassol, Cyprus, (4) AI Cyprus Ethical Novelties Ltd, Limassol, Cyprus, (5) Department of Multimedia and Graphic Arts, Cyprus University of Technology, Limassol, Cyprus)
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:maintaining good health, good health, primary mode, mode of transportation, essential activity
备注: 24 pages, 7 figures, 9 tables, Dataset: [this https URL](https://doi.org/10.5281/zenodo.10907945) , Code: [this https URL](https://github.com/CYENS/PEDESTRIAN)
点击查看摘要
Abstract:Walking has always been a primary mode of transportation and is recognized as an essential activity for maintaining good health. Despite the need for safe walking conditions in urban environments, sidewalks are frequently obstructed by various obstacles that hinder free pedestrian movement. Any object obstructing a pedestrian's path can pose a safety hazard. The advancement of pervasive computing and egocentric vision techniques offers the potential to design systems that can automatically detect such obstacles in real time, thereby enhancing pedestrian safety. The development of effective and efficient identification algorithms relies on the availability of comprehensive and well-balanced datasets of egocentric data. In this work, we introduce the PEDESTRIAN dataset, comprising egocentric data for 29 different obstacles commonly found on urban sidewalks. A total of 340 videos were collected using mobile phone cameras, capturing a pedestrian's point of view. Additionally, we present the results of a series of experiments that involved training several state-of-the-art deep learning algorithms using the proposed dataset, which can be used as a benchmark for obstacle detection and recognition tasks. The dataset can be used for training pavement obstacle detectors to enhance the safety of pedestrians in urban areas.
57. 【2512.19173】CycleChart: A Unified Consistency-Based Learning Framework for Bidirectional Chart Understanding and Generation
链接:https://arxiv.org/abs/2512.19173
作者:Dazhen Deng,Sen Yang,Yuchen He,Yuan Tian,Yingcai Wu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Current chart-specific tasks, Current chart-specific, link chart generation, chart, chart question answering
备注:
点击查看摘要
Abstract:Current chart-specific tasks, such as chart question answering, chart parsing, and chart generation, are typically studied in isolation, preventing models from learning the shared semantics that link chart generation and interpretation. We introduce CycleChart, a consistency-based learning framework for bidirectional chart understanding and generation. CycleChart adopts a schema-centric formulation as a common interface across tasks. We construct a consistent multi-task dataset, where each chart sample includes aligned annotations for schema prediction, data parsing, and question answering. To learn cross-directional chart semantics, CycleChart introduces a generate-parse consistency objective: the model generates a chart schema from a table and a textual query, then learns to recover the schema and data from the generated chart, enforcing semantic alignment across directions. CycleChart achieves strong results on chart generation, chart parsing, and chart question answering, demonstrating improved cross-task generalization and marking a step toward more general chart understanding models.
58. 【2512.19159】OmniMoGen: Unifying Human Motion Generation via Learning from Interleaved Text-Motion Instructions
链接:https://arxiv.org/abs/2512.19159
作者:Wendong Bu,Kaihang Pan,Yuze Lin,Jiacheng Li,Kai Shen,Wenqiao Zhang,Juncheng Li,Jun Xiao,Siliang Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large language models, unification remains unexplored, Large language, diverse linguistic tasks, unified diverse linguistic
备注:
点击查看摘要
Abstract:Large language models (LLMs) have unified diverse linguistic tasks within a single framework, yet such unification remains unexplored in human motion generation. Existing methods are confined to isolated tasks, limiting flexibility for free-form and omni-objective generation. To address this, we propose OmniMoGen, a unified framework that enables versatile motion generation through interleaved text-motion instructions. Built upon a concise RVQ-VAE and transformer architecture, OmniMoGen supports end-to-end instruction-driven motion generation. We construct X2Mo, a large-scale dataset of over 137K interleaved text-motion instructions, and introduce AnyContext, a benchmark for evaluating interleaved motion generation. Experiments show that OmniMoGen achieves state-of-the-art performance on text-to-motion, motion editing, and AnyContext, exhibiting emerging capabilities such as compositional editing, self-reflective generation, and knowledge-informed generation. These results mark a step toward the next intelligent motion generation. Project Page: this https URL.
59. 【2512.19150】AMap: Distilling Future Priors for Ahead-Aware Online HD Map Construction
链接:https://arxiv.org/abs/2512.19150
作者:Ruikai Li,Xinrun Li,Mengwei Xie,Hao Shan,Shoumeng Qiu,Xinyuan Chang,Yizhe Fan,Feng Xiong,Han Jiang,Yilong Ren,Haiyang Yu,Mu Xu,Yang Long,Varun Ojha,Zhiyong Cui
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:construction is pivotal, pivotal for autonomous, Online High-Definition, map construction, autonomous driving
备注: 19 pages, 11 figures
点击查看摘要
Abstract:Online High-Definition (HD) map construction is pivotal for autonomous driving. While recent approaches leverage historical temporal fusion to improve performance, we identify a critical safety flaw in this paradigm: it is inherently ``spatially backward-looking." These methods predominantly enhance map reconstruction in traversed areas, offering minimal improvement for the unseen road ahead. Crucially, our analysis of downstream planning tasks reveals a severe asymmetry: while rearward perception errors are often tolerable, inaccuracies in the forward region directly precipitate hazardous driving maneuvers. To bridge this safety gap, we propose AMap, a novel framework for Ahead-aware online HD Mapping. We pioneer a ``distill-from-future" paradigm, where a teacher model with privileged access to future temporal contexts guides a lightweight student model restricted to the current frame. This process implicitly compresses prospective knowledge into the student model, endowing it with ``look-ahead" capabilities at zero inference-time cost. Technically, we introduce a Multi-Level BEV Distillation strategy with spatial masking and an Asymmetric Query Adaptation module to effectively transfer future-aware representations to the student's static queries. Extensive experiments on the nuScenes and Argoverse 2 benchmark demonstrate that AMap significantly enhances current-frame perception. Most notably, it outperforms state-of-the-art temporal models in critical forward regions while maintaining the efficiency of single current frame inference.
60. 【2512.19133】WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving
链接:https://arxiv.org/abs/2512.19133
作者:Pengxuan Yang,Ben Lu,Zhongpu Xia,Chao Han,Yinfeng Gao,Teng Zhang,Kun Zhan,XianPeng Lang,Yupeng Zheng,Qichao Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:perception annotation-free paradigm, temporal self-supervised learning, Latent World Models, Latent World, planning-oriented latent world
备注: AAAI 2026, first version
点击查看摘要
Abstract:Latent World Models enhance scene representation through temporal self-supervised learning, presenting a perception annotation-free paradigm for end-to-end autonomous driving. However, the reconstruction-oriented representation learning tangles perception with planning tasks, leading to suboptimal optimization for planning. To address this challenge, we propose WorldRFT, a planning-oriented latent world model framework that aligns scene representation learning with planning via a hierarchical planning decomposition and local-aware interactive refinement mechanism, augmented by reinforcement learning fine-tuning (RFT) to enhance safety-critical policy performance. Specifically, WorldRFT integrates a vision-geometry foundation model to improve 3D spatial awareness, employs hierarchical planning task decomposition to guide representation optimization, and utilizes local-aware iterative refinement to derive a planning-oriented driving policy. Furthermore, we introduce Group Relative Policy Optimization (GRPO), which applies trajectory Gaussianization and collision-aware rewards to fine-tune the driving policy, yielding systematic improvements in safety. WorldRFT achieves state-of-the-art (SOTA) performance on both open-loop nuScenes and closed-loop NavSim benchmarks. On nuScenes, it reduces collision rates by 83% (0.30% - 0.05%). On NavSim, using camera-only sensors input, it attains competitive performance with the LiDAR-based SOTA method DiffusionDrive (87.8 vs. 88.1 PDMS).
61. 【2512.19115】Generative Giants, Retrieval Weaklings: Why do Multimodal Large Language Models Fail at Multimodal Retrieval?
链接:https://arxiv.org/abs/2512.19115
作者:Hengyi Feng,Zeang Sheng,Meiyi Qiang,Wentao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, multimodal large language, multimodal retrieval task, generative tasks, zero-shot multimodal retrieval
备注:
点击查看摘要
Abstract:Despite the remarkable success of multimodal large language models (MLLMs) in generative tasks, we observe that they exhibit a counterintuitive deficiency in the zero-shot multimodal retrieval task. In this work, we investigate the underlying mechanisms that hinder MLLMs from serving as effective retrievers. With the help of sparse autoencoders (SAEs), we decompose MLLM output representations into interpretable semantic concepts to probe their intrinsic behavior. Our analysis reveals that the representation space of MLLMs is overwhelmingly dominated by textual semantics; the visual information essential for multimodal retrieval only constitutes a small portion. This imbalance is compounded by the heavy focus of MLLMs on bridging image-text modalities, which facilitates generation but homogenizes embeddings and finally diminishes the discriminative power required for multimodal retrieval. We further discover that the specific feature components that contribute most to the similarity computations for MLLMs are in fact distractors that actively degrade retrieval performance. Overall, our work provides the first in-depth interpretability analysis of MLLM representations in the context of multimodal retrieval and offers possible directions for enhancing the multimodal retrieval capabilities of MLLMs.
62. 【2512.19110】rifocal Tensor and Relative Pose Estimation with Known Vertical Direction
链接:https://arxiv.org/abs/2512.19110
作者:Tao Li,Zhenbao Yu,Banglei Guan,Jianli Han,Weimin Lv,Friedrich Fraundorfer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vertical directions, work presents, estimating the relative, point correspondences, unmanned aerial vehicles
备注:
点击查看摘要
Abstract:This work presents two novel solvers for estimating the relative poses among views with known vertical directions. The vertical directions of camera views can be easily obtained using inertial measurement units (IMUs) which have been widely used in autonomous vehicles, mobile phones, and unmanned aerial vehicles (UAVs). Given the known vertical directions, our lgorithms only need to solve for two rotation angles and two translation vectors. In this paper, a linear closed-form solution has been described, requiring only four point correspondences in three views. We also propose a minimal solution with three point correspondences using the latest Gröbner basis solver. Since the proposed methods require fewer point correspondences, they can be efficiently applied within the RANSAC framework for outliers removal and pose estimation in visual odometry. The proposed method has been tested on both synthetic data and real-world scenes from KITTI. The experimental results show that the accuracy of the estimated poses is superior to other alternative methods.
63. 【2512.19108】GaussianImage++: Boosted Image Representation and Compression with 2D Gaussian Splatting
链接:https://arxiv.org/abs/2512.19108
作者:Tiantian Li,Xinjie Zhang,Xingtong Ge,Tongda Xu,Dailan He,Jun Zhang,Yan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Implicit neural representations, achieved remarkable success, Implicit neural, substantial training time, Gaussian primitives
备注: Accepted to AAAI [this http URL](http://2026.Code) URL: [this https URL](https://github.com/Sweethyh/GaussianImage_plus.git)
点击查看摘要
Abstract:Implicit neural representations (INRs) have achieved remarkable success in image representation and compression, but they require substantial training time and memory. Meanwhile, recent 2D Gaussian Splatting (GS) methods (\textit{e.g.}, GaussianImage) offer promising alternatives through efficient primitive-based rendering. However, these methods require excessive Gaussian primitives to maintain high visual fidelity. To exploit the potential of GS-based approaches, we present GaussianImage++, which utilizes limited Gaussian primitives to achieve impressive representation and compression performance. Firstly, we introduce a distortion-driven densification mechanism. It progressively allocates Gaussian primitives according to signal intensity. Secondly, we employ context-aware Gaussian filters for each primitive, which assist in the densification to optimize Gaussian primitives based on varying image content. Thirdly, we integrate attribute-separated learnable scalar quantizers and quantization-aware training, enabling efficient compression of primitive attributes. Experimental results demonstrate the effectiveness of our method. In particular, GaussianImage++ outperforms GaussianImage and INRs-based COIN in representation and compression performance while maintaining real-time decoding and low memory usage.
64. 【2512.19095】Mamba-Based Modality Disentanglement Network for Multi-Contrast MRI Reconstruction
链接:https://arxiv.org/abs/2512.19095
作者:Weiyi Lyu,Xinming Fang,Jun Wang,Jun Shi,Guixu Zhang,Juncheng Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Magnetic resonance imaging, modern clinical diagnosis, offering unparalleled soft-tissue, unparalleled soft-tissue contrast, Magnetic resonance
备注: 12 pages, 11 figures, 6 tables
点击查看摘要
Abstract:Magnetic resonance imaging (MRI) is a cornerstone of modern clinical diagnosis, offering unparalleled soft-tissue contrast without ionizing radiation. However, prolonged scan times remain a major barrier to patient throughput and comfort. Existing accelerated MRI techniques often struggle with two key challenges: (1) failure to effectively utilize inherent K-space prior information, leading to persistent aliasing artifacts from zero-filled inputs; and (2) contamination of target reconstruction quality by irrelevant information when employing multi-contrast fusion strategies. To overcome these challenges, we present MambaMDN, a dual-domain framework for multi-contrast MRI reconstruction. Our approach first employs fully-sampled reference K-space data to complete the undersampled target data, generating structurally aligned but modality-mixed inputs. Subsequently, we develop a Mamba-based modality disentanglement network to extract and remove reference-specific features from the mixed representation. Furthermore, we introduce an iterative refinement mechanism to progressively enhance reconstruction accuracy through repeated feature purification. Extensive experiments demonstrate that MambaMDN can significantly outperform existing multi-contrast reconstruction methods.
65. 【2512.19091】Auditing Significance, Metric Choice, and Demographic Fairness in Medical AI Challenges
链接:https://arxiv.org/abs/2512.19091
作者:Ariel Lubonja,Pedro R. A. S. Bassi,Wenxuan Li,Hualin Qiao,Randal Burns,Alan L. Yuille,Zongwei Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Open challenges, facto standard, standard for comparative, medical AI methods, Johns Hopkins Hospital
备注: MICCAI 2025 Workshop on Machine Learning in Medical Imaging
点击查看摘要
Abstract:Open challenges have become the de facto standard for comparative ranking of medical AI methods. Despite their importance, medical AI leaderboards exhibit three persistent limitations: (1) score gaps are rarely tested for statistical significance, so rank stability is unknown; (2) single averaged metrics are applied to every organ, hiding clinically important boundary errors; (3) performance across intersecting demographics is seldom reported, masking fairness and equity gaps. We introduce RankInsight, an open-source toolkit that seeks to address these limitations. RankInsight (1) computes pair-wise significance maps that show the nnU-Net family outperforms Vision-Language and MONAI submissions with high statistical certainty; (2) recomputes leaderboards with organ-appropriate metrics, reversing the order of the top four models when Dice is replaced by NSD for tubular structures; and (3) audits intersectional fairness, revealing that more than half of the MONAI-based entries have the largest gender-race discrepancy on our proprietary Johns Hopkins Hospital dataset. The RankInsight toolkit is publicly released and can be directly applied to past, ongoing, and future challenges. It enables organizers and participants to publish rankings that are statistically sound, clinically meaningful, and demographically fair.
66. 【2512.19088】Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation
链接:https://arxiv.org/abs/2512.19088
作者:Khanh Nguyen,Dasith de Silva Edirimuni,Ghulam Mubashar Hassan,Ajmal Mian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Locating and retrieving, scene-level point clouds, augmented reality, challenging problem, problem with broad
备注: Accepted to AAAI 2026 Workshop on New Frontiers in Information Retrieval
点击查看摘要
Abstract:Locating and retrieving objects from scene-level point clouds is a challenging problem with broad applications in robotics and augmented reality. This task is commonly formulated as open-vocabulary 3D instance segmentation. Although recent methods demonstrate strong performance, they depend heavily on SAM and CLIP to generate and classify 3D instance masks from images accompanying the point cloud, leading to substantial computational overhead and slow processing that limit their deployment in real-world settings. Open-YOLO 3D alleviates this issue by using a real-time 2D detector to classify class-agnostic masks produced directly from the point cloud by a pretrained 3D segmenter, eliminating the need for SAM and CLIP and significantly reducing inference time. However, Open-YOLO 3D often fails to generalize to object categories that appear infrequently in the 3D training data. In this paper, we propose a method that generates 3D instance masks for novel objects from RGB images guided by a 2D open-vocabulary detector. Our approach inherits the 2D detector's ability to recognize novel objects while maintaining efficient classification, enabling fast and accurate retrieval of rare instances from open-ended text queries. Our code will be made available at this https URL.
67. 【2512.19070】Watch Closely: Mitigating Object Hallucinations in Large Vision-Language Models with Disentangled Decoding
链接:https://arxiv.org/abs/2512.19070
作者:Ruiqi Ma,Yu Yan,Chunhong Zhang,Minghao Yin,XinChao Liu,Zhihong Jin,Zheng Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Large Vision-Language Models, demonstrating strong potential, Large Vision-Language, bridge the gap, demonstrating strong
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) bridge the gap between visual and linguistic modalities, demonstrating strong potential across a variety of domains. However, despite significant progress, LVLMs still suffer from severe hallucination issues in object recognition tasks. These models often fail to accurately identify certain objects, leading to text generation that appears fluent but does not correspond to the visual content, which can have serious consequences in real-world applications. Recently, several methods have been proposed to alleviate LVLM hallucinations, but most focus solely on reducing hallucinations in the language modality. To mitigate hallucinations in both the language and visual modalities, we introduce Hallucination Disentangled Decoding (HDD) method that requires no training. HDD enhances the original image by segmenting it and selecting images that augment the original, while also utilizing a blank image to eliminate language prior hallucinations in both the original and segmented images. This design not only reduces the model's dependence on language priors but also enhances its visual performance. (Code: this https URL)
68. 【2512.19058】6DAttack: Backdoor Attacks in the 6DoF Pose Estimation
链接:https://arxiv.org/abs/2512.19058
作者:Jihui Guo,Zongmin Zhang,Zhen Sun,Yuhao Yang,Jinlin Wu,Fu Zhang,Xinlei He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep learning advances, Deep learning, enabled accurate, autonomous systems, learning advances
备注:
点击查看摘要
Abstract:Deep learning advances have enabled accurate six-degree-of-freedom (6DoF) object pose estimation, widely used in robotics, AR/VR, and autonomous systems. However, backdoor attacks pose significant security risks. While most research focuses on 2D vision, 6DoF pose estimation remains largely unexplored. Unlike traditional backdoors that only change classes, 6DoF attacks must control continuous parameters like translation and rotation, rendering 2D methods inapplicable. We propose 6DAttack, a framework using 3D object triggers to induce controlled erroneous poses while maintaining normal behavior. Evaluations on PVNet, DenseFusion, and PoseDiffusion across LINEMOD, YCB-Video, and CO3D show high attack success rates (ASRs) without compromising clean performance. Backdoored models achieve up to 100% clean ADD accuracy and 100% ASR, with triggered samples reaching 97.70% ADD-P. Furthermore, a representative defense remains ineffective. Our findings reveal a serious, underexplored threat to 6DoF pose estimation.
69. 【2512.19049】Decoupled Generative Modeling for Human-Object Interaction Synthesis
链接:https://arxiv.org/abs/2512.19049
作者:Hwanhee Jung,Seunggwan Lee,Jeongyoon Yoon,SeungHyeon Kim,Giljoo Nam,Qixing Huang,Sangpil Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Synthesizing realistic human-object, realistic human-object interaction, Synthesizing realistic, Human-Object Interaction Synthesis, computer vision
备注:
点击查看摘要
Abstract:Synthesizing realistic human-object interaction (HOI) is essential for 3D computer vision and robotics, underpinning animation and embodied control. Existing approaches often require manually specified intermediate waypoints and place all optimization objectives on a single network, which increases complexity, reduces flexibility, and leads to errors such as unsynchronized human and object motion or penetration. To address these issues, we propose Decoupled Generative Modeling for Human-Object Interaction Synthesis (DecHOI), which separates path planning and action synthesis. A trajectory generator first produces human and object trajectories without prescribed waypoints, and an action generator conditions on these paths to synthesize detailed motions. To further improve contact realism, we employ adversarial training with a discriminator that focuses on the dynamics of distal joints. The framework also models a moving counterpart and supports responsive, long-sequence planning in dynamic scenes, while preserving plan consistency. Across two benchmarks, FullBodyManipulation and 3D-FUTURE, DecHOI surpasses prior methods on most quantitative metrics and qualitative evaluations, and perceptual studies likewise prefer our results.
70. 【2512.19048】WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
链接:https://arxiv.org/abs/2512.19048
作者:Utae Jeong,Sumin In,Hyunju Ryu,Jaewan Choi,Feng Yang,Jongheon Jeong,Seungryong Kim,Sangpil Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:watermarking supports authenticity, Image watermarking supports, powerful generative edits, authenticity and provenance, supports authenticity
备注:
点击查看摘要
Abstract:Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.
71. 【2512.19036】Distinguishing Visually Similar Actions: Prompt-Guided Semantic Prototype Modulation for Few-Shot Action Recognition
链接:https://arxiv.org/abs/2512.19036
作者:Xiaoyang Li,Mingming Lu,Ruiqi Wang,Hao Li,Zewei Le
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Few-shot action recognition, limited labeled samples, action recognition aims, Semantic Prototype Modulation, Few-shot action
备注: 19 pages, 7 figures. Preprint under review for journal submission
点击查看摘要
Abstract:Few-shot action recognition aims to enable models to quickly learn new action categories from limited labeled samples, addressing the challenge of data scarcity in real-world applications. Current research primarily addresses three core challenges: (1) temporal modeling, where models are prone to interference from irrelevant static background information and struggle to capture the essence of dynamic action features; (2) visual similarity, where categories with subtle visual differences are difficult to distinguish; and (3) the modality gap between visual-textual support prototypes and visual-only queries, which complicates alignment within a shared embedding space. To address these challenges, this paper proposes a CLIP-SPM framework, which includes three components: (1) the Hierarchical Synergistic Motion Refinement (HSMR) module, which aligns deep and shallow motion features to improve temporal modeling by reducing static background interference; (2) the Semantic Prototype Modulation (SPM) strategy, which generates query-relevant text prompts to bridge the modality gap and integrates them with visual features, enhancing the discriminability between similar actions; and (3) the Prototype-Anchor Dual Modulation (PADM) method, which refines support prototypes and aligns query features with a global semantic anchor, improving consistency across support and query samples. Comprehensive experiments across standard benchmarks, including Kinetics, SSv2-Full, SSv2-Small, UCF101, and HMDB51, demonstrate that our CLIP-SPM achieves competitive performance under 1-shot, 3-shot, and 5-shot settings. Extensive ablation studies and visual analyses further validate the effectiveness of each component and its contributions to addressing the core challenges. The source code and models are publicly available at GitHub.
72. 【2512.19032】Automatic Neuronal Activity Segmentation in Fast Four Dimensional Spatio-Temporal Fluorescence Imaging using Bayesian Approach
链接:https://arxiv.org/abs/2512.19032
作者:Ran Li,Pan Xiao,Kaushik Dutta,Youdong Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Fluorescence Microcopy Calcium, Microcopy Calcium Imaging, single cell resolution, Fluorescence Microcopy, Microcopy Calcium
备注:
点击查看摘要
Abstract:Fluorescence Microcopy Calcium Imaging is a fundamental tool to in-vivo record and analyze large scale neuronal activities simultaneously at a single cell resolution. Automatic and precise detection of behaviorally relevant neuron activity from the recordings is critical to study the mapping of brain activity in organisms. However a perpetual bottleneck to this problem is the manual segmentation which is time and labor intensive and lacks generalizability. To this end, we present a Bayesian Deep Learning Framework to detect neuronal activities in 4D spatio-temporal data obtained by light sheet microscopy. Our approach accounts for the use of temporal information by calculating pixel wise correlation maps and combines it with spatial information given by the mean summary image. The Bayesian framework not only produces probability segmentation maps but also models the uncertainty pertaining to active neuron detection. To evaluate the accuracy of our framework we implemented the test of reproducibility to assert the generalization of the network to detect neuron activity. The network achieved a mean Dice Score of 0.81 relative to the synthetic Ground Truth obtained by Otsu's method and a mean Dice Score of 0.79 between the first and second run for test of reproducibility. Our method successfully deployed can be used for rapid detection of active neuronal activities for behavioural studies.
73. 【2512.19026】Finer-Personalization Rank: Fine-Grained Retrieval Examines Identity Preservation for Personalized Generation
链接:https://arxiv.org/abs/2512.19026
作者:Connor Kilrain,David Carlyn,Julia Chae,Sara Beery,Wei-Lun Chao,Jianyang Gu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:central question, raises a central, evaluate identity preservation, Finer-Personalization Rank, generative models raises
备注:
点击查看摘要
Abstract:The rise of personalized generative models raises a central question: how should we evaluate identity preservation? Given a reference image (e.g., one's pet), we expect the generated image to retain precise details attached to the subject's identity. However, current generative evaluation metrics emphasize the overall semantic similarity between the reference and the output, and overlook these fine-grained discriminative details. We introduce Finer-Personalization Rank, an evaluation protocol tailored to identity preservation. Instead of pairwise similarity, Finer-Personalization Rank adopts a ranking view: it treats each generated image as a query against an identity-labeled gallery consisting of visually similar real images. Retrieval metrics (e.g., mean average precision) measure performance, where higher scores indicate that identity-specific details (e.g., a distinctive head spot) are preserved. We assess identity at multiple granularities -- from fine-grained categories (e.g., bird species, car models) to individual instances (e.g., re-identification). Across CUB, Stanford Cars, and animal Re-ID benchmarks, Finer-Personalization Rank more faithfully reflects identity retention than semantic-only metrics and reveals substantial identity drift in several popular personalization methods. These results position the gallery-based protocol as a principled and practical evaluation for personalized generation.
74. 【2512.19022】Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection
链接:https://arxiv.org/abs/2512.19022
作者:Haoze Li,Jie Zhang,Guoying Zhao,Stephen Lin,Shiguang Shan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Face Presentation Attack, Presentation Attack Detection, demands incremental learning, Face Presentation, Attack Detection
备注:
点击查看摘要
Abstract:Face Presentation Attack Detection (PAD) demands incremental learning (IL) to combat evolving spoofing tactics and domains. Privacy regulations, however, forbid retaining past data, necessitating rehearsal-free IL (RF-IL). Vision-Language Pre-trained (VLP) models, with their prompt-tunable cross-modal representations, enable efficient adaptation to new spoofing styles and domains. Capitalizing on this strength, we propose \textbf{SVLP-IL}, a VLP-based RF-IL framework that balances stability and plasticity via \textit{Multi-Aspect Prompting} (MAP) and \textit{Selective Elastic Weight Consolidation} (SEWC). MAP isolates domain dependencies, enhances distribution-shift sensitivity, and mitigates forgetting by jointly exploiting universal and domain-specific cues. SEWC selectively preserves critical weights from previous tasks, retaining essential knowledge while allowing flexibility for new adaptations. Comprehensive experiments across multiple PAD benchmarks show that SVLP-IL significantly reduces catastrophic forgetting and enhances performance on unseen domains. SVLP-IL offers a privacy-compliant, practical solution for robust lifelong PAD deployment in RF-IL settings.
75. 【2512.19021】VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation
链接:https://arxiv.org/abs/2512.19021
作者:Sihao Lin,Zerui Li,Xunyi Zhao,Gengze Zhou,Liuyi Wang,Rong Wei,Rui Tang,Juncheng Li,Hanqing Wang,Jiangmiao Pang,Anton van den Hengel,Jiajun Liu,Qi Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:naive physical simulation, benchmarks remain confined, confined to fixed, small-scale datasets, remain confined
备注:
点击查看摘要
Abstract:Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.
76. 【2512.19020】CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization
链接:https://arxiv.org/abs/2512.19020
作者:Zelin Zhao,Xinyu Gong,Bangya Liu,Ziyang Song,Jun Zhang,Suhui Wu,Yongxin Chen,Hao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Achieving precise camera, generation remains challenging, Achieving precise, video generation remains, camera pose annotations
备注:
点击查看摘要
Abstract:Achieving precise camera control in video generation remains challenging, as existing methods often rely on camera pose annotations that are difficult to scale to large and dynamic datasets and are frequently inconsistent with depth estimation, leading to train-test discrepancies. We introduce CETCAM, a camera-controllable video generation framework that eliminates the need for camera annotations through a consistent and extensible tokenization scheme. CETCAM leverages recent advances in geometry foundation models, such as VGGT, to estimate depth and camera parameters and converts them into unified, geometry-aware tokens. These tokens are seamlessly integrated into a pretrained video diffusion backbone via lightweight context blocks. Trained in two progressive stages, CETCAM first learns robust camera controllability from diverse raw video data and then refines fine-grained visual quality using curated high-fidelity datasets. Extensive experiments across multiple benchmarks demonstrate state-of-the-art geometric consistency, temporal stability, and visual realism. Moreover, CETCAM exhibits strong adaptability to additional control modalities, including inpainting and layout control, highlighting its flexibility beyond camera control. The project page is available at this https URL.
77. 【2512.18994】owards AI-Guided Open-World Ecological Taxonomic Classification
链接:https://arxiv.org/abs/2512.18994
作者:Cheng Yaw Low,Heejoon Koo,Jaewoo Park,Kaleb Mesfin Asfaw,Meeyoung Cha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:species underpins global, underpins global sustainability, global sustainability efforts, conservation planning, species underpins
备注: 4 figures, 11 tables, and 15 pages
点击查看摘要
Abstract:AI-guided classification of ecological families, genera, and species underpins global sustainability efforts such as biodiversity monitoring, conservation planning, and policy-making. Progress toward this goal is hindered by long-tailed taxonomic distributions from class imbalance, along with fine-grained taxonomic variations, test-time spatiotemporal domain shifts, and closed-set assumptions that can only recognize previously seen taxa. We introduce the Open-World Ecological Taxonomy Classification, a unified framework that captures the co-occurrence of these challenges in realistic ecological settings. To address them, we propose TaxoNet, an embedding-based encoder with a dual-margin penalization loss that strengthens learning signals from rare underrepresented taxa while mitigating the dominance of overrepresented ones, directly confronting interrelated challenges. We evaluate our method on diverse ecological domains: Google Auto-Arborist (urban trees), iNat-Plantae (Plantae observations from various ecosystems in iNaturalist-2019), and NAFlora-Mini (a curated herbarium collection). Our model consistently outperforms baselines, particularly for rare taxa, establishing a strong foundation for open-world plant taxonomic monitoring. Our findings further show that general-purpose multimodal foundation models remain constrained in plant-domain applications.
78. 【2512.18991】ICP-4D: Bridging Iterative Closest Point and LiDAR Panoptic Segmentation
链接:https://arxiv.org/abs/2512.18991
作者:Gyeongrok Oh,Youngdong Jang,Jonghyun Choi,Suk-Ju Kang,Guang Lin,Sangpil Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:train deep neural, deep neural networks, design dedicated modules, Dominant paradigms, large superimposed point
备注:
点击查看摘要
Abstract:Dominant paradigms for 4D LiDAR panoptic segmentation are usually required to train deep neural networks with large superimposed point clouds or design dedicated modules for instance association. However, these approaches perform redundant point processing and consequently become computationally expensive, yet still overlook the rich geometric priors inherently provided by raw point clouds. To this end, we introduce ICP-4D, a simple yet effective training-free framework that unifies spatial and temporal reasoning through geometric relations among instance-level point sets. Specifically, we apply the Iterative Closest Point (ICP) algorithm to directly associate temporally consistent instances by aligning the source and target point sets through the estimated transformation. To stabilize association under noisy instance predictions, we introduce a Sinkhorn-based soft matching. This exploits the underlying instance distribution to obtain accurate point-wise correspondences, resulting in robust geometric alignment. Furthermore, our carefully designed pipeline, which considers three instance types-static, dynamic, and missing-offers computational efficiency and occlusion-aware matching. Our extensive experiments across both SemanticKITTI and panoptic nuScenes demonstrate that our method consistently outperforms state-of-the-art approaches, even without additional training or extra point cloud inputs.
79. 【2512.18987】Affordance RAG: Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation
链接:https://arxiv.org/abs/2512.18987
作者:Ryosuke Korekata,Quanting Xie,Yonatan Bisk,Komei Sugiura
类目:Robotics (cs.RO); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:free-form natural language, natural language instructions, address the problem, problem of open-vocabulary, required to carry
备注: Accepted to IEEE RA-L, with presentation at ICRA 2026
点击查看摘要
Abstract:In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.
80. 【2512.18969】Self-Attention with State-Object Weighted Combination for Compositional Zero Shot Learning
链接:https://arxiv.org/abs/2512.18969
作者:Cheng-Hong Chang,Pei-Hsuan Tsai
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:states and objects, state and object, objects, states, Object
备注:
点击查看摘要
Abstract:Object recognition has become prevalent across various industries. However, most existing applications are limited to identifying objects alone, without considering their associated states. The ability to recognize both the state and object simultaneously remains less common. One approach to address this is by treating state and object as a single category during training. However, this approach poses challenges in data collection and training since it requires comprehensive data for all possible combinations. Compositional Zero-shot Learning (CZSL) emerges as a viable solution by treating the state and object as distinct categories during training. CZSL facilitates the identification of novel compositions even in the absence of data for every conceivable combination. The current state-of-the-art method, KG-SP, addresses this issue by training distinct classifiers for states and objects, while leveraging a semantic model to evaluate the plausibility of composed compositions. However, KG-SP's accuracy in state and object recognition can be further improved, and it fails to consider the weighting of states and objects during composition. In this study, we propose SASOW, an enhancement of KG-SP that considers the weighting of states and objects while improving composition recognition accuracy. First, we introduce self-attention mechanisms into the classifiers for states and objects, leading to enhanced accuracy in recognizing both. Additionally, we incorporate the weighting of states and objects during composition to generate more reasonable and accurate compositions. Our validation process involves testing SASOW on three established benchmark datasets. Experimental outcomes affirm when compared against OW-CZSL approach, KG-SP, SASOW showcases improvements of 2.1%, 1.7%, and 0.4% in terms of accuracy for unseen compositions across the MIT-States, UT Zappos, and C-GQA datasets, respectively.
81. 【2512.18968】otal Curvature Regularization and its_Minimization for Surface and Image Smoothing
链接:https://arxiv.org/abs/2512.18968
作者:Tianle Lu,Ke Chen,Yuping Duan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:penalizing normal curvatures, normal curvature regularization, total normal curvature, multiple directions, curvature regularization
备注:
点击查看摘要
Abstract:We introduce a novel formulation for curvature regularization by penalizing normal curvatures from multiple directions. This total normal curvature regularization is capable of producing solutions with sharp edges and precise isotropic properties. To tackle the resulting high-order nonlinear optimization problem, we reformulate it as the task of finding the steady-state solution of a time-dependent partial differential equation (PDE) system. Time discretization is achieved through operator splitting, where each subproblem at the fractional steps either has a closed-form solution or can be efficiently solved using advanced algorithms. Our method circumvents the need for complex parameter tuning and demonstrates robustness to parameter choices. The efficiency and effectiveness of our approach have been rigorously validated in the context of surface and image smoothing problems.
82. 【2512.18964】DVI: Disentangling Semantic and Visual Identity for Training-Free Personalized Generation
链接:https://arxiv.org/abs/2512.18964
作者:Guandong Li,Yijun Ding
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent tuning-free identity, Recent tuning-free, achieve high facial, overlook visual context, tuning-free identity customization
备注:
点击查看摘要
Abstract:Recent tuning-free identity customization methods achieve high facial fidelity but often overlook visual context, such as lighting, skin texture, and environmental tone. This limitation leads to ``Semantic-Visual Dissonance,'' where accurate facial geometry clashes with the input's unique atmosphere, causing an unnatural ``sticker-like'' effect. We propose **DVI (Disentangled Visual-Identity)**, a zero-shot framework that orthogonally disentangles identity into fine-grained semantic and coarse-grained visual streams. Unlike methods relying solely on semantic vectors, DVI exploits the inherent statistical properties of the VAE latent space, utilizing mean and variance as lightweight descriptors for global visual atmosphere. We introduce a **Parameter-Free Feature Modulation** mechanism that adaptively modulates semantic embeddings with these visual statistics, effectively injecting the reference's ``visual soul'' without training. Furthermore, a **Dynamic Temporal Granularity Scheduler** aligns with the diffusion process, prioritizing visual atmosphere in early denoising stages while refining semantic details later. Extensive experiments demonstrate that DVI significantly enhances visual consistency and atmospheric fidelity without parameter fine-tuning, maintaining robust identity preservation and outperforming state-of-the-art methods in IBench evaluations.
83. 【2512.18954】VOIC: Visible-Occluded Decoupling for Monocular 3D Semantic Scene Completion
链接:https://arxiv.org/abs/2512.18954
作者:Zaidao Han,Risa Higashita,Jiang Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:robotic scene understanding, critical task, task for autonomous, autonomous driving, driving and robotic
备注:
点击查看摘要
Abstract:Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.18954 [cs.CV]
(or
arXiv:2512.18954v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.18954
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
84. 【2512.18953】Symmetrization of 3D Generative Models
链接:https://arxiv.org/abs/2512.18953
作者:Nicolas Caytuiro,Ivan Sipiran
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:data-centric approach, approach to promote, promote symmetry, model architecture, training data
备注: 10 pages, 5 figures
点击查看摘要
Abstract:We propose a novel data-centric approach to promote symmetry in 3D generative models by modifying the training data rather than the model architecture. Our method begins with an analysis of reflectional symmetry in both real-world 3D shapes and samples generated by state-of-the-art models. We hypothesize that training a generative model exclusively on half-objects, obtained by reflecting one half of the shapes along the x=0 plane, enables the model to learn a rich distribution of partial geometries which, when reflected during generation, yield complete shapes that are both visually plausible and geometrically symmetric. To test this, we construct a new dataset of half-objects from three ShapeNet classes (Airplane, Car, and Chair) and train two generative models. Experiments demonstrate that the generated shapes are symmetrical and consistent, compared with the generated objects from the original model and the original dataset objects.
85. 【2512.18933】Point What You Mean: Visually Grounded Instruction Policy
链接:https://arxiv.org/abs/2512.18933
作者:Hang Yu,Juntu Zhao,Yufeng Liu,Kaiyu Li,Cheng Ma,Di Zhang,Yingdong Hu,Guang Chen,Junyuan Xie,Junliang Guo,Junqiao Zhao,Yang Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:models align vision, ability remains limited, models align, text prompt, OOD
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.
86. 【2512.18930】LouvreSAE: Sparse Autoencoders for Interpretable and Controllable Style Transfer
链接:https://arxiv.org/abs/2512.18930
作者:Raina Panda,Daniel Fein,Arpita Singhal,Mark Fiore,Maneesh Agrawala,Matyas Bohacek
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:significant challenge, prompt engineering, subject matter, style, generative models remains
备注:
点击查看摘要
Abstract:Artistic style transfer in generative models remains a significant challenge, as existing methods often introduce style only via model fine-tuning, additional adapters, or prompt engineering, all of which can be computationally expensive and may still entangle style with subject matter. In this paper, we introduce a training- and inference-light, interpretable method for representing and transferring artistic style. Our approach leverages an art-specific Sparse Autoencoder (SAE) on top of latent embeddings of generative image models. Trained on artistic data, our SAE learns an emergent, largely disentangled set of stylistic and compositional concepts, corresponding to style-related elements pertaining brushwork, texture, and color palette, as well as semantic and structural concepts. We call it LouvreSAE and use it to construct style profiles: compact, decomposable steering vectors that enable style transfer without any model updates or optimization. Unlike prior concept-based style transfer methods, our method requires no fine-tuning, no LoRA training, and no additional inference passes, enabling direct steering of artistic styles from only a few reference images. We validate our method on ArtBench10, achieving or surpassing existing methods on style evaluations (VGG Style Loss and CLIP Score Style) while being 1.7-20x faster and, critically, interpretable.
87. 【2512.18910】Delta-LLaVA: Base-then-Specialize Alignment for Token-Efficient Vision-Language Models
链接:https://arxiv.org/abs/2512.18910
作者:Mohamad Zamini,Diksha Shukla
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, rich reasoning capabilities, enable rich reasoning
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) combine visual and textual representations to enable rich reasoning capabilities. However, the high computational cost of processing dense visual tokens remains a major bottleneck. A critical component in this pipeline is the visual projector, which bridges the vision encoder and the language model. Standard designs often employ a simple multi-layer perceptron for direct token mapping, but this approach scales poorly with high-resolution inputs, introducing significant redundancy. We present Delta-LLaVA, a token-efficient projector that employs a low-rank DeltaProjection to align multi-level vision features into a compact subspace before further interaction. On top of this base alignment, lightweight Transformer blocks act as specialization layers, capturing both global and local structure under constrained token budgets. Extensive experiments and ablations demonstrate that this base-then-specialize design yields consistent gains across multiple benchmarks with only 144 tokens, highlighting the importance of token formation prior to scaling interaction capacity. With Delta-LLaVA, inference throughput improves by up to 55%, while end-to-end training accelerates by nearly 4-5x in pretraining and over 1.5x in finetuning, highlighting the dual benefits of our design in both efficiency and scalability.
88. 【2512.18897】hinking Beyond Labels: Vocabulary-Free Fine-Grained Recognition using Reasoning-Augmented LMMs
链接:https://arxiv.org/abs/2512.18897
作者:Dmitry Demidov,Zaigham Zaheer,Zongyan Han,Omkar Thawakar,Rao Anwer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:distinguish visually similar, visually similar categories, human-defined label set, aims to distinguish, distinguish visually
备注:
点击查看摘要
Abstract:Vocabulary-free fine-grained image recognition aims to distinguish visually similar categories within a meta-class without a fixed, human-defined label set. Existing solutions for this problem are limited by either the usage of a large and rigid list of vocabularies or by the dependency on complex pipelines with fragile heuristics where errors propagate across stages. Meanwhile, the ability of recent large multi-modal models (LMMs) equipped with explicit or implicit reasoning to comprehend visual-language data, decompose problems, retrieve latent knowledge, and self-correct suggests a more principled and effective alternative. Building on these capabilities, we propose FiNDR (Fine-grained Name Discovery via Reasoning), the first reasoning-augmented LMM-based framework for vocabulary-free fine-grained recognition. The system operates in three automated steps: (i) a reasoning-enabled LMM generates descriptive candidate labels for each image; (ii) a vision-language model filters and ranks these candidates to form a coherent class set; and (iii) the verified names instantiate a lightweight multi-modal classifier used at inference time. Extensive experiments on popular fine-grained classification benchmarks demonstrate state-of-the-art performance under the vocabulary-free setting, with a significant relative margin of up to 18.8% over previous approaches. Remarkably, the proposed method surpasses zero-shot baselines that exploit pre-defined ground-truth names, challenging the assumption that human-curated vocabularies define an upper bound. Additionally, we show that carefully curated prompts enable open-source LMMs to match proprietary counterparts. These findings establish reasoning-augmented LMMs as an effective foundation for scalable, fully automated, open-world fine-grained visual recognition. The source code is available on this http URL.
89. 【2512.18888】Localising Shortcut Learning in Pixel Space via Ordinal Scoring Correlations for Attribution Representations (OSCAR)
链接:https://arxiv.org/abs/2512.18888
作者:Akshit Achara,Peter Triantafillou,Esther Puyol-Antón,Alexander Hammers,Andrew P. King
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep neural networks, Deep neural, shortcut, neural networks, networks often exploit
备注:
点击查看摘要
Abstract:Deep neural networks often exploit shortcuts. These are spurious cues which are associated with output labels in the training data but are unrelated to task semantics. When the shortcut features are associated with sensitive attributes, shortcut learning can lead to biased model performance. Existing methods for localising and understanding shortcut learning are mostly based upon qualitative, image-level inspection and assume cues are human-visible, limiting their use in domains such as medical imaging. We introduce OSCAR (Ordinal Scoring Correlations for Attribution Representations), a model-agnostic framework for quantifying shortcut learning and localising shortcut features. OSCAR converts image-level task attribution maps into dataset-level rank profiles of image regions and compares them across three models: a balanced baseline model (BA), a test model (TS), and a sensitive attribute predictor (SA). By computing pairwise, partial, and deviation-based correlations on these rank profiles, we produce a set of quantitative metrics that characterise the degree of shortcut reliance for TS, together with a ranking of image-level regions that contribute most to it. Experiments on CelebA, CheXpert, and ADNI show that our correlations are (i) stable across seeds and partitions, (ii) sensitive to the level of association between shortcut features and output labels in the training data, and (iii) able to distinguish localised from diffuse shortcut features. As an illustration of the utility of our method, we show how worst-group performance disparities can be reduced using a simple test-time attenuation approach based on the identified shortcut regions. OSCAR provides a lightweight, pixel-space audit that yields statistical decision rules and spatial maps, enabling users to test, localise, and mitigate shortcut reliance. The code is available at this https URL
90. 【2512.18878】CrashChat: A Multimodal Large Language Model for Multitask Traffic Crash Video Analysis
链接:https://arxiv.org/abs/2512.18878
作者:Kaidi Liang,Ke Li,Xianbiao Hu,Ruwen Qin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Automating crash video, driving video data, crash video analysis, traffic safety research, Automating crash
备注:
点击查看摘要
Abstract:Automating crash video analysis is essential to leverage the growing availability of driving video data for traffic safety research and accountability attribution in autonomous driving. Crash video analysis is a challenging multitask problem due to the complex spatiotemporal dynamics of crash events in video data and the diverse analytical requirements involved. It requires capabilities spanning crash recognition, temporal grounding, and high-level video understanding. Existing models, however, cannot perform all these tasks within a unified framework, and effective training strategies for such models remain underexplored. To fill these gaps, this paper proposes CrashChat, a multimodal large language model (MLLM) for multitask traffic crash analysis, built upon VideoLLaMA3. CrashChat acquires domain-specific knowledge through instruction fine-tuning and employs a novel multitask learning strategy based on task decoupling and grouping, which maximizes the benefit of joint learning within and across task groups while mitigating negative transfer. Numerical experiments on consolidated public datasets demonstrate that CrashChat consistently outperforms existing MLLMs across model scales and traditional vision-based methods, achieving state-of-the-art performance. It reaches near-perfect accuracy in crash recognition, a 176\% improvement in crash localization, and a 40\% improvement in the more challenging pre-crash localization. Compared to general MLLMs, it substantially enhances textual accuracy and content coverage in crash description and reasoning tasks, with 0.18-0.41 increases in BLEU scores and 0.18-0.42 increases in ROUGE scores. Beyond its strong performance, CrashChat is a convenient, end-to-end analytical tool ready for practical implementation. The dataset and implementation code for CrashChat are available at this https URL.
91. 【2512.18865】Application of deep learning approaches for medieval historical documents transcription
链接:https://arxiv.org/abs/2512.18865
作者:Maksym Voloshchuk,Bohdana Zarembovska,Mykola Kozlenko
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:solutions show excellent, handwritten Latin-language documents, drops with Latin, optical character recognition, character recognition solutions
备注: 15 pages, 15 figures, 4 tables. Originally published by CEUR Workshop Proceedings ( [this http URL](http://CEUR-WS.org) , ISSN 1613-0073), available: [this https URL](https://ceur-ws.org/Vol-4133/S_05_Kozlenko.pdf)
点击查看摘要
Abstract:Handwritten text recognition and optical character recognition solutions show excellent results with processing data of modern era, but efficiency drops with Latin documents of medieval times. This paper presents a deep learning method to extract text information from handwritten Latin-language documents of the 9th to 11th centuries. The approach takes into account the properties inherent in medieval documents. The paper provides a brief introduction to the field of historical document transcription, a first-sight analysis of the raw data, and the related works and studies. The paper presents the steps of dataset development for further training of the models. The explanatory data analysis of the processed data is provided as well. The paper explains the pipeline of deep learning models to extract text information from the document images, from detecting objects to word recognition using classification models and embedding word images. The paper reports the following results: recall, precision, F1 score, intersection over union, confusion matrix, and mean string distance. The plots of the metrics are also included. The implementation is published on the GitHub repository.
92. 【2512.18864】Cross-modal Counterfactual Explanations: Uncovering Decision Factors and Dataset Biases in Subjective Classification
链接:https://arxiv.org/abs/2512.18864
作者:Alina Elena Baia,Andrea Cavallaro
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Concept-driven counterfactuals explain, Concept-driven counterfactuals, classifiers by altering, counterfactuals explain decisions, altering the model
备注:
点击查看摘要
Abstract:Concept-driven counterfactuals explain decisions of classifiers by altering the model predictions through semantic changes. In this paper, we present a novel approach that leverages cross-modal decompositionality and image-specific concepts to create counterfactual scenarios expressed in natural language. We apply the proposed interpretability framework, termed Decompose and Explain (DeX), to the challenging domain of image privacy decisions, which are contextual and subjective. This application enables the quantification of the differential contributions of key scene elements to the model prediction. We identify relevant decision factors via a multi-criterion selection mechanism that considers both image similarity for minimal perturbations and decision confidence to prioritize impactful changes. This approach evaluates and compares diverse explanations, and assesses the interdependency and mutual influence among explanatory properties. By leveraging image-specific concepts, DeX generates image-grounded, sparse explanations, yielding significant improvements over the state of the art. Importantly, DeX operates as a training-free framework, offering high flexibility. Results show that DeX not only uncovers the principal contributing factors influencing subjective decisions, but also identifies underlying dataset biases allowing for targeted mitigation strategies to improve fairness.
93. 【2512.18853】VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent Inference
链接:https://arxiv.org/abs/2512.18853
作者:Sicheng Song,Yanjie Zhang,Zixin Chen,Huamin Qu,Changbo Wang,Chenhui Li
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:image editing techniques, increasingly threatened, enable subtle, subtle yet deceptive, Large Language Models
备注: IEEE Transactions on Visualization and Computer Graphics (IEEE PacificVis'26 TVCG Track)
点击查看摘要
Abstract:The integrity of data visualizations is increasingly threatened by image editing techniques that enable subtle yet deceptive tampering. Through a formative study, we define this challenge and categorize tampering techniques into two primary types: data manipulation and visual encoding manipulation. To address this, we present VizDefender, a framework for tampering detection and analysis. The framework integrates two core components: 1) a semi-fragile watermark module that protects the visualization by embedding a location map to images, which allows for the precise localization of tampered regions while preserving visual quality, and 2) an intent analysis module that leverages Multimodal Large Language Models (MLLMs) to interpret manipulation, inferring the attacker's intent and misleading effects. Extensive evaluations and user studies demonstrate the effectiveness of our methods.
94. 【2512.18843】Brain-Gen: Towards Interpreting Neural Signals for Stimulus Reconstruction Using Transformers and Latent Diffusion Models
链接:https://arxiv.org/abs/2512.18843
作者:Hasib Aslam,Muhammad Talal Faiz,Muhammad Imran Malik
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabled preliminary decoding, Advances in neuroscience, neuroscience and artificial, artificial intelligence, intelligence have enabled
备注: 21 pages and 7 figures
点击查看摘要
Abstract:Advances in neuroscience and artificial intelligence have enabled preliminary decoding of brain activity. However, despite the progress, the interpretability of neural representations remains limited. A significant challenge arises from the intrinsic properties of electroencephalography (EEG) signals, including high noise levels, spatial diffusion, and pronounced temporal variability. To interpret the neural mechanism underlying thoughts, we propose a transformers-based framework to extract spatial-temporal representations associated with observed visual stimuli from EEG recordings. These features are subsequently incorporated into the attention mechanisms of Latent Diffusion Models (LDMs) to facilitate the reconstruction of visual stimuli from brain activity. The quantitative evaluations on publicly available benchmark datasets demonstrate that the proposed method excels at modeling the semantic structures from EEG signals; achieving up to 6.5% increase in latent space clustering accuracy and 11.8% increase in zero shot generalization across unseen classes while having comparable Inception Score and Fréchet Inception Distance with existing baselines. Our work marks a significant step towards generalizable semantic interpretation of the EEG signals.
95. 【2512.18814】EchoMotion: Unified Human Video and Motion Generation via Dual-Modality Diffusion Transformer
链接:https://arxiv.org/abs/2512.18814
作者:Yuxiao Yang,Hualian Sheng,Sijia Cai,Jing Lin,Jiahao Wang,Bing Deng,Junzhe Lu,Haoqian Wang,Jieping Ye
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human movements due, synthesize complex human, complex human movements, struggle to synthesize, movements due
备注: 26 pages, 16 figures
点击查看摘要
Abstract:Video generation models have advanced significantly, yet they still struggle to synthesize complex human movements due to the high degrees of freedom in human articulation. This limitation stems from the intrinsic constraints of pixel-only training objectives, which inherently bias models toward appearance fidelity at the expense of learning underlying kinematic principles. To address this, we introduce EchoMotion, a framework designed to model the joint distribution of appearance and human motion, thereby improving the quality of complex human action video generation. EchoMotion extends the DiT (Diffusion Transformer) framework with a dual-branch architecture that jointly processes tokens concatenated from different modalities. Furthermore, we propose MVS-RoPE (Motion-Video Syncronized RoPE), which offers unified 3D positional encoding for both video and motion tokens. By providing a synchronized coordinate system for the dual-modal latent sequence, MVS-RoPE establishes an inductive bias that fosters temporal alignment between the two modalities. We also propose a Motion-Video Two-Stage Training Strategy. This strategy enables the model to perform both the joint generation of complex human action videos and their corresponding motion sequences, as well as versatile cross-modal conditional generation tasks. To facilitate the training of a model with these capabilities, we construct HuMoVe, a large-scale dataset of approximately 80,000 high-quality, human-centric video-motion pairs. Our findings reveal that explicitly representing human motion is complementary to appearance, significantly boosting the coherence and plausibility of human-centric video generation.
96. 【2512.18813】Revealing Perception and Generation Dynamics in LVLMs: Mitigating Hallucinations via Validated Dominance Correction
链接:https://arxiv.org/abs/2512.18813
作者:Guangtao Lyu,Xinyi Cheng,Chenghao Xu,Qi Liu,Muli Yang,Fen Fang,Huilin Chen,Jiexi Yan,Xu Yang,Cheng Deng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Vision-Language Models, shown remarkable capabilities, Large Vision-Language, remarkable capabilities, persistent challenge
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have shown remarkable capabilities, yet hallucinations remain a persistent challenge. This work presents a systematic analysis of the internal evolution of visual perception and token generation in LVLMs, revealing two key patterns. First, perception follows a three-stage GATE process: early layers perform a Global scan, intermediate layers Approach and Tighten on core content, and later layers Explore supplementary regions. Second, generation exhibits an SAD (Subdominant Accumulation to Dominant) pattern, where hallucinated tokens arise from the repeated accumulation of subdominant tokens lacking support from attention (visual perception) or feed-forward network (internal knowledge). Guided by these findings, we devise the VDC (Validated Dominance Correction) strategy, which detects unsupported tokens and replaces them with validated dominant ones to improve output reliability. Extensive experiments across multiple models and benchmarks confirm that VDC substantially mitigates hallucinations.
97. 【2512.18809】FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation
链接:https://arxiv.org/abs/2512.18809
作者:Ziyuan Tao,Chuanzhi Xu,Sandaru Jayawardana,Wei Bao,Kanchana Thilakarathna,Teng Joon Lim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:cloud-based pipelines expose, pipelines expose raw, short-form video platforms, video platforms increases, expose raw videos
备注:
点击查看摘要
Abstract:The rapid growth of short-form video platforms increases the need for privacy-preserving moderation, as cloud-based pipelines expose raw videos to privacy risks, high bandwidth costs, and inference latency. To address these challenges, we propose an on-device federated learning framework for video violence detection that integrates self-supervised VideoMAE representations, LoRA-based parameter-efficient adaptation, and defense-in-depth privacy protection. Our approach reduces the trainable parameter count to 5.5M (~3.5% of a 156M backbone) and incorporates DP-SGD with configurable privacy budgets and secure aggregation. Experiments on RWF-2000 with 40 clients achieve 77.25% accuracy without privacy protection and 65-66% under strong differential privacy, while reducing communication cost by $28.3\times$ compared to full-model federated learning. The code is available at: {this https URL}
98. 【2512.18804】mpo as the Stable Cue: Hierarchical Mixture of Tempo and Beat Experts for Music to 3D Dance Generation
链接:https://arxiv.org/abs/2512.18804
作者:Guangtao Lyu,Chenghao Xu,Qi Liu,Jiexi Yan,Muli Yang,Fen Fang,Cheng Deng
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
关键词:rhythmically synchronized human, synchronized human dance, dance generation aims, aims to synthesize, synthesize realistic
备注:
点击查看摘要
Abstract:Music to 3D dance generation aims to synthesize realistic and rhythmically synchronized human dance from music. While existing methods often rely on additional genre labels to further improve dance generation, such labels are typically noisy, coarse, unavailable, or insufficient to capture the diversity of real-world music, which can result in rhythm misalignment or stylistic drift. In contrast, we observe that tempo, a core property reflecting musical rhythm and pace, remains relatively consistent across datasets and genres, typically ranging from 60 to 200 BPM. Based on this finding, we propose TempoMoE, a hierarchical tempo-aware Mixture-of-Experts module that enhances the diffusion model and its rhythm perception. TempoMoE organizes motion experts into tempo-structured groups for different tempo ranges, with multi-scale beat experts capturing fine- and long-range rhythmic dynamics. A Hierarchical Rhythm-Adaptive Routing dynamically selects and fuses experts from music features, enabling flexible, rhythm-aligned generation without manual genre labels. Extensive experiments demonstrate that TempoMoE achieves state-of-the-art results in dance quality and rhythm alignment.
99. 【2512.18784】Eff-GRot: Efficient and Generalizable Rotation Estimation with Transformers
链接:https://arxiv.org/abs/2512.18784
作者:Fanis Mathioulakis,Gorjan Radevski,Tinne Tuytelaars
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:RGB images, generalizable rotation estimation, generalizable rotation, efficient rotation estimation, rotation estimation
备注:
点击查看摘要
Abstract:We introduce Eff-GRot, an approach for efficient and generalizable rotation estimation from RGB images. Given a query image and a set of reference images with known orientations, our method directly predicts the object's rotation in a single forward pass, without requiring object- or category-specific training. At the core of our framework is a transformer that performs a comparison in the latent space, jointly processing rotation-aware representations from multiple references alongside a query. This design enables a favorable balance between accuracy and computational efficiency while remaining simple, scalable, and fully end-to-end. Experimental results show that Eff-GRot offers a promising direction toward more efficient rotation estimation, particularly in latency-sensitive applications.
100. 【2512.18772】In-Context Audio Control of Video Diffusion Transformers
链接:https://arxiv.org/abs/2512.18772
作者:Wenze Liu,Weicai Ye,Minghong Cai,Quande Liu,Xintao Wang,Xiangyu Yue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:handle multiple conditional, multiple conditional inputs, transformer-based foundation models, Recent advancements, conditional inputs in-context
备注:
点击查看摘要
Abstract:Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.
101. 【2512.18766】MaskFocus: Focusing Policy Optimization on Critical Steps for Masked Image Generation
链接:https://arxiv.org/abs/2512.18766
作者:Guohui Zhang,Hu Yu,Xiaoxiao Ma,Yaning Pan,Hang Xu,Feng Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated significant potential, Reinforcement learning, autoregressive visual generative, models remains challenging, post-training language models
备注: Code is available at [this https URL](https://github.com/zghhui/MaskFocus)
点击查看摘要
Abstract:Reinforcement learning (RL) has demonstrated significant potential for post-training language models and autoregressive visual generative models, but adapting RL to masked generative models remains challenging. The core factor is that policy optimization requires accounting for the probability likelihood of each step due to its multi-step and iterative refinement process. This reliance on entire sampling trajectories introduces high computational cost, whereas natively optimizing random steps often yields suboptimal results. In this paper, we present MaskFocus, a novel RL framework that achieves effective policy optimization for masked generative models by focusing on critical steps. Specifically, we determine the step-level information gain by measuring the similarity between the intermediate images at each sampling step and the final generated image. Crucially, we leverage this to identify the most critical and valuable steps and execute focused policy optimization on them. Furthermore, we design a dynamic routing sampling mechanism based on entropy to encourage the model to explore more valuable masking strategies for samples with low entropy. Extensive experiments on multiple Text-to-Image benchmarks validate the effectiveness of our method.
102. 【2512.18750】Context-Aware Network Based on Multi-scale Spatio-temporal Attention for Action Recognition in Videos
链接:https://arxiv.org/abs/2512.18750
作者:Xiaoyang Li,Wenzhu Yang,Kanglin Wang,Tiebiao Wang,Qingsong Fei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video understanding, requiring the comprehensive, Temporal Cue Module, critical task, task in video
备注: 21 pages, 4 figures. Preprint under review for journal submission
点击查看摘要
Abstract:Action recognition is a critical task in video understanding, requiring the comprehensive capture of spatio-temporal cues across various scales. However, existing methods often overlook the multi-granularity nature of actions. To address this limitation, we introduce the Context-Aware Network (CAN). CAN consists of two core modules: the Multi-scale Temporal Cue Module (MTCM) and the Group Spatial Cue Module (GSCM). MTCM effectively extracts temporal cues at multiple scales, capturing both fast-changing motion details and overall action flow. GSCM, on the other hand, extracts spatial cues at different scales by grouping feature maps and applying specialized extraction methods to each group. Experiments conducted on five benchmark datasets (Something-Something V1 and V2, Diving48, Kinetics-400, and UCF101) demonstrate the effectiveness of CAN. Our approach achieves competitive performance, outperforming most mainstream methods, with accuracies of 50.4% on Something-Something V1, 63.9% on Something-Something V2, 88.4% on Diving48, 74.9% on Kinetics-400, and 86.9% on UCF101. These results highlight the importance of capturing multi-scale spatio-temporal cues for robust action recognition.
103. 【2512.18747】IPCV: Information-Preserving Compression for MLLM Visual Encoders
链接:https://arxiv.org/abs/2512.18747
作者:Yuan Chen,Zichen Wen,Yuzhou Wu,Xuyang Liu,Shuang Chen,Junpeng Ma,Weijia Li,Conghui He,Linfeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Multimodal Large Language, Multimodal Large, Vision Transformer, high computational cost
备注: 13 pages, 6 figures
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) deliver strong vision-language performance but at high computational cost, driven by numerous visual tokens processed by the Vision Transformer (ViT) encoder. Existing token pruning strategies are inadequate: LLM-stage token pruning overlooks the ViT's overhead, while conventional ViT token pruning, without language guidance, risks discarding textually critical visual cues and introduces feature distortions amplified by the ViT's bidirectional attention. To meet these challenges, we propose IPCV, a training-free, information-preserving compression framework for MLLM visual encoders. IPCV enables aggressive token pruning inside the ViT via Neighbor-Guided Reconstruction (NGR) that temporarily reconstructs pruned tokens to participate in attention with minimal overhead, then fully restores them before passing to the LLM. Besides, we introduce Attention Stabilization (AS) to further alleviate the negative influence from token pruning by approximating the K/V of pruned tokens. It can be directly applied to previous LLM-side token pruning methods to enhance their performance. Extensive experiments show that IPCV substantially reduces end-to-end computation and outperforms state-of-the-art training-free token compression methods across diverse image and video benchmarks. Our code is available at this https URL.
104. 【2512.18745】InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search
链接:https://arxiv.org/abs/2512.18745
作者:Kaican Li,Lewei Yao,Jiannan Wu,Tiezheng Yu,Jierun Chen,Haoli Bai,Lu Hou,Lanqing Hong,Wei Zhang,Nevin L. Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:sophisticated blend, reasoning, visual, multimodal, Abstract
备注:
点击查看摘要
Abstract:The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at this https URL .
105. 【2512.18741】Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation
链接:https://arxiv.org/abs/2512.18741
作者:Tianrui Zhu,Shiyi Zhang,Zhirui Sun,Jingqi Tian,Yansong Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved significant progress, Frame-level autoregressive, enabling real-time video, bidirectional diffusion models, interactive world models
备注:
点击查看摘要
Abstract:Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose \textbf{Memorize-and-Generate (MAG)}, a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce \textbf{MAG-Bench} to strictly evaluate historical memory retention. Extensive experiments demonstrate that MAG achieves superior historical scene consistency while maintaining competitive performance on standard video generation benchmarks.
106. 【2512.18738】AMLID: An Adaptive Multispectral Landmine Identification Dataset for Drone-Based Detection
链接:https://arxiv.org/abs/2512.18738
作者:James E. Gallagher,Edward J. Oughton
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:million mines deployed, persistent humanitarian threat, claiming approximately, casualties annually, Unmanned Aerial Systems
备注: 8 pages with three figures and one table
点击查看摘要
Abstract:Landmines remain a persistent humanitarian threat, with an estimated 110 million mines deployed across 60 countries, claiming approximately 26,000 casualties annually. Current detection methods are hazardous, inefficient, and prohibitively expensive. We present the Adaptive Multispectral Landmine Identification Dataset (AMLID), the first open-source dataset combining Red-Green-Blue (RGB) and Long-Wave Infrared (LWIR) imagery for Unmanned Aerial Systems (UAS)-based landmine detection. AMLID comprises of 12,078 labeled images featuring 21 globally deployed landmine types across anti-personnel and anti-tank categories in both metal and plastic compositions. The dataset spans 11 RGB-LWIR fusion levels, four sensor altitudes, two seasonal periods, and three daily illumination conditions. By providing comprehensive multispectral coverage across diverse environmental variables, AMLID enables researchers to develop and benchmark adaptive detection algorithms without requiring access to live ordnance or expensive data collection infrastructure, thereby democratizing humanitarian demining research.
107. 【2512.18735】$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models
链接:https://arxiv.org/abs/2512.18735
作者:Kewei Wei,Bocheng Hu,Jie Cao,Xiaohan Chen,Zhengxi Lu,Wubing Xia,Weili Xu,Jiaao Wu,Junchen He,Mingyu Jia,Ciyun Zhao,Ye Sun,Yizhi Li,Zhonghan Zhao,Jian Zhang,Gaoang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Modern Large Multimodal, Large Multimodal Models, Modern Large, Large Multimodal, demonstrated extraordinary ability
备注:
点击查看摘要
Abstract:Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from this https URL and full benchmark data from this https URL.
108. 【2512.18734】Breast Cancer Recurrence Risk Prediction Based on Multiple Instance Learning
链接:https://arxiv.org/abs/2512.18734
作者:Jinqiu Chen,Huyan Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Predicting breast cancer, breast cancer recurrence, Predicting breast, critical clinical challenge, cancer recurrence risk
备注:
点击查看摘要
Abstract:Predicting breast cancer recurrence risk is a critical clinical challenge. This study investigates the potential of computational pathology to stratify patients using deep learning on routine Hematoxylin and Eosin (HE) stained whole-slide images (WSIs). We developed and compared three Multiple Instance Learning (MIL) frameworks -- CLAM-SB, ABMIL, and ConvNeXt-MIL-XGBoost -- on an in-house dataset of 210 patient cases. The models were trained to predict 5-year recurrence risk, categorized into three tiers (low, medium, high), with ground truth labels established by the 21-gene Recurrence Score. Features were extracted using the UNI and CONCH pre-trained models. In a 5-fold cross-validation, the modified CLAM-SB model demonstrated the strongest performance, achieving a mean Area Under the Curve (AUC) of 0.836 and a classification accuracy of 76.2%. Our findings demonstrate the feasibility of using deep learning on standard histology slides for automated, genomics-correlated risk stratification, highlighting a promising pathway toward rapid and cost-effective clinical decision support.
109. 【2512.18718】Rectification Reimagined: A Unified Mamba Model for Image Correction and Rectangling with Prompts
链接:https://arxiv.org/abs/2512.18718
作者:Linwei Qiu,Gongzhe Li,Xiaozhe Zhang,Qinlin Sun,Fengying Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:practical photography systems, correction and rectangling, rectangling are valuable, photography systems, Image correction
备注: AAAI 2026
点击查看摘要
Abstract:Image correction and rectangling are valuable tasks in practical photography systems such as smartphones. Recent remarkable advancements in deep learning have undeniably brought about substantial performance improvements in these fields. Nevertheless, existing methods mainly rely on task-specific architectures. This significantly restricts their generalization ability and effective application across a wide range of different tasks. In this paper, we introduce the Unified Rectification Framework (UniRect), a comprehensive approach that addresses these practical tasks from a consistent distortion rectification perspective. Our approach incorporates various task-specific inverse problems into a general distortion model by simulating different types of lenses. To handle diverse distortions, UniRect adopts one task-agnostic rectification framework with a dual-component structure: a {Deformation Module}, which utilizes a novel Residual Progressive Thin-Plate Spline (RP-TPS) model to address complex geometric deformations, and a subsequent Restoration Module, which employs Residual Mamba Blocks (RMBs) to counteract the degradation caused by the deformation process and enhance the fidelity of the output image. Moreover, a Sparse Mixture-of-Experts (SMoEs) structure is designed to circumvent heavy task competition in multi-task learning due to varying distortions. Extensive experiments demonstrate that our models have achieved state-of-the-art performance compared with other up-to-date methods.
110. 【2512.18692】EcoSplat: Efficiency-controllable Feed-forward 3D Gaussian Splatting from Multi-view Images
链接:https://arxiv.org/abs/2512.18692
作者:Jongmin Park,Minh-Quan Viet Bui,Juan Luis Gonzalez Bello,Jaeho Moon,Jihyong Oh,Munchurl Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:one-pass scene reconstruction, enables efficient one-pass, efficient one-pass scene, Gaussian Splatting, enables efficient
备注: The first two authors contributed equally to this work (equal contribution). The last two authors advised equally to this work. Please visit our project page at [this https URL](https://kaist-viclab.github.io/ecosplat-site/)
点击查看摘要
Abstract:Feed-forward 3D Gaussian Splatting (3DGS) enables efficient one-pass scene reconstruction, providing 3D representations for novel view synthesis without per-scene optimization. However, existing methods typically predict pixel-aligned primitives per-view, producing an excessive number of primitives in dense-view settings and offering no explicit control over the number of predicted Gaussians. To address this, we propose EcoSplat, the first efficiency-controllable feed-forward 3DGS framework that adaptively predicts the 3D representation for any given target primitive count at inference time. EcoSplat adopts a two-stage optimization process. The first stage is Pixel-aligned Gaussian Training (PGT) where our model learns initial primitive prediction. The second stage is Importance-aware Gaussian Finetuning (IGF) stage where our model learns rank primitives and adaptively adjust their parameters based on the target primitive count. Extensive experiments across multiple dense-view settings show that EcoSplat is robust and outperforms state-of-the-art methods under strict primitive-count constraints, making it well-suited for flexible downstream rendering tasks.
111. 【2512.18684】A Study of Finetuning Video Transformers for Multi-view Geometry Tasks
链接:https://arxiv.org/abs/2512.18684
作者:Huimin Wu,Kwang-Ting Cheng,Stephen Lin,Zhirong Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fine-tuning video foundation, video foundation models, multi-view geometry tasks, paper presents, presents an investigation
备注: AAAI 20206, Project website: [this http URL](http://geovit-aaai26.github.io)
点击查看摘要
Abstract:This paper presents an investigation of vision transformer learning for multi-view geometry tasks, such as optical flow estimation, by fine-tuning video foundation models. Unlike previous methods that involve custom architectural designs and task-specific pretraining, our research finds that general-purpose models pretrained on videos can be readily transferred to multi-view problems with minimal adaptation. The core insight is that general-purpose attention between patches learns temporal and spatial information for geometric reasoning. We demonstrate that appending a linear decoder to the Transformer backbone produces satisfactory results, and iterative refinement can further elevate performance to stateof-the-art levels. This conceptually simple approach achieves top cross-dataset generalization results for optical flow estimation with end-point error (EPE) of 0.69, 1.78, and 3.15 on the Sintel clean, Sintel final, and KITTI datasets, respectively. Our method additionally establishes a new record on the online test benchmark with EPE values of 0.79, 1.88, and F1 value of 3.79. Applications to 3D depth estimation and stereo matching also show strong performance, illustrating the versatility of video-pretrained models in addressing geometric vision tasks.
112. 【2512.18679】brat: Aligned Multi-View Embeddings for Brain MRI Analysis
链接:https://arxiv.org/abs/2512.18679
作者:Maxime Kayser,Maksim Gridnev,Wanting Wang,Max Bain,Aneesh Rangnekar,Avijit Chatterjee,Aleksandr Petrov,Harini Veeraraghavan,Nathaniel C. Swinburne
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:magnetic resonance imaging, representation learning framework, brain magnetic resonance, report alignment transformer, multi-view representation learning
备注: First round accept at WACV 2026
点击查看摘要
Abstract:We present brat (brain report alignment transformer), a multi-view representation learning framework for brain magnetic resonance imaging (MRI) trained on MRIs paired with clinical reports. Brain MRIs present unique challenges due to the presence of numerous, highly varied, and often subtle abnormalities that are localized to a few slices within a 3D volume. To address these challenges, we introduce a brain MRI dataset $10\times$ larger than existing ones, containing approximately 80,000 3D scans with corresponding radiology reports, and propose a multi-view pre-training approach inspired by advances in document retrieval. We develop an implicit query-feature matching mechanism and adopt concepts from quality-diversity to obtain multi-view embeddings of MRIs that are aligned with the clinical features given by report sentences. We evaluate our approach across multiple vision-language and vision tasks, demonstrating substantial performance improvements. The brat foundation models are publicly released.
113. 【2512.18675】AsyncDiff: Asynchronous Timestep Conditioning for Enhanced Text-to-Image Diffusion Inference
链接:https://arxiv.org/abs/2512.18675
作者:Longhuan Xu,Feng Yin,Cunjian Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:numerical integrator advances, diffusion inference typically, Relative Policy Optimization, Group Relative Policy, typically follows synchronized
备注: Under review
点击查看摘要
Abstract:Text-to-image diffusion inference typically follows synchronized schedules, where the numerical integrator advances the latent state to the same timestep at which the denoiser is conditioned. We propose an asynchronous inference mechanism that decouples these two, allowing the denoiser to be conditioned at a different, learned timestep while keeping image update schedule unchanged. A lightweight timestep prediction module (TPM), trained with Group Relative Policy Optimization (GRPO), selects a more feasible conditioning timestep based on the current state, effectively choosing a desired noise level to control image detail and textural richness. At deployment, a scaling hyper-parameter can be used to interpolate between the original and de-synchronized timesteps, enabling conservative or aggressive adjustments. To keep the study computationally affordable, we cap the inference at 15 steps for SD3.5 and 10 steps for Flux. Evaluated on Stable Diffusion 3.5 Medium and Flux.1-dev across MS-COCO 2014 and T2I-CompBench datasets, our method optimizes a composite reward that averages Image Reward, HPSv2, CLIP Score and Pick Score, and shows consistent improvement.
114. 【2512.18671】SmartSight: Mitigating Hallucination in Video-LLMs Without Compromising Video Understanding via Temporal Attention Collapse
链接:https://arxiv.org/abs/2512.18671
作者:Yiming Sun,Mi Zhang,Feifei Li,Geng Hong,Min Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Video Large Language, Large Language, substantial safety risk, perceptual hallucinations pose
备注: AAAI26 accepted
点击查看摘要
Abstract:Despite Video Large Language Models having rapidly advanced in recent years, perceptual hallucinations pose a substantial safety risk, which severely restricts their real-world applicability. While several methods for hallucination mitigation have been proposed, they often compromise the model's capacity for video understanding and reasoning. In this work, we propose SmartSight, a pioneering step to address this issue in a training-free manner by leveraging the model's own introspective capabilities. Specifically, SmartSight generates multiple candidate responses to uncover low-hallucinated outputs that are often obscured by standard greedy decoding. It assesses the hallucination of each response using the Temporal Attention Collapse score, which measures whether the model over-focuses on trivial temporal regions of the input video when generating the response. To improve efficiency, SmartSight identifies the Visual Attention Vanishing point, enabling more accurate hallucination estimation and early termination of hallucinated responses, leading to a substantial reduction in decoding cost. Experiments show that SmartSight substantially lowers hallucinations for Qwen2.5-VL-7B by 10.59% on VRIPT-HAL, while simultaneously enhancing video understanding and reasoning, boosting performance on VideoMMMU by up to 8.86%. These results highlight SmartSight's effectiveness in improving the reliability of open-source Video-LLMs.
115. 【2512.18662】Offline Reinforcement Learning for End-to-End Autonomous Driving
链接:https://arxiv.org/abs/2512.18662
作者:Chihiro Noguchi,Takaki Yamamoto
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:persistent failure modes, failure modes remain, modes remain due, autonomous driving models, unified optimization
备注: 15 pages
点击查看摘要
Abstract:End-to-end (E2E) autonomous driving models that take only camera images as input and directly predict a future trajectory are appealing for their computational efficiency and potential for improved generalization via unified optimization; however, persistent failure modes remain due to reliance on imitation learning (IL). While online reinforcement learning (RL) could mitigate IL-induced issues, the computational burden of neural rendering-based simulation and large E2E networks renders iterative reward and hyperparameter tuning costly. We introduce a camera-only E2E offline RL framework that performs no additional exploration and trains solely on a fixed simulator dataset. Offline RL offers strong data efficiency and rapid experimental iteration, yet is susceptible to instability from overestimation on out-of-distribution (OOD) actions. To address this, we construct pseudo ground-truth trajectories from expert driving logs and use them as a behavior regularization signal, suppressing imitation of unsafe or suboptimal behavior while stabilizing value learning. Training and closed-loop evaluation are conducted in a neural rendering environment learned from the public nuScenes dataset. Empirically, the proposed method achieves substantial improvements in collision rate and route completion compared with IL baselines. Our code will be available at [URL].
116. 【2512.18660】PMPGuard: Catching Pseudo-Matched Pairs in Remote Sensing Image-Text Retrieval
链接:https://arxiv.org/abs/2512.18660
作者:Pengxiang Ouyang,Qing Ma,Zheng Wang,Cong Bai
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:aligned image-text pairs, faces significant challenges, Remote sensing, retrieval faces significant, Pseudo-Matched Pairs
备注:
点击查看摘要
Abstract:Remote sensing (RS) image-text retrieval faces significant challenges in real-world datasets due to the presence of Pseudo-Matched Pairs (PMPs), semantically mismatched or weakly aligned image-text pairs, which hinder the learning of reliable cross-modal alignments. To address this issue, we propose a novel retrieval framework that leverages Cross-Modal Gated Attention and a Positive-Negative Awareness Attention mechanism to mitigate the impact of such noisy associations. The gated module dynamically regulates cross-modal information flow, while the awareness mechanism explicitly distinguishes informative (positive) cues from misleading (negative) ones during alignment learning. Extensive experiments on three benchmark RS datasets, i.e., RSICD, RSITMD, and RS5M, demonstrate that our method consistently achieves state-of-the-art performance, highlighting its robustness and effectiveness in handling real-world mismatches and PMPs in RS image-text retrieval tasks.
117. 【2512.18655】SplatBright: Generalizable Low-Light Scene Reconstruction from Sparse Views via Physically-Guided Gaussian Enhancement
链接:https://arxiv.org/abs/2512.18655
作者:Yue Wen,Liang Song,Hesheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:degraded color fidelity, remains challenging due, views remains challenging, sparse views remains, color fidelity
备注:
点击查看摘要
Abstract:Low-light 3D reconstruction from sparse views remains challenging due to exposure imbalance and degraded color fidelity. While existing methods struggle with view inconsistency and require per-scene training, we propose SplatBright, which is, to our knowledge, the first generalizable 3D Gaussian framework for joint low-light enhancement and reconstruction from sparse sRGB inputs. Our key idea is to integrate physically guided illumination modeling with geometry-appearance decoupling for consistent low-light reconstruction. Specifically, we adopt a dual-branch predictor that provides stable geometric initialization of 3D Gaussian parameters. On the appearance side, illumination consistency leverages frequency priors to enable controllable and cross-view coherent lighting, while an appearance refinement module further separates illumination, material, and view-dependent cues to recover fine texture. To tackle the lack of large-scale geometrically consistent paired data, we synthesize dark views via a physics-based camera model for training. Extensive experiments on public and self-collected datasets demonstrate that SplatBright achieves superior novel view synthesis, cross-view consistency, and better generalization to unseen low-light scenes compared with both 2D and 3D methods.
118. 【2512.18651】Adversarial Robustness in Zero-Shot Learning:An Empirical Study on Class and Concept-Level Vulnerabilities
链接:https://arxiv.org/abs/2512.18651
作者:Zhiyuan Peng,Zihan Ye,Shreyank N Gowda,Yuping Yan,Haotian Xu,Ling Shao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enable image classifiers, ZSL models, Generalized Zero-shot Learning, ZSL, Zero-shot Learning
备注:
点击查看摘要
Abstract:Zero-shot Learning (ZSL) aims to enable image classifiers to recognize images from unseen classes that were not included during training. Unlike traditional supervised classification, ZSL typically relies on learning a mapping from visual features to predefined, human-understandable class concepts. While ZSL models promise to improve generalization and interpretability, their robustness under systematic input perturbations remain unclear. In this study, we present an empirical analysis about the robustness of existing ZSL methods at both classlevel and concept-level. Specifically, we successfully disrupted their class prediction by the well-known non-target class attack (clsA). However, in the Generalized Zero-shot Learning (GZSL) setting, we observe that the success of clsA is only at the original best-calibrated point. After the attack, the optimal bestcalibration point shifts, and ZSL models maintain relatively strong performance at other calibration points, indicating that clsA results in a spurious attack success in the GZSL. To address this, we propose the Class-Bias Enhanced Attack (CBEA), which completely eliminates GZSL accuracy across all calibrated points by enhancing the gap between seen and unseen class this http URL, at concept-level attack, we introduce two novel attack modes: Class-Preserving Concept Attack (CPconA) and NonClass-Preserving Concept Attack (NCPconA). Our extensive experiments evaluate three typical ZSL models across various architectures from the past three years and reveal that ZSL models are vulnerable not only to the traditional class attack but also to concept-based attacks. These attacks allow malicious actors to easily manipulate class predictions by erasing or introducing concepts. Our findings highlight a significant performance gap between existing approaches, emphasizing the need for improved adversarial robustness in current ZSL models.
119. 【2512.18640】Geometric-Photometric Event-based 3D Gaussian Ray Tracing
链接:https://arxiv.org/abs/2512.18640
作者:Kai Kohyama,Yoshimitsu Aoki,Guillermo Gallego,Shintaro Shiba
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:traditional frame-based cameras, Event cameras offer, high temporal resolution, cameras offer, frame-based cameras
备注: 15 pages, 10 figures, 5 tables
点击查看摘要
Abstract:Event cameras offer a high temporal resolution over traditional frame-based cameras, which makes them suitable for motion and structure estimation. However, it has been unclear how event-based 3D Gaussian Splatting (3DGS) approaches could leverage fine-grained temporal information of sparse events. This work proposes a framework to address the trade-off between accuracy and temporal resolution in event-based 3DGS. Our key idea is to decouple the rendering into two branches: event-by-event geometry (depth) rendering and snapshot-based radiance (intensity) rendering, by using ray-tracing and the image of warped events. The extensive evaluation shows that our method achieves state-of-the-art performance on the real-world datasets and competitive performance on the synthetic dataset. Also, the proposed method works without prior information (e.g., pretrained image reconstruction models) or COLMAP-based initialization, is more flexible in the event selection number, and achieves sharp reconstruction on scene edges with fast training time. We hope that this work deepens our understanding of the sparse nature of events for 3D reconstruction. The code will be released.
120. 【2512.18635】Uni-Neur2Img: Unified Neural Signal-Guided Image Generation, Editing, and Stylization via Diffusion Transformers
链接:https://arxiv.org/abs/2512.18635
作者:Xiyue Bai,Ronghao Yu,Jia Xiu,Pengfei Zhou,Jie Xia,Peng Ji
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Brain-computer interaction, intersection of neuroscience, immense potential, Neural, editing images directly
备注:
点击查看摘要
Abstract:Generating or editing images directly from Neural signals has immense potential at the intersection of neuroscience, vision, and Brain-computer interaction. In this paper, We present Uni-Neur2Img, a unified framework for neural signal-driven image generation and editing. The framework introduces a parameter-efficient LoRA-based neural signal injection module that independently processes each conditioning signal as a pluggable component, facilitating flexible multi-modal conditioning without altering base model parameters. Additionally, we employ a causal attention mechanism accommodate the long-sequence modeling demands of conditional generation tasks. Existing neural-driven generation research predominantly focuses on textual modalities as conditions or intermediate representations, resulting in limited exploration of visual modalities as direct conditioning signals. To bridge this research gap, we introduce the EEG-Style dataset. We conduct comprehensive evaluations across public benchmarks and self-collected neural signal datasets: (1) EEG-driven image generation on the public CVPR40 dataset; (2) neural signal-guided image editing on the public Loongx dataset for semantic-aware local modifications; and (3) EEG-driven style transfer on our self-collected EEG-Style dataset. Extensive experimental results demonstrate significant improvements in generation fidelity, editing consistency, and style transfer quality while maintaining low computational overhead and strong scalability to additional modalities. Thus, Uni-Neur2Img offers a unified, efficient, and extensible solution for bridging neural signals and visual content generation.
121. 【2512.18614】PTTA: A Pure Text-to-Animation Framework for High-Quality Creation
链接:https://arxiv.org/abs/2512.18614
作者:Ruiqi Chen,Kaitong Cai,Yijia Fan,Keze Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:manual labor cost, Traditional animation production, production involves complex, involves complex pipelines, significant manual labor
备注: Under submission
点击查看摘要
Abstract:Traditional animation production involves complex pipelines and significant manual labor cost. While recent video generation models such as Sora, Kling, and CogVideoX achieve impressive results on natural video synthesis, they exhibit notable limitations when applied to animation generation. Recent efforts, such as AniSora, demonstrate promising performance by fine-tuning image-to-video models for animation styles, yet analogous exploration in the text-to-video setting remains limited. In this work, we present PTTA, a pure text-to-animation framework for high-quality animation creation. We first construct a small-scale but high-quality paired dataset of animation videos and textual descriptions. Building upon the pretrained text-to-video model HunyuanVideo, we perform fine-tuning to adapt it to animation-style generation. Extensive visual evaluations across multiple dimensions show that the proposed approach consistently outperforms comparable baselines in animation video synthesis.
Comments:
Under submission
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2512.18614 [cs.CV]
(or
arXiv:2512.18614v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.18614
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
122. 【2512.18613】xt2Graph VPR: A Text-to-Graph Expert System for Explainable Place Recognition in Changing Environments
链接:https://arxiv.org/abs/2512.18613
作者:Saeideh Yousefzadeh,Hamidreza Pourreza
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Visual Place Recognition, long-term deployment requires, deployment requires reasoning, Visual Place, Graph Attention Network
备注: Preprint version
点击查看摘要
Abstract:Visual Place Recognition (VPR) in long-term deployment requires reasoning beyond pixel similarity: systems must make transparent, interpretable decisions that remain robust under lighting, weather and seasonal change. We present Text2Graph VPR, an explainable semantic localization system that converts image sequences into textual scene descriptions, parses those descriptions into structured scene graphs, and reasons over the resulting graphs to identify places. Scene graphs capture objects, attributes and pairwise relations; we aggregate per-frame graphs into a compact place representation and perform retrieval with a dual-similarity mechanism that fuses learned Graph Attention Network (GAT) embeddings and a Shortest-Path (SP) kernel for structural matching. This hybrid design enables both learned semantic matching and topology-aware comparison, and -- critically -- produces human-readable intermediate representations that support diagnostic analysis and improve transparency in the decision process. We validate the system on Oxford RobotCar and MSLS (Amman/San Francisco) benchmarks and demonstrate robust retrieval under severe appearance shifts, along with zero-shot operation using human textual queries. The results illustrate that semantic, graph-based reasoning is a viable and interpretable alternative for place recognition, particularly suited to safety-sensitive and resource-constrained settings.
123. 【2512.18599】SimpleCall: A Lightweight Image Restoration Agent in Label-Free Environments with MLLM Perceptual Feedback
链接:https://arxiv.org/abs/2512.18599
作者:Jianglin Lu,Yuanwei Wu,Ziyi Zhao,Hongcheng Wang,Felix Jimenez,Abrar Majeedi,Yun Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recover high-quality images, Complex image restoration, Complex image, compression artifacts, aims to recover
备注:
点击查看摘要
Abstract:Complex image restoration aims to recover high-quality images from inputs affected by multiple degradations such as blur, noise, rain, and compression artifacts. Recent restoration agents, powered by vision-language models and large language models, offer promising restoration capabilities but suffer from significant efficiency bottlenecks due to reflection, rollback, and iterative tool searching. Moreover, their performance heavily depends on degradation recognition models that require extensive annotations for training, limiting their applicability in label-free environments. To address these limitations, we propose a policy optimization-based restoration framework that learns an lightweight agent to determine tool-calling sequences. The agent operates in a sequential decision process, selecting the most appropriate restoration operation at each step to maximize final image quality. To enable training within label-free environments, we introduce a novel reward mechanism driven by multimodal large language models, which act as human-aligned evaluator and provide perceptual feedback for policy improvement. Once trained, our agent executes a deterministic restoration plans without redundant tool invocations, significantly accelerating inference while maintaining high restoration quality. Extensive experiments show that despite using no supervision, our method matches SOTA performance on full-reference metrics and surpasses existing approaches on no-reference metrics across diverse degradation scenarios.
124. 【2512.18597】Commercial Vehicle Braking Optimization: A Robust SIFT-Trajectory Approach
链接:https://arxiv.org/abs/2512.18597
作者:Zhe Li,Kun Cheng,Hanyue Mo,Jintao Lu,Ziwen Kuang,Jianwen Ye,Lixu Xu,Xinya Meng,Jiahui Zhao,Shengda Ji,Shuyuan Liu,Mengyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Controller Area Network, inaccurate Controller Area, Area Network, Controller Area, Random Sample Consensus
备注: 5 figures,16 pages
点击查看摘要
Abstract:A vision-based trajectory analysis solution is proposed to address the "zero-speed braking" issue caused by inaccurate Controller Area Network (CAN) signals in commercial vehicle Automatic Emergency Braking (AEB) systems during low-speed operation. The algorithm utilizes the NVIDIA Jetson AGX Xavier platform to process sequential video frames from a blind spot camera, employing self-adaptive Contrast Limited Adaptive Histogram Equalization (CLAHE)-enhanced Scale-Invariant Feature Transform (SIFT) feature extraction and K-Nearest Neighbors (KNN)-Random Sample Consensus (RANSAC) matching. This allows for precise classification of the vehicle's motion state (static, vibration, moving). Key innovations include 1) multiframe trajectory displacement statistics (5-frame sliding window), 2) a dual-threshold state decision matrix, and 3) OBD-II driven dynamic Region of Interest (ROI) configuration. The system effectively suppresses environmental interference and false detection of dynamic objects, directly addressing the challenge of low-speed false activation in commercial vehicle safety systems. Evaluation in a real-world dataset (32,454 video segments from 1,852 vehicles) demonstrates an F1-score of 99.96% for static detection, 97.78% for moving state recognition, and a processing delay of 14.2 milliseconds (resolution 704x576). The deployment on-site shows an 89% reduction in false braking events, a 100% success rate in emergency braking, and a fault rate below 5%.
125. 【2512.18573】Placenta Accreta Spectrum Detection Using an MRI-based Hybrid CNN-Transformer Model
链接:https://arxiv.org/abs/2512.18573
作者:Sumaiya Ali,Areej Alhothali,Ohoud Alzamzami,Sameera Albasri,Ahmed Abduljabbar,Muhammad Alwazzan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Placenta Accreta Spectrum, Magnetic Resonance Imaging, Placenta Accreta, Accreta Spectrum, Resonance Imaging
备注:
点击查看摘要
Abstract:Placenta Accreta Spectrum (PAS) is a serious obstetric condition that can be challenging to diagnose with Magnetic Resonance Imaging (MRI) due to variability in radiologists' interpretations. To overcome this challenge, a hybrid 3D deep learning model for automated PAS detection from volumetric MRI scans is proposed in this study. The model integrates a 3D DenseNet121 to capture local features and a 3D Vision Transformer (ViT) to model global spatial context. It was developed and evaluated on a retrospective dataset of 1,133 MRI volumes. Multiple 3D deep learning architectures were also evaluated for comparison. On an independent test set, the DenseNet121-ViT model achieved the highest performance with a five-run average accuracy of 84.3%. These results highlight the strength of hybrid CNN-Transformer models as a computer-aided diagnosis tool. The model's performance demonstrates a clear potential to assist radiologists by providing a robust decision support to improve diagnostic consistency across interpretations, and ultimately enhance the accuracy and timeliness of PAS diagnosis.
126. 【2512.18571】ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning
链接:https://arxiv.org/abs/2512.18571
作者:Weijie Zhou,Xuangtang Xiong,Ye Tian,Lijun Yue,Xinyu Wu,Wei Li,Chaoyang Zhao,Honghui Dong,Ming Tang,Jinqiao Wang,Zhengyou Zhang
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have empowered embodied agents with remarkable capabilities in planning and reasoning. However, when facing ambiguous natural language instructions (e.g., "fetch the tool" in a cluttered room), current agents often fail to balance the high cost of physical exploration against the cognitive cost of human interaction. They typically treat disambiguation as a passive perception problem, lacking the strategic reasoning to minimize total task execution costs. To bridge this gap, we propose ESearch-R1, a cost-aware embodied reasoning framework that unifies interactive dialogue (Ask), episodic memory retrieval (GetMemory), and physical navigation (Navigate) into a single decision process. We introduce HC-GRPO (Heterogeneous Cost-Aware Group Relative Policy Optimization). Unlike traditional PPO which relies on a separate value critic, HC-GRPO optimizes the MLLM by sampling groups of reasoning trajectories and reinforcing those that achieve the optimal trade-off between information gain and heterogeneous costs (e.g., navigate time, and human attention). Extensive experiments in AI2-THOR demonstrate that ESearch-R1 significantly outperforms standard ReAct-based agents. It improves task success rates while reducing total operational costs by approximately 50\%, validating the effectiveness of GRPO in aligning MLLM agents with physical world constraints.
127. 【2512.18563】OpenView: Empowering MLLMs with Out-of-view VQA
链接:https://arxiv.org/abs/2512.18563
作者:Qixiang Chen,Cheng Zhang,Chi-Wing Fu,Jingwen Ye,Jianfei Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent multimodal large, Recent multimodal, large language models, natural image understanding, multimodal large language
备注: Code: [this https URL](https://github.com/q1xiangchen/OpenView)
点击查看摘要
Abstract:Recent multimodal large language models (MLLMs) show great potential in natural image understanding. Yet, they perform well, mainly on reasoning in-view contents within the image frame. This paper presents the first study on out-of-view (OOV) understanding, i.e., the ability to reason objects, activities, and scenes beyond the visible frame of a perspective view. Our technical contributions are threefold. First, we design OpenView, a four-stage pipeline to massively generate multi-choice VQA by leveraging panoramic imagery to enable context-rich and spatial-grounded VQA synthesis with free-view framing. Second, we curate OpenView-Dataset, a high-quality synthetic dataset from diverse real-world panoramas to empower MLLMs upon supervised fine-tuning. Third, we build OpenView-Bench, a benchmark that jointly measures choice and rationale accuracy for interpretable and diagnosable evaluation. Experimental results show that despite having a large gap from human performance in OOV VQA answer selection, upon empowered by OpenView, multiple MLLMs can consistently boost their performance, uplifted from 48.6% to 64.1% on average. Code, benchmark, and data will be available at this https URL.
128. 【2512.18554】Enhancing Medical Large Vision-Language Models via Alignment Distillation
链接:https://arxiv.org/abs/2512.18554
作者:Aofei Chang,Ting Wang,Fenglong Ma
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, Medical Large Vision-Language, shown promising results, Large Vision-Language, misaligned visual understanding
备注: Accepted to AAAI'2026 (Main track)
点击查看摘要
Abstract:Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual question answering (VQA) benchmarks show that MEDALIGN consistently improves both performance and interpretability, yielding more visually grounded outputs.
129. 【2512.18553】Hierarchical Bayesian Framework for Multisource Domain Adaptation
链接:https://arxiv.org/abs/2512.18553
作者:Alexander M. Glandon,Khan M. Iftekharuddin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multisource domain adaptation, Bayesian framework, proposed Bayesian framework, Hierarchical Bayesian Framework, Multisource domain
备注:
点击查看摘要
Abstract:Multisource domain adaptation (MDA) aims to use multiple source datasets with available labels to infer labels on a target dataset without available labels for target supervision. Prior works on MDA in the literature is ad-hoc as the pretraining of source models is either based on weight sharing or uses independently trained models. This work proposes a Bayesian framework for pretraining in MDA by considering that the distributions of different source domains are typically similar. The Hierarchical Bayesian Framework uses similarity between the different source data distributions to optimize the pretraining for MDA. Experiments using the proposed Bayesian framework for MDA show that our framework improves accuracy on recognition tasks for a large benchmark dataset. Performance comparison with state-of-the-art MDA methods on the challenging problem of human action recognition in multi-domain benchmark Daily-DA RGB video shows the proposed Bayesian Framework offers a 17.29% improvement in accuracy when compared to the state-of-the-art methods in the literature.
130. 【2512.18528】WoundNet-Ensemble: A Novel IoMT System Integrating Self-Supervised Deep Learning and Multi-Model Fusion for Automated, High-Accuracy Wound Classification and Healing Progression Monitoring
链接:https://arxiv.org/abs/2512.18528
作者:Moses Kiprono
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:healthcare costs exceeding, billion dollars annually, including diabetic foot, people with diabetes, impose a substantial
备注: 10 pages, 6 figures. Code to be released publicly
点击查看摘要
Abstract:Chronic wounds, including diabetic foot ulcers which affect up to one-third of people with diabetes, impose a substantial clinical and economic burden, with U.S. healthcare costs exceeding 25 billion dollars annually. Current wound assessment remains predominantly subjective, leading to inconsistent classification and delayed interventions. We present WoundNet-Ensemble, an Internet of Medical Things system leveraging a novel ensemble of three complementary deep learning architectures: ResNet-50, the self-supervised Vision Transformer DINOv2, and Swin Transformer, for automated classification of six clinically distinct wound types. Our system achieves 99.90 percent ensemble accuracy on a comprehensive dataset of 5,175 wound images spanning diabetic foot ulcers, pressure ulcers, venous ulcers, thermal burns, pilonidal sinus wounds, and fungating malignant tumors. The weighted fusion strategy demonstrates a 3.7 percent improvement over previous state-of-the-art methods. Furthermore, we implement a longitudinal wound healing tracker that computes healing rates, severity scores, and generates clinical alerts. This work demonstrates a robust, accurate, and clinically deployable tool for modernizing wound care through artificial intelligence, addressing critical needs in telemedicine and remote patient monitoring. The implementation and trained models will be made publicly available to support reproducibility.
131. 【2512.18527】Detection of AI Generated Images Using Combined Uncertainty Measures and Particle Swarm Optimised Rejection Mechanism
链接:https://arxiv.org/abs/2512.18527
作者:Rahul Yumlembam,Biju Issac,Nauman Aslam,Eaby Kollonoor Babu,Josh Collyer,Fraser Kennedy
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Monte Carlo Dropout, Deep Kernel Learning, increasingly photorealistic, growing challenge, poses a growing
备注: Scientific Reports (2025)
点击查看摘要
Abstract:As AI-generated images become increasingly photorealistic, distinguishing them from natural images poses a growing challenge. This paper presents a robust detection framework that leverages multiple uncertainty measures to decide whether to trust or reject a model's predictions. We focus on three complementary techniques: Fisher Information, which captures the sensitivity of model parameters to input variations; entropy-based uncertainty from Monte Carlo Dropout, which reflects predictive variability; and predictive variance from a Deep Kernel Learning framework using a Gaussian Process classifier. To integrate these diverse uncertainty signals, Particle Swarm Optimisation is used to learn optimal weightings and determine an adaptive rejection threshold. The model is trained on Stable Diffusion-generated images and evaluated on GLIDE, VQDM, Midjourney, BigGAN, and StyleGAN3, each introducing significant distribution shifts. While standard metrics such as prediction probability and Fisher-based measures perform well in distribution, their effectiveness degrades under shift. In contrast, the Combined Uncertainty measure consistently achieves an incorrect rejection rate of approximately 70 percent on unseen generators, successfully filtering most misclassified AI samples. Although the system occasionally rejects correct predictions from newer generators, this conservative behaviour is acceptable, as rejected samples can support retraining. The framework maintains high acceptance of accurate predictions for natural images and in-domain AI data. Under adversarial attacks using FGSM and PGD, the Combined Uncertainty method rejects around 61 percent of successful attacks, while GP-based uncertainty alone achieves up to 80 percent. Overall, the results demonstrate that multi-source uncertainty fusion provides a resilient and adaptive solution for AI-generated image detection.
132. 【2512.18504】GTMA: Dynamic Representation Optimization for OOD Vision-Language Models
链接:https://arxiv.org/abs/2512.18504
作者:Jensen Zhang,Ningyuan Liu,Keze Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:trigger cross-modal alignment, cross-modal alignment collapse, severely degrade zero-shot, struggle in open-world, open-world applications
备注: Under submission
点击查看摘要
Abstract:Vision-language models (VLMs) struggle in open-world applications, where out-of-distribution (OOD) concepts often trigger cross-modal alignment collapse and severely degrade zero-shot performance. We identify the root cause as modal asymmetry: while the visual encoder can extract discriminative features from unseen images, the text encoder is constrained by a fixed discrete vocabulary and cannot synthesize new semantic anchors. Existing approaches such as CoOp or LoRA provide only partial remedies, as they remain confined to the pre-trained semantic space. To overcome this bottleneck, we propose dynamic representation optimization, realized through the Guided Target-Matching Adaptation (GTMA) framework. At inference time, GTMA constructs a continuous pseudo-word embedding that best aligns with an OOD image's visual anchor, effectively bypassing vocabulary limitations. The optimization is driven by an adaptive gradient-based representation policy optimization algorithm, which incorporates semantic regularization to preserve plausibility and compatibility with the model's prior knowledge. Experiments on ImageNet-R and the VISTA-Beyond benchmark demonstrate that GTMA improves zero-shot and few-shot OOD accuracy by up to 15-20 percent over the base VLM while maintaining performance on in-distribution concepts. Ablation studies further confirm the necessity of pseudo-word optimization.
Comments:
Under submission
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2512.18504 [cs.CV]
(or
arXiv:2512.18504v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.18504
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
133. 【2512.18503】NASTaR: NovaSAR Automated Ship Target Recognition Dataset
链接:https://arxiv.org/abs/2512.18503
作者:Benyamin Hosseiny,Kamirul Kamirul,Odysseas Pappas,Alin Achim
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, space-based maritime activity, maritime activity monitoring
备注:
点击查看摘要
Abstract:Synthetic Aperture Radar (SAR) offers a unique capability for all-weather, space-based maritime activity monitoring by capturing and imaging strong reflections from ships at sea. A well-defined challenge in this domain is ship type classification. Due to the high diversity and complexity of ship types, accurate recognition is difficult and typically requires specialized deep learning models. These models, however, depend on large, high-quality ground-truth datasets to achieve robust performance and generalization. Furthermore, the growing variety of SAR satellites operating at different frequencies and spatial resolutions has amplified the need for more annotated datasets to enhance model accuracy. To address this, we present the NovaSAR Automated Ship Target Recognition (NASTaR) dataset. This dataset comprises of 3415 ship patches extracted from NovaSAR S-band imagery, with labels matched to AIS data. It includes distinctive features such as 23 unique classes, inshore/offshore separation, and an auxiliary wake dataset for patches where ship wakes are visible. We validated the dataset applicability across prominent ship-type classification scenarios using benchmark deep learning models. Results demonstrate over 60% accuracy for classifying four major ship types, over 70% for a three-class scenario, more than 75% for distinguishing cargo from tanker ships, and over 87% for identifying fishing vessels. The NASTaR dataset is available at https://https://doi.org/10.5523/bris, while relevant codes for benchmarking and analysis are available at this https URL.
134. 【2512.18500】PlantDiseaseNet-RT50: A Fine-tuned ResNet50 Architecture for High-Accuracy Plant Disease Detection Beyond Standard CNNs
链接:https://arxiv.org/abs/2512.18500
作者:Santwana Sagnika,Manav Malhotra,Ishtaj Kaur Deol,Soumyajit Roy,Swarnav Kumar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:global food security, Plant diseases pose, accounting for 70-80, food security, productivity and global
备注: This work is published in 2025 IEEE International Conference on Advances in Computing Research On Science Engineering and Technology (ACROSET). 6 pages, 2 figures, 2 tables
点击查看摘要
Abstract:Plant diseases pose a significant threat to agricultural productivity and global food security, accounting for 70-80% of crop losses worldwide. Traditional detection methods rely heavily on expert visual inspection, which is time-consuming, labour-intensive, and often impractical for large-scale farming operations. In this paper, we present PlantDiseaseNet-RT50, a novel fine-tuned deep learning architecture based on ResNet50 for automated plant disease detection. Our model features strategically unfrozen layers, a custom classification head with regularization mechanisms, and dynamic learning rate scheduling through cosine decay. Using a comprehensive dataset of distinct plant disease categories across multiple crop species, PlantDiseaseNet-RT50 achieves exceptional performance with approximately 98% accuracy, precision, and recall. Our architectural modifications and optimization protocol demonstrate how targeted fine-tuning can transform a standard pretrained model into a specialized agricultural diagnostic tool. We provide a detailed account of our methodology, including the systematic unfreezing of terminal layers, implementation of batch normalization and dropout regularization and application of advanced training techniques. PlantDiseaseNet-RT50 represents a significant advancement in AI-driven agricultural tools, offering a computationally efficient solution for rapid and accurate plant disease diagnosis that can be readily implemented in practical farming contexts to support timely interventions and reduce crop losses.
135. 【2512.18496】Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-Language Models
链接:https://arxiv.org/abs/2512.18496
作者:Xiaoyang Guo,Keze Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable performance, large-scale vision-language models, recent years, large-scale vision-language, demonstrated remarkable
备注: Under submission
点击查看摘要
Abstract:In recent years, large-scale vision-language models (VLMs) have demonstrated remarkable performance on multimodal understanding and reasoning tasks. However, handling high-dimensional visual features often incurs substantial computational and memory costs. VoCo-LLaMA alleviates this issue by compressing visual patch tokens into a few VoCo tokens, reducing computational overhead while preserving strong cross-modal alignment. Nevertheless, such approaches typically adopt a fixed compression rate, limiting their ability to adapt to varying levels of visual complexity. To address this limitation, we propose Adaptive-VoCo, a framework that augments VoCo-LLaMA with a lightweight predictor for adaptive compression. This predictor dynamically selects an optimal compression rate by quantifying an image's visual complexity using statistical cues from the vision encoder, such as patch token entropy and attention map variance. Furthermore, we introduce a joint loss function that integrates rate regularization with complexity alignment. This enables the model to balance inference efficiency with representational capacity, particularly in challenging scenarios. Experimental results show that our method consistently outperforms fixed-rate baselines across multiple multimodal tasks, highlighting the potential of adaptive visual compression for creating more efficient and robust VLMs.
136. 【2512.18477】STORM: Search-Guided Generative World Models for Robotic Manipulation
链接:https://arxiv.org/abs/2512.18477
作者:Wenjun Lin,Jensen Zhang,Kaitong Cai,Keze Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Generative World Models, Carlo Tree Search, Search-Guided Generative World, Monte Carlo Tree, diffusion-based action generation
备注: Under submission
点击查看摘要
Abstract:We present STORM (Search-Guided Generative World Models), a novel framework for spatio-temporal reasoning in robotic manipulation that unifies diffusion-based action generation, conditional video prediction, and search-based planning. Unlike prior Vision-Language-Action (VLA) models that rely on abstract latent dynamics or delegate reasoning to language components, STORM grounds planning in explicit visual rollouts, enabling interpretable and foresight-driven decision-making. A diffusion-based VLA policy proposes diverse candidate actions, a generative video world model simulates their visual and reward outcomes, and Monte Carlo Tree Search (MCTS) selectively refines plans through lookahead evaluation. Experiments on the SimplerEnv manipulation benchmark demonstrate that STORM achieves a new state-of-the-art average success rate of 51.0 percent, outperforming strong baselines such as CogACT. Reward-augmented video prediction substantially improves spatio-temporal fidelity and task relevance, reducing Frechet Video Distance by over 75 percent. Moreover, STORM exhibits robust re-planning and failure recovery behavior, highlighting the advantages of search-guided generative world models for long-horizon robotic manipulation.
137. 【2512.18455】Plasticine: A Traceable Diffusion Model for Medical Image Translation
链接:https://arxiv.org/abs/2512.18455
作者:Tianyang Zhanng,Xinxing Cheng,Jun Cheng,Shaoming Zheng,He Zhao,Huazhu Fu,Alejandro F Frangi,Jiang Liu,Jinming Duan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:population distributions pose, distributions pose significant, pose significant challenges, medical image analysis, Domain gaps arising
备注: Accepted by IEEE Transactions on Artificial Intelligence
点击查看摘要
Abstract:Domain gaps arising from variations in imaging devices and population distributions pose significant challenges for machine learning in medical image analysis. Existing image-to-image translation methods primarily aim to learn mappings between domains, often generating diverse synthetic data with variations in anatomical scale and shape, but they usually overlook spatial correspondence during the translation process. For clinical applications, traceability, defined as the ability to provide pixel-level correspondences between original and translated images, is equally important. This property enhances clinical interpretability but has been largely overlooked in previous approaches. To address this gap, we propose Plasticine, which is, to the best of our knowledge, the first end-to-end image-to-image translation framework explicitly designed with traceability as a core objective. Our method combines intensity translation and spatial transformation within a denoising diffusion framework. This design enables the generation of synthetic images with interpretable intensity transitions and spatially coherent deformations, supporting pixel-wise traceability throughout the translation process.
138. 【2512.18453】NOVA: Discovering Well-Conditioned Winograd Transforms through Numerical Optimization of Vandermonde Arithmetic
链接:https://arxiv.org/abs/2512.18453
作者:Jayant Lohia
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:reducing arithmetic complexity, algorithm for efficient, reducing arithmetic, arithmetic complexity, standard algorithm
备注:
点击查看摘要
Abstract:Winograd convolution is the standard algorithm for efficient inference, reducing arithmetic complexity by 2.25x for 3x3 kernels. However, it faces a critical barrier in the modern era of low precision computing: numerical instability. As tiles scale to maximize efficiency (e.g., F(6,3), F(8,3)), the condition numbers of standard integer based transforms explode, reaching kappa = 2 x 10^5 for F(8,3), rendering them unusable in FP16 or Int8. We introduce NOVA (Numerical Optimization of Vandermonde Arithmetic), a discovery framework that breaks the decades old convention of integer interpolation. Treating Winograd point selection as a continuous optimization problem, NOVA searches the manifold R^n-1 via Evolution Strategy, snaps candidates to simple rationals, and guarantees correctness via symbolic verification. This process uncovers a hidden landscape of stable, fractional configurations such as {+-5/6, +-7/6, +-3/5} that defy traditional vocabulary constraints. The impact is transformative: NOVA improves the conditioning of F(8,3) by 415x in 1D, which squares to a 172,484x improvement for 2D convolution. In real world FP16 ImageNet inference, where standard transforms collapse to random chance (e.g., 4.7 percent accuracy on VGG16), NOVA's points restore full accuracy (75 to 78 percent), recovering over 70 percentage points without retraining, calibration, or learned parameters. These discovered transforms act as drop in replacements, effectively unlocking the efficiency of large tile Winograd convolution for next generation hardware.
139. 【2512.18450】Agent-Based Output Drift Detection for Breast Cancer Response Prediction in a Multisite Clinical Decision Support System
链接:https://arxiv.org/abs/2512.18450
作者:Xavier Rafael-Palou,Jose Munuera,Ana Jimenez-Pastor,Richard Osuala,Karim Lekadir,Oliver Diaz
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
关键词:concurrently serve multiple, Modern clinical decision, independent medical imaging, medical imaging institutions, Modern clinical
备注: Accepted at MICAD (Medical Imaging and Computer-Aided Diagnosis) 2025
点击查看摘要
Abstract:Modern clinical decision support systems can concurrently serve multiple, independent medical imaging institutions, but their predictive performance may degrade across sites due to variations in patient populations, imaging hardware, and acquisition protocols. Continuous surveillance of predictive model outputs offers a safe and reliable approach for identifying such distributional shifts without ground truth labels. However, most existing methods rely on centralized monitoring of aggregated predictions, overlooking site-specific drift dynamics. We propose an agent-based framework for detecting drift and assessing its severity in multisite clinical AI systems. To evaluate its effectiveness, we simulate a multi-center environment for output-based drift detection, assigning each site a drift monitoring agent that performs batch-wise comparisons of model outputs against a reference distribution. We analyse several multi-center monitoring schemes, that differ in how the reference is obtained (site-specific, global, production-only and adaptive), alongside a centralized baseline. Results on real-world breast cancer imaging data using a pathological complete response prediction model shows that all multi-center schemes outperform centralized monitoring, with F1-score improvements up to 10.3% in drift detection. In the absence of site-specific references, the adaptive scheme performs best, with F1-scores of 74.3% for drift detection and 83.7% for drift severity classification. These findings suggest that adaptive, site-aware agent-based drift monitoring can enhance reliability of multisite clinical decision support systems.
140. 【2512.18448】Object-Centric Framework for Video Moment Retrieval
链接:https://arxiv.org/abs/2512.18448
作者:Zongyao Li,Yongkang Wong,Satoshi Yamazaki,Jianquan Liu,Mohan Kankanhalli
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:primarily encode global, semantic information, retrieval methods rely, encode global visual, global visual
备注: AAAI2026
点击查看摘要
Abstract:Most existing video moment retrieval methods rely on temporal sequences of frame- or clip-level features that primarily encode global visual and semantic information. However, such representations often fail to capture fine-grained object semantics and appearance, which are crucial for localizing moments described by object-oriented queries involving specific entities and their interactions. In particular, temporal dynamics at the object level have been largely overlooked, limiting the effectiveness of existing approaches in scenarios requiring detailed object-level reasoning. To address this limitation, we propose a novel object-centric framework for moment retrieval. Our method first extracts query-relevant objects using a scene graph parser and then generates scene graphs from video frames to represent these objects and their relationships. Based on the scene graphs, we construct object-level feature sequences that encode rich visual and semantic information. These sequences are processed by a relational tracklet transformer, which models spatio-temporal correlations among objects over time. By explicitly capturing object-level state changes, our framework enables more accurate localization of moments aligned with object-oriented queries. We evaluated our method on three benchmarks: Charades-STA, QVHighlights, and TACoS. Experimental results demonstrate that our method outperforms existing state-of-the-art methods across all benchmarks.
141. 【2512.18437】MeniMV: A Multi-view Benchmark for Meniscus Injury Severity Grading
链接:https://arxiv.org/abs/2512.18437
作者:Shurui Xu,Siqi Yang,Jiapin Ren,Zhong Cao,Hongwei Yang,Mengzhen Fan,Yuyu Sun,Shuyan Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:automated MRI analysis, Precise grading, tears is critical, diagnosis but remains, remains underexplored
备注: 5 pages, 2 figures
点击查看摘要
Abstract:Precise grading of meniscal horn tears is critical in knee injury diagnosis but remains underexplored in automated MRI analysis. Existing methods often rely on coarse study-level labels or binary classification, lacking localization and severity information. In this paper, we introduce MeniMV, a multi-view benchmark dataset specifically designed for horn-specific meniscus injury grading. MeniMV comprises 3,000 annotated knee MRI exams from 750 patients across three medical centers, providing 6,000 co-registered sagittal and coronal images. Each exam is meticulously annotated with four-tier (grade 0-3) severity labels for both anterior and posterior meniscal horns, verified by chief orthopedic physicians. Notably, MeniMV offers more than double the pathology-labeled data volume of prior datasets while uniquely capturing the dual-view diagnostic context essential in clinical practice. To demonstrate the utility of MeniMV, we benchmark multiple state-of-the-art CNN and Transformer-based models. Our extensive experiments establish strong baselines and highlight challenges in severity grading, providing a valuable foundation for future research in automated musculoskeletal imaging.
142. 【2512.18429】E-RGB-D: Real-Time Event-Based Perception with Structured Light
链接:https://arxiv.org/abs/2512.18429
作者:Seyed Ehsan Marjani Bajestani,Giovanni Beltrame
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Event-based cameras, offering unmatched speed, report pixel brightness, Digital Light Processing, Active Structured Light
备注:
点击查看摘要
Abstract:Event-based cameras (ECs) have emerged as bio-inspired sensors that report pixel brightness changes asynchronously, offering unmatched speed and efficiency in vision sensing. Despite their high dynamic range, temporal resolution, low power consumption, and computational simplicity, traditional monochrome ECs face limitations in detecting static or slowly moving objects and lack color information essential for certain applications. To address these challenges, we present a novel approach that integrates a Digital Light Processing (DLP) projector, forming Active Structured Light (ASL) for RGB-D sensing. By combining the benefits of ECs and projection-based techniques, our method enables the detection of color and the depth of each pixel separately. Dynamic projection adjustments optimize bandwidth, ensuring selective color data acquisition and yielding colorful point clouds without sacrificing spatial resolution. This integration, facilitated by a commercial TI LightCrafter 4500 projector and a monocular monochrome EC, not only enables frameless RGB-D sensing applications but also achieves remarkable performance milestones. With our approach, we achieved a color detection speed equivalent to 1400 fps and 4 kHz of pixel depth detection, significantly advancing the realm of computer vision across diverse fields from robotics to 3D reconstruction methods. Our code is publicly available: this https URL
143. 【2512.18411】AmPLe: Supporting Vision-Language Models via Adaptive-Debiased Ensemble Multi-Prompt Learning
链接:https://arxiv.org/abs/2512.18411
作者:Fei Song,Yi Li,Jiangmeng Li,Rui Wang,Changwen Zheng,Fanjiang Xu,Hui Xiong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multi-prompt learning methods, Multi-prompt learning, vision-language models, foundation vision-language model, limited resources
备注: Accepted by IJCV2025
点击查看摘要
Abstract:Multi-prompt learning methods have emerged as an effective approach for facilitating the rapid adaptation of vision-language models to downstream tasks with limited resources. Existing multi-prompt learning methods primarily focus on utilizing various meticulously designed prompts within a single foundation vision-language model to achieve superior performance. However, the overlooked model-prompt matching bias hinders the development of multi-prompt learning, i.e., the same prompt can convey different semantics across distinct vision-language models, such as CLIP-ViT-B/16 and CLIP-ViT-B/32, resulting in inconsistent predictions of identical prompt. To mitigate the impact of this bias on downstream tasks, we explore an ensemble learning approach to sufficiently aggregate the benefits of diverse predictions. Additionally, we further disclose the presence of sample-prompt matching bias, which originates from the prompt-irrelevant semantics encapsulated in the input samples. Thus, directly utilizing all information from the input samples for generating weights of ensemble learning can lead to suboptimal performance. In response, we extract prompt-relevant semantics from input samples by leveraging the guidance of the information theory-based analysis, adaptively calculating debiased ensemble weights. Overall, we propose Adaptive-Debiased Ensemble MultiPrompt Learning, abbreviated as AmPLe, to mitigate the two types of bias simultaneously. Extensive experiments on three representative tasks, i.e., generalization to novel classes, new target datasets, and unseen domain shifts, show that AmPLe can widely outperform existing methods. Theoretical validation from a causal perspective further supports the effectiveness of AmPLe.
144. 【2512.18407】hrough the PRISm: Importance-Aware Scene Graphs for Image Retrieval
链接:https://arxiv.org/abs/2512.18407
作者:Dimitrios Georgoulopoulos,Nikolaos Chaidos,Angeliki Dimitriou,Giorgos Stamou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Importance Prediction Module, semantically similar remains, Importance Prediction, computer vision, Accurately retrieving images
备注: 10 pages, 5 figures
点击查看摘要
Abstract:Accurately retrieving images that are semantically similar remains a fundamental challenge in computer vision, as traditional methods often fail to capture the relational and contextual nuances of a scene. We introduce PRISm (Pruning-based Image Retrieval via Importance Prediction on Semantic Graphs), a multimodal framework that advances image-to-image retrieval through two novel components. First, the Importance Prediction Module identifies and retains the most critical objects and relational triplets within an image while pruning irrelevant elements. Second, the Edge-Aware Graph Neural Network explicitly encodes relational structure and integrates global visual features to produce semantically informed image embeddings. PRISm achieves image retrieval that closely aligns with human perception by explicitly modeling the semantic importance of objects and their interactions, capabilities largely absent in prior approaches. Its architecture effectively combines relational reasoning with visual representation, enabling semantically grounded retrieval. Extensive experiments on benchmark and real-world datasets demonstrate consistently superior top-ranked performance, while qualitative analyses show that PRISm accurately captures key objects and interactions, producing interpretable and semantically meaningful results.
145. 【2512.18406】Automated Mosaic Tesserae Segmentation via Deep Learning Techniques
链接:https://arxiv.org/abs/2512.18406
作者:Charilaos Kapelonis,Marios Antonakakis,Konstantinos Politof,Aristomenis Antoniadis,Michalis Zervakis
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:cultural heritage, widely recognized, reflection of civilization, represent an important, important part
备注:
点击查看摘要
Abstract:Art is widely recognized as a reflection of civilization and mosaics represent an important part of cultural heritage. Mosaics are an ancient art form created by arranging small pieces, called tesserae, on a surface using adhesive. Due to their age and fragility, they are prone to damage, highlighting the need for digital preservation. This paper addresses the problem of digitizing mosaics by segmenting the tesserae to separate them from the background within the broader field of Image Segmentation in Computer Vision. We propose a method leveraging Segment Anything Model 2 (SAM 2) by Meta AI, a foundation model that outperforms most conventional segmentation models, to automatically segment mosaics. Due to the limited open datasets in the field, we also create an annotated dataset of mosaic images to fine-tune and evaluate the model. Quantitative evaluation on our testing dataset shows notable improvements compared to the baseline SAM 2 model, with Intersection over Union increasing from 89.00% to 91.02% and Recall from 92.12% to 95.89%. Additionally, on a benchmark proposed by a prior approach, our model achieves an F-measure 3% higher than previous methods and reduces the error in the absolute difference between predicted and actual tesserae from 0.20 to just 0.02. The notable performance of the fine-tuned SAM 2 model together with the newly annotated dataset can pave the way for real-time segmentation of mosaic images.
146. 【2512.18386】RecurGS: Interactive Scene Modeling via Discrete-State Recurrent Gaussian Fusion
链接:https://arxiv.org/abs/2512.18386
作者:Wenhao Hu,Haonan Zhou,Zesheng Li,Liu Liu,Jiacheng Dong,Zhizhong Su,Gaoang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remain open challenges, Recent advances, environments remain open, vision and robotics, enabled high-fidelity
备注:
点击查看摘要
Abstract:Recent advances in 3D scene representations have enabled high-fidelity novel view synthesis, yet adapting to discrete scene changes and constructing interactive 3D environments remain open challenges in vision and robotics. Existing approaches focus solely on updating a single scene without supporting novel-state synthesis. Others rely on diffusion-based object-background decoupling that works on one state at a time and cannot fuse information across multiple observations. To address these limitations, we introduce RecurGS, a recurrent fusion framework that incrementally integrates discrete Gaussian scene states into a single evolving representation capable of interaction. RecurGS detects object-level changes across consecutive states, aligns their geometric motion using semantic correspondence and Lie-algebra based SE(3) refinement, and performs recurrent updates that preserve historical structures through replay supervision. A voxelized, visibility-aware fusion module selectively incorporates newly observed regions while keeping stable areas fixed, mitigating catastrophic forgetting and enabling efficient long-horizon updates. RecurGS supports object-level manipulation, synthesizes novel scene states without requiring additional scans, and maintains photorealistic fidelity across evolving environments. Extensive experiments across synthetic and real-world datasets demonstrate that our framework delivers high-quality reconstructions with substantially improved update efficiency, providing a scalable step toward continuously interactive Gaussian worlds.
147. 【2512.18365】Efficient Zero-Shot Inpainting with Decoupled Diffusion Guidance
链接:https://arxiv.org/abs/2512.18365
作者:Badr Moufad,Navid Bagheri Shouraki,Alain Oliviero Durmus,Thomas Hirtz,Eric Moulines,Jimmy Olsson,Yazid Janati
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:image editing tasks, generate realistic content, pretrained diffusion model, local modification, observed regions
备注: preprint
点击查看摘要
Abstract:Diffusion models have emerged as powerful priors for image editing tasks such as inpainting and local modification, where the objective is to generate realistic content that remains consistent with observed regions. In particular, zero-shot approaches that leverage a pretrained diffusion model, without any retraining, have been shown to achieve highly effective reconstructions. However, state-of-the-art zero-shot methods typically rely on a sequence of surrogate likelihood functions, whose scores are used as proxies for the ideal score. This procedure however requires vector-Jacobian products through the denoiser at every reverse step, introducing significant memory and runtime overhead. To address this issue, we propose a new likelihood surrogate that yields simple and efficient to sample Gaussian posterior transitions, sidestepping the backpropagation through the denoiser network. Our extensive experiments show that our method achieves strong observation consistency compared with fine-tuned baselines and produces coherent, high-quality reconstructions, all while significantly reducing inference cost. Code is available at this https URL.
148. 【2512.18363】Enhancing 3D Semantic Scene Completion with a Refinement Module
链接:https://arxiv.org/abs/2512.18363
作者:Dunxing Zhang(3),Jiachen Lu(3),Han Yang(1 and 2),Lei Bao(1 and 2),Bo Song(1 and 2) ((1) National Science Center for Earthquake Engineering, Tianjin University, Tianjin, China, (2) School of Civil Engineering, Tianjin University, Tianjin, China, (3) Technical University of Munich, Munich, Germany)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Semantic Scene Completion, Local Geometry Module, Scene Completion, existing SSC models, Voxel-level Local Geometry
备注: 19 pages, 8 figures
点击查看摘要
Abstract:We propose ESSC-RM, a plug-and-play Enhancing framework for Semantic Scene Completion with a Refinement Module, which can be seamlessly integrated into existing SSC models. ESSC-RM operates in two phases: a baseline SSC network first produces a coarse voxel prediction, which is subsequently refined by a 3D U-Net-based Prediction Noise-Aware Module (PNAM) and Voxel-level Local Geometry Module (VLGM) under multiscale supervision. Experiments on SemanticKITTI show that ESSC-RM consistently improves semantic prediction performance. When integrated into CGFormer and MonoScene, the mean IoU increases from 16.87% to 17.27% and from 11.08% to 11.51%, respectively. These results demonstrate that ESSC-RM serves as a general refinement framework applicable to a wide range of SSC models.
149. 【2512.18344】MCVI-SANet: A lightweight semi-supervised model for LAI and SPAD estimation of winter wheat under vegetation index saturation
链接:https://arxiv.org/abs/2512.18344
作者:Zhiheng Zhang,Jiajun Yang,Hong Sun,Dong Wang,Honghua Jiang,Yaru Chen,Tangyuan Ning
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:winter wheat constrain, wheat constrain accurate, constrain accurate estimation, limited ground-truth annotations, dense canopy stage
备注:
点击查看摘要
Abstract:Vegetation index (VI) saturation during the dense canopy stage and limited ground-truth annotations of winter wheat constrain accurate estimation of LAI and SPAD. Existing VI-based and texture-driven machine learning methods exhibit limited feature expressiveness. In addition, deep learning baselines suffer from domain gaps and high data demands, which restrict their generalization. Therefore, this study proposes the Multi-Channel Vegetation Indices Saturation Aware Net (MCVI-SANet), a lightweight semi-supervised vision model. The model incorporates a newly designed Vegetation Index Saturation-Aware Block (VI-SABlock) for adaptive channel-spatial feature enhancement. It also integrates a VICReg-based semi-supervised strategy to further improve generalization. Datasets were partitioned using a vegetation height-informed strategy to maintain representativeness across growth stages. Experiments over 10 repeated runs demonstrate that MCVI-SANet achieves state-of-the-art accuracy. The model attains an average R2 of 0.8123 and RMSE of 0.4796 for LAI, and an average R2 of 0.6846 and RMSE of 2.4222 for SPAD. This performance surpasses the best-performing baselines, with improvements of 8.95% in average LAI R2 and 8.17% in average SPAD R2. Moreover, MCVI-SANet maintains high inference speed with only 0.10M parameters. Overall, the integration of semi-supervised learning with agronomic priors provides a promising approach for enhancing remote sensing-based precision agriculture.
150. 【2512.18331】A two-stream network with global-local feature fusion for bone age assessment
链接:https://arxiv.org/abs/2512.18331
作者:Qiong Lou,Han Yang,Fang Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Bone Age Assessment, Bone Age, Age Assessment, local feature extraction, automated bone age
备注:
点击查看摘要
Abstract:Bone Age Assessment (BAA) is a widely used clinical technique that can accurately reflect an individual's growth and development level, as well as maturity. In recent years, although deep learning has advanced the field of bone age assessment, existing methods face challenges in efficiently balancing global features and local skeletal details. This study aims to develop an automated bone age assessment system based on a two-stream deep learning architecture to achieve higher accuracy in bone age assessment. We propose the BoNet+ model incorporating global and local feature extraction channels. A Transformer module is introduced into the global feature extraction channel to enhance the ability in extracting global features through multi-head self-attention mechanism. A RFAConv module is incorporated into the local feature extraction channel to generate adaptive attention maps within multiscale receptive fields, enhancing local feature extraction capabilities. Global and local features are concatenated along the channel dimension and optimized by an Inception-V3 network. The proposed method has been validated on the Radiological Society of North America (RSNA) and Radiological Hand Pose Estimation (RHPE) test datasets, achieving mean absolute errors (MAEs) of 3.81 and 5.65 months, respectively. These results are comparable to the state-of-the-art. The BoNet+ model reduces the clinical workload and achieves automatic, high-precision, and more objective bone age assessment.
151. 【2512.18318】Asynchronous Pipeline Parallelism for Real-Time Multilingual Lip Synchronization in Video Communication Systems
链接:https://arxiv.org/abs/2512.18318
作者:Eren Caglar,Amirkia Rafiei Oskooei,Mehmet Kutanoglu,Mustafa Keles,Mehmet S. Aktas
类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
关键词:real-time video conferencing, Transformer framework designed, asynchronous Transformer framework, asynchronous Transformer, video conferencing systems
备注: Accepted to IEEE Big Data 2025, AIDE4IoT Workshop. Copyright \c{opyright} 2025 IEEE
点击查看摘要
Abstract:This paper introduces a parallel and asynchronous Transformer framework designed for efficient and accurate multilingual lip synchronization in real-time video conferencing systems. The proposed architecture integrates translation, speech processing, and lip-synchronization modules within a pipeline-parallel design that enables concurrent module execution through message-queue-based decoupling, reducing end-to-end latency by up to 3.1 times compared to sequential approaches. To enhance computational efficiency and throughput, the inference workflow of each module is optimized through low-level graph compilation, mixed-precision quantization, and hardware-accelerated kernel fusion. These optimizations provide substantial gains in efficiency while preserving model accuracy and visual quality. In addition, a context-adaptive silence-detection component segments the input speech stream at semantically coherent boundaries, improving translation consistency and temporal alignment across languages. Experimental results demonstrate that the proposed parallel architecture outperforms conventional sequential pipelines in processing speed, synchronization stability, and resource utilization. The modular, message-oriented design makes this work applicable to resource-constrained IoT communication scenarios including telemedicine, multilingual kiosks, and remote assistance systems. Overall, this work advances the development of low-latency, resource-efficient multimodal communication frameworks for next-generation AIoT systems.
152. 【2512.18314】MatSpray: Fusing 2D Material World Knowledge on 3D Geometry
链接:https://arxiv.org/abs/2512.18314
作者:Philipp Langsteiner,Jan-Niklas Dihlmann,Hendrik P.A. Lensch
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Manual modeling, film industries, consuming yet essential, essential task, gaming and film
备注: Project page: [this https URL](https://matspray.jdihlmann.com/)
点击查看摘要
Abstract:Manual modeling of material parameters and 3D geometry is a time consuming yet essential task in the gaming and film industries. While recent advances in 3D reconstruction have enabled accurate approximations of scene geometry and appearance, these methods often fall short in relighting scenarios due to the lack of precise, spatially varying material parameters. At the same time, diffusion models operating on 2D images have shown strong performance in predicting physically based rendering (PBR) properties such as albedo, roughness, and metallicity. However, transferring these 2D material maps onto reconstructed 3D geometry remains a significant challenge. We propose a framework for fusing 2D material data into 3D geometry using a combination of novel learning-based and projection-based approaches. We begin by reconstructing scene geometry via Gaussian Splatting. From the input images, a diffusion model generates 2D maps for albedo, roughness, and metallic parameters. Any existing diffusion model that can convert images or videos to PBR materials can be applied. The predictions are further integrated into the 3D representation either by optimizing an image-based loss or by directly projecting the material parameters onto the Gaussians using Gaussian ray tracing. To enhance fine-scale accuracy and multi-view consistency, we further introduce a light-weight neural refinement step (Neural Merger), which takes ray-traced material features as input and produces detailed adjustments. Our results demonstrate that the proposed methods outperform existing techniques in both quantitative metrics and perceived visual realism. This enables more accurate, relightable, and photorealistic renderings from reconstructed scenes, significantly improving the realism and efficiency of asset creation workflows in content production pipelines.
153. 【2512.18312】MatE: Material Extraction from Single-Image via Geometric Prior
链接:https://arxiv.org/abs/2512.18312
作者:Zeyu Zhang,Wei Zhai,Jian Yang,Yang Cao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:typically requiring specialized, requiring specialized equipment, physically-based rendering, creation of high-fidelity, graphics pipelines
备注: 8 pages, 8 figures
点击查看摘要
Abstract:The creation of high-fidelity, physically-based rendering (PBR) materials remains a bottleneck in many graphics pipelines, typically requiring specialized equipment and expert-driven post-processing. To democratize this process, we present MatE, a novel method for generating tileable PBR materials from a single image taken under unconstrained, real-world conditions. Given an image and a user-provided mask, MatE first performs coarse rectification using an estimated depth map as a geometric prior, and then employs a dual-branch diffusion model. Leveraging a learned consistency from rotation-aligned and scale-aligned training data, this model further rectify residual distortions from the coarse result and translate it into a complete set of material maps, including albedo, normal, roughness and height. Our framework achieves invariance to the unknown illumination and perspective of the input image, allowing for the recovery of intrinsic material properties from casual captures. Through comprehensive experiments on both synthetic and real-world data, we demonstrate the efficacy and robustness of our approach, enabling users to create realistic materials from real-world image.
154. 【2512.18291】Pyramidal Adaptive Cross-Gating for Multimodal Detection
链接:https://arxiv.org/abs/2512.18291
作者:Zidong Gu,Shoufu Tian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:UAV reconnaissance, aerial imagery, task in applications, Pyramidal Feature-aware Multimodal, Pyramidal Adaptive Cross-Gating
备注: 17 pages, 6 figures, submitted to Image and Vision Computing
点击查看摘要
Abstract:Object detection in aerial imagery is a critical task in applications such as UAV reconnaissance. Although existing methods have extensively explored feature interaction between different modalities, they commonly rely on simple fusion strategies for feature aggregation. This introduces two critical flaws: it is prone to cross-modal noise and disrupts the hierarchical structure of the feature pyramid, thereby impairing the fine-grained detection of small objects. To address this challenge, we propose the Pyramidal Adaptive Cross-Gating Network (PACGNet), an architecture designed to perform deep fusion within the backbone. To this end, we design two core components: the Symmetrical Cross-Gating (SCG) module and the Pyramidal Feature-aware Multimodal Gating (PFMG) module. The SCG module employs a bidirectional, symmetrical "horizontal" gating mechanism to selectively absorb complementary information, suppress noise, and preserve the semantic integrity of each modality. The PFMG module reconstructs the feature hierarchy via a progressive hierarchical gating mechanism. This leverages the detailed features from a preceding, higher-resolution level to guide the fusion at the current, lower-resolution level, effectively preserving fine-grained details as features propagate. Through evaluations conducted on the DroneVehicle and VEDAI datasets, our PACGNet sets a new state-of-the-art benchmark, with mAP50 scores reaching 81.7% and 82.1% respectively.
155. 【2512.18279】UniMPR: A Unified Framework for Multimodal Place Recognition with Arbitrary Sensor Configurations
链接:https://arxiv.org/abs/2512.18279
作者:Zhangshuo Qi,Jingyi Xu,Luqi Cheng,Shichen Wen,Yiming Ma,Guangming Xiong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling global localization, multimodal place recognition, Place recognition, vehicles and robotics, enabling global
备注:
点击查看摘要
Abstract:Place recognition is a critical component of autonomous vehicles and robotics, enabling global localization in GPS-denied environments. Recent advances have spurred significant interest in multimodal place recognition (MPR), which leverages complementary strengths of multiple modalities. Despite its potential, most existing MPR methods still face three key challenges: (1) dynamically adapting to arbitrary modality inputs within a unified framework, (2) maintaining robustness with missing or degraded modalities, and (3) generalizing across diverse sensor configurations and setups. In this paper, we propose UniMPR, a unified framework for multimodal place recognition. Using only one trained model, it can seamlessly adapt to any combination of common perceptual modalities (e.g., camera, LiDAR, radar). To tackle the data heterogeneity, we unify all inputs within a polar BEV feature space. Subsequently, the polar BEVs are fed into a multi-branch network to exploit discriminative intra-model and inter-modal features from any modality combinations. To fully exploit the network's generalization capability and robustness, we construct a large-scale training set from multiple datasets and introduce an adaptive label assignment strategy for extensive pre-training. Experiments on seven datasets demonstrate that UniMPR achieves state-of-the-art performance under varying sensor configurations, modality combinations, and environmental conditions. Our code will be released at this https URL.
156. 【2512.18269】Building UI/UX Dataset for Dark Pattern Detection and YOLOv12x-based Real-Time Object Recognition Detection System
链接:https://arxiv.org/abs/2512.18269
作者:Se-Young Jang,Su-Yeon Yoon,Jae-Woong Jung,Dong-Hun Lee,Seong-Hun Choi,Soo-Kyung Jun,Yu-Bin Kim,Young-Seon Ju,Kyounggon Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:undermine users' ability, patterns-user interface designs, dark patterns-user interface, increasingly prominent, accelerating pace
备注: 7page
点击查看摘要
Abstract:With the accelerating pace of digital transformation and the widespread adoption of online platforms, both social and technical concerns regarding dark patterns-user interface designs that undermine users' ability to make informed and rational choices-have become increasingly prominent. As corporate online platforms grow more sophisticated in their design strategies, there is a pressing need for proactive and real-time detection technologies that go beyond the predominantly reactive approaches employed by regulatory authorities. In this paper, we propose a visual dark pattern detection framework that improves both detection accuracy and real-time performance. To this end, we constructed a proprietary visual object detection dataset by manually collecting 4,066 UI/UX screenshots containing dark patterns from 194 websites across six major industrial sectors in South Korea and abroad. The collected images were annotated with five representative UI components commonly associated with dark patterns: Button, Checkbox, Input Field, Pop-up, and QR Code. This dataset has been publicly released to support further research and development in the field. To enable real-time detection, this study adopted the YOLOv12x object detection model and applied transfer learning to optimize its performance for visual dark pattern recognition. Experimental results demonstrate that the proposed approach achieves a high detection accuracy of 92.8% in terms of mAP@50, while maintaining a real-time inference speed of 40.5 frames per second (FPS), confirming its effectiveness for practical deployment in online environments. Furthermore, to facilitate future research and contribute to technological advancements, the dataset constructed in this study has been made publicly available at this https URL.
157. 【2512.18264】Who Can See Through You? Adversarial Shielding Against VLM-Based Attribute Inference Attacks
链接:https://arxiv.org/abs/2512.18264
作者:Yucheng Fan,Jiawei Chen,Yu Tian,Zhaoxia Yin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:VLM-based attribute inference, attribute inference attacks, infer private attributes, VLM-based attribute, vision-language models
备注:
点击查看摘要
Abstract:As vision-language models (VLMs) become widely adopted, VLM-based attribute inference attacks have emerged as a serious privacy concern, enabling adversaries to infer private attributes from images shared on social media. This escalating threat calls for dedicated protection methods to safeguard user privacy. However, existing methods often degrade the visual quality of images or interfere with vision-based functions on social media, thereby failing to achieve a desirable balance between privacy protection and user experience. To address this challenge, we propose a novel protection method that jointly optimizes privacy suppression and utility preservation under a visual consistency constraint. While our method is conceptually effective, fair comparisons between methods remain challenging due to the lack of publicly available evaluation datasets. To fill this gap, we introduce VPI-COCO, a publicly available benchmark comprising 522 images with hierarchically structured privacy questions and corresponding non-private counterparts, enabling fine-grained and joint evaluation of protection methods in terms of privacy preservation and user experience. Building upon this benchmark, experiments on multiple VLMs demonstrate that our method effectively reduces PAR below 25%, keeps NPAR above 88%, maintains high visual consistency, and generalizes well to unseen and paraphrased privacy questions, demonstrating its strong practical applicability for real-world VLM deployments.
158. 【2512.18254】Loom: Diffusion-Transformer for Interleaved Generation
链接:https://arxiv.org/abs/2512.18254
作者:Mingcheng Ye,Jiaming Liu,Yiren Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:jointly produce coherent, produce coherent visual, aligned textual descriptions, Interleaved text-image generation, text-image generation aims
备注:
点击查看摘要
Abstract:Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language planning strategy first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Experiments demonstrate that Loom substantially outperforms the open-source baseline Anole, achieving an average gain of 2.6 points (on a 5-point scale) across temporal and semantic metrics in text-to-interleaved tasks. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.
159. 【2512.18247】owards Ancient Plant Seed Classification: A Benchmark Dataset and Baseline Model
链接:https://arxiv.org/abs/2512.18247
作者:Rui Xing,Runmin Cong,Yingying Wu,Can Wang,Zhongming Tang,Fen Wang,Hao Wu,Sam Kwong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:revealing human-environment interactions, Understanding the dietary, Ancient Plant Seed, human-environment interactions, ancient plant
备注:
点击查看摘要
Abstract:Understanding the dietary preferences of ancient societies and their evolution across periods and regions is crucial for revealing human-environment interactions. Seeds, as important archaeological artifacts, represent a fundamental subject of archaeobotanical research. However, traditional studies rely heavily on expert knowledge, which is often time-consuming and inefficient. Intelligent analysis methods have made progress in various fields of archaeology, but there remains a research gap in data and methods in archaeobotany, especially in the classification task of ancient plant seeds. To address this, we construct the first Ancient Plant Seed Image Classification (APS) dataset. It contains 8,340 images from 17 genus- or species-level seed categories excavated from 18 archaeological sites across China. In addition, we design a framework specifically for the ancient plant seed classification task (APSNet), which introduces the scale feature (size) of seeds based on learning fine-grained information to guide the network in discovering key "evidence" for sufficient classification. Specifically, we design a Size Perception and Embedding (SPE) module in the encoder part to explicitly extract size information for the purpose of complementing fine-grained information. We propose an Asynchronous Decoupled Decoding (ADD) architecture based on traditional progressive learning to decode features from both channel and spatial perspectives, enabling efficient learning of discriminative features. In both quantitative and qualitative analyses, our approach surpasses existing state-of-the-art image classification methods, achieving an accuracy of 90.5%. This demonstrates that our work provides an effective tool for large-scale, systematic archaeological research.
160. 【2512.18245】Spectral Discrepancy and Cross-modal Semantic Consistency Learning for Object Detection in Hyperspectral Image
链接:https://arxiv.org/abs/2512.18245
作者:Xiao He,Chang Tang,Xinwang Liu,Wei Zhang,Zhimin Gao,Chuankun Li,Shaohua Qiu,Jiangfeng Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:recognizing subtle differences, high spectral resolution, spectral resolution provide, similar substances, Hyperspectral images
备注:
点击查看摘要
Abstract:Hyperspectral images with high spectral resolution provide new insights into recognizing subtle differences in similar substances. However, object detection in hyperspectral images faces significant challenges in intra- and inter-class similarity due to the spatial differences in hyperspectral inter-bands and unavoidable interferences, e.g., sensor noises and illumination. To alleviate the hyperspectral inter-bands inconsistencies and redundancy, we propose a novel network termed \textbf{S}pectral \textbf{D}iscrepancy and \textbf{C}ross-\textbf{M}odal semantic consistency learning (SDCM), which facilitates the extraction of consistent information across a wide range of hyperspectral bands while utilizing the spectral dimension to pinpoint regions of interest. Specifically, we leverage a semantic consistency learning (SCL) module that utilizes inter-band contextual cues to diminish the heterogeneity of information among bands, yielding highly coherent spectral dimension representations. On the other hand, we incorporate a spectral gated generator (SGG) into the framework that filters out the redundant data inherent in hyperspectral information based on the importance of the bands. Then, we design the spectral discrepancy aware (SDA) module to enrich the semantic representation of high-level information by extracting pixel-level spectral features. Extensive experiments on two hyperspectral datasets demonstrate that our proposed method achieves state-of-the-art performance when compared with other ones.
161. 【2512.18241】SG-RIFE: Semantic-Guided Real-Time Intermediate Flow Estimation with Diffusion-Competitive Perceptual Quality
链接:https://arxiv.org/abs/2512.18241
作者:Pan Ben Wong,Chengli Wu,Hanyue Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video Frame Interpolation, Real-time Video Frame, Frame Interpolation, Video Frame, offer high throughput
备注:
点击查看摘要
Abstract:Real-time Video Frame Interpolation (VFI) has long been dominated by flow-based methods like RIFE, which offer high throughput but often fail in complicated scenarios involving large motion and occlusion. Conversely, recent diffusion-based approaches (e.g., Consec. BB) achieve state-of-the-art perceptual quality but suffer from prohibitive latency, rendering them impractical for real-time applications. To bridge this gap, we propose Semantic-Guided RIFE (SG-RIFE). Instead of training from scratch, we introduce a parameter-efficient fine-tuning strategy that augments a pre-trained RIFE backbone with semantic priors from a frozen DINOv3 Vision Transformer. We propose a Split-Fidelity Aware Projection Module (Split-FAPM) to compress and refine high-dimensional features, and a Deformable Semantic Fusion (DSF) module to align these semantic priors with pixel-level motion fields. Experiments on SNU-FILM demonstrate that semantic injection provides a decisive boost in perceptual fidelity. SG-RIFE outperforms diffusion-based LDMVFI in FID/LPIPS and achieves quality comparable to Consec. BB on complex benchmarks while running significantly faster, proving that semantic consistency enables flow-based methods to achieve diffusion-competitive perceptual quality in near real-time.
162. 【2512.18237】Joint Learning of Depth, Pose, and Local Radiance Field for Large Scale Monocular 3D Reconstruction
链接:https://arxiv.org/abs/2512.18237
作者:Shahram Najam Syed,Yitian Hu,Yuchao Yao
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:yields ghost geometry, monocular video collapses, scale-ambiguous depth yields, drift corrupts alignment, depth yields ghost
备注: 8 pages, 2 figures, 2 tables
点击查看摘要
Abstract:Photorealistic 3-D reconstruction from monocular video collapses in large-scale scenes when depth, pose, and radiance are solved in isolation: scale-ambiguous depth yields ghost geometry, long-horizon pose drift corrupts alignment, and a single global NeRF cannot model hundreds of metres of content. We introduce a joint learning framework that couples all three factors and demonstrably overcomes each failure case. Our system begins with a Vision-Transformer (ViT) depth network trained with metric-scale supervision, giving globally consistent depths despite wide field-of-view variations. A multi-scale feature bundle-adjustment (BA) layer refines camera poses directly in feature space--leveraging learned pyramidal descriptors instead of brittle keypoints--to suppress drift on unconstrained trajectories. For scene representation, we deploy an incremental local-radiance-field hierarchy: new hash-grid NeRFs are allocated and frozen on-the-fly when view overlap falls below a threshold, enabling city-block-scale coverage on a single GPU. Evaluated on the Tanks and Temples benchmark, our method reduces Absolute Trajectory Error to 0.001-0.021 m across eight indoor-outdoor sequences--up to 18x lower than BARF and 2x lower than NoPe-NeRF--while maintaining sub-pixel Relative Pose Error. These results demonstrate that metric-scale, drift-free 3-D reconstruction and high-fidelity novel-view synthesis are achievable from a single uncalibrated RGB camera.
163. 【2512.18231】Investigating Spatial Attention Bias in Vision-Language Models
链接:https://arxiv.org/abs/2512.18231
作者:Aryan Chaudhary,Sanchit Goyal,Pratik Narang,Dhruv Kumar
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:remain largely unexplored, demonstrated remarkable capabilities, processing remain largely, spatial processing remain, largely unexplored
备注:
点击查看摘要
Abstract:Vision-Language Models have demonstrated remarkable capabilities in understanding visual content, yet systematic biases in their spatial processing remain largely unexplored. This work identifies and characterizes a systematic spatial attention bias where VLMs consistently prioritize describing left-positioned content before right-positioned content in horizontally concatenated images. Through controlled experiments on image pairs using both open-source and closed-source models, we demonstrate that this bias persists across different architectures, with models describing left-positioned content first in approximately 97% of cases under neutral prompting conditions. Testing on an Arabic-finetuned model reveals that the bias persists despite right-to-left language training, ruling out language reading direction as the primary cause. Investigation of training dataset annotation guidelines from PixMo and Visual Genome reveals no explicit left-first ordering instructions, suggesting the bias is consistent with architectural factors rather than explicit training data instructions. These findings reveal fundamental limitations in how current VLMs process spatial information.
164. 【2512.18226】Multifaceted Exploration of Spatial Openness in Rental Housing: A Big Data Analysis in Tokyo's 23 Wards
链接:https://arxiv.org/abs/2512.18226
作者:Takuya OKi,Yuan Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:influencing factors separately, Understanding spatial openness, Understanding spatial, improving residential quality, studies often treat
备注:
点击查看摘要
Abstract:Understanding spatial openness is vital for improving residential quality and design; however, studies often treat its influencing factors separately. This study developed a quantitative framework to evaluate the spatial openness in housing from two- (2D) and three- (3D) dimensional perspectives. Using data from 4,004 rental units in Tokyo's 23 wards, we examined the temporal and spatial variations in openness and its relationship with rent and housing attributes. 2D openness was computed via planar visibility using visibility graph analysis (VGA) from floor plans, whereas 3D openness was derived from interior images analysed using Mask2Former, a semantic segmentation model that identifies walls, ceilings, floors, and windows. The results showed an increase in living room visibility and a 1990s peak in overall openness. Spatial analyses revealed partial correlations among openness, rent, and building characteristics, reflecting urban redevelopment trends. Although the 2D and 3D openness indicators were not directly correlated, higher openness tended to correspond to higher rent. The impression scores predicted by the existing models were only weakly related to openness, suggesting that the interior design and furniture more strongly shape perceived space. This study offers a new multidimensional data-driven framework for quantifying residential spatial openness and linking it with urban and market dynamics.
165. 【2512.18219】Unsupervised Anomaly Detection with an Enhanced Teacher for Student-Teacher Feature Pyramid Matching
链接:https://arxiv.org/abs/2512.18219
作者:Mohammad Zolfaghari,Hedieh Sajedi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Student-Teacher Feature Pyramid, achieving high-performance metrics, Anomaly detection, Feature Pyramid, Experiment results
备注:
点击查看摘要
Abstract:Anomaly detection or outlier is one of the challenging subjects in unsupervised learning . This paper is introduced a student-teacher framework for anomaly detection that its teacher network is enhanced for achieving high-performance metrics . For this purpose , we first pre-train the ResNet-18 network on the ImageNet and then fine-tune it on the MVTech-AD dataset . Experiment results on the image-level and pixel-level demonstrate that this idea has achieved better metrics than the previous methods . Our model , Enhanced Teacher for Student-Teacher Feature Pyramid (ET-STPM), achieved 0.971 mean accuracy on the image-level and 0.977 mean accuracy on the pixel-level for anomaly detection.
166. 【2512.18215】Stable and Efficient Single-Rollout RL for Multimodal Reasoning
链接:https://arxiv.org/abs/2512.18215
作者:Rui Liu,Dian Yu,Lei Ke,Haolin Liu,Yujun Zhou,Zhenwen Liang,Haitao Mi,Pratap Tokekar,Dong Yu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Reinforcement Learning, Verifiable Rewards, Language Models
备注:
点击查看摘要
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.
167. 【2512.18192】Multi-Part Object Representations via Graph Structures and Co-Part Discovery
链接:https://arxiv.org/abs/2512.18192
作者:Alex Foo,Wynne Hsu,Mong Li Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Discovering object-centric representations, Discovering object-centric, sample efficiency, vision models, significantly enhance
备注:
点击查看摘要
Abstract:Discovering object-centric representations from images can significantly enhance the robustness, sample efficiency and generalizability of vision models. Works on images with multi-part objects typically follow an implicit object representation approach, which fail to recognize these learned objects in occluded or out-of-distribution contexts. This is due to the assumption that object part-whole relations are implicitly encoded into the representations through indirect training objectives. We address this limitation by proposing a novel method that leverages on explicit graph representations for parts and present a co-part object discovery algorithm. We then introduce three benchmarks to evaluate the robustness of object-centric methods in recognizing multi-part objects within occluded and out-of-distribution settings. Experimental results on simulated, realistic, and real-world images show marked improvements in the quality of discovered objects compared to state-of-the-art methods, as well as the accurate recognition of multi-part objects in occluded and out-of-distribution contexts. We also show that the discovered object-centric representations can more accurately predict key object properties in a downstream task, highlighting the potential of our method to advance the field of object-centric representations.
168. 【2512.18187】ALIGN: Advanced Query Initialization with LiDAR-Image Guidance for Occlusion-Robust 3D Object Detection
链接:https://arxiv.org/abs/2512.18187
作者:Janghyun Baek,Mincheol Chang,Seokha Moon,Seung Joon Lee,Jinkyu Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:BEV heatmap-based sampling, inefficient query usage, existing query initialization, query initialization strategies,such, Recent query-based
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Recent query-based 3D object detection methods using camera and LiDAR inputs have shown strong performance, but existing query initialization strategies,such as random sampling or BEV heatmap-based sampling, often result in inefficient query usage and reduced accuracy, particularly for occluded or crowded objects. To address this limitation, we propose ALIGN (Advanced query initialization with LiDAR and Image GuidaNce), a novel approach for occlusion-robust, object-aware query initialization. Our model consists of three key components: (i) Occlusion-aware Center Estimation (OCE), which integrates LiDAR geometry and image semantics to estimate object centers accurately (ii) Adaptive Neighbor Sampling (ANS), which generates object candidates from LiDAR clustering and supplements each object by sampling spatially and semantically aligned points around it and (iii) Dynamic Query Balancing (DQB), which adaptively balances queries between foreground and background regions. Our extensive experiments on the nuScenes benchmark demonstrate that ALIGN consistently improves performance across multiple state-of-the-art detectors, achieving gains of up to +0.9 mAP and +1.2 NDS, particularly in challenging scenes with occlusions or dense crowds. Our code will be publicly available upon publication.
169. 【2512.18184】Is There a Better Source Distribution than Gaussian? Exploring Source Distributions for Image Flow Matching
链接:https://arxiv.org/abs/2512.18184
作者:Junho Lee,Kwanseok Kim,Joonseok Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:powerful generative modeling, generative modeling approach, Flow matching, powerful generative, generative modeling
备注:
点击查看摘要
Abstract:Flow matching has emerged as a powerful generative modeling approach with flexible choices of source distribution. While Gaussian distributions are commonly used, the potential for better alternatives in high-dimensional data generation remains largely unexplored. In this paper, we propose a novel 2D simulation that captures high-dimensional geometric properties in an interpretable 2D setting, enabling us to analyze the learning dynamics of flow matching during training. Based on this analysis, we derive several key insights about flow matching behavior: (1) density approximation can paradoxically degrade performance due to mode discrepancy, (2) directional alignment suffers from path entanglement when overly concentrated, (3) Gaussian's omnidirectional coverage ensures robust learning, and (4) norm misalignment incurs substantial learning costs. Building on these insights, we propose a practical framework that combines norm-aligned training with directionally-pruned sampling. This approach maintains the robust omnidirectional supervision essential for stable flow learning, while eliminating initializations in data-sparse regions during inference. Importantly, our pruning strategy can be applied to any flow matching model trained with a Gaussian source, providing immediate performance gains without the need for retraining. Empirical evaluations demonstrate consistent improvements in both generation quality and sampling efficiency. Our findings provide practical insights and guidelines for source distribution design and introduce a readily applicable technique for improving existing flow matching models. Our code is available at this https URL.
170. 【2512.18181】MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation
链接:https://arxiv.org/abs/2512.18181
作者:Kaixing Yang,Jiashu Zhu,Xulong Tang,Ziqiao Peng,Xiangyue Zhang,Puwei Wang,Jiahong Wu,Xiangxiang Chu,Hongyan Liu,Jun He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:compelling research direction, online dance-video platforms, AI-generated content, research direction, rise of online
备注:
点击查看摘要
Abstract:With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial progress in related domains such as music-driven 3D dance generation, pose-driven image animation, and audio-driven talking-head synthesis, existing methods cannot be directly adapted to this task. Moreover, the limited studies in this area still struggle to jointly achieve high-quality visual appearance and realistic human motion. Accordingly, we present MACE-Dance, a music-driven dance video generation framework with cascaded Mixture-of-Experts (MoE). The Motion Expert performs music-to-3D motion generation while enforcing kinematic plausibility and artistic expressiveness, whereas the Appearance Expert carries out motion- and reference-conditioned video synthesis, preserving visual identity with spatiotemporal coherence. Specifically, the Motion Expert adopts a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training (GFT) strategy, achieving state-of-the-art (SOTA) performance in 3D dance generation. The Appearance Expert employs a decoupled kinematic-aesthetic fine-tuning strategy, achieving state-of-the-art (SOTA) performance in pose-driven image animation. To better benchmark this task, we curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. Based on this protocol, MACE-Dance also achieves state-of-the-art performance. Project page: this https URL
171. 【2512.18177】NEURO-GUARD: Neuro-Symbolic Generalization and Unbiased Adaptive Routing for Diagnostics -- Explainable Medical AI
链接:https://arxiv.org/abs/2512.18177
作者:Midhat Urooj,Ayan Banerjee,Sandeep Gupta
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:high-stakes clinical decision-making, subtle visual cues, image-based diagnosis remains, remains a central, central challenge
备注: Accepted at Asilomar Conference
点击查看摘要
Abstract:Accurate yet interpretable image-based diagnosis remains a central challenge in medical AI, particularly in settings characterized by limited data, subtle visual cues, and high-stakes clinical decision-making. Most existing vision models rely on purely data-driven learning and produce black-box predictions with limited interpretability and poor cross-domain generalization, hindering their real-world clinical adoption. We present NEURO-GUARD, a novel knowledge-guided vision framework that integrates Vision Transformers (ViTs) with language-driven reasoning to improve performance, transparency, and domain robustness. NEURO-GUARD employs a retrieval-augmented generation (RAG) mechanism for self-verification, in which a large language model (LLM) iteratively generates, evaluates, and refines feature-extraction code for medical images. By grounding this process in clinical guidelines and expert knowledge, the framework progressively enhances feature detection and classification beyond purely data-driven baselines. Extensive experiments on diabetic retinopathy classification across four benchmark datasets APTOS, EyePACS, Messidor-1, and Messidor-2 demonstrate that NEURO-GUARD improves accuracy by 6.2% over a ViT-only baseline (84.69% vs. 78.4%) and achieves a 5% gain in domain generalization. Additional evaluations on MRI-based seizure detection further confirm its cross-domain robustness, consistently outperforming existing methods. Overall, NEURO-GUARD bridges symbolic medical reasoning with subsymbolic visual learning, enabling interpretable, knowledge-aware, and generalizable medical image diagnosis while achieving state-of-the-art performance across multiple datasets.
Comments:
Accepted at Asilomar Conference
Subjects:
Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.18177 [cs.AI]
(or
arXiv:2512.18177v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2512.18177
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
172. 【2512.18176】Atlas is Your Perfect Context: One-Shot Customization for Generalizable Foundational Medical Image Segmentation
链接:https://arxiv.org/abs/2512.18176
作者:Ziyu Zhang,Yi Yu,Simeng Zhu,Ahmed Aly,Yunhe Gao,Ning Gu,Yuan Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate medical image, Accurate medical, treatment planning, diagnosis and treatment, foundation models
备注:
点击查看摘要
Abstract:Accurate medical image segmentation is essential for clinical diagnosis and treatment planning. While recent interactive foundation models (e.g., nnInteractive) enhance generalization through large-scale multimodal pretraining, they still depend on precise prompts and often perform below expectations in contexts that are underrepresented in their training data. We present AtlasSegFM, an atlas-guided framework that customizes available foundation models to clinical contexts with a single annotated example. The core innovations are: 1) a pipeline that provides context-aware prompts for foundation models via registration between a context atlas and query images, and 2) a test-time adapter to fuse predictions from both atlas registration and the foundation model. Extensive experiments across public and in-house datasets spanning multiple modalities and organs demonstrate that AtlasSegFM consistently improves segmentation, particularly for small, delicate structures. AtlasSegFM provides a lightweight, deployable solution one-shot customization of foundation models in real-world clinical workflows. The code will be made publicly available.
173. 【2512.18161】Local Patches Meet Global Context: Scalable 3D Diffusion Priors for Computed Tomography Reconstruction
链接:https://arxiv.org/abs/2512.18161
作者:Taewon Yang,Jason Hu,Jeffrey A. Fessler,Liyue Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion models, solve inverse problems, models learn strong, Diffusion models learn, training diffusion models
备注:
点击查看摘要
Abstract:Diffusion models learn strong image priors that can be leveraged to solve inverse problems like medical image reconstruction. However, for real-world applications such as 3D Computed Tomography (CT) imaging, directly training diffusion models on 3D data presents significant challenges due to the high computational demands of extensive GPU resources and large-scale datasets. Existing works mostly reuse 2D diffusion priors to address 3D inverse problems, but fail to fully realize and leverage the generative capacity of diffusion models for high-dimensional data. In this study, we propose a novel 3D patch-based diffusion model that can learn a fully 3D diffusion prior from limited data, enabling scalable generation of high-resolution 3D images. Our core idea is to learn the prior of 3D patches to achieve scalable efficiency, while coupling local and global information to guarantee high-quality 3D image generation, by modeling the joint distribution of position-aware 3D local patches and downsampled 3D volume as global context. Our approach not only enables high-quality 3D generation, but also offers an unprecedentedly efficient and accurate solution to high-resolution 3D inverse problems. Experiments on 3D CT reconstruction across multiple datasets show that our method outperforms state-of-the-art methods in both performance and efficiency, notably achieving high-resolution 3D reconstruction of $512 \times 512 \times 256$ ($\sim$20 mins).
174. 【2512.18159】EndoStreamDepth: Temporally Consistent Monocular Depth Estimation for Endoscopic Video Streams
链接:https://arxiv.org/abs/2512.18159
作者:Hao Li,Daiwei Lu,Jiacheng Wang,Robert J. Webster III,Ipek Oguz
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:endoscopic video streams, video streams, endoscopic video, accurate depth maps, work presents EndoStreamDepth
备注:
点击查看摘要
Abstract:This work presents EndoStreamDepth, a monocular depth estimation framework for endoscopic video streams. It provides accurate depth maps with sharp anatomical boundaries for each frame, temporally consistent predictions across frames, and real-time throughput. Unlike prior work that uses batched inputs, EndoStreamDepth processes individual frames with a temporal module to propagate inter-frame information. The framework contains three main components: (1) a single-frame depth network with endoscopy-specific transformation to produce accurate depth maps, (2) multi-level Mamba temporal modules that leverage inter-frame information to improve accuracy and stabilize predictions, and (3) a hierarchical design with comprehensive multi-scale supervision, where complementary loss terms jointly improve local boundary sharpness and global geometric consistency. We conduct comprehensive evaluations on two publicly available colonoscopy depth estimation datasets. Compared to state-of-the-art monocular depth estimation methods, EndoStreamDepth substantially improves performance, and it produces depth maps with sharp, anatomically aligned boundaries, which are essential to support downstream tasks such as automation for robotic surgery. The code is publicly available at this https URL
175. 【2512.18128】SERA-H: Beyond Native Sentinel Spatial Limits for High-Resolution Canopy Height Mapping
链接:https://arxiv.org/abs/2512.18128
作者:Thomas Boudras,Martin Schwartz,Rasmus Fensholt,Martin Brandt,Ibrahim Fayad,Jean-Pierre Wigneron,Gabriel Belouze,Fajwel Fogel,Philippe Ciais
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:biodiversity monitoring, management and biodiversity, canopy height, height maps, predict height maps
备注: 17 pages, 8 figures, 3 tables. Source code available at [this https URL](https://github.com/ThomasBoudras/SERA-H#)
点击查看摘要
Abstract:High-resolution mapping of canopy height is essential for forest management and biodiversity monitoring. Although recent studies have led to the advent of deep learning methods using satellite imagery to predict height maps, these approaches often face a trade-off between data accessibility and spatial resolution. To overcome these limitations, we present SERA-H, an end-to-end model combining a super-resolution module (EDSR) and temporal attention encoding (UTAE). Trained under the supervision of high-density LiDAR data (ALS), our model generates 2.5 m resolution height maps from freely available Sentinel-1 and Sentinel-2 (10 m) time series data. Evaluated on an open-source benchmark dataset in France, SERA-H, with a MAE of 2.6 m and a coefficient of determination of 0.82, not only outperforms standard Sentinel-1/2 baselines but also achieves performance comparable to or better than methods relying on commercial very high-resolution imagery (SPOT-6/7, PlanetScope, Maxar). These results demonstrate that combining high-resolution supervision with the spatiotemporal information embedded in time series enables the reconstruction of details beyond the input sensors' native resolution. SERA-H opens the possibility of freely mapping forests with high revisit frequency, achieving accuracy comparable to that of costly commercial imagery. The source code is available at this https URL
176. 【2512.18115】Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown
链接:https://arxiv.org/abs/2512.18115
作者:Changxu Duan
类目:Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
关键词:digital library workflows, enable scalable digital, scalable digital library, Academic documents stored, plain text structured
备注: Accepted ICDAR 2025
点击查看摘要
Abstract:Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid editing-generation model whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the transformation latency up to 44.5% compared to end-to-end decoder transformer models, while maintaining transformation quality. Our code and reproducible dataset production scripts are open-sourced.
177. 【2512.18082】Uncertainty-Gated Region-Level Retrieval for Robust Semantic Segmentation
链接:https://arxiv.org/abs/2512.18082
作者:Shreshth Rajan,Raymond Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:outdoor street scenes, street scenes plays, mobile robotics, autonomous driving, outdoor street
备注:
点击查看摘要
Abstract:Semantic segmentation of outdoor street scenes plays a key role in applications such as autonomous driving, mobile robotics, and assistive technology for visually-impaired pedestrians. For these applications, accurately distinguishing between key surfaces and objects such as roads, sidewalks, vehicles, and pedestrians is essential for maintaining safety and minimizing risks. Semantic segmentation must be robust to different environments, lighting and weather conditions, and sensor noise, while being performed in real-time. We propose a region-level, uncertainty-gated retrieval mechanism that improves segmentation accuracy and calibration under domain shift. Our best method achieves an 11.3% increase in mean intersection-over-union while reducing retrieval cost by 87.5%, retrieving for only 12.5% of regions compared to 100% for always-on baseline.
178. 【2512.18073】FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis
链接:https://arxiv.org/abs/2512.18073
作者:Ekta Balkrishna Gavas,Sudipta Banerjee,Chinmay Hegde,Nasir Memon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual question answering, complex data analysis, gained significant traction, Multimodal LLMs, data analysis
备注:
点击查看摘要
Abstract:Multimodal LLMs (MLLMs) have gained significant traction in complex data analysis, visual question answering, generation, and reasoning. Recently, they have been used for analyzing the biometric utility of iris and face images. However, their capabilities in fingerprint understanding are yet unexplored. In this work, we design a comprehensive benchmark, \textsc{FPBench} that evaluates the performance of 20 MLLMs (open-source and proprietary) across 7 real and synthetic datasets on 8 biometric and forensic tasks using zero-shot and chain-of-thought prompting strategies. We discuss our findings in terms of performance, explainability and share our insights into the challenges and limitations. We establish \textsc{FPBench} as the first comprehensive benchmark for fingerprint domain understanding with MLLMs paving the path for foundation models for fingerprints.
179. 【2512.18057】FOODER: Real-time Facial Authentication and Expression Recognition
链接:https://arxiv.org/abs/2512.18057
作者:Sabri Mustafa Kahya,Muhammet Sami Yavuz,Boran Hamdi Sivrikaya,Eckehard Steinbach
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)
关键词:training domain, safe deployment, deployment of neural, identification of samples, expression recognition
备注: Book chapter
点击查看摘要
Abstract:Out-of-distribution (OOD) detection is essential for the safe deployment of neural networks, as it enables the identification of samples outside the training domain. We present FOODER, a real-time, privacy-preserving radar-based framework that integrates OOD-based facial authentication with facial expression recognition. FOODER operates using low-cost frequency-modulated continuous-wave (FMCW) radar and exploits both range-Doppler and micro range-Doppler representations. The authentication module employs a multi-encoder multi-decoder architecture with Body Part (BP) and Intermediate Linear Encoder-Decoder (ILED) components to classify a single enrolled individual as in-distribution while detecting all other faces as OOD. Upon successful authentication, an expression recognition module is activated. Concatenated radar representations are processed by a ResNet block to distinguish between dynamic and static facial expressions. Based on this categorization, two specialized MobileViT networks are used to classify dynamic expressions (smile, shock) and static expressions (neutral, anger). This hierarchical design enables robust facial authentication and fine-grained expression recognition while preserving user privacy by relying exclusively on radar data. Experiments conducted on a dataset collected with a 60 GHz short-range FMCW radar demonstrate that FOODER achieves an AUROC of 94.13% and an FPR95 of 18.12% for authentication, along with an average expression recognition accuracy of 94.70%. FOODER outperforms state-of-the-art OOD detection methods and several transformer-based architectures while operating efficiently in real time.
180. 【2512.18046】YolovN-CBi: A Lightweight and Efficient Architecture for Real-Time Detection of Small UAVs
链接:https://arxiv.org/abs/2512.18046
作者:Ami Pandat,Punna Rajasekhar,Gopika Vinod,Rohit Shukla
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unmanned Aerial Vehicles, Unmanned Aerial, Aerial Vehicles, pose increasing risks, drones pose increasing
备注:
点击查看摘要
Abstract:Unmanned Aerial Vehicles, commonly known as, drones pose increasing risks in civilian and defense settings, demanding accurate and real-time drone detection systems. However, detecting drones is challenging because of their small size, rapid movement, and low visual contrast. A modified architecture of YolovN called the YolovN-CBi is proposed that incorporates the Convolutional Block Attention Module (CBAM) and the Bidirectional Feature Pyramid Network (BiFPN) to improve sensitivity to small object detections. A curated training dataset consisting of 28K images is created with various flying objects and a local test dataset is collected with 2500 images consisting of very small drone objects. The proposed architecture is evaluated on four benchmark datasets, along with the local test dataset. The baseline Yolov5 and the proposed Yolov5-CBi architecture outperform newer Yolo versions, including Yolov8 and Yolov12, in the speed-accuracy trade-off for small object detection. Four other variants of the proposed CBi architecture are also proposed and evaluated, which vary in the placement and usage of CBAM and BiFPN. These variants are further distilled using knowledge distillation techniques for edge deployment, using a Yolov5m-CBi teacher and a Yolov5n-CBi student. The distilled model achieved a mA@P0.5:0.9 of 0.6573, representing a 6.51% improvement over the teacher's score of 0.6171, highlighting the effectiveness of the distillation process. The distilled model is 82.9% faster than the baseline model, making it more suitable for real-time drone detection. These findings highlight the effectiveness of the proposed CBi architecture, together with the distilled lightweight models in advancing efficient and accurate real-time detection of small UAVs.
181. 【2512.18038】NodMAISI: Nodule-Oriented Medical AI for Synthetic Imaging
链接:https://arxiv.org/abs/2512.18038
作者:Fakrul Islam Tushar,Ehsan Samei,Cynthia Rudin,Joseph Y. Lo
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:lung cancer screening, annotation-intensive findings critical, small pulmonary nodules, medical imaging datasets, abnormal and annotation-intensive
备注: 3 tables, 7 figures, 12 Supplement tables, 9 Supplement figures
点击查看摘要
Abstract:Objective: Although medical imaging datasets are increasingly available, abnormal and annotation-intensive findings critical to lung cancer screening, particularly small pulmonary nodules, remain underrepresented and inconsistently curated. Methods: We introduce NodMAISI, an anatomically constrained, nodule-oriented CT synthesis and augmentation framework trained on a unified multi-source cohort (7,042 patients, 8,841 CTs, 14,444 nodules). The framework integrates: (i) a standardized curation and annotation pipeline linking each CT with organ masks and nodule-level annotations, (ii) a ControlNet-conditioned rectified-flow generator built on MAISI-v2's foundational blocks to enforce anatomy- and lesion-consistent synthesis, and (iii) lesion-aware augmentation that perturbs nodule masks (controlled shrinkage) while preserving surrounding anatomy to generate paired CT variants. Results: Across six public test datasets, NodMAISI improved distributional fidelity relative to MAISI-v2 (real-to-synthetic FID range 1.18 to 2.99 vs 1.69 to 5.21). In lesion detectability analysis using a MONAI nodule detector, NodMAISI substantially increased average sensitivity and more closely matched clinical scans (IMD-CT: 0.69 vs 0.39; DLCS24: 0.63 vs 0.20), with the largest gains for sub-centimeter nodules where MAISI-v2 frequently failed to reproduce the conditioned lesion. In downstream nodule-level malignancy classification trained on LUNA25 and externally evaluated on LUNA16, LNDbv4, and DLCS24, NodMAISI augmentation improved AUC by 0.07 to 0.11 at =20% clinical data and by 0.12 to 0.21 at 10%, consistently narrowing the performance gap under data scarcity.
182. 【2512.18028】Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation
链接:https://arxiv.org/abs/2512.18028
作者:Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-language navigation requires, navigation requires agents, requires agents, agents to reason, reason and act
备注:
点击查看摘要
Abstract:Vision-language navigation requires agents to reason and act under constraints of embodiment. While vision-language models (VLMs) demonstrate strong generalization, current benchmarks provide limited understanding of how embodiment -- i.e., the choice of physical platform, sensor configuration, and modality alignment -- influences perception, reasoning, and control. We introduce Embodied4C, a closed-loop benchmark designed as a Turing test for embodied reasoning. The benchmark evaluates the core embodied capabilities of VLMs across three heterogeneous embodiments -- autonomous vehicles, aerial drones, and robotic manipulators -- through approximately 1.1K one-shot reasoning questions and 58 goal-directed navigation tasks. These tasks jointly assess four foundational dimensions: semantic, spatial, temporal, and physical reasoning. Each embodiment presents dynamic sensor configurations and environment variations to probe generalization beyond platform-specific adaptation. To prevent embodiment overfitting, Embodied4C integrates domain-far queries targeting abstract and cross-context reasoning. Comprehensive evaluation across ten state-of-the-art VLMs and four embodied control baselines shows that cross-modal alignment and instruction tuning matter more than scale, while spatial and temporal reasoning remains the primary bottleneck for reliable embodied competence.
183. 【2512.18007】Robotic VLA Benefits from Joint Learning with Motion Image Diffusion
链接:https://arxiv.org/abs/2512.18007
作者:Yu Fang,Kanchana Ranasinghe,Le Xue,Honglu Zhou,Juntao Tan,Ran Xu,Shelby Heinecke,Caiming Xiong,Silvio Savarese,Daniel Szafir,Mingyu Ding,Michael S. Ryoo,Juan Carlos Niebles
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, mapping multimodal observations, motion image diffusion, achieved remarkable, remarkable progress
备注: Website: [this https URL](https://vla-motion.github.io/)
点击查看摘要
Abstract:Vision-Language-Action (VLA) models have achieved remarkable progress in robotic manipulation by mapping multimodal observations and instructions directly to actions. However, they typically mimic expert trajectories without predictive motion reasoning, which limits their ability to reason about what actions to take. To address this limitation, we propose joint learning with motion image diffusion, a novel strategy that enhances VLA models with motion reasoning capabilities. Our method extends the VLA architecture with a dual-head design: while the action head predicts action chunks as in vanilla VLAs, an additional motion head, implemented as a Diffusion Transformer (DiT), predicts optical-flow-based motion images that capture future dynamics. The two heads are trained jointly, enabling the shared VLM backbone to learn representations that couple robot control with motion knowledge. This joint learning builds temporally coherent and physically grounded representations without modifying the inference pathway of standard VLAs, thereby maintaining test-time latency. Experiments in both simulation and real-world environments demonstrate that joint learning with motion image diffusion improves the success rate of pi-series VLAs to 97.5% on the LIBERO benchmark and 58.0% on the RoboTwin benchmark, yielding a 23% improvement in real-world performance and validating its effectiveness in enhancing the motion reasoning capability of large-scale VLAs.
184. 【2512.18004】Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models
链接:https://arxiv.org/abs/2512.18004
作者:Shubham Kumar Nigam,Parjanya Aditya Shukla,Noel Shallum,Arnab Bhattacharya
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:pose significant challenges, lack large digitized, large digitized corpora, exhibit high variability, machine translation continue
备注: Accepted in AILaw @ AAAI 2026 Conference
点击查看摘要
Abstract:Handwritten text recognition (HTR) and machine translation continue to pose significant challenges, particularly for low-resource languages like Marathi, which lack large digitized corpora and exhibit high variability in handwriting styles. The conventional approach to address this involves a two-stage pipeline: an OCR system extracts text from handwritten images, which is then translated into the target language using a machine translation model. In this work, we explore and compare the performance of traditional OCR-MT pipelines with Vision Large Language Models that aim to unify these stages and directly translate handwritten text images in a single, end-to-end step. Our motivation is grounded in the urgent need for scalable, accurate translation systems to digitize legal records such as FIRs, charge sheets, and witness statements in India's district and high courts. We evaluate both approaches on a curated dataset of handwritten Marathi legal documents, with the goal of enabling efficient legal document processing, even in low-resource environments. Our findings offer actionable insights toward building robust, edge-deployable solutions that enhance access to legal information for non-native speakers and legal professionals alike.
185. 【2512.18003】Name That Part: 3D Part Segmentation and Naming
链接:https://arxiv.org/abs/2512.18003
作者:Soumava Paul,Prakhar Kaushik,Ankit Vaidya,Anand Bhattad,Alan Yuille
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:decomposing objects, part, part segmentation, address semantic, limiting robust training
备注: Project page at [this https URL](https://name-that-part.github.io)
点击查看摘要
Abstract:We address semantic 3D part segmentation: decomposing objects into parts with meaningful names. While datasets exist with part annotations, their definitions are inconsistent across datasets, limiting robust training. Previous methods produce unlabeled decompositions or retrieve single parts without complete shape annotations. We propose ALIGN-Parts, which formulates part naming as a direct set alignment task. Our method decomposes shapes into partlets - implicit 3D part representations - matched to part descriptions via bipartite assignment. We combine geometric cues from 3D part fields, appearance from multi-view vision features, and semantic knowledge from language-model-generated affordance descriptions. Text-alignment loss ensures partlets share embedding space with text, enabling a theoretically open-vocabulary matching setup, given sufficient data. Our efficient and novel, one-shot, 3D part segmentation and naming method finds applications in several downstream tasks, including serving as a scalable annotation engine. As our model supports zero-shot matching to arbitrary descriptions and confidence-calibrated predictions for known categories, with human verification, we create a unified ontology that aligns PartNet, 3DCoMPaT++, and Find3D, consisting of 1,794 unique 3D parts. We also show examples from our newly created Tex-Parts dataset. We also introduce 2 novel metrics appropriate for the named 3D part segmentation task.
186. 【2512.17987】Enhancing Tea Leaf Disease Recognition with Attention Mechanisms and Grad-CAM Visualization
链接:https://arxiv.org/abs/2512.17987
作者:Omar Faruq Shikdar,Fahad Ahammed,B. M. Shahria Alam,Golam Kibria,Tawhidur Rahman,Nishat Tasnim Niloy
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:tea leaf diseases, consumed drinks globally, widely consumed drinks, tea leaf, leaf diseases
备注: 8 pages, 6 figures, International Conference on Computing and Communication Networks (ICCCNet-2025)
点击查看摘要
Abstract:Tea is among the most widely consumed drinks globally. Tea production is a key industry for many countries. One of the main challenges in tea harvesting is tea leaf diseases. If the spread of tea leaf diseases is not stopped in time, it can lead to massive economic losses for farmers. Therefore, it is crucial to identify tea leaf diseases as soon as possible. Manually identifying tea leaf disease is an ineffective and time-consuming method, without any guarantee of success. Automating this process will improve both the efficiency and the success rate of identifying tea leaf diseases. The purpose of this study is to create an automated system that can classify different kinds of tea leaf diseases, allowing farmers to take action to minimize the damage. A novel dataset was developed specifically for this study. The dataset contains 5278 images across seven classes. The dataset was pre-processed prior to training the model. We deployed three pretrained models: DenseNet, Inception, and EfficientNet. EfficientNet was used only in the ensemble model. We utilized two different attention modules to improve model performance. The ensemble model achieved the highest accuracy of 85.68%. Explainable AI was introduced for better model interpretability.
187. 【2512.17955】A Modular Framework for Single-View 3D Reconstruction of Indoor Environments
链接:https://arxiv.org/abs/2512.17955
作者:Yuxiao Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single-view indoor scene, powered by diffusion, single-view indoor, core modules, diffusion techniques
备注: Master's thesis
点击查看摘要
Abstract:We propose a modular framework for single-view indoor scene 3D reconstruction, where several core modules are powered by diffusion techniques. Traditional approaches for this task often struggle with the complex instance shapes and occlusions inherent in indoor environments. They frequently overshoot by attempting to predict 3D shapes directly from incomplete 2D images, which results in limited reconstruction quality. We aim to overcome this limitation by splitting the process into two steps: first, we employ diffusion-based techniques to predict the complete views of the room background and occluded indoor instances, then transform them into 3D. Our modular framework makes contributions to this field through the following components: an amodal completion module for restoring the full view of occluded instances, an inpainting model specifically trained to predict room layouts, a hybrid depth estimation technique that balances overall geometric accuracy with fine detail expressiveness, and a view-space alignment method that exploits both 2D and 3D cues to ensure precise placement of instances within the scene. This approach effectively reconstructs both foreground instances and the room background from a single image. Extensive experiments on the 3D-Front dataset demonstrate that our method outperforms current state-of-the-art (SOTA) approaches in terms of both visual quality and reconstruction accuracy. The framework holds promising potential for applications in interior design, real estate, and augmented reality.
188. 【2512.17954】SCS-SupCon: Sigmoid-based Common and Style Supervised Contrastive Learning with Adaptive Decision Boundaries
链接:https://arxiv.org/abs/2512.17954
作者:Bin Wang,Fadi Dornaika
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:substantial intra-class variations, subtle inter-class differences, contrastive learning methods, Image classification, existing contrastive learning
备注:
点击查看摘要
Abstract:Image classification is hindered by subtle inter-class differences and substantial intra-class variations, which limit the effectiveness of existing contrastive learning methods. Supervised contrastive approaches based on the InfoNCE loss suffer from negative-sample dilution and lack adaptive decision boundaries, thereby reducing discriminative power in fine-grained recognition tasks. To address these limitations, we propose Sigmoid-based Common and Style Supervised Contrastive Learning (SCS-SupCon). Our framework introduces a sigmoid-based pairwise contrastive loss with learnable temperature and bias parameters to enable adaptive decision boundaries. This formulation emphasizes hard negatives, mitigates negative-sample dilution, and more effectively exploits supervision. In addition, an explicit style-distance constraint further disentangles style and content representations, leading to more robust feature learning. Comprehensive experiments on six benchmark datasets, including CUB200-2011 and Stanford Dogs, demonstrate that SCS-SupCon achieves state-of-the-art performance across both CNN and Transformer backbones. On CIFAR-100 with ResNet-50, SCS-SupCon improves top-1 accuracy over SupCon by approximately 3.9 percentage points and over CS-SupCon by approximately 1.7 points under five-fold cross-validation. On fine-grained datasets, it outperforms CS-SupCon by 0.4--3.0 points. Extensive ablation studies and statistical analyses further confirm the robustness and generalization of the proposed framework, with Friedman tests and Nemenyi post-hoc evaluations validating the stability of the observed improvements.
189. 【2512.17953】Seeing Beyond the Scene: Analyzing and Mitigating Background Bias in Action Recognition
链接:https://arxiv.org/abs/2512.17953
作者:Ellie Zhou,Jihoon Chung,Olga Russakovsky
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Video Large Language, Human action recognition, action recognition models, Large Language Models, action recognition
备注: Accepted to NeurIPS 2025 Workshops: SPACE in Vision, Language, and Embodied AI; and What Makes a Good Video: Next Practices in Video Generation and Evaluation
点击查看摘要
Abstract:Human action recognition models often rely on background cues rather than human movement and pose to make predictions, a behavior known as background bias. We present a systematic analysis of background bias across classification models, contrastive text-image pretrained models, and Video Large Language Models (VLLM) and find that all exhibit a strong tendency to default to background reasoning. Next, we propose mitigation strategies for classification models and show that incorporating segmented human input effectively decreases background bias by 3.78%. Finally, we explore manual and automated prompt tuning for VLLMs, demonstrating that prompt design can steer predictions towards human-focused reasoning by 9.85%.
190. 【2512.17951】SuperFlow: Training Flow Matching Models with RL on the Fly
链接:https://arxiv.org/abs/2512.17951
作者:Kaijie Chen,Zhiyang Xu,Ying Shen,Zihao Lin,Yuguang Yao,Lifu Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent progress, improved text-image alignment, flow-based generative models, reinforcement learning, visual quality
备注: 15 pages
点击查看摘要
Abstract:Recent progress in flow-based generative models and reinforcement learning (RL) has improved text-image alignment and visual quality. However, current RL training for flow models still has two main problems: (i) GRPO-style fixed per-prompt group sizes ignore variation in sampling importance across prompts, which leads to inefficient sampling and slower training; and (ii) trajectory-level advantages are reused as per-step estimates, which biases credit assignment along the flow. We propose SuperFlow, an RL training framework for flow-based models that adjusts group sizes with variance-aware sampling and computes step-level advantages in a way that is consistent with continuous-time flow dynamics. Empirically, SuperFlow reaches promising performance while using only 5.4% to 56.3% of the original training steps and reduces training time by 5.2% to 16.7% without any architectural changes. On standard text-to-image (T2I) tasks, including text rendering, compositional image generation, and human preference alignment, SuperFlow improves over SD3.5-M by 4.6% to 47.2%, and over Flow-GRPO by 1.7% to 16.0%.
191. 【2512.17943】NystagmusNet: Explainable Deep Learning for Photosensitivity Risk Prediction
链接:https://arxiv.org/abs/2512.17943
作者:Karthik Prabhakar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:face significant daily, significant daily challenges, daily challenges due, Nystagmus patients, photosensitivity face significant
备注: 12 pages, 7 figures, 2 tables, code available at [this https URL](https://github.com/knkarthik01/nystagmus-photosensitivity-ai)
点击查看摘要
Abstract:Nystagmus patients with photosensitivity face significant daily challenges due to involuntary eye movements exacerbated by environmental brightness conditions. Current assistive solutions are limited to symptomatic treatments without predictive personalization. This paper proposes NystagmusNet, an AI-driven system that predicts high-risk visual environments and recommends real-time visual adaptations. Using a dual-branch convolutional neural network trained on synthetic and augmented datasets, the system estimates a photosensitivity risk score based on environmental brightness and eye movement variance. The model achieves 75% validation accuracy on synthetic data. Explainability techniques including SHAP and GradCAM are integrated to highlight environmental risk zones, improving clinical trust and model interpretability. The system includes a rule-based recommendation engine for adaptive filter suggestions. Future directions include deployment via smart glasses and reinforcement learning for personalized recommendations.
192. 【2512.17939】A 96pJ/Frame/Pixel and 61pJ/Event Anti-UAV System with Hybrid Object Tracking Modes
链接:https://arxiv.org/abs/2512.17939
作者:Yuncheng Lu,Yucen Shi,Aobo Li,Zehao Li,Junying Li,Bo Wang,Tony Tae-Hyoung Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enable reliable detection, event-driven object tracking, Fast Object Tracking, Object Tracking Unit, energy-efficient anti-UAV system
备注: 2 pages, 7 figures, conference paper published in IEEE Asian Solid-State Circuits Conference 2025
点击查看摘要
Abstract:We present an energy-efficient anti-UAV system that integrates frame-based and event-driven object tracking to enable reliable detection of small and fast-moving drones. The system reconstructs binary event frames using run-length encoding, generates region proposals, and adaptively switches between frame mode and event mode based on object size and velocity. A Fast Object Tracking Unit improves robustness for high-speed targets through adaptive thresholding and trajectory-based classification. The neural processing unit supports both grayscale-patch and trajectory inference with a custom instruction set and a zero-skipping MAC architecture, reducing redundant neural computations by more than 97 percent. Implemented in 40 nm CMOS technology, the 2 mm^2 chip achieves 96 pJ per frame per pixel and 61 pJ per event at 0.8 V, and reaches 98.2 percent recognition accuracy on public UAV datasets across 50 to 400 m ranges and 5 to 80 pixels per second speeds. The results demonstrate state-of-the-art end-to-end energy efficiency for anti-UAV systems.
193. 【2512.16265】Privacy-Aware Sharing of Raw Spatial Sensor Data for Cooperative Perception
链接:https://arxiv.org/abs/2512.16265
作者:Bangya Liu,Chengpo Yan,Chenghao Jiang,Suman Banerjee,Akarsh Prabhakara
类目:Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:reliable scene understanding, scene understanding, Cooperative perception, vehicles is poised, poised to offer
备注:
点击查看摘要
Abstract:Cooperative perception between vehicles is poised to offer robust and reliable scene understanding. Recently, we are witnessing experimental systems research building testbeds that share raw spatial sensor data for cooperative perception. While there has been a marked improvement in accuracies and is the natural way forward, we take a moment to consider the problems with such an approach for eventual adoption by automakers. In this paper, we first argue that new forms of privacy concerns arise and discourage stakeholders to share raw sensor data. Next, we present SHARP, a research framework to minimize privacy leakage and drive stakeholders towards the ambitious goal of raw data based cooperative perception. Finally, we discuss open questions for networked systems, mobile computing, perception researchers, industry and government in realizing our proposed framework.
194. 【2512.19675】Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)
链接:https://arxiv.org/abs/2512.19675
作者:Niclas Griesshaber,Jochen Streb
类目:General Economics (econ.GN); Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
关键词:large language models, leverage multimodal large, multimodal large language, archival image scans, German patents
备注:
点击查看摘要
Abstract:We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be easily adapted to other image corpora using LLM-assisted coding tools, lowering the barriers for less technical researchers. Finally, we explain the economics of deploying LLMs for historical dataset construction and conclude by speculating on the potential implications for the field of economic history.
195. 【2512.19584】Patlak Parametric Image Estimation from Dynamic PET Using Diffusion Model Prior
链接:https://arxiv.org/abs/2512.19584
作者:Ziqian Huang,Boxiao Yu,Siqi Li,Savas Ozdemir,Sangjin Bae,Jae Sung Lee,Guobao Wang,Kuang Gong
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:Dynamic PET enables, Dynamic PET, dynamic PET requires, total-body dynamic PET, dynamic PET datasets
备注: 10 pages, 9 figures
点击查看摘要
Abstract:Dynamic PET enables the quantitative estimation of physiology-related parameters and is widely utilized in research and increasingly adopted in clinical settings. Parametric imaging in dynamic PET requires kinetic modeling to estimate voxel-wise physiological parameters based on specific kinetic models. However, parametric images estimated through kinetic model fitting often suffer from low image quality due to the inherently ill-posed nature of the fitting process and the limited counts resulting from non-continuous data acquisition across multiple bed positions in whole-body PET. In this work, we proposed a diffusion model-based kinetic modeling framework for parametric image estimation, using the Patlak model as an example. The score function of the diffusion model was pre-trained on static total-body PET images and served as a prior for both Patlak slope and intercept images by leveraging their patch-wise similarity. During inference, the kinetic model was incorporated as a data-consistency constraint to guide the parametric image estimation. The proposed framework was evaluated on total-body dynamic PET datasets with different dose levels, demonstrating the feasibility and promising performance of the proposed framework in improving parametric image quality.
196. 【2512.19577】Deep Learning for Primordial $B$-mode Extraction
链接:https://arxiv.org/abs/2512.19577
作者:Eric Guzman,Joel Meyers
类目:Cosmology and Nongalactic Astrophysics (astro-ph.CO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
关键词:primordial gravitational waves, mode polarization, gravitational waves, primordial gravitational, mode
备注: 12 pages, 8 figures. Code available from [this https URL](https://github.com/EEmGuzman/resunet-cmb)
点击查看摘要
Abstract:The search for primordial gravitational waves is a central goal of cosmic microwave background (CMB) surveys. Isolating the characteristic $B$-mode polarization signal sourced by primordial gravitational waves is challenging for several reasons: the amplitude of the signal is inherently small; astrophysical foregrounds produce $B$-mode polarization contaminating the signal; and secondary $B$-mode polarization fluctuations are produced via the conversion of $E$ modes. Current and future low-noise, multi-frequency observations enable sufficient precision to address the first two of these challenges such that secondary $B$ modes will become the bottleneck for improved constraints on the amplitude of primordial gravitational waves. The dominant source of secondary $B$-mode polarization is gravitational lensing by large scale structure. Various strategies have been developed to estimate the lensing deflection and to reverse its effects the CMB, thus reducing confusion from lensing $B$ modes in the search for primordial gravitational waves. However, a few complications remain. First, there may be additional sources of secondary $B$-mode polarization, for example from patchy reionization or from cosmic polarization rotation. Second, the statistics of delensed CMB maps can become complicated and non-Gaussian, especially when advanced lensing reconstruction techniques are applied. We previously demonstrated how a deep learning network, ResUNet-CMB, can provide nearly optimal simultaneous estimates of multiple sources of secondary $B$-mode polarization. In this paper, we show how deep learning can be applied to estimate and remove multiple sources of secondary $B$-mode polarization, and we further show how this technique can be used in a likelihood analysis to produce nearly optimal, unbiased estimates of the amplitude of primordial gravitational waves.
197. 【2512.19489】Rethinking Coupled Tensor Analysis for Hyperspectral Super-Resolution: Recoverable Modeling Under Endmember Variability
链接:https://arxiv.org/abs/2512.19489
作者:Meng Ding,Xiao Fu
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:spatially co-registered hyperspectral, hyperspectral super-resolution, co-registered hyperspectral, fusing a pair, work revisits
备注: The paper was accepted by SIAM Journal on Imaging Sciences
点击查看摘要
Abstract:This work revisits the hyperspectral super-resolution (HSR) problem, i.e., fusing a pair of spatially co-registered hyperspectral (HSI) and multispectral (MSI) images to recover a super-resolution image (SRI) that enhances the spatial resolution of the HSI. Coupled tensor decomposition (CTD)-based methods have gained traction in this domain, offering recoverability guarantees under various assumptions. Existing models such as canonical polyadic decomposition (CPD) and Tucker decomposition provide strong expressive power but lack physical interpretability. The block-term decomposition model with rank-$(L_r, L_r, 1)$ terms (the LL1 model) yields interpretable factors under the linear mixture model (LMM) of spectral images, but LMM assumptions are often violated in practice -- primarily due to nonlinear effects such as endmember variability (EV). To address this, we propose modeling spectral images using a more flexible block-term tensor decomposition with rank-$(L_r, M_r, N_r)$ terms (the LMN model). This modeling choice retains interpretability, subsumes CPD, Tucker, and LL1 as special cases, and robustly accounts for non-ideal effects such as EV, offering a balanced tradeoff between expressiveness and interpretability for HSR. Importantly, under the LMN model for HSI and MSI, recoverability of the SRI can still be established under proper conditions -- providing strong theoretical support. Extensive experiments on synthetic and real datasets further validate the effectiveness and robustness of the proposed method compared with existing CTD-based approaches.
198. 【2512.19225】Selective Phase-Aware Training of nnU-Net for Robust Breast Cancer Segmentation in Multi-Center DCE-MRI
链接:https://arxiv.org/abs/2512.19225
作者:Beyza Zayim,Aissiou Ikram,Boukhiar Naima
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Breast cancer remains, evaluating breast tumors, Breast cancer, cancer remains, common cancer
备注:
点击查看摘要
Abstract:Breast cancer remains the most common cancer among women and is a leading cause of female mortality. Dynamic contrast-enhanced MRI (DCE-MRI) is a powerful imaging tool for evaluating breast tumors, yet the field lacks a standardized benchmark for analyzing treatment responses and guiding personalized care. We participated in the MAMA-MIA Challenge's Primary Tumor Segmentation task and this work presents a proposed selective, phase-aware training framework for the nnU-Net architecture, emphasizing quality-focused data selection to strengthen model robustness and generalization. We employed the No New Net (nnU-Net) framework with a selective training strategy that systematically analyzed the impact of image quality and center-specific variability on segmentation performance. Controlled experiments on the DUKE, NACT, ISPY1, and ISPY2 datasets revealed that including ISPY scans with motion artifacts and reduced contrast impaired segmentation performance, even with advanced preprocessing, such as contrast-limited adaptive histogram equalization (CLAHE). In contrast, training on DUKE and NACT data, which exhibited clearer contrast and fewer motion artifacts despite varying resolutions, with early phase images (0000-0002) provided more stable training conditions. Our results demonstrate the importance of phase-sensitive and quality-aware training strategies in achieving reliable segmentation performance in heterogeneous clinical datasets, highlighting the limitations of the expansion of naive datasets and motivating the need for future automation of quality-based data selection strategies.
199. 【2512.18200】SLIM: Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion
链接:https://arxiv.org/abs/2512.18200
作者:Hyeonjin Lee,Jun-Hyuk Kim,Jong-Seok Lee
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:machine vision, image compression, image compression models, Low-bitrate Image compression, Semantic-based Low-bitrate Image
备注:
点击查看摘要
Abstract:In recent years, the demand of image compression models for machine vision has increased dramatically. However, the training frameworks of image compression still focus on the vision of human, maintaining the excessive perceptual details, thus have limitations in optimally reducing the bits per pixel in the case of performing machine vision tasks. In this paper, we propose Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion, termed SLIM. This is a new effective training framework of image compression for machine vision, using a pretrained latent diffusion this http URL compressor model of our method focuses only on the Region-of-Interest (RoI) areas for machine vision in the image latent, to compress it compactly. Then the pretrained Unet model enhances the decompressed latent, utilizing a RoI-focused text caption which containing semantic information of the image. Therefore, SLIM is able to focus on RoI areas of the image without any guide mask at the inference stage, achieving low bitrate when compressing. And SLIM is also able to enhance a decompressed latent by denoising steps, so the final reconstructed image from the enhanced latent can be optimized for the machine vision task while still containing perceptual details for human vision. Experimental results show that SLIM achieves a higher classification accuracy in the same bits per pixel condition, compared to conventional image compression models for machines.
200. 【2512.18197】Standardized Evaluation of Automatic Methods for Perivascular Spaces Segmentation in MRI -- MICCAI 2024 Challenge Results
链接:https://arxiv.org/abs/2512.18197
作者:Yilei Wu,Yichi Zhang,Zijian Dong,Fang Ji,An Sen Tan,Gifford Tan,Sizhao Tang,Huijuan Chen,Zijiao Chen,Eric Kwun Kei Ng,Jose Bernal,Hang Min,Ying Xia,Ines Vati,Liz Cooper,Xiaoyu Hu,Yuchen Pei,Yutao Ma,Victor Nozais,Ami Tsuchida,Pierre-Yves Hervé,Philippe Boutinaud,Marc Joliot,Junghwa Kang,Wooseung Kim,Dayeon Bak,Rachika E. Hamadache,Valeriia Abramova,Xavier Lladó,Yuntao Zhu,Zhenyu Gong,Xin Chen,John McFadden,Pek Lan Khong,Roberto Duarte Coello,Hongwei Bran Li,Woon Puay Koh,Christopher Chen,Joanna M. Wardlaw,Maria del C. Valdés Hernández,Juan Helen Zhou
类目:Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:magnetic resonance imaging, cerebral small vessel, small vessel disease, Perivascular spaces, automatic enlarged PVS
备注:
点击查看摘要
Abstract:Perivascular spaces (PVS), when abnormally enlarged and visible in magnetic resonance imaging (MRI) structural sequences, are important imaging markers of cerebral small vessel disease and potential indicators of neurodegenerative conditions. Despite their clinical significance, automatic enlarged PVS (EPVS) segmentation remains challenging due to their small size, variable morphology, similarity with other pathological features, and limited annotated datasets. This paper presents the EPVS Challenge organized at MICCAI 2024, which aims to advance the development of automated algorithms for EPVS segmentation across multi-site data. We provided a diverse dataset comprising 100 training, 50 validation, and 50 testing scans collected from multiple international sites (UK, Singapore, and China) with varying MRI protocols and demographics. All annotations followed the STRIVE protocol to ensure standardized ground truth and covered the full brain parenchyma. Seven teams completed the full challenge, implementing various deep learning approaches primarily based on U-Net architectures with innovations in multi-modal processing, ensemble strategies, and transformer-based components. Performance was evaluated using dice similarity coefficient, absolute volume difference, recall, and precision metrics. The winning method employed MedNeXt architecture with a dual 2D/3D strategy for handling varying slice thicknesses. The top solutions showed relatively good performance on test data from seen datasets, but significant degradation of performance was observed on the previously unseen Shanghai cohort, highlighting cross-site generalization challenges due to domain shift. This challenge establishes an important benchmark for EPVS segmentation methods and underscores the need for the continued development of robust algorithms that can generalize in diverse clinical settings.
201. 【2512.17930】CytoDINO: Risk-Aware and Biologically-Informed Adaptation of DINOv3 for Bone Marrow Cytomorphology
链接:https://arxiv.org/abs/2512.17930
作者:Aziz Muminov,Anne Pham
类目:Other Quantitative Biology (q-bio.OT); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Quantitative Methods (q-bio.QM)
关键词:significant inter-observer variability, labor-intensive process subject, Bone marrow cell, Bone marrow, marrow cell cytomorphology
备注: 11 pages, 3 figures
点击查看摘要
Abstract:Bone marrow cell cytomorphology analysis is critical for the diagnosis of hematological malignancies but remains a labor-intensive process subject to significant inter-observer variability. While recent foundation models have shown promise in computational pathology, they often require extensive computational resources and fail to account for the asymmetric risks associated with clinical misdiagnosis. We introduce CytoDINO, a framework that achieves state-of-the-art performance on the Munich Leukemia Laboratory (MLL) dataset by fine-tuning DINOv3 using Low-Rank Adaptation (LoRA). Our primary contribution is a novel Hierarchical Focal Loss with Critical Penalties, which encodes biological relationships between cell lineages and explicitly penalizes clinically dangerous misclassifications (e.g., classifying blasts as normal cells). CytoDINO achieves an 88.2% weighted F1 score and 76.5% macro F1 on a held-out test set of 21 cell classes. By utilizing parameter-efficient fine-tuning with only 8% trainable parameters on a single NVIDIA RTX 5080, we demonstrate that consumer-grade hardware can match specialized infrastructure. Furthermore, confidence-based selective prediction yields 99.5% accuracy on 67% of samples, suggesting a viable pathway for clinical deployment where high-uncertainty cases are flagged for expert review
202. 【2512.17924】A curated UK rain radar data set for training and benchmarking nowcasting models
链接:https://arxiv.org/abs/2512.17924
作者:Viv Atureta,Rifki Priansyah Jasin,Stefan Siegert
类目:Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
关键词:machine learning methods, rain radar image, paper documents, statistical modeling, modeling and machine
备注:
点击查看摘要
Abstract:This paper documents a data set of UK rain radar image sequences for use in statistical modeling and machine learning methods for nowcasting. The main dataset contains 1,000 randomly sampled sequences of length 20 steps (15-minute increments) of 2D radar intensity fields of dimension 40x40 (at 5km spatial resolution). Spatially stratified sampling ensures spatial homogeneity despite removal of clear-sky cases by threshold-based truncation. For each radar sequence, additional atmospheric and geographic features are made available, including date, location, mean elevation, mean wind direction and speed and prevailing storm type. New R functions to extract data from the binary "Nimrod" radar data format are provided. A case study is presented to train and evaluate a simple convolutional neural network for radar nowcasting, including self-contained R code.
203. 【2512.17127】Disentangled representations via score-based variational autoencoders
链接:https://arxiv.org/abs/2512.17127
作者:Benjamin S. H. Lyo,Eero P. Simoncelli,Cristina Savin
类目:Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Multiscale Inference, unsupervised representation learning, method for unsupervised, learning that combines, combines the theoretical
备注: 34 pages, 7 figures
点击查看摘要
Abstract:We present the Score-based Autoencoder for Multiscale Inference (SAMI), a method for unsupervised representation learning that combines the theoretical frameworks of diffusion models and VAEs. By unifying their respective evidence lower bounds, SAMI formulates a principled objective that learns representations through score-based guidance of the underlying diffusion process. The resulting representations automatically capture meaningful structure in the data: it recovers ground truth generative factors in our synthetic dataset, learns factorized, semantic latent dimensions from complex natural images, and encodes video sequences into latent trajectories that are straighter than those of alternative encoders, despite training exclusively on static images. Furthermore, SAMI can extract useful representations from pre-trained diffusion models with minimal additional training. Finally, the explicitly probabilistic formulation provides new ways to identify semantically meaningful axes in the absence of supervised labels, and its mathematical exactness allows us to make formal statements about the nature of the learned representation. Overall, these results indicate that implicit structural information in diffusion models can be made explicit and interpretable through synergistic combination with a variational autoencoder.




