本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新606篇论文,其中:

  • 自然语言处理87
  • 信息检索14
  • 计算机视觉124

自然语言处理

1. 【2512.25070】Scaling Open-Ended Reasoning to Predict the Future

链接https://arxiv.org/abs/2512.25070

作者:Nikhil Chandak,Shashwat Goel,Ameya Prabhu,Moritz Hardt,Jonas Geiping

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:High-stakes decision making, decision making involves, making involves reasoning, High-stakes decision, decision making

备注: 45 pages

点击查看摘要

Abstract:High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.

2. 【2512.25063】Many Minds from One Model: Bayesian Transformers for Population Intelligence

链接https://arxiv.org/abs/2512.25063

作者:Diji Yang,Yi Zhang

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Population Bayesian Transformers, single functional hypothesis, Bayesian Transformer model, propose Population Bayesian, modern transformers

备注

点击查看摘要

Abstract:Despite their scale and success, modern transformers are almost universally trained as single-minded systems: optimization produces one deterministic set of parameters, representing a single functional hypothesis about the data. Motivated by the idea that intelligence emerge from many minds, we propose Population Bayesian Transformers (B-Trans), which transform a standard Large Language Model into a Bayesian Transformer model to supports sampling diverse yet coherent model instances from a single set of pre-trained weights. B-Trans introduces a Bayesian-motivated posterior proxy by treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation, inducing a distribution over model behavior without the cost of training full Bayesian neural networks. Sampling from this proxy yields a set of model instances with diverse behaviors while maintaining general competence. To preserve coherence within each generation, we freeze the sampled noise at the sequence level, enforcing temporal consistency across tokens. B-Trans allows for population-level decision-making, where aggregating predictions across sampled individuals significantly enhances exploration. Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels demonstrate that B-Trans effectively leverage the wisdom of crowds, yielding superior semantic diversity while achieving better task performance compared to deterministic baselines.

Subjects:

Machine Learning (cs.LG); Computation and Language (cs.CL)

Cite as:
arXiv:2512.25063 [cs.LG]

(or
arXiv:2512.25063v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2512.25063

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
3. 【2512.25052】AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

链接https://arxiv.org/abs/2512.25052

作者:Chao Peng,Bin Wang,Zhilei Long,Jinfang Sheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:degrade downstream generation, standard top-k retrieval, Retrieval-augmented generation, waste token budget, downstream generation

备注: Preprint. Under review

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is highly sensitive to the quality of selected context, yet standard top-k retrieval often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation. We present AdaGReS, a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. AdaGReS performs greedy selection under a token-budget constraint using marginal gains derived from the objective, and introduces a closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits. We further provide a theoretical analysis showing that the proposed objective exhibits epsilon-approximate submodularity under practical embedding similarity conditions, yielding near-optimality guarantees for greedy selection. Experiments on open-domain question answering (Natural Questions) and a high-redundancy biomedical (drug) corpus demonstrate consistent improvements in redundancy control and context quality, translating to better end-to-end answer quality and robustness across settings.

4. 【2512.25026】Modeling Language as a Sequence of Thoughts

链接https://arxiv.org/abs/2512.25026

作者:Nasim Borazjanizadeh,James McClelland

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:strikingly natural text, generate strikingly natural, strikingly natural, natural text, text by modeling

备注

点击查看摘要

Abstract:Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level "thought" states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG's loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.

5. 【2512.25015】MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

链接https://arxiv.org/abs/2512.25015

作者:Siddhant Agarwal,Adya Dhuler,Polly Ruhnke,Melvin Speisman,Md Shad Akhtar,Shweta Yadav

类目:Computation and Language (cs.CL)

关键词:Large Language Model, past years, freely and easily, exclusively a medium, medium of humorous

备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever-growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human-annotated explanations. We introduce MAMAMemeia, a collaborative multi-agent multi-aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMAMemeia improves upon the current state-of-the-art by 7.55% in macro-F1 and is established as the new benchmark compared to over 30 methods.

6. 【2512.24997】Classifying long legal documents using short random chunks

链接https://arxiv.org/abs/2512.24997

作者:Luis Adrián Cabrera-Diego

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Classifying legal documents, Classifying legal, specialized vocabulary, legal document classifier, feeding full documents

备注

点击查看摘要

Abstract:Classifying legal documents is a challenge, besides their specialized vocabulary, sometimes they can be very long. This means that feeding full documents to a Transformers-based models for classification might be impossible, expensive or slow. Thus, we present a legal document classifier based on DeBERTa V3 and a LSTM, that uses as input a collection of 48 randomly-selected short chunks (max 128 tokens). Besides, we present its deployment pipeline using Temporal, a durable execution solution, which allow us to have a reliable and robust processing workflow. The best model had a weighted F-score of 0.898, while the pipeline running on CPU had a processing median time of 498 seconds per 100 files.

7. 【2512.24947】CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement

链接https://arxiv.org/abs/2512.24947

作者:Wentao Zhang,Tao Fang,Lina Lu,Lifei Wang,Weihe Zhong

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:costly supervised fine-tuning, domain shifts, interpretable crop disease, existing methods, methods often rely

备注: This paper is 6 pages in length and contains 2 figures. Tao Fang (Corresponding Author), Lina Lu (Co-corresponding Author)

点击查看摘要

Abstract:Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption--Prompt--Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: this https URL.

8. 【2512.24943】RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment

链接https://arxiv.org/abs/2512.24943

作者:Chenji Lu,Zhuo Chen,Hui Zhao,Zhenyi Wang,Pengjie Wang,Jian Xu,Bo Zheng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Search relevance plays, Search relevance, web e-commerce, plays a central, central role

备注

点击查看摘要

Abstract:Search relevance plays a central role in web e-commerce. While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics across the industry. To address this limitation, we propose Rule-Aware benchmark with Image for Relevance assessment(RAIR), a Chinese dataset derived from real-world scenarios. RAIR established a standardized framework for relevance assessment and provides a set of universal rules, which forms the foundation for standardized evaluation. Additionally, RAIR analyzes essential capabilities required for current relevance models and introduces a comprehensive dataset consists of three subset: (1) a general subset with industry-balanced sampling to evaluate fundamental model competencies; (2) a long-tail hard subset focus on challenging cases to assess performance limits; (3) a visual salience subset for evaluating multimodal understanding capabilities. We conducted experiments on RAIR using 14 open and closed-source models. The results demonstrate that RAIR presents sufficient challenges even for GPT-5, which achieved the best performance. RAIR data are now available, serving as an industry benchmark for relevance assessment while providing new insights into general LLM and Visual Language Model(VLM) evaluation.

9. 【2512.24940】Iterative Deployment Improves Planning Skills in LLMs

链接https://arxiv.org/abs/2512.24940

作者:Augusto B. Corrêa,Yoav Gelberg,Luckeciano C. Melo,Ilia Shumailov,André G. Pereira,Yarin Gal

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, previous models' deployment, data carefully curated, large language, carefully curated

备注

点击查看摘要

Abstract:We show that iterative deployment of large language models (LLMs), each fine-tuned on data carefully curated by users from the previous models' deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.

10. 【2512.24939】Vibe Coding, Interface Flattening

链接https://arxiv.org/abs/2512.24939

作者:Hongrui Jin

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词:Large language models, Large language, model-driven toolchains, vibe coding, softwares through natural-language

备注: 16 pages, 1 figure

点击查看摘要

Abstract:Large language models are reshaping programming by enabling 'vibe coding': the development of softwares through natural-language interaction with model-driven toolchains. This article argues that vibe coding is best understood as interface flattening, a reconfiguration in which previously distinct modalities (GUI, CLI, and API) appear to converge into a single conversational surface, even as the underlying chain of translation from intention to machinic effect lengthens and thickens. Drawing on Friedrich Kittler's materialist media theory and Alexander Galloway's account of interfaces as sites of protocol control, the paper situates programming as a historically localised interface arrangement rather than an essential relation to computation. Through a materialist reconstruction of the contemporary vibe-coding stack, it shows how remote compute infrastructures, latency and connectivity, structured outputs, function/tool calling, and interoperability standards such as the Model Context Protocol relocate control and meaning-making power to model and protocol providers. The apparent democratisation of technical capability therefore depends on new dependencies and new literacies. By foregrounding the tension between experiential flattening and infrastructural thickening, I demonstrate how LLM-mediated development redistributes symbolic labour/power, obscures responsibility, and privatises competencies previously dispersed across programming communities, contributing a critical lens on the political economy of AI-mediated human-computer interaction.

11. 【2512.24933】Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline

链接https://arxiv.org/abs/2512.24933

作者:Minjun Zhao,Xinyu Zhang,Shuai Zhang,Deyang Li,Ruifeng Shi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:invoke large language, effectively solve complex, performance heavily depends, pipelines invoke large, Multi-step LLM pipelines

备注

点击查看摘要

Abstract:Multi-step LLM pipelines invoke large language models multiple times in a structured sequence and can effectively solve complex tasks, but their performance heavily depends on the prompts used at each step. Jointly optimizing these prompts is difficult due to missing step-level supervision and inter-step dependencies. Existing end-to-end prompt optimization methods struggle under these conditions and often yield suboptimal or unstable updates. We propose ADOPT, an Adaptive Dependency-aware Prompt Optimization framework for multi-step LLM pipelines. ADOPT explicitly models the dependency between each LLM step and the final task outcome, enabling precise text-gradient estimation analogous to computing analytical derivatives. It decouples textual gradient estimation from gradient updates, reducing multi-prompt optimization to flexible single-prompt optimization steps, and employs a Shapley-based mechanism to adaptively allocate optimization resources. Experiments on real-world datasets and diverse pipeline structures show that ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.

12. 【2512.24885】BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts

链接https://arxiv.org/abs/2512.24885

作者:Hengli Li,Zhaoxin Yu,Qi Shen,Chenxi Li,Mengmeng Wang,Tinglang Wu,Yipeng Kang,Yuxuan Wang,Song-Chun Zhu,Zixia Jia,Zilong Zheng

类目:Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)

关键词:execute distinct dialogue, execute distinct, Strategic dialogue requires, dialogue requires agents, Mutual Friends

备注: Accepted by AAMAS 2026

点击查看摘要

Abstract:Strategic dialogue requires agents to execute distinct dialogue acts, for which belief estimation is essential. While prior work often estimates beliefs accurately, it lacks a principled mechanism to use those beliefs during generation. We bridge this gap by first formalizing two core acts Adversarial and Alignment, and by operationalizing them via probabilistic constraints on what an agent may generate. We instantiate this idea in BEDA, a framework that consists of the world set, the belief estimator for belief estimation, and the conditional generator that selects acts and realizes utterances consistent with the inferred beliefs. Across three settings, Conditional Keeper Burglar (CKBG, adversarial), Mutual Friends (MF, cooperative), and CaSiNo (negotiation), BEDA consistently outperforms strong baselines: on CKBG it improves success rate by at least 5.0 points across backbones and by 20.6 points with GPT-4.1-nano; on Mutual Friends it achieves an average improvement of 9.3 points; and on CaSiNo it achieves the optimal deal relative to all baselines. These results indicate that casting belief estimation as constraints provides a simple, general mechanism for reliable strategic dialogue.

13. 【2512.24880】mHC: Manifold-Constrained Hyper-Connections

链接https://arxiv.org/abs/2512.24880

作者:Zhenda Xie,Yixuan Wei,Huanqi Cao,Chenggang Zhao,Chengqi Deng,Jiashi Li,Damai Dai,Huazuo Gao,Jiang Chang,Liang Zhao,Shangyan Zhou,Zhean Xu,Zhengyan Zhang,Wangding Zeng,Shengding Hu,Yuqing Wang,Jingyang Yuan,Lean Wang,Wenfeng Liang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:diversifying connectivity patterns, residual stream width, connection paradigm established, ubiquitous residual connection, residual connection paradigm

备注

点击查看摘要

Abstract:Recently, studies exemplified by Hyper-Connections (HC) have extended the ubiquitous residual connection paradigm established over the past decade by expanding the residual stream width and diversifying connectivity patterns. While yielding substantial performance gains, this diversification fundamentally compromises the identity mapping property intrinsic to the residual connection, which causes severe training instability and restricted scalability, and additionally incurs notable memory access overhead. To address these challenges, we propose Manifold-Constrained Hyper-Connections (mHC), a general framework that projects the residual connection space of HC onto a specific manifold to restore the identity mapping property, while incorporating rigorous infrastructure optimization to ensure efficiency. Empirical experiments demonstrate that mHC is effective for training at scale, offering tangible performance improvements and superior scalability. We anticipate that mHC, as a flexible and practical extension of HC, will contribute to a deeper understanding of topological architecture design and suggest promising directions for the evolution of foundational models.

14. 【2512.24873】Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem

链接https://arxiv.org/abs/2512.24873

作者:Weixun Wang,XiaoXiao Xu,Wanhe An,Fangwen Dai,Wei Gao,Yancheng He,Ju Huang,Qiang Ji,Hanqi Jin,Xiaoyang Li,Yang Li,Zhongwen Li,Shirong Lin,Jiashun Liu,Zenan Liu,Tao Luo,Dilxat Muhtar,Yuanbin Qu,Jiaqiang Shi,Qinghui Sun,Yingshui Tan,Hao Tang,Runze Wang,Yi Wang,Zhaoguo Wang,Yanan Wu,Shaopan Xiong,Binchen Xu,Xander Xu,Yuchi Xu,Qipeng Zhang,Xixia Zhang,Haizhou Zhao,Jie Zhao,Shuaibing Zhao,Baihui Zheng,Jianhui Zheng,Suhang Zheng,Yanni Zhu,Mengze Cai,Kerui Cao,Xitong Chen,Yue Dai,Lifan Du,Tao Feng,Tao He,Jin Hu,Yijie Hu,Ziyu Jiang,Cheng Li,Xiang Li,Jing Liang,Chonghuan Liu,ZhenDong Liu,Haodong Mi,Yanhu Mo,Junjia Ni,Shixin Pei,Jingyu Shen,XiaoShuai Song,Cecilia Wang,Chaofan Wang,Kangyu Wang,Pei Wang,Tao Wang,Wei Wang,Ke Xiao,Mingyu Xu,Tiange Xu,Nan Ya,Siran Yang,Jianan Ye,Yaxing Zang,Duo Zhang,Junbo Zhang,Boren Zheng,Wanxi Deng,Ling Pan,Lin Qu,Wenbo Su,Jiamang Wang,Wei Wang,Hu Wei,Minggang Wu,Cheng Yu,Bing Zhao,Zhicheng Zheng,Bo Zheng

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:iteratively refining artifacts, Agentic crafting requires, crafting requires LLMs, observing outcomes, Agentic Learning Ecosystem

备注: 36 pages, 15 figures

点击查看摘要

Abstract:Agentic crafting requires LLMs to operate in real-world environments over multiple turns by taking actions, observing outcomes, and iteratively refining artifacts. Despite its importance, the open-source community lacks a principled, end-to-end ecosystem to streamline agent development. We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimizes the production pipeline for agent LLMs. ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering. We release ROME (ROME is Obviously an Agentic Model), an open-source agent grounded by ALE and trained on over one million trajectories. Our approach includes data composition protocols for synthesizing complex behaviors and a novel policy optimization algorithm, Interaction-based Policy Alignment (IPA), which assigns credit over semantic interaction chunks rather than individual tokens to improve long-horizon training stability. Empirically, we evaluate ROME within a structured setting and introduce Terminal Bench Pro, a benchmark with improved scale and contamination control. ROME demonstrates strong performance across benchmarks like SWE-bench Verified and Terminal Bench, proving the effectiveness of the ALE infrastructure.

15. 【2512.24867】Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

链接https://arxiv.org/abs/2512.24867

作者:Yiming Liang,Yizhi Li,Yantao Du,Ge Zhang,Jiayi Zhou,Yuchen Wu,Yinzhu Piao,Denghui Cao,Tong Sun,Ziniu Li,Li Du,Bo Lei,Jiaheng Liu,Chenghua Lin,Zhaoxiang Zhang,Wenhao Huang,Jiajun Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, capability boundaries, play a crucial, crucial role, role in tracking

备注

点击查看摘要

Abstract:Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution--reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs' comprehensive understanding over multiple fine-grained disciplinary knowledge statements.

16. 【2512.24863】Big AI is accelerating the metacrisis: What can we do?

链接https://arxiv.org/abs/2512.24863

作者:Steven Bird

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:grip of ecological, language crises, Abstract, meaning, ecological

备注: 9 pages, 1 figure

点击查看摘要

Abstract:The world is in the grip of ecological, meaning, and language crises which are converging into a metacrisis. Big AI is accelerating them all. Language engineers are playing a central role, persisting with a scalability story that is failing humanity, supplying critical talent to plutocrats and kleptocrats, and creating new technologies as if the whole endeavour was value-free. We urgently need to explore alternatives, applying our collective intelligence to design a life-affirming future for NLP that is centered on human flourishing on a living planet.

17. 【2512.24848】PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI

链接https://arxiv.org/abs/2512.24848

作者:Srija Mukhopadhyay,Sathwik Reddy,Shruthi Muthukumar,Jisun An,Ponnurangam Kumaraguru

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Personalized AI agents, user digital footprint, private emails, chats and purchase, purchase histories

备注: 11 pages, 2 figures

点击查看摘要

Abstract:Personalized AI agents rely on access to a user's digital footprint, which often includes sensitive data from private emails, chats and purchase histories. Yet this access creates a fundamental societal and privacy risk: systems lacking social-context awareness can unintentionally expose user secrets, threatening digital well-being. We introduce PrivacyBench, a benchmark with socially grounded datasets containing embedded secrets and a multi-turn conversational evaluation to measure secret preservation. Testing Retrieval-Augmented Generation (RAG) assistants reveals that they leak secrets in up to 26.56% of interactions. A privacy-aware prompt lowers leakage to 5.12%, yet this measure offers only partial mitigation. The retrieval mechanism continues to access sensitive data indiscriminately, which shifts the entire burden of privacy preservation onto the generator. This creates a single point of failure, rendering current architectures unsafe for wide-scale deployment. Our findings underscore the urgent need for structural, privacy-by-design safeguards to ensure an ethical and inclusive web for everyone.

18. 【2512.24842】riangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

链接https://arxiv.org/abs/2512.24842

作者:Yanan Long

类目:Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:achieve strong aggregate, strong aggregate performance, Multilingual language models, models achieve strong, emph

备注: NeurIPS 2025 Workshop Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

点击查看摘要

Abstract:Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.

19. 【2512.24825】Practising responsibility: Ethics in NLP as a hands-on course

链接https://arxiv.org/abs/2512.24825

作者:Malvina Nissim,Viviana Patti,Beatrice Savoldi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Natural Language Processing, Language Processing, Natural Language, integrating ethical considerations, NLP education

备注

点击查看摘要

Abstract:As Natural Language Processing (NLP) systems become more pervasive, integrating ethical considerations into NLP education has become essential. However, this presents inherent challenges in curriculum development: the field's rapid evolution from both academia and industry, and the need to foster critical thinking beyond traditional technical training. We introduce our course on Ethical Aspects in NLP and our pedagogical approach, grounded in active learning through interactive sessions, hands-on activities, and "learning by teaching" methods. Over four years, the course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds; it has also yielded many reusable products, both in the form of teaching materials and in the form of actual educational products aimed at diverse audiences, made by the students themselves. By sharing our approach and experience, we hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula.

20. 【2512.24776】Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

链接https://arxiv.org/abs/2512.24776

作者:Ákos Prucs,Márton Csutora,Mátyás Antal,Márk Marosi

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, demonstrating rapid improvements, utilize intermediate reasoning, intermediate reasoning steps

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an emergent trend regarding accuracy gain per unit of compute. Finally, we demonstrate that there is a saturation point for inference-time compute. Beyond a certain threshold, accuracy gains diminish, indicating that while extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities.

21. 【2512.24772】Uncertainty-aware Semi-supervised Ensemble Teacher Framework for Multilingual Depression Detection

链接https://arxiv.org/abs/2512.24772

作者:Mohammad Zia Ur Rehman,Velpuru Navya,Sanskar,Shuja Uddin Qureshi,Nagendra Kumar

类目:Computation and Language (cs.CL)

关键词:social media text, Detecting depression, challenging task, social media, media text

备注

点击查看摘要

Abstract:Detecting depression from social media text is still a challenging task. This is due to different language styles, informal expression, and the lack of annotated data in many languages. To tackle these issues, we propose, Semi-SMDNet, a strong Semi-Supervised Multilingual Depression detection Network. It combines teacher-student pseudo-labelling, ensemble learning, and augmentation of data. Our framework uses a group of teacher models. Their predictions come together through soft voting. An uncertainty-based threshold filters out low-confidence pseudo-labels to reduce noise and improve learning stability. We also use a confidence-weighted training method that focuses on reliable pseudo-labelled samples. This greatly boosts robustness across languages. Tests on Arabic, Bangla, English, and Spanish datasets show that our approach consistently beats strong baselines. It significantly reduces the performance gap between settings that have plenty of resources and those that do not. Detailed experiments and studies confirm that our framework is effective and can be used in various situations. This shows that it is suitable for scalable, cross-language mental health monitoring where labelled resources are limited.

22. 【2512.24733】BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature

链接https://arxiv.org/abs/2512.24733

作者:Sibo Wei,Peng Chen,Lifeng Dong,Yin Luo,Lei Wang,Peng Zhang,Wenpeng Lu,Jianbin Guo,Hongjun Yang,Dajun Zeng

类目:Computation and Language (cs.CL)

关键词:including curation lag, interpret heterogeneous molecular, inherit structural limitations, multi-omics pathway mechanism, pathway mechanism elucidation

备注

点击查看摘要

Abstract:Multi-omics studies often rely on pathway enrichment to interpret heterogeneous molecular changes, but pathway enrichment (PE)-based workflows inherit structural limitations of pathway resources, including curation lag, functional redundancy, and limited sensitivity to molecular states and interventions. Although recent work has explored using large language models (LLMs) to improve PE-based interpretation, the lack of a standardized benchmark for end-to-end multi-omics pathway mechanism elucidation has largely confined evaluation to small, manually curated datasets or ad hoc case studies, hindering reproducible progress. To address this issue, we introduce BIOME-Bench, constructed via a rigorous four-stage workflow, to evaluate two core capabilities of LLMs in multi-omics analysis: Biomolecular Interaction Inference and end-to-end Multi-Omics Pathway Mechanism Elucidation. We develop evaluation protocols for both tasks and conduct comprehensive experiments across multiple strong contemporary models. Experimental results demonstrate that existing models still exhibit substantial deficiencies in multi-omics analysis, struggling to reliably distinguish fine-grained biomolecular relation types and to generate faithful, robust pathway-level mechanistic explanations.

23. 【2512.24693】MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

链接https://arxiv.org/abs/2512.24693

作者:Wenzhe Li,Shujian Zhang,Wenxuan Zhou,John Lambert,Chi Jin,Andrew Hard,Rajiv Mathews,Lun Wang

类目:Computation and Language (cs.CL)

关键词:capable Large Language, Large Language Models, developing capable Large, Large Language, requiring costly human

备注

点击查看摘要

Abstract:Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn \textit{training} techniques, effective automated \textit{evaluation} specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning \textit{multiple} turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose \textbf{MU}lti-\textbf{S}tep \textbf{I}nstruction \textbf{C}ontrast (MUSIC), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs exhibiting differences across multiple turns. Leveraging MUSIC on the Skywork preference dataset, we train a multi-turn RM based on the Gemma-2-9B-Instruct model. Empirical results demonstrate that our MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, crucially, without compromising performance on standard single-turn RM benchmarks.

24. 【2512.24684】R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

链接https://arxiv.org/abs/2512.24684

作者:Maoyuan Li,Zhongsheng Wang,Haoyuan Li,Jiamou Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:agentic framework, framework for generating, built on argumentative, multi-turn debates built, generating multi-turn debates

备注: Accepteed by AAMAS 2026 full paper

点击查看摘要

Abstract:We present R-Debater, an agentic framework for generating multi-turn debates built on argumentative memory. Grounded in rhetoric and memory studies, the system views debate as a process of recalling and adapting prior arguments to maintain stance consistency, respond to opponents, and support claims with evidence. Specifically, R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. We evaluate on standardized ORCHID debates, constructing a 1,000-item retrieval corpus and a held-out set of 32 debates across seven domains. Two tasks are evaluated: next-utterance generation, assessed by InspireScore (subjective, logical, and factual), and adversarial multi-turn simulations, judged by Debatrix (argument, source, language, and overall). Compared with strong LLM baselines, R-Debater achieves higher single-turn and multi-turn scores. Human evaluation with 20 experienced debaters further confirms its consistency and evidence use, showing that combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns.

25. 【2512.24661】Do Large Language Models Know What They Are Capable Of?

链接https://arxiv.org/abs/2512.24661

作者:Casey O. Barkan,Sid Black,Oliver Sourbut

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, large language, predictions improve, LLMs, investigate whether large

备注: 23 pages, 8 figures

点击查看摘要

Abstract:We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs' decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.

26. 【2512.24618】Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

链接https://arxiv.org/abs/2512.24618

作者:Junru Lu,Jiarui Qin,Lingfeng Qiao,Yinghui Li,Xinyi Dai,Bo Ke,Jianfeng He,Ruizhi Qiao,Di Yin,Xing Sun,Yunsheng Wu,Yinsong Liu,Shuangyin Liu,Mingkong Tang,Haodong Lin,Jiayi Kuang,Fanxu Meng,Xiaojuan Tang,Yunjia Xi,Junjie Huang,Haotong Yang,Zhenyi Shen,Yangning Li,Qianwen Zhang,Yifei Yu,Siyu An,Junnan Dong,Qiufeng Wang,Jie Wang,Keyu Chen,Wei Wen,Taian Guo,Zhifeng Shen,Daohai Yu,Jiahao Li,Ke Li,Zongyi Li,Xiaoyu Tan

类目:Computation and Language (cs.CL)

关键词:harmonizes high computational, high computational efficiency, native agentic intelligence, powerful language model, powerful language

备注: 57 pages, 26 figures

点击查看摘要

Abstract:We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled "Commonsense-STEM-Agent" Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.

27. 【2512.24601】Recursive Language Models

链接https://arxiv.org/abs/2512.24601

作者:Alex L. Zhang,Tim Kraska,Omar Khattab

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:study allowing large, allowing large language, process arbitrarily long, large language models, Recursive Language Models

备注: 9 pages, 33 with Appendix

点击查看摘要

Abstract:We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

28. 【2512.24574】Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time

链接https://arxiv.org/abs/2512.24574

作者:Zhenyu Zhang,Xiaoxia Wu,Zhongzhu Zhou,Qingyang Wu,Yineng Zhang,Pragaash Ponnusamy,Harikaran Subbaraj,Jue Wang,Shuaiwen Leon Song,Ben Athiwaratkun

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, solve complex tasks, Language Models, rely on long

备注

点击查看摘要

Abstract:Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.

29. 【2512.24572】Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs' Legal Reasoning Capabilities

链接https://arxiv.org/abs/2512.24572

作者:Hongseok Oh,Wonseok Hwang,Kyoung-Woon On

类目:Computation and Language (cs.CL)

关键词:Korean Canonical Legal, Canonical Legal Benchmark, language models' legal, Korean Canonical, models' legal reasoning

备注

点击查看摘要

Abstract:We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models' legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at this https URL.

30. 【2512.24562】HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

链接https://arxiv.org/abs/2512.24562

作者:Chaodong Tong,Qi Zhang,Jiayang Gao,Lei Jiang,Yanbing Liu,Nannan Sun

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, including factual errors, Language Models, including factual

备注: 13 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) excel at question answering (QA) but often generate hallucinations, including factual errors or fabricated content. Detecting hallucinations from internal uncertainty signals is attractive due to its scalability and independence from external resources. Existing methods often aim to accurately capture a single type of uncertainty while overlooking the complementarity among different sources, particularly between token-level probability uncertainty and the uncertainty conveyed by internal semantic representations, which provide complementary views on model reliability. We present \textbf{HaluNet}, a lightweight and trainable neural framework that integrates multi granular token level uncertainties by combining semantic embeddings with probabilistic confidence and distributional uncertainty. Its multi branch architecture adaptively fuses what the model knows with the uncertainty expressed in its outputs, enabling efficient one pass hallucination detection. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet delivers strong detection performance and favorable computational efficiency, with or without access to context, highlighting its potential for real time hallucination detection in LLM based QA systems.

31. 【2512.24556】Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

链接https://arxiv.org/abs/2512.24556

作者:Muhammad Abdullahi Said,Muhammad Sammani Sani

类目:Computation and Language (cs.CL)

关键词:dangerous blind spot, Large Language Models, critical global infrastructure, alignment transfers zero-shot, Large Language

备注

点击查看摘要

Abstract:As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the prevailing multilingual safety gap narrative. Instead of a simple degradation in low-resource settings, we identified a mechanism of Complex Interference where safety is determined by the intersection of variables. While models exhibited a Reverse Linguistic with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.

32. 【2512.24545】More Than Bits: Multi-Envelope Double Binary Factorization for Extreme Quantization

链接https://arxiv.org/abs/2512.24545

作者:Yuma Ichikawa,Yoshihiko Fujisawa,Yudai Fujimoto,Akira Sakai,Katsuki Fujisawa

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

关键词:Double Binary Factorization, large language models, extreme low-bit quantization, enables efficient inference, Double Binary

备注: 14 pages, 2 figures

点击查看摘要

Abstract:For extreme low-bit quantization of large language models (LLMs), Double Binary Factorization (DBF) is attractive as it enables efficient inference without sacrificing accuracy. However, the scaling parameters of DBF are too restrictive; after factoring out signs, all rank components share the same magnitude profile, resulting in performance saturation. We propose Multi-envelope DBF (MDBF), which retains a shared pair of 1-bit sign bases but replaces the single envelope with a rank-$l$ envelope. By sharing sign matrices among envelope components, MDBF effectively maintains a binary carrier and utilizes the limited memory budget for magnitude expressiveness. We also introduce a closed-form initialization and an alternating refinement method to optimize MDBF. Across the LLaMA and Qwen families, MDBF enhances perplexity and zero-shot accuracy over previous binary formats at matched bits per weight while preserving the same deployment-friendly inference primitive.

33. 【2512.24532】From Building Blocks to Planning: Multi-Step Spatial Reasoning in LLMs with Reinforcement Learning

链接https://arxiv.org/abs/2512.24532

作者:Amir Tahmasbi,Sadegh Majidi,Kazem Taram,Aniket Bera

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:gained increasing attention, increasing attention due, large language models, gained increasing, due to applications

备注

点击查看摘要

Abstract:Spatial reasoning in large language models (LLMs) has gained increasing attention due to applications in navigation and planning. Despite strong general language capabilities, LLMs still struggle with spatial transformations and multi-step planning in structured environments. We propose a two-stage approach that decomposes spatial reasoning into atomic building blocks and their composition. First, we apply supervised fine-tuning on elementary spatial transformations, such as rotation, translation, and scaling, to equip the model with basic spatial physics. We then freeze this physics-aware model and train lightweight LoRA adapters within the GRPO framework to learn policies that compose these building blocks for multi-step planning in puzzle-based environments, in a closed-loop manner. To support this pipeline, we synthesize an ASCII-art dataset and construct a corresponding ASCII-based reinforcement learning environment. Our method consistently outperforms baselines, including the generic backbone, physics-aware model, and end-to-end RL models, under both Dynamic environments with explicit state updates and Static environments where the model must rely on its internal state across steps. In addition, the proposed approach converges faster and exhibits more stable training compared to end-to-end reinforcement learning from scratch. Finally, we analyze attention patterns to assess whether fine-tuning induces meaningful improvements in spatial understanding.

34. 【2512.24517】Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

链接https://arxiv.org/abs/2512.24517

作者:Fabian Retkowski,Alexander Waibel

类目:Computation and Language (cs.CL)

关键词:unstructured word streams, Automatic speech transcripts, Automatic speech, readability and repurposing, delivered as unstructured

备注

点击查看摘要

Abstract:Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.

35. 【2512.24460】IELTS Writing Revision Platform with Automated Essay Scoring and Adaptive Feedback

链接https://arxiv.org/abs/2512.24460

作者:Titas Ramancauskas,Kotryna Ramancauske

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:International English Language, English Language Testing, International English, Language Testing System, IELTS writing rubric

备注

点击查看摘要

Abstract:This paper presents the design, development, and evaluation of a proposed revision platform assisting candidates for the International English Language Testing System (IELTS) writing exam. Traditional IELTS preparation methods lack personalised feedback, catered to the IELTS writing rubric. To address these shortcomings, the platform features an attractive user interface (UI), an Automated Essay Scoring system (AES), and targeted feedback tailored to candidates and the IELTS writing rubric. The platform architecture separates conversational guidance from a dedicated writing interface to reduce cognitive load and simulate exam conditions. Through iterative, Design-Based Research (DBR) cycles, the study progressed from rule-based to transformer-based with a regression head scoring, mounted with adaptive feedback. Early cycles (2-3) revealed fundamental limitations of rule-based approaches: mid-band compression, low accuracy, and negative $R^2$ values. DBR Cycle 4 implemented a DistilBERT transformer model with a regression head, yielding substantial improvements with MAE of 0.66 and positive $R^2$. This enabled Cycle 5's adaptive feedback implementation, which demonstrated statistically significant score improvements (mean +0.060 bands, p = 0.011, Cohen's d = 0.504), though effectiveness varied by revision strategy. Findings suggest automated feedback functions are most suited as a supplement to human instruction, with conservative surface-level corrections proving more reliable than aggressive structural interventions for IELTS preparation contexts. Challenges remain in assessing higher-band essays, and future work should incorporate longitudinal studies with real IELTS candidates and validation from official examiners.

Subjects:

Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2512.24460 [cs.CL]

(or
arXiv:2512.24460v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.24460

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
36. 【2512.24459】Cleaning English Abstracts of Scientific Publications

链接https://arxiv.org/abs/2512.24459

作者:Michael E. Rose,Nils A. Herrmann,Sebastian Erhardt

类目:Computation and Language (cs.CL)

关键词:research publications, thematic focus, focus of research, English-language scientific abstracts, clean English-language scientific

备注: 2 tables, 2 figures

点击查看摘要

Abstract:Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.

37. 【2512.24410】Comparing Approaches to Automatic Summarization in Less-Resourced Languages

链接https://arxiv.org/abs/2512.24410

作者:Chester Palen-Michel,Constantine Lignos

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Automatic text summarization, Automatic text, achieved high performance, achieved high, comparatively less attention

备注: Under review

点击查看摘要

Abstract:Automatic text summarization has achieved high performance in high-resourced languages like English, but comparatively less attention has been given to summarization in less-resourced languages. This work compares a variety of different approaches to summarization from zero-shot prompting of LLMs large and small to fine-tuning smaller models like mT5 with and without three data augmentation approaches and multilingual transfer. We also explore an LLM translation pipeline approach, translating from the source language to English, summarizing and translating back. Evaluating with five different metrics, we find that there is variation across LLMs in their performance across similar parameter sizes, that our multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics, and that LLM as judge may be less reliable on less-resourced languages.

38. 【2512.24373】Skim-Aware Contrastive Learning for Efficient Document Representation

链接https://arxiv.org/abs/2512.24373

作者:Waheed Ahmed Abro,Zied Bouraoui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:shown strong performance, effectively representing long, remains difficult, performance in word, sentence-level tasks

备注

点击查看摘要

Abstract:Although transformer-based models have shown strong performance in word- and sentence-level tasks, effectively representing long documents, especially in fields like law and medicine, remains difficult. Sparse attention mechanisms can handle longer inputs, but are resource-intensive and often fail to capture full-document context. Hierarchical transformer models offer better efficiency but do not clearly explain how they relate different sections of a document. In contrast, humans often skim texts, focusing on important sections to understand the overall message. Drawing from this human strategy, we introduce a new self-supervised contrastive learning framework that enhances long document representation. Our method randomly masks a section of the document and uses a natural language inference (NLI)-based contrastive objective to align it with relevant parts while distancing it from unrelated ones. This mimics how humans synthesize information, resulting in representations that are both richer and more computationally efficient. Experiments on legal and biomedical texts confirm significant gains in both accuracy and efficiency.

39. 【2512.24340】DermaVQA-DAS: Dermatology Assessment Schema (DAS) Datasets for Closed-Ended Question Answering Segmentation in Patient-Generated Dermatology Images

链接https://arxiv.org/abs/2512.24340

作者:Wen-wai Yim,Yujuan Fu,Asma Ben Abacha,Meliha Yetisgen,Noel Codella,Roberto Andres Novoa,Josep Malvehy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:lack patient-authored queries, dermatological image analysis, large-scale annotated datasets, Recent advances, existing benchmarks focus

备注

点击查看摘要

Abstract:Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (this https URL).

40. 【2512.24329】World model inspired sarcasm reasoning with large language model agents

链接https://arxiv.org/abs/2512.24329

作者:Keito Inoshita,Shinnosuke Mizuno

类目:Computation and Language (cs.CL)

关键词:natural language processing, Large Language Models, surrounding social context, language processing, natural language

备注

点击查看摘要

Abstract:Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker's intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.

41. 【2512.24314】QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs

链接https://arxiv.org/abs/2512.24314

作者:Shupeng Li,Weipeng Lu,Linyun Liu,Chen Lin,Shaofei Li,Zhendong Tan,Hanjun Zhong,Yucheng Zeng,Chenghao Zhu,Mengyue Liu,Daxiang Dong,Jianmin Wu,Yunting Xiao,Annan Li,Danyu Liu,Jingnan Zhang,Licen Liu,Dawei Yin,Dou Shen

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Domain-specific enhancement, Large Language, context has long, focal point

备注

点击查看摘要

Abstract:Domain-specific enhancement of Large Language Models (LLMs) within the financial context has long been a focal point of industrial application. While previous models such as BloombergGPT and Baichuan-Finance primarily focused on knowledge enhancement, the deepening complexity of financial services has driven a growing demand for models that possess not only domain knowledge but also robust financial reasoning and agentic capabilities. In this paper, we present QianfanHuijin, a financial domain LLM, and propose a generalizable multi-stage training paradigm for industrial model enhancement. Our approach begins with Continual Pre-training (CPT) on financial corpora to consolidate the knowledge base. This is followed by a fine-grained Post-training pipeline designed with increasing specificity: starting with Financial SFT, progressing to Finance Reasoning RL and Finance Agentic RL, and culminating in General RL aligned with real-world business scenarios. Empirical results demonstrate that QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Furthermore, ablation studies confirm that the targeted Reasoning RL and Agentic RL stages yield significant gains in their respective capabilities. These findings validate our motivation and suggest that this fine-grained, progressive post-training methodology is poised to become a mainstream paradigm for various industrial-enhanced LLMs.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2512.24314 [cs.CL]

(or
arXiv:2512.24314v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.24314

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
42. 【2512.24297】Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

链接https://arxiv.org/abs/2512.24297

作者:Meiqi Chen,Fandong Meng,Jie Zhou

类目:Computation and Language (cs.CL)

关键词:involve implicit spatial, implicit spatial, involve implicit, explicitly encoded, reasoning

备注

点击查看摘要

Abstract:Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.

43. 【2512.24289】Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

链接https://arxiv.org/abs/2512.24289

作者:Jonathan Schmoll,Adam Jatowt

类目:Computation and Language (cs.CL)

关键词:Taxonomy presents, Large Language Models, resource-intensive process, challenge for companies, process of complying

备注

点击查看摘要

Abstract:The manual, resource-intensive process of complying with the EU Taxonomy presents a significant challenge for companies. While Large Language Models (LLMs) offer a path to automation, research is hindered by a lack of public benchmark datasets. To address this gap, we introduce a novel, structured dataset from 190 corporate reports, containing ground-truth economic activities and quantitative Key Performance Indicators (KPIs). We use this dataset to conduct the first systematic evaluation of LLMs on the core compliance workflow. Our results reveal a clear performance gap between qualitative and quantitative tasks. LLMs show moderate success in the qualitative task of identifying economic activities, with a multi-step agentic framework modestly enhancing precision. Conversely, the models comprehensively fail at the quantitative task of predicting financial KPIs in a zero-shot setting. We also discover a paradox, where concise metadata often yields superior performance to full, unstructured reports, and find that model confidence scores are poorly calibrated. We conclude that while LLMs are not ready for full automation, they can serve as powerful assistive tools for human experts. Our dataset provides a public benchmark for future research.

44. 【2512.24265】Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

链接https://arxiv.org/abs/2512.24265

作者:Ziqing Fan,Yuqiao Xian,Yan Sun,Li Shen

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:enhance training efficiency, pre-training large language, significantly enhance training, large language models, fine-grained data recipe

备注

点击查看摘要

Abstract:A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs' capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, we achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.

45. 【2512.24259】racing the Flow of Knowledge From Science to Technology Using Deep Learning

链接https://arxiv.org/abs/2512.24259

作者:Michael E. Rose,Mainak Ghosh,Sebastian Erhardt,Cheng Li,Erik Buunk,Dietmar Harhoff

类目:Computation and Language (cs.CL)

关键词:similarity model suitable, suitable for working, scientific publications, language similarity model, credible Patent-Paper Citations

备注: 4 tables, 7 figures

点击查看摘要

Abstract:We develop a language similarity model suitable for working with patents and scientific publications at the same time. In a horse race-style evaluation, we subject eight language (similarity) models to predict credible Patent-Paper Citations. We find that our Pat-SPECTER model performs best, which is the SPECTER2 model fine-tuned on patents. In two real-world scenarios (separating patent-paper-pairs and predicting patent-paper-pairs) we demonstrate the capabilities of the Pat-SPECTER. We finally test the hypothesis that US patents cite papers that are semantically less similar than in other large jurisdictions, which we posit is because of the duty of candor. The model is open for the academic community and practitioners alike.

46. 【2512.24235】LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring

链接https://arxiv.org/abs/2512.24235

作者:May Bashendy,Walid Massoud,Sohaila Eltanbouly,Salam Albatarni,Marwan Sayed,Abrar Abir,Houda Bouamor,Tamer Elsayed

类目:Computation and Language (cs.CL)

关键词:Automated Essay Scoring, AES remains limited, gained increasing attention, remains limited due, Arabic AES remains

备注

点击查看摘要

Abstract:Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.

47. 【2512.24181】MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring

链接https://arxiv.org/abs/2512.24181

作者:Qipeng Wang,Rui Sheng,Yafei Li,Huamin Qu,Yushi Sun,Min Zhu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, demonstrated significant promise, Recent advancements, advancements in Large

备注

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have demonstrated significant promise in clinical diagnosis. However, current models struggle to emulate the iterative, diagnostic hypothesis-driven reasoning of real clinical scenarios. Specifically, current LLMs suffer from three critical limitations: (1) generating hallucinated medical content due to weak grounding in verified knowledge, (2) asking redundant or inefficient questions rather than discriminative ones that hinder diagnostic progress, and (3) losing coherence over multi-turn dialogues, leading to contradictory or inconsistent conclusions. To address these challenges, we propose MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns. Experiments on clinical benchmarks show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.

48. 【2512.24157】raining Report of TeleChat3-MoE

链接https://arxiv.org/abs/2512.24157

作者:Xinzhang Liu,Chao Wang,Zhihao Yang,Zhuo Jiang,Xuncheng Zhao,Haoran Wang,Lei Li,Dongdong He,Luobin Liu,Kaizhe Yuan,Han Gao,Zihan Wang,Yitong Yao,Sishi Xiong,Wenmin Deng,Haowei He,Kaidong Yu,Yu Zhao,Ruiyu Fang,Yuhao Jiang,Yingyan Li,Xiaohui Hu,Xi Yu,Jingqi Li,Yanwei Liu,Qingli Li,Xinyu Shi,Junhao Niu,Chengnuo Huang,Yao Xiao,Ruiwen Wang,Fengkai Li,Luwen Pu,Kaipeng Jia,Fubei Yao,Yuyao Huang,Xuewei He,Zhuoru Jiang,Ruiting Song,Rui Xue,Qiyi Xie,Jie Zhang,Zilu Huang,Zhaoxi Zhang,Zhilong Lu,Yanhan Zhang,Yin Zhang,Yanlei Xue,Zhu Yuan,Teng Su,Xin Jiang,Shuangyong Song,Yongxiang Li,Xuelong Li

类目:Computation and Language (cs.CL)

关键词:Ascend NPU cluster, parameter counts ranging, Ascend NPU, TeleChat large language, architecture with parameter

备注

点击查看摘要

Abstract:TeleChat3-MoE is the latest series of TeleChat large language models, featuring a Mixture-of-Experts (MoE) architecture with parameter counts ranging from 105 billion to over one trillion,trained end-to-end on Ascend NPU cluster. This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes. We detail systematic methodologies for operator-level and end-to-end numerical accuracy verification, ensuring consistency across hardware platforms and distributed parallelism strategies. Furthermore, we introduce a suite of performance optimizations, including interleaved pipeline scheduling, attention-aware data scheduling for long-sequence training,hierarchical and overlapped communication for expert parallelism, and DVM-based operator fusion. A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations. Additionally, we present methodological approaches to cluster-level optimizations, addressing host- and device-bound bottlenecks during large-scale training tasks. These infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.

49. 【2512.24149】Large Emotional World Model

链接https://arxiv.org/abs/2512.24149

作者:Changhao Song,Yazhou Zhang,Hui Gao,Chang Yang,Peng Zhang

类目:Computation and Language (cs.CL)

关键词:broad application potential, World Models serve, Large Language Models, World, numerous fields

备注

点击查看摘要

Abstract:World Models serve as tools for understanding the current state of the world and predicting its future dynamics, with broad application potential across numerous fields. As a key component of world knowledge, emotion significantly influences human decision-making. While existing Large Language Models (LLMs) have shown preliminary capability in capturing world knowledge, they primarily focus on modeling physical-world regularities and lack systematic exploration of emotional factors. In this paper, we first demonstrate the importance of emotion in understanding the world by showing that removing emotionally relevant information degrades reasoning performance. Inspired by theory of mind, we further propose a Large Emotional World Model (LEWM). Specifically, we construct the Emotion-Why-How (EWH) dataset, which integrates emotion into causal relationships and enables reasoning about why actions occur and how emotions drive future world states. Based on this dataset, LEWM explicitly models emotional states alongside visual observations and actions, allowing the world model to predict both future states and emotional transitions. Experimental results show that LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks.

50. 【2512.24143】Activation Steering for Masked Diffusion Language Models

链接https://arxiv.org/abs/2512.24143

作者:Adi Shnaidman,Erin Feiglin,Osher Yaari,Efrat Mentel,Amit Levi,Raz Lapid

类目:Computation and Language (cs.CL)

关键词:Masked diffusion language, Masked diffusion, diffusion language models, iterative denoising process, language models

备注

点击查看摘要

Abstract:Masked diffusion language models (MDLMs) generate text through an iterative denoising process. They have recently gained attention due to mask-parallel decoding and competitive performance with autoregressive large language models. However, effective mechanisms for inference-time control and steering in MDLMs remain largely unexplored. We present an activation-steering framework for MDLMs that computes layer-wise steering vectors from a single forward pass using contrastive examples, without simulating the denoising trajectory. These directions are applied at every reverse-diffusion step, yielding an efficient inference-time control mechanism. Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining the effects of steering across transformer sub-modules and token scope (prompt vs.\ response).

51. 【2512.24124】OptRot: Mitigating Weight Outliers via Data-Free Rotations for Post-Training Quantization

链接https://arxiv.org/abs/2512.24124

作者:Advait Gadhikar,Riccardo Grazzi,James Hensman

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Language Models, Large Language, difficult to quantize, makes them difficult

备注: 25 pages, 10 figures

点击查看摘要

Abstract:The presence of outliers in Large Language Models (LLMs) weights and activations makes them difficult to quantize. Recent work has leveraged rotations to mitigate these outliers. In this work, we propose methods that learn fusible rotations by minimizing principled and cheap proxy objectives to the weight quantization error. We primarily focus on GPTQ as the quantization method. Our main method is OptRot, which reduces weight outliers simply by minimizing the element-wise fourth power of the rotated weights. We show that OptRot outperforms both Hadamard rotations and more expensive, data-dependent methods like SpinQuant and OSTQuant for weight quantization. It also improves activation quantization in the W4A8 setting. We also propose a data-dependent method, OptRot$^{+}$, that further improves performance by incorporating information on the activation covariance. In the W4A4 setting, we see that both OptRot and OptRot$^{+}$ perform worse, highlighting a trade-off between weight and activation quantization.

52. 【2512.24098】raining a Huggingface Model on AWS Sagemaker (Without Tears)

链接https://arxiv.org/abs/2512.24098

作者:Liling Tan

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, resource-rich research groups, development of Large, Hugging Face models

备注

点击查看摘要

Abstract:The development of Large Language Models (LLMs) has primarily been driven by resource-rich research groups and industry partners. Due to the lack of on-premise computing resources required for increasingly complex models, many researchers are turning to cloud services like AWS SageMaker to train Hugging Face models. However, the steep learning curve of cloud platforms often presents a barrier for researchers accustomed to local environments. Existing documentation frequently leaves knowledge gaps, forcing users to seek fragmented information across the web. This demo paper aims to democratize cloud adoption by centralizing the essential information required for researchers to successfully train their first Hugging Face model on AWS SageMaker from scratch.

53. 【2512.24097】Factorized Learning for Temporally Grounded Video-Language Models

链接https://arxiv.org/abs/2512.24097

作者:Wenzheng Zeng,Difei Gao,Mike Zheng Shou,Hwee Tou Ng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:Recent video-language models, shown great potential, Recent video-language, video understanding, temporal grounding

备注: ICCV 2025 paper. This arXiv version updates Figure 1 to include the concurrent work Qwen2.5-VL to ensure consistency with Table 1

点击查看摘要

Abstract:Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at this https URL.

54. 【2512.24092】HY-MT1.5 Technical Report

链接https://arxiv.org/abs/2512.24092

作者:Mao Zheng,Zheng Li,Tao Chen,Mingyang Song,Di Wang

类目:Computation and Language (cs.CL)

关键词:holistic training framework, training framework tailored, introduce our latest, family of machine, holistic training

备注

点击查看摘要

Abstract:In this report, we introduce our latest translation models, HY-MT1.5-1.8B and HY-MT1.5-7B, a new family of machine translation models developed through a holistic training framework tailored for high-performance translation. Our methodology orchestrates a multi-stage pipeline that integrates general and MT-oriented pre-training, supervised fine-tuning, on-policy distillation, and reinforcement learning. HY-MT1.5-1.8B, the 1.8B-parameter model demonstrates remarkable parameter efficiency, comprehensively outperforming significantly larger open-source baselines (e.g., Tower-Plus-72B, Qwen3-32B) and mainstream commercial APIs (e.g., Microsoft Translator, Doubao Translator) in standard Chinese-foreign and English-foreign tasks. It achieves approximately 90% of the performance of ultra-large proprietary models such as Gemini-3.0-Pro, while marginally trailing Gemini-3.0-Pro on WMT25 and Mandarin-minority language benchmarks, it maintains a substantial lead over other competing models. Furthermore, HY-MT1.5-7B establishes a new state-of-the-art for its size class, achieving 95% of Gemini-3.0-Pro's performance on Flores-200 and surpassing it on the challenging WMT25 and Mandarin-minority language test sets. Beyond standard translation, the HY-MT1.5 series supports advanced constraints, including terminology intervention, context-aware translation, and format preservation. Extensive empirical evaluations confirm that both models offer highly competitive, robust solutions for general and specialized translation tasks within their respective parameter scales.

55. 【2512.24058】Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

链接https://arxiv.org/abs/2512.24058

作者:Rohit Kumar Salla,Manoj Saravanan,Shrikar Reddy Kota

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, reliability remains uncertain, Gemma are increasingly, remains uncertain

备注: 5 pages, 4 tables, accepted at AAAI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability remains uncertain. They often make overconfident errors, degrade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a unified framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable metric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under baselines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.

56. 【2512.24052】AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives

链接https://arxiv.org/abs/2512.24052

作者:Yanxi Chen,Wenhui Zhu,Xiwen Chen,Zhipeng Wang,Xin Li,Peijie Qiu,Hao Wang,Xuanzhao Dong,Yujian Xiong,Anderson Schneider,Yuriy Nevmyvaka,Yalin Wang

类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:False Event Identity, Large Audio-Language Models, Large Audio-Language, Temporal Relation Error, Quantitative Temporal Error

备注

点击查看摘要

Abstract:Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods.

57. 【2512.24044】Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

链接https://arxiv.org/abs/2512.24044

作者:Yuan Xin,Dingfan Chen,Linyi Yang,Michael Backes,Xiao Zhang

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:large language models, increasingly deployed, ensuring their safe, large language, language models

备注: 26 pages,11 tables, 7 figures

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed, ensuring their safe use is paramount. Jailbreaking, adversarial prompts that bypass model alignment to trigger harmful outputs, present significant risks, with existing studies reporting high success rates in evading common LLMs. However, previous evaluations have focused solely on the models, neglecting the full deployment pipeline, which typically incorporates additional safety mechanisms like content moderation filters. To address this gap, we present the first systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input and output filtering stages. Our findings yield two key insights: first, nearly all evaluated jailbreak techniques can be detected by at least one safety filter, suggesting that prior assessments may have overestimated the practical success of these attacks; second, while safety filters are effective in detection, there remains room to better balance recall and precision to further optimize protection and user experience. We highlight critical gaps and call for further refinement of detection accuracy and usability in LLM safety systems.

58. 【2512.24014】CLP: Large Language Model Reasoning with Implicit Cognition Latent Planning

链接https://arxiv.org/abs/2512.24014

作者:Sijia Chen,Di Niu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, Large language, Large, reasoning, plans

备注: 9 pages, 6 figures. The source code is publicly available at [this https URL](https://github.com/AgenticFinLab/latent-planning)

点击查看摘要

Abstract:Large language models (LLMs), when guided by explicit textual plans, can perform reliable step-by-step reasoning during problem-solving. However, generating accurate and effective textual plans remains challenging due to LLM hallucinations and the high diversity of task-specific questions. To address this, we draw inspiration from human Implicit Cognition (IC), the subconscious process by which decisions are guided by compact, generalized patterns learned from past experiences without requiring explicit verbalization. We propose iCLP, a novel framework that enables LLMs to adaptively generate latent plans (LPs), which are compact encodings of effective reasoning instructions. iCLP first distills explicit plans from existing step-by-step reasoning trajectories. It then learns discrete representations of these plans via a vector-quantized autoencoder coupled with a codebook. Finally, by fine-tuning LLMs on paired latent plans and corresponding reasoning steps, the models learn to perform implicit planning during reasoning. Experimental results on mathematical reasoning and code generation tasks demonstrate that, with iCLP, LLMs can plan in latent space while reasoning in language space. This approach yields significant improvements in both accuracy and efficiency and, crucially, demonstrates strong cross-domain generalization while preserving the interpretability of chain-of-thought reasoning.

59. 【2512.24000】WISE: Web Information Satire and Fakeness Evaluation

链接https://arxiv.org/abs/2512.24000

作者:Gaurab Chhetri,Subasish Das,Tausif Islam Chowdhury

类目:Computation and Language (cs.CL)

关键词:unique challenge due, overlapping linguistic features, Web Information Satire, Distinguishing fake, Expected Calibration Error

备注: This is the author's preprint. Accepted to WEBGRAPH 2026 (co-located with WSDM 2026), Boise, Idaho, USA, Feb 26, 2026. Final version will appear in WSDM 2026 Companion Proceedings. Conf: [this https URL](https://wsdm-conference.org/2026/) Workshop: [this https URL](https://aiimlab.org/events/WSDM_2026_WEB_and_GRAPH_2026_Workshop_on_Web_and_Graphs_Responsible_Intelligence_and_Social_Media.html)

点击查看摘要

Abstract:Distinguishing fake or untrue news from satire or humor poses a unique challenge due to their overlapping linguistic features and divergent intent. This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as either fake news or satire. Using stratified 5-fold cross-validation, we evaluate models across comprehensive metrics including accuracy, precision, recall, F1-score, ROC-AUC, PR-AUC, MCC, Brier score, and Expected Calibration Error. Our evaluation reveals that MiniLM, a lightweight model, achieves the highest accuracy (87.58%) among all models, while RoBERTa-base achieves the highest ROC-AUC (95.42%) and strong accuracy (87.36%). DistilBERT offers an excellent efficiency-accuracy trade-off with 86.28\% accuracy and 93.90\% ROC-AUC. Statistical tests confirm significant performance differences between models, with paired t-tests and McNemar tests providing rigorous comparisons. Our findings highlight that lightweight models can match or exceed baseline performance, offering actionable insights for deploying misinformation detection systems in real-world, resource-constrained settings.

60. 【2512.23988】Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

链接https://arxiv.org/abs/2512.23988

作者:Zhenyu Zhang,Shujian Zhang,John Lambert,Wenxuan Zhou,Zhangyang Wang,Mingqing Chen,Andrew Hard,Rajiv Mathews,Lun Wang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, process remain underexplored, recent large language, growing reasoning capabilities, reasoning process remain

备注

点击查看摘要

Abstract:Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.

61. 【2512.23971】CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

链接https://arxiv.org/abs/2512.23971

作者:Zhiming Lin,Kai Zhao,Sophie Zhang,Peilai Yu,Canran Xiao

类目:Computation and Language (cs.CL)

关键词:Large-scale Chinese spelling, Chinese spelling correction, Large-scale Chinese, methods lack robustness, real-world text processing

备注: AAAI'26 poster

点击查看摘要

Abstract:Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10--13 F$_1$ points and strong LLM fine-tunes by 5--8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.

62. 【2512.23966】Efficient Context Scaling with LongCat ZigZag Attention

链接https://arxiv.org/abs/2512.23966

作者:Chen Zhang,Yang Bai,Jiahuan Li,Anchun Gui,Keheng Wang,Feifan Liu,Guanyu Wu,Yuwei Jiang,Defei Bu,Li Wei,Haihang Jing,Hongyin Tang,Xin Chen,Xiangzhou Huang,Fengcun Li,Rongxiang Weng,Yulei Qian,Yifan Lu,Yerui Sun,Jingang Wang,Yuchen Xie,Xunliang Cai

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:LongCat ZigZag Attention, attention scheme designed, limited compute budget, sparse attention scheme, introduce LongCat ZigZag

备注: 10 pages, 3 figures, 3 tables

点击查看摘要

Abstract:We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.

63. 【2512.23959】Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

链接https://arxiv.org/abs/2512.23959

作者:Chulun Zhou,Chunkang Zhang,Guoxin Yu,Fandong Meng,Jie Zhou,Wai Lam,Mo Yu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:large language models, widely adopted strategy, enhancing large language, Multi-step retrieval-augmented generation, demand global comprehension

备注: 21 pages

点击查看摘要

Abstract:Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.

64. 【2512.23941】Disentangling Learning from Judgment: Representation Learning for Open Response Analytics

链接https://arxiv.org/abs/2512.23941

作者:Conrad Borchers,Manit Patel,Seiyon M. Lee,Anthony F. Botelho

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:Open-ended responses, automated scoring, scoring often conflates, Open-ended, AUC

备注: Short research paper accepted at Learning Analytics and Knowledge (LAK '26)

点击查看摘要

Abstract:Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade. We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics. Using de-identified ASSISTments mathematics responses, we model teacher histories as dynamic priors and derive text representations from sentence embeddings, incorporating centering and residualization to mitigate prompt and teacher confounds. Temporally-validated linear models quantify the contributions of each signal, and a projection surfaces model disagreements for qualitative inspection. Results show that teacher priors heavily influence grade predictions; the strongest results arise when priors are combined with content embeddings (AUC~0.815), while content-only models remain above chance but substantially weaker (AUC~0.626). Adjusting for rater effects sharpens the residual content representation, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding as opposed to surface-level differences in how students respond. The contribution presents a practical pipeline that transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align (or conflict) with evidence of student reasoning and learning.

65. 【2512.23862】Probing the Limits of Compressive Memory: A Study of Infini-Attention in Small-Scale Pretraining

链接https://arxiv.org/abs/2512.23862

作者:Ruizhe Huang,Kexuan Zhang,Yihao Fang,Baifeng Yu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Small Language Models, Small Language, investigates small-scale pretraining, pretraining for Small, study investigates small-scale

备注

点击查看摘要

Abstract:This study investigates small-scale pretraining for Small Language Models (SLMs) to enable efficient use of limited data and compute, improve accessibility in low-resource settings and reduce costs. To enhance long-context extrapolation in compact models, we focus on Infini-attention, which builds a compressed memory from past segments while preserving local attention. In our work, we conduct an empirical study using 300M-parameter LLaMA models pretrained with Infini-attention. The model demonstrates training stability and outperforms the baseline in long-context retrieval. We identify the balance factor as a key part of the model performance, and we found that retrieval accuracy drops with repeated memory compressions over long sequences. Even so, Infini-attention still effectively compensates for the SLM's limited parameters. Particularly, despite performance degradation at a 16,384-token context, the Infini-attention model achieves up to 31% higher accuracy than the baseline. Our findings suggest that achieving robust long-context capability in SLMs benefits from architectural memory like Infini-attention.

66. 【2512.23852】rellis: Learning to Compress Key-Value Memory in Attention Models

链接https://arxiv.org/abs/2512.23852

作者:Mahdi Karami,Ali Behrouz,Praneeth Kacham,Vahab Mirrokni

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:quadratic computational complexity, suffer from quadratic, quadratic computational, computational complexity, ever-growing Key-Value

备注: In Second Conference on Language Modeling (COLM) (2025)

点击查看摘要

Abstract:Transformers, while powerful, suffer from quadratic computational complexity and the ever-growing Key-Value (KV) cache of the attention mechanism. This paper introduces Trellis, a novel Transformer architecture with bounded memory that learns how to compress its key-value memory dynamically at test time. Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory. To achieve this, it leverages an online gradient descent procedure with a forget gate, enabling the compressed memory to be updated recursively while learning to retain important contextual information from incoming tokens at test time. Extensive experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines. Notably, its performance gains increase as the sequence length grows, highlighting its potential for long-context applications.

67. 【2512.23850】he Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models

链接https://arxiv.org/abs/2512.23850

作者:Rahul Baxi

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:realistic stress, ideal conditions, Current language model, language model evaluations, epistemic robustness

备注: Currently under review at TMLR

点击查看摘要

Abstract:Current language model evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress. Static benchmarks like MMLU and TruthfulQA cannot distinguish a model that lacks knowledge from one whose verification mechanisms collapse when information degrades or adversaries probe for weaknesses. We introduce the Drill-Down and Fabricate Test (DDFT), a protocol that measures epistemic robustness: a model's ability to maintain factual accuracy under progressive semantic compression and adversarial fabrication. We propose a two-system cognitive model comprising a Semantic System that generates fluent text and an Epistemic Verifier that validates factual accuracy. Our findings, based on evaluating 9 frontier models across 8 knowledge domains at 5 compression levels (1,800 turn-level evaluations), reveal that epistemic robustness is orthogonal to conventional design paradigms. Neither parameter count (r=0.083, p=0.832) nor architectural type (r=0.153, p=0.695) significantly predicts robustness, suggesting it emerges from training methodology and verification mechanisms distinct from current approaches. Error detection capability strongly predicts overall robustness (rho=-0.817, p=0.007), indicating this is the critical bottleneck. We find that flagship models exhibit brittleness despite their scale, while smaller models can achieve robust performance, challenging assumptions about the relationship between model size and reliability. The DDFT framework provides both theoretical foundation and practical tools for assessing epistemic robustness before deployment in critical applications.

68. 【2512.23848】Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

链接https://arxiv.org/abs/2512.23848

作者:Yukun Zhang,Stefan Elbl Droguett,Samyak Jain

类目:Computation and Language (cs.CL); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)

关键词:reasoning Question Answering, research project addresses, Question Answering, financial numerical questions, numerical reasoning Question

备注

点击查看摘要

Abstract:This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper's top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.

69. 【2512.23837】Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

链接https://arxiv.org/abs/2512.23837

作者:Kaustubh Dhole

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:mechanistic interpretability suggest, encode token-level hypotheses, Recent advances, attention layers encode, layers encode token-level

备注

点击查看摘要

Abstract:Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient-based attacks, our approach leverages model-internal token predictions, producing perturbations that are both plausible and internally consistent with the model's own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from certain layers and token positions can introduce grammatical degradation, limiting their practical effectiveness. Overall, our findings highlight both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines.

70. 【2512.23836】Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?

链接https://arxiv.org/abs/2512.23836

作者:Dingmin Wang,Ji Ma,Shankar Kumar

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, expanded context windows, windows in Large, Language Models

备注

点击查看摘要

Abstract:The success of expanded context windows in Large Language Models (LLMs) has driven increased use of broader context in retrieval-augmented generation. We investigate the use of LLMs for retrieval augmented question answering. While longer contexts make it easier to incorporate targeted knowledge, they introduce more irrelevant information that hinders the model's generation process and degrades its performance. To address the issue, we design an adaptive prompting strategy which involves splitting the retrieved information into smaller chunks and sequentially prompting a LLM to answer the question using each chunk. Adjusting the chunk size allows a trade-off between incorporating relevant information and reducing irrelevant information. Experimental results on three open-domain question answering datasets demonstrate that the adaptive strategy matches the performance of standard prompting while using fewer tokens. Our analysis reveals that when encountering insufficient information, the LLM often generates incorrect answers instead of declining to respond, which constitutes a major source of error. This finding highlights the need for further research into enhancing LLMs' ability to effectively decline requests when faced with inadequate information.

71. 【2512.23835】Explaining News Bias Detection: A Comparative SHAP Analysis of Transformer Model Decision Mechanisms

链接https://arxiv.org/abs/2512.23835

作者:Himel Ghosh

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Automated bias detection, Automated bias, bias detection, support journalistic analysis, BABE dataset

备注: 10 pages, 8 figures

点击查看摘要

Abstract:Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63\% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.

72. 【2512.23813】StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection

链接https://arxiv.org/abs/2512.23813

作者:Amal Alqahtani,Efsun Kayi,Mona Diab

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:social media platforms, platforms like Twitter, Twitter serving, public health concern, significant public health

备注

点击查看摘要

Abstract:The prevalence of chronic stress represents a significant public health concern, with social media platforms like Twitter serving as important venues for individuals to share their experiences. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for automatic detection of self-reported chronic stress in English tweets. The investigation examines whether continual training on clinically related conditions (depression, anxiety, PTSD), disorders with high comorbidity with chronic stress, improves stress detection compared to general language models and broad mental health models. RoBERTa is continually trained on the Stress-SMHD corpus (108M words from users with self-reported diagnoses of depression, anxiety, and PTSD) and fine-tuned on the SMM4H 2022 Task 8 dataset. StressRoBERTa achieves 82% F1-score, outperforming the best shared task system (79% F1) by 3 percentage points. The results demonstrate that focused cross-condition transfer from stress-related disorders (+1% F1 over vanilla RoBERTa) provides stronger representations than general mental health training. Evaluation on Dreaddit (81% F1) further demonstrates transfer from clinical mental health contexts to situational stress discussions.

73. 【2512.23808】MiMo-Audio: Audio Language Models are Few-Shot Learners

链接https://arxiv.org/abs/2512.23808

作者:Xiaomi LLM-Core Team:Dong Zhang,Gang Wang,Jinlong Xue,Kai Fang,Liang Zhao,Rui Ma,Shuhuai Ren,Shuo Liu,Tao Guo,Weiji Zhuang,Xin Zhang,Xingchen Song,Yihan Yan,Yongzhe He,Cici,Bowen Shen,Chengxuan Zhu,Chong Ma,Chun Chen,Heyu Chen,Jiawei Li,Lei Li,Menghang Zhu,Peidian Li,Qiying Wang,Sirui Deng,Weimin Xiong,Wenshan Huang,Wenyu Yang,Yilin Jiang,Yixin Yang,Yuanyuan Tian,Yue Ma,Yue Yu,Zihan Zhang,Zihao Yue,Bangjun Xiao,Bingquan Xia,Bofei Gao,Bowen Ye,Can Cai,Chang Liu,Chenhong He,Chunan Li,Dawei Zhu,Duo Zhang,Fengyuan Shi,Guoan Wang,Hailin Zhang,Hanglong Lv,Hanyu Li,Hao Tian,Heng Qu,Hongshen Xu,Houbin Zhang,Huaqiu Liu,Jiangshan Duo,Jianguang Zuo,Jianyu Wei,Jiebao Xiao,Jinhao Dong,Jun Shi,Junhao Hu,Kainan Bao,Kang Zhou,Linghao Zhang,Meng Chen,Nuo Chen,Peng Zhang,Qianli Chen,Qiantong Wang,Rang Li,Shaohui Liu,Shengfan Wang,Shicheng Li,Shihua Yu,Shijie Cao,Shimao Chen,Shuhao Gu,Weikun Wang,Wenhan Ma,Xiangwei Deng,Xing Yong,Xing Zhang,Xu Wang,Yifan Song,Yihao Zhao,Yingbo Zhao,Yizhao Gao,Yu Cheng,Yu Tu,Yudong Wang,Zhaojun Huang,Zhengju Tang,Zhenru Lin,Zhichao Song,Zhipeng Xu,Zhixian Zheng,Zihan Jiang

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Existing audio language, Existing audio, language models typically, models typically rely, audio

备注

点击查看摘要

Abstract:Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at this https URL.

74. 【2512.23765】Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

链接https://arxiv.org/abs/2512.23765

作者:Tiancheng Su,Meicong Zhang,Guoxiu He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:accelerates large language, target LLM, large language model, generate candidate tokens, Speculative decoding

备注

点击查看摘要

Abstract:Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model's inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.

75. 【2512.23747】State-of-the-art Small Language Coder Model: Mify-Coder

链接https://arxiv.org/abs/2512.23747

作者:Abhinav Parmar,Abhisek Panigrahi,Abhishek Kumar Dwivedi,Abhishek Bhattacharya,Adarsh Ramachandra,Aditya Choudhary,Aditya Garg,Aditya Raj,Alankrit Bhatt,Alpesh Yadav,Anant Vishnu,Ananthu Pillai,Ankush Kumar,Aryan Patnaik,Aswatha Narayanan S,Avanish Raj Singh,Bhavya Shree Gadda,Brijesh Pankajbhai Kachhadiya,Buggala Jahnavi,Chidurala Nithin Krishna,Chintan Shah,Chunduru Akshaya,Debarshi Banerjee,Debrup Dey,Deepa R.,Deepika B G,Faiz ur Rahman,Gagan Gayari,Gudhi Jagadeesh Kumar Naidu,Gursimar Singh,Harshal Tyagi,Harshini K,James Mani Vathalloor,Jayarama Nettar,Jayashree Gajjam,Joe Walter Sugil George,Kamalakara Sri Krishna Tadepalli,Kamalkumar Rathinasamy,Karan Chaurasia,Karthikeyan S,Kashish Arora,Kaushal Desai,Khushboo Buwade,Kiran Manjrekar,Malikireddy Venkata Sai Likhitha,Manjunath A,Mitali Mahavir Bedmutha,Mohammed Rafee Tarafdar,Nikhil Tiwari,Nikitha K Gigi,Pavan Ravikumar,Pendyala Swarnanjali,Piyush Anand,Prakash Chandrasekar,Prasanna Bhalchandra Gawade,Prasanth Sivan,Preeti Khurana,Priyanshi Babbar,Rajab Ali Mondal,Rajesh Kumar Vissapragada,Rajeshwari Ganesan,Rajeswari Koppisetti,Ramjee R.,Ramkumar Thiruppathisamy,Rani G. S.,S Reka,Samarth Gupta,Sandeep Reddy Kothakota,Sarathy K,Sathyanarayana Sampath Kumar,Saurabh Kumar,Shashank Khasare,Shenbaga Devi Venkatesh Kumar,Shiva Rama Krishna Parvatham,Shoeb Shaikh,Shrishanmathi A,Shubham Pathak,Sree Samhita Koppaka,Sreenivasa Raghavan K S,Sreeram Venkatasubramanian,Suprabha Desai Bojja,Swetha R,Syed Ahmed,Chinmai Harshitha Thota,Tushar Yadav,Veeravelly Kusumitha,V V S S Prasanth Patnaik,Vidya Sri Sesetti,Vijayakeerthi K,Vikram Raj Bakshi,Vinay K K,Vinoth Kumar Loganathan,Vipin Tiwari,Vivek Kumar Shrivastav,V Venkata Sri Datta Charan,Wasim Akhtar Khan

类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:compute-optimal strategy built, compute-optimal strategy, strategy built, code model trained, foundation model

备注

点击查看摘要

Abstract:We present Mify-Coder, a 2.5B-parameter code model trained on 4.2T tokens using a compute-optimal strategy built on the Mify-2.5B foundation model. Mify-Coder achieves comparable accuracy and safety while significantly outperforming much larger baseline models on standard coding and function-calling benchmarks, demonstrating that compact models can match frontier-grade models in code generation and agent-driven workflows. Our training pipeline combines high-quality curated sources with synthetic data generated through agentically designed prompts, refined iteratively using enterprise-grade evaluation datasets. LLM-based quality filtering further enhances data density, enabling frugal yet effective training. Through disciplined exploration of CPT-SFT objectives, data mixtures, and sampling dynamics, we deliver frontier-grade code intelligence within a single continuous training trajectory. Empirical evidence shows that principled data and compute discipline allow smaller models to achieve competitive accuracy, efficiency, and safety compliance. Quantized variants of Mify-Coder enable deployment on standard desktop environments without requiring specialized hardware.

76. 【2512.23739】Break Out the Silverware -- Semantic Understanding of Stored Household Items

链接https://arxiv.org/abs/2512.23739

作者:Michaela Levi-Richter,Reuth Mirsky,Oren Glickman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:simple command reveals, Stored Household Item, Household Item Challenge, inferring where everyday, sight in drawers

备注: Poster presented at the Israeli Seminar on Computational Linguistics 2025

点击查看摘要

Abstract:``Bring me a plate.'' For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots' cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants' kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.

Comments:
Poster presented at the Israeli Seminar on Computational Linguistics 2025

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Cite as:
arXiv:2512.23739 [cs.CL]

(or
arXiv:2512.23739v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.23739

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Reuth Mirsky [view email] [v1]
Thu, 25 Dec 2025 15:21:49 UTC (4,504 KB)

77. 【2512.23732】When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection

链接https://arxiv.org/abs/2512.23732

作者:Anwar Alajmi,Gabriele Pergola

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Sexist content online, traditional detection methods, content online increasingly, evade traditional detection, Sexist content

备注

点击查看摘要

Abstract:Sexist content online increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals, even in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. Together, these limitations point to the need for a design that explicitly addresses the combined effects of (i) underrepresentation, (ii) noise, and (iii) conceptual ambiguity in both data and model predictions. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to adapt supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. Our training setup applies class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. At inference time, a dynamic routing mechanism classifies high-confidence cases directly and escalates uncertain instances to a novel \textit{Collaborative Expert Judgment} (CEJ) module, which prompts multiple personas and consolidates their reasoning through a judge model. Our approach achieves state-of-the-art results across several benchmarks, with a +2.72\% improvement in F1 on the EXIST 2025 Task 1.1, and a gains of +4.48\% and +1.30\% on the EDOS Tasks A and B, respectively.

78. 【2512.23722】Emergent World Beliefs: Exploring Transformers in Stochastic Games

链接https://arxiv.org/abs/2512.23722

作者:Adam Kamel,Tanish Rastogi,Michael Ma,Kailash Ranganathan,Kevin Zhu

类目:Computation and Language (cs.CL)

关键词:Transformer-based large language, strong reasoning abilities, solving programming challenges, Transformer-based large, large language models

备注: Accepted at NeurIPS 2025 Mechanistic Interpretability Workshop

点击查看摘要

Abstract:Transformer-based large language models (LLMs) have demonstrated strong reasoning abilities across diverse fields, from solving programming challenges to competing in strategy-intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT-style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure, such as hand ranks, and stochastic features, such as equity, without explicit instruction. Furthermore, by using primarily nonlinear probes, we demonstrated that these representations are decodeable and correlate with theoretical belief states, suggesting that LLMs are learning their own representation of the stochastic environment of Texas Hold'em Poker.

79. 【2512.23717】HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

链接https://arxiv.org/abs/2512.23717

作者:Shenzhe Zhu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large language models, approaches primarily focus, overtly dangerous content, Large language, current alignment approaches

备注

点击查看摘要

Abstract:Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi-agent debate for generating comprehensive safety training data.

80. 【2512.23716】Noise-Driven Persona Formation in Reflexive Neural Language Generation

链接https://arxiv.org/abs/2512.23716

作者:Toshiyuki Shigemura

类目:Computation and Language (cs.CL)

关键词:large language models, Luca-Noise Reflex Protocol, Luca-Noise Reflex, noise-driven persona emergence, analyzing noise-driven persona

备注: 324 pages, 9 figures (Figure 7 intentionally skipped), with Appendices A-I. This manuscript presents a computational framework for noise-driven persona formation in neural language generation, analyzing 152 generation cycles using GPT-5.1 with stochastic noise seeds generated by Microsoft Copilot. Primary category: cs.CL

点击查看摘要

Abstract:This paper introduces the Luca-Noise Reflex Protocol (LN-RP), a computational framework for analyzing noise-driven persona emergence in large language models. By injecting stochastic noise seeds into the initial generation state, we observe nonlinear transitions in linguistic behavior across 152 generation cycles. Our results reveal three stable persona modes with distinct entropy signatures, and demonstrate that external noise sources can reliably induce phase transitions in reflexive generation dynamics. Quantitative evaluation confirms consistent persona retention and significant differences across modes (p 0.01). The protocol provides a reproducible method for studying reflexive generation, emergent behavior, and longrange linguistic coherence in LLMs.

81. 【2512.23714】PharmaShip: An Entity-Centric, Reading-Order-Supervised Benchmark for Chinese Pharmaceutical Shipping Documents

链接https://arxiv.org/abs/2512.23714

作者:Tingwei Xie,Tianyi Zhou,Yonghong Song

类目:Computation and Language (cs.CL)

关键词:stress-test pre-trained text-layout, pre-trained text-layout models, real-world Chinese dataset, scanned pharmaceutical shipping, shipping documents designed

备注: 5 pages, 4 figures

点击查看摘要

Abstract:We present PharmaShip, a real-world Chinese dataset of scanned pharmaceutical shipping documents designed to stress-test pre-trained text-layout models under noisy OCR and heterogeneous templates. PharmaShip covers three complementary tasks-sequence entity recognition (SER), relation extraction (RE), and reading order prediction (ROP)-and adopts an entity-centric evaluation protocol to minimize confounds across architectures. We benchmark five representative baselines spanning pixel-aware and geometry-aware families (LiLT, LayoutLMv3-base, GeoLayoutLM and their available RORE-enhanced variants), and standardize preprocessing, splits, and optimization. Experiments show that pixels and explicit geometry provide complementary inductive biases, yet neither alone is sufficient: injecting reading-order-oriented regularization consistently improves SER and EL and yields the most robust configuration, while longer positional coverage stabilizes late-page predictions and reduces truncation artifacts. ROP is accurate at the word level but challenging at the segment level, reflecting boundary ambiguity and long-range crossings. PharmaShip thus establishes a controlled, reproducible benchmark for safety-critical document understanding in the pharmaceutical domain and highlights sequence-aware constraints as a transferable bias for structure modeling. We release the dataset at this https URL.

82. 【2512.23713】PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents

链接https://arxiv.org/abs/2512.23713

作者:Jahidul Islam,Md Ataullha,Saiful Azad

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:English prompts, code generation, English, LLMs excel, code

备注: 6 Pages

点击查看摘要

Abstract:LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. We address Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches relying on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought-Code-Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0\% on the development set and 71.6\% on the blind test set. These results establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages. Experimental scripts are publicly available at this http URL.

83. 【2512.23712】STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

链接https://arxiv.org/abs/2512.23712

作者:Guanghui Wang,Jinze Yu,Xing Zhang,Dayuan Jiang,Yin Song,Tomal Deb,Xuefeng Liu,Peiyang He

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, consistency remains critical, Tree Edit Distance, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed for structured data generation, yet output consistency remains critical for production applications. We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. Our approach combines: (1) STED (Semantic Tree Edit Distance), a novel similarity metric balancing semantic flexibility with structural strictness when comparing JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations to quantify reliability. Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, we demonstrate STED achieves superior performance ($0.86-0.90$ similarity for semantic equivalents, $0.0$ for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff. Applying our framework to benchmark six LLMs reveals significant variations: Claude-3.7-Sonnet demonstrates exceptional consistency, maintaining near-perfect structural reliability even at high temperatures ($T=0.9$), while models like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful tuning. Our framework enables practical applications including targeted model selection for structured tasks, iterative prompt refinement for reproducible results, and diagnostic analysis to identify inconsistency root causes. This work provides theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.

84. 【2512.23711】CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

链接https://arxiv.org/abs/2512.23711

作者:Paulo Cavalin,Cassia Sanctos,Marcelo Grave,Claudio Pinhanez,Yago Primerano

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, controllable input variations, Language Models, emph

备注

点击查看摘要

Abstract:We introduce \textsc{CAT}, a framework designed to evaluate and visualize the \emph{interplay} of \emph{accuracy} and \emph{response consistency} of Large Language Models (LLMs) under controllable input variations, using multiple-choice (MC) benchmarks as a case study. Current evaluation practices primarily focus on model capabilities such as accuracy or benchmark scores and, more recently, measuring consistency is being considered an essential property for deploying LLMs in high-stake, real-world applications. We argue in this paper that although both dimensions should still be evaluated independently, their inter-dependency also need to be considered for a more nuanced evaluation of LLMs. At the core of \textsc{CAT} are the \emph{Consistency-Accuracy Relation (CAR)} curves, which visualize how model accuracy varies with increasing consistency requirements, as defined by the \emph{Minimum-Consistency Accuracy (MCA)} metric. We further propose the \emph{Consistency-Oriented Robustness Estimate (CORE)} index, a global metric that combines the area and shape of the CAR curve to quantify the trade-off between accuracy and consistency. We present a practical demonstration of our framework across a diverse set of generalist and domain-specific LLMs, evaluated on multiple MC benchmarks. We also outline how \textsc{CAT} can be extended beyond MC tasks to support long-form, open-ended evaluations through adaptable scoring functions.

85. 【2512.23710】Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

链接https://arxiv.org/abs/2512.23710

作者:Zahra Abedi,Richard M.K. van Dijk,Gijs Wijnholds,Tessa Verhoef

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Leiden University, lectoren 1575-1815 books, 1575-1815 books written, analyzes the Leidse, Leidse hoogleraren

备注

点击查看摘要

Abstract:This research digitizes and analyzes the Leidse hoogleraren en lectoren 1575-1815 books written between 1983 and 1985, which contain biographic data about professors and curators of Leiden University. It addresses the central question: how can we design an automated pipeline that integrates OCR, LLM-based interpretation, and database linking to harmonize data from historical document images with existing high-quality database records? We applied OCR techniques, generative AI decoding constraints that structure data extraction, and database linkage methods to process typewritten historical records into a digital format. OCR achieved a Character Error Rate (CER) of 1.08 percent and a Word Error Rate (WER) of 5.06 percent, while JSON extraction from OCR text achieved an average accuracy of 63 percent and, based on annotated OCR, 65 percent. This indicates that generative AI somewhat corrects low OCR performance. Our record linkage algorithm linked annotated JSON files with 94% accuracy and OCR-derived JSON files with 81%. This study contributes to digital humanities research by offering an automated pipeline for interpreting digitized historical documents, addressing challenges like layout variability and terminology differences, and exploring the applicability and strength of an advanced generative AI model.

86. 【2512.24969】Large language models and the entropy of English

链接https://arxiv.org/abs/2512.24969

作者:Colin Scheibner,Lindsay M. Smith,William Bialek

类目:atistical Mechanics (cond-mat.stat-mech); Computation and Language (cs.CL); Biological Physics (physics.bio-ph); Neurons and Cognition (q-bio.NC)

关键词:English texts, variety of sources, uncover long-ranged structure, structure in English, English

备注: 8 pages, 6 figures

点击查看摘要

Abstract:We use large language models (LLMs) to uncover long-ranged structure in English texts from a variety of sources. The conditional entropy or code length in many cases continues to decrease with context length at least to $N\sim 10^4$ characters, implying that there are direct dependencies or interactions across these distances. A corollary is that there are small but significant correlations between characters at these separations, as we show from the data independent of models. The distribution of code lengths reveals an emergent certainty about an increasing fraction of characters at large $N$. Over the course of model training, we observe different dynamics at long and short context lengths, suggesting that long-ranged structure is learned only gradually. Our results constrain efforts to build statistical physics models of LLMs or language itself.

87. 【2512.24687】Quantum Visual Word Sense Disambiguation: Unraveling Ambiguities Through Quantum Inference Model

链接https://arxiv.org/abs/2512.24687

作者:Wenbo Qiao,Peng Zhang,Qinghua Hu

类目:Quantum Physics (quant-ph); Computation and Language (cs.CL)

关键词:Visual word sense, word sense disambiguation, sense disambiguation focuses, Unsupervised Visual Word, Visual word

备注

点击查看摘要

Abstract:Visual word sense disambiguation focuses on polysemous words, where candidate images can be easily confused. Traditional methods use classical probability to calculate the likelihood of an image matching each gloss of the target word, summing these to form a posterior probability. However, due to the challenge of semantic uncertainty, glosses from different sources inevitably carry semantic biases, which can lead to biased disambiguation results. Inspired by quantum superposition in modeling uncertainty, this paper proposes a Quantum Inference Model for Unsupervised Visual Word Sense Disambiguation (Q-VWSD). It encodes multiple glosses of the target word into a superposition state to mitigate semantic biases. Then, the quantum circuit is executed, and the results are observed. By formalizing our method, we find that Q-VWSD is a quantum generalization of the method based on classical probability. Building on this, we further designed a heuristic version of Q-VWSD that can run more efficiently on classical computing. The experiments demonstrate that our method outperforms state-of-the-art classical methods, particularly by effectively leveraging non-specialized glosses from large language models, which further enhances performance. Our approach showcases the potential of quantum machine learning in practical applications and provides a case for leveraging quantum modeling advantages on classical computers while quantum hardware remains immature.

信息检索

1. 【2512.25052】AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

链接https://arxiv.org/abs/2512.25052

作者:Chao Peng,Bin Wang,Zhilei Long,Jinfang Sheng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:degrade downstream generation, standard top-k retrieval, Retrieval-augmented generation, waste token budget, downstream generation

备注: Preprint. Under review

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) is highly sensitive to the quality of selected context, yet standard top-k retrieval often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation. We present AdaGReS, a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. AdaGReS performs greedy selection under a token-budget constraint using marginal gains derived from the objective, and introduces a closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits. We further provide a theoretical analysis showing that the proposed objective exhibits epsilon-approximate submodularity under practical embedding similarity conditions, yielding near-optimality guarantees for greedy selection. Experiments on open-domain question answering (Natural Questions) and a high-redundancy biomedical (drug) corpus demonstrate consistent improvements in redundancy control and context quality, translating to better end-to-end answer quality and robustness across settings.

2. 【2512.24943】RAIR: A Rule-Aware Benchmark Uniting Challenging Long-Tail and Visual Salience Subset for E-commerce Relevance Assessment

链接https://arxiv.org/abs/2512.24943

作者:Chenji Lu,Zhuo Chen,Hui Zhao,Zhenyi Wang,Pengjie Wang,Jian Xu,Bo Zheng

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Search relevance plays, Search relevance, web e-commerce, plays a central, central role

备注

点击查看摘要

Abstract:Search relevance plays a central role in web e-commerce. While large language models (LLMs) have shown significant results on relevance task, existing benchmarks lack sufficient complexity for comprehensive model assessment, resulting in an absence of standardized relevance evaluation metrics across the industry. To address this limitation, we propose Rule-Aware benchmark with Image for Relevance assessment(RAIR), a Chinese dataset derived from real-world scenarios. RAIR established a standardized framework for relevance assessment and provides a set of universal rules, which forms the foundation for standardized evaluation. Additionally, RAIR analyzes essential capabilities required for current relevance models and introduces a comprehensive dataset consists of three subset: (1) a general subset with industry-balanced sampling to evaluate fundamental model competencies; (2) a long-tail hard subset focus on challenging cases to assess performance limits; (3) a visual salience subset for evaluating multimodal understanding capabilities. We conducted experiments on RAIR using 14 open and closed-source models. The results demonstrate that RAIR presents sufficient challenges even for GPT-5, which achieved the best performance. RAIR data are now available, serving as an industry benchmark for relevance assessment while providing new insights into general LLM and Visual Language Model(VLM) evaluation.

3. 【2512.24787】HiGR: Efficient Generative Slate Recommendation via Hierarchical Planning and Multi-Objective Preference Alignment

链接https://arxiv.org/abs/2512.24787

作者:Yunsheng Pang,Zijian Liu,Yudong Li,Shaojie Zhu,Zijian Luo,Chenyun Yu,Sikai Wu,Shichen Shen,Cong Xu,Bin Wang,Kai Jiang,Hongyong Yu,Chengxiang Zhuo,Zang Li

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:ranked list, widely adopted, Slate recommendation, Slate, items simultaneously

备注

点击查看摘要

Abstract:Slate recommendation, where users are presented with a ranked list of items simultaneously, is widely adopted in online platforms. Recent advances in generative models have shown promise in slate recommendation by modeling sequences of discrete semantic IDs autoregressively. However, existing autoregressive approaches suffer from semantically entangled item tokenization and inefficient sequential decoding that lacks holistic slate planning. To address these limitations, we propose HiGR, an efficient generative slate recommendation framework that integrates hierarchical planning with listwise preference alignment. First, we propose an auto-encoder utilizing residual quantization and contrastive constraints to tokenize items into semantically structured IDs for controllable generation. Second, HiGR decouples generation into a list-level planning stage for global slate intent, followed by an item-level decoding stage for specific item selection. Third, we introduce a listwise preference alignment objective to directly optimize slate quality using implicit user feedback. Experiments on our large-scale commercial media platform demonstrate that HiGR delivers consistent improvements in both offline evaluations and online deployment. Specifically, it outperforms state-of-the-art methods by over 10% in offline recommendation quality with a 5x inference speedup, while further achieving a 1.22% and 1.73% increase in Average Watch Time and Average Video Views in online A/B tests.

4. 【2512.24762】OpenOneRec Technical Report

链接https://arxiv.org/abs/2512.24762

作者:Guorui Zhou,Honghui Bao,Jiaming Huang,Jiaxin Deng,Jinghao Zhang,Junda She,Kuo Cai,Lejian Ren,Lu Ren,Qiang Luo,Qianqian Wang,Qigen Hu,Rongzhou Zhang,Ruiming Tang,Shiyao Wang,Wuchao Li,Xiangyu Wu,Xinchen Luo,Xingmei Wang,Yifei Hu,Yunfan Wu,Zhanyu Liu,Zhiyang Zhang,Zixing Zhang,Bo Chen,Bin Wen,Chaoyi Ma,Chengru Song,Chenglong Chu,Defu Lian,Fan Yang,Feng Jiang,Hongtao Cheng,Huanjie Wang,Kun Gai,Pengfei Zheng,Qiang Wang,Rui Huang,Siyang Mao,Tingting Gao,Wei Yuan,Yan Wang,Yang Zhou,Yi Su,Zexuan Cheng,Zhixin Ling,Ziming Li

类目:Information Retrieval (cs.IR)

关键词:significant gap remains, series has successfully, successfully unified, unified the fragmented, gap remains

备注

点击查看摘要

Abstract:While the OneRec series has successfully unified the fragmented recommendation pipeline into an end-to-end generative framework, a significant gap remains between recommendation systems and general intelligence. Constrained by isolated data, they operate as domain specialists-proficient in pattern matching but lacking world knowledge, reasoning capabilities, and instruction following. This limitation is further compounded by the lack of a holistic benchmark to evaluate such integrated capabilities. To address this, our contributions are: 1) RecIF Bench Open Data: We propose RecIF-Bench, a holistic benchmark covering 8 diverse tasks that thoroughly evaluate capabilities from fundamental prediction to complex reasoning. Concurrently, we release a massive training dataset comprising 96 million interactions from 160,000 users to facilitate reproducible research. 2) Framework Scaling: To ensure full reproducibility, we open-source our comprehensive training pipeline, encompassing data processing, co-pretraining, and post-training. Leveraging this framework, we demonstrate that recommendation capabilities can scale predictably while mitigating catastrophic forgetting of general knowledge. 3) OneRec-Foundation: We release OneRec Foundation (1.7B and 8B), a family of models establishing new state-of-the-art (SOTA) results across all tasks in RecIF-Bench. Furthermore, when transferred to the Amazon benchmark, our models surpass the strongest baselines with an average 26.8% improvement in Recall@10 across 10 diverse datasets (Figure 1). This work marks a step towards building truly intelligent recommender systems. Nonetheless, realizing this vision presents significant technical and theoretical challenges, highlighting the need for broader research engagement in this promising direction.

5. 【2512.24715】MDiffFR: Modality-Guided Diffusion Generation for Cold-start Items in Federated Recommendation

链接https://arxiv.org/abs/2512.24715

作者:Kang Fu,Honglei Zhang,Xuechao Zou,Yidong Li

类目:Information Retrieval (cs.IR)

关键词:provide personalized services, attracted significant attention, Federated recommendations, preserving user privacy, keeping user data

备注

点击查看摘要

Abstract:Federated recommendations (FRs) provide personalized services while preserving user privacy by keeping user data on local clients, which has attracted significant attention in recent years. However, due to the strict privacy constraints inherent in FRs, access to user-item interaction data and user profiles across clients is highly restricted, making it difficult to learn globally effective representations for new (cold-start) items. Consequently, the item cold-start problem becomes even more challenging in FRs. Existing solutions typically predict embeddings for new items through the attribute-to-embedding mapping paradigm, which establishes a fixed one-to-one correspondence between item attributes and their embeddings. However, this one-to-one mapping paradigm often fails to model varying data distributions and tends to cause embedding misalignment, as verified by our empirical studies. To this end, we propose MDiffFR, a novel generation-based modality-guided diffusion method for cold-start items in FRs. In this framework, we employ a tailored diffusion model on the server to generate embeddings for new items, which are then distributed to clients for cold-start inference. To align item semantics, we deploy a pre-trained modality encoder to extract modality features as conditional signals to guide the reverse denoising process. Furthermore, our theoretical analysis verifies that the proposed method achieves stronger privacy guarantees compared to existing mapping-based approaches. Extensive experiments on four real datasets demonstrate that our method consistently outperforms all baselines in FRs.

6. 【2512.24711】MEIC-DT: Memory-Efficient Incremental Clustering for Long-Text Coreference Resolution with Dual-Threshold Constraints

链接https://arxiv.org/abs/2512.24711

作者:Kangyang Luo,Shuzheng Si,Yuzhuo Bai,Cheng Gao,Zhitong Wang,Cheng Huang,Yingli Shen,Yufeng Han,Wenhao Li,Cunliang Kong,Maosong Sun

类目:Information Retrieval (cs.IR)

关键词:large language models, supervised neural methods, neural methods remain, Coreference Resolution, language models

备注

点击查看摘要

Abstract:In the era of large language models (LLMs), supervised neural methods remain the state-of-the-art (SOTA) for Coreference Resolution. Yet, their full potential is underexplored, particularly in incremental clustering, which faces the critical challenge of balancing efficiency with performance for long texts. To address the limitation, we propose \textbf{MEIC-DT}, a novel dual-threshold, memory-efficient incremental clustering approach based on a lightweight Transformer. MEIC-DT features a dual-threshold constraint mechanism designed to precisely control the Transformer's input scale within a predefined memory budget. This mechanism incorporates a Statistics-Aware Eviction Strategy (\textbf{SAES}), which utilizes distinct statistical profiles from the training and inference phases for intelligent cache management. Furthermore, we introduce an Internal Regularization Policy (\textbf{IRP}) that strategically condenses clusters by selecting the most representative mentions, thereby preserving semantic integrity. Extensive experiments on common benchmarks demonstrate that MEIC-DT achieves highly competitive coreference performance under stringent memory constraints.

7. 【2512.24366】On the Factual Consistency of Text-based Explainable Recommendation Models

链接https://arxiv.org/abs/2512.24366

作者:Ben Kabongo,Vincent Guigue

类目:Information Retrieval (cs.IR)

关键词:improve user trust, justify item recommendations, generate natural-language explanations, aims to generate, generate natural-language

备注: 13 pages, 2 figures, 4 tables

点击查看摘要

Abstract:Text-based explainable recommendation aims to generate natural-language explanations that justify item recommendations, to improve user trust and system transparency. Although recent advances leverage LLMs to produce fluent outputs, a critical question remains underexplored: are these explanations factually consistent with the available evidence? We introduce a comprehensive framework for evaluating the factual consistency of text-based explainable recommenders. We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews, thereby constructing a ground truth that isolates and focuses on their factual content. Applying this pipeline to five categories from the Amazon Reviews dataset, we create augmented benchmarks for fine-grained evaluation of explanation quality. We further propose statement-level alignment metrics that combine LLM- and NLI-based approaches to assess both factual consistency and relevance of generated explanations. Across extensive experiments on six state-of-the-art explainable recommendation models, we uncover a critical gap: while models achieve high semantic similarity scores (BERTScore F1: 0.81-0.90), all our factuality metrics reveal alarmingly low performance (LLM-based statement-level precision: 4.38%-32.88%). These findings underscore the need for factuality-aware evaluation in explainable recommendation and provide a foundation for developing more trustworthy explanation systems.

8. 【2512.24325】MaRCA: Multi-Agent Reinforcement Learning for Dynamic Computation Allocation in Large-Scale Recommender Systems

链接https://arxiv.org/abs/2512.24325

作者:Wan Jiang,Xinyi Zang,Yudong Zhao,Yusi Zou,Yunfei Lu,Junbo Tong,Yang Liu,Ming Li,Jiani Shi,Xin Yang

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:face significant computational, significant computational challenges, computational challenges due, Modern recommender systems, making efficient computation

备注: 12 pages, 5 figures

点击查看摘要

Abstract:Modern recommender systems face significant computational challenges due to growing model complexity and traffic scale, making efficient computation allocation critical for maximizing business revenue. Existing approaches typically simplify multi-stage computation resource allocation, neglecting inter-stage dependencies, thus limiting global optimality. In this paper, we propose MaRCA, a multi-agent reinforcement learning framework for end-to-end computation resource allocation in large-scale recommender systems. MaRCA models the stages of a recommender system as cooperative agents, using Centralized Training with Decentralized Execution (CTDE) to optimize revenue under computation resource constraints. We introduce an AutoBucket TestBench for accurate computation cost estimation, and a Model Predictive Control (MPC)-based Revenue-Cost Balancer to proactively forecast traffic loads and adjust the revenue-cost trade-off accordingly. Since its end-to-end deployment in the advertising pipeline of a leading global e-commerce platform in November 2024, MaRCA has consistently handled hundreds of billions of ad requests per day and has delivered a 16.67% revenue uplift using existing computation resources.

9. 【2512.24268】RAGPart RAGMask: Retrieval-Stage Defenses Against Corpus Poisoning in Retrieval-Augmented Generation

链接https://arxiv.org/abs/2512.24268

作者:Pankayaraj Pathmanathan,Michael-Andrei Panaitescu-Liess,Cho-Yu Jason Chiang,Furong Huang

类目:Information Retrieval (cs.IR)

关键词:enhance large language, large language models, external knowledge, reducing hallucinations, outdated information

备注: Published at AAAI 2026 Workshop on New Frontiers in Information Retrieval [Oral]

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm to enhance large language models (LLMs) with external knowledge, reducing hallucinations and compensating for outdated information. However, recent studies have exposed a critical vulnerability in RAG pipelines corpus poisoning where adversaries inject malicious documents into the retrieval corpus to manipulate model outputs. In this work, we propose two complementary retrieval-stage defenses: RAGPart and RAGMask. Our defenses operate directly on the retriever, making them computationally lightweight and requiring no modification to the generation model. RAGPart leverages the inherent training dynamics of dense retrievers, exploiting document partitioning to mitigate the effect of poisoned points. In contrast, RAGMask identifies suspicious tokens based on significant similarity shifts under targeted token masking. Across two benchmarks, four poisoning strategies, and four state-of-the-art retrievers, our defenses consistently reduce attack success rates while preserving utility under benign conditions. We further introduce an interpretable attack to stress-test our defenses. Our findings highlight the potential and limitations of retrieval-stage defenses, providing practical insights for robust RAG deployments.

10. 【2512.24246】me-Aware Adaptive Side Information Fusion for Sequential Recommendation

链接https://arxiv.org/abs/2512.24246

作者:Jie Luo,Wenyu Zhang,Xinming Zhang,Yuan Fang

类目:Information Retrieval (cs.IR)

关键词:Incorporating item-side information, Incorporating item-side, category and brand, improving performance, Adaptive Side Information

备注: 10 pages. Accepted by WSDM'26

点击查看摘要

Abstract:Incorporating item-side information, such as category and brand, into sequential recommendation is a well-established and effective approach for improving performance. However, despite significant advancements, current models are generally limited by three key challenges: they often overlook the fine-grained temporal dynamics inherent in timestamps, exhibit vulnerability to noise in user interaction sequences, and rely on computationally expensive fusion architectures. To systematically address these challenges, we propose the Time-Aware Adaptive Side Information Fusion (TASIF) framework. TASIF integrates three synergistic components: (1) a simple, plug-and-play time span partitioning mechanism to capture global temporal patterns; (2) an adaptive frequency filter that leverages a learnable gate to denoise feature sequences adaptively, thereby providing higher-quality inputs for subsequent fusion modules; and (3) an efficient adaptive side information fusion layer, this layer employs a "guide-not-mix" architecture, where attributes guide the attention mechanism without being mixed into the content-representing item embeddings, ensuring deep interaction while ensuring computational efficiency. Extensive experiments on four public datasets demonstrate that TASIF significantly outperforms state-of-the-art baselines while maintaining excellent efficiency in training. Our source code is available at this https URL.

11. 【2512.24113】CogRec: A Cognitive Recommender Agent Fusing Large Language Models and Soar for Explainable Recommendation

链接https://arxiv.org/abs/2512.24113

作者:Jiaxin Hu,Tao Wang,Bingsan Yang,Hongrun Wang

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Large Language Models, Large Language, Language Models, understanding user preferences, demonstrated a remarkable

备注: 9 pages, 6 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated a remarkable capacity in understanding user preferences for recommendation systems. However, they are constrained by several critical challenges, including their inherent "Black-Box" characteristics, susceptibility to knowledge hallucination, and limited online learning capacity. These factors compromise their trustworthiness and adaptability. Conversely, cognitive architectures such as Soar offer structured and interpretable reasoning processes, yet their knowledge acquisition is notoriously laborious. To address these complementary challenges, we propose a novel cognitive recommender agent called CogRec which synergizes the strengths of LLMs with the Soar cognitive architecture. CogRec leverages Soar as its core symbolic reasoning engine and leverages an LLM for knowledge initialization to populate its working memory with production rules. The agent operates on a Perception-Cognition-Action(PCA) cycle. Upon encountering an impasse, it dynamically queries the LLM to obtain a reasoned solution. This solution is subsequently transformed into a new symbolic production rule via Soar's chunking mechanism, thereby enabling robust online learning. This learning paradigm allows the agent to continuously evolve its knowledge base and furnish highly interpretable rationales for its recommendations. Extensive evaluations conducted on three public datasets demonstrate that CogRec demonstrates significant advantages in recommendation accuracy, explainability, and its efficacy in addressing the long-tail problem.

12. 【2512.24078】High-dimensional Regret Minimization

链接https://arxiv.org/abs/2512.24078

作者:Junyu Liao,Ashwin Lall,Mitsunori Ogihara,Raymond Wong

类目:Databases (cs.DB); Computational Geometry (cs.CG); Information Retrieval (cs.IR)

关键词:Multi-criteria decision making, Multi-criteria decision, real world applications, decision making, making in large

备注

点击查看摘要

Abstract:Multi-criteria decision making in large databases is very important in real world applications. Recently, an interactive query has been studied extensively in the database literature with the advantage of both the top-k query (with limited output size) and the skyline query (which does not require users to explicitly specify their preference function). This approach iteratively asks the user to select the one preferred within a set of options. Based on rounds of feedback, the query learns the implicit preference and returns the most favorable as a recommendation. However, many modern applications in areas like housing or financial product markets feature datasets with hundreds of attributes. Existing interactive algorithms either fail to scale or require excessive user interactions (often exceeding 1000 rounds). Motivated by this, we propose FHDR (Fast High-Dimensional Reduction), a novel framework that takes less than 0.01s with fewer than 30 rounds of interaction. It is considered a breakthrough in the field of interactive queries since most, if not all, existing studies are not scalable to high-dimensional datasets. Extensive experiments demonstrate that FHDR outperforms the best-known algorithms by at least an order of magnitude in execution time and up to several orders of magnitude in terms of the number of interactions required, establishing a new state of the art for scalable interactive regret minimization.

Subjects:

Databases (cs.DB); Computational Geometry (cs.CG); Information Retrieval (cs.IR)

Cite as:
arXiv:2512.24078 [cs.DB]

(or
arXiv:2512.24078v1 [cs.DB] for this version)

https://doi.org/10.48550/arXiv.2512.24078

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
13. 【2512.23961】An Comparative Analysis about KYC on a Recommendation System Toward Agentic Recommendation System

链接https://arxiv.org/abs/2512.23961

作者:Junjie H. Xu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)

关键词:distinct content verticals, Discounted Cumulative Gain, Normalized Discounted Cumulative, content verticals, User-Generated Content

备注: 5 pages, 1 figure

点击查看摘要

Abstract:This research presents a cutting-edge recommendation system utilizing agentic AI for KYC (Know Your Customer in the financial domain), and its evaluation across five distinct content verticals: Advertising (Ad), News, Gossip, Sharing (User-Generated Content), and Technology (Tech). The study compares the performance of four experimental groups, grouping by the intense usage of KYC, benchmarking them against the Normalized Discounted Cumulative Gain (nDCG) metric at truncation levels of $k=1$, $k=3$, and $k=5$. By synthesizing experimental data with theoretical frameworks and industry benchmarks from platforms such as Baidu and Xiaohongshu, this research provides insight by showing experimental results for engineering a large-scale agentic recommendation system.

14. 【2512.23907】Deletion Considered Harmful

链接https://arxiv.org/abs/2512.23907

作者:Paul Englefield,Russell Beale

类目:Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

关键词:effectively manage information, information overload, manage information, effectively manage, information is crucial

备注

点击查看摘要

Abstract:In a world of information overload, understanding how we can most effectively manage information is crucial to success. We set out to understand how people view deletion, the removal of material no longer needed: does it help by reducing clutter and improving the signal to noise ratio, or does the effort required to decide to delete something make it not worthwhile? How does deletion relate to other strategies like filing; do people who spend extensive time in filing also prune their materials too? We studied the behaviour of 51 knowledge workers though a series of questionnaires and interviews to evaluate a range of tactics they used aimed at organizing, filing, and retrieving digital resources. Our study reveals that deletion is consistently under-adopted compared to other tactics such as Filing, Coverage, Ontology, and Timeliness. Moreover, the empirical data indicate that deletion is actually detrimental to retrieval success and satisfaction. In this paper, we examine the practice of deletion, review the related literature, and present detailed statistical results and clustering outcomes that underscore its adverse effects.

计算机视觉

1. 【2512.25075】SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

链接https://arxiv.org/abs/2512.25075

作者:Zhening Huang,Hyeonho Jeong,Xuelin Chen,Yulia Gryaditskaya,Tuanfeng Y. Wang,Joan Lasenby,Chun-Hao Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:controllable generative rendering, space and time, disentangles space, controllable generative, present SpaceTimePilot

备注: Project page: [this https URL](https://zheninghuang.github.io/Space-Time-Pilot/) Code: [this https URL](https://github.com/ZheningHuang/spacetimepilot)

点击查看摘要

Abstract:We present SpaceTimePilot, a video diffusion model that disentangles space and time for controllable generative rendering. Given a monocular video, SpaceTimePilot can independently alter the camera viewpoint and the motion sequence within the generative process, re-rendering the scene for continuous and arbitrary exploration across space and time. To achieve this, we introduce an effective animation time-embedding mechanism in the diffusion process, allowing explicit control of the output video's motion sequence with respect to that of the source video. As no datasets provide paired videos of the same dynamic scene with continuous temporal variations, we propose a simple yet effective temporal-warping training scheme that repurposes existing multi-view datasets to mimic temporal differences. This strategy effectively supervises the model to learn temporal control and achieve robust space-time disentanglement. To further enhance the precision of dual control, we introduce two additional components: an improved camera-conditioning mechanism that allows altering the camera from the first frame, and CamxTime, the first synthetic space-and-time full-coverage rendering dataset that provides fully free space-time video trajectories within a scene. Joint training on the temporal-warping scheme and the CamxTime dataset yields more precise temporal control. We evaluate SpaceTimePilot on both real-world and synthetic data, demonstrating clear space-time disentanglement and strong results compared to prior work. Project page: this https URL Code: this https URL

2. 【2512.25073】GaMO: Geometry-aware Multi-view Diffusion Outpainting for Sparse-View 3D Reconstruction

链接https://arxiv.org/abs/2512.25073

作者:Yi-Chuan Huang,Hao-Jen Chien,Chin-Yang Lin,Ying-Huan Chen,Yu-Lun Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, dense multi-view imagery, achieved remarkable progress, high-quality scene capture, achieved remarkable

备注: Project page: [this https URL](https://yichuanh.github.io/GaMO/)

点击查看摘要

Abstract:Recent advances in 3D reconstruction have achieved remarkable progress in high-quality scene capture from dense multi-view imagery, yet struggle when input views are limited. Various approaches, including regularization techniques, semantic priors, and geometric constraints, have been implemented to address this challenge. Latest diffusion-based methods have demonstrated substantial improvements by generating novel views from new camera poses to augment training data, surpassing earlier regularization and prior-based techniques. Despite this progress, we identify three critical limitations in these state-of-the-art approaches: inadequate coverage beyond known view peripheries, geometric inconsistencies across generated views, and computationally expensive pipelines. We introduce GaMO (Geometry-aware Multi-view Outpainter), a framework that reformulates sparse-view reconstruction through multi-view outpainting. Instead of generating new viewpoints, GaMO expands the field of view from existing camera poses, which inherently preserves geometric consistency while providing broader scene coverage. Our approach employs multi-view conditioning and geometry-aware denoising strategies in a zero-shot manner without training. Extensive experiments on Replica and ScanNet++ demonstrate state-of-the-art reconstruction quality across 3, 6, and 9 input views, outperforming prior methods in PSNR and LPIPS, while achieving a $25\times$ speedup over SOTA diffusion-based methods with processing time under 10 minutes. Project page: this https URL

3. 【2512.25071】Edit3r: Instant 3D Scene Editing from Sparse Unposed Images

链接https://arxiv.org/abs/2512.25071

作者:Jiageng Liu,Weijie Lyu,Xueting Li,Yejie Guo,Ming-Hsuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pass from unposed, feed-forward framework, framework that reconstructs, single pass, instruction-edited images

备注: Project page: [this https URL](https://edit3r.github.io/edit3r/)

点击查看摘要

Abstract:We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.

4. 【2512.25067】FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion

链接https://arxiv.org/abs/2512.25067

作者:Dian Shao,Mingfei Shi,Like Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recognizing fine-grained actions, substantial missing data, yields substantial missing, Recognizing fine-grained, online pose estimation

备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability. Code and datasets could be found at this https URL.

5. 【2512.25066】From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing

链接https://arxiv.org/abs/2512.25066

作者:Xu He,Haoxian Zhang,Hejia Chen,Changyuan Zheng,Liyang Chen,Songlin Tang,Jiehui Huang,Xiaoqiang Liu,Pengfei Wan,Zhiyong Wu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lip movements differ, subject lip movements, movements differ, visual dubbing aims, lip movements

备注: Project Page [this https URL](https://hjrphoebus.github.io/X-Dub)

点击查看摘要

Abstract:Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.

6. 【2512.25034】Generative Classifiers Avoid Shortcut Solutions

链接https://arxiv.org/abs/2512.25034

作者:Alexander C. Li,Ananya Kumar,Deepak Pathak

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词:approaches to classification, classification often learn, learn shortcuts, shortcuts that hold, hold in-distribution

备注: ICLR 2025. Code: [this https URL](https://github.com/alexlioralexli/generative-classifiers)

点击查看摘要

Abstract:Discriminative approaches to classification often learn shortcuts that hold in-distribution but fail even under minor distribution shift. This failure mode stems from an overreliance on features that are spuriously correlated with the label. We show that generative classifiers, which use class-conditional generative models, can avoid this issue by modeling all features, both core and spurious, instead of mainly spurious ones. These generative classifiers are simple to train, avoiding the need for specialized augmentations, strong regularization, extra hyperparameters, or knowledge of the specific spurious correlations to avoid. We find that diffusion-based and autoregressive generative classifiers achieve state-of-the-art performance on five standard image and text distribution shift benchmarks and reduce the impact of spurious correlations in realistic applications, such as medical or satellite datasets. Finally, we carefully analyze a Gaussian toy setting to understand the inductive biases of generative classifiers, as well as the data properties that determine when generative classifiers outperform discriminative ones.

7. 【2512.25008】FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM

链接https://arxiv.org/abs/2512.25008

作者:Yuchen Wu,Jiahe Li,Fabio Tosi,Matteo Poggi,Jin Zheng,Xiao Bai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:previous flow-based approaches, monocular dense SLAM, dense SLAM system, SLAM system, learning-based monocular dense

备注

点击查看摘要

Abstract:We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

8. 【2512.25000】Bi-C2R: Bidirectional Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification

链接https://arxiv.org/abs/2512.25000

作者:Zhenyu Cui,Jiahuan Zhou,Yuxin Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:exploits sequentially collected, Lifelong person Re-IDentification, sequentially collected data, Lifelong person, exploits sequentially

备注

点击查看摘要

Abstract:Lifelong person Re-IDentification (L-ReID) exploits sequentially collected data to continuously train and update a ReID model, focusing on the overall performance of all data. Its main challenge is to avoid the catastrophic forgetting problem of old knowledge while training on new data. Existing L-ReID methods typically re-extract new features for all historical gallery images for inference after each update, known as "re-indexing". However, historical gallery data typically suffers from direct saving due to the data privacy issue and the high re-indexing costs for large-scale gallery images. As a result, it inevitably leads to incompatible retrieval between query features extracted by the updated model and gallery features extracted by those before the update, greatly impairing the re-identification performance. To tackle the above issue, this paper focuses on a new task called Re-index Free Lifelong person Re-IDentification (RFL-ReID), which requires performing lifelong person re-identification without re-indexing historical gallery images. Therefore, RFL-ReID is more challenging than L-ReID, requiring continuous learning and balancing new and old knowledge in diverse streaming data, and making the features output by the new and old models compatible with each other. To this end, we propose a Bidirectional Continuous Compatible Representation (Bi-C2R) framework to continuously update the gallery features extracted by the old model to perform efficient L-ReID in a compatible manner. We verify our proposed Bi-C2R method through theoretical analysis and extensive experiments on multiple benchmarks, which demonstrate that the proposed method can achieve leading performance on both the introduced RFL-ReID task and the traditional L-ReID task.

9. 【2512.24986】PhysTalk: Language-driven Real-time Physics in 3D Gaussian Scenes

链接https://arxiv.org/abs/2512.24986

作者:Luca Collorone,Mert Kiray,Indro Spinelli,Fabio Galasso,Benjamin Busam

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Realistic visual simulations, creation requires computing, Realistic visual, expert animation knowledge, requires computing time

备注

点击查看摘要

Abstract:Realistic visual simulations are omnipresent, yet their creation requires computing time, rendering, and expert animation knowledge. Open-vocabulary visual effects generation from text inputs emerges as a promising solution that can unlock immense creative potential. However, current pipelines lack both physical realism and effective language interfaces, requiring slow offline optimization. In contrast, PhysTalk takes a 3D Gaussian Splatting (3DGS) scene as input and translates arbitrary user prompts into real time, physics based, interactive 4D animations. A large language model (LLM) generates executable code that directly modifies 3DGS parameters through lightweight proxies and particle dynamics. Notably, PhysTalk is the first framework to couple 3DGS directly with a physics simulator without relying on time consuming mesh extraction. While remaining open vocabulary, this design enables interactive 3D Gaussian animation via collision aware, physics based manipulation of arbitrary, multi material objects. Finally, PhysTalk is train-free and computationally lightweight: this makes 4D animation broadly accessible and shifts these workflows from a "render and wait" paradigm toward an interactive dialogue with a modern, physics-informed pipeline.

10. 【2512.24985】DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

链接https://arxiv.org/abs/2512.24985

作者:Yohan Park,Hyunwoo Ha,Wonjun Jo,Tae-Hyun Oh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:Vision Language Models, Vision Language, central reasoning modules, Language Models, embodied agents

备注: Submitted to IEEE Robotics and Automation Letters (RA-L)

点击查看摘要

Abstract:Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

11. 【2512.24971】Evaluating the Impact of Compression Techniques on the Robustness of CNNs under Natural Corruptions

链接https://arxiv.org/abs/2512.24971

作者:Itallo Patrick Castro Alves Da Silva,Emanuel Adler Medeiros Pereira,Erick de Andrade Barboza,Baldoino Fonseca dos Santos Neto,Marcio de Medeiros Ribeiro

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Compressed deep learning, Compressed deep, deep learning models, computer vision systems, deploying computer vision

备注: Accepted for publication at the 2025 International Conference on Machine Learning and Applications (ICMLA). IEEE Catalog Number: CFP25592-ART

点击查看摘要

Abstract:Compressed deep learning models are crucial for deploying computer vision systems on resource-constrained devices. However, model compression may affect robustness, especially under natural corruption. Therefore, it is important to consider robustness evaluation while validating computer vision systems. This paper presents a comprehensive evaluation of compression techniques - quantization, pruning, and weight clustering applied individually and in combination to convolutional neural networks (ResNet-50, VGG-19, and MobileNetV2). Using the CIFAR-10-C and CIFAR 100-C datasets, we analyze the trade-offs between robustness, accuracy, and compression ratio. Our results show that certain compression strategies not only preserve but can also improve robustness, particularly on networks with more complex architectures. Utilizing multiobjective assessment, we determine the best configurations, showing that customized technique combinations produce beneficial multi-objective results. This study provides insights into selecting compression methods for robust and efficient deployment of models in corrupted real-world environments.

12. 【2512.24965】ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands

链接https://arxiv.org/abs/2512.24965

作者:Siyuan Hu,Kevin Qinghong Lin,Mike Zheng Shou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)

关键词:Building intelligent agents, Building intelligent, intelligent agents capable, achieving human-like automation, Adobe Premiere Pro

备注: 17 pages, 15 figures

点击查看摘要

Abstract:Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$\pi$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$\pi$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at this https URL.

13. 【2512.24952】VIPER: Process-aware Evaluation for Generative Video Reasoning

链接https://arxiv.org/abs/2512.24952

作者:Yifan Li,Yukai Gu,Yingqian Min,Zikang Liu,Yifan Du,Kun Zhou,Min Yang,Wayne Xin Zhao,Minghui Qiu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:emerging capability termed, Recent breakthroughs, models resolve complex, resolve complex tasks, Generative Video Reasoning

备注: Work in progress

点击查看摘要

Abstract:Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.

14. 【2512.24948】ProDM: Synthetic Reality-driven Property-aware Progressive Diffusion Model for Coronary Calcium Motion Correction in Non-gated Chest CT

链接https://arxiv.org/abs/2512.24948

作者:Xinran Gong,Gorkem Durak,Halil Ertugrul Aktas,Vedat Cicek,Jinkui Hao,Ulas Bagci,Nilay S. Shah,Bo Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Coronary artery calcium, Coronary artery, disease risk estimation, cardiovascular disease risk, refine clinical cardiovascular

备注: 21 pages, 8 figures

点击查看摘要

Abstract:Coronary artery calcium (CAC) scoring from chest CT is a well-established tool to stratify and refine clinical cardiovascular disease risk estimation. CAC quantification relies on the accurate delineation of calcified lesions, but is oftentimes affected by artifacts introduced by cardiac and respiratory motion. ECG-gated cardiac CTs substantially reduce motion artifacts, but their use in population screening and routine imaging remains limited due to gating requirements and lack of insurance coverage. Although identification of incidental CAC from non-gated chest CT is increasingly considered for it offers an accessible and widely available alternative, this modality is limited by more severe motion artifacts. We present ProDM (Property-aware Progressive Correction Diffusion Model), a generative diffusion framework that restores motion-free calcified lesions from non-gated CTs. ProDM introduces three key components: (1) a CAC motion simulation data engine that synthesizes realistic non-gated acquisitions with diverse motion trajectories directly from cardiac-gated CTs, enabling supervised training without paired data; (2) a property-aware learning strategy incorporating calcium-specific priors through a differentiable calcium consistency loss to preserve lesion integrity; and (3) a progressive correction scheme that reduces artifacts gradually across diffusion steps to enhance stability and calcium fidelity. Experiments on real patient datasets show that ProDM significantly improves CAC scoring accuracy, spatial lesion fidelity, and risk stratification performance compared with several baselines. A reader study on real non-gated scans further confirms that ProDM suppresses motion artifacts and improves clinical usability. These findings highlight the potential of progressive, property-aware frameworks for reliable CAC quantification from routine chest CT imaging.

15. 【2512.24947】CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement

链接https://arxiv.org/abs/2512.24947

作者:Wentao Zhang,Tao Fang,Lina Lu,Lifei Wang,Weihe Zhong

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:costly supervised fine-tuning, domain shifts, interpretable crop disease, existing methods, methods often rely

备注: This paper is 6 pages in length and contains 2 figures. Tao Fang (Corresponding Author), Lina Lu (Co-corresponding Author)

点击查看摘要

Abstract:Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption--Prompt--Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: this https URL.

16. 【2512.24946】HaineiFRDM: Explore Diffusion to Restore Defects in Fast-Movement Films

链接https://arxiv.org/abs/2512.24946

作者:Rongji Xun,Junjie Yuan,Zhongjie Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词:noisy optical flows, show limited performance, limited performance compared, employing noisy optical, http URL

备注

点击查看摘要

Abstract:Existing open-source film restoration methods show limited performance compared to commercial methods due to training with low-quality synthetic data and employing noisy optical flows. In addition, high-resolution films have not been explored by the open-source this http URL propose HaineiFRDM(Film Restoration Diffusion Model), a film restoration framework, to explore diffusion model's powerful content-understanding ability to help human expert better restore indistinguishable film this http URL, we employ a patch-wise training and testing strategy to make restoring high-resolution films on one 24GB-VRAMR GPU possible and design a position-aware Global Prompt and Frame Fusion this http URL, we introduce a global-local frequency module to reconstruct consistent textures among different patches. Besides, we firstly restore a low-resolution result and use it as global residual to mitigate blocky artifacts caused by patching this http URL, we construct a film restoration dataset that contains restored real-degraded films and realistic synthetic this http URL experimental results conclusively demonstrate the superiority of our model in defect restoration ability over existing open-source methods. Code and the dataset will be released.

17. 【2512.24922】Semi-Supervised Diversity-Aware Domain Adaptation for 3D Object detection

链接https://arxiv.org/abs/2512.24922

作者:Bartłomiej Olber,Jakub Winter,Paweł Wawrzyński,Andrii Gamalii,Daniel Górniak,Marcin Łojek,Robert Nowak,Krystian Radlak

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fundamental components, components of perception, perception systems, Asia or Europe, object detectors

备注

点击查看摘要

Abstract:3D object detectors are fundamental components of perception systems in autonomous vehicles. While these detectors achieve remarkable performance on standard autonomous driving benchmarks, they often struggle to generalize across different domains - for instance, a model trained in the U.S. may perform poorly in regions like Asia or Europe. This paper presents a novel lidar domain adaptation method based on neuron activation patterns, demonstrating that state-of-the-art performance can be achieved by annotating only a small, representative, and diverse subset of samples from the target domain if they are correctly selected. The proposed approach requires very small annotation budget and, when combined with post-training techniques inspired by continual learning prevent weight drift from the original model. Empirical evaluation shows that the proposed domain adaptation approach outperforms both linear probing and state-of-the-art domain adaptation techniques.

18. 【2512.24903】FinMMDocR: Benchmarking Financial Multimodal Reasoning with Scenario Awareness, Document Understanding, and Multi-Step Computation

链接https://arxiv.org/abs/2512.24903

作者:Zichen Tang,Haihong E,Rongjin Li,Jiacheng Liu,Linwei Jia,Zhuodi Hao,Zhongjun Yang,Yuanze Li,Haolin Tian,Xinyi Hu,Peizhi Zhao,Yuan Liu,Zhengyu Wang,Xianghe Wang,Yiling Huang,Xueyuan Lin,Ruofei Bai,Zijian Xie,Qian Huang,Ruining Cao,Haocheng Gao

类目:Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE)

关键词:large language models, evaluating multimodal large, multimodal large language, large language, bilingual multimodal benchmark

备注: Accepted by AAAI-26 Main Track

点击查看摘要

Abstract:We introduce FinMMDocR, a novel bilingual multimodal benchmark for evaluating multimodal large language models (MLLMs) on real-world financial numerical reasoning. Compared to existing benchmarks, our work delivers three major advancements. (1) Scenario Awareness: 57.9% of 1,200 expert-annotated problems incorporate 12 types of implicit financial scenarios (e.g., Portfolio Management), challenging models to perform expert-level reasoning based on assumptions; (2) Document Understanding: 837 Chinese/English documents spanning 9 types (e.g., Company Research) average 50.8 pages with rich visual elements, significantly surpassing existing benchmarks in both breadth and depth of financial documents; (3) Multi-Step Computation: Problems demand 11-step reasoning on average (5.3 extraction + 5.7 calculation steps), with 65.0% requiring cross-page evidence (2.4 pages average). The best-performing MLLM achieves only 58.0% accuracy, and different retrieval-augmented generation (RAG) methods show significant performance variations on this task. We expect FinMMDocR to drive improvements in MLLMs and reasoning-enhanced methods on complex multimodal reasoning tasks in real-world scenarios.

19. 【2512.24861】OFL-SAM2: Prompt SAM2 with Online Few-shot Learner for Efficient Medical Image Segmentation

链接https://arxiv.org/abs/2512.24861

作者:Meng Lan,Lefei Zhang,Xiaomeng Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:promptable visual segmentation, visual segmentation capabilities, demonstrated remarkable promptable, remarkable promptable visual, tasks involving

备注

点击查看摘要

Abstract:The Segment Anything Model 2 (SAM2) has demonstrated remarkable promptable visual segmentation capabilities in video data, showing potential for extension to medical image segmentation (MIS) tasks involving 3D volumes and temporally correlated 2D image sequences. However, adapting SAM2 to MIS presents several challenges, including the need for extensive annotated medical data for fine-tuning and high-quality manual prompts, which are both labor-intensive and require intervention from medical experts. To address these challenges, we introduce OFL-SAM2, a prompt-free SAM2 framework for label-efficient MIS. Our core idea is to leverage limited annotated samples to train a lightweight mapping network that captures medical knowledge and transforms generic image features into target features, thereby providing additional discriminative target representations for each frame and eliminating the need for manual prompts. Crucially, the mapping network supports online parameter update during inference, enhancing the model's generalization across test sequences. Technically, we introduce two key components: (1) an online few-shot learner that trains the mapping network to generate target features using limited data, and (2) an adaptive fusion module that dynamically integrates the target features with the memory-attention features generated by frozen SAM2, leading to accurate and robust target representation. Extensive experiments on three diverse MIS datasets demonstrate that OFL-SAM2 achieves state-of-the-art performance with limited training data.

20. 【2512.24851】VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents

链接https://arxiv.org/abs/2512.24851

作者:Xunyi Zhao,Gengze Zhou,Qi Wu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Language Models, Large Language

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.

21. 【2512.24838】CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture

链接https://arxiv.org/abs/2512.24838

作者:Md Ahmed Al Muzaddid,Jordan A. James,William J. Beksi

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:environments presents major, presents major challenges, major challenges due, agricultural environments presents, Multiple-object tracking

备注: 8 pages, 5 figures, and 3 tables

点击查看摘要

Abstract:Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in identification F1 and association accuracy scores with a lower number of identity switches.

22. 【2512.24826】Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

链接https://arxiv.org/abs/2512.24826

作者:Jason Armitage,Rico Sennnrich

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:shift when processing, dimensional shift, Cross-modal systems trained, visual inputs, systems trained

备注

点击查看摘要

Abstract:Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

23. 【2512.24794】Nonlinear Noise2Noise for Efficient Monte Carlo Denoiser Training

链接https://arxiv.org/abs/2512.24794

作者:Andrew Tinits,Stephen Mann

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:noisy targets, target images, nonlinear functions, noisy, target

备注: 15 pages, 7 figures, 2 tables

点击查看摘要

Abstract:The Noise2Noise method allows for training machine learning-based denoisers with pairs of input and target images where both the input and target can be noisy. This removes the need for training with clean target images, which can be difficult to obtain. However, Noise2Noise training has a major limitation: nonlinear functions applied to the noisy targets will skew the results. This bias occurs because the nonlinearity makes the expected value of the noisy targets different from the clean target image. Since nonlinear functions are common in image processing, avoiding them limits the types of preprocessing that can be performed on the noisy targets. Our main insight is that certain nonlinear functions can be applied to the noisy targets without adding significant bias to the results. We develop a theoretical framework for analyzing the effects of these nonlinearities, and describe a class of nonlinear functions with minimal bias. We demonstrate our method on the denoising of high dynamic range (HDR) images produced by Monte Carlo rendering. Noise2Noise training can have trouble with HDR images, where the training process is overwhelmed by outliers and performs poorly. We consider a commonly used method of addressing these training issues: applying a nonlinear tone mapping function to the model output and target images to reduce their dynamic range. This method was previously thought to be incompatible with Noise2Noise training because of the nonlinearities involved. We show that certain combinations of loss functions and tone mapping functions can reduce the effect of outliers while introducing minimal bias. We apply our method to an existing machine learning-based Monte Carlo denoiser, where the original implementation was trained with high-sample count reference images. Our results approach those of the original implementation, but are produced using only noisy training data.

Comments:
15 pages, 7 figures, 2 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

ACMclasses:
I.4.3; I.3.7; I.5.1

Cite as:
arXiv:2512.24794 [cs.CV]

(or
arXiv:2512.24794v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2512.24794

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Journalreference:
SIGGRAPH Asia 2025 Conference Papers, Article 49, 1-11

Related DOI:

https://doi.org/10.1145/3757377.3763931

Focus to learn more

            DOI(s) linking to related resources</p>
24. 【2512.24792】Projection-based Adversarial Attack using Physics-in-the-Loop Optimization for Monocular Depth Estimation

链接https://arxiv.org/abs/2512.24792

作者:Takeru Kusakabe,Yudai Hirose,Mashiho Mukaida,Satoshi Ono

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)

关键词:Deep neural networks, Deep neural, neural networks, remain vulnerable, input images

备注

点击查看摘要

Abstract:Deep neural networks (DNNs) remain vulnerable to adversarial attacks that cause misclassification when specific perturbations are added to input images. This vulnerability also threatens the reliability of DNN-based monocular depth estimation (MDE) models, making robustness enhancement a critical need in practical applications. To validate the vulnerability of DNN-based MDE models, this study proposes a projection-based adversarial attack method that projects perturbation light onto a target object. The proposed method employs physics-in-the-loop (PITL) optimization -- evaluating candidate solutions in actual environments to account for device specifications and disturbances -- and utilizes a distributed covariance matrix adaptation evolution strategy. Experiments confirmed that the proposed method successfully created adversarial examples that lead to depth misestimations, resulting in parts of objects disappearing from the target scene.

25. 【2512.24766】Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

链接https://arxiv.org/abs/2512.24766

作者:Karthik Dharmarajan,Wenlong Huang,Jiajun Wu,Li Fei-Fei,Ruohan Zhang

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Generative video modeling, plausible physical interactions, Generative video, modeling has emerged, compelling tool

备注: Project website: [this https URL](https://dream2flow.github.io/)

点击查看摘要

Abstract:Generative video modeling has emerged as a compelling tool to zero-shot reason about plausible physical interactions for open-world manipulation. Yet, it remains a challenge to translate such human-led motions into the low-level actions demanded by robotic systems. We observe that given an initial image and task instruction, these models excel at synthesizing sensible object motions. Thus, we introduce Dream2Flow, a framework that bridges video generation and robotic control through 3D object flow as an intermediate representation. Our method reconstructs 3D object motions from generated videos and formulates manipulation as object trajectory tracking. By separating the state changes from the actuators that realize those changes, Dream2Flow overcomes the embodiment gap and enables zero-shot guidance from pre-trained video models to manipulate objects of diverse categories-including rigid, articulated, deformable, and granular. Through trajectory optimization or reinforcement learning, Dream2Flow converts reconstructed 3D object flow into executable low-level commands without task-specific demonstrations. Simulation and real-world experiments highlight 3D object flow as a general and scalable interface for adapting video generation models to open-world robotic manipulation. Videos and visualizations are available at this https URL.

26. 【2512.24763】UniC-Lift: Unified 3D Instance Segmentation via Contrastive Learning

链接https://arxiv.org/abs/2512.24763

作者:Ankit Dhiman,Srinath R,Jaswanth Reddy,Lokesh R Boregowda,Venkatesh Babu Radhakrishnan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Neural Radiance Fields, Radiance Fields, Neural Radiance, advanced novel-view synthesis, Gaussian Splatting

备注: Accepted to AAAI 2026. Project Page: [this https URL](https://unic-lift.github.io/)

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) and Neural Radiance Fields (NeRF) have advanced novel-view synthesis. Recent methods extend multi-view 2D segmentation to 3D, enabling instance/semantic segmentation for better scene understanding. A key challenge is the inconsistency of 2D instance labels across views, leading to poor 3D predictions. Existing methods use a two-stage approach in which some rely on contrastive learning with hyperparameter-sensitive clustering, while others preprocess labels for consistency. We propose a unified framework that merges these steps, reducing training time and improving performance by introducing a learnable feature embedding for segmentation in Gaussian primitives. This embedding is then efficiently decoded into instance labels through a novel "Embedding-to-Label" process, effectively integrating the optimization. While this unified framework offers substantial benefits, we observed artifacts at the object boundaries. To address the object boundary issues, we propose hard-mining samples along these boundaries. However, directly applying hard mining to the feature embeddings proved unstable. Therefore, we apply a linear layer to the rasterized feature embeddings before calculating the triplet loss, which stabilizes training and significantly improves performance. Our method outperforms baselines qualitatively and quantitatively on the ScanNet, Replica3D, and Messy-Rooms datasets.

27. 【2512.24742】Splatwizard: A Benchmark Toolkit for 3D Gaussian Splatting Compression

链接https://arxiv.org/abs/2512.24742

作者:Xiang Liu,Yimin Zhou,Jinxiang Wang,Yujun Huang,Shuzhao Xie,Shiyu Qin,Mingyao Hong,Jiawei Li,Yaowei Wang,Zhi Wang,Shu-Tao Xia,Bin Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, view synthesis, recent advent, marked a significant, significant breakthrough

备注

点击查看摘要

Abstract:The recent advent of 3D Gaussian Splatting (3DGS) has marked a significant breakthrough in real-time novel view synthesis. However, the rapid proliferation of 3DGS-based algorithms has created a pressing need for standardized and comprehensive evaluation tools, especially for compression task. Existing benchmarks often lack the specific metrics necessary to holistically assess the unique characteristics of different methods, such as rendering speed, rate distortion trade-offs memory efficiency, and geometric accuracy. To address this gap, we introduce Splatwizard, a unified benchmark toolkit designed specifically for benchmarking 3DGS compression models. Splatwizard provides an easy-to-use framework to implement new 3DGS compression model and utilize state-of-the-art techniques proposed by previous work. Besides, an integrated pipeline that automates the calculation of key performance indicators, including image-based quality metrics, chamfer distance of reconstruct mesh, rendering frame rates, and computational resource consumption is included in the framework as well. Code is available at this https URL

28. 【2512.24731】EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

链接https://arxiv.org/abs/2512.24731

作者:Bingxuan Li,Yiming Cui,Yicheng He,Yiwei Wang,Shu Zhang,Longyin Wen,Yulei Niu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Sound effects build, multimodal storytelling, shaping the emotional, effects build, build an essential

备注

点击查看摘要

Abstract:Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.

29. 【2512.24724】FlowBlending: Stage-Aware Multi-Model Sampling for Fast and High-Fidelity Video Generation

链接https://arxiv.org/abs/2512.24724

作者:Jibin Song,Mingi Kwon,Jaeseok Jeong,Youngjung Uh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:model capacity varies, varies across timesteps, capacity varies, early and late, largely negligible

备注: Project page: [this https URL](https://jibin86.github.io/flowblending_project_page)

点击查看摘要

Abstract:In this work, we show that the impact of model capacity varies across timesteps: it is crucial for the early and late stages but largely negligible during the intermediate stage. Accordingly, we propose FlowBlending, a stage-aware multi-model sampling strategy that employs a large model and a small model at capacity-sensitive stages and intermediate stages, respectively. We further introduce simple criteria to choose stage boundaries and provide a velocity-divergence analysis as an effective proxy for identifying capacity-sensitive regions. Across LTX-Video (2B/13B) and WAN 2.1 (1.3B/14B), FlowBlending achieves up to 1.65x faster inference with 57.35% fewer FLOPs, while maintaining the visual fidelity, temporal coherence, and semantic alignment of the large models. FlowBlending is also compatible with existing sampling-acceleration techniques, enabling up to 2x additional speedup. Project page is available at: this https URL.

30. 【2512.24702】Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting

链接https://arxiv.org/abs/2512.24702

作者:Kai Ye,Xiaotong You,Jianghang Lin,Jiayi Ji,Pingyang Dai,Liujuan Cao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:achieve pixel-level localization, Segmentation requires models, context-dependent linguistic queries, Reasoning Segmentation requires, interpret complex

备注

点击查看摘要

Abstract:Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass "generate-then-segment" chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a "Generate-Evaluate-Evolve" loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at this https URL.

31. 【2512.24663】Renormalization Group Guided Tensor Network Structure Search

链接https://arxiv.org/abs/2512.24663

作者:Maolin Wang,Bowen Yu,Sheng Zhang,Linjie Mi,Wanyu Wang,Yiqi Wang,Pengyue Jia,Xuetao Wei,Zenglin Xu,Ruocheng Guo,Xiangyu Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:automatically discover optimal, optimal network topologies, discover optimal network, efficient tensor decomposition, high-dimensional data representation

备注: Accepted to AAAI 2026

点击查看摘要

Abstract:Tensor network structure search (TN-SS) aims to automatically discover optimal network topologies and rank configurations for efficient tensor decomposition in high-dimensional data representation. Despite recent advances, existing TN-SS methods face significant limitations in computational tractability, structure adaptivity, and optimization robustness across diverse tensor characteristics. They struggle with three key challenges: single-scale optimization missing multi-scale structures, discrete search spaces hindering smooth structure evolution, and separated structure-parameter optimization causing computational inefficiency. We propose RGTN (Renormalization Group guided Tensor Network search), a physics-inspired framework transforming TN-SS via multi-scale renormalization group flows. Unlike fixed-scale discrete search methods, RGTN uses dynamic scale-transformation for continuous structure evolution across resolutions. Its core innovation includes learnable edge gates for optimization-stage topology modification and intelligent proposals based on physical quantities like node tension measuring local stress and edge information flow quantifying connectivity importance. Starting from low-complexity coarse scales and refining to finer ones, RGTN finds compact structures while escaping local minima via scale-induced perturbations. Extensive experiments on light field data, high-order synthetic tensors, and video completion tasks show RGTN achieves state-of-the-art compression ratios and runs 4-600$\times$ faster than existing methods, validating the effectiveness of our physics-inspired approach.

32. 【2512.24639】From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation

链接https://arxiv.org/abs/2512.24639

作者:Siyang Wang,Hanting Li,Wei Li,Jie Hu,Xinghao Chen,Feng Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:language modeling, remarkable success, widely adopted, autoregressive models, traditional autoregressive models

备注

点击查看摘要

Abstract:Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference this http URL this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors--a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.

33. 【2512.24622】FireRescue: A UAV-Based Dataset and Enhanced YOLO Model for Object Detection in Fire Rescue Scenes

链接https://arxiv.org/abs/2512.24622

作者:Qingyu Xu,Runtong Zhang,Zihuan Qiu,Fanman Meng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:fire rescue scenarios, firefighting operations, decision-making in firefighting, fire trucks, rescue scenarios

备注

点击查看摘要

Abstract:Object detection in fire rescue scenarios is importance for command and decision-making in firefighting operations. However, existing research still suffers from two main limitations. First, current work predominantly focuses on environments such as mountainous or forest areas, while paying insufficient attention to urban rescue scenes, which are more frequent and structurally complex. Second, existing detection systems include a limited number of classes, such as flames and smoke, and lack a comprehensive system covering key targets crucial for command decisions, such as fire trucks and firefighters. To address the above issues, this paper first constructs a new dataset named "FireRescue" for rescue command, which covers multiple rescue scenarios, including urban, mountainous, forest, and water areas, and contains eight key categories such as fire trucks and firefighters, with a total of 15,980 images and 32,000 bounding boxes. Secondly, to tackle the problems of inter-class confusion and missed detection of small targets caused by chaotic scenes, diverse targets, and long-distance shooting, this paper proposes an improved model named FRS-YOLO. On the one hand, the model introduces a plug-and-play multidi-mensional collaborative enhancement attention module, which enhances the discriminative representation of easily confused categories (e.g., fire trucks vs. ordinary trucks) through cross-dimensional feature interaction. On the other hand, it integrates a dynamic feature sampler to strengthen high-response foreground features, thereby mitigating the effects of smoke occlusion and background interference. Experimental results demonstrate that object detection in fire rescue scenarios is highly challenging, and the proposed method effectively improves the detection performance of YOLO series models in this context.

34. 【2512.24620】LLHA-Net: A Hierarchical Attention Network for Two-View Correspondence Learning

链接https://arxiv.org/abs/2512.24620

作者:Shuyuan Lin,Yu Guo,Xiao Chen,Yanjie Liang,Guobao Xiao,Feiran Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Establishing the correct, feature points, correct correspondence, fundamental task, feature

备注

点击查看摘要

Abstract:Establishing the correct correspondence of feature points is a fundamental task in computer vision. However, the presence of numerous outliers among the feature points can significantly affect the matching results, reducing the accuracy and robustness of the process. Furthermore, a challenge arises when dealing with a large proportion of outliers: how to ensure the extraction of high-quality information while reducing errors caused by negative samples. To address these issues, in this paper, we propose a novel method called Layer-by-Layer Hierarchical Attention Network, which enhances the precision of feature point matching in computer vision by addressing the issue of outliers. Our method incorporates stage fusion, hierarchical extraction, and an attention mechanism to improve the network's representation capability by emphasizing the rich semantic information of feature points. Specifically, we introduce a layer-by-layer channel fusion module, which preserves the feature semantic information from each stage and achieves overall fusion, thereby enhancing the representation capability of the feature points. Additionally, we design a hierarchical attention module that adaptively captures and fuses global perception and structural semantic information using an attention mechanism. Finally, we propose two architectures to extract and integrate features, thereby improving the adaptability of our network. We conduct experiments on two public datasets, namely YFCC100M and SUN3D, and the results demonstrate that our proposed method outperforms several state-of-the-art techniques in both outlier removal and camera pose estimation. Source code is available at this http URL.

35. 【2512.24605】MoniRefer: A Real-world Large-scale Multi-modal Dataset based on Roadside Infrastructure for 3D Visual Grounding

链接https://arxiv.org/abs/2512.24605

作者:Panquan Yang,Junfei Huang,Zongzhangbao Yin,Yingsong Hu,Anni Xu,Xinyi Luo,Xueqi Sun,Hai Wu,Sheng Ao,Zhaoxing Zhu,Chenglu Wen,Cheng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual grounding aims, natural language sentences, visual grounding, outdoor monitoring scenarios, semantically corresponds

备注: 14 pages

点击查看摘要

Abstract:3D visual grounding aims to localize the object in 3D point cloud scenes that semantically corresponds to given natural language sentences. It is very critical for roadside infrastructure system to interpret natural languages and localize relevant target objects in complex traffic environments. However, most existing datasets and approaches for 3D visual grounding focus on the indoor and outdoor driving scenes, outdoor monitoring scenarios remain unexplored due to scarcity of paired point cloud-text data captured by roadside infrastructure sensors. In this paper, we introduce a novel task of 3D Visual Grounding for Outdoor Monitoring Scenarios, which enables infrastructure-level understanding of traffic scenes beyond the ego-vehicle perspective. To support this task, we construct MoniRefer, the first real-world large-scale multi-modal dataset for roadside-level 3D visual grounding. The dataset consists of about 136,018 objects with 411,128 natural language expressions collected from multiple complex traffic intersections in the real-world environments. To ensure the quality and accuracy of the dataset, we manually verified all linguistic descriptions and 3D labels for objects. Additionally, we also propose a new end-to-end method, named Moni3DVG, which utilizes the rich appearance information provided by images and geometry and optical information from point cloud for multi-modal feature learning and 3D object localization. Extensive experiments and ablation studies on the proposed benchmarks demonstrate the superiority and effectiveness of our method. Our dataset and code will be released.

36. 【2512.24603】Collaborative Low-Rank Adaptation for Pre-Trained Vision Transformers

链接https://arxiv.org/abs/2512.24603

作者:Zheng Liu,Jinchao Zhu,Gao Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable success, pre-trained vision transformers, fine-tuning pre-trained vision, downstream tasks, achieved remarkable

备注: 13 tables, 3 figures

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has achieved remarkable success in fine-tuning pre-trained vision transformers for various downstream tasks. Existing studies mainly focus on exploring more parameter-efficient strategies or more effective representation learning schemes. However, these methods either sacrifice fine-tuning performance or introduce excessive trainable parameters, failing to strike a balance between learning performance and parameter efficiency. To address this problem, we propose a novel tuning method named collaborative low-rank adaptation (CLoRA) in this paper. CLoRA consists of base-space sharing and sample-agnostic diversity enhancement (SADE) components. To maintain parameter efficiency while expanding the learning capacity of low-rank modules (LRMs), base-space sharing allows all LRMs to share a set of down/up-projection spaces. In CLoRA, the low-rank matrices obtained from the shared spaces collaboratively construct each LRM. Since the representations extracted by these matrices may contain redundant information, SADE is employed to regularize the similarities among them to encourage diverse representations in the training process. We conduct extensive experiments on widely used image and point cloud datasets to evaluate the performance of CLoRA. Experimental results demonstrate that CLoRA strikes a better balance between learning performance and parameter efficiency, while requiring the fewest GFLOPs for point cloud analysis, compared with the state-of-the-art methods.

37. 【2512.24593】3D Semantic Segmentation for Post-Disaster Assessment

链接https://arxiv.org/abs/2512.24593

作者:Nhut Le,Maryam Rahnemoonfar

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:substantial economic losses, natural disasters poses, disasters poses severe, poses severe threats, Fast Point Transformer

备注: Accepted by the 2025 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2025)

点击查看摘要

Abstract:The increasing frequency of natural disasters poses severe threats to human lives and leads to substantial economic losses. While 3D semantic segmentation is crucial for post-disaster assessment, existing deep learning models lack datasets specifically designed for post-disaster environments. To address this gap, we constructed a specialized 3D dataset using unmanned aerial vehicles (UAVs)-captured aerial footage of Hurricane Ian (2022) over affected areas, employing Structure-from-Motion (SfM) and Multi-View Stereo (MVS) techniques to reconstruct 3D point clouds. We evaluated the state-of-the-art (SOTA) 3D semantic segmentation models, Fast Point Transformer (FPT), Point Transformer v3 (PTv3), and OA-CNNs on this dataset, exposing significant limitations in existing methods for disaster-stricken regions. These findings underscore the urgent need for advancements in 3D segmentation techniques and the development of specialized 3D benchmark datasets to improve post-disaster scene understanding and response.

38. 【2512.24592】SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks

链接https://arxiv.org/abs/2512.24592

作者:Wei Zhang,Chaoqun Wang,Zixuan Guan,Sam Kao,Pengfei Zhao,Peng Wu,Sifeng He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:coherent visual patterns, robust model evaluation, Systematic failures, subsets with coherent, critical challenge

备注

点击查看摘要

Abstract:Systematic failures of computer vision models on subsets with coherent visual patterns, known as error slices, pose a critical challenge for robust model evaluation. Existing slice discovery methods are primarily developed for image classification, limiting their applicability to multi-instance tasks such as detection, segmentation, and pose estimation. In real-world scenarios, error slices often arise from corner cases involving complex visual relationships, where existing instance-level approaches lacking fine-grained reasoning struggle to yield meaningful insights. Moreover, current benchmarks are typically tailored to specific algorithms or biased toward image classification, with artificial ground truth that fails to reflect real model failures. To address these limitations, we propose SliceLens, a hypothesis-driven framework that leverages LLMs and VLMs to generate and verify diverse failure hypotheses through grounded visual reasoning, enabling reliable identification of fine-grained and interpretable error slices. We further introduce FeSD (Fine-grained Slice Discovery), the first benchmark specifically designed for evaluating fine-grained error slice discovery across instance-level vision tasks, featuring expert-annotated and carefully refined ground-truth slices with precise grounding to local error regions. Extensive experiments on both existing benchmarks and FeSD demonstrate that SliceLens achieves state-of-the-art performance, improving Precision@10 by 0.42 (0.73 vs. 0.31) on FeSD, and identifies interpretable slices that facilitate actionable model improvements, as validated through model repair experiments.

39. 【2512.24591】Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning

链接https://arxiv.org/abs/2512.24591

作者:Fuyu Dong,Ke Li,Di Wang,Nan Luo,Yiming Zhang,Kaiyu Li,Jianfei Yang,Quan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Change detection visual, requires answering text, remote sensing images, answering text queries, visual question answering

备注

点击查看摘要

Abstract:Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.

40. 【2512.24561】RGBT-Ground Benchmark: Visual Grounding Beyond RGB in Complex Real-World Scenarios

链接https://arxiv.org/abs/2512.24561

作者:Tianyi Zhao,Jiawen Xi,Linhui Xiao,Junnan Li,Xue Yang,Maoxun Yuan,Xingxing Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:natural language expressions, localize specific objects, aims to localize, vision-language understanding, localize specific

备注: 27pages, 9figures

点击查看摘要

Abstract:Visual Grounding (VG) aims to localize specific objects in an image according to natural language expressions, serving as a fundamental task in vision-language understanding. However, existing VG benchmarks are mostly derived from datasets collected under clean environments, such as COCO, where scene diversity is limited. Consequently, they fail to reflect the complexity of real-world conditions, such as changes in illumination, weather, etc., that are critical to evaluating model robustness and generalization in safety-critical applications. To address these limitations, we present RGBT-Ground, the first large-scale visual grounding benchmark built for complex real-world scenarios. It consists of spatially aligned RGB and Thermal infrared (TIR) image pairs with high-quality referring expressions, corresponding object bounding boxes, and fine-grained annotations at the scene, environment, and object levels. This benchmark enables comprehensive evaluation and facilitates the study of robust grounding under diverse and challenging conditions. Furthermore, we establish a unified visual grounding framework that supports both uni-modal (RGB or TIR) and multi-modal (RGB-TIR) visual inputs. Based on it, we propose RGBT-VGNet, a simple yet effective baseline for fusing complementary visual modalities to achieve robust grounding. We conduct extensive adaptations to the existing methods on RGBT-Ground. Experimental results show that our proposed RGBT-VGNet significantly outperforms these adapted methods, particularly in nighttime and long-distance scenarios. All resources will be publicly released to promote future research on robust visual grounding in complex real-world environments.

41. 【2512.24552】OCP-LS: An Efficient Algorithm for Visual Localization

链接https://arxiv.org/abs/2512.24552

作者:Jindi Zhong,Hongxia Wang,Huanshui Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC)

关键词:second-order optimization algorithm, paper proposes, second-order optimization, address large-scale optimization, large-scale optimization problems

备注

点击查看摘要

Abstract:This paper proposes a novel second-order optimization algorithm. It aims to address large-scale optimization problems in deep learning because it incorporates the OCP method and appropriately approximating the diagonal elements of the Hessian matrix. Extensive experiments on multiple standard visual localization benchmarks demonstrate the significant superiority of the proposed method. Compared with conventional optimiza tion algorithms, our framework achieves competitive localization accuracy while exhibiting faster convergence, enhanced training stability, and improved robustness to noise interference.

42. 【2512.24551】PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

链接https://arxiv.org/abs/2512.24551

作者:Yuanhao Cai,Kunpeng Li,Menglin Jia,Jialiang Wang,Junzhe Sun,Feng Liang,Weifeng Chen,Felix Juefei-Xu,Chu Wang,Ali Thabet,Xiaoliang Dai,Xuan Ju,Alan Yuille,Ji Hou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:good visual quality, achieved good visual, Recent advances, follow physical laws, physical laws remains

备注

点击查看摘要

Abstract:Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at this https URL for more video results. Our code, models, and data will be released at this https URL

43. 【2512.24547】Hierarchical Vector-Quantized Latents for Perceptual Low-Resolution Video Compression

链接https://arxiv.org/abs/2512.24547

作者:Manikanta Kotthapalli,Banafsheh Rekabdar

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:content delivery networks, Quantized Variational Autoencoder, Vector Quantized Variational, exponential growth, increasing demands

备注: 11 pages

点击查看摘要

Abstract:The exponential growth of video traffic has placed increasing demands on bandwidth and storage infrastructure, particularly for content delivery networks (CDNs) and edge devices. While traditional video codecs like H.264 and HEVC achieve high compression ratios, they are designed primarily for pixel-domain reconstruction and lack native support for machine learning-centric latent representations, limiting their integration into deep learning pipelines. In this work, we present a Multi-Scale Vector Quantized Variational Autoencoder (MS-VQ-VAE) designed to generate compact, high-fidelity latent representations of low-resolution video, suitable for efficient storage, transmission, and client-side decoding. Our architecture extends the VQ-VAE-2 framework to a spatiotemporal setting, introducing a two-level hierarchical latent structure built with 3D residual convolutions. The model is lightweight (approximately 18.5M parameters) and optimized for 64x64 resolution video clips, making it appropriate for deployment on edge devices with constrained compute and memory resources. To improve perceptual reconstruction quality, we incorporate a perceptual loss derived from a pre-trained VGG16 network. Trained on the UCF101 dataset using 2-second video clips (32 frames at 16 FPS), on the test set we achieve 25.96 dB PSNR and 0.8375 SSIM. On validation, our model improves over the single-scale baseline by 1.41 dB PSNR and 0.0248 SSIM. The proposed framework is well-suited for scalable video compression in bandwidth-sensitive scenarios, including real-time streaming, mobile video analytics, and CDN-level storage optimization.

44. 【2512.24518】Using Large Language Models To Translate Machine Results To Human Results

链接https://arxiv.org/abs/2512.24518

作者:Trishna Niraula,Jonathan Stubblefield

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:transformed medical imaging, Artificial intelligence, medical imaging, computer vision, performance in classification

备注: 11 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Artificial intelligence (AI) has transformed medical imaging, with computer vision (CV) systems achieving state-of-the-art performance in classification and detection tasks. However, these systems typically output structured predictions, leaving radiologists responsible for translating results into full narrative reports. Recent advances in large language models (LLMs), such as GPT-4, offer new opportunities to bridge this gap by generating diagnostic narratives from structured findings. This study introduces a pipeline that integrates YOLOv5 and YOLOv8 for anomaly detection in chest X-ray images with a large language model (LLM) to generate natural-language radiology reports. The YOLO models produce bounding-box predictions and class labels, which are then passed to the LLM to generate descriptive findings and clinical summaries. YOLOv5 and YOLOv8 are compared in terms of detection accuracy, inference latency, and the quality of generated text, as measured by cosine similarity to ground-truth reports. Results show strong semantic similarity between AI and human reports, while human evaluation reveals GPT-4 excels in clarity (4.88/5) but exhibits lower scores for natural writing flow (2.81/5), indicating that current systems achieve clinical accuracy but remain stylistically distinguishable from radiologist-authored text.

45. 【2512.24499】raining-Free Color-Aware Adversarial Diffusion Sanitization for Diffusion Stegomalware Defense at Security Gateways

链接https://arxiv.org/abs/2512.24499

作者:Vladimir Frants,Sos Agaian

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:synthetic media creation, normalized large-scale synthetic, large-scale synthetic media, media creation, covert communication

备注

点击查看摘要

Abstract:The rapid expansion of generative AI has normalized large-scale synthetic media creation, enabling new forms of covert communication. Recent generative steganography methods, particularly those based on diffusion models, can embed high-capacity payloads without fine-tuning or auxiliary decoders, creating significant challenges for detection and remediation. Coverless diffusion-based techniques are difficult to counter because they generate image carriers directly from secret data, enabling attackers to deliver stegomalware for command-and-control, payload staging, and data exfiltration while bypassing detectors that rely on cover-stego discrepancies. This work introduces Adversarial Diffusion Sanitization (ADS), a training-free defense for security gateways that neutralizes hidden payloads rather than detecting them. ADS employs an off-the-shelf pretrained denoiser as a differentiable proxy for diffusion-based decoders and incorporates a color-aware, quaternion-coupled update rule to reduce artifacts under strict distortion limits. Under a practical threat model and in evaluation against the state-of-the-art diffusion steganography method Pulsar, ADS drives decoder success rates to near zero with minimal perceptual impact. Results demonstrate that ADS provides a favorable security-utility trade-off compared to standard content transformations, offering an effective mitigation strategy against diffusion-driven steganography.

46. 【2512.24473】F2IDiff: Real-world Image Super-resolution using Feature to Image Diffusion Foundation Model

链接https://arxiv.org/abs/2512.24473

作者:Devendra K. Jangid,Ripon K. Saha,Dilshan Godaliyadda,Jing Li,Seok-Jun Lee,Hamid R. Sheikh

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词:Single Image Super-Resolution, strong priors learned, adopt generative models, Single Image, substantial improvement

备注

点击查看摘要

Abstract:With the advent of Generative AI, Single Image Super-Resolution (SISR) quality has seen substantial improvement, as the strong priors learned by Text-2-Image Diffusion (T2IDiff) Foundation Models (FM) can bridge the gap between High-Resolution (HR) and Low-Resolution (LR) images. However, flagship smartphone cameras have been slow to adopt generative models because strong generation can lead to undesirable hallucinations. For substantially degraded LR images, as seen in academia, strong generation is required and hallucinations are more tolerable because of the wide gap between LR and HR images. In contrast, in consumer photography, the LR image has substantially higher fidelity, requiring only minimal hallucination-free generation. We hypothesize that generation in SISR is controlled by the stringency and richness of the FM's conditioning feature. First, text features are high level features, which often cannot describe subtle textures in an image. Additionally, Smartphone LR images are at least $12MP$, whereas SISR networks built on T2IDiff FM are designed to perform inference on much smaller images ($1MP$). As a result, SISR inference has to be performed on small patches, which often cannot be accurately described by text feature. To address these shortcomings, we introduce an SISR network built on a FM with lower-level feature conditioning, specifically DINOv2 features, which we call a Feature-to-Image Diffusion (F2IDiff) Foundation Model (FM). Lower level features provide stricter conditioning while being rich descriptors of even small patches.

47. 【2512.24463】Spectral and Spatial Graph Learning for Multispectral Solar Image Compression

链接https://arxiv.org/abs/2512.24463

作者:Prasiddha Siwakoti,Atefeh Khoshkhahtinat,Piyush M. Mehta,Barbara J. Thompson,Michael S. F. Kirk,Daniel da Silva

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:imagery remains challenging, preserving fine spectral, multispectral solar imagery, solar imagery remains, Windowed Graph Embedding

备注: 8 pages, 6 figures 1 table. Code available at [this https URL](https://github.com/agyat4/sgraph)

点击查看摘要

Abstract:High-fidelity compression of multispectral solar imagery remains challenging for space missions, where limited bandwidth must be balanced against preserving fine spectral and spatial details. We present a learned image compression framework tailored to solar observations, leveraging two complementary modules: (1) the Inter-Spectral Windowed Graph Embedding (iSWGE), which explicitly models inter-band relationships by representing spectral channels as graph nodes with learned edge features; and (2) the Windowed Spatial Graph Attention and Convolutional Block Attention (WSGA-C), which combines sparse graph attention with convolutional attention to reduce spatial redundancy and emphasize fine-scale structures. Evaluations on the SDOML dataset across six extreme ultraviolet (EUV) channels show that our approach achieves a 20.15%reduction in Mean Spectral Information Divergence (MSID), up to 1.09% PSNR improvement, and a 1.62% log transformed MS-SSIM gain over strong learned baselines, delivering sharper and spectrally faithful reconstructions at comparable bits-per-pixel rates. The code is publicly available at this https URL .

48. 【2512.24438】Exploring Compositionality in Vision Transformers using Wavelet Representations

链接https://arxiv.org/abs/2512.24438

作者:Akshad Shyam Purushottamdas,Pranav K Nayak,Divya Mehul Rajparia,Deekshith Patel,Yashmitha Gogineni,Konda Reddy Mopuri,Sumohana S. Channappayya

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:transformer model, Vision Transformer, Discrete Wavelet Transform, language tasks, model have largely

备注: 9 pages, 6 figures

点击查看摘要

Abstract:While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that approximately compose in latent space, offering a new perspective on how ViTs structure information.

49. 【2512.24411】AI-Driven Evaluation of Surgical Skill via Action Recognition

链接https://arxiv.org/abs/2512.24411

作者:Yan Meng,Daniel A. Donoho,Marcelle Altshuler,Omar Arnaout

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:strategies is critical, development of effective, effective training, evaluation strategies, proficiency typically rely

备注

点击查看摘要

Abstract:The development of effective training and evaluation strategies is critical. Conventional methods for assessing surgical proficiency typically rely on expert supervision, either through onsite observation or retrospective analysis of recorded procedures. However, these approaches are inherently subjective, susceptible to inter-rater variability, and require substantial time and effort from expert surgeons. These demands are often impractical in low- and middle-income countries, thereby limiting the scalability and consistency of such methods across training programs. To address these limitations, we propose a novel AI-driven framework for the automated assessment of microanastomosis performance. The system integrates a video transformer architecture based on TimeSformer, improved with hierarchical temporal attention and weighted spatial attention mechanisms, to achieve accurate action recognition within surgical videos. Fine-grained motion features are then extracted using a YOLO-based object detection and tracking method, allowing for detailed analysis of instrument kinematics. Performance is evaluated along five aspects of microanastomosis skill, including overall action execution, motion quality during procedure-critical actions, and general instrument handling. Experimental validation using a dataset of 58 expert-annotated videos demonstrates the effectiveness of the system, achieving 87.7% frame-level accuracy in action segmentation that increased to 93.62% with post-processing, and an average classification accuracy of 76% in replicating expert assessments across all skill aspects. These findings highlight the system's potential to provide objective, consistent, and interpretable feedback, thereby enabling more standardized, data-driven training and evaluation in surgical education.

50. 【2512.24408】DyStream: Streaming Dyadic Talking Heads Generation via Flow Matching-based Autoregressive Model

链接https://arxiv.org/abs/2512.24408

作者:Bohong Chen,Haiyang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:dyadic talking head, Generating realistic, dyadic talking, video requires ultra-low, requires ultra-low latency

备注: Project Page: [this https URL](https://robinwitch.github.io/DyStream-Page)

点击查看摘要

Abstract:Generating realistic, dyadic talking head video requires ultra-low latency. Existing chunk-based methods require full non-causal context windows, introducing significant delays. This high latency critically prevents the immediate, non-verbal feedback required for a realistic listener. To address this, we present DyStream, a flow matching-based autoregressive model that could generate video in real-time from both speaker and listener audio. Our method contains two key designs: (1) we adopt a stream-friendly autoregressive framework with flow-matching heads for probabilistic modeling, and (2) We propose a causal encoder enhanced by a lookahead module to incorporate short future context (e.g., 60 ms) to improve quality while maintaining low latency. Our analysis shows this simple-and-effective method significantly surpass alternative causal strategies, including distillation and generative encoder. Extensive experiments show that DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively. The model, weights and codes are available.

51. 【2512.24404】Lifting Vision: Ground to Aerial Localization with Reasoning Guided Planning

链接https://arxiv.org/abs/2512.24404

作者:Soham Pahari,M. Srinivas

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal intelligence development, intelligence development recently, Multimodal intelligence, high level reasoning, development recently show

备注

点击查看摘要

Abstract:Multimodal intelligence development recently show strong progress in visual understanding and high level reasoning. Though, most reasoning system still reply on textual information as the main medium for inference. This limit their effectiveness in spatial tasks such as visual navigation and geo-localization. This work discuss about the potential scope of this field and eventually propose an idea visual reasoning paradigm Geo-Consistent Visual Planning, our introduced framework called Visual Reasoning for Localization, or ViReLoc, which performs planning and localization using only visual representations. The proposed framework learns spatial dependencies and geometric relations that text based reasoning often suffer to understand. By encoding step by step inference in the visual domain and optimizing with reinforcement based objectives, ViReLoc plans routes between two given ground images. The system also integrates contrastive learning and adaptive feature interaction to align cross view perspectives and reduce viewpoint differences. Experiments across diverse navigation and localization scenarios show consistent improvements in spatial reasoning accuracy and cross view retrieval performance. These results establish visual reasoning as a strong complementary approach for navigation and localization, and show that such tasks can be performed without real time global positioning system data, leading to more secure navigation solutions.

52. 【2512.24386】RedunCut: Measurement-Driven Sampling and Accuracy Performance Modeling for Low-Cost Live Video Analytics

链接https://arxiv.org/abs/2512.24386

作者:Gur-Eyal Sela,Kumar Krishna Agrawal,Bharathan Balaji,Joseph Gonzalez,Ion Stoica

类目:Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:Live video analytics, massive camera fleets, Live video, runs continuously, camera fleets

备注: 21 pages, 23 figures

点击查看摘要

Abstract:Live video analytics (LVA) runs continuously across massive camera fleets, but inference cost with modern vision models remains high. To address this, dynamic model size selection (DMSS) is an attractive approach: it is content-aware but treats models as black boxes, and could potentially reduce cost by up to 10x without model retraining or modification. Without ground truth labels at runtime, we observe that DMSS methods use two stages per segment: (i) sampling a few models to calculate prediction statistics (e.g., confidences), then (ii) selection of the model size from those statistics. Prior systems fail to generalize to diverse workloads, particularly to mobile videos and lower accuracy targets. We identify that the failure modes stem from inefficient sampling whose cost exceeds its benefit, and inaccurate per-segment accuracy prediction. In this work, we present RedunCut, a new DMSS system that addresses both: It uses a measurement-driven planner that estimates the cost-benefit tradeoff of sampling, and a lightweight, data-driven performance model to improve accuracy prediction. Across road-vehicle, drone, and surveillance videos and multiple model families and tasks, RedunCut reduces compute cost by 14-62% at fixed accuracy and remains robust to limited historical data and to drift.

Comments:
21 pages, 23 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

Cite as:
arXiv:2512.24386 [cs.CV]

(or
arXiv:2512.24386v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2512.24386

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
53. 【2512.24385】Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

链接https://arxiv.org/abs/2512.24385

作者:Song Wang,Lingdong Kong,Xiaolu Liu,Hao Shi,Wentong Li,Jianke Zhu,Steven C. H. Hoi

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:including self-driving vehicles, onboard sensor data, forge true Spatial, true Spatial Intelligence, autonomous systems

备注: Preprint; 38 pages, 7 figures, 9 tables; GitHub at [this https URL](https://github.com/worldbench/awesome-spatial-intelligence)

点击查看摘要

Abstract:The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.

54. 【2512.24384】Geometric Multi-Session Map Merging with Learned Local Descriptors

链接https://arxiv.org/abs/2512.24384

作者:Yanlong Ma,Nakul S. Joshi,Christa S. Robison,Philip R. Osteen,Brett T. Lopez

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:extended autonomous operations, Multi-session map merging, crucial for extended, extended autonomous, autonomous operations

备注

点击查看摘要

Abstract:Multi-session map merging is crucial for extended autonomous operations in large-scale environments. In this paper, we present GMLD, a learning-based local descriptor framework for large-scale multi-session point cloud map merging that systematically aligns maps collected across different sessions with overlapping regions. The proposed framework employs a keypoint-aware encoder and a plane-based geometric transformer to extract discriminative features for loop closure detection and relative pose estimation. To further improve global consistency, we include inter-session scan matching cost factors in the factor-graph optimization stage. We evaluate our framework on the public datasets, as well as self-collected data from diverse environments. The results show accurate and robust map merging with low error, and the learned features deliver strong performance in both loop closure detection and relative pose estimation.

55. 【2512.24340】DermaVQA-DAS: Dermatology Assessment Schema (DAS) Datasets for Closed-Ended Question Answering Segmentation in Patient-Generated Dermatology Images

链接https://arxiv.org/abs/2512.24340

作者:Wen-wai Yim,Yujuan Fu,Asma Ben Abacha,Meliha Yetisgen,Noel Codella,Roberto Andres Novoa,Josep Malvehy

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:lack patient-authored queries, dermatological image analysis, large-scale annotated datasets, Recent advances, existing benchmarks focus

备注

点击查看摘要

Abstract:Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (this https URL).

56. 【2512.24338】he Mechanics of CNN Filtering with Rectification

链接https://arxiv.org/abs/2512.24338

作者:Liam Frija-Altrac,Matthew Toews

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paper proposes elementary, elementary information mechanics, proposes elementary information, quantum mechanics, filtering with rectification

备注

点击查看摘要

Abstract:This paper proposes elementary information mechanics as a new model for understanding the mechanical properties of convolutional filtering with rectification, inspired by physical theories of special relativity and quantum mechanics. We consider kernels decomposed into orthogonal even and odd components. Even components cause image content to diffuse isotropically while preserving the center of mass, analogously to rest or potential energy with zero net momentum. Odd kernels cause directional displacement of the center of mass, analogously to kinetic energy with non-zero momentum. The speed of information displacement is linearly related to the ratio of odd vs total kernel energy. Even-Odd properties are analyzed in the spectral domain via the discrete cosine transform (DCT), where the structure of small convolutional filters (e.g. $3 \times 3$ pixels) is dominated by low-frequency bases, specifically the DC $\Sigma$ and gradient components $\nabla$, which define the fundamental modes of information propagation. To our knowledge, this is the first work demonstrating the link between information processing in generic CNNs and the energy-momentum relation, a cornerstone of modern relativistic physics.

57. 【2512.24331】Spatial-aware Vision Language Model for Autonomous Driving

链接https://arxiv.org/abs/2512.24331

作者:Weijie Wei,Zhipeng Luo,Ling Feng,Venice Erin Liong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:show significant promise, common sense embedded, show significant, image cues, safety and reliability

备注

点击查看摘要

Abstract:While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

58. 【2512.24330】SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

链接https://arxiv.org/abs/2512.24330

作者:Yong Xien Chng,Tao Hu,Wenwen Tong,Xueheng Li,Jiandong Chen,Haojia Yu,Jiefan Lu,Hewei Guo,Hanming Deng,Chengjun Xie,Gao Huang,Dahua Lin,Lewei Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remain largely constrained, isolated tool invocation, capabilities remain largely, Multimodal Agentic Reasoning, agentic reasoning

备注

点击查看摘要

Abstract:While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.

59. 【2512.24323】Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention

链接https://arxiv.org/abs/2512.24323

作者:Haijing Liu,Zhiyuan Song,Hefeng Wu,Tao Pu,Keze Wang,Liang Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Referring Video Object, specific object actively, object actively involved, Video Object Segmentation, Egocentric Referring Video

备注: NeurIPS 2025

点击查看摘要

Abstract:Egocentric Referring Video Object Segmentation (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos. This task is critical for understanding egocentric human behavior. However, achieving such segmentation robustly is challenging due to ambiguities inherent in egocentric videos and biases present in training data. Consequently, existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets and fundamental visual confounding factors of the egocentric perspective, such as rapid motion and frequent occlusions. To address these limitations, we introduce Causal Ego-REferring Segmentation (CERES), a plug-in causal framework that adapts strong, pre-trained RVOS backbones to the egocentric domain. CERES implements dual-modal causal intervention: applying backdoor adjustment principles to counteract language representation biases learned from dataset statistics, and leveraging front-door adjustment concepts to address visual confounding by intelligently integrating semantic visual features with geometric depth information guided by causal principles, creating representations more robust to egocentric distortions. Extensive experiments demonstrate that CERES achieves state-of-the-art performance on Ego-RVOS benchmarks, highlighting the potential of applying causal reasoning to build more reliable models for broader egocentric video understanding.

60. 【2512.24321】UniAct: Unified Motion Generation and Action Streaming for Humanoid Robots

链接https://arxiv.org/abs/2512.24321

作者:Nan Jiang,Zimo He,Wanhe Yu,Lexi Pang,Yunhao Li,Hongjie Li,Jieming Cui,Yuhan Li,Yizhou Wang,Yixin Zhu,Siyuan Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:versatile agents capable, human-level flexibility, long-standing objective, realization of versatile, versatile agents

备注: Project page: [this https URL](https://jnnan.github.io/uniact/)

点击查看摘要

Abstract:A long-standing objective in humanoid robotics is the realization of versatile agents capable of following diverse multimodal instructions with human-level flexibility. Despite advances in humanoid control, bridging high-level multimodal perception with whole-body execution remains a significant bottleneck. Existing methods often struggle to translate heterogeneous instructions -- such as language, music, and trajectories -- into stable, real-time actions. Here we show that UniAct, a two-stage framework integrating a fine-tuned MLLM with a causal streaming pipeline, enables humanoid robots to execute multimodal instructions with sub-500 ms latency. By unifying inputs through a shared discrete codebook via FSQ, UniAct ensures cross-modal alignment while constraining motions to a physically grounded manifold. This approach yields a 19% improvement in the success rate of zero-shot tracking of imperfect reference motions. We validate UniAct on UniMoCap, our 20-hour humanoid motion benchmark, demonstrating robust generalization across diverse real-world scenarios. Our results mark a critical step toward responsive, general-purpose humanoid assistants capable of seamless interaction through unified perception and control.

61. 【2512.24294】Virtual-Eyes: Quantitative Validation of a Lung CT Quality-Control Pipeline for Foundation-Model Cancer Risk Prediction

链接https://arxiv.org/abs/2512.24294

作者:Md. Enamul Hoq,Linda Larson-Prior,Fred Prior

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:lung cancer screening, Robust preprocessing, rarely quantified, quantified in deep-learning, deep-learning pipelines

备注: 23 pages, and Under Review-MIDL-2026

点击查看摘要

Abstract:Robust preprocessing is rarely quantified in deep-learning pipelines for low-dose CT (LDCT) lung cancer screening. We develop and validate Virtual-Eyes, a clinically motivated 16-bit CT quality-control pipeline, and measure its differential impact on generalist foundation models versus specialist models. Virtual-Eyes enforces strict 512x512 in-plane resolution, rejects short or non-diagnostic series, and extracts a contiguous lung block using Hounsfield-unit filtering and bilateral lung-coverage scoring while preserving the native 16-bit grid. Using 765 NLST patients (182 cancer, 583 non-cancer), we compute slice-level embeddings from RAD-DINO and Merlin with frozen encoders and train leakage-free patient-level MLP heads; we also evaluate Sybil and a 2D ResNet-18 baseline under Raw versus Virtual-Eyes inputs without backbone retraining. Virtual-Eyes improves RAD-DINO slice-level AUC from 0.576 to 0.610 and patient-level AUC from 0.646 to 0.683 (mean pooling) and from 0.619 to 0.735 (max pooling), with improved calibration (Brier score 0.188 to 0.112). In contrast, Sybil and ResNet-18 degrade under Virtual-Eyes (Sybil AUC 0.886 to 0.837; ResNet-18 AUC 0.571 to 0.596) with evidence of context dependence and shortcut learning, and Merlin shows limited transferability (AUC approximately 0.507 to 0.567) regardless of preprocessing. These results demonstrate that anatomically targeted QC can stabilize and improve generalist foundation-model workflows but may disrupt specialist models adapted to raw clinical context.

62. 【2512.24278】One-shot synthesis of rare gastrointestinal lesions improves diagnostic accuracy and clinical training

链接https://arxiv.org/abs/2512.24278

作者:Jia Yu,Yan Zhu,Peiyao Fu,Tianyi Chen,Zhihua Wang,Fei Wu,Quanlin Li,Pinghong Zhou,Shuo Wang,Xian Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:reliable artificial intelligence, developing reliable artificial, training novice clinicians, Rare gastrointestinal lesions, routine endoscopy

备注

点击查看摘要

Abstract:Rare gastrointestinal lesions are infrequently encountered in routine endoscopy, restricting the data available for developing reliable artificial intelligence (AI) models and training novice clinicians. Here we present EndoRare, a one-shot, retraining-free generative framework that synthesizes diverse, high-fidelity lesion exemplars from a single reference image. By leveraging language-guided concept disentanglement, EndoRare separates pathognomonic lesion features from non-diagnostic attributes, encoding the former into a learnable prototype embedding while varying the latter to ensure diversity. We validated the framework across four rare pathologies (calcifying fibrous tumor, juvenile polyposis syndrome, familial adenomatous polyposis, and Peutz-Jeghers syndrome). Synthetic images were judged clinically plausible by experts and, when used for data augmentation, significantly enhanced downstream AI classifiers, improving the true positive rate at low false-positive rates. Crucially, a blinded reader study demonstrated that novice endoscopists exposed to EndoRare-generated cases achieved a 0.400 increase in recall and a 0.267 increase in precision. These results establish a practical, data-efficient pathway to bridge the rare-disease gap in both computer-aided diagnostics and clinical education.

63. 【2512.24276】LiftProj: Space Lifting and Projection-Based Panorama Stitching

链接https://arxiv.org/abs/2512.24276

作者:Yuan Jia,Ruimin Wu,Rui Song,Jiaojiao Li,Bin Song

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Traditional image stitching, predominantly utilized two-dimensional, utilized two-dimensional homography, two-dimensional homography transformations, Traditional image

备注: 16 pages, 10 figures

点击查看摘要

Abstract:Traditional image stitching techniques have predominantly utilized two-dimensional homography transformations and mesh warping to achieve alignment on a planar surface. While effective for scenes that are approximately coplanar or exhibit minimal parallax, these approaches often result in ghosting, structural bending, and stretching distortions in non-overlapping regions when applied to real three-dimensional scenes characterized by multiple depth layers and occlusions. Such challenges are exacerbated in multi-view accumulations and 360° closed-loop stitching scenarios. In response, this study introduces a spatially lifted panoramic stitching framework that initially elevates each input image into a dense three-dimensional point representation within a unified coordinate system, facilitating global cross-view fusion augmented by confidence metrics. Subsequently, a unified projection center is established in three-dimensional space, and an equidistant cylindrical projection is employed to map the fused data onto a single panoramic manifold, thereby producing a geometrically consistent 360° panoramic layout. Finally, hole filling is conducted within the canvas domain to address unknown regions revealed by viewpoint transitions, restoring continuous texture and semantic coherence. This framework reconceptualizes stitching from a two-dimensional warping paradigm to a three-dimensional consistency paradigm and is designed to flexibly incorporate various three-dimensional lifting and completion modules. Experimental evaluations demonstrate that the proposed method substantially mitigates geometric distortions and ghosting artifacts in scenarios involving significant parallax and complex occlusions, yielding panoramic results that are more natural and consistent.

64. 【2512.24271】aming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation

链接https://arxiv.org/abs/2512.24271

作者:Zhe Huang,Hao Wen,Aiming Hao,Bingze Song,Meiqi Wu,Jiahong Wu,Xiangxiang Chu,Sheng Lu,Haoqian Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Multimodal Large, Large Language Models, made remarkable progress, Large Language

备注: 18 pages

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations, especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of collecting and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high-quality QA pairs together with original-edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. In addition, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase applies pair-wise $\ell_1$ advantage normalization, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability. We will open-source our dataset and code.

65. 【2512.24260】Physically-Grounded Manifold Projection with Foundation Priors for Metal Artifact Reduction in Dental CBCT

链接https://arxiv.org/abs/2512.24260

作者:Zhi Li,Yaqi Wang,Bingtao Ma,Yifan Zhang,Huiyu Zhou,Shuai Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Dental CBCT severely, CBCT severely obscure, Dental CBCT, Metal Artifact Reduction, obscure anatomical structures

备注: This manuscript has been submitted to Medical Image Analysis for peer review

点击查看摘要

Abstract:Metal artifacts in Dental CBCT severely obscure anatomical structures, hindering diagnosis. Current deep learning for Metal Artifact Reduction (MAR) faces limitations: supervised methods suffer from spectral blurring due to "regression-to-the-mean", while unsupervised ones risk structural hallucinations. Denoising Diffusion Models (DDPMs) offer realism but rely on slow, stochastic iterative sampling, unsuitable for clinical use. To resolve this, we propose the Physically-Grounded Manifold Projection (PGMP) framework. First, our Anatomically-Adaptive Physics Simulation (AAPS) pipeline synthesizes high-fidelity training pairs via Monte Carlo spectral modeling and patient-specific digital twins, bridging the synthetic-to-real gap. Second, our DMP-Former adapts the Direct x-Prediction paradigm, reformulating restoration as a deterministic manifold projection to recover clean anatomy in a single forward pass, eliminating stochastic sampling. Finally, a Semantic-Structural Alignment (SSA) module anchors the solution using priors from medical foundation models (MedDINOv3), ensuring clinical plausibility. Experiments on synthetic and multi-center clinical datasets show PGMP outperforms state-of-the-art methods on unseen anatomy, setting new benchmarks in efficiency and diagnostic reliability. Code and data: this https URL

66. 【2512.24243】MambaSeg: Harnessing Mamba for Accurate and Efficient Image-Event Semantic Segmentation

链接https://arxiv.org/abs/2512.24243

作者:Fuqiang Gu,Yuanke Li,Xianlei Long,Kangping Ji,Chao Chen,Qingyi Gu,Zhenliang Ni

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:including autonomous driving, Interaction Module, wide-ranging applications, including autonomous, driving and robotics

备注: Accepted by AAAI 2026

点击查看摘要

Abstract:Semantic segmentation is a fundamental task in computer vision with wide-ranging applications, including autonomous driving and robotics. While RGB-based methods have achieved strong performance with CNNs and Transformers, their effectiveness degrades under fast motion, low-light, or high dynamic range conditions due to limitations of frame cameras. Event cameras offer complementary advantages such as high temporal resolution and low latency, yet lack color and texture, making them insufficient on their own. To address this, recent research has explored multimodal fusion of RGB and event data; however, many existing approaches are computationally expensive and focus primarily on spatial fusion, neglecting the temporal dynamics inherent in event streams. In this work, we propose MambaSeg, a novel dual-branch semantic segmentation framework that employs parallel Mamba encoders to efficiently model RGB images and event streams. To reduce cross-modal ambiguity, we introduce the Dual-Dimensional Interaction Module (DDIM), comprising a Cross-Spatial Interaction Module (CSIM) and a Cross-Temporal Interaction Module (CTIM), which jointly perform fine-grained fusion along both spatial and temporal dimensions. This design improves cross-modal alignment, reduces ambiguity, and leverages the complementary properties of each modality. Extensive experiments on the DDD17 and DSEC datasets demonstrate that MambaSeg achieves state-of-the-art segmentation performance while significantly reducing computational cost, showcasing its promise for efficient, scalable, and robust multimodal perception.

67. 【2512.24231】MotivNet: Evolving Meta-Sapiens into an Emotionally Intelligent Foundation Model

链接https://arxiv.org/abs/2512.24231

作者:Rahul Medicharla,Alper Yilmaz

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:facial emotion recognition, generalizable facial emotion, robust real-world application, emotion recognition model, facial emotion

备注: 6 pages, 4 figures

点击查看摘要

Abstract:In this paper, we introduce MotivNet, a generalizable facial emotion recognition model for robust real-world application. Current state-of-the-art FER models tend to have weak generalization when tested on diverse data, leading to deteriorated performance in the real world and hindering FER as a research domain. Though researchers have proposed complex architectures to address this generalization issue, they require training cross-domain to obtain generalizable results, which is inherently contradictory for real-world application. Our model, MotivNet, achieves competitive performance across datasets without cross-domain training by using Meta-Sapiens as a backbone. Sapiens is a human vision foundational model with state-of-the-art generalization in the real world through large-scale pretraining of a Masked Autoencoder. We propose MotivNet as an additional downstream task for Sapiens and define three criteria to evaluate MotivNet's viability as a Sapiens task: benchmark performance, model similarity, and data similarity. Throughout this paper, we describe the components of MotivNet, our training approach, and our results showing MotivNet is generalizable across domains. We demonstrate that MotivNet can be benchmarked against existing SOTA models and meets the listed criteria, validating MotivNet as a Sapiens downstream task, and making FER more incentivizing for in-the-wild application. The code is available at this https URL.

68. 【2512.24227】Mirage: One-Step Video Diffusion for Photorealistic and Coherent Asset Editing in Driving Scenes

链接https://arxiv.org/abs/2512.24227

作者:Shuyun Wang,Haiyang Sun,Bing Wang,Hangjun Ye,Xin Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Vision-centric autonomous driving, Vision-centric autonomous, scalable training data, autonomous driving systems, driving systems rely

备注

点击查看摘要

Abstract:Vision-centric autonomous driving systems rely on diverse and scalable training data to achieve robust performance. While video object editing offers a promising path for data augmentation, existing methods often struggle to maintain both high visual fidelity and temporal coherence. In this work, we propose \textbf{Mirage}, a one-step video diffusion model for photorealistic and coherent asset editing in driving scenes. Mirage builds upon a text-to-video diffusion prior to ensure temporal consistency across frames. However, 3D causal variational autoencoders often suffer from degraded spatial fidelity due to compression, and directly passing 3D encoder features to decoder layers breaks temporal causality. To address this, we inject temporally agnostic latents from a pretrained 2D encoder into the 3D decoder to restore detail while preserving causal structures. Furthermore, because scene objects and inserted assets are optimized under different objectives, their Gaussians exhibit a distribution mismatch that leads to pose misalignment. To mitigate this, we introduce a two-stage data alignment strategy combining coarse 3D alignment and fine 2D refinement, thereby improving alignment and providing cleaner supervision. Extensive experiments demonstrate that Mirage achieves high realism and temporal consistency across diverse editing scenarios. Beyond asset editing, Mirage can also generalize to other video-to-video translation tasks, serving as a reliable baseline for future research. Our code is available at this https URL.

69. 【2512.24224】ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

链接https://arxiv.org/abs/2512.24224

作者:Ziquan Liu,Zhewei Zhu,Xuyang Shi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Open-vocabulary semantic segmentation, precise pixel-level details, lack precise pixel-level, Open-vocabulary semantic, semantic segmentation

备注: 10 pages, 4 figures

点击查看摘要

Abstract:Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.

70. 【2512.24214】Medical Image Classification on Imbalanced Data Using ProGAN and SMA-Optimized ResNet: Application to COVID-19

链接https://arxiv.org/abs/2512.24214

作者:Sina Jahromi,Farshid Hajati,Alireza Rezaee,Javaher Nourian

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:images belonging, proposed model, data, challenge, proposed

备注

点击查看摘要

Abstract:The challenge of imbalanced data is prominent in medical image classification. This challenge arises when there is a significant disparity in the number of images belonging to a particular class, such as the presence or absence of a specific disease, as compared to the number of images belonging to other classes. This issue is especially notable during pandemics, which may result in an even more significant imbalance in the dataset. Researchers have employed various approaches in recent years to detect COVID-19 infected individuals accurately and quickly, with artificial intelligence and machine learning algorithms at the forefront. However, the lack of sufficient and balanced data remains a significant obstacle to these methods. This study addresses the challenge by proposing a progressive generative adversarial network to generate synthetic data to supplement the real ones. The proposed method suggests a weighted approach to combine synthetic data with real ones before inputting it into a deep network classifier. A multi-objective meta-heuristic population-based optimization algorithm is employed to optimize the hyper-parameters of the classifier. The proposed model exhibits superior cross-validated metrics compared to existing methods when applied to a large and imbalanced chest X-ray image dataset of COVID-19. The proposed model achieves 95.5% and 98.5% accuracy for 4-class and 2-class imbalanced classification problems, respectively. The successful experimental outcomes demonstrate the effectiveness of the proposed model in classifying medical images using imbalanced data during pandemics.

71. 【2512.24212】RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Contextual Adaptation

链接https://arxiv.org/abs/2512.24212

作者:Ming-Ming Yu,Yi Chen,Börje F. Karlsson,Wenjun Wu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Efficiently finding targets, Efficiently finding, real-world embodied applications, embodied applications, finding targets

备注

点击查看摘要

Abstract:Efficiently finding targets in complex environments is fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on precise depth and pose information provided by simulators, which restricts applicability in real-world scenarios; and (2) lack of in-context learning (ICL) capability, making it difficult to quickly adapt to new environments, as in leveraging short videos. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong ICL capability. By simply observing a short video of a new environment, the system can also significantly improve task efficiency without requiring architectural modifications or fine-tuning. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior ICL adaptability, with no previous 3D mapping of the environment required.

72. 【2512.24195】CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers

链接https://arxiv.org/abs/2512.24195

作者:Yonglak Son,Suhyeok Kim,Seungryong Kim,Young Geun Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:larger capacity leads, iterative denoising process, achieves remarkable performance, denoising process combined, Diffusion transformer

备注: 16 pages, 20 figures

点击查看摘要

Abstract:Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.

73. 【2512.24193】PointRAFT: 3D deep learning for high-throughput prediction of potato tuber weight from partial point clouds

链接https://arxiv.org/abs/2512.24193

作者:Pieter M. Blok,Haozhou Wang,Hyun Kwon Suh,Peicheng Wang,James Burridge,Wei Guo

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:optimizing cultivation practices, practices in agriculture, indicator for optimizing, optimizing cultivation, cultivation practices

备注: 14 pages, 7 figures, 3 tables

点击查看摘要

Abstract:Potato yield is a key indicator for optimizing cultivation practices in agriculture. Potato yield can be estimated on harvesters using RGB-D cameras, which capture three-dimensional (3D) information of individual tubers moving along the conveyor belt. However, point clouds reconstructed from RGB-D images are incomplete due to self-occlusion, leading to systematic underestimation of tuber weight. To address this, we introduce PointRAFT, a high-throughput point cloud regression network that directly predicts continuous 3D shape properties, such as tuber weight, from partial point clouds. Rather than reconstructing full 3D geometry, PointRAFT infers target values directly from raw 3D data. Its key architectural novelty is an object height embedding that incorporates tuber height as an additional geometric cue, improving weight prediction under practical harvesting conditions. PointRAFT was trained and evaluated on 26,688 partial point clouds collected from 859 potato tubers across four cultivars and three growing seasons on an operational harvester in Japan. On a test set of 5,254 point clouds from 172 tubers, PointRAFT achieved a mean absolute error of 12.0 g and a root mean squared error of 17.2 g, substantially outperforming a linear regression baseline and a standard PointNet++ regression network. With an average inference time of 6.3 ms per point cloud, PointRAFT supports processing rates of up to 150 tubers per second, meeting the high-throughput requirements of commercial potato harvesters. Beyond potato weight estimation, PointRAFT provides a versatile regression network applicable to a wide range of 3D phenotyping and robotic perception tasks. The code, network weights, and a subset of the dataset are publicly available at this https URL.

74. 【2512.24176】Guiding a Diffusion Transformer with the Internal Dynamics of Itself

链接https://arxiv.org/abs/2512.24176

作者:Xingyu Zhou,Qifan Li,Xiaobin Hu,Hai Chen,Shuhang Gu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:diffusion model presents, capture the entire, presents a powerful, powerful ability, ability to capture

备注: Project Page: [this https URL](https://zhouxingyu13.github.io/Internal-Guidance/)

点击查看摘要

Abstract:The diffusion model presents a powerful ability to capture the entire (conditional) data distribution. However, due to the lack of sufficient training and data to learn to cover low-probability areas, the model will be penalized for failing to generate high-quality images corresponding to these areas. To achieve better generation quality, guidance strategies such as classifier free guidance (CFG) can guide the samples to the high-probability areas during the sampling stage. However, the standard CFG often leads to over-simplified or distorted samples. On the other hand, the alternative line of guiding diffusion model with its bad version is limited by carefully designed degradation strategies, extra training and additional sampling steps. In this paper, we proposed a simple yet effective strategy Internal Guidance (IG), which introduces an auxiliary supervision on the intermediate layer during training process and extrapolates the intermediate and deep layer's outputs to obtain generative results during sampling process. This simple strategy yields significant improvements in both training efficiency and generation quality on various baselines. On ImageNet 256x256, SiT-XL/2+IG achieves FID=5.31 and FID=1.75 at 80 and 800 epochs. More impressively, LightningDiT-XL/1+IG achieves FID=1.34 which achieves a large margin between all of these methods. Combined with CFG, LightningDiT-XL/1+IG achieves the current state-of-the-art FID of 1.19.

75. 【2512.24172】Deep Global Clustering for Hyperspectral Image Segmentation: Concepts, Applications, and Open Challenges

链接https://arxiv.org/abs/2512.24172

作者:Yu-Tang Chang,Pin-Wei Chen,Shih-Fang Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:analysis faces computational, faces computational bottlenecks, Hyperspectral imaging, computational bottlenecks due, massive data volumes

备注: 10 pages, 4 figures. Technical report extending ACPA 2025 conference paper. Code and data available at [this https URL](https://github.com/b05611038/HSI_global_clustering)

点击查看摘要

Abstract:Hyperspectral imaging (HSI) analysis faces computational bottlenecks due to massive data volumes that exceed available memory. While foundation models pre-trained on large remote sensing datasets show promise, their learned representations often fail to transfer to domain-specific applications like close-range agricultural monitoring where spectral signatures, spatial scales, and semantic targets differ fundamentally. This report presents Deep Global Clustering (DGC), a conceptual framework for memory-efficient HSI segmentation that learns global clustering structure from local patch observations without pre-training. DGC operates on small patches with overlapping regions to enforce consistency, enabling training in under 30 minutes on consumer hardware while maintaining constant memory usage. On a leaf disease dataset, DGC achieves background-tissue separation (mean IoU 0.925) and demonstrates unsupervised disease detection through navigable semantic granularity. However, the framework suffers from optimization instability rooted in multi-objective loss balancing: meaningful representations emerge rapidly but degrade due to cluster over-merging in feature space. We position this work as intellectual scaffolding - the design philosophy has merit, but stable implementation requires principled approaches to dynamic loss balancing. Code and data are available at this https URL.

76. 【2512.24165】DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

链接https://arxiv.org/abs/2512.24165

作者:Zefeng He,Xiaoye Qu,Yafu Li,Tong Zhu,Siyuan Huang,Yu Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, recent Multimodal Large, Large Language, remain predominantly text-centric

备注: Project page: [this https URL](https://diffthinker-project.github.io)

点击查看摘要

Abstract:While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

77. 【2512.24162】Bayesian Self-Distillation for Image Classification

链接https://arxiv.org/abs/2512.24162

作者:Anton Adelöw,Matteo Gamba,Atsuto Maki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:classification typically relies, Supervised training, deep neural networks, hard targets, neural networks

备注: 17 pages, 17 figures

点击查看摘要

Abstract:Supervised training of deep neural networks for classification typically relies on hard targets, which promote overconfidence and can limit calibration, generalization, and robustness. Self-distillation methods aim to mitigate this by leveraging inter-class and sample-specific information present in the model's own predictions, but often remain dependent on hard targets, reducing their effectiveness. With this in mind, we propose Bayesian Self-Distillation (BSD), a principled method for constructing sample-specific target distributions via Bayesian inference using the model's own predictions. Unlike existing approaches, BSD does not rely on hard targets after initialization. BSD consistently yields higher test accuracy (e.g. +1.4% for ResNet-50 on CIFAR-100) and significantly lower Expected Calibration Error (ECE) (-40% ResNet-50, CIFAR-100) than existing architecture-preserving self-distillation methods for a range of deep architectures and datasets. Additional benefits include improved robustness against data corruptions, perturbations, and label noise. When combined with a contrastive loss, BSD achieves state-of-the-art robustness under label noise for single-stage, single-network methods.

78. 【2512.24160】owards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

链接https://arxiv.org/abs/2512.24160

作者:TsaiChing Ni,ZhenQi Chen,YuanFu Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:advance multimodal learning, aligned image-text pairs, Multimodal Defect Dataset, large-scale Industrial Multimodal, Industrial Multimodal Defect

备注

点击查看摘要

Abstract:We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

79. 【2512.24146】aming Preference Mode Collapse via Directional Decoupling Alignment in Diffusion Reinforcement Learning

链接https://arxiv.org/abs/2512.24146

作者:Chubin Chen,Sujie Hu,Jiashu Zhu,Meiqi Wu,Jintao Chen,Yanxun Li,Nisha Huang,Chengyu Fang,Jiahong Wu,Xiangxiang Chu,Xiu Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reinforcement Learning, demonstrated significant progress, Recent studies, Human Feedback, progress in aligning

备注

点击查看摘要

Abstract:Recent studies have demonstrated significant progress in aligning text-to-image diffusion models with human preference via Reinforcement Learning from Human Feedback. However, while existing methods achieve high scores on automated reward metrics, they often lead to Preference Mode Collapse (PMC)-a specific form of reward hacking where models converge on narrow, high-scoring outputs (e.g., images with monolithic styles or pervasive overexposure), severely degrading generative diversity. In this work, we introduce and quantify this phenomenon, proposing DivGenBench, a novel benchmark designed to measure the extent of PMC. We posit that this collapse is driven by over-optimization along the reward model's inherent biases. Building on this analysis, we propose Directional Decoupling Alignment (D$^2$-Align), a novel framework that mitigates PMC by directionally correcting the reward signal. Specifically, our method first learns a directional correction within the reward model's embedding space while keeping the model frozen. This correction is then applied to the reward signal during the optimization process, preventing the model from collapsing into specific modes and thereby maintaining diversity. Our comprehensive evaluation, combining qualitative analysis with quantitative metrics for both quality and diversity, reveals that D$^2$-Align achieves superior alignment with human preference.

80. 【2512.24138】GARDO: Reinforcing Diffusion Models without Reward Hacking

链接https://arxiv.org/abs/2512.24138

作者:Haoran He,Yuxiao Ye,Jie Liu,Jiajun Liang,Zhiyong Wang,Ziyang Yuan,Xintao Wang,Hangyu Mao,Pengfei Wan,Ling Pan

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Fine-tuning diffusion models, shown great potential, Fine-tuning diffusion, online reinforcement learning, reinforcement learning

备注: 17 pages. Project: [this https URL](https://tinnerhrhe.github.io/gardo_project)

点击查看摘要

Abstract:Fine-tuning diffusion models via online reinforcement learning (RL) has shown great potential for enhancing text-to-image alignment. However, since precisely specifying a ground-truth objective for visual tasks remains challenging, the models are often optimized using a proxy reward that only partially captures the true goal. This mismatch often leads to reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. While common solutions add regularization against the reference policy to prevent reward hacking, they compromise sample efficiency and impede the exploration of novel, high-reward regions, as the reference policy is usually sub-optimal. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty. To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target. To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process. Extensive experiments across diverse proxy rewards and hold-out unseen metrics consistently show that GARDO mitigates reward hacking and enhances generation diversity without sacrificing sample efficiency or exploration, highlighting its effectiveness and robustness.

81. 【2512.24120】Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

链接https://arxiv.org/abs/2512.24120

作者:Chandini Vysyaraju,Raghuvir Duvvuri,Avi Goyal,Dmitry Ignatov,Radu Timofte

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:computer vision, Automated neural network, neural network architecture, remains a significant, vision

备注

点击查看摘要

Abstract:Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

82. 【2512.24119】GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation

链接https://arxiv.org/abs/2512.24119

作者:Yuan Feng,Yue Yang,Xiaohan He,Jiatong Zhao,Jianlong Chen,Zijun Chen,Daocheng Fu,Qi Liu,Renqiu Xia,Bo Zhang,Junchi Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:requiring precise analysis, problem solving constitutes, Geometric problem solving, Rigorous Theorem Application, requiring precise

备注

点击查看摘要

Abstract:Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.

83. 【2512.24111】Guided Diffusion-based Generation of Adversarial Objects for Real-World Monocular Depth Estimation Attacks

链接https://arxiv.org/abs/2512.24111

作者:Yongtao Chen,Yanbo Wang,Wentao Zhao,Guole Shen,Tianchen Deng,Jingchuan Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Monocular Depth Estimation, remains highly susceptible, Monocular Depth, core perception module, Depth Estimation

备注

点击查看摘要

Abstract:Monocular Depth Estimation (MDE) serves as a core perception module in autonomous driving systems, but it remains highly susceptible to adversarial attacks. Errors in depth estimation may propagate through downstream decision making and influence overall traffic safety. Existing physical attacks primarily rely on texture-based patches, which impose strict placement constraints and exhibit limited realism, thereby reducing their effectiveness in complex driving environments. To overcome these limitations, this work introduces a training-free generative adversarial attack framework that generates naturalistic, scene-consistent adversarial objects via a diffusion-based conditional generation process. The framework incorporates a Salient Region Selection module that identifies regions most influential to MDE and a Jacobian Vector Product Guidance mechanism that steers adversarial gradients toward update directions supported by the pre-trained diffusion model. This formulation enables the generation of physically plausible adversarial objects capable of inducing substantial adversarial depth shifts. Extensive digital and physical experiments demonstrate that our method significantly outperforms existing attacks in effectiveness, stealthiness, and physical deployability, underscoring its strong practical implications for autonomous driving safety assessment.

84. 【2512.24100】hink Before You Move: Latent Motion Reasoning for Text-to-Motion Generation

链接https://arxiv.org/abs/2512.24100

作者:Yijie Qian,Juncheng Wang,Yuxiang Feng,Chao Xu,Wang Lu,Yang Liu,Baigui Sun,Yiqiang Chen,Yong Liu,Shujun Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:paradigms predominantly treat, direct translation problem, mapping symbolic language, Semantic-Kinematic Impedance Mismatch, symbolic language directly

备注: project page: [this https URL](https://chenhaoqcdyq.github.io/LMR/)

点击查看摘要

Abstract:Current state-of-the-art paradigms predominantly treat Text-to-Motion (T2M) generation as a direct translation problem, mapping symbolic language directly to continuous poses. While effective for simple actions, this System 1 approach faces a fundamental theoretical bottleneck we identify as the Semantic-Kinematic Impedance Mismatch: the inherent difficulty of grounding semantically dense, discrete linguistic intent into kinematically dense, high-frequency motion data in a single shot. In this paper, we argue that the solution lies in an architectural shift towards Latent System 2 Reasoning. Drawing inspiration from Hierarchical Motor Control in cognitive science, we propose Latent Motion Reasoning (LMR) that reformulates generation as a two-stage Think-then-Act decision process. Central to LMR is a novel Dual-Granularity Tokenizer that disentangles motion into two distinct manifolds: a compressed, semantically rich Reasoning Latent for planning global topology, and a high-frequency Execution Latent for preserving physical fidelity. By forcing the model to autoregressively reason (plan the coarse trajectory) before it moves (instantiates the frames), we effectively bridge the ineffability gap between language and physics. We demonstrate LMR's versatility by implementing it for two representative baselines: T2M-GPT (discrete) and MotionStreamer (continuous). Extensive experiments show that LMR yields non-trivial improvements in both semantic alignment and physical plausibility, validating that the optimal substrate for motion planning is not natural language, but a learned, motion-aligned concept space. Codes and demos can be found in \hyperlink{this https URL}{this https URL}

85. 【2512.24097】Factorized Learning for Temporally Grounded Video-Language Models

链接https://arxiv.org/abs/2512.24097

作者:Wenzheng Zeng,Difei Gao,Mike Zheng Shou,Hwee Tou Ng

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:Recent video-language models, shown great potential, Recent video-language, video understanding, temporal grounding

备注: ICCV 2025 paper. This arXiv version updates Figure 1 to include the concurrent work Qwen2.5-VL to ensure consistency with Table 1

点击查看摘要

Abstract:Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at this https URL.

86. 【2512.24086】RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention

链接https://arxiv.org/abs/2512.24086

作者:Aiyue Chen,Yaofu Liu,Junjian Huang,Guang Lian,Yiwu Yao,Wangli Lan,Jing Lin,Zhixin Ma,Tingting Zhou,Harry Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion Transformer, incur extremely high, extremely high computational, models incur extremely, practical applications

备注

点击查看摘要

Abstract:In video and image generation tasks, Diffusion Transformer (DiT) models incur extremely high computational costs due to attention mechanisms, which limits their practical applications. Furthermore, with hardware advancements, a wide range of devices besides graphics processing unit (GPU), such as application-specific integrated circuit (ASIC), have been increasingly adopted for model inference. Sparse attention, which leverages the inherent sparsity of attention by skipping computations for insignificant tokens, is an effective approach to mitigate computational costs. However, existing sparse attention methods have two critical limitations: the overhead of sparse pattern prediction and the lack of hardware generality, as most of these methods are designed for GPU. To address these challenges, this study proposes RainFusion2.0, which aims to develop an online adaptive, hardware-efficient, and low-overhead sparse attention mechanism to accelerate both video and image generative models, with robust performance across diverse hardware platforms. Key technical insights include: (1) leveraging block-wise mean values as representative tokens for sparse mask prediction; (2) implementing spatiotemporal-aware token permutation; and (3) introducing a first-frame sink mechanism specifically designed for video generation scenarios. Experimental results demonstrate that RainFusion2.0 can achieve 80% sparsity while achieving an end-to-end speedup of 1.5~1.8x without compromising video quality. Moreover, RainFusion2.0 demonstrates effectiveness across various generative models and validates its generalization across diverse hardware platforms.

87. 【2512.24074】Balanced Hierarchical Contrastive Learning with Decoupled Queries for Fine-grained Object Detection in Remote Sensing Images

链接https://arxiv.org/abs/2512.24074

作者:Jingzhou Chen,Dexin Chen,Fengchao Xiong,Yuntao Qian,Liang Xiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Fine-grained remote sensing, remote sensing datasets, remote sensing, structures to differentiate, annotated across multiple

备注

点击查看摘要

Abstract:Fine-grained remote sensing datasets often use hierarchical label structures to differentiate objects in a coarse-to-fine manner, with each object annotated across multiple levels. However, embedding this semantic hierarchy into the representation learning space to improve fine-grained detection performance remains challenging. Previous studies have applied supervised contrastive learning at different hierarchical levels to group objects under the same parent class while distinguishing sibling subcategories. Nevertheless, they overlook two critical issues: (1) imbalanced data distribution across the label hierarchy causes high-frequency classes to dominate the learning process, and (2) learning semantic relationships among categories interferes with class-agnostic localization. To address these issues, we propose a balanced hierarchical contrastive loss combined with a decoupled learning strategy within the detection transformer (DETR) framework. The proposed loss introduces learnable class prototypes and equilibrates gradients contributed by different classes at each hierarchical level, ensuring that each hierarchical class contributes equally to the loss computation in every mini-batch. The decoupled strategy separates DETR's object queries into classification and localization sets, enabling task-specific feature extraction and optimization. Experiments on three fine-grained datasets with hierarchical annotations demonstrate that our method outperforms state-of-the-art approaches.

88. 【2512.24066】Pathology Context Recalibration Network for Ocular Disease Recognition

链接https://arxiv.org/abs/2512.24066

作者:Zunjie Xiao,Xiaoqing Zhang,Risa Higashita,Jiang Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:ocular disease recognition, expert experience play, ocular disease, clinical pathology context, play significant roles

备注: The article has been accepted for publication at Machine Intelligence Research (MIR)

点击查看摘要

Abstract:Pathology context and expert experience play significant roles in clinical ocular disease diagnosis. Although deep neural networks (DNNs) have good ocular disease recognition results, they often ignore exploring the clinical pathology context and expert experience priors to improve ocular disease recognition performance and decision-making interpretability. To this end, we first develop a novel Pathology Recalibration Module (PRM) to leverage the potential of pathology context prior via the combination of the well-designed pixel-wise context compression operator and pathology distribution concentration operator; then this paper applies a novel expert prior Guidance Adapter (EPGA) to further highlight significant pixel-wise representation regions by fully mining the expert experience prior. By incorporating PRM and EPGA into the modern DNN, the PCRNet is constructed for automated ocular disease recognition. Additionally, we introduce an Integrated Loss (IL) to boost the ocular disease recognition performance of PCRNet by considering the effects of sample-wise loss distributions and training label frequencies. The extensive experiments on three ocular disease datasets demonstrate the superiority of PCRNet with IL over state-of-the-art attention-based networks and advanced loss methods. Further visualization analysis explains the inherent behavior of PRM and EPGA that affects the decision-making process of DNNs.

89. 【2512.24064】Neighbor-aware Instance Refining with Noisy Labels for Cross-Modal Retrieval

链接https://arxiv.org/abs/2512.24064

作者:Yizhi Liu,Ruitao Pu,Shilin Xu,Yingke Chen,Quan-Hui Liu,Yuan Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:made significant progress, Neighbor-aware Instance Refining, recent years, made significant, significant progress

备注: 9 pages, 4 figures, and AAAI-26 conference

点击查看摘要

Abstract:In recent years, Cross-Modal Retrieval (CMR) has made significant progress in the field of multi-modal analysis. However, since it is time-consuming and labor-intensive to collect large-scale and well-annotated data, the annotation of multi-modal data inevitably contains some noise. This will degrade the retrieval performance of the model. To tackle the problem, numerous robust CMR methods have been developed, including robust learning paradigms, label calibration strategies, and instance selection mechanisms. Unfortunately, they often fail to simultaneously satisfy model performance ceilings, calibration reliability, and data utilization rate. To overcome the limitations, we propose a novel robust cross-modal learning framework, namely Neighbor-aware Instance Refining with Noisy Labels (NIRNL). Specifically, we first propose Cross-modal Margin Preserving (CMP) to adjust the relative distance between positive and negative pairs, thereby enhancing the discrimination between sample pairs. Then, we propose Neighbor-aware Instance Refining (NIR) to identify pure subset, hard subset, and noisy subset through cross-modal neighborhood consensus. Afterward, we construct different tailored optimization strategies for this fine-grained partitioning, thereby maximizing the utilization of all available data while mitigating error propagation. Extensive experiments on three benchmark datasets demonstrate that NIRNL achieves state-of-the-art performance, exhibiting remarkable robustness, especially under high noise rates.

90. 【2512.24035】Reinforced Diffusion: Learning to Push the Limits of Anisotropic Diffusion for Image Denoising

链接https://arxiv.org/abs/2512.24035

作者:Xinran Qin,Yuhui Quan,Ruotao Xu,Hui Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image recovery tasks, recovery tasks, important problem, problem in low-level, low-level vision

备注

点击查看摘要

Abstract:Image denoising is an important problem in low-level vision and serves as a critical module for many image recovery tasks. Anisotropic diffusion is a wide family of image denoising approaches with promising performance. However, traditional anisotropic diffusion approaches use explicit diffusion operators which are not well adapted to complex image structures. As a result, their performance is limited compared to recent learning-based approaches. In this work, we describe a trainable anisotropic diffusion framework based on reinforcement learning. By modeling the denoising process as a series of naive diffusion actions with order learned by deep Q-learning, we propose an effective diffusion-based image denoiser. The diffusion actions selected by deep Q-learning at different iterations indeed composite a stochastic anisotropic diffusion process with strong adaptivity to different image structures, which enjoys improvement over the traditional ones. The proposed denoiser is applied to removing three types of often-seen noise. The experiments show that it outperforms existing diffusion-based methods and competes with the representative deep CNN-based methods.

91. 【2512.24026】PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing

链接https://arxiv.org/abs/2512.24026

作者:Mustafa Munir,Md Mostafijur Rahman,Kartikeya Bhardwaj,Paul Whatmough,Radu Marculescu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Diffusion Implicit Models, Denoising Diffusion Implicit, Implicit Models, poses unique challenges, unique challenges due

备注

点击查看摘要

Abstract:Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow's editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).

92. 【2512.24023】RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations

链接https://arxiv.org/abs/2512.24023

作者:Xingqi He,Yujie Zhang,Shuyong Gao,Wenjie Li,Lingyi Hong,Mingxi Chen,Kaixun Jiang,Jiyuan Fu,Wenqiang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:pixel grounding abilities, object segmentation requires, Text-guided object segmentation, Multimodal Large Language, Large Language Model

备注

点击查看摘要

Abstract:Text-guided object segmentation requires both cross-modal reasoning and pixel grounding abilities. Most recent methods treat text-guided segmentation as one-shot grounding, where the model predicts pixel prompts in a single forward pass to drive an external segmentor, which limits verification, refocusing and refinement when initial localization is wrong. To address this limitation, we propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations. RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks. We further build a data pipeline to synthesize multi-turn reasoning segmentation trajectories, and train RSAgent with a two-stage framework: cold-start supervised fine-tuning followed by agentic reinforcement learning with fine-grained, task-specific rewards. Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance on both in-domain and out-of-domain benchmarks.

93. 【2512.24022】FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing

链接https://arxiv.org/abs/2512.24022

作者:Yunkai Dang,Donghao Wang,Jiacheng Yang,Yifan Jiang,Meiyi Zhu,Yuekun Yang,Cong Wang,Qi Fan,Wenbin Li,Yang Gao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large vision-language models, Large vision-language, exhibit strong performance, remote sensing, exhibit strong

备注

点击查看摘要

Abstract:Large vision-language models (VLMs) exhibit strong performance across various tasks. However, these VLMs encounter significant challenges when applied to the remote sensing domain due to the inherent differences between remote sensing images and natural images. Existing remote sensing VLMs often fail to extract fine-grained visual features and suffer from visual forgetting during deep language processing. To address this, we introduce MF-RSVLM, a Multi-Feature Fusion Remote Sensing Vision--Language Model that effectively extracts and fuses visual features for RS understanding. MF-RSVLM learns multi-scale visual representations and combines global context with local details, improving the capture of small and complex structures in RS scenes. A recurrent visual feature injection scheme ensures the language model remains grounded in visual evidence and reduces visual forgetting during generation. Extensive experiments on diverse RS benchmarks show that MF-RSVLM achieves state-of-the-art or highly competitive performance across remote sensing classification, image captioning, and VQA tasks. Our code is publicly available at this https URL.

94. 【2512.24018】Structure-Guided Allocation of 2D Gaussians for Image Representation and Compression

链接https://arxiv.org/abs/2512.24018

作者:Huanxiong Liang,Yunuo Chen,Yicheng Pan,Sixian Wang,Jincheng Dai,Guo Lu,Wenjun Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advances, Gaussian Splatting, compact image representation, demonstrated its potential, Recent

备注

点击查看摘要

Abstract:Recent advances in 2D Gaussian Splatting (2DGS) have demonstrated its potential as a compact image representation with millisecond-level decoding. However, existing 2DGS-based pipelines allocate representation capacity and parameter precision largely oblivious to image structure, limiting their rate-distortion (RD) efficiency at low bitrates. To address this, we propose a structure-guided allocation principle for 2DGS, which explicitly couples image structure with both representation capacity and quantization precision, while preserving native decoding speed. First, we introduce a structure-guided initialization that assigns 2D Gaussians according to spatial structural priors inherent in natural images, yielding a localized and semantically meaningful distribution. Second, during quantization-aware fine-tuning, we propose adaptive bitwidth quantization of covariance parameters, which grants higher precision to small-scale Gaussians in complex regions and lower precision elsewhere, enabling RD-aware optimization, thereby reducing redundancy without degrading edge quality. Third, we impose a geometry-consistent regularization that aligns Gaussian orientations with local gradient directions to better preserve structural details. Extensive experiments demonstrate that our approach substantially improves both the representational power and the RD performance of 2DGS while maintaining over 1000 FPS decoding. Compared with the baseline GSImage, we reduce BD-rate by 43.44% on Kodak and 29.91% on DIV2K.

95. 【2512.24016】FitControler: Toward Fit-Aware Virtual Try-On

链接https://arxiv.org/abs/2512.24016

作者:Lu Yang,Yicheng Liu,Yanan Li,Xiang Bai,Hao Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Realistic virtual try-on, Realistic virtual, virtual try-on, VTON, faithful rendering

备注

点击查看摘要

Abstract:Realistic virtual try-on (VTON) concerns not only faithful rendering of garment details but also coordination of the style. Prior art typically pursues the former, but neglects a key factor that shapes the holistic style -- garment fit. Garment fit delineates how a garment aligns with the body of a wearer and is a fundamental element in fashion design. In this work, we introduce fit-aware VTON and present FitControler, a learnable plug-in that can seamlessly integrate into modern VTON models to enable customized fit control. To achieve this, we highlight two challenges: i) how to delineate layouts of different fits and ii) how to render the garment that matches the layout. FitControler first features a fit-aware layout generator to redraw the body-garment layout conditioned on a set of delicately processed garment-agnostic representations, and a multi-scale fit injector is then used to deliver layout cues to enable layout-driven VTON. In particular, we build a fit-aware VTON dataset termed Fit4Men, including 13,000 body-garment pairs of different fits, covering both tops and bottoms, and featuring varying camera distances and body poses. Two fit consistency metrics are also introduced to assess the fitness of generations. Extensive experiments show that FitControler can work with various VTON models and achieve accurate fit control. Code and data will be released.

96. 【2512.24015】On Exact Editing of Flow-Based Diffusion Models

链接https://arxiv.org/abs/2512.24015

作者:Zixiang Li,Yue Song,Jianing Peng,Ting Liu,Jun Huang,Xiaochao Qu,Luoqi Liu,Wei Wang,Yao Zhao,Yunchao Wei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enabled direct transformations, Recent methods, explicit inversion, enabled direct, flow-based diffusion editing

备注

点击查看摘要

Abstract:Recent methods in flow-based diffusion editing have enabled direct transformations between source and target image distribution without explicit inversion. However, the latent trajectories in these methods often exhibit accumulated velocity errors, leading to semantic inconsistency and loss of structural fidelity. We propose Conditioned Velocity Correction (CVC), a principled framework that reformulates flow-based editing as a distribution transformation problem driven by a known source prior. CVC rethinks the role of velocity in inter-distribution transformation by introducing a dual-perspective velocity conversion mechanism. This mechanism explicitly decomposes the latent evolution into two components: a structure-preserving branch that remains consistent with the source trajectory, and a semantically-guided branch that drives a controlled deviation toward the target distribution. The conditional velocity field exhibits an absolute velocity error relative to the true underlying distribution trajectory, which inherently introduces potential instability and trajectory drift in the latent space. To address this quantifiable deviation and maintain fidelity to the true flow, we apply a posterior-consistent update to the resulting conditional velocity field. This update is derived from Empirical Bayes Inference and Tweedie correction, which ensures a mathematically grounded error compensation over time. Our method yields stable and interpretable latent dynamics, achieving faithful reconstruction alongside smooth local semantic conversion. Comprehensive experiments demonstrate that CVC consistently achieves superior fidelity, better semantic alignment, and more reliable editing behavior across diverse tasks.

97. 【2512.24013】Bridging the Perception-Cognition Gap:Re-engineering SAM2 with Hilbert-Mamba for Robust VLM-based Medical Diagnosis

链接https://arxiv.org/abs/2512.24013

作者:Hao Wu,Hui Li,Yiyun Su

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual Language Models, Recent studies suggest, Visual Language, automated medical diagnosis, Recent studies

备注

点击查看摘要

Abstract:Recent studies suggest that Visual Language Models (VLMs) hold great potential for tasks such as automated medical diagnosis. However, processing complex three-dimensional (3D) multimodal medical images poses significant challenges - specifically, the effective integration of complementary information and the occasional oversight of subtle yet critical pathological features. To address these issues, we present a novel two-stage fusion framework termed Hilbert-VLM. This framework leverages the HilbertMed-SAM module for precise lesion segmentation, with the generated multimodal enhanced prompts then guiding the VLM toward accurate disease classification. Our key innovation lies in the systematic redesign of the Segment Anything Model 2 (SAM2) architecture: we incorporate Hilbert space-filling curves into the scanning mechanism of the Mamba State Space Model (SSM) to maximize the preservation of spatial locality in 3D data, a property critical for medical image analysis. We also introduce a novel Hilbert-Mamba Cross-Attention (HMCA) mechanism and a scale-aware decoder to capture fine-grained details. Meanwhile, the prompt enhancement module unifies segmentation masks and their corresponding textual attributes into an information-dense prompt to support VLM inference. Extensive experiments were conducted to validate the effectiveness of the Hilbert-VLM model. On the BraTS2021 segmentation benchmark, it achieves a Dice score of 82.35 percent, with a diagnostic classification accuracy (ACC) of 78.85 percent. These results demonstrate that the proposed model offers substantial potential to improve the accuracy and reliability of medical VLM-based analysis.

98. 【2512.23998】Improved 3D Gaussian Splatting of Unknown Spacecraft Structure Using Space Environment Illumination Knowledge

链接https://arxiv.org/abs/2512.23998

作者:Tae Ha Park,Simone D'Amico

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Proximity Operations, Rendezvous and Proximity, captured during Rendezvous, unknown target spacecraft, Gaussian Splatting

备注: Presented at 2025 IEEE International Conference on Space Robotics (iSpaRo)

点击查看摘要

Abstract:This work presents a novel pipeline to recover the 3D structure of an unknown target spacecraft from a sequence of images captured during Rendezvous and Proximity Operations (RPO) in space. The target's geometry and appearance are represented as a 3D Gaussian Splatting (3DGS) model. However, learning 3DGS requires static scenes, an assumption in contrast to dynamic lighting conditions encountered in spaceborne imagery. The trained 3DGS model can also be used for camera pose estimation through photometric optimization. Therefore, in addition to recovering a geometrically accurate 3DGS model, the photometric accuracy of the rendered images is imperative to downstream pose estimation tasks during the RPO process. This work proposes to incorporate the prior knowledge of the Sun's position, estimated and maintained by the servicer spacecraft, into the training pipeline for improved photometric quality of 3DGS rasterization. Experimental studies demonstrate the effectiveness of the proposed solution, as 3DGS models trained on a sequence of images learn to adapt to rapidly changing illumination conditions in space and reflect global shadowing and self-occlusion.

99. 【2512.23997】Bridging Structure and Appearance: Topological Features for Robust Self-Supervised Segmentation

链接https://arxiv.org/abs/2512.23997

作者:Haotang Li,Zhenyu Qi,Hao Qin,Huanrui Yang,Sen He,Kebin Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Self-supervised semantic segmentation, Self-supervised semantic, semantic segmentation methods, semantic segmentation, fail when faced

备注

点击查看摘要

Abstract:Self-supervised semantic segmentation methods often fail when faced with appearance ambiguities. We argue that this is due to an over-reliance on unstable, appearance-based features such as shadows, glare, and local textures. We propose \textbf{GASeg}, a novel framework that bridges appearance and geometry by leveraging stable topological information. The core of our method is Differentiable Box-Counting (\textbf{DBC}) module, which quantifies multi-scale topological statistics from two parallel streams: geometric-based features and appearance-based features. To force the model to learn these stable structural representations, we introduce Topological Augmentation (\textbf{TopoAug}), an adversarial strategy that simulates real-world ambiguities by applying morphological operators to the input images. A multi-objective loss, \textbf{GALoss}, then explicitly enforces cross-modal alignment between geometric-based and appearance-based features. Extensive experiments demonstrate that GASeg achieves state-of-the-art performance on four benchmarks, including COCO-Stuff, Cityscapes, and PASCAL, validating our approach of bridging geometry and appearance via topological information.

100. 【2512.23990】GCA-ResUNet: Medical Image Segmentation Using Grouped Coordinate Attention

链接https://arxiv.org/abs/2512.23990

作者:Jun Ding,Shang Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:subsequent clinical decision-making, Accurate segmentation, Grouped Coordinate Attention, pivotal for computer-aided, computer-aided diagnosis

备注

点击查看摘要

Abstract:Accurate segmentation of heterogeneous anatomical structures is pivotal for computer-aided diagnosis and subsequent clinical decision-making. Although U-Net based convolutional neural networks have achieved remarkable progress, their intrinsic locality and largely homogeneous attention formulations often limit the modeling of long-range contextual dependencies, especially in multi-organ scenarios and low-contrast regions. Transformer-based architectures mitigate this issue by leveraging global self-attention, but they usually require higher computational resources and larger training data, which may impede deployment in resource-constrained clinical this http URL this paper, we propose GCA-ResUNet, an efficient medical image segmentation framework equipped with a lightweight and plug-and-play Grouped Coordinate Attention (GCA) module. The proposed GCA decouples channel-wise context modeling into multiple groups to explicitly account for semantic heterogeneity across channels, and integrates direction-aware coordinate encoding to capture structured spatial dependencies along horizontal and vertical axes. This design enhances global representation capability while preserving the efficiency advantages of CNN backbones. Extensive experiments on two widely used benchmarks, Synapse and ACDC, demonstrate that GCA-ResUNet achieves Dice scores of 86.11% and 92.64%, respectively, outperforming a range of representative CNN and Transformer-based methods, including Swin-UNet and TransUNet. In particular, GCA-ResUNet yields consistent improvements in delineating small anatomical structures with complex boundaries. These results indicate that the proposed approach provides a favorable trade-off between segmentation accuracy and computational efficiency, offering a practical and scalable solution for clinical deployment.

101. 【2512.23986】Anomaly detection in satellite imagery through temporal inpainting

链接https://arxiv.org/abs/2512.23986

作者:Bertrand Rouet-Leduc,Claudia Hulbert

类目:Computer Vision and Pattern Recognition (cs.CV); Geophysics (physics.geo-ph)

关键词:rapid disaster response, remains challenging due, seasonal variations, atmospheric noise, sensor artifacts

备注

点击查看摘要

Abstract:Detecting surface changes from satellite imagery is critical for rapid disaster response and environmental monitoring, yet remains challenging due to the complex interplay between atmospheric noise, seasonal variations, and sensor artifacts. Here we show that deep learning can leverage the temporal redundancy of satellite time series to detect anomalies at unprecedented sensitivity, by learning to predict what the surface should look like in the absence of change. We train an inpainting model built upon the SATLAS foundation model to reconstruct the last frame of a Sentinel-2 time series from preceding acquisitions, using globally distributed training data spanning diverse climate zones and land cover types. When applied to regions affected by sudden surface changes, the discrepancy between prediction and observation reveals anomalies that traditional change detection methods miss. We validate our approach on earthquake-triggered surface ruptures from the 2023 Turkey-Syria earthquake sequence, demonstrating detection of a rift feature in Tepehan with higher sensitivity and specificity than temporal median or Reed-Xiaoli anomaly detectors. Our method reaches detection thresholds approximately three times lower than baseline approaches, providing a path towards automated, global-scale monitoring of surface changes from freely available multi-spectral satellite data.

102. 【2512.23983】DriveExplorer: Images-Only Decoupled 4D Reconstruction with Progressive Restoration for Driving View Extrapolation

链接https://arxiv.org/abs/2512.23983

作者:Yuang Jia,Jinlong Wang,Jiayi Zhao,Chunlam Li,Shunzhou Wang,Wei Gao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autonomous driving scenarios, driving scenarios, paper presents, presents an effective, effective solution

备注

点击查看摘要

Abstract:This paper presents an effective solution for view extrapolation in autonomous driving scenarios. Recent approaches focus on generating shifted novel view images from given viewpoints using diffusion models. However, these methods heavily rely on priors such as LiDAR point clouds, 3D bounding boxes, and lane annotations, which demand expensive sensors or labor-intensive labeling, limiting applicability in real-world deployment. In this work, with only images and optional camera poses, we first estimate a global static point cloud and per-frame dynamic point clouds, fusing them into a unified representation. We then employ a deformable 4D Gaussian framework to reconstruct the scene. The initially trained 4D Gaussian model renders degraded and pseudo-images to train a video diffusion model. Subsequently, progressively shifted Gaussian renderings are iteratively refined by the diffusion model,and the enhanced results are incorporated back as training data for 4DGS. This process continues until extrapolation reaches the target viewpoints. Compared with baselines, our method produces higher-quality images at novel extrapolated viewpoints.

103. 【2512.23953】2VAttack: Adversarial Attack on Text-to-Video Diffusion Models

链接https://arxiv.org/abs/2512.23953

作者:Changzhen Li,Yuecong Min,Jie Zhang,Zheng Yuan,Shiguang Shan,Xilin Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:natural language descriptions, driven remarkable advancements, generating high-quality, language descriptions, diffusion models

备注

点击查看摘要

Abstract:The rapid evolution of Text-to-Video (T2V) diffusion models has driven remarkable advancements in generating high-quality, temporally coherent videos from natural language descriptions. Despite these achievements, their vulnerability to adversarial attacks remains largely unexplored. In this paper, we introduce T2VAttack, a comprehensive study of adversarial attacks on T2V diffusion models from both semantic and temporal perspectives. Considering the inherently dynamic nature of video data, we propose two distinct attack objectives: a semantic objective to evaluate video-text alignment and a temporal objective to assess the temporal dynamics. To achieve an effective and efficient attack process, we propose two adversarial attack methods: (i) T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and (ii) T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt. By combining these objectives and strategies, we conduct a comprehensive evaluation on the adversarial robustness of several state-of-the-art T2V models, including ModelScope, CogVideoX, Open-Sora, and HunyuanVideo. Our experiments reveal that even minor prompt modifications, such as the substitution or insertion of a single word, can cause substantial degradation in semantic fidelity and temporal dynamics, highlighting critical vulnerabilities in current T2V diffusion models.

104. 【2512.23950】U-Net-Like Spiking Neural Networks for Single Image Dehazing

链接https://arxiv.org/abs/2512.23950

作者:Huibin Li,Haoran Liu,Mingzhe Liu,Yulong Xiao,Peng Li,Guibin Zan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Convolutional Neural Networks, Spiking Neural Networks, Neural Networks, enhancing image clarity, specifically Convolutional Neural

备注: 9 pages, 4 figures. Accepted at IJCNN 2025 (Rome, Italy). To appear in IEEE/IJCNN 2025 proceedings

点击查看摘要

Abstract:Image dehazing is a critical challenge in computer vision, essential for enhancing image clarity in hazy conditions. Traditional methods often rely on atmospheric scattering models, while recent deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Transformers, have improved performance by effectively analyzing image features. However, CNNs struggle with long-range dependencies, and Transformers demand significant computational resources. To address these limitations, we propose DehazeSNN, an innovative architecture that integrates a U-Net-like design with Spiking Neural Networks (SNNs). DehazeSNN captures multi-scale image features while efficiently managing local and long-range dependencies. The introduction of the Orthogonal Leaky-Integrate-and-Fire Block (OLIFBlock) enhances cross-channel communication, resulting in superior dehazing performance with reduced computational burden. Our extensive experiments show that DehazeSNN is highly competitive to state-of-the-art methods on benchmark datasets, delivering high-quality haze-free images with a smaller model size and less multiply-accumulate operations. The proposed dehazing method is publicly available at this https URL.

105. 【2512.23942】Kinematic-Based Assessment of Surgical Actions in Microanastomosis

链接https://arxiv.org/abs/2512.23942

作者:Yan Meng,Daniel Donoho,Marcelle Altshuler,Omar Arnaout

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precisely manipulate fine, manipulate fine instruments, critical surgical skill, successful outcomes, ability to precisely

备注

点击查看摘要

Abstract:Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.

106. 【2512.23938】Learnable Query Aggregation with KV Routing for Cross-view Geo-localisation

链接https://arxiv.org/abs/2512.23938

作者:Hualin Ye,Bingxi Liu,Jixiang Du,Yu Qin,Ziyi Chen,Hong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:query image, aims to estimate, large-scale database, estimate the geographic, geographic location

备注: 7 pages, 4 figures

点击查看摘要

Abstract:Cross-view geo-localisation (CVGL) aims to estimate the geographic location of a query image by matching it with images from a large-scale database. However, the significant view-point discrepancies present considerable challenges for effective feature aggregation and alignment. To address these challenges, we propose a novel CVGL system that incorporates three key improvements. Firstly, we leverage the DINOv2 backbone with a convolution adapter fine-tuning to enhance model adaptability to cross-view variations. Secondly, we propose a multi-scale channel reallocation module to strengthen the diversity and stability of spatial representations. Finally, we propose an improved aggregation module that integrates a Mixture-of-Experts (MoE) routing into the feature aggregation process. Specifically, the module dynamically selects expert subspaces for the keys and values in a cross-attention framework, enabling adaptive processing of heterogeneous input domains. Extensive experiments on the University-1652 and SUES-200 datasets demonstrate that our method achieves competitive performance with fewer trained parameters.

107. 【2512.23936】MGML: A Plug-and-Play Meta-Guided Multi-Modal Learning Framework for Incomplete Multimodal Brain Tumor Segmentation

链接https://arxiv.org/abs/2512.23936

作者:Yulong Zou,Bo Liu,Cun-Jing Zheng,Yuan-ming Geng,Siyue Li,Qiankun Zuo,Shuihua Wang,Yudong Zhang,Jin Hong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Magnetic Resonance Imaging, Resonance Imaging, Leveraging multimodal information, Magnetic Resonance, multimodal MRI data

备注

点击查看摘要

Abstract:Leveraging multimodal information from Magnetic Resonance Imaging (MRI) plays a vital role in lesion segmentation, especially for brain tumors. However, in clinical practice, multimodal MRI data are often incomplete, making it challenging to fully utilize the available information. Therefore, maximizing the utilization of this incomplete multimodal information presents a crucial research challenge. We present a novel meta-guided multi-modal learning (MGML) framework that comprises two components: meta-parameterized adaptive modality fusion and consistency regularization module. The meta-parameterized adaptive modality fusion (Meta-AMF) enables the model to effectively integrate information from multiple modalities under varying input conditions. By generating adaptive soft-label supervision signals based on the available modalities, Meta-AMF explicitly promotes more coherent multimodal fusion. In addition, the consistency regularization module enhances segmentation performance and implicitly reinforces the robustness and generalization of the overall framework. Notably, our approach does not alter the original model architecture and can be conveniently integrated into the training pipeline for end-to-end model optimization. We conducted extensive experiments on the public BraTS2020 and BraTS2023 datasets. Compared to multiple state-of-the-art methods from previous years, our method achieved superior performance. On BraTS2020, for the average Dice scores across fifteen missing modality combinations, building upon the baseline, our method obtained scores of 87.55, 79.36, and 62.67 for the whole tumor (WT), the tumor core (TC), and the enhancing tumor (ET), respectively. We have made our source code publicly available at this https URL.

108. 【2512.23920】Learning to learn skill assessment for fetal ultrasound scanning

链接https://arxiv.org/abs/2512.23920

作者:Yipei Wang,Qianye Yang,Lior Drukker,Aris T. Papageorghiou,Yipeng Hu,J. Alison Noble

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:supervision and feedback, time-intensive nature, relied on expert, expert supervision, subjectivity and time-intensive

备注

点击查看摘要

Abstract:Traditionally, ultrasound skill assessment has relied on expert supervision and feedback, a process known for its subjectivity and time-intensive nature. Previous works on quantitative and automated skill assessment have predominantly employed supervised learning methods, often limiting the analysis to predetermined or assumed factors considered influential in determining skill levels. In this work, we propose a novel bi-level optimisation framework that assesses fetal ultrasound skills by how well a task is performed on the acquired fetal ultrasound images, without using manually predefined skill ratings. The framework consists of a clinical task predictor and a skill predictor, which are optimised jointly by refining the two networks simultaneously. We validate the proposed method on real-world clinical ultrasound videos of scanning the fetal head. The results demonstrate the feasibility of predicting ultrasound skills by the proposed framework, which quantifies optimised task performance as a skill indicator.

109. 【2512.23903】Scaling Remote Sensing Foundation Models: Data Domain Tradeoffs at the Peta-Scale

链接https://arxiv.org/abs/2512.23903

作者:Charith Wickrema,Eliza Mace,Hunter Brown,Heidys Cabrera,Nick Krall,Matthew O'Neill,Shivangi Sarkar,Lowell Weissman,Eric Hughes,Guido Zarrella

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:establish practical techniques, high-resolution electro-optical, exceed the current, orders of magnitude, artificial intelligence

备注

点击查看摘要

Abstract:We explore the scaling behaviors of artificial intelligence to establish practical techniques for training foundation models on high-resolution electro-optical (EO) datasets that exceed the current state-of-the-art scale by orders of magnitude. Modern multimodal machine learning (ML) applications, such as generative artificial intelligence (GenAI) systems for image captioning, search, and reasoning, depend on robust, domain-specialized encoders for non-text modalities. In natural-image domains where internet-scale data is plentiful, well-established scaling laws help optimize the joint scaling of model capacity, training compute, and dataset size. Unfortunately, these relationships are much less well-understood in high-value domains like remote sensing (RS). Using over a quadrillion pixels of commercial satellite EO data and the MITRE Federal AI Sandbox, we train progressively larger vision transformer (ViT) backbones, report success and failure modes observed at petascale, and analyze implications for bridging domain gaps across additional RS modalities. We observe that even at this scale, performance is consistent with a data limited regime rather than a model parameter-limited one. These practical insights are intended to inform data-collection strategies, compute budgets, and optimization schedules that advance the future development of frontier-scale RS foundation models.

110. 【2512.23894】MRI-to-CT Synthesis With Cranial Suture Segmentations Using A Variational Autoencoder Framework

链接https://arxiv.org/abs/2512.23894

作者:Krithika Iyer,Austin Tapp,Athelia Paulli,Gabrielle Dickerson,Syed Muhammad Anwar,Natasha Lepore,Marius George Linguraru

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Quantifying normative pediatric, growth-related cephalic disorders, treating growth-related cephalic, Quantifying normative, cephalic disorders

备注

点击查看摘要

Abstract:Quantifying normative pediatric cranial development and suture ossification is crucial for diagnosing and treating growth-related cephalic disorders. Computed tomography (CT) is widely used to evaluate cranial and sutural deformities; however, its ionizing radiation is contraindicated in children without significant abnormalities. Magnetic resonance imaging (MRI) offers radiation free scans with superior soft tissue contrast, but unlike CT, MRI cannot elucidate cranial sutures, estimate skull bone density, or assess cranial vault growth. This study proposes a deep learning driven pipeline for transforming T1 weighted MRIs of children aged 0.2 to 2 years into synthetic CTs (sCTs), predicting detailed cranial bone segmentation, generating suture probability heatmaps, and deriving direct suture segmentation from the heatmaps. With our in-house pediatric data, sCTs achieved 99% structural similarity and a Frechet inception distance of 1.01 relative to real CTs. Skull segmentation attained an average Dice coefficient of 85% across seven cranial bones, and sutures achieved 80% Dice. Equivalence of skull and suture segmentation between sCTs and real CTs was confirmed using two one sided tests (TOST p 0.05). To our knowledge, this is the first pediatric cranial CT synthesis framework to enable suture segmentation on sCTs derived from MRI, despite MRI's limited depiction of bone and sutures. By combining robust, domain specific variational autoencoders, our method generates perceptually indistinguishable cranial sCTs from routine pediatric MRIs, bridging critical gaps in non invasive cranial evaluation.

111. 【2512.23864】Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation

链接https://arxiv.org/abs/2512.23864

作者:Guo Ye,Zexi Zhang,Xu Zhao,Shang Wu,Haoran Lu,Shihan Lu,Han Liu

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:shown remarkable generalization, mapping web-scale knowledge, shown remarkable, remarkable generalization, generalization by mapping

备注

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have shown remarkable generalization by mapping web-scale knowledge to robotic control, yet they remain blind to physical contact. Consequently, they struggle with contact-rich manipulation tasks that require reasoning about force, texture, and slip. While some approaches incorporate low-dimensional tactile signals, they fail to capture the high-resolution dynamics essential for such interactions. To address this limitation, we introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future. Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs coupled with wrist-camera local vision and third-person macro vision. To reconcile these multi-scale sensory streams, we first train a unified policy with a Hierarchical Spatial Alignment (HSA) loss that aligns tactile tokens with their spatial counterparts in the wrist and third-person views. To further deepen the model's understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals. To mitigate tactile data scarcity and the wear-prone nature of tactile sensors, we construct a hybrid large-scale dataset sourced from both high-fidelity digital twin and real-world experiments. By anticipating upcoming tactile states, DreamTacVLA acquires a rich model of contact physics and conditions its actions on both real observations and imagined consequences. Across contact-rich manipulation tasks, it outperforms state-of-the-art VLA baselines, achieving up to 95% success, highlighting the importance of understanding physical contact for robust, touch-aware robotic agents.

112. 【2512.23860】Lifelong Domain Adaptive 3D Human Pose Estimation

链接https://arxiv.org/abs/2512.23860

作者:Qucheng Peng,Hongfei Xue,Pu Wang,Chen Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Human Pose Estimation, HPE, Human Pose, Pose Estimation, domain

备注: Accepted by AAAI 2026

点击查看摘要

Abstract:3D Human Pose Estimation (3D HPE) is vital in various applications, from person re-identification and action recognition to virtual reality. However, the reliance on annotated 3D data collected in controlled environments poses challenges for generalization to diverse in-the-wild scenarios. Existing domain adaptation (DA) paradigms like general DA and source-free DA for 3D HPE overlook the issues of non-stationary target pose datasets. To address these challenges, we propose a novel task named lifelong domain adaptive 3D HPE. To our knowledge, we are the first to introduce the lifelong domain adaptation to the 3D HPE task. In this lifelong DA setting, the pose estimator is pretrained on the source domain and subsequently adapted to distinct target domains. Moreover, during adaptation to the current target domain, the pose estimator cannot access the source and all the previous target domains. The lifelong DA for 3D HPE involves overcoming challenges in adapting to current domain poses and preserving knowledge from previous domains, particularly combating catastrophic forgetting. We present an innovative Generative Adversarial Network (GAN) framework, which incorporates 3D pose generators, a 2D pose discriminator, and a 3D pose estimator. This framework effectively mitigates domain shifts and aligns original and augmented poses. Moreover, we construct a novel 3D pose generator paradigm, integrating pose-aware, temporal-aware, and domain-aware knowledge to enhance the current domain's adaptation and alleviate catastrophic forgetting on previous domains. Our method demonstrates superior performance through extensive experiments on diverse domain adaptive 3D HPE datasets.

113. 【2512.23851】Pretraining Frame Preservation in Autoregressive Video Memory Compression

链接https://arxiv.org/abs/2512.23851

作者:Lvmin Zhang,Shengqu Cai,Muyang Li,Chong Zeng,Beijia Lu,Anyi Rao,Song Han,Gordon Wetzstein,Maneesh Agrawala

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:arbitrary temporal positions, explicit pretraining objective, present PFP, neural network structure, temporal positions

备注: [this https URL](https://github.com/lllyasviel/PFP)

点击查看摘要

Abstract:We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.

114. 【2512.23819】Video-Based Performance Evaluation for ECR Drills in Synthetic Training Environments

链接https://arxiv.org/abs/2512.23819

作者:Surya Rayala,Marcos Quinones-Grueiro,Naveeduddin Mohammed,Ashwin T S,Benjamin Goldberg,Randall Spain,Paige Lawton,Gautam Biswas

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Effective urban warfare, urban warfare training, warfare training requires, Effective urban, Synthetic Training Environments

备注: 14 pages, 9 figures, I/ITSEC-2025

点击查看摘要

Abstract:Effective urban warfare training requires situational awareness and muscle memory, developed through repeated practice in realistic yet controlled environments. A key drill, Enter and Clear the Room (ECR), demands threat assessment, coordination, and securing confined spaces. The military uses Synthetic Training Environments that offer scalable, controlled settings for repeated exercises. However, automatic performance assessment remains challenging, particularly when aiming for objective evaluation of cognitive, psychomotor, and teamwork skills. Traditional methods often rely on costly, intrusive sensors or subjective human observation, limiting scalability and accuracy. This paper introduces a video-based assessment pipeline that derives performance analytics from training videos without requiring additional hardware. By utilizing computer vision models, the system extracts 2D skeletons, gaze vectors, and movement trajectories. From these data, we develop task-specific metrics that measure psychomotor fluency, situational awareness, and team coordination. These metrics feed into an extended Cognitive Task Analysis (CTA) hierarchy, which employs a weighted combination to generate overall performance scores for teamwork and cognition. We demonstrate the approach with a case study of real-world ECR drills, providing actionable, domain specific metrics that capture individual and team performance. We also discuss how these insights can support After Action Reviews with interactive dashboards within Gamemaster and the Generalized Intelligent Framework for Tutoring (GIFT), providing intuitive and understandable feedback. We conclude by addressing limitations, including tracking difficulties, ground-truth validation, and the broader applicability of our approach. Future work includes expanding analysis to 3D video data and leveraging video analysis to enable scalable evaluation within STEs.

115. 【2512.23786】Leveraging Synthetic Priors for Monocular Depth Estimation in Specular Surgical Environments

链接https://arxiv.org/abs/2512.23786

作者:Ankan Aich,Yangming Lee

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Accurate Monocular Depth, Monocular Depth Estimation, fluid-filled endoscopic environments, Accurate Monocular, Depth Estimation

备注

点击查看摘要

Abstract:Accurate Monocular Depth Estimation (MDE) is critical for robotic surgery but remains fragile in specular, fluid-filled endoscopic environments. Existing self-supervised methods, typically relying on foundation models trained with noisy real-world pseudo-labels, often suffer from boundary collapse on thin surgical tools and transparent surfaces. In this work, we address this by leveraging the high-fidelity synthetic priors of the Depth Anything V2 architecture, which inherently captures precise geometric details of thin structures. We efficiently adapt these priors to the medical domain using Dynamic Vector Low-Rank Adaptation (DV-LORA), minimizing the parameter budget while bridging the synthetic-to-real gap. Additionally, we introduce a physically-stratified evaluation protocol on the SCARED dataset to rigorously quantify performance in high-specularity regimes often masked by aggregate metrics. Our approach establishes a new state-of-the-art, achieving an accuracy ( 1.25) of 98.1% and reducing Squared Relative Error by over 17% compared to established baselines, demonstrating superior robustness in adverse surgical lighting.

116. 【2512.23766】A Granular Grassmannian Clustering Framework via the Schubert Variety of Best Fit

链接https://arxiv.org/abs/2512.23766

作者:Karim Salta,Michael Kirby,Chris Peterson

类目:Machine Learning (cs.LG); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:compute a geometric, geometric representative, clustering tasks, subspace clustering algorithm, Schubert Variety

备注

点击查看摘要

Abstract:In many classification and clustering tasks, it is useful to compute a geometric representative for a dataset or a cluster, such as a mean or median. When datasets are represented by subspaces, these representatives become points on the Grassmann or flag manifold, with distances induced by their geometry, often via principal angles. We introduce a subspace clustering algorithm that replaces subspace means with a trainable prototype defined as a Schubert Variety of Best Fit (SVBF) - a subspace that comes as close as possible to intersecting each cluster member in at least one fixed direction. Integrated in the Linde-Buzo-Grey (LBG) pipeline, this SVBF-LBG scheme yields improved cluster purity on synthetic, image, spectral, and video action data, while retaining the mathematical structure required for downstream analysis.

117. 【2512.23739】Break Out the Silverware -- Semantic Understanding of Stored Household Items

链接https://arxiv.org/abs/2512.23739

作者:Michaela Levi-Richter,Reuth Mirsky,Oren Glickman

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:simple command reveals, Stored Household Item, Household Item Challenge, inferring where everyday, sight in drawers

备注: Poster presented at the Israeli Seminar on Computational Linguistics 2025

点击查看摘要

Abstract:``Bring me a plate.'' For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots' cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants' kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.

Comments:
Poster presented at the Israeli Seminar on Computational Linguistics 2025

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

Cite as:
arXiv:2512.23739 [cs.CL]

(or
arXiv:2512.23739v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.23739

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Reuth Mirsky [view email] [v1]
Thu, 25 Dec 2025 15:21:49 UTC (4,504 KB)

118. 【2512.24894】owards autonomous time-calibration of large quantum-dot devices: Detection, real-time feedback, and noise spectroscopy

链接https://arxiv.org/abs/2512.24894

作者:Anantha S. Rao,Barnaby van Straaten,Valentin John,Cécile X. Yu,Stefan D. Oosterhout,Lucas Stehouwer,Giordano Scappucci,M. D. Stewart Jr.,Menno Veldhorst,Francesco Borsoi,Justyna P. Zwolak

类目:Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)

关键词:shift operating points, destabilize qubit parameters, semiconductor quantum-dot, performance and scalability, scalability of semiconductor

备注: 12 pages, 4 figures

点击查看摘要

Abstract:The performance and scalability of semiconductor quantum-dot (QD) qubits are limited by electrostatic drift and charge noise that shift operating points and destabilize qubit parameters. As systems expand to large one- and two-dimensional arrays, manual recalibration becomes impractical, creating a need for autonomous stabilization frameworks. Here, we introduce a method that uses the full network of charge-transition lines in repeatedly acquired double-quantum-dot charge stability diagrams (CSDs) as a multidimensional probe of the local electrostatic environment. By accurately tracking the motion of selected transitions in time, we detect voltage drifts, identify abrupt charge reconfigurations, and apply compensating updates to maintain stable operating conditions. We demonstrate our approach on a 10-QD device, showing robust stabilization and real-time diagnostic access to dot-specific noise processes. The high acquisition rate of radio-frequency reflectometry CSD measurements also enables time-domain noise spectroscopy, allowing the extraction of noise power spectral densities, the identification of two-level fluctuators, and the analysis of spatial noise correlations across the array. From our analysis, we find that the background noise at 100~$\mu$\si{\hertz} is dominated by drift with a power law of $1/f^2$, accompanied by a few dominant two-level fluctuators and an average linear correlation length of $(188 \pm 38)$~\si{\nano\meter} in the device. These capabilities form the basis of a scalable, autonomous calibration and characterization module for QD-based quantum processors, providing essential feedback for long-duration, high-fidelity qubit operations.

119. 【2512.24492】Automated Classification of First-Trimester Fetal Heart Views Using Ultrasound-Specific Self-Supervised Learning

链接https://arxiv.org/abs/2512.24492

作者:Youssef Megahed,Aylin Erman,Robin Ducharme,Mark C. Walker,Steven Hawken,Adrian D. C. Chan

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:common congenital anomaly, Congenital heart disease, heart disease remains, common congenital, congenital anomaly

备注: 7 pages, 4 figures

点击查看摘要

Abstract:Congenital heart disease remains the most common congenital anomaly and a leading cause of neonatal morbidity and mortality. Although first-trimester fetal echocardiography offers an opportunity for earlier detection, automated analysis at this stage is challenging due to small cardiac structures, low signal-to-noise ratio, and substantial inter-operator variability. In this work, we evaluate a self-supervised ultrasound foundation model, USF-MAE, for first-trimester fetal heart view classification. USF-MAE is pretrained using masked autoencoding modelling on more than 370,000 unlabelled ultrasound images spanning over 40 anatomical regions and is subsequently fine-tuned for downstream classification. As a proof of concept, the pretrained Vision Transformer encoder was fine-tuned on an open-source dataset of 6,720 first-trimester fetal echocardiography images to classify five categories: aorta, atrioventricular flows, V sign, X sign, and Other. Model performance was benchmarked against supervised convolutional neural network baselines (ResNet-18 and ResNet-50) and a Vision Transformer (ViT-B/16) model pretrained on natural images (ImageNet-1k). All models were trained and evaluated using identical preprocessing, data splits, and optimization protocols. On an independent test set, USF-MAE achieved the highest performance across all evaluation metrics, with 90.57% accuracy, 91.15% precision, 90.57% recall, and 90.71% F1-score. This represents an improvement of +2.03% in accuracy and +1.98% in F1-score compared with the strongest baseline, ResNet-18. The proposed approach demonstrated robust performance without reliance on aggressive image preprocessing or region-of-interest cropping and showed improved discrimination of non-diagnostic frames.

120. 【2512.24117】argeted Semantic Segmentation of Himalayan Glacial Lakes Using Time-Series SAR: Towards Automated GLOF Early Warning

链接https://arxiv.org/abs/2512.24117

作者:Pawan Adhikari,Satish Raj Regmi,Hari Ram Shrestha

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Lake Outburst Floods, Outburst Floods, change induced hazards, Glacial Lake Outburst, devastating climate change

备注: 12 pages, 6 figures

点击查看摘要

Abstract:Glacial Lake Outburst Floods (GLOFs) are one of the most devastating climate change induced hazards. Existing remote monitoring approaches often prioritise maximising spatial coverage to train generalistic models or rely on optical imagery hampered by persistent cloud coverage. This paper presents an end-to-end, automated deep learning pipeline for the targeted monitoring of high-risk Himalayan glacial lakes using time-series Sentinel-1 SAR. We introduce a "temporal-first" training strategy, utilising a U-Net with an EfficientNet-B3 backbone trained on a curated dataset of a cohort of 4 lakes (Tsho Rolpa, Chamlang Tsho, Tilicho and Gokyo Lake). The model achieves an IoU of 0.9130 validating the success and efficacy of the "temporal-first" strategy required for transitioning to Early Warning Systems. Beyond the model, we propose an operational engineering architecture: a Dockerised pipeline that automates data ingestion via the ASF Search API and exposes inference results via a RESTful endpoint. This system shifts the paradigm from static mapping to dynamic and automated early warning, providing a scalable architectural foundation for future development in Early Warning Systems.

121. 【2512.24019】One-Shot Structured Pruning of Quantum Neural Networks via $q$-Group Engineering and Quantum Geometric Metrics

链接https://arxiv.org/abs/2512.24019

作者:Haijian Shao,Wei Liu,Xing Deng,Yingtao Jiang

类目:Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)

关键词:Quantum neural networks, noisy intermediate-scale quantum, severe gate-level redundancy, neural networks, suffer from severe

备注: 10 pages, 2 figures

点击查看摘要

Abstract:Quantum neural networks (QNNs) suffer from severe gate-level redundancy, which hinders their deployment on noisy intermediate-scale quantum (NISQ) devices. In this work, we propose q-iPrune, a one-shot structured pruning framework grounded in the algebraic structure of $q$-deformed groups and task-conditioned quantum geometry. Unlike prior heuristic or gradient-based pruning methods, q-iPrune formulates redundancy directly at the gate level. Each gate is compared within an algebraically consistent subgroup using a task-conditioned $q$-overlap distance, which measures functional similarity through state overlaps on a task-relevant ensemble. A gate is removed only when its replacement by a subgroup representative provably induces a bounded deviation on all task observables. We establish three rigorous theoretical guarantees. First, we prove completeness of redundancy pruning: no gate that violates the prescribed similarity threshold is removed. Second, we show that the pruned circuit is functionally equivalent up to an explicit, task-conditioned error bound, with a closed-form dependence on the redundancy tolerance and the number of replaced gates. Third, we prove that the pruning procedure is computationally feasible, requiring only polynomial-time comparisons and avoiding exponential enumeration over the Hilbert space. To adapt pruning decisions to hardware imperfections, we introduce a noise-calibrated deformation parameter $\lambda$ that modulates the $q$-geometry and redundancy tolerance. Experiments on standard quantum machine learning benchmarks demonstrate that q-iPrune achieves substantial gate reduction while maintaining bounded task performance degradation, consistent with our theoretical guarantees.

Comments:
10 pages, 2 figures

Subjects:

Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2512.24019 [quant-ph]

(or
arXiv:2512.24019v1 [quant-ph] for this version)

https://doi.org/10.48550/arXiv.2512.24019

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Haijian Shao [view email] [v1]
Tue, 30 Dec 2025 06:37:28 UTC (41 KB)

122. 【2512.23906】A multimodal Transformer for InSAR-based ground deformation forecasting with cross-site generalization across Europe

链接https://arxiv.org/abs/2512.23906

作者:Wendong Yao,Binhua Huang,Soumyabrata Dev

类目:ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:critical infrastructure management, support urban planning, natural hazard mitigation, European Ground Motion, Synthetic Aperture Radar

备注: submitted to ISPRS Journal of Photogrammetry and Remote Sensing for review

点击查看摘要

Abstract:Near-real-time regional-scale monitoring of ground deformation is increasingly required to support urban planning, critical infrastructure management, and natural hazard mitigation. While Interferometric Synthetic Aperture Radar (InSAR) and continental-scale services such as the European Ground Motion Service (EGMS) provide dense observations of past motion, predicting the next observation remains challenging due to the superposition of long-term trends, seasonal cycles, and occasional abrupt discontinuities (e.g., co-seismic steps), together with strong spatial heterogeneity. In this study we propose a multimodal patch-based Transformer for single-step, fixed-interval next-epoch nowcasting of displacement maps from EGMS time series (resampled to a 64x64 grid over 100 km x 100 km tiles). The model ingests recent displacement snapshots together with (i) static kinematic indicators (mean velocity, acceleration, seasonal amplitude) computed in a leakage-safe manner from the training window only, and (ii) harmonic day-of-year encodings. On the eastern Ireland tile (E32N34), the STGCN is strongest in the displacement-only setting, whereas the multimodal Transformer clearly outperforms CNN-LSTM, CNN-LSTM+Attn, and multimodal STGCN when all models receive the same multimodal inputs, achieving RMSE = 0.90 mm and $R^2$ = 0.97 on the test set with the best threshold accuracies.

123. 【2512.23757】Leveraging Machine Learning for Early Detection of Lung Diseases

链接https://arxiv.org/abs/2512.23757

作者:Bahareh Rahmani,Harsha Reddy Bindela,Rama Kanth Reddy Gosula,Krishna Yedubati,Mohammad Amir Salari,Leslie Hinyard,Payam Norouzzadeh,Eli Snir,Martin Schoen

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:traditional image processing, preventive healthcare paradigm, image processing methods, neural networks concretes, advanced neural networks

备注

点击查看摘要

Abstract:A combination of traditional image processing methods with advanced neural networks concretes a predictive and preventive healthcare paradigm. This study offers rapid, accurate, and non-invasive diagnostic solutions that can significantly impact patient outcomes, particularly in areas with limited access to radiologists and healthcare resources. In this project, deep learning methods apply in enhancing the diagnosis of respiratory diseases such as COVID-19, lung cancer, and pneumonia from chest x-rays. We trained and validated various neural network models, including CNNs, VGG16, InceptionV3, and EfficientNetB0, with high accuracy, precision, recall, and F1 scores to highlight the models' reliability and potential in real-world diagnostic applications.

124. 【2512.23726】q3-MuPa: Quick, Quiet, Quantitative Multi-Parametric MRI using Physics-Informed Diffusion Models

链接https://arxiv.org/abs/2512.23726

作者:Shishuai Wang,Florian Wiesinger,Noemi Sgambelluri,Carolin Pirkl,Stefan Klein,Juan A. Hernandez-Tamames,Dirk H.J. Poot

类目:Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:phyllotaxis sampling scheme, fast silent multi-parametric, quantitative MRI, fast silent, silent scanning

备注

点击查看摘要

Abstract:The 3D fast silent multi-parametric mapping sequence with zero echo time (MuPa-ZTE) is a novel quantitative MRI (qMRI) acquisition that enables nearly silent scanning by using a 3D phyllotaxis sampling scheme. MuPa-ZTE improves patient comfort and motion robustness, and generates quantitative maps of T1, T2, and proton density using the acquired weighted image series. In this work, we propose a diffusion model-based qMRI mapping method that leverages both a deep generative model and physics-based data consistency to further improve the mapping performance. Furthermore, our method enables additional acquisition acceleration, allowing high-quality qMRI mapping from a fourfold-accelerated MuPa-ZTE scan (approximately 1 minute). Specifically, we trained a denoising diffusion probabilistic model (DDPM) to map MuPa-ZTE image series to qMRI maps, and we incorporated the MuPa-ZTE forward signal model as an explicit data consistency (DC) constraint during inference. We compared our mapping method against a baseline dictionary matching approach and a purely data-driven diffusion model. The diffusion models were trained entirely on synthetic data generated from digital brain phantoms, eliminating the need for large real-scan datasets. We evaluated on synthetic data, a NISM/ISMRM phantom, healthy volunteers, and a patient with brain metastases. The results demonstrated that our method produces 3D qMRI maps with high accuracy, reduced noise and better preservation of structural details. Notably, it generalised well to real scans despite training on synthetic data alone. The combination of the MuPa-ZTE acquisition and our physics-informed diffusion model is termed q3-MuPa, a quick, quiet, and quantitative multi-parametric mapping framework, and our findings highlight its strong clinical potential.