本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新356篇论文,其中:
- 自然语言处理43篇
- 信息检索11篇
- 计算机视觉77篇
自然语言处理
1. 【2512.21336】Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty
链接:https://arxiv.org/abs/2512.21336
作者:Ziyu Chen,Xinbei Jiang,Peng Sun,Tao Lin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Masked Diffusion Models, Masked Diffusion, Diffusion Models, final output quality, offer flexible
备注:
点击查看摘要
Abstract:Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.
2. 【2512.21332】C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling
链接:https://arxiv.org/abs/2512.21332
作者:Jin Qin,Zihan Liao,Ziyin Zhang,Hang Yu,Peng Di,Rui Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Contrastive Code Large, Code Large Language, Large Language Models, Large Language, Contrastive Code
备注:
点击查看摘要
Abstract:We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.
3. 【2512.21329】Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks
链接:https://arxiv.org/abs/2512.21329
作者:Xinhe Wang,Jin Huang,Xingjian Zhang,Tianhao Wang,Jiaqi W. Ma
类目:Computation and Language (cs.CL)
关键词:ARC-AGI are widely, fluid reasoning abilities, Reasoning Corpus, Reasoning, probes of core
备注:
点击查看摘要
Abstract:Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid'' reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning. To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2512.21329 [cs.CL]
(or
arXiv:2512.21329v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2512.21329
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2512.21326】Measuring all the noises of LLM Evals
链接:https://arxiv.org/abs/2512.21326
作者:Sida Wang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
关键词:Separating signal, noise, experimental science, central to experimental, Separating
备注:
点击查看摘要
Abstract:Separating signal from noise is central to experimental science. Applying well-established statistical method effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings. These measurements revealed clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. These findings enable practitioners to assess significance without custom testing and to detect much smaller effects in controlled experiments.
5. 【2512.21323】Parallel Token Prediction for Language Models
链接:https://arxiv.org/abs/2512.21323
作者:Felix Draxler,Justus Will,Farrin Marouf Sofian,Theofanis Karaletsos,Sameer Singh,Stephan Mandt
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:propose Parallel Token, PTP, Parallel Token Prediction, language models, Token Prediction
备注: Preprint. Under review
点击查看摘要
Abstract:We propose Parallel Token Prediction (PTP), a universal framework for parallel sequence generation in language models. PTP jointly predicts multiple dependent tokens in a single transformer call by incorporating the sampling procedure into the model. This reduces the latency bottleneck of autoregressive decoding, and avoids the restrictive independence assumptions common in existing multi-token prediction methods. We prove that PTP can represent arbitrary autoregressive sequence distributions. PTP is trained either by distilling an existing model or through inverse autoregressive training without a teacher. Experimentally, we achieve state-of-the-art speculative decoding performance on Vicuna-7B by accepting over four tokens per step on Spec-Bench. The universality of our framework indicates that parallel generation of long sequences is feasible without loss of modeling power.
6. 【2512.21280】SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance
链接:https://arxiv.org/abs/2512.21280
作者:Divij Dudeja,Mayukha Pal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Engineering Manuals, engineering equipment, includes written documents, standard parameter lists, step procedures
备注:
点击查看摘要
Abstract:The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.
7. 【2512.21257】ReaSeq: Unleashing World Knowledge via Reasoning for Sequential Modeling
链接:https://arxiv.org/abs/2512.21257
作者:Chuan Wang,Gaoming Yang,Han Wu,Jiakai Tang,Jiahao Yu,Jian Wu,Jianwu Hu,Junjun Zheng,Shuwen Xiao,Yeqiu Yang,Yuning Jiang,Ahjol Nurlanbek,Binbin Cao,Bo Zheng,Fangmei Zhu,Gaoming Zhou,Huimin Yi,Huiping Chu,Jin Huang,Jinzhe Shan,Kenan Cui,Longbin Li,Silu Zhou,Wen Chen,Xia Ming,Xiang Gao,Xin Yao,Xingyu Wen,Yan Zhang,Yiwen Hu,Yulin Wang,Ziheng Bao,Zongyuan Wu
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Large Language Models, brittle interest modeling, Industrial recommender systems, constrains model performance, Language Models
备注:
点击查看摘要
Abstract:Industrial recommender systems face two fundamental limitations under the log-driven paradigm: (1) knowledge poverty in ID-based item representations that causes brittle interest modeling under data sparsity, and (2) systemic blindness to beyond-log user interests that constrains model performance within platform boundaries. These limitations stem from an over-reliance on shallow interaction statistics and close-looped feedback while neglecting the rich world knowledge about product semantics and cross-domain behavioral patterns that Large Language Models have learned from vast corpora. To address these challenges, we introduce ReaSeq, a reasoning-enhanced framework that leverages world knowledge in Large Language Models to address both limitations through explicit and implicit reasoning. Specifically, ReaSeq employs explicit Chain-of-Thought reasoning via multi-agent collaboration to distill structured product knowledge into semantically enriched item representations, and latent reasoning via Diffusion Large Language Models to infer plausible beyond-log behaviors. Deployed on Taobao's ranking system serving hundreds of millions of users, ReaSeq achieves substantial gains: 6.0% in IPV and CTR, 2.9% in Orders, and 2.5% in GMV, validating the effectiveness of world-knowledge-enhanced reasoning over purely log-driven approaches.
Subjects:
Information Retrieval (cs.IR); Computation and Language (cs.CL)
Cite as:
arXiv:2512.21257 [cs.IR]
(or
arXiv:2512.21257v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2512.21257
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
8. 【2512.21204】SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation
链接:https://arxiv.org/abs/2512.21204
作者:Mahi Luthra,Jiayi Shen,Maxime Poli,Angelo Ortiz,Yosuke Higuchi,Youssef Benchekroun,Martin Gleize,Charles-Eric Saint-James,Dongyan Lin,Phillip Rust,Angel Villar,Surya Parimi,Vanessa Stark,Rashel Moritz,Juan Pino,Yann LeCun,Emmanuel Dupoux
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:acquire basic units, striking efficiency gap, efficiency gap compared, Human infants, acquire basic
备注:
点击查看摘要
Abstract:Human infants, with only a few hundred hours of speech exposure, acquire basic units of new languages, highlighting a striking efficiency gap compared to the data-hungry self-supervised speech models. To address this gap, this paper introduces SpidR-Adapt for rapid adaptation to new languages using minimal unlabeled data. We cast such low-resource speech representation learning as a meta-learning problem and construct a multi-task adaptive pre-training (MAdaPT) protocol which formulates the adaptation process as a bi-level optimization framework. To enable scalable meta-training under this framework, we propose a novel heuristic solution, first-order bi-level optimization (FOBLO), avoiding heavy computation costs. Finally, we stabilize meta-training by using a robust initialization through interleaved supervision which alternates self-supervised and supervised objectives. Empirically, SpidR-Adapt achieves rapid gains in phonemic discriminability (ABX) and spoken language modeling (sWUGGY, sBLIMP, tSC), improving over in-domain language models after training on less than 1h of target-language audio, over $100\times$ more data-efficient than standard training. These findings highlight a practical, architecture-agnostic path toward biologically inspired, data-efficient representations. We open-source the training code and model checkpoints at this https URL.
9. 【2512.21120】ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models
链接:https://arxiv.org/abs/2512.21120
作者:Sichun Luo,Yi Huang,Mukai Li,Shichang Meng,Fengyuan Liu,Zefa Hu,Junlan Feng,Qi Liu
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large language models, Large language, language models, assistants in open-domain, ambiguous information
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed as conversational assistants in open-domain, multi-turn settings, where users often provide incomplete or ambiguous information. However, existing LLM-focused clarification benchmarks primarily assume single-turn interactions or cooperative users, limiting their ability to evaluate clarification behavior in realistic settings. We introduce \textbf{ClarifyMT-Bench}, a benchmark for multi-turn clarification grounded in a five-dimensional ambiguity taxonomy and a set of six behaviorally diverse simulated user personas. Through a hybrid LLM-human pipeline, we construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns. Evaluating ten representative LLMs uncovers a consistent under-clarification bias: LLMs tend to answer prematurely, and performance degrades as dialogue depth increases. To mitigate this, we propose \textbf{ClarifyAgent}, an agentic approach that decomposes clarification into perception, forecasting, tracking, and planning, substantially improving robustness across ambiguity conditions. ClarifyMT-Bench establishes a reproducible foundation for studying when LLMs should ask, when they should answer, and how to navigate ambiguity in real-world human-LLM interactions.
10. 【2512.21110】Beyond Context: Large Language Models Failure to Grasp Users Intent
链接:https://arxiv.org/abs/2512.21110
作者:Ahmed M. Hussain,Salahuddin Salahuddin,Panos Papadimitratos
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
关键词:Large Language Models, Current Large Language, Language Models, Large Language, explicitly harmful content
备注: 22 pages and 23 figures
点击查看摘要
Abstract:Current Large Language Models (LLMs) safety approaches focus on explicitly harmful content while overlooking a critical vulnerability: the inability to understand context and recognize user intent. This creates exploitable vulnerabilities that malicious users can systematically leverage to circumvent safety mechanisms. We empirically evaluate multiple state-of-the-art LLMs, including ChatGPT, Claude, Gemini, and DeepSeek. Our analysis demonstrates the circumvention of reliable safety mechanisms through emotional framing, progressive revelation, and academic justification techniques. Notably, reasoning-enabled configurations amplified rather than mitigated the effectiveness of exploitation, increasing factual precision while failing to interrogate the underlying intent. The exception was Claude Opus 4.1, which prioritized intent detection over information provision in some use cases. This pattern reveals that current architectural designs create systematic vulnerabilities. These limitations require paradigmatic shifts toward contextual understanding and intent recognition as core safety capabilities rather than post-hoc protective mechanisms.
11. 【2512.21107】Semi-Supervised Learning for Large Language Models Safety and Content Moderation
链接:https://arxiv.org/abs/2512.21107
作者:Eduard Stefan Dinuta,Iustin Sirbu,Traian Rebedea
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:ongoing research focus, Large Language Models, Large Language, Language Models, ongoing research
备注:
点击查看摘要
Abstract:Safety for Large Language Models (LLMs) has been an ongoing research focus since their emergence and is even more relevant nowadays with the increasing capacity of those models. Currently, there are several guardrails in place for all public LLMs and multiple proposed datasets for training safety classifiers. However, training these safety classifiers relies on large quantities of labeled data, which can be problematic to acquire, prone to labeling errors, or often include synthetic data. To address these issues, we suggest a different approach: utilizing semi-supervised learning techniques, which leverage both labeled and unlabeled data, to improve the performance on the safety task. We analyze the improvements that these techniques can offer for both prompts given to Large Language Models and the responses to those requests. Moreover, since augmentation is the central part of semi-supervised algorithms, we demonstrate the importance of using task-specific augmentations, which significantly increase the performance when compared to general-purpose augmentation techniques.
12. 【2512.21106】Semantic Refinement with LLMs for Graph Representations
链接:https://arxiv.org/abs/2512.21106
作者:Safal Thapaliya,Zehong Wang,Jiazheng Li,Ziming Li,Yanfang Ye,Chuxu Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Graph-structured data exhibit, structural patterns play, data exhibit substantial, Graph-structured data, exhibit substantial heterogeneity
备注:
点击查看摘要
Abstract:Graph-structured data exhibit substantial heterogeneity in where their predictive signals originate: in some domains, node-level semantics dominate, while in others, structural patterns play a central role. This structure-semantics heterogeneity implies that no graph learning model with a fixed inductive bias can generalize optimally across diverse graph domains. However, most existing methods address this challenge from the model side by incrementally injecting new inductive biases, which remains fundamentally limited given the open-ended diversity of real-world graphs. In this work, we take a data-centric perspective and treat node semantics as a task-adaptive variable. We propose a Data-Adaptive Semantic Refinement framework DAS for graph representation learning, which couples a fixed graph neural network (GNN) and a large language model (LLM) in a closed feedback loop. The GNN provides implicit supervisory signals to guide the semantic refinement of LLM, and the refined semantics are fed back to update the same graph learner. We evaluate our approach on both text-rich and text-free graphs. Results show consistent improvements on structure-dominated graphs while remaining competitive on semantics-rich graphs, demonstrating the effectiveness of data-centric semantic adaptation under structure-semantics heterogeneity.
13. 【2512.21017】Rethinking Supervised Fine-Tuning: Emphasizing Key Answer Tokens for Improved LLM Accuracy
链接:https://arxiv.org/abs/2512.21017
作者:Xiaofeng Shi,Qian Kou,Yuduo Li,Hua Zhou
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, complex reasoning tasks, advancement of Large, Language Models
备注:
点击查看摘要
Abstract:With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5\% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.
14. 【2512.21002】Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation
链接:https://arxiv.org/abs/2512.21002
作者:Wei-Rui Chen,Vignesh Kothapalli,Ata Fatahibaarzi,Hejian Sang,Shao Tang,Qingquan Song,Zhipeng Wang,Muhammad Abdul-Mageed
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language model, smaller student model, language model, large language, substantial amounts
备注:
点击查看摘要
Abstract:Distilling the reasoning capabilities from a large language model (LLM) to a smaller student model often involves training on substantial amounts of reasoning data. However, distillation over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) segments makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different segments (P, CoT, A) affects student performance. Our analysis shows that selective knowledge distillation over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that training on only the first $50\%$ of tokens of every training sequence can retain, on average, $\approx94\%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50\%$ each. These findings suggest that reasoning distillation benefits from prioritizing early reasoning tokens and provides a simple lever for computation-quality tradeoffs. Codes are available at this https URL.
15. 【2512.20983】Automatic Replication of LLM Mistakes in Medical Conversations
链接:https://arxiv.org/abs/2512.20983
作者:Oleksii Proniakin,Diego Fajardo,Ruslan Nazarenko,Razvan Marinescu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:quantify reasoning quality, Large language models, reasoning quality, increasingly evaluated, evaluated in clinical
备注: 48 pages, 3 figures, 4 tables
点击查看摘要
Abstract:Large language models (LLMs) are increasingly evaluated in clinical settings using multi-dimensional rubrics which quantify reasoning quality, safety, and patient-centeredness. Yet, replicating specific mistakes in other LLM models is not straightforward and often requires manual effort. We introduce MedMistake, an automatic pipeline that extracts mistakes LLMs make in patient-doctor conversations and converts them into a benchmark of single-shot QA pairs. Our pipeline (1) creates complex, conversational data between an LLM patient and LLM doctor, (2) runs an evaluation with a committee of 2 LLM judges across a variety of dimensions and (3) creates simplified single-shot QA scenarios from those mistakes. We release MedMistake-All, a dataset of 3,390 single-shot QA pairs where GPT-5 and Gemini 2.5 Pro are currently failing to answer correctly, as judged by two LLM judges. We used medical experts to validate a subset of 211/3390 questions (MedMistake-Bench), which we used to run a final evaluation of 12 frontier LLMs: Claude Opus 4.5, Claude Sonnet 4.5, DeepSeek-Chat, Gemini 2.5 Pro, Gemini 3 Pro, GPT-4o, GPT-5, GPT-5.1, GPT-5.2, Grok 4, Grok 4.1, Mistral Large. We found that GPT models, Claude and Grok obtained the best performance on MedMistake-Bench. We release both the doctor-validated benchmark (MedMistake-Bench), as well as the full dataset (MedMistake-All) at this https URL.
16. 【2512.20954】Reflection Pretraining Enables Token-Level Self-Correction in Biological Sequence Models
链接:https://arxiv.org/abs/2512.20954
作者:Xiang Zhang,Jiaqi Wei,Yuejin Yang,Zijie Qiu,Yuhan Chen,Zhiqiang Gao,Muhammad Abdul-Mageed,Laks V. S. Lakshmanan,Wanli Ouyang,Chenyu You,Siqi Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:advanced task-solving capabilities, natural language processing, significantly advanced task-solving, advanced task-solving, task-solving capabilities
备注:
点击查看摘要
Abstract:Chain-of-Thought (CoT) prompting has significantly advanced task-solving capabilities in natural language processing with large language models. Unlike standard prompting, CoT encourages the model to generate intermediate reasoning steps, non-answer tokens, that help guide the model toward more accurate final outputs. These intermediate steps enable more complex reasoning processes such as error correction, memory management, future planning, and self-reflection. However, applying CoT to non-natural language domains, such as protein and RNA language models, is not yet possible, primarily due to the limited expressiveness of their token spaces (e.g., amino acid tokens). In this work, we propose and define the concept of language expressiveness: the ability of a given language, using its tokens and grammar, to encode information. We show that the limited expressiveness of protein language severely restricts the applicability of CoT-style reasoning. To overcome this, we introduce reflection pretraining, for the first time in a biological sequence model, which enables the model to engage in intermediate reasoning through the generation of auxiliary "thinking tokens" beyond simple answer tokens. Theoretically, we demonstrate that our augmented token set significantly enhances biological language expressiveness, thereby improving the overall reasoning capacity of the model. Experimentally, our pretraining approach teaches protein models to self-correct and leads to substantial performance gains compared to standard pretraining.
17. 【2512.20950】MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment
链接:https://arxiv.org/abs/2512.20950
作者:Mohammad Mahdi Abootorabi,Alireza Ghahramani Kure,Mohammadali Mohammadkhani,Sina Elahimanesh,Mohammad Ali Ali Panah
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Crosslingual Fact-Checked Claim, Multilingual and Crosslingual, Fact-Checked Claim Retrieval, paper presents, presents our system
备注: 11 pages Published at the SemEval-2025 workshop
点击查看摘要
Abstract:This paper presents our system for SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval. In an era where misinformation spreads rapidly, effective fact-checking is increasingly critical. We introduce TriAligner, a novel approach that leverages a dual-encoder architecture with contrastive learning and incorporates both native and English translations across different modalities. Our method effectively retrieves claims across multiple languages by learning the relative importance of different sources in alignment. To enhance robustness, we employ efficient data preprocessing and augmentation using large language models while incorporating hard negative sampling to improve representation learning. We evaluate our approach on monolingual and crosslingual benchmarks, demonstrating significant improvements in retrieval accuracy and fact-checking performance over baselines.
18. 【2512.20949】Neural Probe-Based Hallucination Detection for Large Language Models
链接:https://arxiv.org/abs/2512.20949
作者:Shize Liang,Hongzhi Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:knowledge question-answering tasks, generating hallucinated content, Large language models, Large language, excel at text
备注:
点击查看摘要
Abstract:Large language models(LLMs) excel at text generation and knowledge question-answering tasks, but they are prone to generating hallucinated content, severely limiting their application in high-risk domains. Current hallucination detection methods based on uncertainty estimation and external knowledge retrieval suffer from the limitation that they still produce erroneous content at high confidence levels and rely heavily on retrieval efficiency and knowledge coverage. In contrast, probe methods that leverage the model's hidden-layer states offer real-time and lightweight advantages. However, traditional linear probes struggle to capture nonlinear structures in deep semantic this http URL overcome these limitations, we propose a neural network-based framework for token-level hallucination detection. By freezing language model parameters, we employ lightweight MLP probes to perform nonlinear modeling of high-level hidden states. A multi-objective joint loss function is designed to enhance detection stability and semantic disambiguity. Additionally, we establish a layer position-probe performance response model, using Bayesian optimization to automatically search for optimal probe insertion layers and achieve superior training this http URL results on LongFact, HealthBench, and TriviaQA demonstrate that MLP probes significantly outperform state-of-the-art methods in accuracy, recall, and detection capability under low false-positive conditions.
19. 【2512.20948】Foundation Model-based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive, Multi-Modal, and Multi-Lingual Study
链接:https://arxiv.org/abs/2512.20948
作者:Zhongren Dong,Haotian Guo,Weixiang Xu,Huan Zhao,Zixing Zhang
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:offering potential biomarkers, autism spectrum disorder, Alzheimer disease, acoustic abnormalities, offering potential
备注:
点击查看摘要
Abstract:Neuropsychiatric disorders, such as Alzheimer's disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.
20. 【2512.20934】ransductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning
链接:https://arxiv.org/abs/2512.20934
作者:Shengguang Wu,Xiaohan Wang,Yuhui Zhang,Hao Zhu,Serena Yeung-Levy
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:scenes requires precise, challenge vision-language models, requires precise geometric, precise geometric calculations, Visual programming
备注: Project Website: [this https URL](https://transductive-visualprogram.github.io/)
点击查看摘要
Abstract:Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at this https URL.
21. 【2512.20908】Where Did This Sentence Come From? Tracing Provenance in LLM Reasoning Distillation
链接:https://arxiv.org/abs/2512.20908
作者:Kaiyuan Liu,Shaotian Yan,Rui Miao,Bing Wang,Chen Shen,Jun Zhang,Jieping Ye
类目:Computation and Language (cs.CL)
关键词:attracted increasing attention, Reasoning Distillation Provenance, Reasoning distillation, Distillation Provenance Tracing, distilled model
备注:
点击查看摘要
Abstract:Reasoning distillation has attracted increasing attention. It typically leverages a large teacher model to generate reasoning paths, which are then used to fine-tune a student model so that it mimics the teacher's behavior in training contexts. However, previous approaches have lacked a detailed analysis of the origins of the distilled model's capabilities. It remains unclear whether the student can maintain consistent behaviors with the teacher in novel test-time contexts, or whether it regresses to its original output patterns, raising concerns about the generalization of distillation models. To analyse this question, we introduce a cross-model Reasoning Distillation Provenance Tracing framework. For each action (e.g., a sentence) produced by the distilled model, we obtain the predictive probabilities assigned by the teacher, the original student, and the distilled model under the same context. By comparing these probabilities, we classify each action into different categories. By systematically disentangling the provenance of each action, we experimentally demonstrate that, in test-time contexts, the distilled model can indeed generate teacher-originated actions, which correlate with and plausibly explain observed performance on distilled model. Building on this analysis, we further propose a teacher-guided data selection method. Unlike prior approach that rely on heuristics, our method directly compares teacher-student divergences on the training data, providing a principled selection criterion. We validate the effectiveness of our approach across multiple representative teacher models and diverse student models. The results highlight the utility of our provenance-tracing framework and underscore its promise for reasoning distillation. We hope to share Reasoning Distillation Provenance Tracing and our insights into reasoning distillation with the community.
22. 【2512.20877】Architectural Trade-offs in Small Language Models Under Compute Constraints
链接:https://arxiv.org/abs/2512.20877
作者:Shivraj Singh Bhatti
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:strict compute constraints, systematic empirical study, training budget interact, compute constraints, present a systematic
备注: 15 pages, 11 images
点击查看摘要
Abstract:We present a systematic empirical study of small language models under strict compute constraints, analyzing how architectural choices and training budget interact to determine performance. Starting from a linear next-token predictor, we progressively introduce nonlinearities, self-attention, and multi-layer transformer architectures, evaluating each on character-level modeling of Tiny Shakespeare and word-level modeling of Penn Treebank (PTB) and WikiText-2. We compare models using test negative log-likelihood (NLL), parameter count, and approximate training FLOPs to characterize accuracy-efficiency trade-offs. Our results show that attention-based models dominate MLPs in per-FLOP efficiency even at small scale, while increasing depth or context without sufficient optimization can degrade performance. We further examine rotary positional embeddings (RoPE), finding that architectural techniques successful in large language models do not necessarily transfer to small-model regimes.
23. 【2512.20856】NVIDIA Nemotron 3: Efficient and Open Intelligence
链接:https://arxiv.org/abs/2512.20856
作者:NVIDIA:Aaron Blakeman,Aaron Grattafiori,Aarti Basant,Abhibha Gupta,Abhinav Khattar,Adi Renduchintala,Aditya Vavre,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Kondratenko,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alisa Liu,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Amy Shen,Anahita Bhiwandiwalla,Andrew Tao,Anjulie Agrusa,Ankur Verma,Ann Guan,Anubhav Mandarwal,Arham Mehta,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asit Mishra,Asma Kuriparambil Thekkumpate,Ayush Dattagupta,Banghua Zhu,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bita Darvish Rouhani,Boris Ginsburg,Brandon Norick,Brandon Soubasis,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Carlo del Mundo,Chantal Hwang,Charles Wang,Cheng-Ping Hsieh,Chenghao Zhang,Chenhan Yu,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christopher Parisien,Collin Neale,Cyril Meurillon,Damon Mosk-Aoyama,Dan Su,Dane Corneil,Daniel Afrimi,Daniel Lo,Daniel Rohrer,Daniel Serebrenik,Daria Gitman,Daria Levy,Darko Stosic,David Mosallanezhad,Deepak Narayanan,Dhruv Nathawani,Dima Rekesh,Dina Yared,Divyanshu Kakwani,Dong Ahn,Duncan Riach,Dusan Stosic,Edgar Minasyan,Edward Lin,Eileen Long,Eileen Peters Long,Elad Segal,Elena Lantz,Ellie Evans,Elliott Ning,Eric Chung,Eric Harper,Eric Tramel,Erick Galinkin,Erik Pounds,Evan Briones,Evelina Bakhturina,Evgeny Tsykunov,Faisal Ladhak,Fay Wang,Fei Jia
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Nemotron, introduce the Nemotron, Ultra, models, Super
备注:
点击查看摘要
Abstract:We introduce the Nemotron 3 family of models - Nano, Super, and Ultra. These models deliver strong agentic, reasoning, and conversational capabilities. The Nemotron 3 family uses a Mixture-of-Experts hybrid Mamba-Transformer architecture to provide best-in-class throughput and context lengths of up to 1M tokens. Super and Ultra models are trained with NVFP4 and incorporate LatentMoE, a novel approach that improves model quality. The two larger models also include MTP layers for faster text generation. All Nemotron 3 models are post-trained using multi-environment reinforcement learning enabling reasoning, multi-step tool use, and support granular reasoning budget control. Nano, the smallest model, outperforms comparable models in accuracy while remaining extremely cost-efficient for inference. Super is optimized for collaborative agents and high-volume workloads such as IT ticket automation. Ultra, the largest model, provides state-of-the-art accuracy and reasoning performance. Nano is released together with its technical report and this white paper, while Super and Ultra will follow in the coming months. We will openly release the model weights, pre- and post-training software, recipes, and all data for which we hold redistribution rights.
24. 【2512.20854】How important is Recall for Measuring Retrieval Quality?
链接:https://arxiv.org/abs/2512.20854
作者:Shelly Schwartz,Oleg Vasilyev,Randy Sawaya
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:realistic retrieval settings, evolving knowledge bases, typically unknown, settings with large, large and evolving
备注:
点击查看摘要
Abstract:In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.
25. 【2512.20848】Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning
链接:https://arxiv.org/abs/2512.20848
作者:NVIDIA:Aaron Blakeman,Aaron Grattafiori,Aarti Basant,Abhibha Gupta,Abhinav Khattar,Adi Renduchintala,Aditya Vavre,Akanksha Shukla,Akhiad Bercovich,Aleksander Ficek,Aleksandr Shaposhnikov,Alex Kondratenko,Alexander Bukharin,Alexandre Milesi,Ali Taghibakhshi,Alisa Liu,Amelia Barton,Ameya Sunil Mahabaleshwarkar,Amir Klein,Amit Zuker,Amnon Geifman,Amy Shen,Anahita Bhiwandiwalla,Andrew Tao,Ann Guan,Anubhav Mandarwal,Arham Mehta,Ashwath Aithal,Ashwin Poojary,Asif Ahamed,Asma Kuriparambil Thekkumpate,Ayush Dattagupta,Banghua Zhu,Bardiya Sadeghi,Barnaby Simkin,Ben Lanir,Benedikt Schifferer,Besmira Nushi,Bilal Kartal,Bita Darvish Rouhani,Boris Ginsburg,Brandon Norick,Brandon Soubasis,Branislav Kisacanin,Brian Yu,Bryan Catanzaro,Carlo del Mundo,Chantal Hwang,Charles Wang,Cheng-Ping Hsieh,Chenghao Zhang,Chenhan Yu,Chetan Mungekar,Chintan Patel,Chris Alexiuk,Christopher Parisien,Collin Neale,Damon Mosk-Aoyama,Dan Su,Dane Corneil,Daniel Afrimi,Daniel Rohrer,Daniel Serebrenik,Daria Gitman,Daria Levy,Darko Stosic,David Mosallanezhad,Deepak Narayanan,Dhruv Nathawani,Dima Rekesh,Dina Yared,Divyanshu Kakwani,Dong Ahn,Duncan Riach,Dusan Stosic,Edgar Minasyan,Edward Lin,Eileen Long,Eileen Peters Long,Elena Lantz,Ellie Evans,Elliott Ning,Eric Chung,Eric Harper,Eric Tramel,Erick Galinkin,Erik Pounds,Evan Briones,Evelina Bakhturina,Faisal Ladhak,Fay Wang,Fei Jia,Felipe Soares,Feng Chen,Ferenc Galko,Frankie Siino,Gal Hubara Agam,Ganesh Ajjanagadde,Gantavya Bhatt
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:hybrid Mamba-Transformer language, Nano, Mamba-Transformer language model, Nemotron, hybrid Mamba-Transformer
备注:
点击查看摘要
Abstract:We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.
26. 【2512.20822】MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs
链接:https://arxiv.org/abs/2512.20822
作者:Zhan Qu,Michael Färber
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, applied to medicine, Language Models, increasingly applied
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly applied to medicine, yet their adoption is limited by concerns over reliability and safety. Existing evaluations either test factual medical knowledge in isolation or assess patient-level reasoning without verifying correctness, leaving a critical gap. We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies. MediEval generates diverse factual and counterfactual medical statements within real patient contexts, enabling systematic evaluation across a 4-quadrant framework that jointly considers knowledge grounding and contextual consistency. Using this framework, we identify critical failure modes, including hallucinated support and truth inversion, that current proprietary, open-source, and domain-specific LLMs frequently exhibit. To address these risks, we propose Counterfactual Risk-Aware Fine-tuning (CoRFu), a DPO-based method with an asymmetric penalty targeting unsafe confusions. CoRFu improves by +16.4 macro-F1 points over the base model and eliminates truth inversion errors, demonstrating both higher accuracy and substantially greater safety.
27. 【2512.20817】EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading
链接:https://arxiv.org/abs/2512.20817
作者:Kumar Satvik Chaudhary,Chengshuai Zhao,Fan Zhang,Yung Hin Tse,Garima Agrawal,Yuli Deng,Huan Liu
类目:Computation and Language (cs.CL)
关键词:automated grading systems, large language models, language models function, Understanding how automated, grading systems evaluate
备注:
点击查看摘要
Abstract:Understanding how automated grading systems evaluate essays remains a significant challenge for educators and students, especially when large language models function as black boxes. We introduce EssayCBM, a rubric-aligned framework that prioritizes interpretability in essay assessment. Instead of predicting grades directly from text, EssayCBM evaluates eight writing concepts, such as Thesis Clarity and Evidence Use, through dedicated prediction heads on an encoder. These concept scores form a transparent bottleneck, and a lightweight network computes the final grade using only concepts. Instructors can adjust concept predictions and instantly view the updated grade, enabling accountable human-in-the-loop evaluation. EssayCBM matches black-box performance while offering actionable, concept-level feedback through an intuitive web interface.
28. 【2512.20812】Semantic Deception: When Reasoning Models Can't Compute an Addition
链接:https://arxiv.org/abs/2512.20812
作者:Nathaniël de Leeuw,Marceau Nahon,Mathis Reymond,Raja Chatila,Mehdi Khamassi
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, LLMs, semantic, Large
备注: 22 pages, 5 figures
点击查看摘要
Abstract:Large language models (LLMs) are increasingly used in situations where human values are at stake, such as decision-making tasks that involve reasoning when performed by humans. We investigate the so-called reasoning capabilities of LLMs over novel symbolic representations by introducing an experimental framework that tests their ability to process and manipulate unfamiliar symbols. We introduce semantic deceptions: situations in which symbols carry misleading semantic associations due to their form, such as being embedded in specific contexts, designed to probe whether LLMs can maintain symbolic abstraction or whether they default to exploiting learned semantic associations. We redefine standard digits and mathematical operators using novel symbols, and task LLMs with solving simple calculations expressed in this altered notation. The objective is: (1) to assess LLMs' capacity for abstraction and manipulation of arbitrary symbol systems; (2) to evaluate their ability to resist misleading semantic cues that conflict with the task's symbolic logic. Through experiments with four LLMs we show that semantic cues can significantly deteriorate reasoning models' performance on very simple tasks. They reveal limitations in current LLMs' ability for symbolic manipulations and highlight a tendency to over-rely on surface-level semantics, suggesting that chain-of-thoughts may amplify reliance on statistical correlations. Even in situations where LLMs seem to correctly follow instructions, semantic cues still impact basic capabilities. These limitations raise ethical and societal concerns, undermining the widespread and pernicious tendency to attribute reasoning abilities to LLMs and suggesting how LLMs might fail, in particular in decision-making contexts where robust symbolic reasoning is essential and should not be compromised by residual semantic associations inherited from the model's training.
29. 【2512.20796】Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
链接:https://arxiv.org/abs/2512.20796
作者:Zhengyang Shan,Aaron Mueller
类目:Computation and Language (cs.CL)
关键词:investigate how independent, general demographic recognition, demographic, bias, independent demographic bias
备注:
点击查看摘要
Abstract:We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This highlights the need for dimension-specific interventions. Overall, our results show that demographic bias arises from task-specific mechanisms rather than absolute demographic markers, and that mechanistic inference-time interventions can enable surgical debiasing without compromising core model capabilities.
30. 【2512.20794】Investigating Model Editing for Unlearning in Large Language Models
链接:https://arxiv.org/abs/2512.20794
作者:Shariqah Hossain,Lalana Kagal
类目:Computation and Language (cs.CL)
关键词:Machine unlearning aims, remove unwanted information, Machine unlearning, remove unwanted, fully remove
备注:
点击查看摘要
Abstract:Machine unlearning aims to remove unwanted information from a model, but many methods are inefficient for LLMs with large numbers of parameters or fail to fully remove the intended information without degrading performance on knowledge that should be retained. Model editing algorithms solve a similar problem of changing information in models, but they focus on redirecting inputs to a new target rather than removing that information altogether. In this work, we explore the editing algorithms ROME, IKE, and WISE and design new editing targets for an unlearning setting. Through this investigation, we show that model editing approaches can exceed baseline unlearning methods in terms of quality of forgetting depending on the setting. Like traditional unlearning techniques, they struggle to encapsulate the scope of what is to be unlearned without damage to the overall model performance.
31. 【2512.20780】Large Language Models Approach Expert Pedagogical Quality in Math Tutoring but Differ in Instructional and Linguistic Profiles
链接:https://arxiv.org/abs/2512.20780
作者:Ramatu Oiza Abdulsalam,Segun Aroyehun
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:large language models, instructional behavior aligns, large language, language models, expert human tutors
备注:
点击查看摘要
Abstract:Recent work has explored the use of large language models for generating tutoring responses in mathematics, yet it remains unclear how closely their instructional behavior aligns with expert human practice. We examine this question using a controlled, turn-level comparison in which expert human tutors, novice human tutors, and multiple large language models respond to the same set of math remediation conversation turns. We examine both instructional strategies and linguistic characteristics of tutoring responses, including restating and revoicing, pressing for accuracy, lexical diversity, readability, politeness, and agency. We find that large language models approach expert levels of perceived pedagogical quality on average but exhibit systematic differences in their instructional and linguistic profiles. In particular, large language models tend to underuse restating and revoicing strategies characteristic of expert human tutors, while producing longer, more lexically diverse, and more polite responses. Statistical analyses show that restating and revoicing, lexical diversity, and pressing for accuracy are positively associated with perceived pedagogical quality, whereas higher levels of agentic and polite language are negatively associated. Overall, recent large language models exhibit levels of perceived pedagogical quality comparable to expert human tutors, while relying on different instructional and linguistic strategies. These findings underscore the value of analyzing instructional strategies and linguistic characteristics when evaluating tutoring responses across human tutors and intelligent tutoring systems.
32. 【2512.20773】Adversarial Training for Failure-Sensitive User Simulation in Mental Health Dialogue Optimization
链接:https://arxiv.org/abs/2512.20773
作者:Ziyi Zhu,Olivier Tieleman,Caitlin A. Stamatis,Luka Smyth,Thomas D. Hull,Daniel R. Cahn,Matteo Malgaroli
类目:Computation and Language (cs.CL)
关键词:evaluating task-oriented dialogue, behavior remains challenging, accurately replicate human, replicate human behavior, human behavior remains
备注:
点击查看摘要
Abstract:Realistic user simulation is crucial for training and evaluating task-oriented dialogue (TOD) systems, yet creating simulators that accurately replicate human behavior remains challenging. A key property of effective simulators is their ability to expose failure modes of the systems they evaluate. We present an adversarial training framework that iteratively improves user simulator realism through a competitive dynamic between a generator (user simulator) and a discriminator. Applied to mental health support chatbots, our approach demonstrates that fine-tuned simulators dramatically outperform zero-shot base models at surfacing system issues, and adversarial training further enhances diversity, distributional alignment, and predictive validity. The resulting simulator achieves a strong correlation between simulated and real failure occurrence rates across diverse chatbot configurations while maintaining low distributional divergence of failure modes. Discriminator accuracy decreases drastically after three adversarial iterations, suggesting improved realism. These results provide evidence that adversarial training is a promising approach for creating realistic user simulators in mental health support TOD domains, enabling rapid, reliable, and cost-effective system evaluation before deployment.
33. 【2512.20760】Generalization of RLVR Using Causal Reasoning as a Testbed
链接:https://arxiv.org/abs/2512.20760
作者:Brian Lu,Hongyu Zhao,Shuo Sun,Hao Peng,Rui Ding,Hongyuan Mei
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:post-training large language, Reinforcement learning, large language models, RLVR, verifiable rewards
备注:
点击查看摘要
Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for post-training large language models (LLMs) on complex reasoning tasks. Yet, the conditions under which RLVR yields robust generalization remain poorly understood. This paper provides an empirical study of RLVR generalization in the setting of probabilistic inference over causal graphical models. This setting offers two natural axes along which to examine generalization: (i) the level of the probabilistic query -- associational, interventional, or counterfactual -- and (ii) the structural complexity of the query, measured by the size of its relevant subgraph. We construct datasets of causal graphs and queries spanning these difficulty axes and fine-tune Qwen-2.5-Instruct models using RLVR or supervised fine-tuning (SFT). We vary both the model scale (3B-32B) and the query level included in training. We find that RLVR yields stronger within-level and across-level generalization than SFT, but only for specific combinations of model size and training query level. Further analysis shows that RLVR's effectiveness depends on the model's initial reasoning competence. With sufficient initial competence, RLVR improves an LLM's marginalization strategy and reduces errors in intermediate probability calculations, producing substantial accuracy gains, particularly on more complex queries. These findings show that RLVR can improve specific causal reasoning subskills, with its benefits emerging only when the model has sufficient initial competence.
34. 【2512.20757】okSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
链接:https://arxiv.org/abs/2512.20757
作者:Gül Sena Altıntaş,Malikeh Ehghaghi,Brian Lester,Fengyuan Liu,Wanru Zhao,Marco Ciccone,Colin Raffel
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:provide the fundamental, fundamental basis, text is represented, represented and processed, processed by language
备注:
点击查看摘要
Abstract:Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.
35. 【2512.20745】AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent
链接:https://arxiv.org/abs/2512.20745
作者:Haipeng Luo,Huawen Feng,Qingfeng Sun,Can Xu,Kai Zheng,Yufei Wang,Tao Yang,Han Hu,Yansong Tang,Di Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:achieved remarkable progress, Large Reasoning Models, Large Reasoning, achieved remarkable, remarkable progress
备注: LLM, Mathematical Reasoning
点击查看摘要
Abstract:Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought. However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations. In this work, we present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision to efficiently tackle complex mathematical problems. Our approach introduces three key innovations: (1) An automated method that converts natural language chain-of-thought into structured tool-augmented trajectories, generating high-quality supervised fine-tuning (SFT) data to alleviate data scarcity; (2) A novel agentic reinforcement learning (RL) paradigm that dynamically interleaves natural language generation with real-time code execution. This enables models to autonomously learn optimal tool-use strategies through multi-round interactive feedback, while fostering emergent capabilities in code refinement and error correction; (3) An efficient training system incorporating innovative techniques, including request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing, achieving 4-5x speedup and making efficient RL training feasible on ultra-long sequences with scenarios with massive tool this http URL evaluations show that AgentMath achieves state-of-the-art performance on challenging mathematical competition benchmarks including AIME24, AIME25, and HMMT25. Specifically, AgentMath-30B-A3B attains 90.6%, 86.4%, and 73.8% accuracy respectively, achieving advanced this http URL results validate the effectiveness of our approach and pave the way for building more sophisticated and scalable mathematical reasoning agents.
36. 【2512.20724】SA-DiffuSeq: Addressing Computational and Scalability Challenges in Long-Document Generation with Sparse Attention
链接:https://arxiv.org/abs/2512.20724
作者:Alexandros Christoforos,Chadbourne Davis
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:prohibitive computational cost, sequence length increases, Diffusion based approaches, length increases, based approaches
备注: Under submission
点击查看摘要
Abstract:Diffusion based approaches to long form text generation suffer from prohibitive computational cost and memory overhead as sequence length increases. We introduce SA-DiffuSeq, a diffusion framework that integrates sparse attention to fundamentally improve scalability for long document modeling. By selectively allocating attention within the diffusion process, SA-DiffuSeq significantly reduces computational complexity while maintaining semantic coherence and generation quality. A key component of our method is a soft absorbing state tailored to sparse attention dynamics, which stabilizes diffusion trajectories and accelerates sequence reconstruction. This design improves sampling efficiency and enhances precision in long range dependency modeling. Extensive experiments demonstrate that SA-DiffuSeq consistently surpasses state of the art diffusion baselines in both training efficiency and sampling speed, with especially strong gains on extended sequences. These properties make SA-DiffuSeq well suited for demanding long form applications such as scientific writing, large scale code generation, and multi turn long context dialogue. Overall, our results indicate that incorporating structured sparsity into diffusion models is a promising direction for efficient and expressive long text generation.
37. 【2512.20687】PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation
链接:https://arxiv.org/abs/2512.20687
作者:Yuma Ichikawa,Naoya Takagi,Takumi Nakagawa,Yuzi Kanazawa,Akira Sakai
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:Transformers operate, operate as horizontal, generation step, ever-growing sequence, sequence of token-level
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Transformers operate as horizontal token-by-token scanners; at each generation step, the model attends to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding increasingly memory-bound, as KV-cache reads and writes dominate inference throughput rather than arithmetic computation. We propose Parallel Hierarchical Operation for Top-down Networks (PHOTON), a hierarchical autoregressive model that replaces flat scanning with vertical, multi-resolution context access. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder progressively compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, offering significant advantages in long-context and multi-query tasks. This reduces decode-time KV-cache traffic, yielding up to $10^{3}\times$ higher throughput per unit memory.
38. 【2512.20677】Automated Red-Teaming Framework for Large Language Model Security Assessment: A Comprehensive Attack Generation and Detection System
链接:https://arxiv.org/abs/2512.20677
作者:Zhang Wei,Peilu Hu,Shengning Lang,Hao Yan,Li Mei,Yichao Zhang,Chen Yang,Junfeng Hao,Zhimo Han
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:large language models, high-stakes domains, critical challenge, large language, increasingly deployed
备注: 18 pages
点击查看摘要
Abstract:As large language models (LLMs) are increasingly deployed in high-stakes domains, ensuring their security and alignment has become a critical challenge. Existing red-teaming practices depend heavily on manual testing, which limits scalability and fails to comprehensively cover the vast space of potential adversarial behaviors. This paper introduces an automated red-teaming framework that systematically generates, executes, and evaluates adversarial prompts to uncover security vulnerabilities in LLMs. Our framework integrates meta-prompting-based attack synthesis, multi-modal vulnerability detection, and standardized evaluation protocols spanning six major threat categories -- reward hacking, deceptive alignment, data exfiltration, sandbagging, inappropriate tool use, and chain-of-thought manipulation. Experiments on the GPT-OSS-20B model reveal 47 distinct vulnerabilities, including 21 high-severity and 12 novel attack patterns, achieving a $3.9\times$ improvement in vulnerability discovery rate over manual expert testing while maintaining 89\% detection accuracy. These results demonstrate the framework's effectiveness in enabling scalable, systematic, and reproducible AI safety evaluations. By providing actionable insights for improving alignment robustness, this work advances the state of automated LLM red-teaming and contributes to the broader goal of building secure and trustworthy AI systems.
39. 【2512.20638】Uncovering Competency Gaps in Large Language Models and Their Benchmarks
链接:https://arxiv.org/abs/2512.20638
作者:Matyas Bohacek,Nino Scherrer,Nicholas Dufour,Thomas Leung,Christoph Bregler,Stephanie C. Y. Chan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, relies heavily, large language, heavily on standardized, benchmarks
备注:
点击查看摘要
Abstract:The evaluation of large language models (LLMs) relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics for a given capability, but those aggregated metrics can obscure (i) particular sub-areas where the LLMs are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). We propose a new method that uses sparse autoencoders (SAEs) to automatically uncover both types of gaps. By extracting SAE concept activations and computing saliency-weighted performance scores across benchmark data, the method grounds evaluation in the model's internal representations and enables comparison across benchmarks. As examples demonstrating our approach, we applied the method to two popular open-source models and ten benchmarks. We found that these models consistently underperformed on concepts that stand in contrast to sycophantic behaviors (e.g., politely refusing a request or asserting boundaries) and concepts connected to safety discussions. These model gaps align with observations previously surfaced in the literature; our automated, unsupervised method was able to recover them without manual supervision. We also observed benchmark gaps: many of the evaluated benchmarks over-represented concepts related to obedience, authority, or instruction-following, while missing core concepts that should fall within their intended scope. In sum, our method offers a representation-grounded approach to evaluation, enabling concept-level decomposition of benchmark scores. Rather than replacing conventional aggregated metrics, CG complements them by providing a concept-level decomposition that can reveal why a model scored as it did and how benchmarks could evolve to better reflect their intended scope. Code is available at this https URL.
40. 【2512.20634】Real Time Detection and Quantitative Analysis of Spurious Forgetting in Continual Learning
链接:https://arxiv.org/abs/2512.20634
作者:Weiwei Wang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Catastrophic forgetting remains, Catastrophic forgetting, remains a fundamental, fundamental challenge, challenge in continual
备注:
点击查看摘要
Abstract:Catastrophic forgetting remains a fundamental challenge in continual learning for large language models. Recent work revealed that performance degradation may stem from spurious forgetting caused by task alignment disruption rather than true knowledge loss. However, this work only qualitatively describes alignment, relies on post-hoc analysis, and lacks automatic distinction mechanisms. We introduce the shallow versus deep alignment framework, providing the first quantitative characterization of alignment depth. We identify that current task alignment approaches suffer from shallow alignment - maintained only over the first few output tokens (approximately 3-5) - making models vulnerable to forgetting. This explains why spurious forgetting occurs, why it is reversible, and why fine-tuning attacks are effective. We propose a comprehensive framework addressing all gaps: (1) quantitative metrics (0-1 scale) to measure alignment depth across token positions; (2) real-time detection methods for identifying shallow alignment during training; (3) specialized analysis tools for visualization and recovery prediction; and (4) adaptive mitigation strategies that automatically distinguish forgetting types and promote deep alignment. Extensive experiments on multiple datasets and model architectures (Qwen2.5-3B to Qwen2.5-32B) demonstrate 86.2-90.6% identification accuracy and show that promoting deep alignment improves robustness against forgetting by 3.3-7.1% over baselines.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2512.20634 [cs.LG]
(or
arXiv:2512.20634v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2512.20634
Focus to learn more
arXiv-issued DOI via DataCite</p>
41. 【2512.20631】Zero-Training Temporal Drift Detection for Transformer Sentiment Models: A Comprehensive Analysis on Authentic Social Media Streams
链接:https://arxiv.org/abs/2512.20631
作者:Aayam Bansal,Ishaan Gangwani
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:authentic social media, social media data, major real-world events, authentic social, social media
备注: ICML NewInML
点击查看摘要
Abstract:We present a comprehensive zero-training temporal drift analysis of transformer-based sentiment models validated on authentic social media data from major real-world events. Through systematic evaluation across three transformer architectures and rigorous statistical validation on 12,279 authentic social media posts, we demonstrate significant model instability with accuracy drops reaching 23.4% during event-driven periods. Our analysis reveals maximum confidence drops of 13.0% (Bootstrap 95% CI: [9.1%, 16.5%]) with strong correlation to actual performance degradation. We introduce four novel drift metrics that outperform embedding-based baselines while maintaining computational efficiency suitable for production deployment. Statistical validation across multiple events confirms robust detection capabilities with practical significance exceeding industry monitoring thresholds. This zero-training methodology enables immediate deployment for real-time sentiment monitoring systems and provides new insights into transformer model behavior during dynamic content periods.
42. 【2512.20626】MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation
链接:https://arxiv.org/abs/2512.20626
作者:Chi-Hsiang Hsiao,Yi-Cheng Wang,Tzung-Sheng Lin,Yi-Ren Yeh,Chu-Song Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:large language models, access external information, dynamically access external, previously unseen documents, enables large language
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, which is powerful for answering questions over previously unseen documents. Nonetheless, they struggle with high-level conceptual understanding and holistic comprehension due to limited context windows, which constrain their ability to perform deep reasoning over long-form, domain-specific content such as full-length books. To solve this problem, knowledge graphs (KGs) have been leveraged to provide entity-centric structure and hierarchical summaries, offering more structured support for reasoning. However, existing KG-based RAG solutions remain restricted to text-only inputs and fail to leverage the complementary insights provided by other modalities such as vision. On the other hand, reasoning from visual documents requires textual, visual, and spatial cues into structured, hierarchical concepts. To address this issue, we introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding. Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process. Experimental results across both global and fine-grained question answering tasks show that our approach consistently outperforms existing RAG-based approaches on both textual and multimodal corpora.
43. 【2512.20929】Decoding Predictive Inference in Visual Language Processing via Spatiotemporal Neural Coherence
链接:https://arxiv.org/abs/2512.20929
作者:Sean C. Borneman,Julia Krebs,Ronnie B. Wilbur,Evie A. Malaia
类目:Neurons and Cognition (q-bio.NC); Computation and Language (cs.CL)
关键词:Human language processing, language processing relies, Human language, processing relies, Human
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Foundation Models for the Brain and Body
点击查看摘要
Abstract:Human language processing relies on the brain's capacity for predictive inference. We present a machine learning framework for decoding neural (EEG) responses to dynamic visual language stimuli in Deaf signers. Using coherence between neural signals and optical flow-derived motion features, we construct spatiotemporal representations of predictive neural dynamics. Through entropy-based feature selection, we identify frequency-specific neural signatures that differentiate interpretable linguistic input from linguistically disrupted (time-reversed) stimuli. Our results reveal distributed left-hemispheric and frontal low-frequency coherence as key features in language comprehension, with experience-dependent neural signatures correlating with age. This work demonstrates a novel multimodal approach for probing experience-driven generative models of perception in the brain.
信息检索
1. 【2512.21257】ReaSeq: Unleashing World Knowledge via Reasoning for Sequential Modeling
链接:https://arxiv.org/abs/2512.21257
作者:Chuan Wang,Gaoming Yang,Han Wu,Jiakai Tang,Jiahao Yu,Jian Wu,Jianwu Hu,Junjun Zheng,Shuwen Xiao,Yeqiu Yang,Yuning Jiang,Ahjol Nurlanbek,Binbin Cao,Bo Zheng,Fangmei Zhu,Gaoming Zhou,Huimin Yi,Huiping Chu,Jin Huang,Jinzhe Shan,Kenan Cui,Longbin Li,Silu Zhou,Wen Chen,Xia Ming,Xiang Gao,Xin Yao,Xingyu Wen,Yan Zhang,Yiwen Hu,Yulin Wang,Ziheng Bao,Zongyuan Wu
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Large Language Models, brittle interest modeling, Industrial recommender systems, constrains model performance, Language Models
备注:
点击查看摘要
Abstract:Industrial recommender systems face two fundamental limitations under the log-driven paradigm: (1) knowledge poverty in ID-based item representations that causes brittle interest modeling under data sparsity, and (2) systemic blindness to beyond-log user interests that constrains model performance within platform boundaries. These limitations stem from an over-reliance on shallow interaction statistics and close-looped feedback while neglecting the rich world knowledge about product semantics and cross-domain behavioral patterns that Large Language Models have learned from vast corpora. To address these challenges, we introduce ReaSeq, a reasoning-enhanced framework that leverages world knowledge in Large Language Models to address both limitations through explicit and implicit reasoning. Specifically, ReaSeq employs explicit Chain-of-Thought reasoning via multi-agent collaboration to distill structured product knowledge into semantically enriched item representations, and latent reasoning via Diffusion Large Language Models to infer plausible beyond-log behaviors. Deployed on Taobao's ranking system serving hundreds of millions of users, ReaSeq achieves substantial gains: 6.0% in IPV and CTR, 2.9% in Orders, and 2.5% in GMV, validating the effectiveness of world-knowledge-enhanced reasoning over purely log-driven approaches.
Subjects:
Information Retrieval (cs.IR); Computation and Language (cs.CL)
Cite as:
arXiv:2512.21257 [cs.IR]
(or
arXiv:2512.21257v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2512.21257
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
2. 【2512.21120】ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models
链接:https://arxiv.org/abs/2512.21120
作者:Sichun Luo,Yi Huang,Mukai Li,Shichang Meng,Fengyuan Liu,Zefa Hu,Junlan Feng,Qi Liu
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large language models, Large language, language models, assistants in open-domain, ambiguous information
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed as conversational assistants in open-domain, multi-turn settings, where users often provide incomplete or ambiguous information. However, existing LLM-focused clarification benchmarks primarily assume single-turn interactions or cooperative users, limiting their ability to evaluate clarification behavior in realistic settings. We introduce \textbf{ClarifyMT-Bench}, a benchmark for multi-turn clarification grounded in a five-dimensional ambiguity taxonomy and a set of six behaviorally diverse simulated user personas. Through a hybrid LLM-human pipeline, we construct 6,120 multi-turn dialogues capturing diverse ambiguity sources and interaction patterns. Evaluating ten representative LLMs uncovers a consistent under-clarification bias: LLMs tend to answer prematurely, and performance degrades as dialogue depth increases. To mitigate this, we propose \textbf{ClarifyAgent}, an agentic approach that decomposes clarification into perception, forecasting, tracking, and planning, substantially improving robustness across ambiguity conditions. ClarifyMT-Bench establishes a reproducible foundation for studying when LLMs should ask, when they should answer, and how to navigate ambiguity in real-world human-LLM interactions.
3. 【2512.21076】Blurb-Refined Inference from Crowdsourced Book Reviews using Hierarchical Genre Mining with Dual-Path Graph Convolutions
链接:https://arxiv.org/abs/2512.21076
作者:Suraj Kumar,Utsav Kumar Nareti,Soumi Chattopadhyay,Chandranath Adak,Prolay Mallick
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:digital library organization, Accurate book genre, Accurate book, library organization, personalized recommendation
备注: 10 pages, 4 figures, 3 tables
点击查看摘要
Abstract:Accurate book genre classification is fundamental to digital library organization, content discovery, and personalized recommendation. Existing approaches typically model genre prediction as a flat, single-label task, ignoring hierarchical genre structure and relying heavily on noisy, subjective user reviews, which often degrade classification reliability. We propose HiGeMine, a two-phase hierarchical genre mining framework that robustly integrates user reviews with authoritative book blurbs. In the first phase, HiGeMine employs a zero-shot semantic alignment strategy to filter reviews, retaining only those semantically consistent with the corresponding blurb, thereby mitigating noise, bias, and irrelevance. In the second phase, we introduce a dual-path, two-level graph-based classification architecture: a coarse-grained Level-1 binary classifier distinguishes fiction from non-fiction, followed by Level-2 multi-label classifiers for fine-grained genre prediction. Inter-genre dependencies are explicitly modeled using a label co-occurrence graph, while contextual representations are derived from pretrained language models applied to the filtered textual content. To facilitate systematic evaluation, we curate a new hierarchical book genre dataset. Extensive experiments demonstrate that HiGeMine consistently outperformed strong baselines across hierarchical genre classification tasks. The proposed framework offers a principled and effective solution for leveraging both structured and unstructured textual data in hierarchical book genre analysis.
4. 【2512.21039】Agentic Multi-Persona Framework for Evidence-Aware Fake News Detection
链接:https://arxiv.org/abs/2512.21039
作者:Roopa Bukke,Soumya Pandey,Suraj Kumar,Soumi Chattopadhyay,Chandranath Adak
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:necessitating reliable automated, poses significant risks, reliable automated fake, misinformation poses significant, public trust
备注: 12 pages, 8 tables, 2 figures
点击查看摘要
Abstract:The rapid proliferation of online misinformation poses significant risks to public trust, policy, and safety, necessitating reliable automated fake news detection. Existing methods often struggle with multimodal content, domain generalization, and explainability. We propose AMPEND-LS, an agentic multi-persona evidence-grounded framework with LLM-SLM synergy for multimodal fake news detection. AMPEND-LS integrates textual, visual, and contextual signals through a structured reasoning pipeline powered by LLMs, augmented with reverse image search, knowledge graph paths, and persuasion strategy analysis. To improve reliability, we introduce a credibility fusion mechanism combining semantic similarity, domain trustworthiness, and temporal context, and a complementary SLM classifier to mitigate LLM uncertainty and hallucinations. Extensive experiments across three benchmark datasets demonstrate that AMPEND-LS consistently outperformed state-of-the-art baselines in accuracy, F1 score, and robustness. Qualitative case studies further highlight its transparent reasoning and resilience against evolving misinformation. This work advances the development of adaptive, explainable, and evidence-aware systems for safeguarding online information integrity.
5. 【2512.21021】owards Better Search with Domain-Aware Text Embeddings for C2C Marketplaces
链接:https://arxiv.org/abs/2512.21021
作者:Andre Rusli,Miao Cao,Shoma Ishimoto,Sho Akiyama,Max Frenzel
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:distinct retrieval challenges, pose distinct retrieval, marketplaces pose distinct, ambiguous queries, user-generated listings
备注: 5 pages, AAAI 2026 Workshop on New Frontiers in Information Retrieval
点击查看摘要
Abstract:Consumer-to-consumer (C2C) marketplaces pose distinct retrieval challenges: short, ambiguous queries; noisy, user-generated listings; and strict production constraints. This paper reports our experiment to build a domain-aware Japanese text-embedding approach to improve the quality of search at Mercari, Japan's largest C2C marketplace. We experimented with fine-tuning on purchase-driven query-title pairs, using role-specific prefixes to model query-item asymmetry. To meet production constraints, we apply Matryoshka Representation Learning to obtain compact, truncation-robust embeddings. Offline evaluation on historical search logs shows consistent gains over a strong generic encoder, with particularly large improvements when replacing PCA compression with Matryoshka truncation. A manual assessment further highlights better handling of proper nouns, marketplace-specific semantics, and term-importance alignment. Additionally, an initial online A/B test demonstrates statistically significant improvements in revenue per user and search-flow efficiency, with transaction frequency maintained. Results show that domain-aware embeddings improve relevance and efficiency at scale and form a practical foundation for richer LLM-era search experiences.
6. 【2512.20950】MultiMind at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval via Multi-Source Alignment
链接:https://arxiv.org/abs/2512.20950
作者:Mohammad Mahdi Abootorabi,Alireza Ghahramani Kure,Mohammadali Mohammadkhani,Sina Elahimanesh,Mohammad Ali Ali Panah
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Crosslingual Fact-Checked Claim, Multilingual and Crosslingual, Fact-Checked Claim Retrieval, paper presents, presents our system
备注: 11 pages Published at the SemEval-2025 workshop
点击查看摘要
Abstract:This paper presents our system for SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval. In an era where misinformation spreads rapidly, effective fact-checking is increasingly critical. We introduce TriAligner, a novel approach that leverages a dual-encoder architecture with contrastive learning and incorporates both native and English translations across different modalities. Our method effectively retrieves claims across multiple languages by learning the relative importance of different sources in alignment. To enhance robustness, we employ efficient data preprocessing and augmentation using large language models while incorporating hard negative sampling to improve representation learning. We evaluate our approach on monolingual and crosslingual benchmarks, demonstrating significant improvements in retrieval accuracy and fact-checking performance over baselines.
7. 【2512.20916】MMSRARec: Summarization and Retrieval Augumented Sequential Recommendation Based on Multimodal Large Language Model
链接:https://arxiv.org/abs/2512.20916
作者:Haoyu Wang,Yitong Wang,Jining Wang
类目:Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:Large Language Models, Multimodal Large Language, demonstrated significant potential, Recent advancements, Large Language
备注: Under Review
点击查看摘要
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated significant potential in recommendation systems. However, the effective application of MLLMs to multimodal sequential recommendation remains unexplored: A) Existing methods primarily leverage the multimodal semantic understanding capabilities of pre-trained MLLMs to generate item embeddings or semantic IDs, thereby enhancing traditional recommendation models. These approaches generate item representations that exhibit limited interpretability, and pose challenges when transferring to language model-based recommendation systems. B) Other approaches convert user behavior sequence into image-text pairs and perform recommendation through multiple MLLM inference, incurring prohibitive computational and time costs. C) Current MLLM-based recommendation systems generally neglect the integration of collaborative signals. To address these limitations while balancing recommendation performance, interpretability, and computational cost, this paper proposes MultiModal Summarization-and-Retrieval-Augmented Sequential Recommendation. Specifically, we first employ MLLM to summarize items into concise keywords and fine-tune the model using rewards that incorporate summary length, information loss, and reconstruction difficulty, thereby enabling adaptive adjustment of the summarization policy. Inspired by retrieval-augmented generation, we then transform collaborative signals into corresponding keywords and integrate them as supplementary context. Finally, we apply supervised fine-tuning with multi-task learning to align the MLLM with the multimodal sequential recommendation. Extensive evaluations on common recommendation datasets demonstrate the effectiveness of MMSRARec, showcasing its capability to efficiently and interpretably understand user behavior histories and item information for accurate recommendations.
8. 【2512.20896】Accurate and Diverse Recommendations via Propensity-Weighted Linear Autoencoders
链接:https://arxiv.org/abs/2512.20896
作者:Kazuma Onishi,Katsuhiko Hayashi,Hidetaka Kamigaito
类目:Information Retrieval (cs.IR)
关键词:real-world recommender systems, recommender systems, real-world recommender, MNAR, Inverse Propensity Scoring
备注: Published in the proceedings of SIGIR-AP'25
点击查看摘要
Abstract:In real-world recommender systems, user-item interactions are Missing Not At Random (MNAR), as interactions with popular items are more frequently observed than those with less popular ones. Missing observations shift recommendations toward frequently interacted items, which reduces the diversity of the recommendation list. To alleviate this problem, Inverse Propensity Scoring (IPS) is widely used and commonly models propensities based on a power-law function of item interaction frequency. However, we found that such power-law-based correction overly penalizes popular items and harms their recommendation performance. We address this issue by redefining the propensity score to allow broader item recommendation without excessively penalizing popular items. The proposed score is formulated by applying a sigmoid function to the logarithm of the item observation frequency, maintaining the simplicity of power-law scoring while allowing for more flexible adjustment. Furthermore, we incorporate the redefined propensity score into a linear autoencoder model, which tends to favor popular items, and evaluate its effectiveness. Experimental results revealed that our method substantially improves the diversity of items in the recommendation list without sacrificing recommendation accuracy.
9. 【2512.20854】How important is Recall for Measuring Retrieval Quality?
链接:https://arxiv.org/abs/2512.20854
作者:Shelly Schwartz,Oleg Vasilyev,Randy Sawaya
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:realistic retrieval settings, evolving knowledge bases, typically unknown, settings with large, large and evolving
备注:
点击查看摘要
Abstract:In realistic retrieval settings with large and evolving knowledge bases, the total number of documents relevant to a query is typically unknown, and recall cannot be computed. In this paper, we evaluate several established strategies for handling this limitation by measuring the correlation between retrieval quality metrics and LLM-based judgments of response quality, where responses are generated from the retrieved documents. We conduct experiments across multiple datasets with a relatively low number of relevant documents (2-15). We also introduce a simple retrieval quality measure that performs well without requiring knowledge of the total number of relevant documents.
10. 【2512.20781】Soft Filtering: Guiding Zero-shot Composed Image Retrieval with Prescriptive and Proscriptive Constraints
链接:https://arxiv.org/abs/2512.20781
作者:Youjin Jung,Seongwoo Cho,Hyun-seok Min,Sungchul Choi
类目:Information Retrieval (cs.IR)
关键词:Composed Image Retrieval, Composed Image, aims to find, Zero-shot CIR, CIR benchmarks
备注: Accepted to AAAI 2026 Workshop on New Frontiers in Information Retrieval
点击查看摘要
Abstract:Composed Image Retrieval (CIR) aims to find a target image that aligns with user intent, expressed through a reference image and a modification text. While Zero-shot CIR (ZS-CIR) methods sidestep the need for labeled training data by leveraging pretrained vision-language models, they often rely on a single fused query that merges all descriptive cues of what the user wants, tending to dilute key information and failing to account for what they wish to avoid. Moreover, current CIR benchmarks assume a single correct target per query, overlooking the ambiguity in modification texts. To address these challenges, we propose Soft Filtering with Textual constraints (SoFT), a training-free, plug-and-play filtering module for ZS-CIR. SoFT leverages multimodal large language models (LLMs) to extract two complementary constraints from the reference-modification pair: prescriptive (must-have) and proscriptive (must-avoid) constraints. These serve as semantic filters that reward or penalize candidate images to re-rank results, without modifying the base retrieval model or adding supervision. In addition, we construct a two-stage dataset pipeline that refines CIR benchmarks. We first identify multiple plausible targets per query to construct multi-target triplets, capturing the open-ended nature of user intent. Then guide multimodal LLMs to rewrite the modification text to focus on one target, while referencing contrastive distractors to ensure precision. This enables more comprehensive and reliable evaluation under varying ambiguity levels. Applied on top of CIReVL, a ZS-CIR retriever, SoFT raises R@5 to 65.25 on CIRR (+12.94), mAP@50 to 27.93 on CIRCO (+6.13), and R@50 to 58.44 on FashionIQ (+4.59), demonstrating broad effectiveness.
11. 【2512.20626】MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation
链接:https://arxiv.org/abs/2512.20626
作者:Chi-Hsiang Hsiao,Yi-Cheng Wang,Tzung-Sheng Lin,Yi-Ren Yeh,Chu-Song Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:large language models, access external information, dynamically access external, previously unseen documents, enables large language
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, which is powerful for answering questions over previously unseen documents. Nonetheless, they struggle with high-level conceptual understanding and holistic comprehension due to limited context windows, which constrain their ability to perform deep reasoning over long-form, domain-specific content such as full-length books. To solve this problem, knowledge graphs (KGs) have been leveraged to provide entity-centric structure and hierarchical summaries, offering more structured support for reasoning. However, existing KG-based RAG solutions remain restricted to text-only inputs and fail to leverage the complementary insights provided by other modalities such as vision. On the other hand, reasoning from visual documents requires textual, visual, and spatial cues into structured, hierarchical concepts. To address this issue, we introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding. Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process. Experimental results across both global and fine-grained question answering tasks show that our approach consistently outperforms existing RAG-based approaches on both textual and multimodal corpora.
计算机视觉
1. 【2512.21338】HiStream: Efficient High-Resolution Video Generation via Redundancy-Eliminated Streaming
链接:https://arxiv.org/abs/2512.21338
作者:Haonan Qiu,Shikun Liu,Zijian Zhou,Zhaochong An,Weiming Ren,Zhiheng Liu,Jonas Schult,Sen He,Shoufa Chen,Yuren Cong,Tao Xiang,Ziwei Liu,Juan-Manuel Perez-Rua
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:practical inference infeasible, High-resolution video generation, media and film, Spatial Compression, Temporal Compression
备注: Project Page: [this http URL](http://haonanqiu.com/projects/HiStream.html)
点击查看摘要
Abstract:High-resolution video generation, while crucial for digital media and film, is computationally bottlenecked by the quadratic complexity of diffusion models, making practical inference infeasible. To address this, we introduce HiStream, an efficient autoregressive framework that systematically reduces redundancy across three axes: i) Spatial Compression: denoising at low resolution before refining at high resolution with cached features; ii) Temporal Compression: a chunk-by-chunk strategy with a fixed-size anchor cache, ensuring stable inference speed; and iii) Timestep Compression: applying fewer denoising steps to subsequent, cache-conditioned chunks. On 1080p benchmarks, our primary HiStream model (i+ii) achieves state-of-the-art visual quality while demonstrating up to 76.2x faster denoising compared to the Wan2.1 baseline and negligible quality loss. Our faster variant, HiStream+, applies all three optimizations (i+ii+iii), achieving a 107.5x acceleration over the baseline, offering a compelling trade-off between speed and quality, thereby making high-resolution video generation both practical and scalable.
2. 【2512.21337】Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
链接:https://arxiv.org/abs/2512.21337
作者:Li-Zhong Szu-Tu,Ting-Lin Wu,Chia-Jui Chang,He Syu,Yu-Lun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:famous buildings compared, indicating a reliance, generalizable understanding, significant popularity bias, expose a significant
备注: Project page: [this https URL](https://sytwu.github.io/BeyondMemo/)
点击查看摘要
Abstract:We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: this https URL
3. 【2512.21334】Streaming Video Instruction Tuning
链接:https://arxiv.org/abs/2512.21334
作者:Jiaer Xia,Peixian Chen,Mengdan Zhang,Xing Sun,Kaiyang Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:LLM that serves, streaming video LLM, general-purpose interactive assistant, video LLM, general-purpose interactive
备注:
点击查看摘要
Abstract:We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.
4. 【2512.21333】Fast SAM2 with Text-Driven Token Pruning
链接:https://arxiv.org/abs/2512.21333
作者:Avilasha Mandal,Chaoning Zhang,Fachrina Dewi Puspitasari,Xudong Wang,Jiaquan Zhang,Caiyan Qin,Guoqing Wang,Yang Yang,Heng Tao Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision foundation model, deployment remains limited, Segment Anything Model, foundation model, practical deployment remains
备注: 28 pages, 9 figures
点击查看摘要
Abstract:Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.
5. 【2512.21331】ICON: A Slide-Level Tile Contextualizer for Histopathology Representation Learning
链接:https://arxiv.org/abs/2512.21331
作者:Varun Belagali,Saarthak Kapse,Pierre Marza,Srijan Das,Zilinghan Li,Sofiène Boutaj,Pushpak Pati,Srikar Yellapragada,Tarak Nath Nandi,Ravi K Madduri,Joel Saltz,Prateek Prasanna,Stergios Christodoulidis Maria Vakalopoulou,Dimitris Samaras
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:larger image context, slide images, larger image, interpretation of small, large whole slide
备注:
点击查看摘要
Abstract:The interpretation of small tiles in large whole slide images (WSI) often needs a larger image context. We introduce TICON, a transformer-based tile representation contextualizer that produces rich, contextualized embeddings for ''any'' application in computational pathology. Standard tile encoder-based pipelines, which extract embeddings of tiles stripped from their context, fail to model the rich slide-level information essential for both local and global tasks. Furthermore, different tile-encoders excel at different downstream tasks. Therefore, a unified model is needed to contextualize embeddings derived from ''any'' tile-level foundation model. TICON addresses this need with a single, shared encoder, pretrained using a masked modeling objective to simultaneously unify and contextualize representations from diverse tile-level pathology foundation models. Our experiments demonstrate that TICON-contextualized embeddings significantly improve performance across many different tasks, establishing new state-of-the-art results on tile-level benchmarks (i.e., HEST-Bench, THUNDER, CATCH) and slide-level benchmarks (i.e., Patho-Bench). Finally, we pretrain an aggregator on TICON to form a slide-level foundation model, using only 11K WSIs, outperforming SoTA slide-level foundation models pretrained with up to 350K WSIs.
6. 【2512.21315】Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks
链接:https://arxiv.org/abs/2512.21315
作者:Roy Turgeman,Tom Tirer
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
关键词:information-theoretic principle stating, data processing inequality, optimal Bayes classifier, information-theoretic principle, principle stating
备注:
点击查看摘要
Abstract:The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.
7. 【2512.21302】AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
链接:https://arxiv.org/abs/2512.21302
作者:Yue Cao,Yingyao Wang,Pi Bu,Jingxuan Xing,Wei Jiang,Zekun Zhu,Junpeng Ma,Sashuai Zhou,Tong Lu,Jun Song,Yu Cheng,Yuning Jiang,Bo Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Graphical user interface, substantially improve productivity, automating frequently executed, frequently executed long-latency, Graphical user
备注: 23 pages, 13 figures, 8 tables
点击查看摘要
Abstract:Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.
8. 【2512.21287】Post-Processing Mask-Based Table Segmentation for Structural Coordinate Extraction
链接:https://arxiv.org/abs/2512.21287
作者:Suren Bandara
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scanned documents, digital archives, Structured data extraction, plays a crucial, crucial role
备注:
点击查看摘要
Abstract:Structured data extraction from tables plays a crucial role in document image analysis for scanned documents and digital archives. Although many methods have been proposed to detect table structures and extract cell contents, accurately identifying table segment boundaries (rows and columns) remains challenging, particularly in low-resolution or noisy images. In many real-world scenarios, table data are incomplete or degraded, limiting the adaptability of transformer-based methods to noisy inputs. Mask-based edge detection techniques have shown greater robustness under such conditions, as their sensitivity can be adjusted through threshold tuning; however, existing approaches typically apply masks directly to images, leading to noise sensitivity, resolution loss, or high computational cost. This paper proposes a novel multi-scale signal-processing method for detecting table edges from table masks. Row and column transitions are modeled as one-dimensional signals and processed using Gaussian convolution with progressively increasing variances, followed by statistical thresholding to suppress noise while preserving stable structural edges. Detected signal peaks are mapped back to image coordinates to obtain accurate segment boundaries. Experimental results show that applying the proposed approach to column edge detection improves Cell-Aware Segmentation Accuracy (CASA) a layout-aware metric evaluating both textual correctness and correct cell placement from 67% to 76% on the PubLayNet-1M benchmark when using TableNet with PyTesseract OCR. The method is robust to resolution variations through zero-padding and scaling strategies and produces optimized structured tabular outputs suitable for downstream analysis.
9. 【2512.21284】Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential
链接:https://arxiv.org/abs/2512.21284
作者:Shihao Zou,Jingjing Li,Wei Ji,Jincai Huang,Kai Wang,Guo Dan,Weixin Si,Yi Pan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern surgical systems, enhanced intra-operative safety, systems increasingly rely, provide timely situational, timely situational awareness
备注:
点击查看摘要
Abstract:Modern surgical systems increasingly rely on intelligent scene understanding to provide timely situational awareness for enhanced intra-operative safety. Within this pipeline, surgical scene segmentation plays a central role in accurately perceiving operative events. Although recent deep learning models, particularly large-scale foundation models, achieve remarkable segmentation accuracy, their substantial computational demands and power consumption hinder real-time deployment in resource-constrained surgical environments. To address this limitation, we explore the emerging SNN as a promising paradigm for highly efficient surgical intelligence. However, their performance is still constrained by the scarcity of labeled surgical data and the inherently sparse nature of surgical video representations. To this end, we propose \textit{SpikeSurgSeg}, the first spike-driven video Transformer framework tailored for surgical scene segmentation with real-time potential on non-GPU platforms. To address the limited availability of surgical annotations, we introduce a surgical-scene masked autoencoding pretraining strategy for SNNs that enables robust spatiotemporal representation learning via layer-wise tube masking. Building on this pretrained backbone, we further adopt a lightweight spike-driven segmentation head that produces temporally consistent predictions while preserving the low-latency characteristics of SNNs. Extensive experiments on EndoVis18 and our in-house SurgBleed dataset demonstrate that SpikeSurgSeg achieves mIoU comparable to SOTA ANN-based models while reducing inference latency by at least $8\times$. Notably, it delivers over $20\times$ acceleration relative to most foundation-model baselines, underscoring its potential for time-critical surgical scene segmentation.
10. 【2512.21276】GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation
链接:https://arxiv.org/abs/2512.21276
作者:Snehal Singh Tomar,Alexandros Graikos,Arjun Krishna,Dimitris Samaras,Klaus Mueller
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern deep learning, Modern deep, sequentially stacked frames, deep learning methods, deep learning
备注:
点击查看摘要
Abstract:Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a low-resolution 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently outperforms SoTA in quality and inference speed (at least twice-as-fast) across datasets.
11. 【2512.21268】ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision
链接:https://arxiv.org/abs/2512.21268
作者:Weiqi Li,Zehao Zhang,Liang Lin,Guangrun Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fundamental requirement, Controllability, ACD, conditioning, video synthesis
备注:
点击查看摘要
Abstract:Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.
12. 【2512.21264】AnyAD: Unified Any-Modality Anomaly Detection in Incomplete Multi-Sequence MRI
链接:https://arxiv.org/abs/2512.21264
作者:Changwei Wu,Yifei Chen,Yuxin Du,Mingxuan Liu,Jinying Zong,Beining Wu,Jie Dong,Feiwei Qin,Yunkang Cao,Qiyuan Tian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reliable anomaly detection, remains challenging due, key imaging modalities, brain MRI remains, MRI remains challenging
备注: 15 pages, 8 figures
点击查看摘要
Abstract:Reliable anomaly detection in brain MRI remains challenging due to the scarcity of annotated abnormal cases and the frequent absence of key imaging modalities in real clinical workflows. Existing single-class or multi-class anomaly detection (AD) models typically rely on fixed modality configurations, require repetitive training, or fail to generalize to unseen modality combinations, limiting their clinical scalability. In this work, we present a unified Any-Modality AD framework that performs robust anomaly detection and localization under arbitrary MRI modality availability. The framework integrates a dual-pathway DINOv2 encoder with a feature distribution alignment mechanism that statistically aligns incomplete-modality features with full-modality representations, enabling stable inference even with severe modality dropout. To further enhance semantic consistency, we introduce an Intrinsic Normal Prototypes (INPs) extractor and an INP-guided decoder that reconstruct only normal anatomical patterns while naturally amplifying abnormal deviations. Through randomized modality masking and indirect feature completion during training, the model learns to adapt to all modality configurations without re-training. Extensive experiments on BraTS2018, MU-Glioma-Post, and Pretreat-MetsToBrain-Masks demonstrate that our approach consistently surpasses state-of-the-art industrial and medical AD baselines across 7 modality combinations, achieving superior generalization. This study establishes a scalable paradigm for multimodal medical AD under real-world, imperfect modality conditions. Our source code is available at this https URL.
13. 【2512.21252】DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation
链接:https://arxiv.org/abs/2512.21252
作者:Jiawei Liu,Junqiao Li,Jiangfan Deng,Gen Li,Siyu Zhou,Zetao Fang,Shanshan Lao,Zengde Deng,Jianing Zhu,Tingting Ma,Jiayi Li,Yunqiu Wang,Qian He,Xinglong Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:technique represents, aesthetic in filmmaking, represents a distinct, distinct and sophisticated, sophisticated aesthetic
备注: Project Page: [this https URL](https://dreamontage.github.io/DreaMontage/)
点击查看摘要
Abstract:The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.
14. 【2512.21241】Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks
链接:https://arxiv.org/abs/2512.21241
作者:Xinjie Xu,Shuyu Cheng,Dongwei Xu,Qi Xuan,Chen Ma
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:hard-label black-box adversarial, black-box adversarial attacks, prohibitive query complexity, query complexity poses, predicted label
备注: Published at AAAI 2026 (Oral). This version corresponds to the conference proceedings; v2 will include the appendix
点击查看摘要
Abstract:In hard-label black-box adversarial attacks, where only the top-1 predicted label is accessible, the prohibitive query complexity poses a major obstacle to practical deployment. In this paper, we focus on optimizing a representative class of attacks that search for the optimal ray direction yielding the minimum $\ell_2$-norm perturbation required to move a benign image into the adversarial region. Inspired by Nesterov's Accelerated Gradient (NAG), we propose a momentum-based algorithm, ARS-OPT, which proactively estimates the gradient with respect to a future ray direction inferred from accumulated momentum. We provide a theoretical analysis of its convergence behavior, showing that ARS-OPT enables more accurate directional updates and achieves faster, more stable optimization. To further accelerate convergence, we incorporate surrogate-model priors into ARS-OPT's gradient estimation, resulting in PARS-OPT with enhanced performance. The superiority of our approach is supported by theoretical guarantees under standard assumptions. Extensive experiments on ImageNet and CIFAR-10 demonstrate that our method surpasses 13 state-of-the-art approaches in query efficiency.
15. 【2512.21237】SegMo: Segment-aligned Text to 3D Human Motion Generation
链接:https://arxiv.org/abs/2512.21237
作者:Bowen Dang,Lin Wu,Xiaohang Yang,Zheng Yuan,Zhixiang Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:important research problem, virtual reality, augmented reality, motion, Motion Segment Extraction
备注: The IEEE/CVF Winter Conference on Applications of Computer Vision 2026
点击查看摘要
Abstract:Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.
16. 【2512.21221】Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval
链接:https://arxiv.org/abs/2512.21221
作者:Dao Sy Duy Minh,Huynh Trung Kiet,Nguyen Lam Phu Quy,Phu-Hoa Pham,Tran Chi Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:natural language processing, natural language descriptions, digital content management, natural language, Retrieving images
备注: System description paper for EVENTA Grand Challenge Track 2 at ACM Multimedia 2025 (MM '25). Ranked 4th place. 6 pages, 1 figure, 2 tables
点击查看摘要
Abstract:Retrieving images from natural language descriptions is a core task at the intersection of computer vision and natural language processing, with wide-ranging applications in search engines, media archiving, and digital content management. However, real-world image-text retrieval remains challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions. In this work, we propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions. The first stage performs efficient candidate filtering using BM25 based on salient entities, while the second stage applies BEiT-3 models to capture deep multimodal semantics and rerank the results. Evaluated on the OpenEvents v1 benchmark, our method achieves a mean average precision of 0.559, substantially outperforming prior baselines. These results highlight the effectiveness of combining event-guided filtering with long-text vision-language modeling for accurate and efficient retrieval in complex, real-world scenarios. Our code is available at this https URL
17. 【2512.21220】RoboSafe: Safeguarding Embodied Agents via Executable Safety Logic
链接:https://arxiv.org/abs/2512.21220
作者:Le Wang,Zonghao Ying,Xiao Yang,Quanchen Zou,Zhenfei Yin,Tianlin Li,Jian Yang,Yaodong Yang,Aishan Liu,Xianglong Liu
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:trigger unsafe behaviors, executing complex real-world, vision-language models, unsafe behaviors, powered by vision-language
备注: 11 pages, 6 figures
点击查看摘要
Abstract:Embodied agents powered by vision-language models (VLMs) are increasingly capable of executing complex real-world tasks, yet they remain vulnerable to hazardous instructions that may trigger unsafe behaviors. Runtime safety guardrails, which intercept hazardous actions during task execution, offer a promising solution due to their flexibility. However, existing defenses often rely on static rule filters or prompt-level control, which struggle to address implicit risks arising in dynamic, temporally dependent, and context-rich environments. To address this, we propose RoboSafe, a hybrid reasoning runtime safeguard for embodied agents through executable predicate-based safety logic. RoboSafe integrates two complementary reasoning processes on a Hybrid Long-Short Safety Memory. We first propose a Backward Reflective Reasoning module that continuously revisits recent trajectories in short-term memory to infer temporal safety predicates and proactively triggers replanning when violations are detected. We then propose a Forward Predictive Reasoning module that anticipates upcoming risks by generating context-aware safety predicates from the long-term safety memory and the agent's multimodal observations. Together, these components form an adaptive, verifiable safety logic that is both interpretable and executable as code. Extensive experiments across multiple agents demonstrate that RoboSafe substantially reduces hazardous actions (-36.8% risk occurrence) compared with leading baselines, while maintaining near-original task performance. Real-world evaluations on physical robotic arms further confirm its practicality. Code will be released upon acceptance.
18. 【2512.21218】Latent Implicit Visual Reasoning
链接:https://arxiv.org/abs/2512.21218
作者:Kelvin Li,Chuyi Shang,Leonid Karlinsky,Rogerio Feris,Trevor Darrell,Roei Herzig
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Multimodal Models, Large Multimodal, made significant progress, remain largely text-centric, core reasoning modality
备注:
点击查看摘要
Abstract:While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks -- including those where intermediate abstractions are hard to specify -- while also generalizing to multi-task instruction tuning.
19. 【2512.21209】Human Motion Estimation with Everyday Wearables
链接:https://arxiv.org/abs/2512.21209
作者:Siqi Zhu,Yixuan Li,Junfu Li,Qi Wu,Zan Wang,Haozhe Ma,Wei Liang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:on-body device-based human, device-based human motion, expensive hardware, poor wearability, on-body device-based
备注:
点击查看摘要
Abstract:While on-body device-based human motion estimation is crucial for applications such as XR interaction, existing methods often suffer from poor wearability, expensive hardware, and cumbersome calibration, which hinder their adoption in daily life. To address these challenges, we present EveryWear, a lightweight and practical human motion capture approach based entirely on everyday wearables: a smartphone, smartwatch, earbuds, and smart glasses equipped with one forward-facing and two downward-facing cameras, requiring no explicit calibration before use. We introduce Ego-Elec, a 9-hour real-world dataset covering 56 daily activities across 17 diverse indoor and outdoor environments, with ground-truth 3D annotations provided by the motion capture (MoCap), to facilitate robust research and benchmarking in this direction. Our approach employs a multimodal teacher-student framework that integrates visual cues from egocentric cameras with inertial signals from consumer devices. By training directly on real-world data rather than synthetic data, our model effectively eliminates the sim-to-real gap that constrains prior work. Experiments demonstrate that our method outperforms baseline models, validating its effectiveness for practical full-body motion estimation.
20. 【2512.21201】Schrödinger's Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation
链接:https://arxiv.org/abs/2512.21201
作者:Yu He,Da Huang,Zhenyang Liu,Zixiao Gu,Qiang Sun,Guangnan Ye,Yanwei Fu
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:task-specific training, relying on pre-built, existing ZSON methods, previously unseen environment, Schrödinger Navigator
备注:
点击查看摘要
Abstract:Zero-shot object navigation (ZSON) requires a robot to locate a target object in a previously unseen environment without relying on pre-built maps or task-specific training. However, existing ZSON methods often struggle in realistic and cluttered environments, particularly when the scene contains heavy occlusions, unknown risks, or dynamically moving target objects. To address these challenges, we propose \textbf{Schrödinger's Navigator}, a navigation framework inspired by Schrödinger's thought experiment on uncertainty. The framework treats unobserved space as a set of plausible future worlds and reasons over them before acting. Conditioned on egocentric visual inputs and three candidate trajectories, a trajectory-conditioned 3D world model imagines future observations along each path. This enables the agent to see beyond occlusions and anticipate risks in unseen regions without requiring extra detours or dense global mapping. The imagined 3D observations are fused into the navigation map and used to update a value map. These updates guide the policy toward trajectories that avoid occlusions, reduce exposure to uncertain space, and better track moving targets. Experiments on a Go2 quadruped robot across three challenging scenarios, including severe static occlusions, unknown risks, and dynamically moving targets, show that Schrödinger's Navigator consistently outperforms strong ZSON baselines in self-localization, object localization, and overall Success Rate in occlusion-heavy environments. These results demonstrate the effectiveness of trajectory-conditioned 3D imagination in enabling robust zero-shot object navigation.
21. 【2512.21194】VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs
链接:https://arxiv.org/abs/2512.21194
作者:Brigitta Malagurski Törtei,Yasser Dahou,Ngoc Dung Huynh,Wamiq Reyaz Para,Phúc H. Lê Khac,Ankit Singh,Sofian Chaybouti,Sanath Narayan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, visual question answering, achieved remarkable, remarkable progress, question answering
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have achieved remarkable progress across tasks such as visual question answering and image captioning. Yet, the extent to which these models perform visual reasoning as opposed to relying on linguistic priors remains unclear. To address this, we introduce VisRes Bench, a benchmark designed to study visual reasoning in naturalistic settings without contextual language supervision. Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities. VisRes isolates distinct reasoning abilities across its levels. Level 1 probes perceptual completion and global image matching under perturbations such as blur, texture changes, occlusion, and rotation; Level 2 tests rule-based inference over a single attribute (e.g., color, count, orientation); and Level 3 targets compositional reasoning that requires integrating multiple visual attributes. Across more than 19,000 controlled task images, we find that state-of-the-art VLMs perform near random under subtle perceptual perturbations, revealing limited abstraction beyond pattern recognition. We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.
22. 【2512.21185】UltraShape 1.0: High-Fidelity 3D Shape Generation via Scalable Geometric Refinement
链接:https://arxiv.org/abs/2512.21185
作者:Tanghui Jia,Dongyu Yan,Dehao Hao,Yang Li,Kaiyi Zhang,Xianyi He,Lanjiong Li,Jinnan Chen,Lutao Jiang,Qishen Yin,Long Quan,Ying-Cong Chen,Li Yuan
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:framework for high-fidelity, geometric, generation, geometry, introduce UltraShape
备注: 14 pages, 10 figures, Technical Report,
点击查看摘要
Abstract:In this report, we introduce UltraShape 1.0, a scalable 3D diffusion framework for high-fidelity 3D geometry generation. The proposed approach adopts a two-stage generation pipeline: a coarse global structure is first synthesized and then refined to produce detailed, high-quality geometry. To support reliable 3D generation, we develop a comprehensive data processing pipeline that includes a novel watertight processing method and high-quality data filtering. This pipeline improves the geometric quality of publicly available 3D datasets by removing low-quality samples, filling holes, and thickening thin structures, while preserving fine-grained geometric details. To enable fine-grained geometry refinement, we decouple spatial localization from geometric detail synthesis in the diffusion process. We achieve this by performing voxel-based refinement at fixed spatial locations, where voxel queries derived from coarse geometry provide explicit positional anchors encoded via RoPE, allowing the diffusion model to focus on synthesizing local geometric details within a reduced, structured solution space. Our model is trained exclusively on publicly available 3D datasets, achieving strong geometric quality despite limited training resources. Extensive evaluations demonstrate that UltraShape 1.0 performs competitively with existing open-source methods in both data processing quality and geometry generation. All code and trained models will be released to support future research.
23. 【2512.21183】owards Arbitrary Motion Completing via Hierarchical Continuous Representation
链接:https://arxiv.org/abs/2512.21183
作者:Chenghao Xu,Guangtao Lyu,Qi Liu,Jiexi Yan,Muli Yang,Cheng Deng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:higher camera frame, rates typically contribute, camera frame rates, frame rates typically, Physical motions
备注:
点击查看摘要
Abstract:Physical motions are inherently continuous, and higher camera frame rates typically contribute to improved smoothness and temporal coherence. For the first time, we explore continuous representations of human motion sequences, featuring the ability to interpolate, inbetween, and even extrapolate any input motion sequences at arbitrary frame rates. To achieve this, we propose a novel parametric activation-induced hierarchical implicit representation framework, referred to as NAME, based on Implicit Neural Representations (INRs). Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns. Additionally, we integrate a custom parametric activation function, powered by Fourier transformations, into the MLP-based decoder to enhance the expressiveness of the continuous representation. This parametric formulation significantly augments the model's ability to represent complex motion behaviors with high accuracy. Extensive evaluations across several benchmark datasets demonstrate the effectiveness and robustness of our proposed approach.
24. 【2512.21174】A Turn Toward Better Alignment: Few-Shot Generative Adaptation with Equivariant Feature Rotation
链接:https://arxiv.org/abs/2512.21174
作者:Chenghao Xu,Qi Liu,Jiexi Yan,Muli Yang,Cheng Deng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Few-shot image generation, image generation aims, Few-shot image, image generation, training images
备注:
点击查看摘要
Abstract:Few-shot image generation aims to effectively adapt a source generative model to a target domain using very few training images. Most existing approaches introduce consistency constraints-typically through instance-level or distribution-level loss functions-to directly align the distribution patterns of source and target domains within their respective latent spaces. However, these strategies often fall short: overly strict constraints can amplify the negative effects of the domain gap, leading to distorted or uninformative content, while overly relaxed constraints may fail to leverage the source domain effectively. This limitation primarily stems from the inherent discrepancy in the underlying distribution structures of the source and target domains. The scarcity of target samples further compounds this issue by hindering accurate estimation of the target domain's distribution. To overcome these limitations, we propose Equivariant Feature Rotation (EFR), a novel adaptation strategy that aligns source and target domains at two complementary levels within a self-rotated proxy feature space. Specifically, we perform adaptive rotations within a parameterized Lie Group to transform both source and target features into an equivariant proxy space, where alignment is conducted. These learnable rotation matrices serve to bridge the domain gap by preserving intra-domain structural information without distortion, while the alignment optimization facilitates effective knowledge transfer from the source to the target domain. Comprehensive experiments on a variety of commonly used datasets demonstrate that our method significantly enhances the generative performance within the targeted domain.
25. 【2512.21150】ORCA: Object Recognition and Comprehension for Archiving Marine Species
链接:https://arxiv.org/abs/2512.21150
作者:Yuk-Kwan Wong,Haixin Liang,Zeyu Ma,Yiwei Chen,Ziqiang Zheng,Rinaldi Gotama,Pascal Sebastian,Lauren D. Sparks,Sai-Kit Yeung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scalable biological surveys, protecting marine ecosystems, enabling automatic, biological surveys, essential for monitoring
备注: Accepted by The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026
点击查看摘要
Abstract:Marine visual understanding is essential for monitoring and protecting marine ecosystems, enabling automatic and scalable biological surveys. However, progress is hindered by limited training data and the lack of a systematic task formulation that aligns domain-specific marine challenges with well-defined computer vision tasks, thereby limiting effective model application. To address this gap, we present ORCA, a multi-modal benchmark for marine research comprising 14,647 images from 478 species, with 42,217 bounding box annotations and 22,321 expert-verified instance captions. The dataset provides fine-grained visual and textual annotations that capture morphology-oriented attributes across diverse marine species. To catalyze methodological advances, we evaluate 18 state-of-the-art models on three tasks: object detection (closed-set and open-vocabulary), instance captioning, and visual grounding. Results highlight key challenges, including species diversity, morphological overlap, and specialized domain demands, underscoring the difficulty of marine understanding. ORCA thus establishes a comprehensive benchmark to advance research in marine domain. Project Page: this http URL.
26. 【2512.21135】GC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation
链接:https://arxiv.org/abs/2512.21135
作者:Gaoren Lin,Huangxuan Zhao,Yuan Xiong,Lefei Zhang,Bo Du,Wentao Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:segmentation enhances segmentation, Text-guided medical segmentation, enhances segmentation accuracy, utilizing clinical reports, medical segmentation enhances
备注:
点击查看摘要
Abstract:Text-guided medical segmentation enhances segmentation accuracy by utilizing clinical reports as auxiliary information. However, existing methods typically rely on unaligned image and text encoders, which necessitate complex interaction modules for multimodal fusion. While CLIP provides a pre-aligned multimodal feature space, its direct application to medical imaging is limited by three main issues: insufficient preservation of fine-grained anatomical structures, inadequate modeling of complex clinical descriptions, and domain-specific semantic misalignment. To tackle these challenges, we propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations. Specifically, it incorporates a Semantic-Structural Synergy Encoder (SSE) that augments CLIP's ViT with a CNN branch for multi-scale structural refinement, a Domain-Augmented Text Encoder (DATE) that injects large-language-model-derived medical knowledge, and a Vision-Language Calibration Module (VLCM) that refines cross-modal correspondence in a unified feature space. Experiments on five datasets across chest X-ray and thoracic CT modalities demonstrate that TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.
27. 【2512.21126】MarineEval: Assessing the Marine Intelligence of Vision-Language Models
链接:https://arxiv.org/abs/2512.21126
作者:YuK-Kwan Wong,Tuan-An To,Jipeng Zhang,Ziqiang Zheng,Sai-Kit Yeung
类目:Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB)
关键词:witnessed promising progress, promising progress led, vision language models, existing VLMs, general-purpose assistant
备注: Accepted by The IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026
点击查看摘要
Abstract:We have witnessed promising progress led by large language models (LLMs) and further vision language models (VLMs) in handling various queries as a general-purpose assistant. VLMs, as a bridge to connect the visual world and language corpus, receive both visual content and various text-only user instructions to generate corresponding responses. Though great success has been achieved by VLMs in various fields, in this work, we ask whether the existing VLMs can act as domain experts, accurately answering marine questions, which require significant domain expertise and address special domain challenges/requirements. To comprehensively evaluate the effectiveness and explore the boundary of existing VLMs, we construct the first large-scale marine VLM dataset and benchmark called MarineEval, with 2,000 image-based question-answering pairs. During our dataset construction, we ensure the diversity and coverage of the constructed data: 7 task dimensions and 20 capacity dimensions. The domain requirements are specially integrated into the data construction and further verified by the corresponding marine domain experts. We comprehensively benchmark 17 existing VLMs on our MarineEval and also investigate the limitations of existing models in answering marine research questions. The experimental results reveal that existing VLMs cannot effectively answer the domain-specific questions, and there is still a large room for further performance improvements. We hope our new benchmark and observations will facilitate future research. Project Page: this http URL
28. 【2512.21118】STLDM: Spatio-Temporal Latent Diffusion Model for Precipitation Nowcasting
链接:https://arxiv.org/abs/2512.21118
作者:Shi Quan Foo,Chi-Ho Wong,Zhihan Gao,Dit-Yan Yeung,Ka-Hing Wong,Wai-Kin Wong
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:extreme weather events, prevent severe damage, severe damage owing, critical spatio-temporal prediction, Precipitation nowcasting
备注: Accepted by TMLR. Camera-ready submission
点击查看摘要
Abstract:Precipitation nowcasting is a critical spatio-temporal prediction task for society to prevent severe damage owing to extreme weather events. Despite the advances in this field, the complex and stochastic nature of this task still poses challenges to existing approaches. Specifically, deterministic models tend to produce blurry predictions while generative models often struggle with poor accuracy. In this paper, we present a simple yet effective model architecture termed STLDM, a diffusion-based model that learns the latent representation from end to end alongside both the Variational Autoencoder and the conditioning network. STLDM decomposes this task into two stages: a deterministic forecasting stage handled by the conditioning network, and an enhancement stage performed by the latent diffusion model. Experimental results on multiple radar datasets demonstrate that STLDM achieves superior performance compared to the state of the art, while also improving inference efficiency. The code is available in this https URL.
29. 【2512.21104】FreeInpaint: Tuning-free Prompt Alignment and Visual Rationality Enhancement in Image Inpainting
链接:https://arxiv.org/abs/2512.21104
作者:Chao Gong,Dong Li,Yingwei Pan,Jingjing Chen,Ting Yao,Tao Mei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Text-guided image inpainting, Text-guided image, endeavors to generate, generate new content, image inpainting endeavors
备注: Accepted by AAAI 2026
点击查看摘要
Abstract:Text-guided image inpainting endeavors to generate new content within specified regions of images using textual prompts from users. The primary challenge is to accurately align the inpainted areas with the user-provided prompts while maintaining a high degree of visual fidelity. While existing inpainting methods have produced visually convincing results by leveraging the pre-trained text-to-image diffusion models, they still struggle to uphold both prompt alignment and visual rationality simultaneously. In this work, we introduce FreeInpaint, a plug-and-play tuning-free approach that directly optimizes the diffusion latents on the fly during inference to improve the faithfulness of the generated images. Technically, we introduce a prior-guided noise optimization method that steers model attention towards valid inpainting regions by optimizing the initial noise. Furthermore, we meticulously design a composite guidance objective tailored specifically for the inpainting task. This objective efficiently directs the denoising process, enhancing prompt alignment and visual rationality by optimizing intermediate latents at each step. Through extensive experiments involving various inpainting diffusion models and evaluation metrics, we demonstrate the effectiveness and robustness of our proposed FreeInpaint.
30. 【2512.21099】xAvatars : Hybrid Texel-3D Representations for Stable Rigging of Photorealistic Gaussian Head Avatars
链接:https://arxiv.org/abs/2512.21099
作者:Jaeseong Lee,Junyeong Ahn,Taewoong Kang,Jaegul Choo
类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:expressive user experiences, Constructing drivable, drivable and photorealistic, user experiences, central task
备注: 3DV 2026, Project page with videos: [this https URL](https://summertight.github.io/TexAvatars/)
点击查看摘要
Abstract:Constructing drivable and photorealistic 3D head avatars has become a central task in AR/XR, enabling immersive and expressive user experiences. With the emergence of high-fidelity and efficient representations such as 3D Gaussians, recent works have pushed toward ultra-detailed head avatars. Existing approaches typically fall into two categories: rule-based analytic rigging or neural network-based deformation fields. While effective in constrained settings, both approaches often fail to generalize to unseen expressions and poses, particularly in extreme reenactment scenarios. Other methods constrain Gaussians to the global texel space of 3DMMs to reduce rendering complexity. However, these texel-based avatars tend to underutilize the underlying mesh structure. They apply minimal analytic deformation and rely heavily on neural regressors and heuristic regularization in UV space, which weakens geometric consistency and limits extrapolation to complex, out-of-distribution deformations. To address these limitations, we introduce TexAvatars, a hybrid avatar representation that combines the explicit geometric grounding of analytic rigging with the spatial continuity of texel space. Our approach predicts local geometric attributes in UV space via CNNs, but drives 3D deformation through mesh-aware Jacobians, enabling smooth and semantically meaningful transitions across triangle boundaries. This hybrid design separates semantic modeling from geometric control, resulting in improved generalization, interpretability, and stability. Furthermore, TexAvatars captures fine-grained expression effects, including muscle-induced wrinkles, glabellar lines, and realistic mouth cavity geometry, with high fidelity. Our method achieves state-of-the-art performance under extreme pose and expression variations, demonstrating strong generalization in challenging head reenactment settings.
31. 【2512.21095】UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters
链接:https://arxiv.org/abs/2512.21095
作者:Yongkun Du,Zhineng Chen,Yazhen Xie,Weikang Baiand Hao Feng,Wei Shi,Yuchen Su,Can Huang,Yu-Gang Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:core informational components, constitute the core, core informational, informational components, Text
备注:
点击查看摘要
Abstract:Text and formulas constitute the core informational components of many documents. Accurately and efficiently recognizing both is crucial for developing robust and generalizable document parsing systems. Recently, vision-language models (VLMs) have achieved impressive unified recognition of text and formulas. However, they are large-sized and computationally demanding, restricting their usage in many applications. In this paper, we propose UniRec-0.1B, a unified recognition model with only 0.1B parameters. It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents. To implement this task, we first establish UniRec40M, a large-scale dataset comprises 40 million text, formula and their mix samples, enabling the training of a powerful yet lightweight model. Secondly, we identify two challenges when building such a lightweight but unified expert model. They are: structural variability across hierarchies and semantic entanglement between textual and formulaic content. To tackle these, we introduce a hierarchical supervision training that explicitly guides structural comprehension, and a semantic-decoupled tokenizer that separates text and formula representations. Finally, we develop a comprehensive evaluation benchmark covering Chinese and English documents from multiple domains and with multiple levels. Experimental results on this and public benchmarks demonstrate that UniRec-0.1B outperforms both general-purpose VLMs and leading document parsing expert models, while achieving a 2-9$\times$ speedup, validating its effectiveness and efficiency. Codebase and Dataset: this https URL.
32. 【2512.21094】2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation
链接:https://arxiv.org/abs/2512.21094
作者:Zhe Cao,Tao Wang,Jiaming Wang,Yanghai Wang,Yuanxing Zhang,Jialu Chen,Miao Deng,Jiahao Wang,Yubin Guo,Chenxi Liao,Yize Zhang,Zhaoxiang Zhang,Jiaheng Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:synthesize temporally coherent, narrowly scoped benchmarks, evaluation remains fragmented, temporally coherent video, semantically synchronized audio
备注:
点击查看摘要
Abstract:Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.
33. 【2512.21083】Hierarchical Modeling Approach to Fast and Accurate Table Recognition
链接:https://arxiv.org/abs/2512.21083
作者:Takaya Kawakatsu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:intelligent information retrieval, information retrieval, diverse knowledge, knowledge from numerous, pressing challenge
备注:
点击查看摘要
Abstract:The extraction and use of diverse knowledge from numerous documents is a pressing challenge in intelligent information retrieval. Documents contain elements that require different recognition methods. Table recognition typically consists of three subtasks, namely table structure, cell position and cell content recognition. Recent models have achieved excellent recognition with a combination of multi-task learning, local attention, and mutual learning. However, their effectiveness has not been fully explained, and they require a long period of time for inference. This paper presents a novel multi-task model that utilizes non-causal attention to capture the entire table structure, and a parallel inference algorithm for faster cell content inference. The superiority is demonstrated both visually and statistically on two large public datasets.
34. 【2512.21078】UniPR-3D: Towards Universal Visual Place Recognition with Visual Geometry Grounded Transformer
链接:https://arxiv.org/abs/2512.21078
作者:Tianchen Deng,Xun Chen,Ziming Li,Hongming Shen,Danwei Wang,Javier Civera,Hesheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Place Recognition, Visual Place, place recognition task, traditionally formulated, single-image retrieval task
备注:
点击查看摘要
Abstract:Visual Place Recognition (VPR) has been traditionally formulated as a single-image retrieval task. Using multiple views offers clear advantages, yet this setting remains relatively underexplored and existing methods often struggle to generalize across diverse environments. In this work we introduce UniPR-3D, the first VPR architecture that effectively integrates information from multiple views. UniPR-3D builds on a VGGT backbone capable of encoding multi-view 3D representations, which we adapt by designing feature aggregators and fine-tune for the place recognition task. To construct our descriptor, we jointly leverage the 3D tokens and intermediate 2D tokens produced by VGGT. Based on their distinct characteristics, we design dedicated aggregation modules for 2D and 3D features, allowing our descriptor to capture fine-grained texture cues while also reasoning across viewpoints. To further enhance generalization, we incorporate both single- and multi-frame aggregation schemes, along with a variable-length sequence retrieval strategy. Our experiments show that UniPR-3D sets a new state of the art, outperforming both single- and multi-view baselines and highlighting the effectiveness of geometry-grounded tokens for VPR. Our code and models will be made publicly available on Github this https URL.
35. 【2512.21065】Language-Guided Grasp Detection with Coarse-to-Fine Learning for Robotic Manipulation
链接:https://arxiv.org/abs/2512.21065
作者:Zebin Jiang,Tianle Jin,Xiangtong Yao,Alois Knoll,Hu Cao
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:fundamental challenging capabilities, semantically diverse environments, fundamental challenging, challenging capabilities, robotic manipulation
备注: Submitted to IEEE Journal
点击查看摘要
Abstract:Grasping is one of the most fundamental challenging capabilities in robotic manipulation, especially in unstructured, cluttered, and semantically diverse environments. Recent researches have increasingly explored language-guided manipulation, where robots not only perceive the scene but also interpret task-relevant natural language instructions. However, existing language-conditioned grasping methods typically rely on shallow fusion strategies, leading to limited semantic grounding and weak alignment between linguistic intent and visual grasp this http URL this work, we propose Language-Guided Grasp Detection (LGGD) with a coarse-to-fine learning paradigm for robotic manipulation. LGGD leverages CLIP-based visual and textual embeddings within a hierarchical cross-modal fusion pipeline, progressively injecting linguistic cues into the visual feature reconstruction process. This design enables fine-grained visual-semantic alignment and improves the feasibility of the predicted grasps with respect to task instructions. In addition, we introduce a language-conditioned dynamic convolution head (LDCH) that mixes multiple convolution experts based on sentence-level features, enabling instruction-adaptive coarse mask and grasp predictions. A final refinement module further enhances grasp consistency and robustness in complex this http URL on the OCID-VLG and Grasp-Anything++ datasets show that LGGD surpasses existing language-guided grasping methods, exhibiting strong generalization to unseen objects and diverse language queries. Moreover, deployment on a real robotic platform demonstrates the practical effectiveness of our approach in executing accurate, instruction-conditioned grasp actions. The code will be released publicly upon acceptance.
36. 【2512.21064】Multimodal Skeleton-Based Action Representation Learning via Decomposition and Composition
链接:https://arxiv.org/abs/2512.21064
作者:Hongsong Wang,Heng Fei,Bingxuan Dai,Jie Gui
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human action understanding, computer vision, Multimodal human action, significant problem, problem in computer
备注: Accepted by Machine Intelligence Research (Journal Impact Factor 8.7, 2024)
点击查看摘要
Abstract:Multimodal human action understanding is a significant problem in computer vision, with the central challenge being the effective utilization of the complementarity among diverse modalities while maintaining model efficiency. However, most existing methods rely on simple late fusion to enhance performance, which results in substantial computational overhead. Although early fusion with a shared backbone for all modalities is efficient, it struggles to achieve excellent performance. To address the dilemma of balancing efficiency and effectiveness, we introduce a self-supervised multimodal skeleton-based action representation learning framework, named Decomposition and Composition. The Decomposition strategy meticulously decomposes the fused multimodal features into distinct unimodal features, subsequently aligning them with their respective ground truth unimodal counterparts. On the other hand, the Composition strategy integrates multiple unimodal features, leveraging them as self-supervised guidance to enhance the learning of multimodal representations. Extensive experiments on the NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets demonstrate that the proposed method strikes an excellent balance between computational cost and model performance.
37. 【2512.21058】Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control
链接:https://arxiv.org/abs/2512.21058
作者:Minghao Han,YiChen Liu,Yizhou Liu,Zizhi Chen,Jingqun Tang,Xuecheng Wu,Dingkang Yang,Lihua Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:exhibit diagnostic-level competence, largely simulate pixels, generative models largely, models largely simulate, advanced understanding models
备注: 32 pages, 17 figures, and 6 tables
点击查看摘要
Abstract:In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image-text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image-text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath's SOTA performance, including a Patho-FID of 80.9 (51% better than the second-best) and fine-grained semantic control achieving 98.7% of the real-image. The meticulously curated datasets, complete source code, and pre-trained model weights developed in this study will be made openly accessible to the public.
38. 【2512.21054】DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors
链接:https://arxiv.org/abs/2512.21054
作者:Kaustubh Kundu,Hrishav Bakul Barua,Lucy Robertson-Bell,Zhixi Cai,Kalin Stefanov
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:data-driven generative methods, require vast amounts, acceptable generation quality, human pose data, sign language generation
备注: Accepted in WACV 2026
点击查看摘要
Abstract:The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: this https URL.
39. 【2512.21053】Optical Flow-Guided 6DoF Object Pose Tracking with an Event Camera
链接:https://arxiv.org/abs/2512.21053
作者:Zibin Liu,Banglei Guan,Yang Shang,Shunkun Liang,Zhenbao Yu,Qifeng Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:attracting ever-growing attention, technologies in multimedia, attracting ever-growing, recent years, pivotal technologies
备注: 9 pages, 5 figures. In Proceedings of the 32nd ACM International Conference on Multimedia (MM '24)
点击查看摘要
Abstract:Object pose tracking is one of the pivotal technologies in multimedia, attracting ever-growing attention in recent years. Existing methods employing traditional cameras encounter numerous challenges such as motion blur, sensor noise, partial occlusion, and changing lighting conditions. The emerging bio-inspired sensors, particularly event cameras, possess advantages such as high dynamic range and low latency, which hold the potential to address the aforementioned challenges. In this work, we present an optical flow-guided 6DoF object pose tracking method with an event camera. A 2D-3D hybrid feature extraction strategy is firstly utilized to detect corners and edges from events and object models, which characterizes object motion precisely. Then, we search for the optical flow of corners by maximizing the event-associated probability within a spatio-temporal window, and establish the correlation between corners and edges guided by optical flow. Furthermore, by minimizing the distances between corners and edges, the 6DoF object pose is iteratively optimized to achieve continuous pose tracking. Experimental results of both simulated and real events demonstrate that our methods outperform event-based state-of-the-art methods in terms of both accuracy and robustness.
40. 【2512.21050】Matrix Completion Via Reweighted Logarithmic Norm Minimization
链接:https://arxiv.org/abs/2512.21050
作者:Zhijie Wang,Liangtian He,Qinghua Zhang,Jifei Miao,Liang-Jian Deng,Jun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Low-rank matrix completion, demonstrated remarkable success, Low-rank matrix, matrix completion, range of applications
备注:
点击查看摘要
Abstract:Low-rank matrix completion (LRMC) has demonstrated remarkable success in a wide range of applications. To address the NP-hard nature of the rank minimization problem, the nuclear norm is commonly used as a convex and computationally tractable surrogate for the rank function. However, this approach often yields suboptimal solutions due to the excessive shrinkage of singular values. In this letter, we propose a novel reweighted logarithmic norm as a more effective nonconvex surrogate, which provides a closer approximation than many existing alternatives. We efficiently solve the resulting optimization problem by employing the alternating direction method of multipliers (ADMM). Experimental results on image inpainting demonstrate that the proposed method achieves superior performance compared to state-of-the-art LRMC approaches, both in terms of visual quality and quantitative metrics.
41. 【2512.21040】A Large-Depth-Range Layer-Based Hologram Dataset for Machine Learning-Based 3D Computer-Generated Holography
链接:https://arxiv.org/abs/2512.21040
作者:Jaehong Lee,You Chan No,YoungWoo Kim,Duksu Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
关键词:Machine learning-based computer-generated, learning-based computer-generated holography, Machine learning-based, computer-generated holography, availability of high-quality
备注:
点击查看摘要
Abstract:Machine learning-based computer-generated holography (ML-CGH) has advanced rapidly in recent years, yet progress is constrained by the limited availability of high-quality, large-scale hologram datasets. To address this, we present KOREATECH-CGH, a publicly available dataset comprising 6,000 pairs of RGB-D images and complex holograms across resolutions ranging from 256*256 to 2048*2048, with depth ranges extending to the theoretical limits of the angular spectrum method for wide 3D scene coverage. To improve hologram quality at large depth ranges, we introduce amplitude projection, a post-processing technique that replaces amplitude components of hologram wavefields at each depth layer while preserving phase. This approach enhances reconstruction fidelity, achieving 27.01 dB PSNR and 0.87 SSIM, surpassing a recent optimized silhouette-masking layer-based method by 2.03 dB and 0.04 SSIM, respectively. We further validate the utility of KOREATECH-CGH through experiments on hologram generation and super-resolution using state-of-the-art ML models, confirming its applicability for training and evaluating next-generation ML-CGH systems.
42. 【2512.21038】Next-Scale Prediction: A Self-Supervised Approach for Real-World Image Denoising
链接:https://arxiv.org/abs/2512.21038
作者:Yiwen Shan,Haiyu Zhao,Peng Hu,Xi Peng,Yuanbiao Gou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:decorrelating spatially structured, spatially structured noise, preserving high-frequency details, fundamental challenge, remains a fundamental
备注:
点击查看摘要
Abstract:Self-supervised real-world image denoising remains a fundamental challenge, arising from the antagonistic trade-off between decorrelating spatially structured noise and preserving high-frequency details. Existing blind-spot network (BSN) methods rely on pixel-shuffle downsampling (PD) to decorrelate noise, but aggressive downsampling fragments fine structures, while milder downsampling fails to remove correlated noise. To address this, we introduce Next-Scale Prediction (NSP), a novel self-supervised paradigm that decouples noise decorrelation from detail preservation. NSP constructs cross-scale training pairs, where BSN takes low-resolution, fully decorrelated sub-images as input to predict high-resolution targets that retain fine details. As a by-product, NSP naturally supports super-resolution of noisy images without retraining or modification. Extensive experiments demonstrate that NSP achieves state-of-the-art self-supervised denoising performance on real-world benchmarks, significantly alleviating the long-standing conflict between noise decorrelation and detail preservation.
43. 【2512.21032】Multi-Attribute guided Thermal Face Image Translation based on Latent Diffusion Model
链接:https://arxiv.org/abs/2512.21032
作者:Mingshu Cai,Osamu Yoshie,Yuya Ieiri
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern surveillance systems, surveillance systems increasingly, systems increasingly rely, deep neural networks, Modern surveillance
备注: Accepted by 2025 IEEE International Joint Conference on Biometrics (IJCB 2025)
点击查看摘要
Abstract:Modern surveillance systems increasingly rely on multi-wavelength sensors and deep neural networks to recognize faces in infrared images captured at night. However, most facial recognition models are trained on visible light datasets, leading to substantial performance degradation on infrared inputs due to significant domain shifts. Early feature-based methods for infrared face recognition proved ineffective, prompting researchers to adopt generative approaches that convert infrared images into visible light images for improved recognition. This paradigm, known as Heterogeneous Face Recognition (HFR), faces challenges such as model and modality discrepancies, leading to distortion and feature loss in generated images. To address these limitations, this paper introduces a novel latent diffusion-based model designed to generate high-quality visible face images from thermal inputs while preserving critical identity features. A multi-attribute classifier is incorporated to extract key facial attributes from visible images, mitigating feature loss during infrared-to-visible image restoration. Additionally, we propose the Self-attn Mamba module, which enhances global modeling of cross-modal features and significantly improves inference speed. Experimental results on two benchmark datasets demonstrate the superiority of our approach, achieving state-of-the-art performance in both image quality and identity preservation.
44. 【2512.21019】Efficient and Robust Video Defense Framework against 3D-field Personalized Talking Face
链接:https://arxiv.org/abs/2512.21019
作者:Rui-qing Sun,Xingshan Yao,Tian Lan,Hui-Yang Zhao,Jia-Ling Shi,Chen-Hao Cui,Zhijing Wu,Chen Yang,Xian-Ling Mao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Talking Face Generation, video-referenced Talking Face, Face Generation, Talking Face, video-referenced Talking
备注:
点击查看摘要
Abstract:State-of-the-art 3D-field video-referenced Talking Face Generation (TFG) methods synthesize high-fidelity personalized talking-face videos in real time by modeling 3D geometry and appearance from reference portrait video. This capability raises significant privacy concerns regarding malicious misuse of personal portraits. However, no efficient defense framework exists to protect such videos against 3D-field TFG methods. While image-based defenses could apply per-frame 2D perturbations, they incur prohibitive computational costs, severe video quality degradation, failing to disrupt 3D information for video protection. To address this, we propose a novel and efficient video defense framework against 3D-field TFG methods, which protects portrait video by perturbing the 3D information acquisition process while maintain high-fidelity video quality. Specifically, our method introduces: (1) a similarity-guided parameter sharing mechanism for computational efficiency, and (2) a multi-scale dual-domain attention module to jointly optimize spatial-frequency perturbations. Extensive experiments demonstrate that our proposed framework exhibits strong defense capability and achieves a 47x acceleration over the fastest baseline while maintaining high fidelity. Moreover, it remains robust against scaling operations and state-of-the-art purification attacks, and the effectiveness of our design choices is further validated through ablation studies. Our project is available at this https URL.
45. 【2512.21015】FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing
链接:https://arxiv.org/abs/2512.21015
作者:Mingshu Cai,Yixuan Li,Osamu Yoshie,Yuya Ieiri
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved unprecedented success, achieved unprecedented, image generation, Large-scale, video editing
备注: Accepted by IEEE Transactions on Multimedia (TMM)
点击查看摘要
Abstract:Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.
46. 【2512.21011】Granular-ball Guided Masking: Structure-aware Data Augmentation
链接:https://arxiv.org/abs/2512.21011
作者:Shuyin Xia,Fan Chen,Dawei Dai,Meng Yang,Junwei Han,Xinbo Gao,Guoyin Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep learning models, achieved remarkable success, large-scale labeled data, Deep learning, distributions shift
备注:
点击查看摘要
Abstract:Deep learning models have achieved remarkable success in computer vision, but they still rely heavily on large-scale labeled data and tend to overfit when data are limited or distributions shift. Data augmentation, particularly mask-based information dropping, can enhance robustness by forcing models to explore complementary cues; however, existing approaches often lack structural awareness and may discard essential semantics. We propose Granular-ball Guided Masking (GBGM), a structure-aware augmentation strategy guided by Granular-ball Computing (GBC). GBGM adaptively preserves semantically rich, structurally important regions while suppressing redundant areas through a coarse-to-fine hierarchical masking process, producing augmentations that are both representative and discriminative. Extensive experiments on multiple benchmarks demonstrate consistent improvements in classification accuracy and masked image reconstruction, confirming the effectiveness and broad applicability of the proposed method. Simple and model-agnostic, it integrates seamlessly into CNNs and Vision Transformers and provides a new paradigm for structure-aware data augmentation.
47. 【2512.21004】Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations
链接:https://arxiv.org/abs/2512.21004
作者:Jinghan Li,Yang Jin,Hao Jiang,Yadong Mu,Yang Song,Kun Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significantly improved performance, Recent advances, pretraining general foundation, diverse downstream tasks, general foundation models
备注:
点击查看摘要
Abstract:Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.
48. 【2512.21003】MVInverse: Feed-forward Multi-view Inverse Rendering in Seconds
链接:https://arxiv.org/abs/2512.21003
作者:Xiangzuo Wu,Chengwei Ren,Jun Zhou,Xiu Li,Yuan Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:inverse rendering aims, Multi-view inverse rendering, recover geometry, multiple viewpoints, aims to recover
备注: 21 pages, 17 figures, 5 tables
点击查看摘要
Abstract:Multi-view inverse rendering aims to recover geometry, materials, and illumination consistently across multiple viewpoints. When applied to multi-view images, existing single-view approaches often ignore cross-view relationships, leading to inconsistent results. In contrast, multi-view optimization methods rely on slow differentiable rendering and per-scene refinement, making them computationally expensive and hard to scale. To address these limitations, we introduce a feed-forward multi-view inverse rendering framework that directly predicts spatially varying albedo, metallic, roughness, diffuse shading, and surface normals from sequences of RGB images. By alternating attention across views, our model captures both intra-view long-range lighting interactions and inter-view material consistency, enabling coherent scene-level reasoning within a single forward pass. Due to the scarcity of real-world training data, models trained on existing synthetic datasets often struggle to generalize to real-world scenes. To overcome this limitation, we propose a consistency-based finetuning strategy that leverages unlabeled real-world videos to enhance both multi-view coherence and robustness under in-the-wild conditions. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance in terms of multi-view consistency, material and normal estimation quality, and generalization to real-world imagery.
49. 【2512.20988】PUFM++: Point Cloud Upsampling via Enhanced Flow Matching
链接:https://arxiv.org/abs/2512.20988
作者:Zhi-Song Liu,Chenhang He,Roland Maier,Andreas Rupp
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated strong promise, Recent advances, high-quality point cloud, point cloud upsampling, advances in generative
备注: 21 pages, 15 figures
点击查看摘要
Abstract:Recent advances in generative modeling have demonstrated strong promise for high-quality point cloud upsampling. In this work, we present PUFM++, an enhanced flow-matching framework for reconstructing dense and accurate point clouds from sparse, noisy, and partial observations. PUFM++ improves flow matching along three key axes: (i) geometric fidelity, (ii) robustness to imperfect input, and (iii) consistency with downstream surface-based tasks. We introduce a two-stage flow-matching strategy that first learns a direct, straight-path flow from sparse inputs to dense targets, and then refines it using noise-perturbed samples to approximate the terminal marginal distribution better. To accelerate and stabilize inference, we propose a data-driven adaptive time scheduler that improves sampling efficiency based on interpolation behavior. We further impose on-manifold constraints during sampling to ensure that generated points remain aligned with the underlying surface. Finally, we incorporate a recurrent interface network~(RIN) to strengthen hierarchical feature interactions and boost reconstruction quality. Extensive experiments on synthetic benchmarks and real-world scans show that PUFM++ sets a new state of the art in point cloud upsampling, delivering superior visual fidelity and quantitative accuracy across a wide range of tasks. Code and pretrained models are publicly available at this https URL.
50. 【2512.20980】X-ray Insights Unleashed: Pioneering the Enhancement of Multi-Label Long-Tail Data
链接:https://arxiv.org/abs/2512.20980
作者:Xinquan Yang,Jinheng Xie,Yawen Huang,Yuexiang Li,Huimin Huang,Hao Zheng,Xian Wu,Yefeng Zheng,Linlin Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Long-tailed pulmonary anomalies, formidable diagnostic challenges, Long-tailed pulmonary, chest radiography present, radiography present formidable
备注:
点击查看摘要
Abstract:Long-tailed pulmonary anomalies in chest radiography present formidable diagnostic challenges. Despite the recent strides in diffusion-based methods for enhancing the representation of tailed lesions, the paucity of rare lesion exemplars curtails the generative capabilities of these approaches, thereby leaving the diagnostic precision less than optimal. In this paper, we propose a novel data synthesis pipeline designed to augment tail lesions utilizing a copious supply of conventional normal X-rays. Specifically, a sufficient quantity of normal samples is amassed to train a diffusion model capable of generating normal X-ray images. This pre-trained diffusion model is subsequently utilized to inpaint the head lesions present in the diseased X-rays, thereby preserving the tail classes as augmented training data. Additionally, we propose the integration of a Large Language Model Knowledge Guidance (LKG) module alongside a Progressive Incremental Learning (PIL) strategy to stabilize the inpainting fine-tuning process. Comprehensive evaluations conducted on the public lung datasets MIMIC and CheXpert demonstrate that the proposed method sets a new benchmark in performance.
51. 【2512.20976】XGrid-Mapping: Explicit Implicit Hybrid Grid Submaps for Efficient Incremental Neural LiDAR Mapping
链接:https://arxiv.org/abs/2512.20976
作者:Zeqing Song,Zhongmiao Yan,Junyuan Deng,Songpengcheng Xia,Xiang Mu,Jingyi Xu,Qi Wu,Ling Pei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reliable autonomous systems, Large-scale incremental mapping, underpins incremental environmental, incremental environmental understanding, Large-scale incremental
备注:
点击查看摘要
Abstract:Large-scale incremental mapping is fundamental to the development of robust and reliable autonomous systems, as it underpins incremental environmental understanding with sequential inputs for navigation and decision-making. LiDAR is widely used for this purpose due to its accuracy and robustness. Recently, neural LiDAR mapping has shown impressive performance; however, most approaches rely on dense implicit representations and underutilize geometric structure, while existing voxel-guided methods struggle to achieve real-time performance. To address these challenges, we propose XGrid-Mapping, a hybrid grid framework that jointly exploits explicit and implicit representations for efficient neural LiDAR mapping. Specifically, the strategy combines a sparse grid, providing geometric priors and structural guidance, with an implicit dense grid that enriches scene representation. By coupling the VDB structure with a submap-based organization, the framework reduces computational load and enables efficient incremental mapping on a large scale. To mitigate discontinuities across submaps, we introduce a distillation-based overlap alignment strategy, in which preceding submaps supervise subsequent ones to ensure consistency in overlapping regions. To further enhance robustness and sampling efficiency, we incorporate a dynamic removal module. Extensive experiments show that our approach delivers superior mapping quality while overcoming the efficiency limitations of voxel-guided methods, thereby outperforming existing state-of-the-art mapping methods.
52. 【2512.20975】SPOT!: Map-Guided LLM Agent for Unsupervised Multi-CCTV Dynamic Object Tracking
链接:https://arxiv.org/abs/2512.20975
作者:Yujin Noh,Inho Jake Park,Chigon Hwang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:face structural limitations, multiple camera environments, systems face structural, CCTV-based vehicle tracking, tracking systems face
备注: 33 pages, 27figures
点击查看摘要
Abstract:CCTV-based vehicle tracking systems face structural limitations in continuously connecting the trajectories of the same vehicle across multiple camera environments. In particular, blind spots occur due to the intervals between CCTVs and limited Fields of View (FOV), which leads to object ID switching and trajectory loss, thereby reducing the reliability of real-time path prediction. This paper proposes SPOT (Spatial Prediction Over Trajectories), a map-guided LLM agent capable of tracking vehicles even in blind spots of multi-CCTV environments without prior training. The proposed method represents road structures (Waypoints) and CCTV placement information as documents based on 2D spatial coordinates and organizes them through chunking techniques to enable real-time querying and inference. Furthermore, it transforms the vehicle's position into the actual world coordinate system using the relative position and FOV information of objects observed in CCTV images. By combining map spatial information with the vehicle's moving direction, speed, and driving patterns, a beam search is performed at the intersection level to derive candidate CCTV locations where the vehicle is most likely to enter after the blind spot. Experimental results based on the CARLA simulator in a virtual city environment confirmed that the proposed method accurately predicts the next appearing CCTV even in blind spot sections, maintaining continuous vehicle trajectories more effectively than existing techniques.
53. 【2512.20963】Generalization of Diffusion Models Arises with a Balanced Representation Space
链接:https://arxiv.org/abs/2512.20963
作者:Zekai Zhang,Xiao Li,Xiang Li,Lianghe Shi,Meng Wu,Molei Tao,Qing Qu
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:risk memorizing training, Diffusion models excel, Diffusion models, generating high-quality, excel at generating
备注: 40 pages, 19 figures. The first two authors contributed equally
点击查看摘要
Abstract:Diffusion models excel at generating high-quality, diverse samples, yet they risk memorizing training data when overfit to the training objective. We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning. By investigating a two-layer ReLU denoising autoencoder (DAE), we prove that (i) memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized "spiky" representations, whereas (ii) generalization arises when the model captures local data statistics, producing "balanced" representations. Furthermore, we validate these theoretical findings on real-world unconditional and text-to-image diffusion models, demonstrating that the same representation structures emerge in deep generative models with significant practical implications. Building on these insights, we propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering. Together, our results highlight that learning good representations is central to novel and meaningful generative modeling.
54. 【2512.20937】Beyond Artifacts: Real-Centric Envelope Modeling for Reliable AI-Generated Image Detection
链接:https://arxiv.org/abs/2512.20937
作者:Ruiqi Liu,Yi Han,Zhengbo Zhang,Liwei Yao,Zhiyuan Yan,Jialiang Shen,ZhiJin Chen,Boyi Sun,Lubin Weng,Jing Dong,Yan Wang,Shu Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rapid progress, models has intensified, Real-centric Envelope Modeling, generative models, real-world conditions
备注:
点击查看摘要
Abstract:The rapid progress of generative models has intensified the need for reliable and robust detection under real-world conditions. However, existing detectors often overfit to generator-specific artifacts and remain highly sensitive to real-world degradations. As generative architectures evolve and images undergo multi-round cross-platform sharing and post-processing (chain degradations), these artifact cues become obsolete and harder to detect. To address this, we propose Real-centric Envelope Modeling (REM), a new paradigm that shifts detection from learning generator artifacts to modeling the robust distribution of real images. REM introduces feature-level perturbations in self-reconstruction to generate near-real samples, and employs an envelope estimator with cross-domain consistency to learn a boundary enclosing the real image manifold. We further build RealChain, a comprehensive benchmark covering both open-source and commercial generators with simulated real-world degradation. Across eight benchmark evaluations, REM achieves an average improvement of 7.5% over state-of-the-art methods, and notably maintains exceptional generalization on the severely degraded RealChain benchmark, establishing a solid foundation for synthetic image detection under real-world conditions. The code and the RealChain benchmark will be made publicly available upon acceptance of the paper.
55. 【2512.20936】Reasoning-Driven Amodal Completion: Collaborative Agents and Perceptual Evaluation
链接:https://arxiv.org/abs/2512.20936
作者:Hongxing Fan,Shuyu Zhao,Jiayang Ao,Lu Sheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:faces significant challenges, invisible object parts, inferring invisible object, object parts, faces significant
备注:
点击查看摘要
Abstract:Amodal completion, the task of inferring invisible object parts, faces significant challenges in maintaining semantic consistency and structural integrity. Prior progressive approaches are inherently limited by inference instability and error accumulation. To tackle these limitations, we present a Collaborative Multi-Agent Reasoning Framework that explicitly decouples Semantic Planning from Visual Synthesis. By employing specialized agents for upfront reasoning, our method generates a structured, explicit plan before pixel generation, enabling visually and semantically coherent single-pass synthesis. We integrate this framework with two critical mechanisms: (1) a self-correcting Verification Agent that employs Chain-of-Thought reasoning to rectify visible region segmentation and identify residual occluders strictly within the Semantic Planning phase, and (2) a Diverse Hypothesis Generator that addresses the ambiguity of invisible regions by offering diverse, plausible semantic interpretations, surpassing the limited pixel-level variations of standard random seed sampling. Furthermore, addressing the limitations of traditional metrics in assessing inferred invisible content, we introduce the MAC-Score (MLLM Amodal Completion Score), a novel human-aligned evaluation metric. Validated against human judgment and ground truth, these metrics establish a robust standard for assessing structural completeness and semantic consistency with visible context. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art methods across multiple datasets. Our project is available at: this https URL.
56. 【2512.20934】ransductive Visual Programming: Evolving Tool Libraries from Experience for Spatial Reasoning
链接:https://arxiv.org/abs/2512.20934
作者:Shengguang Wu,Xiaohan Wang,Yuhui Zhang,Hao Zhu,Serena Yeung-Levy
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:scenes requires precise, challenge vision-language models, requires precise geometric, precise geometric calculations, Visual programming
备注: Project Website: [this https URL](https://transductive-visualprogram.github.io/)
点击查看摘要
Abstract:Spatial reasoning in 3D scenes requires precise geometric calculations that challenge vision-language models. Visual programming addresses this by decomposing problems into steps calling specialized tools, yet existing methods rely on either fixed toolsets or speculative tool induction before solving problems, resulting in suboptimal programs and poor utilization of induced tools. We present Transductive Visual Programming (TVP), a novel framework that builds new tools from its own experience rather than speculation. TVP first solves problems using basic tools while accumulating experiential solutions into an Example Library, then abstracts recurring patterns from these programs into reusable higher-level tools for an evolving Tool Library. This allows TVP to tackle new problems with increasingly powerful tools learned from experience. On Omni3D-Bench, TVP achieves state-of-the-art performance, outperforming GPT-4o by 22% and the previous best visual programming system by 11%. Our transductively learned tools are used 5x more frequently as core program dependency than inductively created ones, demonstrating more effective tool discovery and reuse. The evolved tools also show strong generalization to unseen spatial tasks, achieving superior performance on benchmarks from SpatialScore-Hard collection without any testset-specific modification. Our work establishes experience-driven transductive tool creation as a powerful paradigm for building self-evolving visual programming agents that effectively tackle challenging spatial reasoning tasks. We release our code at this https URL.
57. 【2512.20927】Quantile Rendering: Efficiently Embedding High-dimensional Feature on 3D Gaussian Splatting
链接:https://arxiv.org/abs/2512.20927
作者:Yoonwoo Jeong,Cheng Sun,Frank Wang,Minsu Cho,Jaesung Choe
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:successfully extended Open-vocabulary, Recent advancements, extended Open-vocabulary segmentation, domain by leveraging, OVS
备注: Will be updated
点击查看摘要
Abstract:Recent advancements in computer vision have successfully extended Open-vocabulary segmentation (OVS) to the 3D domain by leveraging 3D Gaussian Splatting (3D-GS). Despite this progress, efficiently rendering the high-dimensional features required for open-vocabulary queries poses a significant challenge. Existing methods employ codebooks or feature compression, causing information loss, thereby degrading segmentation quality. To address this limitation, we introduce Quantile Rendering (Q-Render), a novel rendering strategy for 3D Gaussians that efficiently handles high-dimensional features while maintaining high fidelity. Unlike conventional volume rendering, which densely samples all 3D Gaussians intersecting each ray, Q-Render sparsely samples only those with dominant influence along the ray. By integrating Q-Render into a generalizable 3D neural network, we also propose Gaussian Splatting Network (GS-Net), which predicts Gaussian features in a generalizable manner. Extensive experiments on ScanNet and LeRF demonstrate that our framework outperforms state-of-the-art methods, while enabling real-time rendering with an approximate ~43.7x speedup on 512-D feature maps. Code will be made publicly available.
58. 【2512.20921】Self-supervised Multiplex Consensus Mamba for General Image Fusion
链接:https://arxiv.org/abs/2512.20921
作者:Yingying Wang,Rongjin Zhuang,Hui Zheng,Xuanhua He,Ke Cao,Xiaotong Tu,Xinghao Ding
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generate high-quality fused, high-quality fused images, Image fusion, semantic segmentation, general image fusion
备注: Accepted by AAAI 2026, 9 pages, 4 figures
点击查看摘要
Abstract:Image fusion integrates complementary information from different modalities to generate high-quality fused images, thereby enhancing downstream tasks such as object detection and semantic segmentation. Unlike task-specific techniques that primarily focus on consolidating inter-modal information, general image fusion needs to address a wide range of tasks while improving performance without increasing complexity. To achieve this, we propose SMC-Mamba, a Self-supervised Multiplex Consensus Mamba framework for general image fusion. Specifically, the Modality-Agnostic Feature Enhancement (MAFE) module preserves fine details through adaptive gating and enhances global representations via spatial-channel and frequency-rotational scanning. The Multiplex Consensus Cross-modal Mamba (MCCM) module enables dynamic collaboration among experts, reaching a consensus to efficiently integrate complementary information from multiple modalities. The cross-modal scanning within MCCM further strengthens feature interactions across modalities, facilitating seamless integration of critical information from both sources. Additionally, we introduce a Bi-level Self-supervised Contrastive Learning Loss (BSCL), which preserves high-frequency information without increasing computational overhead while simultaneously boosting performance in downstream tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art (SOTA) image fusion algorithms in tasks such as infrared-visible, medical, multi-focus, and multi-exposure fusion, as well as downstream visual tasks.
59. 【2512.20907】PanoGrounder: Bridging 2D and 3D with Panoramic Scene Representations for VLM-based 3D Visual Grounding
链接:https://arxiv.org/abs/2512.20907
作者:Seongmin Jung,Seongho Choi,Gunwoo Jeon,Minsu Cho,Jongwoo Lim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Grounding, perception to robotics, requiring both language, critical bridge, language understanding
备注:
点击查看摘要
Abstract:3D Visual Grounding (3DVG) is a critical bridge from vision-language perception to robotics, requiring both language understanding and 3D scene reasoning. Traditional supervised models leverage explicit 3D geometry but exhibit limited generalization, owing to the scarcity of 3D vision-language datasets and the limited reasoning capabilities compared to modern vision-language models (VLMs). We propose PanoGrounder, a generalizable 3DVG framework that couples multi-modal panoramic representation with pretrained 2D VLMs for strong vision-language reasoning. Panoramic renderings, augmented with 3D semantic and geometric features, serve as an intermediate representation between 2D and 3D, and offer two major benefits: (i) they can be directly fed to VLMs with minimal adaptation and (ii) they retain long-range object-to-object relations thanks to their 360-degree field of view. We devise a three-stage pipeline that places a compact set of panoramic viewpoints considering the scene layout and geometry, grounds a text query on each panoramic rendering with a VLM, and fuses per-view predictions into a single 3D bounding box via lifting. Our approach achieves state-of-the-art results on ScanRefer and Nr3D, and demonstrates superior generalization to unseen 3D datasets and text rephrasings.
60. 【2512.20901】Benchmarking and Enhancing VLM for Compressed Image Understanding
链接:https://arxiv.org/abs/2512.20901
作者:Zifu Zhang,Tongda Xu,Siqi Li,Shengxi Li,Yue Zhang,Mai Xu,Yan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:increasingly important, compressed images, rapid development, development of Vision-Language, growing demand
备注:
点击查看摘要
Abstract:With the rapid development of Vision-Language Models (VLMs) and the growing demand for their applications, efficient compression of the image inputs has become increasingly important. Existing VLMs predominantly digest and understand high-bitrate compressed images, while their ability to interpret low-bitrate compressed images has yet to be explored by far. In this paper, we introduce the first comprehensive benchmark to evaluate the ability of VLM against compressed images, varying existing widely used image codecs and diverse set of tasks, encompassing over one million compressed images in our benchmark. Next, we analyse the source of performance gap, by categorising the gap from a) the information loss during compression and b) generalisation failure of VLM. We visualize these gaps with concrete examples and identify that for compressed images, only the generalization gap can be mitigated. Finally, we propose a universal VLM adaptor to enhance model performance on images compressed by existing codecs. Consequently, we demonstrate that a single adaptor can improve VLM performance across images with varying codecs and bitrates by 10%-30%. We believe that our benchmark and enhancement method provide valuable insights and contribute toward bridging the gap between VLMs and compressed images.
61. 【2512.20898】DGSAN: Dual-Graph Spatiotemporal Attention Network for Pulmonary Nodule Malignancy Prediction
链接:https://arxiv.org/abs/2512.20898
作者:Xiao Yu,Zhaojie Fang,Guanyu Zhou,Yin Shen,Huoling Luo,Ye Li,Ahmed Elazab,Xiang Wan,Ruiquan Ge,Changmiao Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Lung cancer continues, cancer-related deaths globally, Lung cancer, deaths globally, cancer continues
备注:
点击查看摘要
Abstract:Lung cancer continues to be the leading cause of cancer-related deaths globally. Early detection and diagnosis of pulmonary nodules are essential for improving patient survival rates. Although previous research has integrated multimodal and multi-temporal information, outperforming single modality and single time point, the fusion methods are limited to inefficient vector concatenation and simple mutual attention, highlighting the need for more effective multimodal information fusion. To address these challenges, we introduce a Dual-Graph Spatiotemporal Attention Network, which leverages temporal variations and multimodal data to enhance the accuracy of predictions. Our methodology involves developing a Global-Local Feature Encoder to better capture the local, global, and fused characteristics of pulmonary nodules. Additionally, a Dual-Graph Construction method organizes multimodal features into inter-modal and intra-modal graphs. Furthermore, a Hierarchical Cross-Modal Graph Fusion Module is introduced to refine feature integration. We also compiled a novel multimodal dataset named the NLST-cmst dataset as a comprehensive source of support for related research. Our extensive experiments, conducted on both the NLST-cmst and curated CSTL-derived datasets, demonstrate that our DGSAN significantly outperforms state-of-the-art methods in classifying pulmonary nodules with exceptional computational efficiency.
62. 【2512.20892】Beyond Weight Adaptation: Feature-Space Domain Injection for Cross-Modal Ship Re-Identification
链接:https://arxiv.org/abs/2512.20892
作者:Tingfeng Xian,Wenlve Zhou,Zhiheng Zhou,Zhelin Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Cross-Modality Ship Re-Identification, maritime target tracking, all-weather maritime target, significant modality discrepancies, CMS Re-ID
备注:
点击查看摘要
Abstract:Cross-Modality Ship Re-Identification (CMS Re-ID) is critical for achieving all-day and all-weather maritime target tracking, yet it is fundamentally challenged by significant modality discrepancies. Mainstream solutions typically rely on explicit modality alignment strategies; however, this paradigm heavily depends on constructing large-scale paired datasets for pre-training. To address this, grounded in the Platonic Representation Hypothesis, we explore the potential of Vision Foundation Models (VFMs) in bridging modality gaps. Recognizing the suboptimal performance of existing generic Parameter-Efficient Fine-Tuning (PEFT) methods that operate within the weight space, particularly on limited-capacity models, we shift the optimization perspective to the feature space and propose a novel PEFT strategy termed Domain Representation Injection (DRI). Specifically, while keeping the VFM fully frozen to maximize the preservation of general knowledge, we design a lightweight, learnable Offset Encoder to extract domain-specific representations rich in modality and identity attributes from raw inputs. Guided by the contextual information of intermediate features at different layers, a Modulator adaptively transforms these representations. Subsequently, they are injected into the intermediate layers via additive fusion, dynamically reshaping the feature distribution to adapt to the downstream task without altering the VFM's pre-trained weights. Extensive experimental results demonstrate the superiority of our method, achieving State-of-the-Art (SOTA) performance with minimal trainable parameters. For instance, on the HOSS-ReID dataset, we attain 57.9\% and 60.5\% mAP using only 1.54M and 7.05M parameters, respectively. The code is available at this https URL.
63. 【2512.20871】NeRV360: Neural Representation for 360-Degree Videos with a Viewport Decoder
链接:https://arxiv.org/abs/2512.20871
作者:Daichi Arai,Kyohei Unno,Yasuko Sugito,Yuichi Kusakabe
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
关键词:Implicit neural representations, shown strong potential, Implicit neural, neural representations, shown strong
备注: 2026 IIEEJ International Conference on Image Electronics and Visual Computing (IEVC)
点击查看摘要
Abstract:Implicit neural representations for videos (NeRV) have shown strong potential for video compression. However, applying NeRV to high-resolution 360-degree videos causes high memory usage and slow decoding, making real-time applications impractical. We propose NeRV360, an end-to-end framework that decodes only the user-selected viewport instead of reconstructing the entire panoramic frame. Unlike conventional pipelines, NeRV360 integrates viewport extraction into decoding and introduces a spatial-temporal affine transform module for conditional decoding based on viewpoint and time. Experiments on 6K-resolution videos show that NeRV360 achieves a 7-fold reduction in memory consumption and a 2.5-fold increase in decoding speed compared to HNeRV, a representative prior work, while delivering better image quality in terms of objective metrics.
64. 【2512.20866】Lightweight framework for underground pipeline recognition and spatial localization based on multi-view 2D GPR images
链接:https://arxiv.org/abs/2512.20866
作者:Haotian Lv,Chao Li,Jiangbo Dai,Yuhui Zhang,Zepeng Fan,Yiqiu Tan,Dawei Wang,Binglei Xie
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:address the issues, issues of weak, insufficient robustness, paper proposes, low recognition accuracy
备注:
点击查看摘要
Abstract:To address the issues of weak correlation between multi-view features, low recognition accuracy of small-scale targets, and insufficient robustness in complex scenarios in underground pipeline detection using 3D GPR, this paper proposes a 3D pipeline intelligent detection framework. First, based on a B/C/D-Scan three-view joint analysis strategy, a three-dimensional pipeline three-view feature evaluation method is established by cross-validating forward simulation results obtained using FDTD methods with actual measurement data. Second, the DCO-YOLO framework is proposed, which integrates DySample, CGLU, and OutlookAttention cross-dimensional correlation mechanisms into the original YOLOv11 algorithm, significantly improving the small-scale pipeline edge feature extraction capability. Furthermore, a 3D-DIoU spatial feature matching algorithm is proposed, which integrates three-dimensional geometric constraints and center distance penalty terms to achieve automated association of multi-view annotations. The three-view fusion strategy resolves inherent ambiguities in single-view detection. Experiments based on real urban underground pipeline data show that the proposed method achieves accuracy, recall, and mean average precision of 96.2%, 93.3%, and 96.7%, respectively, in complex multi-pipeline scenarios, which are 2.0%, 2.1%, and 0.9% higher than the baseline model. Ablation experiments validated the synergistic optimization effect of the dynamic feature enhancement module and Grad-CAM++ heatmap visualization demonstrated that the improved model significantly enhanced its ability to focus on pipeline geometric features. This study integrates deep learning optimization strategies with the physical characteristics of 3D GPR, offering an efficient and reliable novel technical framework for the intelligent recognition and localization of underground pipelines.
65. 【2512.20858】ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction
链接:https://arxiv.org/abs/2512.20858
作者:Md Zabirul Islam,Md Motaleb Hossen Manik,Ge Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Traditional lecture videos, videos offer flexibility, lecture videos offer, Interactive Video Engine, Traditional lecture
备注:
点击查看摘要
Abstract:Traditional lecture videos offer flexibility but lack mechanisms for real-time clarification, forcing learners to search externally when confusion arises. Recent advances in large language models and neural avatars provide new opportunities for interactive learning, yet existing systems typically lack lecture awareness, rely on cloud-based services, or fail to integrate retrieval and avatar-delivered explanations in a unified, privacy-preserving pipeline. We present ALIVE, an Avatar-Lecture Interactive Video Engine that transforms passive lecture viewing into a dynamic, real-time learning experience. ALIVE operates fully on local hardware and integrates (1) Avatar-delivered lecture generated through ASR transcription, LLM refinement, and neural talking-head synthesis; (2) A content-aware retrieval mechanism that combines semantic similarity with timestamp alignment to surface contextually relevant lecture segments; and (3) Real-time multimodal interaction, enabling students to pause the lecture, ask questions through text or voice, and receive grounded explanations either as text or as avatar-delivered responses. To maintain responsiveness, ALIVE employs lightweight embedding models, FAISS-based retrieval, and segmented avatar synthesis with progressive preloading. We demonstrate the system on a complete medical imaging course, evaluate its retrieval accuracy, latency characteristics, and user experience, and show that ALIVE provides accurate, content-aware, and engaging real-time support. ALIVE illustrates how multimodal AI-when combined with content-aware retrieval and local deployment-can significantly enhance the pedagogical value of recorded lectures, offering an extensible pathway toward next-generation interactive learning environments.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.20858 [cs.CV]
(or
arXiv:2512.20858v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.20858
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Md Motaleb Hossen Manik [view email] [v1]
Wed, 24 Dec 2025 00:33:59 UTC (5,021 KB)
66. 【2512.20839】Input-Adaptive Visual Preprocessing for Efficient Fast Vision-Language Model Inference
链接:https://arxiv.org/abs/2512.20839
作者:Putu Indah Githa Cahyani,Komang David Dananjaya Suartana,Novanto Yudistira
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal reasoning tasks, demonstrated strong performance, deployment remains challenging, remains challenging due, processing high-resolution visual
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have demonstrated strong performance on multimodal reasoning tasks, but their deployment remains challenging due to high inference latency and computational cost, particularly when processing high-resolution visual inputs. While recent architectures such as FastVLM improve efficiency through optimized vision encoders, existing pipelines still rely on static visual preprocessing, leading to redundant computation for visually simple inputs. In this work, we propose an adaptive visual preprocessing method that dynamically adjusts input resolution and spatial coverage based on image content characteristics. The proposed approach combines content-aware image analysis, adaptive resolution selection, and content-aware cropping to reduce visual redundancy prior to vision encoding. Importantly, the method is integrated with FastVLM without modifying its architecture or requiring retraining. We evaluate the proposed method on a subset of the DocVQA dataset in an inference-only setting, focusing on efficiency-oriented metrics. Experimental results show that adaptive preprocessing reduces per-image inference time by over 50\%, lowers mean full generation time, and achieves a consistent reduction of more than 55\% in visual token count compared to the baseline pipeline. These findings demonstrate that input-aware preprocessing is an effective and lightweight strategy for improving deployment-oriented efficiency of vision-language models. To facilitate reproducibility, our implementation is provided as a fork of the FastVLM repository, incorporating the files for the proposed method, and is available at this https URL.
67. 【2512.20833】CHAMMI-75: pre-training multi-channel models with heterogeneous microscopy images
链接:https://arxiv.org/abs/2512.20833
作者:Vidit Agrawal(1,2),John Peters(1,2),Tyler N. Thompson(1,2),Mohammad Vali Sanian(3,4),Chau Pham(5),Nikita Moshkov(6),Arshad Kazi(1,2),Aditya Pillai(1,2),Jack Freeman(1),Byunguk Kang(7,8),Samouil L. Farhi(8),Ernest Fraenkel(7),Ron Stewart(1),Lassi Paavolainen(3,4),Bryan A. Plummer(5),Juan C. Caicedo(1,2) ((1) Morgridge Institute for Research, Madison, WI, USA, (2) University of Wisconsin-Madison, Madison, WI, USA, (3) Institute for Molecular Medicine Finland (FIMM), Helsinki, Finland, (4) University of Helsinki, Helsinki, Finland, (5) Boston University, Boston, MA, USA, (6) Institute of Computational Biology, Helmholtz Munich, Neuherberg, Germany, (7) Massachusetts Institute of Technology, Cambridge, MA, USA, (8) Broad Institute of MIT and Harvard, Cambridge, MA, USA)
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Quantifying cell morphology, Quantifying cell, machine learning, learning has proven, powerful tool
备注: 47 Pages, 23 Figures, 26 Tables
点击查看摘要
Abstract:Quantifying cell morphology using images and machine learning has proven to be a powerful tool to study the response of cells to treatments. However, models used to quantify cellular morphology are typically trained with a single microscopy imaging type. This results in specialized models that cannot be reused across biological studies because the technical specifications do not match (e.g., different number of channels), or because the target experimental conditions are out of distribution. Here, we present CHAMMI-75, an open access dataset of heterogeneous, multi-channel microscopy images from 75 diverse biological studies. We curated this resource from publicly available sources to investigate cellular morphology models that are channel-adaptive and can process any microscopy image type. Our experiments show that training with CHAMMI-75 can improve performance in multi-channel bioimaging tasks primarily because of its high diversity in microscopy modalities. This work paves the way to create the next generation of cellular morphology models for biological studies.
68. 【2512.20815】Learning to Sense for Driving: Joint Optics-Sensor-Model Co-Design for Semantic Segmentation
链接:https://arxiv.org/abs/2512.20815
作者:Reeshad Khan amd John Gauch
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:decouple camera design, prioritize human viewable, human viewable imagery, Traditional autonomous driving, driving pipelines decouple
备注:
点击查看摘要
Abstract:Traditional autonomous driving pipelines decouple camera design from downstream perception, relying on fixed optics and handcrafted ISPs that prioritize human viewable imagery rather than machine semantics. This separation discards information during demosaicing, denoising, or quantization, while forcing models to adapt to sensor artifacts. We present a task-driven co-design framework that unifies optics, sensor modeling, and lightweight semantic segmentation networks into a single end-to-end RAW-to-task pipeline. Building on DeepLens[19], our system integrates realistic cellphone-scale lens models, learnable color filter arrays, Poisson-Gaussian noise processes, and quantization, all optimized directly for segmentation objectives. Evaluations on KITTI-360 show consistent mIoU improvements over fixed pipelines, with optics modeling and CFA learning providing the largest gains, especially for thin or low-light-sensitive classes. Importantly, these robustness gains are achieved with a compact ~1M-parameter model running at ~28 FPS, demonstrating edge deployability. Visual and quantitative analyses further highlight how co-designed sensors adapt acquisition to semantic structure, sharpening boundaries and maintaining accuracy under blur, noise, and low bit-depth. Together, these findings establish full-stack co-optimization of optics, sensors, and networks as a principled path toward efficient, reliable, and deployable perception in autonomous systems.
69. 【2512.20783】NULLBUS: Multimodal Mixed-Supervision for Breast Ultrasound Segmentation via Nullable Global-Local Prompts
链接:https://arxiv.org/abs/2512.20783
作者:Raja Mallina,Bryar Shareef
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:lesion boundaries essential, Breast ultrasound, public BUS datasets, treatment planning, lesion boundaries
备注: 5 pages, 2 figures, and 4 tables
点击查看摘要
Abstract:Breast ultrasound (BUS) segmentation provides lesion boundaries essential for computer-aided diagnosis and treatment planning. While promptable methods can improve segmentation performance and tumor delineation when text or spatial prompts are available, many public BUS datasets lack reliable metadata or reports, constraining training to small multimodal subsets and reducing robustness. We propose NullBUS, a multimodal mixed-supervision framework that learns from images with and without prompts in a single model. To handle missing text, we introduce nullable prompts, implemented as learnable null embeddings with presence masks, enabling fallback to image-only evidence when metadata are absent and the use of text when present. Evaluated on a unified pool of three public BUS datasets, NullBUS achieves a mean IoU of 0.8568 and a mean Dice of 0.9103, demonstrating state-of-the-art performance under mixed prompt availability.
70. 【2512.20770】OccuFly: A 3D Vision Benchmark for Semantic Scene Completion from the Aerial Perspective
链接:https://arxiv.org/abs/2512.20770
作者:Markus Gross,Sai B. Matha,Aya Fahmy,Rui Song,Daniel Cremers,Henri Meess
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:jointly estimating dense, estimating dense volumetric, dense volumetric occupancy, Semantic Scene Completion, Scene Completion
备注:
点击查看摘要
Abstract:Semantic Scene Completion (SSC) is crucial for 3D perception in mobile robotics, as it enables holistic scene understanding by jointly estimating dense volumetric occupancy and per-voxel semantics. Although SSC has been widely studied in terrestrial domains such as autonomous driving, aerial scenarios like autonomous flying remain largely unexplored, thereby limiting progress on downstream applications. Furthermore, LiDAR sensors represent the primary modality for SSC data generation, which poses challenges for most uncrewed aerial vehicles (UAVs) due to flight regulations, mass and energy constraints, and the sparsity of LiDAR-based point clouds from elevated viewpoints. To address these limitations, we introduce OccuFly, the first real-world, camera-based aerial SSC benchmark, captured at altitudes of 50m, 40m, and 30m during spring, summer, fall, and winter. OccuFly covers urban, industrial, and rural scenarios, provides 22 semantic classes, and the data format adheres to established conventions to facilitate seamless integration with existing research. Crucially, we propose a LiDAR-free data generation framework based on camera modality, which is ubiquitous on modern UAVs. By utilizing traditional 3D reconstruction, our framework automates label transfer by lifting a subset of annotated 2D masks into the reconstructed point cloud, thereby substantially minimizing manual 3D annotation effort. Finally, we benchmark the state-of-the-art on OccuFly and highlight challenges specific to elevated viewpoints, yielding a comprehensive vision benchmark for holistic aerial 3D scene understanding.
71. 【2512.20746】rashDet: Iterative Neural Architecture Search for Efficient Waste Detection
链接:https://arxiv.org/abs/2512.20746
作者:Tony Tran,Bin Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:addresses trash detection, hardware-aware neural architecture, neural architecture search, iterative hardware-aware neural, framework targeting edge
备注: 10 pages. The paper has been accepted by the WACV 2026 workshop
点击查看摘要
Abstract:This paper addresses trash detection on the TACO dataset under strict TinyML constraints using an iterative hardware-aware neural architecture search framework targeting edge and IoT devices. The proposed method constructs a Once-for-All-style ResDets supernet and performs iterative evolutionary search that alternates between backbone and neck/head optimization, supported by a population passthrough mechanism and an accuracy predictor to reduce search cost and improve stability. This framework yields a family of deployment-ready detectors, termed TrashDets. On a five-class TACO subset (paper, plastic, bottle, can, cigarette), the strongest variant, TrashDet-l, achieves 19.5 mAP50 with 30.5M parameters, improving accuracy by up to 3.6 mAP50 over prior detectors while using substantially fewer parameters. The TrashDet family spans 1.2M to 30.5M parameters with mAP50 values between 11.4 and 19.5, providing scalable detector options for diverse TinyML deployment budgets on resource-constrained hardware. On the MAX78002 microcontroller with the TrashNet dataset, two specialized variants, TrashDet-ResNet and TrashDet-MBNet, jointly dominate the ai87-fpndetector baseline, with TrashDet-ResNet achieving 7525~$\mu$J energy per inference at 26.7 ms latency and 37.45 FPS, and TrashDet-MBNet improving mAP50 by 10.2%; together they reduce energy consumption by up to 88%, latency by up to 78%, and average power by up to 53% compared to existing TinyML detectors.
72. 【2512.20735】VL4Gaze: Unleashing Vision-Language Models for Gaze Following
链接:https://arxiv.org/abs/2512.20735
作者:Shijing Wang,Chaoqun Cui,Yaping Huang,Hyung Jin Chang,Yihua Cheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains largely unexplored, current vision-language models, understanding remains largely, Human gaze, gaze understanding
备注:
点击查看摘要
Abstract:Human gaze provides essential cues for interpreting attention, intention, and social interaction in visual scenes, yet gaze understanding remains largely unexplored in current vision-language models (VLMs). While recent VLMs achieve strong scene-level reasoning across a range of visual tasks, there exists no benchmark that systematically evaluates or trains them for gaze interpretation, leaving open the question of whether gaze understanding can emerge from general-purpose vision-language pre-training. To address this gap, we introduce VL4Gaze, the first large-scale benchmark designed to investigate, evaluate, and unlock the potential of VLMs for gaze understanding. VL4Gaze contains 489K automatically generated question-answer pairs across 124K images and formulates gaze understanding as a unified VQA problem through four complementary tasks: (1) gaze object description, (2) gaze direction description, (3) gaze point location, and (4) ambiguous question recognition. We comprehensively evaluate both commercial and open-source VLMs under in-context learning and fine-tuning settings. The results show that even large-scale VLMs struggle to reliably infer gaze semantics and spatial localization without task-specific supervision. In contrast, training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities in VLMs. We will release the dataset and code to support further research and development in this direction.
73. 【2512.20674】HyDRA: Hierarchical and Dynamic Rank Adaptation for Mobile Vision Language Model
链接:https://arxiv.org/abs/2512.20674
作者:Yuanhao Xi,Xiaohuan Bing,Ramin Yahyapour
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Language Models, Vision Language, undergone significant advancements, Language Models, emergence of mobile-oriented
备注:
点击查看摘要
Abstract:Vision Language Models (VLMs) have undergone significant advancements, particularly with the emergence of mobile-oriented VLMs, which offer a wide range of application scenarios. However, the substantial computational requirements for training these models present a significant obstacle to their practical application. To address this issue, Low-Rank Adaptation (LoRA) has been proposed. Nevertheless, the standard LoRA with a fixed rank lacks sufficient capability for training mobile VLMs that process both text and image modalities. In this work, we introduce HyDRA, a parameter-efficient fine-tuning framework designed to implement hierarchical and dynamic rank scheduling for mobile VLMs. This framework incorporates two essential optimization strategies: (1) hierarchical optimization, which involves a coarse-grained approach that assigns different ranks to various layers, as well as a fine-grained method that adjusts ranks within individual layers, and (2) dynamic adjustment, which employs an end-to-end automatic optimization using a lightweight performance model to determine and adjust ranks during the fine-tuning process. Comprehensive experiments conducted on popular benchmarks demonstrate that HyDRA consistently outperforms the baseline, achieving a 4.7\% improvement across various model sizes without increasing the number of trainable parameters. In some tasks, it even surpasses full-parameter fine-tuning.
74. 【2512.20655】MaskOpt: A Large-Scale Mask Optimization Dataset to Advance AI in Integrated Circuit Manufacturing
链接:https://arxiv.org/abs/2512.20655
作者:Yuting Hu,Lei Zhuang,Hua Xiang,Jinjun Xiong,Gi-Joon Nam
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:faces growing challenges, lithography faces growing, optical lithography faces, integrated circuit, dimensions shrink
备注:
点击查看摘要
Abstract:As integrated circuit (IC) dimensions shrink below the lithographic wavelength, optical lithography faces growing challenges from diffraction and process variability. Model-based optical proximity correction (OPC) and inverse lithography technique (ILT) remain indispensable but computationally expensive, requiring repeated simulations that limit scalability. Although deep learning has been applied to mask optimization, existing datasets often rely on synthetic layouts, disregard standard-cell hierarchy, and neglect the surrounding contexts around the mask optimization targets, thereby constraining their applicability to practical mask optimization. To advance deep learning for cell- and context-aware mask optimization, we present MaskOpt, a large-scale benchmark dataset constructed from real IC designs at the 45$\mathrm{nm}$ node. MaskOpt includes 104,714 metal-layer tiles and 121,952 via-layer tiles. Each tile is clipped at a standard-cell placement to preserve cell information, exploiting repeated logic gate occurrences. Different context window sizes are supported in MaskOpt to capture the influence of neighboring shapes from optical proximity effects. We evaluate state-of-the-art deep learning models for IC mask optimization to build up benchmarks, and the evaluation results expose distinct trade-offs across baseline models. Further context size analysis and input ablation studies confirm the importance of both surrounding geometries and cell-aware inputs in achieving accurate mask generation.
75. 【2512.20626】MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation
链接:https://arxiv.org/abs/2512.20626
作者:Chi-Hsiang Hsiao,Yi-Cheng Wang,Tzung-Sheng Lin,Yi-Ren Yeh,Chu-Song Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:large language models, access external information, dynamically access external, previously unseen documents, enables large language
备注:
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) enables large language models (LLMs) to dynamically access external information, which is powerful for answering questions over previously unseen documents. Nonetheless, they struggle with high-level conceptual understanding and holistic comprehension due to limited context windows, which constrain their ability to perform deep reasoning over long-form, domain-specific content such as full-length books. To solve this problem, knowledge graphs (KGs) have been leveraged to provide entity-centric structure and hierarchical summaries, offering more structured support for reasoning. However, existing KG-based RAG solutions remain restricted to text-only inputs and fail to leverage the complementary insights provided by other modalities such as vision. On the other hand, reasoning from visual documents requires textual, visual, and spatial cues into structured, hierarchical concepts. To address this issue, we introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding. Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process. Experimental results across both global and fine-grained question answering tasks show that our approach consistently outperforms existing RAG-based approaches on both textual and multimodal corpora.
76. 【2512.21180】Equivariant Multiscale Learned Invertible Reconstruction for Cone Beam CT: From Simulated to Real Data
链接:https://arxiv.org/abs/2512.21180
作者:Nikita Moriakov,Efstratios Gavves,Jonathan H. Mason,Carmen Seller-Oria,Jonas Teuwen,Jan-Jakob Sonke
类目:Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
关键词:conventional Computed Tomography, Cone Beam, Computed Tomography, imaging modality nowadays, important imaging modality
备注: 29 pages. arXiv admin note: substantial text overlap with [arXiv:2401.11256](https://arxiv.org/abs/2401.11256)
点击查看摘要
Abstract:Cone Beam CT (CBCT) is an important imaging modality nowadays, however lower image quality of CBCT compared to more conventional Computed Tomography (CT) remains a limiting factor in CBCT applications. Deep learning reconstruction methods are a promising alternative to classical analytical and iterative reconstruction methods, but applying such methods to CBCT is often difficult due to the lack of ground truth data, memory limitations and the need for fast inference at clinically-relevant resolutions. In this work we propose LIRE++, an end-to-end rotationally-equivariant multiscale learned invertible primal-dual scheme for fast and memory-efficient CBCT reconstruction. Memory optimizations and multiscale reconstruction allow for fast training and inference, while rotational equivariance improves parameter efficiency. LIRE++ was trained on simulated projection data from a fast quasi-Monte Carlo CBCT projection simulator that we developed as well. Evaluated on synthetic data, LIRE++ gave an average improvement of 1 dB in Peak Signal-to-Noise Ratio over alternative deep learning baselines. On real clinical data, LIRE++ improved the average Mean Absolute Error between the reconstruction and the corresponding planning CT by 10 Hounsfield Units with respect to current proprietary state-of-the-art hybrid deep-learning/iterative method.
77. 【2512.20642】Flow Gym
链接:https://arxiv.org/abs/2512.20642
作者:Francesco Banelli,Antonio Terpin,Alan Bonomi,Raffaello D'Andrea
类目:Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE); Computational Physics (physics.comp-ph)
关键词:Flow Gym, quantification methods inspired, flow-field quantification methods, OpenAI Gym, toolkit for research
备注: Code: [this https URL](https://github.com/antonioterpin/flowgym)
点击查看摘要
Abstract:Flow Gym is a toolkit for research and deployment of flow-field quantification methods inspired by OpenAI Gym and Stable-Baselines3. It uses SynthPix as synthetic image generation engine and provides a unified interface for the testing, deployment and training of (learning-based) algorithms for flow-field quantification from a number of consecutive images of tracer particles. It also contains a growing number of integrations of existing algorithms and stable (re-)implementations in JAX.




