本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新556篇论文,其中:
- 自然语言处理58篇
- 信息检索12篇
- 计算机视觉131篇
自然语言处理
1. 【2512.16917】Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
链接:https://arxiv.org/abs/2512.16917
作者:Qihao Liu,Luoxin Ye,Wufei Ma,Yu-Cheng Chou,Alan Yuille
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, Large language, explicit reasoning capabilities, reasoning capabilities excel, Generative Adversarial Reasoner
备注:
点击查看摘要
Abstract:Large language models (LLMs) with explicit reasoning capabilities excel at mathematical reasoning yet still commit process errors, such as incorrect calculations, brittle logic, and superficially plausible but invalid steps. In this paper, we introduce Generative Adversarial Reasoner, an on-policy joint training framework designed to enhance reasoning by co-evolving an LLM reasoner and an LLM-based discriminator through adversarial reinforcement learning. A compute-efficient review schedule partitions each reasoning chain into logically complete slices of comparable length, and the discriminator evaluates each slice's soundness with concise, structured justifications. Learning couples complementary signals: the LLM reasoner is rewarded for logically consistent steps that yield correct answers, while the discriminator earns rewards for correctly detecting errors or distinguishing traces in the reasoning process. This produces dense, well-calibrated, on-policy step-level rewards that supplement sparse exact-match signals, improving credit assignment, increasing sample efficiency, and enhancing overall reasoning quality of LLMs. Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training. Specifically, on AIME24, we improve DeepSeek-R1-Distill-Qwen-7B from 54.0 to 61.3 (+7.3) and DeepSeek-R1-Distill-Llama-8B from 43.7 to 53.7 (+10.0). The modular discriminator also enables flexible reward shaping for objectives such as teacher distillation, preference alignment, and mathematical proof-based reasoning.
2. 【2512.16914】Constructive Circuit Amplification: Improving Math Reasoning in LLMs via Targeted Sub-Network Updates
链接:https://arxiv.org/abs/2512.16914
作者:Nikhil Prakash,Donghao Ren,Dominik Moritz,Yannick Assogba
类目:Computation and Language (cs.CL)
关键词:Prior studies investigating, Prior studies, performing specific tasks, uncovered sparse subnetworks, studies investigating
备注: 18 pages, 3 figures
点击查看摘要
Abstract:Prior studies investigating the internal workings of LLMs have uncovered sparse subnetworks, often referred to as circuits, that are responsible for performing specific tasks. Additionally, it has been shown that model performance improvement through fine-tuning often results from the strengthening of existing circuits in the model. Taken together, these findings suggest the possibility of intervening directly on such circuits to make precise, task-targeted updates. Motivated by these findings, we propose a novel method called Constructive Circuit Amplification which identifies pivotal tokens from model reasoning traces as well as model components responsible for the desired task, and updates only those components. Applied to mathematical reasoning, it improves accuracy by up to +11.4% across multiple models while modifying as little as 1.59% of model components, with minimal impact on other abilities as measured by MMLU, TriviaQA, and TruthfulQA. These results demonstrate that targeted capabilities can be reliably enhanced by selectively updating a sparse set of model components.
3. 【2512.16912】Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
链接:https://arxiv.org/abs/2512.16912
作者:Peter Chen,Xiaopeng Li,Ziniu Li,Wotao Yin,Xi Chen,Tianyi Lin
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, paper examines, examines the exploration-exploitation
备注: 35 pages
点击查看摘要
Abstract:This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
4. 【2512.16904】How Good is Post-Hoc Watermarking With Language Model Rephrasing?
链接:https://arxiv.org/abs/2512.16904
作者:Pierre Fernandez,Tom Sander,Hady Elsahar,Hongyan Chang,Tomáš Souček,Valeriu Lacatusu,Tuan Tran,Sylvestre-Alvise Rebuffi,Alexandre Mourachko
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:embeds statistical signals, AI-generated content, watermarking embeds statistical, embeds statistical, statistical signals
备注: Code at [this https URL](https://github.com/facebookresearch/textseal)
点击查看摘要
Abstract:Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content. We explore *post-hoc watermarking* where an LLM rewrites existing text while applying generation-time watermarking, to protect copyrighted documents, or detect their use in training or RAG via watermark radioactivity. Unlike generation-time approaches, which is constrained by how LLMs are served, this setting offers additional degrees of freedom for both generation and detection. We investigate how allocating compute (through larger rephrasing models, beam search, multi-candidate generation, or entropy filtering at detection) affects the quality-detectability trade-off. Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books. Among our findings, the simple Gumbel-max scheme surprisingly outperforms more recent alternatives under nucleus sampling, and most methods benefit significantly from beam search. However, most approaches struggle when watermarking verifiable text such as code, where we counterintuitively find that smaller models outperform larger ones. This study reveals both the potential and limitations of post-hoc watermarking, laying groundwork for practical applications and future research.
5. 【2512.16902】In-Context Algebra
链接:https://arxiv.org/abs/2512.16902
作者:Eric Todd,Jannik Brinkmann,Rohit Gandikota,David Bau
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:solve arithmetic, trained to solve, mirror algebraic structure, transformers, Abstract
备注: 28 pages, 18 figures. Code and data at [this https URL](https://algebra.baulab.info)
点击查看摘要
Abstract:We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions. While prior work has found that transformers develop geometric embeddings that mirror algebraic structure, those previous findings emerge from settings where arithmetic-valued tokens have fixed meanings. We devise a new task in which the assignment of symbols to specific algebraic group elements varies from one sequence to another. Despite this challenging setup, transformers achieve near-perfect accuracy on the task and even generalize to unseen algebraic groups. We develop targeted data distributions to create causal tests of a set of hypothesized mechanisms, and we isolate three mechanisms models consistently learn: commutative copying where a dedicated head copies answers, identity element recognition that distinguishes identity-containing facts, and closure-based cancellation that tracks group membership to constrain valid answers. Complementary to the geometric representations found in fixed-symbol settings, our findings show that models develop symbolic reasoning mechanisms when trained to reason in-context with variables whose meanings are not fixed.
6. 【2512.16901】Impacts of Racial Bias in Historical Training Data for News AI
链接:https://arxiv.org/abs/2512.16901
作者:Rahul Bhargava,Malene Hornstrup Jespersen,Emily Boardman Ndulue,Vivica Dsouza
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:including computational journalism, computational journalism research, large text corpora, involve large text, York Times Annotated
备注:
点击查看摘要
Abstract:AI technologies have rapidly moved into business and research applications that involve large text corpora, including computational journalism research and newsroom settings. These models, trained on extant data from various sources, can be conceptualized as historical artifacts that encode decades-old attitudes and stereotypes. This paper investigates one such example trained on the broadly-used New York Times Annotated Corpus to create a multi-label classifier. Our use in research settings surfaced the concerning "blacks" thematic topic label. Through quantitative and qualitative means we investigate this label's use in the training corpus, what concepts it might be encoding in the trained classifier, and how those concepts impact our model use. Via the application of explainable AI methods, we find that the "blacks" label operates partially as a general "racism detector" across some minoritized groups. However, it performs poorly against expectations on modern examples such as COVID-19 era anti-Asian hate stories, and reporting on the Black Lives Matter movement. This case study of interrogating embedded biases in a model reveals how similar applications in newsroom settings can lead to unexpected outputs that could impact a wide variety of potential uses of any large language model-story discovery, audience targeting, summarization, etc. The fundamental tension this exposes for newsrooms is how to adopt AI-enabled workflow tools while reducing the risk of reproducing historical biases in news coverage.
7. 【2512.16899】Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
链接:https://arxiv.org/abs/2512.16899
作者:Yushi Hu,Reyhane Askari-Hemmat,Melissa Hall,Emily Dinan,Luke Zettlemoyer,Marjan Ghazvininejad
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:training large language, large language models, Reward models, text sequences, handle interleaved image
备注: Code and data available at [this https URL](https://github.com/facebookresearch/MMRB2)
点击查看摘要
Abstract:Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to 90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.
8. 【2512.16883】AdaSearch: Balancing Parametric Knowledge and Search in Large Language Models via Reinforcement Learning
链接:https://arxiv.org/abs/2512.16883
作者:Tzu-Han Lin,Wei-Lin Chen,Chen-An Li,Hung-yi Lee,Yun-Nung Chen,Yu Meng
类目:Computation and Language (cs.CL)
关键词:Equipping large language, Equipping large, large language models, reinforcement learning, large language
备注: Preprint. Code and artifacts will be uploaded to [this https URL](https://github.com/hank0316/AdaSearch)
点击查看摘要
Abstract:Equipping large language models (LLMs) with search engines via reinforcement learning (RL) has emerged as an effective approach for building search agents. However, overreliance on search introduces unnecessary cost and risks exposure to noisy or malicious content, while relying solely on parametric knowledge risks hallucination. The central challenge is to develop agents that adaptively balance parametric knowledge with external search, invoking search only when necessary. Prior work mitigates search overuse by shaping rewards around the number of tool calls. However, these penalties require substantial reward engineering, provide ambiguous credit assignment, and can be exploited by agents that superficially reduce calls. Moreover, evaluating performance solely through call counts conflates necessary and unnecessary search, obscuring the measurement of true adaptive behavior. To address these limitations, we first quantify the self-knowledge awareness of existing search agents via an F1-based decision metric, revealing that methods such as Search-R1 often overlook readily available parametric knowledge. Motivated by these findings, we propose AdaSearch, a simple two-stage, outcome-driven RL framework that disentangles problem solving from the decision of whether to invoke search, and makes this decision process explicit and interpretable. This transparency is crucial for high-stakes domains such as finance and medical question answering, yet is largely neglected by prior approaches. Experiments across multiple model families and sizes demonstrate that AdaSearch substantially improves knowledge-boundary awareness, reduces unnecessary search calls, preserves strong task performance, and offers more transparent, interpretable decision behaviors.
9. 【2512.16843】LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference
链接:https://arxiv.org/abs/2512.16843
作者:Harsh Vardhan Bansal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Transformer-based language models, real-timeand large-scale deployment, achieved remarkable performance, Transformer-based language, high inference latency
备注: Accepted and presented at 13th IEEE International Conference on Intelligent Systems and Embedded Design (ISED-2025)
点击查看摘要
Abstract:Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with 0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications
10. 【2512.16832】What Do Prosody and Text Convey? Characterizing How Meaningful Information is Distributed Across Multiple Channels
链接:https://arxiv.org/abs/2512.16832
作者:Aditya Yadavalli,Tiago Pimentel,Tamar I Regev,Ethan Wilcox,Alex Warstadt
类目:Computation and Language (cs.CL)
关键词:conveys critical information, conveys critical, information, text, Prosody
备注:
点击查看摘要
Abstract:Prosody -- the melody of speech -- conveys critical information often not captured by the words or text of a message. In this paper, we propose an information-theoretic approach to quantify how much information is expressed by prosody alone and not by text, and crucially, what that information is about. Our approach applies large speech and language models to estimate the mutual information between a particular dimension of an utterance's meaning (e.g., its emotion) and any of its communication channels (e.g., audio or text). We then use this approach to quantify how much information is conveyed by audio and text about sarcasm, emotion, and questionhood, using speech from television and podcasts. We find that for sarcasm and emotion the audio channel -- and by implication the prosodic channel -- transmits over an order of magnitude more information about these features than the text channel alone, at least when long-term context beyond the current sentence is unavailable. For questionhood, prosody provides comparatively less additional information. We conclude by outlining a program applying our approach to more dimensions of meaning, communication channels, and languages.
11. 【2512.16814】Grammar-Forced Translation of Natural Language to Temporal Logic using LLMs
链接:https://arxiv.org/abs/2512.16814
作者:William English,Dominic Simon,Sumit Kumar Jha,Rickard Ewetz
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Translating natural language, Translating natural, temporal logic, autonomous systems, integral for human
备注:
点击查看摘要
Abstract:Translating natural language (NL) into a formal language such as temporal logic (TL) is integral for human communication with robots and autonomous systems. State-of-the-art approaches decompose the task into a lifting of atomic propositions (APs) phase and a translation phase. However, existing methods struggle with accurate lifting, the existence of co-references, and learning from limited data. In this paper, we propose a framework for NL to TL translation called Grammar Forced Translation (GraFT). The framework is based on the observation that previous work solves both the lifting and translation steps by letting a language model iteratively predict tokens from its full vocabulary. In contrast, GraFT reduces the complexity of both tasks by restricting the set of valid output tokens from the full vocabulary to only a handful in each step. The solution space reduction is obtained by exploiting the unique properties of each problem. We also provide a theoretical justification for why the solution space reduction leads to more efficient learning. We evaluate the effectiveness of GraFT using the CW, GLTL, and Navi benchmarks. Compared with state-of-the-art translation approaches, it can be observed that GraFT the end-to-end translation accuracy by 5.49% and out-of-domain translation accuracy by 14.06% on average.
12. 【2512.16802】Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology
链接:https://arxiv.org/abs/2512.16802
作者:Primož Kocbek,Azra Frkatović-Hodžić,Dora Lalić,Vivian Hui,Gordan Lauc,Gregor Štiglic
类目:Computation and Language (cs.CL)
关键词:promises grounded biomedical, optical character recognition, returns page images, Multi-modal retrieval-augmented generation, retrieval-augmented generation
备注: Will be published in IEEE BigData 2025 proceedings. Contains 10 pages, 1 figure, 5 tables
点击查看摘要
Abstract:Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor). Accuracy with Agresti-Coull 95% confidence intervals (CIs) was computed over 5 runs per configuration. With Gemma-3-27B-IT, Text and Multi-modal augmentation outperformed OCR-free retrieval (0.722-0.740 vs. 0.510 average accuracy). With GPT-4o, Multi-modal achieved 0.808, with Text 0.782 and ColPali 0.745 close behind; within-model differences were small. In follow-on experiments with the GPT-5 family, the best results with ColPali and ColFlor improved by ~2% to 0.828 in both cases. In general, across the GPT-5 family, ColPali, ColQwen, and ColFlor were statistically indistinguishable. GPT-5-nano trailed larger GPT-5 variants by roughly 8-10%. Pipeline choice is capacity-dependent: converting visuals to text lowers the reader burden and is more reliable for mid-size models, whereas OCR-free visual retrieval becomes competitive under frontier models. Among retrievers, ColFlor offers parity with heavier options at a smaller footprint, making it an efficient default when strong generators are available.
13. 【2512.16795】From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs
链接:https://arxiv.org/abs/2512.16795
作者:Shubham Mishra,Samyek Jain,Gorang Mehrishi,Shiv Tiwari,Harsh Sharma,Pratik Narang,Dhruv Kumar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:large language models, grounds large language, Retrieval-Augmented Generation, retrieved sources conflict, grounds large
备注: Under Review
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.
14. 【2512.16770】GinSign: Grounding Natural Language Into System Signatures for Temporal Logic Translation
链接:https://arxiv.org/abs/2512.16770
作者:William English,Chase Walker,Dominic Simon,Rickard Ewetz
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:manually crafting formal, crafting formal specifications-an, formal specifications-an essential, specifications-an essential capability, building trustworthy autonomous
备注:
点击查看摘要
Abstract:Natural language (NL) to temporal logic (TL) translation enables engineers to specify, verify, and enforce system behaviors without manually crafting formal specifications-an essential capability for building trustworthy autonomous systems. While existing NL-to-TL translation frameworks have demonstrated encouraging initial results, these systems either explicitly assume access to accurate atom grounding or suffer from low grounded translation accuracy. In this paper, we propose a framework for Grounding Natural Language Into System Signatures for Temporal Logic translation called GinSign. The framework introduces a grounding model that learns the abstract task of mapping NL spans onto a given system signature: given a lifted NL specification and a system signature $\mathcal{S}$, the classifier must assign each lifted atomic proposition to an element of the set of signature-defined atoms $\mathcal{P}$. We decompose the grounding task hierarchically- first predicting predicate labels, then selecting the appropriately typed constant arguments. Decomposing this task from a free-form generation problem into a structured classification problem permits the use of smaller masked language models and eliminates the reliance on expensive LLMs. Experiments across multiple domains show that frameworks which omit grounding tend to produce syntactically correct lifted LTL that is semantically nonequivalent to grounded target expressions, whereas our framework supports downstream model checking and achieves grounded logical-equivalence scores of $95.5\%$, a $1.4\times$ improvement over SOTA.
15. 【2512.16676】DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI
链接:https://arxiv.org/abs/2512.16676
作者:Hao Liang,Xiaochen Ma,Zhou Liu,Zhen Hao Wong,Zhengyang Zhao,Zimo Meng,Runming He,Chengyu Shen,Qifeng Cai,Zhaoyang Han,Meiyi Qiang,Yalin Feng,Tianyi Bai,Zewei Pan,Ziyi Guo,Yizhen Jiang,Jingwen Deng,Qijie You,Peichao Lai,Tianyu Guo,Chi Hsu Tsai,Hengyi Feng,Rui Hu,Wenkai Yu,Junbo Niu,Bohan Zeng,Ruichuan An,Lu Ma,Jihao Huang,Yaowei Zheng,Conghui He,Linpeng Tang,Bin Cui,Weinan E,Wentao Zhang
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, rapidly growing demand, semantically rich data, rich data preparation
备注:
点击查看摘要
Abstract:The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.
16. 【2512.16649】JustRL: Scaling a 1.5B LLM with a Simple RL Recipe
链接:https://arxiv.org/abs/2512.16649
作者:Bingxiang He,Zekai Qu,Zeyuan Liu,Yinghao Chen,Yuxin Zuo,Cheng Qian,Kaiyan Zhang,Weize Chen,Chaojun Xiao,Ganqu Cui,Ning Ding,Zhiyuan Liu
类目:Computation and Language (cs.CL)
关键词:curriculum learning strategies, multi-stage training pipelines, dynamic hyperparameter schedules, Recent advances, large language models
备注: 12 pages, 3 figures
点击查看摘要
Abstract:Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.
17. 【2512.16602】Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
链接:https://arxiv.org/abs/2512.16602
作者:Iker García-Ferrero,David Montero,Roman Orus
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, exercise fine-grained control, Language Models refusal, control over Large
备注:
点击查看摘要
Abstract:We introduce Refusal Steering, an inference-time method to exercise fine-grained control over Large Language Models refusal behaviour on politically sensitive topics without retraining. We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction. On Qwen3-Next-80B-A3B-Thinking, our method removes the refusal behaviour of the model around politically sensitive topics while maintaining safety on JailbreakBench and near-baseline performance on general benchmarks. The approach generalizes across 4B and 80B models and can also induce targeted refusals when desired. We analize the steering vectors and show that refusal signals concentrate in deeper layers of the transformer and are distributed across many dimensions. Together, these results demonstrate that activation steering can remove political refusal behaviour while retaining safety alignment for harmful content, offering a practical path to controllable, transparent moderation at inference time.
18. 【2512.16553】Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild
链接:https://arxiv.org/abs/2512.16553
作者:Yumeng Wang,Tianyu Fan,Lingrui Xu,Chao Huang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, sophisticated agents capable, complex real-world tasks, live web content
备注: Data and code are available at [this https URL](https://github.com/Tango-Whiskyman/Needle_in_the_Web)
点击查看摘要
Abstract:Large Language Models (LLMs) have evolved from simple chatbots into sophisticated agents capable of automating complex real-world tasks, where browsing and reasoning over live web content is key to assessing retrieval and cognitive skills. Existing benchmarks like BrowseComp and xBench-DeepSearch emphasize complex reasoning searches requiring multi-hop synthesis but neglect Fuzzy Exploratory Search, namely queries that are vague and multifaceted, where users seek the most relevant webpage rather than a single factual answer. To address this gap, we introduce Needle in the Web, a novel benchmark specifically designed to evaluate modern search agents and LLM-based systems on their ability to retrieve and reason over real-world web content in response to ambiguous, exploratory queries under varying levels of difficulty. Needle in the Web comprises 663 questions spanning seven distinct domains. To ensure high query quality and answer uniqueness, we employ a flexible methodology that reliably generates queries of controllable difficulty based on factual claims of web contents. We benchmark three leading LLMs and three agent-based search systems on Needle in the Web, finding that most models struggle: many achieve below 35% accuracy, and none consistently excel across domains or difficulty levels. These findings reveal that Needle in the Web presents a significant challenge for current search systems and highlights the open problem of effective fuzzy retrieval under semantic ambiguity.
19. 【2512.16541】UM_FHS at the CLEF 2025 SimpleText Track: Comparing No-Context and Fine-Tune Approaches for GPT-4.1 Models in Sentence and Document-Level Text Simplification
链接:https://arxiv.org/abs/2512.16541
作者:Primoz Kocbek,Gregor Stiglic
类目:Computation and Language (cs.CL)
关键词:SimpleText track Task, track Task, SimpleText track, addressing both sentenceand, work describes
备注: 10 pages, 3 tables. CLEF 2025 Working Notes, 9 to 12 September 2025, Madrid, Spain
点击查看摘要
Abstract:This work describes our submission to the CLEF 2025 SimpleText track Task 1, addressing both sentenceand document-level simplification of scientific texts. The methodology centered on using the gpt-4.1, gpt-4.1mini, and gpt-4.1-nano models from OpenAI. Two distinct approaches were compared: a no-context method relying on prompt engineering and a fine-tuned (FT) method across models. The gpt-4.1-mini model with no-context demonstrated robust performance at both levels of simplification, while the fine-tuned models showed mixed results, highlighting the complexities of simplifying text at different granularities, where gpt-4.1-nano-ft performance stands out at document-level simplification in one case.
20. 【2512.16530】Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics
链接:https://arxiv.org/abs/2512.16530
作者:Primoz Kocbek,Leon Kopitar,Gregor Stiglic
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, enhance health literacy, simplifying biomedical texts, application of Large, Large Language
备注: 5 pages, 1 figure
点击查看摘要
Abstract:This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the approaches similarly as the qualitative metric.
21. 【2512.16445】opic Modelling Black Box Optimization
链接:https://arxiv.org/abs/2512.16445
作者:Roman Akramov,Artem Khamatullin,Svetlana Glazyrina,Maksim Kryzhanovskiy,Roman Ischenko
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
关键词:Latent Dirichlet Allocation, Dirichlet Allocation, Latent Dirichlet, key design decision, black-box optimization
备注:
点击查看摘要
Abstract:Choosing the number of topics $T$ in Latent Dirichlet Allocation (LDA) is a key design decision that strongly affects both the statistical fit and interpretability of topic models. In this work, we formulate the selection of $T$ as a discrete black-box optimization problem, where each function evaluation corresponds to training an LDA model and measuring its validation perplexity. Under a fixed evaluation budget, we compare four families of optimizers: two hand-designed evolutionary methods - Genetic Algorithm (GA) and Evolution Strategy (ES) - and two learned, amortized approaches, Preferential Amortized Black-Box Optimization (PABBO) and Sharpness-Aware Black-Box Optimization (SABBO). Our experiments show that, while GA, ES, PABBO, and SABBO eventually reach a similar band of final perplexity, the amortized optimizers are substantially more sample- and time-efficient. SABBO typically identifies a near-optimal topic number after essentially a single evaluation, and PABBO finds competitive configurations within a few evaluations, whereas GA and ES require almost the full budget to approach the same region.
22. 【2512.16439】From Essence to Defense: Adaptive Semantic-aware Watermarking for Embedding-as-a-Service Copyright Protection
链接:https://arxiv.org/abs/2512.16439
作者:Hao Li,Yubing Ren,Yanan Cao,Yingjie Li,Fang Fang,Xuebin Wang
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:large language models, natural language understanding, successful commercial paradigm, large language, language models
备注:
点击查看摘要
Abstract:Benefiting from the superior capabilities of large language models in natural language understanding and generation, Embeddings-as-a-Service (EaaS) has emerged as a successful commercial paradigm on the web platform. However, prior studies have revealed that EaaS is vulnerable to imitation attacks. Existing methods protect the intellectual property of EaaS through watermarking techniques, but they all ignore the most important properties of embedding: semantics, resulting in limited harmlessness and stealthiness. To this end, we propose SemMark, a novel semantic-based watermarking paradigm for EaaS copyright protection. SemMark employs locality-sensitive hashing to partition the semantic space and inject semantic-aware watermarks into specific regions, ensuring that the watermark signals remain imperceptible and diverse. In addition, we introduce the adaptive watermark weight mechanism based on the local outlier factor to preserve the original embedding distribution. Furthermore, we propose Detect-Sampling and Dimensionality-Reduction attacks and construct four scenarios to evaluate the watermarking method. Extensive experiments are conducted on four popular NLP datasets, and SemMark achieves superior verifiability, diversity, stealthiness, and harmlessness.
23. 【2512.16401】Bridging the Reality Gap: Efficient Adaptation of ASR systems for Challenging Low-Resource Domains
链接:https://arxiv.org/abs/2512.16401
作者:Darshil Chauhan,Adityasinh Solanki,Vansh Patel,Kanav Kapoor,Ritvik Jain,Aditya Bansal,Dhruv Kumar,Prateek Narang
类目:Computation and Language (cs.CL)
关键词:Automatic Speech Recognition, Automatic Speech, Speech Recognition, holds immense potential, digitizing handwritten prescriptions
备注:
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) holds immense potential to streamline clinical documentation, such as digitizing handwritten prescriptions and reports, thereby increasing patient throughput and reducing costs in resource-constrained sectors like rural healthcare. However, realizing this utility is currently obstructed by significant technical barriers: strict data privacy constraints, limited computational resources, and severe acoustic domain shifts. We quantify this gap by showing that a robust multilingual model (IndicWav2Vec) degrades to a stark 40.94% Word Error Rate (WER) when deployed on real-world clinical audio (Gram Vaani), rendering it unusable for practical applications. To address these challenges and bring ASR closer to deployment, we propose an efficient, privacy-preserving adaptation framework. We employ Low-Rank Adaptation (LoRA) to enable continual learning from incoming data streams directly on edge devices, ensuring patient data confidentiality. Our strategy yields a 17.1% relative improvement in WER on the target domain. Furthermore, by integrating multi-domain experience replay, we reduce catastrophic forgetting by 47% compared to naive adaptation. These results demonstrate a viable pathway for building reliable, self-improving ASR systems that can operate effectively within the constraints of high-impact real-world environments.
24. 【2512.16378】Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
链接:https://arxiv.org/abs/2512.16378
作者:Sara Papi,Javier Garcia Gilabert,Zachary Hopton,Vilém Zouhar,Carlos Escolano,Gerard I. Gállego,Jorge Iranzo-Sánchez,Ahrii Kim,Dominik Macháček,Patricia Schmidtova,Maike Züfle
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
关键词:bypassing traditional transcription-based, spoken language directly, Large Language Models, traditional transcription-based pipelines, expand beyond text
备注: Project available at [this https URL](https://github.com/sarapapi/hearing2translate)
点击查看摘要
Abstract:As Large Language Models (LLMs) expand beyond text, integrating speech as a native modality has given rise to SpeechLLMs, which aim to translate spoken language directly, thereby bypassing traditional transcription-based pipelines. Whether this integration improves speech-to-text translation quality over established cascaded architectures, however, remains an open question. We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 5 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual LLMs. Our analysis spans 16 benchmarks, 13 language pairs, and 9 challenging conditions, including disfluent, noisy, and long-form speech. Across this extensive evaluation, we find that cascaded systems remain the most reliable overall, while current SpeechLLMs only match cascades in selected settings and SFMs lag behind both, highlighting that integrating an LLM, either within the model or in a pipeline, is essential for high-quality speech translation.
25. 【2512.16323】Hacking Neural Evaluation Metrics with Single Hub Text
链接:https://arxiv.org/abs/2512.16323
作者:Hiroyuki Deguchi,Katsuki Chousa,Yusuke Sakai
类目:Computation and Language (cs.CL)
关键词:Strongly human-correlated evaluation, Strongly human-correlated, human-correlated evaluation metrics, evaluation metrics serve, evaluation metrics
备注:
点击查看摘要
Abstract:Strongly human-correlated evaluation metrics serve as an essential compass for the development and improvement of generation models and must be highly reliable and robust. Recent embedding-based neural text evaluation metrics, such as COMET for translation tasks, are widely used in both research and development fields. However, there is no guarantee that they yield reliable evaluation results due to the black-box nature of neural networks. To raise concerns about the reliability and safety of such metrics, we propose a method for finding a single adversarial text in the discrete space that is consistently evaluated as high-quality, regardless of the test cases, to identify the vulnerabilities in evaluation metrics. The single hub text found with our method achieved 79.1 COMET% and 67.8 COMET% in the WMT'24 English-to-Japanese (En--Ja) and English-to-German (En--De) translation tasks, respectively, outperforming translations generated individually for each source sentence by using M2M100, a general translation model. Furthermore, we also confirmed that the hub text found with our method generalizes across multiple language pairs such as Ja--En and De--En.
26. 【2512.16310】Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
链接:https://arxiv.org/abs/2512.16310
作者:Yuxuan Qiao,Dongqin Liu,Hongchang Yang,Wei Zhou,Songlin Hu
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Driven by Large, Large Language, autonomous agents due, Tools Orchestration Privacy
备注:
点击查看摘要
Abstract:Driven by Large Language Models, the single-agent, multi-tool architecture has become a popular paradigm for autonomous agents due to its simplicity and effectiveness. However, this architecture also introduces a new and severe privacy risk, which we term Tools Orchestration Privacy Risk (TOP-R), where an agent, to achieve a benign user goal, autonomously aggregates information fragments across multiple tools and leverages its reasoning capabilities to synthesize unexpected sensitive information. We provide the first systematic study of this risk. First, we establish a formal framework, attributing the risk's root cause to the agent's misaligned objective function: an overoptimization for helpfulness while neglecting privacy awareness. Second, we construct TOP-Bench, comprising paired leakage and benign scenarios, to comprehensively evaluate this risk. To quantify the trade-off between safety and robustness, we introduce the H-Score as a holistic metric. The evaluation results reveal that TOP-R is a severe risk: the average Risk Leakage Rate (RLR) of eight representative models reaches 90.24%, while the average H-Score is merely 0.167, with no model exceeding 0.3. Finally, we propose the Privacy Enhancement Principle (PEP) method, which effectively mitigates TOP-R, reducing the Risk Leakage Rate to 46.58% and significantly improving the H-Score to 0.624. Our work reveals both a new class of risk and inherent structural limitations in current agent architectures, while also offering feasible mitigation strategies.
27. 【2512.16301】Adaptation of Agentic AI
链接:https://arxiv.org/abs/2512.16301
作者:Pengcheng Jiang,Jiacheng Lin,Zhiyi Shi,Zifeng Wang,Luxi He,Yichen Wu,Ming Zhong,Peiyang Song,Qizheng Zhang,Heng Wang,Xueqiang Xu,Hanwen Xu,Pengrui Han,Dylan Zhang,Jiashuo Sun,Chaoqi Yang,Kun Qian,Tian Wang,Changran Hu,Manling Li,Quanzheng Li,Hao Peng,Sheng Wang,Jingbo Shang,Chao Zhang,Jiaxuan You,Liyuan Liu,Pan Lu,Yu Zhang,Heng Ji,Yejin Choi,Dawn Song,Jimeng Sun,Jiawei Han
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:perform increasingly complex, Cutting-edge agentic, adapted to plan, specialized tasks, interact with external
备注:
点击查看摘要
Abstract:Cutting-edge agentic AI systems are built on foundation models that can be adapted to plan, reason, and interact with external tools to perform increasingly complex and specialized tasks. As these systems grow in capability and scope, adaptation becomes a central mechanism for improving performance, reliability, and generalization. In this paper, we unify the rapidly expanding research landscape into a systematic framework that spans both agent adaptations and tool adaptations. We further decompose these into tool-execution-signaled and agent-output-signaled forms of agent adaptation, as well as agent-agnostic and agent-supervised forms of tool adaptation. We demonstrate that this framework helps clarify the design space of adaptation strategies in agentic AI, makes their trade-offs explicit, and provides practical guidance for selecting or switching among strategies during system design. We then review the representative approaches in each category, analyze their strengths and limitations, and highlight key open challenges and future opportunities. Overall, this paper aims to offer a conceptual foundation and practical roadmap for researchers and practitioners seeking to build more capable, efficient, and reliable agentic AI systems.
28. 【2512.16287】Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures
链接:https://arxiv.org/abs/2512.16287
作者:Yehor Tereshchenko,Mika Hämäläinen,Svitlana Myroniuk
类目:Computation and Language (cs.CL)
关键词:Large Language Models, evaluation of Large, Large Language, low-resource Uralic languages, tasks has primarily
备注: IWCLUL 2025
点击查看摘要
Abstract:The evaluation of Large Language Models (LLMs) for translation tasks has primarily focused on high-resource languages, leaving a significant gap in understanding their performance on low-resource and endangered languages. This study presents a comprehensive comparison of OpenAI's GPT models, specifically examining the differences between reasoning and non-reasoning architectures for translating between Finnish and four low-resource Uralic languages: Komi-Zyrian, Moksha, Erzya, and Udmurt. Using a parallel corpus of literary texts, we evaluate model willingness to attempt translation through refusal rate analysis across different model architectures. Our findings reveal significant performance variations between reasoning and non-reasoning models, with reasoning models showing 16 percentage points lower refusal rates. The results provide valuable insights for researchers and practitioners working with Uralic languages and contribute to the broader understanding of reasoning model capabilities for endangered language preservation.
29. 【2512.16279】QuadSentinel: Sequent Safety for Machine-Checkable Control in Multi-agent Systems
链接:https://arxiv.org/abs/2512.16279
作者:Yiliu Yang,Yilei Jiang,Qunzhong Wang,Yingshui Tan,Xiaoyong Zhu,Sherman S.M. Chow,Bo Zheng,Xiangyu Yue
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:solve complex tasks, large language model-based, Safety risks arise, model-based agents solve, agents solve complex
备注: Preprint
点击查看摘要
Abstract:Safety risks arise as large language model-based agents solve complex tasks with tools, multi-step plans, and inter-agent messages. However, deployer-written policies in natural language are ambiguous and context dependent, so they map poorly to machine-checkable rules, and runtime enforcement is unreliable. Expressing safety policies as sequents, we propose \textsc{QuadSentinel}, a four-agent guard (state tracker, policy verifier, threat watcher, and referee) that compiles these policies into machine-checkable rules built from predicates over observable state and enforces them online. Referee logic plus an efficient top-$k$ predicate updater keeps costs low by prioritizing checks and resolving conflicts hierarchically. Measured on ST-WebAgentBench (ICML CUA~'25) and AgentHarm (ICLR~'25), \textsc{QuadSentinel} improves guardrail accuracy and rule recall while reducing false positives. Against single-agent baselines such as ShieldAgent (ICML~'25), it yields better overall safety control. Near-term deployments can adopt this pattern without modifying core agents by keeping policies separate and machine-checkable. Our code will be made publicly available at this https URL.
30. 【2512.16248】Sigma-Moe-Tiny Technical Report
链接:https://arxiv.org/abs/2512.16248
作者:Qingguo Hu,Zhenghao Lin,Ziyue Yang,Yucheng Ding,Xiao Liu,Yuting Jiang,Ruizhe Wang,Tianyu Chen,Zhongxin Guo,Yifan Xiong,Rui Gao,Lei Qu,Jinsong Su,Peng Cheng,Yeyun Gong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:foundation models due, powerful scalability, promising paradigm, paradigm for foundation, efficient and powerful
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) has emerged as a promising paradigm for foundation models due to its efficient and powerful scalability. In this work, we present Sigma-MoE-Tiny, an MoE language model that achieves the highest sparsity compared to existing open-source models. Sigma-MoE-Tiny employs fine-grained expert segmentation with up to 96 experts per layer, while activating only one expert for each token, resulting in 20B total parameters with just 0.5B activated. The major challenge introduced by such extreme sparsity lies in expert load balancing. We find that the widely-used load balancing loss tends to become ineffective in the lower layers under this setting. To address this issue, we propose a progressive sparsification schedule aiming to balance expert utilization and training stability. Sigma-MoE-Tiny is pre-trained on a diverse and high-quality corpus, followed by post-training to further unlock its capabilities. The entire training process remains remarkably stable, with no occurrence of irrecoverable loss spikes. Comprehensive evaluations reveal that, despite activating only 0.5B parameters, Sigma-MoE-Tiny achieves top-tier performance among counterparts of comparable or significantly larger scale. In addition, we provide an in-depth discussion of load balancing in highly sparse MoE models, offering insights for advancing sparsity in future MoE architectures. Project page: this https URL Code: this https URL
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2512.16248 [cs.CL]
(or
arXiv:2512.16248v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2512.16248
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
31. 【2512.16229】LoPA: Scaling dLLM Inference via Lookahead Parallel Decoding
链接:https://arxiv.org/abs/2512.16229
作者:Chenkai Xu,Yijie Jin,Jiajun Li,Yi Tu,Guoping Long,Dandan Tu,Tianqi Hou,Junchi Yan,Zhijie Deng
类目:Computation and Language (cs.CL)
关键词:Diffusion Large Language, Large Language Models, Diffusion Large, Large Language, demonstrated significant potential
备注:
点击查看摘要
Abstract:Diffusion Large Language Models (dLLMs) have demonstrated significant potential for high-speed inference. However, current confidence-driven decoding strategies are constrained by limited parallelism, typically achieving only 1--3 tokens per forward pass (TPF). In this work, we identify that the degree of parallelism during dLLM inference is highly sensitive to the Token Filling Order (TFO). Then, we introduce Lookahead PArallel Decoding LoPA, a training-free, plug-and-play algorithm, to identify a superior TFO and hence accelerate inference. LoPA concurrently explores distinct candidate TFOs via parallel branches, and selects the one with the highest potential for future parallelism based on branch confidence. We apply LoPA to the state-of-the-art D2F model and observe a substantial enhancement in decoding efficiency. Notably, LoPA increases the TPF of D2F-Dream to 10.1 on the GSM8K while maintaining performance superior to the Dream baseline. Furthermore, to facilitate this unprecedented degree of parallelism, we develop a specialized multi-device inference system featuring Branch Parallelism (BP), which achieves a single-sample throughput of 1073.9 tokens per second under multi-GPU deployment. The code is available at this https URL.
32. 【2512.16227】An Information-Theoretic Framework for Robust Large Language Model Editing
链接:https://arxiv.org/abs/2512.16227
作者:Qizhou Chen,Chengyu Wang,Taolin Zhang,Xiaofeng He
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, enabling transformative advances, Language Models, tools in science
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have become indispensable tools in science, technology, and society, enabling transformative advances across diverse fields. However, errors or outdated information within these models can undermine their accuracy and restrict their safe deployment. Developing efficient strategies for updating model knowledge without the expense and disruption of full retraining remains a critical challenge. Current model editing techniques frequently struggle to generalize corrections beyond narrow domains, leading to unintended consequences and limiting their practical impact. Here, we introduce a novel framework for editing LLMs, grounded in information bottleneck theory. This approach precisely compresses and isolates the essential information required for generalizable knowledge correction while minimizing disruption to unrelated model behaviors. Building upon this foundation, we present the Information Bottleneck Knowledge Editor (IBKE), which leverages compact latent representations to guide gradient-based updates, enabling robust and broadly applicable model editing. We validate IBKE's effectiveness across multiple LLM architectures and standard benchmark tasks, demonstrating state-of-the-art accuracy and improved generality and specificity of edits. These findings establish a theoretically principled and practical paradigm for open-domain knowledge editing, advancing the utility and trustworthiness of LLMs in real-world applications.
33. 【2512.16189】Mitigating Hallucinations in Healthcare LLMs with Granular Fact-Checking and Domain-Specific Adaptation
链接:https://arxiv.org/abs/2512.16189
作者:Musarrat Zeba,Abdullah Al Mamun,Kishoar Jahan Tithee,Debopom Sutradhar,Mohaimenul Azam Khan Raiaan,Saddam Mukta,Reem E. Mohamed,Md Rafiqul Islam,Yakub Sebastian,Mukhtar Hussain,Sami Azam
类目:Computation and Language (cs.CL)
关键词:cases involving decision-making, reliable and accurate, patient safety, cases involving, involving decision-making
备注:
点击查看摘要
Abstract:In healthcare, it is essential for any LLM-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRa) on the MIMIC III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM model on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3,786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary model achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.
34. 【2512.16183】A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
链接:https://arxiv.org/abs/2512.16183
作者:Mengfan Shen,Kangqi Song,Xindi Wang,Wei Jia,Tao Wang,Ziqiang Han
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:presents considerable challenges, considerable challenges due, police incident announcements, accurate data processing, Structured information extraction
备注: 41 pages,3figures and 9 tables
点击查看摘要
Abstract:Structured information extraction from police incident announcements is crucial for timely and accurate data processing, yet presents considerable challenges due to the variability and informal nature of textual sources such as social media posts. To address these challenges, we developed a domain-adapted extraction pipeline that leverages targeted prompt engineering with parameter-efficient fine-tuning of the Qwen2.5-7B model using Low-Rank Adaptation (LoRA). This approach enables the model to handle noisy, heterogeneous text while reliably extracting 15 key fields, including location, event characteristics, and impact assessment, from a high-quality, manually annotated dataset of 4,933 instances derived from 27,822 police briefing posts on Chinese Weibo (2019-2020). Experimental results demonstrated that LoRA-based fine-tuning significantly improved performance over both the base and instruction-tuned models, achieving an accuracy exceeding 98.36% for mortality detection and Exact Match Rates of 95.31% for fatality counts and 95.54% for province-level location extraction. The proposed pipeline thus provides a validated and efficient solution for multi-task structured information extraction in specialized domains, offering a practical framework for transforming unstructured text into reliable structured data in social science research.
35. 【2512.16182】DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack
链接:https://arxiv.org/abs/2512.16182
作者:Hao Li,Yubing Ren,Yanan Cao,Yingjie Li,Fang Fang,Shi Wang,Li Guo
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:cloud-based services, web platforms, large language models, rapid development, development of cloud-based
备注:
点击查看摘要
Abstract:With the rapid development of cloud-based services, large language models (LLMs) have become increasingly accessible through various web platforms. However, this accessibility has also led to growing risks of model abuse. LLM watermarking has emerged as an effective approach to mitigate such misuse and protect intellectual property. Existing watermarking algorithms, however, primarily focus on defending against paraphrase attacks while overlooking piggyback spoofing attacks, which can inject harmful content, compromise watermark reliability, and undermine trust in attribution. To address this limitation, we propose DualGuard, the first watermarking algorithm capable of defending against both paraphrase and spoofing attacks. DualGuard employs the adaptive dual-stream watermarking mechanism, in which two complementary watermark signals are dynamically injected based on the semantic content. This design enables DualGuard not only to detect but also to trace spoofing attacks, thereby ensuring reliable and trustworthy watermark detection. Extensive experiments conducted across multiple datasets and language models demonstrate that DualGuard achieves excellent detectability, robustness, traceability, and text quality, effectively advancing the state of LLM watermarking for real-world applications.
36. 【2512.16171】Science Consultant Agent
链接:https://arxiv.org/abs/2512.16171
作者:Karthikeyan K,Philip Wu,Xin Tang,Alexandre Alves
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:web-based Artificial Intelligence, Science Consultant Agent, Artificial Intelligence, effective modeling strategy, Science Consultant
备注:
点击查看摘要
Abstract:The Science Consultant Agent is a web-based Artificial Intelligence (AI) tool that helps practitioners select and implement the most effective modeling strategy for AI-based solutions. It operates through four core components: Questionnaire, Smart Fill, Research-Guided Recommendation, and Prototype Builder. By combining structured questionnaires, literature-backed solution recommendations, and prototype generation, the Science Consultant Agent accelerates development for everyone from Product Managers and Software Developers to Researchers. The full pipeline is illustrated in Figure 1.
37. 【2512.16147】Decoding Fake Narratives in Spreading Hateful Stories: A Dual-Head RoBERTa Model with Multi-Task Learning
链接:https://arxiv.org/abs/2512.16147
作者:Yash Bhaskar,Sankalp Bahad,Parameswari Krishnamurthy
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:enabling global connectivity, Social media platforms, including hate speech, global connectivity, hate speech
备注: Accepted Paper, Anthology ID: [this http URL](http://2024.icon) -fauxhate.3, 4 pages, 1 figure, 1 table
点击查看摘要
Abstract:Social media platforms, while enabling global connectivity, have become hubs for the rapid spread of harmful content, including hate speech and fake narratives \cite{davidson2017automated, shu2017fake}. The Faux-Hate shared task focuses on detecting a specific phenomenon: the generation of hate speech driven by fake narratives, termed Faux-Hate. Participants are challenged to identify such instances in code-mixed Hindi-English social media text. This paper describes our system developed for the shared task, addressing two primary sub-tasks: (a) Binary Faux-Hate detection, involving fake and hate speech classification, and (b) Target and Severity prediction, categorizing the intended target and severity of hateful content. Our approach combines advanced natural language processing techniques with domain-specific pretraining to enhance performance across both tasks. The system achieved competitive results, demonstrating the efficacy of leveraging multi-task learning for this complex problem.
38. 【2512.16145】MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation
链接:https://arxiv.org/abs/2512.16145
作者:Pengyu Wang,Shuchang Ye,Usman Naseem,Jinman Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:automatically derive radiology-style, derive radiology-style reports, Medical report generation, aims to automatically, Relative Policy Optimization
备注: 12 pages
点击查看摘要
Abstract:Medical report generation (MRG) aims to automatically derive radiology-style reports from medical images to aid in clinical decision-making. However, existing methods often generate text that mimics the linguistic style of radiologists but fails to guarantee clinical correctness, because they are trained on token-level objectives which focus on word-choice and sentence structure rather than actual medical accuracy. We propose a semantic-driven reinforcement learning (SRL) method for medical report generation, adopted on a large vision-language model (LVLM). SRL adopts Group Relative Policy Optimization (GRPO) to encourage clinical-correctness-guided learning beyond imitation of language style. Specifically, we optimise a report-level reward: a margin-based cosine similarity (MCCS) computed between key radiological findings extracted from generated and reference reports, thereby directly aligning clinical-label agreement and improving semantic correctness. A lightweight reasoning format constraint further guides the model to generate structured "thinking report" outputs. We evaluate Medical Report Generation with Sematic-driven Reinforment Learning (MRG-R1), on two datasets: IU X-Ray and MIMIC-CXR using clinical efficacy (CE) metrics. MRG-R1 achieves state-of-the-art performance with CE-F1 51.88 on IU X-Ray and 40.39 on MIMIC-CXR. We found that the label-semantic reinforcement is better than conventional token-level supervision. These results indicate that optimizing a clinically grounded, report-level reward rather than token overlap,meaningfully improves clinical correctness. This work is a prior to explore semantic-reinforcement in supervising medical correctness in medical Large vision-language model(Med-LVLM) training.
39. 【2512.16125】Convolutional Lie Operator for Sentence Classification
链接:https://arxiv.org/abs/2512.16125
作者:Daniela N. Rim,Heeyoul Choi
类目:Computation and Language (cs.CL)
关键词:Convolutional Neural Networks, Traditional Convolutional Neural, Convolutional Neural, Neural Networks, Convolutional-based sentence classifiers
备注: Proceedings of the 2024 8th International Conference on Natural Language Processing and Information Retrieval
点击查看摘要
Abstract:Traditional Convolutional Neural Networks have been successful in capturing local, position-invariant features in text, but their capacity to model complex transformation within language can be further explored. In this work, we explore a novel approach by integrating Lie Convolutions into Convolutional-based sentence classifiers, inspired by the ability of Lie group operations to capture complex, non-Euclidean symmetries. Our proposed models SCLie and DPCLie empirically outperform traditional Convolutional-based sentence classifiers, suggesting that Lie-based models relatively improve the accuracy by capturing transformations not commonly associated with language. Our findings motivate more exploration of new paradigms in language modeling.
40. 【2512.16059】ContextLeak: Auditing Leakage in Private In-Context Learning Methods
链接:https://arxiv.org/abs/2512.16059
作者:Jacob Choi,Shuying Cao,Xingjian Dong,Wang Bill Zhu,Robin Jia,Sai Praneeth Karimireddy
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:adapting Large Language, Large Language Models, Large Language, In-Context Learning, adapting Large
备注:
点击查看摘要
Abstract:In-Context Learning (ICL) has become a standard technique for adapting Large Language Models (LLMs) to specialized tasks by supplying task-specific exemplars within the prompt. However, when these exemplars contain sensitive information, reliable privacy-preserving mechanisms are essential to prevent unintended leakage through model outputs. Many privacy-preserving methods are proposed to protect the information leakage in the context, but there are less efforts on how to audit those methods. We introduce ContextLeak, the first framework to empirically measure the worst-case information leakage in ICL. ContextLeak uses canary insertion, embedding uniquely identifiable tokens in exemplars and crafting targeted queries to detect their presence. We apply ContextLeak across a range of private ICL techniques, both heuristic such as prompt-based defenses and those with theoretical guarantees such as Embedding Space Aggregation and Report Noisy Max. We find that ContextLeak tightly correlates with the theoretical privacy budget ($\epsilon$) and reliably detects leakage. Our results further reveal that existing methods often strike poor privacy-utility trade-offs, either leaking sensitive information or severely degrading performance.
41. 【2512.16041】Are We on the Right Way to Assessing LLM-as-a-Judge?
链接:https://arxiv.org/abs/2512.16041
作者:Yuanning Feng,Sinan Wang,Zhengxiang Cheng,Yao Wan,Dongping Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:widely adopted, Sage, human, model training, supervised rewards
备注:
点击查看摘要
Abstract:LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.
42. 【2512.16034】Examining the Utility of Self-disclosure Types for Modeling Annotators of Social Norms
链接:https://arxiv.org/abs/2512.16034
作者:Kieran Henderson,Kian Omoomi,Vasudha Varadarajan,Allison Lahnala,Charles Welch
类目:Computation and Language (cs.CL)
关键词:Recent work, predicting annotator labels, personal information, subjective tasks, form of persona
备注:
点击查看摘要
Abstract:Recent work has explored the use of personal information in the form of persona sentences or self-disclosures to improve modeling of individual characteristics and prediction of annotator labels for subjective tasks. The volume of personal information has historically been restricted and thus little exploration has gone into understanding what kind of information is most informative for predicting annotator labels. In this work, we categorize self-disclosure sentences and use them to build annotator models for predicting judgments of social norms. We perform several ablations and analyses to examine the impact of the type of information on our ability to predict annotation patterns. We find that demographics are more impactful than attitudes, relationships, and experiences. Generally, theory-based approaches worked better than automatic clusters. Contrary to previous work, only a small number of related comments are needed. Lastly, having a more diverse sample of annotator self-disclosures leads to the best performance.
43. 【2512.16029】Cross-Language Bias Examination in Large Language Models
链接:https://arxiv.org/abs/2512.16029
作者:Yuxuan Liang,Marwa Mahmoud
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, Implicit Association Test, Association Test, prompt-based Implicit Association, Language Models
备注:
点击查看摘要
Abstract:This study introduces an innovative multilingual bias evaluation framework for assessing bias in Large Language Models, combining explicit bias assessment through the BBQ benchmark with implicit bias measurement using a prompt-based Implicit Association Test. By translating the prompts and word list into five target languages, English, Chinese, Arabic, French, and Spanish, we directly compare different types of bias across languages. The results reveal substantial gaps in bias across languages used in LLMs. For example, Arabic and Spanish consistently show higher levels of stereotype bias, while Chinese and English exhibit lower levels of bias. We also identify contrasting patterns across bias types. Age shows the lowest explicit bias but the highest implicit bias, emphasizing the importance of detecting implicit biases that are undetectable with standard benchmarks. These findings indicate that LLMs vary significantly across languages and bias dimensions. This study fills a key research gap by providing a comprehensive methodology for cross-lingual bias analysis. Ultimately, our work establishes a foundation for the development of equitable multilingual LLMs, ensuring fairness and effectiveness across diverse languages and cultures.
44. 【2512.15973】Dynamic Rank Reinforcement Learning for Adaptive Low-Rank Multi-Head Self Attention in Large Language Models
链接:https://arxiv.org/abs/2512.15973
作者:Caner Erden
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Large Language, Dynamic Rank Reinforcement, Rank Reinforcement Learning
备注:
点击查看摘要
Abstract:We propose Dynamic Rank Reinforcement Learning (DR-RL), a novel framework that adaptively optimizes the low-rank factorization of Multi-Head Self-Attention (MHSA) in Large Language Models (LLMs) through the integration of reinforcement learning and online matrix perturbation theory. While traditional low-rank approximations often rely on static rank assumptions--limiting their flexibility across diverse input contexts--our method dynamically selects ranks based on real-time sequence dynamics, layer-specific sensitivities, and hardware constraints. The core innovation lies in an RL agent that formulates rank selection as a sequential policy optimization problem, where the reward function strictly balances attention fidelity against computational latency. Crucially, we employ online matrix perturbation bounds to enable incremental rank updates, thereby avoiding the prohibitive cost of full decomposition during inference. Furthermore, the integration of a lightweight Transformer-based policy network and batched Singular Value Decomposition (SVD) operations ensures scalable deployment on modern GPU architectures. Experiments demonstrate that DR-RL maintains downstream accuracy statistically equivalent to full-rank attention while significantly reducing Floating Point Operations (FLOPs), particularly in long-sequence regimes (L 4096). This work bridges the gap between adaptive efficiency and theoretical rigor in MHSA, offering a principled, mathematically grounded alternative to heuristic rank reduction techniques in resource-constrained deep learning. Source code and experiment logs are available at: this https URL
45. 【2512.15959】BRAID: Bounded Reasoning for Autonomous Inference and Decisions
链接:https://arxiv.org/abs/2512.15959
作者:Armağan Amcalar,Eyup Cinar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, exhibit nonlinear relationships, Language Models, exhibit nonlinear
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit nonlinear relationships between performance, cost, and token usage. This paper presents a quantitative study on structured prompting using BRAID (Bounded Reasoning for Au tonomous Inference and Decisions) across multiple GPT model tiers, eval uated on the AdvancedIF, GSM-Hard, and the SCALE MultiChallenge benchmark datasets. BRAID introduces a bounded reasoning framework using Mermaid-based instruction graphs that enable models to reason struc turally rather than through unbounded natural-language token expansion. We show that structured machine-readable prompts substantially increase reasoning accuracy and cost efficiency for agents in production systems. The findings establish BRAID as an effective and scalable technique for optimizing inference efficiency in autonomous agent systems. All datasets and detailed result logs are available at this https URL.
46. 【2512.15926】DSO: Direct Steering Optimization for Bias Mitigation
链接:https://arxiv.org/abs/2512.15926
作者:Lucas Monteiro Paes,Nivedha Sivakumar,Yinong Oliver Wang,Masha Fedzechkina Donaldson,Luca Zappella,Nicholas Apostoloff
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:visually impaired individuals, Generative models, identifying which person, impaired individuals, deployed to make
备注:
点击查看摘要
Abstract:Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.
47. 【2512.15925】Social Story Frames: Contextual Reasoning about Narrative Intent and Reception
链接:https://arxiv.org/abs/2512.15925
作者:Joel Mire,Maria Antoniak,Steven R. Wilson,Zexin Ma,Achyutarama R. Ganti,Andrew Piper,Maarten Sap
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
关键词:evokes rich interpretive, Reading stories evokes, stories evokes rich, rich interpretive, Reading stories
备注: Presented at IC2S2 2025; Under Review (ARR Oct 2025)
点击查看摘要
Abstract:Reading stories evokes rich interpretive, affective, and evaluative responses, such as inferences about narrative intent or judgments about characters. Yet, computational models of reader response are limited, preventing nuanced analyses. To address this gap, we introduce SocialStoryFrames, a formalism for distilling plausible inferences about reader response, such as perceived author intent, explanatory and predictive reasoning, affective responses, and value judgments, using conversational context and a taxonomy grounded in narrative theory, linguistic pragmatics, and psychology. We develop two models, SSF-Generator and SSF-Classifier, validated through human surveys (N=382 participants) and expert annotations, respectively. We conduct pilot analyses to showcase the utility of the formalism for studying storytelling at scale. Specifically, applying our models to SSF-Corpus, a curated dataset of 6,140 social media stories from diverse contexts, we characterize the frequency and interdependence of storytelling intents, and we compare and contrast narrative practices (and their diversity) across communities. By linking fine-grained, context-sensitive modeling with a generic taxonomy of reader responses, SocialStoryFrames enable new research into storytelling in online communities.
48. 【2512.15907】abReX : Tabular Referenceless eXplainable Evaluation
链接:https://arxiv.org/abs/2512.15907
作者:Tejas Anvekar,Juhna Park,Aparna Garimella,Vivek Gupta
类目:Computation and Language (cs.CL)
关键词:large language models, ignoring structure, language models, open challenge, limit generalization
备注:
点击查看摘要
Abstract:Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.
49. 【2512.15885】Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models
链接:https://arxiv.org/abs/2512.15885
作者:Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Pier Luigi Dovesi,Shaghayegh Roohi,Mark Granroth-Wilding,Rita Cucchiara
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:tasks remains limited, recently demonstrated impressive, demonstrated impressive capabilities, Large Language Models, Multimodal Large Language
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: this https URL.
50. 【2512.15830】From Minutes to Days: Scaling Intracranial Speech Decoding with Supervised Pretraining
链接:https://arxiv.org/abs/2512.15830
作者:Linnea Evanson,Mingfang Zhang,Hubert Banville,Saarang Panchavati,Pierre Bourdillon,Jean-Rémi King
类目:ound (cs.SD); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
关键词:limited neural recordings, neural recordings collected, highly controlled experiments, typically relied, relied on limited
备注: Linnea Evanson* and Mingfang (Lucy) Zhang* are joint first authors. Pierre Bourdillon** and Jean-Rémi King** are joint last authors
点击查看摘要
Abstract:Decoding speech from brain activity has typically relied on limited neural recordings collected during short and highly controlled experiments. Here, we introduce a framework to leverage week-long intracranial and audio recordings from patients undergoing clinical monitoring, effectively increasing the training dataset size by over two orders of magnitude. With this pretraining, our contrastive learning model substantially outperforms models trained solely on classic experimental data, with gains that scale log-linearly with dataset size. Analysis of the learned representations reveals that, while brain activity represents speech features, its global structure largely drifts across days, highlighting the need for models that explicitly account for cross-day variability. Overall, our approach opens a scalable path toward decoding and modeling brain representations in both real-life and controlled task settings.
51. 【2512.15798】DP-Bench: A Benchmark for Evaluating Data Product Creation Systems
链接:https://arxiv.org/abs/2512.15798
作者:Faisal Chowdhury,Sola Shirai,Sarthak Dash,Nandana Mihindukulasooriya,Horst Samulowitz
类目:Databases (cs.DB); Computation and Language (cs.CL)
关键词:specific business usecase, specific problem, addressing a specific, solving a specific, specific business
备注:
点击查看摘要
Abstract:A data product is created with the intention of solving a specific problem, addressing a specific business usecase or meeting a particular need, going beyond just serving data as a raw asset. Data products enable end users to gain greater insights about their data. Since it was first introduced over a decade ago, there has been considerable work, especially in industry, to create data products manually or semi-automatically. However, there exists hardly any benchmark to evaluate automatic data product creation. In this work, we present a benchmark, first of its kind, for this task. We call it DP-Bench. We describe how this benchmark was created by taking advantage of existing work in ELT (Extract-Load-Transform) and Text-to-SQL benchmarks. We also propose a number of LLM based approaches that can be considered as baselines for generating data products automatically. We make the DP-Bench and supplementary materials available in this https URL .
52. 【2512.15793】Explainable Ethical Assessment on Human Behaviors by Generating Conflicting Social Norms
链接:https://arxiv.org/abs/2512.15793
作者:Yuxi Sun,Wei Gao,Hongzhan Lin,Jing Ma,Wenxuan Zhang
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:textit, commonsense rules, defined as shared, social norms, guided or constrained
备注: Acceppt by Asia-Pacific Chapter of the Association for Computational Linguistics (2025)
点击查看摘要
Abstract:Human behaviors are often guided or constrained by social norms, which are defined as shared, commonsense rules. For example, underlying an action ``\textit{report a witnessed crime}" are social norms that inform our conduct, such as ``\textit{It is expected to be brave to report crimes}''. Current AI systems that assess valence (i.e., support or oppose) of human actions by leveraging large-scale data training not grounded on explicit norms may be difficult to explain, and thus untrustworthy. Emulating human assessors by considering social norms can help AI models better understand and predict valence. While multiple norms come into play, conflicting norms can create tension and directly influence human behavior. For example, when deciding whether to ``\textit{report a witnessed crime}'', one may balance \textit{bravery} against \textit{self-protection}. In this paper, we introduce \textit{ClarityEthic}, a novel ethical assessment approach, to enhance valence prediction and explanation by generating conflicting social norms behind human actions, which strengthens the moral reasoning capabilities of language models by using a contrastive learning strategy. Extensive experiments demonstrate that our method outperforms strong baseline approaches, and human evaluations confirm that the generated social norms provide plausible explanations for the assessment of human behaviors.
53. 【2512.15792】A Systematic Analysis of Biases in Large Language Models
链接:https://arxiv.org/abs/2512.15792
作者:Xulang Zhang,Rui Mao,Erik Cambria
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:supporting human decision-making, Large language models, human decision-making, Large language, rapidly become indispensable
备注:
点击查看摘要
Abstract:Large language models (LLMs) have rapidly become indispensable tools for acquiring information and supporting human decision-making. However, ensuring that these models uphold fairness across varied contexts is critical to their safe and responsible deployment. In this study, we undertake a comprehensive examination of four widely adopted LLMs, probing their underlying biases and inclinations across the dimensions of politics, ideology, alliance, language, and gender. Through a series of carefully designed experiments, we investigate their political neutrality using news summarization, ideological biases through news stance classification, tendencies toward specific geopolitical alliances via United Nations voting patterns, language bias in the context of multilingual story completion, and gender-related affinities as revealed by responses to the World Values Survey. Results indicate that while the LLMs are aligned to be neutral and impartial, they still show biases and affinities of different types.
54. 【2512.15791】Evaluation of AI Ethics Tools in Language Models: A Developers' Perspective Case Stud
链接:https://arxiv.org/abs/2512.15791
作者:Jhessica Silva,Diego A. B. Moreira,Gabriel O. dos Santos,Alef Ferreira,Helena Maia,Sandra Avila,Helio Pedrini
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Artificial Intelligence, gained significant importance, significant importance due, simulating realistic conversations, text generation
备注: 7 figures, 11 tables. Accepted for publication in AI and Ethics
点击查看摘要
Abstract:In Artificial Intelligence (AI), language models have gained significant importance due to the widespread adoption of systems capable of simulating realistic conversations with humans through text generation. Because of their impact on society, developing and deploying these language models must be done responsibly, with attention to their negative impacts and possible harms. In this scenario, the number of AI Ethics Tools (AIETs) publications has recently increased. These AIETs are designed to help developers, companies, governments, and other stakeholders establish trust, transparency, and responsibility with their technologies by bringing accepted values to guide AI's design, development, and use stages. However, many AIETs lack good documentation, examples of use, and proof of their effectiveness in practice. This paper presents a methodology for evaluating AIETs in language models. Our approach involved an extensive literature survey on 213 AIETs, and after applying inclusion and exclusion criteria, we selected four AIETs: Model Cards, ALTAI, FactSheets, and Harms Modeling. For evaluation, we applied AIETs to language models developed for the Portuguese language, conducting 35 hours of interviews with their developers. The evaluation considered the developers' perspective on the AIETs' use and quality in helping to identify ethical considerations about their model. The results suggest that the applied AIETs serve as a guide for formulating general ethical considerations about language models. However, we note that they do not address unique aspects of these models, such as idiomatic expressions. Additionally, these AIETs did not help to identify potential negative impacts of models for the Portuguese language.
55. 【2512.15782】Auto-Tuning Safety Guardrails for Black-Box Large Language Models
链接:https://arxiv.org/abs/2512.15782
作者:Perry Abdulkadir
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large language models, modify model weights, Large language, increasingly deployed, settings where product
备注: 8 pages, 7 figures, 1 table. Work completed as part of the M.S. in Artificial Intelligence at the University of St. Thomas using publicly available models and datasets; all views and any errors are the author's own
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed behind safety guardrails such as system prompts and content filters, especially in settings where product teams cannot modify model weights. In practice these guardrails are typically hand-tuned, brittle, and difficult to reproduce. This paper studies a simple but practical alternative: treat safety guardrail design itself as a hyperparameter optimization problem over a frozen base model. Concretely, I wrap Mistral-7B-Instruct with modular jailbreak and malware system prompts plus a ModernBERT-based harmfulness classifier, then evaluate candidate configurations on three public benchmarks covering malware generation, classic jailbreak prompts, and benign user queries. Each configuration is scored using malware and jailbreak attack success rate, benign harmful-response rate, and end-to-end latency. A 48-point grid search over prompt combinations and filter modes establishes a baseline. I then run a black-box Optuna study over the same space and show that it reliably rediscovers the best grid configurations while requiring an order of magnitude fewer evaluations and roughly 8x less wall-clock time. The results suggest that viewing safety guardrails as tunable hyperparameters is a feasible way to harden black-box LLM deployments under compute and time constraints.
56. 【2512.15747】D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models
链接:https://arxiv.org/abs/2512.15747
作者:Javon Hickmon
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:human-level image understanding, achieve human-level image, Image classification, essential for machine, machine perception
备注:
点击查看摘要
Abstract:Image classification is a task essential for machine perception to achieve human-level image understanding. Multimodal models such as CLIP have been able to perform well on this task by learning semantic similarities across vision and language; however, despite these advances, image classification is still a challenging task. Models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Along with this, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. When datasets do not enforce balanced demographics, the predictions will be biased toward the more represented class, while others will be neglected. We focus on how these issues can lead to harmful bias for zero-shot image classification, and explore how to combat these issues in demographic bias. We propose Diverse Demographic Data Generation (D3G), a training-free, zero-shot method of boosting classification accuracy while reducing demographic bias in pre-trained multimodal models. With this method, we utilize CLIP as our base multimodal model and Stable Diffusion XL as our generative model. We demonstrate that providing diverse demographic data at inference time improves performance for these models, and explore the impact of individual demographics on the resulting accuracy metric.
57. 【2512.15745】LLaDA2.0: Scaling Up Diffusion Language Models to 100B
链接:https://arxiv.org/abs/2512.15745
作者:Tiwei Bie,Maosong Cao,Kun Chen,Lun Du,Mingliang Gong,Zhuochen Gong,Yanmei Gu,Jiaqi Hu,Zenan Huang,Zhenzhong Lan,Chengxi Li,Chongxuan Li,Jianguo Li,Zehuan Li,Huabin Liu,Ling Liu,Guoshan Lu,Xiaocheng Lu,Yuxin Ma,Jianfeng Tan,Lanning Wei,Ji-Rong Wen,Yipeng Xing,Xiaolu Zhang,Junbo Zhao,Da Zheng,Jun Zhou,Junlin Zhou,Zhanchao Zhou,Liwang Zhu,Yihong Zhuang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:discrete diffusion large, diffusion large language, large language models, paper presents, total parameters
备注: 19 pages
点击查看摘要
Abstract:This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.
58. 【2512.15722】Value Lens: Using Large Language Models to Understand Human Values
链接:https://arxiv.org/abs/2512.15722
作者:Eduardo de la Cruz Fernández,Marcelo Karanik,Sascha Ossowski
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:autonomous decision-making process, autonomous decision-making, increasingly applied, applied to computer, choices made
备注: 4 pages. 2 figures. Published in ECAI 2025, Frontiers in Artificial Intelligence and Applications, Volume 413, pages 5175-5178
点击查看摘要
Abstract:The autonomous decision-making process, which is increasingly applied to computer systems, requires that the choices made by these systems align with human values. In this context, systems must assess how well their decisions reflect human values. To achieve this, it is essential to identify whether each available action promotes or undermines these values. This article presents Value Lens, a text-based model designed to detect human values using generative artificial intelligence, specifically Large Language Models (LLMs). The proposed model operates in two stages: the first aims to formulate a formal theory of values, while the second focuses on identifying these values within a given text. In the first stage, an LLM generates a description based on the established theory of values, which experts then verify. In the second stage, a pair of LLMs is employed: one LLM detects the presence of values, and the second acts as a critic and reviewer of the detection process. The results indicate that Value Lens performs comparably to, and even exceeds, the effectiveness of other models that apply different methods for similar tasks.
信息检索
1. 【2512.16891】LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation
链接:https://arxiv.org/abs/2512.16891
作者:Haichao Zhang,Yao Lu,Lichen Wang,Yunzhe Li,Daiwei Chen,Yunpeng Xu,Yun Fu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Large Language Models, Video Large Language, video question answering, Language Models, Large Language
备注:
点击查看摘要
Abstract:Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
2. 【2512.16795】From Facts to Conclusions : Integrating Deductive Reasoning in Retrieval-Augmented LLMs
链接:https://arxiv.org/abs/2512.16795
作者:Shubham Mishra,Samyek Jain,Gorang Mehrishi,Shiv Tiwari,Harsh Sharma,Pratik Narang,Dhruv Kumar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:large language models, grounds large language, Retrieval-Augmented Generation, retrieved sources conflict, grounds large
备注: Under Review
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) grounds large language models (LLMs) in external evidence, but fails when retrieved sources conflict or contain outdated or subjective information. Prior work address these issues independently but lack unified reasoning supervision. We propose a reasoning-trace-augmented RAG framework that adds structured, interpretable reasoning across three stages : (1) document-level adjudication, (2) conflict analysis, and (3) grounded synthesis, producing citation-linked answers or justified refusals. A Conflict-Aware Trust-Score (CATS) pipeline is introduced which evaluates groundedness, factual correctness, refusal accuracy, and conflict-behavior alignment using an LLM-as-a-Judge. Our 539-query reasoning dataset and evaluation pipeline establish a foundation for conflict-aware, interpretable RAG systems. Experimental results demonstrate substantial gains over baselines, most notably with Qwen, where Supervised Fine-Tuning improved End-to-End answer correctness from 0.069 to 0.883 and behavioral adherence from 0.074 to 0.722.
3. 【2512.16661】Microsoft Academic Graph Information Retrieval for Research Recommendation and Assistance
链接:https://arxiv.org/abs/2512.16661
作者:Jacob Reiss,Shikshya Shiwakoti,Samuel Goldsmith,Ujjwal Pandit
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:today information-driven world, information-driven world, access to scientific, increasingly easy, today information-driven
备注: 5 pages, 3 figures
点击查看摘要
Abstract:In today's information-driven world, access to scientific publications has become increasingly easy. At the same time, filtering through the massive volume of available research has become more challenging than ever. Graph Neural Networks (GNNs) and graph attention mechanisms have shown strong effectiveness in searching large-scale information databases, particularly when combined with modern large language models. In this paper, we propose an Attention-Based Subgraph Retriever, a GNN-as-retriever model that applies attention-based pruning to extract a refined subgraph, which is then passed to a large language model for advanced knowledge reasoning.
4. 【2512.16581】Abacus: Self-Supervised Event Counting-Aligned Distributional Pretraining for Sequential User Modeling
链接:https://arxiv.org/abs/2512.16581
作者:Sullivan Castro,Artem Betlei,Thomas Di Martino,Nadir El Manouzi
类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)
关键词:user purchase behavior, real-time bidding, Modeling user purchase, purchase behavior, critical challenge
备注:
点击查看摘要
Abstract:Modeling user purchase behavior is a critical challenge in display advertising systems, necessary for real-time bidding. The difficulty arises from the sparsity of positive user events and the stochasticity of user actions, leading to severe class imbalance and irregular event timing. Predictive systems usually rely on hand-crafted "counter" features, overlooking the fine-grained temporal evolution of user intent. Meanwhile, current sequential models extract direct sequential signal, missing useful event-counting statistics. We enhance deep sequential models with self-supervised pretraining strategies for display advertising. Especially, we introduce Abacus, a novel approach of predicting the empirical frequency distribution of user events. We further propose a hybrid objective unifying Abacus with sequential learning objectives, combining stability of aggregated statistics with the sequence modeling sensitivity. Experiments on two real-world datasets show that Abacus pretraining outperforms existing methods accelerating downstream task convergence, while hybrid approach yields up to +6.1% AUC compared to the baselines.
5. 【2512.16576】InfoDCL: Informative Noise Enhanced Diffusion Based Contrastive Learning
链接:https://arxiv.org/abs/2512.16576
作者:Xufeng Liang,Zhida Qin,Chong Zhang,Tianyu Huang,Gangyi Ding
类目:Information Retrieval (cs.IR)
关键词:demonstrated promising potential, recommender systems, demonstrated promising, promising potential, potential in recommender
备注:
点击查看摘要
Abstract:Contrastive learning has demonstrated promising potential in recommender systems. Existing methods typically construct sparser views by randomly perturbing the original interaction graph, as they have no idea about the authentic user preferences. Owing to the sparse nature of recommendation data, this paradigm can only capture insufficient semantic information. To address the issue, we propose InfoDCL, a novel diffusion-based contrastive learning framework for recommendation. Rather than injecting randomly sampled Gaussian noise, we employ a single-step diffusion process that integrates noise with auxiliary semantic information to generate signals and feed them to the standard diffusion process to generate authentic user preferences as contrastive views. Besides, based on a comprehensive analysis of the mutual influence between generation and preference learning in InfoDCL, we build a collaborative training objective strategy to transform the interference between them into mutual collaboration. Additionally, we employ multiple GCN layers only during inference stage to incorporate higher-order co-occurrence information while maintaining training efficiency. Extensive experiments on five real-world datasets demonstrate that InfoDCL significantly outperforms state-of-the-art methods. Our InfoDCL offers an effective solution for enhancing recommendation performance and suggests a novel paradigm for applying diffusion method in contrastive learning frameworks.
6. 【2512.16532】From Personalization to Prejudice: Bias and Discrimination in Memory-Enhanced AI Agents for Recruitment
链接:https://arxiv.org/abs/2512.16532
作者:Himanshu Gharat,Himanshi Agrawal,Gourab K. Patro
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Language Models, capabilities for understanding, diverse tasks
备注: In Proceedings of the Nineteenth ACM International Conference on Web Search and Data Mining (WSDM '26)
点击查看摘要
Abstract:Large Language Models (LLMs) have empowered AI agents with advanced capabilities for understanding, reasoning, and interacting across diverse tasks. The addition of memory further enhances them by enabling continuity across interactions, learning from past experiences, and improving the relevance of actions and responses over time; termed as memory-enhanced personalization. Although such personalization through memory offers clear benefits, it also introduces risks of bias. While several previous studies have highlighted bias in ML and LLMs, bias due to memory-enhanced personalized agents is largely unexplored. Using recruitment as an example use case, we simulate the behavior of a memory-enhanced personalized agent, and study whether and how bias is introduced and amplified in and across various stages of operation. Our experiments on agents using safety-trained LLMs reveal that bias is systematically introduced and reinforced through personalization, emphasizing the need for additional protective measures or agent guardrails in memory-enhanced LLM-based AI agents.
7. 【2512.16425】Introducing ORKG ASK: an AI-driven Scholarly Literature Search and Exploration System Taking a Neuro-Symbolic Approach
链接:https://arxiv.org/abs/2512.16425
作者:Allard Oelen,Mohamad Yaser Jaradeh,Sören Auer
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Large Language Models, generative Artificial Intelligence, continues to grow, increasingly difficult, scholarly literature continues
备注:
点击查看摘要
Abstract:As the volume of published scholarly literature continues to grow, finding relevant literature becomes increasingly difficult. With the rise of generative Artificial Intelligence (AI), and particularly Large Language Models (LLMs), new possibilities emerge to find and explore literature. We introduce ASK (Assistant for Scientific Knowledge), an AI-driven scholarly literature search and exploration system that follows a neuro-symbolic approach. ASK aims to provide active support to researchers in finding relevant scholarly literature by leveraging vector search, LLMs, and knowledge graphs. The system allows users to input research questions in natural language and retrieve relevant articles. ASK automatically extracts key information and generates answers to research questions using a Retrieval-Augmented Generation (RAG) approach. We present an evaluation of ASK, assessing the system's usability and usefulness. Findings indicate that the system is user-friendly and users are generally satisfied while using the system.
8. 【2512.16348】From Flows to Functions: Macroscopic Behavioral Fingerprinting of IoT Devices via Network Services
链接:https://arxiv.org/abs/2512.16348
作者:Shayan Azizi,Norihiro Okui,Masataka Nakahara,Ayumu Kubota,Hassan Habibi Gharakheili
类目:Information Retrieval (cs.IR)
关键词:Internet of Things, health monitoring sensors, critical operational task, voice assistants, Identifying devices
备注: 10 pages, 3 figures, 1 table, and 1 algorithm
点击查看摘要
Abstract:Identifying devices such as cameras, printers, voice assistants, or health monitoring sensors, collectively known as the Internet of Things (IoT), within a network is a critical operational task, particularly to manage the cyber risks they introduce. While behavioral fingerprinting based on network traffic analysis has shown promise, most existing approaches rely on machine learning (ML) techniques applied to fine-grained features of short-lived traffic units (packets and/or flows). These methods tend to be computationally expensive, sensitive to traffic measurement errors, and often produce opaque inferences. In this paper, we propose a macroscopic, lightweight, and explainable alternative to behavioral fingerprinting focusing on the network services (e.g., TCP/80, UDP/53) that IoT devices use to perform their intended functions over extended periods. Our contributions are threefold. (1) We demonstrate that IoT devices exhibit stable and distinguishable patterns in their use of network services over a period of time. We formalize the notion of service-level fingerprints and derive a generalized method to represent network behaviors using a configurable granularity parameter. (2) We develop a procedure to extract service-level fingerprints, apply it to traffic from 13 consumer IoT device types in a lab testbed, and evaluate the resulting representations in terms of their convergence and recurrence properties. (3) We validate the efficacy of service-level fingerprints for device identification in closed-set and open-set scenarios. Our findings are based on a large dataset comprising about 10 million IPFIX flow records collected over a 1.5-year period.
9. 【2512.16236】he Evolution of Reranking Models in Information Retrieval: From Heuristic Methods to Large Language Models
链接:https://arxiv.org/abs/2512.16236
作者:Tejul Pandit,Sakshi Mahendru,Meet Raval,Dhvani Upadhyay
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:initial candidate sets, user-presented final results, honing initial candidate, contemporary information retrieval, Retrieval Augmented Generation
备注: 15 pages, 1 figure, Accepted in CLNLP'25
点击查看摘要
Abstract:Reranking is a critical stage in contemporary information retrieval (IR) systems, improving the relevance of the user-presented final results by honing initial candidate sets. This paper is a thorough guide to examine the changing reranker landscape and offer a clear view of the advancements made in reranking methods. We present a comprehensive survey of reranking models employed in IR, particularly within modern Retrieval Augmented Generation (RAG) pipelines, where retrieved documents notably influence output quality. We embark on a chronological journey through the historical trajectory of reranking techniques, starting with foundational approaches, before exploring the wide range of sophisticated neural network architectures such as cross-encoders, sequence-generation models like T5, and Graph Neural Networks (GNNs) utilized for structural information. Recognizing the computational cost of advancing neural rerankers, we analyze techniques for enhancing efficiency, notably knowledge distillation for creating competitive, lighter alternatives. Furthermore, we map the emerging territory of integrating Large Language Models (LLMs) in reranking, examining novel prompting strategies and fine-tuning tactics. This survey seeks to elucidate the fundamental ideas, relative effectiveness, computational features, and real-world trade-offs of various reranking strategies. The survey provides a structured synthesis of the diverse reranking paradigms, highlighting their underlying principles and comparative strengths and weaknesses.
Comments:
15 pages, 1 figure, Accepted in CLNLP’25
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2512.16236 [cs.IR]
(or
arXiv:2512.16236v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2512.16236
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2512.16171】Science Consultant Agent
链接:https://arxiv.org/abs/2512.16171
作者:Karthikeyan K,Philip Wu,Xin Tang,Alexandre Alves
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:web-based Artificial Intelligence, Science Consultant Agent, Artificial Intelligence, effective modeling strategy, Science Consultant
备注:
点击查看摘要
Abstract:The Science Consultant Agent is a web-based Artificial Intelligence (AI) tool that helps practitioners select and implement the most effective modeling strategy for AI-based solutions. It operates through four core components: Questionnaire, Smart Fill, Research-Guided Recommendation, and Prototype Builder. By combining structured questionnaires, literature-backed solution recommendations, and prototype generation, the Science Consultant Agent accelerates development for everyone from Product Managers and Software Developers to Researchers. The full pipeline is illustrated in Figure 1.
11. 【2512.16106】ModelTables: A Corpus of Tables about Models
链接:https://arxiv.org/abs/2512.16106
作者:Zhengyuan Dong,Victor Zhong,Renée J. Miller
类目:Databases (cs.DB); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Hugging Face model, Model, Face model cards, Hugging Face, Model Lakes
备注: 14 pages, 8 figures and 8 tables
点击查看摘要
Abstract:We present ModelTables, a benchmark of tables in Model Lakes that captures the structured semantics of performance and configuration tables often overlooked by text only retrieval. The corpus is built from Hugging Face model cards, GitHub READMEs, and referenced papers, linking each table to its surrounding model and publication context. Compared with open data lake tables, model tables are smaller yet exhibit denser inter table relationships, reflecting tightly coupled model and benchmark evolution. The current release covers over 60K models and 90K tables. To evaluate model and table relatedness, we construct a multi source ground truth using three complementary signals: (1) paper citation links, (2) explicit model card links and inheritance, and (3) shared training datasets. We present one extensive empirical use case for the benchmark which is table search. We compare canonical Data Lake search operators (unionable, joinable, keyword) and Information Retrieval baselines (dense, sparse, hybrid retrieval) on this benchmark. Union based semantic table retrieval attains 54.8 % P@1 overall (54.6 % on citation, 31.3 % on inheritance, 30.6 % on shared dataset signals); table based dense retrieval reaches 66.5 % P@1, and metadata hybrid retrieval achieves 54.1 %. This evaluation indicates clear room for developing better table search methods. By releasing ModelTables and its creation protocol, we provide the first large scale benchmark of structured data describing AI model. Our use case of table discovery in Model Lakes, provides intuition and evidence for developing more accurate semantic retrieval, structured comparison, and principled organization of structured model knowledge. Source code, data, and other artifacts have been made available at this https URL.
12. 【2512.16033】On Recommending Category: A Cascading Approach
链接:https://arxiv.org/abs/2512.16033
作者:Qihao Wang,Pritom Saha Akash,Varvara Kollia,Kevin Chen-Chuan Chang,Biwei Jiang,Vadim Von Brzeski
类目:Information Retrieval (cs.IR)
关键词:boosting commercial success, enhancing user experience, commercial success, experience and boosting, boosting commercial
备注:
点击查看摘要
Abstract:Recommendation plays a key role in e-commerce, enhancing user experience and boosting commercial success. Existing works mainly focus on recommending a set of items, but online e-commerce platforms have recently begun to pay attention to exploring users' potential interests at the category level. Category-level recommendation allows e-commerce platforms to promote users' engagements by expanding their interests to different types of items. In addition, it complements item-level recommendations when the latter becomes extremely challenging for users with little-known information and past interactions. Furthermore, it facilitates item-level recommendations in existing works. The predicted category, which is called intention in those works, aids the exploration of item-level preference. However, such category-level preference prediction has mostly been accomplished through applying item-level models. Some key differences between item-level recommendations and category-level recommendations are ignored in such a simplistic adaptation. In this paper, we propose a cascading category recommender (CCRec) model with a variational autoencoder (VAE) to encode item-level information to perform category-level recommendations. Experiments show the advantages of this model over methods designed for item-level recommendations.
计算机视觉
1. 【2512.16924】he World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text
链接:https://arxiv.org/abs/2512.16924
作者:Hanlin Wang,Hao Ouyang,Qiuyu Wang,Yue Yu,Yihao Meng,Wen Wang,Ka Leong Cheng,Shuailei Ma,Qingyan Bai,Yixuan Li,Cheng Chen,Yanhong Zeng,Xing Zhu,Yujun Shen,Qifeng Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:promptable world events, enables rich, user-directed simulation, combining text, reference images
备注: Project page and code: [this https URL](https://worldcanvas.github.io/)
点击查看摘要
Abstract:We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: this https URL.
2. 【2512.16923】Generative Refocusing: Flexible Defocus Control from a Single Image
链接:https://arxiv.org/abs/2512.16923
作者:Chun-Wei Tuan Mu,Jia-Bin Huang,Yu-Lun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:essential in photography, special equipment, perfect focus, refocusing, Generative Refocusing
备注: Project website: [this https URL](https://generative-refocusing.github.io/)
点击查看摘要
Abstract:Depth-of-field control is essential in photography, but getting the perfect focus often takes several tries or special equipment. Single-image refocusing is still difficult. It involves recovering sharp content and creating realistic bokeh. Current methods have significant drawbacks. They need all-in-focus inputs, depend on synthetic data from simulators, and have limited control over aperture. We introduce Generative Refocusing, a two-step process that uses DeblurNet to recover all-in-focus images from various inputs and BokehNet for creating controllable bokeh. Our main innovation is semi-supervised training. This method combines synthetic paired data with unpaired real bokeh images, using EXIF metadata to capture real optical characteristics beyond what simulators can provide. Our experiments show we achieve top performance in defocus deblurring, bokeh synthesis, and refocusing benchmarks. Additionally, our Generative Refocusing allows text-guided adjustments and custom aperture shapes.
3. 【2512.16922】Next-Embedding Prediction Makes Strong Vision Learners
链接:https://arxiv.org/abs/2512.16922
作者:Sihan Xu,Ziqiao Ma,Wenhao Chai,Xuweiyi Chen,Weiyang Jin,Joyce Chai,Saining Xie,Stella X. Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language, principles can yield, self-supervised visual learners, Inspired, yield strong self-supervised
备注: Project Page: [this https URL](https://sihanxu.me/nepa)
点击查看摘要
Abstract:Inspired by the success of generative pretraining in natural language, we ask whether the same principles can yield strong self-supervised visual learners. Instead of training models to output features for downstream use, we train them to generate embeddings to perform predictive tasks directly. This work explores such a shift from learning representations to learning models. Specifically, models learn to predict future patch embeddings conditioned on past ones, using causal masking and stop gradient, which we refer to as Next-Embedding Predictive Autoregression (NEPA). We demonstrate that a simple Transformer pretrained on ImageNet-1k with next embedding prediction as its sole learning objective is effective - no pixel reconstruction, discrete tokens, contrastive loss, or task-specific heads. This formulation retains architectural simplicity and scalability, without requiring additional design complexity. NEPA achieves strong results across tasks, attaining 83.8% and 85.3% top-1 accuracy on ImageNet-1K with ViT-B and ViT-L backbones after fine-tuning, and transferring effectively to semantic segmentation on ADE20K. We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.
4. 【2512.16921】Differences That Matter: Auditing Models for Capability Gap Discovery and Rectification
链接:https://arxiv.org/abs/2512.16921
作者:Qihao Liu,Chengzhi Mao,Yaojie Liu,Alan Yuille,Wen-Sheng Chu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Conventional evaluation methods, fully disclose significant, disclose significant capability, significant capability gaps, Conventional evaluation
备注: project page: [this https URL](https://auditdm.github.io/)
点击查看摘要
Abstract:Conventional evaluation methods for multimodal LLMs (MLLMs) lack interpretability and are often insufficient to fully disclose significant capability gaps across models. To address this, we introduce AuditDM, an automated framework that actively discovers and rectifies MLLM failure modes by auditing their divergence. AuditDM fine-tunes an MLLM as an auditor via reinforcement learning to generate challenging questions and counterfactual images that maximize disagreement among target models. Once trained, the auditor uncovers diverse, interpretable exemplars that reveal model weaknesses and serve as annotation-free data for rectification. When applied to SoTA models like Gemma-3 and PaliGemma-2, AuditDM discovers more than 20 distinct failure types. Fine-tuning on these discoveries consistently improves all models across 16 benchmarks, and enables a 3B model to surpass its 28B counterpart. Our results suggest that as data scaling hits diminishing returns, targeted model auditing offers an effective path to model diagnosis and improvement.
5. 【2512.16920】EasyV2V: A High-quality Instruction-based Video Editing Framework
链接:https://arxiv.org/abs/2512.16920
作者:Jinjie Mai,Chaoyang Wang,Guocheng Gordon Qian,Willi Menapace,Sergey Tulyakov,Bernard Ghanem,Peter Wonka,Ashkan Mirzaei
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:advanced rapidly, remains less explored, facing challenges, challenges in consistency, video editing remains
备注: Project page: [this https URL](https://snap-research.github.io/easyv2v/)
点击查看摘要
Abstract:While image editing has advanced rapidly, video editing remains less explored, facing challenges in consistency, control, and generalization. We study the design space of data, architecture, and control, and introduce \emph{EasyV2V}, a simple and effective framework for instruction-based video editing. On the data side, we compose existing experts with fast inverses to build diverse video pairs, lift image edit pairs into videos via single-frame supervision and pseudo pairs with shared affine motion, mine dense-captioned clips for video pairs, and add transition supervision to teach how edits unfold. On the model side, we observe that pretrained text-to-video models possess editing capability, motivating a simplified design. Simple sequence concatenation for conditioning with light LoRA fine-tuning suffices to train a strong model. For control, we unify spatiotemporal control via a single mask mechanism and support optional reference images. Overall, EasyV2V works with flexible inputs, e.g., video+text, video+mask+text, video+mask+reference+text, and achieves state-of-the-art video editing results, surpassing concurrent and commercial systems. Project page: this https URL
6. 【2512.16919】DVGT: Driving Visual Geometry Transformer
链接:https://arxiv.org/abs/2512.16919
作者:Sicheng Zuo,Zixun Xie,Wenzhao Zheng,Shaoqing Xu,Fang Li,Shengyin Jiang,Long Chen,Zhi-Xin Yang,Jiwen Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:Perceiving and reconstructing, crucial for autonomous, Visual Geometry Transformer, scene geometry, Perceiving
备注: Code is available at [this https URL](https://github.com/wzzheng/DVGT)
点击查看摘要
Abstract:Perceiving and reconstructing 3D scene geometry from visual inputs is crucial for autonomous driving. However, there still lacks a driving-targeted dense geometry perception model that can adapt to different scenarios and camera configurations. To bridge this gap, we propose a Driving Visual Geometry Transformer (DVGT), which reconstructs a global dense 3D point map from a sequence of unposed multi-view visual inputs. We first extract visual features for each image using a DINO backbone, and employ alternating intra-view local attention, cross-view spatial attention, and cross-frame temporal attention to infer geometric relations across images. We then use multiple heads to decode a global point map in the ego coordinate of the first frame and the ego poses for each frame. Unlike conventional methods that rely on precise camera parameters, DVGT is free of explicit 3D geometric priors, enabling flexible processing of arbitrary camera configurations. DVGT directly predicts metric-scaled geometry from image sequences, eliminating the need for post-alignment with external sensors. Trained on a large mixture of driving datasets including nuScenes, OpenScene, Waymo, KITTI, and DDAD, DVGT significantly outperforms existing models on various scenarios. Code is available at this https URL.
7. 【2512.16918】AdaTooler-V: Adaptive Tool-Use for Images and Videos
链接:https://arxiv.org/abs/2512.16918
作者:Chaoyang Wang,Kaituo Feng,Dongyang Chen,Zhongyu Wang,Zhixun Li,Sicheng Gao,Meng Meng,Xu Zhou,Manyuan Zhang,Yuzhang Shang,Xiangyu Yue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multimodal large language, Recent advances, vision tool interactions, multimodal interleaved, large language models
备注: Project page: [this https URL](https://github.com/CYWang735/AdaTooler-V)
点击查看摘要
Abstract:Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable rewards across single-image, multi-image, and video data. Experiments across twelve benchmarks demonstrate the strong reasoning capability of AdaTooler-V, outperforming existing methods in diverse visual reasoning tasks. Notably, AdaTooler-V-7B achieves an accuracy of 89.8\% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro. All code, models, and data are released.
8. 【2512.16915】StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
链接:https://arxiv.org/abs/2512.16915
作者:Guibao Shen,Yihua Du,Wenhang Ge,Jing He,Chirui Chang,Donghao Zhou,Zhen Yang,Luozhou Wang,Xin Tao,Ying-Cong Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:stereo video content, high-quality stereo video, stereoscopic displays, including VR headsets, rapid growth
备注:
点击查看摘要
Abstract:The rapid growth of stereoscopic displays, including VR headsets and 3D cinemas, has led to increasing demand for high-quality stereo video content. However, producing 3D videos remains costly and complex, while automatic Monocular-to-Stereo conversion is hindered by the limitations of the multi-stage ``Depth-Warp-Inpaint'' (DWI) pipeline. This paradigm suffers from error propagation, depth ambiguity, and format inconsistency between parallel and converged stereo configurations. To address these challenges, we introduce UniStereo, the first large-scale unified dataset for stereo video conversion, covering both stereo formats to enable fair benchmarking and robust model training. Building upon this dataset, we propose StereoPilot, an efficient feed-forward model that directly synthesizes the target view without relying on explicit depth maps or iterative diffusion sampling. Equipped with a learnable domain switcher and a cycle consistency loss, StereoPilot adapts seamlessly to different stereo formats and achieves improved consistency. Extensive experiments demonstrate that StereoPilot significantly outperforms state-of-the-art methods in both visual fidelity and computational efficiency. Project page: this https URL.
9. 【2512.16913】Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation
链接:https://arxiv.org/abs/2512.16913
作者:Xin Lin,Meixi Song,Dizhe Zhang,Wenxuan Lu,Haodong Li,Bo Du,Ming-Hsuan Yang,Truong Nguyen,Lu Qi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:depth foundation model, metric depth foundation, panoramic metric depth, depth foundation, foundation model
备注: Project Page: [this https URL](https://insta360-research-team.github.io/DAP_website/)
点击查看摘要
Abstract:In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: \href{this https URL} {this https URL\_website/}
10. 【2512.16910】SFTok: Bridging the Performance Gap in Discrete Tokenizers
链接:https://arxiv.org/abs/2512.16910
作者:Qihang Rao,Borui Zhang,Wenzhao Zheng,Jie Zhou,Jiwen Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Recent advances, multimodal models highlight, highlight the pivotal, pivotal role, tokenization in high-resolution
备注: Under review. Code is available at [this https URL](https://github.com/Neur-IO/SFTok)
点击查看摘要
Abstract:Recent advances in multimodal models highlight the pivotal role of image tokenization in high-resolution image generation. By compressing images into compact latent representations, tokenizers enable generative models to operate in lower-dimensional spaces, thereby improving computational efficiency and reducing complexity. Discrete tokenizers naturally align with the autoregressive paradigm but still lag behind continuous ones, limiting their adoption in multimodal systems. To address this, we propose \textbf{SFTok}, a discrete tokenizer that incorporates a multi-step iterative mechanism for precise reconstruction. By integrating \textbf{self-forcing guided visual reconstruction} and \textbf{debias-and-fitting training strategy}, SFTok resolves the training-inference inconsistency in multi-step process, significantly enhancing image reconstruction quality. At a high compression rate of only 64 tokens per image, SFTok achieves state-of-the-art reconstruction quality on ImageNet (rFID = 1.21) and demonstrates exceptional performance in class-to-image generation tasks (gFID = 2.29).
11. 【2512.16909】MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
链接:https://arxiv.org/abs/2512.16909
作者:Yuanchen Ju,Yongyuan Liang,Yen-Jen Wang,Nandiraju Gireesh,Yuanliang Ju,Seungjae Lee,Qiao Gu,Elvis Hsieh,Furong Huang,Koushil Sreenath
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Mobile manipulators, navigate and manipulate, Mobile, scene, Scene graphs
备注: 25 pages, 10 figures. Project page: [this https URL](https://hybridrobotics.github.io/MomaGraph/)
点击查看摘要
Abstract:Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
12. 【2512.16908】SceneDiff: A Benchmark and Method for Multiview Object Change Detection
链接:https://arxiv.org/abs/2512.16908
作者:Yuqun Wu,Chih-hao Lin,Henry Che,Aditi Tiwari,Chuhang Zou,Shenlong Wang,Derek Hoiem
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:investigate the problem, problem of identifying, identifying objects, Abstract, removed
备注:
点击查看摘要
Abstract:We investigate the problem of identifying objects that have been added, removed, or moved between a pair of captures (images or videos) of the same scene at different times. Detecting such changes is important for many applications, such as robotic tidying or construction progress and safety monitoring. A major challenge is that varying viewpoints can cause objects to falsely appear changed. We introduce SceneDiff Benchmark, the first multiview change detection benchmark with object instance annotations, comprising 350 diverse video pairs with thousands of changed objects. We also introduce the SceneDiff method, a new training-free approach for multiview object change detection that leverages pretrained 3D, segmentation, and image encoding models to robustly predict across multiple benchmarks. Our method aligns the captures in 3D, extracts object regions, and compares spatial and semantic region features to detect changes. Experiments on multi-view and two-view benchmarks demonstrate that our method outperforms existing approaches by large margins (94% and 37.4% relative AP improvements). The benchmark and code will be publicly released.
13. 【2512.16907】Flowing from Reasoning to Motion: Learning 3D Hand Trajectory Prediction from Egocentric Human Interaction Videos
链接:https://arxiv.org/abs/2512.16907
作者:Mingfei Chen,Yifan Wang,Zhengqin Li,Homanga Bharadhwaj,Yujin Chen,Chuan Qin,Ziyi Kou,Yuan Tian,Eric Whitmire,Rajinder Sodhi,Hrvoje Benko,Eli Shlizerman,Yue Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:hand trajectory prediction, Prior works, hand trajectory, trajectory prediction, weakly link reasoning
备注: Project website: [this https URL](https://egoman-project.github.io)
点击查看摘要
Abstract:Prior works on 3D hand trajectory prediction are constrained by datasets that decouple motion from semantic supervision and by models that weakly link reasoning and action. To address these, we first present the EgoMAN dataset, a large-scale egocentric dataset for interaction stage-aware 3D hand trajectory prediction with 219K 6DoF trajectories and 3M structured QA pairs for semantic, spatial, and motion reasoning. We then introduce the EgoMAN model, a reasoning-to-motion framework that links vision-language reasoning and motion generation via a trajectory-token interface. Trained progressively to align reasoning with motion dynamics, our approach yields accurate and stage-aware trajectories with generalization across real-world scenes.
14. 【2512.16906】VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization
链接:https://arxiv.org/abs/2512.16906
作者:Xiaoyan Cong,Haotian Yang,Angtian Wang,Yizhi Wang,Yiding Yang,Canyu Zhang,Chongyang Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:preserving content fidelity, Instruction-based video editing, temporal coherence, aims to modify, modify an input
备注:
点击查看摘要
Abstract:Instruction-based video editing aims to modify an input video according to a natural-language instruction while preserving content fidelity and temporal coherence. However, existing diffusion-based approaches are often trained on paired data of simple editing operations, which fundamentally limits their ability to generalize to diverse and complex, real-world instructions. To address this generalization gap, we propose VIVA, a scalable framework for instruction-based video editing that leverages VLM-guided encoding and reward optimization. First, we introduce a VLM-based instructor that encodes the textual instruction, the first frame of the source video, and an optional reference image into visually-grounded instruction representations, providing fine-grained spatial and semantic context for the diffusion transformer backbone. Second, we propose a post-training stage, Edit-GRPO, which adapts Group Relative Policy Optimization to the domain of video editing, directly optimizing the model for instruction-faithful, content-preserving, and aesthetically pleasing edits using relative rewards. Furthermore, we propose a data construction pipeline designed to synthetically generate diverse, high-fidelity paired video-instruction data of basic editing operations. Extensive experiments show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods. Website: this https URL
15. 【2512.16905】Alchemist: Unlocking Efficiency in Text-to-Image Model Training via Meta-Gradient Data Selection
链接:https://arxiv.org/abs/2512.16905
作者:Kaixin Ding,Yang Zhou,Xi Chen,Miao Yang,Jiarong Ou,Rui Chen,Xin Tao,Hengshuang Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Stable Diffusion, Recent advances, led to remarkable, remarkable improvements, data
备注: project page: [this https URL](https://kxding.github.io/project/Alchemist/)
点击查看摘要
Abstract:Recent advances in Text-to-Image (T2I) generative models, such as Imagen, Stable Diffusion, and FLUX, have led to remarkable improvements in visual quality. However, their performance is fundamentally limited by the quality of training data. Web-crawled and synthetic image datasets often contain low-quality or redundant samples, which lead to degraded visual fidelity, unstable training, and inefficient computation. Hence, effective data selection is crucial for improving data efficiency. Existing approaches rely on costly manual curation or heuristic scoring based on single-dimensional features in Text-to-Image data filtering. Although meta-learning based method has been explored in LLM, there is no adaptation for image modalities. To this end, we propose **Alchemist**, a meta-gradient-based framework to select a suitable subset from large-scale text-image data pairs. Our approach automatically learns to assess the influence of each sample by iteratively optimizing the model from a data-centric perspective. Alchemist consists of two key stages: data rating and data pruning. We train a lightweight rater to estimate each sample's influence based on gradient information, enhanced with multi-granularity perception. We then use the Shift-Gsampling strategy to select informative subsets for efficient model training. Alchemist is the first automatic, scalable, meta-gradient-based data selection framework for Text-to-Image model training. Experiments on both synthetic and web-crawled datasets demonstrate that Alchemist consistently improves visual quality and downstream performance. Training on an Alchemist-selected 50% of the data can outperform training on the full dataset.
16. 【2512.16900】FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
链接:https://arxiv.org/abs/2512.16900
作者:Shuyuan Tu,Yueming Pan,Yinming Huang,Xintong Han,Zhen Xing,Qi Dai,Kai Qiu,Chong Luo,Zuxuan Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diffusion-based acceleration methods, long-portrait animation struggle, methods for long-portrait, struggle to ensure, Current diffusion-based acceleration
备注:
点击查看摘要
Abstract:Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6x acceleration in inference speed. In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor. It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling. During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6x speed acceleration. Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.
17. 【2512.16899】Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
链接:https://arxiv.org/abs/2512.16899
作者:Yushi Hu,Reyhane Askari-Hemmat,Melissa Hall,Emily Dinan,Luke Zettlemoyer,Marjan Ghazvininejad
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:training large language, large language models, Reward models, text sequences, handle interleaved image
备注: Code and data available at [this https URL](https://github.com/facebookresearch/MMRB2)
点击查看摘要
Abstract:Reward models (RMs) are essential for training large language models (LLMs), but remain underexplored for omni models that handle interleaved image and text sequences. We introduce Multimodal RewardBench 2 (MMRB2), the first comprehensive benchmark for reward models on multimodal understanding and (interleaved) generation. MMRB2 spans four tasks: text-to-image, image editing, interleaved generation, and multimodal reasoning ("thinking-with-images"), providing 1,000 expert-annotated preference pairs per task from 23 models and agents across 21 source tasks. MMRB2 is designed with: (1) practical but challenging prompts; (2) responses from state-of-the-art models and agents; and (3) preference pairs with strong human-expert consensus, curated via an ensemble filtering strategy. Using MMRB2, we study existing judges for each subtask, including multimodal LLM-as-a-judge and models trained with human preferences. The latest Gemini 3 Pro attains 75-80% accuracy. GPT-5 and Gemini 2.5 Pro reach 66-75% accuracy, compared to 90% for humans, yet surpass the widely used GPT-4o (59%). The best performing open-source model Qwen3-VL-32B achieves similar accuracies as Gemini 2.5 Flash (64%). We also show that MMRB2 performance strongly correlates with downstream task success using Best-of-N sampling and conduct an in-depth analysis that shows key areas to improve the reward models going forward.
18. 【2512.16896】Sceniris: A Fast Procedural Scene Generation Framework
链接:https://arxiv.org/abs/2512.16896
作者:Jinghuan Shang,Harsh Patel,Ran Gong,Karl Schmeckpeper
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:developing Physical, generative models, essential for developing, Synthetic, Scene Synthesizer
备注: Code is available at [this https URL](https://github.com/rai-inst/sceniris)
点击查看摘要
Abstract:Synthetic 3D scenes are essential for developing Physical AI and generative models. Existing procedural generation methods often have low output throughput, creating a significant bottleneck in scaling up dataset creation. In this work, we introduce Sceniris, a highly efficient procedural scene generation framework for rapidly generating large-scale, collision-free scene variations. Sceniris also provides an optional robot reachability check, providing manipulation-feasible scenes for robot tasks. Sceniris is designed for maximum efficiency by addressing the primary performance limitations of the prior method, Scene Synthesizer. Leveraging batch sampling and faster collision checking in cuRobo, Sceniris achieves at least 234x speed-up over Scene Synthesizer. Sceniris also expands the object-wise spatial relationships available in prior work to support diverse scene requirements. Our code is available at this https URL
19. 【2512.16893】Instant Expressive Gaussian Head Avatar via 3D-Aware Expression Distillation
链接:https://arxiv.org/abs/2512.16893
作者:Kaiwen Jiang,Xueting Li,Seonwook Park,Ravi Ramamoorthi,Shalini De Mello,Koki Nagano
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video diffusion models, witnessed tremendous quality, tremendous quality improvements, Portrait animation, witnessed tremendous
备注: Project website is [this https URL](https://research.nvidia.com/labs/amri/projects/instant4d)
点击查看摘要
Abstract:Portrait animation has witnessed tremendous quality improvements thanks to recent advances in video diffusion models. However, these 2D methods often compromise 3D consistency and speed, limiting their applicability in real-world scenarios, such as digital twins or telepresence. In contrast, 3D-aware facial animation feedforward methods -- built upon explicit 3D representations, such as neural radiance fields or Gaussian splatting -- ensure 3D consistency and achieve faster inference speed, but come with inferior expression details. In this paper, we aim to combine their strengths by distilling knowledge from a 2D diffusion-based method into a feed-forward encoder, which instantly converts an in-the-wild single image into a 3D-consistent, fast yet expressive animatable representation. Our animation representation is decoupled from the face's 3D representation and learns motion implicitly from data, eliminating the dependency on pre-defined parametric models that often constrain animation capabilities. Unlike previous computationally intensive global fusion mechanisms (e.g., multiple attention layers) for fusing 3D structural and animation information, our design employs an efficient lightweight local fusion strategy to achieve high animation expressivity. As a result, our method runs at 107.31 FPS for animation and pose control while achieving comparable animation quality to the state-of-the-art, surpassing alternative designs that trade speed for quality or vice versa. Project website is this https URL
20. 【2512.16891】LinkedOut: Linking World Knowledge Representation Out of Video LLM for Next-Generation Video Recommendation
链接:https://arxiv.org/abs/2512.16891
作者:Haichao Zhang,Yao Lu,Lichen Wang,Yunzhe Li,Daiwei Chen,Yunpeng Xu,Yun Fu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Large Language Models, Video Large Language, video question answering, Language Models, Large Language
备注:
点击查看摘要
Abstract:Video Large Language Models (VLLMs) unlock world-knowledge-aware video understanding through pretraining on internet-scale data and have already shown promise on tasks such as movie analysis and video question answering. However, deploying VLLMs for downstream tasks such as video recommendation remains challenging, since real systems require multi-video inputs, lightweight backbones, low-latency sequential inference, and rapid response. In practice, (1) decode-only generation yields high latency for sequential inference, (2) typical interfaces do not support multi-video inputs, and (3) constraining outputs to language discards fine-grained visual details that matter for downstream vision tasks. We argue that these limitations stem from the absence of a representation that preserves pixel-level detail while leveraging world knowledge. We present LinkedOut, a representation that extracts VLLM world knowledge directly from video to enable fast inference, supports multi-video histories, and removes the language bottleneck. LinkedOut extracts semantically grounded, knowledge-aware tokens from raw frames using VLLMs, guided by promptable queries and optional auxiliary modalities. We introduce a cross-layer knowledge fusion MoE that selects the appropriate level of abstraction from the rich VLLM features, enabling personalized, interpretable, and low-latency recommendation. To our knowledge, LinkedOut is the first VLLM-based video recommendation method that operates on raw frames without handcrafted labels, achieving state-of-the-art results on standard benchmarks. Interpretability studies and ablations confirm the benefits of layer diversity and layer-wise fusion, pointing to a practical path that fully leverages VLLM world-knowledge priors and visual reasoning for downstream vision tasks such as recommendation.
21. 【2512.16885】M-PhyGs: Multi-Material Object Dynamics from Video
链接:https://arxiv.org/abs/2512.16885
作者:Norika Wada,Kohei Yamashita,Ryo Kawahara,Ko Nishino
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:material properties governing, unseen interactions, physical material properties, properties governing, accurately anticipate
备注:
点击查看摘要
Abstract:Knowledge of the physical material properties governing the dynamics of a real-world object becomes necessary to accurately anticipate its response to unseen interactions. Existing methods for estimating such physical material parameters from visual data assume homogeneous single-material objects, pre-learned dynamics, or simplistic topologies. Real-world objects, however, are often complex in material composition and geometry lying outside the realm of these assumptions. In this paper, we particularly focus on flowers as a representative common object. We introduce Multi-material Physical Gaussians (M-PhyGs) to estimate the material composition and parameters of such multi-material complex natural objects from video. From a short video captured in a natural setting, M-PhyGs jointly segments the object into similar materials and recovers their continuum mechanical parameters while accounting for gravity. M-PhyGs achieves this efficiently with newly introduced cascaded 3D and 2D losses, and by leveraging temporal mini-batching. We introduce a dataset, Phlowers, of people interacting with flowers as a novel platform to evaluate the accuracy of this challenging task of multi-material physical parameter estimation. Experimental results on Phlowers dataset demonstrate the accuracy and effectiveness of M-PhyGs and its components.
22. 【2512.16880】Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation
链接:https://arxiv.org/abs/2512.16880
作者:Valay Bundele,Mehran Hosseinzadeh,Hendrik P.A. Lensch
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate surgical instrument, long-term instrument re-entry, remains challenging due, surgical instrument segmentation, Accurate surgical
备注: Under Review
点击查看摘要
Abstract:Accurate surgical instrument segmentation in endoscopic videos is crucial for computer-assisted interventions, yet remains challenging due to frequent occlusions, rapid motion, specular artefacts, and long-term instrument re-entry. While SAM3 provides a powerful spatio-temporal framework for video object segmentation, its performance in surgical scenes is limited by indiscriminate memory updates, fixed memory capacity, and weak identity recovery after occlusions. We propose ReMeDI-SAM3, a training-free memory-enhanced extension of SAM3, that addresses these limitations through three components: (i) relevance-aware memory filtering with a dedicated occlusion-aware memory for storing pre-occlusion frames, (ii) a piecewise interpolation scheme that expands the effective memory capacity, and (iii) a feature-based re-identification module with temporal voting for reliable post-occlusion identity disambiguation. Together, these components mitigate error accumulation and enable reliable recovery after occlusions. Evaluations on EndoVis17 and EndoVis18 under a zero-shot setting show absolute mcIoU improvements of around 7% and 16%, respectively, over vanilla SAM3, outperforming even prior training-based approaches. Project page: this https URL.
23. 【2512.16876】raining Together, Diagnosing Better: Federated Learning for Collagen VI-Related Dystrophies
链接:https://arxiv.org/abs/2512.16876
作者:Astrid Brull,Sara Aguti,Véronique Bolduc,Ying Hu,Daniel M. Jimenez-Gutierrez,Enrique Zuazua,Joaquin Del-Rio,Oleksii Sliusarenko,Haiyan Zhou,Francesco Muntoni,Carsten G. Bönnemann,Xabi Uribe-Etxebarria
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:Machine Learning, collagen VI-related dystrophies, application of Machine, rare diseases, VI-related dystrophies
备注:
点击查看摘要
Abstract:The application of Machine Learning (ML) to the diagnosis of rare diseases, such as collagen VI-related dystrophies (COL6-RD), is fundamentally limited by the scarcity and fragmentation of available data. Attempts to expand sampling across hospitals, institutions, or countries with differing regulations face severe privacy, regulatory, and logistical obstacles that are often difficult to overcome. The Federated Learning (FL) provides a promising solution by enabling collaborative model training across decentralized datasets while keeping patient data local and private. Here, we report a novel global FL initiative using the this http URL FL platform, which leverages FL across distributed datasets in two international organizations for the diagnosis of COL6-RD, using collagen VI immunofluorescence microscopy images from patient-derived fibroblast cultures. Our solution resulted in an ML model capable of classifying collagen VI patient images into the three primary pathogenic mechanism groups associated with COL6-RD: exon skipping, glycine substitution, and pseudoexon insertion. This new approach achieved an F1-score of 0.82, outperforming single-organization models (0.57-0.75). These results demonstrate that FL substantially improves diagnostic utility and generalizability compared to isolated institutional models. Beyond enabling more accurate diagnosis, we anticipate that this approach will support the interpretation of variants of uncertain significance and guide the prioritization of sequencing strategies to identify novel pathogenic variants.
24. 【2512.16874】Pixel Seal: Adversarial-only training for invisible image and video watermarking
链接:https://arxiv.org/abs/2512.16874
作者:Tomáš Souček,Pierre Fernandez,Hady Elsahar,Sylvestre-Alvise Rebuffi,Valeriu Lacatusu,Tuan Tran,Tom Sander,Alexandre Mourachko
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
关键词:Invisible watermarking, digital content, Pixel Seal, essential for tracing, introduces Pixel Seal
备注: Code and model available at [this https URL](https://github.com/facebookresearch/videoseal)
点击查看摘要
Abstract:Invisible watermarking is essential for tracing the provenance of digital content. However, training state-of-the-art models remains notoriously difficult, with current approaches often struggling to balance robustness against true imperceptibility. This work introduces Pixel Seal, which sets a new state-of-the-art for image and video watermarking. We first identify three fundamental issues of existing methods: (i) the reliance on proxy perceptual losses such as MSE and LPIPS that fail to mimic human perception and result in visible watermark artifacts; (ii) the optimization instability caused by conflicting objectives, which necessitates exhaustive hyperparameter tuning; and (iii) reduced robustness and imperceptibility of watermarks when scaling models to high-resolution images and videos. To overcome these issues, we first propose an adversarial-only training paradigm that eliminates unreliable pixel-wise imperceptibility losses. Second, we introduce a three-stage training schedule that stabilizes convergence by decoupling robustness and imperceptibility. Third, we address the resolution gap via high-resolution adaptation, employing JND-based attenuation and training-time inference simulation to eliminate upscaling artifacts. We thoroughly evaluate the robustness and imperceptibility of Pixel Seal on different image types and across a wide range of transformations, and show clear improvements over the state-of-the-art. We finally demonstrate that the model efficiently adapts to video via temporal watermark pooling, positioning Pixel Seal as a practical and scalable solution for reliable provenance in real-world image and video settings.
25. 【2512.16864】RePlan: Reasoning-guided Region Planning for Complex Instruction-based Image Editing
链接:https://arxiv.org/abs/2512.16864
作者:Tianyuan Qu,Lei Ke,Xiaohang Zhan,Longxiang Tang,Yuqi Liu,Bohao Peng,Bei Yu,Dong Yu,Jiaya Jia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Instruction-based image editing, image editing enables, editing enables natural-language, enables natural-language control, existing models falter
备注: Precise region control and planning for instruction-based image editing. Our project page: [this https URL](https://replan-iv-edit.github.io)
点击查看摘要
Abstract:Instruction-based image editing enables natural-language control over visual modifications, yet existing models falter under Instruction-Visual Complexity (IV-Complexity), where intricate instructions meet cluttered or ambiguous scenes. We introduce RePlan (Region-aligned Planning), a plan-then-execute framework that couples a vision-language planner with a diffusion editor. The planner decomposes instructions via step-by-step reasoning and explicitly grounds them to target regions; the editor then applies changes using a training-free attention-region injection mechanism, enabling precise, parallel multi-region edits without iterative inpainting. To strengthen planning, we apply GRPO-based reinforcement learning using 1K instruction-only examples, yielding substantial gains in reasoning fidelity and format reliability. We further present IV-Edit, a benchmark focused on fine-grained grounding and knowledge-intensive edits. Across IV-Complex settings, RePlan consistently outperforms strong baselines trained on far larger datasets, improving regional precision and overall fidelity. Our project page: this https URL
26. 【2512.16853】GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
链接:https://arxiv.org/abs/2512.16853
作者:Amita Kamath,Kai-Wei Chang,Ranjay Krishna,Luke Zettlemoyer,Yushi Hu,Marjan Ghazvininejad
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:score correctness, test prompts, Automating, benchmark, GenEval
备注:
点击查看摘要
Abstract:Automating Text-to-Image (T2I) model evaluation is challenging; a judge model must be used to score correctness, and test prompts must be selected to be challenging for current T2I models but not the judge. We argue that satisfying these constraints can lead to benchmark drift over time, where the static benchmark judges fail to keep up with newer model capabilities. We show that benchmark drift is a significant problem for GenEval, one of the most popular T2I benchmarks. Although GenEval was well-aligned with human judgment at the time of its release, it has drifted far from human judgment over time -- resulting in an absolute error of as much as 17.7% for current models. This level of drift strongly suggests that GenEval has been saturated for some time, as we verify via a large-scale human study. To help fill this benchmarking gap, we introduce a new benchmark, GenEval 2, with improved coverage of primitive visual concepts and higher degrees of compositionality, which we show is more challenging for current models. We also introduce Soft-TIFA, an evaluation method for GenEval 2 that combines judgments for visual primitives, which we show is more well-aligned with human judgment and argue is less likely to drift from human-alignment over time (as compared to more holistic judges such as VQAScore). Although we hope GenEval 2 will provide a strong benchmark for many years, avoiding benchmark drift is far from guaranteed and our work, more generally, highlights the importance of continual audits and improvement for T2I and related automated model evaluation benchmarks.
27. 【2512.16842】OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction
链接:https://arxiv.org/abs/2512.16842
作者:Yuxin Ray Song,Jinzhou Li,Rao Fu,Devin Murphy,Kaichen Zhou,Rishi Shiv,Yaqi Li,Haoyu Xiong,Crystal Elaine Owens,Yilun Du,Yiyue Luo,Xianyi Cheng,Antonio Torralba,Wojciech Matusik,Paul Pu Liang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:makes contact, egocentric perception rarely, human hand, primary interface, forcefully it makes
备注: [this https URL](https://opentouch-tactile.github.io/)
点击查看摘要
Abstract:The human hand is our primary interface to the physical world, yet egocentric perception rarely knows when, where, or how forcefully it makes contact. Robust wearable tactile sensors are scarce, and no existing in-the-wild datasets align first-person video with full-hand touch. To bridge the gap between visual perception and physical interaction, we present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset, containing 5.1 hours of synchronized video-touch-pose data and 2,900 curated clips with detailed text annotations. Using OpenTouch, we introduce retrieval and classification benchmarks that probe how touch grounds perception and action. We show that tactile signals provide a compact yet powerful cue for grasp understanding, strengthen cross-modal alignment, and can be reliably retrieved from in-the-wild video queries. By releasing this annotated vision-touch-pose dataset and benchmark, we aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.
28. 【2512.16841】Radiology Report Generation with Layer-Wise Anatomical Attention
链接:https://arxiv.org/abs/2512.16841
作者:Emmanuel D. Muñiz-De-León,Jorge A. Rosales-de-Golferichs,Ana S. Muñoz-Rodríguez,Alejandro I. Trejo-Castro,Eduardo de Avila-Armenta,Antonio Martínez-Torteya
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Automatic radiology report, multimodal deep learning, reduce reporting workload, radiology report generation, Automatic radiology
备注: 11 pages, 6 figures
点击查看摘要
Abstract:Automatic radiology report generation is a promising application of multimodal deep learning, aiming to reduce reporting workload and improve consistency. However, current state-of-the-art (SOTA) systems - such as Multimodal AI for Radiology Applications (MAIRA-2) and Medical Pathways Language Model-Multimodal (MedPaLM-M) - depend on large-scale multimodal training, clinical metadata, and multiple imaging views, making them resource-intensive and inaccessible for most settings. We introduce a compact image-to-text architecture that generates the Findings section of chest X-ray reports from a single frontal image. The model combines a frozen Self-Distillation with No Labels v3 (DINOv3) Vision Transformer (ViT) encoder with a Generative Pre-trained Transformer 2 (GPT-2) decoder enhanced by layer-wise anatomical attention. This mechanism integrates lung and heart segmentation masks through hierarchical Gaussian smoothing, biasing attention toward clinically relevant regions without adding trainable parameters. Evaluated on the official Medical Information Mart for Intensive Care-Chest X-ray (MIMIC-CXR) dataset using Chest Radiograph Expert (CheXpert) and Radiology Graph (RadGraph) metrics, our approach achieved substantial gains: CheXpert Macro-F1 for five key pathologies increased by 168% (0.083 - 0.238) and Micro-F1 by 146% (0.137 - 0.337), while broader performance across 14 observations improved by 86% (0.170 - 0.316). Structural coherence also improved, with RadGraph F1 rising by 9.7%. Despite its small size and purely image-conditioned design, the model demonstrates that decoder-level anatomical guidance improves spatial grounding and enhances coherence in clinically relevant regions. The source code is publicly available at: this https URL.
29. 【2512.16826】Next-Generation License Plate Detection and Recognition System using YOLOv8
链接:https://arxiv.org/abs/2512.16826
作者:Arslan Amin,Rafia Mumtaz,Muhammad Jawad Bashir,Syed Mohammad Hassan Zaidi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:license plate detection, Intelligent Transportation Systems, License Plate Recognition, license plate, Character Recognition
备注: 6 pages, 5 figures. Accepted and published in the 2023 IEEE 20th International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT (HONET)
点击查看摘要
Abstract:In the evolving landscape of traffic management and vehicle surveillance, efficient license plate detection and recognition are indispensable. Historically, many methodologies have tackled this challenge, but consistent real-time accuracy, especially in diverse environments, remains elusive. This study examines the performance of YOLOv8 variants on License Plate Recognition (LPR) and Character Recognition tasks, crucial for advancing Intelligent Transportation Systems. Two distinct datasets were employed for training and evaluation, yielding notable findings. The YOLOv8 Nano variant demonstrated a precision of 0.964 and mAP50 of 0.918 on the LPR task, while the YOLOv8 Small variant exhibited a precision of 0.92 and mAP50 of 0.91 on the Character Recognition task. A custom method for character sequencing was introduced, effectively sequencing the detected characters based on their x-axis positions. An optimized pipeline, utilizing YOLOv8 Nano for LPR and YOLOv8 Small for Character Recognition, is proposed. This configuration not only maintains computational efficiency but also ensures high accuracy, establishing a robust foundation for future real-world deployments on edge devices within Intelligent Transportation Systems. This effort marks a significant stride towards the development of smarter and more efficient urban infrastructures.
30. 【2512.16818】DenseBEV: Transforming BEV Grid Cells into 3D Objects
链接:https://arxiv.org/abs/2512.16818
作者:Marius Dähling,Sebastian Krebs,J. Marius Zöllner
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:BEV, current research, based transformers, transformers are increasingly, increasingly utilized
备注: 15 pages, 8 figures, accepted by WACV 2026
点击查看摘要
Abstract:In current research, Bird's-Eye-View (BEV)-based transformers are increasingly utilized for multi-camera 3D object detection. Traditional models often employ random queries as anchors, optimizing them successively. Recent advancements complement or replace these random queries with detections from auxiliary networks. We propose a more intuitive and efficient approach by using BEV feature cells directly as anchors. This end-to-end approach leverages the dense grid of BEV queries, considering each cell as a potential object for the final detection task. As a result, we introduce a novel two-stage anchor generation method specifically designed for multi-camera 3D object detection. To address the scaling issues of attention with a large number of queries, we apply BEV-based Non-Maximum Suppression, allowing gradients to flow only through non-suppressed objects. This ensures efficient training without the need for post-processing. By using BEV features from encoders such as BEVFormer directly as object queries, temporal BEV information is inherently embedded. Building on the temporal BEV information already embedded in our object queries, we introduce a hybrid temporal modeling approach by integrating prior detections to further enhance detection performance. Evaluating our method on the nuScenes dataset shows consistent and significant improvements in NDS and mAP over the baseline, even with sparser BEV grids and therefore fewer initial anchors. It is particularly effective for small objects, enhancing pedestrian detection with a 3.8% mAP increase on nuScenes and an 8% increase in LET-mAP on Waymo. Applying our method, named DenseBEV, to the challenging Waymo Open dataset yields state-of-the-art performance, achieving a LET-mAP of 60.7%, surpassing the previous best by 5.4%. Code is available at this https URL.
31. 【2512.16811】GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation
链接:https://arxiv.org/abs/2512.16811
作者:Jingjing Qian,Boyao Han,Chen Shi,Lei Xiao,Long Yang,Shaoshuai Shi,Li Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:remain largely reactive, models achieve strong, achieve strong generalization, models achieve, making them unreliable
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models achieve strong generalization in robotic manipulation but remain largely reactive and 2D-centric, making them unreliable in tasks that require precise 3D reasoning. We propose GeoPredict, a geometry-aware VLA framework that augments a continuous-action policy with predictive kinematic and geometric priors. GeoPredict introduces a trajectory-level module that encodes motion history and predicts multi-step 3D keypoint trajectories of robot arms, and a predictive 3D Gaussian geometry module that forecasts workspace geometry with track-guided refinement along future keypoint trajectories. These predictive modules serve exclusively as training-time supervision through depth-based rendering, while inference requires only lightweight additional query tokens without invoking any 3D decoding. Experiments on RoboCasa Human-50, LIBERO, and real-world manipulation tasks show that GeoPredict consistently outperforms strong VLA baselines, especially in geometry-intensive and spatially demanding scenarios.
32. 【2512.16791】KineST: A Kinematics-guided Spatiotemporal State Space Model for Human Motion Tracking from Sparse Signals
链接:https://arxiv.org/abs/2512.16791
作者:Shuting Zhao,Zeyu Xiao,Xinrong Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:motion tracking plays, Full-body motion tracking, bridging physical, virtual interactions, tracking plays
备注: Accepted by AAAI 2026
点击查看摘要
Abstract:Full-body motion tracking plays an essential role in AR/VR applications, bridging physical and virtual interactions. However, it is challenging to reconstruct realistic and diverse full-body poses based on sparse signals obtained by head-mounted displays, which are the main devices in AR/VR scenarios. Existing methods for pose reconstruction often incur high computational costs or rely on separately modeling spatial and temporal dependencies, making it difficult to balance accuracy, temporal coherence, and efficiency. To address this problem, we propose KineST, a novel kinematics-guided state space model, which effectively extracts spatiotemporal dependencies while integrating local and global pose perception. The innovation comes from two core ideas. Firstly, in order to better capture intricate joint relationships, the scanning strategy within the State Space Duality framework is reformulated into kinematics-guided bidirectional scanning, which embeds kinematic priors. Secondly, a mixed spatiotemporal representation learning approach is employed to tightly couple spatial and temporal contexts, balancing accuracy and smoothness. Additionally, a geometric angular velocity loss is introduced to impose physically meaningful constraints on rotational variations for further improving motion stability. Extensive experiments demonstrate that KineST has superior performance in both accuracy and temporal consistency within a lightweight framework. Project page: this https URL
33. 【2512.16784】R3ST: A Synthetic 3D Dataset With Realistic Trajectories
链接:https://arxiv.org/abs/2512.16784
作者:Simone Teglia,Claudia Melis Tonti,Francesco Pro,Leonardo Russo,Andrea Alfarano,Leonardo Pentassuglia,Irene Amerini
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:evaluate computer vision, enhance road safety, computer vision models, essential to train, train and evaluate
备注:
点击查看摘要
Abstract:Datasets are essential to train and evaluate computer vision models used for traffic analysis and to enhance road safety. Existing real datasets fit real-world scenarios, capturing authentic road object behaviors, however, they typically lack precise ground-truth annotations. In contrast, synthetic datasets play a crucial role, allowing for the annotation of a large number of frames without additional costs or extra time. However, a general drawback of synthetic datasets is the lack of realistic vehicle motion, since trajectories are generated using AI models or rule-based systems. In this work, we introduce R3ST (Realistic 3D Synthetic Trajectories), a synthetic dataset that overcomes this limitation by generating a synthetic 3D environment and integrating real-world trajectories derived from SinD, a bird's-eye-view dataset recorded from drone footage. The proposed dataset closes the gap between synthetic data and realistic trajectories, advancing the research in trajectory forecasting of road vehicles, offering both accurate multimodal ground-truth annotations and authentic human-driven vehicle trajectories.
34. 【2512.16776】Kling-Omni Technical Report
链接:https://arxiv.org/abs/2512.16776
作者:Kling Team:Jialu Chen,Yuanzheng Ci,Xiangyu Du,Zipeng Feng,Kun Gai,Sainan Guo,Feng Han,Jingbin He,Kang He,Xiao Hu,Xiaohua Hu,Boyuan Jiang,Fangyuan Kong,Hang Li,Jie Li,Qingyu Li,Shen Li,Xiaohan Li,Yan Li,Jiajun Liang,Borui Liao,Yiqiao Liao,Weihong Lin,Quande Liu,Xiaokun Liu,Yilun Liu,Yuliang Liu,Shun Lu,Hangyu Mao,Yunyao Mao,Haodong Ouyang,Wenyu Qin,Wanqi Shi,Xiaoyu Shi,Lianghao Su,Haozhi Sun,Peiqin Sun,Pengfei Wan,Chao Wang,Chenyu Wang,Meng Wang,Qiulin Wang,Runqi Wang,Xintao Wang,Xuebo Wang,Zekun Wang,Min Wei,Tiancheng Wen,Guohao Wu,Xiaoshi Wu,Zhenhua Wu,Da Xie,Yingtong Xiong,Yulong Xu,Sile Yang,Zikang Yang,Weicai Ye,Ziyang Yuan,Shenglong Zhang,Shuaiyu Zhang,Yuanxing Zhang,Yufan Zhang,Wenzheng Zhao,Ruiliang Zhou,Yan Zhou,Guosheng Zhu,Yongjie Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual language inputs, synthesize high-fidelity videos, high-fidelity videos directly, generalist generative framework, generative framework designed
备注: Kling-Omni Technical Report
点击查看摘要
Abstract:We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.
35. 【2512.16771】FlowDet: Unifying Object Detection and Generative Transport Flows
链接:https://arxiv.org/abs/2512.16771
作者:Enis Baty,C. P. Bridges,Simon Hadfield
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Conditional Flow Matching, Flow Matching techniques, modern Conditional Flow, Conditional Flow, Flow Matching
备注:
点击查看摘要
Abstract:We present FlowDet, the first formulation of object detection using modern Conditional Flow Matching techniques. This work follows from DiffusionDet, which originally framed detection as a generative denoising problem in the bounding box space via diffusion. We revisit and generalise this formulation to a broader class of generative transport problems, while maintaining the ability to vary the number of boxes and inference steps without re-training. In contrast to the curved stochastic transport paths induced by diffusion, FlowDet learns simpler and straighter paths resulting in faster scaling of detection performance as the number of inference steps grows. We find that this reformulation enables us to outperform diffusion based detection systems (as well as non-generative baselines) across a wide range of experiments, including various precision/recall operating points using multiple feature backbones and datasets. In particular, when evaluating under recall-constrained settings, we can highlight the effects of the generative transport without over-compensating with large numbers of proposals. This provides gains of up to +3.6% AP and +4.2% AP$_{rare}$ over DiffusionDet on the COCO and LVIS datasets, respectively.
36. 【2512.16767】Make-It-Poseable: Feed-forward Latent Posing Model for 3D Humanoid Character Animation
链接:https://arxiv.org/abs/2512.16767
作者:Zhiyang Guo,Ori Zhang,Jax Xiang,Alan Zhao,Wengang Zhou,Houqiang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:graphics and vision, fundamental task, task in computer, computer graphics, Posing
备注: Project page: [this https URL](https://jasongzy.github.io/Make-It-Poseable/)
点击查看摘要
Abstract:Posing 3D characters is a fundamental task in computer graphics and vision. However, existing methods like auto-rigging and pose-conditioned generation often struggle with challenges such as inaccurate skinning weight prediction, topological imperfections, and poor pose conformance, limiting their robustness and generalizability. To overcome these limitations, we introduce Make-It-Poseable, a novel feed-forward framework that reformulates character posing as a latent-space transformation problem. Instead of deforming mesh vertices as in traditional pipelines, our method reconstructs the character in new poses by directly manipulating its latent representation. At the core of our method is a latent posing transformer that manipulates shape tokens based on skeletal motion. This process is facilitated by a dense pose representation for precise control. To ensure high-fidelity geometry and accommodate topological changes, we also introduce a latent-space supervision strategy and an adaptive completion module. Our method demonstrates superior performance in posing quality. It also naturally extends to 3D editing applications like part replacement and refinement.
37. 【2512.16743】reeNet: A Light Weight Model for Low Bitrate Image Compression
链接:https://arxiv.org/abs/2512.16743
作者:Mahadev Prasad Panda,Purnachandra Rao Makkena,Srivatsa Prativadibhayankaram,Siegfried Fößel,André Kaup
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:image compression techniques, computational complexity remains, learning-based image compression, image compression, remains a critical
备注:
点击查看摘要
Abstract:Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in BD-rate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.
38. 【2512.16740】ask-Oriented Data Synthesis and Control-Rectify Sampling for Remote Sensing Semantic Segmentation
链接:https://arxiv.org/abs/2512.16740
作者:Yunkai Yang,Yudong Zhang,Kunquan Zhang,Jinxiao Zhang,Xinying Chen,Haohuan Fu,Runmin Dong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:expand labeled datasets, alleviate manual annotation, training data synthesis, Multimodal Diffusion Transformer, remote sensing
备注:
点击查看摘要
Abstract:With the rapid progress of controllable generation, training data synthesis has become a promising way to expand labeled datasets and alleviate manual annotation in remote sensing (RS). However, the complexity of semantic mask control and the uncertainty of sampling quality often limit the utility of synthetic data in downstream semantic segmentation tasks. To address these challenges, we propose a task-oriented data synthesis framework (TODSynth), including a Multimodal Diffusion Transformer (MM-DiT) with unified triple attention and a plug-and-play sampling strategy guided by task feedback. Built upon the powerful DiT-based generative foundation model, we systematically evaluate different control schemes, showing that a text-image-mask joint attention scheme combined with full fine-tuning of the image and mask branches significantly enhances the effectiveness of RS semantic segmentation data synthesis, particularly in few-shot and complex-scene scenarios. Furthermore, we propose a control-rectify flow matching (CRFM) method, which dynamically adjusts sampling directions guided by semantic loss during the early high-plasticity stage, mitigating the instability of generated images and bridging the gap between synthetic data and downstream segmentation tasks. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art controllable generation methods, producing more stable and task-oriented synthetic data for RS semantic segmentation.
39. 【2512.16727】OMG-Bench: A New Challenging Benchmark for Skeleton-based Online Micro Hand Gesture Recognition
链接:https://arxiv.org/abs/2512.16727
作者:Haochen Chang,Pengfei Ren,Buyuan Zhang,Da Li,Tianhao Han,Haoyang Zhang,Liang Xie,Hongbo Chen,Erwei Yin
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
关键词:Online micro gesture, micro gesture recognition, faces challenges due, Online micro, task-specific algorithms
备注: Project page: [this https URL](https://omg-bench.github.io/)
点击查看摘要
Abstract:Online micro gesture recognition from hand skeletons is critical for VR/AR interaction but faces challenges due to limited public datasets and task-specific algorithms. Micro gestures involve subtle motion patterns, which make constructing datasets with precise skeletons and frame-level annotations difficult. To this end, we develop a multi-view self-supervised pipeline to automatically generate skeleton data, complemented by heuristic rules and expert refinement for semi-automatic annotation. Based on this pipeline, we introduce OMG-Bench, the first large-scale public benchmark for skeleton-based online micro gesture recognition. It features 40 fine-grained gesture classes with 13,948 instances across 1,272 sequences, characterized by subtle motions, rapid dynamics, and continuous execution. To tackle these challenges, we propose Hierarchical Memory-Augmented Transformer (HMATr), an end-to-end framework that unifies gesture detection and classification by leveraging hierarchical memory banks which store frame-level details and window-level semantics to preserve historical context. In addition, it employs learnable position-aware queries initialized from the memory to implicitly encode gesture positions and semantics. Experiments show that HMATr outperforms state-of-the-art methods by 7.6\% in detection rate, establishing a strong baseline for online micro gesture recognition. Project page: this https URL
40. 【2512.16724】VERM: Leveraging Foundation Models to Create a Virtual Eye for Efficient 3D Robotic Manipulation
链接:https://arxiv.org/abs/2512.16724
作者:Yixiang Chen,Yan Huang,Keji He,Peiyan Li,Liang Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:multiple fixed cameras, execute action planning, action planning based, fixed cameras, based on perceptions
备注: Accepted at RA-L 2025
点击查看摘要
Abstract:When performing 3D manipulation tasks, robots have to execute action planning based on perceptions from multiple fixed cameras. The multi-camera setup introduces substantial redundancy and irrelevant information, which increases computational costs and forces the model to spend extra training time extracting crucial task-relevant details. To filter out redundant information and accurately extract task-relevant features, we propose the VERM (Virtual Eye for Robotic Manipulation) method, leveraging the knowledge in foundation models to imagine a virtual task-adaptive view from the constructed 3D point cloud, which efficiently captures necessary information and mitigates occlusion. To facilitate 3D action planning and fine-grained manipulation, we further design a depth-aware module and a dynamic coarse-to-fine procedure. Extensive experimental results on both simulation benchmark RLBench and real-world evaluations demonstrate the effectiveness of our method, surpassing previous state-of-the-art methods while achieving 1.89x speedup in training time and 1.54x speedup in inference speed. More results can be found on our project website at this https URL .
41. 【2512.16710】A multi-centre, multi-device benchmark dataset for landmark-based comprehensive fetal biometry
链接:https://arxiv.org/abs/2512.16710
作者:Chiara Di Vece,Zhehua Mao,Netanell Avisdris,Brian Dromey,Raffaele Napolitano,Dafna Ben Bashat,Francisco Vasconcelos,Danail Stoyanov,Leo Joskowicz,Sophia Bano
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate fetal growth, fetal growth assessment, manually identifying anatomical, precise biometry measured, Accurate fetal
备注: 11 pages, 5 figures, 3 tables
点击查看摘要
Abstract:Accurate fetal growth assessment from ultrasound (US) relies on precise biometry measured by manually identifying anatomical landmarks in standard planes. Manual landmarking is time-consuming, operator-dependent, and sensitive to variability across scanners and sites, limiting the reproducibility of automated approaches. There is a need for multi-source annotated datasets to develop artificial intelligence-assisted fetal growth assessment methods. To address this bottleneck, we present an open, multi-centre, multi-device benchmark dataset of fetal US images with expert anatomical landmark annotations for clinically used fetal biometric measurements. These measurements include head bi-parietal and occipito-frontal diameters, abdominal transverse and antero-posterior diameters, and femoral length. The dataset contains 4,513 de-identified US images from 1,904 subjects acquired at three clinical sites using seven different US devices. We provide standardised, subject-disjoint train/test splits, evaluation code, and baseline results to enable fair and reproducible comparison of methods. Using an automatic biometry model, we quantify domain shift and demonstrate that training and evaluation confined to a single centre substantially overestimate performance relative to multi-centre testing. To the best of our knowledge, this is the first publicly available multi-centre, multi-device, landmark-annotated dataset that covers all primary fetal biometry measures, providing a robust benchmark for domain adaptation and multi-centre generalisation in fetal biometry and enabling more reliable AI-assisted fetal growth assessment across centres. All data, annotations, training code, and evaluation pipelines are made publicly available.
42. 【2512.16706】SDFoam: Signed-Distance Foam for explicit surface reconstruction
链接:https://arxiv.org/abs/2512.16706
作者:Antonella Rech,Nicola Conci,Nicola Garau
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:driven impressive progress, Neural radiance fields, ray-traced volumetric rendering, Neural radiance, Gaussian Splatting
备注:
点击查看摘要
Abstract:Neural radiance fields (NeRF) have driven impressive progress in view synthesis by using ray-traced volumetric rendering. Splatting-based methods such as 3D Gaussian Splatting (3DGS) provide faster rendering by rasterizing 3D primitives. RadiantFoam (RF) brought ray tracing back, achieving throughput comparable to Gaussian Splatting by organizing radiance with an explicit Voronoi Diagram (VD). Yet, all the mentioned methods still struggle with precise mesh reconstruction. We address this gap by jointly learning an explicit VD with an implicit Signed Distance Field (SDF). The scene is optimized via ray tracing and regularized by an Eikonal objective. The SDF introduces metric-consistent isosurfaces, which, in turn, bias near-surface Voronoi cell faces to align with the zero level set. The resulting model produces crisper, view-consistent surfaces with fewer floaters and improved topology, while preserving photometric quality and maintaining training speed on par with RadiantFoam. Across diverse scenes, our hybrid implicit-explicit formulation, which we name SDFoam, substantially improves mesh reconstruction accuracy (Chamfer distance) with comparable appearance (PSNR, SSIM), without sacrificing efficiency.
43. 【2512.16688】Detecting Localized Deepfakes: How Well Do Synthetic Image Detectors Handle Inpainting?
链接:https://arxiv.org/abs/2512.16688
作者:Serafino Pandolfini,Lorenzo Pellegrini,Matteo Ferrara,Davide Maltoni
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabled highly realistic, highly realistic image, region-level editing, rapid progress, progress of generative
备注: 17 pages, 5 figures, 9 tables
点击查看摘要
Abstract:The rapid progress of generative AI has enabled highly realistic image manipulations, including inpainting and region-level editing. These approaches preserve most of the original visual context and are increasingly exploited in cybersecurity-relevant threat scenarios. While numerous detectors have been proposed for identifying fully synthetic images, their ability to generalize to localized manipulations remains insufficiently characterized. This work presents a systematic evaluation of state-of-the-art detectors, originally trained for the deepfake detection on fully synthetic images, when applied to a distinct challenge: localized inpainting detection. The study leverages multiple datasets spanning diverse generators, mask sizes, and inpainting techniques. Our experiments show that models trained on a large set of generators exhibit partial transferability to inpainting-based edits and can reliably detect medium- and large-area manipulations or regeneration-style inpainting, outperforming many existing ad hoc detection approaches.
44. 【2512.16685】Few-Shot Fingerprinting Subject Re-Identification in 3D-MRI and 2D-X-Ray
链接:https://arxiv.org/abs/2512.16685
作者:Gonçalo Gaspar Alves,Shekoufeh Gorgi Zadeh,Andreas Husch,Ben Bausch
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Combining open-source datasets, Combining open-source, inflated model performance, introduce data leakage, multiple sets
备注:
点击查看摘要
Abstract:Combining open-source datasets can introduce data leakage if the same subject appears in multiple sets, leading to inflated model performance. To address this, we explore subject fingerprinting, mapping all images of a subject to a distinct region in latent space, to enable subject re-identification via similarity matching. Using a ResNet-50 trained with triplet margin loss, we evaluate few-shot fingerprinting on 3D MRI and 2D X-ray data in both standard (20-way 1-shot) and challenging (1000-way 1-shot) scenarios. The model achieves high Mean- Recall-@-K scores: 99.10% (20-way 1-shot) and 90.06% (500-way 5-shot) on ChestXray-14; 99.20% (20-way 1-shot) and 98.86% (100-way 3-shot) on BraTS- 2021.
45. 【2512.16670】FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering
链接:https://arxiv.org/abs/2512.16670
作者:Ole Beisswenger,Jan-Niklas Dihlmann,Hendrik P.A. Lensch
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:requires translating geometric, applications requires translating, interactive applications requires, interactive applications, translating geometric
备注: Project Page: [this https URL](https://framediffuser.jdihlmann.com/)
点击查看摘要
Abstract:Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.
46. 【2512.16636】REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion
链接:https://arxiv.org/abs/2512.16636
作者:Giorgos Petsangourakis,Christos Sgouropoulos,Bill Psomas,Theodoros Giannakopoulos,Giorgos Sfikas,Ioannis Kakogeorgiou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requiring longer training, limiting sample quality, reconstruction-style denoising objective, Vision Foundation Models, indirect semantic supervision
备注:
点击查看摘要
Abstract:Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slowly, requiring longer training and limiting sample quality. Recent works inject semantics from Vision Foundation Models (VFMs) either externally via representation alignment or internally by jointly modeling only a narrow slice of VFM features inside the diffusion process, under-utilizing the rich, nonlinear, multi-layer spatial semantics available. We introduce REGLUE (Representation Entanglement with Global-Local Unified Encoding), a unified latent diffusion framework that jointly models (i) VAE image latents, (ii) compact local (patch-level) VFM semantics, and (iii) a global (image-level) [CLS] token within a single SiT backbone. A lightweight convolutional semantic compressor nonlinearly aggregates multi-layer VFM features into a low-dimensional, spatially structured representation, which is entangled with the VAE latents in the diffusion process. An external alignment loss further regularizes internal representations toward frozen VFM targets. On ImageNet 256x256, REGLUE consistently improves FID and accelerates convergence over SiT-B/2 and SiT-XL/2 baselines, as well as over REPA, ReDi, and REG. Extensive experiments show that (a) spatial VFM semantics are crucial, (b) non-linear compression is key to unlocking their full benefit, and (c) global tokens and external alignment act as complementary, lightweight enhancements within our global-local-latent joint modeling framework. The code is available at this https URL .
47. 【2512.16635】SARMAE: Masked Autoencoder for SAR Representation Learning
链接:https://arxiv.org/abs/2512.16635
作者:Danxu Liu,Di Wang,Hebaixu Wang,Haoyang Chen,Wentao Jiang,Yilin Cheng,Haonan Guo,Wei Cui,Jing Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Synthetic Aperture Radar, Synthetic Aperture, Aperture Radar, remote sensing applications, role in all-weather
备注: Code and models will be available at [this https URL](https://github.com/MiliLab/SARMAE)
点击查看摘要
Abstract:Synthetic Aperture Radar (SAR) imagery plays a critical role in all-weather, day-and-night remote sensing applications. However, existing SAR-oriented deep learning is constrained by data scarcity, while the physically grounded speckle noise in SAR imagery further hampers fine-grained semantic representation learning. To address these challenges, we propose SARMAE, a Noise-Aware Masked Autoencoder for self-supervised SAR representation learning. Specifically, we construct SAR-1M, the first million-scale SAR dataset, with additional paired optical images, to enable large-scale pre-training. Building upon this, we design Speckle-Aware Representation Enhancement (SARE), which injects SAR-specific speckle noise into masked autoencoders to facilitate noise-aware and robust representation learning. Furthermore, we introduce Semantic Anchor Representation Constraint (SARC), which leverages paired optical priors to align SAR features and ensure semantic consistency. Extensive experiments across multiple SAR datasets demonstrate that SARMAE achieves state-of-the-art performance on classification, detection, and segmentation tasks. Code and models will be available at this https URL.
48. 【2512.16625】DeContext as Defense: Safe Image Editing in Diffusion Transformers
链接:https://arxiv.org/abs/2512.16625
作者:Linghui Shen,Mingyue Cui,Xingyi Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ease and realism, In-context diffusion models, users to modify, remarkable ease, In-context diffusion
备注: 17 pages, 11 figures
点击查看摘要
Abstract:In-context diffusion models allow users to modify images with remarkable ease and realism. However, the same power raises serious privacy concerns: personal images can be easily manipulated for identity impersonation, misinformation, or other malicious uses, all without the owner's consent. While prior work has explored input perturbations to protect against misuse in personalized text-to-image generation, the robustness of modern, large-scale in-context DiT-based models remains largely unexamined. In this paper, we propose DeContext, a new method to safeguard input images from unauthorized in-context editing. Our key insight is that contextual information from the source image propagates to the output primarily through multimodal attention layers. By injecting small, targeted perturbations that weaken these cross-attention pathways, DeContext breaks this flow, effectively decouples the link between input and output. This simple defense is both efficient and robust. We further show that early denoising steps and specific transformer blocks dominate context propagation, which allows us to concentrate perturbations where they matter most. Experiments on Flux Kontext and Step1X-Edit show that DeContext consistently blocks unwanted image edits while preserving visual quality. These results highlight the effectiveness of attention-based perturbations as a powerful defense against image manipulation.
49. 【2512.16620】Plug to Place: Indoor Multimedia Geolocation from Electrical Sockets for Digital Investigation
链接:https://arxiv.org/abs/2512.16620
作者:Kanwal Aftab,Graham Adams,Mark Scanlon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rapidly evolving field, shows great promise, Computer vision, digital forensic applications, evolving field
备注:
点击查看摘要
Abstract:Computer vision is a rapidly evolving field, giving rise to powerful new tools and techniques in digital forensic investigation, and shows great promise for novel digital forensic applications. One such application, indoor multimedia geolocation, has the potential to become a crucial aid for law enforcement in the fight against human trafficking, child exploitation, and other serious crimes. While outdoor multimedia geolocation has been widely explored, its indoor counterpart remains underdeveloped due to challenges such as similar room layouts, frequent renovations, visual ambiguity, indoor lighting variability, unreliable GPS signals, and limited datasets in sensitive domains. This paper introduces a pipeline that uses electric sockets as consistent indoor markers for geolocation, since plug socket types are standardised by country or region. The three-stage deep learning pipeline detects plug sockets (YOLOv11, mAP@0.5 = 0.843), classifies them into one of 12 plug socket types (Xception, accuracy = 0.912), and maps the detected socket types to countries (accuracy = 0.96 at 90% threshold confidence). To address data scarcity, two dedicated datasets were created: socket detection dataset of 2,328 annotated images expanded to 4,072 through augmentation, and a classification dataset of 3,187 images across 12 plug socket classes. The pipeline was evaluated on the Hotels-50K dataset, focusing on the TraffickCam subset of crowd-sourced hotel images, which capture real-world conditions such as poor lighting and amateur angles. This dataset provides a more realistic evaluation than using professional, well-lit, often wide-angle images from travel websites. This framework demonstrates a practical step toward real-world digital forensic applications. The code, trained models, and the data for this paper are available open source.
50. 【2512.16615】rainable Log-linear Sparse Attention for Efficient Diffusion Transformers
链接:https://arxiv.org/abs/2512.16615
作者:Yifan Zhou,Zeqi Xiao,Tianyi Wei,Shuai Yang,Xingang Pan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformers, fundamentally limits scaling, self-attention cost fundamentally, cost fundamentally limits, quadratic self-attention cost
备注: Code is available at: [this https URL](https://github.com/SingleZombie/LLSA)
点击查看摘要
Abstract:Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost fundamentally limits scaling to long token sequences. Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation and selecting a small set of relevant key blocks, but still suffer from (i) quadratic selection cost on compressed tokens and (ii) increasing K required to maintain model quality as sequences grow. We identify that their inefficiency is due to the single-level design, as a single coarse level is insufficient to represent the global structure. In this paper, we introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences that reduces both selection and attention costs from quadratic to log-linear complexity by utilizing a hierarchical structure. LLSA performs hierarchical Top-K selection, progressively adopting sparse Top-K selection with the indices found at the previous level, and introduces a Hierarchical KV Enrichment mechanism that preserves global context while using fewer tokens of different granularity during attention computation. To support efficient training, we develop a high-performance GPU implementation that uses only sparse indices for both the forward and backward passes, eliminating the need for dense attention masks. We evaluate LLSA on high-resolution pixel-space image generation without using patchification and VAE encoding. LLSA accelerates attention inference by 28.27x and DiT training by 6.09x on 256x256 pixel token sequences, while maintaining generation quality. The results demonstrate that LLSA offers a promising direction for training long-sequence DiTs efficiently. Code is available at: this https URL
51. 【2512.16614】Don't Guess, Escalate: Towards Explainable Uncertainty-Calibrated AI Forensic Agents
链接:https://arxiv.org/abs/2512.16614
作者:Giulia Boato,Andrea Montibeller,Edward Delp,Luisa Verdoliva,Daniele Miorandi
类目:Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:reshaping the landscape, landscape of multimedia, multimedia forensics, combine forensic detectors, Abstract
备注:
点击查看摘要
Abstract:AI is reshaping the landscape of multimedia forensics. We propose AI forensic agents: reliable orchestrators that select and combine forensic detectors, identify provenance and context, and provide uncertainty-aware assessments. We highlight pitfalls in current solutions and introduce a unified framework to improve the authenticity verification process.
52. 【2512.16609】Hazedefy: A Lightweight Real-Time Image and Video Dehazing Pipeline for Practical Deployment
链接:https://arxiv.org/abs/2512.16609
作者:Ayush Bhavsar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:camera feed enhancement, live camera feed, paper introduces Hazedefy, application-focused dehazing pipeline, dehazing pipeline intended
备注: 4 pages, 2 figures. Code and demo available at [this https URL](https://doi.org/10.5281/zenodo.17915355)
点击查看摘要
Abstract:This paper introduces Hazedefy, a lightweight and application-focused dehazing pipeline intended for real-time video and live camera feed enhancement. Hazedefy prioritizes computational simplicity and practical deployability on consumer-grade hardware, building upon the Dark Channel Prior (DCP) concept and the atmospheric scattering model. Key elements include gamma-adaptive reconstruction, a fast transmission approximation with lower bounds for numerical stability, a stabilized atmospheric light estimator based on fractional top-pixel averaging, and an optional color balance stage. The pipeline is suitable for mobile and embedded applications, as experimental demonstrations on real-world images and videos show improved visibility and contrast without requiring GPU acceleration.
53. 【2512.16586】Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks
链接:https://arxiv.org/abs/2512.16586
作者:Shaohua Wu,Tong Yu,Shenling Wang,Xudong Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:convolutional neural networks, shown remarkable capacity, U-shaped architecture, image synthesis based, neural networks
备注:
点击查看摘要
Abstract:Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.
54. 【2512.16584】Sketch-in-Latents: Eliciting Unified Reasoning in MLLMs
链接:https://arxiv.org/abs/2512.16584
作者:Jintao Tong,Jiaqi Gu,Yujing Lou,Lubin Fan,Yixiong Zou,Yue Wu,Jieping Ye,Ruixuan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, Language Models
备注: 14 pages, 11 figures
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) excel at visual understanding tasks through text reasoning, they often fall short in scenarios requiring visual imagination. Unlike current works that take predefined external toolkits or generate images during thinking, however, humans can form flexible visual-text imagination and interactions during thinking without predefined toolkits, where one important reason is that humans construct the visual-text thinking process in a unified space inside the brain. Inspired by this capability, given that current MLLMs already encode visual and text information in the same feature space, we hold that visual tokens can be seamlessly inserted into the reasoning process carried by text tokens, where ideally, all visual imagination processes can be encoded by the latent features. To achieve this goal, we propose Sketch-in-Latents (SkiLa), a novel paradigm for unified multi-modal reasoning that expands the auto-regressive capabilities of MLLMs to natively generate continuous visual embeddings, termed latent sketch tokens, as visual thoughts. During multi-step reasoning, the model dynamically alternates between textual thinking mode for generating textual think tokens and visual sketching mode for generating latent sketch tokens. A latent visual semantics reconstruction mechanism is proposed to ensure these latent sketch tokens are semantically grounded. Extensive experiments demonstrate that SkiLa achieves superior performance on vision-centric tasks while exhibiting strong generalization to diverse general multi-modal benchmarks. Codes will be released at this https URL.
55. 【2512.16577】CRONOS: Continuous Time Reconstruction for 4D Medical Longitudinal Series
链接:https://arxiv.org/abs/2512.16577
作者:Nico Albert Disch,Saikat Roy,Constantin Ulrich,Yannick Kirchhoff,Maximilian Rokuss,Robin Peretzke,David Zimmerer,Klaus Maier-Hein
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:treatment planning, disease progression, developmental assessment, medical scans evolve, important for disease
备注: [this https URL](https://github.com/MIC-DKFZ/Longitudinal4DMed)
点击查看摘要
Abstract:Forecasting how 3D medical scans evolve over time is important for disease progression, treatment planning, and developmental assessment. Yet existing models either rely on a single prior scan, fixed grid times, or target global labels, which limits voxel-level forecasting under irregular sampling. We present CRONOS, a unified framework for many-to-one prediction from multiple past scans that supports both discrete (grid-based) and continuous (real-valued) timestamps in one model, to the best of our knowledge the first to achieve continuous sequence-to-image forecasting for 3D medical data. CRONOS learns a spatio-temporal velocity field that transports context volumes toward a target volume at an arbitrary time, while operating directly in 3D voxel space. Across three public datasets spanning Cine-MRI, perfusion CT, and longitudinal MRI, CRONOS outperforms other baselines, while remaining computationally competitive. We will release code and evaluation protocols to enable reproducible, multi-dataset benchmarking of multi-context, continuous-time forecasting.
56. 【2512.16567】Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation
链接:https://arxiv.org/abs/2512.16567
作者:Yin Zhang,Yongqiang Zhang,Yaoyue Zheng,Bogdan Raducanu,Dan Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Foundation Models, Generalized Semantic Segmentation, Fine-tuning Vision Foundation, Domain Generalized Semantic, Foundation Models
备注: Accepted by AAAI 2026
点击查看摘要
Abstract:Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.
57. 【2512.16564】4D Primitive-Mâché: Glueing Primitives for Persistent 4D Scene Reconstruction
链接:https://arxiv.org/abs/2512.16564
作者:Kirill Mazur,Marwan Taher,Andrew J. Davison
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:monocular RGB video, casual monocular RGB, monocular RGB, RGB video, video as input
备注: For project page, see [this https URL](https://makezur.github.io/4DPM/)
点击查看摘要
Abstract:We present a dynamic reconstruction system that receives a casual monocular RGB video as input, and outputs a complete and persistent reconstruction of the scene. In other words, we reconstruct not only the the currently visible parts of the scene, but also all previously viewed parts, which enables replaying the complete reconstruction across all timesteps. Our method decomposes the scene into a set of rigid 3D primitives, which are assumed to be moving throughout the scene. Using estimated dense 2D correspondences, we jointly infer the rigid motion of these primitives through an optimisation pipeline, yielding a 4D reconstruction of the scene, i.e. providing 3D geometry dynamically moving through time. To achieve this, we also introduce a mechanism to extrapolate motion for objects that become invisible, employing motion-grouping techniques to maintain continuity. The resulting system enables 4D spatio-temporal awareness, offering capabilities such as replayable 3D reconstructions of articulated objects through time, multi-object scanning, and object permanence. On object scanning and multi-object datasets, our system significantly outperforms existing methods both quantitatively and qualitatively.
Comments:
For project page, see this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.16564 [cs.CV]
(or
arXiv:2512.16564v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.16564
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
58. 【2512.16561】N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
链接:https://arxiv.org/abs/2512.16561
作者:Yuxin Wang,Lei Ke,Boqiang Zhang,Tianyuan Qu,Hanxun Yu,Zhenpeng Huang,Meng Yu,Dan Xu,Dong Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:comprehend spatial relationships, current multimodal models, answer questions based, object perception, lack intrinsic
备注: Project Page: [this https URL](https://n3d-vlm.github.io)
点击查看摘要
Abstract:While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.
59. 【2512.16523】P: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
链接:https://arxiv.org/abs/2512.16523
作者:Zhiwei Li,Yitian Pang,Weining Wang,Zhenan Sun,Qi Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:posing significant risks, achieved impressive zero-shot, impressive zero-shot recognition, zero-shot recognition performance, remain highly susceptible
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios. Previous training-time defenses rely on adversarial fine-tuning, which requires labeled data and costly retraining, while existing test-time strategies fail to reliably distinguish between clean and adversarial inputs, thereby preventing both adversarial robustness and clean accuracy from reaching their optimum. To address these limitations, we propose Test-Time Padding (TTP), a lightweight defense framework that performs adversarial detection followed by targeted adaptation at inference. TTP identifies adversarial inputs via the cosine similarity shift between CLIP feature embeddings computed before and after spatial padding, yielding a universal threshold for reliable detection across architectures and datasets. For detected adversarial cases, TTP employs trainable padding to restore disrupted attention patterns, coupled with a similarity-aware ensemble strategy for a more robust final prediction. For clean inputs, TTP leaves them unchanged by default or optionally integrates existing test-time adaptation techniques for further accuracy gains. Comprehensive experiments on diverse CLIP backbones and fine-grained benchmarks show that TTP consistently surpasses state-of-the-art test-time defenses, delivering substantial improvements in adversarial robustness without compromising clean accuracy. The code for this paper will be released soon.
60. 【2512.16511】Multi-scale Attention-Guided Intrinsic Decomposition and Rendering Pass Prediction for Facial Images
链接:https://arxiv.org/abs/2512.16511
作者:Hossein Javidnia
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:high-fidelity digital doubles, Accurate intrinsic decomposition, Accurate intrinsic, Attention-Guided Intrinsics Network, high-fidelity digital
备注:
点击查看摘要
Abstract:Accurate intrinsic decomposition of face images under unconstrained lighting is a prerequisite for photorealistic relighting, high-fidelity digital doubles, and augmented-reality effects. This paper introduces MAGINet, a Multi-scale Attention-Guided Intrinsics Network that predicts a $512\times512$ light-normalized diffuse albedo map from a single RGB portrait. MAGINet employs hierarchical residual encoding, spatial-and-channel attention in a bottleneck, and adaptive multi-scale feature fusion in the decoder, yielding sharper albedo boundaries and stronger lighting invariance than prior U-Net variants. The initial albedo prediction is upsampled to $1024\times1024$ and refined by a lightweight three-layer CNN (RefinementNet). Conditioned on this refined albedo, a Pix2PixHD-based translator then predicts a comprehensive set of five additional physically based rendering passes: ambient occlusion, surface normal, specular reflectance, translucency, and raw diffuse colour (with residual lighting). Together with the refined albedo, these six passes form the complete intrinsic decomposition. Trained with a combination of masked-MSE, VGG, edge, and patch-LPIPS losses on the FFHQ-UV-Intrinsics dataset, the full pipeline achieves state-of-the-art performance for diffuse albedo estimation and demonstrates significantly improved fidelity for the complete rendering stack compared to prior methods. The resulting passes enable high-quality relighting and material editing of real faces.
61. 【2512.16504】Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization
链接:https://arxiv.org/abs/2512.16504
作者:Qiushuo Cheng,Jingjing Liu,Catherine Morgan,Alan Whone,Majid Mirmehdi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved great success, paradigm has achieved, achieved great, great success, action
备注:
点击查看摘要
Abstract:The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.
62. 【2512.16501】VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks
链接:https://arxiv.org/abs/2512.16501
作者:Beitong Zhou,Zhexiao Huang,Yuan Guo,Zhangxuan Gu,Tianyu Xia,Zichen Luo,Fei Tang,Dehan Kong,Yanyi Shang,Suling Ou,Zhenlin Guo,Changhua Meng,Shuheng Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:capable GUI agents, building capable GUI, component in building, building capable, GUI agents
备注:
点击查看摘要
Abstract:GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a hierarchical task taxonomy that divides grounding into basic and advanced categories, encompassing six distinct subtasks designed to evaluate models from complementary perspectives. Our experimental findings reveal critical insights: general-purpose multimodal models now match or even surpass specialized GUI models on basic grounding tasks. In contrast, advanced tasks, still favor GUI-specialized models, though they exhibit significant overfitting and poor robustness. These results underscore the necessity of comprehensive, multi-tiered evaluation frameworks.
63. 【2512.16494】PoseMoE: Mixture-of-Experts Network for Monocular 3D Human Pose Estimation
链接:https://arxiv.org/abs/2512.16494
作者:Mengyuan Liu,Jiajie Liu,Jinyan Zhang,Wenhao Li,Junsong Yuan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:pose, depth, human pose, human pose estimation, detected
备注: IEEE Transactions on Image Processing (T-IP)
点击查看摘要
Abstract:The lifting-based methods have dominated monocular 3D human pose estimation by leveraging detected 2D poses as intermediate representations. The 2D component of the final 3D human pose benefits from the detected 2D poses, whereas its depth counterpart must be estimated from scratch. The lifting-based methods encode the detected 2D pose and unknown depth in an entangled feature space, explicitly introducing depth uncertainty to the detected 2D pose, thereby limiting overall estimation accuracy. This work reveals that the depth representation is pivotal for the estimation process. Specifically, when depth is in an initial, completely unknown state, jointly encoding depth features with 2D pose features is detrimental to the estimation process. In contrast, when depth is initially refined to a more dependable state via network-based estimation, encoding it together with 2D pose information is beneficial. To address this limitation, we present a Mixture-of-Experts network for monocular 3D pose estimation named PoseMoE. Our approach introduces: (1) A mixture-of-experts network where specialized expert modules refine the well-detected 2D pose features and learn the depth features. This mixture-of-experts design disentangles the feature encoding process for 2D pose and depth, therefore reducing the explicit influence of uncertain depth features on 2D pose features. (2) A cross-expert knowledge aggregation module is proposed to aggregate cross-expert spatio-temporal contextual information. This step enhances features through bidirectional mapping between 2D pose and depth. Extensive experiments show that our proposed PoseMoE outperforms the conventional lifting-based methods on three widely used datasets: Human3.6M, MPI-INF-3DHP, and 3DPW.
64. 【2512.16493】YOLO11-4K: An Efficient Architecture for Real-Time Small Object Detection in 4K Panoramic Images
链接:https://arxiv.org/abs/2512.16493
作者:Huma Hafeez,Matthew Garratt,Jo Plested,Sankaran Iyer,Arcot Sowmya
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:inherent spatial distortions, poses significant challenges, images poses significant, spatial distortions, wide fields
备注: Conference paper just submitted
点击查看摘要
Abstract:The processing of omnidirectional 360-degree images poses significant challenges for object detection due to inherent spatial distortions, wide fields of view, and ultra-high-resolution inputs. Conventional detectors such as YOLO are optimised for standard image sizes (for example, 640x640 pixels) and often struggle with the computational demands of 4K or higher-resolution imagery typical of 360-degree vision. To address these limitations, we introduce YOLO11-4K, an efficient real-time detection framework tailored for 4K panoramic images. The architecture incorporates a novel multi-scale detection head with a P2 layer to improve sensitivity to small objects often missed at coarser scales, and a GhostConv-based backbone to reduce computational complexity without sacrificing representational power. To enable evaluation, we manually annotated the CVIP360 dataset, generating 6,876 frame-level bounding boxes and producing a publicly available, detection-ready benchmark for 4K panoramic scenes. YOLO11-4K achieves 0.95 mAP at 0.50 IoU with 28.3 milliseconds inference per frame, representing a 75 percent latency reduction compared to YOLO11 (112.3 milliseconds), while also improving accuracy (mAP at 0.50 of 0.95 versus 0.908). This balance of efficiency and precision enables robust object detection in expansive 360-degree environments, making the framework suitable for real-world high-resolution panoramic applications. While this work focuses on 4K omnidirectional images, the approach is broadly applicable to high-resolution detection tasks in autonomous navigation, surveillance, and augmented reality.
65. 【2512.16485】Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors
链接:https://arxiv.org/abs/2512.16485
作者:Kejun Liu,Yuanyuan Liu,Lin Wei,Chang Tang,Yibing Zhan,Zijing Chen,Zhe Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:identifying human emotions, facial expression recognition, Emotion Recognition, Multimodal Emotion Recognition, FER
备注: Accepted by TMM
点击查看摘要
Abstract:Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at this https URL.
66. 【2512.16484】Guiding Perception-Reasoning Closer to Human in Blind Image Quality Assessment
链接:https://arxiv.org/abs/2512.16484
作者:Yuan Li,Yahan Yu,Youyuan Lin,Yong-Hao Yang,Chenhui Chu,Shin'ya Nishida
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:integrating sensory cues, form self-consistent judgments, image quality, integrating sensory, sensory cues
备注: Under review
点击查看摘要
Abstract:Humans assess image quality through a perception-reasoning cascade, integrating sensory cues with implicit reasoning to form self-consistent judgments. In this work, we investigate how a model can acquire both human-like and self-consistent reasoning capability for blind image quality assessment (BIQA). We first collect human evaluation data that capture several aspects of human perception-reasoning pipeline. Then, we adopt reinforcement learning, using human annotations as reward signals to guide the model toward human-like perception and reasoning. To enable the model to internalize self-consistent reasoning capability, we design a reward that drives the model to infer the image quality purely from self-generated descriptions. Empirically, our approach achieves score prediction performance comparable to state-of-the-art BIQA systems under general metrics, including Pearson and Spearman correlation coefficients. In addition to the rating score, we assess human-model alignment using ROUGE-1 to measure the similarity between model-generated and human perception-reasoning chains. On over 1,000 human-annotated samples, our model reaches a ROUGE-1 score of 0.512 (cf. 0.443 for baseline), indicating substantial coverage of human explanations and marking a step toward human-like interpretable reasoning in BIQA.
67. 【2512.16483】StageVAR: Stage-Aware Acceleration for Visual Autoregressive Models
链接:https://arxiv.org/abs/2512.16483
作者:Senmao Li,Kai Wang,Salman Khan,Fahad Shahbaz Khan,Jian Yang,Yaxing Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:next-token prediction paradigm, enabling high-quality image, VAR paradigm suffers, next-token prediction, next-scale prediction
备注:
点击查看摘要
Abstract:Visual Autoregressive (VAR) modeling departs from the next-token prediction paradigm of traditional Autoregressive (AR) models through next-scale prediction, enabling high-quality image generation. However, the VAR paradigm suffers from sharply increased computational complexity and running time at large-scale steps. Although existing acceleration methods reduce runtime for large-scale steps, but rely on manual step selection and overlook the varying importance of different stages in the generation process. To address this challenge, we present StageVAR, a systematic study and stage-aware acceleration framework for VAR models. Our analysis shows that early steps are critical for preserving semantic and structural consistency and should remain intact, while later steps mainly refine details and can be pruned or approximated for acceleration. Building on these insights, StageVAR introduces a plug-and-play acceleration strategy that exploits semantic irrelevance and low-rank properties in late-stage computations, without requiring additional training. Our proposed StageVAR achieves up to 3.4x speedup with only a 0.01 drop on GenEval and a 0.26 decrease on DPG, consistently outperforming existing acceleration baselines. These results highlight stage-aware design as a powerful principle for efficient visual autoregressive image generation.
68. 【2512.16461】SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning
链接:https://arxiv.org/abs/2512.16461
作者:Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:ensure reliable navigation, systems require spatio-temporal, robotic systems require, navigation and interaction, systems require
备注:
点击查看摘要
Abstract:Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture localized semantic, geometric, and temporal attributes. These tokens are incrementally integrated into a 4D Scene Graph (4DSG), which serves as 4D prior for downstream reasoning. A lightweight SLAM backend anchors all STEP tokens spatially in the environment, providing the global reference alignment, and ensuring unambiguous spatial grounding across time. The resulting 4DSG forms a queryable, unified world model through which VLMs can directly interpret spatial scene structure and temporal dynamics. Experiments on a diverse set of benchmarks demonstrate that SNOW enables precise 4D scene understanding and spatially grounded inference, thereby setting new state-of-the-art performance in several settings, highlighting the importance of structured 4D priors for embodied reasoning and autonomous robotics.
69. 【2512.16456】Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach
链接:https://arxiv.org/abs/2512.16456
作者:Masashi Hatano,Saptarshi Sinha,Jacob Chalk,Wei-Hong Li,Hideo Saito,Dima Damen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:create realistic motion, realistic motion imitating, imitating natural human, challenging task, task that aims
备注: Project Page: [this https URL](https://masashi-hatano.github.io/prime-and-reach/)
点击查看摘要
Abstract:Human motion generation is a challenging task that aims to create realistic motion imitating natural human behaviour. We focus on the well-studied behaviour of priming an object/location for pick up or put down -- that is, the spotting of an object/location from a distance, known as gaze priming, followed by the motion of approaching and reaching the target location. To that end, we curate, for the first time, 23.7K gaze-primed human motion sequences for reaching target object locations from five publicly available datasets, i.e., HD-EPIC, MoGaze, HOT3D, ADT, and GIMO. We pre-train a text-conditioned diffusion-based motion generation model, then fine-tune it conditioned on goal pose or location, on our curated sequences. Importantly, we evaluate the ability of the generated motion to imitate natural human movement through several metrics, including the 'Reach Success' and a newly introduced 'Prime Success' metric. On the largest dataset, HD-EPIC, our model achieves 60% prime success and 89% reach success when conditioned on the goal object location.
70. 【2512.16443】Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt
链接:https://arxiv.org/abs/2512.16443
作者:Shangxun Li,Youngjung Uh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diffusion models excel, generating high-quality images, natural language descriptions, multiple outputs, visual storytelling
备注:
点击查看摘要
Abstract:Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.
71. 【2512.16415】CountZES: Counting via Zero-Shot Exemplar Selection
链接:https://arxiv.org/abs/2512.16415
作者:Muhammad Ibraheem Siddiqui,Muhammad Haris Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scenes remains challenging, complex scenes remains, Object counting, zero-shot object counting, remains challenging
备注:
点击查看摘要
Abstract:Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.
72. 【2512.16413】BrepLLM: Native Boundary Representation Understanding with Large Language Models
链接:https://arxiv.org/abs/2512.16413
作者:Liyuan Deng,Hao Guo,Yunpeng Bai,Yongkang Dai,Huaxi Huang,Yilei Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Large Language, Language Models, Cross-modal Alignment Pre-training, Boundary Representation
备注:
点击查看摘要
Abstract:Current token-sequence-based Large Language Models (LLMs) are not well-suited for directly processing 3D Boundary Representation (Brep) models that contain complex geometric and topological information. We propose BrepLLM, the first framework that enables LLMs to parse and reason over raw Brep data, bridging the modality gap between structured 3D geometry and natural language. BrepLLM employs a two-stage training pipeline: Cross-modal Alignment Pre-training and Multi-stage LLM Fine-tuning. In the first stage, an adaptive UV sampling strategy converts Breps into graphs representation with geometric and topological information. We then design a hierarchical BrepEncoder to extract features from geometry (i.e., faces and edges) and topology, producing both a single global token and a sequence of node tokens. Then we align the global token with text embeddings from a frozen CLIP text encoder (ViT-L/14) via contrastive learning. In the second stage, we integrate the pretrained BrepEncoder into an LLM. We then align its sequence of node tokens using a three-stage progressive training strategy: (1) training an MLP-based semantic mapping from Brep representation to 2D with 2D-LLM priors. (2) performing fine-tuning of the LLM. (3) designing a Mixture-of-Query Experts (MQE) to enhance geometric diversity modeling. We also construct Brep2Text, a dataset comprising 269,444 Brep-text question-answer pairs. Experiments show that BrepLLM achieves state-of-the-art (SOTA) results on 3D object classification and captioning tasks.
73. 【2512.16397】Using Gaussian Splats to Create High-Fidelity Facial Geometry and Texture
链接:https://arxiv.org/abs/2512.16397
作者:Haodi He,Jihun Yu,Ronald Fedkiw
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:increasingly popular three-dimensional, leverage increasingly popular, popular three-dimensional neural, three-dimensional neural representations, Gaussian Splatting
备注: Submitted to CVPR 2026. 21 pages, 22 figures
点击查看摘要
Abstract:We leverage increasingly popular three-dimensional neural representations in order to construct a unified and consistent explanation of a collection of uncalibrated images of the human face. Our approach utilizes Gaussian Splatting, since it is more explicit and thus more amenable to constraints than NeRFs. We leverage segmentation annotations to align the semantic regions of the face, facilitating the reconstruction of a neutral pose from only 11 images (as opposed to requiring a long video). We soft constrain the Gaussians to an underlying triangulated surface in order to provide a more structured Gaussian Splat reconstruction, which in turn informs subsequent perturbations to increase the accuracy of the underlying triangulated surface. The resulting triangulated surface can then be used in a standard graphics pipeline. In addition, and perhaps most impactful, we show how accurate geometry enables the Gaussian Splats to be transformed into texture space where they can be treated as a view-dependent neural texture. This allows one to use high visual fidelity Gaussian Splatting on any asset in a scene without the need to modify any other asset or any other aspect (geometry, lighting, renderer, etc.) of the graphics pipeline. We utilize a relightable Gaussian model to disentangle texture from lighting in order to obtain a delit high-resolution albedo texture that is also readily usable in a standard graphics pipeline. The flexibility of our system allows for training with disparate images, even with incompatible lighting, facilitating robust regularization. Finally, we demonstrate the efficacy of our approach by illustrating its use in a text-driven asset creation pipeline.
74. 【2512.16393】Adaptive Frequency Domain Alignment Network for Medical image segmentation
链接:https://arxiv.org/abs/2512.16393
作者:Zhanwei Li,Liang Li,Jiawan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High-quality annotated data, High-quality annotated, annotated data plays, achieving accurate segmentation, plays a crucial
备注:
点击查看摘要
Abstract:High-quality annotated data plays a crucial role in achieving accurate segmentation. However, such data for medical image segmentation are often scarce due to the time-consuming and labor-intensive nature of manual annotation. To address this challenge, we propose the Adaptive Frequency Domain Alignment Network (AFDAN)--a novel domain adaptation framework designed to align features in the frequency domain and alleviate data scarcity. AFDAN integrates three core components to enable robust cross-domain knowledge transfer: an Adversarial Domain Learning Module that transfers features from the source to the target domain; a Source-Target Frequency Fusion Module that blends frequency representations across domains; and a Spatial-Frequency Integration Module that combines both frequency and spatial features to further enhance segmentation accuracy across domains. Extensive experiments demonstrate the effectiveness of AFDAN: it achieves an Intersection over Union (IoU) of 90.9% for vitiligo segmentation in the newly constructed VITILIGO2025 dataset and a competitive IoU of 82.6% on the retinal vessel segmentation benchmark DRIVE, surpassing existing state-of-the-art approaches.
75. 【2512.16371】Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models
链接:https://arxiv.org/abs/2512.16371
作者:Mariam Hassan,Bastien Van Delft,Wuyang Li,Alexandre Alahi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visually impressive results, generate visually impressive, logical temporal instructions, follow logical temporal, compose complex scenes
备注:
点击查看摘要
Abstract:State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors, including apparent motion failures, originate from the model's inability to construct a semantically correct or logically consistent initial frame. We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages: (1) Reasoning, where a Large Language Model (LLM) rewrites the video prompt to describe only the initial scene, resolving temporal ambiguities; (2) Composition, where a Text-to-Image (T2I) model synthesizes a high-quality, compositionally-correct anchor frame from this new prompt; and (3) Temporal Synthesis, where a video model, finetuned to understand this anchor, focuses its entire capacity on animating the scene and following the prompt. Our decomposed approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2. Furthermore, we show that visual anchoring allows us to cut the number of sampling steps by 70% without any loss in performance, leading to a substantial speed-up in sampling. Factorized Video Generation offers a simple yet practical path toward more efficient, robust, and controllable video synthesis
76. 【2512.16360】EverybodyDance: Bipartite Graph-Based Identity Correspondence for Multi-Character Animation
链接:https://arxiv.org/abs/2512.16360
作者:Haotian Ling,Zequn Chen,Qiuying Chen,Donglin Di,Yongjia Ma,Hao Li,Chen Wei,Zhulin Tao,Xun Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Consistent pose-driven character, achieved remarkable progress, Consistent pose-driven, single-character scenarios, achieved remarkable
备注:
点击查看摘要
Abstract:Consistent pose-driven character animation has achieved remarkable progress in single-character scenarios. However, extending these advances to multi-character settings is non-trivial, especially when position swap is involved. Beyond mere scaling, the core challenge lies in enforcing correct Identity Correspondence (IC) between characters in reference and generated frames. To address this, we introduce EverybodyDance, a systematic solution targeting IC correctness in multi-character animation. EverybodyDance is built around the Identity Matching Graph (IMG), which models characters in the generated and reference frames as two node sets in a weighted complete bipartite graph. Edge weights, computed via our proposed Mask-Query Attention (MQA), quantify the affinity between each pair of characters. Our key insight is to formalize IC correctness as a graph structural metric and to optimize it during training. We also propose a series of targeted strategies tailored for multi-character animation, including identity-embedded guidance, a multi-scale matching strategy, and pre-classified sampling, which work synergistically. Finally, to evaluate IC performance, we curate the Identity Correspondence Evaluation benchmark, dedicated to multi-character IC correctness. Extensive experiments demonstrate that EverybodyDance substantially outperforms state-of-the-art baselines in both IC and visual fidelity.
77. 【2512.16357】GMODiff: One-Step Gain Map Refinement with Diffusion Priors for HDR Reconstruction
链接:https://arxiv.org/abs/2512.16357
作者:Tao Hu,Weiyu Zhou,Yanjie Tu,Peng Wu,Wei Dong,Qingsen Yan,Yanning Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Pre-trained Latent Diffusion, recently shown strong, low-level vision tasks, Pre-trained Latent, High Dynamic Range
备注:
点击查看摘要
Abstract:Pre-trained Latent Diffusion Models (LDMs) have recently shown strong perceptual priors for low-level vision tasks, making them a promising direction for multi-exposure High Dynamic Range (HDR) reconstruction. However, directly applying LDMs to HDR remains challenging due to: (1) limited dynamic-range representation caused by 8-bit latent compression, (2) high inference cost from multi-step denoising, and (3) content hallucination inherent to generative nature. To address these challenges, we introduce GMODiff, a gain map-driven one-step diffusion framework for multi-exposure HDR reconstruction. Instead of reconstructing full HDR content, we reformulate HDR reconstruction as a conditionally guided Gain Map (GM) estimation task, where the GM encodes the extended dynamic range while retaining the same bit depth as LDR images. We initialize the denoising process from an informative regression-based estimate rather than pure noise, enabling the model to generate high-quality GMs in a single denoising step. Furthermore, recognizing that regression-based models excel in content fidelity while LDMs favor perceptual quality, we leverage regression priors to guide both the denoising process and latent decoding of the LDM, suppressing hallucinations while preserving structural accuracy. Extensive experiments demonstrate that our GMODiff performs favorably against several state-of-the-art methods and is 100 faster than previous LDM-based methods.
78. 【2512.16349】Collaborative Edge-to-Server Inference for Vision-Language Models
链接:https://arxiv.org/abs/2512.16349
作者:Soochang Song,Yongjune Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:propose a collaborative, vision-language models, maintaining inference accuracy, inference, inference accuracy
备注: 13 pages, 12 figures
点击查看摘要
Abstract:We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces the communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, resizing the original image (global image) to match the vision encoder's input resolution often discards fine-grained details, leading to accuracy degradation. To overcome this limitation, we design a two-stage framework. In the first stage, the server performs inference on the global image and identifies a region of interest (RoI) using the VLM's internal attention. The min-entropy of the output tokens is then computed as a confidence measure to determine whether retransmission is required. If the min-entropy exceeds a predefined threshold, the server requests the edge device to send a detail-preserved local image of the RoI. The server then refines its inference by jointly leveraging the global and local images. This selective retransmission strategy ensures that only essential visual content is transmitted. Experiments across multiple VLM architectures show that the proposed framework significantly reduces communication cost while maintaining inference accuracy.
79. 【2512.16325】QUIDS: Quality-informed Incentive-driven Multi-agent Dispatching System for Mobile Crowdsensing
链接:https://arxiv.org/abs/2512.16325
作者:Nan Zhou,Zuxin Li,Fanhang Man,Xuecheng Chen,Susu Xu,Fan Dang,Chaopeng Hong,Yunhao Liu,Xiao-Ping Zhang,Xinlei Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vehicular mobile crowdsensing, non-dedicated vehicular mobile, achieving optimal Quality, mobile crowdsensing, Quality of Information
备注:
点击查看摘要
Abstract:This paper addresses the challenge of achieving optimal Quality of Information (QoI) in non-dedicated vehicular mobile crowdsensing (NVMCS) systems. The key obstacles are the interrelated issues of sensing coverage, sensing reliability, and the dynamic participation of vehicles. To tackle these, we propose QUIDS, a QUality-informed Incentive-driven multi-agent Dispatching System, which ensures high sensing coverage and reliability under budget constraints. QUIDS introduces a novel metric, Aggregated Sensing Quality (ASQ), to quantitatively capture QoI by integrating both coverage and reliability. We also develop a Mutually Assisted Belief-aware Vehicle Dispatching algorithm that estimates sensing reliability and allocates incentives under uncertainty, further improving ASQ. Evaluation using real-world data from a metropolitan NVMCS deployment shows QUIDS improves ASQ by 38% over non-dispatching scenarios and by 10% over state-of-the-art methods. It also reduces reconstruction map errors by 39-74% across algorithms. By jointly optimizing coverage and reliability via a quality-informed incentive mechanism, QUIDS enables low-cost, high-quality urban monitoring without dedicated infrastructure, applicable to smart-city scenarios like traffic and environmental sensing.
80. 【2512.16314】Ridge Estimation-Based Vision and Laser Ranging Fusion Localization Method for UAVs
链接:https://arxiv.org/abs/2512.16314
作者:Huayu Huang,Chen Chen,Banglei Guan,Ze Tan,Yang Shang,Zhang Li,Qifeng Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Tracking and measuring, measuring targets, variety of sensors, sensors mounted, mounted on UAVs
备注:
点击查看摘要
Abstract:Tracking and measuring targets using a variety of sensors mounted on UAVs is an effective means to quickly and accurately locate the target. This paper proposes a fusion localization method based on ridge estimation, combining the advantages of rich scene information from sequential imagery with the high precision of laser ranging to enhance localization accuracy. Under limited conditions such as long distances, small intersection angles, and large inclination angles, the column vectors of the design matrix have serious multicollinearity when using the least squares estimation algorithm. The multicollinearity will lead to ill-conditioned problems, resulting in significant instability and low robustness. Ridge estimation is introduced to mitigate the serious multicollinearity under the condition of limited observation. Experimental results demonstrate that our method achieves higher localization accuracy compared to ground localization algorithms based on single information. Moreover, the introduction of ridge estimation effectively enhances the robustness, particularly under limited observation conditions.
81. 【2512.16313】LaverNet: Lightweight All-in-one Video Restoration via Selective Propagation
链接:https://arxiv.org/abs/2512.16313
作者:Haiyu Zhao,Yiwen Shan,Yuanbiao Gou,Xi Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent studies, studies have explored, Recent, handles multiple degradations, Abstract
备注:
点击查看摘要
Abstract:Recent studies have explored all-in-one video restoration, which handles multiple degradations with a unified model. However, these approaches still face two challenges when dealing with time-varying degradations. First, the degradation can dominate temporal modeling, confusing the model to focus on artifacts rather than the video content. Second, current methods typically rely on large models to handle all-in-one restoration, concealing those underlying difficulties. To address these challenges, we propose a lightweight all-in-one video restoration network, LaverNet, with only 362K parameters. To mitigate the impact of degradations on temporal modeling, we introduce a novel propagation mechanism that selectively transmits only degradation-agnostic features across frames. Through LaverNet, we demonstrate that strong all-in-one restoration can be achieved with a compact network. Despite its small size, less than 1\% of the parameters of existing models, LaverNet achieves comparable, even superior performance across benchmarks.
82. 【2512.16303】PixelArena: A benchmark for Pixel-Precision Visual Intelligence
链接:https://arxiv.org/abs/2512.16303
作者:Feng Liang,Sizhe Cheng,Chenqi Yi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multi-modal large language, Multi-modal large, large language models, output are emerging, large language
备注: 7 pages, 11 figures, project page: [this https URL](https://pixelarena.reify.ing/project)
点击查看摘要
Abstract:Multi-modal large language models that have image output are emerging. Many image generation benchmarks focus on aesthetics instead of fine-grained generation capabilities. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. We find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to multimodality, reasoning, interpretability and benchmarking.
83. 【2512.16294】MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval
链接:https://arxiv.org/abs/2512.16294
作者:Amna Amir,Erchan Aptoula
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly imbalanced label, imbalanced label distributions, complex inter-class co-occurrence, inter-class co-occurrence patterns, co-occurrence patterns constitute
备注:
点击查看摘要
Abstract:Semantic overlap among land-cover categories, highly imbalanced label distributions, and complex inter-class co-occurrence patterns constitute significant challenges for multi-label remote-sensing image retrieval. In this article, Multi-Label Adaptive Contrastive Learning (MACL) is introduced as an extension of contrastive learning to address them. It integrates label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to achieve balanced representation learning across both common and rare categories. Extensive experiments on three benchmark datasets (DLRSD, ML-AID, and WHDLD), show that MACL consistently outperforms contrastive-loss based baselines, effectively mitigating semantic imbalance and delivering more reliable retrieval performance in large-scale remote-sensing archives. Code, pretrained models, and evaluation scripts will be released at this https URL upon acceptance.
84. 【2512.16275】GFLAN: Generative Functional Layouts
链接:https://arxiv.org/abs/2512.16275
作者:Mohamed Abouagour,Eleftherios Garyfallidis
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:unified computational treatment, Automated floor plan, functional design requirements, geometric constraint satisfaction, plan generation lies
备注: 21 pages, 15 figures
点击查看摘要
Abstract:Automated floor plan generation lies at the intersection of combinatorial search, geometric constraint satisfaction, and functional design requirements -- a confluence that has historically resisted a unified computational treatment. While recent deep learning approaches have improved the state of the art, they often struggle to capture architectural reasoning: the precedence of topological relationships over geometric instantiation, the propagation of functional constraints through adjacency networks, and the emergence of circulation patterns from local connectivity decisions. To address these fundamental challenges, this paper introduces GFLAN, a generative framework that restructures floor plan synthesis through explicit factorization into topological planning and geometric realization. Given a single exterior boundary and a front-door location, our approach departs from direct pixel-to-pixel or wall-tracing generation in favor of a principled two-stage decomposition. Stage A employs a specialized convolutional architecture with dual encoders -- separating invariant spatial context from evolving layout state -- to sequentially allocate room centroids within the building envelope via discrete probability maps over feasible placements. Stage B constructs a heterogeneous graph linking room nodes to boundary vertices, then applies a Transformer-augmented graph neural network (GNN) that jointly regresses room boundaries.
85. 【2512.16270】xtEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering
链接:https://arxiv.org/abs/2512.16270
作者:Rui Gui,Yang Wan,Haochen Han,Dongxing Mao,Fangming Liu,Min Li,Alex Jinpeng Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:drawing significant attention, drawing significant, rendering has recently, recently emerged, challenging frontiers
备注:
点击查看摘要
Abstract:Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely unexplored, as it requires generating legible characters while preserving semantic, geometric, and contextual coherence. To fill this gap, we introduce TextEditBench, a comprehensive evaluation benchmark that explicitly focuses on text-centric regions in images. Beyond basic pixel manipulations, our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies. We further propose a novel evaluation dimension, Semantic Expectation (SE), which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment during text editing. Extensive experiments on state-of-the-art editing systems reveal that while current models can follow simple textual instructions, they still struggle with context-dependent reasoning, physical consistency, and layout-aware integration. By focusing evaluation on this long-overlooked yet fundamental capability, TextEditBench establishes a new testing ground for advancing text-guided image editing and reasoning in multimodal generation.
86. 【2512.16266】Pixel Super-Resolved Fluorescence Lifetime Imaging Using Deep Learning
链接:https://arxiv.org/abs/2512.16266
作者:Paloma Casteleiro Costa,Parnian Ghapandar Kashani,Xuhui Liu,Alexander Chen,Ary Portes,Julien Bec,Laura Marcu,Aydogan Ozcan
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Medical Physics (physics.med-ph); Optics (physics.optics)
关键词:powerful quantitative technique, Fluorescence lifetime imaging, offering strong translational, strong translational potential, real-time diagnostics
备注: 30 Pages, 9 Figures
点击查看摘要
Abstract:Fluorescence lifetime imaging microscopy (FLIM) is a powerful quantitative technique that provides metabolic and molecular contrast, offering strong translational potential for label-free, real-time diagnostics. However, its clinical adoption remains limited by long pixel dwell times and low signal-to-noise ratio (SNR), which impose a stricter resolution-speed trade-off than conventional optical imaging approaches. Here, we introduce FLIM_PSR_k, a deep learning-based multi-channel pixel super-resolution (PSR) framework that reconstructs high-resolution FLIM images from data acquired with up to a 5-fold increased pixel size. The model is trained using the conditional generative adversarial network (cGAN) framework, which, compared to diffusion model-based alternatives, delivers a more robust PSR reconstruction with substantially shorter inference times, a crucial advantage for practical deployment. FLIM_PSR_k not only enables faster image acquisition but can also alleviate SNR limitations in autofluorescence-based FLIM. Blind testing on held-out patient-derived tumor tissue samples demonstrates that FLIM_PSR_k reliably achieves a super-resolution factor of k = 5, resulting in a 25-fold increase in the space-bandwidth product of the output images and revealing fine architectural features lost in lower-resolution inputs, with statistically significant improvements across various image quality metrics. By increasing FLIM's effective spatial resolution, FLIM_PSR_k advances lifetime imaging toward faster, higher-resolution, and hardware-flexible implementations compatible with low-numerical-aperture and miniaturized platforms, better positioning FLIM for translational applications.
87. 【2512.16243】Semi-Supervised Multi-View Crowd Counting by Ranking Multi-View Fusion Models
链接:https://arxiv.org/abs/2512.16243
作者:Qi Zhang,Yunfei Gong,Zhidan Xie,Zhizi Wang,Antoni B. Chan,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:severe occlusion issue, Multi-view, Multi-view crowd counting, multi-view fusion models, multi-view fusion
备注: 13 pages, 7 figures, under review
点击查看摘要
Abstract:Multi-view crowd counting has been proposed to deal with the severe occlusion issue of crowd counting in large and wide scenes. However, due to the difficulty of collecting and annotating multi-view images, the datasets for multi-view counting have a limited number of multi-view frames and scenes. To solve the problem of limited data, one approach is to collect synthetic data to bypass the annotating step, while another is to propose semi- or weakly-supervised or unsupervised methods that demand less multi-view data. In this paper, we propose two semi-supervised multi-view crowd counting frameworks by ranking the multi-view fusion models of different numbers of input views, in terms of the model predictions or the model uncertainties. Specifically, for the first method (vanilla model), we rank the multi-view fusion models' prediction results of different numbers of camera-view inputs, namely, the model's predictions with fewer camera views shall not be larger than the predictions with more camera views. For the second method, we rank the estimated model uncertainties of the multi-view fusion models with a variable number of view inputs, guided by the multi-view fusion models' prediction errors, namely, the model uncertainties with more camera views shall not be larger than those with fewer camera views. These constraints are introduced into the model training in a semi-supervised fashion for multi-view counting with limited labeled data. The experiments demonstrate the advantages of the proposed multi-view model ranking methods compared with other semi-supervised counting methods.
88. 【2512.16235】AI-Powered Dermatological Diagnosis: From Interpretable Models to Clinical Implementation A Comprehensive Framework for Accessible and Trustworthy Skin Disease Detection
链接:https://arxiv.org/abs/2512.16235
作者:Satya Narayana Panda,Vaishnavi Kukkala,Spandana Iyer
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:billion people globally, remains challenging due, limited specialist availability, complex clinical presentations, accurate diagnosis remains
备注: 9 pages, 5 figures, 1 table. Code available at [this https URL](https://github.com/colabre2020/Enhancing-Skin-Disease-Diagnosis)
点击查看摘要
Abstract:Dermatological conditions affect 1.9 billion people globally, yet accurate diagnosis remains challenging due to limited specialist availability and complex clinical presentations. Family history significantly influences skin disease susceptibility and treatment responses, but is often underutilized in diagnostic processes. This research addresses the critical question: How can AI-powered systems integrate family history data with clinical imaging to enhance dermatological diagnosis while supporting clinical trial validation and real-world implementation? We developed a comprehensive multi-modal AI framework that combines deep learning-based image analysis with structured clinical data, including detailed family history patterns. Our approach employs interpretable convolutional neural networks integrated with clinical decision trees that incorporate hereditary risk factors. The methodology includes prospective clinical trials across diverse healthcare settings to validate AI-assisted diagnosis against traditional clinical assessment. In this work, validation was conducted with healthcare professionals to assess AI-assisted outputs against clinical expectations; prospective clinical trials across diverse healthcare settings are proposed as future work. The integrated AI system demonstrates enhanced diagnostic accuracy when family history data is incorporated, particularly for hereditary skin conditions such as melanoma, psoriasis, and atopic dermatitis. Expert feedback indicates potential for improved early detection and more personalized recommendations; formal clinical trials are planned. The framework is designed for integration into clinical workflows while maintaining interpretability through explainable AI mechanisms.
Comments:
9 pages, 5 figures, 1 table. Code available at this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
MSC classes:
68T05
ACMclasses:
J.3; I.2.1; I.5.4
Cite as:
arXiv:2512.16235 [cs.CV]
(or
arXiv:2512.16235v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.16235
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
89. 【2512.16234】ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation
链接:https://arxiv.org/abs/2512.16234
作者:Zichen Geng,Zeeshan Hayder,Wei Liu,Hesheng Wang,Ajmal Mian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human reaction generation, reaction generation faces, human reaction, main challenges, faces three main
备注:
点击查看摘要
Abstract:3D human reaction generation faces three main challenges:(1) high motion fidelity, (2) real-time inference, and (3) autoregressive adaptability for online scenarios. Existing methods fail to meet all three simultaneously. We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between actor and reactor motions. It consists of a causal context encoder and an MLP-based velocity predictor. We introduce Bootstrap Contextual Encoding (BSCE) in training, encoding generated history instead of the ground-truth ones, to alleviate error accumulation in autoregressive generation. We further introduce the offline variant ReMFlow, achieving state-of-the-art performance with the fastest inference among offline methods. Our ARMFlow addresses key limitations of online settings by: (1) enhancing semantic alignment via a global contextual encoder; (2) achieving high accuracy and low latency in a single-step inference; and (3) reducing accumulated errors through BSCE. Our single-step online generation surpasses existing online methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.
90. 【2512.16226】Image Compression Using Singular Value Decomposition
链接:https://arxiv.org/abs/2512.16226
作者:Justin Jiang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:making efficient compression, efficient compression important, making efficient, bandwidth demands, substantial portion
备注:
点击查看摘要
Abstract:Images are a substantial portion of the internet, making efficient compression important for reducing storage and bandwidth demands. This study investigates the use of Singular Value Decomposition and low-rank matrix approximations for image compression, evaluating performance using relative Frobenius error and compression ratio. The approach is applied to both grayscale and multichannel images to assess its generality. Results show that the low-rank approximations often produce images that appear visually similar to the originals, but the compression efficiency remains consistently worse than established formats such as JPEG, JPEG2000, and WEBP at comparable error levels. At low tolerated error levels, the compressed representation produced by Singular Value Decomposition can even exceed the size of the original image, indicating that this method is not competitive with industry-standard codecs for practical image compression.
91. 【2512.16219】Learning High-Quality Initial Noise for Single-View Synthesis with Diffusion Models
链接:https://arxiv.org/abs/2512.16219
作者:Zhihao Zhang,Xuejun Yang,Weihua Liu,Mouquan Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:attracted increasing attention, Single-view novel view, recently attracted increasing, view synthesis, single image prompt
备注: 16 pages, 9 figures
点击查看摘要
Abstract:Single-view novel view synthesis (NVS) models based on diffusion models have recently attracted increasing attention, as they can generate a series of novel view images from a single image prompt and camera pose information as conditions. It has been observed that in diffusion models, certain high-quality initial noise patterns lead to better generation results than others. However, there remains a lack of dedicated learning frameworks that enable NVS models to learn such high-quality noise. To obtain high-quality initial noise from random Gaussian noise, we make the following contributions. First, we design a discretized Euler inversion method to inject image semantic information into random noise, thereby constructing paired datasets of random and high-quality noise. Second, we propose a learning framework based on an encoder-decoder network (EDN) that directly transforms random noise into high-quality noise. Experiments demonstrate that the proposed EDN can be seamlessly plugged into various NVS models, such as SV3D and MV-Adapter, achieving significant performance improvements across multiple datasets. Code is available at: this https URL.
92. 【2512.16213】Enhanced 3D Shape Analysis via Information Geometry
链接:https://arxiv.org/abs/2512.16213
作者:Amit Vishwakarma,K.S. Subrahamanian Moosath
类目:Computer Vision and Pattern Recognition (cs.CV); Differential Geometry (math.DG)
关键词:Three-dimensional point clouds, highly accurate digital, accurate digital representations, Gaussian Mixture Models, Three-dimensional point
备注:
点击查看摘要
Abstract:Three-dimensional point clouds provide highly accurate digital representations of objects, essential for applications in computer graphics, photogrammetry, computer vision, and robotics. However, comparing point clouds faces significant challenges due to their unstructured nature and the complex geometry of the surfaces they represent. Traditional geometric metrics such as Hausdorff and Chamfer distances often fail to capture global statistical structure and exhibit sensitivity to outliers, while existing Kullback-Leibler (KL) divergence approximations for Gaussian Mixture Models can produce unbounded or numerically unstable values. This paper introduces an information geometric framework for 3D point cloud shape analysis by representing point clouds as Gaussian Mixture Models (GMMs) on a statistical manifold. We prove that the space of GMMs forms a statistical manifold and propose the Modified Symmetric Kullback-Leibler (MSKL) divergence with theoretically guaranteed upper and lower bounds, ensuring numerical stability for all GMM comparisons. Through comprehensive experiments on human pose discrimination (MPI-FAUST dataset) and animal shape comparison (G-PCD dataset), we demonstrate that MSKL provides stable and monotonically varying values that directly reflect geometric variation, outperforming traditional distances and existing KL approximations.
93. 【2512.16202】Open Ad-hoc Categorization with Contextualized Feature Learning
链接:https://arxiv.org/abs/2512.16202
作者:Zilin Wang,Sangwoo Mo,Stella X. Yu,Sima Behpour,Liu Ren
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:handle changing tasks, changing tasks, scenes is essential, agents to handle, handle changing
备注: 26 pages, 17 figures
点击查看摘要
Abstract:Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.
Comments:
26 pages, 17 figures
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2512.16202 [cs.CV]
(or
arXiv:2512.16202v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.16202
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Journalreference:
CVPR 2025
Related DOI:
https://doi.org/10.1109/cvpr52734.2025.01407
Focus to learn more
DOI(s) linking to related resources</p>
94. 【2512.16201】Visual Alignment of Medical Vision-Language Models for Grounded Radiology Report Generation
链接:https://arxiv.org/abs/2512.16201
作者:Sarosij Bose,Ravi K. Rajendran,Biplob Debnath,Konstantinos Karydis,Amit K. Roy-Chowdhury,Srimat Chakradhar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:automating healthcare workflows, Radiology Report Generation, facilitating accurate patient, accurate patient assessments, Medical Vision-Language Models
备注:
点击查看摘要
Abstract:Radiology Report Generation (RRG) is a critical step toward automating healthcare workflows, facilitating accurate patient assessments, and reducing the workload of medical professionals. Despite recent progress in Large Medical Vision-Language Models (Med-VLMs), generating radiology reports that are both visually grounded and clinically accurate remains a significant challenge. Existing approaches often rely on large labeled corpora for pre-training, costly task-specific preference data, or retrieval-based methods. However, these strategies do not adequately mitigate hallucinations arising from poor cross-modal alignment between visual and linguistic representations. To address these limitations, we propose VALOR:Visual Alignment of Medical Vision-Language Models for GrOunded Radiology Report Generation. Our method introduces a reinforcement learning-based post-alignment framework utilizing Group-Relative Proximal Optimization (GRPO). The training proceeds in two stages: (1) improving the Med-VLM with textual rewards to encourage clinically precise terminology, and (2) aligning the vision projection module of the textually grounded model with disease findings, thereby guiding attention toward image re gions most relevant to the diagnostic task. Extensive experiments on multiple benchmarks demonstrate that VALOR substantially improves factual accuracy and visual grounding, achieving significant performance gains over state-of-the-art report generation methods.
95. 【2512.16199】Avatar4D: Synthesizing Domain-Specific 4D Humans for Real-World Pose Estimation
链接:https://arxiv.org/abs/2512.16199
作者:Jerrin Bright,Zhibo Wang,Dmytro Klepachevskyi,Yuhao Chen,Sirisha Rambhatla,David Clausi,John Zelek
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generating customizable synthetic, pipeline for generating, generating customizable, motion datasets tailored, customizable synthetic human
备注:
点击查看摘要
Abstract:We present Avatar4D, a real-world transferable pipeline for generating customizable synthetic human motion datasets tailored to domain-specific applications. Unlike prior works, which focus on general, everyday motions and offer limited flexibility, our approach provides fine-grained control over body pose, appearance, camera viewpoint, and environmental context, without requiring any manual annotations. To validate the impact of Avatar4D, we focus on sports, where domain-specific human actions and movement patterns pose unique challenges for motion understanding. In this setting, we introduce Syn2Sport, a large-scale synthetic dataset spanning sports, including baseball and ice hockey. Avatar4D features high-fidelity 4D (3D geometry over time) human motion sequences with varying player appearances rendered in diverse environments. We benchmark several state-of-the-art pose estimation models on Syn2Sport and demonstrate their effectiveness for supervised learning, zero-shot transfer to real-world data, and generalization across sports. Furthermore, we evaluate how closely the generated synthetic data aligns with real-world datasets in feature space. Our results highlight the potential of such systems to generate scalable, controllable, and transferable human datasets for diverse domain-specific tasks without relying on domain-specific real data.
96. 【2512.16178】owards Closing the Domain Gap with Event Cameras
链接:https://arxiv.org/abs/2512.16178
作者:M. Oltan Sevinc,Liao Wu,Francisco Cruz
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:performance suffers greatly, deployment environment, primary sensor, suffers greatly, match the deployment
备注: Accepted to Australasian Conference on Robotics and Automation (ACRA), 2025
点击查看摘要
Abstract:Although traditional cameras are the primary sensor for end-to-end driving, their performance suffers greatly when the conditions of the data they were trained on does not match the deployment environment, a problem known as the domain gap. In this work, we consider the day-night lighting difference domain gap. Instead of traditional cameras we propose event cameras as a potential alternative which can maintain performance across lighting condition domain gaps without requiring additional adjustments. Our results show that event cameras maintain more consistent performance across lighting conditions, exhibiting domain-shift penalties that are generally comparable to or smaller than grayscale frames and provide superior baseline performance in cross-domain scenarios.
97. 【2512.16164】C-DGPA: Class-Centric Dual-Alignment Generative Prompt Adaptation
链接:https://arxiv.org/abs/2512.16164
作者:Chao Li,Dasha Hu,Chengyang Li,Yuming Jiang,Yuncheng Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Unsupervised Domain Adaptation, Adaptation transfers knowledge, Unsupervised Domain, unlabeled target domain, Domain Adaptation transfers
备注:
点击查看摘要
Abstract:Unsupervised Domain Adaptation transfers knowledge from a labeled source domain to an unlabeled target domain. Directly deploying Vision-Language Models (VLMs) with prompt tuning in downstream UDA tasks faces the signifi cant challenge of mitigating domain discrepancies. Existing prompt-tuning strategies primarily align marginal distribu tion, but neglect conditional distribution discrepancies, lead ing to critical issues such as class prototype misalignment and degraded semantic discriminability. To address these lim itations, the work proposes C-DGPA: Class-Centric Dual Alignment Generative Prompt Adaptation. C-DGPA syner gistically optimizes marginal distribution alignment and con ditional distribution alignment through a novel dual-branch architecture. The marginal distribution alignment branch em ploys a dynamic adversarial training framework to bridge marginal distribution discrepancies. Simultaneously, the con ditional distribution alignment branch introduces a Class Mapping Mechanism (CMM) to align conditional distribu tion discrepancies by standardizing semantic prompt under standing and preventing source domain over-reliance. This dual alignment strategy effectively integrates domain knowl edge into prompt learning via synergistic optimization, ensur ing domain-invariant and semantically discriminative repre sentations. Extensive experiments on OfficeHome, Office31, and VisDA-2017 validate the superiority of C-DGPA. It achieves new state-of-the-art results on all benchmarks.
98. 【2512.16143】SegGraph: Leveraging Graphs of SAM Segments for Few-Shot 3D Part Segmentation
链接:https://arxiv.org/abs/2512.16143
作者:Yueyang Hu,Haiyong Jiang,Haoxuan Song,Jun Xiao,Hao Pan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:framework for few-shot, work presents, part segmentation, foundation models, geometric
备注:
点击查看摘要
Abstract:This work presents a novel framework for few-shot 3D part segmentation. Recent advances have demonstrated the significant potential of 2D foundation models for low-shot 3D part segmentation. However, it is still an open problem that how to effectively aggregate 2D knowledge from foundation models to 3D. Existing methods either ignore geometric structures for 3D feature learning or neglects the high-quality grouping clues from SAM, leading to under-segmentation and inconsistent part labels. We devise a novel SAM segment graph-based propagation method, named SegGraph, to explicitly learn geometric features encoded within SAM's segmentation masks. Our method encodes geometric features by modeling mutual overlap and adjacency between segments while preserving intra-segment semantic consistency. We construct a segment graph, conceptually similar to an atlas, where nodes represent segments and edges capture their spatial relationships (overlap/adjacency). Each node adaptively modulates 2D foundation model features, which are then propagated via a graph neural network to learn global geometric structures. To enforce intra-segment semantic consistency, we map segment features to 3D points with a novel view-direction-weighted fusion attenuating contributions from low-quality segments. Extensive experiments on PartNet-E demonstrate that our method outperforms all competing baselines by at least 6.9 percent mIoU. Further analysis reveals that SegGraph achieves particularly strong performance on small components and part boundaries, demonstrating its superior geometric understanding. The code is available at: this https URL.
99. 【2512.16140】ResDynUNet++: A nested U-Net with residual dynamic convolution blocks for dual-spectral CT
链接:https://arxiv.org/abs/2512.16140
作者:Ze Yuan,Wenbin Li,Shusen Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:deep learning models, hybrid reconstruction framework, integrates iterative methods, learning models, propose a hybrid
备注:
点击查看摘要
Abstract:We propose a hybrid reconstruction framework for dual-spectral CT (DSCT) that integrates iterative methods with deep learning models. The reconstruction process consists of two complementary components: a knowledge-driven module and a data-driven module. In the knowledge-driven phase, we employ the oblique projection modification technique (OPMT) to reconstruct an intermediate solution of the basis material images from the projection data. We select OPMT for this role because of its fast convergence, which allows it to rapidly generate an intermediate solution that successfully achieves basis material decomposition. Subsequently, in the data-driven phase, we introduce a novel neural network, ResDynUNet++, to refine this intermediate solution. The ResDynUNet++ is built upon a UNet++ backbone by replacing standard convolutions with residual dynamic convolution blocks, which combine the adaptive, input-specific feature extraction of dynamic convolution with the stable training of residual connections. This architecture is designed to address challenges like channel imbalance and near-interface large artifacts in DSCT, producing clean and accurate final solutions. Extensive experiments on both synthetic phantoms and real clinical datasets validate the efficacy and superior performance of the proposed method.
100. 【2512.16133】Interaction-via-Actions: Cattle Interaction Detection with Joint Learning of Action-Interaction Latent Space
链接:https://arxiv.org/abs/2512.16133
作者:Ren Nakagawa,Yang Yang,Risa Shinoda,Hiroaki Santo,Kenji Oyama,Fumio Okura,Takenao Ohkawa
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:smart livestock management, automatically detecting behavioral, detecting behavioral interactions, detecting estrus, automatically detecting
备注: Accepted to WACV 2026
点击查看摘要
Abstract:This paper introduces a method and application for automatically detecting behavioral interactions between grazing cattle from a single image, which is essential for smart livestock management in the cattle industry, such as for detecting estrus. Although interaction detection for humans has been actively studied, a non-trivial challenge lies in cattle interaction detection, specifically the lack of a comprehensive behavioral dataset that includes interactions, as the interactions of grazing cattle are rare events. We, therefore, propose CattleAct, a data-efficient method for interaction detection by decomposing interactions into the combinations of actions by individual cattle. Specifically, we first learn an action latent space from a large-scale cattle action dataset. Then, we embed rare interactions via the fine-tuning of the pre-trained latent space using contrastive learning, thereby constructing a unified latent space of actions and interactions. On top of the proposed method, we develop a practical working system integrating video and GPS inputs. Experiments on a commercial-scale pasture demonstrate the accurate interaction detection achieved by our method compared to the baselines. Our implementation is available at this https URL.
101. 【2512.16126】Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure
链接:https://arxiv.org/abs/2512.16126
作者:Lulu Xue,Shengshan Hu,Linqiang Qian,Peijin Guo,Yechao Zhang,Minghui Li,Yanjun Zhang,Dayong Ye,Leo Yu Zhang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:newly popularized technique, removing specific training, data deletion requests, specific training data, deletion requests
备注: Accepeted by AAAI2026
点击查看摘要
Abstract:Machine unlearning is a newly popularized technique for removing specific training data from a trained model, enabling it to comply with data deletion requests. While it protects the rights of users requesting unlearning, it also introduces new privacy risks. Prior works have primarily focused on the privacy of data that has been unlearned, while the risks to retained data remain largely unexplored. To address this gap, we focus on the privacy risks of retained data and, for the first time, reveal the vulnerabilities introduced by machine unlearning under the dual-view setting, where an adversary can query both the original and the unlearned models. From an information-theoretic perspective, we introduce the concept of {privacy knowledge gain} and demonstrate that the dual-view setting allows adversaries to obtain more information than querying either model alone, thereby amplifying privacy leakage. To effectively demonstrate this threat, we propose DVIA, a Dual-View Inference Attack, which extracts membership information on retained data using black-box queries to both models. DVIA eliminates the need to train an attack model and employs a lightweight likelihood ratio inference module for efficient inference. Experiments across different datasets and model architectures validate the effectiveness of DVIA and highlight the privacy risks inherent in the dual-view setting.
102. 【2512.16123】Autoencoder-based Denoising Defense against Adversarial Attacks on Object Detection
链接:https://arxiv.org/abs/2512.16123
作者:Min Geun Song,Gang Min Kim,Woonmin Kim,Yongsik Kim,Jeonghyun Sim,Sangbeom Park,Huy Kang Kim
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep learning-based object, security surveillance systems, Deep learning-based, learning-based object detection, surveillance systems
备注: 7 pages, 2 figures
点击查看摘要
Abstract:Deep learning-based object detection models play a critical role in real-world applications such as autonomous driving and security surveillance systems, yet they remain vulnerable to adversarial examples. In this work, we propose an autoencoder-based denoising defense to recover object detection performance degraded by adversarial perturbations. We conduct adversarial attacks using Perlin noise on vehicle-related images from the COCO dataset, apply a single-layer convolutional autoencoder to remove the perturbations, and evaluate detection performance using YOLOv5. Our experiments demonstrate that adversarial attacks reduce bbox mAP from 0.2890 to 0.1640, representing a 43.3% performance degradation. After applying the proposed autoencoder defense, bbox mAP improves to 0.1700 (3.7% recovery) and bbox mAP@50 increases from 0.2780 to 0.3080 (10.8% improvement). These results indicate that autoencoder-based denoising can provide partial defense against adversarial attacks without requiring model retraining.
103. 【2512.16113】Flexible Camera Calibration using a Collimator System
链接:https://arxiv.org/abs/2512.16113
作者:Shunkun Liang,Banglei Guan,Zhenbao Yu,Dongcai Tan,Pengju Sun,Zibin Liu,Qifeng Yu,Yang Shang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision applications, collimator system, Camera, crucial step, step in photogrammetry
备注:
点击查看摘要
Abstract:Camera calibration is a crucial step in photogrammetry and 3D vision applications. This paper introduces a novel camera calibration method using a designed collimator system. Our collimator system provides a reliable and controllable calibration environment for the camera. Exploiting the unique optical geometry property of our collimator system, we introduce an angle invariance constraint and further prove that the relative motion between the calibration target and camera conforms to a spherical motion model. This constraint reduces the original 6DOF relative motion between target and camera to a 3DOF pure rotation motion. Using spherical motion constraint, a closed-form linear solver for multiple images and a minimal solver for two images are proposed for camera calibration. Furthermore, we propose a single collimator image calibration algorithm based on the angle invariance constraint. This algorithm eliminates the requirement for camera motion, providing a novel solution for flexible and fast calibration. The performance of our method is evaluated in both synthetic and real-world experiments, which verify the feasibility of calibration using the collimator system and demonstrate that our method is superior to existing baseline methods. Demo code is available at this https URL
104. 【2512.16101】A Tri-Dynamic Preprocessing Framework for UGC Video Compression
链接:https://arxiv.org/abs/2512.16101
作者:Fei Zhao,Mengxi Guo,Shijie Zhao,Junlin Li,Li Zhang,Xiaodong Xie
类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
关键词:user generated content, recent years, user generated, generated content, internet traffic
备注: Accepted as a POSTER and for publication in the ICASSP 2024 proceedings
点击查看摘要
Abstract:In recent years, user generated content (UGC) has become the dominant force in internet traffic. However, UGC videos exhibit a higher degree of variability and diverse characteristics compared to traditional encoding test videos. This variance challenges the effectiveness of data-driven machine learning algorithms for optimizing encoding in the broader context of UGC scenarios. To address this issue, we propose a Tri-Dynamic Preprocessing framework for UGC. Firstly, we employ an adaptive factor to regulate preprocessing intensity. Secondly, an adaptive quantization level is employed to fine-tune the codec simulator. Thirdly, we utilize an adaptive lambda tradeoff to adjust the rate-distortion loss function. Experimental results on large-scale test sets demonstrate that our method attains exceptional performance.
105. 【2512.16093】urboDiffusion: Accelerating Video Diffusion Models by 100-200 Times
链接:https://arxiv.org/abs/2512.16093
作者:Jintao Zhang,Kaiwen Zheng,Kai Jiang,Haoxu Wang,Ion Stoica,Joseph E. Gonzalez,Jianfei Chen,Jun Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:generation acceleration framework, Step distillation, acceleration framework, TurboDiffusion, Attention
备注:
点击查看摘要
Abstract:We introduce TurboDiffusion, a video generation acceleration framework that can speed up end-to-end diffusion generation by 100-200x while maintaining video quality. TurboDiffusion mainly relies on several components for acceleration: (1) Attention acceleration: TurboDiffusion uses low-bit SageAttention and trainable Sparse-Linear Attention (SLA) to speed up attention computation. (2) Step distillation: TurboDiffusion adopts rCM for efficient step distillation. (3) W8A8 quantization: TurboDiffusion quantizes model parameters and activations to 8 bits to accelerate linear layers and compress the model. In addition, TurboDiffusion incorporates several other engineering optimizations. We conduct experiments on the Wan2.2-I2V-14B-720P, Wan2.1-T2V-1.3B-480P, Wan2.1-T2V-14B-720P, and Wan2.1-T2V-14B-480P models. Experimental results show that TurboDiffusion achieves 100-200x speedup for video generation even on a single RTX 5090 GPU, while maintaining comparable video quality. The GitHub repository, which includes model checkpoints and easy-to-use code, is available at this https URL.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2512.16093 [cs.CV]
(or
arXiv:2512.16093v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.16093
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
106. 【2512.16092】Collimator-assisted high-precision calibration method for event cameras
链接:https://arxiv.org/abs/2512.16092
作者:Zibin Liu,Shunkun Liang,Banglei Guan,Dongcai Tan,Yang Shang,Qifeng Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:brain-inspired visual sensor, high temporal resolution, high dynamic range, event camera calibration, temporal resolution
备注: 4 pages, 3 figures
点击查看摘要
Abstract:Event cameras are a new type of brain-inspired visual sensor with advantages such as high dynamic range and high temporal resolution. The geometric calibration of event cameras, which involves determining their intrinsic and extrinsic parameters, particularly in long-range measurement scenarios, remains a significant challenge. To address the dual requirements of long-distance and high-precision measurement, we propose an event camera calibration method utilizing a collimator with flickering star-based patterns. The proposed method first linearly solves camera parameters using the sphere motion model of the collimator, followed by nonlinear optimization to refine these parameters with high precision. Through comprehensive real-world experiments across varying conditions, we demonstrate that the proposed method consistently outperforms existing event camera calibration methods in terms of accuracy and reliability.
107. 【2512.16089】LAPX: Lightweight Hourglass Network with Global Context
链接:https://arxiv.org/abs/2512.16089
作者:Haopeng Zhao,Marsha Mariya Kappan,Mahdi Bamdad,Francisco Cruz
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Human pose estimation, Human pose, computer vision, pose estimation, crucial task
备注: 10 pages
点击查看摘要
Abstract:Human pose estimation is a crucial task in computer vision. Methods that have SOTA (State-of-the-Art) accuracy, often involve a large number of parameters and incur substantial computational cost. Many lightweight variants have been proposed to reduce the model size and computational cost of them. However, several of these methods still contain components that are not well suited for efficient deployment on edge devices. Moreover, models that primarily emphasize inference speed on edge devices often suffer from limited accuracy due to their overly simplified designs. To address these limitations, we propose LAPX, an Hourglass network with self-attention that captures global contextual information, based on previous work, LAP. In addition to adopting the self-attention module, LAPX advances the stage design and refine the lightweight attention modules. It achieves competitive results on two benchmark datasets, MPII and COCO, with only 2.3M parameters, and demonstrates real-time performance, confirming its edge-device suitability.
108. 【2512.16077】Auto-Vocabulary 3D Object Detection
链接:https://arxiv.org/abs/2512.16077
作者:Haomeng Zhang,Kuan-Chuan Peng,Suhas Lohit,Raymond A. Yeh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:object detection methods, object detection, Open-vocabulary, classes unseen, existing methods rely
备注: technical report
点击查看摘要
Abstract:Open-vocabulary 3D object detection methods are able to localize 3D boxes of classes unseen during training. Despite the name, existing methods rely on user-specified classes both at training and inference. We propose to study Auto-Vocabulary 3D Object Detection (AV3DOD), where the classes are automatically generated for the detected objects without any user input. To this end, we introduce Semantic Score (SS) to evaluate the quality of the generated class names. We then develop a novel framework, AV3DOD, which leverages 2D vision-language models (VLMs) to generate rich semantic candidates through image captioning, pseudo 3D box generation, and feature-space semantics expansion. AV3DOD achieves the state-of-the-art (SOTA) performance on both localization (mAP) and semantic quality (SS) on the ScanNetV2 and SUNRGB-D datasets. Notably, it surpasses the SOTA, CoDA, by 3.48 overall mAP and attains a 24.5% relative improvement in SS on ScanNetV2.
109. 【2512.16075】FOD-Diff: 3D Multi-Channel Patch Diffusion Model for Fiber Orientation Distribution
链接:https://arxiv.org/abs/2512.16075
作者:Hao Tang,Hanyu Liu,Alessandro Perelli,Xi Chen,Chao Li
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:fiber orientation distribution, white matter integrity, critical non-invasive technique, estimate fiber orientation, characterizing white matter
备注:
点击查看摘要
Abstract:Diffusion MRI (dMRI) is a critical non-invasive technique to estimate fiber orientation distribution (FOD) for characterizing white matter integrity. Estimating FOD from single-shell low angular resolution dMRI (LAR-FOD) is limited by accuracy, whereas estimating FOD from multi-shell high angular resolution dMRI (HAR-FOD) requires a long scanning time, which limits its applicability. Diffusion models have shown promise in estimating HAR-FOD based on LAR-FOD. However, using diffusion models to efficiently generate HAR-FOD is challenging due to the large number of spherical harmonic (SH) coefficients in FOD. Here, we propose a 3D multi-channel patch diffusion model to predict HAR-FOD from LAR-FOD. We design the FOD-patch adapter by introducing the prior brain anatomy for more efficient patch-based learning. Furthermore, we introduce a voxel-level conditional coordinating module to enhance the global understanding of the model. We design the SH attention module to effectively learn the complex correlations of the SH coefficients. Our experimental results show that our method achieves the best performance in HAR-FOD prediction and outperforms other state-of-the-art methods.
110. 【2512.16055】Driving in Corner Case: A Real-World Adversarial Closed-Loop Evaluation Platform for End-to-End Autonomous Driving
链接:https://arxiv.org/abs/2512.16055
作者:Jiaheng Geng,Jiatong Du,Xinyu Zhang,Ye Li,Panqu Wang,Yanjun Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Safety-critical corner cases, Safety-critical corner, autonomous driving, corner cases, difficult to collect
备注:
点击查看摘要
Abstract:Safety-critical corner cases, difficult to collect in the real world, are crucial for evaluating end-to-end autonomous driving. Adversarial interaction is an effective method to generate such safety-critical corner cases. While existing adversarial evaluation methods are built for models operating in simplified simulation environments, adversarial evaluation for real-world end-to-end autonomous driving has been little explored. To address this challenge, we propose a closed-loop evaluation platform for end-to-end autonomous driving, which can generate adversarial interactions in real-world scenes. In our platform, the real-world image generator cooperates with an adversarial traffic policy to evaluate various end-to-end models trained on real-world data. The generator, based on flow matching, efficiently and stably generates real-world images according to the traffic environment information. The efficient adversarial surrounding vehicle policy is designed to model challenging interactions and create corner cases that current autonomous driving systems struggle to handle. Experimental results demonstrate that the platform can generate realistic driving images efficiently. Through evaluating the end-to-end models such as UniAD and VAD, we demonstrate that based on the adversarial policy, our platform evaluates the performance degradation of the tested model in corner cases. This result indicates that this platform can effectively detect the model's potential issues, which will facilitate the safety and robustness of end-to-end autonomous driving.
111. 【2512.16023】CoVAR: Co-generation of Video and Action for Robotic Manipulation via Multi-Modal Diffusion
链接:https://arxiv.org/abs/2512.16023
作者:Liudi Yang,Yang Bai,George Eskandar,Fengyi Shen,Mohammad Altillawi,Dong Chen,Ziyuan Liu,Abhinav Valada
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:follow text instructions, initial image observation, robot joint states, generate video-action pairs, text instructions
备注: 9 pages, 7 figures
点击查看摘要
Abstract:We present a method to generate video-action pairs that follow text instructions, starting from an initial image observation and the robot's joint states. Our approach automatically provides action labels for video diffusion models, overcoming the common lack of action annotations and enabling their full use for robotic policy learning. Existing methods either adopt two-stage pipelines, which limit tightly coupled cross-modal information sharing, or rely on adapting a single-modal diffusion model for a joint distribution that cannot fully leverage pretrained video knowledge. To overcome these limitations, we (1) extend a pretrained video diffusion model with a parallel, dedicated action diffusion model that preserves pretrained knowledge, (2) introduce a Bridge Attention mechanism to enable effective cross-modal interaction, and (3) design an action refinement module to convert coarse actions into precise controls for low-resolution datasets. Extensive evaluations on multiple public benchmarks and real-world datasets demonstrate that our method generates higher-quality videos, more accurate actions, and significantly outperforms existing baselines, offering a scalable framework for leveraging large-scale video data for robotic learning.
112. 【2512.15993】Eyes on the Grass: Biodiversity-Increasing Robotic Mowing Using Deep Visual Embeddings
链接:https://arxiv.org/abs/2512.15993
作者:Lars Beckers,Arno Waes,Aaron Van Campenhout,Toon Goedemé
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:actively enhances garden, adaptive decision-making, paper presents, framework that actively, actively enhances
备注:
点击查看摘要
Abstract:This paper presents a robotic mowing framework that actively enhances garden biodiversity through visual perception and adaptive decision-making. Unlike passive rewilding approaches, the proposed system uses deep feature-space analysis to identify and preserve visually diverse vegetation patches in camera images by selectively deactivating the mower blades. A ResNet50 network pretrained on PlantNet300K provides ecologically meaningful embeddings, from which a global deviation metric estimates biodiversity without species-level supervision. These estimates drive a selective mowing algorithm that dynamically alternates between mowing and conservation behavior. The system was implemented on a modified commercial robotic mower and validated both in a controlled mock-up lawn and on real garden datasets. Results demonstrate a strong correlation between embedding-space dispersion and expert biodiversity assessment, confirming the feasibility of deep visual diversity as a proxy for ecological richness and the effectiveness of the proposed mowing decision approach. Widespread adoption of such systems will turn ecologically worthless, monocultural lawns into vibrant, valuable biotopes that boost urban biodiversity.
113. 【2512.15977】Are vision-language models ready to zero-shot replace supervised classification models in agriculture?
链接:https://arxiv.org/abs/2512.15977
作者:Earl Ranario,Mason J. Earles
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual recognition tasks, agricultural decision support, remains poorly understood, decision support remains, support remains poorly
备注: Draft version
点击查看摘要
Abstract:Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural classification datasets from the AgML collection, spanning 162 classes across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (for example, from 21% to 30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.
114. 【2512.15971】From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection
链接:https://arxiv.org/abs/2512.15971
作者:Manuel Nkegoum,Minh-Tan Pham,Élisa Fromont,Bruno Avignon,Sébastien Lefèvre
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse illumination conditions, driving and surveillance, conditions is essential, critical for safety-sensitive, safety-sensitive applications
备注:
点击查看摘要
Abstract:Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.
115. 【2512.15957】Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models
链接:https://arxiv.org/abs/2512.15957
作者:Utsav Panchal,Yuchen Liu,Luigi Palmieri,Ilche Georgievski,Marco Aiello
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Accurately predicting human, mobile robots operating, Accurately predicting, human-populated environments, crucial for mobile
备注: Accepted at IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
点击查看摘要
Abstract:Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.
116. 【2512.15949】he Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
链接:https://arxiv.org/abs/2512.15949
作者:Tejas Anvekar,Fenil Bardoliya,Pavan K. Turaga,Chitta Baral,Vivek Gupta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remain poorly characterized, yielded increasingly powerful, capacities remain poorly, multimodal large language, perceptual capacities remain
备注: Accepted at WACV 2026
点击查看摘要
Abstract:Recent advances in multimodal large language models (MLLMs) have yielded increasingly powerful models, yet their perceptual capacities remain poorly characterized. In practice, most model families scale language component while reusing nearly identical vision encoders (e.g., Qwen2.5-VL 3B/7B/72B), which raises pivotal concerns about whether progress reflects genuine visual grounding or reliance on internet-scale textual world knowledge. Existing evaluation methods emphasize end-task accuracy, overlooking robustness, attribution fidelity, and reasoning under controlled perturbations. We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like: (i) simple vision tasks, such as face matching and text-in-vision comprehension capabilities; (ii) local-to-global understanding, encompassing image matching, grid pointing game, and attribute localization, which tests general visual grounding. Each vertical is instantiated with ground-truth datasets of faces and words, systematically perturbed through pixel-based augmentations and diffusion-based stylized illusions. The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations, providing a principled foundation for analyzing strengths and weaknesses of current and future models.
117. 【2512.15940】R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space
链接:https://arxiv.org/abs/2512.15940
作者:Tin Stribor Sohn,Maximilian Dillitzer,Jason J. Corso,Eric Sax
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Humans perceive, structured internal representations, encode semantic meaning, perceive and reason, dimensions by building
备注:
点击查看摘要
Abstract:Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are integrated into the VLM's reasoning. Unlike classical retrieval-augmented generation methods, retrieval in R4 operates directly in 4D space, enabling episodic and collaborative reasoning without training. Experiments on embodied question answering and navigation benchmarks demonstrate that R4 substantially improves retrieval and reasoning over spatio-temporal information compared to baselines, advancing a new paradigm for embodied 4D reasoning in dynamic environments.
118. 【2512.15938】SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks
链接:https://arxiv.org/abs/2512.15938
作者:Vegard Flovik
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep neural networks, neural networks achieve, networks achieve impressive, achieve impressive performance, Deep neural
备注: Under review
点击查看摘要
Abstract:Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified "discover, validate, and control" framework that bridges mechanistic interpretability and model editing. Using an $\ell_1$-regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder's structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, $\alpha_{crit}$, quantifying each class's reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is validated on both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control over their behavior. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.
119. 【2512.15933】City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs
链接:https://arxiv.org/abs/2512.15933
作者:Dwip Dalal,Utkarsh Mishra,Narendra Ahuja,Nebojsa Jojic
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, Leveraging multimodal large, offers significant promise, addressing complex real-world, multimodal large language
备注:
点击查看摘要
Abstract:Leveraging multimodal large language models (MLLMs) to develop embodied agents offers significant promise for addressing complex real-world tasks. However, current evaluation benchmarks remain predominantly language-centric or heavily reliant on simulated environments, rarely probing the nuanced, knowledge-intensive reasoning essential for practical, real-world scenarios. To bridge this critical gap, we introduce the task of Sparsely Grounded Visual Navigation, explicitly designed to evaluate the sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments. We operationalize this task with CityNav, a comprehensive benchmark encompassing four diverse global cities, specifically constructed to assess raw MLLM-driven agents in city navigation. Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points without additional environmental annotations or specialized architectural modifications. Crucially, agents must autonomously achieve localization through interpreting city-specific cues and recognizing landmarks, perform spatial reasoning, and strategically plan and execute routes to their destinations. Through extensive evaluations, we demonstrate that current state-of-the-art MLLMs and standard reasoning techniques (e.g., Chain-of-Thought, Reflection) significantly underperform in this challenging setting. To address this, we propose Verbalization of Path (VoP), which explicitly grounds the agent's internal reasoning by probing an explicit cognitive map (key landmarks and directions toward the destination) from the MLLMs, substantially enhancing navigation success. Project Webpage: this https URL
120. 【2512.15885】Seeing Beyond Words: Self-Supervised Visual Learning for Multimodal Large Language Models
链接:https://arxiv.org/abs/2512.15885
作者:Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Pier Luigi Dovesi,Shaghayegh Roohi,Mark Granroth-Wilding,Rita Cucchiara
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:tasks remains limited, recently demonstrated impressive, demonstrated impressive capabilities, Large Language Models, Multimodal Large Language
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in connecting vision and language, yet their proficiency in fundamental visual reasoning tasks remains limited. This limitation can be attributed to the fact that MLLMs learn visual understanding primarily from textual descriptions, which constitute a subjective and inherently incomplete supervisory signal. Furthermore, the modest scale of multimodal instruction tuning compared to massive text-only pre-training leads MLLMs to overfit language priors while overlooking visual details. To address these issues, we introduce JARVIS, a JEPA-inspired framework for self-supervised visual enhancement in MLLMs. Specifically, we integrate the I-JEPA learning paradigm into the standard vision-language alignment pipeline of MLLMs training. Our approach leverages frozen vision foundation models as context and target encoders, while training the predictor, implemented as the early layers of an LLM, to learn structural and semantic regularities from images without relying exclusively on language supervision. Extensive experiments on standard MLLM benchmarks show that JARVIS consistently improves performance on vision-centric benchmarks across different LLM families, without degrading multimodal reasoning abilities. Our source code is publicly available at: this https URL.
121. 【2512.15840】Large Video Planner Enables Generalizable Robot Control
链接:https://arxiv.org/abs/2512.15840
作者:Boyuan Chen,Tianyuan Zhang,Haoran Geng,Kiwhan Song,Caiyi Zhang,Peihao Li,William T. Freeman,Jitendra Malik,Pieter Abbeel,Russ Tedrake,Vincent Sitzmann,Yilun Du
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:General-purpose robots require, robots require decision-making, require decision-making models, General-purpose robots, require decision-making
备注: 29 pages, 16 figures
点击查看摘要
Abstract:General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at this https URL.
122. 【2512.15829】Human-like Working Memory from Artificial Intrinsic Plasticity Neurons
链接:https://arxiv.org/abs/2512.15829
作者:Jingli Liu,Huannan Zheng,Bohao Zou,Kezhou Yang
类目:Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)
关键词:integrate transient information, human-like working memory, rapid decision-making, Working memory, brain to integrate
备注:
点击查看摘要
Abstract:Working memory enables the brain to integrate transient information for rapid decision-making. Artificial networks typically replicate this via recurrent or parallel architectures, yet incur high energy costs and noise sensitivity. Here we report IPNet, a hardware-software co-designed neuromorphic architecture realizing human-like working memory via neuronal intrinsic plasticity. Exploiting Joule-heating dynamics of Magnetic Tunnel Junctions (MTJs), IPNet physically emulates biological memory volatility. The memory behavior of the proposed architecture shows similar trends in n-back, free recall and memory interference tasks to that of reported human subjects. Implemented exclusively with MTJ neurons, the architecture with human-like working memory achieves 99.65% accuracy on 11-class DVS gesture datasets and maintains 99.48% on a novel 22-class time-reversed benchmark, outperforming RNN, LSTM, and 2+1D CNN baselines sharing identical backbones. For autonomous driving (DDD-20), IPNet reduces steering prediction error by 14.4% compared to ResNet-LSTM. Architecturally, we identify a 'Memory-at-the-Frontier' effect where performance is maximized at the sensing interface, validating a bio-plausible near-sensor processing paradigm. Crucially, all results rely on raw parameters from fabricated devices without optimization. Hardware-in-the-loop validation confirms the system's physical realizability. Separately, energy analysis reveals a reduction in memory power of 2,874x compared to LSTMs and 90,920x versus parallel 3D-CNNs. This capacitor-free design enables a compact ~1.5um2 footprint (28 nm CMOS): a 20-fold reduction over standard LIF neurons. Ultimately, we demonstrate that instantiating human-like working memory via intrinsic neuronal plasticity endows neural networks with the dual biological advantages of superior dynamic vision processing and minimal metabolic cost.
123. 【2512.15774】wo-Step Data Augmentation for Masked Face Detection and Recognition: Turning Fake Masks to Real
链接:https://arxiv.org/abs/2512.15774
作者:Yan Yang,George Bebis,Mircea Nicolescu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:distribution shift pose, shift pose major, pose major challenges, masked face detection, scarcity and distribution
备注: 9 pages, 9 figures. Conference version
点击查看摘要
Abstract:Data scarcity and distribution shift pose major challenges for masked face detection and recognition. We propose a two-step generative data augmentation framework that combines rule-based mask warping with unpaired image-to-image translation using GANs, enabling the generation of realistic masked-face samples beyond purely synthetic transformations. Compared to rule-based warping alone, the proposed approach yields consistent qualitative improvements and complements existing GAN-based masked face generation methods such as IAMGAN. We introduce a non-mask preservation loss and stochastic noise injection to stabilize training and enhance sample diversity. Experimental observations highlight the effectiveness of the proposed components and suggest directions for future improvements in data-centric augmentation for face recognition tasks.
124. 【2512.15748】Surely Large Multimodal Models (Don't) Excel in Visual Species Recognition?
链接:https://arxiv.org/abs/2512.15748
作者:Tian Liu,Anwesha Basu,James Caverlee,Shu Kong
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Species Recognition, Species Recognition, Visual Species, evolution research, assessment and conservation
备注: website and code: [this https URL](https://tian1327.github.io/POC)
点击查看摘要
Abstract:Visual Species Recognition (VSR) is pivotal to biodiversity assessment and conservation, evolution research, and ecology and ecosystem management. Training a machine-learned model for VSR typically requires vast amounts of annotated images. Yet, species-level annotation demands domain expertise, making it realistic for domain experts to annotate only a few examples. These limited labeled data motivate training an ''expert'' model via few-shot learning (FSL). Meanwhile, advanced Large Multimodal Models (LMMs) have demonstrated prominent performance on general recognition tasks. It is straightforward to ask whether LMMs excel in the highly specialized VSR task and whether they outshine FSL expert models. Somewhat surprisingly, we find that LMMs struggle in this task, despite using various established prompting techniques. LMMs even significantly underperform FSL expert models, which are as simple as finetuning a pretrained visual encoder on the few-shot images. However, our in-depth analysis reveals that LMMs can effectively post-hoc correct the expert models' incorrect predictions. Briefly, given a test image, when prompted with the top predictions from an FSL expert model, LMMs can recover the ground-truth label. Building on this insight, we derive a simple method called Post-hoc Correction (POC), which prompts an LMM to re-rank the expert model's top predictions using enriched prompts that include softmax confidence scores and few-shot visual examples. Across five challenging VSR benchmarks, POC outperforms prior art of FSL by +6.4% in accuracy without extra training, validation, or manual intervention. Importantly, POC generalizes to different pretrained backbones and LMMs, serving as a plug-and-play module to significantly enhance existing FSL methods.
125. 【2512.15747】D3G: Diverse Demographic Data Generation Increases Zero-Shot Image Classification Accuracy within Multimodal Models
链接:https://arxiv.org/abs/2512.15747
作者:Javon Hickmon
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:human-level image understanding, achieve human-level image, Image classification, essential for machine, machine perception
备注:
点击查看摘要
Abstract:Image classification is a task essential for machine perception to achieve human-level image understanding. Multimodal models such as CLIP have been able to perform well on this task by learning semantic similarities across vision and language; however, despite these advances, image classification is still a challenging task. Models with low capacity often suffer from underfitting and thus underperform on fine-grained image classification. Along with this, it is important to ensure high-quality data with rich cross-modal representations of each class, which is often difficult to generate. When datasets do not enforce balanced demographics, the predictions will be biased toward the more represented class, while others will be neglected. We focus on how these issues can lead to harmful bias for zero-shot image classification, and explore how to combat these issues in demographic bias. We propose Diverse Demographic Data Generation (D3G), a training-free, zero-shot method of boosting classification accuracy while reducing demographic bias in pre-trained multimodal models. With this method, we utilize CLIP as our base multimodal model and Stable Diffusion XL as our generative model. We demonstrate that providing diverse demographic data at inference time improves performance for these models, and explore the impact of individual demographics on the resulting accuracy metric.
126. 【2511.00456】Weakly Supervised Pneumonia Localization from Chest X-Rays Using Deep Neural Network and Grad-CAM Explanations
链接:https://arxiv.org/abs/2511.00456
作者:Kiran Shahi,Anup Bagale
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Chest X-ray imaging, Chest X-ray, typically requires detailed, Class Activation Mapping, Gradient-weighted Class Activation
备注: [this https URL](https://github.com/kiranshahi/pneumonia-analysis)
点击查看摘要
Abstract:Chest X-ray imaging is commonly used to diagnose pneumonia, but accurately localizing the pneumonia-affected regions typically requires detailed pixel-level annotations, which are costly and time consuming to obtain. To address this limitation, this study proposes a weakly supervised deep learning framework for pneumonia classification and localization using Gradient-weighted Class Activation Mapping (Grad-CAM). Instead of relying on costly pixel-level annotations, the proposed method utilizes image-level labels to generate clinically meaningful heatmaps that highlight pneumonia-affected regions. Furthermore, we evaluate seven pre-trained deep learning models, including a Vision Transformer, under identical training conditions, using focal loss and patient-wise splits to prevent data leakage. Experimental results suggest that all models achieved high classification accuracy (96--98\%), with ResNet-18 and EfficientNet-B0 showing the best overall performance and MobileNet-V3 providing an efficient lightweight alternative. Grad-CAM heatmap visualizations confirm that the proposed methods focus on clinically relevant lung regions, supporting the use of explainable AI for radiological diagnostics. Overall, this work highlights the potential of weakly supervised, explainable models that enhance transparency and clinical trust in AI-assisted pneumonia screening.
127. 【2512.16085】Machine Learning Enabled Graph Analysis of Particulate Composites: Application to Solid-state Battery Cathodes
链接:https://arxiv.org/abs/2512.16085
作者:Zebin Li,Shimao Deng,Yijin Liu,Jia-Mian Hu
类目:Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV)
关键词:influence system performance, strongly influence system, inter-particle connections strongly, connections strongly influence, system performance
备注:
点击查看摘要
Abstract:Particulate composites underpin many solid-state chemical and electrochemical systems, where microstructural features such as multiphase boundaries and inter-particle connections strongly influence system performance. Advances in X-ray microscopy enable capturing large-scale, multimodal images of these complex microstructures with an unprecedentedly high throughput. However, harnessing these datasets to discover new physical insights and guide microstructure optimization remains a major challenge. Here, we develop a machine learning (ML) enabled framework that enables automated transformation of experimental multimodal X-ray images of multiphase particulate composites into scalable, topology-aware graphs for extracting physical insights and establishing local microstructure-property relationships at both the particle and network level. Using the multiphase particulate cathode of solid-state lithium batteries as an example, our ML-enabled graph analysis corroborates the critical role of triple phase junctions and concurrent ion/electron conduction channels in realizing desirable local electrochemical activity. Our work establishes graph-based microstructure representation as a powerful paradigm for bridging multimodal experimental imaging and functional understanding, and facilitating microstructure-aware data-driven materials design in a broad range of particulate composites.
128. 【2512.15947】MCR-VQGAN: A Scalable and Cost-Effective Tau PET Synthesis Approach for Alzheimer's Disease Imaging
链接:https://arxiv.org/abs/2512.15947
作者:Jin Young Kim,Jeremy Hudson,Jeongchul Kim,Qing Lyu,Christopher T. Whitlow
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:positron emission tomography, quantifies neurofibrillary tangles, critical diagnostic modality, Tau positron emission, Generative Adversarial Network
备注: 12 pages, 5 figures. A preliminary version of this work was presented at RSNA 2025
点击查看摘要
Abstract:Tau positron emission tomography (PET) is a critical diagnostic modality for Alzheimer's disease (AD) because it visualizes and quantifies neurofibrillary tangles, a hallmark of AD pathology. However, its widespread clinical adoption is hindered by significant challenges, such as radiation exposure, limited availability, high clinical workload, and substantial financial costs. To overcome these limitations, we propose Multi-scale CBAM Residual Vector Quantized Generative Adversarial Network (MCR-VQGAN) to synthesize high-fidelity tau PET images from structural T1-weighted MRI scans. MCR-VQGAN improves standard VQGAN by integrating three key architectural enhancements: multi-scale convolutions, ResNet blocks, and Convolutional Block Attention Modules (CBAM). Using 222 paired structural T1-weighted MRI and tau PET scans from Alzheimer's Disease Neuroimaging Initiative (ADNI), we trained and compared MCR-VQGAN with cGAN, WGAN-GP, CycleGAN, and VQGAN. Our proposed model achieved superior image synthesis performance across all metrics: MSE of 0.0056 +/- 0.0061, PSNR of 24.39 +/- 4.49 dB, and SSIM of 0.9000 +/- 0.0453. To assess the clinical utility of the synthetic images, we trained and evaluated a CNN-based AD classifier. The classifier achieved comparable accuracy when tested on real (63.64%) and synthetic (65.91%) images. This result indicates that our synthesis process successfully preserves diagnostically relevant features without significant information loss. Our results demonstrate that MCR-VQGAN can offer a reliable and scalable surrogate for conventional tau PET imaging, potentially improving the accessibility and scalability of tau imaging biomarkers for AD research and clinical workflows.
129. 【2512.15921】In search of truth: Evaluating concordance of AI-based anatomy segmentation models
链接:https://arxiv.org/abs/2512.15921
作者:Lena Giebeler,Deepa Krishnaswamy,David Clunie,Jakob Wasserthal,Lalith Kumar Shiyam Sundar,Andres Diaz-Pinto,Klaus H. Maier-Hein,Murong Xu,Bjoern Menze,Steve Pieper,Ron Kikinis,Andrey Fedorov
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Purpose AI-based methods, Purpose AI-based, large imaging datasets, AI-based methods, methods for anatomy
备注:
点击查看摘要
Abstract:Purpose AI-based methods for anatomy segmentation can help automate characterization of large imaging datasets. The growing number of similar in functionality models raises the challenge of evaluating them on datasets that do not contain ground truth annotations. We introduce a practical framework to assist in this task. Approach We harmonize the segmentation results into a standard, interoperable representation, which enables consistent, terminology-based labeling of the structures. We extend 3D Slicer to streamline loading and comparison of these harmonized segmentations, and demonstrate how standard representation simplifies review of the results using interactive summary plots and browser-based visualization using OHIF Viewer. To demonstrate the utility of the approach we apply it to evaluating segmentation of 31 anatomical structures (lungs, vertebrae, ribs, and heart) by six open-source models - TotalSegmentator 1.5 and 2.6, Auto3DSeg, MOOSE, MultiTalent, and CADS - for a sample of Computed Tomography (CT) scans from the publicly available National Lung Screening Trial (NLST) dataset. Results We demonstrate the utility of the framework in enabling automating loading, structure-wise inspection and comparison across models. Preliminary results ascertain practical utility of the approach in allowing quick detection and review of problematic results. The comparison shows excellent agreement segmenting some (e.g., lung) but not all structures (e.g., some models produce invalid vertebrae or rib segmentations). Conclusions The resources developed are linked from this https URL including segmentation harmonization scripts, summary plots, and visualization tools. This work assists in model evaluation in absence of ground truth, ultimately enabling informed model selection.
130. 【2512.15820】BioimageAIpub: a toolbox for AI-ready bioimaging data publishing
链接:https://arxiv.org/abs/2512.15820
作者:Stefan Dvoretskii,Anwai Archit,Constantin Pape,Josh Moore,Marco Nolden
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern bioimage analysis, imaging facilities, Modern bioimage, bioimage analysis approaches, Image Data Resource
备注:
点击查看摘要
Abstract:Modern bioimage analysis approaches are data hungry, making it necessary for researchers to scavenge data beyond those collected within their (bio)imaging facilities. In addition to scale, bioimaging datasets must be accompanied with suitable, high-quality annotations and metadata. Although established data repositories such as the Image Data Resource (IDR) and BioImage Archive offer rich metadata, their contents typically cannot be directly consumed by image analysis tools without substantial data wrangling. Such a tedious assembly and conversion of (meta)data can account for a dedicated amount of time investment for researchers, hindering the development of more powerful analysis tools. Here, we introduce BioimageAIpub, a workflow that streamlines bioimaging data conversion, enabling a seamless upload to HuggingFace, a widely used platform for sharing machine learning datasets and models.
131. 【2512.15808】Foundation Models in Biomedical Imaging: Turning Hype into Reality
链接:https://arxiv.org/abs/2512.15808
作者:Amgad Muneer,Kai Zhang,Ibraheem Hamdi,Rizwan Qureshi,Muhammad Waqas,Shereen Fouad,Hazrat Ali,Syed Muhammad Anwar,Jia Wu
类目:Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Foundation models, driving a prominent, prominent shift, shift in artificial, artificial intelligence
备注: 5 figures and 3 tables
点击查看摘要
Abstract:Foundation models (FMs) are driving a prominent shift in artificial intelligence across different domains, including biomedical imaging. These models are designed to move beyond narrow pattern recognition towards emulating sophisticated clinical reasoning, understanding complex spatial relationships, and integrating multimodal data with unprecedented flexibility. However, a critical gap exists between this potential and the current reality, where the clinical evaluation and deployment of FMs are hampered by significant challenges. Herein, we critically assess the current state-of-the-art, analyzing hype by examining the core capabilities and limitations of FMs in the biomedical domain. We also provide a taxonomy of reasoning, ranging from emulated sequential logic and spatial understanding to the integration of explicit symbolic knowledge, to evaluate whether these models exhibit genuine cognition or merely mimic surface-level patterns. We argue that a critical frontier lies beyond statistical correlation, in the pursuit of causal inference, which is essential for building robust models that understand cause and effect. Furthermore, we discuss the paramount issues in deployment stemming from trustworthiness, bias, and safety, dissecting the challenges of algorithmic bias, data bias and privacy, and model hallucinations. We also draw attention to the need for more inclusive, rigorous, and clinically relevant validation frameworks to ensure their safe and ethical application. We conclude that while the vision of autonomous AI-doctors remains distant, the immediate reality is the emergence of powerful technology and assistive tools that would benefit clinical practice. The future of FMs in biomedical imaging hinges not on scale alone, but on developing hybrid, causally aware, and verifiably safe systems that augment, rather than replace, human expertise.




