本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新493篇论文,其中:

  • 自然语言处理49
  • 信息检索7
  • 计算机视觉119

自然语言处理

1. 【2512.10949】Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

链接https://arxiv.org/abs/2512.10949

作者:Yiwen Tang,Zoey Guo,Kaixin Zhu,Ray Zhang,Qizhi Chen,Dongzhi Jiang,Junli Liu,Bohan Zeng,Haoming Song,Delin Qu,Tianyi Bai,Dan Xu,Wentao Zhang,Bin Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Reinforcement learning, image generation recently, earlier proven, extended to enhance, effective in large

备注: Code is released at [this https URL](https://github.com/Ivan-Tang-3D/3DGen-R1)

点击查看摘要

Abstract:Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at this https URL.

2. 【2512.10938】Stronger Normalization-Free Transformers

链接https://arxiv.org/abs/2512.10938

作者:Mingzhi Chen,Taiming Lu,Jiachen Zhu,Mingjie Sun,Zhuang Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Dynamic Tanh, introduction of Dynamic, deep learning architectures, normalization layers, layers have long

备注

点击查看摘要

Abstract:Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

3. 【2512.10931】Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

链接https://arxiv.org/abs/2512.10931

作者:George Yakushev,Nataliia Babina,Masoud Vahid Dastgerdi,Vyacheslav Zhdanovskiy,Alina Shutova,Denis Kuznedelev

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:improve language model, language model capabilities, Abstract, LLMs, real time

备注: Preprint, work in progress

点击查看摘要

Abstract:Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities and safety, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embedded assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of rotary embeddings to enable LLMs built for sequential interactions to simultaneously think, listen, and generate outputs. We evaluate our approach on math, commonsense, and safety reasoning and find that it can generate accurate thinking-augmented answers in real time, reducing time to first non-thinking token from minutes to = 5s. and the overall real-time delays by 6-11x.

4. 【2512.10918】CompanionCast: A Multi-Agent Conversational AI Framework with Spatial Audio for Social Co-Viewing Experiences

链接https://arxiv.org/abs/2512.10918

作者:Yiyang Wang,Chen Chen,Tica Lin,Vishnu Raj,Josh Kimball,Alex Cabral,Josiah Hester

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)

关键词:modern media consumption, increasingly solitary, modern media, media consumption, consumption is increasingly

备注: 11 pages

点击查看摘要

Abstract:Social presence is central to the enjoyment of watching content together, yet modern media consumption is increasingly solitary. We investigate whether multi-agent conversational AI systems can recreate the dynamics of shared viewing experiences across diverse content types. We present CompanionCast, a general framework for orchestrating multiple role-specialized AI agents that respond to video content using multimodal inputs, speech synthesis, and spatial audio. Distinctly, CompanionCast integrates an LLM-as-a-Judge module that iteratively scores and refines conversations across five dimensions (relevance, authenticity, engagement, diversity, personality consistency). We validate this framework through sports viewing, a domain with rich dynamics and strong social traditions, where a pilot study with soccer fans suggests that multi-agent interaction improves perceived social presence compared to solo viewing. We contribute: (1) a generalizable framework for orchestrating multi-agent conversations around multimodal video content, (2) a novel evaluator-agent pipeline for conversation quality control, and (3) exploratory evidence of increased social presence in AI-mediated co-viewing. We discuss challenges and future directions for applying this approach to diverse viewing contexts including entertainment, education, and collaborative watching experiences.

5. 【2512.10882】Computational emotion analysis with multimodal LLMs: Current evidence on an emerging methodological opportunity

链接https://arxiv.org/abs/2512.10882

作者:Hauke Licht

类目:Computation and Language (cs.CL)

关键词:long tradition, central to politics, politics and analyzing, analyzing their role, role in political

备注

点击查看摘要

Abstract:Emotions are central to politics and analyzing their role in political communication has a long tradition. As research increasingly leverages audio-visual materials to analyze the display of emotions, the emergence of multimodal generative AI promises great advances. However, we lack evidence about the effectiveness of multimodal AI in emotion analysis. This paper addresses this gap by evaluating current multimodal large language models (mLLMs) in video-based analysis of emotional arousal in two complementary data sets of human-labeled video recordings. I find that under ideal circumstances, mLLMs' emotional arousal ratings are highly reliable and show little to know indication of demographic bias. However, in recordings of speakers in real-world parliamentary debates, mLLMs' arousal ratings fail to deliver on this promise with potential negative consequences for downstream statistical inferences. This study therefore underscores the need for continued, thorough evaluation of emerging generative AI methods in political analysis and contributes a suitable replicable framework.

6. 【2512.10865】Quantifying Emotional Tone in Tolkien's The Hobbit: Dialogue Sentiment Analysis with RegEx, NRC-VAD, and Python

链接https://arxiv.org/abs/2512.10865

作者:Lilin Qiu

类目:Computation and Language (cs.CL)

关键词:study analyzes, emotional, Hobbit, computational text analysis, Abstract

备注

点击查看摘要

Abstract:This study analyzes the emotional tone of dialogue in J. R. R. Tolkien's The Hobbit (1937) using computational text analysis. Dialogue was extracted with regular expressions, then preprocessed, and scored using the NRC-VAD lexicon to quantify emotional dimensions. The results show that the dialogue maintains a generally positive (high valence) and calm (low arousal) tone, with a gradually increasing sense of agency (dominance) as the story progresses. These patterns reflect the novel's emotional rhythm: moments of danger and excitement are regularly balanced by humor, camaraderie, and relief. Visualizations -- including emotional trajectory graphs and word clouds -- highlight how Tolkien's language cycles between tension and comfort. By combining computational tools with literary interpretation, this study demonstrates how digital methods can uncover subtle emotional structures in literature, revealing the steady rhythm and emotional modulation that shape the storytelling in The Hobbit.

7. 【2512.10793】LabelFusion: Learning to Fuse LLMs and Transformer Classifiers for Robust Text Classification

链接https://arxiv.org/abs/2512.10793

作者:Michael Schlee,Christoph Weisser,Timo Kivimäki,Melchizedek Mashiku,Benjamin Saefken

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Google Gemini, Language Models, Large Language, OpenAI GPT

备注

点击查看摘要

Abstract:LabelFusion is a fusion ensemble for text classification that learns to combine a traditional transformer-based classifier (e.g., RoBERTa) with one or more Large Language Models (LLMs such as OpenAI GPT, Google Gemini, or DeepSeek) to deliver accurate and cost-aware predictions across multi-class and multi-label tasks. The package provides a simple high-level interface (AutoFusionClassifier) that trains the full pipeline end-to-end with minimal configuration, and a flexible API for advanced users. Under the hood, LabelFusion integrates vector signals from both sources by concatenating the ML backbone's embeddings with the LLM-derived per-class scores -- obtained through structured prompt-engineering strategies -- and feeds this joint representation into a compact multi-layer perceptron (FusionMLP) that produces the final prediction. This learned fusion approach captures complementary strengths of LLM reasoning and traditional transformer-based classifiers, yielding robust performance across domains -- achieving 92.4% accuracy on AG News and 92.3% on 10-class Reuters 21578 topic classification -- while enabling practical trade-offs between accuracy, latency, and cost.

8. 【2512.10791】he FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

链接https://arxiv.org/abs/2512.10791

作者:Aileen Cheng,Alon Jacovi,Amir Globerson,Ben Golan,Charles Kwong,Chris Alberti,Connie Tao,Eyal Ben-David,Gaurav Singh Tomar,Lukas Haas,Yonatan Bitton,Adam Bloniarz,Aijun Bai,Andrew Wang,Anfal Siddiqui,Arturo Bajuelos Castillo,Aviel Atias,Chang Liu,Corey Fry,Daniel Balle,Deepanway Ghosal,Doron Kukliansky,Dror Marcus,Elena Gribovskaya,Eran Ofek,Honglei Zhuang,Itay Laish,Jan Ackermann,Lily Wang,Meg Risdal,Megan Barnes,Michael Fink,Mohamed Amin,Moran Ambar,Natan Potikha,Nikita Gupta,Nitzan Katz,Noam Velan,Ofir Roval,Ori Ram,Polina Zablotskaia,Prathamesh Bang,Priyanka Agrawal,Rakesh Ghiya,Sanjay Ganapathy,Simon Baumgartner,Sofia Erell,Sushant Prakash,Thibault Sellam,Vikram Rao,Xuanhui Wang,Yaroslav Akulov,Yulong Yang,Zhen Yang,Zhixin Lai,Zhongru Wu,Anca Dragan,Avinatan Hassidim,Fernando Pereira,Slav Petrov,Srinivasan Venkatachary,Tulsee Doshi,Yossi Matias,Sasha Goldshtein,Dipanjan Das

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:generate factually accurate, factually accurate text, FACTS Leaderboard Suite, online leaderboard suite, FACTS Leaderboard

备注

点击查看摘要

Abstract:We introduce The FACTS Leaderboard, an online leaderboard suite and associated set of benchmarks that comprehensively evaluates the ability of language models to generate factually accurate text across diverse scenarios. The suite provides a holistic measure of factuality by aggregating the performance of models on four distinct sub-leaderboards: (1) FACTS Multimodal, which measures the factuality of responses to image-based questions; (2) FACTS Parametric, which assesses models' world knowledge by answering closed-book factoid questions from internal parameters; (3) FACTS Search, which evaluates factuality in information-seeking scenarios, where the model must use a search API; and (4) FACTS Grounding (v2), which evaluates whether long-form responses are grounded in provided documents, featuring significantly improved judge models. Each sub-leaderboard employs automated judge models to score model responses, and the final suite score is an average of the four components, designed to provide a robust and balanced assessment of a model's overall factuality. The FACTS Leaderboard Suite will be actively maintained, containing both public and private splits to allow for external participation while guarding its integrity. It can be found at this https URL .

9. 【2512.10787】Replace, Don't Expand: Mitigating Context Dilution in Multi-Hop RAG via Fixed-Budget Evidence Assembly

链接https://arxiv.org/abs/2512.10787

作者:Moshe Lahmy,Roi Yozevitch

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:textbf, Retrieval-Augmented Generation, initial retrieval misses, systems often fail, bridge fact

备注: 24 pages, 2 figures

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems often fail on multi-hop queries when the initial retrieval misses a bridge fact. Prior corrective approaches, such as Self-RAG, CRAG, and Adaptive-$k$, typically address this by \textit{adding} more context or pruning existing lists. However, simply expanding the context window often leads to \textbf{context dilution}, where distractors crowd out relevant information. We propose \textbf{SEAL-RAG}, a training-free controller that adopts a \textbf{``replace, don't expand''} strategy to fight context dilution under a fixed retrieval depth $k$. SEAL executes a (\textbf{S}earch $\rightarrow$ \textbf{E}xtract $\rightarrow$ \textbf{A}ssess $\rightarrow$ \textbf{L}oop) cycle: it performs on-the-fly, entity-anchored extraction to build a live \textit{gap specification} (missing entities/relations), triggers targeted micro-queries, and uses \textit{entity-first ranking} to actively swap out distractors for gap-closing evidence. We evaluate SEAL-RAG against faithful re-implementations of Basic RAG, CRAG, Self-RAG, and Adaptive-$k$ in a shared environment on \textbf{HotpotQA} and \textbf{2WikiMultiHopQA}. On HotpotQA ($k=3$), SEAL improves answer correctness by \textbf{+3--13 pp} and evidence precision by \textbf{+12--18 pp} over Self-RAG. On 2WikiMultiHopQA ($k=5$), it outperforms Adaptive-$k$ by \textbf{+8.0 pp} in accuracy and maintains \textbf{96\%} evidence precision compared to 22\% for CRAG. These gains are statistically significant ($p0.001$). By enforcing fixed-$k$ replacement, SEAL yields a predictable cost profile while ensuring the top-$k$ slots are optimized for precision rather than mere breadth. We release our code and data at this https URL.

10. 【2512.10780】Script Gap: Evaluating LLM Triage on Indian Languages in Native vs Roman Scripts in a Real World Setting

链接https://arxiv.org/abs/2512.10780

作者:Manurag Khullar,Utkarsh Desai,Poorva Malviya,Aman Dalmia,Zheyuan Ryan Shi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, high-stakes clinical applications, Indian languages, Indian languages frequently

备注

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed in high-stakes clinical applications in India. In many such settings, speakers of Indian languages frequently communicate using romanized text rather than native scripts, yet existing research rarely evaluates this orthographic variation using real-world data. We investigate how romanization impacts the reliability of LLMs in a critical domain: maternal and newborn healthcare triage. We benchmark leading LLMs on a real-world dataset of user-generated queries spanning five Indian languages and Nepali. Our results reveal consistent degradation in performance for romanized messages, with F1 scores trailing those of native scripts by 5-12 points. At our partner maternal health organization in India, this gap could cause nearly 2 million excess errors in triage. Crucially, this performance gap by scripts is not due to a failure in clinical reasoning. We demonstrate that LLMs often correctly infer the semantic intent of romanized queries. Nevertheless, their final classification outputs remain brittle in the presence of orthographic noise in romanized inputs. Our findings highlight a critical safety blind spot in LLM-based health systems: models that appear to understand romanized input may still fail to act on it reliably.

11. 【2512.10772】Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation

链接https://arxiv.org/abs/2512.10772

作者:Kevin Glocker,Kätriin Kukk,Romina Oji,Marcel Bollmann,Marco Kuhlmann,Jenny Kunz

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Achieving high-performing language, Achieving high-performing, include medium, high-performing language models, lower-resource languages remains

备注

点击查看摘要

Abstract:Achieving high-performing language models which include medium- and lower-resource languages remains a challenge. Massively multilingual models still underperform compared to language-specific adaptations, especially at smaller model scales. In this work, we investigate scaling as an efficient strategy for adapting pretrained models to new target languages. Through comprehensive scaling ablations with approximately FLOP-matched models, we test whether upscaling an English base model enables more effective and resource-efficient adaptation than standard continued pretraining. We find that, once exposed to sufficient target-language data, larger upscaled models can match or surpass the performance of smaller models continually pretrained on much more data, demonstrating the benefits of scaling for data efficiency. Scaling also helps preserve the base model's capabilities in English, thus reducing catastrophic forgetting. Finally, we explore whether such scaled, language-specific models can be merged to construct modular and flexible multilingual systems. We find that while merging remains less effective than joint multilingual training, upscaled merges perform better than smaller ones. We observe large performance differences across merging methods, suggesting potential for improvement through merging approaches specialized for language-level integration.

12. 【2512.10756】OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification

链接https://arxiv.org/abs/2512.10756

作者:Zijian Wu,Lingkai Kong,Wenwei Zhang,Songyang Gao,Yuzhe Gu,Zhongrui Cai,Tianyou Ma,Yuhong Liu,Zhi Wang,Runyuan Ma,Guangyu Wang,Wei Li,Conghui He,Dahua Lin,Kai Chen

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Verifiable Rewards, achieved significant progress, Large language models, Large language, tasks by Reinforcement

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the Outcome-based Process Verifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out OPV-Bench, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2% to 73.3% on AIME2025 as the compute budget scales.

13. 【2512.10741】RIDENT: A Redundant Architecture for Caribbean-Accented Emergency Speech Triage

链接https://arxiv.org/abs/2512.10741

作者:Elroy Galbraith,Chadwick Sutherland,Donahue Morgan

类目:Computation and Language (cs.CL)

关键词:non-standard English varieties, exhibit systematic performance, systematic performance degradation, English varieties, non-standard English

备注

点击查看摘要

Abstract:Emergency speech recognition systems exhibit systematic performance degradation on non-standard English varieties, creating a critical gap in services for Caribbean populations. We present TRIDENT (Transcription and Routing Intelligence for Dispatcher-Empowered National Triage), a three-layer dispatcher-support architecture designed to structure emergency call inputs for human application of established triage protocols (the ESI for routine operations and START for mass casualty events), even when automatic speech recognition fails. The system combines Caribbean-accent-tuned ASR, local entity extraction via large language models, and bio-acoustic distress detection to provide dispatchers with three complementary signals: transcription confidence, structured clinical entities, and vocal stress indicators. Our key insight is that low ASR confidence, rather than representing system failure, serves as a valuable queue prioritization signal -- particularly when combined with elevated vocal distress markers indicating a caller in crisis whose speech may have shifted toward basilectal registers. A complementary insight drives the entity extraction layer: trained responders and composed bystanders may report life-threatening emergencies without elevated vocal stress, requiring semantic analysis to capture clinical indicators that paralinguistic features miss. We describe the architectural design, theoretical grounding in psycholinguistic research on stress-induced code-switching, and deployment considerations for offline operation during disaster scenarios. This work establishes a framework for accent-resilient emergency AI that ensures Caribbean voices receive equitable access to established national triage protocols. Empirical validation on Caribbean emergency calls remains future work.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2512.10741 [cs.CL]

(or
arXiv:2512.10741v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.10741

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
14. 【2512.10739】Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving

链接https://arxiv.org/abs/2512.10739

作者:Songyang Gao,Yuzhe Gu,Zijian Wu,Lingkai Kong,Wenwei Zhang,Zhongrui Cai,Fan Zheng,Tianyou Ma,Junhao Shen,Haiteng Zhao,Duanyang Zhang,Huilun Zhang,Kuikun Liu,Chengqi Lyu,Yanhui Duan,Chiyu Chen,Ningsheng Ma,Jianfei Gao,Han Lyu,Dahua Lin,Kai Chen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Verifiable Rewards, achieved significant progress, Large language models, Large language, tasks by Reinforcement

备注

点击查看摘要

Abstract:Large language models (LLMs) have achieved significant progress in solving complex reasoning tasks by Reinforcement Learning with Verifiable Rewards (RLVR). This advancement is also inseparable from the oversight automated by reliable verifiers. However, current outcome-based verifiers (OVs) are unable to inspect the unreliable intermediate steps in the long reasoning chains of thought (CoTs). Meanwhile, current process-based verifiers (PVs) have difficulties in reliably detecting errors in the complex long CoTs, limited by the scarcity of high-quality annotations due to the prohibitive costs of human annotations. Therefore, we propose the \textbf{O}utcome-based \textbf{P}rocess \textbf{V}erifier (OPV), which verifies the rationale process of summarized outcomes from long CoTs to achieve both accurate and efficient verification and enable large-scale annotation. To empower the proposed verifier, we adopt an iterative active learning framework with expert annotations to progressively improve the verification capability of OPV with fewer annotation costs. Specifically, in each iteration, the most uncertain cases of the current best OPV are annotated and then subsequently used to train a new OPV through Rejection Fine-Tuning (RFT) and RLVR for the next round. Extensive experiments demonstrate OPV's superior performance and broad applicability. It achieves new state-of-the-art results on our held-out \textsc{\thisbench}, outperforming much larger open-source models such as Qwen3-Max-Preview with an F1 score of 83.1 compared to 76.3. Furthermore, OPV effectively detects false positives within synthetic dataset, closely align with expert assessment. When collaborating with policy models, OPV consistently yields performance gains, e.g., raising the accuracy of DeepSeek-R1-Distill-Qwen-32B from 55.2\% to 73.3\% on AIME2025 as the compute budget scales.

15. 【2512.10734】xtual Data Bias Detection and Mitigation - An Extensible Pipeline with Experimental Evaluation

链接https://arxiv.org/abs/2512.10734

作者:Rebekka Görge,Sujan Sai Gannamaneni,Tabea Naeven,Hammam Abdelwahab,Héctor Allende-Cid,Armin B. Cremers,Lennard Helmer,Michael Mock,Anna Schmitz,Songkai Xue,Elif Yildirir,Maximilian Poretschkin,Stefan Wrobel

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:train large language, encompassing harmful language, manifestations encompassing harmful, skewed demographic distributions, large language models

备注

点击查看摘要

Abstract:Textual data used to train large language models (LLMs) exhibits multifaceted bias manifestations encompassing harmful language and skewed demographic distributions. Regulations such as the European AI Act require identifying and mitigating biases against protected groups in data, with the ultimate goal of preventing unfair model outputs. However, practical guidance and operationalization are lacking. We propose a comprehensive data bias detection and mitigation pipeline comprising four components that address two data bias types, namely representation bias and (explicit) stereotypes for a configurable sensitive attribute. First, we leverage LLM-generated word lists created based on quality criteria to detect relevant group labels. Second, representation bias is quantified using the Demographic Representation Score. Third, we detect and mitigate stereotypes using sociolinguistically informed filtering. Finally, we compensate representation bias through Grammar- and Context-Aware Counterfactual Data Augmentation. We conduct a two-fold evaluation using the examples of gender, religion and age. First, the effectiveness of each individual component on data debiasing is evaluated through human validation and baseline comparison. The findings demonstrate that we successfully reduce representation bias and (explicit) stereotypes in a text dataset. Second, the effect of data debiasing on model bias reduction is evaluated by bias benchmarking of several models (0.6B-8B parameters), fine-tuned on the debiased text dataset. This evaluation reveals that LLMs fine-tuned on debiased data do not consistently show improved performance on bias benchmarks, exposing critical gaps in current evaluation methodologies and highlighting the need for targeted data manipulation to address manifested model bias.

16. 【2512.10696】Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

链接https://arxiv.org/abs/2512.10696

作者:Zouying Cao,Jiaji Deng,Li Yu,Weikang Zhou,Zhaoyang Liu,Bolin Ding,Hai Zhao

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:theoretically reducing redundant, large language model, Procedural memory enables, enables large language, memory enables large

备注: 16 pages, 9 figures, 9 tables

点击查看摘要

Abstract:Procedural memory enables large language model (LLM) agents to internalize "how-to" knowledge, theoretically reducing redundant trial-and-error. However, existing frameworks predominantly suffer from a "passive accumulation" paradigm, treating memory as a static append-only archive. To bridge the gap between static storage and dynamic reasoning, we propose $\textbf{ReMe}$ ($\textit{Remember Me, Refine Me}$), a comprehensive framework for experience-driven agent evolution. ReMe innovates across the memory lifecycle via three mechanisms: 1) $\textit{multi-faceted distillation}$, which extracts fine-grained experiences by recognizing success patterns, analyzing failure triggers and generating comparative insights; 2) $\textit{context-adaptive reuse}$, which tailors historical insights to new contexts via scenario-aware indexing; and 3) $\textit{utility-based refinement}$, which autonomously adds valid memories and prunes outdated ones to maintain a compact, high-quality experience pool. Extensive experiments on BFCL-V3 and AppWorld demonstrate that ReMe establishes a new state-of-the-art in agent memory system. Crucially, we observe a significant memory-scaling effect: Qwen3-8B equipped with ReMe outperforms larger, memoryless Qwen3-14B, suggesting that self-evolving memory provides a computation-efficient pathway for lifelong learning. We release our code and the $\texttt{this http URL}$ dataset to facilitate further research.

17. 【2512.10630】From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages

链接https://arxiv.org/abs/2512.10630

作者:Smiljana Antonijevic Ubois

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:low resource languages, source language materials, Large language models, languages typically reflects, linguistic biases present

备注

点击查看摘要

Abstract:Large language models are commonly trained on dominant languages like English, and their representation of low resource languages typically reflects cultural and linguistic biases present in the source language materials. Using the Serbian language as a case, this study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age. Drawing on semi structured interviews with ten scholars and practitioners, including linguists, digital humanists, and AI developers, it traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues that drive reductive, engineering first approaches prioritizing functionality over linguistic nuance. These include superficial transliteration, reliance on English-trained models, data bias, and dataset curation lacking cultural specificity. To address these challenges, the study proposes Data Care, a framework grounded in CARE principles (Collective Benefit, Authority to Control, Responsibility, and Ethics), that reframes bias mitigation from a post hoc technical fix to an integral component of corpus design, annotation, and governance, and positions Data Care as a replicable model for building inclusive, sustainable, and culturally grounded language technologies in contexts where traditional LLM development reproduces existing power imbalances and cultural blind spots.

18. 【2512.10624】AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence

链接https://arxiv.org/abs/2512.10624

作者:Bo Yang,Lanfei Feng,Yunkui Chen,Yu Zhang,Jianyu Zhang,Xiao Xu,Nueraili Aierken,Shijian Li

类目:Computation and Language (cs.CL)

关键词:applications remain constrained, agricultural applications remain, comprehensive evaluation benchmarks, unified multimodal architectures, rapid advances

备注

点击查看摘要

Abstract:Despite rapid advances in multimodal large language models, agricultural applications remain constrained by the lack of multilingual speech data, unified multimodal architectures, and comprehensive evaluation benchmarks. To address these challenges, we present AgriGPT-Omni, an agricultural omni-framework that integrates speech, vision, and text in a unified framework. First, we construct a scalable data synthesis and collection pipeline that converts agricultural texts and images into training data, resulting in the largest agricultural speech dataset to date, including 492K synthetic and 1.4K real speech samples across six languages. Second, based on this, we train the first agricultural omni-model via a three-stage paradigm: textual knowledge injection, progressive multimodal alignment, and GRPO-based reinforcement learning, enabling unified reasoning across languages and modalities. Third, we propose AgriBench-Omni-2K, the first tri-modal benchmark for agriculture, covering diverse speech-vision-text tasks and multilingual slices, with standardized protocols and reproducible tools. Experiments show that AgriGPT-Omni significantly outperforms general-purpose baselines on multilingual and multimodal reasoning as well as real-world speech understanding. All models, data, benchmarks, and code will be released to promote reproducible research, inclusive agricultural intelligence, and sustainable AI development for low-resource regions.

19. 【2512.10575】RoleRMBench RoleRM: Towards Reward Modeling for Profile-Based Role Play in Dialogue Systems

链接https://arxiv.org/abs/2512.10575

作者:Hang Ding,Qiming Feng,Dongqi Liu,Qi Zhao,Tao Yao,Shuo Wang,Dongsheng Chen,Jian Li,Zhenye Gan,Jiangning Zhang,Chengjie Wang,Yabiao Wang

类目:Computation and Language (cs.CL)

关键词:aligning large language, cornerstone of aligning, large language models, reward models, Reward

备注

点击查看摘要

Abstract:Reward modeling has become a cornerstone of aligning large language models (LLMs) with human preferences. Yet, when extended to subjective and open-ended domains such as role play, existing reward models exhibit severe degradation, struggling to capture nuanced and persona-grounded human judgments. To address this gap, we introduce RoleRMBench, the first systematic benchmark for reward modeling in role-playing dialogue, covering seven fine-grained capabilities from narrative management to role consistency and engagement. Evaluation on RoleRMBench reveals large and consistent gaps between general-purpose reward models and human judgment, particularly in narrative and stylistic dimensions. We further propose RoleRM, a reward model trained with Continuous Implicit Preferences (CIP), which reformulates subjective evaluation as continuous consistent pairwise supervision under multiple structuring strategies. Comprehensive experiments show that RoleRM surpasses strong open- and closed-source reward models by over 24% on average, demonstrating substantial gains in narrative coherence and stylistic fidelity. Our findings highlight the importance of continuous preference representation and annotation consistency, establishing a foundation for subjective alignment in human-centered dialogue systems.

20. 【2512.10561】Causal Reasoning Favors Encoders: On The Limits of Decoder-Only Models

链接https://arxiv.org/abs/2512.10561

作者:Amartya Roy,Elamparithy M,Kripabandhu Ghosh,Ponnurangam Kumaraguru,Adrian de Wynter

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:underpins recent advances, reasoning remains unclear, context learning, underpins recent, remains unclear

备注

点击查看摘要

Abstract:In context learning (ICL) underpins recent advances in large language models (LLMs), although its role and performance in causal reasoning remains unclear. Causal reasoning demands multihop composition and strict conjunctive control, and reliance on spurious lexical relations of the input could provide misleading results. We hypothesize that, due to their ability to project the input into a latent space, encoder and encoder decoder architectures are better suited for said multihop conjunctive reasoning versus decoder only models. To do this, we compare fine-tuned versions of all the aforementioned architectures with zero and few shot ICL in both natural language and non natural language scenarios. We find that ICL alone is insufficient for reliable causal reasoning, often overfocusing on irrelevant input features. In particular, decoder only models are noticeably brittle to distributional shifts, while finetuned encoder and encoder decoder models can generalize more robustly across our tests, including the non natural language split. Both architectures are only matched or surpassed by decoder only architectures at large scales. We conclude by noting that for cost effective, short horizon robust causal reasoning, encoder or encoder decoder architectures with targeted finetuning are preferable.

21. 【2512.10545】XDoGE: Multilingual Data Reweighting to Enhance Language Inclusivity in LLMs

链接https://arxiv.org/abs/2512.10545

作者:Iñaki Lacunza,José Javier Saiz,Alexander Shvets,Aitor Gonzalez-Agirre,Marta Villegas

类目:Computation and Language (cs.CL)

关键词:Current large language, Current large, trained on massive, massive amounts, amounts of text

备注: Accepted and presented at the LLMs4All workshop at the IEEE BigData 2025 Conference, Macau - December 8-11, 2025

点击查看摘要

Abstract:Current large language models (LLMs) are trained on massive amounts of text data, primarily from a few dominant languages. Studies suggest that this over-reliance on high-resource languages, such as English, hampers LLM performance in mid- and low-resource languages. To mitigate this problem, we propose to (i) optimize the language distribution by training a small proxy model within a domain-reweighing DoGE algorithm that we extend to XDoGE for a multilingual setup, and (ii) rescale the data and train a full-size model with the established language weights either from scratch or within a continual pre-training phase (CPT). We target six languages possessing a variety of geographic and intra- and inter-language-family relations, namely, English and Spanish (high-resource), Portuguese and Catalan (mid-resource), Galician and Basque (low-resource). We experiment with Salamandra-2b, which is a promising model for these languages. We investigate the effects of substantial data repetition on minor languages and under-sampling on dominant languages using the IberoBench framework for quantitative evaluation. Finally, we release a new promising IberianLLM-7B-Instruct model centering on Iberian languages and English that we pretrained from scratch and further improved using CPT with the XDoGE weights.

22. 【2512.10453】Grammaticality Judgments in Humans and Language Models: Revisiting Generative Grammar with LLMs

链接https://arxiv.org/abs/2512.10453

作者:Lars G.B. Johnsen

类目:Computation and Language (cs.CL)

关键词:counts as evidence, evidence for syntactic, traditional generative grammar, syntactic structure, subject-auxiliary inversion

备注: 2 figures

点击查看摘要

Abstract:What counts as evidence for syntactic structure? In traditional generative grammar, systematic contrasts in grammaticality such as subject-auxiliary inversion and the licensing of parasitic gaps are taken as evidence for an internal, hierarchical grammar. In this paper, we test whether large language models (LLMs), trained only on surface forms, reproduce these contrasts in ways that imply an underlying structural representation. We focus on two classic constructions: subject-auxiliary inversion (testing recognition of the subject boundary) and parasitic gap licensing (testing abstract dependency structure). We evaluate models including GPT-4 and LLaMA-3 using prompts eliciting acceptability ratings. Results show that LLMs reliably distinguish between grammatical and ungrammatical variants in both constructions, and as such support that they are sensitive to structure and not just linear order. Structural generalizations, distinct from cognitive knowledge, emerge from predictive training on surface forms, suggesting functional sensitivity to syntax without explicit encoding.

Comments:
2 figures

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2512.10453 [cs.CL]

(or
arXiv:2512.10453v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.10453

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2512.10449】When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

链接https://arxiv.org/abs/2512.10449

作者:Devanshu Sahoo,Manish Prasad,Vasudev Majhi,Jahnvi Singh,Vinay Chamola,Yash Sinha,Murari Mandal,Dhruv Kumar

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:Large Language Models, integration of Large, scientific peer review, Large Language, peer review

备注

点击查看摘要

Abstract:The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the "Lazy Reviewer" hypothesis) and the formal institutional deployment of AI-powered assessment systems by conferences like AAAI and Stanford's Agents4Science. This study investigates the robustness of these "LLM-as-a-Judge" systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping "Reject" decisions to "Accept," for which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like "Maximum Mark Magyk" successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.

24. 【2512.10441】Decoding Student Minds: Leveraging Conversational Agents for Psychological and Learning Analysis

链接https://arxiv.org/abs/2512.10441

作者:Nour El Houda Ben Chaabene,Hamza Hammami,Laid Kahloul

类目:Computation and Language (cs.CL)

关键词:psychologically-aware conversational agent, conversational agent designed, Large Language Models, Long Short-Term Memory, combines Large Language

备注: This manuscript is currently under peer review in Expert Systems with Applications

点击查看摘要

Abstract:This paper presents a psychologically-aware conversational agent designed to enhance both learning performance and emotional well-being in educational settings. The system combines Large Language Models (LLMs), a knowledge graph-enhanced BERT (KG-BERT), and a bidirectional Long Short-Term Memory (LSTM) with attention to classify students' cognitive and affective states in real time. Unlike prior chatbots limited to either tutoring or affective support, our approach leverages multimodal data-including textual semantics, prosodic speech features, and temporal behavioral trends-to infer engagement, stress, and conceptual understanding. A pilot study with university students demonstrated improved motivation, reduced stress, and moderate academic gains compared to baseline methods. These results underline the promise of integrating semantic reasoning, multimodal fusion, and temporal modeling to support adaptive, student-centered educational interventions.

25. 【2512.10440】Enhancing Next-Generation Language Models with Knowledge Graphs: Extending Claude, Mistral IA, and GPT-4 via KG-BERT

链接https://arxiv.org/abs/2512.10440

作者:Nour El Houda Ben Chaabene,Hamza Hammami

类目:Computation and Language (cs.CL)

关键词:Large language models, lack structured knowledge, excel in NLP, Large language, NLP but lack

备注: This paper was accepted and scheduled for inclusion in the ICALT 2025 proceedings but was ultimately not published due to absence from the conference presentation. It appears in the official program booklet. Conference: 2025 IEEE International Conference on Advanced Learning Technologies (ICALT)

点击查看摘要

Abstract:Large language models (LLMs) like Claude, Mistral IA, and GPT-4 excel in NLP but lack structured knowledge, leading to factual inconsistencies. We address this by integrating Knowledge Graphs (KGs) via KG-BERT to enhance grounding and reasoning. Experiments show significant gains in knowledge-intensive tasks such as question answering and entity linking. This approach improves factual reliability and enables more context-aware next-generation LLMs.

26. 【2512.10435】Semantic Reconstruction of Adversarial Plagiarism: A Context-Aware Framework for Detecting and Restoring "Tortured Phrases" in Scientific Literature

链接https://arxiv.org/abs/2512.10435

作者:Agniva Maiti,Prajwal Panth,Suresh Chandra Satapathy

类目:Computation and Language (cs.CL)

关键词:automated paraphrasing tools, text generation techniques, generation techniques, integrity and reliability, literature is facing

备注: 10 pages, 5 figures; unpublished manuscript; submitted to arXiv for dissemination

点击查看摘要

Abstract:The integrity and reliability of scientific literature is facing a serious threat by adversarial text generation techniques, specifically from the use of automated paraphrasing tools to mask plagiarism. These tools generate "tortured phrases", statistically improbable synonyms (e.g. "counterfeit consciousness" for "artificial intelligence"), that preserve the local grammar while obscuring the original source. Most existing detection methods depend heavily on static blocklists or general-domain language models, which suffer from high false-negative rates for novel obfuscations and cannot determine the source of the plagiarized content. In this paper, we propose Semantic Reconstruction of Adversarial Plagiarism (SRAP), a framework designed not only to detect these anomalies but to mathematically recover the original terminology. We use a two-stage architecture: (1) statistical anomaly detection with a domain-specific masked language model (SciBERT) using token-level pseudo-perplexity, and (2) source-based semantic reconstruction using dense vector retrieval (FAISS) and sentence-level alignment (SBERT). Experiments on a parallel corpus of adversarial scientific text show that while zero-shot baselines fail completely (0.00 percent restoration accuracy), our retrieval-augmented approach achieves 23.67 percent restoration accuracy, significantly outperforming baseline methods. We also show that static decision boundaries are necessary for robust detection in jargon-heavy scientific text, since dynamic thresholding fails under high variance. SRAP enables forensic analysis by linking obfuscated expressions back to their most probable source documents.

27. 【2512.10430】-pro 2.0: An Efficient Russian Hybrid-Reasoning Model and Playground

链接https://arxiv.org/abs/2512.10430

作者:Dmitrii Stoianov,Danil Taranets,Olga Tsymboi,Ramil Latypov,Almaz Dautov,Vladislav Kruglikov,Nikita Surkov,German Abramov,Pavel Gein,Dmitry Abulkhanov,Mikhail Gashkov,Viktor Zelenkovskiy,Artem Batalov,Aleksandr Medvedev,Anatolii Potapov

类目:Computation and Language (cs.CL)

关键词:open-weight Russian LLM, Russian LLM, open-weight Russian, practical Russian LLM, Russian LLM applications

备注

点击查看摘要

Abstract:We introduce T-pro 2.0, an open-weight Russian LLM for hybrid reasoning and efficient inference. The model supports direct answering and reasoning-trace generation, using a Cyrillic-dense tokenizer and an adapted EAGLE speculative-decoding pipeline to reduce latency. To enable reproducible and extensible research, we release the model weights, the T-Wix 500k instruction corpus, the T-Math reasoning benchmark, and the EAGLE weights on Hugging Face. These resources allow users to study Russian-language reasoning and to extend or adapt both the model and the inference pipeline. A public web demo exposes reasoning and non-reasoning modes and illustrates the speedups achieved by our inference stack across domains. T-pro 2.0 thus serves as an accessible open system for building and evaluating efficient, practical Russian LLM applications.

28. 【2512.10422】Cooperative Retrieval-Augmented Generation for Question Answering: Mutual Information Exchange and Ranking by Contrasting Layers

链接https://arxiv.org/abs/2512.10422

作者:Youmin Ko,Sungjong Seo,Hyunjoon Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:factually inaccurate output, generate factually inaccurate, gained significant attention, large language models, retrieval-augmented generation

备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:Since large language models (LLMs) have a tendency to generate factually inaccurate output, retrieval-augmented generation (RAG) has gained significant attention as a key means to mitigate this downside of harnessing only LLMs. However, existing RAG methods for simple and multi-hop question answering (QA) are still prone to incorrect retrievals and hallucinations. To address these limitations, we propose CoopRAG, a novel RAG framework for the question answering task in which a retriever and an LLM work cooperatively with each other by exchanging informative knowledge, and the earlier and later layers of the retriever model work cooperatively with each other to accurately rank the retrieved documents relevant to a given query. In this framework, we (i) unroll a question into sub-questions and a reasoning chain in which uncertain positions are masked, (ii) retrieve the documents relevant to the question augmented with the sub-questions and the reasoning chain, (iii) rerank the documents by contrasting layers of the retriever, and (iv) reconstruct the reasoning chain by filling the masked positions via the LLM. Our experiments demonstrate that CoopRAG consistently outperforms state-of-the-art QA methods on three multi-hop QA datasets as well as a simple QA dataset in terms of both the retrieval and QA performances. Our code is available.\footnote{this https URL}

29. 【2512.10411】Sliding Window Attention Adaptation

链接https://arxiv.org/abs/2512.10411

作者:Yijiong Yu,Jiale Liu,Qingyun Wu,Huazheng Wang,Ji Pei

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Transformer-based Large Language, Large Language Models, Transformer-based Large, Large Language, making long-context inference

备注

点击查看摘要

Abstract:The self-attention mechanism in Transformer-based Large Language Models (LLMs) scales quadratically with input length, making long-context inference expensive. Sliding window attention (SWA) reduces this cost to linear complexity, but naively enabling complete SWA at inference-time for models pretrained with full attention (FA) causes severe long-context performance degradation due to training-inference mismatch. This makes us wonder: Can FA-pretrained LLMs be well adapted to SWA without pretraining? We investigate this by proposing Sliding Window Attention Adaptation (SWAA), a set of practical recipes that combine five methods for better adaptation: (1) applying SWA only during prefilling; (2) preserving "sink" tokens; (3) interleaving FA/SWA layers; (4) chain-of-thought (CoT); and (5) fine-tuning. Our experiments show that SWA adaptation is feasible while non-trivial: no single method suffices, yet specific synergistic combinations effectively recover the original long-context performance. We further analyze the performance-efficiency trade-offs of different SWAA configurations and provide recommended recipes for diverse scenarios. Our code is available at this https URL

30. 【2512.10403】BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

链接https://arxiv.org/abs/2512.10403

作者:Tianyu Guo,Hongyu Chen,Hao Liang,Meiyi Qiang,Bohan Zeng,Linzhuang Sun,Bin Cui,Wentao Zhang

类目:ound (cs.SD); Computation and Language (cs.CL)

关键词:Automatic audio captioning, Automatic audio, Toggle, Audio Caption, Audio Caption Evaluation

备注

点击查看摘要

Abstract:Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.

Subjects:

Sound (cs.SD); Computation and Language (cs.CL)

Cite as:
arXiv:2512.10403 [cs.SD]

(or
arXiv:2512.10403v1 [cs.SD] for this version)

https://doi.org/10.48550/arXiv.2512.10403

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Hao Liang [view email] [v1]
Thu, 11 Dec 2025 08:09:24 UTC (1,311 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled BRACE: A Benchmark for Robust Audio Caption Quality Evaluation, by Tianyu Guo and 7 other authorsView PDFHTML (experimental)TeX Source

view license

Current browse context: cs.SD

prev

|
next

new
|
recent
| 2025-12

Change to browse by:

cs
cs.CL

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status

31. 【2512.10398】Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

链接https://arxiv.org/abs/2512.10398

作者:Zhaodong Wang,Zhenting Qi,Sherman Wong,Nathan Hu,Samuel Lin,Jun Ge,Erwin Gao,Yining Yang,Ben Maurer,Wenlin Chen,David Recordon,Yilun Du,Minlan Yu,Ying Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)

关键词:robustly coordinate complex, coordinate complex toolchains, maintain durable memory, demands coding agents, Confucius SDK

备注

点击查看摘要

Abstract:Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.

32. 【2512.10365】GPG: Generalized Policy Gradient Theorem for Transformer-based Policies

链接https://arxiv.org/abs/2512.10365

作者:Hangyu Mao,Guangting Dong,Zhicheng Dou

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Generalized Policy Gradient, Policy Gradient Theorem, Transformer-based policies, present the Generalized, designed for Transformer-based

备注

点击查看摘要

Abstract:We present the Generalized Policy Gradient (GPG) Theorem, specifically designed for Transformer-based policies. Notably, we demonstrate that both standard Policy Gradient Theorem and GRPO emerge as special cases within our GPG framework. Furthermore, we explore its practical applications in training Large Language Models (LLMs), offering new insights into efficient policy optimization.

33. 【2512.10336】Multilingual VLM Training: Adapting an English-Trained VLM to French

链接https://arxiv.org/abs/2512.10336

作者:Jules Lahmi,Alexis Roger

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:made great progress, Artificial intelligence, Language Models, recent years, intelligence has made

备注

点击查看摘要

Abstract:Artificial intelligence has made great progress in recent years, particularly in the development of Vision--Language Models (VLMs) that understand both visual and textual data. However, these advancements remain largely limited to English, reducing their accessibility for non--English speakers. It is essential to extend these capabilities to a broader range of languages. This paper explores the challenges of adapting an English-trained VLM to different languages. To this end, we will explore and compare different methods for their performance and computational cost. We consider a translation-based pipeline, LoRA finetuning, and a two-stage finetuning strategy that separates vision adaptation from language adaptation. To evaluate these methods, we use a combination of standard multimodal benchmarks translated into the target language and manual assessments by native experts. The results reveal that dataset translation remains a major bottleneck in multilingual VLM performance, with data quality limiting the effectiveness of training and evaluation. These findings suggest that future efforts should focus on native-language dataset collection and improved translation strategies.

34. 【2512.10284】MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

链接https://arxiv.org/abs/2512.10284

作者:Yixin Wan,Lei Ke,Wenhao Yu,Kai-Wei Chang,Dong Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:modifying subject actions, motion-centric image editing-the, image editing-the task, preserving identity, physical plausibility

备注

点击查看摘要

Abstract:We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2512.10284 [cs.CV]

(or
arXiv:2512.10284v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2512.10284

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
35. 【2512.10195】AutoMedic: An Automated Evaluation Framework for Clinical Conversational Agents with Medical Dataset Grounding

链接https://arxiv.org/abs/2512.10195

作者:Gyutaek Oh,Sangjoon Park,Byung-Hoon Kim

类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:large language models, Evaluating large language, language models, clinical conversational agents, large language

备注

点击查看摘要

Abstract:Evaluating large language models (LLMs) has recently emerged as a critical issue for safe and trustworthy application of LLMs in the medical domain. Although a variety of static medical question-answering (QA) benchmarks have been proposed, many aspects remain underexplored, such as the effectiveness of LLMs in generating responses in dynamic, interactive clinical multi-turn conversation situations and the identification of multi-faceted evaluation strategies beyond simple accuracy. However, formally evaluating a dynamic, interactive clinical situation is hindered by its vast combinatorial space of possible patient states and interaction trajectories, making it difficult to standardize and quantitatively measure such scenarios. Here, we introduce AutoMedic, a multi-agent simulation framework that enables automated evaluation of LLMs as clinical conversational agents. AutoMedic transforms off-the-shelf static QA datasets into virtual patient profiles, enabling realistic and clinically grounded multi-turn clinical dialogues between LLM agents. The performance of various clinical conversational agents is then assessed based on our CARE metric, which provides a multi-faceted evaluation standard of clinical conversational accuracy, efficiency/strategy, empathy, and robustness. Our findings, validated by human experts, demonstrate the validity of AutoMedic as an automated evaluation framework for clinical conversational agents, offering practical guidelines for the effective development of LLMs in conversational medical applications.

36. 【2512.10185】Watermarks for Language Models via Probabilistic Automata

链接https://arxiv.org/abs/2512.10185

作者:Yangkun Wang,Jingbo Shang

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:language models achieves, models achieves distortion-free, achieves distortion-free embedding, edit-distance attacks, language models

备注

点击查看摘要

Abstract:A recent watermarking scheme for language models achieves distortion-free embedding and robustness to edit-distance attacks. However, it suffers from limited generation diversity and high detection overhead. In parallel, recent research has focused on undetectability, a property ensuring that watermarks remain difficult for adversaries to detect and spoof. In this work, we introduce a new class of watermarking schemes constructed through probabilistic automata. We present two instantiations: (i) a practical scheme with exponential generation diversity and computational efficiency, and (ii) a theoretical construction with formal undetectability guarantees under cryptographic assumptions. Extensive experiments on LLaMA-3B and Mistral-7B validate the superior performance of our scheme in terms of robustness and efficiency.

37. 【2512.10178】CIEGAD: Cluster-Conditioned Interpolative and Extrapolative Framework for Geometry-Aware and Domain-Aligned Data Augmentation

链接https://arxiv.org/abs/2512.10178

作者:Keito Inoshita,Xiaokang Zhou,Akira Kawai,Katsutoshi Yada

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:deep learning deployment, practical deep learning, hindering model training, semantically uncovered regions, Domain-aligned data augmentation

备注

点击查看摘要

Abstract:In practical deep learning deployment, the scarcity of data and the imbalance of label distributions often lead to semantically uncovered regions within the real-world data distribution, hindering model training and causing misclassification near class boundaries as well as unstable behaviors in peripheral areas. Although recent large language models (LLMs) show promise for data augmentation, an integrated framework that simultaneously achieves directional control of generation, domain alignment, and quality control has not yet been fully established. To address these challenges, we propose a Cluster-conditioned Interpolative and Extrapolative framework for Geometry-Aware and Domain-aligned data augmentation (CIEGAD), which systematically complements both in-distribution and out-of-distribution semantically uncovered regions. CIEGAD constructs domain profiles through cluster conditioning, allocates generation with a hierarchical frequency-geometric allocation integrating class frequency and geometric indicators, and finely controls generation directions via the coexistence of interpolative and extrapolative synthesis. It further performs quality control through geometry-constrained filtering combined with an LLM-as-a-Judge mechanism. Experiments on multiple classification tasks demonstrate that CIEGAD effectively extends the periphery of real-world data distributions while maintaining high alignment between generated and real-world data as well as semantic diversity. In particular, for long-tailed and multi-class classification tasks, CIEGAD consistently improves F1 and recall, validating the triple harmony of distributional consistency, diversity, and quality. These results indicate that CIEGAD serves as a practically oriented data augmentation framework that complements underrepresented regions while preserving alignment with real-world data.

38. 【2512.10172】Offscript: Automated Auditing of Instruction Adherence in LLMs

链接https://arxiv.org/abs/2512.10172

作者:Nicholas Clark,Ryan Bai,Tanu Mitra

类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, generative search systems, sourcing and presentation

备注

点击查看摘要

Abstract:Large Language Models (LLMs) and generative search systems are increasingly used for information seeking by diverse populations with varying preferences for knowledge sourcing and presentation. While users can customize LLM behavior through custom instructions and behavioral prompts, no mechanism exists to evaluate whether these instructions are being followed effectively. We present Offscript, an automated auditing tool that efficiently identifies potential instruction following failures in LLMs. In a pilot study analyzing custom instructions sourced from Reddit, Offscript detected potential deviations from instructed behavior in 86.4% of conversations, 22.2% of which were confirmed as material violations through human review. Our findings suggest that automated auditing serves as a viable approach for evaluating compliance to behavioral instructions related to information seeking.

39. 【2512.10150】Unforgotten Safety: Preserving Safety Alignment of Large Language Models with Continual Learning

链接https://arxiv.org/abs/2512.10150

作者:Lama Alssum,Hani Itani,Hasan Abed Al Kader Hammoud,Philip Torr,Adel Bibi,Bernard Ghanem

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, alignment of large, large language, increasingly important, safety

备注

点击查看摘要

Abstract:The safety alignment of large language models (LLMs) is becoming increasingly important with their democratization. In this paper, we study the safety degradation that comes with adapting LLMs to new tasks. We attribute this safety compromise to catastrophic forgetting and frame the problem of preserving safety when fine-tuning as a continual learning (CL) problem. We consider the fine-tuning-as-a-service setup where the user uploads their data to a service provider to get a customized model that excels on the user's selected task. We adapt several CL approaches from the literature and systematically evaluate their ability to mitigate safety degradation. These include regularization-based, memory-based, and model merging approaches. We consider two scenarios, (1) benign user data and (2) poisoned user data. Our results demonstrate that CL approaches consistently achieve lower attack success rates than standard fine-tuning. Among these, DER outperforms both other CL methods and existing safety-preserving baselines while maintaining task utility. These findings generalize across three downstream tasks (GSM8K, SST2, Code) and three model families (LLaMA2-7B, Mistral-7B, Gemma-2B), establishing CL as a practical solution to preserve safety.

40. 【2512.10148】PARAN: Persona-Augmented Review ANswering system on Food Delivery Review Dataset

链接https://arxiv.org/abs/2512.10148

作者:Moonsoo Park,Jeongseok Yun,Bohyung Kim

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Personalized review response, Personalized review, information is limited, presents a significant, significant challenge

备注

点击查看摘要

Abstract:Personalized review response generation presents a significant challenge in domains where user information is limited, such as food delivery platforms. While large language models (LLMs) offer powerful text generation capabilities, they often produce generic responses when lacking contextual user data, reducing engagement and effectiveness. In this work, we propose a two-stage prompting framework that infers both explicit (e.g., user-stated preferences) and implicit (e.g., demographic or stylistic cues) personas directly from short review texts. These inferred persona attributes are then incorporated into the response generation prompt to produce user-tailored replies. To encourage diverse yet faithful generations, we adjust decoding temperature during inference. We evaluate our method using a real-world dataset collected from a Korean food delivery app, and assess its impact on precision, diversity, and semantic consistency. Our findings highlight the effectiveness of persona-augmented prompting in enhancing the relevance and personalization of automated responses without requiring model fine-tuning.

41. 【2512.10121】Workflow is All You Need: Escaping the "Statistical Smoothing Trap" via High-Entropy Information Foraging and Adversarial Pacing

链接https://arxiv.org/abs/2512.10121

作者:Zhongjie Jiang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); General Finance (q-fin.GN)

关键词:confronting current large, current large language, Statistical Smoothing Trap, large language models, Central to long-form

备注: 22 pages, 8 figures. Includes an ecological validity blind test where the Agentic Workflow achieved a 25% acceptance rate in top-tier media, decisively outperforming the SOTA Zero-shot baseline (0%). Features the DNFO-v5 ontology

点击查看摘要

Abstract:Central to long-form text generation in vertical domains is the "impossible trinity" confronting current large language models (LLMs): the simultaneous achievement of low hallucination, deep logical coherence, and personalized expression. This study establishes that this bottleneck arises from existing generative paradigms succumbing to the Statistical Smoothing Trap, a phenomenon that overlooks the high-entropy information acquisition and structured cognitive processes integral to expert-level writing. To address this limitation, we propose the DeepNews Framework, an agentic workflow that explicitly models the implicit cognitive processes of seasoned financial journalists. The framework integrates three core modules: first, a dual-granularity retrieval mechanism grounded in information foraging theory, which enforces a 10:1 saturated information input ratio to mitigate hallucinatory outputs; second, schema-guided strategic planning, a process leveraging domain expert knowledge bases (narrative schemas) and Atomic Blocks to forge a robust logical skeleton; third, adversarial constraint prompting, a technique deploying tactics including Rhythm Break and Logic Fog to disrupt the probabilistic smoothness inherent in model-generated text. Experiments delineate a salient Knowledge Cliff in deep financial reporting: content truthfulness collapses when retrieved context falls below 15,000 characters, while a high-redundancy input exceeding 30,000 characters stabilizes the Hallucination-Free Rate (HFR) above 85%. In an ecological validity blind test conducted with a top-tier Chinese technology media outlet, the DeepNews system--built on a previous-generation model (DeepSeek-V3-0324)-achieved a 25% submission acceptance rate, significantly outperforming the 0% acceptance rate of zero-shot generation by a state-of-the-art (SOTA) model (GPT-5).

42. 【2512.10110】Generate-Then-Validate: A Novel Question Generation Approach Using Small Language Models

链接https://arxiv.org/abs/2512.10110

作者:Yumou Wei,John Stamper,Paulo F. Carvalho

类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:learning analytics research, small language models, automatic question generation, analytics research, generate high-quality questions

备注: Accepted as a full research paper for the 16th International Conference on Learning Analytics and Knowledge (LAK'26)

点击查看摘要

Abstract:We explore the use of small language models (SLMs) for automatic question generation as a complement to the prevalent use of their large counterparts in learning analytics research. We present a novel question generation pipeline that leverages both the text generation and the probabilistic reasoning abilities of SLMs to generate high-quality questions. Adopting a "generate-then-validate" strategy, our pipeline first performs expansive generation to create an abundance of candidate questions and refine them through selective validation based on novel probabilistic reasoning. We conducted two evaluation studies, one with seven human experts and the other with a large language model (LLM), to assess the quality of the generated questions. Most judges (humans or LLMs) agreed that the generated questions had clear answers and generally aligned well with the intended learning objectives. Our findings suggest that an SLM can effectively generate high-quality questions when guided by a well-designed pipeline that leverages its strengths.

43. 【2512.10080】What Kind of Reasoning (if any) is an LLM actually doing? On the Stochastic Nature and Abductive Appearance of Large Language Models

链接https://arxiv.org/abs/2512.10080

作者:Luciano Floridi,Jessica Morley,Claudio Novelli,David Watson

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:current Large Language, Large Language Models, Large Language, current Large, Language Models

备注

点击查看摘要

Abstract:This article looks at how reasoning works in current Large Language Models (LLMs) that function using the token-completion method. It examines their stochastic nature and their similarity to human abductive reasoning. The argument is that these LLMs create text based on learned patterns rather than performing actual abductive reasoning. When their output seems abductive, this is largely because they are trained on human-generated texts that include reasoning structures. Examples are used to show how LLMs can produce plausible ideas, mimic commonsense reasoning, and give explanatory answers without being grounded in truth, semantics, verification, or understanding, and without performing any real abductive reasoning. This dual nature, where the models have a stochastic base but appear abductive in use, has important consequences for how LLMs are evaluated and applied. They can assist with generating ideas and supporting human thinking, but their outputs must be critically assessed because they cannot identify truth or verify their explanations. The article concludes by addressing five objections to these points, noting some limitations in the analysis, and offering an overall evaluation.

44. 【2512.10054】Parallel Decoder Transformer: Model-Internal Parallel Decoding with Speculative Invariance via Note Conditioning

链接https://arxiv.org/abs/2512.10054

作者:Logan Robbins

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Autoregressive decoding, inherently sequential, creating a latency

备注

点击查看摘要

Abstract:Autoregressive decoding in Large Language Models (LLMs) is inherently sequential, creating a latency bottleneck that scales linearly with output length. While ``Decomposition-and-Fill'' methods like Skeleton-of-Thought attempt to parallelize generation via external orchestration, they suffer from \textit{coherence drift} due to the lack of cross-stream communication. In this work, we introduce the \textbf{Parallel Decoder Transformer (PDT)}, a parameter-efficient architecture that embeds coordination primitives directly into the inference process of a frozen pre-trained model. Instead of retraining the base model, PDT injects lightweight \textit{Speculative Note Conditioning (SNC)} adapters that allow parallel decoding streams to synchronize via a shared, dynamic latent space. We formulate coordination as a \textit{speculative consensus} problem, where sibling streams broadcast semantic ``notes'' to a global bus, gated by a learned verification head. We validate our approach on a 50,000-step curriculum using a frozen 20B-parameter backbone. Our results demonstrate that PDT achieves effective self-correction, reaching \textbf{77.8\% precision} in coverage prediction and recovering approximate serial semantics without modifying the trunk weights. This establishes PDT as a scalable, efficient alternative to full model fine-tuning for structured parallel generation.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2512.10054 [cs.AI]

(or
arXiv:2512.10054v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2512.10054

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
45. 【2512.10038】Diffusion Is Your Friend in Show, Suggest and Tell

链接https://arxiv.org/abs/2512.10038

作者:Jia Cheng Hu,Roberto Cavicchioli,Alessandro Capotondi

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Computer Vision tasks, generative Computer Vision, Denoising models demonstrated, Computer Vision, Diffusion Denoising models

备注

点击查看摘要

Abstract:Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: this https URL\_suggest\_tell.

46. 【2512.10004】Exploring LLMs for Scientific Information Extraction Using The SciEx Framework

链接https://arxiv.org/abs/2512.10004

作者:Sha Li,Ayush Sadekar,Nathan Self,Yiqi Su,Lars Andersland,Mira Chaplin,Annabel Zhang,Hyoju Yang,James B Henderson,Krista Wigginton,Linsey Marr,T.M. Murali,Naren Ramakrishnan

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large language models, Large language, automating scientific information, increasingly touted, touted as powerful

备注

点击查看摘要

Abstract:Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents, multi-modal content, and reconciling varied and inconsistent fine-grained information across multiple publications into standardized formats. These challenges are further compounded when the desired data schema or extraction ontology changes rapidly, making it difficult to re-architect or fine-tune existing systems. We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design streamlines on-demand data extraction while enabling extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms. We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. Our findings provide practical insights into both the strengths and limitations of current LLM-based pipelines.

47. 【2512.09972】BAMBO: Construct Ability and Efficiency LLM Pareto Set via Bayesian Adaptive Multi-objective Block-wise Optimization

链接https://arxiv.org/abs/2512.09972

作者:Kesheng Chen,Wenjian Luo,Zhenqian Zhu,Yamin Hu,Yiya Xi

类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)

关键词:Large Language Models, existing merging techniques, merging techniques remain, techniques remain inadequate, Large Language

备注

点击查看摘要

Abstract:Constructing a Pareto set is pivotal for navigating the capability-efficiency trade-offs in Large Language Models (LLMs); however, existing merging techniques remain inadequate for this task. Coarse-grained, model-level methods yield only a sparse set of suboptimal solutions, while fine-grained, layer-wise approaches suffer from the "curse of dimensionality," rendering the search space computationally intractable. To resolve this dichotomy, we propose BAMBO (Bayesian Adaptive Multi-objective Block-wise Optimization), a novel framework that automatically constructs the LLM Pareto set. BAMBO renders the search tractable by introducing a Hybrid Optimal Block Partitioning strategy. Formulated as a 1D clustering problem, this strategy leverages a dynamic programming approach to optimally balance intra-block homogeneity and inter-block information distribution, thereby dramatically reducing dimensionality without sacrificing critical granularity. The entire process is automated within an evolutionary loop driven by the q-Expected Hypervolume Improvement (qEHVI) acquisition function. Experiments demonstrate that BAMBO discovers a superior and more comprehensive Pareto frontier than baselines, enabling agile model selection tailored to diverse operational constraints. Code is available at: this https URL.

48. 【2503.18702】Unsupervised Acquisition of Discrete Grammatical Categories

链接https://arxiv.org/abs/2503.18702

作者:David Ph. Shakouri,Crit Cremers,Niels O. Schiller

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

关键词:mother language model, language model, presents experiments performed, article presents experiments, mother language

备注: 34 pages, 3 figures, 7 tables

点击查看摘要

Abstract:This article presents experiments performed using a computational laboratory environment for language acquisition experiments. It implements a multi-agent system consisting of two agents: an adult language model and a daughter language model that aims to learn the mother language. Crucially, the daughter agent does not have access to the internal knowledge of the mother language model but only to the language exemplars the mother agent generates. These experiments illustrate how this system can be used to acquire abstract grammatical knowledge. We demonstrate how statistical analyses of patterns in the input data corresponding to grammatical categories yield discrete grammatical rules. These rules are subsequently added to the grammatical knowledge of the daughter language model. To this end, hierarchical agglomerative cluster analysis was applied to the utterances consecutively generated by the mother language model. It is argued that this procedure can be used to acquire structures resembling grammatical categories proposed by linguists for natural languages. Thus, it is established that non-trivial grammatical knowledge has been acquired. Moreover, the parameter configuration of this computational laboratory environment determined using training data generated by the mother language model is validated in a second experiment with a test set similarly resulting in the acquisition of non-trivial categories.

49. 【2412.20505】Planning, Living and Judging: A Multi-agent LLM-based Framework for Cyclical Urban Planning

链接https://arxiv.org/abs/2412.20505

作者:Hang Ni,Yuzhi Wang,Hao Liu

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:regeneration presents significant, presents significant challenges, Urban regeneration presents, requiring adaptive approaches, context of urbanization

备注: 4 pages, 2 figures, accepted by The 1st Workshop on AI for Urban Planning (AAAI 2025's Workshop)

点击查看摘要

Abstract:Urban regeneration presents significant challenges within the context of urbanization, requiring adaptive approaches to tackle evolving needs. Leveraging advancements in large language models (LLMs), we propose Cyclical Urban Planning (CUP), a new paradigm that continuously generates, evaluates, and refines urban plans in a closed-loop. Specifically, our multi-agent LLM-based framework consists of three key components: (1) Planning, where LLM agents generate and refine urban plans based on contextual data; (2) Living, where agents simulate the behaviors and interactions of residents, modeling life in the urban environment; and (3) Judging, which involves evaluating plan effectiveness and providing iterative feedback for improvement. The cyclical process enables a dynamic and responsive planning approach. Experiments on the real-world dataset demonstrate the effectiveness of our framework as a continuous and adaptive planning process.

信息检索

1. 【2512.10688】Rethinking Popularity Bias in Collaborative Filtering via Analytical Vector Decomposition

链接https://arxiv.org/abs/2512.10688

作者:Lingfeng Liu,Yixin Song,Dazhong Shen,Bing Yin,Hao Li,Yanyong Zhang,Chao Wang

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Bayesian Pairwise Ranking, bias fundamentally undermines, disproportionately recommend popular, neglecting users' genuine, Popularity bias fundamentally

备注: Accepted by SIGKDD 2026(First Cycle)

点击查看摘要

Abstract:Popularity bias fundamentally undermines the personalization capabilities of collaborative filtering (CF) models, causing them to disproportionately recommend popular items while neglecting users' genuine preferences for niche content. While existing approaches treat this as an external confounding factor, we reveal that popularity bias is an intrinsic geometric artifact of Bayesian Pairwise Ranking (BPR) optimization in CF models. Through rigorous mathematical analysis, we prove that BPR systematically organizes item embeddings along a dominant "popularity direction" where embedding magnitudes directly correlate with interaction frequency. This geometric distortion forces user embeddings to simultaneously handle two conflicting tasks-expressing genuine preference and calibrating against global popularity-trapping them in suboptimal configurations that favor popular items regardless of individual tastes. We propose Directional Decomposition and Correction (DDC), a universally applicable framework that surgically corrects this embedding geometry through asymmetric directional updates. DDC guides positive interactions along personalized preference directions while steering negative interactions away from the global popularity direction, disentangling preference from popularity at the geometric source. Extensive experiments across multiple BPR-based architectures demonstrate that DDC significantly outperforms state-of-the-art debiasing methods, reducing training loss to less than 5% of heavily-tuned baselines while achieving superior recommendation quality and fairness. Code is available in this https URL.

2. 【2512.10388】he Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation

链接https://arxiv.org/abs/2512.10388

作者:Ziwei Liu,Yejing Wang,Qidong Liu,Zijian Zhang,Chong Chen,Wei Huang,Xiangyu Zhao

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词:Conventional Sequential Recommender, Sequential Recommender Systems, Conventional Sequential, Recommender Systems, Sequential Recommender

备注

点击查看摘要

Abstract:Conventional Sequential Recommender Systems (SRS) typically assign unique Hash IDs (HID) to construct item embeddings. These HID embeddings effectively learn collaborative information from historical user-item interactions, making them vulnerable to situations where most items are rarely consumed (the long-tail problem). Recent methods that incorporate auxiliary information often suffer from noisy collaborative sharing caused by co-occurrence signals or semantic homogeneity caused by flat dense embeddings. Semantic IDs (SIDs), with their capability of code sharing and multi-granular semantic modeling, provide a promising alternative. However, the collaborative overwhelming phenomenon hinders the further development of SID-based methods. The quantization mechanisms commonly compromise the uniqueness of identifiers required for modeling head items, creating a performance seesaw between head and tail items. To address this dilemma, we propose \textbf{\name}, a novel framework that harmonizes the SID and HID. Specifically, we devise a dual-branch modeling architecture that enables the model to capture both the multi-granular semantics within SID while preserving the unique collaborative identity of HID. Furthermore, we introduce a dual-level alignment strategy that bridges the two representations, facilitating knowledge transfer and supporting robust preference modeling. Extensive experiments on three real-world datasets show that \name~ effectively balances recommendation quality for both head and tail items while surpassing the existing baselines. The implementation code can be found online\footnote{this https URL}.

3. 【2512.10165】BookReconciler: An Open-Source Tool for Metadata Enrichment and Work-Level Clustering

链接https://arxiv.org/abs/2512.10165

作者:Matt Miller,Dan Sinykin,Melanie Walsh

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:clustering book data, enhancing and clustering, cluster related Expressions, Expressions and Manifestations, clustering book

备注: Published in the proceedings of the Joint Conference on Digital Libraries (JCDL) 2025, Resources

点击查看摘要

Abstract:We present BookReconciler, an open-source tool for enhancing and clustering book data. BookReconciler allows users to take spreadsheets with minimal metadata, such as book title and author, and automatically 1) add authoritative, persistent identifiers like ISBNs 2) and cluster related Expressions and Manifestations of the same Work, e.g., different translations or editions. This enhancement makes it easier to combine related collections and analyze books at scale. The tool is currently designed as an extension for OpenRefine -- a popular software application -- and connects to major bibliographic services including the Library of Congress, VIAF, OCLC, HathiTrust, Google Books, and Wikidata. Our approach prioritizes human judgment. Through an interactive interface, users can manually evaluate matches and define the contours of a Work (e.g., to include translations or not). We evaluate reconciliation performance on datasets of U.S. prize-winning books and contemporary world fiction. BookReconciler achieves near-perfect accuracy for U.S. works but lower performance for global texts, reflecting structural weaknesses in bibliographic infrastructures for non-English and global literature. Overall, BookReconciler supports the reuse of bibliographic data across domains and applications, contributing to ongoing work in digital libraries and digital humanities.

4. 【2512.10149】STARS: Semantic Tokens with Augmented Representations for Recommendation at Scale

链接https://arxiv.org/abs/2512.10149

作者:Han Chen,Steven Zhu,Yingrui Li

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:rapidly shifting user, deliver relevant items, context including seasonality, dynamic context including, shifting user intent

备注

点击查看摘要

Abstract:Real-world ecommerce recommender systems must deliver relevant items under strict tens-of-milliseconds latency constraints despite challenges such as cold-start products, rapidly shifting user intent, and dynamic context including seasonality, holidays, and promotions. We introduce STARS, a transformer-based sequential recommendation framework built for large-scale, low-latency ecommerce settings. STARS combines several innovations: dual-memory user embeddings that separate long-term preferences from short-term session intent; semantic item tokens that fuse pretrained text embeddings, learnable deltas, and LLM-derived attribute tags, strengthening content-based matching, long-tail coverage, and cold-start performance; context-aware scoring with learned calendar and event offsets; and a latency-conscious two-stage retrieval pipeline that performs offline embedding generation and online maximum inner-product search with filtering, enabling tens-of-milliseconds response times. In offline evaluations on production-scale data, STARS improves Hit@5 by more than 75 percent relative to our existing LambdaMART system. A large-scale A/B test on 6 million visits shows statistically significant lifts, including Total Orders +0.8%, Add-to-Cart on Home +2.0%, and Visits per User +0.5%. These results demonstrate that combining semantic enrichment, multi-intent modeling, and deployment-oriented design can yield state-of-the-art recommendation quality in real-world environments without sacrificing serving efficiency.

5. 【2512.10104】LLM-PEA: Leveraging Large Language Models Against Phishing Email Attacks

链接https://arxiv.org/abs/2512.10104

作者:Najmul Hassan,Prashanth BusiReddyGari,Haitao Zhao,Yihao Ren,Jinsheng Xu,Shaohu Zhang

类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

关键词:Large Language Models, globally consequential vectors, deploy Large Language, phishing email, cyber intrusion

备注: 7 pages

点击查看摘要

Abstract:Email phishing is one of the most prevalent and globally consequential vectors of cyber intrusion. As systems increasingly deploy Large Language Models (LLMs) applications, these systems face evolving phishing email threats that exploit their fundamental architectures. Current LLMs require substantial hardening before deployment in email security systems, particularly against coordinated multi-vector attacks that exploit architectural vulnerabilities. This paper proposes LLMPEA, an LLM-based framework to detect phishing email attacks across multiple attack vectors, including prompt injection, text refinement, and multilingual attacks. We evaluate three frontier LLMs (e.g., GPT-4o, Claude Sonnet 4, and Grok-3) and comprehensive prompting design to assess their feasibility, robustness, and limitations against phishing email attacks. Our empirical analysis reveals that LLMs can detect the phishing email over 90% accuracy while we also highlight that LLM-based phishing email detection systems could be exploited by adversarial attack, prompt injection, and multilingual attacks. Our findings provide critical insights for LLM-based phishing detection in real-world settings where attackers exploit multiple vulnerabilities in combination.

6. 【2512.09947】HGC-Herd: Efficient Heterogeneous Graph Condensation via Representative Node Herding

链接https://arxiv.org/abs/2512.09947

作者:Fuyan Ou,Siqi Ai,Yulin Hu

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:demonstrated strong capability, graph neural networks, modeling complex semantics, Heterogeneous graph neural, neural networks

备注: 8 pages, 2 figures

点击查看摘要

Abstract:Heterogeneous graph neural networks (HGNNs) have demonstrated strong capability in modeling complex semantics across multi-type nodes and relations. However, their scalability to large-scale graphs remains challenging due to structural redundancy and high-dimensional node features. Existing graph condensation approaches, such as GCond, are primarily developed for homogeneous graphs and rely on gradient matching, resulting in considerable computational, memory, and optimization overhead. We propose HGC-Herd, a training-free condensation framework that generates compact yet informative heterogeneous graphs while maintaining both semantic and structural fidelity. HGC-Herd integrates lightweight feature propagation to encode multi-hop relational context and employs a class-wise herding mechanism to identify representative nodes per class, producing balanced and discriminative subsets for downstream learning tasks. Extensive experiments on ACM, DBLP, and Freebase validate that HGC-Herd attains comparable or superior accuracy to full-graph training while markedly reducing both runtime and memory consumption. These results underscore its practical value for efficient and scalable heterogeneous graph representation learning.

7. 【2512.09874】Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

链接https://arxiv.org/abs/2512.09874

作者:Pius Horn,Janis Keuper

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Correctly parsing mathematical, training large language, building scientific knowledge, scientific knowledge bases, Correctly parsing

备注

点击查看摘要

Abstract:Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: this https URL

计算机视觉

1. 【2512.10959】StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

链接https://arxiv.org/abs/2512.10959

作者:Tjark Behrens,Anton Obukhov,Bingxin Ke,Fabio Tosi,Matteo Poggi,Konrad Schindler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:models geometry purely, synthesis that models, depth or warping, diffusion-based framework, purely through viewpoint

备注: Project page: [this https URL](https://hf.co/spaces/prs-eth/stereospace_web)

点击查看摘要

Abstract:We introduce StereoSpace, a diffusion-based framework for monocular-to-stereo synthesis that models geometry purely through viewpoint conditioning, without explicit depth or warping. A canonical rectified space and the conditioning guide the generator to infer correspondences and fill disocclusions end-to-end. To ensure fair and leakage-free evaluation, we introduce an end-to-end protocol that excludes any ground truth or proxy geometry estimates at test time. The protocol emphasizes metrics reflecting downstream relevance: iSQoE for perceptual comfort and MEt3R for geometric consistency. StereoSpace surpasses other methods from the warp inpaint, latent-warping, and warped-conditioning categories, achieving sharp parallax and strong robustness on layered and non-Lambertian scenes. This establishes viewpoint-conditioned diffusion as a scalable, depth-free solution for stereo generation.

2. 【2512.10958】WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

链接https://arxiv.org/abs/2512.10958

作者:Ao Liang,Lingdong Kong,Tianyi Yan,Hongsi Liu,Wesley Yang,Ziqi Huang,Wei Yin,Jialong Zuo,Yixuan Hu,Dekai Zhu,Dongyue Lu,Youquan Liu,Guangfeng Jiang,Linfeng Li,Xiangtai Li,Long Zhuo,Lai Xing Ng,Benoit R. Cottereau,Changxin Gao,Liang Pan,Wei Tsang Ooi,Ziwei Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generative world models, Generative world, synthesize realistic, driving environments, physically or behaviorally

备注: Preprint; 80 pages, 37 figures, 29 tables; Project Page at [this https URL](https://worldbench.github.io/worldlens)

点击查看摘要

Abstract:Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects -- Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference -- jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop WorldLens-Agent, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity -- standardizing how future models are judged not only by how real they look, but by how real they behave.

3. 【2512.10957】SceneMaker: Open-set 3D Scene Generation with Decoupled De-occlusion and Pose Estimation Model

链接https://arxiv.org/abs/2512.10957

作者:Yukai Shi,Weiyu Li,Zihao Wang,Hongyang Li,Xingyu Chen,Ping Tan,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:framework called SceneMaker, called SceneMaker, generation framework called, pose estimation model, pose estimation

备注: Project page: [this https URL](https://idea-research.github.io/SceneMaker/)

点击查看摘要

Abstract:We propose a decoupled 3D scene generation framework called SceneMaker in this work. Due to the lack of sufficient open-set de-occlusion and pose estimation priors, existing methods struggle to simultaneously produce high-quality geometry and accurate poses under severe occlusion and open-set settings. To address these issues, we first decouple the de-occlusion model from 3D object generation, and enhance it by leveraging image datasets and collected de-occlusion datasets for much more diverse open-set occlusion patterns. Then, we propose a unified pose estimation model that integrates global and local mechanisms for both self-attention and cross-attention to improve accuracy. Besides, we construct an open-set 3D scene dataset to further extend the generalization of the pose estimation model. Comprehensive experiments demonstrate the superiority of our decoupled framework on both indoor and open-set scenes. Our codes and datasets is released at this https URL.

4. 【2512.10956】Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

链接https://arxiv.org/abs/2512.10956

作者:Wentao Zhou,Xuweiyi Chen,Vignesh Rajagopal,Jeffrey Chen,Rohan Chandra,Zezhou Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:robot navigation foundation, foundation models, mid-level vision, vision, navigation foundation models

备注: Project Page: [this https URL](https://www.cs.virginia.edu/~tsx4zn/stereowalk/)

点击查看摘要

Abstract:The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.

Comments:
Project Page: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2512.10956 [cs.CV]

(or
arXiv:2512.10956v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2512.10956

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
5. 【2512.10955】Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization

链接https://arxiv.org/abs/2512.10955

作者:Tsai-Shien Chen,Aliaksandr Siarohin,Guocheng Gordon Qian,Kuan-Chieh Jackson Wang,Egor Nemchinov,Moayed Haji-Ali,Riza Alp Guler,Willi Menapace,Ivan Skorokhodov,Anil Kag,Jun-Yan Zhu,Sergey Tulyakov

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:concept personalization aims, Visual concept personalization, unseen contexts, aims to transfer, transfer only specific

备注: Project page: [this https URL](https://snap-research.github.io/omni-attribute)

点击查看摘要

Abstract:Visual concept personalization aims to transfer only specific image attributes, such as identity, expression, lighting, and style, into unseen contexts. However, existing methods rely on holistic embeddings from general-purpose image encoders, which entangle multiple visual factors and make it difficult to isolate a single attribute. This often leads to information leakage and incoherent synthesis. To address this limitation, we introduce Omni-Attribute, the first open-vocabulary image attribute encoder designed to learn high-fidelity, attribute-specific representations. Our approach jointly designs the data and model: (i) we curate semantically linked image pairs annotated with positive and negative attributes to explicitly teach the encoder what to preserve or suppress; and (ii) we adopt a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement. The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation, achieving state-of-the-art performance across multiple benchmarks.

6. 【2512.10954】Group Diffusion: Enhancing Image Generation by Unlocking Cross-Sample Collaboration

链接https://arxiv.org/abs/2512.10954

作者:Sicheng Mo,Thao Nguyen,Richard Zhang,Nick Kolkin,Siddharth Srinivasan Iyer,Eli Shechtman,Krishna Kumar Singh,Yong Jae Lee,Bolei Zhou,Yuheng Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:explore an untapped, untapped signal, diffusion model inference, diffusion model, model inference

备注: Project Page: [this https URL](https://sichengmo.github.io/GroupDiff/)

点击查看摘要

Abstract:In this work, we explore an untapped signal in diffusion model inference. While all previous methods generate images independently at inference, we instead ask if samples can be generated collaboratively. We propose Group Diffusion, unlocking the attention mechanism to be shared across images, rather than limited to just the patches within an image. This enables images to be jointly denoised at inference time, learning both intra and inter-image correspondence. We observe a clear scaling effect - larger group sizes yield stronger cross-sample attention and better generation quality. Furthermore, we introduce a qualitative measure to capture this behavior and show that its strength closely correlates with FID. Built on standard diffusion transformers, our GroupDiff achieves up to 32.2% FID improvement on ImageNet-256x256. Our work reveals cross-sample inference as an effective, previously unexplored mechanism for generative modeling.

7. 【2512.10953】Bidirectional Normalizing Flow: From Data to Noise and Back

链接https://arxiv.org/abs/2512.10953

作者:Yiyang Lu,Qiao Sun,Xianbang Wang,Zhicheng Jiang,Hanhong Zhao,Kaiming He

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Bidirectional Normalizing Flow, generative modeling, reverse process, Normalizing Flows, process

备注: Tech report

点击查看摘要

Abstract:Normalizing Flows (NFs) have been established as a principled framework for generative modeling. Standard NFs consist of a forward process and a reverse process: the forward process maps data to noise, while the reverse process generates samples by inverting it. Typical NF forward transformations are constrained by explicit invertibility, ensuring that the reverse process can serve as their exact analytic inverse. Recent developments in TARFlow and its variants have revitalized NF methods by combining Transformers and autoregressive flows, but have also exposed causal decoding as a major bottleneck. In this work, we introduce Bidirectional Normalizing Flow ($\textbf{BiFlow}$), a framework that removes the need for an exact analytic inverse. BiFlow learns a reverse model that approximates the underlying noise-to-data inverse mapping, enabling more flexible loss functions and architectures. Experiments on ImageNet demonstrate that BiFlow, compared to its causal decoding counterpart, improves generation quality while accelerating sampling by up to two orders of magnitude. BiFlow yields state-of-the-art results among NF-based methods and competitive performance among single-evaluation ("1-NFE") methods. Following recent encouraging progress on NFs, we hope our work will draw further attention to this classical paradigm.

8. 【2512.10950】E-RayZer: Self-supervised 3D Reconstruction as Spatial Visual Pre-training

链接https://arxiv.org/abs/2512.10950

作者:Qitao Zhao,Hao Tan,Qianqian Wang,Sai Bi,Kai Zhang,Kalyan Sunkavalli,Shubham Tulsiani,Hanwen Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains largely unexplored, revolutionized foundation models, revolutionized foundation, remains largely, largely unexplored

备注: Project website: [this https URL](https://qitaozhao.github.io/E-RayZer)

点击查看摘要

Abstract:Self-supervised pre-training has revolutionized foundation models for languages, individual 2D images and videos, but remains largely unexplored for learning 3D-aware representations from multi-view images. In this paper, we present E-RayZer, a self-supervised large 3D Vision model that learns truly 3D-aware representations directly from unlabeled images. Unlike prior self-supervised methods such as RayZer that infer 3D indirectly through latent-space view synthesis, E-RayZer operates directly in 3D space, performing self-supervised 3D reconstruction with Explicit geometry. This formulation eliminates shortcut solutions and yields representations that are geometrically grounded. To ensure convergence and scalability, we introduce a novel fine-grained learning curriculum that organizes training from easy to hard samples and harmonizes heterogeneous data sources in an entirely unsupervised manner. Experiments demonstrate that E-RayZer significantly outperforms RayZer on pose estimation, matches or sometimes surpasses fully supervised reconstruction models such as VGGT. Furthermore, its learned representations outperform leading visual pre-training models (e.g., DINOv3, CroCo v2, VideoMAE V2, and RayZer) when transferring to 3D downstream tasks, establishing E-RayZer as a new paradigm for 3D-aware visual pre-training.

9. 【2512.10949】Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

链接https://arxiv.org/abs/2512.10949

作者:Yiwen Tang,Zoey Guo,Kaixin Zhu,Ray Zhang,Qizhi Chen,Dongzhi Jiang,Junli Liu,Bohan Zeng,Haoming Song,Delin Qu,Tianyi Bai,Dan Xu,Wentao Zhang,Bin Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Reinforcement learning, image generation recently, earlier proven, extended to enhance, effective in large

备注: Code is released at [this https URL](https://github.com/Ivan-Tang-3D/3DGen-R1)

点击查看摘要

Abstract:Reinforcement learning (RL), earlier proven to be effective in large language and multi-modal models, has been successfully extended to enhance 2D image generation recently. However, applying RL to 3D generation remains largely unexplored due to the higher spatial complexity of 3D objects, which require globally consistent geometry and fine-grained local textures. This makes 3D generation significantly sensitive to reward designs and RL algorithms. To address these challenges, we conduct the first systematic study of RL for text-to-3D autoregressive generation across several dimensions. (1) Reward designs: We evaluate reward dimensions and model choices, showing that alignment with human preference is crucial, and that general multi-modal models provide robust signal for 3D attributes. (2) RL algorithms: We study GRPO variants, highlighting the effectiveness of token-level optimization, and further investigate the scaling of training data and iterations. (3) Text-to-3D Benchmarks: Since existing benchmarks fail to measure implicit reasoning abilities in 3D generation models, we introduce MME-3DR. (4) Advanced RL paradigms: Motivated by the natural hierarchy of 3D generation, we propose Hi-GRPO, which optimizes the global-to-local hierarchical 3D generation through dedicated reward ensembles. Based on these insights, we develop AR3D-R1, the first RL-enhanced text-to-3D model, expert from coarse shape to texture refinement. We hope this study provides insights into RL-driven reasoning for 3D generation. Code is released at this https URL.

10. 【2512.10948】ClusIR: Towards Cluster-Guided All-in-One Image Restoration

链接https://arxiv.org/abs/2512.10948

作者:Shengkai Hu,Jiaqi Ma,Jun Wan,Wenwen Min,Yongcheng Jing,Lefei Zhang,Dacheng Tao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recover high-quality images, aims to recover, Image Restoration, recover high-quality, Image Restoration framework

备注

点击查看摘要

Abstract:All-in-One Image Restoration (AiOIR) aims to recover high-quality images from diverse degradations within a unified framework. However, existing methods often fail to explicitly model degradation types and struggle to adapt their restoration behavior to complex or mixed degradations. To address these issues, we propose ClusIR, a Cluster-Guided Image Restoration framework that explicitly models degradation semantics through learnable clustering and propagates cluster-aware cues across spatial and frequency domains for adaptive restoration. Specifically, ClusIR comprises two key components: a Probabilistic Cluster-Guided Routing Mechanism (PCGRM) and a Degradation-Aware Frequency Modulation Module (DAFMM). The proposed PCGRM disentangles degradation recognition from expert activation, enabling discriminative degradation perception and stable expert routing. Meanwhile, DAFMM leverages the cluster-guided priors to perform adaptive frequency decomposition and targeted modulation, collaboratively refining structural and textural representations for higher restoration fidelity. The cluster-guided synergy seamlessly bridges semantic cues with frequency-domain modulation, empowering ClusIR to attain remarkable restoration results across a wide range of degradations. Extensive experiments on diverse benchmarks validate that ClusIR reaches competitive performance under several scenarios.

11. 【2512.10947】owards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

链接https://arxiv.org/abs/2512.10947

作者:Jiawei Yang,Ziyu Chen,Yurong You,Yan Wang,Yiming Li,Yuxiao Chen,Boyi Li,Boris Ivanovic,Marco Pavone,Yue Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:processing high-volume multi-camera, high-volume multi-camera data, encoder that addresses, addresses the computational, computational bottleneck

备注: Project Page: [this https URL](https://jiawei-yang.github.io/Flex/)

点击查看摘要

Abstract:We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin compared to state-of-the-art methods. Furthermore, we show that these compact scene tokens develop an emergent capability for scene decomposition without any explicit supervision. Our findings challenge the prevailing assumption that 3D priors are necessary, demonstrating that a data-driven, joint encoding strategy offers a more scalable, efficient and effective path for future autonomous driving systems.

12. 【2512.10945】MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation

链接https://arxiv.org/abs/2512.10945

作者:Henghui Ding,Chang Liu,Shuting He,Kaining Ying,Xudong Jiang,Chen Change Loy,Yu-Gang Jiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale multi-modal dataset, referring motion expression, video object segmentation, video, motion

备注: IEEE TPAMI, Project Page: [this https URL](https://henghuiding.com/MeViS/)

点击查看摘要

Abstract:This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes. The proposed MeViS dataset and the method's source code are publicly available at this https URL

13. 【2512.10943】AlcheMinT: Fine-grained Temporal Control for Multi-Reference Consistent Video Generation

链接https://arxiv.org/abs/2512.10943

作者:Sharath Girish,Viacheslav Ivanov,Tsai-Shien Chen,Hao Chen,Aliaksandr Siarohin,Sergey Tulyakov

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:enabled personalized content, Recent advances, personalized content synthesis, content synthesis conditioned, large diffusion models

备注: Project page: [this https URL](https://snap-research.github.io/Video-AlcheMinT/snap-research.github.io/Video-AlcheMinT)

点击查看摘要

Abstract:Recent advances in subject-driven video generation with large diffusion models have enabled personalized content synthesis conditioned on user-provided subjects. However, existing methods lack fine-grained temporal control over subject appearance and disappearance, which are essential for applications such as compositional video synthesis, storyboarding, and controllable animation. We propose AlcheMinT, a unified framework that introduces explicit timestamps conditioning for subject-driven video generation. Our approach introduces a novel positional encoding mechanism that unlocks the encoding of temporal intervals, associated in our case with subject identities, while seamlessly integrating with the pretrained video generation model positional embeddings. Additionally, we incorporate subject-descriptive text tokens to strengthen binding between visual identity and video captions, mitigating ambiguity during generation. Through token-wise concatenation, AlcheMinT avoids any additional cross-attention modules and incurs negligible parameter overhead. We establish a benchmark evaluating multiple subject identity preservation, video fidelity, and temporal adherence. Experimental results demonstrate that AlcheMinT achieves visual quality matching state-of-the-art video personalization methods, while, for the first time, enabling precise temporal control over multi-subject generation within videos. Project page is at this https URL

14. 【2512.10942】VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

链接https://arxiv.org/abs/2512.10942

作者:Delong Chen,Mustafa Shukor,Theo Moutakanni,Willy Chung,Jade Yu,Tejaswi Kasarla,Allen Bolourchi,Yann LeCun,Pascale Fung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Joint Embedding Predictive, Embedding Predictive Architecture, vision-language model built, Predictive Architecture, Joint Embedding

备注

点击查看摘要

Abstract:We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.

15. 【2512.10941】Mull-Tokens: Modality-Agnostic Latent Thinking

链接https://arxiv.org/abs/2512.10941

作者:Arijit Ray,Ahmed Abdelkader,Chengzhi Mao,Bryan A. Plummer,Kate Saenko,Ranjay Krishna,Leonidas Guibas,Wen-Sheng Chu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:real world requires, world requires reasoning, real world, world requires, Reasoning

备注: Project webpage: [this https URL](https://arijitray.com/multimodal_thinking/)

点击查看摘要

Abstract:Reasoning goes beyond language; the real world requires reasoning about space, time, affordances, and much more that words alone cannot convey. Existing multimodal models exploring the potential of reasoning with images are brittle and do not scale. They rely on calling specialist tools, costly generation of images, or handcrafted reasoning data to switch between text and image thoughts. Instead, we offer a simpler alternative -- Mull-Tokens -- modality-agnostic latent tokens pre-trained to hold intermediate information in either image or text modalities to let the model think free-form towards the correct answer. We investigate best practices to train Mull-Tokens inspired by latent reasoning frameworks. We first train Mull-Tokens using supervision from interleaved text-image traces, and then fine-tune without any supervision by only using the final answers. Across four challenging spatial reasoning benchmarks involving tasks such as solving puzzles and taking different perspectives, we demonstrate that Mull-Tokens improve upon several baselines utilizing text-only reasoning or interleaved image-text reasoning, achieving a +3% average improvement and up to +16% on a puzzle solving reasoning-heavy split compared to our strongest baseline. Adding to conversations around challenges in grounding textual and visual reasoning, Mull-Tokens offers a simple solution to abstractly think in multiple modalities.

16. 【2512.10940】OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

链接https://arxiv.org/abs/2512.10940

作者:Xiang Fan,Sharath Girish,Vivek Ramanujan,Chaoyang Wang,Ashkan Mirzaei,Petr Sushko,Aliaksandr Siarohin,Sergey Tulyakov,Ranjay Krishna

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Prior approaches injecting, Prior approaches, consistency tasks, focused on specific, specific subsets

备注: Project page: [this https URL](https://snap-research.github.io/OmniView/)

点击查看摘要

Abstract:Prior approaches injecting camera control into diffusion models have focused on specific subsets of 4D consistency tasks: novel view synthesis, text-to-video with camera control, image-to-video, amongst others. Therefore, these fragmented approaches are trained on disjoint slices of available 3D/4D data. We introduce OmniView, a unified framework that generalizes across a wide range of 4D consistency tasks. Our method separately represents space, time, and view conditions, enabling flexible combinations of these inputs. For example, OmniView can synthesize novel views from static, dynamic, and multiview inputs, extrapolate trajectories forward and backward in time, and create videos from text or image prompts with full camera control. OmniView is competitive with task-specific models across diverse benchmarks and metrics, improving image quality scores among camera-conditioned diffusion models by up to 33\% in multiview NVS LLFF dataset, 60\% in dynamic NVS Neural 3D Video benchmark, 20\% in static camera control on RE-10K, and reducing camera trajectory errors by 4x in text-conditioned video generation. With strong generalizability in one model, OmniView demonstrates the feasibility of a generalist 4D video model. Project page is available at this https URL

17. 【2512.10939】GaussianHeadTalk: Wobble-Free 3D Talking Heads with Audio Driven Gaussian Splatting

链接https://arxiv.org/abs/2512.10939

作者:Madhav Agarwal,Mingtian Zhang,Laura Sevilla-Lara,Steven McDonagh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Speech-driven talking heads, Speech-driven talking, recently emerged, Speech-driven, Gaussian Splatting

备注: IEEE/CVF Winter Conference on Applications of Computer Vision 2026

点击查看摘要

Abstract:Speech-driven talking heads have recently emerged and enable interactive avatars. However, real-world applications are limited, as current methods achieve high visual fidelity but slow or fast yet temporally unstable. Diffusion methods provide realistic image generation, yet struggle with oneshot settings. Gaussian Splatting approaches are real-time, yet inaccuracies in facial tracking, or inconsistent Gaussian mappings, lead to unstable outputs and video artifacts that are detrimental to realistic use cases. We address this problem by mapping Gaussian Splatting using 3D Morphable Models to generate person-specific avatars. We introduce transformer-based prediction of model parameters, directly from audio, to drive temporal consistency. From monocular video and independent audio speech inputs, our method enables generation of real-time talking head videos where we report competitive quantitative and qualitative performance.

18. 【2512.10938】Stronger Normalization-Free Transformers

链接https://arxiv.org/abs/2512.10938

作者:Mingzhi Chen,Taiming Lu,Jiachen Zhu,Mingjie Sun,Zhuang Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Dynamic Tanh, introduction of Dynamic, deep learning architectures, normalization layers, layers have long

备注

点击查看摘要

Abstract:Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce $\mathrm{Derf}(x) = \mathrm{erf}(\alpha x + s)$, where $\mathrm{erf}(x)$ is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

19. 【2512.10935】Any4D: Unified Feed-Forward Metric 4D Reconstruction

链接https://arxiv.org/abs/2512.10935

作者:Jay Karhade,Nikhil Keetha,Yuchen Zhang,Tanisha Gupta,Akash Sharma,Sebastian Scherer,Deva Ramanan

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词:scalable multi-view transformer, transformer for metric-scale, dense feed-forward, scalable multi-view, multi-view transformer

备注: Project Website: [this https URL](https://any-4d.github.io/)

点击查看摘要

Abstract:We present Any4D, a scalable multi-view transformer for metric-scale, dense feed-forward 4D reconstruction. Any4D directly generates per-pixel motion and geometry predictions for N frames, in contrast to prior work that typically focuses on either 2-view dense scene flow or sparse 3D point tracking. Moreover, unlike other recent methods for 4D reconstruction from monocular RGB videos, Any4D can process additional modalities and sensors such as RGB-D frames, IMU-based egomotion, and Radar Doppler measurements, when available. One of the key innovations that allows for such a flexible framework is a modular representation of a 4D scene; specifically, per-view 4D predictions are encoded using a variety of egocentric factors (depthmaps and camera intrinsics) represented in local camera coordinates, and allocentric factors (camera extrinsics and scene flow) represented in global world coordinates. We achieve superior performance across diverse setups - both in terms of accuracy (2-3X lower error) and compute efficiency (15X faster), opening avenues for multiple downstream applications.

20. 【2512.10932】BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

链接https://arxiv.org/abs/2512.10932

作者:Shengao Wang,Wenqi Wang,Zecheng Wang,Max Whitton,Michael Wakeham,Arjun Chandra,Joey Huang,Pengyue Zhu,Helen Chen,David Li,Jeffrey Li,Shawn Li,Andrew Zagula,Amy Zhao,Andrew Zhu,Sayaka Nakamura,Yuki Yamamoto,Jerry Jun Yokono,Aaron Mueller,Bryan A. Plummer,Kate Saenko,Venkatesh Saligrama,Boqing Gong

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:developmental trajectories set, children developmental trajectories, Early children developmental, developmental trajectories, natural goal

备注

点击查看摘要

Abstract:Early children's developmental trajectories set up a natural goal for sample-efficient pretraining of vision foundation models. We introduce BabyVLM-V2, a developmentally grounded framework for infant-inspired vision-language modeling that extensively improves upon BabyVLM-V1 through a longitudinal, multifaceted pretraining set, a versatile model, and, most importantly, DevCV Toolbox for cognitive evaluation. The pretraining set maximizes coverage while minimizing curation of a longitudinal, infant-centric audiovisual corpus, yielding video-utterance, image-utterance, and multi-turn conversational data that mirror infant experiences. DevCV Toolbox adapts all vision-related measures of the recently released NIH Baby Toolbox into a benchmark suite of ten multimodal tasks, covering spatial reasoning, memory, and vocabulary understanding aligned with early children's capabilities. Experimental results show that a compact model pretrained from scratch can achieve competitive performance on DevCV Toolbox, outperforming GPT-4o on some tasks. We hope the principled, unified BabyVLM-V2 framework will accelerate research in developmentally plausible pretraining of vision foundation models.

21. 【2512.10927】FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos

链接https://arxiv.org/abs/2512.10927

作者:Yulu Gan,Ligeng Zhu,Dandan Shan,Baifeng Shi,Hongxu Yin,Boris Ivanovic,Song Han,Trevor Darrell,Jitendra Malik,Marco Pavone,Boyi Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:predict future states, Motion, future states, fundamental to physical, infer dynamics

备注: Code is available at [this https URL](https://github.com/Wolfv0/FoundationMotion/tree/main)

点击查看摘要

Abstract:Motion understanding is fundamental to physical reasoning, enabling models to infer dynamics and predict future states. However, state-of-the-art models still struggle on recent motion benchmarks, primarily due to the scarcity of large-scale, fine-grained motion datasets. Existing motion datasets are often constructed from costly manual annotation, severely limiting scalability. To address this challenge, we introduce FoundationMotion, a fully automated data curation pipeline that constructs large-scale motion datasets. Our approach first detects and tracks objects in videos to extract their trajectories, then leverages these trajectories and video frames with Large Language Models (LLMs) to generate fine-grained captions and diverse question-answer pairs about motion and spatial reasoning. Using datasets produced by this pipeline, we fine-tune open-source models including NVILA-Video-15B and Qwen2.5-7B, achieving substantial improvements in motion understanding without compromising performance on other tasks. Notably, our models outperform strong closed-source baselines like Gemini-2.5 Flash and large open-source models such as Qwen2.5-VL-72B across diverse motion understanding datasets and benchmarks. FoundationMotion thus provides a scalable solution for curating fine-grained motion datasets that enable effective fine-tuning of diverse models to enhance motion understanding and spatial reasoning capabilities.

22. 【2512.10894】DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance

链接https://arxiv.org/abs/2512.10894

作者:Peiying Zhang,Nanxuan Zhao,Matthew Fisher,Yiran Xu,Jing Liao,Difan Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent vision-language model, achieved impressive results, Recent vision-language, based approaches, approaches have achieved

备注: Project page: [this https URL](https://intchous.github.io/DuetSVG-site)

点击查看摘要

Abstract:Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.

23. 【2512.10888】PubTables-v2: A new large-scale dataset for full-page and multi-page table extraction

链接https://arxiv.org/abs/2512.10888

作者:Brandon Smock,Valerie Faucon-Morin,Max Sokolov,Libin Liang,Tayyibah Khanam,Maury Courtland

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:visual document understanding, key challenge, challenge in visual, document understanding, Table

备注: 15 pages, 7 figures

点击查看摘要

Abstract:Table extraction (TE) is a key challenge in visual document understanding. Traditional approaches detect tables first, then recognize their structure. Recently, interest has surged in developing methods, such as vision-language models (VLMs), that can extract tables directly in their full page or document context. However, progress has been difficult to demonstrate due to a lack of annotated data. To address this, we create a new large-scale dataset, PubTables-v2. PubTables-v2 supports a number of current challenging table extraction tasks. Notably, it is the first large-scale benchmark for multi-page table structure recognition. We demonstrate its usefulness by evaluating domain-specialized VLMs on these tasks and highlighting current progress. Finally, we use PubTables-v2 to create the Page-Object Table Transformer (POTATR), an image-to-graph extension of the Table Transformer to comprehensive page-level TE. Data, code, and trained models will be released.

24. 【2512.10881】MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos

链接https://arxiv.org/abs/2512.10881

作者:Kehong Gong,Zhengyu Wen,Weixia He,Mingxi Xu,Qi Wang,Ning Zhang,Zhengyu Li,Dongze Lian,Wei Zhao,Xiaoyu He,Mingyuan Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:pipelines remain species, underpins content creation, existing pipelines remain, Motion capture, digital humans

备注: Project page: [this https URL](https://animotionlab.github.io/MoCapAnything/)

点击查看摘要

Abstract:Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap between video and joint space; and (3) a Unified Motion Decoder that fuses these cues to produce temporally coherent trajectories. We also curate Truebones Zoo with 1038 motion clips, each providing a standardized skeleton-mesh-render triad. Experiments on both in-domain benchmarks and in-the-wild videos show that MoCapAnything delivers high-quality skeletal animations and exhibits meaningful cross-species retargeting across heterogeneous rigs, enabling scalable, prompt-driven 3D motion capture for arbitrary assets. Project page: this https URL

25. 【2512.10867】From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models

链接https://arxiv.org/abs/2512.10867

作者:Zongzhao Li,Xiangzhe Kong,Jiahui Su,Zongyang Ma,Mingze Li,Songyou Li,Yuelin Zhang,Yu Rong,Tingyang Xu,Deli Zhao,Wenbing Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Microscopic Spatial Intelligence, invisible microscopic entities, microscopic entities, invisible microscopic, Spatial Intelligence

备注

点击查看摘要

Abstract:This paper introduces the concept of Microscopic Spatial Intelligence (MiSI), the capability to perceive and reason about the spatial relationships of invisible microscopic entities, which is fundamental to scientific discovery. To assess the potential of Vision-Language Models (VLMs) in this domain, we propose a systematic benchmark framework MiSI-Bench. This framework features over 163,000 question-answer pairs and 587,000 images derived from approximately 4,000 molecular structures, covering nine complementary tasks that evaluate abilities ranging from elementary spatial transformations to complex relational identifications. Experimental results reveal that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model demonstrates substantial potential, even surpassing humans in spatial transformation tasks, while its poor performance in scientifically-grounded tasks like hydrogen bond recognition underscores the necessity of integrating explicit domain knowledge for progress toward scientific AGI. The datasets are available at this https URL.

26. 【2512.10863】MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence

链接https://arxiv.org/abs/2512.10863

作者:Jingli Lin,Runsen Xu,Shaohao Zhu,Sihan Yang,Peizhou Cao,Yunlong Ran,Miao Hu,Chenming Zhu,Yiman Xie,Yilin Long,Wenbo Hu,Dahua Lin,Tai Wang,Jiangmiao Pang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:continuous visual input, physical environments, understanding over continuous, continuous visual, visual input

备注

点击查看摘要

Abstract:Spatial understanding over continuous visual input is crucial for MLLMs to evolve into general-purpose assistants in physical environments. Yet there is still no comprehensive benchmark that holistically assesses the progress toward this goal. In this work, we introduce MMSI-Video-Bench, a fully human-annotated benchmark for video-based spatial intelligence in MLLMs. It operationalizes a four-level framework, Perception, Planning, Prediction, and Cross-Video Reasoning, through 1,106 questions grounded in 1,278 clips from 25 datasets and in-house videos. Each item is carefully designed and reviewed by 3DV experts with explanatory rationales to ensure precise, unambiguous grounding. Leveraging its diverse data sources and holistic task coverage, MMSI-Video-Bench also supports three domain-oriented sub-benchmarks (Indoor Scene Perception Bench, Robot Bench and Grounding Bench) for targeted capability assessment. We evaluate 25 strong open-source and proprietary MLLMs, revealing a striking human--AI gap: many models perform near chance, and the best reasoning model lags humans by nearly 60%. We further find that spatially fine-tuned models still fail to generalize effectively on our benchmark. Fine-grained error analysis exposes systematic failures in geometric reasoning, motion grounding, long-horizon prediction, and cross-video correspondence. We also show that typical frame-sampling strategies transfer poorly to our reasoning-intensive benchmark, and that neither 3D spatial cues nor chain-of-thought prompting yields meaningful gains. We expect our benchmark to establish a solid testbed for advancing video-based spatial intelligence.

27. 【2512.10860】SWiT-4D: Sliding-Window Transformer for Lossless and Parameter-Free Temporal 4D Generation

链接https://arxiv.org/abs/2512.10860

作者:Kehong Gong,Zhengyu Wen,Mingxi Xu,Weixia He,Qi Wang,Ning Zhang,Zhengyu Li,Chenbin Li,Dongze Lian,Wei Zhao,Xiaoyu He,Mingyuan Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:meshes remains considerably, remains considerably challenging, assets with explicit, high-quality animated, meshes remains

备注: Project page: [this https URL](https://animotionlab.github.io/SWIT4D/)

点击查看摘要

Abstract:Despite significant progress in 4D content generation, the conversion of monocular videos into high-quality animated 3D assets with explicit 4D meshes remains considerably challenging. The scarcity of large-scale, naturally captured 4D mesh datasets further limits the ability to train generalizable video-to-4D models from scratch in a purely data-driven manner. Meanwhile, advances in image-to-3D generation, supported by extensive datasets, offer powerful prior models that can be leveraged. To better utilize these priors while minimizing reliance on 4D supervision, we introduce SWiT-4D, a Sliding-Window Transformer for lossless, parameter-free temporal 4D mesh generation. SWiT-4D integrates seamlessly with any Diffusion Transformer (DiT)-based image-to-3D generator, adding spatial-temporal modeling across video frames while preserving the original single-image forward process, enabling 4D mesh reconstruction from videos of arbitrary length. To recover global translation, we further introduce an optimization-based trajectory module tailored for static-camera monocular videos. SWiT-4D demonstrates strong data efficiency: with only a single short (10s) video for fine-tuning, it achieves high-fidelity geometry and stable temporal consistency, indicating practical deployability under extremely limited 4D supervision. Comprehensive experiments on both in-domain zoo-test sets and challenging out-of-domain benchmarks (C4D, Objaverse, and in-the-wild videos) show that SWiT-4D consistently outperforms existing baselines in temporal smoothness. Project page: this https URL

28. 【2512.10840】PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

链接https://arxiv.org/abs/2512.10840

作者:Jianqi Chen,Biao Zhang,Xiangjun Tang,Peter Wonka

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:remains challenging, object pose estimation, pose estimation, query image, template images

备注: Project page: [this https URL](https://windvchen.github.io/PoseGAM/)

点击查看摘要

Abstract:6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: this https URL .

29. 【2512.10821】Agile Deliberation: Concept Deliberation for Subjective Visual Classification

链接https://arxiv.org/abs/2512.10821

作者:Leijie Wang,Otilia Stretcu,Wei Qiao,Thomas Denby,Krishnamurthy Viswanathan,Enming Luo,Chun-Ta Lu,Tushar Dogra,Ranjay Krishna,Ariel Fuxman

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)

关键词:applications requiring vision, requiring vision classifiers, applications requiring, rapidly expanding, requiring vision

备注

点击查看摘要

Abstract:From content moderation to content curation, applications requiring vision classifiers for visual concepts are rapidly expanding. Existing human-in-the-loop approaches typically assume users begin with a clear, stable concept understanding to be able to provide high-quality supervision. In reality, users often start with a vague idea and must iteratively refine it through "concept deliberation", a practice we uncovered through structured interviews with content moderation experts. We operationalize the common strategies in deliberation used by real content moderators into a human-in-the-loop framework called "Agile Deliberation" that explicitly supports evolving and subjective concepts. The system supports users in defining the concept for themselves by exposing them to borderline cases. The system does this with two deliberation stages: (1) concept scoping, which decomposes the initial concept into a structured hierarchy of sub-concepts, and (2) concept iteration, which surfaces semantically borderline examples for user reflection and feedback to iteratively align an image classifier with the user's evolving intent. Since concept deliberation is inherently subjective and interactive, we painstakingly evaluate the framework through 18 user sessions, each 1.5h long, rather than standard benchmarking datasets. We find that Agile Deliberation achieves 7.5% higher F1 scores than automated decomposition baselines and more than 3% higher than manual deliberation, while participants reported clearer conceptual understanding and lower cognitive effort.

30. 【2512.10818】Self-Ensemble Post Learning for Noisy Domain Generalization

链接https://arxiv.org/abs/2512.10818

作者:Wang Lu,Jindong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made great progress, data distribution shift, great progress, key issues, data distribution

备注: 18 pages

点击查看摘要

Abstract:While computer vision and machine learning have made great progress, their robustness is still challenged by two key issues: data distribution shift and label noise. When domain generalization (DG) encounters noise, noisy labels further exacerbate the emergence of spurious features in deep layers, i.e. spurious feature enlargement, leading to a degradation in the performance of existing algorithms. This paper, starting from domain generalization, explores how to make existing methods rework when meeting noise. We find that the latent features inside the model have certain discriminative capabilities, and different latent features focus on different parts of the image. Based on these observations, we propose the Self-Ensemble Post Learning approach (SEPL) to diversify features which can be leveraged. Specifically, SEPL consists of two parts: feature probing training and prediction ensemble inference. It leverages intermediate feature representations within the model architecture, training multiple probing classifiers to fully exploit the capabilities of pre-trained models, while the final predictions are obtained through the integration of outputs from these diverse classification heads. Considering the presence of noisy labels, we employ semi-supervised algorithms to train probing classifiers. Given that different probing classifiers focus on different areas, we integrate their predictions using a crowdsourcing inference approach. Extensive experimental evaluations demonstrate that the proposed method not only enhances the robustness of existing methods but also exhibits significant potential for real-world applications with high flexibility.

31. 【2512.10817】Extrapolation of Periodic Functions Using Binary Encoding of Continuous Numerical Values

链接https://arxiv.org/abs/2512.10817

作者:Brian P. Powell,Jordan A. Caraballo-Vega,Mark L. Carroll,Thomas Maxwell,Andrew Ptak,Greg Olmschenk,Jorge Martinez-Palomera

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:training bounds, extrapolate periodic functions, report the discovery, discovery that binary, neural networks

备注: Submitted to JMLR, under review

点击查看摘要

Abstract:We report the discovery that binary encoding allows neural networks to extrapolate periodic functions beyond their training bounds. We introduce Normalized Base-2 Encoding (NB2E) as a method for encoding continuous numerical values and demonstrate that, using this input encoding, vanilla multi-layer perceptrons (MLP) successfully extrapolate diverse periodic signals without prior knowledge of their functional form. Internal activation analysis reveals that NB2E induces bit-phase representations, enabling MLPs to learn and extrapolate signal structure independently of position.

32. 【2512.10808】Graph Laplacian Transformer with Progressive Sampling for Prostate Cancer Grading

链接https://arxiv.org/abs/2512.10808

作者:Masum Shah Junayed,John Derek Van Vessem,Qian Wan,Gahie Nam,Sheida Nabavi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Prostate cancer grading, challenging task due, Prostate cancer, selecting diagnostically relevant, Iterative Refinement Module

备注: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2025

点击查看摘要

Abstract:Prostate cancer grading from whole-slide images (WSIs) remains a challenging task due to the large-scale nature of WSIs, the presence of heterogeneous tissue structures, and difficulty of selecting diagnostically relevant regions. Existing approaches often rely on random or static patch selection, leading to the inclusion of redundant or non-informative regions that degrade performance. To address this, we propose a Graph Laplacian Attention-Based Transformer (GLAT) integrated with an Iterative Refinement Module (IRM) to enhance both feature learning and spatial consistency. The IRM iteratively refines patch selection by leveraging a pretrained ResNet50 for local feature extraction and a foundation model in no-gradient mode for importance scoring, ensuring only the most relevant tissue regions are preserved. The GLAT models tissue-level connectivity by constructing a graph where patches serve as nodes, ensuring spatial consistency through graph Laplacian constraints and refining feature representations via a learnable filtering mechanism that enhances discriminative histological structures. Additionally, a convex aggregation mechanism dynamically adjusts patch importance to generate a robust WSI-level representation. Extensive experiments on five public and one private dataset demonstrate that our model outperforms state-of-the-art methods, achieving higher performance and spatial consistency while maintaining computational efficiency.

33. 【2512.10805】Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

链接https://arxiv.org/abs/2512.10805

作者:Akshay Kulkarni,Tsui-Wei Weng,Vivek Narayanaswamy,Shusen Liu,Wesam A. Sakla,Kowshik Thopalli

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Bottleneck Sparse Autoencoders, Sparse autoencoders, promise a unified, unified approach, approach for mechanistic

备注

点击查看摘要

Abstract:Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires that the learned features be both interpretable and steerable. To that end, we introduce two new computationally inexpensive interpretability and steerability metrics and conduct a systematic analysis on LVLMs. Our analysis uncovers two observations; (i) a majority of SAE neurons exhibit either low interpretability or low steerability or both, rendering them ineffective for downstream use; and (ii) due to the unsupervised nature of SAEs, user-desired concepts are often absent in the learned dictionary, thus limiting their practical utility. To address these limitations, we propose Concept Bottleneck Sparse Autoencoders (CB-SAE) - a novel post-hoc framework that prunes low-utility neurons and augments the latent space with a lightweight concept bottleneck aligned to a user-defined concept set. The resulting CB-SAE improves interpretability by +32.1% and steerability by +14.5% across LVLMs and image generation tasks. We will make our code and model weights available.

34. 【2512.10794】What matters for Representation Alignment: Global Information or Spatial Structure?

链接https://arxiv.org/abs/2512.10794

作者:Jaskirat Singh,Xingjian Leng,Zongze Wu,Liang Zheng,Richard Zhang,Eli Shechtman,Saining Xie

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG); Machine Learning (stat.ML)

关键词:intermediate diffusion features, diffusion features, intermediate diffusion, target representation, pretrained vision encoder

备注: Project page: [this https URL](https://end2end-diffusion.github.io/irepa)

点击查看摘要

Abstract:Representation alignment (REPA) guides generative training by distilling representations from a strong, pretrained vision encoder to intermediate diffusion features. We investigate a fundamental question: what aspect of the target representation matters for generation, its \textit{global} \revision{semantic} information (e.g., measured by ImageNet-1K accuracy) or its spatial structure (i.e. pairwise cosine similarity between patch tokens)? Prevalent wisdom holds that stronger global semantic performance leads to better generation as a target representation. To study this, we first perform a large-scale empirical analysis across 27 different vision encoders and different model scales. The results are surprising; spatial structure, rather than global performance, drives the generation performance of a target representation. To further study this, we introduce two straightforward modifications, which specifically accentuate the transfer of \emph{spatial} information. We replace the standard MLP projection layer in REPA with a simple convolution layer and introduce a spatial normalization layer for the external representation. Surprisingly, our simple method (implemented in $$4 lines of code), termed iREPA, consistently improves convergence speed of REPA, across a diverse set of vision encoders, model sizes, and training variants (such as REPA, REPA-E, Meanflow, JiT etc). %, etc. Our work motivates revisiting the fundamental working mechanism of representational alignment and how it can be leveraged for improved training of generative models. The code and project page are available at this https URL

35. 【2512.10766】Metaphor-based Jailbreaking Attacks on Text-to-Image Models

链接https://arxiv.org/abs/2512.10766

作者:Chenyu Zhang,Yiwen Ma,Lanjun Wang,Wenhui Li,Yi Tu,An-An Liu

类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:commonly incorporate defense, adversarial prompts, models commonly incorporate, defense mechanisms, adversarial

备注: This paper includes model-generated content that may contain offensive or distressing material

点击查看摘要

Abstract:Text-to-image~(T2I) models commonly incorporate defense mechanisms to prevent the generation of sensitive images. Unfortunately, recent jailbreaking attacks have shown that adversarial prompts can effectively bypass these mechanisms and induce T2I models to produce sensitive content, revealing critical safety vulnerabilities. However, existing attack methods implicitly assume that the attacker knows the type of deployed defenses, which limits their effectiveness against unknown or diverse defense mechanisms. In this work, we introduce \textbf{MJA}, a \textbf{m}etaphor-based \textbf{j}ailbreaking \textbf{a}ttack method inspired by the Taboo game, aiming to effectively and efficiently attack diverse defense mechanisms without prior knowledge of their type by generating metaphor-based adversarial prompts. Specifically, MJA consists of two modules: an LLM-based multi-agent generation module~(MLAG) and an adversarial prompt optimization module~(APO). MLAG decomposes the generation of metaphor-based adversarial prompts into three subtasks: metaphor retrieval, context matching, and adversarial prompt generation. Subsequently, MLAG coordinates three LLM-based agents to generate diverse adversarial prompts by exploring various metaphors and contexts. To enhance attack efficiency, APO first trains a surrogate model to predict the attack results of adversarial prompts and then designs an acquisition strategy to adaptively identify optimal adversarial prompts. Extensive experiments on T2I models with various external and internal defense mechanisms demonstrate that MJA outperforms six baseline methods, achieving stronger attack performance while using fewer queries. Code is available in this https URL.

36. 【2512.10765】Blood Pressure Prediction for Coronary Artery Disease Diagnosis using Coronary Computed Tomography Angiography

链接https://arxiv.org/abs/2512.10765

作者:Rene Lisasi,Michele Esposito,Chen Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Computational fluid dynamics, Computational fluid, valuable hemodynamic markers, coronary artery disease, diagnosing coronary artery

备注: 19 pages, 9 figures

点击查看摘要

Abstract:Computational fluid dynamics (CFD) based simulation of coronary blood flow provides valuable hemodynamic markers, such as pressure gradients, for diagnosing coronary artery disease (CAD). However, CFD is computationally expensive, time-consuming, and difficult to integrate into large-scale clinical workflows. These limitations restrict the availability of labeled hemodynamic data for training AI models and hinder broad adoption of non-invasive, physiology based CAD assessment. To address these challenges, we develop an end to end pipeline that automates coronary geometry extraction from coronary computed tomography angiography (CCTA), streamlines simulation data generation, and enables efficient learning of coronary blood pressure distributions. The pipeline reduces the manual burden associated with traditional CFD workflows while producing consistent training data. We further introduce a diffusion-based regression model designed to predict coronary blood pressure directly from CCTA derived features, bypassing the need for slow CFD computation during inference. Evaluated on a dataset of simulated coronary hemodynamics, the proposed model achieves state of the art performance, with an R2 of 64.42%, a root mean squared error of 0.0974, and a normalized RMSE of 0.154, outperforming several baseline approaches. This work provides a scalable and accessible framework for rapid, non-invasive blood pressure prediction to support CAD diagnosis.

37. 【2512.10750】LDP: Parameter-Efficient Fine-Tuning of Multimodal LLM for Medical Report Generation

链接https://arxiv.org/abs/2512.10750

作者:Tianyu Zhou,Junyi Tang,Zehui Li,Dahong Qian,Suncheng Xiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Colonoscopic polyp diagnosis, colorectal cancer detection, early colorectal cancer, multimodal medical data, Colonoscopic polyp

备注: Work in progress

点击查看摘要

Abstract:Colonoscopic polyp diagnosis is pivotal for early colorectal cancer detection, yet traditional automated reporting suffers from inconsistencies and hallucinations due to the scarcity of high-quality multimodal medical data. To bridge this gap, we propose LDP, a novel framework leveraging multimodal large language models (MLLMs) for professional polyp diagnosis report generation. Specifically, we curate MMEndo, a multimodal endoscopic dataset comprising expert-annotated colonoscopy image-text pairs. We fine-tune the Qwen2-VL-7B backbone using Parameter-Efficient Fine-Tuning (LoRA) and align it with clinical standards via Direct Preference Optimization (DPO). Extensive experiments show that our LDP outperforms existing baselines on both automated metrics and rigorous clinical expert evaluations (achieving a Physician Score of 7.2/10), significantly reducing training computational costs by 833x compared to full fine-tuning. The proposed solution offers a scalable, clinically viable path for primary healthcare, with additional validation on the IU-XRay dataset confirming its robustness.

38. 【2512.10730】IRG-MotionLLM: Interleaving Motion Generation, Assessment and Refinement for Text-to-Motion Generation

链接https://arxiv.org/abs/2512.10730

作者:Yuan-Ming Li,Qize Yang,Nan Lei,Shenghao Fu,Ling-An Zeng,Jian-Fang Hu,Xihan Wei,Wei-Shi Zheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:motion-aware large language, shown remarkable promise, Recent advances, large language models, unifying motion understanding

备注: 25 pages, 16 figures

点击查看摘要

Abstract:Recent advances in motion-aware large language models have shown remarkable promise for unifying motion understanding and generation tasks. However, these models typically treat understanding and generation separately, limiting the mutual benefits that could arise from interactive feedback between tasks. In this work, we reveal that motion assessment and refinement tasks act as crucial bridges to enable bidirectional knowledge flow between understanding and generation. Leveraging this insight, we propose Interleaved Reasoning for Motion Generation (IRMoGen), a novel paradigm that tightly couples motion generation with assessment and refinement through iterative text-motion dialogue. To realize this, we introduce IRG-MotionLLM, the first model that seamlessly interleaves motion generation, assessment, and refinement to improve generation performance. IRG-MotionLLM is developed progressively with a novel three-stage training scheme, initializing and subsequently enhancing native IRMoGen capabilities. To facilitate this development, we construct an automated data engine to synthesize interleaved reasoning annotations from existing text-motion datasets. Extensive experiments demonstrate that: (i) Assessment and refinement tasks significantly improve text-motion alignment; (ii) Interleaving motion generation, assessment, and refinement steps yields consistent performance gains across training stages; and (iii) IRG-MotionLLM clearly outperforms the baseline model and achieves advanced performance on standard text-to-motion generation benchmarks. Cross-evaluator testing further validates its effectiveness. Code Data: this https URL.

39. 【2512.10725】Video Depth Propagation

链接https://arxiv.org/abs/2512.10725

作者:Luigi Piccinelli,Thiemo Wandel,Christos Sakaridis,Wim Abbeloos,Luc Van Gool

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:essential for visual, real-world applications, Depth, Depth estimation, applications

备注

点击查看摘要

Abstract:Depth estimation in videos is essential for visual perception in real-world applications. However, existing methods either rely on simple frame-by-frame monocular models, leading to temporal inconsistencies and inaccuracies, or use computationally demanding temporal modeling, unsuitable for real-time applications. These limitations significantly restrict general applicability and performance in practical settings. To address this, we propose VeloDepth, an efficient and robust online video depth estimation pipeline that effectively leverages spatiotemporal priors from previous depth predictions and performs deep feature propagation. Our method introduces a novel Propagation Module that refines and propagates depth features and predictions using flow-based warping coupled with learned residual corrections. In addition, our design structurally enforces temporal consistency, resulting in stable depth predictions across consecutive frames with improved efficiency. Comprehensive zero-shot evaluation on multiple benchmarks demonstrates the state-of-the-art temporal consistency and competitive accuracy of VeloDepth, alongside its significantly faster inference compared to existing video-based depth estimators. VeloDepth thus provides a practical, efficient, and accurate solution for real-time depth estimation suitable for diverse perception tasks. Code and models are available at this https URL

40. 【2512.10719】SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving

链接https://arxiv.org/abs/2512.10719

作者:Peizheng Li,Zhenghao Zhang,David Holtz,Hang Yu,Yutong Yang,Yuzhi Lai,Rui Song,Andreas Geiger,Andreas Zell

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:undergone rapid development, rapid development driven, strong reasoning capabilities, reasoning capabilities obtained, vision language models

备注

点击查看摘要

Abstract:End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.

41. 【2512.10715】CheXmask-U: Quantifying uncertainty in landmark-based anatomical segmentation for X-ray images

链接https://arxiv.org/abs/2512.10715

作者:Matias Cosarinsky,Nicolas Gaggion,Rodrigo Echeveste,Enzo Ferrante

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:supporting human oversight, human oversight, image segmentation systems, supporting human, Uncertainty

备注

点击查看摘要

Abstract:Uncertainty estimation is essential for the safe clinical deployment of medical image segmentation systems, enabling the identification of unreliable predictions and supporting human oversight. While prior work has largely focused on pixel-level uncertainty, landmark-based segmentation offers inherent topological guarantees yet remains underexplored from an uncertainty perspective. In this work, we study uncertainty estimation for anatomical landmark-based segmentation on chest X-rays. Inspired by hybrid neural network architectures that combine standard image convolutional encoders with graph-based generative decoders, and leveraging their variational latent space, we derive two complementary measures: (i) latent uncertainty, captured directly from the learned distribution parameters, and (ii) predictive uncertainty, obtained by generating multiple stochastic output predictions from latent samples. Through controlled corruption experiments we show that both uncertainty measures increase with perturbation severity, reflecting both global and local degradation. We demonstrate that these uncertainty signals can identify unreliable predictions by comparing with manual ground-truth, and support out-of-distribution detection on the CheXmask dataset. More importantly, we release CheXmask-U (this http URL), a large scale dataset of 657,566 chest X-ray landmark segmentations with per-node uncertainty estimates, enabling researchers to account for spatial variations in segmentation quality when using these anatomical masks. Our findings establish uncertainty estimation as a promising direction to enhance robustness and safe deployment of landmark-based anatomical segmentation methods in chest X-ray. A fully working interactive demo of the method is available at this http URL and the source code at this http URL.

42. 【2512.10691】Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning

链接https://arxiv.org/abs/2512.10691

作者:Benjamin Gundersen,Nicolas Deperrois,Samuel Ruiperez-Campillo,Thomas M. Sutter,Julia E. Vogt,Michael Moor,Farhad Nooralahzadeh,Michael Krauthammer

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:improved Chest X-ray, Chest X-ray, Recent advances, improved Chest, interpretation in multiple

备注: 10 pages main text (3 figures, 3 tables), 31 pages in total

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have improved Chest X-ray (CXR) interpretation in multiple aspects. However, many medical VLMs rely solely on supervised fine-tuning (SFT), which optimizes next-token prediction without evaluating answer quality. In contrast, reinforcement learning (RL) can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks. To investigate the effects of RL and thinking in a CXR VLM, we perform large-scale SFT on CXR data to build an updated RadVLM based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability. We then apply Group Relative Policy Optimization (GRPO) with clinically grounded, task-specific rewards for report generation and visual grounding, and run matched RL experiments on both domain-specific and general-domain Qwen3-VL variants, with and without thinking. Across these settings, we find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results. Under a unified evaluation pipeline, the RL-optimized RadVLM models outperform their baseline counterparts and reach state-of-the-art performance on both report generation and grounding, highlighting clinically aligned RL as a powerful complement to SFT for medical VLMs.

43. 【2512.10685】Sharp Monocular View Synthesis in Less Than a Second

链接https://arxiv.org/abs/2512.10685

作者:Lars Mescheder,Wei Dong,Shiwei Li,Xuyang Bai,Marcel Santos,Peiyun Hu,Bruno Lecouat,Mingmin Zhen,Amaël Delaunoy,Tian Fang,Yanghai Tsin,Stephan R. Richter,Vladlen Koltun

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Gaussian representation, present SHARP, SHARP, Gaussian representation produced, single

备注: Code and weights available at [this https URL](https://github.com/apple/ml-sharp)

点击查看摘要

Abstract:We present SHARP, an approach to photorealistic view synthesis from a single image. Given a single photograph, SHARP regresses the parameters of a 3D Gaussian representation of the depicted scene. This is done in less than a second on a standard GPU via a single feedforward pass through a neural network. The 3D Gaussian representation produced by SHARP can then be rendered in real time, yielding high-resolution photorealistic images for nearby views. The representation is metric, with absolute scale, supporting metric camera movements. Experimental results demonstrate that SHARP delivers robust zero-shot generalization across datasets. It sets a new state of the art on multiple datasets, reducing LPIPS by 25-34% and DISTS by 21-43% versus the best prior model, while lowering the synthesis time by three orders of magnitude. Code and weights are provided at this https URL

44. 【2512.10683】Optimal transport unlocks end-to-end learning for single-molecule localization

链接https://arxiv.org/abs/2512.10683

作者:Romain Seailles(DI-ENS),Jean-Baptiste Masson(IP, CNRS, UPCité),Jean Ponce(DI-ENS, CDS),Julien Mairal(LJK)

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Single-molecule localization microscopy, fluorescent molecules stained, reconstruct super-resolved images, reconstructing biology-relevant structures, localizing individual fluorophores

备注

点击查看摘要

Abstract:Single-molecule localization microscopy (SMLM) allows reconstructing biology-relevant structures beyond the diffraction limit by detecting and localizing individual fluorophores -- fluorescent molecules stained onto the observed specimen -- over time to reconstruct super-resolved images. Currently, efficient SMLM requires non-overlapping emitting fluorophores, leading to long acquisition times that hinders live-cell imaging. Recent deep-learning approaches can handle denser emissions, but they rely on variants of non-maximum suppression (NMS) layers, which are unfortunately non-differentiable and may discard true positives with their local fusion strategy. In this presentation, we reformulate the SMLM training objective as a set-matching problem, deriving an optimal-transport loss that eliminates the need for NMS during inference and enables end-to-end training. Additionally, we propose an iterative neural network that integrates knowledge of the microscope's optical system inside our model. Experiments on synthetic benchmarks and real biological data show that both our new loss function and architecture surpass the state of the art at moderate and high emitter densities. Code is available at this https URL.

45. 【2512.10675】Evaluating Gemini Robotics Policies in a Veo World Simulator

链接https://arxiv.org/abs/2512.10675

作者:Gemini Robotics Team,Coline Devin,Yilun Du,Debidatta Dwibedi,Ruiqi Gao,Abhishek Jindal,Thomas Kipf,Sean Kirmani,Fangchen Liu,Anirudha Majumdar,Andrew Marmon,Carolina Parada,Yulia Rubanova,Dhruv Shah,Vikas Sindhwani,Jie Tan,Fei Xia,Ted Xiao,Sherry Yang,Wenhao Yu,Allan Zhou

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:hold significant potential, world models hold, models hold significant, Generative world models, video models

备注

点击查看摘要

Abstract:Generative world models hold significant potential for simulating interactions with visuomotor policies in varied environments. Frontier video models can enable generation of realistic observations and environment interactions in a scalable and general manner. However, the use of video models in robotics has been limited primarily to in-distribution evaluations, i.e., scenarios that are similar to ones used to train the policy or fine-tune the base video model. In this report, we demonstrate that video models can be used for the entire spectrum of policy evaluation use cases in robotics: from assessing nominal performance to out-of-distribution (OOD) generalization, and probing physical and semantic safety. We introduce a generative evaluation system built upon a frontier video foundation model (Veo). The system is optimized to support robot action conditioning and multi-view consistency, while integrating generative image-editing and multi-view completion to synthesize realistic variations of real-world scenes along multiple axes of generalization. We demonstrate that the system preserves the base capabilities of the video model to enable accurate simulation of scenes that have been edited to include novel interaction objects, novel visual backgrounds, and novel distractor objects. This fidelity enables accurately predicting the relative performance of different policies in both nominal and OOD conditions, determining the relative impact of different axes of generalization on policy performance, and performing red teaming of policies to expose behaviors that violate physical or semantic safety constraints. We validate these capabilities through 1600+ real-world evaluations of eight Gemini Robotics policy checkpoints and five tasks for a bimanual manipulator.

46. 【2512.10674】Geo6DPose: Fast Zero-Shot 6D Object Pose Estimation via Geometry-Filtered Feature Matching

链接https://arxiv.org/abs/2512.10674

作者:Javier Villena Toro,Mehdi Tarkian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent progress, driven largely, largely by large-scale, object pose estimation, cloud-based inference

备注

点击查看摘要

Abstract:Recent progress in zero-shot 6D object pose estimation has been driven largely by large-scale models and cloud-based inference. However, these approaches often introduce high latency, elevated energy consumption, and deployment risks related to connectivity, cost, and data governance; factors that conflict with the practical constraints of real-world robotics, where compute is limited and on-device inference is frequently required. We introduce Geo6DPose, a lightweight, fully local, and training-free pipeline for zero-shot 6D pose estimation that trades model scale for geometric reliability. Our method combines foundation model visual features with a geometric filtering strategy: Similarity maps are computed between onboarded template DINO descriptors and scene patches, and mutual correspondences are established by projecting scene patch centers to 3D and template descriptors to the object model coordinate system. Final poses are recovered via correspondence-driven RANSAC and ranked using a weighted geometric alignment metric that jointly accounts for reprojection consistency and spatial support, improving robustness to noise, clutter, and partial visibility. Geo6DPose achieves sub-second inference on a single commodity GPU while matching the average recall of significantly larger zero-shot baselines (53.7 AR, 1.08 FPS). It requires no training, fine-tuning, or network access, and remains compatible with evolving foundation backbones, advancing practical, fully local 6D perception for robotic deployment.

47. 【2512.10668】XDen-1K: A Density Field Dataset of Real-World Objects

链接https://arxiv.org/abs/2512.10668

作者:Jingxuan Zhang,Tianqi Yu,Yatu Zhang,Jinze Wu,Kaixin Yao,Jingyang Liu,Yuyao Zhang,Jiayuan Gu,Jingyi Yu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:deep understanding, central goal, realistic simulation, physical world, volumetric density

备注: 10 pages, 7 figures

点击查看摘要

Abstract:A deep understanding of the physical world is a central goal for embodied AI and realistic simulation. While current models excel at capturing an object's surface geometry and appearance, they largely neglect its internal physical properties. This omission is critical, as properties like volumetric density are fundamental for predicting an object's center of mass, stability, and interaction dynamics in applications ranging from robotic manipulation to physical simulation. The primary bottleneck has been the absence of large-scale, real-world data. To bridge this gap, we introduce XDen-1K, the first large-scale, multi-modal dataset designed for real-world physical property estimation, with a particular focus on volumetric density. The core of this dataset consists of 1,000 real-world objects across 148 categories, for which we provide comprehensive multi-modal data, including a high-resolution 3D geometric model with part-level annotations and a corresponding set of real-world biplanar X-ray scans. Building upon this data, we introduce a novel optimization framework that recovers a high-fidelity volumetric density field of each object from its sparse X-ray views. To demonstrate its practical value, we add X-ray images as a conditioning signal to an existing segmentation network and perform volumetric segmentation. Furthermore, we conduct experiments on downstream robotics tasks. The results show that leveraging the dataset can effectively improve the accuracy of center-of-mass estimation and the success rate of robotic manipulation. We believe XDen-1K will serve as a foundational resource and a challenging new benchmark, catalyzing future research in physically grounded visual inference and embodied AI.

48. 【2512.10660】NaviHydra: Controllable Navigation-guided End-to-end Autonomous Driving with Hydra-distillation

链接https://arxiv.org/abs/2512.10660

作者:Hanfeng Wu,Marlon Steiner,Michael Schmidt,Alvaro Marcos-Ramiro,Christoph Stiller

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:scenarios requires robust, generate safe trajectories, driving scenarios requires, requires robust models, scenarios requires

备注

点击查看摘要

Abstract:The complexity of autonomous driving scenarios requires robust models that can interpret high-level navigation commands and generate safe trajectories. While traditional rule-based systems can react to these commands, they often struggle in dynamic environments, and end-to-end methods face challenges in complying with explicit navigation commands. To address this, we present NaviHydra, a controllable navigation-guided end-to-end model distilled from an existing rule-based simulator. Our framework accepts high-level navigation commands as control signals, generating trajectories that align with specified intentions. We utilize a Bird's Eye View (BEV) based trajectory gathering method to enhance the trajectory feature extraction. Additionally, we introduce a novel navigation compliance metric to evaluate adherence to intended route, improving controllability and navigation safety. To comprehensively assess our model's controllability, we design a test that evaluates its response to various navigation commands. Our method significantly outperforms baseline models, achieving state-of-the-art results in the NAVSIM benchmark, demonstrating its effectiveness in advancing autonomous driving.

49. 【2512.10652】riDF: Evaluating Perception, Detection, and Hallucination for Interpretable DeepFake Detection

链接https://arxiv.org/abs/2512.10652

作者:Jian-Yu Jiang-Lin,Kang-Yang Huang,Ling Zou,Ling Lo,Sheng-Ping Yang,Yu-Wen Tseng,Kun-Hsiang Lin,Chia-Ling Chen,Yu-Ting Ta,Yan-Tsung Wang,Po-Ching Chen,Hongxia Xie,Hong-Han Shuai,Wen-Huang Cheng

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:fabricate realistic portrayals, Advances in generative, portrayals of individuals, creating serious risks, risks for security

备注

点击查看摘要

Abstract:Advances in generative modeling have made it increasingly easy to fabricate realistic portrayals of individuals, creating serious risks for security, communication, and public trust. Detecting such person-driven manipulations requires systems that not only distinguish altered content from authentic media but also provide clear and reliable reasoning. In this paper, we introduce TriDF, a comprehensive benchmark for interpretable DeepFake detection. TriDF contains high-quality forgeries from advanced synthesis models, covering 16 DeepFake types across image, video, and audio modalities. The benchmark evaluates three key aspects: Perception, which measures the ability of a model to identify fine-grained manipulation artifacts using human-annotated evidence; Detection, which assesses classification performance across diverse forgery families and generators; and Hallucination, which quantifies the reliability of model-generated explanations. Experiments on state-of-the-art multimodal large language models show that accurate perception is essential for reliable detection, but hallucination can severely disrupt decision-making, revealing the interdependence of these three aspects. TriDF provides a unified framework for understanding the interaction between detection accuracy, evidence identification, and explanation reliability, offering a foundation for building trustworthy systems that address real-world synthetic media threats.

50. 【2512.10628】K-Track: Kalman-Enhanced Tracking for Accelerating Deep Point Trackers on Edge Devices

链接https://arxiv.org/abs/2512.10628

作者:Bishoy Galoaa,Pau Closas,Sarah Ostadabbas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision applications, including robotics, augmented reality, video analysis, video sequences

备注

点击查看摘要

Abstract:Point tracking in video sequences is a foundational capability for real-world computer vision applications, including robotics, autonomous systems, augmented reality, and video analysis. While recent deep learning-based trackers achieve state-of-the-art accuracy on challenging benchmarks, their reliance on per-frame GPU inference poses a major barrier to deployment on resource-constrained edge devices, where compute, power, and connectivity are limited. We introduce K-Track (Kalman-enhanced Tracking), a general-purpose, tracker-agnostic acceleration framework designed to bridge this deployment gap. K-Track reduces inference cost by combining sparse deep learning keyframe updates with lightweight Kalman filtering for intermediate frame prediction, using principled Bayesian uncertainty propagation to maintain temporal coherence. This hybrid strategy enables 5-10X speedup while retaining over 85% of the original trackers' accuracy. We evaluate K-Track across multiple state-of-the-art point trackers and demonstrate real-time performance on edge platforms such as the NVIDIA Jetson Nano and RTX Titan. By preserving accuracy while dramatically lowering computational requirements, K-Track provides a practical path toward deploying high-quality point tracking in real-world, resource-limited settings, closing the gap between modern tracking algorithms and deployable vision systems.

51. 【2512.10619】DOCR-Inspector: Fine-Grained and Automated Evaluation of Document Parsing with VLM

链接https://arxiv.org/abs/2512.10619

作者:Qintong Zhang,Junyuan Zhang,Zhifei Ren,Linke Ouyang,Zichen Wen,Junbo Niu,Yuan Qu,Bin Wang,Ka-Ho Chow,Conghui He,Wentao Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:transform unstructured PDF, unstructured PDF images, unstructured PDF, semi-structured data, facilitating the digitization

备注

点击查看摘要

Abstract:Document parsing aims to transform unstructured PDF images into semi-structured data, facilitating the digitization and utilization of information in diverse domains. While vision language models (VLMs) have significantly advanced this task, achieving reliable, high-quality parsing in real-world scenarios remains challenging. Common practice often selects the top-performing model on standard benchmarks. However, these benchmarks may carry dataset-specific biases, leading to inconsistent model rankings and limited correlation with real-world performance. Moreover, benchmark metrics typically provide only overall scores, which can obscure distinct error patterns in output. This raises a key challenge: how can we reliably and comprehensively assess document parsing quality in the wild? We address this problem with DOCR-Inspector, which formalizes document parsing assessment as fine-grained error detection and analysis. Leveraging VLM-as-a-Judge, DOCR-Inspector analyzes a document image and its parsed output, identifies all errors, assigns them to one of 28 predefined types, and produces a comprehensive quality assessment. To enable this capability, we construct DOCRcase-200K for training and propose the Chain-of-Checklist reasoning paradigm to enable the hierarchical structure of parsing quality assessment. For empirical validation, we introduce DOCRcaseBench, a set of 882 real-world document parsing cases with manual annotations. On this benchmark, DOCR-Inspector-7B outperforms commercial models like Gemini 2.5 Pro, as well as leading open-source models. Further experiments demonstrate that its quality assessments provide valuable guidance for parsing results refinement, making DOCR-Inspector both a practical evaluator and a driver for advancing document parsing systems at scale. Model and code are released at: this https URL.

52. 【2512.10617】Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces

链接https://arxiv.org/abs/2512.10617

作者:Bishoy Galoaa,Xiangyu Bai,Sarah Ostadabbas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:joint embedding spaces, aligning motion manifolds, embedding spaces, framework for language-guided, manifolds with joint

备注

点击查看摘要

Abstract:We present Lang2Motion, a framework for language-guided point trajectory generation by aligning motion manifolds with joint embedding spaces. Unlike prior work focusing on human motion or video synthesis, we generate explicit trajectories for arbitrary objects using motion extracted from real-world videos via point tracking. Our transformer-based auto-encoder learns trajectory representations through dual supervision: textual motion descriptions and rendered trajectory visualizations, both mapped through CLIP's frozen encoders. Lang2Motion achieves 34.2% Recall@1 on text-to-trajectory retrieval, outperforming video-based methods by 12.5 points, and improves motion accuracy by 33-52% (12.4 ADE vs 18.3-25.3) compared to video generation baselines. We demonstrate 88.3% Top-1 accuracy on human action recognition despite training only on diverse object motions, showing effective transfer across motion domains. Lang2Motion supports style transfer, semantic interpolation, and latent-space editing through CLIP-aligned trajectory representations.

53. 【2512.10608】Robust Multi-Disease Retinal Classification via Xception-Based Transfer Learning and W-Net Vessel Segmentation

链接https://arxiv.org/abs/2512.10608

作者:Mohammad Sadegh Gholizadeh,Amir Arsalan Rezapour

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accurate screening solutions, vision-threatening eye diseases, recent years, risen dramatically, necessitating scalable

备注

点击查看摘要

Abstract:In recent years, the incidence of vision-threatening eye diseases has risen dramatically, necessitating scalable and accurate screening solutions. This paper presents a comprehensive study on deep learning architectures for the automated diagnosis of ocular conditions. To mitigate the "black-box" limitations of standard convolutional neural networks (CNNs), we implement a pipeline that combines deep feature extraction with interpretable image processing modules. Specifically, we focus on high-fidelity retinal vessel segmentation as an auxiliary task to guide the classification process. By grounding the model's predictions in clinically relevant morphological features, we aim to bridge the gap between algorithmic output and expert medical validation, thereby reducing false positives and improving deployment viability in clinical settings.

54. 【2512.10607】rack and Caption Any Motion: Query-Free Motion Discovery and Description in Videos

链接https://arxiv.org/abs/2512.10607

作者:Bishoy Galoaa,Sarah Ostadabbas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Track and Caption, propose Track, user queries, automatic video understanding, motion-centric framework

备注

点击查看摘要

Abstract:We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.

55. 【2512.10596】Beyond Pixels: A Training-Free, Text-to-Text Framework for Remote Sensing Image Retrieval

链接https://arxiv.org/abs/2512.10596

作者:J. Xiao,Y. Guo,X. Zi,K. Thiyagarajan,C. Moreira,M. Prasad

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:high-level human concepts, critical task fundamentally, task fundamentally challenged, low-level visual features, model low-level visual

备注: 6 pages, 1 figure

点击查看摘要

Abstract:Semantic retrieval of remote sensing (RS) images is a critical task fundamentally challenged by the \textquote{semantic gap}, the discrepancy between a model's low-level visual features and high-level human concepts. While large Vision-Language Models (VLMs) offer a promising path to bridge this gap, existing methods often rely on costly, domain-specific training, and there is a lack of benchmarks to evaluate the practical utility of VLM-generated text in a zero-shot retrieval context. To address this research gap, we introduce the Remote Sensing Rich Text (RSRT) dataset, a new benchmark featuring multiple structured captions per image. Based on this dataset, we propose a fully training-free, text-only retrieval reference called TRSLLaVA. Our methodology reformulates cross-modal retrieval as a text-to-text (T2T) matching problem, leveraging rich text descriptions as queries against a database of VLM-generated captions within a unified textual embedding space. This approach completely bypasses model training or fine-tuning. Experiments on the RSITMD and RSICD benchmarks show our training-free method is highly competitive with state-of-the-art supervised models. For instance, on RSITMD, our method achieves a mean Recall of 42.62\%, nearly doubling the 23.86\% of the standard zero-shot CLIP baseline and surpassing several top supervised models. This validates that high-quality semantic representation through structured text provides a powerful and cost-effective paradigm for remote sensing image retrieval.

56. 【2512.10592】Salient Object Detection in Complex Weather Conditions via Noise Indicators

链接https://arxiv.org/abs/2512.10592

作者:Quan Chen,Xiaokai Yang,Tingyu Wang,Rongfeng Lu,Xichun Sheng,Yaoqi Sun,Chenggang Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Salient object detection, Salient object, object detection, computer vision, enhance generalization

备注

点击查看摘要

Abstract:Salient object detection (SOD), a foundational task in computer vision, has advanced from single-modal to multi-modal paradigms to enhance generalization. However, most existing SOD methods assume low-noise visual conditions, overlooking the degradation of segmentation accuracy caused by weather-induced noise in real-world scenarios. In this paper, we propose a SOD framework tailored for diverse weather conditions, encompassing a specific encoder and a replaceable decoder. To enable handling of varying weather noises, we introduce a one-hot vector as a noise indicator to represent different weather types and design a Noise Indicator Fusion Module (NIFM). The NIFM takes both semantic features and the noise indicator as dual inputs and is inserted between consecutive stages of the encoder to embed weather-aware priors via adaptive feature modulation. Critically, the proposed specific encoder retains compatibility with mainstream SOD decoders. Extensive experiments are conducted on the WXSOD dataset under varying training data scales (100%, 50%, 30% of the full training set), three encoder and seven decoder configurations. Results show that the proposed SOD framework (particularly the NIFM-enhanced specific encoder) improves segmentation accuracy under complex weather conditions compared to a vanilla encoder.

57. 【2512.10581】Unleashing Degradation-Carrying Features in Symmetric U-Net: Simpler and Stronger Baselines for All-in-One Image Restoration

链接https://arxiv.org/abs/2512.10581

作者:Wenlong Jiao,Heyang Lee,Ping Wang,Pengfei Zhu,Qinghua Hu,Dongwei Ren

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:degradation prompt strategies, elaborate degradation prompt, handle diverse degradations, methods increasingly rely, adverse weather

备注

点击查看摘要

Abstract:All-in-one image restoration aims to handle diverse degradations (e.g., noise, blur, adverse weather) within a unified framework, yet existing methods increasingly rely on complex architectures (e.g., Mixture-of-Experts, diffusion models) and elaborate degradation prompt strategies. In this work, we reveal a critical insight: well-crafted feature extraction inherently encodes degradation-carrying information, and a symmetric U-Net architecture is sufficient to unleash these cues effectively. By aligning feature scales across encoder-decoder and enabling streamlined cross-scale propagation, our symmetric design preserves intrinsic degradation signals robustly, rendering simple additive fusion in skip connections sufficient for state-of-the-art performance. Our primary baseline, SymUNet, is built on this symmetric U-Net and achieves better results across benchmark datasets than existing approaches while reducing computational cost. We further propose a semantic enhanced variant, SE-SymUNet, which integrates direct semantic injection from frozen CLIP features via simple cross-attention to explicitly amplify degradation priors. Extensive experiments on several benchmarks validate the superiority of our methods. Both baselines SymUNet and SE-SymUNet establish simpler and stronger foundations for future advancements in all-in-one image restoration. The source code is available at this https URL.

58. 【2512.10571】Audio-sync Video Instance Editing with Granularity-Aware Mask Refiner

链接https://arxiv.org/abs/2512.10571

作者:Haojie Zheng,Shuchen Weng,Jingqi Liu,Siqi Yang,Boxin Shi,Xinlong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:engaging content creation, Recent advancements, video generation highlight, content creation, realistic audio-visual synchronization

备注

点击查看摘要

Abstract:Recent advancements in video generation highlight that realistic audio-visual synchronization is crucial for engaging content creation. However, existing video editing methods largely overlook audio-visual synchronization and lack the fine-grained spatial and temporal controllability required for precise instance-level edits. In this paper, we propose AVI-Edit, a framework for audio-sync video instance editing. We propose a granularity-aware mask refiner that iteratively refines coarse user-provided masks into precise instance-level regions. We further design a self-feedback audio agent to curate high-quality audio guidance, providing fine-grained temporal control. To facilitate this task, we additionally construct a large-scale dataset with instance-centric correspondence and comprehensive annotations. Extensive experiments demonstrate that AVI-Edit outperforms state-of-the-art methods in visual quality, condition following, and audio-visual synchronization. Project page: this https URL.

59. 【2512.10562】Data-Efficient American Sign Language Recognition via Few-Shot Prototypical Networks

链接https://arxiv.org/abs/2512.10562

作者:Meher Md Saad

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Isolated Sign Language, Sign Language Recognition, Language Recognition, Isolated Sign, Sign Language

备注

点击查看摘要

Abstract:Isolated Sign Language Recognition (ISLR) is critical for bridging the communication gap between the Deaf and Hard-of-Hearing (DHH) community and the hearing world. However, robust ISLR is fundamentally constrained by data scarcity and the long-tail distribution of sign vocabulary, where gathering sufficient examples for thousands of unique signs is prohibitively expensive. Standard classification approaches struggle under these conditions, often overfitting to frequent classes while failing to generalize to rare ones. To address this bottleneck, we propose a Few-Shot Prototypical Network framework adapted for a skeleton based encoder. Unlike traditional classifiers that learn fixed decision boundaries, our approach utilizes episodic training to learn a semantic metric space where signs are classified based on their proximity to dynamic class prototypes. We integrate a Spatiotemporal Graph Convolutional Network (ST-GCN) with a novel Multi-Scale Temporal Aggregation (MSTA) module to capture both rapid and fluid motion dynamics. Experimental results on the WLASL dataset demonstrate the superiority of this metric learning paradigm: our model achieves 43.75% Top-1 and 77.10% Top-5 accuracy on the test set. Crucially, this outperforms a standard classification baseline sharing the identical backbone architecture by over 13%, proving that the prototypical training strategy effectively outperforms in a data scarce situation where standard classification fails. Furthermore, the model exhibits strong zero-shot generalization, achieving nearly 30% accuracy on the unseen SignASL dataset without fine-tuning, offering a scalable pathway for recognizing extensive sign vocabularies with limited data.

60. 【2512.10554】Grounding Everything in Tokens for Multimodal Large Language Models

链接https://arxiv.org/abs/2512.10554

作者:Xiangxuan Ren,Zhongdao Wang,Liping Hou,Pin Tang,Guoqing Wang,Chao Ma

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, made significant advancements, Multimodal large, large language models, made significant

备注: 19 pages, 16 figures, 12 Tables

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.

61. 【2512.10548】Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

链接https://arxiv.org/abs/2512.10548

作者:Yuchen Feng,Zhenyu Zhang,Naibin Gu,Yilong Chen,Peng Fu,Zheng Lin,Shuohuan Wang,Yu Sun,Hua Wu,Weiping Wang,Haifeng Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large language models, achieved remarkable progress, perception remains limited, Multimodal large language, language models

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior. Our pilot analysis reveals that MLLMs naturally attend to different visual regions across layers and that selectively allocating more computation to salient tokens can enhance visual perception. Building on this insight, we propose Blink, a dynamic visual token resolution framework that emulates the human-inspired process within a single forward pass. Specifically, Blink includes two modules: saliency-guided scanning and dynamic token resolution. It first estimates the saliency of visual tokens in each layer based on the attention map, and extends important tokens through a plug-and-play token super-resolution (TokenSR) module. In the next layer, it drops the extended tokens when they lose focus. This dynamic mechanism balances broad exploration and fine-grained focus, thereby enhancing visual perception adaptively and efficiently. Extensive experiments validate Blink, demonstrating its effectiveness in enhancing visual perception and multimodal understanding.

62. 【2512.10524】Mode-Seeking for Inverse Problems with Diffusion Models

链接https://arxiv.org/abs/2512.10524

作者:Sai Bharath Chandra Gutha,Ricardo Vinuesa,Hossein Azizpour

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:unconditional diffusion model, pre-trained unconditional diffusion, solve arbitrary inverse, maximum a posteriori, training or fine-tuning

备注

点击查看摘要

Abstract:A pre-trained unconditional diffusion model, combined with posterior sampling or maximum a posteriori (MAP) estimation techniques, can solve arbitrary inverse problems without task-specific training or fine-tuning. However, existing posterior sampling and MAP estimation methods often rely on modeling approximations and can be computationally demanding. In this work, we propose the variational mode-seeking loss (VML), which, when minimized during each reverse diffusion step, guides the generated sample towards the MAP estimate. VML arises from a novel perspective of minimizing the Kullback-Leibler (KL) divergence between the diffusion posterior $p(\mathbf{x}_0|\mathbf{x}_t)$ and the measurement posterior $p(\mathbf{x}_0|\mathbf{y})$, where $\mathbf{y}$ denotes the measurement. Importantly, for linear inverse problems, VML can be analytically derived and need not be approximated. Based on further theoretical insights, we propose VML-MAP, an empirically effective algorithm for solving inverse problems, and validate its efficacy over existing methods in both performance and computational time, through extensive experiments on diverse image-restoration tasks across multiple datasets.

63. 【2512.10521】ake a Peek: Efficient Encoder Adaptation for Few-Shot Semantic Segmentation via LoRA

链接https://arxiv.org/abs/2512.10521

作者:Pasquale De Marinis,Gennaro Vessio,Giovanna Castellano

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-shot semantic segmentation, Few-shot semantic, small annotated support, aims to segment, query images

备注

点击查看摘要

Abstract:Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS (CD-FSS). TaP leverages Low-Rank Adaptation (LoRA) to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks--including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray--demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, ensuring computational efficiency. By addressing a critical limitation in FSS--the encoder's generalization to novel classes--TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at this https URL.

64. 【2512.10517】3D Blood Pulsation Maps

链接https://arxiv.org/abs/2512.10517

作者:Maurice Rohr,Tobias Reinhardt,Tizian Dege,Justus Thies,Christoph Hoog Antink

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:kind for estimating, blood pulsation, blood pulsation maps, pulsation, blood pulsation analysis

备注: 9 pages (without references), supplementals attached, waiting for publication. In order to access the dataset,see [this https URL](https://github.com/KISMED-TUDa/pulse3dface)

点击查看摘要

Abstract:We present Pulse3DFace, the first dataset of its kind for estimating 3D blood pulsation maps. These maps can be used to develop models of dynamic facial blood pulsation, enabling the creation of synthetic video data to improve and validate remote pulse estimation methods via photoplethysmography imaging. Additionally, the dataset facilitates research into novel multi-view-based approaches for mitigating illumination effects in blood pulsation analysis. Pulse3DFace consists of raw videos from 15 subjects recorded at 30 Hz with an RGB camera from 23 viewpoints, blood pulse reference measurements, and facial 3D scans generated using monocular structure-from-motion techniques. It also includes processed 3D pulsation maps compatible with the texture space of the 3D head model FLAME. These maps provide signal-to-noise ratio, local pulse amplitude, phase information, and supplementary data. We offer a comprehensive evaluation of the dataset's illumination conditions, map consistency, and its ability to capture physiologically meaningful features in the facial and neck skin regions.

65. 【2512.10498】Robust Shape from Focus via Multiscale Directional Dilated Laplacian and Recurrent Network

链接https://arxiv.org/abs/2512.10498

作者:Khurram Ashfaq,Muhammad Tariq Mahmood

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:passive depth estimation, focal stack, depth estimation technique, analyzing focus variations, focus volumes

备注: Accepted to IJCV

点击查看摘要

Abstract:Shape-from-Focus (SFF) is a passive depth estimation technique that infers scene depth by analyzing focus variations in a focal stack. Most recent deep learning-based SFF methods typically operate in two stages: first, they extract focus volumes (a per pixel representation of focus likelihood across the focal stack) using heavy feature encoders; then, they estimate depth via a simple one-step aggregation technique that often introduces artifacts and amplifies noise in the depth map. To address these issues, we propose a hybrid framework. Our method computes multi-scale focus volumes traditionally using handcrafted Directional Dilated Laplacian (DDL) kernels, which capture long-range and directional focus variations to form robust focus volumes. These focus volumes are then fed into a lightweight, multi-scale GRU-based depth extraction module that iteratively refines an initial depth estimate at a lower resolution for computational efficiency. Finally, a learned convex upsampling module within our recurrent network reconstructs high-resolution depth maps while preserving fine scene details and sharp boundaries. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach outperforms state-of-the-art deep learning and traditional methods, achieving superior accuracy and generalization across diverse focal conditions.

66. 【2512.10450】Error-Propagation-Free Learned Video Compression With Dual-Domain Progressive Temporal Alignment

链接https://arxiv.org/abs/2512.10450

作者:Han Li,Shaohui Li,Wenrui Dai,Chenglin Li,Xinlong Pan,Haipeng Wang,Junni Zou,Hongkai Xiong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:learned video compression, error propagation, Existing frameworks, video compression suffer, estimation and compensation

备注

点击查看摘要

Abstract:Existing frameworks for learned video compression suffer from a dilemma between inaccurate temporal alignment and error propagation for motion estimation and compensation (ME/MC). The separate-transform framework employs distinct transforms for intra-frame and inter-frame compression to yield impressive rate-distortion (R-D) performance but causes evident error propagation, while the unified-transform framework eliminates error propagation via shared transforms but is inferior in ME/MC in shared latent domains. To address this limitation, in this paper, we propose a novel unifiedtransform framework with dual-domain progressive temporal alignment and quality-conditioned mixture-of-expert (QCMoE) to enable quality-consistent and error-propagation-free streaming for learned video compression. Specifically, we propose dualdomain progressive temporal alignment for ME/MC that leverages coarse pixel-domain alignment and refined latent-domain alignment to significantly enhance temporal context modeling in a coarse-to-fine fashion. The coarse pixel-domain alignment efficiently handles simple motion patterns with optical flow estimated from a single reference frame, while the refined latent-domain alignment develops a Flow-Guided Deformable Transformer (FGDT) over latents from multiple reference frames to achieve long-term motion refinement (LTMR) for complex motion patterns. Furthermore, we design a QCMoE module for continuous bit-rate adaptation that dynamically assigns different experts to adjust quantization steps per pixel based on target quality and content rather than relies on a single quantization step. QCMoE allows continuous and consistent rate control with appealing R-D performance. Experimental results show that the proposed method achieves competitive R-D performance compared with the state-of-the-arts, while successfully eliminating error propagation.

67. 【2512.10437】An M-Health Algorithmic Approach to Identify and Assess Physiotherapy Exercises in Real Time

链接https://arxiv.org/abs/2512.10437

作者:Stylianos Kandylakis,Christos Orfanopoulos,Georgios Siolas,Panayiotis Tsanakas

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:efficient algorithmic framework, mobile devices, work presents, presents an efficient, efficient algorithmic

备注: 11 pages, 5 figures

点击查看摘要

Abstract:This work presents an efficient algorithmic framework for real-time identification, classification, and evaluation of human physiotherapy exercises using mobile devices. The proposed method interprets a kinetic movement as a sequence of static poses, which are estimated from camera input using a pose-estimation neural network. Extracted body keypoints are transformed into trigonometric angle-based features and classified with lightweight supervised models to generate frame-level pose predictions and accuracy scores. To recognize full exercise movements and detect deviations from prescribed patterns, we employ a dynamic-programming scheme based on a modified Levenshtein distance algorithm, enabling robust sequence matching and localization of inaccuracies. The system operates entirely on the client side, ensuring scalability and real-time performance. Experimental evaluation demonstrates the effectiveness of the methodology and highlights its applicability to remote physiotherapy supervision and m-health applications.

68. 【2512.10421】Neural Collapse in Test-Time Adaptation

链接https://arxiv.org/abs/2512.10421

作者:Xiao Chen,Zhongjing Du,Jiazhen Huang,Xu Jiang,Li Lu,Jingyan Jiang,Zhi Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lack theoretical insights, existing methods lack, methods lack theoretical, data by updating, online during inference

备注: 10 pages, 8 figures

点击查看摘要

Abstract:Test-Time Adaptation (TTA) enhances model robustness to out-of-distribution (OOD) data by updating the model online during inference, yet existing methods lack theoretical insights into the fundamental causes of performance degradation under domain shifts. Recently, Neural Collapse (NC) has been proposed as an emergent geometric property of deep neural networks (DNNs), providing valuable insights for TTA. In this work, we extend NC to the sample-wise level and discover a novel phenomenon termed Sample-wise Alignment Collapse (NC3+), demonstrating that a sample's feature embedding, obtained by a trained model, aligns closely with the corresponding classifier weight. Building on NC3+, we identify that the performance degradation stems from sample-wise misalignment in adaptation which exacerbates under larger distribution shifts. This indicates the necessity of realigning the feature embeddings with their corresponding classifier weights. However, the misalignment makes pseudo-labels unreliable under domain shifts. To address this challenge, we propose NCTTA, a novel feature-classifier alignment method with hybrid targets to mitigate the impact of unreliable pseudo-labels, which blends geometric proximity with predictive confidence. Extensive experiments demonstrate the effectiveness of NCTTA in enhancing robustness to domain shifts. For example, NCTTA outperforms Tent by 14.52% on ImageNet-C.

69. 【2512.10419】ransLocNet: Cross-Modal Attention for Aerial-Ground Vehicle Localization with Contrastive Learning

链接https://arxiv.org/abs/2512.10419

作者:Phu Pham,Damon Conover,Aniket Bera

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:overhead imagery, difficult due, due to large, large viewpoint, viewpoint and modality

备注: 8 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Aerial-ground localization is difficult due to large viewpoint and modality gaps between ground-level LiDAR and overhead imagery. We propose TransLocNet, a cross-modal attention framework that fuses LiDAR geometry with aerial semantic context. LiDAR scans are projected into a bird's-eye-view representation and aligned with aerial features through bidirectional attention, followed by a likelihood map decoder that outputs spatial probability distributions over position and orientation. A contrastive learning module enforces a shared embedding space to improve cross-modal alignment. Experiments on CARLA and KITTI show that TransLocNet outperforms state-of-the-art baselines, reducing localization error by up to 63% and achieving sub-meter, sub-degree accuracy. These results demonstrate that TransLocNet provides robust and generalizable aerial-ground localization in both synthetic and real-world settings.

70. 【2512.10416】Beyond Endpoints: Path-Centric Reasoning for Vectorized Off-Road Network Extraction

链接https://arxiv.org/abs/2512.10416

作者:Wenfei Guan,Jilin Mei,Tong Shen,Xumin Wu,Shuo Wang,Cheng Min,Yu Hu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:environments remain underexplored, Deep learning, off-road environments remain, vectorized road extraction, environments remain

备注

点击查看摘要

Abstract:Deep learning has advanced vectorized road extraction in urban settings, yet off-road environments remain underexplored and challenging. A significant domain gap causes advanced models to fail in wild terrains due to two key issues: lack of large-scale vectorized datasets and structural weakness in prevailing methods. Models such as SAM-Road employ a node-centric paradigm that reasons at sparse endpoints, making them fragile to occlusions and ambiguous junctions in off-road scenes, leading to topological this http URL work addresses these limitations in two complementary ways. First, we release WildRoad, a gloabal off-road road network dataset constructed efficiently with a dedicated interactive annotation tool tailored for road-network labeling. Second, we introduce MaGRoad (Mask-aware Geodesic Road network extractor), a path-centric framework that aggregates multi-scale visual evidence along candidate paths to infer connectivity this http URL experiments show that MaGRoad achieves state-of-the-art performance on our challenging WildRoad benchmark while generalizing well to urban datasets. A streamlined pipeline also yields roughly 2.5x faster inference, improving practical applicability. Together, the dataset and path-centric paradigm provide a stronger foundation for mapping roads in the wild.

71. 【2512.10408】MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos

链接https://arxiv.org/abs/2512.10408

作者:Qiyue Sun,Tailin Chen,Yinghui Zhang,Yuchen Zhang,Jiangbei Yue,Jianbo Jiao,Zeyu Fu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:harmful cues emerge, cues emerge subtly, asynchronously across visual, textual streams, multimodal hate speech

备注

点击查看摘要

Abstract:The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.

72. 【2512.10386】Adaptive Dual-Weighted Gravitational Point Cloud Denoising Method

链接https://arxiv.org/abs/2512.10386

作者:Ge Zhang,Chunyang Wang,Bo Xiao,Xuelian Liu,Bin Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:High-quality point cloud, point cloud, High-quality point, point cloud data, point cloud denoising

备注

点击查看摘要

Abstract:High-quality point cloud data is a critical foundation for tasks such as autonomous driving and 3D reconstruction. However, LiDAR-based point cloud acquisition is often affected by various disturbances, resulting in a large number of noise points that degrade the accuracy of subsequent point cloud object detection and recognition. Moreover, existing point cloud denoising methods typically sacrifice computational efficiency in pursuit of higher denoising accuracy, or, conversely, improve processing speed at the expense of preserving object boundaries and fine structural details, making it difficult to simultaneously achieve high denoising accuracy, strong edge preservation, and real-time performance. To address these limitations, this paper proposes an adaptive dual-weight gravitational-based point cloud denoising method. First, an octree is employed to perform spatial partitioning of the global point cloud, enabling parallel acceleration. Then, within each leaf node, adaptive voxel-based occupancy statistics and k-nearest neighbor (kNN) density estimation are applied to rapidly remove clearly isolated and low-density noise points, thereby reducing the effective candidate set. Finally, a gravitational scoring function that combines density weights with adaptive distance weights is constructed to finely distinguish noise points from object points. Experiments conducted on the Stanford 3D Scanning Repository, the Canadian Adverse Driving Conditions (CADC) dataset, and in-house FMCW LiDAR point clouds acquired in our laboratory demonstrate that, compared with existing methods, the proposed approach achieves consistent improvements in F1, PSNR, and Chamfer Distance (CD) across various noise conditions while reducing the single-frame processing time, thereby validating its high accuracy, robustness, and real-time performance in multi-noise scenarios.

73. 【2512.10384】owards Fine-Grained Recognition with Large Visual Language Models: Benchmark and Optimization Strategies

链接https://arxiv.org/abs/2512.10384

作者:Cong Pang,Hongtao Yu,Zixuan Chen,Lewei Lu,Xin Lou

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision Language, Vision Language Models, Large Vision, Vision Language, made remarkable progress

备注

点击查看摘要

Abstract:Large Vision Language Models (LVLMs) have made remarkable progress, enabling sophisticated vision-language interaction and dialogue applications. However, existing benchmarks primarily focus on reasoning tasks, often neglecting fine-grained recognition, which is crucial for practical application scenarios. To address this gap, we introduce the Fine-grained Recognition Open World (FROW) benchmark, designed for detailed evaluation of LVLMs with GPT-4o. On the basis of that, we propose a novel optimization strategy from two perspectives: \textit{data construction} and \textit{training process}, to improve the performance of LVLMs. Our dataset includes mosaic data, which combines multiple short-answer responses, and open-world data, generated from real-world questions and answers using GPT-4o, creating a comprehensive framework for evaluating fine-grained recognition in LVLMs. Experiments show that mosaic data improves category recognition accuracy by 1\% and open-world data boosts FROW benchmark accuracy by 10\%-20\% and content accuracy by 6\%-12\%. Meanwhile, incorporating fine-grained data into the pre-training phase can improve the model's category recognition accuracy by up to 10\%. The benchmark will be available at this https URL.

74. 【2512.10379】Self-Supervised Contrastive Embedding Adaptation for Endoscopic Image Matching

链接https://arxiv.org/abs/2512.10379

作者:Alberto Rota,Elena De Momi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:augmented reality integration, Accurate spatial understanding, image-guided surgery, augmented reality, context awareness

备注

点击查看摘要

Abstract:Accurate spatial understanding is essential for image-guided surgery, augmented reality integration and context awareness. In minimally invasive procedures, where visual input is the sole intraoperative modality, establishing precise pixel-level correspondences between endoscopic frames is critical for 3D reconstruction, camera tracking, and scene interpretation. However, the surgical domain presents distinct challenges: weak perspective cues, non-Lambertian tissue reflections, and complex, deformable anatomy degrade the performance of conventional computer vision techniques. While Deep Learning models have shown strong performance in natural scenes, their features are not inherently suited for fine-grained matching in surgical images and require targeted adaptation to meet the demands of this domain. This research presents a novel Deep Learning pipeline for establishing feature correspondences in endoscopic image pairs, alongside a self-supervised optimization framework for model training. The proposed methodology leverages a novel-view synthesis pipeline to generate ground-truth inlier correspondences, subsequently utilized for mining triplets within a contrastive learning paradigm. Through this self-supervised approach, we augment the DINOv2 backbone with an additional Transformer layer, specifically optimized to produce embeddings that facilitate direct matching through cosine similarity thresholding. Experimental evaluation demonstrates that our pipeline surpasses state-of-the-art methodologies on the SCARED datasets improved matching precision and lower epipolar error compared to the related work. The proposed framework constitutes a valuable contribution toward enabling more accurate high-level computer vision applications in surgical endoscopy.

75. 【2512.10376】RaLiFlow: Scene Flow Estimation with 4D Radar and LiDAR Point Clouds

链接https://arxiv.org/abs/2512.10376

作者:Jingyun Fu,Zhiyu Xiang,Na Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent multimodal fusion, Recent multimodal, scene flow, integrating images, scene flow estimation

备注: Accepted by AAAI

点击查看摘要

Abstract:Recent multimodal fusion methods, integrating images with LiDAR point clouds, have shown promise in scene flow estimation. However, the fusion of 4D millimeter wave radar and LiDAR remains unexplored. Unlike LiDAR, radar is cheaper, more robust in various weather conditions and can detect point-wise velocity, making it a valuable complement to LiDAR. However, radar inputs pose challenges due to noise, low resolution, and sparsity. Moreover, there is currently no dataset that combines LiDAR and radar data specifically for scene flow estimation. To address this gap, we construct a Radar-LiDAR scene flow dataset based on a public real-world automotive dataset. We propose an effective preprocessing strategy for radar denoising and scene flow label generation, deriving more reliable flow ground truth for radar points out of the object boundaries. Additionally, we introduce RaLiFlow, the first joint scene flow learning framework for 4D radar and LiDAR, which achieves effective radar-LiDAR fusion through a novel Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module and a carefully designed set of loss functions. The DBCF module integrates dynamic cues from radar into the local cross-attention mechanism, enabling the propagation of contextual information across modalities. Meanwhile, the proposed loss functions mitigate the adverse effects of unreliable radar data during training and enhance the instance-level consistency in scene flow predictions from both modalities, particularly for dynamic foreground areas. Extensive experiments on the repurposed scene flow dataset demonstrate that our method outperforms existing LiDAR-based and radar-based single-modal methods by a significant margin.

76. 【2512.10369】Breaking the Vicious Cycle: Coherent 3D Gaussian Splatting from Sparse and Motion-Blurred Views

链接https://arxiv.org/abs/2512.10369

作者:Zhankuo Xu,Chaoran Feng,Yingtao Li,Jianbin Zhao,Jiashu Yang,Wangbo Yu,Li Yuan,Yonghong Tian

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, Gaussian, Splatting, view synthesis, motion blur

备注: 20 pages, 14 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a state-of-the-art method for novel view synthesis. However, its performance heavily relies on dense, high-quality input imagery, an assumption that is often violated in real-world applications, where data is typically sparse and motion-blurred. These two issues create a vicious cycle: sparse views ignore the multi-view constraints necessary to resolve motion blur, while motion blur erases high-frequency details crucial for aligning the limited views. Thus, reconstruction often fails catastrophically, with fragmented views and a low-frequency bias. To break this cycle, we introduce CoherentGS, a novel framework for high-fidelity 3D reconstruction from sparse and blurry images. Our key insight is to address these compound degradations using a dual-prior strategy. Specifically, we combine two pre-trained generative models: a specialized deblurring network for restoring sharp details and providing photometric guidance, and a diffusion model that offers geometric priors to fill in unobserved regions of the scene. This dual-prior strategy is supported by several key techniques, including a consistency-guided camera exploration module that adaptively guides the generative process, and a depth regularization loss that ensures geometric plausibility. We evaluate CoherentGS through both quantitative and qualitative experiments on synthetic and real-world scenes, using as few as 3, 6, and 9 input views. Our results demonstrate that CoherentGS significantly outperforms existing methods, setting a new state-of-the-art for this challenging task. The code and video demos are available at this https URL.

77. 【2512.10363】Point to Span: Zero-Shot Moment Retrieval for Navigating Unseen Hour-Long Videos

链接https://arxiv.org/abs/2512.10363

作者:Mingyu Jeon,Jisoo Yang,Sungjin Han,Jinkwon Hwang,Sunjae Yoon,Jonghee Kim,Junyeoung Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Moment Retrieval, Long Video Moment, Moment Retrieval, Zero-shot Long Video, natural language query

备注

点击查看摘要

Abstract:Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose \textbf{P}oint-\textbf{to}-\textbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient 'search' and costly 'refine' phases. P2S overcomes these challenges with two key innovations: an 'Adaptive Span Generator' to prevent the search phase candidate explosion, and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7\% on R5@0.1 on MAD).

78. 【2512.10362】Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models

链接https://arxiv.org/abs/2512.10362

作者:Woojun Jung,Jaehoon Go,Mingyu Jeon,Sunjae Yoon,Junyeong Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, impressive reasoning capabilities

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate impressive reasoning capabilities, but often fail to perceive fine-grained visual details, limiting their applicability in precision-demanding tasks. While methods that crop salient regions of an image offer a partial solution, we identify a critical limitation they introduce: "Contextual Blindness". This failure occurs due to structural disconnect between high-fidelity details (from the crop) and the broader global context (from the original image), even when all necessary visual information is present. We argue that this limitation stems not from a lack of information 'Quantity', but from a lack of 'Structural Diversity' in the model's input. To resolve this, we propose Visual Funnel, a training-free, two-step approach. Visual Funnel first performs Contextual Anchoring to identify the region of interest in a single forward pass. It then constructs an Entropy-Scaled Portfolio that preserves the hierarchical context - ranging from focal detail to broader surroundings - by dynamically determining crop sizes based on attention entropy and refining crop centers. Through extensive experiments, we demonstrate that Visual Funnel significantly outperforms naive single-crop and unstructured multi-crop baselines. Our results further validate that simply adding more unstructured crops provides limited or even detrimental benefits, confirming that the hierarchical structure of our portfolio is key to resolving Contextual Blindness.

79. 【2512.10359】ool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

链接https://arxiv.org/abs/2512.10359

作者:Sunqi Fan,Jiashuo Cui,Meng-Hao Guo,Shuojin Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Question Answering, Question Answering, dynamic real-world scenarios, Large Language Models, Multimodal Large Language

备注: Accepted by NeurIPS 2025 main track

点击查看摘要

Abstract:Video Question Answering (VideoQA) task serves as a critical playground for evaluating whether foundation models can effectively perceive, understand, and reason about dynamic real-world scenarios. However, existing Multimodal Large Language Models (MLLMs) struggle with simultaneously modeling spatial relationships within video frames and understanding the causal dynamics of temporal evolution on complex and reasoning-intensive VideoQA task. In this work, we equip MLLM with a comprehensive and extensible Video Toolkit, to enhance MLLM's spatiotemporal reasoning capabilities and ensure the harmony between the quantity and diversity of tools. To better control the tool invocation sequence and avoid toolchain shortcut issues, we propose a Spatiotemporal Reasoning Framework (STAR) that strategically schedules temporal and spatial tools, thereby progressively localizing the key area in the video. Our STAR framework enhances GPT-4o using lightweight tools, achieving an 8.2% gain on VideoMME and 4.6% on LongVideoBench. We believe that our proposed Video Toolkit and STAR framework make an important step towards building autonomous and intelligent video analysis assistants. The code is publicly available at this https URL.

80. 【2512.10357】mmCounter: Static People Counting in Dense Indoor Scenarios Using mmWave Radar

链接https://arxiv.org/abs/2512.10357

作者:Tarik Reza Toha,Shao-Jung(Louie)Lu,Shahriar Nirjon

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:mmWave radars struggle, mmWave radars, groups due, radars struggle, struggle to detect

备注: Accepted at the 22nd International Conference on Embedded Wireless Systems and Networks (EWSN 2025)

点击查看摘要

Abstract:mmWave radars struggle to detect or count individuals in dense, static (non-moving) groups due to limitations in spatial resolution and reliance on movement for detection. We present mmCounter, which accurately counts static people in dense indoor spaces (up to three people per square meter). mmCounter achieves this by extracting ultra-low frequency ( 1 Hz) signals, primarily from breathing and micro-scale body movements such as slight torso shifts, and applying novel signal processing techniques to differentiate these subtle signals from background noise and nearby static objects. Our problem differs significantly from existing studies on breathing rate estimation, which assume the number of people is known a priori. In contrast, mmCounter utilizes a novel multi-stage signal processing pipeline to extract relevant low-frequency sources along with their spatial information and map these sources to individual people, enabling accurate counting. Extensive evaluations in various environments demonstrate that mmCounter delivers an 87% average F1 score and 0.6 mean absolute error in familiar environments, and a 60% average F1 score and 1.1 mean absolute error in previously untested environments. It can count up to seven individuals in a three square meter space, such that there is no side-by-side spacing and only a one-meter front-to-back distance.

81. 【2512.10353】Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation

链接https://arxiv.org/abs/2512.10353

作者:Yiheng Lyu,Lian Xu,Mohammed Bennamoun,Farid Boussaid,Coen Arrow,Girish Dwivedi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:semantic segmentation offers, Weakly supervised semantic, volumetric medical imaging, supervised semantic segmentation, weakly supervised volumetric

备注

点击查看摘要

Abstract:Weakly supervised semantic segmentation offers a label-efficient solution to train segmentation models for volumetric medical imaging. However, existing approaches often rely on 2D encoders that neglect the inherent volumetric nature of the data. We propose TranSamba, a hybrid Transformer-Mamba architecture designed to capture 3D context for weakly supervised volumetric medical segmentation. TranSamba augments a standard Vision Transformer backbone with Cross-Plane Mamba blocks, which leverage the linear complexity of state space models for efficient information exchange across neighboring slices. The information exchange enhances the pairwise self-attention within slices computed by the Transformer blocks, directly contributing to the attention maps for object localization. TranSamba achieves effective volumetric modeling with time complexity that scales linearly with the input volume depth and maintains constant memory usage for batch processing. Extensive experiments on three datasets demonstrate that TranSamba establishes new state-of-the-art performance, consistently outperforming existing methods across diverse modalities and pathologies. Our source code and trained models are openly accessible at: this https URL.

82. 【2512.10352】opology-Agnostic Animal Motion Generation from Text Prompt

链接https://arxiv.org/abs/2512.10352

作者:Keyi Chen,Mingze Sun,Zhenyu Liu,Zhangquan Chen,Ruqi Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:virtual environments, fundamental to computer, computer animation, animation and widely, arbitrary skeletal topologies

备注: 10 pages, 7 [this http URL](http://figures.Conference) submission

点击查看摘要

Abstract:Motion generation is fundamental to computer animation and widely used across entertainment, robotics, and virtual environments. While recent methods achieve impressive results, most rely on fixed skeletal templates, which prevent them from generalizing to skeletons with different or perturbed topologies. We address the core limitation of current motion generation methods - the combined lack of large-scale heterogeneous animal motion data and unified generative frameworks capable of jointly modeling arbitrary skeletal topologies and textual conditions. To this end, we introduce OmniZoo, a large-scale animal motion dataset spanning 140 species and 32,979 sequences, enriched with multimodal annotations. Building on OmniZoo, we propose a generalized autoregressive motion generation framework capable of producing text-driven motions for arbitrary skeletal topologies. Central to our model is a Topology-aware Skeleton Embedding Module that encodes geometric and structural properties of any skeleton into a shared token space, enabling seamless fusion with textual semantics. Given a text prompt and a target skeleton, our method generates temporally coherent, physically plausible, and semantically aligned motions, and further enables cross-species motion style transfer.

83. 【2512.10342】CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates

链接https://arxiv.org/abs/2512.10342

作者:Shresth Grover,Priyank Pathak,Akash Kumar,Vibhav Vineet,Yogesh S Rawat

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large-scale Vision-Language Models, exhibit impressive complex, remain largely unexplored, Large-scale Vision-Language, Vision-Language Models

备注

点击查看摘要

Abstract:Large-scale Vision-Language Models (VLMs) exhibit impressive complex reasoning capabilities but remain largely unexplored in visual sequential planning, i.e., executing multi-step actions towards a goal. Additionally, practical sequential planning often involves non-optimal (erroneous) steps, challenging VLMs to detect and correct such steps. We propose Corrective Sequential Planning Benchmark (CoSPlan) to evaluate VLMs in error-prone, vision-based sequential planning tasks across 4 domains: maze navigation, block rearrangement, image reconstruction,and object reorganization. CoSPlan assesses two key abilities: Error Detection (identifying non-optimal action) and Step Completion (correcting and completing action sequences to reach the goal). Despite using state-of-the-art reasoning techniques such as Chain-of-Thought and Scene Graphs, VLMs (e.g. Intern-VLM and Qwen2) struggle on CoSPlan, failing to leverage contextual cues to reach goals. Addressing this, we propose a novel training-free method, Scene Graph Incremental updates (SGI), which introduces intermediate reasoning steps between the initial and goal states. SGI helps VLMs reason about sequences, yielding an average performance gain of 5.2%. In addition to enhancing reliability in corrective sequential planning, SGI generalizes to traditional planning tasks such as Plan-Bench and VQA.

84. 【2512.10340】Zero-shot Adaptation of Stable Diffusion via Plug-in Hierarchical Degradation Representation for Real-World Super-Resolution

链接https://arxiv.org/abs/2512.10340

作者:Yi-Cheng Liao,Shyang-En Weng,Yu-Syuan Xu,Chi-Wei Hsiao,Wei-Chen Chiu,Ching-Chun Huang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:complex real-world factors, recover high-quality images, low-quality inputs degraded, aims to recover, recover high-quality

备注

点击查看摘要

Abstract:Real-World Image Super-Resolution (Real-ISR) aims to recover high-quality images from low-quality inputs degraded by unknown and complex real-world factors. Real-world scenarios involve diverse and coupled degradations, making it necessary to provide diffusion models with richer and more informative guidance. However, existing methods often assume known degradation severity and rely on CLIP text encoders that cannot capture numerical severity, limiting their generalization ability. To address this, we propose \textbf{HD-CLIP} (\textbf{H}ierarchical \textbf{D}egradation CLIP), which decomposes a low-quality image into a semantic embedding and an ordinal degradation embedding that captures ordered relationships and allows interpolation across unseen levels. Furthermore, we integrated it into diffusion models via classifier-free guidance (CFG) and proposed classifier-free projection guidance (CFPG). HD-CLIP leverages semantic cues to guide generative restoration while using degradation cues to suppress undesired hallucinations and artifacts. As a \textbf{plug-and-play module}, HD-CLIP can be seamlessly integrated into various super-resolution frameworks without training, significantly improving detail fidelity and perceptual realism across diverse real-world datasets.

85. 【2512.10334】A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images

链接https://arxiv.org/abs/2512.10334

作者:Yi Liu,Yichi Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:play important roles, Thin and elongated, microtubules and actin, play important, important roles

备注

点击查看摘要

Abstract:Thin and elongated filamentous structures, such as microtubules and actin filaments, often play important roles in biological systems. Segmenting these filaments in biological images is a fundamental step for quantitative analysis. Recent advances in deep learning have significantly improved the performance of filament segmentation. However, there is a big challenge in acquiring high quality pixel-level annotated dataset for filamentous structures, as the dense distribution and geometric properties of filaments making manual annotation extremely laborious and time-consuming. To address the data shortage problem, we propose a conditional generative framework based on the Pix2Pix architecture to generate realistic filaments in microscopy images from binary masks. We also propose a filament-aware structural loss to improve the structure similarity when generating synthetic images. Our experiments have demonstrated the effectiveness of our approach and outperformed existing model trained without synthetic data.

86. 【2512.10327】Simple Yet Effective Selective Imputation for Incomplete Multi-view Clustering

链接https://arxiv.org/abs/2512.10327

作者:Cai Xu,Jinlong Liu,Yilin Zhang,Ziyu Guan,Wei Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:pose significant challenges, Incomplete multi-view data, Incomplete multi-view, pose significant, significant challenges

备注

点击查看摘要

Abstract:Incomplete multi-view data, where different views suffer from missing and unbalanced observations, pose significant challenges for clustering. Existing imputation-based methods attempt to estimate missing views to restore data associations, but indiscriminate imputation often introduces noise and bias, especially when the available information is insufficient. Imputation-free methods avoid this risk by relying solely on observed data, but struggle under severe incompleteness due to the lack of cross-view complementarity. To address this issue, we propose Informativeness-based Selective imputation Multi-View Clustering (ISMVC). Our method evaluates the imputation-relevant informativeness of each missing position based on intra-view similarity and cross-view consistency, and selectively imputes only when sufficient support is available. Furthermore, we integrate this selection with a variational autoencoder equipped with a mixture-of-Gaussians prior to learn clustering-friendly latent representations. By performing distribution-level imputation, ISMVC not only stabilizes the aggregation of posterior distributions but also explicitly models imputation uncertainty, enabling robust fusion and preventing overconfident reconstructions. Compared with existing cautious imputation strategies that depend on training dynamics or model feedback, our method is lightweight, data-driven, and model-agnostic. It can be readily integrated into existing IMC models as a plug-in module. Extensive experiments on multiple benchmark datasets under a more realistic and challenging unbalanced missing scenario demonstrate that our method outperforms both imputation-based and imputation-free approaches.

87. 【2512.10326】StainNet: A Special Staining Self-Supervised Vision Transformer for Computational Pathology

链接https://arxiv.org/abs/2512.10326

作者:Jiawen Li,Jiali Hu,Xitong Ling,Yongqiang Lv,Yuxuan Chen,Yizhi Wang,Tian Guan,Yifei Liu,Yonghong He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:large-scale histological images, large-scale histological, significantly accelerated, accelerated the development, development of computational

备注: 15 pages, 6 figures

点击查看摘要

Abstract:Foundation models trained with self-supervised learning (SSL) on large-scale histological images have significantly accelerated the development of computational pathology. These models can serve as backbones for region-of-interest (ROI) image analysis or patch-level feature extractors in whole-slide images (WSIs) based on multiple instance learning (MIL). Existing pathology foundation models (PFMs) are typically pre-trained on Hematoxylin-Eosin (HE) stained pathology images. However, images with special stains, such as immunohistochemistry, are also frequently used in clinical practice. PFMs pre-trained mainly on H\E-stained images may be limited in clinical applications involving special stains. To address this issue, we propose StainNet, a specialized foundation model for special stains based on the vision transformer (ViT) architecture. StainNet adopts a self-distillation SSL approach and is trained on over 1.4 million patch images cropping from 20,231 publicly available special staining WSIs in the HISTAI database. To evaluate StainNet, we conduct experiments on an in-house slide-level liver malignancy classification task and two public ROI-level datasets to demonstrate its strong ability. We also perform few-ratio learning and retrieval evaluations, and compare StainNet with recently larger PFMs to further highlight its strengths. We have released the StainNet model weights at: this https URL.

88. 【2512.10324】EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs

链接https://arxiv.org/abs/2512.10324

作者:Chao Gong,Depeng Wang,Zhipeng Wei,Ya Guo,Huijia Zhu,Jingjing Chen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Audio-Visual Large Language, Language Models, Large Language, face prohibitive computational

备注

点击查看摘要

Abstract:Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational overhead from massive audio and video tokens. Token reduction, while extensively explored for video-only LLMs, is insufficient for the audio-visual domain, as these unimodal methods cannot leverage audio-visual cross-modal synergies. Furthermore, the distinct and dynamic information densities of audio and video render static budgets per modality suboptimal. How to perform token reduction on a joint audio-visual stream thus remains an unaddressed bottleneck. To fill this gap, we introduce EchoingPixels, a framework inspired by the coexistence and interaction of visuals and sound in real-world scenes. The core of our framework is the Cross-Modal Semantic Sieve (CS2), a module enabling early audio-visual interaction. Instead of compressing modalities independently, CS2 co-attends to the joint multimodal stream and reduces tokens from an entire combined pool of audio-visual tokens rather than using fixed budgets per modality. This single-pool approach allows it to adaptively allocate the token budget across both modalities and dynamically identify salient tokens in concert. To ensure this aggressive reduction preserves the vital temporal modeling capability, we co-design a Synchronization-Augmented RoPE (Sync-RoPE) to maintain critical temporal relationships for the sparsely selected tokens. Extensive experiments demonstrate that EchoingPixels achieves performance comparable to strong baselines using only 5-20% of the original tokens, with a 2-3x speedup and memory reduction.

89. 【2512.10321】Point2Pose: A Generative Framework for 3D Human Pose Estimation with Multi-View Point Cloud Dataset

链接https://arxiv.org/abs/2512.10321

作者:Hyunsoo Lee,Daeum Jeon,Hyeokjae Oh

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human pose estimation, pose estimation, pose estimation poses, human, pose

备注: WACV 2026 camera ready

点击查看摘要

Abstract:We propose a novel generative approach for 3D human pose estimation. 3D human pose estimation poses several key challenges due to the complex geometry of the human body, self-occluding joints, and the requirement for large-scale real-world motion datasets. To address these challenges, we introduce Point2Pose, a framework that effectively models the distribution of human poses conditioned on sequential point cloud and pose history. Specifically, we employ a spatio-temporal point cloud encoder and a pose feature encoder to extract joint-wise features, followed by an attention-based generative regressor. Additionally, we present a large-scale indoor dataset MVPose3D, which contains multiple modalities, including IMU data of non-trivial human motions, dense multi-view point clouds, and RGB images. Experimental results show that the proposed method outperforms the baseline models, demonstrating its superior performance across various datasets.

90. 【2512.10319】Design of a six wheel suspension and a three-axis linear actuation mechanism for a laser weeding robot

链接https://arxiv.org/abs/2512.10319

作者:Muhammad Usama,Muhammad Ibrahim Khan,Ahmad Hasan,Muhammad Shaaf Nadeem,Khawaja Fahad Iqbal,Jawad Aslam,Mian Ashfaq Ali,Asad Nisar Awan

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

关键词:automate labor-intensive tasks, increasingly utilized, utilized in agriculture, agriculture to automate, automate labor-intensive

备注: 15 Pages, 10 figures

点击查看摘要

Abstract:Mobile robots are increasingly utilized in agriculture to automate labor-intensive tasks such as weeding, sowing, harvesting and soil analysis. Recently, agricultural robots have been developed to detect and remove weeds using mechanical tools or precise herbicide sprays. Mechanical weeding is inefficient over large fields, and herbicides harm the soil ecosystem. Laser weeding with mobile robots has emerged as a sustainable alternative in precision farming. In this paper, we present an autonomous weeding robot that uses controlled exposure to a low energy laser beam for weed removal. The proposed robot is six-wheeled with a novel double four-bar suspension for higher stability. The laser is guided towards the detected weeds by a three-dimensional linear actuation mechanism. Field tests have demonstrated the robot's capability to navigate agricultural terrains effectively by overcoming obstacles up to 15 cm in height. At an optimal speed of 42.5 cm/s, the robot achieves a weed detection rate of 86.2\% and operating time of 87 seconds per meter. The laser actuation mechanism maintains a minimal mean positional error of 1.54 mm, combined with a high hit rate of 97\%, ensuring effective and accurate weed removal. This combination of speed, accuracy, and efficiency highlights the robot's potential for significantly enhancing precision farming practices.

91. 【2512.10316】ConStruct: Structural Distillation of Foundation Models for Prototype-Based Weakly Supervised Histopathology Segmentation

链接https://arxiv.org/abs/2512.10316

作者:Khang Le(equal contribution),Ha Thach(equal contribution),Anh M. Vu(equal contribution),Trang T. K. Vo,Han H. Huynh,David Yang,Minh H. N. Le,Thanh-Huy Nguyen,Akash Awasthi,Chandra Mohan,Zhu Han,Hien Van Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Weakly supervised semantic, histopathology relies heavily, Weakly supervised, full spatial extent, supervised semantic segmentation

备注

点击查看摘要

Abstract:Weakly supervised semantic segmentation (WSSS) in histopathology relies heavily on classification backbones, yet these models often localize only the most discriminative regions and struggle to capture the full spatial extent of tissue structures. Vision-language models such as CONCH offer rich semantic alignment and morphology-aware representations, while modern segmentation backbones like SegFormer preserve fine-grained spatial cues. However, combining these complementary strengths remains challenging, especially under weak supervision and without dense annotations. We propose a prototype learning framework for WSSS in histopathological images that integrates morphology-aware representations from CONCH, multi-scale structural cues from SegFormer, and text-guided semantic alignment to produce prototypes that are simultaneously semantically discriminative and spatially coherent. To effectively leverage these heterogeneous sources, we introduce text-guided prototype initialization that incorporates pathology descriptions to generate more complete and semantically accurate pseudo-masks. A structural distillation mechanism transfers spatial knowledge from SegFormer to preserve fine-grained morphological patterns and local tissue boundaries during prototype learning. Our approach produces high-quality pseudo masks without pixel-level annotations, improves localization completeness, and enhances semantic consistency across tissue types. Experiments on BCSS-WSSS datasets demonstrate that our prototype learning framework outperforms existing WSSS methods while remaining computationally efficient through frozen foundation model backbones and lightweight trainable adapters.

92. 【2512.10314】DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation

链接https://arxiv.org/abs/2512.10314

作者:Anh M. Vu(equal contribution),Khang P. Le(equal contribution),Trang T. K. Vo(equal contribution),Ha Thach,Huy Hung Nguyen,David Yang,Han H. Huynh,Quynh Nguyen,Tuan M. Pham,Tuan-Anh Le,Minh H. N. Le,Thanh-Huy Nguyen,Akash Awasthi,Chandra Mohan,Zhu Han,Hien Van Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Weakly supervised semantic, reduce annotation cost, supervised semantic segmentation, Weakly supervised, intra-class heterogeneity

备注

点击查看摘要

Abstract:Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.

93. 【2512.10310】Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

链接https://arxiv.org/abs/2512.10310

作者:Duo Zheng,Shijia Huang,Yanyang Li,Liwei Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal large language, shown promising potential, Multimodal large, Vision-Language Navigation, large language models

备注

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that contribute to the overhead: (1) the quadratic computational burden from processing long-horizon historical observations as massive sequences of tokens, and (2) the exploration-efficiency trade-off in DAgger, i.e., a data aggregation process of collecting agent-explored trajectories. While more exploration yields effective error-recovery trajectories for handling test-time distribution shifts, it comes at the cost of longer trajectory lengths for both training and inference. To address these challenges, we propose Efficient-VLN, a training-efficient VLN model. Specifically, to mitigate the token processing burden, we design two efficient memory mechanisms: a progressive memory that dynamically allocates more tokens to recent observations, and a learnable recursive memory that utilizes the key-value cache of learnable tokens as the memory state. Moreover, we introduce a dynamic mixed policy to balance the exploration-efficiency trade-off. Extensive experiments show that Efficient-VLN achieves state-of-the-art performance on R2R-CE (64.2% SR) and RxR-CE (67.0% SR). Critically, our model consumes merely 282 H800 GPU hours, demonstrating a dramatic reduction in training overhead compared to state-of-the-art methods.

94. 【2512.10293】Physically Aware 360$^\circ$ View Generation from a Single Image using Disentangled Scene Embeddings

链接https://arxiv.org/abs/2512.10293

作者:Karthikeya KV,Narendra Bandaru

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:direction disentangled volume, natural scene reconstruction, disentangled volume rendering, Gaussian Splatting backbone, unique view synthesis

备注

点击查看摘要

Abstract:We introduce Disentangled360, an innovative 3D-aware technology that integrates the advantages of direction disentangled volume rendering with single-image 360° unique view synthesis for applications in medical imaging and natural scene reconstruction. In contrast to current techniques that either oversimplify anisotropic light behavior or lack generalizability across various contexts, our framework distinctly differentiates between isotropic and anisotropic contributions inside a Gaussian Splatting backbone. We implement a dual-branch conditioning framework, one optimized for CT intensity driven scattering in volumetric data and the other for real-world RGB scenes through normalized camera embeddings. To address scale ambiguity and maintain structural realism, we present a hybrid pose agnostic anchoring method that adaptively samples scene depth and material transitions, functioning as stable pivots during scene distillation. Our design integrates preoperative radiography simulation and consumer-grade 360° rendering into a singular inference pipeline, facilitating rapid, photorealistic view synthesis with inherent directionality. Evaluations on the Mip-NeRF 360, RealEstate10K, and DeepDRR datasets indicate superior SSIM and LPIPS performance, while runtime assessments confirm its viability for interactive applications. Disentangled360 facilitates mixed-reality medical supervision, robotic perception, and immersive content creation, eliminating the necessity for scene-specific finetuning or expensive photon simulations.

95. 【2512.10286】ShotDirector: Directorially Controllable Multi-Shot Video Generation with Cinematographic Transitions

链接https://arxiv.org/abs/2512.10286

作者:Xiaoxue Wu,Xinyuan Chen,Yaohui Wang,Yu Qiao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Shot transitions play, narrative expression, play a pivotal, pivotal role, coherent narrative expression

备注: Project Page: [this https URL](https://uknowsth.github.io/ShotDirector/)

点击查看摘要

Abstract:Shot transitions play a pivotal role in multi-shot video generation, as they determine the overall narrative expression and the directorial design of visual storytelling. However, recent progress has primarily focused on low-level visual consistency across shots, neglecting how transitions are designed and how cinematographic language contributes to coherent narrative expression. This often leads to mere sequential shot changes without intentional film-editing patterns. To address this limitation, we propose ShotDirector, an efficient framework that integrates parameter-level camera control and hierarchical editing-pattern-aware prompting. Specifically, we adopt a camera control module that incorporates 6-DoF poses and intrinsic settings to enable precise camera information injection. In addition, a shot-aware mask mechanism is employed to introduce hierarchical prompts aware of professional editing patterns, allowing fine-grained control over shot content. Through this design, our framework effectively combines parameter-level conditions with high-level semantic guidance, achieving film-like controllable shot transitions. To facilitate training and evaluation, we construct ShotWeaver40K, a dataset that captures the priors of film-like editing patterns, and develop a set of evaluation metrics for controllable multi-shot video generation. Extensive experiments demonstrate the effectiveness of our framework.

96. 【2512.10284】MotionEdit: Benchmarking and Learning Motion-Centric Image Editing

链接https://arxiv.org/abs/2512.10284

作者:Yixin Wan,Lei Ke,Wenhao Yu,Kai-Wei Chang,Dong Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:modifying subject actions, motion-centric image editing-the, image editing-the task, preserving identity, physical plausibility

备注

点击查看摘要

Abstract:We introduce MotionEdit, a novel dataset for motion-centric image editing-the task of modifying subject actions and interactions while preserving identity, structure, and physical plausibility. Unlike existing image editing datasets that focus on static appearance changes or contain only sparse, low-quality motion edits, MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted and verified from continuous videos. This new task is not only scientifically challenging but also practically significant, powering downstream applications such as frame-controlled video synthesis and animation. To evaluate model performance on the novel task, we introduce MotionEdit-Bench, a benchmark that challenges models on motion-centric edits and measures model performance with generative, discriminative, and preference-based metrics. Benchmark results reveal that motion editing remains highly challenging for existing state-of-the-art diffusion-based editing models. To address this gap, we propose MotionNFT (Motion-guided Negative-aware Fine Tuning), a post-training framework that computes motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion, guiding models toward accurate motion transformations. Extensive experiments on FLUX.1 Kontext and Qwen-Image-Edit show that MotionNFT consistently improves editing quality and motion fidelity of both base models on the motion editing task without sacrificing general editing ability, demonstrating its effectiveness.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as:
arXiv:2512.10284 [cs.CV]

(or
arXiv:2512.10284v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2512.10284

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
97. 【2512.10275】Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation

链接https://arxiv.org/abs/2512.10275

作者:Hongsin Lee,Hye Won Chung

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:standard min-max adversarial, training framework aims, compact student, robust teacher network, standard min-max

备注

点击查看摘要

Abstract:Adversarial distillation in the standard min-max adversarial training framework aims to transfer adversarial robustness from a large, robust teacher network to a compact student. However, existing work often neglects to incorporate state-of-the-art robust teachers. Through extensive analysis, we find that stronger teachers do not necessarily yield more robust students-a phenomenon known as robust saturation. While typically attributed to capacity gaps, we show that such explanations are incomplete. Instead, we identify adversarial transferability-the fraction of student-crafted adversarial examples that remain effective against the teacher-as a key factor in successful robustness transfer. Based on this insight, we propose Sample-wise Adaptive Adversarial Distillation (SAAD), which reweights training examples by their measured transferability without incurring additional computational cost. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that SAAD consistently improves AutoAttack robustness over prior methods. Our code is available at this https URL.

98. 【2512.10267】Long-LRM++: Preserving Fine Details in Feed-Forward Wide-Coverage Reconstruction

链接https://arxiv.org/abs/2512.10267

作者:Chen Ziwen,Hao Tan,Peng Wang,Zexiang Xu,Li Fuxin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generalizable Gaussian splatting, Recent advances, enabled feed-forward reconstruction, advances in generalizable, enabled feed-forward

备注

点击查看摘要

Abstract:Recent advances in generalizable Gaussian splatting (GS) have enabled feed-forward reconstruction of scenes from tens of input views. Long-LRM notably scales this paradigm to 32 input images at $950\times540$ resolution, achieving 360° scene-level reconstruction in a single forward pass. However, directly predicting millions of Gaussian parameters at once remains highly error-sensitive: small inaccuracies in positions or other attributes lead to noticeable blurring, particularly in fine structures such as text. In parallel, implicit representation methods such as LVSM and LaCT have demonstrated significantly higher rendering fidelity by compressing scene information into model weights rather than explicit Gaussians, and decoding RGB frames using the full transformer or TTT backbone. However, this computationally intensive decompression process for every rendered frame makes real-time rendering infeasible. These observations raise key questions: Is the deep, sequential "decompression" process necessary? Can we retain the benefits of implicit representations while enabling real-time performance? We address these questions with Long-LRM++, a model that adopts a semi-explicit scene representation combined with a lightweight decoder. Long-LRM++ matches the rendering quality of LaCT on DL3DV while achieving real-time 14 FPS rendering on an A100 GPU, overcoming the speed limitations of prior implicit methods. Our design also scales to 64 input views at the $950\times540$ resolution, demonstrating strong generalization to increased input lengths. Additionally, Long-LRM++ delivers superior novel-view depth prediction on ScanNetv2 compared to direct depth rendering from Gaussians. Extensive ablation studies validate the effectiveness of each component in the proposed framework.

99. 【2512.10262】VLM-NCD:Novel Class Discovery with Vision-Based Large Language Models

链接https://arxiv.org/abs/2512.10262

作者:Yuetong Su,Baoguo Wei,Xinyu Wang,Xu Li,Lixin Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:utilise prior knowledge, Class Discovery aims, aims to utilise, utilise prior, prior knowledge

备注: 8 pages, 5 figures, conference

点击查看摘要

Abstract:Novel Class Discovery aims to utilise prior knowledge of known classes to classify and discover unknown classes from unlabelled data. Existing NCD methods for images primarily rely on visual features, which suffer from limitations such as insufficient feature discriminability and the long-tail distribution of data. We propose LLM-NCD, a multimodal framework that breaks this bottleneck by fusing visual-textual semantics and prototype guided clustering. Our key innovation lies in modelling cluster centres and semantic prototypes of known classes by jointly optimising known class image and text features, and a dualphase discovery mechanism that dynamically separates known or novel samples via semantic affinity thresholds and adaptive clustering. Experiments on the CIFAR-100 dataset show that compared to the current methods, this method achieves up to 25.3% improvement in accuracy for unknown classes. Notably, our method shows unique resilience to long tail distributions, a first in NCD literature.

100. 【2512.10252】GDKVM: Echocardiography Video Segmentation via Spatiotemporal Key-Value Memory with Gated Delta Rule

链接https://arxiv.org/abs/2512.10252

作者:Rui Wang,Yimu Sun,Jingxing Guo,Huisi Wu,Jing Qin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cardiac function, cardiac chambers, analysis of cardiac, Accurate segmentation, aiding in clinical

备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Accurate segmentation of cardiac chambers in echocardiography sequences is crucial for the quantitative analysis of cardiac function, aiding in clinical diagnosis and treatment. The imaging noise, artifacts, and the deformation and motion of the heart pose challenges to segmentation algorithms. While existing methods based on convolutional neural networks, Transformers, and space-time memory networks have improved segmentation accuracy, they often struggle with the trade-off between capturing long-range spatiotemporal dependencies and maintaining computational efficiency with fine-grained feature representation. In this paper, we introduce GDKVM, a novel architecture for echocardiography video segmentation. The model employs Linear Key-Value Association (LKVA) to effectively model inter-frame correlations, and introduces Gated Delta Rule (GDR) to efficiently store intermediate memory states. Key-Pixel Feature Fusion (KPFF) module is designed to integrate local and global features at multiple scales, enhancing robustness against boundary blurring and noise interference. We validated GDKVM on two mainstream echocardiography video datasets (CAMUS and EchoNet-Dynamic) and compared it with various state-of-the-art methods. Experimental results show that GDKVM outperforms existing approaches in terms of segmentation accuracy and robustness, while ensuring real-time performance. Code is available at this https URL.

101. 【2512.10251】HE-Pose: Topological Prior with Hybrid Graph Fusion for Estimating Category-Level 6D Object Pose

链接https://arxiv.org/abs/2512.10251

作者:Eunho Lee,Chaehyeon Song,Seunghoon Jeong,Ayoung Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hybrid graph fusion, pose estimation requires, intra-class variations, requires both global, robustness against intra-class

备注

点击查看摘要

Abstract:Category-level object pose estimation requires both global context and local structure to ensure robustness against intra-class variations. However, 3D graph convolution (3D-GC) methods only focus on local geometry and depth information, making them vulnerable to complex objects and visual ambiguities. To address this, we present THE-Pose, a novel category-level 6D pose estimation framework that leverages a topological prior via surface embedding and hybrid graph fusion. Specifically, we extract consistent and invariant topological features from the image domain, effectively overcoming the limitations inherent in existing 3D-GC based methods. Our Hybrid Graph Fusion (HGF) module adaptively integrates the topological features with point-cloud features, seamlessly bridging 2D image context and 3D geometric structure. These fused features ensure stability for unseen or complicated objects, even under significant occlusions. Extensive experiments on the REAL275 dataset show that THE-Pose achieves a 35.8% improvement over the 3D-GC baseline (HS-Pose) and surpasses the previous state-of-the-art by 7.2% across all key metrics. The code is avaialbe on this https URL

102. 【2512.10248】RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

链接https://arxiv.org/abs/2512.10248

作者:Zhuo Wang,Xiliang Liu,Ligang Sun

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:technologies poses challenges, AI-generated video technologies, video technologies poses, information integrity, proliferation of AI-generated

备注

点击查看摘要

Abstract:The proliferation of AI-generated video technologies poses challenges to information integrity. While recent benchmarks advance AIGC video detection, they overlook a critical factor: many state-of-the-art generative models embed digital watermarks in outputs, and detectors may partially rely on these patterns. To evaluate this influence, we present RobustSora, the benchmark designed to assess watermark robustness in AIGC video detection. We systematically construct a dataset of 6,500 videos comprising four types: Authentic-Clean (A-C), Authentic-Spoofed with fake watermarks (A-S), Generated-Watermarked (G-W), and Generated-DeWatermarked (G-DeW). Our benchmark introduces two evaluation tasks: Task-I tests performance on watermark-removed AI videos, while Task-II assesses false alarm rates on authentic videos with fake watermarks. Experiments with ten models spanning specialized AIGC detectors, transformer architectures, and MLLM approaches reveal performance variations of 2-8pp under watermark manipulation. Transformer-based models show consistent moderate dependency (6-8pp), while MLLMs exhibit diverse patterns (2-8pp). These findings indicate partial watermark dependency and highlight the need for watermark-aware training strategies. RobustSora provides essential tools to advance robust AIGC detection research.

103. 【2512.10244】Solving Semi-Supervised Few-Shot Learning from an Auto-Annotation Perspective

链接https://arxiv.org/abs/2512.10244

作者:Tian Liu,Anwesha Basu,James Caverlee,Shu Kong

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Semi-supervised few-shot learning, formulates real-world applications, Semi-supervised few-shot, formulates real-world, real-world applications

备注: website and code: [this https URL](https://tian1327.github.io/SWIFT)

点击查看摘要

Abstract:Semi-supervised few-shot learning (SSFSL) formulates real-world applications like ''auto-annotation'', as it aims to learn a model over a few labeled and abundant unlabeled examples to annotate the unlabeled ones. Despite the availability of powerful open-source Vision-Language Models (VLMs) and their pretraining data, the SSFSL literature largely neglects these open-source resources. In contrast, the related area few-shot learning (FSL) has already exploited them to boost performance. Arguably, to achieve auto-annotation in the real world, SSFSL should leverage such open-source resources. To this end, we start by applying established SSL methods to finetune a VLM. Counterintuitively, they significantly underperform FSL baselines. Our in-depth analysis reveals the root cause: VLMs produce rather ''flat'' distributions of softmax probabilities. This results in zero utilization of unlabeled data and weak supervision signals. We address this issue with embarrassingly simple techniques: classifier initialization and temperature tuning. They jointly increase the confidence scores of pseudo-labels, improving the utilization rate of unlabeled data, and strengthening supervision signals. Building on this, we propose: Stage-Wise Finetuning with Temperature Tuning (SWIFT), which enables existing SSL methods to effectively finetune a VLM on limited labeled data, abundant unlabeled data, and task-relevant but noisy data retrieved from the VLM's pretraining set. Extensive experiments on five SSFSL benchmarks show that SWIFT outperforms recent FSL and SSL methods by $\sim$5 accuracy points. SWIFT even rivals supervised learning, which finetunes VLMs with the unlabeled data being labeled with ground truth!

104. 【2512.10237】Multi-dimensional Preference Alignment by Conditioning Reward Itself

链接https://arxiv.org/abs/2512.10237

作者:Jiho Jang,Jinyoung Kim,Kyungjune Baek,Nojun Kwak

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Reinforcement Learning, Learning from Human, Human Feedback, Feedback has emerged, aligning diffusion models

备注

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback has emerged as a standard for aligning diffusion models. However, we identify a fundamental limitation in the standard DPO formulation because it relies on the Bradley-Terry model to aggregate diverse evaluation axes like aesthetic quality and semantic alignment into a single scalar reward. This aggregation creates a reward conflict where the model is forced to unlearn desirable features of a specific dimension if they appear in a globally non-preferred sample. To address this issue, we propose Multi Reward Conditional DPO (MCDPO). This method resolves reward conflicts by introducing a disentangled Bradley-Terry objective. MCDPO explicitly injects a preference outcome vector as a condition during training, which allows the model to learn the correct optimization direction for each reward axis independently within a single network. We further introduce dimensional reward dropout to ensure balanced optimization across dimensions. Extensive experiments on Stable Diffusion 1.5 and SDXL demonstrate that MCDPO achieves superior performance on benchmarks. Notably, our conditional framework enables dynamic and multiple-axis control at inference time using Classifier Free Guidance to amplify specific reward dimensions without additional training or external reward models.

105. 【2512.10230】Emerging Standards for Machine-to-Machine Video Coding

链接https://arxiv.org/abs/2512.10230

作者:Md Eimran Hossain Eimon,Velibor Adzic,Hari Kalva,Borko Furht

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Video Coding, Picture Experts Group, Versatile Video Coding, Moving Picture Experts, Advanced Video Coding

备注

点击查看摘要

Abstract:Machines are increasingly becoming the primary consumers of visual data, yet most deployments of machine-to-machine systems still rely on remote inference where pixel-based video is streamed using codecs optimized for human perception. Consequently, this paradigm is bandwidth intensive, scales poorly, and exposes raw images to third parties. Recent efforts in the Moving Picture Experts Group (MPEG) redesigned the pipeline for machine-to-machine communication: Video Coding for Machines (VCM) is designed to apply task-aware coding tools in the pixel domain, and Feature Coding for Machines (FCM) is designed to compress intermediate neural features to reduce bitrate, preserve privacy, and support compute offload. Experiments show that FCM is capable of maintaining accuracy close to edge inference while significantly reducing bitrate. Additional analysis of H.26X codecs used as inner codecs in FCM reveals that H.265/High Efficiency Video Coding (HEVC) and H.266/Versatile Video Coding (VVC) achieve almost identical machine task performance, with an average BD-Rate increase of 1.39% when VVC is replaced with HEVC. In contrast, H.264/Advanced Video Coding (AVC) yields an average BD-Rate increase of 32.28% compared to VVC. However, for the tracking task, the impact of codec choice is minimal, with HEVC outperforming VVC and achieving BD Rate of -1.81% and 8.79% for AVC, indicating that existing hardware for already deployed codecs can support machine-to-machine communication without degrading performance.

106. 【2512.10226】Latent Chain-of-Thought World Modeling for End-to-End Driving

链接https://arxiv.org/abs/2512.10226

作者:Shuhan Tan,Kashyap Chitta,Yuxiao Chen,Ran Tian,Yurong You,Yan Wang,Wenjie Luo,Yulong Cao,Philipp Krahenbuhl,Marco Pavone,Boris Ivanovic

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:autonomous driving explore, driving explore inference-time, improve driving performance, explore inference-time reasoning, challenging scenarios

备注: Technical Report

点击查看摘要

Abstract:Recent Vision-Language-Action (VLA) models for autonomous driving explore inference-time reasoning as a way to improve driving performance and safety in challenging scenarios. Most prior work uses natural language to express chain-of-thought (CoT) reasoning before producing driving actions. However, text may not be the most efficient representation for reasoning. In this work, we present Latent-CoT-Drive (LCDrive): a model that expresses CoT in a latent language that captures possible outcomes of the driving actions being considered. Our approach unifies CoT reasoning and decision making by representing both in an action-aligned latent space. Instead of natural language, the model reasons by interleaving (1) action-proposal tokens, which use the same vocabulary as the model's output actions; and (2) world model tokens, which are grounded in a learned latent world model and express future outcomes of these actions. We cold start latent CoT by supervising the model's action proposals and world model tokens based on ground-truth future rollouts of the scene. We then post-train with closed-loop reinforcement learning to strengthen reasoning capabilities. On a large-scale end-to-end driving benchmark, LCDrive achieves faster inference, better trajectory quality, and larger improvements from interactive reinforcement learning compared to both non-reasoning and text-reasoning baselines.

107. 【2512.10224】Federated Domain Generalization with Latent Space Inversion

链接https://arxiv.org/abs/2512.10224

作者:Ragja Palakkadavath,Hung Le,Thanh Nguyen-Tang,Svetha Venkatesh,Sunil Gupta

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:federated learning framework, federated learning, addresses distribution shifts, Federated domain generalization, learning framework

备注: Accepted at ICDM 2025

点击查看摘要

Abstract:Federated domain generalization (FedDG) addresses distribution shifts among clients in a federated learning framework. FedDG methods aggregate the parameters of locally trained client models to form a global model that generalizes to unseen clients while preserving data privacy. While improving the generalization capability of the global model, many existing approaches in FedDG jeopardize privacy by sharing statistics of client data between themselves. Our solution addresses this problem by contributing new ways to perform local client training and model aggregation. To improve local client training, we enforce (domain) invariance across local models with the help of a novel technique, \textbf{latent space inversion}, which enables better client privacy. When clients are not \emph{i.i.d}, aggregating their local models may discard certain local adaptations. To overcome this, we propose an \textbf{important weight} aggregation strategy to prioritize parameters that significantly influence predictions of local models during aggregation. Our extensive experiments show that our approach achieves superior results over state-of-the-art methods with less communication overhead.

108. 【2512.10209】Feature Coding for Scalable Machine Vision

链接https://arxiv.org/abs/2512.10209

作者:Md Eimran Hossain Eimon,Juan Merlos,Ashan Perera,Hari Kalva,Velibor Adzic,Borko Furht

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep neural networks, high compute demands, drive modern machine, edge devices due, Deep neural

备注: This article has been accepted for publication in IEEE Consumer Electronics Magazine

点击查看摘要

Abstract:Deep neural networks (DNNs) drive modern machine vision but are challenging to deploy on edge devices due to high compute demands. Traditional approaches-running the full model on-device or offloading to the cloud face trade-offs in latency, bandwidth, and privacy. Splitting the inference workload between the edge and the cloud offers a balanced solution, but transmitting intermediate features to enable such splitting introduces new bandwidth challenges. To address this, the Moving Picture Experts Group (MPEG) initiated the Feature Coding for Machines (FCM) standard, establishing a bitstream syntax and codec pipeline tailored for compressing intermediate features. This paper presents the design and performance of the Feature Coding Test Model (FCTM), showing significant bitrate reductions-averaging 85.14%-across multiple vision tasks while preserving accuracy. FCM offers a scalable path for efficient and interoperable deployment of intelligent features in bandwidth-limited and privacy-sensitive consumer applications.

109. 【2512.10151】opological Conditioning for Mammography Models via a Stable Wavelet-Persistence Vectorization

链接https://arxiv.org/abs/2512.10151

作者:Charles Fanning,Mehmet Emin Aktas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:cancer death worldwide, commonly diagnosed cancer, Breast cancer, diagnosed cancer, cancer death

备注: 8 Pages, 2 Figures, submitted to IEEE Transactions on Medical Imaging

点击查看摘要

Abstract:Breast cancer is the most commonly diagnosed cancer in women and a leading cause of cancer death worldwide. Screening mammography reduces mortality, yet interpretation still suffers from substantial false negatives and false positives, and model accuracy often degrades when deployed across scanners, modalities, and patient populations. We propose a simple conditioning signal aimed at improving external performance based on a wavelet based vectorization of persistent homology. Using topological data analysis, we summarize image structure that persists across intensity thresholds and convert this information into spatial, multi scale maps that are provably stable to small intensity perturbations. These maps are integrated into a two stage detection pipeline through input level channel concatenation. The model is trained and validated on the CBIS DDSM digitized film mammography cohort from the United States and evaluated on two independent full field digital mammography cohorts from Portugal (INbreast) and China (CMMD), with performance reported at the patient level. On INbreast, augmenting ConvNeXt Tiny with wavelet persistence channels increases patient level AUC from 0.55 to 0.75 under a limited training budget.

110. 【2512.10102】Hierarchical Instance Tracking to Balance Privacy Preservation with Accessible Information

链接https://arxiv.org/abs/2512.10102

作者:Neelima Prasad,Jarek Reynolds,Neel Karsanbhai,Tanusree Sharma,Lotus Zhang,Abigale Stangl,Yang Wang,Leah Findlater,Danna Gurari

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:hierarchical instance tracking, objects and parts, hierarchical relationships, hierarchical instance, instance tracking

备注: Accepted at WACV 2026

点击查看摘要

Abstract:We propose a novel task, hierarchical instance tracking, which entails tracking all instances of predefined categories of objects and parts, while maintaining their hierarchical relationships. We introduce the first benchmark dataset supporting this task, consisting of 2,765 unique entities that are tracked in 552 videos and belong to 40 categories (across objects and parts). Evaluation of seven variants of four models tailored to our novel task reveals the new dataset is challenging. Our dataset is available at this https URL

111. 【2512.10095】raceFlow: Dynamic 3D Reconstruction of Specular Scenes Driven by Ray Tracing

链接https://arxiv.org/abs/2512.10095

作者:Jiachen Tao,Junyi Wu,Haoxuan Wang,Zongxin Yang,Dawen Cai,Yan Yan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:precise reflection direction, reflection direction estimation, accurate reflection modeling, Gaussian Splatting representation, key challenges

备注

点击查看摘要

Abstract:We present TraceFlow, a novel framework for high-fidelity rendering of dynamic specular scenes by addressing two key challenges: precise reflection direction estimation and physically accurate reflection modeling. To achieve this, we propose a Residual Material-Augmented 2D Gaussian Splatting representation that models dynamic geometry and material properties, allowing accurate reflection ray computation. Furthermore, we introduce a Dynamic Environment Gaussian and a hybrid rendering pipeline that decomposes rendering into diffuse and specular components, enabling physically grounded specular synthesis via rasterization and ray tracing. Finally, we devise a coarse-to-fine training strategy to improve optimization stability and promote physically meaningful decomposition. Extensive experiments on dynamic scene benchmarks demonstrate that TraceFlow outperforms prior methods both quantitatively and qualitatively, producing sharper and more realistic specular reflections in complex dynamic environments.

112. 【2512.10067】Independent Density Estimation

链接https://arxiv.org/abs/2512.10067

作者:Jiahao Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large-scale Vision-Language models, achieved remarkable results, Large-scale Vision-Language, conditioned image generation, Independent Density Estimation

备注: 10 pages, 1 table, 4 figures

点击查看摘要

Abstract:Large-scale Vision-Language models have achieved remarkable results in various domains, such as image captioning and conditioned image generation. Neverthe- less, these models still encounter difficulties in achieving human-like composi- tional generalization. In this study, we propose a new method called Independent Density Estimation (IDE) to tackle this challenge. IDE aims to learn the connec- tion between individual words in a sentence and the corresponding features in an image, enabling compositional generalization. We build two models based on the philosophy of IDE. The first one utilizes fully disentangled visual representations as input, and the second leverages a Variational Auto-Encoder to obtain partially disentangled features from raw images. Additionally, we propose an entropy- based compositional inference method to combine predictions of each word in the sentence. Our models exhibit superior generalization to unseen compositions compared to current models when evaluated on various datasets.

113. 【2512.10041】MetaVoxel: Joint Diffusion Modeling of Imaging and Clinical Metadata

链接https://arxiv.org/abs/2512.10041

作者:Yihao Liu,Chenyu Gao,Lianrui Zuo,Michael E. Kim,Brian D. Boyd,Lisa L. Barnes,Walter A. Kukull,Lori L. Beason-Held,Susan M. Resnick,Timothy J. Hohman,Warren D. Taylor,Bennett A. Landman

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:estimating continuous biomarkers, Modern deep learning, achieved impressive results, Modern deep, deep learning methods

备注

点击查看摘要

Abstract:Modern deep learning methods have achieved impressive results across tasks from disease classification, estimating continuous biomarkers, to generating realistic medical images. Most of these approaches are trained to model conditional distributions defined by a specific predictive direction with a specific set of input variables. We introduce MetaVoxel, a generative joint diffusion modeling framework that models the joint distribution over imaging data and clinical metadata by learning a single diffusion process spanning all variables. By capturing the joint distribution, MetaVoxel unifies tasks that traditionally require separate conditional models and supports flexible zero-shot inference using arbitrary subsets of inputs without task-specific retraining. Using more than 10,000 T1-weighted MRI scans paired with clinical metadata from nine datasets, we show that a single MetaVoxel model can perform image generation, age estimation, and sex prediction, achieving performance comparable to established task-specific baselines. Additional experiments highlight its capabilities for flexible this http URL, these findings demonstrate that joint multimodal diffusion offers a promising direction for unifying medical AI models and enabling broader clinical applicability.

114. 【2512.10038】Diffusion Is Your Friend in Show, Suggest and Tell

链接https://arxiv.org/abs/2512.10038

作者:Jia Cheng Hu,Roberto Cavicchioli,Alessandro Capotondi

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Computer Vision tasks, generative Computer Vision, Denoising models demonstrated, Computer Vision, Diffusion Denoising models

备注

点击查看摘要

Abstract:Diffusion Denoising models demonstrated impressive results across generative Computer Vision tasks, but they still fail to outperform standard autoregressive solutions in the discrete domain, and only match them at best. In this work, we propose a different paradigm by adopting diffusion models to provide suggestions to the autoregressive generation rather than replacing them. By doing so, we combine the bidirectional and refining capabilities of the former with the strong linguistic structure provided by the latter. To showcase its effectiveness, we present Show, Suggest and Tell (SST), which achieves State-of-the-Art results on COCO, among models in a similar setting. In particular, SST achieves 125.1 CIDEr-D on the COCO dataset without Reinforcement Learning, outperforming both autoregressive and diffusion model State-of-the-Art results by 1.5 and 2.5 points. On top of the strong results, we performed extensive experiments to validate the proposal and analyze the impact of the suggestion module. Results demonstrate a positive correlation between suggestion and caption quality, overall indicating a currently underexplored but promising research direction. Code will be available at: this https URL\_suggest\_tell.

115. 【2512.10031】ABBSPO: Adaptive Bounding Box Scaling and Symmetric Prior based Orientation Prediction for Detecting Aerial Image Objects

链接https://arxiv.org/abs/2512.10031

作者:Woojin Lee,Hyugjae Chang,Jaeho Moon,Jaehyup Lee,Munchurl Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Weakly supervised oriented, adaptive bounding box, bounding box scaling, oriented object detection, bounding box

备注: 17 pages, 11 figures, 8 tables, supplementary included. Accepted to CVPR 2025. Please visit our project page at [this https URL](https://kaist-viclab.github.io/ABBSPO_site/)

点击查看摘要

Abstract:Weakly supervised oriented object detection (WS-OOD) has gained attention as a cost-effective alternative to fully supervised methods, providing both efficiency and high accuracy. Among weakly supervised approaches, horizontal bounding box (HBox)-supervised OOD stands out for its ability to directly leverage existing HBox annotations while achieving the highest accuracy under weak supervision settings. This paper introduces adaptive bounding box scaling and symmetry-prior-based orientation prediction, called ABBSPO, a framework for WS-OOD. Our ABBSPO addresses limitations of previous HBox-supervised OOD methods, which compare ground truth (GT) HBoxes directly with the minimum circumscribed rectangles of predicted RBoxes, often leading to inaccurate scale estimation. To overcome this, we propose: (i) Adaptive Bounding Box Scaling (ABBS), which appropriately scales GT HBoxes to optimize for the size of each predicted RBox, ensuring more accurate scale prediction; and (ii) a Symmetric Prior Angle (SPA) loss that exploits inherent symmetry of aerial objects for self-supervised learning, resolving issues in previous methods where learning collapses when predictions for all three augmented views (original, rotated, and flipped) are consistently incorrect. Extensive experimental results demonstrate that ABBSPO achieves state-of-the-art performance, outperforming existing methods.

116. 【2512.09969】Neuromorphic Eye Tracking for Low-Latency Pupil Detection

链接https://arxiv.org/abs/2512.09969

作者:Paul Hueber,Luca Peres,Florian Pitters,Alejandro Gloriani,Oliver Rhodes

类目:Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

关键词:limited temporal resolution, conventional frame-based pipelines, frame-based pipelines struggle, high compute cost, Eye tracking

备注: 8 pages, 2 figures, conference

点击查看摘要

Abstract:Eye tracking for wearable systems demands low latency and milliwatt-level power, but conventional frame-based pipelines struggle with motion blur, high compute cost, and limited temporal resolution. Such capabilities are vital for enabling seamless and responsive interaction in emerging technologies like augmented reality (AR) and virtual reality (VR), where understanding user gaze is key to immersion and interface design. Neuromorphic sensors and spiking neural networks (SNNs) offer a promising alternative, yet existing SNN approaches are either too specialized or fall short of the performance of modern ANN architectures. This paper presents a neuromorphic version of top-performing event-based eye-tracking models, replacing their recurrent and attention modules with lightweight LIF layers and exploiting depth-wise separable convolutions to reduce model complexity. Our models obtain 3.7-4.1px mean error, approaching the accuracy of the application-specific neuromorphic system, Retina (3.24px), while reducing model size by 20x and theoretical compute by 850x, compared to the closest ANN variant of the proposed model. These efficient variants are projected to operate at an estimated 3.9-4.9 mW with 3 ms latency at 1 kHz. The present results indicate that high-performing event-based eye-tracking architectures can be redesigned as SNNs with substantial efficiency gains, while retaining accuracy suitable for real-time wearable deployment.

117. 【2512.09944】Echo-CoPilot: A Multi-View, Multi-Task Agent for Echocardiography Interpretation and Reporting

链接https://arxiv.org/abs/2512.09944

作者:Moein Heidari,Mohammad Amin Roohi,Armin Khosravi,Ilker Hacihaliloglu

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词:contemporary cardiovascular care, full-study interpretation remains, cardiovascular care, cognitively demanding, performed manually

备注

点击查看摘要

Abstract:Echocardiography is central to contemporary cardiovascular care, but full-study interpretation remains a cognitively demanding, multi-view task that is still performed manually. While recent foundation models for echocardiography can achieve strong performance on individual perceptual subtasks such as view classification, segmentation, or disease prediction, they typically operate in isolation and do not provide a unified, clinically coherent assessment. In this work, we introduce Echo-CoPilot, a multi-view, multi-task agent that uses a large language model to orchestrate a suite of specialized echocardiography tools. Within a ReAct-style loop, the agent decomposes clinician queries, invokes tools for view recognition, cardiac structure segmentation, measurement and disease prediction, and report synthesis, and integrates their outputs into guideline-aware answers and narrative summaries. We evaluate Echo-CoPilot on the public MIMIC-EchoQA benchmark, where it achieves an accuracy of 50.8\%, outperforming both general-purpose and biomedical video vision-language models. Qualitative analyses further show that the agent leverages quantitative measurements and physiologic context to resolve challenging cases near clinical decision thresholds, such as borderline left ventricular hypertrophy or pericardial effusion severity. The code will be released upon acceptance of the paper.

118. 【2512.09874】Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs

链接https://arxiv.org/abs/2512.09874

作者:Pius Horn,Janis Keuper

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Correctly parsing mathematical, training large language, building scientific knowledge, scientific knowledge bases, Correctly parsing

备注

点击查看摘要

Abstract:Correctly parsing mathematical formulas from PDFs is critical for training large language models and building scientific knowledge bases from academic literature, yet existing benchmarks either exclude formulas entirely or lack semantically-aware evaluation metrics. We introduce a novel benchmarking framework centered on synthetically generated PDFs with precise LaTeX ground truth, enabling systematic control over layout, formulas, and content characteristics. A key methodological contribution is pioneering LLM-as-a-judge for semantic formula assessment, combined with a robust two-stage matching pipeline that handles parser output inconsistencies. Through human validation on 250 formula pairs (750 ratings from 30 evaluators), we demonstrate that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.78) compared to CDM (r=0.34) and text similarity (r~0). Evaluating 20+ contemporary PDF parsers (including specialized OCR models, vision-language models, and rule-based approaches) across 100 synthetic documents with 2,000+ formulas reveals significant performance disparities. Our findings provide crucial insights for practitioners selecting parsers for downstream applications and establish a robust, scalable methodology that enables reproducible evaluation of PDF formula extraction quality. Code and benchmark data: this https URL

119. 【2510.20875】CC-GRMAS: A Multi-Agent Graph Neural System for Spatiotemporal Landslide Risk Assessment in High Mountain Asia

链接https://arxiv.org/abs/2510.20875

作者:Mihir Panchal,Ying-Jung Chen,Surya Parkash

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:high mountain Asia, mountain Asia, growing climate induced, climate induced hazard, human consequences

备注

点击查看摘要

Abstract:Landslides are a growing climate induced hazard with severe environmental and human consequences, particularly in high mountain Asia. Despite increasing access to satellite and temporal datasets, timely detection and disaster response remain underdeveloped and fragmented. This work introduces CC-GRMAS, a framework leveraging a series of satellite observations and environmental signals to enhance the accuracy of landslide forecasting. The system is structured around three interlinked agents Prediction, Planning, and Execution, which collaboratively enable real time situational awareness, response planning, and intervention. By incorporating local environmental factors and operationalizing multi agent coordination, this approach offers a scalable and proactive solution for climate resilient disaster preparedness across vulnerable mountainous terrains.