本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新390篇论文,其中:
- 自然语言处理40篇
- 信息检索5篇
- 计算机视觉105篇
自然语言处理
1. 【2512.11755】SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support
链接:https://arxiv.org/abs/2512.11755
作者:Yuming Feng,Xinrui Jiang
类目:Computation and Language (cs.CL)
关键词:hinder effective decision-making, Online product reviews, effective decision-making, Online product, rich but noisy
备注: Code available at [this https URL](https://github.com/Harry20030331/SumForU)
点击查看摘要
Abstract:Online product reviews contain rich but noisy signals that overwhelm users and hinder effective decision-making. Existing LLM-based summarizers remain generic and fail to account for individual preferences, limiting their practical utility. We propose SUMFORU, a steerable review summarization framework that aligns outputs with explicit user personas to support personalized purchase decisions. Our approach integrates a high-quality data pipeline built from the Amazon 2023 Review Dataset with a two-stage alignment procedure: (1) persona-aware Supervised Fine-Tuning (SFT) via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback (RLAIF) using a preference estimator to capture fine-grained, persona-relevant signals. We evaluate the model across rule-based, LLM-based, and human-centered metrics, demonstrating consistent improvements in consistency, grounding, and preference alignment. Our framework achieves the highest performance across all evaluation settings and generalizes effectively to unseen product categories. Our results highlight the promise of steerable pluralistic alignment for building next-generation personalized decision-support systems.
2. 【2512.11724】From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines
链接:https://arxiv.org/abs/2512.11724
作者:Titaya Mairittha,Tanakon Sawanglok,Panuwit Raden,Jirapast Buntub,Thanapat Warunee,Napat Asawachaisuvikrom,Thanaphum Saiwongin
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:remarkable generative capabilities, feel conversationally broken, achieved remarkable generative, generative capabilities, conversationally broken
备注: 6 pages, 1 figure
点击查看摘要
Abstract:While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.
3. 【2512.11718】Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks
链接:https://arxiv.org/abs/2512.11718
作者:Sergey Pankratov,Dan Alistarh
类目:Computation and Language (cs.CL)
关键词:verify multiple draft, large language models, draft tokens simultaneously, multiple draft tokens, promising technique
备注:
点击查看摘要
Abstract:Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight'' lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as $\mathbb{E}[X] \leq (\mu + \mu_{(2)})\log(P )/\mu^2 + O(1)$, where $P$ is the verifier's capacity, $\mu$ is the expected entropy of the verifier's output distribution, and $\mu_{(2)}$ is the expected second log-moment. This result provides new insights into the limits of parallel token generation, and could guide the design of future speculative decoding systems. Empirical evaluations on Llama models validate our theoretical predictions, confirming the tightness of our bounds in practical settings.
4. 【2512.11635】Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling
链接:https://arxiv.org/abs/2512.11635
作者:Keerthana Murugaraj,Salima Lamsiyah,Marten During,Martin Theobald
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Optical Character Recognition, Optical Character, Character Recognition, presents significant challenges, significant challenges due
备注: This is a preprint of a manuscript submitted to Digital Scholarship in the Humanities (Oxford University Press). The paper is currently under peer review
点击查看摘要
Abstract:Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.
5. 【2512.11614】Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols
链接:https://arxiv.org/abs/2512.11614
作者:Björn Deiseroth,Max Henning Höth,Kristian Kersting,Letitia Parcalabescu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:guide large language, Retrieval-augmented generation, large language model, current systems treat, systems treat retrieval
备注: 34 pages, 19 figures
点击查看摘要
Abstract:Retrieval-augmented generation (RAG) models rely on retrieved evidence to guide large language model (LLM) generators, yet current systems treat retrieval as a weak heuristic rather than verifiable evidence. As a result, LLMs answer without support, hallucinate under incomplete or misleading context, and rely on spurious evidence. We introduce a training framework that treats the entire RAG pipeline -- both the retriever and the generator -- as an interactive proof system via an adaptation of the Merlin-Arthur (M/A) protocol. Arthur (the generator LLM) trains on questions of unkown provenance: Merlin provides helpful evidence, while Morgana injects adversarial, misleading context. Both use a linear-time XAI method to identify and modify the evidence most influential to Arthur. Consequently, Arthur learns to (i) answer when the context support the answer, (ii) reject when evidence is insufficient, and (iii) rely on the specific context spans that truly ground the answer. We further introduce a rigorous evaluation framework to disentangle explanation fidelity from baseline predictive errors. This allows us to introduce and measure the Explained Information Fraction (EIF), which normalizes M/A certified mutual-information guarantees relative to model capacity and imperfect benchmarks. Across three RAG datasets and two model families of varying sizes, M/A-trained LLMs show improved groundedness, completeness, soundness, and reject behavior, as well as reduced hallucinations -- without needing manually annotated unanswerable questions. The retriever likewise improves recall and MRR through automatically generated M/A hard positives and negatives. Our results demonstrate that autonomous interactive-proof-style supervision provides a principled and practical path toward reliable RAG systems that treat retrieved documents not as suggestions, but as verifiable evidence.
6. 【2512.11573】Visualizing token importance for black-box language models
链接:https://arxiv.org/abs/2512.11573
作者:Paulius Rauba,Qiyao Wei,Mihaela van der Schaar
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:auditing black-box large, production settings, regulatory compliance, ensure they behave, behave reliably
备注:
点击查看摘要
Abstract:We consider the problem of auditing black-box large language models (LLMs) to ensure they behave reliably when deployed in production settings, particularly in high-stakes domains such as legal, medical, and regulatory compliance. Existing approaches for LLM auditing often focus on isolated aspects of model behavior, such as detecting specific biases or evaluating fairness. We are interested in a more general question -- can we understand how the outputs of black-box LLMs depend on each input token? There is a critical need to have such tools in real-world applications that rely on inaccessible API endpoints to language models. However, this is a highly non-trivial problem, as LLMs are stochastic functions (i.e. two outputs will be different by chance), while computing prompt-level gradients to approximate input sensitivity is infeasible. To address this, we propose Distribution-Based Sensitivity Analysis (DBSA), a lightweight model-agnostic procedure to evaluate the sensitivity of the output of a language model for each input token, without making any distributional assumptions about the LLM. DBSA is developed as a practical tool for practitioners, enabling quick, plug-and-play visual exploration of LLMs reliance on specific input tokens. Through illustrative examples, we demonstrate how DBSA can enable users to inspect LLM inputs and find sensitivities that may be overlooked by existing LLM interpretability methods.
7. 【2512.11567】Extending a Parliamentary Corpus with MPs' Tweets: Automatic Annotation and Evaluation Using MultiParTweet
链接:https://arxiv.org/abs/2512.11567
作者:Mevlüt Bagci,Ali Abusaleh,Daniel Baumartz,Giueseppe Abrami,Maxim Konca,Alexander Mehler
类目:Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:Social media serves, reflects politicians' ideologies, politicians' social media, Social media, younger generations
备注: Submitted to LREC 2026
点击查看摘要
Abstract:Social media serves as a critical medium in modern politics because it both reflects politicians' ideologies and facilitates communication with younger generations. We present MultiParTweet, a multilingual tweet corpus from X that connects politicians' social media discourse with German political corpus GerParCor, thereby enabling comparative analyses between online communication and parliamentary debates. MultiParTweet contains 39 546 tweets, including 19 056 media items. Furthermore, we enriched the annotation with nine text-based models and one vision-language model (VLM) to annotate MultiParTweet with emotion, sentiment, and topic annotations. Moreover, the automated annotations are evaluated against a manually annotated subset. MultiParTweet can be reconstructed using our tool, TTLABTweetCrawler, which provides a framework for collecting data from X. To demonstrate a methodological demonstration, we examine whether the models can predict each other using the outputs of the remaining models. In summary, we provide MultiParTweet, a resource integrating automatic text and media-based annotations validated with human annotations, and TTLABTweetCrawler, a general-purpose X data collection tool. Our analysis shows that the models are mutually predictable. In addition, VLM-based annotation were preferred by human annotators, suggesting that multimodal representations align more with human interpretation.
8. 【2512.11558】DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
链接:https://arxiv.org/abs/2512.11558
作者:Zhenyang Cai,Jiaming Zhang,Junjie Zhao,Ziyi Zeng,Yanchao Li,Jingyi Liang,Junying Chen,Yunjin Yang,Jiajun You,Shuzhi Deng,Tongfei Wang,Wanting Chen,Chunxiu Hao,Ruiqi Xie,Zhenwei Wen,Xiangyi Feng,Zou Ting,Jin Zou Lin,Jianquan Li,Guangjun Yu,Liangyi Chen,Junwen Wang,Shan Jiang,Benyou Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:automated oral healthcare, large language models, Reliable interpretation, lack sufficient reasoning, sufficient reasoning ability
备注:
点击查看摘要
Abstract:Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.
9. 【2512.11534】HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
链接:https://arxiv.org/abs/2512.11534
作者:Yiqing Yang,Kin-Man Lam
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:presents significant challenges, video understanding presents, understanding presents significant, Key frame selection, significant challenges
备注: 18 pages, 8 figures
点击查看摘要
Abstract:Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.
10. 【2512.11509】Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMs
链接:https://arxiv.org/abs/2512.11509
作者:Mohor Banerjee,Nadya Yuki Wangsajaya,Syed Ali Redha Alsagoff,Min Sen Tan,Zachary Choy Kit Chun,Alvin Chan Guo Wei
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, natural language understanding, exhibit remarkable capabilities, factually incorrect content, Large Language
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit remarkable capabilities in natural language understanding and reasoning, but suffer from hallucination: the generation of factually incorrect content. While numerous methods have been developed to reduce hallucinations, their impact on creative generations remains unexplored. This gap is particularly critical for AI-assisted scientific discovery, which requires both factual accuracy and creative hypothesis generation. We investigate how three hallucination-reduction techniques: Chain of Verification (CoVe), Decoding by Contrasting Layers (DoLa), and Retrieval-Augmented Generation (RAG), affect creativity in LLMs. Evaluating multiple model families (LLaMA, Qwen, Mistral) at varying scales (1B - 70B parameters) on two creativity benchmarks (NeoCoder and CS4), we find that these methods have opposing effects on divergent creativity. CoVe enhances divergent thinking, DoLa suppresses it, and RAG shows minimal impact. Our findings provide guidance for selecting appropriate hallucination-reduction methods in scientific applications, where the balance between factual accuracy and creative exploration is crucial.
11. 【2512.11502】Building Patient Journeys in Hebrew: A Language Model for Clinical Timeline Extraction
链接:https://arxiv.org/abs/2512.11502
作者:Kai Golan Hashiloni,Brenda Kasabe Nokai,Michal Shevach,Esthy Shemesh,Ronit Bartin,Anna Bergrin,Liran Harel,Nachum Dershowitz,Liat Nadai Arad,Kfir Bar
类目:Computation and Language (cs.CL)
关键词:Hebrew medical language, extract structured clinical, structured clinical timelines, electronic health records, Hebrew medical
备注: In Proceedings of the Workshop on Large Language Models and Generative AI for Health Informatics 2025, IJCAI 2025, Montreal, Canada
点击查看摘要
Abstract:We present a new Hebrew medical language model designed to extract structured clinical timelines from electronic health records, enabling the construction of patient journeys. Our model is based on DictaBERT 2.0 and continually pre-trained on over five million de-identified hospital records. To evaluate its effectiveness, we introduce two new datasets -- one from internal medicine and emergency departments, and another from oncology -- annotated for event temporal relations. Our results show that our model achieves strong performance on both datasets. We also find that vocabulary adaptation improves token efficiency and that de-identification does not compromise downstream performance, supporting privacy-conscious model development. The model is made available for research use under ethical restrictions.
12. 【2512.11485】Mistake Notebook Learning: Selective Batch-Wise Context Optimization for In-Context Learning
链接:https://arxiv.org/abs/2512.11485
作者:Xuanbo Su,Yingfang Zhang,Hao Luo,Xiaoteng Liu,Leo Huang
类目:Computation and Language (cs.CL)
关键词:Large language models, poor mistake learning, Large language, Mistake Notebook Learning, heavy computation
备注:
点击查看摘要
Abstract:Large language models (LLMs) adapt to tasks via gradient fine-tuning (heavy computation, catastrophic forgetting) or In-Context Learning (ICL: low robustness, poor mistake learning). To fix this, we introduce Mistake Notebook Learning (MNL), a training-free framework with a persistent knowledge base of abstracted error patterns. Unlike prior instance/single-trajectory memory methods, MNL uses batch-wise error abstraction: it extracts generalizable guidance from multiple failures, stores insights in a dynamic notebook, and retains only baseline-outperforming guidance via hold-out validation (ensuring monotonic improvement). We show MNL nearly matches Supervised Fine-Tuning (93.9% vs 94.3% on GSM8K) and outperforms training-free alternatives on GSM8K, Spider, AIME, and KaggleDBQA. On KaggleDBQA (Qwen3-8B), MNL hits 28% accuracy (47% relative gain), outperforming Memento (15.1%) and Training-Free GRPO (22.1) - proving it's a strong training-free alternative for complex reasoning.
13. 【2512.11470】Rethinking Expert Trajectory Utilization in LLM Post-training
链接:https://arxiv.org/abs/2512.11470
作者:Bowen Ding,Yuhan Chen,Jiayang Lv,Jiyao Yuan,Qi Zhu,Shuangshuang Tian,Dantong Zhu,Futing Wang,Heyuan Deng,Fei Mi,Lifeng Shang,Tao Lin
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:integrates Supervised Fine-Tuning, post-training integrates Supervised, Reinforcement Learning, foundational SFT performance, trajectories remains unresolved
备注: 24 pages, 5 figures, under review
点击查看摘要
Abstract:While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
14. 【2512.11437】CLINIC: Evaluating Multilingual Trustworthiness in Language Models for Healthcare
链接:https://arxiv.org/abs/2512.11437
作者:Akash Ghosh,Srivarshinee Sridhar,Raghav Kaushik Ravi,Muhsin Muhsin,Sriparna Saha,Chirag Agarwal
类目:Computation and Language (cs.CL)
关键词:systems holds great, holds great promise, improving medical workflows, Integrating language models, healthcare systems holds
备注: 49 pages, 31 figures
点击查看摘要
Abstract:Integrating language models (LMs) in healthcare systems holds great promise for improving medical workflows and decision-making. However, a critical barrier to their real-world adoption is the lack of reliable evaluation of their trustworthiness, especially in multilingual healthcare settings. Existing LMs are predominantly trained in high-resource languages, making them ill-equipped to handle the complexity and diversity of healthcare queries in mid- and low-resource languages, posing significant challenges for deploying them in global healthcare contexts where linguistic diversity is key. In this work, we present CLINIC, a Comprehensive Multilingual Benchmark to evaluate the trustworthiness of language models in healthcare. CLINIC systematically benchmarks LMs across five key dimensions of trustworthiness: truthfulness, fairness, safety, robustness, and privacy, operationalized through 18 diverse tasks, spanning 15 languages (covering all the major continents), and encompassing a wide array of critical healthcare topics like disease conditions, preventive actions, diagnostic tests, treatments, surgeries, and medications. Our extensive evaluation reveals that LMs struggle with factual correctness, demonstrate bias across demographic and linguistic groups, and are susceptible to privacy breaches and adversarial attacks. By highlighting these shortcomings, CLINIC lays the foundation for enhancing the global reach and safety of LMs in healthcare across diverse languages.
15. 【2512.11412】ask-Specific Sparse Feature Masks for Molecular Toxicity Prediction with Chemical Language Models
链接:https://arxiv.org/abs/2512.11412
作者:Kwun Sy Lee,Jiawei Chen,Fuk Sheng Ford Chung,Tianyu Zhao,Zhenyuan Chen,Debby D. Wang
类目:Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
关键词:modern drug discovery, Reliable in silico, drug discovery, offering a scalable, experimental screening
备注: 6 pages, 4 figures
点击查看摘要
Abstract:Reliable in silico molecular toxicity prediction is a cornerstone of modern drug discovery, offering a scalable alternative to experimental screening. However, the black-box nature of state-of-the-art models remains a significant barrier to adoption, as high-stakes safety decisions demand verifiable structural insights alongside predictive performance. To address this, we propose a novel multi-task learning (MTL) framework designed to jointly enhance accuracy and interpretability. Our architecture integrates a shared chemical language model with task-specific attention modules. By imposing an L1 sparsity penalty on these modules, the framework is constrained to focus on a minimal set of salient molecular fragments for each distinct toxicity endpoint. The resulting framework is trained end-to-end and is readily adaptable to various transformer-based backbones. Evaluated on the ClinTox, SIDER, and Tox21 benchmark datasets, our approach consistently outperforms both single-task and standard MTL baselines. Crucially, the sparse attention weights provide chemically intuitive visualizations that reveal the specific fragments influencing predictions, thereby enhancing insight into the model's decision-making process.
16. 【2512.11399】Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction
链接:https://arxiv.org/abs/2512.11399
作者:Galann Pennec,Zhengyuan Liu,Nicholas Asher,Philippe Muller,Nancy F. Chen
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:process increasingly longer, increasingly longer videos, process increasingly, increasingly longer, Vision-Language Models
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.
17. 【2512.11388】Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis
链接:https://arxiv.org/abs/2512.11388
作者:Felipe Ribeiro Fujita de Mello,Hideyuki Takada
类目:Computation and Language (cs.CL)
关键词:machine translation fine-tuning, open LLMs, machine translation, COMET Kiwi, investigated the impact
备注: To appear at IEEE Big Data 2025
点击查看摘要
Abstract:We investigated the impact of data selection on machine translation fine-tuning for open LLMs. Using Japanese-English corpora, we compare five selectors: TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection, under controlled training conditions. We observed that semantic selectors consistently outperform lexical and geometry-based heuristics, and that even when the selected data differ by less than 3%, the impact on model performance is substantial, underscoring the sensitivity of fine-tuning to data quality.
18. 【2512.11374】Mining Legal Arguments to Study Judicial Formalism
链接:https://arxiv.org/abs/2512.11374
作者:Tomáš Koref,Lena Held,Mahammad Namazov,Harun Kumru,Yassine Thlija,Christoph Burchard,Ivan Habernal
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:scale remains difficult, Czech Supreme Courts', systematically analyzing judicial, analyzing judicial reasoning, Supreme Courts' decisions
备注: pre-print under review
点击查看摘要
Abstract:Courts must justify their decisions, but systematically analyzing judicial reasoning at scale remains difficult. This study refutes claims about formalistic judging in Central and Eastern Europe (CEE) by developing automated methods to detect and classify judicial reasoning in Czech Supreme Courts' decisions using state-of-the-art natural language processing methods. We create the MADON dataset of 272 decisions from two Czech Supreme Courts with expert annotations of 9,183 paragraphs with eight argument types and holistic formalism labels for supervised training and evaluation. Using a corpus of 300k Czech court decisions, we adapt transformer LLMs for Czech legal domain by continued pretraining and experiment with methods to address dataset imbalance including asymmetric loss and class weighting. The best models successfully detect argumentative paragraphs (82.6\% macro-F1), classify traditional types of legal argument (77.5\% macro-F1), and classify decisions as formalistic/non-formalistic (83.2\% macro-F1). Our three-stage pipeline combining ModernBERT, Llama 3.1, and traditional feature-based machine learning achieves promising results for decision classification while reducing computational costs and increasing explainability. Empirically, we challenge prevailing narratives about CEE formalism. This work shows that legal argument mining enables reliable judicial philosophy classification and shows the potential of legal argument mining for other important tasks in computational legal studies. Our methodology is easily replicable across jurisdictions, and our entire pipeline, datasets, guidelines, models, and source codes are available at this https URL.
19. 【2512.11366】qa-FLoRA: Data-free query-adaptive Fusion of LoRAs for LLMs
链接:https://arxiv.org/abs/2512.11366
作者:Shreya Shukla,Aditya Sriram,Milinda Kuppur Narayanaswamy,Hiteshi Jain
类目:Computation and Language (cs.CL)
关键词:domain-specific parameter-efficient finetuning, requires domain-specific parameter-efficient, large language models, deployment of large, large language
备注: Accepted at AAAI 2026 (Main Technical Track)
点击查看摘要
Abstract:The deployment of large language models for specialized tasks often requires domain-specific parameter-efficient finetuning through Low-Rank Adaptation (LoRA) modules. However, effectively fusing these adapters to handle complex, multi-domain composite queries remains a critical challenge. Existing LoRA fusion approaches either use static weights, which assign equal relevance to each participating LoRA, or require data-intensive supervised training for every possible LoRA combination to obtain respective optimal fusion weights. We propose qa-FLoRA, a novel query-adaptive data-and-training-free method for LoRA fusion that dynamically computes layer-level fusion weights by measuring distributional divergence between the base model and respective adapters. Our approach eliminates the need for composite training data or domain-representative samples, making it readily applicable to existing adapter collections. Extensive experiments across nine multilingual composite tasks spanning mathematics, coding, and medical domains, show that qa-FLoRA outperforms static fusion by ~5% with LLaMA-2 and ~6% with LLaMA-3, and the training-free baselines by ~7% with LLaMA-2 and ~10% with LLaMA-3, while significantly closing the gap with supervised baselines. Further, layer-level analysis of our fusion weights reveals interpretable fusion patterns, demonstrating the effectiveness of our approach for robust multi-domain adaptation.
20. 【2512.11303】Unifying Dynamic Tool Creation and Cross-Task Experience Sharing through Cognitive Memory Architecture
链接:https://arxiv.org/abs/2512.11303
作者:Jiarun Liu,Shiyue Xu,Yang Li,Shangkun Liu,Yongli Yu,Peng Cao
类目:Computation and Language (cs.CL)
关键词:Large Language Model, Language Model agents, Large Language, Language Model, face fundamental challenges
备注:
点击查看摘要
Abstract:Large Language Model agents face fundamental challenges in adapting to novel tasks due to limitations in tool availability and experience reuse. Existing approaches either rely on predefined tools with limited coverage or build tools from scratch without leveraging past experiences, leading to inefficient exploration and suboptimal performance. We introduce SMITH (Shared Memory Integrated Tool Hub), a unified cognitive architecture that seamlessly integrates dynamic tool creation with cross-task experience sharing through hierarchical memory organization. SMITH organizes agent memory into procedural, semantic, and episodic components, enabling systematic capability expansion while preserving successful execution patterns. Our approach formalizes tool creation as iterative code generation within controlled sandbox environments and experience sharing through episodic memory retrieval with semantic similarity matching. We further propose a curriculum learning strategy based on agent-ensemble difficulty re-estimation. Extensive experiments on the GAIA benchmark demonstrate SMITH's effectiveness, achieving 81.8% Pass@1 accuracy and outperforming state-of-the-art baselines including Alita (75.2%) and Memento (70.9%). Our work establishes a foundation for building truly adaptive agents that continuously evolve their capabilities through principled integration of tool creation and experience accumulation.
21. 【2512.11297】LegalRikai: Open Benchmark -- A Benchmark for Complex Japanese Corporate Legal Tasks
链接:https://arxiv.org/abs/2512.11297
作者:Shogo Fujita,Yuji Naraki,Yiqing Zhu,Shinsuke Mori
类目:Computation and Language (cs.CL)
关键词:emulate Japanese corporate, paper introduces LegalRikai, Japanese corporate legal, emulate Japanese, Japanese corporate
备注:
点击查看摘要
Abstract:This paper introduces LegalRikai: Open Benchmark, a new benchmark comprising four complex tasks that emulate Japanese corporate legal practices. The benchmark was created by legal professionals under the supervision of an attorney. This benchmark has 100 samples that require long-form, structured outputs, and we evaluated them against multiple practical criteria. We conducted both human and automated evaluations using leading LLMs, including GPT-5, Gemini 2.5 Pro, and Claude Opus 4.1. Our human evaluation revealed that abstract instructions prompted unnecessary modifications, highlighting model weaknesses in document-level editing that were missed by conventional short-text tasks. Furthermore, our analysis reveals that automated evaluation aligns well with human judgment on criteria with clear linguistic grounding, and assessing structural consistency remains a challenge. The result demonstrates the utility of automated evaluation as a screening tool when expert availability is limited. We propose a dataset evaluation framework to promote more practice-oriented research in the legal domain.
22. 【2512.11282】CIP: A Plug-and-Play Causal Prompting Framework for Mitigating Hallucinations under Long-Context Noise
链接:https://arxiv.org/abs/2512.11282
作者:Qingsen Ma,Dianyun Wang,Ran Jing,Yujun Sun,Zhenbo Xu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:noisy retrieval contexts, genuine causal relationships, hallucinate when processing, processing long, long and noisy
备注:
点击查看摘要
Abstract:Large language models often hallucinate when processing long and noisy retrieval contexts because they rely on spurious correlations rather than genuine causal relationships. We propose CIP, a lightweight and plug-and-play causal prompting framework that mitigates hallucinations at the input stage. CIP constructs a causal relation sequence among entities, actions, and events and injects it into the prompt to guide reasoning toward causally relevant evidence. Through causal intervention and counterfactual reasoning, CIP suppresses non causal reasoning paths, improving factual grounding and interpretability. Experiments across seven mainstream language models, including GPT-4o, Gemini 2.0 Flash, and Llama 3.1, show that CIP consistently enhances reasoning quality and reliability, achieving 2.6 points improvement in Attributable Rate, 0.38 improvement in Causal Consistency Score, and a fourfold increase in effective information density. API level profiling further shows that CIP accelerates contextual understanding and reduces end to end response latency by up to 55.1 percent. These results suggest that causal reasoning may serve as a promising paradigm for improving the explainability, stability, and efficiency of large language models.
23. 【2512.11280】AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference
链接:https://arxiv.org/abs/2512.11280
作者:Kuan-Wei Lu,Ding-Yong Hong,Pangfeng Liu
类目:Computation and Language (cs.CL)
关键词:Large language models, achieved remarkable performance, increasing parameter sizes, parameter sizes significantly, sizes significantly slow
备注:
点击查看摘要
Abstract:Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive thresholds: one to determine when to stop candidate token generation and another to decide token acceptance, both updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49\% speedup over standard speculative decoding while limiting accuracy degradation to under 2\%, making it a practical solution for efficient and adaptive LLM inference.
24. 【2512.11277】When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents
链接:https://arxiv.org/abs/2512.11277
作者:Mrinal Rawat,Arkajyoti Chakraborty,Neha Gupta,Roberto Pieraccini
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Supervised fine-tuning, large language models, performance of large, large language, SFT
备注:
点击查看摘要
Abstract:Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks. However, SFT can have difficulty generalizing when the underlying data distribution changes, even when the new data does not fall completely outside the training domain. Recent reasoning-focused models such as o1 and R1 have demonstrated consistent gains over their non-reasoning counterparts, highlighting the importance of reasoning for improved generalization and reliability. However, collecting high-quality reasoning traces for SFT remains challenging -- annotations are costly, subjective, and difficult to scale. To address this limitation, we leverage Reinforcement Learning (RL) to enable models to learn reasoning strategies directly from task outcomes. We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools (e.g., function calls) and the final answer generation for conversational agents. Our method employs Group Relative Policy Optimization (GRPO) with rewards designed around tool accuracy and answer correctness, allowing the model to iteratively refine its reasoning and actions. Experimental results demonstrate that our approach improves both the quality of reasoning and the precision of tool invocations, achieving a 1.5% relative improvement over the SFT model (trained without explicit thinking) and a 40% gain compared to the base of the vanilla Qwen3-1.7B model. These findings demonstrate the promise of unifying reasoning and action learning through RL to build more capable and generalizable conversational agents.
25. 【2512.11261】Leveraging LLMs for Title and Abstract Screening for Systematic Review: A Cost-Effective Dynamic Few-Shot Learning Approach
链接:https://arxiv.org/abs/2512.11261
作者:Yun-Chung Liu,Rui Yang,Jonathan Chong Kai Liew,Ziran Yin,Henry Foote,Christopher J. Lindsell,Chuan Hong
类目:Computation and Language (cs.CL)
关键词:guiding clinical decisions, synthesizing existing research, existing research evidence, Systematic reviews, conducting systematic reviews
备注: 22 pages, 3 figures
点击查看摘要
Abstract:Systematic reviews are a key component of evidence-based medicine, playing a critical role in synthesizing existing research evidence and guiding clinical decisions. However, with the rapid growth of research publications, conducting systematic reviews has become increasingly burdensome, with title and abstract screening being one of the most time-consuming and resource-intensive steps. To mitigate this issue, we designed a two-stage dynamic few-shot learning (DFSL) approach aimed at improving the efficiency and performance of large language models (LLMs) in the title and abstract screening task. Specifically, this approach first uses a low-cost LLM for initial screening, then re-evaluates low-confidence instances using a high-performance LLM, thereby enhancing screening performance while controlling computational costs. We evaluated this approach across 10 systematic reviews, and the results demonstrate its strong generalizability and cost-effectiveness, with potential to reduce manual screening burden and accelerate the systematic review process in practical applications.
26. 【2512.11258】Multi-Intent Spoken Language Understanding: Methods, Trends, and Challenges
链接:https://arxiv.org/abs/2512.11258
作者:Di Wu,Ruiyu Fang,Liting Jiang,Shuangyong Song,Xiaomeng Huang,Shiquan Wang,Zhongqiu Li,Lingling Shi,Mengjiao Bao,Yongxiang Li,Hao Huang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:multiple intent detection, spoken language understanding, jointly handle utterances, Multi-intent spoken language, multi-intent SLU
备注:
点击查看摘要
Abstract:Multi-intent spoken language understanding (SLU) involves two tasks: multiple intent detection and slot filling, which jointly handle utterances containing more than one intent. Owing to this characteristic, which closely reflects real-world applications, the task has attracted increasing research attention, and substantial progress has been achieved. However, there remains a lack of a comprehensive and systematic review of existing studies on multi-intent SLU. To this end, this paper presents a survey of recent advances in multi-intent SLU. We provide an in-depth overview of previous research from two perspectives: decoding paradigms and modeling approaches. On this basis, we further compare the performance of representative models and analyze their strengths and limitations. Finally, we discuss the current challenges and outline promising directions for future research. We hope this survey will offer valuable insights and serve as a useful reference for advancing research in multi-intent SLU.
27. 【2512.11221】Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference
链接:https://arxiv.org/abs/2512.11221
作者:Adilet Metinov,Gulida M. Kudakeeva,Bolotbek uulu Nursultan,Gulnara D. Kabaeva
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Adaptive Soft Rolling, present Adaptive Soft, Adaptive Soft, Soft Rolling, efficient large language
备注: 6 pages, 3 tables , 1 figure
点击查看摘要
Abstract:We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs.
28. 【2512.11213】FutureWeaver: Planning Test-Time Compute for Multi-Agent Systems with Modularized Collaboration
链接:https://arxiv.org/abs/2512.11213
作者:Dongwon Jung,Peng Shi,Yi Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:computation improves large, improves large language, large language model, language model performance, test-time computation improves
备注:
点击查看摘要
Abstract:Scaling test-time computation improves large language model performance without additional training. Recent work demonstrates that techniques such as repeated sampling, self-verification, and self-reflection can significantly enhance task success by allocating more inference-time compute. However, applying these techniques across multiple agents in a multi-agent system is difficult: there does not exist principled mechanisms to allocate compute to foster collaboration among agents, to extend test-time scaling to collaborative interactions, or to distribute compute across agents under explicit budget constraints. To address this gap, we propose FutureWeaver, a framework for planning and optimizing test-time compute allocation in multi-agent systems under fixed budgets. FutureWeaver introduces modularized collaboration, formalized as callable functions that encapsulate reusable multi-agent workflows. These modules are automatically derived through self-play reflection by abstracting recurring interaction patterns from past trajectories. Building on these modules, FutureWeaver employs a dual-level planning architecture that optimizes compute allocation by reasoning over the current task state while also speculating on future steps. Experiments on complex agent benchmarks demonstrate that FutureWeaver consistently outperforms baselines across diverse budget settings, validating its effectiveness for multi-agent collaboration in inference-time optimization.
29. 【2512.11192】SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing
链接:https://arxiv.org/abs/2512.11192
作者:Luca Foppiano,Sotaro Takeshita,Pedro Ortiz Suarez,Ekaterina Borisova,Raia Abu Ahmad,Malte Ostendorff,Fabio Barth,Julian Moreno-Schneider,Georg Rehm
类目:Computation and Language (cs.CL)
关键词:unfiltered TEI XML, TEI XML split, frameworks and publicly, TEI XML, scientific language constructed
备注: 12 pages, 2 figures, 3 tables
点击查看摘要
Abstract:SciLaD is a novel, large-scale dataset of scientific language constructed entirely using open-source frameworks and publicly available data sources. It comprises a curated English split containing over 10 million scientific publications and a multilingual, unfiltered TEI XML split including more than 35 million publications. We also publish the extensible pipeline for generating SciLaD. The dataset construction and processing workflow demonstrates how open-source tools can enable large-scale, scientific data curation while maintaining high data quality. Finally, we pre-train a RoBERTa model on our dataset and evaluate it across a comprehensive set of benchmarks, achieving performance comparable to other scientific language models of similar size, validating the quality and utility of SciLaD. We publish the dataset and evaluation pipeline to promote reproducibility, transparency, and further research in natural scientific language processing and understanding including scholarly document processing.
30. 【2512.11110】FIBER: A Multilingual Evaluation Resource for Factual Inference Bias
链接:https://arxiv.org/abs/2512.11110
作者:Evren Ayberk Munis,Deniz Yılmaz,Arianna Muti,Çağrı Toraman
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:reliability and biases, Large language models, factual reliability, factual, language
备注:
点击查看摘要
Abstract:Large language models are widely used across domains, yet there are concerns about their factual reliability and biases. Factual knowledge probing offers a systematic means to evaluate these aspects. Most existing benchmarks focus on single-entity facts and monolingual data. We therefore present FIBER, a multilingual benchmark for evaluating factual knowledge in single- and multi-entity settings. The dataset includes sentence completion, question-answering, and object-count prediction tasks in English, Italian, and Turkish. Using FIBER, we examine whether the prompt language induces inference bias in entity selection and how large language models perform on multi-entity versus single-entity questions. The results indicate that the language of the prompt can influence the model's generated output, particularly for entities associated with the country corresponding to that language. However, this effect varies across different topics such that 31% of the topics exhibit factual inference bias score greater than 0.5. Moreover, the level of bias differs across languages such that Turkish prompts show higher bias compared to Italian in 83% of the topics, suggesting a language-dependent pattern. Our findings also show that models face greater difficulty when handling multi-entity questions than the single-entity questions. Model performance differs across both languages and model sizes. The highest mean average precision is achieved in English, while Turkish and Italian lead to noticeably lower scores. Larger models, including Llama-3.1-8B and Qwen-2.5-7B, show consistently better performance than smaller 3B-4B models.
31. 【2512.11108】Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
链接:https://arxiv.org/abs/2512.11108
作者:Jonathan Kamp,Roos Bakker,Dominique Blok
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Good quality explanations, Good quality, quality explanations strengthen, strengthen the understanding, understanding of language
备注:
点击查看摘要
Abstract:Good quality explanations strengthen the understanding of language models and data. Feature attribution methods, such as Integrated Gradient, are a type of post-hoc explainer that can provide token-level insights. However, explanations on the same input may vary greatly due to underlying biases of different methods. Users may be aware of this issue and mistrust their utility, while unaware users may trust them inadequately. In this work, we delve beyond the superficial inconsistencies between attribution methods, structuring their biases through a model- and method-agnostic framework of three evaluation metrics. We systematically assess both the lexical and position bias (what and where in the input) for two transformers; first, in a controlled, pseudo-random classification task on artificial data; then, in a semi-controlled causal relation detection task on natural data. We find that lexical and position biases are structurally unbalanced in our model comparison, with models that score high on one type score low on the other. We also find signs that methods producing anomalous explanations are more likely to be biased themselves.
32. 【2512.11079】Applying NLP to iMessages: Understanding Topic Avoidance, Responsiveness, and Sentiment
链接:https://arxiv.org/abs/2512.11079
作者:Alan Gerber,Sam Cooperman
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Applications (stat.AP); Other Statistics (stat.OT)
关键词:data, Abstract, users, short-form electronic communication, messaging
备注: 11 pages, 18 figures, [this https URL](https://github.com/Alanshnir/imessage-analyzer/blob/main/Research/NLP-iMessage-Analyzer%20Findings.pdf)
点击查看摘要
Abstract:What is your messaging data used for? While many users do not often think about the information companies can gather based off of their messaging platform of choice, it is nonetheless important to consider as society increasingly relies on short-form electronic communication. While most companies keep their data closely guarded, inaccessible to users or potential hackers, Apple has opened a door to their walled-garden ecosystem, providing iMessage users on Mac with one file storing all their messages and attached metadata. With knowledge of this locally stored file, the question now becomes: What can our data do for us? In the creation of our iMessage text message analyzer, we set out to answer five main research questions focusing on topic modeling, response times, reluctance scoring, and sentiment analysis. This paper uses our exploratory data to show how these questions can be answered using our analyzer and its potential in future studies on iMessage data.
33. 【2512.11074】MultiScript30k: Leveraging Multilingual Embeddings to Extend Cross Script Parallel Data
链接:https://arxiv.org/abs/2512.11074
作者:Christopher Driggers-Ellis,Detravious Brinkley,Ray Chen,Aashish Dhawan,Daisy Zhe Wang,Christan Grant
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:offering parallel text, deep learning models, parallel text data, fine-tuning deep learning, multimodal machine translation
备注: 7 pages, 2 figures, 5 tables. Not published at any conference at this time
点击查看摘要
Abstract:Multi30k is frequently cited in the multimodal machine translation (MMT) literature, offering parallel text data for training and fine-tuning deep learning models. However, it is limited to four languages: Czech, English, French, and German. This restriction has led many researchers to focus their investigations only on these languages. As a result, MMT research on diverse languages has been stalled because the official Multi30k dataset only represents European languages in Latin scripts. Previous efforts to extend Multi30k exist, but the list of supported languages, represented language families, and scripts is still very short. To address these issues, we propose MultiScript30k, a new Multi30k dataset extension for global languages in various scripts, created by translating the English version of Multi30k (Multi30k-En) using NLLB200-3.3B. The dataset consists of over \(30000\) sentences and provides translations of all sentences in Multi30k-En into Ar, Es, Uk, Zh\_Hans and Zh\_Hant. Similarity analysis shows that Multi30k extension consistently achieves greater than \(0.8\) cosine similarity and symmetric KL divergence less than \(0.000251\) for all languages supported except Zh\_Hant which is comparable to the previous Multi30k extensions ArEnMulti30k and Multi30k-Uk. COMETKiwi scores reveal mixed assessments of MultiScript30k as a translation of Multi30k-En in comparison to the related work. ArEnMulti30k scores nearly equal MultiScript30k-Ar, but Multi30k-Uk scores $6.4\%$ greater than MultiScript30k-Uk per split.
34. 【2512.11013】PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data
链接:https://arxiv.org/abs/2512.11013
作者:Pawel Batorski,Paul Swoboda
类目:Computation and Language (cs.CL)
关键词:requires intricate crafting, handcrafting effective prompts, Monte Carlo Shapley, LLMs are highly, highly sensitive
备注:
点击查看摘要
Abstract:LLMs are highly sensitive to prompt design, but handcrafting effective prompts is difficult and often requires intricate crafting of few-shot examples. We propose a fast automatic prompt construction algorithm that augments human instructions by generating a small set of few shot examples. Our method iteratively replaces/drops/keeps few-shot examples using Monte Carlo Shapley estimation of example utility. For faster execution, we use aggressive subsampling and a replay buffer for faster evaluations. Our method can be run using different compute time budgets. On a limited budget, we outperform existing automatic prompting methods on text simplification and GSM8K and obtain second best results on classification and summarization. With an extended, but still modest compute budget we set a new state of the art among automatic prompting methods on classification, simplification and GSM8K. Our results show that carefully constructed examples, rather than exhaustive instruction search, are the dominant lever for fast and data efficient prompt engineering. Our code is available at this https URL.
35. 【2512.10999】KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering
链接:https://arxiv.org/abs/2512.10999
作者:Xin Sun,Zhongqi Chen,Xing Zheng,Qiang Liu,Shu Wu,Bowen Song,Zilei Wang,Weiqiang Wang,Liang Wang
类目:Computation and Language (cs.CL)
关键词:Base Question Answering, Question Answering, executable logical forms, generating executable logical, Knowledge Base Question
备注:
点击查看摘要
Abstract:Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present \textbf{KBQA-R1}, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions, leveraging Group Relative Policy Optimization (GRPO) to refine its strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce \textbf{Referenced Rejection Sampling (RRS)}, a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution.
36. 【2512.10998】SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models
链接:https://arxiv.org/abs/2512.10998
作者:Mohamed Afane,Abhishek Satyam,Ke Chen,Tao Li,Junaid Farooq,Juntao Chen
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:presenting critical risks, create significant security, embedding hidden triggers, significant security threats, manipulate model behavior
备注: 9 pages, 3 figures
点击查看摘要
Abstract:Backdoor attacks create significant security threats to language models by embedding hidden triggers that manipulate model behavior during inference, presenting critical risks for AI systems deployed in healthcare and other sensitive domains. While existing defenses effectively counter obvious threats such as out-of-context trigger words and safety alignment violations, they fail against sophisticated attacks using contextually-appropriate triggers that blend seamlessly into natural language. This paper introduces three novel contextually-aware attack scenarios that exploit domain-specific knowledge and semantic plausibility: the ViralApp attack targeting social media addiction classification, the Fever attack manipulating medical diagnosis toward hypertension, and the Referral attack steering clinical recommendations. These attacks represent realistic threats where malicious actors exploit domain-specific vocabulary while maintaining semantic coherence, demonstrating how adversaries can weaponize contextual appropriateness to evade conventional detection methods. To counter both traditional and these sophisticated attacks, we present \textbf{SCOUT (Saliency-based Classification Of Untrusted Tokens)}, a novel defense framework that identifies backdoor triggers through token-level saliency analysis rather than traditional context-based detection methods. SCOUT constructs a saliency map by measuring how the removal of individual tokens affects the model's output logits for the target label, enabling detection of both conspicuous and subtle manipulation attempts. We evaluate SCOUT on established benchmark datasets (SST-2, IMDB, AG News) against conventional attacks (BadNet, AddSent, SynBkd, StyleBkd) and our novel attacks, demonstrating that SCOUT successfully detects these sophisticated threats while preserving accuracy on clean inputs.
37. 【2512.10996】MedBioRAG: Semantic Search and Retrieval-Augmented Generation with Large Language Models for Medical and Biological QA
链接:https://arxiv.org/abs/2512.10996
作者:Seonok Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:perform complex question-answering, large language models, Recent advancements, complex question-answering, significantly enhanced
备注: Submitted to ACL 2025. 9 pages, 4 figures, 5 tables (including 2 appendix tables)
点击查看摘要
Abstract:Recent advancements in retrieval-augmented generation (RAG) have significantly enhanced the ability of large language models (LLMs) to perform complex question-answering (QA) tasks. In this paper, we introduce MedBioRAG, a retrieval-augmented model designed to improve biomedical QA performance through a combination of semantic and lexical search, document retrieval, and supervised fine-tuning. MedBioRAG efficiently retrieves and ranks relevant biomedical documents, enabling precise and context-aware response generation. We evaluate MedBioRAG across text retrieval, close-ended QA, and long-form QA tasks using benchmark datasets such as NFCorpus, TREC-COVID, MedQA, PubMedQA, and BioASQ. Experimental results demonstrate that MedBioRAG outperforms previous state-of-the-art (SoTA) models and the GPT-4o base model in all evaluated tasks. Notably, our approach improves NDCG and MRR scores for document retrieval, while achieving higher accuracy in close-ended QA and ROUGE scores in long-form QA. Our findings highlight the effectiveness of semantic search-based retrieval and LLM fine-tuning in biomedical applications.
38. 【2512.10968】Benchmarking Automatic Speech Recognition Models for African Languages
链接:https://arxiv.org/abs/2512.10968
作者:Alvin Nahabwe,Sulaiman Kagumire,Denis Musinguzi,Bruno Beijuka,Jonah Mubuuke Kyagaba,Peter Nabende,Andrew Katumba,Joyce Nakatumba-Nabende
类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:Automatic speech recognition, Automatic speech, African languages remains, limited labeled data, languages remains constrained
备注: 19 pages, 8 figures, Deep Learning Indiba, Proceedings of Machine Learning Research
点击查看摘要
Abstract:Automatic speech recognition (ASR) for African languages remains constrained by limited labeled data and the lack of systematic guidance on model selection, data scaling, and decoding strategies. Large pre-trained systems such as Whisper, XLS-R, MMS, and W2v-BERT have expanded access to ASR technology, but their comparative behavior in African low-resource contexts has not been studied in a unified and systematic way. In this work, we benchmark four state-of-the-art ASR models across 13 African languages, fine-tuning them on progressively larger subsets of transcribed data ranging from 1 to 400 hours. Beyond reporting error rates, we provide new insights into why models behave differently under varying conditions. We show that MMS and W2v-BERT are more data efficient in very low-resource regimes, XLS-R scales more effectively as additional data becomes available, and Whisper demonstrates advantages in mid-resource conditions. We also analyze where external language model decoding yields improvements and identify cases where it plateaus or introduces additional errors, depending on the alignment between acoustic and text resources. By highlighting the interaction between pre-training coverage, model architecture, dataset domain, and resource availability, this study offers practical and insights into the design of ASR systems for underrepresented languages.
39. 【2512.10967】ASR Under the Stethoscope: Evaluating Biases in Clinical Speech Recognition across Indian Languages
链接:https://arxiv.org/abs/2512.10967
作者:Subham Kumar,Prakrithi Shivaprakash,Abhishek Manoharan,Astut Kurariya,Diptadhi Mukherjee,Lekhansh Shukla,Animesh Mukherjee,Prabhat Chand,Pratima Murthy
类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:Automatic Speech Recognition, remains largely unknown, contexts remains largely, document clinical encounters, demographically diverse Indian
备注:
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) is increasingly used to document clinical encounters, yet its reliability in multilingual and demographically diverse Indian healthcare contexts remains largely unknown. In this study, we conduct the first systematic audit of ASR performance on real world clinical interview data spanning Kannada, Hindi, and Indian English, comparing leading models including Indic Whisper, Whisper, Sarvam, Google speech to text, Gemma3n, Omnilingual, Vaani, and Gemini. We evaluate transcription accuracy across languages, speakers, and demographic subgroups, with a particular focus on error patterns affecting patients vs. clinicians and gender based or intersectional disparities. Our results reveal substantial variability across models and languages, with some systems performing competitively on Indian English but failing on code mixed or vernacular speech. We also uncover systematic performance gaps tied to speaker role and gender, raising concerns about equitable deployment in clinical settings. By providing a comprehensive multilingual benchmark and fairness analysis, our work highlights the need for culturally and demographically inclusive ASR development for healthcare ecosystem in India.
40. 【2512.05103】V2TV: A Unified Framework for Interleaved Language and Video Generation
链接:https://arxiv.org/abs/2512.05103
作者:Xiaochuang Han,Youssef Emad,Melissa Hall,John Nguyen,Karthik Padthe,Liam Robbins,Amir Bar,Delong Chen,Michal Drozdzal,Maha Elbayad,Yushi Hu,Shang-Wen Li,Sreya Dutta Roy,Jakob Verbeek,XuDong Wang,Marjan Ghazvininejad,Luke Zettlemoyer,Emily Dinan
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:require significant semantic, significant semantic branching, repeated high-level reasoning, Video generation, rapidly advancing
备注:
点击查看摘要
Abstract:Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
信息检索
1. 【2512.11635】Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling
链接:https://arxiv.org/abs/2512.11635
作者:Keerthana Murugaraj,Salima Lamsiyah,Marten During,Martin Theobald
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Optical Character Recognition, Optical Character, Character Recognition, presents significant challenges, significant challenges due
备注: This is a preprint of a manuscript submitted to Digital Scholarship in the Humanities (Oxford University Press). The paper is currently under peer review
点击查看摘要
Abstract:Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.
2. 【2512.11490】VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing
链接:https://arxiv.org/abs/2512.11490
作者:Emanuel Sánchez Aimar,Gulnaz Zhambulova,Fahad Shahbaz Khan,Yonghao Xu,Michael Felsberg
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Satellite imagery differs, diverse scale variations, imagery differs fundamentally, small objects demand, Satellite imagery
备注: 21 pages, 7 figures, under review
点击查看摘要
Abstract:Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.
3. 【2512.11474】General-purpose AI models can generate actionable knowledge on agroecological crop protection
链接:https://arxiv.org/abs/2512.11474
作者:Kris A.G. Wyckhuys
类目:Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Information Retrieval (cs.IR)
关键词:Generative artificial intelligence, science remains unexplored, agri-food science remains, Generative artificial, actionable information
备注: 33 pages, 3 figures, 3 tables, 1 supplementary table
点击查看摘要
Abstract:Generative artificial intelligence (AI) offers potential for democratizing scientific knowledge and converting this to clear, actionable information, yet its application in agri-food science remains unexplored. Here, we verify the scientific knowledge on agroecological crop protection that is generated by either web-grounded or non-grounded large language models (LLMs), i.e., DeepSeek versus the free-tier version of ChatGPT. For nine globally limiting pests, weeds, and plant diseases, we assessed the factual accuracy, data consistency, and breadth of knowledge or data completeness of each LLM. Overall, DeepSeek consistently screened a 4.8-49.7-fold larger literature corpus and reported 1.6-2.4-fold more biological control agents or management solutions than ChatGPT. As a result, DeepSeek reported 21.6% higher efficacy estimates, exhibited greater laboratory-to-field data consistency, and showed more realistic effects of pest identity and management tactics. However, both models hallucinated, i.e., fabricated fictitious agents or references, reported on implausible ecological interactions or outcomes, confused old and new scientific nomenclatures, and omitted data on key agents or solutions. Despite these shortcomings, both LLMs correctly reported low-resolution efficacy trends. Overall, when paired with rigorous human oversight, LLMs may pose a powerful tool to support farm-level decision-making and unleash scientific creativity.
4. 【2512.11254】FAIR: Focused Attention Is All You Need for Generative Recommendation
链接:https://arxiv.org/abs/2512.11254
作者:Longtao Xiao,Haolin Zhang,Guohao Cai,Jieming Zhu,Yifan Wang,Heng Chang,Zhenhua Dong,Xiu Li,Ruixuan Li
类目:Information Retrieval (cs.IR)
关键词:garnered significant attention, garnered significant, user behavior modeling, modeling user behavior, attention
备注:
点击查看摘要
Abstract:Recently, transformer-based generative recommendation has garnered significant attention for user behavior modeling. However, it often requires discretizing items into multi-code representations (e.g., typically four code tokens or more), which sharply increases the length of the original item sequence. This expansion poses challenges to transformer-based models for modeling user behavior sequences with inherent noises, since they tend to overallocate attention to irrelevant or noisy context. To mitigate this issue, we propose FAIR, the first generative recommendation framework with focused attention, which enhances attention scores to relevant context while suppressing those to irrelevant ones. Specifically, we propose (1) a focused attention mechanism integrated into the standard Transformer, which learns two separate sets of Q and K attention weights and computes their difference as the final attention scores to eliminate attention noise while focusing on relevant contexts; (2) a noise-robustness objective, which encourages the model to maintain stable attention patterns under stochastic perturbations, preventing undesirable shifts toward irrelevant context due to noise; and (3) a mutual information maximization objective, which guides the model to identify contexts that are most informative for next-item prediction. We validate the effectiveness of FAIR on four public benchmarks, demonstrating its superior performance compared to existing methods.
5. 【2512.10963】Emotion-Driven Personalized Recommendation for AI-Generated Content Using Multi-Modal Sentiment and Intent Analysis
链接:https://arxiv.org/abs/2512.10963
作者:Zheqi Hu,Xuanjing Chen,Jinlin Hu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:emotionally aware recommendation, increasingly important, aware recommendation systems, rapid growth, demand for emotionally
备注:
点击查看摘要
Abstract:With the rapid growth of AI-generated content (AIGC) across domains such as music, video, and literature, the demand for emotionally aware recommendation systems has become increasingly important. Traditional recommender systems primarily rely on user behavioral data such as clicks, views, or ratings, while neglecting users' real-time emotional and intentional states during content interaction. To address this limitation, this study proposes a Multi-Modal Emotion and Intent Recognition Model (MMEI) based on a BERT-based Cross-Modal Transformer with Attention-Based Fusion, integrated into a cloud-native personalized AIGC recommendation framework. The proposed system jointly processes visual (facial expression), auditory (speech tone), and textual (comments or utterances) modalities through pretrained encoders ViT, Wav2Vec2, and BERT, followed by an attention-based fusion module to learn emotion-intent representations. These embeddings are then used to drive personalized content recommendations through a contextual matching layer. Experiments conducted on benchmark emotion datasets (AIGC-INT, MELD, and CMU-MOSEI) and an AIGC interaction dataset demonstrate that the proposed MMEI model achieves a 4.3% improvement in F1-score and a 12.3% reduction in cross-entropy loss compared to the best fusion-based transformer baseline. Furthermore, user-level online evaluations reveal that emotion-driven recommendations increase engagement time by 15.2% and enhance satisfaction scores by 11.8%, confirming the model's effectiveness in aligning AI-generated content with users' affective and intentional states. This work highlights the potential of cross-modal emotional intelligence for next-generation AIGC ecosystems, enabling adaptive, empathetic, and context-aware recommendation experiences.
计算机视觉
1. 【2512.11800】Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance
链接:https://arxiv.org/abs/2512.11800
作者:Jan U. Müller,Robin Tim Landsgesell,Leif Van Holland,Patrick Stotko,Reinhard Klein
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:high-quality radiance fields, enabling fast optimization, Gaussian Splatting, radiance fields, recent success
备注:
点击查看摘要
Abstract:The recent success of 3D Gaussian Splatting (3DGS) has reshaped novel view synthesis by enabling fast optimization and real-time rendering of high-quality radiance fields. However, it relies on simplified, order-dependent alpha blending and coarse approximations of the density integral within the rasterizer, thereby limiting its ability to render complex, overlapping semi-transparent objects. In this paper, we extend rasterization-based rendering of 3D Gaussian representations with a novel method for high-fidelity transmittance computation, entirely avoiding the need for ray tracing or per-pixel sample sorting. Building on prior work in moment-based order-independent transparency, our key idea is to characterize the density distribution along each camera ray with a compact and continuous representation based on statistical moments. To this end, we analytically derive and compute a set of per-pixel moments from all contributing 3D Gaussians. From these moments, a continuous transmittance function is reconstructed for each ray, which is then independently sampled within each Gaussian. As a result, our method bridges the gap between rasterization and physical accuracy by modeling light attenuation in complex translucent media, significantly improving overall reconstruction and rendering quality.
2. 【2512.11799】V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
链接:https://arxiv.org/abs/2512.11799
作者:Ye Fang,Tong Wu,Valentin Deschaintre,Duygu Ceylan,Iliyan Georgiev,Chun-Hao Paul Huang,Yiwei Hu,Xuelin Chen,Tuanfeng Yang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large-scale video generation, shown remarkable potential, video generation models, Large-scale video, generation models
备注: Project Page: [this https URL](https://aleafy.github.io/vrgbx)
点击查看摘要
Abstract:Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.
3. 【2512.11798】Particulate: Feed-Forward 3D Object Articulation
链接:https://arxiv.org/abs/2512.11798
作者:Ruining Li,Yuxin Yao,Chuanxia Zheng,Christian Rupprecht,Joan Lasenby,Shangzhe Wu,Andrea Vedaldi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:Part Articulation Transformer, motion constraints, underlying articulated structure, input mesh, Part Articulation
备注: Project page: [this https URL](https://ruiningli.com/particulate)
点击查看摘要
Abstract:We present Particulate, a feed-forward approach that, given a single static 3D mesh of an everyday object, directly infers all attributes of the underlying articulated structure, including its 3D parts, kinematic structure, and motion constraints. At its core is a transformer network, Part Articulation Transformer, which processes a point cloud of the input mesh using a flexible and scalable architecture to predict all the aforementioned attributes with native multi-joint support. We train the network end-to-end on a diverse collection of articulated 3D assets from public datasets. During inference, Particulate lifts the network's feed-forward prediction to the input mesh, yielding a fully articulated 3D model in seconds, much faster than prior approaches that require per-object optimization. Particulate can also accurately infer the articulated structure of AI-generated 3D assets, enabling full-fledged extraction of articulated 3D objects from a single (real or synthetic) image when combined with an off-the-shelf image-to-3D generator. We further introduce a new challenging benchmark for 3D articulation estimation curated from high-quality public 3D assets, and redesign the evaluation protocol to be more consistent with human preferences. Quantitative and qualitative results show that Particulate significantly outperforms state-of-the-art approaches.
4. 【2512.11797】AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis
链接:https://arxiv.org/abs/2512.11797
作者:Junjie Ye,Rong Xue,Basile Van Hoorick,Pavel Tokmakov,Muhammad Zubair Irshad,Yue Wang,Vitor Guizilini
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:offer limited diversity, simulators offer limited, fidelity with pronounced, collection of large-scale, remains a major
备注: Project page: [this https URL](https://jay-ye.github.io/AnchorDream/)
点击查看摘要
Abstract:The collection of large-scale and diverse robot demonstrations remains a major bottleneck for imitation learning, as real-world data acquisition is costly and simulators offer limited diversity and fidelity with pronounced sim-to-real gaps. While generative models present an attractive solution, existing methods often alter only visual appearances without creating new behaviors, or suffer from embodiment inconsistencies that yield implausible motions. To address these limitations, we introduce AnchorDream, an embodiment-aware world model that repurposes pretrained video diffusion models for robot data synthesis. AnchorDream conditions the diffusion process on robot motion renderings, anchoring the embodiment to prevent hallucination while synthesizing objects and environments consistent with the robot's kinematics. Starting from only a handful of human teleoperation demonstrations, our method scales them into large, diverse, high-quality datasets without requiring explicit environment modeling. Experiments show that the generated data leads to consistent improvements in downstream policy learning, with relative gains of 36.4% in simulator benchmarks and nearly double performance in real-world studies. These results suggest that grounding generative world models in robot motion provides a practical path toward scaling imitation learning.
5. 【2512.11792】Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation
链接:https://arxiv.org/abs/2512.11792
作者:Yang Fei,George Stoica,Jingyuan Liu,Qifeng Chen,Ranjay Krishna,Xiaojuan Wang,Benlin Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dance between rigid, rigid constraints, Reality, constraints and deformable, deformable structures
备注: Project Website: [this https URL](https://sam2videox.github.io/)
点击查看摘要
Abstract:Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at this https URL .
6. 【2512.11791】Uncertainty-Aware Domain Adaptation for Vitiligo Segmentation in Clinical Photographs
链接:https://arxiv.org/abs/2512.11791
作者:Wentao Jiang,Vamsi Varra,Caitlin Perez-Stable,Harrison Zhu,Meredith Apicella,Nicole Nyamongo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurately quantifying vitiligo, routine clinical photographs, Accurately quantifying, quantifying vitiligo extent, High-Frequency Spectral Gating
备注:
点击查看摘要
Abstract:Accurately quantifying vitiligo extent in routine clinical photographs is crucial for longitudinal monitoring of treatment response. We propose a trustworthy, frequency-aware segmentation framework built on three synergistic pillars: (1) a data-efficient training strategy combining domain-adaptive pre-training on the ISIC 2019 dataset with an ROI-constrained dual-task loss to suppress background noise; (2) an architectural refinement via a ConvNeXt V2-based encoder enhanced with a novel High-Frequency Spectral Gating (HFSG) module and stem-skip connections to capture subtle textures; and (3) a clinical trust mechanism employing K-fold ensemble and Test-Time Augmentation (TTA) to generate pixel-wise uncertainty maps. Extensive validation on an expert-annotated clinical cohort demonstrates superior performance, achieving a Dice score of 85.05% and significantly reducing boundary error (95% Hausdorff Distance improved from 44.79 px to 29.95 px), consistently outperforming strong CNN (ResNet-50 and UNet++) and Transformer (MiT-B5) baselines. Notably, our framework demonstrates high reliability with zero catastrophic failures and provides interpretable entropy maps to identify ambiguous regions for clinician review. Our approach suggests that the proposed framework establishes a robust and reliable standard for automated vitiligo assessment.
7. 【2512.11782】MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator
链接:https://arxiv.org/abs/2512.11782
作者:Peiqing Yang,Shangchen Zhou,Kai Hao,Qingyi Tao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:matting remains limited, remains limited, realism of existing, Video matting remains, Matting Quality Evaluator
备注: Project page: [this https URL](https://pq-yang.github.io/projects/MatAnyone2/)
点击查看摘要
Abstract:Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.
8. 【2512.11771】Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints
链接:https://arxiv.org/abs/2512.11771
作者:Kai Yao,Marc Juarez
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:remains largely unexplored, conditions remains largely, attributing AI-generated images, adversarial conditions remains, Model fingerprint detection
备注: This work has been accepted for publication in the 4th IEEE Conference on Secure and Trustworthy Machine Learning (IEEE SaTML 2026). The final version will be available on IEEE Xplore
点击查看摘要
Abstract:Model fingerprint detection techniques have emerged as a promising approach for attributing AI-generated images to their source models, but their robustness under adversarial conditions remains largely unexplored. We present the first systematic security evaluation of these techniques, formalizing threat models that encompass both white- and black-box access and two attack goals: fingerprint removal, which erases identifying traces to evade attribution, and fingerprint forgery, which seeks to cause misattribution to a target model. We implement five attack strategies and evaluate 14 representative fingerprinting methods across RGB, frequency, and learned-feature domains on 12 state-of-the-art image generators. Our experiments reveal a pronounced gap between clean and adversarial performance. Removal attacks are highly effective, often achieving success rates above 80% in white-box settings and over 50% under constrained black-box access. While forgery is more challenging than removal, its success significantly varies across targeted models. We also identify a utility-robustness trade-off: methods with the highest attribution accuracy are often vulnerable to attacks. Although some techniques exhibit robustness in specific settings, none achieves high robustness and accuracy across all evaluated threat models. These findings highlight the need for techniques balancing robustness and accuracy, and identify the most promising approaches for advancing this goal.
9. 【2512.11763】Reducing Domain Gap with Diffusion-Based Domain Adaptation for Cell Counting
链接:https://arxiv.org/abs/2512.11763
作者:Mohammad Dehghanmanshadi,Wallapak Tavanapong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Generating realistic synthetic, training deep learning, Generating realistic, deep learning models, label-scarce environments
备注: Accepted at ICMLA 2025
点击查看摘要
Abstract:Generating realistic synthetic microscopy images is critical for training deep learning models in label-scarce environments, such as cell counting with many cells per image. However, traditional domain adaptation methods often struggle to bridge the domain gap when synthetic images lack the complex textures and visual patterns of real samples. In this work, we adapt the Inversion-Based Style Transfer (InST) framework originally designed for artistic style transfer to biomedical microscopy images. Our method combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real fluorescence microscopy images to synthetic ones, while weakly preserving content structure. We evaluate the effectiveness of our InST-based synthetic dataset for downstream cell counting by pre-training and fine-tuning EfficientNet-B0 models on various data sources, including real data, hard-coded synthetic data, and the public Cell200-s dataset. Models trained with our InST-synthesized images achieve up to 37\% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data, and a 52\% reduction in MAE compared to models trained on Cell200-s (from 53.70 to 25.95 MAE). Notably, our approach also outperforms models trained on real data alone (25.95 vs. 27.74 MAE). Further improvements are achieved when combining InST-synthesized data with lightweight domain adaptation techniques such as DACS with CutMix. These findings demonstrate that InST-based style transfer most effectively reduces the domain gap between synthetic and real microscopy data. Our approach offers a scalable path for enhancing cell counting performance while minimizing manual labeling effort. The source code and resources are publicly available at: this https URL.
Comments:
Accepted at ICMLA 2025
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.11763 [cs.CV]
(or
arXiv:2512.11763v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.11763
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2512.11749】SVG-T2I: Scaling Up Text-to-Image Latent Diffusion Model Without Variational Autoencoder
链接:https://arxiv.org/abs/2512.11749
作者:Minglei Shi,Haolin Wang,Borui Zhang,Wenzhao Zheng,Bohan Zeng,Ziyang Yuan,Xiaoshi Wu,Yuanxing Zhang,Huan Yang,Xintao Wang,Pengfei Wan,Kun Gai,Jie Zhou,Jiwen Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Foundation Model, highly promising unified, promising unified pathway, integrating visual understanding, Foundation Model
备注: Code Repository: [this https URL](https://github.com/KlingTeam/SVG-T2I;) Model Weights: [this https URL](https://huggingface.co/KlingTeam/SVG-T2I)
点击查看摘要
Abstract:Visual generation grounded in Visual Foundation Model (VFM) representations offers a highly promising unified pathway for integrating visual understanding, perception, and generation. Despite this potential, training large-scale text-to-image diffusion models entirely within the VFM representation space remains largely unexplored. To bridge this gap, we scale the SVG (Self-supervised representations for Visual Generation) framework, proposing SVG-T2I to support high-quality text-to-image synthesis directly in the VFM feature domain. By leveraging a standard text-to-image diffusion pipeline, SVG-T2I achieves competitive performance, reaching 0.75 on GenEval and 85.78 on DPG-Bench. This performance validates the intrinsic representational power of VFMs for generative tasks. We fully open-source the project, including the autoencoder and generation model, together with their training, inference, evaluation pipelines, and pre-trained weights, to facilitate further research in representation-driven visual generation.
11. 【2512.11722】Weak-to-Strong Generalization Enables Fully Automated De Novo Training of Multi-head Mask-RCNN Model for Segmenting Densely Overlapping Cell Nuclei in Multiplex Whole-slice Brain Images
链接:https://arxiv.org/abs/2512.11722
作者:Lin Bai,Xiaoyang Li,Liqiang Huang,Quynh Nguyen,Hien Van Nguyen,Saurabh Prasad,Dragan Maric,John Redell,Pramod Dash,Badrinath Roysam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:strong generalization methodology, phenomena underlying weak, multiplex cyclic immunofluorescent, efficient channel attention, overlapping cell nuclei
备注:
点击查看摘要
Abstract:We present a weak to strong generalization methodology for fully automated training of a multi-head extension of the Mask-RCNN method with efficient channel attention for reliable segmentation of overlapping cell nuclei in multiplex cyclic immunofluorescent (IF) whole-slide images (WSI), and present evidence for pseudo-label correction and coverage expansion, the key phenomena underlying weak to strong generalization. This method can learn to segment de novo a new class of images from a new instrument and/or a new imaging protocol without the need for human annotations. We also present metrics for automated self-diagnosis of segmentation quality in production environments, where human visual proofreading of massive WSI images is unaffordable. Our method was benchmarked against five current widely used methods and showed a significant improvement. The code, sample WSI images, and high-resolution segmentation results are provided in open form for community adoption and adaptation.
12. 【2512.11720】Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation
链接:https://arxiv.org/abs/2512.11720
作者:Yan Zhang,Han Zou,Lincong Feng,Cong Xie,Ruiqi Yu,Zhenpeng Zhan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generate temporally coherent, temporally coherent, identity-preserving dance videos, key challenge, generate temporally
备注:
点击查看摘要
Abstract:Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at this https URL
13. 【2512.11719】Referring Change Detection in Remote Sensing Imagery
链接:https://arxiv.org/abs/2512.11719
作者:Yilmaz Korkmaz,Jay N. Paranjape,Celso M. de Melo,Vishal M. Patel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Change detection, change detection methods, Referring Change Detection, environmental monitoring, remote sensing imagery
备注: 2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
点击查看摘要
Abstract:Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) \textbf{RCDNet}, a cross-modal fusion network designed for referring change detection, and (II) \textbf{RCDGen}, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: this https URL.
14. 【2512.11715】EditMGT: Unleashing Potentials of Masked Generative Transformers in Image Editing
链接:https://arxiv.org/abs/2512.11715
作者:Wei Chow,Linfeng Li,Lingdong Kong,Zefeng Li,Qi Xu,Hang Song,Tian Ye,Xian Wang,Jinbin Bai,Shilin Xu,Xiangtai Li,Junting Pan,Shaoteng Liu,Ran Zhou,Tianshu Yang,Songhua Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Image and Video Processing (eess.IV)
关键词:achieved exceptional visual, Recent advances, exceptional visual quality, Masked Generative Transformers, advances in diffusion
备注:
点击查看摘要
Abstract:Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct CrispEdit-2M, a high-resolution dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves similarity performance while enabling 6 times faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.
15. 【2512.11691】xt images processing system using artificial intelligence models
链接:https://arxiv.org/abs/2512.11691
作者:Aya Kaysan Bahjat
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:including Invoice, identifies textual content, image classifier device, predefined categories, Differentiable Binarization Network
备注: 8 pages, 12 figures, article
点击查看摘要
Abstract:This is to present a text image classifier device that identifies textual content in images and then categorizes each image into one of four predefined categories, including Invoice, Form, Letter, or Report. The device supports a gallery mode, in which users browse files on flash disks, hard disk drives, or microSD cards, and a live mode which renders feeds of cameras connected to it. Its design is specifically aimed at addressing pragmatic challenges, such as changing light, random orientation, curvature or partial coverage of text, low resolution, and slightly visible text. The steps of the processing process are divided into four steps: image acquisition and preprocessing, textual elements detection with the help of DBNet++ (Differentiable Binarization Network Plus) model, BART (Bidirectional Auto-Regressive Transformers) model that classifies detected textual elements, and the presentation of the results through a user interface written in Python and PyQt5. All the stages are connected in such a way that they form a smooth workflow. The system achieved a text recognition rate of about 94.62% when tested over ten hours on the mentioned Total-Text dataset, that includes high resolution images, created so as to represent a wide range of problematic conditions. These experimental results support the effectiveness of the suggested methodology to practice, mixed-source text categorization, even in uncontrolled imaging conditions.
16. 【2512.11683】Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection
链接:https://arxiv.org/abs/2512.11683
作者:Qiushi Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Depth Copy Paste, copy paste, Traditional copy paste, face detection systems, copy paste augmentation
备注:
点击查看摘要
Abstract:Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.
17. 【2512.11680】Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing
链接:https://arxiv.org/abs/2512.11680
作者:Xu Zhang,Jiabin Fang,Zhuoming Ding,Jin Yuan,Xuan Liu,Qianjun Zhang,Zhiyong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:leverage large language, Recent advances, large language models, remote sensing, Multimodal Image Understanding
备注: 12 pages, 5 figures
点击查看摘要
Abstract:Recent advances in image understanding have enabled methods that leverage large language models for multimodal reasoning in remote sensing. However, existing approaches still struggle to steer models to the user-relevant regions when only simple, generic text prompts are available. Moreover, in large-scale aerial imagery many objects exhibit highly similar visual appearances and carry rich inter-object relationships, which further complicates accurate recognition. To address these challenges, we propose Cross-modal Context-aware Learning for Visual Prompt-Guided Multimodal Image Understanding (CLV-Net). CLV-Net lets users supply a simple visual cue, a bounding box, to indicate a region of interest, and uses that cue to guide the model to generate correlated segmentation masks and captions that faithfully reflect user intent. Central to our design is a Context-Aware Mask Decoder that models and integrates inter-object relationships to strengthen target representations and improve mask quality. In addition, we introduce a Semantic and Relationship Alignment module: a Cross-modal Semantic Consistency Loss enhances fine-grained discrimination among visually similar targets, while a Relationship Consistency Loss enforces alignment between textual relations and visual interactions. Comprehensive experiments on two benchmark datasets show that CLV-Net outperforms existing methods and establishes new state-of-the-art results. The model effectively captures user intent and produces precise, intention-aligned multimodal outputs.
18. 【2512.11654】Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation
链接:https://arxiv.org/abs/2512.11654
作者:Luca Cazzola,Ahed Alboody
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Human Activity Recognition, skeletal-based Human Activity, Activity Recognition, Human Activity, skeletal-based Human
备注:
点击查看摘要
Abstract:The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (this https URL).
19. 【2512.11645】FactorPortrait: Controllable Portrait Animation via Disentangled Expression, Pose, and Viewpoint
链接:https://arxiv.org/abs/2512.11645
作者:Jiapeng Tang,Kai Li,Chengxiang Yin,Liuhao Ge,Fei Jiang,Jiu Xu,Matthias Nießner,Christian Häne,Timur Bagautdinov,Egor Zakharov,Peihong Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enables lifelike synthesis, controllable portrait animation, introduce FactorPortrait, facial expression, driving video
备注: Project page: [this https URL](https://tangjiapeng.github.io/FactorPortrait/)
点击查看摘要
Abstract:We introduce FactorPortrait, a video diffusion method for controllable portrait animation that enables lifelike synthesis from disentangled control signals of facial expressions, head movement, and camera viewpoints. Given a single portrait image, a driving video, and camera trajectories, our method animates the portrait by transferring facial expressions and head movements from the driving video while simultaneously enabling novel view synthesis from arbitrary viewpoints. We utilize a pre-trained image encoder to extract facial expression latents from the driving video as control signals for animation generation. Such latents implicitly capture nuanced facial expression dynamics with identity and pose information disentangled, and they are efficiently injected into the video diffusion transformer through our proposed expression controller. For camera and head pose control, we employ Plücker ray maps and normal maps rendered from 3D body mesh tracking. To train our model, we curate a large-scale synthetic dataset containing diverse combinations of camera viewpoints, head poses, and facial expression dynamics. Extensive experiments demonstrate that our method outperforms existing approaches in realism, expressiveness, control accuracy, and view consistency.
20. 【2512.11624】Fast and Explicit: Slice-to-Volume Reconstruction via 3D Gaussian Primitives with Analytic Point Spread Function Modeling
链接:https://arxiv.org/abs/2512.11624
作者:Maik Dannecker,Steven Jia,Nil Stolt-Ansó,Nadine Girard,Guillaume Auzias,François Rousseau,Daniel Rueckert
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:broad applications ranging, Recovering high-fidelity, MRI super-resolution, sparse or degraded, medical imaging
备注: Under Review for MIDL 2026
点击查看摘要
Abstract:Recovering high-fidelity 3D images from sparse or degraded 2D images is a fundamental challenge in medical imaging, with broad applications ranging from 3D ultrasound reconstruction to MRI super-resolution. In the context of fetal MRI, high-resolution 3D reconstruction of the brain from motion-corrupted low-resolution 2D acquisitions is a prerequisite for accurate neurodevelopmental diagnosis. While implicit neural representations (INRs) have recently established state-of-the-art performance in self-supervised slice-to-volume reconstruction (SVR), they suffer from a critical computational bottleneck: accurately modeling the image acquisition physics requires expensive stochastic Monte Carlo sampling to approximate the point spread function (PSF). In this work, we propose a shift from neural network based implicit representations to Gaussian based explicit representations. By parameterizing the HR 3D image volume as a field of anisotropic Gaussian primitives, we leverage the property of Gaussians being closed under convolution and thus derive a \textit{closed-form analytical solution} for the forward model. This formulation reduces the previously intractable acquisition integral to an exact covariance addition ($\mathbf{\Sigma}_{obs} = \mathbf{\Sigma}_{HR} + \mathbf{\Sigma}_{PSF}$), effectively bypassing the need for compute-intensive stochastic sampling while ensuring exact gradient propagation. We demonstrate that our approach matches the reconstruction quality of self-supervised state-of-the-art SVR frameworks while delivering a 5$\times$--10$\times$ speed-up on neonatal and fetal data. With convergence often reached in under 30 seconds, our framework paves the way towards translation into clinical routine of real-time fetal 3D MRI. Code will be public at {this https URL}.
21. 【2512.11612】Embodied Image Compression
链接:https://arxiv.org/abs/2512.11612
作者:Chunyi Li,Rui Qing,Jianbo Zhang,Yuan Tian,Xiangyang Zhu,Zicheng Zhang,Xiaohong Liu,Weisi Lin,Guangtao Zhai
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:pivotal research direction, visual data compression, Embodied Image Compression, Image Compression, pivotal research
备注: 15 pages, 12 figures, 3 tables
点击查看摘要
Abstract:Image Compression for Machines (ICM) has emerged as a pivotal research direction in the field of visual data compression. However, with the rapid evolution of machine intelligence, the target of compression has shifted from task-specific virtual models to Embodied agents operating in real-world environments. To address the communication constraints of Embodied AI in multi-agent systems and ensure real-time task execution, this paper introduces, for the first time, the scientific problem of Embodied Image Compression. We establish a standardized benchmark, EmbodiedComp, to facilitate systematic evaluation under ultra-low bitrate conditions in a closed-loop setting. Through extensive empirical studies in both simulated and real-world settings, we demonstrate that existing Vision-Language-Action models (VLAs) fail to reliably perform even simple manipulation tasks when compressed below the Embodied bitrate threshold. We anticipate that EmbodiedComp will catalyze the development of domain-specific compression tailored for Embodied agents , thereby accelerating the Embodied AI deployment in the Real-world.
22. 【2512.11611】Using GUI Agent for Electronic Design Automation
链接:https://arxiv.org/abs/2512.11611
作者:Chunyi Li,Longfei Li,Zicheng Zhang,Xiaohong Liu,Min Tang,Weisi Lin,Guangtao Zhai
类目:Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
关键词:Graphical User Interface, Graphical User, User Interface, automating repetitive tasks, GUI agents
备注: 17 pages, 15 figures, 8 tables
点击查看摘要
Abstract:Graphical User Interface (GUI) agents adopt an end-to-end paradigm that maps a screenshot to an action sequence, thereby automating repetitive tasks in virtual environments. However, existing GUI agents are evaluated almost exclusively on commodity software such as Microsoft Word and Excel. Professional Computer-Aided Design (CAD) suites promise an order-of-magnitude higher economic return, yet remain the weakest performance domain for existing agents and are still far from replacing expert Electronic-Design-Automation (EDA) engineers. We therefore present the first systematic study that deploys GUI agents for EDA workflows. Our contributions are: (1) a large-scale dataset named GUI-EDA, including 5 CAD tools and 5 physical domains, comprising 2,000+ high-quality screenshot-answer-action pairs recorded by EDA scientists and engineers during real-world component design; (2) a comprehensive benchmark that evaluates 30+ mainstream GUI agents, demonstrating that EDA tasks constitute a major, unsolved challenge; and (3) an EDA-specialized metric named EDAgent, equipped with a reflection mechanism that achieves reliable performance on industrial CAD software and, for the first time, outperforms Ph.D. students majored in Electrical Engineering. This work extends GUI agents from generic office automation to specialized, high-value engineering domains and offers a new avenue for advancing EDA productivity. The dataset will be released at: this https URL.
23. 【2512.11582】Brain-Semantoks: Learning Semantic Tokens of Brain Dynamics with a Self-Distilled Foundation Model
链接:https://arxiv.org/abs/2512.11582
作者:Sam Gijsen,Marc-Andre Schulz,Kerstin Ritter
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
关键词:magnetic resonance imaging, holds significant promise, predicting phenotypes related, series holds significant, functional magnetic resonance
备注: Code and pretrained models available at [this https URL](https://github.com/SamGijsen/Brain-Semantoks)
点击查看摘要
Abstract:The development of foundation models for functional magnetic resonance imaging (fMRI) time series holds significant promise for predicting phenotypes related to disease and cognition. Current models, however, are often trained using a mask-and-reconstruct objective on small brain regions. This focus on low-level information leads to representations that are sensitive to noise and temporal fluctuations, necessitating extensive fine-tuning for downstream tasks. We introduce Brain-Semantoks, a self-supervised framework designed specifically to learn abstract representations of brain dynamics. Its architecture is built on two core innovations: a semantic tokenizer that aggregates noisy regional signals into robust tokens representing functional networks, and a self-distillation objective that enforces representational stability across time. We show that this objective is stabilized through a novel training curriculum, ensuring the model robustly learns meaningful features from low signal-to-noise time series. We demonstrate that learned representations enable strong performance on a variety of downstream tasks even when only using a linear probe. Furthermore, we provide comprehensive scaling analyses indicating more unlabeled data reliably results in out-of-distribution performance gains without domain adaptation.
24. 【2512.11575】In-Context Learning for Seismic Data Processing
链接:https://arxiv.org/abs/2512.11575
作者:Fabian Fuchs,Mario Ruben Fernandez,Norman Ettrich,Janis Keuper
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:subsurface images essential, processing transforms raw, geophysical applications, transforms raw data, transforms raw
备注: Source code available under [this https URL](https://codeberg.org/fuchsfa/in-context-learning-seismic)
点击查看摘要
Abstract:Seismic processing transforms raw data into subsurface images essential for geophysical applications. Traditional methods face challenges, such as noisy data, and manual parameter tuning, among others. Recently deep learning approaches have proposed alternative solutions to some of these problems. However, important challenges of existing deep learning approaches are spatially inconsistent results across neighboring seismic gathers and lack of user-control. We address these limitations by introducing ContextSeisNet, an in-context learning model, to seismic demultiple processing. Our approach conditions predictions on a support set of spatially related example pairs: neighboring common-depth point gathers from the same seismic line and their corresponding labels. This allows the model to learn task-specific processing behavior at inference time by observing how similar gathers should be processed, without any retraining. This method provides both flexibility through user-defined examples and improved lateral consistency across seismic lines. On synthetic data, ContextSeisNet outperforms a U-Net baseline quantitatively and demonstrates enhanced spatial coherence between neighboring gathers. On field data, our model achieves superior lateral consistency compared to both traditional Radon demultiple and the U-Net baseline. Relative to the U-Net, ContextSeisNet also delivers improved near-offset performance and more complete multiple removal. Notably, ContextSeisNet achieves comparable field data performance despite being trained on 90% less data, demonstrating substantial data efficiency. These results establish ContextSeisNet as a practical approach for spatially consistent seismic demultiple with potential applicability to other seismic processing tasks.
25. 【2512.11574】Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis
链接:https://arxiv.org/abs/2512.11574
作者:Valentina Lilova,Toyesh Chakravorty,Julian I. Bibo,Emma Boccaletti,Brandon Li,Lívia Baxová,Cees G. M. Snoek,Mohammadreza Salehi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving, essential for real-world, real-world applications, robotics and autonomous, spatial understanding
备注: NeurIPS 2025 UniReps workshop
点击查看摘要
Abstract:Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at this https URL .
26. 【2512.11560】Multi-temporal Calving Front Segmentation
链接:https://arxiv.org/abs/2512.11560
作者:Marcel Dreier,Nora Gourmelon,Dakota Pyles,Fei Wu,Matthias Braun,Thorsten Seehaus,Andreas Maier,Vincent Christlein
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:marine-terminating glaciers undergo, glaciers undergo constant, Synthetic Aperture Radar, undergo constant, Aperture Radar imagery
备注:
点击查看摘要
Abstract:The calving fronts of marine-terminating glaciers undergo constant changes. These changes significantly affect the glacier's mass and dynamics, demanding continuous monitoring. To address this need, deep learning models were developed that can automatically delineate the calving front in Synthetic Aperture Radar imagery. However, these models often struggle to correctly classify areas affected by seasonal conditions such as ice melange or snow-covered surfaces. To address this issue, we propose to process multiple frames from a satellite image time series of the same glacier in parallel and exchange temporal information between the corresponding feature maps to stabilize each prediction. We integrate our approach into the current state-of-the-art architecture Tyrion and accomplish a new state-of-the-art performance on the CaFFe benchmark dataset. In particular, we achieve a Mean Distance Error of 184.4 m and a mean Intersection over Union of 83.6.
27. 【2512.11558】DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry
链接:https://arxiv.org/abs/2512.11558
作者:Zhenyang Cai,Jiaming Zhang,Junjie Zhao,Ziyi Zeng,Yanchao Li,Jingyi Liang,Junying Chen,Yunjin Yang,Jiajun You,Shuzhi Deng,Tongfei Wang,Wanting Chen,Chunxiu Hao,Ruiqi Xie,Zhenwei Wen,Xiangyi Feng,Zou Ting,Jin Zou Lin,Jianquan Li,Guangjun Yu,Liangyi Chen,Junwen Wang,Shan Jiang,Benyou Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:automated oral healthcare, large language models, Reliable interpretation, lack sufficient reasoning, sufficient reasoning ability
备注:
点击查看摘要
Abstract:Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.
28. 【2512.11557】3DTeethSAM: Taming SAM2 for 3D Teeth Segmentation
链接:https://arxiv.org/abs/2512.11557
作者:Zhiguo Lu,Jianwen Lou,Mingjun Ma,Hairong Jin,Youyi Zheng,Kun Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:digital dentistry due, involving the localization, real-world dentition, localization of tooth, tooth instances
备注: Accepted by AAAI 2026
点击查看摘要
Abstract:3D teeth segmentation, involving the localization of tooth instances and their semantic categorization in 3D dental models, is a critical yet challenging task in digital dentistry due to the complexity of real-world dentition. In this paper, we propose 3DTeethSAM, an adaptation of the Segment Anything Model 2 (SAM2) for 3D teeth segmentation. SAM2 is a pretrained foundation model for image and video segmentation, demonstrating a strong backbone in various downstream scenarios. To adapt SAM2 for 3D teeth data, we render images of 3D teeth models from predefined views, apply SAM2 for 2D segmentation, and reconstruct 3D results using 2D-3D projections. Since SAM2's performance depends on input prompts and its initial outputs often have deficiencies, and given its class-agnostic nature, we introduce three light-weight learnable modules: (1) a prompt embedding generator to derive prompt embeddings from image embeddings for accurate mask decoding, (2) a mask refiner to enhance SAM2's initial segmentation results, and (3) a mask classifier to categorize the generated masks. Additionally, we incorporate Deformable Global Attention Plugins (DGAP) into SAM2's image encoder. The DGAP enhances both the segmentation accuracy and the speed of the training process. Our method has been validated on the 3DTeethSeg benchmark, achieving an IoU of 91.90% on high-resolution 3D teeth meshes, establishing a new state-of-the-art in the field.
29. 【2512.11548】SSL-MedSAM2: A Semi-supervised Medical Image Segmentation Framework Powered by Few-shot Learning of SAM2
链接:https://arxiv.org/abs/2512.11548
作者:Zhendi Gong,Xin Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:annotated training datasets, scale annotated training, large scale annotated, medical image segmentation, medical image annotation
备注: Accepted by MICCAI 2025 CARE Challenge, waiting for publication
点击查看摘要
Abstract:Despite the success of deep learning based models in medical image segmentation, most state-of-the-art (SOTA) methods perform fully-supervised learning, which commonly rely on large scale annotated training datasets. However, medical image annotation is highly time-consuming, hindering its clinical applications. Semi-supervised learning (SSL) has been emerged as an appealing strategy in training with limited annotations, largely reducing the labelling cost. We propose a novel SSL framework SSL-MedSAM2, which contains a training-free few-shot learning branch TFFS-MedSAM2 based on the pretrained large foundation model Segment Anything Model 2 (SAM2) for pseudo label generation, and an iterative fully-supervised learning branch FSL-nnUNet based on nnUNet for pseudo label refinement. The results on MICCAI2025 challenge CARE-LiSeg (Liver Segmentation) demonstrate an outstanding performance of SSL-MedSAM2 among other methods. The average dice scores on the test set in GED4 and T1 MRI are 0.9710 and 0.9648 respectively, and the Hausdorff distances are 20.07 and 21.97 respectively. The code is available via this https URL.
30. 【2512.11542】Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models
链接:https://arxiv.org/abs/2512.11542
作者:Hossein Shahabadi,Niki Sepasian,Arash Marioriyad,Ali Sharifi-Zarchi,Mahdieh Soleymani Baghshah
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Achieving compositional alignment, emerging Visual Autoregressive, Achieving compositional, covering objects, generated images
备注:
点击查看摘要
Abstract:Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-$\alpha$, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs. In contrast, SDXL and PixArt-$\alpha$ show persistent weaknesses in attribute-sensitive and spatial tasks. These results provide the first systematic comparison of VAR and diffusion approaches to compositional alignment and establish unified baselines for the future development of the T2I model.
31. 【2512.11534】HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
链接:https://arxiv.org/abs/2512.11534
作者:Yiqing Yang,Kin-Man Lam
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
关键词:presents significant challenges, video understanding presents, understanding presents significant, Key frame selection, significant challenges
备注: 18 pages, 8 figures
点击查看摘要
Abstract:Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates relevance, coverage, and redundancy, enabling differentiable optimization via Gumbel-Softmax to select optimal frame combinations at the set level. Finally, student-teacher mutual learning is employed, where the student selector (SLM) and teacher reasoner (MLLM) are trained to align their frame importance distributions via KL divergence. Combined with cross-entropy loss, this enables end-to-end optimization, eliminating reliance on static pseudo labels. Experiments across various benchmarks, including Video-MME, LongVideoBench, MLVU, and NExT-QA, demonstrate that our method significantly outperforms existing approaches.
32. 【2512.11532】Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems
链接:https://arxiv.org/abs/2512.11532
作者:Chong Tang,Hao Dai,Jagmohan Chauhan
类目:Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:necessitates faster inference, increasingly complex models, edge devices necessitates, devices necessitates faster, applications on edge
备注:
点击查看摘要
Abstract:The growing demand for real-time DNN applications on edge devices necessitates faster inference of increasingly complex models. Although many devices include specialized accelerators (e.g., mobile GPUs), dynamic control-flow operators and unsupported kernels often fall back to CPU execution. Existing frameworks handle these fallbacks poorly, leaving CPU cores idle and causing high latency and memory spikes. We introduce Parallax, a framework that accelerates mobile DNN inference without model refactoring or custom operator implementations. Parallax first partitions the computation DAG to expose parallelism, then employs branch-aware memory management with dedicated arenas and buffer reuse to reduce runtime footprint. An adaptive scheduler executes branches according to device memory constraints, meanwhile, fine-grained subgraph control enables heterogeneous inference of dynamic models. By evaluating on five representative DNNs across three different mobile devices, Parallax achieves up to 46% latency reduction, maintains controlled memory overhead (26.5% on average), and delivers up to 30% energy savings compared with state-of-the-art frameworks, offering improvements aligned with the responsiveness demands of real-time mobile inference.
33. 【2512.11524】Super-Resolved Canopy Height Mapping from Sentinel-2 Time Series Using LiDAR HD Reference Data across Metropolitan France
链接:https://arxiv.org/abs/2512.11524
作者:Ekaterina Kalinicheva,Florian Helen,Stéphane Mermoz,Florian Mouret,Milena Planells
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:understanding canopy structure, Fine-scale forest monitoring, carbon stocks, canopy structure, Fine-scale forest
备注:
点击查看摘要
Abstract:Fine-scale forest monitoring is essential for understanding canopy structure and its dynamics, which are key indicators of carbon stocks, biodiversity, and forest health. Deep learning is particularly effective for this task, as it integrates spectral, temporal, and spatial signals that jointly reflect the canopy structure. To address this need, we introduce THREASURE-Net, a novel end-to-end framework for Tree Height Regression And Super-Resolution. The model is trained on Sentinel-2 time series using reference height metrics derived from LiDAR HD data at multiple spatial resolutions over Metropolitan France to produce annual height maps. We evaluate three model variants, producing tree-height predictions at 2.5 m, 5 m, and 10 m resolution. THREASURE-Net does not rely on any pretrained model nor on reference very high resolution optical imagery to train its super-resolution module; instead, it learns solely from LiDAR-derived height information. Our approach outperforms existing state-of-the-art methods based on Sentinel data and is competitive with methods based on very high resolution imagery. It can be deployed to generate high-precision annual canopy-height maps, achieving mean absolute errors of 2.62 m, 2.72 m, and 2.88 m at 2.5 m, 5 m, and 10 m resolution, respectively. These results highlight the potential of THREASURE-Net for scalable and cost-effective structural monitoring of temperate forests using only freely available satellite data. The source code for THREASURE-Net is available at: this https URL.
34. 【2512.11510】Reconstruction as a Bridge for Event-Based Visual Question Answering
链接:https://arxiv.org/abs/2512.11510
作者:Hanyue Lou,Jiayi Zhou,Yang Zhang,Boyu Li,Yi Wang,Guangnan Ye,Boxin Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Integrating event cameras, promises general scene, challenging visual conditions
备注:
点击查看摘要
Abstract:Integrating event cameras with Multimodal Large Language Models (MLLMs) promises general scene understanding in challenging visual conditions, yet requires navigating a trade-off between preserving the unique advantages of event data and ensuring compatibility with frame-based models. We address this challenge by using reconstruction as a bridge, proposing a straightforward Frame-based Reconstruction and Tokenization (FRT) method and designing an efficient Adaptive Reconstruction and Tokenization (ART) method that leverages event sparsity. For robust evaluation, we introduce EvQA, the first objective, real-world benchmark for event-based MLLMs, comprising 1,000 event-QA pairs from 22 public datasets. Our experiments demonstrate that our methods achieve state-of-the-art performance on EvQA, highlighting the significant potential of MLLMs in event-based vision.
35. 【2512.11508】On Geometric Understanding and Learned Data Priors in VGGT
链接:https://arxiv.org/abs/2512.11508
作者:Jelena Bratulić,Sudhanshu Mittal,Thomas Brox,Christian Rupprecht
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Geometry Grounded Transformer, Visual Geometry Grounded, Grounded Transformer, single feed-forward pass, Visual Geometry
备注:
点击查看摘要
Abstract:The Visual Geometry Grounded Transformer (VGGT) is a 3D foundation model that infers camera geometry and scene structure in a single feed-forward pass. Trained in a supervised, single-step fashion on large datasets, VGGT raises a key question: does it build upon geometric concepts like traditional multi-view methods, or does it rely primarily on learned appearance-based data-driven priors? In this work, we conduct a systematic analysis of VGGT's internal mechanisms to uncover whether geometric understanding emerges within its representations. By probing intermediate features, analyzing attention patterns, and performing interventions, we examine how the model implements its functionality. Our findings reveal that VGGT implicitly performs correspondence matching within its global attention layers and encodes epipolar geometry, despite being trained without explicit geometric constraints. We further investigate VGGT's dependence on its learned data priors. Using spatial input masking and perturbation experiments, we assess its robustness to occlusions, appearance variations, and camera configurations, comparing it with classical multi-stage pipelines. Together, these insights highlight how VGGT internalizes geometric structure while using learned data-driven priors.
36. 【2512.11507】SSA3D: Text-Conditioned Assisted Self-Supervised Framework for Automatic Dental Abutment Design
链接:https://arxiv.org/abs/2512.11507
作者:Mianjie Zheng,Xinquan Yang,Along He,Xuguang Li,Feilie Zhong,Xuefen Liu,Kun Tang,Zhicheng Zhang,Linlin Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dental implant restoration, critical step, step in dental, Abutment design, regression branch
备注:
点击查看摘要
Abstract:Abutment design is a critical step in dental implant restoration. However, manual design involves tedious measurement and fitting, and research on automating this process with AI is limited, due to the unavailability of large annotated datasets. Although self-supervised learning (SSL) can alleviate data scarcity, its need for pre-training and fine-tuning results in high computational costs and long training times. In this paper, we propose a Self-supervised assisted automatic abutment design framework (SS$A^3$D), which employs a dual-branch architecture with a reconstruction branch and a regression branch. The reconstruction branch learns to restore masked intraoral scan data and transfers the learned structural information to the regression branch. The regression branch then predicts the abutment parameters under supervised learning, which eliminates the separate pre-training and fine-tuning process. We also design a Text-Conditioned Prompt (TCP) module to incorporate clinical information (such as implant location, system, and series) into SS$A^3$D. This guides the network to focus on relevant regions and constrains the parameter predictions. Extensive experiments on a collected dataset show that SS$A^3$D saves half of the training time and achieves higher accuracy than traditional SSL methods. It also achieves state-of-the-art performance compared to other methods, significantly improving the accuracy and efficiency of automated abutment design.
37. 【2512.11503】Skel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition
链接:https://arxiv.org/abs/2512.11503
作者:Yanan Liu,Jun Liu,Hao Zhang,Dan Xu,Hossein Rahmani,Mohammed Bennamoun,Qiuhong Ke
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision community, garnered significant attention, Skeleton-based action recognition, Skeleton-based action, vision community
备注:
点击查看摘要
Abstract:Skeleton-based action recognition has garnered significant attention in the computer vision community. Inspired by the recent success of the selective state-space model (SSM) Mamba in modeling 1D temporal sequences, we propose TSkel-Mamba, a hybrid Transformer-Mamba framework that effectively captures both spatial and temporal dynamics. In particular, our approach leverages Spatial Transformer for spatial feature learning while utilizing Mamba for temporal modeling. Mamba, however, employs separate SSM blocks for individual channels, which inherently limits its ability to model inter-channel dependencies. To better adapt Mamba for skeleton data and enhance Mamba`s ability to model temporal dependencies, we introduce a Temporal Dynamic Modeling (TDM) block, which is a versatile plug-and-play component that integrates a novel Multi-scale Temporal Interaction (MTI) module. The MTI module employs multi-scale Cycle operators to capture cross-channel temporal interactions, a critical factor in action recognition. Extensive experiments on NTU-RGB+D 60, NTU-RGB+D 120, NW-UCLA and UAV-Human datasets demonstrate that TSkel-Mamba achieves state-of-the-art performance while maintaining low inference time, making it both efficient and highly effective.
38. 【2512.11490】VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing
链接:https://arxiv.org/abs/2512.11490
作者:Emanuel Sánchez Aimar,Gulnaz Zhambulova,Fahad Shahbaz Khan,Yonghao Xu,Michael Felsberg
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Satellite imagery differs, diverse scale variations, imagery differs fundamentally, small objects demand, Satellite imagery
备注: 21 pages, 7 figures, under review
点击查看摘要
Abstract:Satellite imagery differs fundamentally from natural images: its aerial viewpoint, very high resolution, diverse scale variations, and abundance of small objects demand both region-level spatial reasoning and holistic scene understanding. Current remote-sensing approaches remain fragmented between dual-encoder retrieval models, which excel at large-scale cross-modal search but cannot interleave modalities, and generative assistants, which support region-level interpretation but lack scalable retrieval capabilities. We propose $\textbf{VLM2GeoVec}$, an instruction-following, single-encoder vision-language model trained contrastively to embed interleaved inputs (images, text, bounding boxes, and geographic coordinates) in a unified vector space. Our single encoder interleaves all inputs into one joint embedding trained with a contrastive loss, eliminating multi-stage pipelines and task-specific modules. To evaluate its versatility, we introduce $\textbf{RSMEB}$, a novel benchmark covering key remote-sensing embedding applications: scene classification; cross-modal search; compositional retrieval; visual-question answering; visual grounding and region-level reasoning; and semantic geospatial retrieval. On RSMEB, it achieves $\textbf{26.6%}$ P@1 on region-caption retrieval (+25 pp vs. dual-encoder baselines), $\textbf{32.5%}$ P@1 on referring-expression retrieval (+19 pp), and $\textbf{17.8%}$ P@1 on semantic geo-localization retrieval (over $3\times$ prior best), while matching or exceeding specialized baselines on conventional tasks such as scene classification and cross-modal retrieval. VLM2GeoVec unifies scalable retrieval with region-level spatial reasoning, enabling cohesive multimodal analysis in remote sensing. We will publicly release the code, checkpoints, and data upon acceptance.
39. 【2512.11480】CADMorph: Geometry-Driven Parametric CAD Editing via a Plan-Generate-Verify Loop
链接:https://arxiv.org/abs/2512.11480
作者:Weijian Ma,Shizhao Sun,Ruiyu Wang,Jiang Bian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:resulting visible geometric, parametric CAD editing, visible geometric shape, parametric construction sequence, geometry-driven parametric CAD
备注: NeurIPS 2025
点击查看摘要
Abstract:A Computer-Aided Design (CAD) model encodes an object in two coupled forms: a parametric construction sequence and its resulting visible geometric shape. During iterative design, adjustments to the geometric shape inevitably require synchronized edits to the underlying parametric sequence, called geometry-driven parametric CAD editing. The task calls for 1) preserving the original sequence's structure, 2) ensuring each edit's semantic validity, and 3) maintaining high shape fidelity to the target shape, all under scarce editing data triplets. We present CADMorph, an iterative plan-generate-verify framework that orchestrates pretrained domain-specific foundation models during inference: a parameter-to-shape (P2S) latent diffusion model and a masked-parameter-prediction (MPP) model. In the planning stage, cross-attention maps from the P2S model pinpoint the segments that need modification and offer editing masks. The MPP model then infills these masks with semantically valid edits in the generation stage. During verification, the P2S model embeds each candidate sequence in shape-latent space, measures its distance to the target shape, and selects the closest one. The three stages leverage the inherent geometric consciousness and design knowledge in pretrained priors, and thus tackle structure preservation, semantic validity, and shape fidelity respectively. Besides, both P2S and MPP models are trained without triplet data, bypassing the data-scarcity bottleneck. CADMorph surpasses GPT-4o and specialized CAD baselines, and supports downstream applications such as iterative editing and reverse-engineering enhancement.
40. 【2512.11465】DOS: Distilling Observable Softmaps of Zipfian Prototypes for Self-Supervised Point Representation
链接:https://arxiv.org/abs/2512.11465
作者:Mohamed Abdelsamad,Michael Ulrich,Bin Yang,Miao Zhang,Yakov Miron,Abhinav Valada
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:shown tremendous potential, Recent advances, point cloud representations, advances in self-supervised, shown tremendous
备注: AAAI-26
点击查看摘要
Abstract:Recent advances in self-supervised learning (SSL) have shown tremendous potential for learning 3D point cloud representations without human annotations. However, SSL for 3D point clouds still faces critical challenges due to irregular geometry, shortcut-prone reconstruction, and unbalanced semantics distribution. In this work, we propose DOS (Distilling Observable Softmaps), a novel SSL framework that self-distills semantic relevance softmaps only at observable (unmasked) points. This strategy prevents information leakage from masked regions and provides richer supervision than discrete token-to-prototype assignments. To address the challenge of unbalanced semantics in an unsupervised setting, we introduce Zipfian prototypes and incorporate them using a modified Sinkhorn-Knopp algorithm, Zipf-Sinkhorn, which enforces a power-law prior over prototype usage and modulates the sharpness of the target softmap during training. DOS outperforms current state-of-the-art methods on semantic segmentation and 3D object detection across multiple benchmarks, including nuScenes, Waymo, SemanticKITTI, ScanNet, and ScanNet200, without relying on extra data or annotations. Our results demonstrate that observable-point softmaps distillation offers a scalable and effective paradigm for learning robust 3D representations.
41. 【2512.11464】Exploring MLLM-Diffusion Information Transfer with MetaCanvas
链接:https://arxiv.org/abs/2512.11464
作者:Han Lin,Xichen Pan,Ziqi Huang,Ji Hou,Jialiang Wang,Weifeng Chen,Zecheng He,Felix Juefei-Xu,Junzhe Sun,Zhipeng Fan,Ali Thabet,Mohit Bansal,Chu Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, rapidly advanced visual, multimodal large language, powerful core models, advanced visual understanding
备注: Project page: [this https URL](https://metacanvas.github.io)
点击查看摘要
Abstract:Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.
42. 【2512.11458】Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation
链接:https://arxiv.org/abs/2512.11458
作者:Jingmin Zhu,Anqi Zhu,Hossein Rahmani,Jun Liu,Mohammed Bennamoun,Qiuhong Ke
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:training-free test-time adaptation, test-time adaptation framework, improving model generalization, aimed at improving, training-free test-time
备注:
点击查看摘要
Abstract:We introduce Skeleton-Cache, the first training-free test-time adaptation framework for skeleton-based zero-shot action recognition (SZAR), aimed at improving model generalization to unseen actions during inference. Skeleton-Cache reformulates inference as a lightweight retrieval process over a non-parametric cache that stores structured skeleton representations, combining both global and fine-grained local descriptors. To guide the fusion of descriptor-wise predictions, we leverage the semantic reasoning capabilities of large language models (LLMs) to assign class-specific importance weights. By integrating these structured descriptors with LLM-guided semantic priors, Skeleton-Cache dynamically adapts to unseen actions without any additional training or access to training data. Extensive experiments on NTU RGB+D 60/120 and PKU-MMD II demonstrate that Skeleton-Cache consistently boosts the performance of various SZAR backbones under both zero-shot and generalized zero-shot settings. The code is publicly available at this https URL.
43. 【2512.11446】YawDD+: Frame-level Annotations for Accurate Yawn Prediction
链接:https://arxiv.org/abs/2512.11446
作者:Ahmed Mujtaba,Gleb Radchenko,Marc Masana,Radu Prodan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:involving drowsy drivers, crashes involving drowsy, Driver fatigue remains, drowsy drivers, road accidents
备注: This paper is submitted at European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2026
点击查看摘要
Abstract:Driver fatigue remains a leading cause of road accidents, with 24\% of crashes involving drowsy drivers. While yawning serves as an early behavioral indicator of fatigue, existing machine learning approaches face significant challenges due to video-annotated datasets that introduce systematic noise from coarse temporal annotations. We develop a semi-automated labeling pipeline with human-in-the-loop verification, which we apply to YawDD, enabling more accurate model training. Training the established MNasNet classifier and YOLOv11 detector architectures on YawDD+ improves frame accuracy by up to 6\% and mAP by 5\% over video-level supervision, achieving 99.34\% classification accuracy and 95.69\% detection mAP. The resulting approach deliver up to 59.8 FPS on edge AI hardware (NVIDIA Jetson Nano), confirming that enhanced data quality alone supports on-device yawning monitoring without server-side computation.
44. 【2512.11438】Flowception: Temporally Expansive Flow Matching for Video Generation
链接:https://arxiv.org/abs/2512.11438
作者:Tariq Berrada Ifriqi,John Nguyen,Karteek Alahari,Jakob Verbeek,Ricky T. Q. Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:non-autoregressive and variable-length, present Flowception, video generation framework, Flowception, variable-length video generation
备注:
点击查看摘要
Abstract:We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.
45. 【2512.11433】Back to the Baseline: Examining Baseline Effects on Explainability Metrics
链接:https://arxiv.org/abs/2512.11433
作者:Agustin Martin Picard(ANITI),Thibaut Boissin(ANITI),Varshini Subhash,Rémi Cadène(SU),Thomas Fel(ANITI)
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Explainable Artificial Intelligence, Artificial Intelligence, Explainable Artificial, Insertion and Deletion, techniques in Explainable
备注:
点击查看摘要
Abstract:Attribution methods are among the most prevalent techniques in Explainable Artificial Intelligence (XAI) and are usually evaluated and compared using Fidelity metrics, with Insertion and Deletion being the most popular. These metrics rely on a baseline function to alter the pixels of the input image that the attribution map deems most important. In this work, we highlight a critical problem with these metrics: the choice of a given baseline will inevitably favour certain attribution methods over others. More concerningly, even a simple linear model with commonly used baselines contradicts itself by designating different optimal methods. A question then arises: which baseline should we use? We propose to study this problem through two desirable properties of a baseline: (i) that it removes information and (ii) that it does not produce overly out-of-distribution (OOD) images. We first show that none of the tested baselines satisfy both criteria, and there appears to be a trade-off among current baselines: either they remove information or they produce a sequence of OOD images. Finally, we introduce a novel baseline by leveraging recent work in feature visualisation to artificially produce a model-dependent baseline that removes information without being overly OOD, thus improving on the trade-off when compared to other existing baselines. Our code is available at this https URL
46. 【2512.11423】JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion
链接:https://arxiv.org/abs/2512.11423
作者:Chaochao Li,Ruikui Wang,Liangbo Zhou,Jinheng Feng,Huaishao Luo,Huan Zhang,Youzheng Wu,Xiaodong He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved considerable progress, Existing DiT-based audio-driven, high computational overhead, Existing DiT-based, DiT-based audio-driven avatar
备注:
点击查看摘要
Abstract:Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.
47. 【2512.11401】Collaborative Reconstruction and Repair for Multi-class Industrial Anomaly Detection
链接:https://arxiv.org/abs/2512.11401
作者:Qishan Wang,Haofeng Wang,Shuyong Gao,Jia Guo,Li Xiong,Jiaqi Li,Dengxuan Bai,Wenqiang Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:challenging open-set task, normal data distribution, identify unknown anomalous, unknown anomalous patterns, anomalous patterns deviating
备注: Accepted to Data Intelligence 2025
点击查看摘要
Abstract:Industrial anomaly detection is a challenging open-set task that aims to identify unknown anomalous patterns deviating from normal data distribution. To avoid the significant memory consumption and limited generalizability brought by building separate models per class, we focus on developing a unified framework for multi-class anomaly detection. However, under this challenging setting, conventional reconstruction-based networks often suffer from an identity mapping problem, where they directly replicate input features regardless of whether they are normal or anomalous, resulting in detection failures. To address this issue, this study proposes a novel framework termed Collaborative Reconstruction and Repair (CRR), which transforms the reconstruction to repairation. First, we optimize the decoder to reconstruct normal samples while repairing synthesized anomalies. Consequently, it generates distinct representations for anomalous regions and similar representations for normal areas compared to the encoder's output. Second, we implement feature-level random masking to ensure that the representations from decoder contain sufficient local information. Finally, to minimize detection errors arising from the discrepancies between feature representations from the encoder and decoder, we train a segmentation network supervised by synthetic anomaly masks, thereby enhancing localization performance. Extensive experiments on industrial datasets that CRR effectively mitigates the identity mapping issue and achieves state-of-the-art performance in multi-class industrial anomaly detection.
48. 【2512.11399】Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction
链接:https://arxiv.org/abs/2512.11399
作者:Galann Pennec,Zhengyuan Liu,Nicholas Asher,Philippe Muller,Nancy F. Chen
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:process increasingly longer, increasingly longer videos, process increasingly, increasingly longer, Vision-Language Models
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.
49. 【2512.11395】FlowDC: Flow-Based Decoupling-Decay for Complex Image Editing
链接:https://arxiv.org/abs/2512.11395
作者:Yilei Jiang,Zhen Wang,Yanghao Wang,Jun Yu,Yueting Zhuang,Jun Xiao,Long Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:flow matching models, gained remarkable improvement, text-based image editing, image editing performance, single editing target
备注:
点击查看摘要
Abstract:With the surge of pre-trained text-to-image flow matching models, text-based image editing performance has gained remarkable improvement, especially for \underline{simple editing} that only contains a single editing target. To satisfy the exploding editing requirements, the \underline{complex editing} which contains multiple editing targets has posed as a more challenging task. However, current complex editing solutions: single-round and multi-round editing are limited by long text following and cumulative inconsistency, respectively. Thus, they struggle to strike a balance between semantic alignment and source consistency. In this paper, we propose \textbf{FlowDC}, which decouples the complex editing into multiple sub-editing effects and superposes them in parallel during the editing process. Meanwhile, we observed that the velocity quantity that is orthogonal to the editing displacement harms the source structure preserving. Thus, we decompose the velocity and decay the orthogonal part for better source consistency. To evaluate the effectiveness of complex editing settings, we construct a complex editing benchmark: Complex-PIE-Bench. On two benchmarks, FlowDC shows superior results compared with existing methods. We also detail the ablations of our module designs.
50. 【2512.11393】he N-Body Problem: Parallel Execution from Single-Person Egocentric Video
链接:https://arxiv.org/abs/2512.11393
作者:Zhifan Zhu,Yifei Huang,Yoichi Sato,Dima Damen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:parallelise complex activities, intuitively parallelise complex, Humans can intuitively, complex activities, single person
备注: project webpage: [this https URL](https://zhifanzhu.github.io/ego-nbody)
点击查看摘要
Abstract:Humans can intuitively parallelise complex activities, but can a model learn this from observing a single person? Given one egocentric video, we introduce the N-Body Problem: how N individuals, can hypothetically perform the same set of tasks observed in this video. The goal is to maximise speed-up, but naive assignment of video segments to individuals often violates real-world constraints, leading to physically impossible scenarios like two people using the same object or occupying the same space. To address this, we formalise the N-Body Problem and propose a suite of metrics to evaluate both performance (speed-up, task coverage) and feasibility (spatial collisions, object conflicts and causal constraints). We then introduce a structured prompting strategy that guides a Vision-Language Model (VLM) to reason about the 3D environment, object usage, and temporal dependencies to produce a viable parallel execution. On 100 videos from EPIC-Kitchens and HD-EPIC, our method for N = 2 boosts action coverage by 45% over a baseline prompt for Gemini 2.5 Pro, while simultaneously slashing collision rates, object and causal conflicts by 55%, 45% and 55% respectively.
51. 【2512.11373】Out-of-Distribution Segmentation via Wasserstein-Based Evidential Uncertainty
链接:https://arxiv.org/abs/2512.11373
作者:Arnold Brosch,Abdelrahman Eldesokey,Michael Felsberg,Kira Maag
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Deep neural networks, neural networks achieve, networks achieve superior, Deep neural, encounter unknown objects
备注:
点击查看摘要
Abstract:Deep neural networks achieve superior performance in semantic segmentation, but are limited to a predefined set of classes, which leads to failures when they encounter unknown objects in open-world scenarios. Recognizing and segmenting these out-of-distribution (OOD) objects is crucial for safety-critical applications such as automated driving. In this work, we present an evidence segmentation framework using a Wasserstein loss, which captures distributional distances while respecting the probability simplex geometry. Combined with Kullback-Leibler regularization and Dice structural consistency terms, our approach leads to improved OOD segmentation performance compared to uncertainty-based approaches.
52. 【2512.11369】Assisted Refinement Network Based on Channel Information Interaction for Camouflaged and Salient Object Detection
链接:https://arxiv.org/abs/2512.11369
作者:Kuan Wang,Yanjun Qin,Mengge Lu,Liejun Wang,Xiaoming Tao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Camouflaged Object Detection, visually highly integrated, segmenting objects visually, objects visually highly, Camouflaged Object
备注: 15 pages, 9 figures
点击查看摘要
Abstract:Camouflaged Object Detection (COD) stands as a significant challenge in computer vision, dedicated to identifying and segmenting objects visually highly integrated with their backgrounds. Current mainstream methods have made progress in cross-layer feature fusion, but two critical issues persist during the decoding stage. The first is insufficient cross-channel information interaction within the same-layer features, limiting feature expressiveness. The second is the inability to effectively co-model boundary and region information, making it difficult to accurately reconstruct complete regions and sharp boundaries of objects. To address the first issue, we propose the Channel Information Interaction Module (CIIM), which introduces a horizontal-vertical integration mechanism in the channel dimension. This module performs feature reorganization and interaction across channels to effectively capture complementary cross-channel information. To address the second issue, we construct a collaborative decoding architecture guided by prior knowledge. This architecture generates boundary priors and object localization maps through Boundary Extraction (BE) and Region Extraction (RE) modules, then employs hybrid attention to collaboratively calibrate decoded features, effectively overcoming semantic ambiguity and imprecise boundaries. Additionally, the Multi-scale Enhancement (MSE) module enriches contextual feature representations. Extensive experiments on four COD benchmark datasets validate the effectiveness and state-of-the-art performance of the proposed model. We further transferred our model to the Salient Object Detection (SOD) task and demonstrated its adaptability across downstream tasks, including polyp segmentation, transparent object detection, and industrial and road defect detection. Code and experimental results are publicly available at: this https URL.
53. 【2512.11360】Reliable Detection of Minute Targets in High-Resolution Aerial Imagery across Temporal Shifts
链接:https://arxiv.org/abs/2512.11360
作者:Mohammad Sadegh Gholizadeh,Amir Arsalan Rezapour,Hamidreza Shayegh,Ehsan Pazouki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unmanned Aerial Vehicles, scaling precision agriculture, Efficient crop detection, remains challenging due, Efficient crop
备注:
点击查看摘要
Abstract:Efficient crop detection via Unmanned Aerial Vehicles is critical for scaling precision agriculture, yet it remains challenging due to the small scale of targets and environmental variability. This paper addresses the detection of rice seedlings in paddy fields by leveraging a Faster R-CNN architecture initialized via transfer learning. To overcome the specific difficulties of detecting minute objects in high-resolution aerial imagery, we curate a significant UAV dataset for training and rigorously evaluate the model's generalization capabilities. Specifically, we validate performance across three distinct test sets acquired at different temporal intervals, thereby assessing robustness against varying imaging conditions. Our empirical results demonstrate that transfer learning not only facilitates the rapid convergence of object detection models in agricultural contexts but also yields consistent performance despite domain shifts in image acquisition.
54. 【2512.11356】Prior-Enhanced Gaussian Splatting for Dynamic Scene Reconstruction from Casual Video
链接:https://arxiv.org/abs/2512.11356
作者:Meng-Li Shih,Ying-Huan Chen,Yu-Lun Liu,Brian Curless
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fully automatic pipeline, captured monocular RGB, Dynamic Gaussian Splatting, monocular RGB videos, casually captured monocular
备注:
点击查看摘要
Abstract:We introduce a fully automatic pipeline for dynamic scene reconstruction from casually captured monocular RGB videos. Rather than designing a new scene representation, we enhance the priors that drive Dynamic Gaussian Splatting. Video segmentation combined with epipolar-error maps yields object-level masks that closely follow thin structures; these masks (i) guide an object-depth loss that sharpens the consistent video depth, and (ii) support skeleton-based sampling plus mask-guided re-identification to produce reliable, comprehensive 2-D tracks. Two additional objectives embed the refined priors in the reconstruction stage: a virtual-view depth loss removes floaters, and a scaffold-projection loss ties motion nodes to the tracks, preserving fine geometry and coherent motion. The resulting system surpasses previous monocular dynamic scene reconstruction methods and delivers visibly superior renderings
55. 【2512.11354】A Multi-Mode Structured Light 3D Imaging System with Multi-Source Information Fusion for Underwater Pipeline Detection
链接:https://arxiv.org/abs/2512.11354
作者:Qinghan Hu,Haijiang Zhu,Na Sun,Lei Chen,Zhengqiang Fan,Zhiqing Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significant safety risks, pose significant safety, susceptible to corrosion, safety risks, highly susceptible
备注:
点击查看摘要
Abstract:Underwater pipelines are highly susceptible to corrosion, which not only shorten their service life but also pose significant safety risks. Compared with manual inspection, the intelligent real-time imaging system for underwater pipeline detection has become a more reliable and practical solution. Among various underwater imaging techniques, structured light 3D imaging can restore the sufficient spatial detail for precise defect characterization. Therefore, this paper develops a multi-mode underwater structured light 3D imaging system for pipeline detection (UW-SLD system) based on multi-source information fusion. First, a rapid distortion correction (FDC) method is employed for efficient underwater image rectification. To overcome the challenges of extrinsic calibration among underwater sensors, a factor graph-based parameter optimization method is proposed to estimate the transformation matrix between the structured light and acoustic sensors. Furthermore, a multi-mode 3D imaging strategy is introduced to adapt to the geometric variability of underwater pipelines. Given the presence of numerous disturbances in underwater environments, a multi-source information fusion strategy and an adaptive extended Kalman filter (AEKF) are designed to ensure stable pose estimation and high-accuracy measurements. In particular, an edge detection-based ICP (ED-ICP) algorithm is proposed. This algorithm integrates pipeline edge detection network with enhanced point cloud registration to achieve robust and high-fidelity reconstruction of defect structures even under variable motion conditions. Extensive experiments are conducted under different operation modes, velocities, and depths. The results demonstrate that the developed system achieves superior accuracy, adaptability and robustness, providing a solid foundation for autonomous underwater pipeline detection.
56. 【2512.11350】Surveillance Video-Based Traffic Accident Detection Using Transformer Architecture
链接:https://arxiv.org/abs/2512.11350
作者:Tanu Singh,Pranamesh Chakraborty,Long T. Truong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Road traffic accidents, incidence rates rising, rates rising due, Road traffic, mortality globally
备注:
点击查看摘要
Abstract:Road traffic accidents represent a leading cause of mortality globally, with incidence rates rising due to increasing population, urbanization, and motorization. Rising accident rates raise concerns about traffic surveillance effectiveness. Traditional computer vision methods for accident detection struggle with limited spatiotemporal understanding and poor cross-domain generalization. Recent advances in transformer architectures excel at modeling global spatial-temporal dependencies and parallel computation. However, applying these models to automated traffic accident detection is limited by small, non-diverse datasets, hindering the development of robust, generalizable systems. To address this gap, we curated a comprehensive and balanced dataset that captures a wide spectrum of traffic environments, accident types, and contextual variations. Utilizing the curated dataset, we propose an accident detection model based on a transformer architecture using pre-extracted spatial video features. The architecture employs convolutional layers to extract local correlations across diverse patterns within a frame, while leveraging transformers to capture sequential-temporal dependencies among the retrieved features. Moreover, most existing studies neglect the integration of motion cues, which are essential for understanding dynamic scenes, especially during accidents. These approaches typically rely on static features or coarse temporal information. In this study, multiple methods for incorporating motion cues were evaluated to identify the most effective strategy. Among the tested input approaches, concatenating RGB features with optical flow achieved the highest accuracy at 88.3%. The results were further compared with vision language models (VLM) such as GPT, Gemini, and LLaVA-NeXT-Video to assess the effectiveness of the proposed method.
57. 【2512.11340】ask-Specific Distance Correlation Matching for Few-Shot Action Recognition
链接:https://arxiv.org/abs/2512.11340
作者:Fei Long,Yao Zhang,Jiaming Lv,Jiangtao Xie,Peihua Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Few-shot action recognition, recently made notable, made notable progress, Few-shot action, large-scale pre-trained models
备注: 9 pages. 4 figures, conference
点击查看摘要
Abstract:Few-shot action recognition (FSAR) has recently made notable progress through set matching and efficient adaptation of large-scale pre-trained models. However, two key limitations persist. First, existing set matching metrics typically rely on cosine similarity to measure inter-frame linear dependencies and then perform matching with only instance-level information, thus failing to capture more complex patterns such as nonlinear relationships and overlooking task-specific cues. Second, for efficient adaptation of CLIP to FSAR, recent work performing fine-tuning via skip-fusion layers (which we refer to as side layers) has significantly reduced memory cost. However, the newly introduced side layers are often difficult to optimize under limited data conditions. To address these limitations, we propose TS-FSAR, a framework comprising three components: (1) a visual Ladder Side Network (LSN) for efficient CLIP fine-tuning; (2) a metric called Task-Specific Distance Correlation Matching (TS-DCM), which uses $\alpha$-distance correlation to model both linear and nonlinear inter-frame dependencies and leverages a task prototype to enable task-specific matching; and (3) a Guiding LSN with Adapted CLIP (GLAC) module, which regularizes LSN using the adapted frozen CLIP to improve training for better $\alpha$-distance correlation estimation under limited supervision. Extensive experiments on five widely-used benchmarks demonstrate that our TS-FSAR yields superior performance compared to prior state-of-the-arts.
58. 【2512.11336】UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
链接:https://arxiv.org/abs/2512.11336
作者:Hewen Pan,Cong Wei,Dashuang Liang,Zepeng Huang,Pengfei Gao,Ziqi Zhou,Lulu Xue,Pengfei Yan,Xiaoming Wei,Minghui Li,Shengshan Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multi-modal Large Language, Large Language Models, Large Language, specialized video understanding, multi-modal Large
备注: 22 pages, 13 figures, technical report
点击查看摘要
Abstract:With the advancement of multi-modal Large Language Models (LLMs), Video LLMs have been further developed to perform on holistic and specialized video understanding. However, existing works are limited to specialized video understanding tasks, failing to achieve a comprehensive and multi-grained video perception. To bridge this gap, we introduce UFVideo, the first Video LLM with unified multi-grained cooperative understanding capabilities. Specifically, we design unified visual-language guided alignment to flexibly handle video understanding across global, pixel and temporal scales within a single model. UFVideo dynamically encodes the visual and text inputs of different tasks and generates the textual response, temporal localization, or grounded mask. Additionally, to evaluate challenging multi-grained video understanding tasks, we construct the UFVideo-Bench consisting of three distinct collaborative tasks within the scales, which demonstrates UFVideo's flexibility and advantages over GPT-4o. Furthermore, we validate the effectiveness of our model across 9 public benchmarks covering various common video understanding tasks, providing valuable insights for future Video LLMs.
59. 【2512.11335】FreqDINO: Frequency-Guided Adaptation for Generalized Boundary-Aware Ultrasound Image Segmentation
链接:https://arxiv.org/abs/2512.11335
作者:Yixuan Zhang,Qing Xu,Yue Li,Xiangjian He,Qian Zhang,Mainul Haque,Rong Qu,Wenting Duan,Zhen Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Ultrasound image segmentation, Ultrasound image, clinical diagnosis, imaging artifacts, pivotal for clinical
备注:
点击查看摘要
Abstract:Ultrasound image segmentation is pivotal for clinical diagnosis, yet challenged by speckle noise and imaging artifacts. Recently, DINOv3 has shown remarkable promise in medical image segmentation with its powerful representation capabilities. However, DINOv3, pre-trained on natural images, lacks sensitivity to ultrasound-specific boundary degradation. To address this limitation, we propose FreqDINO, a frequency-guided segmentation framework that enhances boundary perception and structural consistency. Specifically, we devise a Multi-scale Frequency Extraction and Alignment (MFEA) strategy to separate low-frequency structures and multi-scale high-frequency boundary details, and align them via learnable attention. We also introduce a Frequency-Guided Boundary Refinement (FGBR) module that extracts boundary prototypes from high-frequency components and refines spatial features. Furthermore, we design a Multi-task Boundary-Guided Decoder (MBGD) to ensure spatial coherence between boundary and semantic predictions. Extensive experiments demonstrate that FreqDINO surpasses state-of-the-art methods with superior achieves remarkable generalization capability. The code is at this https URL.
60. 【2512.11327】Physics-Informed Video Flare Synthesis and Removal Leveraging Motion Independence between Flare and Scene
链接:https://arxiv.org/abs/2512.11327
作者:Junqiao Wang,Yuanfei Huang,Hua Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:degradation phenomenon caused, Lens flare, flare, strong light sources, video flare
备注:
点击查看摘要
Abstract:Lens flare is a degradation phenomenon caused by strong light sources. Existing researches on flare removal have mainly focused on images, while the spatiotemporal characteristics of video flare remain largely unexplored. Video flare synthesis and removal pose significantly greater challenges than in image, owing to the complex and mutually independent motion of flare, light sources, and scene content. This motion independence further affects restoration performance, often resulting in flicker and artifacts. To address this issue, we propose a physics-informed dynamic flare synthesis pipeline, which simulates light source motion using optical flow and models the temporal behaviors of both scattering and reflective flares. Meanwhile, we design a video flare removal network that employs an attention module to spatially suppress flare regions and incorporates a Mamba-based temporal modeling component to capture long range spatio-temporal dependencies. This motion-independent spatiotemporal representation effectively eliminates the need for multi-frame alignment, alleviating temporal aliasing between flares and scene content and thereby improving video flare removal performance. Building upon this, we construct the first video flare dataset to comprehensively evaluate our method, which includes a large set of synthetic paired videos and additional real-world videos collected from the Internet to assess generalization capability. Extensive experiments demonstrate that our method consistently outperforms existing video-based restoration and image-based flare removal methods on both real and synthetic videos, effectively removing dynamic flares while preserving light source integrity and maintaining spatiotemporal consistency of scene.
61. 【2512.11325】MLLM Machine Unlearning via Visual Knowledge Distillation
链接:https://arxiv.org/abs/2512.11325
作者:Yuhang Wang,Zhenxing Niu,Haoxuan Ji,Guangyu He,Haichang Gao,Gang Hua
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:remove sensitive information, machine unlearning approaches, well-trained large models, proposed to remove, remove sensitive
备注:
点击查看摘要
Abstract:Recently, machine unlearning approaches have been proposed to remove sensitive information from well-trained large models. However, most existing methods are tailored for LLMs, while MLLM-oriented unlearning remains at its early stage. Inspired by recent studies exploring the internal mechanisms of MLLMs, we propose to disentangle the visual and textual knowledge embedded within MLLMs and introduce a dedicated approach to selectively erase target visual knowledge while preserving textual knowledge. Unlike previous unlearning methods that rely on output-level supervision, our approach introduces a Visual Knowledge Distillation (VKD) scheme, which leverages intermediate visual representations within the MLLM as supervision signals. This design substantially enhances both unlearning effectiveness and model utility. Moreover, since our method only fine-tunes the visual components of the MLLM, it offers significant efficiency advantages. Extensive experiments demonstrate that our approach outperforms state-of-the-art unlearning methods in terms of both effectiveness and efficiency. Moreover, we are the first to evaluate the robustness of MLLM unlearning against relearning attacks.
62. 【2512.11321】KeyframeFace: From Text to Expressive Facial Keyframes
链接:https://arxiv.org/abs/2512.11321
作者:Jingchao Wu,Zejian Kang,Haibo Liu,Yuanchen Fei,Xiangru Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, natural language requires, Generating dynamic, temporally structured semantics, language requires understanding
备注:
点击查看摘要
Abstract:Generating dynamic 3D facial animation from natural language requires understanding both temporally structured semantics and fine-grained expression changes. Existing datasets and methods mainly focus on speech-driven animation or unstructured expression sequences and therefore lack the semantic grounding and temporal structures needed for expressive human performance generation. In this work, we introduce KeyframeFace, a large-scale multimodal dataset designed for text-to-animation research through keyframe-level supervision. KeyframeFace provides 2,100 expressive scripts paired with monocular videos, per-frame ARKit coefficients, contextual backgrounds, complex emotions, manually defined keyframes, and multi-perspective annotations based on ARKit coefficients and images via Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Beyond the dataset, we propose the first text-to-animation framework that explicitly leverages LLM priors for interpretable facial motion synthesis. This design aligns the semantic understanding capabilities of LLMs with the interpretable structure of ARKit's coefficients, enabling high-fidelity expressive animation. KeyframeFace and our LLM-based framework together establish a new foundation for interpretable, keyframe-guided, and context-aware text-to-animation. Code and data are available at this https URL.
63. 【2512.11319】SATMapTR: Satellite Image Enhanced Online HD Map Construction
链接:https://arxiv.org/abs/2512.11319
作者:Bingyuan Huang,Guanyi Zhao,Qian Xu,Yang Lou,Yung-Hui Li,Jianping Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:support autonomous driving, diverse scenarios, evolving from pre-annotated, pre-annotated to real-time, support autonomous
备注: 9 pages (+ 3 pages of Appendix)
点击查看摘要
Abstract:High-definition (HD) maps are evolving from pre-annotated to real-time construction to better support autonomous driving in diverse scenarios. However, this process is hindered by low-quality input data caused by onboard sensors limited capability and frequent occlusions, leading to incomplete, noisy, or missing data, and thus reduced mapping accuracy and robustness. Recent efforts have introduced satellite images as auxiliary input, offering a stable, wide-area view to complement the limited ego perspective. However, satellite images in Bird's Eye View are often degraded by shadows and occlusions from vegetation and buildings. Prior methods using basic feature extraction and fusion remain ineffective. To address these challenges, we propose SATMapTR, a novel online map construction model that effectively fuses satellite image through two key components: (1) a gated feature refinement module that adaptively filters satellite image features by integrating high-level semantics with low-level structural cues to extract high signal-to-noise ratio map-relevant representations; and (2) a geometry-aware fusion module that consistently fuse satellite and BEV features at a grid-to-grid level, minimizing interference from irrelevant regions and low-quality inputs. Experimental results on the nuScenes dataset show that SATMapTR achieves the highest mean average precision (mAP) of 73.8, outperforming state-of-the-art satellite-enhanced models by up to 14.2 mAP. It also shows lower mAP degradation under adverse weather and sensor failures, and achieves nearly 3 times higher mAP at extended perception ranges.
64. 【2512.11301】MultiEgo: A Multi-View Egocentric Video Dataset for 4D Scene Reconstruction
链接:https://arxiv.org/abs/2512.11301
作者:Bate Li,Houqiang Zhong,Zhengxue Cheng,Qiang Hu,Qiang Wang,Li Song,Wenjun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamic scene reconstruction, reconstruction holds significant, Multi-view egocentric dynamic, egocentric dynamic scene, dynamic scene
备注: ACM MM 2025 Dataset Track
点击查看摘要
Abstract:Multi-view egocentric dynamic scene reconstruction holds significant research value for applications in holographic documentation of social interactions. However, existing reconstruction datasets focus on static multi-view or single-egocentric view setups, lacking multi-view egocentric datasets for dynamic scene reconstruction. Therefore, we present MultiEgo, the first multi-view egocentric dataset for 4D dynamic scene reconstruction. The dataset comprises five canonical social interaction scenes: meetings, performances, and a presentation. Each scene provides five authentic egocentric videos captured by participants wearing AR glasses. We design a hardware-based data acquisition system and processing pipeline, achieving sub-millisecond temporal synchronization across views, coupled with accurate pose annotations. Experiment validation demonstrates the practical utility and effectiveness of our dataset for free-viewpoint video (FVV) applications, establishing MultiEgo as a foundational resource for advancing multi-view egocentric dynamic scene reconstruction research.
65. 【2512.11296】Few-Shot VLM-Based G-Code and HMI Verification in CNC Machining
链接:https://arxiv.org/abs/2512.11296
作者:Yasaman Hashem Pour,Nazanin Mahjourian,Vinh Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:Manual generation, G-code, HMI, HMI errors, Manual
备注:
点击查看摘要
Abstract:Manual generation of G-code is important for learning the operation of CNC machines. Prior work in G-code verification uses Large-Language Models (LLMs), which primarily examine errors in the written programming. However, CNC machining requires extensive use and knowledge of the Human-Machine Interface (HMI), which displays machine status and errors. LLMs currently lack the capability to leverage knowledge of HMIs due to their inability to access the vision modality. This paper proposes a few-shot VLM-based verification approach that simultaneously evaluates the G-code and the HMI display for errors and safety status. The input dataset includes paired G-code text and associated HMI screenshots from a 15-slant-PRO lathe, including both correct and error-prone cases. To enable few-shot learning, the VLM is provided with a structured JSON schema based on prior heuristic knowledge. After determining the prompts, instances of G-code and HMI that either contain errors or are error free are used as few-shot examples to guide the VLM. The model was then evaluated in comparison to a zero-shot VLM through multiple scenarios of incorrect G-code and HMI errors with respect to per-slot accuracy. The VLM showed that few-shot prompting led to overall enhancement of detecting HMI errors and discrepancies with the G-code for more comprehensive debugging. Therefore, the proposed framework was demonstrated to be suitable for verification of manually generated G-code that is typically developed in CNC training.
66. 【2512.11293】Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context
链接:https://arxiv.org/abs/2512.11293
作者:Cuifeng Shen,Lumin Xu,Xingguo Zhu,Gengdai Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video autoencoders compress, autoencoders compress videos, Autoregressive Video Autoencoder, existing video autoencoders, Video autoencoders
备注:
点击查看摘要
Abstract:Video autoencoders compress videos into compact latent representations for efficient reconstruction, playing a vital role in enhancing the quality and efficiency of video generation. However, existing video autoencoders often entangle spatial and temporal information, limiting their ability to capture temporal consistency and leading to suboptimal performance. To address this, we propose Autoregressive Video Autoencoder (ARVAE), which compresses and reconstructs each frame conditioned on its predecessor in an autoregressive manner, allowing flexible processing of videos with arbitrary lengths. ARVAE introduces a temporal-spatial decoupled representation that combines downsampled flow field for temporal coherence with spatial relative compensation for newly emerged content, achieving high compression efficiency without information loss. Specifically, the encoder compresses the current and previous frames into the temporal motion and spatial supplement, while the decoder reconstructs the original frame from the latent representations given the preceding frame. A multi-stage training strategy is employed to progressively optimize the model. Extensive experiments demonstrate that ARVAE achieves superior reconstruction quality with extremely lightweight models and small-scale training data. Moreover, evaluations on video generation tasks highlight its strong potential for downstream applications.
67. 【2512.11284】RcAE: Recursive Reconstruction Framework for Unsupervised Industrial Anomaly Detection
链接:https://arxiv.org/abs/2512.11284
作者:Rongcheng Wu,Hao Zhu,Shiying Zhang,Mingzhe Wang,Zhidong Li,Hui Li,Jianlong Zhou,Jiangtao Cui,Fang Chen,Pingyang Sun,Qiyu Liao,Ye Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unsupervised industrial anomaly, requires accurately identifying, accurately identifying defects, Unsupervised industrial, detection requires accurately
备注: 19 pages, 7 figures, to be published in AAAI-26
点击查看摘要
Abstract:Unsupervised industrial anomaly detection requires accurately identifying defects without labeled data. Traditional autoencoder-based methods often struggle with incomplete anomaly suppression and loss of fine details, as their single-pass decoding fails to effectively handle anomalies with varying severity and scale. We propose a recursive architecture for autoencoder (RcAE), which performs reconstruction iteratively to progressively suppress anomalies while refining normal structures. Unlike traditional single-pass models, this recursive design naturally produces a sequence of reconstructions, progressively exposing suppressed abnormal patterns. To leverage this reconstruction dynamics, we introduce a Cross Recursion Detection (CRD) module that tracks inconsistencies across recursion steps, enhancing detection of both subtle and large-scale anomalies. Additionally, we incorporate a Detail Preservation Network (DPN) to recover high-frequency textures typically lost during reconstruction. Extensive experiments demonstrate that our method significantly outperforms existing non-diffusion methods, and achieves performance on par with recent diffusion models with only 10% of their parameters and offering substantially faster inference. These results highlight the practicality and efficiency of our approach for real-world applications.
68. 【2512.11274】FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion
链接:https://arxiv.org/abs/2512.11274
作者:Xiangyang Luo,Qingyu Li,Xiaokun Liu,Wenyu Qin,Miao Yang,Meng Wang,Pengfei Wan,Di Zhang,Kun Gai,Shao-Lun Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generation models perform, facing critical challenges, flexibly generating videos, arbitrary length, facing critical
备注: AAAI-2026
点击查看摘要
Abstract:Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce \textbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content. Project Page: this https URL
69. 【2512.11267】Evaluating the Efficacy of Sentinel-2 versus Aerial Imagery in Serrated Tussock Classification
链接:https://arxiv.org/abs/2512.11267
作者:Rezwana Sultana,Manzur Murshed,Kathryn Sheffield,Singarayer Florentine,Tsz-Kwan Lee,Shyh Wei Teng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pose major global, major global threats, Invasive species pose, species pose major, ecosystems and agriculture
备注: Accepted in Earthsense 2025 (IEEE INTERNATIONAL CONFERENCE ON NEXT-GEN TECHNOLOGIES OF ARTIFICIAL INTELLIGENCE AND GEOSCIENCE REMOTE SENSING)
点击查看摘要
Abstract:Invasive species pose major global threats to ecosystems and agriculture. Serrated tussock (\textit{Nassella trichotoma}) is a highly competitive invasive grass species that disrupts native grasslands, reduces pasture productivity, and increases land management costs. In Victoria, Australia, it presents a major challenge due to its aggressive spread and ecological impact. While current ground surveys and subsequent management practices are effective at small scales, they are not feasible for landscape-scale monitoring. Although aerial imagery offers high spatial resolution suitable for detailed classification, its high cost limits scalability. Satellite-based remote sensing provides a more cost-effective and scalable alternative, though often with lower spatial resolution. This study evaluates whether multi-temporal Sentinel-2 imagery, despite its lower spatial resolution, can provide a comparable and cost-effective alternative for landscape-scale monitoring of serrated tussock by leveraging its higher spectral resolution and seasonal phenological information. A total of eleven models have been developed using various combinations of spectral bands, texture features, vegetation indices, and seasonal data. Using a random forest classifier, the best-performing Sentinel-2 model (M76*) has achieved an Overall Accuracy (OA) of 68\% and an Overall Kappa (OK) of 0.55, slightly outperforming the best-performing aerial imaging model's OA of 67\% and OK of 0.52 on the same dataset. These findings highlight the potential of multi-seasonal feature-enhanced satellite-based models for scalable invasive species classification.
70. 【2512.11260】Do We Need Reformer for Vision? An Experimental Comparison with Vision Transformers
链接:https://arxiv.org/abs/2512.11260
作者:Ali El Bellaj,Mohammed-Amine Cheddadi,Rhassan Berber
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recently demonstrated strong, demonstrated strong performance, high-level image features, Vision Transformers, Transformers have recently
备注:
点击查看摘要
Abstract:Transformers have recently demonstrated strong performance in computer vision, with Vision Transformers (ViTs) leveraging self-attention to capture both low-level and high-level image features. However, standard ViTs remain computationally expensive, since global self-attention scales quadratically with the number of tokens, which limits their practicality for high-resolution inputs and resource-constrained settings. In this work, we investigate the Reformer architecture as an alternative vision backbone. By combining patch-based tokenization with locality-sensitive hashing (LSH) attention, our model approximates global self-attention while reducing its theoretical time complexity from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ in the sequence length $n$. We evaluate the proposed Reformer-based vision model on CIFAR-10 to assess its behavior on small-scale datasets, on ImageNet-100 to study its accuracy--efficiency trade-off in a more realistic setting, and on a high-resolution medical imaging dataset to evaluate the model under longer token sequences. While the Reformer achieves higher accuracy on CIFAR-10 compared to our ViT-style baseline, the ViT model consistently outperforms the Reformer in our experiments in terms of practical efficiency and end-to-end computation time across the larger and higher-resolution settings. These results suggest that, despite the theoretical advantages of LSH-based attention, meaningful computation gains require sequence lengths substantially longer than those produced by typical high-resolution images.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.11260 [cs.CV]
(or
arXiv:2512.11260v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.11260
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
71. 【2512.11253】PersonaLive! Expressive Portrait Image Animation for Live Streaming
链接:https://arxiv.org/abs/2512.11253
作者:Zhiyuan Li,Chi-Man Pun,Chen Fang,Jue Wang,Xiaodong Cun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Current diffusion-based portrait, enhancing visual quality, live streaming scenario, diffusion-based portrait animation, Current diffusion-based
备注:
点击查看摘要
Abstract:Current diffusion-based portrait animation models predominantly focus on enhancing visual quality and expression realism, while overlooking generation latency and real-time performance, which restricts their application range in the live streaming scenario. We propose PersonaLive, a novel diffusion-based framework towards streaming real-time portrait animation with multi-stage training recipes. Specifically, we first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control. Then, a fewer-step appearance distillation strategy is proposed to eliminate appearance redundancy in the denoising process, greatly improving inference efficiency. Finally, we introduce an autoregressive micro-chunk streaming generation paradigm equipped with a sliding training strategy and a historical keyframe mechanism to enable low-latency and stable long-term video generation. Extensive experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.
72. 【2512.11243】ask-Aware Multi-Expert Architecture For Lifelong Deep Learning
链接:https://arxiv.org/abs/2512.11243
作者:Jianyu Wang,Jacob Nean-Hua Sheikh,Cat P. Le,Hoda Bidkhori
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:preserving prior knowledge, trains neural networks, learn sequentially, preserving prior, Lifelong deep learning
备注:
点击查看摘要
Abstract:Lifelong deep learning (LDL) trains neural networks to learn sequentially across tasks while preserving prior knowledge. We propose Task-Aware Multi-Expert (TAME), a continual learning algorithm that leverages task similarity to guide expert selection and knowledge transfer. TAME maintains a pool of pretrained neural networks and activates the most relevant expert for each new task. A shared dense layer integrates features from the chosen expert to generate predictions. To reduce catastrophic forgetting, TAME uses a replay buffer that stores representative samples and embeddings from previous tasks and reuses them during training. An attention mechanism further prioritizes the most relevant stored information for each prediction. Together, these components allow TAME to adapt flexibly while retaining important knowledge across evolving task sequences. Experiments on binary classification tasks derived from CIFAR-100 show that TAME improves accuracy on new tasks while sustaining performance on earlier ones, highlighting its effectiveness in balancing adaptation and retention in lifelong learning settings.
73. 【2512.11239】Cross-modal Prompting for Balanced Incomplete Multi-modal Emotion Recognition
链接:https://arxiv.org/abs/2512.11239
作者:Wen-Jue He,Xiaofeng Zhu,Zheng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Incomplete multi-modal emotion, understanding human intentions, partially observed multi-source, observed multi-source data, multi-modal emotion recognition
备注: Accepted by AAAI 2026
点击查看摘要
Abstract:Incomplete multi-modal emotion recognition (IMER) aims at understanding human intentions and sentiments by comprehensively exploring the partially observed multi-source data. Although the multi-modal data is expected to provide more abundant information, the performance gap and modality under-optimization problem hinder effective multi-modal learning in practice, and are exacerbated in the confrontation of the missing data. To address this issue, we devise a novel Cross-modal Prompting (ComP) method, which emphasizes coherent information by enhancing modality-specific features and improves the overall recognition accuracy by boosting each modality's performance. Specifically, a progressive prompt generation module with a dynamic gradient modulator is proposed to produce concise and consistent modality semantic cues. Meanwhile, cross-modal knowledge propagation selectively amplifies the consistent information in modality features with the delivered prompts to enhance the discrimination of the modality-specific output. Additionally, a coordinator is designed to dynamically re-weight the modality outputs as a complement to the balance strategy to improve the model's efficacy. Extensive experiments on 4 datasets with 7 SOTA methods under different missing rates validate the effectiveness of our proposed method.
74. 【2512.11237】WildCap: Facial Appearance Capture in the Wild via Hybrid Inverse Rendering
链接:https://arxiv.org/abs/2512.11237
作者:Yuxuan Han,Xin Ming,Tianxiao Li,Zhuofan Shen,Qixuan Zhang,Lan Xu,Feng Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:high-quality facial appearance, facial appearance capture, Existing methods achieve, increases capture cost, Existing methods
备注: Technical report. project page: [this https URL](https://yxuhan.github.io/WildCap/index.html;) code: [this https URL](https://github.com/yxuhan/WildCap)
点击查看摘要
Abstract:Existing methods achieve high-quality facial appearance capture under controllable lighting, which increases capture cost and limits usability. We propose WildCap, a novel method for high-quality facial appearance capture from a smartphone video recorded in the wild. To disentangle high-quality reflectance from complex lighting effects in in-the-wild captures, we propose a novel hybrid inverse rendering framework. Specifically, we first apply a data-driven method, i.e., SwitchLight, to convert the captured images into more constrained conditions and then adopt model-based inverse rendering. However, unavoidable local artifacts in network predictions, such as shadow-baking, are non-physical and thus hinder accurate inverse rendering of lighting and material. To address this, we propose a novel texel grid lighting model to explain non-physical effects as clean albedo illuminated by local physical lighting. During optimization, we jointly sample a diffusion prior for reflectance maps and optimize the lighting, effectively resolving scale ambiguity between local lights and albedo. Our method achieves significantly better results than prior arts in the same capture setup, closing the quality gap between in-the-wild and controllable recordings by a large margin. Our code will be released \href{this https URL}{\textcolor{magenta}{here}}.
75. 【2512.11234】RoomPilot: Controllable Synthesis of Interactive Indoor Environments via Multimodal Semantic Parsing
链接:https://arxiv.org/abs/2512.11234
作者:Wentang Chen,Shougao Zhang,Yiman Zhang,Tianhao Zhou,Ruihui Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:architectural visualization, game development, embodied AI training, Generating controllable, fundamental to applications
备注: 20 pages, 6 figures
点击查看摘要
Abstract:Generating controllable and interactive indoor scenes is fundamental to applications in game development, architectural visualization, and embodied AI training. Yet existing approaches either handle a narrow range of input modalities or rely on stochastic processes that hinder controllability. To overcome these limitations, we introduce RoomPilot, a unified framework that parses diverse multi-modal inputs--textual descriptions or CAD floor plans--into an Indoor Domain-Specific Language (IDSL) for indoor structured scene generation. The key insight is that a well-designed IDSL can act as a shared semantic representation, enabling coherent, high-quality scene synthesis from any single modality while maintaining interaction semantics. In contrast to conventional procedural methods that produce visually plausible but functionally inert layouts, RoomPilot leverages a curated dataset of interaction-annotated assets to synthesize environments exhibiting realistic object behaviors. Extensive experiments further validate its strong multi-modal understanding, fine-grained controllability in scene generation, and superior physical consistency and visual fidelity, marking a significant step toward general-purpose controllable 3D indoor scene generation.
76. 【2512.11229】REST: Diffusion-based Real-time End-to-end Streaming Talking Head Generation via ID-Context Caching and Asynchronous Streaming Distillation
链接:https://arxiv.org/abs/2512.11229
作者:Haotian Wang,Yuzhe Weng,Xinyi Yu,Jun Du,Haoran Xu,Xiaoyan Wu,Shan He,Bing Yin,Cong Liu,Qingfeng Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
关键词:Diffusion models, talking head generation, talking head, significantly advanced, advanced the field
备注: 10pages, 4 figures
点击查看摘要
Abstract:Diffusion models have significantly advanced the field of talking head generation. However, the slow inference speeds and non-autoregressive paradigms severely constrain the application of diffusion-based THG models. In this study, we propose REST, the first diffusion-based, real-time, end-to-end streaming audio-driven talking head generation framework. To support real-time end-to-end generation, a compact video latent space is first learned through high spatiotemporal VAE compression. Additionally, to enable autoregressive streaming within the compact video latent space, we introduce an ID-Context Cache mechanism, which integrates ID-Sink and Context-Cache principles to key-value caching for maintaining temporal consistency and identity coherence during long-time streaming generation. Furthermore, an Asynchronous Streaming Distillation (ASD) training strategy is proposed to mitigate error accumulation in autoregressive generation and enhance temporal consistency, which leverages a non-streaming teacher with an asynchronous noise schedule to supervise the training of the streaming student model. REST bridges the gap between autoregressive and diffusion-based approaches, demonstrating substantial value for applications requiring real-time talking head generation. Experimental results demonstrate that REST outperforms state-of-the-art methods in both generation speed and overall performance.
77. 【2512.11226】FutureX: Enhance End-to-End Autonomous Driving via Latent Chain-of-Thought World Model
链接:https://arxiv.org/abs/2512.11226
作者:Hongbin Lin,Yiming Yang,Yifan Zhang,Chaoda Zheng,Jie Feng,Sheng Wang,Zhennan Wang,Shijia Chen,Boyang Wang,Yu Zhang,Xianming Liu,Shuguang Cui,Zhen Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:raw sensor data, autonomous driving, planners learn scene, raw sensor, sensor data
备注:
点击查看摘要
Abstract:In autonomous driving, end-to-end planners learn scene representations from raw sensor data and utilize them to generate a motion plan or control actions. However, exclusive reliance on the current scene for motion planning may result in suboptimal responses in highly dynamic traffic environments where ego actions further alter the future scene. To model the evolution of future scenes, we leverage the World Model to represent how the ego vehicle and its environment interact and change over time, which entails complex reasoning. The Chain of Thought (CoT) offers a promising solution by forecasting a sequence of future thoughts that subsequently guide trajectory refinement. In this paper, we propose FutureX, a CoT-driven pipeline that enhances end-to-end planners to perform complex motion planning via future scene latent reasoning and trajectory refinement. Specifically, the Auto-think Switch examines the current scene and decides whether additional reasoning is required to yield a higher-quality motion plan. Once FutureX enters the Thinking mode, the Latent World Model conducts a CoT-guided rollout to predict future scene representation, enabling the Summarizer Module to further refine the motion plan. Otherwise, FutureX operates in an Instant mode to generate motion plans in a forward pass for relatively simple scenes. Extensive experiments demonstrate that FutureX enhances existing methods by producing more rational motion plans and fewer collisions without compromising efficiency, thereby achieving substantial overall performance gains, e.g., 6.2 PDMS improvement for TransFuser on NAVSIM. Code will be released.
78. 【2512.11225】VFMF: World Modeling by Forecasting Vision Foundation Model Features
链接:https://arxiv.org/abs/2512.11225
作者:Gabrijel Boduljak,Yushi Lan,Christian Rupprecht,Andrea Vedaldi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:partial observations, observations is central, world, Forecasting, VFM features
备注:
点击查看摘要
Abstract:Forecasting from partial observations is central to world modeling. Many recent methods represent the world through images, and reduce forecasting to stochastic video generation. Although such methods excel at realism and visual fidelity, predicting pixels is computationally intensive and not directly useful in many applications, as it requires translating RGB into signals useful for decision making. An alternative approach uses features from vision foundation models (VFMs) as world representations, performing deterministic regression to predict future world states. These features can be directly translated into actionable signals such as semantic segmentation and depth, while remaining computationally efficient. However, deterministic regression averages over multiple plausible futures, undermining forecast accuracy by failing to capture uncertainty. To address this crucial limitation, we introduce a generative forecaster that performs autoregressive flow matching in VFM feature space. Our key insight is that generative modeling in this space requires encoding VFM features into a compact latent space suitable for diffusion. We show that this latent space preserves information more effectively than previously used PCA-based alternatives, both for forecasting and other applications, such as image generation. Our latent predictions can be easily decoded into multiple useful and interpretable output modalities: semantic segmentation, depth, surface normals, and even RGB. With matched architecture and compute, our method produces sharper and more accurate predictions than regression across all modalities. Our results suggest that stochastic conditional generation of VFM features offers a promising and scalable foundation for future world models.
79. 【2512.11218】Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy
链接:https://arxiv.org/abs/2512.11218
作者:Kechun Xu,Zhenjie Zhu,Anzhe Chen,Shuqi Zhao,Qing Huang,Yifei Yang,Haojian Lu,Rong Xiong,Masayoshi Tomizuka,Yue Wang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:backbone during fine-tuning, hindered by catastrophic, VLM, catastrophic forgetting, Vision-Language Model
备注:
点击查看摘要
Abstract:The pursuit of out-of-distribution generalization in Vision-Language-Action (VLA) models is often hindered by catastrophic forgetting of the Vision-Language Model (VLM) backbone during fine-tuning. While co-training with external reasoning data helps, it requires experienced tuning and data-related overhead. Beyond such external dependencies, we identify an intrinsic cause within VLA datasets: modality imbalance, where language diversity is much lower than visual and action diversity. This imbalance biases the model toward visual shortcuts and language forgetting. To address this, we introduce BayesVLA, a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify. This inherently preserves generalization and promotes instruction following. We further incorporate pre- and post-contact phases to better leverage pre-trained foundation models. Information-theoretic analysis formally validates our effectiveness in mitigating shortcut learning. Extensive experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods. Project page is available at: this https URL.
80. 【2512.11215】SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection
链接:https://arxiv.org/abs/2512.11215
作者:Tianye Qi,Weihao Li,Nick Barnes
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:confounded with clouds, visually confounded, making early-stage detection, smoke, smoke localization
备注: Accepted to WACV 2026
点击查看摘要
Abstract:Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.
81. 【2512.11203】AutoRefiner: Improving Autoregressive Video Diffusion Models via Reflective Refinement Over the Stochastic Sampling Path
链接:https://arxiv.org/abs/2512.11203
作者:Zhengyang Yu,Akio Hayakawa,Masato Ishii,Qingtao Yu,Takashi Shibuya,Jing Zhang,Yuki Mitsufuji
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autoregressive video diffusion, show strong promise, Autoregressive video, video diffusion models, show strong
备注:
点击查看摘要
Abstract:Autoregressive video diffusion models (AR-VDMs) show strong promise as scalable alternatives to bidirectional VDMs, enabling real-time and interactive applications. Yet there remains room for improvement in their sample fidelity. A promising solution is inference-time alignment, which optimizes the noise space to improve sample fidelity without updating model parameters. Yet, optimization- or search-based methods are computationally impractical for AR-VDMs. Recent text-to-image (T2I) works address this via feedforward noise refiners that modulate sampled noises in a single forward pass. Can such noise refiners be extended to AR-VDMs? We identify the failure of naively extending T2I noise refiners to AR-VDMs and propose AutoRefiner-a noise refiner tailored for AR-VDMs, with two key designs: pathwise noise refinement and a reflective KV-cache. Experiments demonstrate that AutoRefiner serves as an efficient plug-in for AR-VDMs, effectively enhancing sample fidelity by refining noise along stochastic denoising paths.
82. 【2512.11199】CADKnitter: Compositional CAD Generation from Text and Geometry Guidance
链接:https://arxiv.org/abs/2512.11199
作者:Tri Le,Khang Nguyen,Baoru Huang,Tung D. Ta,Anh Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Crafting computer-aided design, Crafting computer-aided, demanding both precision, expertise from designers, CAD
备注:
点击查看摘要
Abstract:Crafting computer-aided design (CAD) models has long been a painstaking and time-intensive task, demanding both precision and expertise from designers. With the emergence of 3D generation, this task has undergone a transformative impact, shifting not only from visual fidelity to functional utility but also enabling editable CAD designs. Prior works have achieved early success in single-part CAD generation, which is not well-suited for real-world applications, as multiple parts need to be assembled under semantic and geometric constraints. In this paper, we propose CADKnitter, a compositional CAD generation framework with a geometry-guided diffusion sampling strategy. CADKnitter is able to generate a complementary CAD part that follows both the geometric constraints of the given CAD model and the semantic constraints of the desired design text prompt. We also curate a dataset, so-called KnitCAD, containing over 310,000 samples of CAD models, along with textual prompts and assembly metadata that provide semantic and geometric constraints. Intensive experiments demonstrate that our proposed method outperforms other state-of-the-art baselines by a clear margin.
83. 【2512.11194】Beyond Memorization: Gradient Projection Enables Selective Learning in Diffusion Models
链接:https://arxiv.org/abs/2512.11194
作者:Divya Kothandaraman,Jaclyn Pytlarz
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:intellectual property risks, poses significant security, enabling adversarial attribute, models poses significant, property risks
备注:
点击查看摘要
Abstract:Memorization in large-scale text-to-image diffusion models poses significant security and intellectual property risks, enabling adversarial attribute extraction and the unauthorized reproduction of sensitive or proprietary features. While conventional dememorization techniques, such as regularization and data filtering, limit overfitting to specific training examples, they fail to systematically prevent the internalization of prohibited concept-level features. Simply discarding all images containing a sensitive feature wastes invaluable training data, necessitating a method for selective unlearning at the concept level. To address this, we introduce a Gradient Projection Framework designed to enforce a stringent requirement of concept-level feature exclusion. Our defense operates during backpropagation by systematically identifying and excising training signals aligned with embeddings of prohibited attributes. Specifically, we project each gradient update onto the orthogonal complement of the sensitive feature's embedding space, thereby zeroing out its influence on the model's weights. Our method integrates seamlessly into standard diffusion model training pipelines and complements existing defenses. We analyze our method against an adversary aiming for feature extraction. In extensive experiments, we demonstrate that our framework drastically reduces memorization while rigorously preserving generation quality and semantic fidelity. By reframing memorization control as selective learning, our approach establishes a new paradigm for IP-safe and privacy-preserving generative AI.
Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.11194 [cs.LG]
(or
arXiv:2512.11194v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2512.11194
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
84. 【2512.11189】Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization
链接:https://arxiv.org/abs/2512.11189
作者:Anh-Kiet Duong,Petra Gomez-Krämer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multi-modal video settings, temporal action localization, Temporal Shift Module, video settings, Challenge at ICCV
备注: BinEgo360@ICCV25
点击查看摘要
Abstract:We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
85. 【2512.11186】Lightweight 3D Gaussian Splatting Compression via Video Codec
链接:https://arxiv.org/abs/2512.11186
作者:Qi Yang,Geert Van Der Auwera,Zhu Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Linear Assignment Sorting, Parallel Linear Assignment, Assignment Sorting, Parallel Linear, Linear Assignment
备注: Accepted by DCC2026 Oral
点击查看摘要
Abstract:Current video-based GS compression methods rely on using Parallel Linear Assignment Sorting (PLAS) to convert 3D GS into smooth 2D maps, which are computationally expensive and time-consuming, limiting the application of GS on lightweight devices. In this paper, we propose a Lightweight 3D Gaussian Splatting (GS) Compression method based on Video codec (LGSCV). First, a two-stage Morton scan is proposed to generate blockwise 2D maps that are friendly for canonical video codecs in which the coding units (CU) are square blocks. A 3D Morton scan is used to permute GS primitives, followed by a 2D Morton scan to map the ordered GS primitives to 2D maps in a blockwise style. However, although the blockwise 2D maps report close performance to the PLAS map in high-bitrate regions, they show a quality collapse at medium-to-low bitrates. Therefore, a principal component analysis (PCA) is used to reduce the dimensionality of spherical harmonics (SH), and a MiniPLAS, which is flexible and fast, is designed to permute the primitives within certain block sizes. Incorporating SH PCA and MiniPLAS leads to a significant gain in rate-distortion (RD) performance, especially at medium and low bitrates. MiniPLAS can also guide the setting of the codec CU size configuration and significantly reduce encoding time. Experimental results on the MPEG dataset demonstrate that the proposed LGSCV achieves over 20% RD gain compared with state-of-the-art methods, while reducing 2D map generation time to approximately 1 second and cutting encoding time by 50%. The code is available at this https URL .
86. 【2512.11167】Image Tiling for High-Resolution Reasoning: Balancing Local Detail with Global Context
链接:https://arxiv.org/abs/2512.11167
作者:Anatole Jacquin de Margerie,Alexis Roger,Irina Rish
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:lack transparent implementation, Reproducibility remains, accessible training infrastructure, transparent implementation details, complex multimodal models
备注: Accepted in AAAI 2025 Workshop on Reproducible AI
点击查看摘要
Abstract:Reproducibility remains a cornerstone of scientific progress, yet complex multimodal models often lack transparent implementation details and accessible training infrastructure. In this work, we present a detailed reproduction and critical analysis of the Monkey Vision-Language Model (VLM) (Li et al. 2023b) published in CVPR24, a recent approach to high-resolution image understanding via image tiling. The original paper proposed splitting large images into tiles to recover fine-grained visual details while maintaining computational efficiency. Our study replicates this strategy using open checkpoints and reimplements the training pipeline. We confirm the key finding of the original Monkey VLM work, namely that tiling effectively recovers local details. We then extend this work further, by investigating the effect of the inclusion of the global context, which provide practical insights for future high-resolution multimodal modeling. However, we also report deviations in the results, with the magnitude of these effects depending heavily on task type and tile granularity.
87. 【2512.11145】Autoencoder-based Semi-Supervised Dimensionality Reduction and Clustering for Scientific Ensembles
链接:https://arxiv.org/abs/2512.11145
作者:Lennard Manuel,Hamid Gadirov,Steffen Frey
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:poses significant challenges, complexity poses significant, Analyzing and visualizing, scientific ensemble datasets, visualizing scientific ensemble
备注: Research Internship Project
点击查看摘要
Abstract:Analyzing and visualizing scientific ensemble datasets with high dimensionality and complexity poses significant challenges. Dimensionality reduction techniques and autoencoders are powerful tools for extracting features, but they often struggle with such high-dimensional data. This paper presents an enhanced autoencoder framework that incorporates a clustering loss, based on the soft silhouette score, alongside a contrastive loss to improve the visualization and interpretability of ensemble datasets. First, EfficientNetV2 is used to generate pseudo-labels for the unlabeled portions of the scientific ensemble datasets. By jointly optimizing the reconstruction, clustering, and contrastive objectives, our method encourages similar data points to group together while separating distinct clusters in the latent space. UMAP is subsequently applied to this latent representation to produce 2D projections, which are evaluated using the silhouette score. Multiple types of autoencoders are evaluated and compared based on their ability to extract meaningful features. Experiments on two scientific ensemble datasets - channel structures in soil derived from Markov chain Monte Carlo, and droplet-on-film impact dynamics - show that models incorporating clustering or contrastive loss marginally outperform the baseline approaches.
88. 【2512.11141】Learning complete and explainable visual representations from itemized text supervision
链接:https://arxiv.org/abs/2512.11141
作者:Yiwei Lyu,Chenhui Zhao,Soumyanil Banerjee,Shixuan Liu,Akshay Rao,Akhil Kondepudi,Honglak Lee,Todd C. Hollon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Training vision models, Training vision, language supervision enables, supervision enables general, vision models
备注:
点击查看摘要
Abstract:Training vision models with language supervision enables general and transferable representations. However, many visual domains, especially non-object-centric domains such as medical imaging and remote sensing, contain itemized text annotations: multiple text items describing distinct and semantically independent findings within a single image. Such supervision differs from standard multi-caption supervision, where captions are redundant or highly overlapping. Here, we introduce ItemizedCLIP, a framework for learning complete and explainable visual representations from itemized text supervision. ItemizedCLIP employs a cross-attention module to produce text item-conditioned visual embeddings and a set of tailored objectives that jointly enforce item independence (distinct regions for distinct items) and representation completeness (coverage of all items). Across four domains with naturally itemized text supervision (brain MRI, head CT, chest CT, remote sensing) and one additional synthetically itemized dataset, ItemizedCLIP achieves substantial improvements in zero-shot performance and fine-grained interpretability over baselines. The resulting ItemizedCLIP representations are semantically grounded, item-differentiable, complete, and visually interpretable. Our code is available at this https URL.
89. 【2512.11130】Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
链接:https://arxiv.org/abs/2512.11130
作者:Bowen Wen,Shaurya Dewan,Stan Birchfield
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:remain computationally prohibitive, strong zero-shot generalization, remain computationally, computationally prohibitive, Stereo foundation models
备注:
点击查看摘要
Abstract:Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: this https URL
90. 【2512.11121】Learning from a Generative Oracle: Domain Adaptation for Restoration
链接:https://arxiv.org/abs/2512.11121
作者:Yuyang Hu,Mojtaba Sahraee-Ardakan,Arpit Bansal,Kangfu Mei,Christian Qi,Peyman Milanfar,Mauricio Delbracio
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:degradations due, due to significant, significant domain gaps, Pre-trained image restoration, Pre-trained image
备注:
点击查看摘要
Abstract:Pre-trained image restoration models often fail on real-world, out-of-distribution degradations due to significant domain gaps. Adapting to these unseen domains is challenging, as out-of-distribution data lacks ground truth, and traditional adaptation methods often require complex architectural changes. We propose LEGO (Learning from a Generative Oracle), a practical three-stage framework for post-training domain adaptation without paired data. LEGO converts this unsupervised challenge into a tractable pseudo-supervised one. First, we obtain initial restorations from the pre-trained model. Second, we leverage a frozen, large-scale generative oracle to refine these estimates into high-quality pseudo-ground-truths. Third, we fine-tune the original model using a mixed-supervision strategy combining in-distribution data with these new pseudo-pairs. This approach adapts the model to the new distribution without sacrificing its original robustness or requiring architectural modifications. Experiments demonstrate that LEGO effectively bridges the domain gap, significantly improving performance on diverse real-world benchmarks.
91. 【2512.11104】Information-driven Fusion of Pathology Foundation Models for Enhanced Disease Characterization
链接:https://arxiv.org/abs/2512.11104
作者:Brennan Flannery,Thomas DeSilvio,Jane Nguyen,Satish E. Viswanath
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Foundation models, demonstrated strong performance, demonstrated strong, diverse pathology tasks, Foundation
备注: 29 Pages, 10 figures
点击查看摘要
Abstract:Foundation models (FMs) have demonstrated strong performance across diverse pathology tasks. While there are similarities in the pre-training objectives of FMs, there is still limited understanding of their complementarity, redundancy in embedding spaces, or biological interpretation of features. In this study, we propose an information-driven, intelligent fusion strategy for integrating multiple pathology FMs into a unified representation and systematically evaluate its performance for cancer grading and staging across three distinct diseases. Diagnostic HE whole-slide images from kidney (519 slides), prostate (490 slides), and rectal (200 slides) cancers were dichotomized into low versus high grade or stage. Both tile-level FMs (Conch v1.5, MUSK, Virchow2, H-Optimus1, Prov-Gigapath) and slide-level FMs (TITAN, CHIEF, MADELEINE) were considered to train downstream classifiers. We then evaluated three FM fusion schemes at both tile and slide levels: majority-vote ensembling, naive feature concatenation, and intelligent fusion based on correlation-guided pruning of redundant features. Under patient-stratified cross-validation with hold-out testing, intelligent fusion of tile-level embeddings yielded consistent gains in classification performance across all three cancers compared with the best single FMs and naive fusion. Global similarity metrics revealed substantial alignment of FM embedding spaces, contrasted by lower local neighborhood agreement, indicating complementary fine-grained information across FMs. Attention maps showed that intelligent fusion yielded concentrated attention on tumor regions while reducing spurious focus on benign regions. Our findings suggest that intelligent, correlation-guided fusion of pathology FMs can yield compact, task-tailored representations that enhance both predictive performance and interpretability in downstream computational pathology tasks.
92. 【2512.11099】VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction
链接:https://arxiv.org/abs/2512.11099
作者:Weitai Kang,Jason Kuen,Mengwei Ren,Zijun Wei,Yan Yan,Kangning Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Model, LLM pretrained reasoning, Multimodal Large, Large Language
备注: 8 pages
点击查看摘要
Abstract:Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special or object tokens for grounding, which may undermine the LLM's pretrained reasoning ability. In contrast, we propose VGent, a modular encoder-decoder architecture that explicitly disentangles high-level reasoning and low-level bounding box prediction. Specifically, a frozen MLLM serves as the encoder to provide untouched powerful reasoning capabilities, while a decoder takes high-quality boxes proposed by detectors as queries and selects target box(es) via cross-attending on encoder's hidden states. This design fully leverages advances in both object detection and MLLM, avoids the pitfalls of auto-regressive decoding, and enables fast inference. Moreover, it supports modular upgrades of both the encoder and decoder to benefit the whole system: we introduce (i) QuadThinker, an RL-based training paradigm for enhancing multi-target reasoning ability of the encoder; (ii) mask-aware label for resolving detection-segmentation ambiguity; and (iii) global target recognition to improve the recognition of all the targets which benefits the selection among augmented proposals. Experiments on multi-target visual grounding benchmarks show that VGent achieves a new state-of-the-art with +20.6% F1 improvement over prior methods, and further boosts gIoU by +8.2% and cIoU by +5.8% under visual reference challenges, while maintaining constant, fast inference latency.
93. 【2512.11098】Vision-Language Models for Infrared Industrial Sensing in Additive Manufacturing Scene Description
链接:https://arxiv.org/abs/2512.11098
作者:Nazanin Mahjourian,Vinh Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:manufacturing environments operate, vision systems struggle, conventional vision systems, operate in low-light, low-light conditions
备注:
点击查看摘要
Abstract:Many manufacturing environments operate in low-light conditions or within enclosed machines where conventional vision systems struggle. Infrared cameras provide complementary advantages in such environments. Simultaneously, supervised AI systems require large labeled datasets, which makes zero-shot learning frameworks more practical for applications including infrared cameras. Recent advances in vision-language foundation models (VLMs) offer a new path in zero-shot predictions from paired image-text representations. However, current VLMs cannot understand infrared camera data since they are trained on RGB data. This work introduces VLM-IRIS (Vision-Language Models for InfraRed Industrial Sensing), a zero-shot framework that adapts VLMs to infrared data by preprocessing infrared images captured by a FLIR Boson sensor into RGB-compatible inputs suitable for CLIP-based encoders. We demonstrate zero-shot workpiece presence detection on a 3D printer bed where temperature differences between the build plate and workpieces make the task well-suited for thermal imaging. VLM-IRIS converts the infrared images to magma representation and applies centroid prompt ensembling with a CLIP ViT-B/32 encoder to achieve high accuracy on infrared images without any model retraining. These findings demonstrate that the proposed improvements to VLMs can be effectively extended to thermal applications for label-free monitoring.
94. 【2512.11076】E-CHUM: Event-based Cameras for Human Detection and Urban Monitoring
链接:https://arxiv.org/abs/2512.11076
作者:Jack Brady,Andrew Dailey,Kristen Schang,Zo Vic Shong
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Understanding human movement, event-based cameras, cameras, event-based, human movement
备注:
点击查看摘要
Abstract:Understanding human movement and city dynamics has always been challenging. From traditional methods of manually observing the city's inhabitant, to using cameras, to now using sensors and more complex technology, the field of urban monitoring has evolved greatly. Still, there are more that can be done to unlock better practices for understanding city dynamics. This paper surveys how the landscape of urban dynamics studying has evolved with a particular focus on event-based cameras. Event-based cameras capture changes in light intensity instead of the RGB values that traditional cameras do. They offer unique abilities, like the ability to work in low-light, that can make them advantageous compared to other sensors. Through an analysis of event-based cameras, their applications, their advantages and challenges, and machine learning applications, we propose event-based cameras as a medium for capturing information to study urban dynamics. They offer the ability to capture important information while maintaining privacy. We also suggest multi-sensor fusion of event-based cameras and other sensors in the study of urban dynamics. Combining event-based cameras and infrared, event-LiDAR, or vibration has to potential to enhance the ability of event-based cameras and overcome the challenges that event-based cameras have.
95. 【2512.11061】VDAWorld: World Modelling via VLM-Directed Abstraction and Simulation
链接:https://arxiv.org/abs/2512.11061
作者:Felix O'Mahony,Roberto Cipolla,Ayush Tewari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:face fundamental limitations, Generative video models, Generative video, face fundamental, fundamental limitations
备注: Website: [this https URL](https://felixomahony.github.io/vdaworld/)
点击查看摘要
Abstract:Generative video models, a leading approach to world modeling, face fundamental limitations. They often violate physical and logical rules, lack interactivity, and operate as opaque black boxes ill-suited for building structured, queryable worlds. To overcome these challenges, we propose a new paradigm focused on distilling an image caption pair into a tractable, abstract representation optimized for simulation. We introduce VDAWorld, a framework where a Vision-Language Model (VLM) acts as an intelligent agent to orchestrate this process. The VLM autonomously constructs a grounded (2D or 3D) scene representation by selecting from a suite of vision tools, and accordingly chooses a compatible physics simulator (e.g., rigid body, fluid) to act upon it. VDAWorld can then infer latent dynamics from the static scene to predict plausible future states. Our experiments show that this combination of intelligent abstraction and adaptive simulation results in a versatile world model capable of producing high quality simulations across a wide range of dynamic scenarios.
96. 【2512.11060】Synthetic Vasculature and Pathology Enhance Vision-Language Model Reasoning
链接:https://arxiv.org/abs/2512.11060
作者:Chenjun Li,Cheng Wan,Laurin Lux,Alexander Berger,Richard B. Rosen,Martin J. Menten,Johannes C. Paetzold
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-Language Models, interpretable medical diagnosis, Coherence Tomography Angiography, offer a promising, explanations alongside predictions
备注: 23 pages, 8 figures, 6 tables. Full paper under review for MIDL 2026 (Medical Imaging with Deep Learning)
点击查看摘要
Abstract:Vision-Language Models (VLMs) offer a promising path toward interpretable medical diagnosis by allowing users to ask about clinical explanations alongside predictions and across different modalities. However, training VLMs for detailed reasoning requires large-scale image-text datasets. In many specialized domains, for example in reading Optical Coherence Tomography Angiography (OCTA) images, such precise text with grounded description of pathologies is scarce or even non-existent. To overcome this bottleneck, we introduce Synthetic Vasculature Reasoning (SVR), a framework that controllably synthesizes images and corresponding text, specifically: realistic retinal vasculature with Diabetic Retinopathy (DR) features: capillary dropout, microaneurysms, neovascularization, and tortuosity, while automatically generating granular reasoning texts. Based on this we curate OCTA-100K-SVR, an OCTA image-reasoning dataset with 100,000 pairs. Our experiments show that a general-purpose VLM (Qwen3-VL-8b) trained on the dataset achieves a zero-shot balanced classification accuracy of 89.67% on real OCTA images, outperforming supervised baselines. Through human expert evaluation we also demonstrate that it significantly enhances explanation quality and pathology localization on clinical data.
97. 【2512.11057】Weakly Supervised Tuberculosis Localization in Chest X-rays through Knowledge Distillation
链接:https://arxiv.org/abs/2512.11057
作者:Marshal Ashif Shawkat,Moidul Hasan,Taufiq Hasan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:mortality worldwide, resource-limited countries, Tuberculosis, Chest X-ray, requires expert interpretation
备注: 18 pages, 9 figures, 4 tables
点击查看摘要
Abstract:Tuberculosis (TB) remains one of the leading causes of mortality worldwide, particularly in resource-limited countries. Chest X-ray (CXR) imaging serves as an accessible and cost-effective diagnostic tool but requires expert interpretation, which is often unavailable. Although machine learning models have shown high performance in TB classification, they often depend on spurious correlations and fail to generalize. Besides, building large datasets featuring high-quality annotations for medical images demands substantial resources and input from domain specialists, and typically involves several annotators reaching agreement, which results in enormous financial and logistical expenses. This study repurposes knowledge distillation technique to train CNN models reducing spurious correlations and localize TB-related abnormalities without requiring bounding-box annotations. By leveraging a teacher-student framework with ResNet50 architecture, the proposed method trained on TBX11k dataset achieve impressive 0.2428 mIOU score. Experimental results further reveal that the student model consistently outperforms the teacher, underscoring improved robustness and potential for broader clinical deployment in diverse settings.
98. 【2512.11047】WholeBodyVLA: Towards Unified Latent VLA for Whole-Body Loco-Manipulation Control
链接:https://arxiv.org/abs/2512.11047
作者:Haoran Jiang,Jin Chen,Qingwen Bu,Li Chen,Modi Shi,Yanjie Zhang,Delong Li,Chuanzhe Suo,Chuang Wang,Zhihui Peng,Hongyang Li
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:robots require precise, require precise locomotion, perform challenging loco-manipulation, require precise, dexterous manipulation
备注:
点击查看摘要
Abstract:Humanoid robots require precise locomotion and dexterous manipulation to perform challenging loco-manipulation tasks. Yet existing approaches, modular or end-to-end, are deficient in manipulation-aware locomotion. This confines the robot to a limited workspace, preventing it from performing large-space loco-manipulation. We attribute this to: (1) the challenge of acquiring loco-manipulation knowledge due to the scarcity of humanoid teleoperation data, and (2) the difficulty of faithfully and reliably executing locomotion commands, stemming from the limited precision and stability of existing RL controllers. To acquire richer loco-manipulation knowledge, we propose a unified latent learning framework that enables Vision-Language-Action (VLA) system to learn from low-cost action-free egocentric videos. Moreover, an efficient human data collection pipeline is devised to augment the dataset and scale the benefits. To more precisely execute the desired locomotion commands, we present a loco-manipulation-oriented (LMO) RL policy specifically tailored for accurate and stable core loco-manipulation movements, such as advancing, turning, and squatting. Building on these components, we introduce WholeBodyVLA, a unified framework for humanoid loco-manipulation. To the best of our knowledge, WholeBodyVLA is one of its kind enabling large-space humanoid loco-manipulation. It is verified via comprehensive experiments on the AgiBot X2 humanoid, outperforming prior baseline by 21.3%. It also demonstrates strong generalization and high extensibility across a broad range of tasks.
99. 【2512.11016】SoccerMaster: A Vision Foundation Model for Soccer Understanding
链接:https://arxiv.org/abs/2512.11016
作者:Haolin Yang,Jiayuan Rao,Haoning Wu,Weidi Xie
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:recently garnered growing, garnered growing research, growing research interest, research interest due, unique challenges
备注:
点击查看摘要
Abstract:Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges. Unlike prior works that typically rely on isolated, task-specific expert models, this work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception (e.g., athlete detection) to semantic reasoning (e.g., event classification). Specifically, our contributions are threefold: (i) we present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework via supervised multi-task pretraining; (ii) we develop an automated data curation pipeline to generate scalable spatial annotations, and integrate them with various existing soccer video datasets to construct SoccerFactory, a comprehensive pretraining data resource; and (iii) we conduct extensive evaluations demonstrating that SoccerMaster consistently outperforms task-specific expert models across diverse downstream tasks, highlighting its breadth and superiority. The data, code, and model will be publicly available.
100. 【2512.11015】Leveraging Text Guidance for Enhancing Demographic Fairness in Gender Classification
链接:https://arxiv.org/abs/2512.11015
作者:Anoop Krishnan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Image Text Matching, artificial intelligence, Image Text fusion, Image Text, text guided methodologies
备注:
点击查看摘要
Abstract:In the quest for fairness in artificial intelligence, novel approaches to enhance it in facial image based gender classification algorithms using text guided methodologies are presented. The core methodology involves leveraging semantic information from image captions during model training to improve generalization capabilities. Two key strategies are presented: Image Text Matching (ITM) guidance and Image Text fusion. ITM guidance trains the model to discern fine grained alignments between images and texts to obtain enhanced multimodal representations. Image text fusion combines both modalities into comprehensive representations for improved fairness. Exensive experiments conducted on benchmark datasets demonstrate these approaches effectively mitigate bias and improve accuracy across gender racial groups compared to existing methods. Additionally, the unique integration of textual guidance underscores an interpretable and intuitive training paradigm for computer vision systems. By scrutinizing the extent to which semantic information reduces disparities, this research offers valuable insights into cultivating more equitable facial analysis algorithms. The proposed methodologies contribute to addressing the pivotal challenge of demographic bias in gender classification from facial images. Furthermore, this technique operates in the absence of demographic labels and is application agnostic.
101. 【2512.10966】Multimodal Fusion of Regional Brain Experts for Interpretable Alzheimer's Disease Diagnosis
链接:https://arxiv.org/abs/2512.10966
作者:Farica Zhuang,Dinara Aliyeva,Shu Yang,Zixuan Wen,Duy Duong-Tran,Christos Davatzikos,Tianlong Chen,Song Wang,Li Shen
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:mirroring clinical practice, integrating complementary information, Accurate and early, multiple modalities, mirroring clinical
备注:
点击查看摘要
Abstract:Accurate and early diagnosis of Alzheimer's disease (AD) can benefit from integrating complementary information from multiple modalities, mirroring clinical practice. However, conventional fusion approaches often rely on simple concatenation of features, which cannot adaptively balance the contributions of biomarkers such as amyloid PET and MRI across brain regions. In this work, we propose MREF-AD, a Multimodal Regional Expert Fusion model for AD diagnosis. It is a Mixture-of-Experts (MoE) framework that models meso-scale brain regions in each modality as an independent expert and employs two-level gating networks to learn subject-specific fusion weights. Beyond improving diagnostic performance, MREF-AD provides modality- and region-level insight into how structural and molecular imaging jointly contribute to disease diagnosis. Using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), MREF-AD achieves state-of-the-art performance over baselines while providing enhanced interpretability of brain region-specific biomarker relevance, underscoring its utility as a general framework for adaptive and interpretable multimodal fusion in neuroimaging.
102. 【2512.05103】V2TV: A Unified Framework for Interleaved Language and Video Generation
链接:https://arxiv.org/abs/2512.05103
作者:Xiaochuang Han,Youssef Emad,Melissa Hall,John Nguyen,Karthik Padthe,Liam Robbins,Amir Bar,Delong Chen,Michal Drozdzal,Maha Elbayad,Yushi Hu,Shang-Wen Li,Sreya Dutta Roy,Jakob Verbeek,XuDong Wang,Marjan Ghazvininejad,Luke Zettlemoyer,Emily Dinan
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:require significant semantic, significant semantic branching, repeated high-level reasoning, Video generation, rapidly advancing
备注:
点击查看摘要
Abstract:Video generation models are rapidly advancing, but can still struggle with complex video outputs that require significant semantic branching or repeated high-level reasoning about what should happen next. In this paper, we introduce a new class of omni video-text models that integrate ideas from recent LM reasoning advances to address this challenge. More specifically, we present TV2TV, a unified generative modeling framework which decomposes video generation into an interleaved text and video generation process. TV2TV jointly learns language modeling (next-token prediction) and video flow matching (next-frame prediction) using a Mixture-of-Transformers (MoT) architecture. At inference time, TV2TV decides when to alternate between generating text and video frames, allowing the model to "think in words" about subsequent content before ``acting in pixels'' to produce frames. This design offloads much of the responsibility for deciding what should happen next to the language modeling tower, enabling improved visual quality and prompt alignment of generated videos. It also enables fine-grained controllability, allowing users to modify the video generation trajectory through text interventions at any point in the process. In controlled experiments on video game data, TV2TV demonstrates substantial improvements in both visual quality and controllability. TV2TV also scales to natural videos, as we show by augmenting sports videos with interleaved natural language action descriptions using vision-language models (VLMs). Training TV2TV on this corpus yields strong visual quality and prompt alignment, showcasing the model's ability to reason about and generate complex real-world action sequences. Together, these results highlight TV2TV as a promising step toward video generation with open-ended textual reasoning and control.
103. 【2512.11745】mViSE: A Visual Search Engine for Analyzing Multiplex IHC Brain Tissue Images
链接:https://arxiv.org/abs/2512.11745
作者:Liqiang Huang,Rachel W. Mills,Saikiran Mandula,Lin Bai,Mahtab Jeyhani,John Redell,Hien Van Nguyen,Saurabh Prasad,Dragan Maric,Badrinath Roysam
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Whole-slide multiplex imaging, require custom software, generates massive information-dense, massive information-dense images, Whole-slide multiplex
备注:
点击查看摘要
Abstract:Whole-slide multiplex imaging of brain tissue generates massive information-dense images that are challenging to analyze and require custom software. We present an alternative query-driven programming-free strategy using a multiplex visual search engine (mViSE) that learns the multifaceted brain tissue chemoarchitecture, cytoarchitecture, and myeloarchitecture. Our divide-and-conquer strategy organizes the data into panels of related molecular markers and uses self-supervised learning to train a multiplex encoder for each panel with explicit visual confirmation of successful learning. Multiple panels can be combined to process visual queries for retrieving similar communities of individual cells or multicellular niches using information-theoretic methods. The retrievals can be used for diverse purposes including tissue exploration, delineating brain regions and cortical cell layers, profiling and comparing brain regions without computer programming. We validated mViSE's ability to retrieve single cells, proximal cell pairs, tissue patches, delineate cortical layers, brain regions and sub-regions. mViSE is provided as an open-source QuPath plug-in.
104. 【2512.11695】Particle Image Velocimetry Refinement via Consensus ADMM
链接:https://arxiv.org/abs/2512.11695
作者:Alan Bonomi,Francesco Banelli,Antonio Terpin
类目:Fluid Dynamics (physics.flu-dyn); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Optimization and Control (math.OC)
关键词:Particle Image Velocimetry, tracer particles immersed, buoyant tracer particles, neutrally buoyant tracer, quantifies flow fields
备注: Code: [this https URL](https://github.com/antonioterpin/flowgym)
点击查看摘要
Abstract:Particle Image Velocimetry (PIV) is an imaging technique in experimental fluid dynamics that quantifies flow fields around bluff bodies by analyzing the displacement of neutrally buoyant tracer particles immersed in the fluid. Traditional PIV approaches typically depend on tuning parameters specific to the imaging setup, making the performance sensitive to variations in illumination, flow conditions, and seeding density. On the other hand, even state-of-the-art machine learning methods for flow quantification are fragile outside their training set. In our experiments, we observed that flow quantification would improve if different tunings (or algorithms) were applied to different regions of the same image pair. In this work, we parallelize the instantaneous flow quantification with multiple algorithms and adopt a consensus framework based on the alternating direction method of multipliers, seamlessly incorporating priors such as smoothness and incompressibility. We perform several numerical experiments to demonstrate the benefits of this approach. For instance, we achieve a decrease in end-point-error of up to 20% of a dense-inverse-search estimator at an inference rate of 60Hz, and we show how this performance boost can be increased further with outlier rejection. Our method is implemented in JAX, effectively exploiting hardware acceleration, and integrated in Flow Gym, enabling (i) reproducible comparisons with the state-of-the-art, (ii) testing different base algorithms, (iii) straightforward deployment for active fluids control applications.
105. 【2512.11676】Stochastics of shapes and Kunita flows
链接:https://arxiv.org/abs/2512.11676
作者:Stefan Sommer,Gefan Yang,Elizabeth Louise Baker
类目:Probability (math.PR); Computer Vision and Pattern Recognition (cs.CV)
关键词:including evolutionary biology, applications including evolutionary, evolutionary biology, Stochastic processes, stochastic shape processes
备注:
点击查看摘要
Abstract:Stochastic processes of evolving shapes are used in applications including evolutionary biology, where morphology changes stochastically as a function of evolutionary processes. Due to the non-linear and often infinite-dimensional nature of shape spaces, the mathematical construction of suitable stochastic shape processes is far from immediate. We define and formalize properties that stochastic shape processes should ideally satisfy to be compatible with the shape structure, and we link this to Kunita flows that, when acting on shape spaces, induce stochastic processes that satisfy these criteria by their construction. We couple this with a survey of other relevant shape stochastic processes and show how bridge sampling techniques can be used to condition shape stochastic processes on observed data thereby allowing for statistical inference of parameters of the stochastic dynamics.




