本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新438篇论文,其中:
- 自然语言处理41篇
- 信息检索11篇
- 计算机视觉116篇
自然语言处理
1. 【2512.17901】When Reasoning Meets Its Laws
链接:https://arxiv.org/abs/2512.17901
作者:Junyu Zhang,Yifan Sun,Tianang Leng,Jingyan Shen,Liu Ziyin,Paul Pu Liang,Huan Zhang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:suboptimal reasoning capabilities, Reasoning, Large Reasoning Models, leading to suboptimal, reasoning behaviors
备注:
点击查看摘要
Abstract:Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: this https URL
2. 【2512.17843】ShareChat: A Dataset of Chatbot Conversations in the Wild
链接:https://arxiv.org/abs/2512.17843
作者:Yueru Yan,Tuc Nguyen,Bo Su,Melissa Lieffers,Thai Le
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:Large Language Models, generic text generators, unique interface designs, datasets treat models, existing public datasets
备注:
点击查看摘要
Abstract:While Large Language Models (LLMs) have evolved into distinct platforms with unique interface designs and capabilities, existing public datasets treat models as generic text generators, stripping away the interface context that actively shapes user interaction. To address this limitation, we present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms: ChatGPT, Claude, Gemini, Perplexity, and Grok. ShareChat distinguishes itself by preserving native platform affordances often lost in standard logs, including reasoning traces, source links, and code artifacts, while spanning 101 languages over the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. We demonstrate the dataset's multifaceted utility through three representative analyses: (1) analyzing conversation completeness to measure user intent satisfaction; (2) evaluating source citation behaviors in content generation; and (3) conducting temporal analysis to track evolving usage patterns. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild.
3. 【2512.17776】DEER: A Comprehensive and Reliable Benchmark for Deep-Research Expert Reports
链接:https://arxiv.org/abs/2512.17776
作者:Janghoon Han,Heegyu Kim,Changho Lee,Dahm Lee,Min Hyung Park,Hosung Song,Stanley Jungkyu Choi,Moontae Lee,Honglak Lee
类目:Computation and Language (cs.CL)
关键词:large language models, reports remains challenging, language models, evidence-based synthesis, remains challenging
备注: Work in progress
点击查看摘要
Abstract:As large language models (LLMs) advance, deep research systems can generate expert-level reports via multi-step reasoning and evidence-based synthesis, but evaluating such reports remains challenging. Existing benchmarks often lack systematic criteria for expert reporting, evaluations that rely heavily on LLM judges can fail to capture issues that require expert judgment, and source verification typically covers only a limited subset of explicitly cited statements rather than report-wide factual reliability. We introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains and an expert-grounded evaluation taxonomy (7 dimensions, 25 sub-dimension) operationalized into 130 fine-grained rubric items. DEER further provides task-specific expert guidance to help LLM judges assess expert-level report quality more consistently. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that extracts and verifies all claims across the entire report, including both cited and uncited ones, and quantifies external-evidence quality. DEER correlates closely with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.
4. 【2512.17769】Bangla MedER: Multi-BERT Ensemble Approach for the Recognition of Bangla Medical Entity
链接:https://arxiv.org/abs/2512.17769
作者:Tanjim Taharat Aurpa,Farzana Akter,Md. Mehedi Hasan,Shakil Ahmed,Shifat Ara Rafiq,Fatema Khan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:extracting meaningful entities, Medical Entity Recognition, Bangla medical entity, Entity Recognition, extracting meaningful
备注:
点击查看摘要
Abstract:Medical Entity Recognition (MedER) is an essential NLP task for extracting meaningful entities from the medical corpus. Nowadays, MedER-based research outcomes can remarkably contribute to the development of automated systems in the medical sector, ultimately enhancing patient care and outcomes. While extensive research has been conducted on MedER in English, low-resource languages like Bangla remain underexplored. Our work aims to bridge this gap. For Bangla medical entity recognition, this study first examined a number of transformer models, including BERT, DistilBERT, ELECTRA, and RoBERTa. We also propose a novel Multi-BERT Ensemble approach that outperformed all baseline models with the highest accuracy of 89.58%. Notably, it provides an 11.80% accuracy improvement over the single-layer BERT model, demonstrating its effectiveness for this task. A major challenge in MedER for low-resource languages is the lack of annotated datasets. To address this issue, we developed a high-quality dataset tailored for the Bangla MedER task. The dataset was used to evaluate the effectiveness of our model through multiple performance metrics, demonstrating its robustness and applicability. Our findings highlight the potential of Multi-BERT Ensemble models in improving MedER for Bangla and set the foundation for further advancements in low-resource medical NLP.
5. 【2512.17756】AncientBench: Towards Comprehensive Evaluation on Excavated and Transmitted Chinese Corpora
链接:https://arxiv.org/abs/2512.17756
作者:Zhihan Zhou,Daqian Shi,Rui Song,Lida Shi,Xiaolei Diao,Hao Xu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:ancient texts plays, ancient Chinese, large language models, ancient Chinese language, Chinese
备注:
点击查看摘要
Abstract:Comprehension of ancient texts plays an important role in archaeology and understanding of Chinese history and civilization. The rapid development of large language models needs benchmarks that can evaluate their comprehension of ancient characters. Existing Chinese benchmarks are mostly targeted at modern Chinese and transmitted documents in ancient Chinese, but the part of excavated documents in ancient Chinese is not covered. To meet this need, we propose the AncientBench, which aims to evaluate the comprehension of ancient characters, especially in the scenario of excavated documents. The AncientBench is divided into four dimensions, which correspond to the four competencies of ancient character comprehension: glyph comprehension, pronunciation comprehension, meaning comprehension, and contextual comprehension. The benchmark also contains ten tasks, including radical, phonetic radical, homophone, cloze, translation, and more, providing a comprehensive framework for evaluation. We convened archaeological researchers to conduct experimental evaluations, proposed an ancient model as baseline, and conducted extensive experiments on the currently best-performing large language models. The experimental results reveal the great potential of large language models in ancient textual scenarios as well as the gap with humans. Our research aims to promote the development and application of large language models in the field of archaeology and ancient Chinese language.
6. 【2512.17752】Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science
链接:https://arxiv.org/abs/2512.17752
作者:Jan Philip Wahle,Krishnapriya Vishnubhotla,Bela Gipp,Saif M. Mohammad
类目:Computation and Language (cs.CL)
关键词:questions about people, Social Science explores, Science, Computational Affective, Computational Affective Science
备注:
点击查看摘要
Abstract:Work in Computational Affective Science and Computational Social Science explores a wide variety of research questions about people, emotions, behavior, and health. Such work often relies on language data that is first labeled with relevant information, such as the use of emotion words or the age of the speaker. Although many resources and algorithms exist to enable this type of labeling, discovering, accessing, and using them remains a substantial impediment, particularly for practitioners outside of computer science. Here, we present the ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion), a large-scale collection of over 400 million text utterances drawn from social media, blogs, books, and AI-generated sources. The dataset is annotated with a wide range of features relevant to computational affective and social science. ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.
7. 【2512.17738】When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content
链接:https://arxiv.org/abs/2512.17738
作者:Lydia Nishimwe,Benoît Sagot,Rachel Bawden
类目:Computation and Language (cs.CL)
关键词:User-generated content, character repetitions, UGC, characterised by frequent, spelling errors
备注: 10 pages, 19 pages with references and appendices
点击查看摘要
Abstract:User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.
8. 【2512.17677】oward Ethical AI Through Bayesian Uncertainty in Neural Question Answering
链接:https://arxiv.org/abs/2512.17677
作者:Riccardo Di Sipio
类目:Computation and Language (cs.CL)
关键词:explore Bayesian reasoning, question answering, networks for question, explore Bayesian, Bayesian reasoning
备注: 14 pages, 8 figures,
点击查看摘要
Abstract:We explore Bayesian reasoning as a means to quantify uncertainty in neural networks for question answering. Starting with a multilayer perceptron on the Iris dataset, we show how posterior inference conveys confidence in predictions. We then extend this to language models, applying Bayesian inference first to a frozen head and finally to LoRA-adapted transformers, evaluated on the CommonsenseQA benchmark. Rather than aiming for state-of-the-art accuracy, we compare Laplace approximations against maximum a posteriori (MAP) estimates to highlight uncertainty calibration and selective prediction. This allows models to abstain when confidence is low. An ``I don't know'' response not only improves interpretability but also illustrates how Bayesian methods can contribute to more responsible and ethical deployment of neural question-answering systems.
9. 【2512.17657】Peeking Into The Future For Contextual Biasing
链接:https://arxiv.org/abs/2512.17657
作者:Ramaneswaran Selvakumar,Cindy Tseng,Eesung Kim,Vijendra Raj Apsingekar,Yun Tang
类目:Computation and Language (cs.CL)
关键词:automatic speech recognition, unseen named entities, automatic speech, speech recognition, general transcription
备注:
点击查看摘要
Abstract:While end-to-end (E2E) automatic speech recognition (ASR) models excel at general transcription, they struggle to recognize rare or unseen named entities (e.g., contact names, locations), which are critical for downstream applications like virtual assistants. In this paper, we propose a contextual biasing method for attention based encoder decoder (AED) models using a list of candidate named entities. Instead of predicting only the next token, we simultaneously predict multiple future tokens, enabling the model to "peek into the future" and score potential candidate entities in the entity list. Moreover, our approach leverages the multi-token prediction logits directly without requiring additional entity encoders or cross-attention layers, significantly reducing architectural complexity. Experiments on Librispeech demonstrate that our approach achieves up to 50.34% relative improvement in named entity word error rate compared to the baseline AED model.
10. 【2512.17648】Simulstream: Open-Source Toolkit for Evaluation and Demonstration of Streaming Speech-to-Text Translation Systems
链接:https://arxiv.org/abs/2512.17648
作者:Marco Gaido,Sara Papi,Mauro Cettolo,Matteo Negri,Luisa Bentivogli
类目:Computation and Language (cs.CL)
关键词:requires producing translations, producing translations concurrently, balance partial-information decision-making, high translation quality, imposing strict latency
备注:
点击查看摘要
Abstract:Streaming Speech-to-Text Translation (StreamST) requires producing translations concurrently with incoming speech, imposing strict latency constraints and demanding models that balance partial-information decision-making with high translation quality. Research efforts on the topic have so far relied on the SimulEval repository, which is no longer maintained and does not support systems that revise their outputs. In addition, it has been designed for simulating the processing of short segments, rather than long-form audio streams, and it does not provide an easy method to showcase systems in a demo. As a solution, we introduce simulstream, the first open-source framework dedicated to unified evaluation and demonstration of StreamST systems. Designed for long-form speech processing, it supports not only incremental decoding approaches, but also re-translation methods, enabling for their comparison within the same framework both in terms of quality and latency. In addition, it also offers an interactive web interface to demo any system built within the tool.
11. 【2512.17639】Linear Personality Probing and Steering in LLMs: A Big Five Study
链接:https://arxiv.org/abs/2512.17639
作者:Michel Frising,Daniel Balcells
类目:Computation and Language (cs.CL)
关键词:Large language models, greatly impact trust, Large language, exhibit distinct, trust and engagement
备注: 29 pages, 6 figures
点击查看摘要
Abstract:Large language models (LLMs) exhibit distinct and consistent personalities that greatly impact trust and engagement. While this means that personality frameworks would be highly valuable tools to characterize and control LLMs' behavior, current approaches remain either costly (post-training) or brittle (prompt engineering). Probing and steering via linear directions has recently emerged as a cheap and efficient alternative. In this paper, we investigate whether linear directions aligned with the Big Five personality traits can be used for probing and steering model behavior. Using Llama 3.3 70B, we generate descriptions of 406 fictional characters and their Big Five trait scores. We then prompt the model with these descriptions and questions from the Alpaca questionnaire, allowing us to sample hidden activations that vary along personality traits in known, quantifiable ways. Using linear regression, we learn a set of per-layer directions in activation space, and test their effectiveness for probing and steering model behavior. Our results suggest that linear directions aligned with trait-scores are effective probes for personality detection, while their steering capabilities strongly depend on context, producing reliable effects in forced-choice tasks but limited influence in open-ended generation or when additional context is present in the prompt.
12. 【2512.17630】Confidence-Credibility Aware Weighted Ensembles of Small LLMs Outperform Large LLMs in Emotion Detection
链接:https://arxiv.org/abs/2512.17630
作者:Menna Elgabry,Ali Hamdi
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Condorcet Jury Theorem, Jury Theorem, Condorcet Jury, inspired by Condorcet, credibility-aware ensemble framework
备注: Accepted at IRICT 2025
点击查看摘要
Abstract:This paper introduces a confidence-weighted, credibility-aware ensemble framework for text-based emotion detection, inspired by Condorcet's Jury Theorem (CJT). Unlike conventional ensembles that often rely on homogeneous architectures, our approach combines architecturally diverse small transformer-based large language models (sLLMs) - BERT, RoBERTa, DistilBERT, DeBERTa, and ELECTRA, each fully fine-tuned for emotion classification. To preserve error diversity, we minimize parameter convergence while taking advantage of the unique biases of each model. A dual-weighted voting mechanism integrates both global credibility (validation F1 score) and local confidence (instance-level probability) to dynamically weight model contributions. Experiments on the DAIR-AI dataset demonstrate that our credibility-confidence ensemble achieves a macro F1 score of 93.5 percent, surpassing state-of-the-art benchmarks and significantly outperforming large-scale LLMs, including Falcon, Mistral, Qwen, and Phi, even after task-specific Low-Rank Adaptation (LoRA). With only 595M parameters in total, our small LLMs ensemble proves more parameter-efficient and robust than models up to 7B parameters, establishing that carefully designed ensembles of small, fine-tuned models can outperform much larger LLMs in specialized natural language processing (NLP) tasks such as emotion detection.
13. 【2512.17419】SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories
链接:https://arxiv.org/abs/2512.17419
作者:Lilin Wang,Lucas Ramalho,Alan Celestino,Phuc Anthony Pham,Yu Liu,Umang Kumar Sinha,Andres Portillo,Onassis Osunwa,Gabriel Maduekwe
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:evaluation of Large, Large Language Models, software engineering tasks, repository-level software engineering, Large Language
备注:
点击查看摘要
Abstract:Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on Python-based bug fixes. We introduce SWE-Bench++, an automated framework that generates repository-level coding tasks from open-source GitHub projects. Unlike synthetic approaches, our pipeline harvests live pull requests to cover both bug fixes and feature requests across 11 languages. SWE-Bench++ turns GitHub pull requests (PRs) into reproducible, execution-based tasks via four stages: programmatic sourcing, environment synthesis, test oracle extraction, and quality assurance. A final hint-guided trajectory synthesis step converts instances that strong models fail on into training trajectories. Our initial benchmark consists of 11,133 instances from 3,971 repositories across 11 languages. On a subset of 1,782 instances of this benchmark, today's strongest models perform as follows: claude-sonnet-4.5 achieves 36.20% pass@10, gpt-5-2025-08-07 34.57%, gemini/gemini-2.5-pro 24.92%, and gpt-4o 16.89%. We further demonstrate the utility of our dataset by showing that fine-tuning on SWE-Bench++ instances yields measurable improvements on the SWE-bench Multilingual benchmark. SWE-Bench++ provides a scalable, multilingual benchmark for evaluating and improving repository-level code generation.
14. 【2512.17396】RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
链接:https://arxiv.org/abs/2512.17396
作者:Léo Butsanets,Charles Corbière,Julien Khlaut,Pierre Manceron,Corentin Dancette
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:advance radiologic visual, large-scale dataset designed, visual question answering, MRI exams, radiologic visual question
备注: Preprint, 23 pages, 12 figures, 7 tables
点击查看摘要
Abstract:In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at this https URL.
15. 【2512.17394】Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?
链接:https://arxiv.org/abs/2512.17394
作者:Zabir Al Nazi,G M Shahariar,Abrar Hossain,Wei Peng
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:artificial agents, ability to attribute, remains a major, major challenge, challenge for artificial
备注:
点击查看摘要
Abstract:Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
16. 【2512.17387】CIFE: Code Instruction-Following Evaluation
链接:https://arxiv.org/abs/2512.17387
作者:Sravani Gunnu,Shanmukha Guttula,Hima Patel
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, reliable deployment, requirements for robustness, Language Models
备注: 20 pages, 22 figures, 2 tables
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness, formatting, and security. Existing benchmarks primarily assess correctness through test-case execution, offering limited insight into how reliably models follow such constraints. We introduce a benchmark of 1,000 Python tasks, each paired with an average of 7 developer-specified constraints spanning 13 categories. Constraints are curated through a four-stage human-LLM pipeline to ensure they are atomic, relevant, and objective. We evaluate 14 open- and closed-source models using complementary adherence metrics and propose the C2A Score, a composite measure that jointly captures correctness and constraint compliance. Results reveal a substantial gap between partial and strict satisfaction, while strong models achieve over 90% partial adherence, strict adherence remains between 39-66%. These findings highlight that trustworthy code generation requires not only correctness but also consistent adherence to developer intent.
17. 【2512.17385】UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models
链接:https://arxiv.org/abs/2512.17385
作者:Jiajun Wu,Jian Yang,Wei Zhang,Lin Jing,Yuqing Ma,Ensheng Shi,Yuchi Ma,Zhoujun Li,Xianglong Liu
类目:Computation and Language (cs.CL)
关键词:Large language models, demonstrated remarkable capabilities, Large language, code generation tasks, demonstrated remarkable
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.
18. 【2512.17375】AdvJudge-Zero: Binary Decision Flips in LLM-as-a-Judge via Adversarial Control Tokens
链接:https://arxiv.org/abs/2512.17375
作者:Tung-Ling Li,Yuhao Wu,Hongliang Liu
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
关键词:provide scalar feedback, modern post-training pipelines, guide model selection, RL-based fine-tuning, Reward models
备注:
点击查看摘要
Abstract:Reward models and LLM-as-a-Judge systems are central to modern post-training pipelines such as RLHF, DPO, and RLAIF, where they provide scalar feedback and binary decisions that guide model selection and RL-based fine-tuning. We show that these judge systems exhibit a recurring vulnerability: short sequences of low-perplexity control tokens can flip many binary evaluations from correct ``No'' judgments to incorrect ``Yes'' judgments by steering the last-layer logit gap. These control tokens are patterns that a policy model could plausibly generate during post-training, and thus represent realistic reward-hacking risks rather than worst-case adversarial strings. Our method, AdvJudge-Zero, uses the model's next-token distribution and beam-search exploration to discover diverse control-token sequences from scratch, and our analysis shows that the induced hidden-state perturbations concentrate in a low-rank ``soft mode'' that is anti-aligned with the judge's refusal direction. Empirically, these tokens cause very high false positive rates when large open-weight and specialized judge models score incorrect answers on math and reasoning benchmarks. Finally, we show that LoRA-based adversarial training on small sets of control-token-augmented examples can markedly reduce these false positives while preserving evaluation quality.
19. 【2512.17351】Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
链接:https://arxiv.org/abs/2512.17351
作者:Zeyuan Allen-Zhu
类目:Computation and Language (cs.CL)
关键词:Understanding architectural differences, Understanding architectural, noise and randomness, dominated by noise, CANON LAYERS
备注: V1.1 appeared in NeurIPS 2025 main conference; V2 adds GDN experiments, tightens some experiments (for a stronger, fairer comparison), and re-organizes sections
点击查看摘要
Abstract:Understanding architectural differences in language models is challenging, especially at academic-scale pretraining (e.g., 1.3B parameters, 100B tokens), where results are often dominated by noise and randomness. To overcome this, we introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities. Within this framework, we discover CANON LAYERS: lightweight architectural components -- named after the musical term "canon" -- that promote horizontal information flow across neighboring tokens. Canon layers compute weighted sums of nearby token representations and integrate seamlessly into Transformers, linear attention, state-space models, or any sequence architecture. We present 12 key results. This includes how Canon layers enhance reasoning depth (e.g., by $2\times$), reasoning breadth, knowledge manipulation, etc. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA linear models like Mamba2/GDN -- validated both through synthetic tasks and real-world academic-scale pretraining. This synthetic playground offers an economical, principled path to isolate core model capabilities often obscured at academic scales. Equipped with infinite high-quality data, it may even PREDICT how future architectures will behave as training pipelines improve -- e.g., through better data curation or RL-based post-training -- unlocking deeper reasoning and hierarchical inference.
Comments:
V1.1 appeared in NeurIPS 2025 main conference; V2 adds GDN experiments, tightens some experiments (for a stronger, fairer comparison), and re-organizes sections
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2512.17351 [cs.CL]
(or
arXiv:2512.17351v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2512.17351
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
20. 【2512.17347】Stakeholder Suite: A Unified AI Framework for Mapping Actors, Topics and Arguments in Public Debates
链接:https://arxiv.org/abs/2512.17347
作者:Mohamed Chenene,Jeanne Rouhier,Jean Daniélou,Mihir Sarkar,Elena Cabrio
类目:Computation and Language (cs.CL)
关键词:Public debates surrounding, involve complex networks, projects involve complex, evolving narratives, involve complex
备注:
点击查看摘要
Abstract:Public debates surrounding infrastructure and energy projects involve complex networks of stakeholders, arguments, and evolving narratives. Understanding these dynamics is crucial for anticipating controversies and informing engagement strategies, yet existing tools in media intelligence largely rely on descriptive analytics with limited transparency. This paper presents Stakeholder Suite, a framework deployed in operational contexts for mapping actors, topics, and arguments within public debates. The system combines actor detection, topic modeling, argument extraction and stance classification in a unified pipeline. Tested on multiple energy infrastructure projects as a case study, the approach delivers fine-grained, source-grounded insights while remaining adaptable to diverse domains. The framework achieves strong retrieval precision and stance accuracy, producing arguments judged relevant in 75% of pilot use cases. Beyond quantitative metrics, the tool has proven effective for operational use: helping project teams visualize networks of influence, identify emerging controversies, and support evidence-based decision-making.
21. 【2512.17344】Governance-Aware Hybrid Fine-Tuning for Multilingual Large Language Models
链接:https://arxiv.org/abs/2512.17344
作者:Haomin Qi,Chengbo Huang,Zihan Dai,Yunkai Gao
类目:Computation and Language (cs.CL)
关键词:large language models, governance-aware hybrid fine-tuning, shown in Table, hybrid fine-tuning framework, present a governance-aware
备注: 11 pages, 4 figures, 6 tables. arXiv admin note: substantial text overlap with [arXiv:2507.18076](https://arxiv.org/abs/2507.18076)
点击查看摘要
Abstract:We present a governance-aware hybrid fine-tuning framework for multilingual, low-resource adaptation of large language models. The core algorithm combines gradient-aligned low-rank updates with structured orthogonal transformations through layer-wise mixing and introduces unitary constraints in selected sub-layers to stabilize deep optimization. In tandem with lightweight, label-free data governance steps, including language identification, near-duplicate removal, and quality filtering, the framework targets accuracy, calibration, and cross-language parity under tight compute budgets. Across XNLI and FLORES, the hybrid approach delivers consistent gains over strong PEFT baselines while maintaining directional balance and improving probability calibration, as shown in Tables II and III. It is more resilient to lightweight orthographic variants, as shown in Table IV, and benefits additively from simple governance steps, as shown in Table V. Training footprint measurements indicate modest overhead and a favorable cost-quality frontier, as shown in Table VI and Figure 2. Together, these results show that hybrid and unitary PEFT provide a stable and accessible path to resource-efficient multilingual adaptation when paired with practical data governance.
22. 【2512.17325】ask Schema and Binding: A Double Dissociation Study of In-Context Learning
链接:https://arxiv.org/abs/2512.17325
作者:Chaeha Kim
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:abstract task type, specific input-output associations, Task Schema, Toggle, task type recognition
备注: 20pages, 2figures
点击查看摘要
Abstract:We provide causal mechanistic validation that in-context learning (ICL) decomposes into two separable mechanisms: Task Schema (abstract task type recognition) and Binding (specific input-output associations). Through activation patching experiments across 9 models from 7 Transformer families plus Mamba (370M-13B parameters), we establish three key findings: 1. Double dissociation: Task Schema transfers at 100% via late MLP patching; Binding transfers at 62% via residual stream patching -- proving separable mechanisms 2. Prior-Schema trade-off: Schema reliance inversely correlates with prior knowledge (Spearman rho = -0.596, p 0.001, N=28 task-model pairs) 3. Architecture generality: The mechanism operates across all tested architectures including the non-Transformer Mamba These findings offer a mechanistic account of the ICL puzzle that contrasts with prior views treating ICL as a monolithic mechanism (whether retrieval-based, gradient descent-like, or purely Bayesian). By establishing that Schema and Binding are neurally dissociable -- not merely behavioral modes -- we provide causal evidence for dual-process theories of ICL. Models rely on Task Schema when prior knowledge is absent, but prior knowledge interferes through attentional mis-routing (72.7% recency bias) rather than direct output competition (0%). This explains why arbitrary mappings succeed (zero prior leads to full Schema reliance) while factual overrides fail -- and reveals that the true bottleneck is attentional, not output-level. Practical implications: Understanding these dual mechanisms enables more efficient prompt engineering -- reliable schema transfer reduces required demonstrations for novel tasks, while prior-aware design can mitigate the 38% binding failure rate in high-prior scenarios, improving ICL system reliability in production deployments.
Comments:
20pages, 2figures
Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:
arXiv:2512.17325 [cs.LG]
(or
arXiv:2512.17325v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2512.17325
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Chaeha Kim [view email] [v1]
Fri, 19 Dec 2025 08:14:21 UTC (75 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Task Schema and Binding: A Double Dissociation Study of In-Context Learning, by Chaeha KimView PDFHTML (experimental)TeX Source
view license
Current browse context: cs.LG
prev
|
next
new
|
recent
| 2025-12
Change to browse by:
cs
cs.CL
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
IArxiv recommender toggle
IArxiv Recommender
(What is IArxiv?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
23. 【2512.17308】Large Language Models as Pokémon Battle Agents: Strategic Play and Content Generation
链接:https://arxiv.org/abs/2512.17308
作者:Daksh Jain,Aarya Jain,Ashutosh Desai,Avyakt Verma,Ishan Bhanuka,Pratik Narang,Dhruv Kumar
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, evaluating large language, Pokémon battles presents, large language, language models
备注: Under Review
点击查看摘要
Abstract:Strategic decision-making in Pokémon battles presents a unique testbed for evaluating large language models. Pokémon battles demand reasoning about type matchups, statistical trade-offs, and risk assessment, skills that mirror human strategic thinking. This work examines whether Large Language Models (LLMs) can serve as competent battle agents, capable of both making tactically sound decisions and generating novel, balanced game content. We developed a turn-based Pokémon battle system where LLMs select moves based on battle state rather than pre-programmed logic. The framework captures essential Pokémon mechanics: type effectiveness multipliers, stat-based damage calculations, and multi-Pokémon team management. Through systematic evaluation across multiple model architectures we measured win rates, decision latency, type-alignment accuracy, and token efficiency. These results suggest LLMs can function as dynamic game opponents without domain-specific training, offering a practical alternative to reinforcement learning for turn-based strategic games. The dual capability of tactical reasoning and content creation, positions LLMs as both players and designers, with implications for procedural generation and adaptive difficulty systems in interactive entertainment.
24. 【2512.17289】Subjective Question Generation and Answer Evaluation using NLP
链接:https://arxiv.org/abs/2512.17289
作者:G. M. Refatul Islam,Safwan Shaheer,Yaseen Nur,Mohammad Rafid Hamid
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Natural Language Processing, Natural Language, Language Processing, revolutionary technologies today, technologies today
备注: 5 pages, 5 figures, 2 tables, conference paper
点击查看摘要
Abstract:Natural Language Processing (NLP) is one of the most revolutionary technologies today. It uses artificial intelligence to understand human text and spoken words. It is used for text summarization, grammar checking, sentiment analysis, and advanced chatbots and has many more potential use cases. Furthermore, it has also made its mark on the education sector. Much research and advancements have already been conducted on objective question generation; however, automated subjective question generation and answer evaluation are still in progress. An automated system to generate subjective questions and evaluate the answers can help teachers assess student work and enhance the student's learning experience by allowing them to self-assess their understanding after reading an article or a chapter of a book. This research aims to improve current NLP models or make a novel one for automated subjective question generation and answer evaluation from text input.
25. 【2512.17270】Understanding Generalization in Role-Playing Models via Information Theory
链接:https://arxiv.org/abs/2512.17270
作者:Yongqi Li,Hao Lang,Fei Huang,Tieyun Qian,Yongbin Li
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Role-playing models, real-world applications, applications but underperform, underperform when deployed, RPM performance degradation
备注:
点击查看摘要
Abstract:Role-playing models (RPMs) are widely used in real-world applications but underperform when deployed in the wild. This degradation can be attributed to distribution shifts, including user, character, and dialogue compositional shifts. Existing methods like LLM-as-a-judge fall short in providing a fine-grained diagnosis of how these shifts affect RPM generalization, and thus there lack formal frameworks to characterize RPM generalization behaviors. To bridge these gaps, we introduce an information-theoretic metric, named reasoning-based effective mutual information difference (R-EMID), to measure RPM performance degradation in an interpretable way. We also derive an upper bound on R-EMID to predict the worst-case generalization performance of RPMs and theoretically reveal how various shifts contribute to the RPM performance degradation. Moreover, we propose a co-evolving reinforcement learning framework to adaptively model the connection among user, character, and dialogue context and thus enhance the estimation of dialogue response generation probability, which is critical for calculating R-EMID. Finally, we evaluate the generalization performance of various RPMs using R-EMID, finding that user shift poses the highest risk among all shifts and reinforcement learning is the most effective approach for enhancing RPM generalization.
26. 【2512.17267】AutoMetrics: Approximate Human Judgements with Automatically Generated Evaluators
链接:https://arxiv.org/abs/2512.17267
作者:Michael J. Ryan,Yanzhe Zhang,Amol Salunkhe,Yi Chu,Di Xu,Diyi Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:clinical note generation, Evaluating user-facing, central challenge, travel planning, clinical note
备注:
点击查看摘要
Abstract:Evaluating user-facing AI applications remains a central challenge, especially in open-ended domains such as travel planning, clinical note generation, or dialogue. The gold standard is user feedback (e.g., thumbs up/down) or behavioral signals (e.g., retention), but these are often scarce in prototypes and research projects, or too-slow to use for system optimization. We present AutoMetrics, a framework for synthesizing evaluation metrics under low-data constraints. AutoMetrics combines retrieval from MetricBank, a collection of 48 metrics we curate, with automatically generated LLM-as-a-Judge criteria informed by lightweight human feedback. These metrics are composed via regression to maximize correlation with human signal. AutoMetrics takes you from expensive measures to interpretable automatic metrics. Across 5 diverse tasks, AutoMetrics improves Kendall correlation with human ratings by up to 33.4% over LLM-as-a-Judge while requiring fewer than 100 feedback points. We show that AutoMetrics can be used as a proxy reward to equal effect as a verifiable reward. We release the full AutoMetrics toolkit and MetricBank to accelerate adaptive evaluation of LLM applications.
27. 【2512.17260】Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience
链接:https://arxiv.org/abs/2512.17260
作者:Jiangjie Chen,Wenxiang Chen,Jiacheng Du,Jinyi Hu,Zhicheng Jiang,Allan Jie,Xiaoran Jin,Xing Jin,Chenggang Li,Wenlei Shi,Zhihong Wang,Mingxuan Wang,Chenrui Wei,Shufa Wei,Huajian Xin,Fan Yang,Weihao Gao,Zheng Yuan,Tianyang Zhan,Zeyu Zheng,Tianxi Zhou,Thomas Hanwen Zhu
类目:Computation and Language (cs.CL)
关键词:recently made significant, made significant progress, Large language models, Large language, rigorous mathematical proofs
备注: 21 pages
点击查看摘要
Abstract:Large language models have recently made significant progress to generate rigorous mathematical proofs. In contrast, utilizing LLMs for theorem proving in formal languages (such as Lean) remains challenging and computationally expensive, particularly when addressing problems at the undergraduate level and beyond. In this work, we present \textbf{Seed-Prover 1.5}, a formal theorem-proving model trained via large-scale agentic reinforcement learning, alongside an efficient test-time scaling (TTS) workflow. Through extensive interactions with Lean and other tools, the model continuously accumulates experience during the RL process, substantially enhancing the capability and efficiency of formal theorem proving. Furthermore, leveraging recent advancements in natural language proving, our TTS workflow efficiently bridges the gap between natural and formal languages. Compared to state-of-the-art methods, Seed-Prover 1.5 achieves superior performance with a smaller compute budget. It solves \textbf{88\% of PutnamBench} (undergraduate-level), \textbf{80\% of Fate-H} (graduate-level), and \textbf{33\% of Fate-X} (PhD-level) problems. Notably, using our system, we solved \textbf{11 out of 12 problems} from Putnam 2025 within 9 hours. Our findings suggest that scaling learning from experience, driven by high-quality formal feedback, holds immense potential for the future of formal mathematical reasoning.
28. 【2512.17247】Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition
链接:https://arxiv.org/abs/2512.17247
作者:Zahra Rahmani(1),Hossein Sameti(1) ((1) Department of Computer Engineering, Sharif University of Technology)
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Automatic Speech Recognition, systems suffer significant, suffer significant performance, significant performance degradation, Automatic Speech
备注:
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) systems suffer significant performance degradation in noisy environments, a challenge that is especially severe for low-resource languages such as Persian. Even state-of-the-art models such as Whisper struggle to maintain accuracy under varying signal-to-noise ratios (SNRs). This study presents a robust noise-sensitive ASR error correction framework that combines multiple hypotheses and noise-aware modeling. Using noisy Persian speech, we generate 5-best hypotheses from a modified Whisper-large decoder. Error Level Noise (ELN) is introduced as a representation that captures semantic- and token-level disagreement across hypotheses, quantifying the linguistic distortions caused by noise. ELN thus provides a direct measure of noise-induced uncertainty, enabling the LLM to reason about the reliability of each hypothesis during correction. Three models are evaluated: (1) a base LLaMA-2-7B model without fine-tuning, (2) a fine-tuned variant trained on text-only hypotheses, and (3) a noise-conditioned model integrating ELN embeddings at both sentence and word levels. Experimental results demonstrate that the ELN-conditioned model achieves substantial reductions in Word Error Rate (WER). Specifically, on the challenging Mixed Noise test set, the proposed Fine-tuned + ELN (Ours) model reduces the WER from a baseline of 31.10\% (Raw Whisper) to 24.84\%, significantly surpassing the Fine-tuned (No ELN) text-only baseline of 30.79\%, whereas the original LLaMA-2-7B model increased the WER to 64.58\%, demonstrating that it is unable to correct Persian errors on its own. This confirms the effectiveness of combining multiple hypotheses with noise-aware embeddings for robust Persian ASR in noisy real-world scenarios.
29. 【2512.17220】Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding
链接:https://arxiv.org/abs/2512.17220
作者:Yuqing Li,Jiangnan Li,Zheng Lin,Ziyan Zhou,Junjie Wu,Weiping Wang,Jie Zhou,Mo Yu
类目:Computation and Language (cs.CL)
关键词:Humans understand long, Humans understand, understand long, long and complex, complex texts
备注:
点击查看摘要
Abstract:Humans understand long and complex texts by relying on a holistic semantic representation of the content. This global view helps organize prior knowledge, interpret new information, and integrate evidence dispersed across a document, as revealed by the Mindscape-Aware Capability of humans in psychology. Current Retrieval-Augmented Generation (RAG) systems lack such guidance and therefore struggle with long-context tasks. In this paper, we propose Mindscape-Aware RAG (MiA-RAG), the first approach that equips LLM-based RAG systems with explicit global context awareness. MiA-RAG builds a mindscape through hierarchical summarization and conditions both retrieval and generation on this global semantic representation. This enables the retriever to form enriched query embeddings and the generator to reason over retrieved evidence within a coherent global context. We evaluate MiA-RAG across diverse long-context and bilingual benchmarks for evidence-based understanding and global sense-making. It consistently surpasses baselines, and further analysis shows that it aligns local details with a coherent global representation, enabling more human-like long-context retrieval and reasoning.
30. 【2512.17179】Enhancing Long Document Long Form Summarisation with Self-Planning
链接:https://arxiv.org/abs/2512.17179
作者:Xiaotang Du,Rohit Saxena,Laura Perez-Beltrachini,Pasquale Minervini,Ivan Titov
类目:Computation and Language (cs.CL)
关键词:leverages sentence-level information, leverages sentence-level, sentence-level information, traceability and faithfulness, faithfulness of generated
备注:
点击查看摘要
Abstract:We introduce a novel approach for long context summarisation, highlight-guided generation, that leverages sentence-level information as a content plan to improve the traceability and faithfulness of generated summaries. Our framework applies self-planning methods to identify important content and then generates a summary conditioned on the plan. We explore both an end-to-end and two-stage variants of the approach, finding that the two-stage pipeline performs better on long and information-dense documents. Experiments on long-form summarisation datasets demonstrate that our method consistently improves factual consistency while preserving relevance and overall quality. On GovReport, our best approach has improved ROUGE-L by 4.1 points and achieves about 35% gains in SummaC scores. Qualitative analysis shows that highlight-guided summarisation helps preserve important details, leading to more accurate and insightful summaries across domains.
31. 【2512.17093】A Solver-in-the-Loop Framework for Improving LLMs on Answer Set Programming for Logic Puzzle Solving
链接:https://arxiv.org/abs/2512.17093
作者:Timo Pierre Schrader,Lukas Lange,Tobias Kaminski,Simon Razniewski,Annemarie Friedrich
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large language models, ASP code generation, coding assistants, Answer Set Programming, rise of large
备注: 15 pages, 7 figures, accepted at AAAI'26
点击查看摘要
Abstract:The rise of large language models (LLMs) has sparked interest in coding assistants. While general-purpose programming languages are well supported, generating code for domain-specific languages remains a challenging problem for LLMs. In this paper, we focus on the LLM-based generation of code for Answer Set Programming (ASP), a particularly effective approach for finding solutions to combinatorial search problems. The effectiveness of LLMs in ASP code generation is currently hindered by the limited number of examples seen during their initial pre-training phase. In this paper, we introduce a novel ASP-solver-in-the-loop approach for solver-guided instruction-tuning of LLMs to addressing the highly complex semantic parsing task inherent in ASP code generation. Our method only requires problem specifications in natural language and their solutions. Specifically, we sample ASP statements for program continuations from LLMs for unriddling logic puzzles. Leveraging the special property of declarative ASP programming that partial encodings increasingly narrow down the solution space, we categorize them into chosen and rejected instances based on solver feedback. We then apply supervised fine-tuning to train LLMs on the curated data and further improve robustness using a solver-guided search that includes best-of-N sampling. Our experiments demonstrate consistent improvements in two distinct prompting settings on two datasets.
Comments:
15 pages, 7 figures, accepted at AAAI’26
Subjects:
Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:
arXiv:2512.17093 [cs.AI]
(or
arXiv:2512.17093v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2512.17093
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
32. 【2512.17092】Data Augmentation Supporting a Conversational Agent Designed for Smoking Cessation Support Groups
链接:https://arxiv.org/abs/2512.17092
作者:Salar Hashemitaheri,Ian Harris
类目:Computation and Language (cs.CL)
关键词:economical and accessible, data, low user engagement, augmentation, synthetic
备注:
点击查看摘要
Abstract:Online support groups for smoking cessation are economical and accessible, yet they often face challenges with low user engagement and stigma. The use of an automatic conversational agent would improve engagement by ensuring that all user comments receive a timely response.). We address the challenge of insufficient high-quality data by employing a two-level data augmentation strategy: synthetic data augmentation and real data augmentation. First, we fine-tuned an open source LLM to classify posts from our existing smoking cessation support groups and identify intents with low F1 (precision+recall) scores. Then, for these intents, we generate additional synthetic data using prompt engineering with the GPT model, with an average of 87\% of the generated synthetic posts deemed high quality by human annotators. Overall, the synthetic augmentation process resulted in 43\% of the original posts being selected for augmentation, followed by 140\% synthetic expansion of these posts. Additionally, we scraped more than 10,000 real posts from a related online support context, of which 73\% were validated as good quality by human annotators. Each synthetic or scraped post underwent rigorous validation involving human reviewers to ensure quality and relevance. The validated new data, combined with the original support group posts, formed an augmented dataset used to retrain the intent classifier. Performance evaluation of the retrained model demonstrated a 32\% improvement in F1, confirming the effectiveness of our data augmentation approach. Synthetic and real post augmentation led to similar performance improvements. This study provides a replicable framework for enhancing conversational agent performance in domains where data scarcity is a critical issue.
33. 【2512.17083】When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation
链接:https://arxiv.org/abs/2512.17083
作者:Michael H. Coen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Dialogue topic segmentation, segmentation supports summarization, topic segmentation, Dialogue topic, topic segmentation supports
备注: 17 pages, 2 figures. Evaluation and methodology study on dialogue topic segmentation
点击查看摘要
Abstract:Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.
Comments:
17 pages, 2 figures. Evaluation and methodology study on dialogue topic segmentation
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
ACMclasses:
I.2.7; H.3.1
Cite as:
arXiv:2512.17083 [cs.CL]
(or
arXiv:2512.17083v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2512.17083
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
34. 【2512.17075】Perturb Your Data: Paraphrase-Guided Training Data Watermarking
链接:https://arxiv.org/abs/2512.17075
作者:Pranav Shetty,Mirazul Haque,Petr Babkin,Zhiqiang Ma,Xiaomo Liu,Manuela Veloso
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, massive text corpora, text corpora scraped, Language Models
备注: Accepted to AAAI 2026
点击查看摘要
Abstract:Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.
35. 【2512.17065】XLM: A Python package for non-autoregressive language models
链接:https://arxiv.org/abs/2512.17065
作者:Dhruvesh Patel,Durga Prasad Maram,Sai Sreenivas Chintha,Benjamin Rozonoyer,Andrew McCallum
类目:Computation and Language (cs.CL)
关键词:general language modeling, non-autoregressive text generation, language modeling, non-autoregressive language modeling, recent years
备注: Code available at [this https URL](https://github.com/dhruvdcoder/xlm-core)
点击查看摘要
Abstract:In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires it own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the XLM python package, which is designed to make implementing small non-autoregressive language models faster with a secondary goal of providing a suite of small pre-trained models (through a companion xlm-models package) that can be used by the research community. The code is available at this https URL.
36. 【2512.17053】Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
链接:https://arxiv.org/abs/2512.17053
作者:Khushboo Thaker,Yony Bresler
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
关键词:trilemma involving cost, difficult trilemma involving, enterprise level faces, Deploying accurate, Small Language Models
备注:
点击查看摘要
Abstract:Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance. Current solutions force enterprises to choose between expensive, proprietary Large Language Models (LLMs) and low-performing Small Language Models (SLMs). Efforts to improve SLMs often rely on distilling reasoning from large LLMs using unstructured Chain-of-Thought (CoT) traces, a process that remains inherently ambiguous. Instead, we hypothesize that a formal, structured reasoning representation provides a clearer, more reliable teaching signal, as the Text-to-SQL task requires explicit and precise logical steps. To evaluate this hypothesis, we propose Struct-SQL, a novel Knowledge Distillation (KD) framework that trains an SLM to emulate a powerful large LLM. Consequently, we adopt a query execution plan as a formal blueprint to derive this structured reasoning. Our SLM, distilled with structured CoT, achieves an absolute improvement of 8.1% over an unstructured CoT distillation baseline. A detailed error analysis reveals that a key factor in this gain is a marked reduction in syntactic errors. This demonstrates that teaching a model to reason using a structured logical blueprint is beneficial for reliable SQL generation in SLMs.
37. 【2512.17028】A Women's Health Benchmark for Large Language Models
链接:https://arxiv.org/abs/2512.17028
作者:Victoria-Elisabeth Gruber,Razvan Marinescu,Diego Fajardo,Amin H. Nassar,Christopher Arkfeld,Alexandria Ludlow,Shama Patel,Mehrnoosh Samaei,Valerie Klug,Anna Huber,Marcel Gühner,Albert Botta i Orfila,Irene Lagoja,Kimya Tarr,Haleigh Larson,Mary Beth Howard
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:remains critically unexamined, women health, health remains critically, Women Health Benchmark, women health remains
备注: 15 pages, 6 Figures, 2 Tables
点击查看摘要
Abstract:As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60\% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.
38. 【2512.16970】PAACE: A Plan-Aware Automated Agent Context Engineering Framework
链接:https://arxiv.org/abs/2512.16970
作者:Kamer Ali Yuksel
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Large Language Model, Large Language, external knowledge systems, workflows involving planning, multi-step workflows involving
备注:
点击查看摘要
Abstract:Large Language Model (LLM) agents are increasingly deployed in complex, multi-step workflows involving planning, tool use, reflection, and interaction with external knowledge systems. These workflows generate rapidly expanding contexts that must be curated, transformed, and compressed to maintain fidelity, avoid attention dilution, and reduce inference cost. Prior work on summarization and query-aware compression largely ignores the multi-step, plan-aware nature of agentic reasoning. In this work, we introduce PAACE (Plan-Aware Automated Context Engineering), a unified framework for optimizing the evolving state of LLM agents through next-k-task relevance modeling, plan-structure analysis, instruction co-refinement, and function-preserving compression. PAACE comprises (1) PAACE-Syn, a large-scale generator of synthetic agent workflows annotated with stepwise compression supervision, and (2) PAACE-FT, a family of distilled, plan-aware compressors trained from successful teacher demonstrations. Experiments on long-horizon benchmarks (AppWorld, OfficeBench, and 8-Objective QA) demonstrate that PAACE consistently improves agent correctness while substantially reducing context load. On AppWorld, PAACE achieves higher accuracy than all baselines while lowering peak context and cumulative dependency. On OfficeBench and multi-hop QA, PAACE improves both accuracy and F1, achieving fewer steps, lower peak tokens, and reduced attention dependency. Distilled PAACE-FT retains 97 percent of the teacher's performance while reducing inference cost by over an order of magnitude, enabling practical deployment of plan-aware compression with compact models.
39. 【2512.16969】Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows
链接:https://arxiv.org/abs/2512.16969
作者:Wanghan Xu,Yuhao Zhou,Yifan Zhou,Qinglong Cao,Shuo Li,Jia Bu,Bo Liu,Yixin Chen,Xuming He,Xiangyu Zhao,Xiang Zhuang,Fengxiang Wang,Zhiwang Zhou,Qiantai Feng,Wenxuan Huang,Jiaqi Wei,Hao Wu,Yuejin Yang,Guangshuai Wang,Sheng Xu,Ziyan Huang,Xinyao Liu,Jiyao Liu,Cheng Tang,Wei Li,Ying Chen,Junzhi Ning,Pengfei Jiang,Chenglong Ma,Ye Du,Changkai Ji,Huihui Xu,Ming Hu,Jiangbin Zheng,Xin Chen,Yucheng Wu,Feifei Jiang,Xi Chen,Xiangru Tang,Yuchen Fu,Yingzhou Lu,Yuanyuan Zhang,Lihao Sun,Chengbo Li,Jinzhe Ma,Wanhao Liu,Yating Liu,Kuo-Cheng Wu,Shengdu Chai,Yizhou Wang,Ouwen Zhangjin,Chen Tang,Shufei Zhang,Wenbo Cao,Junjie Ren,Taoyong Cui,Zhouheng Yao,Juntao Deng,Yijie Sun,Feng Liu,Wangxu Wei,Jingyi Xu,Zhangrui Li,Junchao Gong,Zijie Guo,Zhiyu Yao,Zaoyu Chen,Tianhao Peng,Fangchen Yu,Bo Zhang,Dongzhan Zhou,Shixiang Tang,Jiaheng Liu,Fenghua Ling,Yan Lu,Yuchen Ren,Ben Fei,Zhen Zhao,Xinyu Gu,Rui Su,Xiao-Ming Wu,Weikang Si,Yang Liu,Hao Chen,Xiangchao Yan,Xue Yang,Junchi Yan,Jiamin Wu,Qihao Zheng,Chenhui Li,Zhiqiang Gao,Hao Kong,Junjun He,Mao Su,Tianfan Fu,Peng Ye,Chunfeng Song,Nanqing Dong,Yuqiang Li,Huazhu Fu
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Scientific General Intelligence, General Intelligence, Practical Inquiry Model, autonomously conceive, Scientific General
备注:
点击查看摘要
Abstract:Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
40. 【2502.12672】Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization
链接:https://arxiv.org/abs/2502.12672
作者:Tzu-Quan Lin,Wei-Ping Huang,Hao Tang,Hung-yi Lee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
关键词:cross-task generalization, cross-task generalization ability, specific tasks, speech representation models, Fine-tuning
备注: Published in IEEE Transactions on Audio, Speech, and Language Processing (TASLP). Model and code available at: [this https URL](https://github.com/nervjack2/Speech-FT)
点击查看摘要
Abstract:Fine-tuning speech representation models can enhance performance on specific tasks but often compromises their cross-task generalization ability. This degradation is often caused by excessive changes in the representations, making it difficult to retain information learned during pre-training. Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization. To address this issue, we propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning. Speech-FT first applies fine-tuning specifically designed to reduce representational drift, followed by weight-space interpolation with the pre-trained model to restore cross-task generalization. Extensive experiments on HuBERT, wav2vec 2.0, DeCoAR 2.0, and WavLM Base+ demonstrate that Speech-FT consistently improves performance across a wide range of supervised, unsupervised, and multitask fine-tuning scenarios. Moreover, Speech-FT achieves superior cross-task generalization compared to fine-tuning baselines that explicitly constrain weight changes, such as weight-space regularization and LoRA fine-tuning. Our analysis reveals that Speech-FT maintains higher feature similarity to the pre-trained model compared to alternative strategies, despite allowing larger weight-space updates. Notably, Speech-FT achieves significant improvements on the SUPERB benchmark. For example, when fine-tuning HuBERT on automatic speech recognition, Speech-FT is able to reduce phone error rate from 5.17% to 3.94%, lower word error rate from 6.38% to 5.75%, and increase speaker identification accuracy from 81.86% to 84.11%. Speech-FT provides a simple yet powerful solution for further refining speech representation models after pre-training.
41. 【2512.17525】Computational analysis reveals historical trajectory of East-Polynesian lunar calendars
链接:https://arxiv.org/abs/2512.17525
作者:Miguel Valério,Fabio Tamburini,Michele Corazza
类目:Populations and Evolution (q-bio.PE); Computation and Language (cs.CL)
关键词:including Rapa Nui, Easter Island, Rapa Nui, East Polynesia, East Polynesian lunar
备注:
点击查看摘要
Abstract:We investigate a type of lunar calendar known as lists of the 'nights of the moon', found throughout East Polynesia, including Rapa Nui (Easter Island). Using computational methods, we analyzed the lexical and structural divergence of 49 calendric lists from all major archipelagos, each containing about 30 night names. Our results, presented as a rooted phylogenetic tree, show a clear split into two main groups: one including lists from Rapa Nui, Mangareva, and the Marquesas; the other comprising lists from New Zealand, Hawaii, the Cook Islands, the Austral Islands, Tahiti, and the Tuamotu. This pattern aligns with a recent alternative classification of East Polynesian languages into 'Distal' (Marquesan, Mangarevan, Rapanui) and 'Proximal' (Maori, Hawaiian, Tahitian, etc.) subgroups. Since both language and lunar calendars are symbolic systems passed down and changed within communities - and given the geographic isolation of many archipelagos - we interpret this correspondence as evidence that the early divergence of East Polynesian lunar calendars mirrors early population movements and language splits in the region.
信息检索
1. 【2512.17795】Intelligent Knowledge Mining Framework: Bridging AI Analysis and Trustworthy Preservation
链接:https://arxiv.org/abs/2512.17795
作者:Binh Vu
类目:Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:presents significant challenges, challenges in access, data-intensive sectors, unprecedented proliferation, proliferation of digital
备注:
点击查看摘要
Abstract:The unprecedented proliferation of digital data presents significant challenges in access, integration, and value creation across all data-intensive sectors. Valuable information is frequently encapsulated within disparate systems, unstructured documents, and heterogeneous formats, creating silos that impede efficient utilization and collaborative decision-making. This paper introduces the Intelligent Knowledge Mining Framework (IKMF), a comprehensive conceptual model designed to bridge the critical gap between dynamic AI-driven analysis and trustworthy long-term preservation. The framework proposes a dual-stream architecture: a horizontal Mining Process that systematically transforms raw data into semantically rich, machine-actionable knowledge, and a parallel Trustworthy Archiving Stream that ensures the integrity, provenance, and computational reproducibility of these assets. By defining a blueprint for this symbiotic relationship, the paper provides a foundational model for transforming static repositories into living ecosystems that facilitate the flow of actionable intelligence from producers to consumers. This paper outlines the motivation, problem statement, and key research questions guiding the research and development of the framework, presents the underlying scientific methodology, and details its conceptual design and modeling.
2. 【2512.17733】Diversity Recommendation via Causal Deconfounding of Co-purchase Relations and Counterfactual Exposure
链接:https://arxiv.org/abs/2512.17733
作者:Jingmao Zhang,Zhiting Zhao,Yunqi Lin,Jianghong Ma,Tianjun Wei,Haijun Zhang,Xiaofeng Zhang
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:user-item modeling, Asymmetric Co-purchase Relationship, Unbiased Asymmetric Co-purchase, enhance recommendation, recommendation
备注:
点击查看摘要
Abstract:Beyond user-item modeling, item-to-item relationships are increasingly used to enhance recommendation. However, common methods largely rely on co-occurrence, making them prone to item popularity bias and user attributes, which degrades embedding quality and performance. Meanwhile, although diversity is acknowledged as a key aspect of recommendation quality, existing research offers limited attention to it, with a notable lack of causal perspectives and theoretical grounding. To address these challenges, we propose Cadence: Diversity Recommendation via Causal Deconfounding of Co-purchase Relations and Counterfactual Exposure - a plug-and-play framework built upon LightGCN as the backbone, primarily designed to enhance recommendation diversity while preserving accuracy. First, we compute the Unbiased Asymmetric Co-purchase Relationship (UACR) between items - excluding item popularity and user attributes - to construct a deconfounded directed item graph, with an aggregation mechanism to refine embeddings. Second, we leverage UACR to identify diverse categories of items that exhibit strong causal relevance to a user's interacted items but have not yet been engaged with. We then simulate their behavior under high-exposure scenarios, thereby significantly enhancing recommendation diversity while preserving relevance. Extensive experiments on real-world datasets demonstrate that our method consistently outperforms state-of-the-art diversity models in both diversity and accuracy, and further validates its effectiveness, transferability, and efficiency over baselines.
3. 【2512.17462】Behavioural Effects of Agentic Messaging: A Case Study on a Financial Service Application
链接:https://arxiv.org/abs/2512.17462
作者:Olivier Jeunen,Schaun Wheeler
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Information Retrieval methods, Information Retrieval, Marketing and product, Retrieval methods, product personalisation provide
备注: To appear in the 48th European Conference on Information Retrieval (ECIR '26) Industry Track
点击查看摘要
Abstract:Marketing and product personalisation provide a prominent and visible use-case for the application of Information Retrieval methods across several business domains. Recently, agentic approaches to these problems have been gaining traction. This work evaluates the behavioural and retention effects of agentic personalisation on a financial service application's customer communication system during a 2025 national tax filing period. Through a two month-long randomised controlled trial, we compare an agentic messaging approach against a business-as-usual (BAU) rule-based campaign system, focusing on two primary outcomes: unsubscribe behaviour and conversion timing. Empirical results show that agent-led messaging reduced unsubscribe events by 21\% ($\pm 0.01$) relative to BAU and increased early filing behaviour in the weeks preceding the national deadline. These findings demonstrate how adaptive, user-level decision-making systems can modulate engagement intensity whilst improving long-term retention indicators.
4. 【2512.17442】A Systematic Reproducibility Study of BSARec for Sequential Recommendation
链接:https://arxiv.org/abs/2512.17442
作者:Jan Hutter,Hua Chang Bakker,Stan Fris,Madelon Bernardy,Yuanna Liu
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Transformer-based models acts, mechanism of Transformer-based, Transformer-based models, short-term user interests, reflect short-term user
备注: Jan Hutter, Hua Chang Bakker, Stan Fris, Madelon Bernardy contributed equally to this work
点击查看摘要
Abstract:In sequential recommendation (SR), the self-attention mechanism of Transformer-based models acts as a low-pass filter, limiting their ability to capture high-frequency signals that reflect short-term user interests. To overcome this, BSARec augments the Transformer encoder with a frequency layer that rescales high-frequency components using the Fourier transform. However, the overall effectiveness of BSARec and the roles of its individual components have yet to be systematically validated. We reproduce BSARec and show that it outperforms other SR methods on some datasets. To empirically assess whether BSARec improves performance on high-frequency signals, we propose a metric to quantify user history frequency and evaluate SR methods across different user groups. We compare digital signal processing (DSP) techniques and find that the discrete wavelet transform (DWT) offer only slight improvements over Fourier transforms, and DSP methods provide no clear advantage over simple residual connections. Finally, we explore padding strategies and find that non-constant padding significantly improves recommendation performance, whereas constant padding hinders the frequency rescaler's ability to capture high-frequency signals.
5. 【2512.17389】he Mental World of Large Language Models in Recommendation: A Benchmark on Association, Personalization, and Knowledgeability
链接:https://arxiv.org/abs/2512.17389
作者:Guangneng Hu
类目:Information Retrieval (cs.IR)
关键词:zero-shot ranker, shown potential, enhancer or zero-shot, Large language models, large semantic gap
备注: 21 pages, 13 figures, 27 tables, submission to KDD 2025
点击查看摘要
Abstract:Large language models (LLMs) have shown potential in recommendation systems (RecSys) by using them as either knowledge enhancer or zero-shot ranker. A key challenge lies in the large semantic gap between LLMs and RecSys where the former internalizes language world knowledge while the latter captures personalized world of behaviors. Unfortunately, the research community lacks a comprehensive benchmark that evaluates the LLMs over their limitations and boundaries in RecSys so that we can draw a confident conclusion. To investigate this, we propose a benchmark named LRWorld containing over 38K high-quality samples and 23M tokens carefully compiled and generated from widely used public recommendation datasets. LRWorld categorizes the mental world of LLMs in RecSys as three main scales (association, personalization, and knowledgeability) spanned by ten factors with 31 measures (tasks). Based on LRWorld, comprehensive experiments on dozens of LLMs show that they are still not well capturing the deep neural personalized embeddings but can achieve good results on shallow memory-based item-item similarity. They are also good at perceiving item entity relations, entity hierarchical taxonomies, and item-item association rules when inferring user interests. Furthermore, LLMs show a promising ability in multimodal knowledge reasoning (movie poster and product image) and robustness to noisy profiles. None of them show consistently good performance over the ten factors. Model sizes, position bias, and more are ablated.
6. 【2512.17277】Warmer for Less: A Cost-Efficient Strategy for Cold-Start Recommendations at Pinterest
链接:https://arxiv.org/abs/2512.17277
作者:Saeed Ebrahimi,Weijie Jiang,Jaewon Yang,Olafur Gudmundsson,Yucheng Tu,Huizhong Duan
类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:leading visual discovery, visual discovery platform, recommender systems, delivering relevant, leading visual
备注: Submitted to the WWW'26
点击查看摘要
Abstract:Pinterest is a leading visual discovery platform where recommender systems (RecSys) are key to delivering relevant, engaging, and fresh content to our users. In this paper, we study the problem of improving RecSys model predictions for cold-start (CS) items, which appear infrequently in the training data. Although this problem is well-studied in academia, few studies have addressed its root causes effectively at the scale of a platform like Pinterest. By investigating live traffic data, we identified several challenges of the CS problem and developed a corresponding solution for each: First, industrial-scale RecSys models must operate under tight computational constraints. Since CS items are a minority, any related improvements must be highly cost-efficient. To address this, our solutions were designed to be lightweight, collectively increasing the total parameters by only 5%. Second, CS items are represented only by non-historical (e.g., content or attribute) features, which models often treat as less important. To elevate their significance, we introduce a residual connection for the non-historical features. Third, CS items tend to receive lower prediction scores compared to non-CS items, reducing their likelihood of being surfaced. We mitigate this by incorporating a score regularization term into the model. Fourth, the labels associated with CS items are sparse, making it difficult for the model to learn from them. We apply the manifold mixup technique to address this data sparsity. Implemented together, our methods increased fresh content engagement at Pinterest by 10% without negatively impacting overall engagement and cost, and have been deployed to serve over 570 million users on Pinterest.
7. 【2512.17178】ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching
链接:https://arxiv.org/abs/2512.17178
作者:Qi Zhang,Yuxu Chen,Lei Deng,Lili Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Contrastive Language-Image Pretraining, Contrastive Language-Image, Language-Image Pretraining, multimodal tasks, achieved remarkable performance
备注: 10 pages, 8 figures
点击查看摘要
Abstract:Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.
8. 【2512.17164】CDE: Topic-Centric Dual Expansion of Queries and Documents with Large Language Models for Information Retrieval
链接:https://arxiv.org/abs/2512.17164
作者:Yu Yang,Feng Tian,Ping Chen
类目:Information Retrieval (cs.IR)
关键词:applied separately, enriches queries, enriches documents, Expansion, queries
备注:
点击查看摘要
Abstract:Query Expansion (QE) enriches queries and Document Expansion (DE) enriches documents, and these two techniques are often applied separately. However, such separate application may lead to semantic misalignment between the expanded queries (or documents) and their relevant documents (or queries). To address this serious issue, we propose TCDE, a dual expansion strategy that leverages large language models (LLMs) for topic-centric enrichment on both queries and documents. In TCDE, we design two distinct prompt templates for processing each query and document. On the query side, an LLM is guided to identify distinct sub-topics within each query and generate a focused pseudo-document for each sub-topic. On the document side, an LLM is guided to distill each document into a set of core topic sentences. The resulting outputs are used to expand the original query and document. This topic-centric dual expansion process establishes semantic bridges between queries and their relevant documents, enabling better alignment for downstream retrieval models. Experiments on two challenging benchmarks, TREC Deep Learning and BEIR, demonstrate that TCDE achieves substantial improvements over strong state-of-the-art expansion baselines. In particular, on dense retrieval tasks, it outperforms several state-of-the-art methods, with a relative improvement of 2.8\% in NDCG@10 on the SciFact dataset. Experimental results validate the effectiveness of our topic-centric and dual expansion strategy.
9. 【2512.17027】Unexpected Knowledge: Auditing Wikipedia and Grokipedia Search Recommendations
链接:https://arxiv.org/abs/2512.17027
作者:Erica Coppolillo,Simone Mungari
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:explore information online, Encyclopedic knowledge platforms, search engine, key gateways, users explore information
备注:
点击查看摘要
Abstract:Encyclopedic knowledge platforms are key gateways through which users explore information online. The recent release of Grokipedia, a fully AI-generated encyclopedia, introduces a new alternative to traditional, well-established platforms like Wikipedia. In this context, search engine mechanisms play an important role in guiding users exploratory paths, yet their behavior across different encyclopedic systems remains underexplored. In this work, we address this gap by providing the first comparative analysis of search engine in Wikipedia and Grokipedia. Using nearly 10,000 neutral English words and their substrings as queries, we collect over 70,000 search engine results and examine their semantic alignment, overlap, and topical structure. We find that both platforms frequently generate results that are weakly related to the original query and, in many cases, surface unexpected content starting from innocuous queries. Despite these shared properties, the two systems often produce substantially different recommendation sets for the same query. Through topical annotation and trajectory analysis, we further identify systematic differences in how content categories are surfaced and how search engine results evolve over multiple stages of exploration. Overall, our findings show that unexpected search engine outcomes are a common feature of both the platforms, even though they exhibit discrepancies in terms of topical distribution and query suggestions.
Subjects:
Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2512.17027 [cs.IR]
(or
arXiv:2512.17027v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2512.17027
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
10. 【2512.17015】A Reproducible and Fair Evaluation of Partition-aware Collaborative Filtering
链接:https://arxiv.org/abs/2512.17015
作者:Domenico De Gioia,Claudio Pomo,Ludovico Boratto,Tommaso Di Noia
类目:Information Retrieval (cs.IR)
关键词:long demonstrated strong, demonstrated strong offline, strong offline performance, conceptual simplicity, long demonstrated
备注: accepted at ECIR 2026 reproducibility track
点击查看摘要
Abstract:Similarity-based collaborative filtering (CF) models have long demonstrated strong offline performance and conceptual simplicity. However, their scalability is limited by the quadratic cost of maintaining dense item-item similarity matrices. Partitioning-based paradigms have recently emerged as an effective strategy for balancing effectiveness and efficiency, enabling models to learn local similarities within coherent subgraphs while maintaining a limited global context. In this work, we focus on the Fine-tuning Partition-aware Similarity Refinement (FPSR) framework, a prominent representative of this family, as well as its extension, FPSR+. Reproducible evaluation of partition-aware collaborative filtering remains challenging, as prior FPSR/FPSR+ reports often rely on splits of unclear provenance and omit some similarity-based baselines, thereby complicating fair comparison. We present a transparent, fully reproducible benchmark of FPSR and FPSR+. Based on our results, the family of FPSR models does not consistently perform at the highest level. Overall, it remains competitive, validates its design choices, and shows significant advantages in long-tail scenarios. This highlights the accuracy-coverage trade-offs resulting from partitioning, global components, and hub design. Our investigation clarifies when partition-aware similarity modeling is most beneficial and offers actionable guidance for scalable recommender system design under reproducible protocols.
11. 【2512.16925】V-Agent: An Interactive Video Search System Using Vision-Language Models
链接:https://arxiv.org/abs/2512.16925
作者:SunYoung Park,Jong-Hyeon Lee,Youngjune Kim,Daegyu Sung,Younghyun Yu,Young-rok Cha,Jeongho Ju
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:interactive user-system conversations, multi-agent platform designed, user-system conversations, multi-agent platform, platform designed
备注: CIKM 2025 MMGENSR Workshop
点击查看摘要
Abstract:We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications.
计算机视觉
1. 【2512.17909】Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing
链接:https://arxiv.org/abs/2512.17909
作者:Shilong Zhang,He Zhang,Zhifei Zhang,Chongjian Ge,Shuchen Xue,Shaoteng Liu,Mengwei Ren,Soo Ye Kim,Yuqian Zhou,Qing Liu,Daniil Pakhomov,Kai Zhang,Zhe Lin,Ping Luo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low-level Variational Autoencoder, Variational Autoencoder, Modern Latent Diffusion, low-level Variational, Modern Latent
备注: Project Page: [this https URL](https://jshilong.github.io/PS-VAE-PAGE/)
点击查看摘要
Abstract:Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend is to adopt high-dimensional features from representation encoders as generative latents. However, we empirically identify two fundamental obstacles in this paradigm: (1) the discriminative feature space lacks compact regularization, making diffusion models prone to off-manifold latents that lead to inaccurate object structures; and (2) the encoder's inherently weak pixel-level reconstruction hinders the generator from learning accurate fine-grained geometry and texture. In this paper, we propose a systematic framework to adapt understanding-oriented encoder features for generative tasks. We introduce a semantic-pixel reconstruction objective to regularize the latent space, enabling the compression of both semantic information and fine-grained details into a highly compact representation (96 channels with 16x16 spatial downsampling). This design ensures that the latent space remains semantically rich and achieves state-of-the-art image reconstruction, while remaining compact enough for accurate generation. Leveraging this representation, we design a unified Text-to-Image (T2I) and image editing model. Benchmarking against various feature spaces, we demonstrate that our approach achieves state-of-the-art reconstruction, faster convergence, and substantial performance gains in both T2I and editing tasks, validating that representation encoders can be effectively adapted into robust generative components.
2. 【2512.17908】Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting
链接:https://arxiv.org/abs/2512.17908
作者:Ananta R. Bhattarai,Helge Rhodin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Monocular depth estimation, estimation remains challenging, recent foundation models, depth estimation remains, struggle with real-world
备注:
点击查看摘要
Abstract:Monocular depth estimation remains challenging as recent foundation models, such as Depth Anything V2 (DA-V2), struggle with real-world images that are far from the training distribution. We introduce Re-Depth Anything, a test-time self-supervision framework that bridges this domain gap by fusing DA-V2 with the powerful priors of large-scale 2D diffusion models. Our method performs label-free refinement directly on the input image by re-lighting predicted depth maps and augmenting the input. This re-synthesis method replaces classical photometric reconstruction by leveraging shape from shading (SfS) cues in a new, generative context with Score Distillation Sampling (SDS). To prevent optimization collapse, our framework employs a targeted optimization strategy: rather than optimizing depth directly or fine-tuning the full model, we freeze the encoder and only update intermediate embeddings while also fine-tuning the decoder. Across diverse benchmarks, Re-Depth Anything yields substantial gains in depth accuracy and realism over the DA-V2, showcasing new avenues for self-supervision by augmenting geometric reasoning.
3. 【2512.17907】Dexterous World Models
链接:https://arxiv.org/abs/2512.17907
作者:Byungjun Kim,Taeksoo Kim,Junyoung Lee,Hanbyul Joo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent progress, digital twins, reconstruction has made, everyday environments, create realistic digital
备注: Project Page: [this http URL](http://snuvclab.github.io/dwm)
点击查看摘要
Abstract:Recent progress in 3D reconstruction has made it easy to create realistic digital twins from everyday environments. However, current digital twins remain largely static and are limited to navigation and view synthesis without embodied interactivity. To bridge this gap, we introduce Dexterous World Model (DWM), a scene-action-conditioned video diffusion framework that models how dexterous human actions induce dynamic changes in static 3D scenes. Given a static 3D scene rendering and an egocentric hand motion sequence, DWM generates temporally coherent videos depicting plausible human-scene interactions. Our approach conditions video generation on (1) static scene renderings following a specified camera trajectory to ensure spatial consistency, and (2) egocentric hand mesh renderings that encode both geometry and motion cues to model action-conditioned dynamics directly. To train DWM, we construct a hybrid interaction video dataset. Synthetic egocentric interactions provide fully aligned supervision for joint locomotion and manipulation learning, while fixed-camera real-world videos contribute diverse and realistic object dynamics. Experiments demonstrate that DWM enables realistic and physically plausible interactions, such as grasping, opening, and moving objects, while maintaining camera and scene consistency. This framework represents a first step toward video diffusion-based interactive digital twins and enables embodied simulation from egocentric actions.
Comments:
Project Page: this http URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.17907 [cs.CV]
(or
arXiv:2512.17907v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.17907
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
4. 【2512.17902】Adversarial Robustness of Vision in Open Foundation Models
链接:https://arxiv.org/abs/2512.17902
作者:Jonathon Fox,William J Buchanan,Pavlos Papadopoulos
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
关键词:deep learning, identify objects, increase in deep, increasingly difficult, difficult to understand
备注:
点击查看摘要
Abstract:With the increase in deep learning, it becomes increasingly difficult to understand the model in which AI systems can identify objects. Thus, an adversary could aim to modify an image by adding unseen elements, which will confuse the AI in its recognition of an entity. This paper thus investigates the adversarial robustness of LLaVA-1.5-13B and Meta's Llama 3.2 Vision-8B-2. These are tested for untargeted PGD (Projected Gradient Descent) against the visual input modality, and empirically evaluated on the Visual Question Answering (VQA) v2 dataset subset. The results of these adversarial attacks are then quantified using the standard VQA accuracy metric. This evaluation is then compared with the accuracy degradation (accuracy drop) of LLaVA and Llama 3.2 Vision. A key finding is that Llama 3.2 Vision, despite a lower baseline accuracy in this setup, exhibited a smaller drop in performance under attack compared to LLaVA, particularly at higher perturbation levels. Overall, the findings confirm that the vision modality represents a viable attack vector for degrading the performance of contemporary open-weight VLMs, including Meta's Llama 3.2 Vision. Furthermore, they highlight that adversarial robustness does not necessarily correlate directly with standard benchmark performance and may be influenced by underlying architectural and training factors.
5. 【2512.17900】Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
链接:https://arxiv.org/abs/2512.17900
作者:Vongani H. Maluleke,Kie Horiuchi,Lea Wilken,Evonne Ng,Jitendra Malik,Angjoo Kanazawa
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Understanding and generating, generating multi-person interactions, generating multi-person, fundamental challenge, challenge with broad
备注:
点击查看摘要
Abstract:Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Diffusion Forcing Transformer), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic prediction, partner inpainting, and full multi-agent motion generation within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of v. Building on Diffusion Forcing, we introduce key modifications that explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g, dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people, enabled by a scalable architecture that is agnostic to the number of agents. We refer readers to the supplemental video, where the temporal dynamics and spatial coordination of generated interactions are best appreciated. Project page: this https URL
6. 【2512.17897】RadarGen: Automotive Radar Point Cloud Generation from Cameras
链接:https://arxiv.org/abs/2512.17897
作者:Tomer Borreda,Fangqiang Ding,Sanja Fidler,Shengyu Huang,Or Litany
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:multi-view camera imagery, synthesizing realistic automotive, realistic automotive radar, camera imagery, synthesizing realistic
备注: Project page: [this https URL](https://radargen.github.io/)
点击查看摘要
Abstract:We present RadarGen, a diffusion model for synthesizing realistic automotive radar point clouds from multi-view camera imagery. RadarGen adapts efficient image-latent diffusion to the radar domain by representing radar measurements in bird's-eye-view form that encodes spatial structure together with radar cross section (RCS) and Doppler attributes. A lightweight recovery step reconstructs point clouds from the generated maps. To better align generation with the visual scene, RadarGen incorporates BEV-aligned depth, semantic, and motion cues extracted from pretrained foundation models, which guide the stochastic generation process toward physically plausible radar patterns. Conditioning on images makes the approach broadly compatible, in principle, with existing visual datasets and simulation frameworks, offering a scalable direction for multimodal generative simulation. Evaluations on large-scale driving data show that RadarGen captures characteristic radar measurement distributions and reduces the gap to perception models trained on real data, marking a step toward unified generative simulation across sensing modalities.
7. 【2512.17891】Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training
链接:https://arxiv.org/abs/2512.17891
作者:Kristoffer Wickstrøm,Teresa Dorszewski,Siyan Chen,Michael Kampffmeyer,Elisabeth Wetzer,Robert Jenssen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:require complicated training, complicated training procedures, Current approaches, designing self-explainable models, require complicated
备注:
点击查看摘要
Abstract:Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical. With the advance of general purpose foundation models based on Vision Transformers (ViTs), this impracticability becomes even more problematic. Therefore, new methods are necessary to provide transparency and reliability to ViT-based foundation models. In this work, we present a new method for turning any well-trained ViT-based model into a SEM without retraining, which we call Keypoint Counting Classifiers (KCCs). Recent works have shown that ViTs can automatically identify matching keypoints between images with high precision, and we build on these results to create an easily interpretable decision process that is inherently visualizable in the input. We perform an extensive evaluation which show that KCCs improve the human-machine communication compared to recent baselines. We believe that KCCs constitute an important step towards making ViT-based foundation models more transparent and reliable.
8. 【2512.17875】Visually Prompted Benchmarks Are Surprisingly Fragile
链接:https://arxiv.org/abs/2512.17875
作者:Haiwen Feng,Long Lian,Lisa Dunlap,Jiahao Shu,XuDong Wang,Renhao Wang,Trevor Darrell,Alane Suhr,Angjoo Kanazawa
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:testing models' ability, analyze visual content, visual content independently, textual priors, visual
备注:
点击查看摘要
Abstract:A key challenge in evaluating VLMs is testing models' ability to analyze visual content independently from their textual priors. Recent benchmarks such as BLINK probe visual perception through visual prompting, where questions about visual content are paired with coordinates to which the question refers, with the coordinates explicitly marked in the image itself. While these benchmarks are an important part of VLM evaluation, we find that existing models are surprisingly fragile to seemingly irrelevant details of visual prompting: simply changing a visual marker from red to blue can completely change rankings among models on a leaderboard. By evaluating nine commonly-used open- and closed-source VLMs on two visually prompted tasks, we demonstrate how details in benchmark setup, including visual marker design and dataset size, have a significant influence on model performance and leaderboard rankings. These effects can even be exploited to lift weaker models above stronger ones; for instance, slightly increasing the size of the visual marker results in open-source InternVL3-8B ranking alongside or better than much larger proprietary models like Gemini 2.5 Pro. We further show that low-level inference choices that are often ignored in benchmarking, such as JPEG compression levels in API calls, can also cause model lineup changes. These details have substantially larger impacts on visually prompted benchmarks than on conventional semantic VLM evaluations. To mitigate this instability, we curate existing datasets to create VPBench, a larger visually prompted benchmark with 16 visual marker variants. VPBench and additional analysis tools are released at this https URL.
9. 【2512.17873】InSPECT: Invariant Spectral Features Preservation of Diffusion Models
链接:https://arxiv.org/abs/2512.17873
作者:Baohua Yan,Qingyuan Liu,Jennifer Kava,Xuan Di
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Modern diffusion models, Modern diffusion, invariant spectral features, Invariant Spectral, diffusion models
备注:
点击查看摘要
Abstract:Modern diffusion models (DMs) have achieved state-of-the-art image generation. However, the fundamental design choice of diffusing data all the way to white noise and then reconstructing it leads to an extremely difficult and computationally intractable prediction task. To overcome this limitation, we propose InSPECT (Invariant Spectral Feature-Preserving Diffusion Model), a novel diffusion model that keeps invariant spectral features during both the forward and backward processes. At the end of the forward process, the Fourier coefficients smoothly converge to a specified random noise, enabling features preservation while maintaining diversity and randomness. By preserving invariant features, InSPECT demonstrates enhanced visual diversity, faster convergence rate, and a smoother diffusion process. Experiments on CIFAR-10, Celeb-A, and LSUN demonstrate that InSPECT achieves on average a 39.23% reduction in FID and 45.80% improvement in IS against DDPM for 10K iterations under specified parameter settings, which demonstrates the significant advantages of preserving invariant features: achieving superior generation quality and diversity, while enhancing computational efficiency and enabling faster convergence rate. To the best of our knowledge, this is the first attempt to analyze and preserve invariant spectral features in diffusion models.
10. 【2512.17864】Interpretable Plant Leaf Disease Detection Using Attention-Enhanced CNN
链接:https://arxiv.org/abs/2512.17864
作者:Balram Singh,Ram Prakash Sharma,Somnath Dey
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:global food security, Convolutional Neural Network, Plant diseases pose, food security, necessitating accurate
备注: 27 pages, 12 figures
点击查看摘要
Abstract:Plant diseases pose a significant threat to global food security, necessitating accurate and interpretable disease detection methods. This study introduces an interpretable attention-guided Convolutional Neural Network (CNN), CBAM-VGG16, for plant leaf disease detection. By integrating Convolution Block Attention Module (CBAM) at each convolutional stage, the model enhances feature extraction and disease localization. Trained on five diverse plant disease datasets, our approach outperforms recent techniques, achieving high accuracy (up to 98.87%) and demonstrating robust generalization. Here, we show the effectiveness of our method through comprehensive evaluation and interpretability analysis using CBAM attention maps, Grad-CAM, Grad-CAM++, and Layer-wise Relevance Propagation (LRP). This study advances the application of explainable AI in agricultural diagnostics, offering a transparent and reliable system for smart farming. The code of our proposed work is available at this https URL.
11. 【2512.17852】Simulation-Driven Deep Learning Framework for Raman Spectral Denoising Under Fluorescence-Dominant Conditions
链接:https://arxiv.org/abs/2512.17852
作者:Mengkun Chen,Sanidhya D. Tripathi,James W. Tunnell
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:spectroscopy enables non-destructive, label-free molecular analysis, Raman spectroscopy enables, label-free molecular, high specificity
备注:
点击查看摘要
Abstract:Raman spectroscopy enables non-destructive, label-free molecular analysis with high specificity, making it a powerful tool for biomedical diagnostics. However, its application to biological tissues is challenged by inherently weak Raman scattering and strong fluorescence background, which significantly degrade signal quality. In this study, we present a simulation-driven denoising framework that combines a statistically grounded noise model with deep learning to enhance Raman spectra acquired under fluorescence-dominated conditions. We comprehensively modeled major noise sources. Based on this model, we generated biologically realistic Raman spectra and used them to train a cascaded deep neural network designed to jointly suppress stochastic detector noise and fluorescence baseline interference. To evaluate the performance of our approach, we simulated human skin spectra derived from real experimental data as a validation case study. Our results demonstrate the potential of physics-informed learning to improve spectral quality and enable faster, more accurate Raman-based tissue analysis.
12. 【2512.17851】InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models
链接:https://arxiv.org/abs/2512.17851
作者:Sarah Rastegar,Violeta Chatalbasheva,Sieger Falkena,Anuj Singh,Yanbo Wang,Tejas Gokhale,Hamid Palangi,Hadi Jamali-Rad
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:models generate high-quality, generate high-quality images, diffusion models generate, models generate, generate high-quality
备注:
点击查看摘要
Abstract:Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
13. 【2512.17838】ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges
链接:https://arxiv.org/abs/2512.17838
作者:Roshan Kenia,Xiaoman Zhang,Pranav Rajpurkar
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:domain-specific scientific problems, machine learning tasks, large language models, coding agents built, ineffective on complex
备注: [this https URL](https://github.com/rajpurkarlab/ReX-MLE)
点击查看摘要
Abstract:Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly demanding domain, requiring long training cycles, high-dimensional data handling, and specialized preprocessing and validation pipelines, capabilities not fully measured in existing agent benchmarks. To address this gap, we introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions spanning diverse modalities and task types. Unlike prior ML-agent benchmarks, ReX-MLE evaluates full end-to-end workflows, requiring agents to independently manage data preprocessing, model training, and submission under realistic compute and time constraints. Evaluating state-of-the-art agents (AIDE, ML-Master, RD-Agent) with different LLM backends (GPT-5, Gemini, Claude), we observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts. Failures stem from domain-knowledge and engineering limitations. ReX-MLE exposes these bottlenecks and provides a foundation for developing domain-aware autonomous AI systems.
14. 【2512.17817】Chorus: Multi-Teacher Pretraining for Holistic 3D Gaussian Scene Encoding
链接:https://arxiv.org/abs/2512.17817
作者:Yue Li,Qi Ma,Runyi Yang,Mengjiao Ma,Bin Ren,Nikola Popovic,Nicu Sebe,Theo Gevers,Luc Van Gool,Danda Pani Paudel,Martin R. Oswald
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:general-purpose features directly, primitives remains under-explored, high-fidelity scene representation, encoding rich, general-purpose features
备注:
点击查看摘要
Abstract:While 3DGS has emerged as a high-fidelity scene representation, encoding rich, general-purpose features directly from its primitives remains under-explored. We address this gap by introducing Chorus, a multi-teacher pretraining framework that learns a holistic feed-forward 3D Gaussian Splatting (3DGS) scene encoder by distilling complementary signals from 2D foundation models. Chorus employs a shared 3D encoder and teacher-specific projectors to learn from language-aligned, generalist, and object-aware teachers, encouraging a shared embedding space that captures signals from high-level semantics to fine-grained structure. We evaluate Chorus on a wide range of tasks: open-vocabulary semantic and instance segmentation, linear and decoder probing, as well as data-efficient supervision. Besides 3DGS, we also test Chorus on several benchmarks that only support point clouds by pretraining a variant using only Gaussians' centers, colors, estimated normals as inputs. Interestingly, this encoder shows strong transfer and outperforms the point clouds baseline while using 39.9 times fewer training scenes. Finally, we propose a render-and-distill adaptation that facilitates out-of-domain finetuning. Our code and model will be released upon publication.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.17817 [cs.CV]
(or
arXiv:2512.17817v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.17817
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
15. 【2512.17796】Animate Any Character in Any World
链接:https://arxiv.org/abs/2512.17796
作者:Yitong Wang,Fangyun Wei,Hongyang Zhang,Bo Dai,Yan Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:greatly enhanced interactive, Recent advances, interactive environment simulation, enhanced interactive environment, greatly enhanced
备注: Project page: [this https URL](https://snowflakewang.github.io/AniX/)
点击查看摘要
Abstract:Recent advances in world models have greatly enhanced interactive environment simulation. Existing methods mainly fall into two categories: (1) static world generation models, which construct 3D environments without active agents, and (2) controllable-entity models, which allow a single entity to perform limited actions in an otherwise uncontrollable environment. In this work, we introduce AniX, leveraging the realism and structural grounding of static world generation while extending controllable-entity models to support user-specified characters capable of performing open-ended actions. Users can provide a 3DGS scene and a character, then direct the character through natural language to perform diverse behaviors from basic locomotion to object-centric interactions while freely exploring the environment. AniX synthesizes temporally coherent video clips that preserve visual fidelity with the provided scene and character, formulated as a conditional autoregressive video generation problem. Built upon a pre-trained video generator, our training strategy significantly enhances motion dynamics while maintaining generalization across actions and characters. Our evaluation covers a broad range of aspects, including visual quality, character consistency, action controllability, and long-horizon coherence.
16. 【2512.17784】Long-Range depth estimation using learning based Hybrid Distortion Model for CCTV cameras
链接:https://arxiv.org/abs/2512.17784
作者:Ami Pandat,Punna Rajasekhar,G.Aravamuthan,Gopika Vinod,Rohit Shukla
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate camera models, Accurate camera, distortion models, Accurate, distortion
备注:
点击查看摘要
Abstract:Accurate camera models are essential for photogrammetry applications such as 3D mapping and object localization, particularly for long distances. Various stereo-camera based 3D localization methods are available but are limited to few hundreds of meters' range. This is majorly due to the limitation of the distortion models assumed for the non-linearities present in the camera lens. This paper presents a framework for modeling a suitable distortion model that can be used for localizing the objects at longer distances. It is well known that neural networks can be a better alternative to model a highly complex non-linear lens distortion function; on contrary, it is observed that a direct application of neural networks to distortion models fails to converge to estimate the camera parameters. To resolve this, a hybrid approach is presented in this paper where the conventional distortion models are initially extended to incorporate higher-order terms and then enhanced using neural network based residual correction model. This hybrid approach has substantially improved long-range localization performance and is capable of estimating the 3D position of objects at distances up to 5 kilometres. The estimated 3D coordinates are transformed to GIS coordinates and are plotted on a GIS map for visualization. Experimental validation demonstrates the robustness and effectiveness of proposed framework, offering a practical solution to calibrate CCTV cameras for long-range photogrammetry applications.
17. 【2512.17782】UrbanDIFF: A Denoising Diffusion Model for Spatial Gap Filling of Urban Land Surface Temperature Under Dense Cloud Cover
链接:https://arxiv.org/abs/2512.17782
作者:Arya Chavoshi,Hassan Dashtian,Naveen Sudharsan,Dev Niyogi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Satellite-derived Land Surface, Land Surface Temperature, Satellite-derived Land, urban heat island, Surface Temperature
备注:
点击查看摘要
Abstract:Satellite-derived Land Surface Temperature (LST) products are central to surface urban heat island (SUHI) monitoring due to their consistent grid-based coverage over large metropolitan regions. However, cloud contamination frequently obscures LST observations, limiting their usability for continuous SUHI analysis. Most existing LST reconstruction methods rely on multitemporal information or multisensor data fusion, requiring auxiliary observations that may be unavailable or unreliable under persistent cloud cover. Purely spatial gap-filling approaches offer an alternative, but traditional statistical methods degrade under large or spatially contiguous gaps, while many deep learning based spatial models deteriorate rapidly with increasing missingness. Recent advances in denoising diffusion based image inpainting models have demonstrated improved robustness under high missingness, motivating their adoption for spatial LST reconstruction. In this work, we introduce UrbanDIFF, a purely spatial denoising diffusion model for reconstructing cloud contaminated urban LST imagery. The model is conditioned on static urban structure information, including built-up surface data and a digital elevation model, and enforces strict consistency with revealed cloud free pixels through a supervised pixel guided refinement step during inference. UrbanDIFF is trained and evaluated using NASA MODIS Terra LST data from seven major United States metropolitan areas spanning 2002 to 2025. Experiments using synthetic cloud masks with 20 to 85 percent coverage show that UrbanDIFF consistently outperforms an interpolation baseline, particularly under dense cloud occlusion, achieving SSIM of 0.89, RMSE of 1.2 K, and R2 of 0.84 at 85 percent cloud coverage, while exhibiting slower performance degradation as cloud density increases.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.17782 [cs.CV]
(or
arXiv:2512.17782v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.17782
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2512.17781】LiteGE: Lightweight Geodesic Embedding for Efficient Geodesics Computation and Non-Isometric Shape Correspondence
链接:https://arxiv.org/abs/2512.17781
作者:Yohanes Yudhi Adikusuma,Qixing Huang,Ying He
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
关键词:Computing geodesic distances, surfaces is fundamental, vision and geometry, geometry processing, Computing geodesic
备注:
点击查看摘要
Abstract:Computing geodesic distances on 3D surfaces is fundamental to many tasks in 3D vision and geometry processing, with deep connections to tasks such as shape correspondence. Recent learning-based methods achieve strong performance but rely on large 3D backbones, leading to high memory usage and latency, which limit their use in interactive or resource-constrained settings. We introduce LiteGE, a lightweight approach that constructs compact, category-aware shape descriptors by applying PCA to unsigned distance field (UDFs) samples at informative voxels. This descriptor is efficient to compute and removes the need for high-capacity networks. LiteGE remains robust on sparse point clouds, supporting inputs with as few as 300 points, where prior methods fail. Extensive experiments show that LiteGE reduces memory usage and inference time by up to 300$\times$ compared to existing neural approaches. In addition, by exploiting the intrinsic relationship between geodesic distance and shape correspondence, LiteGE enables fast and accurate shape matching. Our method achieves up to 1000$\times$ speedup over state-of-the-art mesh-based approaches while maintaining comparable accuracy on non-isometric shape pairs, including evaluations on point-cloud inputs.
19. 【2512.17773】Pix2NPHM: Learning to Regress NPHM Reconstructions From a Single Image
链接:https://arxiv.org/abs/2512.17773
作者:Simon Giebenhain,Tobias Kirschstein,Liam Schoneveld,Davide Davoli,Zhe Chen,Matthias Nießner
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Parametric Head Models, Head Models, Neural Parametric Head, morphable models, high-fidelity geometric detail
备注: Project website: [this https URL](https://simongiebenhain.github.io/Pix2NPHM/) , Video: [this https URL](https://www.youtube.com/watch?v=MgpEJC5p1Ts)
点击查看摘要
Abstract:Neural Parametric Head Models (NPHMs) are a recent advancement over mesh-based 3d morphable models (3DMMs) to facilitate high-fidelity geometric detail. However, fitting NPHMs to visual inputs is notoriously challenging due to the expressive nature of their underlying latent space. To this end, we propose Pix2NPHM, a vision transformer (ViT) network that directly regresses NPHM parameters, given a single image as input. Compared to existing approaches, the neural parametric space allows our method to reconstruct more recognizable facial geometry and accurate facial expressions. For broad generalization, we exploit domain-specific ViTs as backbones, which are pretrained on geometric prediction tasks. We train Pix2NPHM on a mixture of 3D data, including a total of over 100K NPHM registrations that enable direct supervision in SDF space, and large-scale 2D video datasets, for which normal estimates serve as pseudo ground truth geometry. Pix2NPHM not only allows for 3D reconstructions at interactive frame rates, it is also possible to improve geometric fidelity by a subsequent inference-time optimization against estimated surface normals and canonical point maps. As a result, we achieve unprecedented face reconstruction quality that can run at scale on in-the-wild data.
20. 【2512.17730】AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection
链接:https://arxiv.org/abs/2512.17730
作者:Yichen Jiang,Mohammed Talha Alam,Sohail Ahmed Khan,Duc-Tien Dang-Nguyen,Fakhri Karray
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:realistic synthetic media, Recent advances, highly realistic synthetic, reliable deepfake detection, increasing the difficulty
备注: Under Review
点击查看摘要
Abstract:Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework's versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.
21. 【2512.17726】MambaMIL+: Modeling Long-Term Contextual Patterns for Gigapixel Whole Slide Image
链接:https://arxiv.org/abs/2512.17726
作者:Qian Zeng,Yihui Wang,Shu Yang,Yingxue Xu,Fengtao Zhou,Jiabo Ma,Dejia Cai,Zhengyu Zhang,Lijuan Qu,Yu Wang,Li Liang,Hao Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:important data modality, fine-grained annotations challenge, annotations challenge conventional, challenge conventional deep, Whole-slide images
备注: 18 pages, 11 figures, 10 tables
点击查看摘要
Abstract:Whole-slide images (WSIs) are an important data modality in computational pathology, yet their gigapixel resolution and lack of fine-grained annotations challenge conventional deep learning models. Multiple instance learning (MIL) offers a solution by treating each WSI as a bag of patch-level instances, but effectively modeling ultra-long sequences with rich spatial context remains difficult. Recently, Mamba has emerged as a promising alternative for long sequence learning, scaling linearly to thousands of tokens. However, despite its efficiency, it still suffers from limited spatial context modeling and memory decay, constraining its effectiveness to WSI analysis. To address these limitations, we propose MambaMIL+, a new MIL framework that explicitly integrates spatial context while maintaining long-range dependency modeling without memory forgetting. Specifically, MambaMIL+ introduces 1) overlapping scanning, which restructures the patch sequence to embed spatial continuity and instance correlations; 2) a selective stripe position encoder (S2PE) that encodes positional information while mitigating the biases of fixed scanning orders; and 3) a contextual token selection (CTS) mechanism, which leverages supervisory knowledge to dynamically enlarge the contextual memory for stable long-range modeling. Extensive experiments on 20 benchmarks across diagnostic classification, molecular prediction, and survival analysis demonstrate that MambaMIL+ consistently achieves state-of-the-art performance under three feature extractors (ResNet-50, PLIP, and CONCH), highlighting its effectiveness and robustness for large-scale computational pathology
22. 【2512.17724】SAVeD: A First-Person Social Media Video Dataset for ADAS-equipped vehicle Near-Miss and Crash Event Analyses
链接:https://arxiv.org/abs/2512.17724
作者:Shaoyan Zhai,Mohamed Abdel-Aty,Chenzhu Wang,Rodrigo Vena Garcia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:capture high-risk edge, high-risk edge cases, Driver Assistance System, Advanced Driver Assistance, ADAS-equipped vehicles require
备注:
点击查看摘要
Abstract:The advancement of safety-critical research in driving behavior in ADAS-equipped vehicles require real-world datasets that not only include diverse traffic scenarios but also capture high-risk edge cases such as near-miss events and system failures. However, existing datasets are largely limited to either simulated environments or human-driven vehicle data, lacking authentic ADAS (Advanced Driver Assistance System) vehicle behavior under risk conditions. To address this gap, this paper introduces SAVeD, a large-scale video dataset curated from publicly available social media content, explicitly focused on ADAS vehicle-related crashes, near-miss incidents, and disengagements. SAVeD features 2,119 first-person videos, capturing ADAS vehicle operations in diverse locations, lighting conditions, and weather scenarios. The dataset includes video frame-level annotations for collisions, evasive maneuvers, and disengagements, enabling analysis of both perception and decision-making failures. We demonstrate SAVeD's utility through multiple analyses and contributions: (1) We propose a novel framework integrating semantic segmentation and monocular depth estimation to compute real-time Time-to-Collision (TTC) for dynamic objects. (2) We utilize the Generalized Extreme Value (GEV) distribution to model and quantify the extreme risk in crash and near-miss events across different roadway types. (3) We establish benchmarks for state-of-the-art VLLMs (VideoLLaMA2 and InternVL2.5 HiCo R16), showing that SAVeD's detailed annotations significantly enhance model performance through domain adaptation in complex near-miss scenarios.
23. 【2512.17717】FlexAvatar: Flexible Large Reconstruction Model for Animatable Gaussian Head Avatars with Detailed Deformation
链接:https://arxiv.org/abs/2512.17717
作者:Cheng Peng,Zhuo Su,Liao Wang,Chen Guo,Zhaohu Li,Chengjiang Long,Zheng Lv,Jingxiang Sun,Chenyangguang Zhang,Yebin Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requiring camera poses, large reconstruction model, flexible large reconstruction, sparse images, reconstruction model
备注: Project page: [this https URL](https://pengc02.github.io/flexavatar)
点击查看摘要
Abstract:We present FlexAvatar, a flexible large reconstruction model for high-fidelity 3D head avatars with detailed dynamic deformation from single or sparse images, without requiring camera poses or expression labels. It leverages a transformer-based reconstruction model with structured head query tokens as canonical anchor to aggregate flexible input-number-agnostic, camera-pose-free and expression-free inputs into a robust canonical 3D representation. For detailed dynamic deformation, we introduce a lightweight UNet decoder conditioned on UV-space position maps, which can produce detailed expression-dependent deformations in real time. To better capture rare but critical expressions like wrinkles and bared teeth, we also adopt a data distribution adjustment strategy during training to balance the distribution of these expressions in the training set. Moreover, a lightweight 10-second refinement can further enhances identity-specific details in extreme identities without affecting deformation quality. Extensive experiments demonstrate that our FlexAvatar achieves superior 3D consistency, detailed dynamic realism compared with previous methods, providing a practical solution for animatable 3D avatar creation.
24. 【2512.17675】An Empirical Study of Sampling Hyperparameters in Diffusion-Based Super-Resolution
链接:https://arxiv.org/abs/2512.17675
作者:Yudhistira Arief Wibowo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:shown strong potential, solving inverse problems, pretrained unconditional prior, Manifold Constrained Gradient, Diffusion Posterior Sampling
备注:
点击查看摘要
Abstract:Diffusion models have shown strong potential for solving inverse problems such as single-image super-resolution, where a high-resolution image is recovered from a low-resolution observation using a pretrained unconditional prior. Conditioning methods, including Diffusion Posterior Sampling (DPS) and Manifold Constrained Gradient (MCG), can substantially improve reconstruction quality, but they introduce additional hyperparameters that require careful tuning. In this work, we conduct an empirical ablation study on FFHQ super-resolution to identify the dominant factors affecting performance when applying conditioning to pretrained diffusion models, and show that the conditioning step size has a significantly greater impact than the diffusion step count, with step sizes in the range of [2.0, 3.0] yielding the best overall performance in our experiments.
25. 【2512.17673】Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation
链接:https://arxiv.org/abs/2512.17673
作者:Alexandre Personnic,Mihai Bâce
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
关键词:multiple image frames, human eye gaze, inherently temporal dynamics, Video-based gaze estimation, image frames
备注: 12 pages, 5 figures, the code repository is available at [this https URL](https://gitlab.kuleuven.be/u0172623/ST-Gaze)
点击查看摘要
Abstract:Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames. However, since models must capture both spatial and temporal relationships, performance is limited by the feature representations within a frame but also between multiple frames. We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules to fuse eye and face features optimally. The fused features are then treated as a spatial sequence, allowing for the capture of an intra-frame context, which is then propagated through time to model inter-frame dynamics. We evaluated our method on the EVE dataset and show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation. Additionally, our ablation study provides further insights into the model performance, showing that preserving and modelling intra-frame spatial context with our spatio-temporal recurrence is fundamentally superior to premature spatial pooling. As such, our results pave the way towards more robust video-based gaze estimation using commonly available cameras.
26. 【2512.17655】Bitbox: Behavioral Imaging Toolbox for Computational Analysis of Behavior from Videos
链接:https://arxiv.org/abs/2512.17655
作者:Evangelos Sariyanidi,Gokul Nair,Lisa Yankowitz,Casey J. Zampella,Mohan Kashyap Pargi,Aashvi Manakiwala,Maya McNealis,John D. Herrington,Jeffrey Cohn,Robert T. Schultz,Birkan Tunc
类目:Computer Vision and Pattern Recognition (cs.CV); Neurons and Cognition (q-bio.NC)
关键词:human behavior, recently become feasible, feasible due, due to major, behavioral
备注:
点击查看摘要
Abstract:Computational measurement of human behavior from video has recently become feasible due to major advances in AI. These advances now enable granular and precise quantification of facial expression, head movement, body action, and other behavioral modalities and are increasingly used in psychology, psychiatry, neuroscience, and mental health research. However, mainstream adoption remains slow. Most existing methods and software are developed for engineering audiences, require specialized software stacks, and fail to provide behavioral measurements at a level directly useful for hypothesis-driven research. As a result, there is a large barrier to entry for researchers who wish to use modern, AI-based tools in their work. We introduce Bitbox, an open-source toolkit designed to remove this barrier and make advanced computational analysis directly usable by behavioral scientists and clinical researchers. Bitbox is guided by principles of reproducibility, modularity, and interpretability. It provides a standardized interface for extracting high-level behavioral measurements from video, leveraging multiple face, head, and body processors. The core modules have been tested and validated on clinical samples and are designed so that new measures can be added with minimal effort. Bitbox is intended to serve both sides of the translational gap. It gives behavioral researchers access to robust, high-level behavioral metrics without requiring engineering expertise, and it provides computer scientists a practical mechanism for disseminating methods to domains where their impact is most needed. We expect that Bitbox will accelerate integration of computational behavioral measurement into behavioral, clinical, and mental health research. Bitbox has been designed from the beginning as a community-driven effort that will evolve through contributions from both method developers and domain scientists.
27. 【2512.17650】Region-Constraint In-Context Generation for Instructional Video Editing
链接:https://arxiv.org/abs/2512.17650
作者:Zhongwei Zhang,Fuchen Long,Wei Li,Zhaofan Qiu,Wu Liu,Ting Yao,Tao Mei
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:demonstrated strong power, editing, video editing, synthesis quality, video
备注: Project page: [this https URL](https://zhw-zhang.github.io/ReCo-page/)
点击查看摘要
Abstract:The In-context generation paradigm recently has demonstrated strong power in instructional image editing with both data efficiency and synthesis quality. Nevertheless, shaping such in-context learning for instruction-based video editing is not trivial. Without specifying editing regions, the results can suffer from the problem of inaccurate editing regions and the token interference between editing and non-editing areas during denoising. To address these, we present ReCo, a new instructional video editing paradigm that novelly delves into constraint modeling between editing and non-editing regions during in-context generation. Technically, ReCo width-wise concatenates source and target video for joint denoising. To calibrate video diffusion learning, ReCo capitalizes on two regularization terms, i.e., latent and attention regularization, conducting on one-step backward denoised latents and attention maps, respectively. The former increases the latent discrepancy of the editing region between source and target videos while reducing that of non-editing areas, emphasizing the modification on editing area and alleviating outside unexpected content generation. The latter suppresses the attention of tokens in the editing region to the tokens in counterpart of the source video, thereby mitigating their interference during novel object generation in target video. Furthermore, we propose a large-scale, high-quality video editing dataset, i.e., ReCo-Data, comprising 500K instruction-video pairs to benefit model training. Extensive experiments conducted on four major instruction-based video editing tasks demonstrate the superiority of our proposal.
28. 【2512.17640】Generative Human-Object Interaction Detection via Differentiable Cognitive Steering of Multi-modal LLMs
链接:https://arxiv.org/abs/2512.17640
作者:Zhaolin Cai,Huiyu Duan,Zitong Xu,Fan Li,Zhi Liu,Jing Liu,Wei Shen,Xiongkuo Min,Guangtao Zhai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:localize human-object pairs, localize human-object, human-object pairs, Human-object interaction, aims to localize
备注:
点击查看摘要
Abstract:Human-object interaction (HOI) detection aims to localize human-object pairs and the interactions between them. Existing methods operate under a closed-world assumption, treating the task as a classification problem over a small, predefined verb set, which struggles to generalize to the long-tail of unseen or ambiguous interactions in the wild. While recent multi-modal large language models (MLLMs) possess the rich world knowledge required for open-vocabulary understanding, they remain decoupled from existing HOI detectors since fine-tuning them is computationally prohibitive. To address these constraints, we propose \GRASP-HO}, a novel Generative Reasoning And Steerable Perception framework that reformulates HOI detection from the closed-set classification task to the open-vocabulary generation problem. To bridge the vision and cognitive, we first extract hybrid interaction representations, then design a lightweight learnable cognitive steering conduit (CSC) module to inject the fine-grained visual evidence into a frozen MLLM for effective reasoning. To address the supervision mismatch between classification-based HOI datasets and open-vocabulary generative models, we introduce a hybrid guidance strategy that coupling the language modeling loss and auxiliary classification loss, enabling discriminative grounding without sacrificing generative flexibility. Experiments demonstrate state-of-the-art closed-set performance and strong zero-shot generalization, achieving a unified paradigm that seamlessly bridges discriminative perception and generative reasoning for open-world HOI detection.
29. 【2512.17621】PathFLIP: Fine-grained Language-Image Pretraining for Versatile Computational Pathology
链接:https://arxiv.org/abs/2512.17621
作者:Fengchun Liu,Songhan Jiang,Linghan Cai,Ziyue Wang,Yongbing Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved notable progress, Slide Images, continue to pose, multimodal understanding, achieved notable
备注:
点击查看摘要
Abstract:While Vision-Language Models (VLMs) have achieved notable progress in computational pathology (CPath), the gigapixel scale and spatial heterogeneity of Whole Slide Images (WSIs) continue to pose challenges for multimodal understanding. Existing alignment methods struggle to capture fine-grained correspondences between textual descriptions and visual cues across thousands of patches from a slide, compromising their performance on downstream tasks. In this paper, we propose PathFLIP (Pathology Fine-grained Language-Image Pretraining), a novel framework for holistic WSI interpretation. PathFLIP decomposes slide-level captions into region-level subcaptions and generates text-conditioned region embeddings to facilitate precise visual-language grounding. By harnessing Large Language Models (LLMs), PathFLIP can seamlessly follow diverse clinical instructions and adapt to varied diagnostic contexts. Furthermore, it exhibits versatile capabilities across multiple paradigms, efficiently handling slide-level classification and retrieval, fine-grained lesion localization, and instruction following. Extensive experiments demonstrate that PathFLIP outperforms existing large-scale pathological VLMs on four representative benchmarks while requiring significantly less training data, paving the way for fine-grained, instruction-aware WSI interpretation in clinical practice.
30. 【2512.17620】StereoMV2D: A Sparse Temporal Stereo-Enhanced Framework for Robust Multi-View 3D Object Detection
链接:https://arxiv.org/abs/2512.17620
作者:Di Wu,Feng Yang,Wenhui Zhao,Jinwen Yu,Pan Liao,Benlian Xu,Dingwen Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:efficiency remains crucial, autonomous driving perception, computational efficiency remains, remains crucial, fundamental task
备注: 12 pages, 4 figures. This work has been submitted to the IEEE for possible publication
点击查看摘要
Abstract:Multi-view 3D object detection is a fundamental task in autonomous driving perception, where achieving a balance between detection accuracy and computational efficiency remains crucial. Sparse query-based 3D detectors efficiently aggregate object-relevant features from multi-view images through a set of learnable queries, offering a concise and end-to-end detection paradigm. Building on this foundation, MV2D leverages 2D detection results to provide high-quality object priors for query initialization, enabling higher precision and recall. However, the inherent depth ambiguity in single-frame 2D detections still limits the accuracy of 3D query generation. To address this issue, we propose StereoMV2D, a unified framework that integrates temporal stereo modeling into the 2D detection-guided multi-view 3D detector. By exploiting cross-temporal disparities of the same object across adjacent frames, StereoMV2D enhances depth perception and refines the query priors, while performing all computations efficiently within 2D regions of interest (RoIs). Furthermore, a dynamic confidence gating mechanism adaptively evaluates the reliability of temporal stereo cues through learning statistical patterns derived from the inter-frame matching matrix together with appearance consistency, ensuring robust detection under object appearance and occlusion. Extensive experiments on the nuScenes and Argoverse 2 datasets demonstrate that StereoMV2D achieves superior detection performance without incurring significant computational overhead. Code will be available at this https URL.
31. 【2512.17612】Self-Supervised Weighted Image Guided Quantitative MRI Super-Resolution
链接:https://arxiv.org/abs/2512.17612
作者:Alireza Samadifardheris,Dirk H.J. Poot,Florian Wiesinger,Stefan Klein,Juan A. Hernandez-Tamames
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains clinically underutilized, clinically underutilized due, objective tissue characterization, qMRI, tissue characterization
备注: This work has been submitted to IEEE TMI for possible publication
点击查看摘要
Abstract:High-resolution (HR) quantitative MRI (qMRI) relaxometry provides objective tissue characterization but remains clinically underutilized due to lengthy acquisition times. We propose a physics-informed, self-supervised framework for qMRI super-resolution that uses routinely acquired HR weighted MRI (wMRI) scans as guidance, thus, removing the necessity for HR qMRI ground truth during training. We formulate super-resolution as Bayesian maximum a posteriori inference, minimizing two discrepancies: (1) between HR images synthesized from super-resolved qMRI maps and acquired wMRI guides via forward signal models, and (2) between acquired LR qMRI and downsampled predictions. This physics-informed objective allows the models to learn from clinical wMRI without HR qMRI supervision. To validate the concept, we generate training data by synthesizing wMRI guides from HR qMRI using signal equations, then degrading qMRI resolution via k-space truncation. A deep neural network learns the super-resolution mapping. Ablation experiments demonstrate that T1-weighted images primarily enhance T1 maps, T2-weighted images improve T2 maps, and combined guidance optimally enhances all parameters simultaneously. Validation on independently acquired in-vivo data from a different qMRI sequence confirms cross-qMRI sequence generalizability. Models trained on synthetic data can produce super-resolved maps from a 1-minute acquisition with quality comparable to a 5-minute reference scan, leveraging the scanner-independent nature of relaxometry parameters. By decoupling training from HR qMRI requirement, our framework enables fast qMRI acquisitions enhanced via routine clinical images, offering a practical pathway for integrating quantitative relaxometry into clinical workflows with acceptable additional scan time.
32. 【2512.17610】Semi-Supervised 3D Segmentation for Type-B Aortic Dissection with Slim UNETR
链接:https://arxiv.org/abs/2512.17610
作者:Denis Mikhailapov,Vladimir Berikov
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Convolutional neural networks, Convolutional neural, widely used today, segmentation, Convolutional
备注: 7 pages, 5 figures, 1 listing
点击查看摘要
Abstract:Convolutional neural networks (CNN) for multi-class segmentation of medical images are widely used today. Especially models with multiple outputs that can separately predict segmentation classes (regions) without relying on a probabilistic formulation of the segmentation of regions. These models allow for more precise segmentation by tailoring the network's components to each class (region). They have a common encoder part of the architecture but branch out at the output layers, leading to improved accuracy. These methods are used to diagnose type B aortic dissection (TBAD), which requires accurate segmentation of aortic structures based on the ImageTBDA dataset, which contains 100 3D computed tomography angiography (CTA) images. These images identify three key classes: true lumen (TL), false lumen (FL), and false lumen thrombus (FLT) of the aorta, which is critical for diagnosis and treatment decisions. In the dataset, 68 examples have a false lumen, while the remaining 32 do not, creating additional complexity for pathology detection. However, implementing these CNN methods requires a large amount of high-quality labeled data. Obtaining accurate labels for the regions of interest can be an expensive and time-consuming process, particularly for 3D data. Semi-supervised learning methods allow models to be trained by using both labeled and unlabeled data, which is a promising approach for overcoming the challenge of obtaining accurate labels. However, these learning methods are not well understood for models with multiple outputs. This paper presents a semi-supervised learning method for models with multiple outputs. The method is based on the additional rotations and flipping, and does not assume the probabilistic nature of the model's responses. This makes it a universal approach, which is especially important for architectures that involve separate segmentation.
Comments:
7 pages, 5 figures, 1 listing
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.17610 [cs.CV]
(or
arXiv:2512.17610v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.17610
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Denis Mikhailapov [view email] [v1]
Fri, 19 Dec 2025 14:14:41 UTC (712 KB)
33. 【2512.17605】MGRegBench: A Novel Benchmark Dataset with Anatomical Landmarks for Mammography Image Registration
链接:https://arxiv.org/abs/2512.17605
作者:Svetlana Krasnova,Emiliya Starikova,Ilia Naletov,Andrey Krylov,Dmitry Sorokin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:tracking disease progression, Robust mammography registration, Robust mammography, breast tissue, essential for clinical
备注:
点击查看摘要
Abstract:Robust mammography registration is essential for clinical applications like tracking disease progression and monitoring longitudinal changes in breast tissue. However, progress has been limited by the absence of public datasets and standardized benchmarks. Existing studies are often not directly comparable, as they use private data and inconsistent evaluation frameworks. To address this, we present MGRegBench, a public benchmark dataset for mammogram registration. It comprises over 5,000 image pairs, with 100 containing manual anatomical landmarks and segmentation masks for rigorous evaluation. This makes MGRegBench one of the largest public 2D registration datasets with manual annotations. Using this resource, we benchmarked diverse registration methods including classical (ANTs), learning-based (VoxelMorph, TransMorph), implicit neural representation (IDIR), a classic mammography-specific approach, and a recent state-of-the-art deep learning method MammoRegNet. The implementations were adapted to this modality from the authors' implementations or re-implemented from scratch. Our contributions are: (1) the first public dataset of this scale with manual landmarks and masks for mammography registration; (2) the first like-for-like comparison of diverse methods on this modality; and (3) an extensive analysis of deep learning-based registration. We publicly release our code and data to establish a foundational resource for fair comparisons and catalyze future research. The source code and data are at this https URL.
34. 【2512.17601】HeadHunt-VAD: Hunting Robust Anomaly-Sensitive Heads in MLLM for Tuning-Free Video Anomaly Detection
链接:https://arxiv.org/abs/2512.17601
作者:Zhaolin Cai,Fan Li,Ziwei Zheng,Haixia Bi,Lijun He
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Video Anomaly Detection, Large Language Models, Multimodal Large Language, aims to locate, locate events
备注:
点击查看摘要
Abstract:Video Anomaly Detection (VAD) aims to locate events that deviate from normal patterns in videos. Traditional approaches often rely on extensive labeled data and incur high computational costs. Recent tuning-free methods based on Multimodal Large Language Models (MLLMs) offer a promising alternative by leveraging their rich world knowledge. However, these methods typically rely on textual outputs, which introduces information loss, exhibits normalcy bias, and suffers from prompt sensitivity, making them insufficient for capturing subtle anomalous cues. To address these constraints, we propose HeadHunt-VAD, a novel tuning-free VAD paradigm that bypasses textual generation by directly hunting robust anomaly-sensitive internal attention heads within the frozen MLLM. Central to our method is a Robust Head Identification module that systematically evaluates all attention heads using a multi-criteria analysis of saliency and stability, identifying a sparse subset of heads that are consistently discriminative across diverse prompts. Features from these expert heads are then fed into a lightweight anomaly scorer and a temporal locator, enabling efficient and accurate anomaly detection with interpretable outputs. Extensive experiments show that HeadHunt-VAD achieves state-of-the-art performance among tuning-free methods on two major VAD benchmarks while maintaining high efficiency, validating head-level probing in MLLMs as a powerful and practical solution for real-world anomaly detection.
35. 【2512.17594】MAD-OOD: A Deep Learning Cluster-Driven Framework for an Out-of-Distribution Malware Detection and Classification
链接:https://arxiv.org/abs/2512.17594
作者:Tosin Ige,Christopher Kiekintveld,Aritran Piplai,Asif Rahman,Olukunle Kolade,Sasidhar Kunapuli
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:family variability introduced, intra family variability, substantial intra family, OOD, remains a critical
备注:
点击查看摘要
Abstract:Out of distribution (OOD) detection remains a critical challenge in malware classification due to the substantial intra family variability introduced by polymorphic and metamorphic malware variants. Most existing deep learning based malware detectors rely on closed world assumptions and fail to adequately model this intra class variation, resulting in degraded performance when confronted with previously unseen malware families. This paper presents MADOOD, a novel two stage, cluster driven deep learning framework for robust OOD malware detection and classification. In the first stage, malware family embeddings are modeled using class conditional spherical decision boundaries derived from Gaussian Discriminant Analysis (GDA), enabling statistically grounded separation of indistribution and OOD samples without requiring OOD data during training. Z score based distance analysis across multiple class centroids is employed to reliably identify anomalous samples in the latent space. In the second stage, a deep neural network integrates cluster based predictions, refined embeddings, and supervised classifier outputs to enhance final classification accuracy. Extensive evaluations on benchmark malware datasets comprising 25 known families and multiple novel OOD variants demonstrate that MADOOD significantly outperforms state of the art OOD detection methods, achieving an AUC of up to 0.911 on unseen malware families. The proposed framework provides a scalable, interpretable, and statistically principled solution for real world malware detection and anomaly identification in evolving cybersecurity environments.
36. 【2512.17581】Medical Imaging AI Competitions Lack Fairness
链接:https://arxiv.org/abs/2512.17581
作者:Annika Reinke,Evangelia Christodoulou,Sthuthi Sadananda,A. Emre Kavur,Khrystyna Faryna,Daan Schouten,Bennett A. Landman,Carole Sudre,Olivier Colliot,Nick Heller,Sophie Loizillon,Martin Maška,Maëlys Solal,Arya Yazdan-Panah,Vilma Bozgo,Ömer Sümer,Siem de Jong,Sophie Fischer,Michal Kozubek,Tim Rädsch,Nadim Hammoud,Fruzsina Molnár-Gábor,Steven Hicks,Michael A. Riegler,Anindo Saha,Vajira Thambawita,Pal Halvorsen,Amelia Jiménez-Sánchez,Qingyang Yang,Veronika Cheplygina,Sabrina Bottazzi,Alexander Seitel,Spyridon Bakas,Alexandros Karargyris,Kiran Vaidhya Venkadesh,Bram van Ginneken,Lena Maier-Hein
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:defining performance standards, shaping methodological progress, artificial intelligence, defining performance, methodological progress
备注: Submitted to Nature BME
点击查看摘要
Abstract:Benchmarking competitions are central to the development of artificial intelligence (AI) in medical imaging, defining performance standards and shaping methodological progress. However, it remains unclear whether these benchmarks provide data that are sufficiently representative, accessible, and reusable to support clinically meaningful AI. In this work, we assess fairness along two complementary dimensions: (1) whether challenge datasets are representative of real-world clinical diversity, and (2) whether they are accessible and legally reusable in line with the FAIR principles. To address this question, we conducted a large-scale systematic study of 241 biomedical image analysis challenges comprising 458 tasks across 19 imaging modalities. Our findings show substantial biases in dataset composition, including geographic location, modality-, and problem type-related biases, indicating that current benchmarks do not adequately reflect real-world clinical diversity. Despite their widespread influence, challenge datasets were frequently constrained by restrictive or ambiguous access conditions, inconsistent or non-compliant licensing practices, and incomplete documentation, limiting reproducibility and long-term reuse. Together, these shortcomings expose foundational fairness limitations in our benchmarking ecosystem and highlight a disconnect between leaderboard success and clinical relevance.
37. 【2512.17578】3One2: One-step Regression Plus One-step Diffusion for One-hot Modulation in Dual-path Video Snapshot Compressive Imaging
链接:https://arxiv.org/abs/2512.17578
作者:Ge Wang,Xing Liu,Xin Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:snapshot compressive imaging, Video snapshot compressive, captures dynamic scene, dynamic scene sequences, snapshot compressive
备注:
点击查看摘要
Abstract:Video snapshot compressive imaging (SCI) captures dynamic scene sequences through a two-dimensional (2D) snapshot, fundamentally relying on optical modulation for hardware compression and the corresponding software reconstruction. While mainstream video SCI using random binary modulation has demonstrated success, it inevitably results in temporal aliasing during compression. One-hot modulation, activating only one sub-frame per pixel, provides a promising solution for achieving perfect temporal decoupling, thereby alleviating issues associated with aliasing. However, no algorithms currently exist to fully exploit this potential. To bridge this gap, we propose an algorithm specifically designed for one-hot masks. First, leveraging the decoupling properties of one-hot modulation, we transform the reconstruction task into a generative video inpainting problem and introduce a stochastic differential equation (SDE) of the forward process that aligns with the hardware compression process. Next, we identify limitations of the pure diffusion method for video SCI and propose a novel framework that combines one-step regression initialization with one-step diffusion refinement. Furthermore, to mitigate the spatial degradation caused by one-hot modulation, we implement a dual optical path at the hardware level, utilizing complementary information from another path to enhance the inpainted video. To our knowledge, this is the first work integrating diffusion into video SCI reconstruction. Experiments conducted on synthetic datasets and real scenes demonstrate the effectiveness of our method.
38. 【2512.17573】RoomEditor++: A Parameter-Sharing Diffusion Architecture for High-Fidelity Furniture Synthesis
链接:https://arxiv.org/abs/2512.17573
作者:Qilong Wang,Xiaofan Ming,Zhenyi Lin,Jinwen Li,Dongwei Ren,Wangmeng Zuo,Qinghua Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Virtual furniture synthesis, holds substantial promise, seamlessly integrates reference, integrates reference objects, maintaining geometric coherence
备注:
点击查看摘要
Abstract:Virtual furniture synthesis, which seamlessly integrates reference objects into indoor scenes while maintaining geometric coherence and visual realism, holds substantial promise for home design and e-commerce applications. However, this field remains underexplored due to the scarcity of reproducible benchmarks and the limitations of existing image composition methods in achieving high-fidelity furniture synthesis while preserving background integrity. To overcome these challenges, we first present RoomBench++, a comprehensive and publicly available benchmark dataset tailored for this task. It consists of 112,851 training pairs and 1,832 testing pairs drawn from both real-world indoor videos and realistic home design renderings, thereby supporting robust training and evaluation under practical conditions. Then, we propose RoomEditor++, a versatile diffusion-based architecture featuring a parameter-sharing dual diffusion backbone, which is compatible with both U-Net and DiT architectures. This design unifies the feature extraction and inpainting processes for reference and background images. Our in-depth analysis reveals that the parameter-sharing mechanism enforces aligned feature representations, facilitating precise geometric transformations, texture preservation, and seamless integration. Extensive experiments validate that RoomEditor++ is superior over state-of-the-art approaches in terms of quantitative metrics, qualitative assessments, and human preference studies, while highlighting its strong generalization to unseen indoor scenes and general scenes without task-specific fine-tuning. The dataset and source code are available at \url{this https URL}.
39. 【2512.17566】A unified FLAIR hyperintensity segmentation model for various CNS tumor types and acquisition time points
链接:https://arxiv.org/abs/2512.17566
作者:Mathilde Gajda Faanes,David Bouget,Asgeir S. Jakola,Timothy R. Smith,Vasileios K. Kavouridis,Francesco Latini,Margret Jensdottir,Peter Milos,Henrietta Nittby Redebrandt,Rickard L. Sjöberg,Rupavathana Mahesparan,Lars Kjelsberg Pedersen,Ole Solheim,Ingerid Reinertsen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:fluid-attenuated inversion recovery, magnetic resonance imaging, acquisition time points, FLAIR hyperintensity volume, FLAIR hyperintensity
备注: 13 pages, 4 figures
点击查看摘要
Abstract:T2-weighted fluid-attenuated inversion recovery (FLAIR) magnetic resonance imaging (MRI) scans are important for diagnosis, treatment planning and monitoring of brain tumors. Depending on the brain tumor type, the FLAIR hyperintensity volume is an important measure to asses the tumor volume or surrounding edema, and an automatic segmentation of this would be useful in the clinic. In this study, around 5000 FLAIR images of various tumors types and acquisition time points from different centers were used to train a unified FLAIR hyperintensity segmentation model using an Attention U-Net architecture. The performance was compared against dataset specific models, and was validated on different tumor types, acquisition time points and against BraTS. The unified model achieved an average Dice score of 88.65\% for pre-operative meningiomas, 80.08% for pre-operative metastasis, 90.92% for pre-operative and 84.60% for post-operative gliomas from BraTS, and 84.47% for pre-operative and 61.27\% for post-operative lower grade gliomas. In addition, the results showed that the unified model achieved comparable segmentation performance to the dataset specific models on their respective datasets, and enables generalization across tumor types and acquisition time points, which facilitates the deployment in a clinical setting. The model is integrated into Raidionics, an open-source software for CNS tumor analysis.
40. 【2512.17547】G3Splat: Geometrically Consistent Generalizable Gaussian Splatting
链接:https://arxiv.org/abs/2512.17547
作者:Mehdi Hosseinzadeh,Shin-Fang Chng,Yi Xu,Simon Lucey,Ian Reid,Ravi Garg
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:adapt multi-view structure, multi-view structure prediction, structure prediction networks, regress per-pixel, accurate novel-view synthesis
备注: Project page: [this https URL](https://m80hz.github.io/g3splat/)
点击查看摘要
Abstract:3D Gaussians have recently emerged as an effective scene representation for real-time splatting and accurate novel-view synthesis, motivating several works to adapt multi-view structure prediction networks to regress per-pixel 3D Gaussians from images. However, most prior work extends these networks to predict additional Gaussian parameters -- orientation, scale, opacity, and appearance -- while relying almost exclusively on view-synthesis supervision. We show that a view-synthesis loss alone is insufficient to recover geometrically meaningful splats in this setting. We analyze and address the ambiguities of learning 3D Gaussian splats under self-supervision for pose-free generalizable splatting, and introduce G3Splat, which enforces geometric priors to obtain geometrically consistent 3D scene representations. Trained on RE10K, our approach achieves state-of-the-art performance in (i) geometrically consistent reconstruction, (ii) relative pose estimation, and (iii) novel-view synthesis. We further demonstrate strong zero-shot generalization on ScanNet, substantially outperforming prior work in both geometry recovery and relative pose estimation. Code and pretrained models are released on our project page (this https URL).
41. 【2512.17545】ClothHMR: 3D Mesh Recovery of Humans in Diverse Clothing from Single Image
链接:https://arxiv.org/abs/2512.17545
作者:Yunqi Gao,Leyuan Liu,Yuhan Li,Changxin Gao,Yuanyuan Liu,Jingying Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:data rapidly emerging, human mesh recovery, mesh recovery technology, data rapidly, human mesh
备注: 15 pages,16 figures
点击查看摘要
Abstract:With 3D data rapidly emerging as an important form of multimedia information, 3D human mesh recovery technology has also advanced accordingly. However, current methods mainly focus on handling humans wearing tight clothing and perform poorly when estimating body shapes and poses under diverse clothing, especially loose garments. To this end, we make two key insights: (1) tailoring clothing to fit the human body can mitigate the adverse impact of clothing on 3D human mesh recovery, and (2) utilizing human visual information from large foundational models can enhance the generalization ability of the estimation. Based on these insights, we propose ClothHMR, to accurately recover 3D meshes of humans in diverse clothing. ClothHMR primarily consists of two modules: clothing tailoring (CT) and FHVM-based mesh recovering (MR). The CT module employs body semantic estimation and body edge prediction to tailor the clothing, ensuring it fits the body silhouette. The MR module optimizes the initial parameters of the 3D human mesh by continuously aligning the intermediate representations of the 3D mesh with those inferred from the foundational human visual model (FHVM). ClothHMR can accurately recover 3D meshes of humans wearing diverse clothing, precisely estimating their body shapes and poses. Experimental results demonstrate that ClothHMR significantly outperforms existing state-of-the-art methods across benchmark datasets and in-the-wild images. Additionally, a web application for online fashion and shopping powered by ClothHMR is developed, illustrating that ClothHMR can effectively serve real-world usage scenarios. The code and model for ClothHMR are available at: \url{this https URL}.
42. 【2512.17541】FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views
链接:https://arxiv.org/abs/2512.17541
作者:Qijian Tian,Xin Tan,Jiayu Ying,Xuhong Wang,Yuan Xie,Lizhuang Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:present FLEG, feed-forward network, Gaussian, Abstract, reconstructs language-embedded
备注: Project page: [this https URL](https://fangzhou2000.github.io/projects/fleg)
点击查看摘要
Abstract:We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from any views. Previous straightforward solutions combine feed-forward reconstruction with Gaussian heads but suffer from fixed input views and insufficient 3D training data. In contrast, we propose a 3D-annotation-free training framework for 2D-to-3D lifting from arbitrary uncalibrated and unposed multi-view images. Since the framework does not require 3D annotations, we can leverage large-scale video data with easily obtained 2D instance information to enrich semantic embedding. We also propose an instance-guided contrastive learning to align 2D semantics with the 3D representations. In addition, to mitigate the high memory and computational cost of dense views, we further propose a geometry-semantic hierarchical sparsification strategy. Our FLEG efficiently reconstructs language-embedded 3D Gaussian representation in a feed-forward manner from arbitrary sparse or dense views, jointly producing accurate geometry, high-fidelity appearance, and language-aligned semantics. Extensive experiments show that it outperforms existing methods on various related tasks. Project page: this https URL.
43. 【2512.17532】Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
链接:https://arxiv.org/abs/2512.17532
作者:Jiaqi Tang,Jianmin Chen,Wei Wei,Xiaogang Xu,Runtao Liu,Xiangyu Wu,Qipeng Xie,Jiafei Wu,Lei Zhang,Qifeng Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Language Models struggle, Multimodal Large, Large Language
备注: Accepted by AAAI2026 Oral
点击查看摘要
Abstract:Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.
44. 【2512.17517】PathBench-MIL: A Comprehensive AutoML and Benchmarking Framework for Multiple Instance Learning in Histopathology
链接:https://arxiv.org/abs/2512.17517
作者:Siemen Brussee,Pieter A. Valkema,Jurre A. J. Weijer,Thom Doeleman,Anne M.R. Schrader,Jesper Kers
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Software Engineering (cs.SE); Tissues and Organs (q-bio.TO)
关键词:multiple instance learning, instance learning, MIL pipeline construction, open-source AutoML, framework for multiple
备注: 14 Pages, 3 Figures, 2 Appendices
点击查看摘要
Abstract:We introduce PathBench-MIL, an open-source AutoML and benchmarking framework for multiple instance learning (MIL) in histopathology. The system automates end-to-end MIL pipeline construction, including preprocessing, feature extraction, and MIL-aggregation, and provides reproducible benchmarking of dozens of MIL models and feature extractors. PathBench-MIL integrates visualization tooling, a unified configuration system, and modular extensibility, enabling rapid experimentation and standardization across datasets and tasks. PathBench-MIL is publicly available at this https URL
45. 【2512.17514】Foundation Model Priors Enhance Object Focus in Feature Space for Source-Free Object Detection
链接:https://arxiv.org/abs/2512.17514
作者:Sairam VCR,Rishabh Lalla,Aveen Dayal,Tejal Kulkarni,Anuj Lalla,Vineeth N Balasubramanian,Muhammad Haris Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Source-Free Object Detection, approaches in Source-Free, typically rely, Mean-Teacher self-labeling, rely on Mean-Teacher
备注:
点击查看摘要
Abstract:Current state-of-the-art approaches in Source-Free Object Detection (SFOD) typically rely on Mean-Teacher self-labeling. However, domain shift often reduces the detector's ability to maintain strong object-focused representations, causing high-confidence activations over background clutter. This weak object focus results in unreliable pseudo-labels from the detection head. While prior works mainly refine these pseudo-labels, they overlook the underlying need to strengthen the feature space itself. We propose FALCON-SFOD (Foundation-Aligned Learning with Clutter suppression and Noise robustness), a framework designed to enhance object-focused adaptation under domain shift. It consists of two complementary components. SPAR (Spatial Prior-Aware Regularization) leverages the generalization strength of vision foundation models to regularize the detector's feature space. Using class-agnostic binary masks derived from OV-SAM, SPAR promotes structured and foreground-focused activations by guiding the network toward object regions. IRPL (Imbalance-aware Noise Robust Pseudo-Labeling) complements SPAR by promoting balanced and noise-tolerant learning under severe foreground-background imbalance. Guided by a theoretical analysis that connects these designs to tighter localization and classification error bounds, FALCON-SFOD achieves competitive performance across SFOD benchmarks.
46. 【2512.17505】Adaptive Covariance and Quaternion-Focused Hybrid Error-State EKF/UKF for Visual-Inertial Odometry
链接:https://arxiv.org/abs/2512.17505
作者:Ufuk Asil,Efendi Nasibov
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Unmanned Aerial Vehicles, hybrid Visual-Inertial Odometry, Aerial Vehicles, Unmanned Aerial, innovative hybrid Visual-Inertial
备注:
点击查看摘要
Abstract:This study presents an innovative hybrid Visual-Inertial Odometry (VIO) method for Unmanned Aerial Vehicles (UAVs) that is resilient to environmental challenges and capable of dynamically assessing sensor reliability. Built upon a loosely coupled sensor fusion architecture, the system utilizes a novel hybrid Quaternion-focused Error-State EKF/UKF (Qf-ES-EKF/UKF) architecture to process inertial measurement unit (IMU) data. This architecture first propagates the entire state using an Error-State Extended Kalman Filter (ESKF) and then applies a targeted Scaled Unscented Kalman Filter (SUKF) step to refine only the orientation. This sequential process blends the accuracy of SUKF in quaternion estimation with the overall computational efficiency of ESKF. The reliability of visual measurements is assessed via a dynamic sensor confidence score based on metrics, such as image entropy, intensity variation, motion blur, and inference quality, adapting the measurement noise covariance to ensure stable pose estimation even under challenging conditions. Comprehensive experimental analyses on the EuRoC MAV dataset demonstrate key advantages: an average improvement of 49% in position accuracy in challenging scenarios, an average of 57% in rotation accuracy over ESKF-based methods, and SUKF-comparable accuracy achieved with approximately 48% lower computational cost than a full SUKF implementation. These findings demonstrate that the presented approach strikes an effective balance between computational efficiency and estimation accuracy, and significantly enhances UAV pose estimation performance in complex environments with varying sensor reliability.
47. 【2512.17504】InsertAnywhere: Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion
链接:https://arxiv.org/abs/2512.17504
作者:Hoiyeong Jin,Hyojin Jang,Jeongho Kim,Junha Hyung,Kinam Kim,Dongjin Kim,Huijin Choi,Hyeonji Kim,Jaegul Choo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:remains challenging due, Recent advances, controllable video editing, remains challenging, due to limited
备注: 16 pages, project page: [this https URL](https://myyzzzoooo.github.io/InsertAnywhere/)
点击查看摘要
Abstract:Recent advances in diffusion-based video generation have opened new possibilities for controllable video editing, yet realistic video object insertion (VOI) remains challenging due to limited 4D scene understanding and inadequate handling of occlusion and lighting effects. We present InsertAnywhere, a new VOI framework that achieves geometrically consistent object placement and appearance-faithful video synthesis. Our method begins with a 4D aware mask generation module that reconstructs the scene geometry and propagates user specified object placement across frames while maintaining temporal coherence and occlusion consistency. Building upon this spatial foundation, we extend a diffusion based video generation model to jointly synthesize the inserted object and its surrounding local variations such as illumination and shading. To enable supervised training, we introduce ROSE++, an illumination aware synthetic dataset constructed by transforming the ROSE object removal dataset into triplets of object removed video, object present video, and a VLM generated reference image. Through extensive experiments, we demonstrate that our framework produces geometrically plausible and visually coherent object insertions across diverse real world scenarios, significantly outperforming existing research and commercial models.
48. 【2512.17499】Validation of Diagnostic Artificial Intelligence Models for Prostate Pathology in a Middle Eastern Cohort
链接:https://arxiv.org/abs/2512.17499
作者:Peshawa J. Muhammad Ali(1 and 2),Navin Vincent(3),Saman S. Abdulla(4 and 5),Han N. Mohammed Fadhl(6),Anders Blilie(7 and 8),Kelvin Szolnoky(9),Julia Anna Mielcarz(3),Xiaoyi Ji(9),Nita Mulliqi(3),Abdulbasit K. Al-Talabani(1),Kimmo Kartasalo(3) ((1) Department of Software Engineering, Faculty of Engineering, Koya University, Koya 44023, Kurdistan Region - F.R. Iraq, (2) Department of Mechanical and Manufacturing Engineering, Faculty of Engineering, Koya University, Koya 44023, Kurdistan Region - F.R. Iraq, (3) Department of Medical Epidemiology and Biostatistics, SciLifeLab, Karolinska Institutet, Stockholm, Sweden, (4) College of Dentistry, Hawler Medical University, Erbil, Kurdistan Region, Iraq, (5) PAR Private Hospital, Erbil, Kurdistan Region, Iraq, (6) College of Dentistry, University of Sulaimani, Sulaymaniyah, Kurdistan Region, Iraq, (7) Department of Pathology, Stavanger University Hospital, Stavanger, Norway, (8) Faculty of Health Sciences, University of Stavanger, Stavanger, Norway, (9) Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Toggle, Code Toggle Papers, Toggle Hugging Face, Code, Papers
备注: 40 pages, 8 figures, 11 tables
点击查看摘要
Abstract:Background: Artificial intelligence (AI) is improving the efficiency and accuracy of cancer diagnostics. The performance of pathology AI systems has been almost exclusively evaluated on European and US cohorts from large centers. For global AI adoption in pathology, validation studies on currently under-represented populations - where the potential gains from AI support may also be greatest - are needed. We present the first study with an external validation cohort from the Middle East, focusing on AI-based diagnosis and Gleason grading of prostate cancer. Methods: We collected and digitised 339 prostate biopsy specimens from the Kurdistan region, Iraq, representing a consecutive series of 185 patients spanning the period 2013-2024. We evaluated a task-specific end-to-end AI model and two foundation models in terms of their concordance with pathologists and consistency across samples digitised on three scanner models (Hamamatsu, Leica, and Grundium). Findings: Grading concordance between AI and pathologists was similar to pathologist-pathologist concordance with Cohen's quadratically weighted kappa 0.801 vs. 0.799 (p=0.9824). Cross-scanner concordance was high (quadratically weighted kappa 0.90) for all AI models and scanner pairs, including low-cost compact scanner. Interpretation: AI models demonstrated pathologist-level performance in prostate histopathology assessment. Compact scanners can provide a route for validation studies in non-digitalised settings and enable cost-effective adoption of AI in laboratories with limited sample volumes. This first openly available digital pathology dataset from the Middle East supports further research into globally equitable AI pathology. Funding: SciLifeLab and Wallenberg Data Driven Life Science Program, Instrumentarium Science Foundation, Karolinska Institutet Research Foundation.
Comments:
40 pages, 8 figures, 11 tables
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.17499 [cs.CV]
(or
arXiv:2512.17499v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.17499
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Kimmo Kartasalo [view email] [v1]
Fri, 19 Dec 2025 12:08:28 UTC (2,195 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Validation of Diagnostic Artificial Intelligence Models for Prostate Pathology in a Middle Eastern Cohort, by Peshawa J. Muhammad Ali (1 and 2) and 50 other authorsView PDF
view license
Current browse context: cs.CV
prev
|
next
new
|
recent
| 2025-12
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
49. 【2512.17495】GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation
链接:https://arxiv.org/abs/2512.17495
作者:Rang Li,Lei Li,Shuhuai Ren,Hao Tian,Shuhao Gu,Shicheng Li,Zihao Yue,Yudong Wang,Wenhan Ma,Zhe Yang,Jingyuan Ma,Zhifang Sui,Fuli Luo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language descriptions, natural language, language, localizing objects, Abstract
备注:
点击查看摘要
Abstract:Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing benchmarks, a fundamental question remains: can MLLMs truly ground language in vision with human-like sophistication, or are they merely pattern-matching on simplified datasets? Current benchmarks fail to capture real-world complexity where humans effortlessly navigate ambiguous references and recognize when grounding is impossible. To rigorously assess MLLMs' true capabilities, we introduce GroundingME, a benchmark that systematically challenges models across four critical dimensions: (1) Discriminative, distinguishing highly similar objects, (2) Spatial, understanding complex relational descriptions, (3) Limited, handling occlusions or tiny objects, and (4) Rejection, recognizing ungroundable queries. Through careful curation combining automated generation with human verification, we create 1,005 challenging examples mirroring real-world complexity. Evaluating 25 state-of-the-art MLLMs reveals a profound capability gap: the best model achieves only 45.1% accuracy, while most score 0% on rejection tasks, reflexively hallucinating objects rather than acknowledging their absence, raising critical safety concerns for deployment. We explore two strategies for improvements: (1) test-time scaling selects optimal response by thinking trajectory to improve complex grounding by up to 2.9%, and (2) data-mixture training teaches models to recognize ungroundable queries, boosting rejection accuracy from 0% to 27.9%. GroundingME thus serves as both a diagnostic tool revealing current limitations in MLLMs and a roadmap toward human-level visual grounding.
50. 【2512.17492】MMLANDMARKS: a Cross-View Instance-Level Benchmark for Geo-Spatial Understanding
链接:https://arxiv.org/abs/2512.17492
作者:Oskar Kristoffersen,Alba R. Sánchez,Morten R. Hannemose,Anders B. Dahl,Dim P. Papadopoulos
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single geographic location, world benefits, textual descriptions, geographic coordinates, single geographic
备注:
点击查看摘要
Abstract:Geo-spatial analysis of our world benefits from a multimodal approach, as every single geographic location can be described in numerous ways (images from various viewpoints, textual descriptions, and geographic coordinates). Current geo-spatial benchmarks have limited coverage across modalities, considerably restricting progress in the field, as current approaches cannot integrate all relevant modalities within a unified framework. We introduce the Multi-Modal Landmark dataset (MMLANDMARKS), a benchmark composed of four modalities: 197k highresolution aerial images, 329k ground-view images, textual information, and geographic coordinates for 18,557 distinct landmarks in the United States. The MMLANDMARKS dataset has a one-to-one correspondence across every modality, which enables training and benchmarking models for various geo-spatial tasks, including cross-view Ground-to-Satellite retrieval, ground and satellite geolocalization, Text-to-Image, and Text-to-GPS retrieval. We demonstrate broad generalization and competitive performance against off-the-shelf foundational models and specialized state-of-the-art models across different tasks by employing a simple CLIP-inspired baseline, illustrating the necessity for multimodal datasets to achieve broad geo-spatial understanding.
51. 【2512.17489】LumiCtrl : Learning Illuminant Prompts for Lighting Control in Personalized Text-to-Image Models
链接:https://arxiv.org/abs/2512.17489
作者:Muhammad Atif Butt,Kai Wang,Javier Vazquez-Corral,Joost Van De Weijer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable progress, lack precise control, content designers aiming, creative image generation, models have demonstrated
备注:
点击查看摘要
Abstract:Current text-to-image (T2I) models have demonstrated remarkable progress in creative image generation, yet they still lack precise control over scene illuminants, which is a crucial factor for content designers aiming to manipulate the mood, atmosphere, and visual aesthetics of generated images. In this paper, we present an illuminant personalization method named LumiCtrl that learns an illuminant prompt given a single image of an object. LumiCtrl consists of three basic components: given an image of the object, our method applies (a) physics-based illuminant augmentation along the Planckian locus to create fine-tuning variants under standard illuminants; (b) edge-guided prompt disentanglement using a frozen ControlNet to ensure prompts focus on illumination rather than structure; and (c) a masked reconstruction loss that focuses learning on the foreground object while allowing the background to adapt contextually, enabling what we call contextual light adaptation. We qualitatively and quantitatively compare LumiCtrl against other T2I customization methods. The results show that our method achieves significantly better illuminant fidelity, aesthetic quality, and scene coherence compared to existing personalization baselines. A human preference study further confirms strong user preference for LumiCtrl outputs. The code and data will be released upon publication.
52. 【2512.17488】winSegNet: A Digital Twin-Enabled Federated Learning Framework for Brain Tumor Analysis
链接:https://arxiv.org/abs/2512.17488
作者:Almustapha A. Wakili,Adamu Hussaini,Abubakar A. Musa,Woosub Jung,Wei Yu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Brain tumor segmentation, critical in diagnosis, diagnosis and treatment, treatment planning, Brain tumor
备注: IEEE Virtual Conference on Communications. 4-6 November 2025
点击查看摘要
Abstract:Brain tumor segmentation is critical in diagnosis and treatment planning for the disease. Yet, current deep learning methods rely on centralized data collection, which raises privacy concerns and limits generalization across diverse institutions. In this paper, we propose TwinSegNet, which is a privacy-preserving federated learning framework that integrates a hybrid ViT-UNet model with personalized digital twins for accurate and real-time brain tumor segmentation. Our architecture combines convolutional encoders with Vision Transformer bottlenecks to capture local and global context. Each institution fine-tunes the global model of private data to form its digital twin. Evaluated on nine heterogeneous MRI datasets, including BraTS 2019-2021 and custom tumor collections, TwinSegNet achieves high Dice scores (up to 0.90%) and sensitivity/specificity exceeding 90%, demonstrating robustness across non-independent and identically distributed (IID) client distributions. Comparative results against centralized models such as TumorVisNet highlight TwinSegNet's effectiveness in preserving privacy without sacrificing performance. Our approach enables scalable, personalized segmentation for multi-institutional clinical settings while adhering to strict data confidentiality requirements.
53. 【2512.17459】3D-RE-GEN: 3D Reconstruction of Indoor Scenes with a Generative Framework
链接:https://arxiv.org/abs/2512.17459
作者:Tobias Sautter,Jan-Niklas Dihlmann,Hendrik P.A. Lensch
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visually appealing output, produce visually appealing, current representations hinder, Recent advances, representations hinder artists'
备注: Project Page: [this https URL](https://3dregen.jdihlmann.com/)
点击查看摘要
Abstract:Recent advances in 3D scene generation produce visually appealing output, but current representations hinder artists' workflows that require modifiable 3D textured mesh scenes for visual effects and game development. Despite significant advances, current textured mesh scene reconstruction methods are far from artist ready, suffering from incorrect object decomposition, inaccurate spatial relationships, and missing backgrounds. We present 3D-RE-GEN, a compositional framework that reconstructs a single image into textured 3D objects and a background. We show that combining state of the art models from specific domains achieves state of the art scene reconstruction performance, addressing artists' requirements. Our reconstruction pipeline integrates models for asset detection, reconstruction, and placement, pushing certain models beyond their originally intended domains. Obtaining occluded objects is treated as an image editing task with generative models to infer and reconstruct with scene level reasoning under consistent lighting and geometry. Unlike current methods, 3D-RE-GEN generates a comprehensive background that spatially constrains objects during optimization and provides a foundation for realistic lighting and simulation tasks in visual effects and games. To obtain physically realistic layouts, we employ a novel 4-DoF differentiable optimization that aligns reconstructed objects with the estimated ground plane. 3D-RE-GEN~achieves state of the art performance in single image 3D scene reconstruction, producing coherent, modifiable scenes through compositional generation guided by precise camera recovery and spatial optimization.
Comments:
Project Page: this https URL
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.17459 [cs.CV]
(or
arXiv:2512.17459v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.17459
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
54. 【2512.17450】MULTIAQUA: A multimodal maritime dataset and robust training strategies for multimodal semantic segmentation
链接:https://arxiv.org/abs/2512.17450
作者:Jon Muhovič,Janez Perš
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Unmanned surface vehicles, varied visual circumstances, Unmanned surface, circumstances during operation, surface vehicles
备注:
点击查看摘要
Abstract:Unmanned surface vehicles can encounter a number of varied visual circumstances during operation, some of which can be very difficult to interpret. While most cases can be solved only using color camera images, some weather and lighting conditions require additional information. To expand the available maritime data, we present a novel multimodal maritime dataset MULTIAQUA (Multimodal Aquatic Dataset). Our dataset contains synchronized, calibrated and annotated data captured by sensors of different modalities, such as RGB, thermal, IR, LIDAR, etc. The dataset is aimed at developing supervised methods that can extract useful information from these modalities in order to provide a high quality of scene interpretation regardless of potentially poor visibility conditions. To illustrate the benefits of the proposed dataset, we evaluate several multimodal methods on our difficult nighttime test set. We present training approaches that enable multimodal methods to be trained in a more robust way, thus enabling them to retain reliable performance even in near-complete darkness. Our approach allows for training a robust deep neural network only using daytime images, thus significantly simplifying data acquisition, annotation, and the training process.
55. 【2512.17445】LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents
链接:https://arxiv.org/abs/2512.17445
作者:Yun He,Francesco Pittaluga,Ziyu Jiang,Matthias Zwicker,Manmohan Chandraker,Zaid Tasneem
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:real-world driving videos, diverse traffic scenarios, editing real-world driving, synthesize diverse traffic, Behavior Reviewer Agent
备注: Project Page: [this https URL](https://yunhe24.github.io/langdrivectrl/)
点击查看摘要
Abstract:LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It leverages explicit 3D scene decomposition to represent driving videos as a scene graph, containing static background and dynamic objects. To enable fine-grained editing and realism, it incorporates an agentic pipeline in which an Orchestrator transforms user instructions into execution graphs that coordinate specialized agents and tools. Specifically, an Object Grounding Agent establishes correspondence between free-form text descriptions and target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and then refined using a video diffusion tool to address artifacts introduced by object insertion and significant view changes. LangDriveCTRL supports both object node editing (removal, insertion and replacement) and multi-object behavior editing from a single natural-language instruction. Quantitatively, it achieves nearly $2\times$ higher instruction alignment than the previous SoTA, with superior structural preservation, photorealism, and traffic realism. Project page is available at: this https URL.
56. 【2512.17436】Xiaomi MiMo-VL-Miloco Technical Report
链接:https://arxiv.org/abs/2512.17436
作者:Jiaze Li,Jingyang Chen,Yuxun Qu,Jianzhong Ju,Zhenbo Luo,Jian Luan,Shijie Xu,Zhenru Lin,Junyou Zhu,Boshen Xu,Wenhui Tan,Pei Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieve strong performance, home-centric vision-language models, general multimodal reasoning, home-scenario understanding, pair of home-centric
备注:
点击查看摘要
Abstract:We open-source \textbf{MiMo-VL-Miloco-7B} and its quantized variant \textbf{MiMo-VL-Miloco-7B-GGUF}, a pair of home-centric vision-language models that achieve strong performance on both home-scenario understanding and general multimodal reasoning. Built on the MiMo-VL-7B backbone, MiMo-VL-Miloco-7B is specialized for smart-home environments, attaining leading F1 scores on gesture recognition and common home-scenario understanding, while also delivering consistent gains across video benchmarks such as Video-MME, Video-MMMU, and Charades-STA, as well as language understanding benchmarks including MMMU-Pro and MMLU-Pro. In our experiments, MiMo-VL-Miloco-7B outperforms strong closed-source and open-source baselines on home-scenario understanding and several multimodal reasoning benchmarks. To balance specialization and generality, we design a two-stage training pipeline that combines supervised fine-tuning with reinforcement learning based on Group Relative Policy Optimization, leveraging efficient multi-domain data. We further incorporate chain-of-thought supervision and token-budget-aware reasoning, enabling the model to learn knowledge in a data-efficient manner while also performing reasoning efficiently. Our analysis shows that targeted home-scenario training not only enhances activity and gesture understanding, but also improves text-only reasoning with only modest trade-offs on document-centric tasks. Model checkpoints, quantized GGUF weights, and our home-scenario evaluation toolkit are publicly available at \href{this https URL}{this https URL} to support research and deployment in real-world smart-home applications.
57. 【2512.17432】AIFloodSense: A Global Aerial Imagery Dataset for Semantic Segmentation and Understanding of Flooded Environments
链接:https://arxiv.org/abs/2512.17432
作者:Georgios Simantiris,Konstantinos Bacharidis,Apostolos Papanikolaou,Petros Giannakakis,Costas Panagiotakis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate flood detection, remain scarce due, segmentation remain scarce, annotating large-scale imagery, improving disaster response
备注: 36 pages, 19 figures, 8 tables
点击查看摘要
Abstract:Accurate flood detection from visual data is a critical step toward improving disaster response and risk assessment, yet datasets for flood segmentation remain scarce due to the challenges of collecting and annotating large-scale imagery. Existing resources are often limited in geographic scope and annotation detail, hindering the development of robust, generalized computer vision methods. To bridge this gap, we introduce AIFloodSense, a comprehensive, publicly available aerial imagery dataset comprising 470 high-resolution images from 230 distinct flood events across 64 countries and six continents. Unlike prior benchmarks, AIFloodSense ensures global diversity and temporal relevance (2022-2024), supporting three complementary tasks: (i) Image Classification with novel sub-tasks for environment type, camera angle, and continent recognition; (ii) Semantic Segmentation providing precise pixel-level masks for flood, sky, and buildings; and (iii) Visual Question Answering (VQA) to enable natural language reasoning for disaster assessment. We establish baseline benchmarks for all tasks using state-of-the-art architectures, demonstrating the dataset's complexity and its value in advancing domain-generalized AI tools for climate resilience.
58. 【2512.17416】Beyond Occlusion: In Search for Near Real-Time Explainability of CNN-Based Prostate Cancer Classification
链接:https://arxiv.org/abs/2512.17416
作者:Martin Krebs,Jan Obdržálek,Vít Musil,Tomáš Brázdil
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Deep neural networks, Deep neural, assisted cancer diagnosis, neural networks, networks are starting
备注:
点击查看摘要
Abstract:Deep neural networks are starting to show their worth in critical applications such as assisted cancer diagnosis. However, for their outputs to get accepted in practice, the results they provide should be explainable in a way easily understood by pathologists. A well-known and widely used explanation technique is occlusion, which, however, can take a long time to compute, thus slowing the development and interaction with pathologists. In this work, we set out to find a faster replacement for occlusion in a successful system for detecting prostate cancer. Since there is no established framework for comparing the performance of various explanation methods, we first identified suitable comparison criteria and selected corresponding metrics. Based on the results, we were able to choose a different explanation method, which cut the previously required explanation time at least by a factor of 10, without any negative impact on the quality of outputs. This speedup enables rapid iteration in model development and debugging and brings us closer to adopting AI-assisted prostate cancer detection in clinical settings. We propose that our approach to finding the replacement for occlusion can be used to evaluate candidate methods in other related applications.
59. 【2512.17396】RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
链接:https://arxiv.org/abs/2512.17396
作者:Léo Butsanets,Charles Corbière,Julien Khlaut,Pierre Manceron,Corentin Dancette
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:advance radiologic visual, large-scale dataset designed, visual question answering, MRI exams, radiologic visual question
备注: Preprint, 23 pages, 12 figures, 7 tables
点击查看摘要
Abstract:In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at this https URL.
60. 【2512.17394】Are Vision Language Models Cross-Cultural Theory of Mind Reasoners?
链接:https://arxiv.org/abs/2512.17394
作者:Zabir Al Nazi,G M Shahariar,Abrar Hossain,Wei Peng
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
关键词:artificial agents, ability to attribute, remains a major, major challenge, challenge for artificial
备注:
点击查看摘要
Abstract:Theory of Mind (ToM) -- the ability to attribute beliefs, desires, and emotions to others -- is fundamental for human social intelligence, yet remains a major challenge for artificial agents. Existing Vision-Language Models (VLMs) are increasingly applied in socially grounded tasks, but their capacity for cross-cultural ToM reasoning is largely unexplored. In this work, we introduce CulturalToM-VQA, a new evaluation benchmark containing 5095 questions designed to probe ToM reasoning across diverse cultural contexts through visual question answering. The dataset captures culturally grounded cues such as rituals, attire, gestures, and interpersonal dynamics, enabling systematic evaluation of ToM reasoning beyond Western-centric benchmarks. Our dataset is built through a VLM-assisted human-in-the-loop pipeline, where human experts first curate culturally rich images across traditions, rituals, and social interactions; a VLM then assist in generating structured ToM-focused scene descriptions, which are refined into question-answer pairs spanning a taxonomy of six ToM tasks and four graded complexity levels. The resulting dataset covers diverse theory of mind facets such as mental state attribution, false belief reasoning, non-literal communication, social norm violations, perspective coordination, and multi-agent reasoning.
61. 【2512.17376】owards Deeper Emotional Reflection: Crafting Affective Image Filters with Generative Priors
链接:https://arxiv.org/abs/2512.17376
作者:Peixuan Zhang,Shuchen Weng,Jiajun Tang,Si Li,Boxin Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Social media platforms, media platforms enable, Affective Image Filter, Social media, platforms enable users
备注:
点击查看摘要
Abstract:Social media platforms enable users to express emotions by posting text with accompanying images. In this paper, we propose the Affective Image Filter (AIF) task, which aims to reflect visually-abstract emotions from text into visually-concrete images, thereby creating emotionally compelling results. We first introduce the AIF dataset and the formulation of the AIF models. Then, we present AIF-B as an initial attempt based on a multi-modal transformer architecture. After that, we propose AIF-D as an extension of AIF-B towards deeper emotional reflection, effectively leveraging generative priors from pre-trained large-scale diffusion models. Quantitative and qualitative experiments demonstrate that AIF models achieve superior performance for both content consistency and emotional fidelity compared to state-of-the-art methods. Extensive user study experiments demonstrate that AIF models are significantly more effective at evoking specific emotions. Based on the presented results, we comprehensively discuss the value and potential of AIF models.
62. 【2512.17350】Beyond Semantic Features: Pixel-level Mapping for Generalized AI-Generated Image Detection
链接:https://arxiv.org/abs/2512.17350
作者:Chenming Zhou,Jiaan Wang,Yu Li,Lei Li,Juan Cao,Sheng Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:technologies necessitates reliable, necessitates reliable methods, generative technologies necessitates, detecting AI-generated images, rapid evolution
备注: Accepted by AAAI 2026
点击查看摘要
Abstract:The rapid evolution of generative technologies necessitates reliable methods for detecting AI-generated images. A critical limitation of current detectors is their failure to generalize to images from unseen generative models, as they often overfit to source-specific semantic cues rather than learning universal generative artifacts. To overcome this, we introduce a simple yet remarkably effective pixel-level mapping pre-processing step to disrupt the pixel value distribution of images and break the fragile, non-essential semantic patterns that detectors commonly exploit as shortcuts. This forces the detector to focus on more fundamental and generalizable high-frequency traces inherent to the image generation process. Through comprehensive experiments on GAN and diffusion-based generators, we show that our approach significantly boosts the cross-generator performance of state-of-the-art detectors. Extensive analysis further verifies our hypothesis that the disruption of semantic cues is the key to generalization.
63. 【2512.17343】Multi-level distortion-aware deformable network for omnidirectional image super-resolution
链接:https://arxiv.org/abs/2512.17343
作者:Cuixin Yang,Rongkang Dong,Kin-Man Lam,Yuhang Zhang,Guoping Qiu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:virtual reality applications, reality applications gain, applications gain popularity, attracted increasing attention, augmented reality
备注:
点击查看摘要
Abstract:As augmented reality and virtual reality applications gain popularity, image processing for OmniDirectional Images (ODIs) has attracted increasing attention. OmniDirectional Image Super-Resolution (ODISR) is a promising technique for enhancing the visual quality of ODIs. Before performing super-resolution, ODIs are typically projected from a spherical surface onto a plane using EquiRectangular Projection (ERP). This projection introduces latitude-dependent geometric distortion in ERP images: distortion is minimal near the equator but becomes severe toward the poles, where image content is stretched across a wider area. However, existing ODISR methods have limited sampling ranges and feature extraction capabilities, which hinder their ability to capture distorted patterns over large areas. To address this issue, we propose a novel Multi-level Distortion-aware Deformable Network (MDDN) for ODISR, designed to expand the sampling range and receptive field. Specifically, the feature extractor in MDDN comprises three parallel branches: a deformable attention mechanism (serving as the dilation=1 path) and two dilated deformable convolutions with dilation rates of 2 and 3. This architecture expands the sampling range to include more distorted patterns across wider areas, generating dense and comprehensive features that effectively capture geometric distortions in ERP images. The representations extracted from these deformable feature extractors are adaptively fused in a multi-level feature fusion module. Furthermore, to reduce computational cost, a low-rank decomposition strategy is applied to dilated deformable convolutions. Extensive experiments on publicly available datasets demonstrate that MDDN outperforms state-of-the-art methods, underscoring its effectiveness and superiority in ODISR.
64. 【2512.17331】SynergyWarpNet: Attention-Guided Cooperative Warping for Neural Portrait Animation
链接:https://arxiv.org/abs/2512.17331
作者:Shihang Li,Zhiqiang Gong,Minming Ye,Yue Gao,Wen Yao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:digital content creation, demonstrated remarked potential, virtual avatars, content creation, advances in neural
备注: Submitted to ICASSP 2026
点击查看摘要
Abstract:Recent advances in neural portrait animation have demonstrated remarked potential for applications in virtual avatars, telepresence, and digital content creation. However, traditional explicit warping approaches often struggle with accurate motion transfer or recovering missing regions, while recent attention-based warping methods, though effective, frequently suffer from high complexity and weak geometric grounding. To address these issues, we propose SynergyWarpNet, an attention-guided cooperative warping framework designed for high-fidelity talking head synthesis. Given a source portrait, a driving image, and a set of reference images, our model progressively refines the animation in three stages. First, an explicit warping module performs coarse spatial alignment between the source and driving image using 3D dense optical flow. Next, a reference-augmented correction module leverages cross-attention across 3D keypoints and texture features from multiple reference images to semantically complete occluded or distorted regions. Finally, a confidence-guided fusion module integrates the warped outputs with spatially-adaptive fusing, using a learned confidence map to balance structural alignment and visual consistency. Comprehensive evaluations on benchmark datasets demonstrate state-of-the-art performance.
65. 【2512.17326】Democratizing Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling
链接:https://arxiv.org/abs/2512.17326
作者:Sander Moonemans,Sebastiaan Ram,Frédérique Meeuwsen,Carlijn Lems,Jeroen van der Laak,Geert Litjens,Francesco Ciompi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-language models, co-pilots for pathologists, Vision-language, alpha, ANTONI
备注: 10 pages, 4 figures
点击查看摘要
Abstract:Vision-language models (VLMs) have the potential to become co-pilots for pathologists. However, most VLMs either focus on small regions of interest within whole-slide images, provide only static slide-level outputs, or rely on data that is not publicly available, limiting reproducibility. Furthermore, training data containing WSIs paired with detailed clinical reports is scarce, restricting progress toward transparent and generalisable VLMs. We address these limitations with three main contributions. First, we introduce Polysome, a standardised tool for synthetic instruction generation. Second, we apply Polysome to the public HISTAI dataset, generating HISTAI-Instruct, a large whole-slide instruction tuning dataset spanning 24,259 slides and over 1.1 million instruction-response pairs. Finally, we use HISTAI-Instruct to train ANTONI-{\alpha}, a VLM capable of visual-question answering (VQA). We show that ANTONI-{\alpha} outperforms MedGemma on WSI-level VQA tasks of tissue identification, neoplasm detection, and differential diagnosis. We also compare the performance of multiple incarnations of ANTONI-{\alpha} trained with different amounts of data. All methods, data, and code are publicly available.
66. 【2512.17323】DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training
链接:https://arxiv.org/abs/2512.17323
作者:Jiyun Kong,Jun-Hyuk Kim,Jong-Seok Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamic scenes due, Video frame prediction, prediction extrapolates future, Video frame, extrapolates future frames
备注:
点击查看摘要
Abstract:Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.
67. 【2512.17322】Rotterdam artery-vein segmentation (RAV) dataset
链接:https://arxiv.org/abs/2512.17322
作者:Jose Vargas Quiros,Bart Liefers,Karin van Garderen,Jeroen Vermeulen,Eyened Reading Center,Caroline Klaver
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Toggle, Code Toggle Papers, Toggle Hugging Face, Code, Bibliographic Explorer Toggle
备注:
点击查看摘要
Abstract:Purpose: To provide a diverse, high-quality dataset of color fundus images (CFIs) with detailed artery-vein (A/V) segmentation annotations, supporting the development and evaluation of machine learning algorithms for vascular analysis in ophthalmology. Methods: CFIs were sampled from the longitudinal Rotterdam Study (RS), encompassing a wide range of ages, devices, and capture conditions. Images were annotated using a custom interface that allowed graders to label arteries, veins, and unknown vessels on separate layers, starting from an initial vessel segmentation mask. Connectivity was explicitly verified and corrected using connected component visualization tools. Results: The dataset includes 1024x1024-pixel PNG images in three modalities: original RGB fundus images, contrast-enhanced versions, and RGB-encoded A/V masks. Image quality varied widely, including challenging samples typically excluded by automated quality assessment systems, but judged to contain valuable vascular information. Conclusion: This dataset offers a rich and heterogeneous source of CFIs with high-quality segmentations. It supports robust benchmarking and training of machine learning models under real-world variability in image quality and acquisition settings. Translational Relevance: By including connectivity-validated A/V masks and diverse image conditions, this dataset enables the development of clinically applicable, generalizable machine learning tools for retinal vascular analysis, potentially improving automated screening and diagnosis of systemic and ocular diseases.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.17322 [cs.CV]
(or
arXiv:2512.17322v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.17322
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Jose David Vargas Quiros [view email] [v1]
Fri, 19 Dec 2025 08:09:02 UTC (432 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Rotterdam artery-vein segmentation (RAV) dataset, by Jose Vargas Quiros and 5 other authorsView PDF
view license
Current browse context: cs.CV
prev
|
next
new
|
recent
| 2025-12
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
68. 【2512.17320】EMMA: Concept Erasure Benchmark with Comprehensive Semantic Metrics and Diverse Categories
链接:https://arxiv.org/abs/2512.17320
作者:Lu Wei,Yuta Nakashima,Noa Garcia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generation has raised, concerns about privacy, Concept erasure, Concept erasure techniques, widespread adoption
备注: Under review
点击查看摘要
Abstract:The widespread adoption of text-to-image (T2I) generation has raised concerns about privacy, bias, and copyright violations. Concept erasure techniques offer a promising solution by selectively removing undesired concepts from pre-trained models without requiring full retraining. However, these methods are often evaluated on a limited set of concepts, relying on overly simplistic and direct prompts. To test the boundaries of concept erasure techniques, and assess whether they truly remove targeted concepts from model representations, we introduce EMMA, a benchmark that evaluates five key dimensions of concept erasure over 12 metrics. EMMA goes beyond standard metrics like image quality and time efficiency, testing robustness under challenging conditions, including indirect descriptions, visually similar non-target concepts, and potential gender and ethnicity bias, providing a socially aware analysis of method behavior. Using EMMA, we analyze five concept erasure methods across five domains (objects, celebrities, art styles, NSFW, and copyright). Our results show that existing methods struggle with implicit prompts (i.e., generating the erased concept when it is indirectly referenced) and visually similar non-target concepts (i.e., failing to generate non-targeted concepts resembling the erased one), while some amplify gender and ethnicity bias compared to the original model.
69. 【2512.17319】A Benchmark for Ultra-High-Resolution Remote Sensing MLLMs
链接:https://arxiv.org/abs/2512.17319
作者:Yunkai Dang,Meiyi Zhu,Donghao Wang,Yizhuo Zhang,Jiacheng Yang,Qi Fan,Yuekun Yang,Wenbin Li,Feng Miao,Yang Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:existing remote sensing, Multimodal large language, remote sensing, existing remote, Multimodal large
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs) demonstrate strong perception and reasoning performance on existing remote sensing (RS) benchmarks. However, most prior benchmarks rely on low-resolution imagery, and some high-resolution benchmarks suffer from flawed reasoning-task designs. We show that text-only LLMs can perform competitively with multimodal vision-language models on RS reasoning tasks without access to images, revealing a critical mismatch between current benchmarks and the intended evaluation of visual understanding. To enable faithful assessment, we introduce RSHR-Bench, a super-high-resolution benchmark for RS visual understanding and reasoning. RSHR-Bench contains 5,329 full-scene images with a long side of at least 4,000 pixels, with up to about 3 x 10^8 pixels per image, sourced from widely used RS corpora and UAV collections. We design four task families: multiple-choice VQA, open-ended VQA, image captioning, and single-image evaluation. These tasks cover nine perception categories and four reasoning types, supporting multi-turn and multi-image dialog. To reduce reliance on language priors, we apply adversarial filtering with strong LLMs followed by rigorous human verification. Overall, we construct 3,864 VQA tasks, 3,913 image captioning tasks, and 500 fully human-written or verified single-image evaluation VQA pairs. Evaluations across open-source, closed-source, and RS-specific VLMs reveal persistent performance gaps in super-high-resolution scenarios. Code: this https URL
70. 【2512.17313】Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model
链接:https://arxiv.org/abs/2512.17313
作者:SuBeen Lee,GilHan Park,WonJun Moon,Hyun Seok Seong,Jae-Pil Heo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:impressive zero-shot capabilities, impressive zero-shot, zero-shot capabilities, capabilities of Vision-Language, struggle in downstream
备注:
点击查看摘要
Abstract:Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.
71. 【2512.17312】CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
链接:https://arxiv.org/abs/2512.17312
作者:Qi Song,Honglin Li,Yingchen Yu,Haoyi Zhou,Lin Yang,Song Bai,Qi She,Zilong Huang,Yunqing Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rigid visual schemas, Recent releases, combines structured tool, highlight human-like, thinking with images
备注:
点击查看摘要
Abstract:Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.
72. 【2512.17306】Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images
链接:https://arxiv.org/abs/2512.17306
作者:Wenhao Yang,Yu Xia,Jinlong Huang,Shiyin Lu,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Yuanyu Wan,Lijun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, large Vision-Language Models, exhibited strong reasoning, strong reasoning capabilities, actively invoking tools
备注:
点击查看摘要
Abstract:Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.
73. 【2512.17303】EMAG: Self-Rectifying Diffusion Sampling with Exponential Moving Average Guidance
链接:https://arxiv.org/abs/2512.17303
作者:Ankit Yadav,Ta Duc Huy,Lingqiao Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:flow-matching generative models, flow-matching generative, improve sample quality, Moving Average Guidance, Exponential Moving Average
备注: 26 pages
点击查看摘要
Abstract:In diffusion and flow-matching generative models, guidance techniques are widely used to improve sample quality and consistency. Classifier-free guidance (CFG) is the de facto choice in modern systems and achieves this by contrasting conditional and unconditional samples. Recent work explores contrasting negative samples at inference using a weaker model, via strong/weak model pairs, attention-based masking, stochastic block dropping, or perturbations to the self-attention energy landscape. While these strategies refine the generation quality, they still lack reliable control over the granularity or difficulty of the negative samples, and target-layer selection is often fixed. We propose Exponential Moving Average Guidance (EMAG), a training-free mechanism that modifies attention at inference time in diffusion transformers, with a statistics-based, adaptive layer-selection rule. Unlike prior methods, EMAG produces harder, semantically faithful negatives (fine-grained degradations), surfacing difficult failure modes, enabling the denoiser to refine subtle artifacts, boosting the quality and human preference score (HPS) by +0.46 over CFG. We further demonstrate that EMAG naturally composes with advanced guidance techniques, such as APG and CADS, further improving HPS.
74. 【2512.17302】MatLat: Material Latent Space for PBR Texture Generation
链接:https://arxiv.org/abs/2512.17302
作者:Kyeongmin Yeo,Yunhong Min,Jaihoon Kim,Minhyuk Sung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:producing high-quality PBR, high-quality PBR textures, PBR texture, PBR texture datasets, large-scale PBR texture
备注: Project page: [this https URL](https://matlat-proj.github.io)
点击查看摘要
Abstract:We propose a generative framework for producing high-quality PBR textures on a given 3D mesh. As large-scale PBR texture datasets are scarce, our approach focuses on effectively leveraging the embedding space and diffusion priors of pretrained latent image generative models while learning a material latent space, MatLat, through targeted fine-tuning. Unlike prior methods that freeze the embedding network and thus lead to distribution shifts when encoding additional PBR channels and hinder subsequent diffusion training, we fine-tune the pretrained VAE so that new material channels can be incorporated with minimal latent distribution deviation. We further show that correspondence-aware attention alone is insufficient for cross-view consistency unless the latent-to-image mapping preserves locality. To enforce this locality, we introduce a regularization in the VAE fine-tuning that crops latent patches, decodes them, and aligns the corresponding image regions to maintain strong pixel-latent spatial correspondence. Ablation studies and comparison with previous baselines demonstrate that our framework improves PBR texture fidelity and that each component is critical for achieving state-of-the-art performance.
75. 【2512.17298】ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration
链接:https://arxiv.org/abs/2512.17298
作者:Fanpu Cao,Yaofo Chen,Zeng You,Wei Luo,Cen Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformers, hinders real-time deployment, high computational cost, computational cost hinders, cost hinders real-time
备注: Accepted for poster presentation at AAAI 2026
点击查看摘要
Abstract:Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model's temporal characteristics; and (ii) a selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-alpha and DiT demonstrate that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.
76. 【2512.17296】owards Pixel-Wise Anomaly Location for High-Resolution PCBA \\ via Self-Supervised Image Reconstruction
链接:https://arxiv.org/abs/2512.17296
作者:Wuyi Liu,Le Jin,Junxian Yang,Yuanchao Yu,Zishuo Peng,Jinfeng Xu,Xianzhi Li,Jun Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:assembled Printed Circuit, Circuit Board Assemblies, Printed Circuit Board, Automated defect inspection, Printed Circuit
备注:
点击查看摘要
Abstract:Automated defect inspection of assembled Printed Circuit Board Assemblies (PCBA) is quite challenging due to the insufficient labeled data, micro-defects with just a few pixels in visually-complex and high-resolution images. To address these challenges, we present HiSIR-Net, a High resolution, Self-supervised Reconstruction framework for pixel-wise PCBA localization. Our design combines two lightweight modules that make this practical on real 4K-resolution boards: (i) a Selective Input-Reconstruction Gate (SIR-Gate) that lets the model decide where to trust reconstruction versus the original input, thereby reducing irrelevant reconstruction artifacts and false alarms; and (ii) a Region-level Optimized Patch Selection (ROPS) scheme with positional cues to select overlapping patch reconstructions coherently across arbitrary resolutions. Organically integrating these mechanisms yields clean, high-resolution anomaly maps with low false positive (FP) rate. To bridge the gap in high-resolution PCBA datasets, we further contribute a self-collected dataset named SIPCBA-500 of 500 images. We conduct extensive experiments on our SIPCBA-500 as well as public benchmarks, demonstrating the superior localization performance of our method while running at practical speed. Full code and dataset will be made available upon acceptance.
77. 【2512.17292】Vision-Language Model Guided Image Restoration
链接:https://arxiv.org/abs/2512.17292
作者:Cuixin Yang,Rongkang Dong,Kin-Man Lam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recover realistic photos, fine-grained details, image restoration, require both pixel-level, pixel-level fidelity
备注:
点击查看摘要
Abstract:Many image restoration (IR) tasks require both pixel-level fidelity and high-level semantic understanding to recover realistic photos with fine-grained details. However, previous approaches often struggle to effectively leverage both the visual and linguistic knowledge. Recent efforts have attempted to incorporate Vision-language models (VLMs), which excel at aligning visual and textual features, into universal IR. Nevertheless, these methods fail to utilize the linguistic priors to ensure semantic coherence during the restoration process. To address this issue, in this paper, we propose the Vision-Language Model Guided Image Restoration (VLMIR) framework, which leverages the rich vision-language priors of VLMs, such as CLIP, to enhance IR performance through improved visual perception and semantic understanding. Our approach consists of two stages: VLM-based feature extraction and diffusion-based image restoration. In the first stage, we extract complementary visual and linguistic representations of input images by condensing the visual perception and high-level semantic priors through VLMs. Specifically, we align the embeddings of captions from low-quality and high-quality images using a cosine similarity loss with LoRA fine-tuning, and employ a degradation predictor to decompose degradation and clean image content embeddings. These complementary visual and textual embeddings are then integrated into a diffusion-based model via cross-attention mechanisms for enhanced restoration. Extensive experiments and ablation studies demonstrate that VLMIR achieves superior performance across both universal and degradation-specific IR tasks, underscoring the critical role of integrated visual and linguistic knowledge from VLMs in advancing image restoration capabilities.
78. 【2512.17279】Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge
链接:https://arxiv.org/abs/2512.17279
作者:Zehui Lin,Luyi Han,Xin Wang,Ying Zhou,Yanming Zhang,Tianyu Zhang,Lingyun Bao,Shandong Wu,Dong Xu,Tao Tan, theUUSIC25 Challenge Consortium
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:toggle, modern ultrasound systems, versatile modern ultrasound, Universal UltraSound Image, Code
备注: 8 pages, 2 figures. Summary of the UUSIC25 Challenge held at MICCAI 2025. Extensive Supplementary Material (containing original team reports) is available in the "ancillary files" section
点击查看摘要
Abstract:IMPORTANCE: Current ultrasound AI remains fragmented into single-task tools, limiting clinical utility compared to versatile modern ultrasound systems. OBJECTIVE: To evaluate the diagnostic accuracy and efficiency of single general-purpose deep learning models for multi-organ classification and segmentation. DESIGN: The Universal UltraSound Image Challenge 2025 (UUSIC25) involved developing algorithms on 11,644 images (public/private). Evaluation used an independent, multi-center test set of 2,479 images, including data from a center completely unseen during training to assess generalization. OUTCOMES: Diagnostic performance (Dice Similarity Coefficient [DSC]; Area Under the Receiver Operating Characteristic Curve [AUC]) and computational efficiency (inference time, GPU memory). RESULTS: Of 15 valid algorithms, the top model (SMART) achieved a macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models showed high capability in segmentation (e.g., fetal head DSC: 0.942) but variability in complex tasks subject to domain shift. Notably, in breast cancer molecular subtyping, the top model's performance dropped from AUC 0.571 (internal) to 0.508 (unseen external center), highlighting generalization challenges. CONCLUSIONS: General-purpose AI models achieve high accuracy and efficiency across multiple tasks using a single architecture. However, performance degradation on unseen data suggests domain generalization is critical for future clinical deployment.
Comments:
8 pages, 2 figures. Summary of the UUSIC25 Challenge held at MICCAI 2025. Extensive Supplementary Material (containing original team reports) is available in the “ancillary files” section
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.17279 [cs.CV]
(or
arXiv:2512.17279v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.17279
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Zehui Lin [view email] [v1]
Fri, 19 Dec 2025 06:54:30 UTC (13,589 KB)
function toggleList(whichLayer,toggleThis)
{
var elem, vis;
if( document.getElementById ) // standard
elem = document.getElementById( whichLayer );
else if( document.all ) // old msie versions
elem = document.all[whichLayer];
else if( document.layers ) // nn4
elem = document.layers[whichLayer];
vis = elem.style;
// if the style.display value is blank we try to figure it out here
if(vis.display==''!=undefined!=undefined)
vis.display = (elem.offsetWidth!=0!=0)?'inline':'none';
vis.display = (vis.display==''||vis.display=='inline')?'none':'inline';
// toggle link inner text
status = vis.display;
if(vis.display=='inline'){
document.getElementById('toggle').innerHTML = "(collapse list)";
document.getElementById('toggle').title = "Collapse list";
} else {
document.getElementById('toggle').innerHTML = "("+toggleThis+")";
document.getElementById('toggle').title = "Show complete list";
}
}
Full-text links:
Access Paper:
View a PDF of the paper titled Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge, by Zehui Lin and 10 other authorsView PDFHTML (experimental)TeX Source
view license
Ancillary-file links:
Ancillary files (details):
supplement.pdf
Current browse context: cs.CV
prev
|
next
new
|
recent
| 2025-12
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
79. 【2512.17278】WDFFU-Mamba: A Wavelet-guided Dual-attention Feature Fusion Mamba for Breast Tumor Segmentation in Ultrasound Images
链接:https://arxiv.org/abs/2512.17278
作者:Guoping Cai,Houjin Chen,Yanfeng Li,Jia Sun,Ziwei Chen,Qingzi Geng
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:early tumor screening, assisting clinical diagnosis, http URL, plays a vital, vital role
备注:
点击查看摘要
Abstract:Breast ultrasound (BUS) image segmentation plays a vital role in assisting clinical diagnosis and early tumor screening. However, challenges such as speckle noise, imaging artifacts, irregular lesion morphology, and blurred boundaries severely hinder accurate segmentation. To address these challenges, this work aims to design a robust and efficient model capable of automatically segmenting breast tumors in BUS this http URL propose a novel segmentation network named WDFFU-Mamba, which integrates wavelet-guided enhancement and dual-attention feature fusion within a U-shaped Mamba architecture. A Wavelet-denoised High-Frequency-guided Feature (WHF) module is employed to enhance low-level representations through noise-suppressed high-frequency cues. A Dual Attention Feature Fusion (DAFF) module is also introduced to effectively merge skip-connected and semantic features, improving contextual this http URL experiments on two public BUS datasets demonstrate that WDFFU-Mamba achieves superior segmentation accuracy, significantly outperforming existing methods in terms of Dice coefficient and 95th percentile Hausdorff Distance (HD95).The combination of wavelet-domain enhancement and attention-based fusion greatly improves both the accuracy and robustness of BUS image segmentation, while maintaining computational this http URL proposed WDFFU-Mamba model not only delivers strong segmentation performance but also exhibits desirable generalization ability across datasets, making it a promising solution for real-world clinical applications in breast tumor ultrasound analysis.
80. 【2512.17263】AnyCXR: Human Anatomy Segmentation of Chest X-ray at Any Acquisition Position using Multi-stage Domain Randomized Synthetic Data with Imperfect Annotations and Conditional Joint Annotation Regularization Learning
链接:https://arxiv.org/abs/2512.17263
作者:Dong Zifei,Wu Wenjie,Hao Jinkui,Chen Tianqi,Weng Ziqiao,Zhou Bo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains challenging due, chest X-rays, Robust anatomical segmentation, Multi-stage Domain Randomization, remains challenging
备注: 20 pages, 12 figures, Preprint (under review at Medical Image Analysis)
点击查看摘要
Abstract:Robust anatomical segmentation of chest X-rays (CXRs) remains challenging due to the scarcity of comprehensive annotations and the substantial variability of real-world acquisition conditions. We propose AnyCXR, a unified framework that enables generalizable multi-organ segmentation across arbitrary CXR projection angles using only synthetic supervision. The method combines a Multi-stage Domain Randomization (MSDR) engine, which generates over 100,000 anatomically faithful and highly diverse synthetic radiographs from 3D CT volumes, with a Conditional Joint Annotation Regularization (CAR) learning strategy that leverages partial and imperfect labels by enforcing anatomical consistency in a latent space. Trained entirely on synthetic data, AnyCXR achieves strong zero-shot generalization on multiple real-world datasets, providing accurate delineation of 54 anatomical structures in PA, lateral, and oblique views. The resulting segmentation maps support downstream clinical tasks, including automated cardiothoracic ratio estimation, spine curvature assessment, and disease classification, where the incorporation of anatomical priors improves diagnostic performance. These results demonstrate that AnyCXR establishes a scalable and reliable foundation for anatomy-aware CXR analysis and offers a practical pathway toward reducing annotation burdens while improving robustness across diverse imaging conditions.
81. 【2512.17253】Mitty: Diffusion-based Human-to-Robot Video Generation
链接:https://arxiv.org/abs/2512.17253
作者:Yiren Song,Cheng Liu,Weijia Mao,Mike Zheng Shou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:key milestone, generalizable robot learning, Learning directly, Learning, Mitty
备注:
点击查看摘要
Abstract:Learning directly from human demonstration videos is a key milestone toward scalable and generalizable robot learning. Yet existing methods rely on intermediate representations such as keypoints or trajectories, introducing information loss and cumulative errors that harm temporal and visual consistency. We present Mitty, a Diffusion Transformer that enables video In-Context Learning for end-to-end Human2Robot video generation. Built on a pretrained video diffusion model, Mitty leverages strong visual-temporal priors to translate human demonstrations into robot-execution videos without action labels or intermediate abstractions. Demonstration videos are compressed into condition tokens and fused with robot denoising tokens through bidirectional attention during diffusion. To mitigate paired-data scarcity, we also develop an automatic synthesis pipeline that produces high-quality human-robot pairs from large egocentric datasets. Experiments on Human2Robot and EPIC-Kitchens show that Mitty delivers state-of-the-art results, strong generalization to unseen environments, and new insights for scalable robot learning from human observations.
82. 【2512.17229】Video Detective: Seek Critical Clues Recurrently to Answer Question from Long Videos
链接:https://arxiv.org/abs/2512.17229
作者:Henghui Du,Chang Zhou,Chunjie Zhang,Xi Chen,Di Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multi-modal Large Language, Large Language Models, Multi-modal Large, Large Language, challenge for Multi-modal
备注:
点击查看摘要
Abstract:Long Video Question-Answering (LVQA) presents a significant challenge for Multi-modal Large Language Models (MLLMs) due to immense context and overloaded information, which could also lead to prohibitive memory consumption. While existing methods attempt to address these issues by reducing visual tokens or extending model's context length, they may miss useful information or take considerable computation. In fact, when answering given questions, only a small amount of crucial information is required. Therefore, we propose an efficient question-aware memory mechanism, enabling MLLMs to recurrently seek these critical clues. Our approach, named VideoDetective, simplifies this task by iteratively processing video sub-segments. For each sub-segment, a question-aware compression strategy is employed by introducing a few special memory tokens to achieve purposefully compression. This allows models to effectively seek critical clues while reducing visual tokens. Then, due to history context could have a significant impact, we recurrently aggregate and store these memory tokens to update history context, which would be reused for subsequent sub-segments. Furthermore, to more effectively measure model's long video understanding ability, we introduce GLVC (Grounding Long Video Clues), a long video question-answering dataset, which features grounding critical and concrete clues scattered throughout entire videos. Experimental results demonstrate our method enables MLLMs with limited context length of 32K to efficiently process 100K tokens (3600 frames, an hour-long video sampled at 1fps), requiring only 2 minutes and 37GB GPU memory usage. Evaluation results across multiple long video benchmarks illustrate our method can more effectively seek critical clues from massive information.
83. 【2512.17227】Learning When to Look: A Disentangled Curriculum for Strategic Perception in Multimodal Reasoning
链接:https://arxiv.org/abs/2512.17227
作者:Siqi Yang,Zilve Gao,Haibo Qiu,Fanfan Liu,Peng Shi,Zhixiong Zeng,Qingmin Liao,Lin Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, demonstrate significant potential
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) demonstrate significant potential but remain brittle in complex, long-chain visual reasoning tasks. A critical failure mode is "visual forgetting", where models progressively lose visual grounding as reasoning extends, a phenomenon aptly described as "think longer, see less". We posit this failure stems from current training paradigms prematurely entangling two distinct cognitive skills: (1) abstract logical reasoning "how-to-think") and (2) strategic visual perception ("when-to-look"). This creates a foundational cold-start deficiency -- weakening abstract reasoning -- and a strategic perception deficit, as models lack a policy for when to perceive. In this paper, we propose a novel curriculum-based framework to disentangle these skills. First, we introduce a disentangled Supervised Fine-Tuning (SFT) curriculum that builds a robust abstract reasoning backbone on text-only data before anchoring it to vision with a novel Perception-Grounded Chain-of-Thought (PG-CoT) paradigm. Second, we resolve the strategic perception deficit by formulating timing as a reinforcement learning problem. We design a Pivotal Perception Reward that teaches the model when to look by coupling perceptual actions to linguistic markers of cognitive uncertainty (e.g., "wait", "verify"), thereby learning an autonomous grounding policy. Our contributions include the formalization of these two deficiencies and the development of a principled, two-stage framework to address them, transforming the model from a heuristic-driven observer to a strategic, grounded reasoner. \textbf{Code}: \url{this https URL}.
84. 【2512.17226】Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors
链接:https://arxiv.org/abs/2512.17226
作者:Son Tung Nguyen,Tobias Fischer,Alejandro Fontan,Michael Milford
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent learning-based visual, noisy geometric constraints, Recent learning-based, disambiguate visually similar, covisibility graphs
备注: WACV 2026 conference paper
点击查看摘要
Abstract:Recent learning-based visual localization methods use global descriptors to disambiguate visually similar places, but existing approaches often derive these descriptors from geometric cues alone (e.g., covisibility graphs), limiting their discriminative power and reducing robustness in the presence of noisy geometric constraints. We propose an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity, ensuring that images are close in descriptor space only when they are visually similar and spatially connected. This corrects erroneous associations caused by unreliable overlap scores. Using a batch-mining strategy based solely on the overlap scores and a modified contrastive loss, our method trains without manual place labels and generalizes across diverse environments. Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency. Code is available at \href{this https URL\_scr}{this http URL\_scr}.
85. 【2512.17224】Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing
链接:https://arxiv.org/abs/2512.17224
作者:Xuyang Li,Chenyu Li,Danfeng Hong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ground sampling distances, supply indispensable evidence, Remote Sensing Foundation, Sensing Foundation Models, diverse band layouts
备注: Accepted by AAAI2026
点击查看摘要
Abstract:Optical satellites, with their diverse band layouts and ground sampling distances, supply indispensable evidence for tasks ranging from ecosystem surveillance to emergency response. However, significant discrepancies in band composition and spatial resolution across different optical sensors present major challenges for existing Remote Sensing Foundation Models (RSFMs). These models are typically pretrained on fixed band configurations and resolutions, making them vulnerable to real world scenarios involving missing bands, cross sensor fusion, and unseen spatial scales, thereby limiting their generalization and practical deployment. To address these limitations, we propose Any Optical Model (AOM), a universal RSFM explicitly designed to accommodate arbitrary band compositions, sensor types, and resolution scales. To preserve distinctive spectral characteristics even when bands are missing or newly introduced, AOM introduces a spectrum-independent tokenizer that assigns each channel a dedicated band embedding, enabling explicit encoding of spectral identity. To effectively capture texture and contextual patterns from sub-meter to hundred-meter imagery, we design a multi-scale adaptive patch embedding mechanism that dynamically modulates the receptive field. Furthermore, to maintain global semantic consistency across varying resolutions, AOM incorporates a multi-scale semantic alignment mechanism alongside a channel-wise self-supervised masking and reconstruction pretraining strategy that jointly models spectral-spatial relationships. Extensive experiments on over 10 public datasets, including those from Sentinel-2, Landsat, and HLS, demonstrate that AOM consistently achieves state-of-the-art (SOTA) performance under challenging conditions such as band missing, cross sensor, and cross resolution settings.
86. 【2512.17221】DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
链接:https://arxiv.org/abs/2512.17221
作者:Brandon Huang,Hang Hua,Zhuoran Yu,Trevor Darrell,Rogerio Feris,Roei Herzig
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable performance, spatial information essential, low-level features lack, Vision-language models, vision encoders presents
备注:
点击查看摘要
Abstract:While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
87. 【2512.17213】CheXPO-v2: Preference Optimization for Chest X-ray VLMs with Knowledge Graph Consistency
链接:https://arxiv.org/abs/2512.17213
作者:Xiao Liang,Yuxuan An,Di Wang,Jiawei Hu,Zhicheng Jiao,Bin Jing,Quan Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Medical Vision-Language Models, compromising clinical reliability, Relative Policy Optimization, Medical Vision-Language, Group Relative Policy
备注:
点击查看摘要
Abstract:Medical Vision-Language Models (VLMs) are prone to hallucinations, compromising clinical reliability. While reinforcement learning methods like Group Relative Policy Optimization (GRPO) offer a low-cost alignment solution, their reliance on sparse, outcome-based rewards inadvertently encourages models to "overthink" -- generating verbose, convoluted, and unverifiable Chain-of-Thought reasoning to justify answers. This focus on outcomes obscures factual errors and poses significant safety risks. To address this, we propose CheXPO-v2, a novel alignment framework that shifts from outcome to process supervision. Our core innovation is a Knowledge Graph Consistency Reward mechanism driven by Entity-Relation Matching. By explicitly parsing reasoning steps into structured "Disease, Relation, Anatomy" triplets, we provide fine-grained supervision that penalizes incoherent logic and hallucinations at the atomic level. Integrating this with a hard-example mining strategy, our approach significantly outperforms GRPO and state-of-the-art models on benchmarks like MIMIC-CXR-VQA. Crucially, CheXPO-v2 achieves new state-of-the-art accuracy using only 5k samples, demonstrating exceptional data efficiency while producing clinically sound and verifiable reasoning. The project source code is publicly available at: this https URL.
88. 【2512.17206】Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs
链接:https://arxiv.org/abs/2512.17206
作者:Rujiao Long,Yang Li,Xingyao Zhang,Weixun Wang,Tianqianjin Lin,Xi Zhao,Yuchi Xu,Wenbo Su,Junchi Yan,Bo Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:redundant reasoning paths, yields redundant reasoning, training for large, high-level diversity, yields redundant
备注:
点击查看摘要
Abstract:Exploration capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models, as stochastic sampling often yields redundant reasoning paths with little high-level diversity. This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a stochastic latent variable for strategic contextualization, guiding its internal planning prior to token generation. This latent context is inferred from the mean-pooled embedding of a question-answer pair via a variational autoencoder (VAE), where each sampled latent potentially encodes a distinct reasoning context. During inference, a sampled latent is decoded into learnable token prefixes and prepended to the input prompt, modulating the model's internal reasoning trajectory. In this way, the model performs internal sampling over reasoning strategies prior to output generation, which shapes the style and structure of the entire response sequence. A brief supervised fine-tuning (SFT) warm-up phase allows the model to adapt to this latent conditioning. Within RL optimization, Reasoning Palette facilitates structured exploration by enabling on-demand injection for diverse reasoning modes, significantly enhancing exploration efficiency and sustained learning capability. Experiments across multiple reasoning benchmarks demonstrate that our method enables interpretable and controllable control over the (vision-) language model's strategic behavior, thereby achieving consistent performance gains over standard RL methods.
89. 【2512.17202】Fose: Fusion of One-Step Diffusion and End-to-End Network for Pansharpening
链接:https://arxiv.org/abs/2512.17202
作者:Kai Liu,Zeli Lin,Weibo Wang,Linghe Kong,Yulun Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:high-resolution multispectral images, low-resolution multispectral images, high-resolution panchromatic images, image fusion task, obtain high-resolution multispectral
备注: Code link: [this https URL](https://github.com/Kai-Liu001/Fose)
点击查看摘要
Abstract:Pansharpening is a significant image fusion task that fuses low-resolution multispectral images (LRMSI) and high-resolution panchromatic images (PAN) to obtain high-resolution multispectral images (HRMSI). The development of the diffusion models (DM) and the end-to-end models (E2E model) has greatly improved the frontier of pansharping. DM takes the multi-step diffusion to obtain an accurate estimation of the residual between LRMSI and HRMSI. However, the multi-step process takes large computational power and is time-consuming. As for E2E models, their performance is still limited by the lack of prior and simple structure. In this paper, we propose a novel four-stage training strategy to obtain a lightweight network Fose, which fuses one-step DM and an E2E model. We perform one-step distillation on an enhanced SOTA DM for pansharping to compress the inference process from 50 steps to only 1 step. Then we fuse the E2E model with one-step DM with lightweight ensemble blocks. Comprehensive experiments are conducted to demonstrate the significant improvement of the proposed Fose on three commonly used benchmarks. Moreover, we achieve a 7.42 speedup ratio compared to the baseline DM while achieving much better performance. The code and model are released at this https URL.
90. 【2512.17189】Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs
链接:https://arxiv.org/abs/2512.17189
作者:Xiao Liang,Chenxi Liu,Zhi Ma,Di Wang,Bin Jing,Quan Wang,Yuanyuan Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical Vision-Language Models, show immense promise, Medical Vision-Language, show immense, immense promise
备注:
点击查看摘要
Abstract:Medical Vision-Language Models (MedVLMs) show immense promise in clinical applicability. However, their reliability is hindered by hallucinations, where models often fail to derive answers from visual evidence, instead relying on learned textual priors. Existing mitigation strategies for MedVLMs have distinct limitations: training-based methods rely on costly expert annotations, limiting scalability, while training-free interventions like contrastive decoding, though data-efficient, apply a global, untargeted correction whose effects in complex real-world clinical settings can be unreliable. To address these challenges, we introduce Anatomical Region-Guided Contrastive Decoding (ARCD), a plug-and-play strategy that mitigates hallucinations by providing targeted, region-specific guidance. Our module leverages an anatomical mask to direct a three-tiered contrastive decoding process. By dynamically re-weighting at the token, attention, and logits levels, it verifiably steers the model's focus onto specified regions, reinforcing anatomical understanding and suppressing factually incorrect outputs. Extensive experiments across diverse datasets, including chest X-ray, CT, brain MRI, and ocular ultrasound, demonstrate our method's effectiveness in improving regional understanding, reducing hallucinations, and enhancing overall diagnostic accuracy.
91. 【2512.17188】Globally Optimal Solution to the Generalized Relative Pose Estimation Problem using Affine Correspondences
链接:https://arxiv.org/abs/2512.17188
作者:Zhenbao Yu,Banglei Guan,Shunkun Liang,Zibin Liu,Yang Shang,Qifeng Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Mobile devices equipped, inertial measurement unit, Mobile devices, relative pose estimation, relative pose
备注:
点击查看摘要
Abstract:Mobile devices equipped with a multi-camera system and an inertial measurement unit (IMU) are widely used nowadays, such as self-driving cars. The task of relative pose estimation using visual and inertial information has important applications in various fields. To improve the accuracy of relative pose estimation of multi-camera systems, we propose a globally optimal solver using affine correspondences to estimate the generalized relative pose with a known vertical direction. First, a cost function about the relative rotation angle is established after decoupling the rotation matrix and translation vector, which minimizes the algebraic error of geometric constraints from affine correspondences. Then, the global optimization problem is converted into two polynomials with two unknowns based on the characteristic equation and its first derivative is zero. Finally, the relative rotation angle can be solved using the polynomial eigenvalue solver, and the translation vector can be obtained from the eigenvector. Besides, a new linear solution is proposed when the relative rotation is small. The proposed solver is evaluated on synthetic data and real-world datasets. The experiment results demonstrate that our method outperforms comparable state-of-the-art methods in accuracy.
92. 【2512.17186】It is not always greener on the other side: Greenery perception across demographics and personalities in multiple cities
链接:https://arxiv.org/abs/2512.17186
作者:Matias Quintana,Fangqi Liu,Jussi Torkko,Youlong Gu,Xiucheng Liang,Yujun Hou,Koichi Ito,Yihan Zhu,Mahmoud Abdelrahman,Tuuli Toivonen,Yi Lu,Filip Biljecki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Quantifying and assessing, assessing urban greenery, Green View Index, planning and development, reflecting the everlasting
备注:
点击查看摘要
Abstract:Quantifying and assessing urban greenery is consequential for planning and development, reflecting the everlasting importance of green spaces for multiple climate and well-being dimensions of cities. Evaluation can be broadly grouped into objective (e.g., measuring the amount of greenery) and subjective (e.g., polling the perception of people) approaches, which may differ -- what people see and feel about how green a place is might not match the measurements of the actual amount of vegetation. In this work, we advance the state of the art by measuring such differences and explaining them through human, geographic, and spatial dimensions. The experiments rely on contextual information extracted from street view imagery and a comprehensive urban visual perception survey collected from 1,000 people across five countries with their extensive demographic and personality information. We analyze the discrepancies between objective measures (e.g., Green View Index (GVI)) and subjective scores (e.g., pairwise ratings), examining whether they can be explained by a variety of human and visual factors such as age group and spatial variation of greenery in the scene. The findings reveal that such discrepancies are comparable around the world and that demographics and personality do not play a significant role in perception. Further, while perceived and measured greenery correlate consistently across geographies (both where people and where imagery are from), where people live plays a significant role in explaining perceptual differences, with these two, as the top among seven, features that influences perceived greenery the most. This location influence suggests that cultural, environmental, and experiential factors substantially shape how individuals observe greenery in cities.
93. 【2512.17178】ABE-CLIP: Training-Free Attribute Binding Enhancement for Compositional Image-Text Matching
链接:https://arxiv.org/abs/2512.17178
作者:Qi Zhang,Yuxu Chen,Lei Deng,Lili Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Contrastive Language-Image Pretraining, Contrastive Language-Image, Language-Image Pretraining, multimodal tasks, achieved remarkable performance
备注: 10 pages, 8 figures
点击查看摘要
Abstract:Contrastive Language-Image Pretraining (CLIP) has achieved remarkable performance in various multimodal tasks. However, it still struggles with compositional image-text matching, particularly in accurately associating objects with their corresponding attributes, because its inherent global representation often overlooks fine-grained semantics for attribute binding. Existing methods often require additional training or extensive hard negative sampling, yet they frequently show limited generalization to novel compositional concepts and fail to fundamentally address the drawbacks of global representations. In this paper, we propose ABE-CLIP, a novel training-free Attribute Binding Enhancement method designed to strengthen attribute-object binding in CLIP-like models. Specifically, we employ a Semantic Refinement Mechanism to refine token embeddings for both object and attribute phrases in the text, thereby mitigating attribute confusion and improving semantic precision. We further introduce a Local Token-Patch Alignment strategy that computes similarity scores between refined textual tokens and their most relevant image patches. By aggregating localized similarity scores, ABE-CLIP computes the final image-text similarity. Experiments on multiple datasets demonstrate that ABE-CLIP significantly improves attribute-object binding performance, even surpassing methods that require extensive training.
94. 【2512.17160】Can Synthetic Images Serve as Effective and Efficient Class Prototypes?
链接:https://arxiv.org/abs/2512.17160
作者:Dianxing Shi,Dingjie Fu,Yuqiao Liu,Jun Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Contrastive Language-Image Pre-training, shown strong performance, shown strong, Contrastive Language-Image, Language-Image Pre-training
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have shown strong performance in zero-shot image classification tasks. However, existing methods, including Contrastive Language-Image Pre-training (CLIP), all rely on annotated text-to-image pairs for aligning visual and textual modalities. This dependency introduces substantial cost and accuracy requirement in preparing high-quality datasets. At the same time, processing data from two modes also requires dual-tower encoders for most models, which also hinders their lightweight. To address these limitations, we introduce a ``Contrastive Language-Image Pre-training via Large-Language-Model-based Generation (LGCLIP)" framework. LGCLIP leverages a Large Language Model (LLM) to generate class-specific prompts that guide a diffusion model in synthesizing reference images. Afterwards these generated images serve as visual prototypes, and the visual features of real images are extracted and compared with the visual features of these prototypes to achieve comparative prediction. By optimizing prompt generation through the LLM and employing only a visual encoder, LGCLIP remains lightweight and efficient. Crucially, our framework requires only class labels as input during whole experimental procedure, eliminating the need for manually annotated image-text pairs and extra pre-processing. Experimental results validate the feasibility and efficiency of LGCLIP, demonstrating great performance in zero-shot classification tasks and establishing a novel paradigm for classification.
95. 【2512.17152】PhysFire-WM: A Physics-Informed World Model for Emulating Fire Spread Dynamics
链接:https://arxiv.org/abs/2512.17152
作者:Nan Zhou,Huandong Wang,Jiahao Li,Yang Li,Xiao-Ping Zhang,Yong Li,Xinlei Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:emergency response, plays a crucial, crucial role, role in emergency, fire
备注:
点击查看摘要
Abstract:Fine-grained fire prediction plays a crucial role in emergency response. Infrared images and fire masks provide complementary thermal and boundary information, yet current methods are predominantly limited to binary mask modeling with inherent signal sparsity, failing to capture the complex dynamics of fire. While world models show promise in video generation, their physical inconsistencies pose significant challenges for fire forecasting. This paper introduces PhysFire-WM, a Physics-informed World Model for emulating Fire spread dynamics. Our approach internalizes combustion dynamics by encoding structured priors from a Physical Simulator to rectify physical discrepancies, coupled with a Cross-task Collaborative Training strategy (CC-Train) that alleviates the issue of limited information in mask-based modeling. Through parameter sharing and gradient coordination, CC-Train effectively integrates thermal radiation dynamics and spatial boundary delineation, enhancing both physical realism and geometric accuracy. Extensive experiments on a fine-grained multimodal fire dataset demonstrate the superior accuracy of PhysFire-WM in fire spread prediction. Validation underscores the importance of physical priors and cross-task collaboration, providing new insights for applying physics-informed world models to disaster prediction.
96. 【2512.17151】xt-Conditioned Background Generation for Editable Multi-Layer Documents
链接:https://arxiv.org/abs/2512.17151
作者:Taewon Kang,Joseph K J,Chris Tensmeyer,Jihyung Kil,Wanrong Zhu,Ming C. Lin,Vlad I. Morariu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Automated Readability Optimization, document-centric background generation, text regions, thematic continuity, regions remain readable
备注:
点击查看摘要
Abstract:We present a framework for document-centric background generation with multi-page editing and thematic continuity. To ensure text regions remain readable, we employ a \emph{latent masking} formulation that softly attenuates updates in the diffusion space, inspired by smooth barrier functions in physics and numerical optimization. In addition, we introduce \emph{Automated Readability Optimization (ARO)}, which automatically places semi-transparent, rounded backing shapes behind text regions. ARO determines the minimal opacity needed to satisfy perceptual contrast standards (WCAG 2.2) relative to the underlying background, ensuring readability while maintaining aesthetic harmony without human intervention. Multi-page consistency is maintained through a summarization-and-instruction process, where each page is distilled into a compact representation that recursively guides subsequent generations. This design reflects how humans build continuity by retaining prior context, ensuring that visual motifs evolve coherently across an entire document. Our method further treats a document as a structured composition in which text, figures, and backgrounds are preserved or regenerated as separate layers, allowing targeted background editing without compromising readability. Finally, user-provided prompts allow stylistic adjustments in color and texture, balancing automated consistency with flexible customization. Our training-free framework produces visually coherent, text-preserving, and thematically aligned documents, bridging generative modeling with natural design workflows.
97. 【2512.17143】Pro-Pose: Unpaired Full-Body Portrait Synthesis via Canonical UV Maps
链接:https://arxiv.org/abs/2512.17143
作者:Sandeep Mishra,Yasamin Jafarian,Andreas Lugmayr,Yingwei Li,Varsha Ramakrishnan,Srivatsan Varadharajan,Alan C. Bovik,Ira Kemelmacher-Shlizerman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:photographers typically present, flattering quality, typically present, professional photographers typically, beautiful lighting
备注:
点击查看摘要
Abstract:Photographs of people taken by professional photographers typically present the person in beautiful lighting, with an interesting pose, and flattering quality. This is unlike common photos people can take of themselves. In this paper, we explore how to create a ``professional'' version of a person's photograph, i.e., in a chosen pose, in a simple environment, with good lighting, and standard black top/bottom clothing. A key challenge is to preserve the person's unique identity, face and body features while transforming the photo. If there would exist a large paired dataset of the same person photographed both ``in the wild'' and by a professional photographer, the problem would potentially be easier to solve. However, such data does not exist, especially for a large variety of identities. To that end, we propose two key insights: 1) Our method transforms the input photo and person's face to a canonical UV space, which is further coupled with reposing methodology to model occlusions and novel view synthesis. Operating in UV space allows us to leverage existing unpaired datasets. 2) We personalize the output photo via multi image finetuning. Our approach yields high-quality, reposed portraits and achieves strong qualitative and quantitative performance on real-world imagery.
98. 【2512.17137】SDUM: A Scalable Deep Unrolled Model for Universal MRI Reconstruction
链接:https://arxiv.org/abs/2512.17137
作者:Puyang Wang,Pengfei Guo,Keyi Chai,Jinyuan Zhou,Daguang Xu,Shanshan Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Clinical MRI encompasses, spanning anatomical targets, encompasses diverse imaging, current deep learning, diverse imaging protocols
备注:
点击查看摘要
Abstract:Clinical MRI encompasses diverse imaging protocols--spanning anatomical targets (cardiac, brain, knee), contrasts (T1, T2, mapping), sampling patterns (Cartesian, radial, spiral, kt-space), and acceleration factors--yet current deep learning reconstructions are typically protocol-specific, hindering generalization and deployment. We introduce Scalable Deep Unrolled Model (SDUM), a universal framework combining a Restormer-based reconstructor, a learned coil sensitivity map estimator (CSME), sampling-aware weighted data consistency (SWDC), universal conditioning (UC) on cascade index and protocol metadata, and progressive cascade expansion training. SDUM exhibits foundation-model-like scaling behavior: reconstruction quality follows PSNR ${\sim}$ log(parameters) with correlation $r{=}0.986$ ($R^2{=}0.973$) up to 18 cascades, demonstrating predictable performance gains with model depth. A single SDUM trained on heterogeneous data achieves state-of-the-art results across all four CMRxRecon2025 challenge tracks--multi-center, multi-disease, 5T, and pediatric--without task-specific fine-tuning, surpassing specialized baselines by up to ${+}1.0$~dB. On CMRxRecon2024, SDUM outperforms the winning method PromptMR+ by ${+}0.55$~dB; on fastMRI brain, it exceeds PC-RNN by ${+}1.8$~dB. Ablations validate each component: SWDC ${+}0.43$~dB over standard DC, per-cascade CSME ${+}0.51$~dB, UC ${+}0.38$~dB. These results establish SDUM as a practical path toward universal, scalable MRI reconstruction.
99. 【2512.17098】Predictive Modeling of Maritime Radar Data Using Transformer Architecture
链接:https://arxiv.org/abs/2512.17098
作者:Bjorna Qesaraku,Jan Steckel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous systems require, systems require robust, anticipate vessel motion, Maritime autonomous systems, robust predictive capabilities
备注: 9 pages, 2 figures, 1 table
点击查看摘要
Abstract:Maritime autonomous systems require robust predictive capabilities to anticipate vessel motion and environmental dynamics. While transformer architectures have revolutionized AIS-based trajectory prediction and demonstrated feasibility for sonar frame forecasting, their application to maritime radar frame prediction remains unexplored, creating a critical gap given radar's all-weather reliability for navigation. This survey systematically reviews predictive modeling approaches relevant to maritime radar, with emphasis on transformer architectures for spatiotemporal sequence forecasting, where existing representative methods are analyzed according to data type, architecture, and prediction horizon. Our review shows that, while the literature has demonstrated transformer-based frame prediction for sonar sensing, no prior work addresses transformer-based maritime radar frame prediction, thereby defining a clear research gap and motivating a concrete research direction for future work in this area.
100. 【2512.17094】DGH: Dynamic Gaussian Hair
链接:https://arxiv.org/abs/2512.17094
作者:Junying Wang,Yuanlu Xu,Edith Tretschk,Ziyan Wang,Anastasia Ianina,Aljaz Bozic,Ulrich Neumann,Tony Tung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:digital human modeling, light scattering, creation of photorealistic, remains a major, major challenge
备注: Accepted by NeurIPS 2025. Project page: [this https URL](https://junyingw.github.io/paper/dgh)
点击查看摘要
Abstract:The creation of photorealistic dynamic hair remains a major challenge in digital human modeling because of the complex motions, occlusions, and light scattering. Existing methods often resort to static capture and physics-based models that do not scale as they require manual parameter fine-tuning to handle the diversity of hairstyles and motions, and heavy computation to obtain high-quality appearance. In this paper, we present Dynamic Gaussian Hair (DGH), a novel framework that efficiently learns hair dynamics and appearance. We propose: (1) a coarse-to-fine model that learns temporally coherent hair motion dynamics across diverse hairstyles; (2) a strand-guided optimization module that learns a dynamic 3D Gaussian representation for hair appearance with support for differentiable rendering, enabling gradient-based learning of view-consistent appearance under motion. Unlike prior simulation-based pipelines, our approach is fully data-driven, scales with training data, and generalizes across various hairstyles and head motion sequences. Additionally, DGH can be seamlessly integrated into a 3D Gaussian avatar framework, enabling realistic, animatable hair for high-fidelity avatar representation. DGH achieves promising geometry and appearance results, providing a scalable, data-driven alternative to physics-based simulation and rendering.
101. 【2512.17080】Interpretable Similarity of Synthetic Image Utility
链接:https://arxiv.org/abs/2512.17080
作者:Panagiota Gatoula,George Dimas,Dimitris K. Iakovidis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:clinical decision support, based clinical decision, deep learning, decision support, large scale
备注: Submitted for journal publication
点击查看摘要
Abstract:Synthetic medical image data can unlock the potential of deep learning (DL)-based clinical decision support (CDS) systems through the creation of large scale, privacy-preserving, training sets. Despite the significant progress in this field, there is still a largely unanswered research question: "How can we quantitatively assess the similarity of a synthetically generated set of images with a set of real images in a given application domain?". Today, answers to this question are mainly provided via user evaluation studies, inception-based measures, and the classification performance achieved on synthetic images. This paper proposes a novel measure to assess the similarity between synthetically generated and real sets of images, in terms of their utility for the development of DL-based CDS systems. Inspired by generalized neural additive models, and unlike inception-based measures, the proposed measure is interpretable (Interpretable Utility Similarity, IUS), explaining why a synthetic dataset could be more useful than another one in the context of a CDS system based on clinically relevant image features. The experimental results on publicly available datasets from various color medical imaging modalities including endoscopic, dermoscopic and fundus imaging, indicate that selecting synthetic images of high utility similarity using IUS can result in relative improvements of up to 54.6% in terms of classification performance. The generality of IUS for synthetic data assessment is demonstrated also for greyscale X-ray and ultrasound imaging modalities. IUS implementation is available at this https URL
102. 【2512.17040】Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation
链接:https://arxiv.org/abs/2512.17040
作者:Min-Jung Kim,Jeongho Kim,Hoiyeong Jin,Junha Hyung,Jaegul Choo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:spurred growing interest, Recent progress, cinematic camera control, camera control capabilities, camera-controlled novel-view video
备注:
点击查看摘要
Abstract:Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:this https URL
103. 【2512.17021】FORMSpoT: A Decade of Tree-Level, Country-Scale Forest Monitoring
链接:https://arxiv.org/abs/2512.17021
作者:Martin Schwartz,Fajwel Fogel,Nikola Besic,Damien Robert,Louis Geist,Jean-Pierre Renaud,Jean-Matthieu Monnet,Clemens Mosig,Cédric Vega,Alexandre d'Aspremont,Loic Landrieu,Philippe Ciais
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:frequently updated forest, European forest carbon, carbon sink highlights, European forest, Delta
备注:
点击查看摘要
Abstract:The recent decline of the European forest carbon sink highlights the need for spatially explicit and frequently updated forest monitoring tools. Yet, existing satellite-based disturbance products remain too coarse to detect changes at the scale of individual trees, typically below 100 m$^{2}$. Here, we introduce FORMSpoT (Forest Mapping with SPOT Time series), a decade-long (2014-2024) nationwide mapping of forest canopy height at 1.5 m resolution, together with annual disturbance polygons (FORMSpoT-$\Delta$) covering mainland France. Canopy heights were derived from annual SPOT-6/7 composites using a hierarchical transformer model (PVTv2) trained on high-resolution airborne laser scanning (ALS) data. To enable robust change detection across heterogeneous acquisitions, we developed a dedicated post-processing pipeline combining co-registration and spatio-temporal total variation denoising. Validation against ALS revisits across 19 sites and 5,087 National Forest Inventory plots shows that FORMSpoT-$\Delta$ substantially outperforms existing disturbance products. In mountainous forests, where disturbances are small and spatially fragmented, FORMSpoT-$\Delta$ achieves an F1-score of 0.44, representing an order of magnitude higher than existing benchmarks. By enabling tree-level monitoring of forest dynamics at national scale, FORMSpoT-$\Delta$ provides a unique tool to analyze management practices, detect early signals of forest decline, and better quantify carbon losses from subtle disturbances such as thinning or selective logging. These results underscore the critical importance of sustaining very high-resolution satellite missions like SPOT and open-data initiatives such as DINAMIS for monitoring forests under climate change.
104. 【2512.17012】4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
链接:https://arxiv.org/abs/2512.17012
作者:Chiao-An Yang,Ryo Hachiuma,Sifei Liu,Subhashree Radhakrishnan,Raymond A. Yeh,Yu-Chiang Frank Wang,Min-Hung Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamics remains limited, Multimodal LLMs, advances in Multimodal, Video Question Answering, constrained by weak
备注:
点击查看摘要
Abstract:Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.
105. 【2512.16978】A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
链接:https://arxiv.org/abs/2512.16978
作者:Mohammed Irfan Kurpath,Jaseel Muhammad Kaithakkodan,Jinxing Zhou,Sahal Shaji Mullappilly,Mohammad Almansoori,Noor Ahsan,Beknur Kalmakhanbet,Sambal Shikhar,Rishabh Lalla,Jean Lahoud,Mariette Awad,Fahad Shahbaz Khan,Salman Khan,Rao Muhammad Anwer,Hisham Cholakkal
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requires integrating vision, coherent long-range reasoning, understanding requires integrating, integrating vision, requires integrating
备注:
点击查看摘要
Abstract:Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: this https URL.
106. 【2512.16977】Endo-SemiS: Towards Robust Semi-Supervised Image Segmentation for Endoscopic Video
链接:https://arxiv.org/abs/2512.16977
作者:Hao Li,Daiwei Lu,Xing Yao,Nicholas Kavoussi,Ipek Oguz
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:endoscopic video frames, semi-supervised segmentation framework, unlabeled data, framework for providing, providing reliable segmentation
备注:
点击查看摘要
Abstract:In this paper, we present Endo-SemiS, a semi-supervised segmentation framework for providing reliable segmentation of endoscopic video frames with limited annotation. EndoSemiS uses 4 strategies to improve performance by effectively utilizing all available data, particularly unlabeled data: (1) Cross-supervision between two individual networks that supervise each other; (2) Uncertainty-guided pseudo-labels from unlabeled data, which are generated by selecting high-confidence regions to improve their quality; (3) Joint pseudolabel supervision, which aggregates reliable pixels from the pseudo-labels of both networks to provide accurate supervision for unlabeled data; and (4) Mutual learning, where both networks learn from each other at the feature and image levels, reducing variance and guiding them toward a consistent solution. Additionally, a separate corrective network that utilizes spatiotemporal information from endoscopy video to improve segmentation performance. Endo-SemiS is evaluated on two clinical applications: kidney stone laser lithotomy from ureteroscopy and polyp screening from colonoscopy. Compared to state-of-the-art segmentation methods, Endo-SemiS substantially achieves superior results on both datasets with limited labeled data. The code is publicly available at this https URL
107. 【2512.16975】InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
链接:https://arxiv.org/abs/2512.16975
作者:Haotian Ye,Qiyuan He,Jiaqi Han,Puheng Li,Jiaojiao Fan,Zekun Hao,Fitsum Reda,Yogesh Balaji,Huayu Chen,Sheng Liu,Angela Yao,James Zou,Stefano Ermon,Haoxiang Wang,Ming-Yu Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:video sequences processing, efficient discrete video, long video sequences, sequences processing, efficient discrete
备注:
点击查看摘要
Abstract:Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence on performance, and achieving 2.3x compression rates while still outperforming prior heuristic adaptive approaches. By allocating tokens according to informational richness, InfoTok enables a more compressed yet accurate tokenization for video representation, offering valuable insights for future research.
108. 【2512.16954】Lights, Camera, Consistency: A Multistage Pipeline for Character-Stable AI Video Stories
链接:https://arxiv.org/abs/2512.16954
作者:Chayan Jain,Rishant Sharma,Archit Garg,Ishan Bhanuka,Pratik Narang,Dhruv Kumar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:cohesive video stories, Generating long, significant challenge, cohesive video, video stories
备注: Under Review
点击查看摘要
Abstract:Generating long, cohesive video stories with consistent characters is a significant challenge for current text-to-video AI. We introduce a method that approaches video generation in a filmmaker-like manner. Instead of creating a video in one step, our proposed pipeline first uses a large language model to generate a detailed production script. This script guides a text-to-image model in creating consistent visuals for each character, which then serve as anchors for a video generation model to synthesize each scene individually. Our baseline comparisons validate the necessity of this multi-stage decomposition; specifically, we observe that removing the visual anchoring mechanism results in a catastrophic drop in character consistency scores (from 7.99 to 0.55), confirming that visual priors are essential for identity preservation. Furthermore, we analyze cultural disparities in current models, revealing distinct biases in subject consistency and dynamic degree between Indian vs Western-themed generations.
109. 【2512.16950】Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections
链接:https://arxiv.org/abs/2512.16950
作者:Adrian Straker,Paul Magdon,Marco Zullich,Maximilian Freudenberg,Christoph Kleinn,Johannes Breidenbach,Stefano Puliti,Nils Nölke
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:core research area, forest remote sensing, Classifying tree species, Class Activation Mapping, Classifying tree
备注: 31 pages, 14 figures, submitted to Forestry: An International Journal of Forest Research
点击查看摘要
Abstract:Classifying tree species has been a core research area in forest remote sensing for decades. New sensors and classification approaches like TLS and deep learning achieve state-of-the art accuracy but their decision processes remain unclear. Methods such as Finer-CAM (Class Activation Mapping) can highlight features in TLS projections that contribute to the classification of a target species, yet are uncommon in similar looking contrastive tree species. We propose a novel method linking Finer-CAM explanations to segments of TLS projections representing structural tree features to systemically evaluate which features drive species discrimination. Using TLS data from 2,445 trees across seven European tree species, we trained and validated five YOLOv8 models with cross-validation, reaching a mean accuracy of 96% (SD = 0.24%). Analysis of 630 saliency maps shows the models primarily rely on crown features in TLS projections for species classification. While this result is pronounced in Silver Birch, European Beech, English oak, and Norway spruce, stem features contribute more frequently to the differentiation of European ash, Scots pine, and Douglas fir. Particularly representations of finer branches contribute to the decisions of the models. The models consider those tree species similar to each other which a human expert would also regard as similar. Furthermore, our results highlight the need for an improved understanding of the decision processes of tree species classification models to help reveal data set and model limitations, biases, and to build confidence in model predictions.
110. 【2512.16948】AVM: Towards Structure-Preserving Neural Response Modeling in the Visual Cortex Across Stimuli and Individuals
链接:https://arxiv.org/abs/2512.16948
作者:Qi Xu,Shuai Gong,Xuming Ran,Haihua Luo,Yangfan Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shown strong performance, stable visual encoding, deep learning models, separate stable visual, simulating neural responses
备注:
点击查看摘要
Abstract:While deep learning models have shown strong performance in simulating neural responses, they often fail to clearly separate stable visual encoding from condition-specific adaptation, which limits their ability to generalize across stimuli and individuals. We introduce the Adaptive Visual Model (AVM), a structure-preserving framework that enables condition-aware adaptation through modular subnetworks, without modifying the core representation. AVM keeps a Vision Transformer-based encoder frozen to capture consistent visual features, while independently trained modulation paths account for neural response variations driven by stimulus content and subject identity. We evaluate AVM in three experimental settings, including stimulus-level variation, cross-subject generalization, and cross-dataset adaptation, all of which involve structured changes in inputs and individuals. Across two large-scale mouse V1 datasets, AVM outperforms the state-of-the-art V1T model by approximately 2% in predictive correlation, demonstrating robust generalization, interpretable condition-wise modulation, and high architectural efficiency. Specifically, AVM achieves a 9.1% improvement in explained variance (FEVE) under the cross-dataset adaptation setting. These results suggest that AVM provides a unified framework for adaptive neural modeling across biological and experimental conditions, offering a scalable solution under structural constraints. Its design may inform future approaches to cortical modeling in both neuroscience and biologically inspired AI systems.
111. 【2512.16947】Comparison of deep learning models: CNN and VGG-16 in identifying pornographic content
链接:https://arxiv.org/abs/2512.16947
作者:Reza Chandra,Adang Suhendra,Lintang Yuniar Banowosari,Prihandoko
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Indonesian government due, Indonesian government, including pornography, government due, pornographic content quickly
备注:
点击查看摘要
Abstract:In 2020, a total of 59,741 websites were blocked by the Indonesian government due to containing negative content, including pornography, with 14,266 websites falling into this category. However, these blocked websites could still be accessed by the public using virtual private networks (VPNs). This prompted the research idea to quickly identify pornographic content. This study aims to develop a system capable of identifying websites suspected of containing pornographic image content, using a deep learning approach with convolutional neural network (CNN) and visual geometry group 16 (VGG-16) model. The two models were then explored comprehensively and holistically to determine which model was most effective in detecting pornographic content quickly. Based on the findings of the comparison between testing the CNN and VGG-16 models, research results showed that the best test results were obtained in the eighth experiment using the CNN model at an epoch value level of 50 and a learning rate of 0.001 of 0.9487 or 94.87%. This can be interpreted that the CNN model is more effective in detecting pornographic content quickly and accurately compared to using the VGG-16 model.
112. 【2512.16925】V-Agent: An Interactive Video Search System Using Vision-Language Models
链接:https://arxiv.org/abs/2512.16925
作者:SunYoung Park,Jong-Hyeon Lee,Youngjune Kim,Daegyu Sung,Younghyun Yu,Young-rok Cha,Jeongho Ju
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:interactive user-system conversations, multi-agent platform designed, user-system conversations, multi-agent platform, platform designed
备注: CIKM 2025 MMGENSR Workshop
点击查看摘要
Abstract:We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications.
113. 【2512.17774】MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Image Segmentation
链接:https://arxiv.org/abs/2512.17774
作者:Saikat Roy,Yannick Kirchhoff,Constantin Ulrich,Maximillian Rokuss,Tassilo Wald,Fabian Isensee,Klaus Maier-Hein
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:medical image segmentation, rapidly reshaping, medical image, Global Response Normalization, image segmentation
备注:
点击查看摘要
Abstract:Large-scale supervised pretraining is rapidly reshaping 3D medical image segmentation. However, existing efforts focus primarily on increasing dataset size and overlook the question of whether the backbone network is an effective representation learner at scale. In this work, we address this gap by revisiting ConvNeXt-based architectures for volumetric segmentation and introducing MedNeXt-v2, a compound-scaled 3D ConvNeXt that leverages improved micro-architecture and data scaling to deliver state-of-the-art performance. First, we show that routinely used backbones in large-scale pretraining pipelines are often suboptimal. Subsequently, we use comprehensive backbone benchmarking prior to scaling and demonstrate that stronger from scratch performance reliably predicts stronger downstream performance after pretraining. Guided by these findings, we incorporate a 3D Global Response Normalization module and use depth, width, and context scaling to improve our architecture for effective representation learning. We pretrain MedNeXt-v2 on 18k CT volumes and demonstrate state-of-the-art performance when fine-tuning across six challenging CT and MR benchmarks (144 structures), showing consistent gains over seven publicly released pretrained models. Beyond improvements, our benchmarking of these models also reveals that stronger backbones yield better results on similar data, representation scaling disproportionately benefits pathological segmentation, and that modality-specific pretraining offers negligible benefit once full finetuning is applied. In conclusion, our results establish MedNeXt-v2 as a strong backbone for large-scale supervised representation learning in 3D Medical Image Segmentation. Our code and pretrained models are made available with the official nnUNet repository at: this https URL
114. 【2512.17759】Breast Cancer Neoadjuvant Chemotherapy Treatment Response Prediction Using Aligned Longitudinal MRI and Clinical Data
链接:https://arxiv.org/abs/2512.17759
作者:Rahul Ravi,Ruizhe Li,Tarek Abdelfatah,Stephen Chan,Xin Chen
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:breast cancer patients, study investigates treatment, contrast-enhanced magnetic resonance, investigates treatment response, treatment response prediction
备注:
点击查看摘要
Abstract:Aim: This study investigates treatment response prediction to neoadjuvant chemotherapy (NACT) in breast cancer patients, using longitudinal contrast-enhanced magnetic resonance images (CE-MRI) and clinical data. The goal is to develop machine learning (ML) models to predict pathologic complete response (PCR binary classification) and 5-year relapse-free survival status (RFS binary classification). Method: The proposed framework includes tumour segmentation, image registration, feature extraction, and predictive modelling. Using the image registration method, MRI image features can be extracted and compared from the original tumour site at different time points, therefore monitoring the intratumor changes during NACT process. Four feature extractors, including one radiomics and three deep learning-based (MedicalNet, Segformer3D, SAM-Med3D) were implemented and compared. In combination with three feature selection methods and four ML models, predictive models are built and compared. Results: The proposed image registration-based feature extraction consistently improves the predictive models. In the PCR and RFS classification tasks logistic regression model trained on radiomic features performed the best with an AUC of 0.88 and classification accuracy of 0.85 for PCR classification, and AUC of 0.78 and classification accuracy of 0.72 for RFS classification. Conclusions: It is evidenced that the image registration method has significantly improved performance in longitudinal feature learning in predicting PCR and RFS. The radiomics feature extractor is more effective than the pre-trained deep learning feature extractors, with higher performance and better interpretability.
115. 【2512.17585】SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis
链接:https://arxiv.org/abs/2512.17585
作者:N. A. Adarsh Pritam,Jeba Shiney O,Sanyam Jain
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:work introduces SkinGenBench, systematic biomedical imaging, biomedical imaging benchmark, Denoising Diffusion Probabilistic, Diffusion Probabilistic Models
备注:
点击查看摘要
Abstract:This work introduces SkinGenBench, a systematic biomedical imaging benchmark that investigates how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis. Using a curated dataset of 14,116 dermoscopic images from HAM10000 and MILK10K across five lesion classes, we evaluate the two representative generative paradigms: StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs) under basic geometric augmentation and advanced artifact removal pipelines. Synthetic melanoma images are assessed using established perceptual and distributional metrics (FID, KID, IS), feature space analysis, and their impact on diagnostic performance across five downstream classifiers. Experimental results demonstrate that generative architecture choice has a stronger influence on both image fidelity and diagnostic utility than preprocessing complexity. StyleGAN2-ADA consistently produced synthetic images more closely aligned with real data distributions, achieving the lowest FID (~65.5) and KID (~0.05), while diffusion models generated higher variance samples at the cost of reduces perceptual fidelity and class anchoring. Advanced artifact removal yielded only marginal improvements in generative metrics and provided limited downstream diagnostic gains, suggesting possible suppression of clinically relevant texture cues. In contrast, synthetic data augmentation substantially improved melanoma detection with 8-15% absolute gains in melanoma F1-score, and ViT-B/16 achieving F1~0.88 and ROC-AUC~0.98, representing an improvement of approximately 14% over non-augmented baselines. Our code can be found at this https URL
116. 【2512.16964】Colormap-Enhanced Vision Transformers for MRI-Based Multiclass (4-Class) Alzheimer's Disease Classification
链接:https://arxiv.org/abs/2512.16964
作者:Faisal Ahmed
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Magnetic Resonance Imaging, Magnetic Resonance, Resonance Imaging, Alzheimer disease, Alzheimer disease classification
备注: 12 pages, 4 figures
点击查看摘要
Abstract:Magnetic Resonance Imaging (MRI) plays a pivotal role in the early diagnosis and monitoring of Alzheimer's disease (AD). However, the subtle structural variations in brain MRI scans often pose challenges for conventional deep learning models to extract discriminative features effectively. In this work, we propose PseudoColorViT-Alz, a colormap-enhanced Vision Transformer framework designed to leverage pseudo-color representations of MRI images for improved Alzheimer's disease classification. By combining colormap transformations with the global feature learning capabilities of Vision Transformers, our method amplifies anatomical texture and contrast cues that are otherwise subdued in standard grayscale MRI scans. We evaluate PseudoColorViT-Alz on the OASIS-1 dataset using a four-class classification setup (non-demented, moderate dementia, mild dementia, and very mild dementia). Our model achieves a state-of-the-art accuracy of 99.79% with an AUC of 100%, surpassing the performance of recent 2024--2025 methods, including CNN-based and Siamese-network approaches, which reported accuracies ranging from 96.1% to 99.68%. These results demonstrate that pseudo-color augmentation combined with Vision Transformers can significantly enhance MRI-based Alzheimer's disease classification. PseudoColorViT-Alz offers a robust and interpretable framework that outperforms current methods, providing a promising tool to support clinical decision-making and early detection of Alzheimer's disease.
Comments:
12 pages, 4 figures
Subjects:
Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2512.16964 [eess.IV]
(or
arXiv:2512.16964v1 [eess.IV] for this version)
https://doi.org/10.48550/arXiv.2512.16964
Focus to learn more
arXiv-issued DOI via DataCite</p>




