本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新826篇论文,其中:
- 自然语言处理87篇
- 信息检索11篇
- 计算机视觉220篇
自然语言处理
1. 【2512.23707】raining AI Co-Scientists Using Rubric Rewards
链接:https://arxiv.org/abs/2512.23707
作者:Shashwat Goel,Rishi Hazra,Dulhan Jayalath,Timon Willi,Parag Jain,William F. Shen,Ilias Leontiadis,Francesco Barbieri,Yoram Bachrach,Jonas Geiping,Chenxi Whitehouse
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:research, tool to assist, research goals, assist human researchers, goals
备注: 11 pages in the main paper, total 119 including sample outputs in the Appendix
点击查看摘要
Abstract:AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.
2. 【2512.23701】Eliciting Behaviors in Multi-Turn Conversations
链接:https://arxiv.org/abs/2512.23701
作者:Jing Huang,Shujian Zhang,Lun Wang,Andrew Hard,Rajiv Mathews,John Lambert
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, Identifying specific, target model, induce specific behaviors, large language
备注:
点击查看摘要
Abstract:Identifying specific and often complex behaviors from large language models (LLMs) in conversational settings is crucial for their evaluation. Recent work proposes novel techniques to find natural language prompts that induce specific behaviors from a target model, yet they are mainly studied in single-turn settings. In this work, we study behavior elicitation in the context of multi-turn conversations. We first offer an analytical framework that categorizes existing methods into three families based on their interactions with the target model: those that use only prior knowledge, those that use offline interactions, and those that learn from online interactions. We then introduce a generalized multi-turn formulation of the online method, unifying single-turn and multi-turn elicitation. We evaluate all three families of methods on automatically generating multi-turn test cases. We investigate the efficiency of these approaches by analyzing the trade-off between the query budget, i.e., the number of interactions with the target model, and the success rate, i.e., the discovery rate of behavior-eliciting inputs. We find that online methods can achieve an average success rate of 45/19/77% with just a few thousand queries over three tasks where static methods from existing multi-turn conversation benchmarks find few or even no failure cases. Our work highlights a novel application of behavior elicitation methods in multi-turn conversation evaluation and the need for the community to move towards dynamic benchmarks.
3. 【2512.23693】Fine-Tuning LLMs with Fine-Grained Human Feedback on Text Spans
链接:https://arxiv.org/abs/2512.23693
作者:Sky CH-Wang,Justin Svegliato,Helen Appel,Jason Eisner
类目:Computation and Language (cs.CL)
关键词:fine-tuning language models, dataset for fine-tuning, fine-tuning language, feedback-driven improvement chains, Abstract
备注:
点击查看摘要
Abstract:We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.
4. 【2512.23686】PROFASR-BENCH: A Benchmark for Context-Conditioned ASR in High-Stakes Professional Speech
链接:https://arxiv.org/abs/2512.23686
作者:Deepak Babu Piskala
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:Automatic Speech Recognition, formal register variation, Automatic Speech, professional settings faces, settings faces challenges
备注: Benchmark dataset and evaluation suite. Data and code available at: [this https URL](https://huggingface.co/datasets/prdeepakbabu/ProfASR-Bench) [this https URL](https://github.com/prdeepakbabu/ProfASR-Bench)
点击查看摘要
Abstract:Automatic Speech Recognition (ASR) in professional settings faces challenges that existing benchmarks underplay: dense domain terminology, formal register variation, and near-zero tolerance for critical entity errors. We present ProfASR-Bench, a professional-talk evaluation suite for high-stakes applications across finance, medicine, legal, and technology. Each example pairs a natural-language prompt (domain cue and/or speaker profile) with an entity-rich target utterance, enabling controlled measurement of context-conditioned recognition. The corpus supports conventional ASR metrics alongside entity-aware scores and slice-wise reporting by accent and gender. Using representative families Whisper (encoder-decoder ASR) and Qwen-Omni (audio language models) under matched no-context, profile, domain+profile, oracle, and adversarial conditions, we find a consistent pattern: lightweight textual context produces little to no change in average word error rate (WER), even with oracle prompts, and adversarial prompts do not reliably degrade performance. We term this the context-utilization gap (CUG): current systems are nominally promptable yet underuse readily available side information. ProfASR-Bench provides a standardized context ladder, entity- and slice-aware reporting with confidence intervals, and a reproducible testbed for comparing fusion strategies across model families. Dataset: this https URL Code: this https URL
Comments:
Benchmark dataset and evaluation suite. Data and code available at: this https URL this https URL
Subjects:
Computation and Language (cs.CL); Sound (cs.SD)
Cite as:
arXiv:2512.23686 [cs.CL]
(or
arXiv:2512.23686v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2512.23686
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
5. 【2512.23684】Multilingual Hidden Prompt Injection Attacks on LLM-Based Academic Reviewing
链接:https://arxiv.org/abs/2512.23684
作者:Panagiotis Theocharopoulos,Ajinkya Kulkarni,Mathew Magimai.-Doss
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, including academic peer, high-impact workflows, Large language, increasingly considered
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly considered for use in high-impact workflows, including academic peer review. However, LLMs are vulnerable to document-level hidden prompt injection attacks. In this work, we construct a dataset of approximately 500 real academic papers accepted to ICML and evaluate the effect of embedding hidden adversarial prompts within these documents. Each paper is injected with semantically equivalent instructions in four different languages and reviewed using an LLM. We find that prompt injection induces substantial changes in review scores and accept/reject decisions for English, Japanese, and Chinese injections, while Arabic injections produce little to no effect. These results highlight the susceptibility of LLM-based reviewing systems to document-level prompt injection and reveal notable differences in vulnerability across languages.
6. 【2512.23676】Web World Models
链接:https://arxiv.org/abs/2512.23676
作者:Jichen Feng,Yifan Zhang,Chenggong Zhang,Yifu Lu,Shilong Liu,Mengdi Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:agents increasingly require, increasingly require persistent, Language agents increasingly, require persistent worlds, agents increasingly
备注: Project Page: [this https URL](https://github.com/Princeton-AI2-Lab/Web-World-Models)
点击查看摘要
Abstract:Language agents increasingly require persistent worlds in which they can act, remember, and learn. Existing approaches sit at two extremes: conventional web frameworks provide reliable but fixed contexts backed by databases, while fully generative world models aim for unlimited environments at the expense of controllability and practical engineering. In this work, we introduce the Web World Model (WWM), a middle ground where world state and ``physics'' are implemented in ordinary web code to ensure logical consistency, while large language models generate context, narratives, and high-level decisions on top of this structured latent state. We build a suite of WWMs on a realistic web stack, including an infinite travel atlas grounded in real geography, fictional galaxy explorers, web-scale encyclopedic and narrative worlds, and simulation- and game-like environments. Across these systems, we identify practical design principles for WWMs: separating code-defined rules from model-driven imagination, representing latent state as typed web interfaces, and utilizing deterministic generation to achieve unlimited but structured exploration. Our results suggest that web stacks themselves can serve as a scalable substrate for world models, enabling controllable yet open-ended environments. Project Page: this https URL.
7. 【2512.23659】Less is more: Probabilistic reduction is best explained by small-scale predictability measures
链接:https://arxiv.org/abs/2512.23659
作者:Cassandra L. Jacobs,Andrés Buxó-Lugo,Anna K. Taylor,Marie Leopold-Hooke
类目:Computation and Language (cs.CL)
关键词:primary research questions, language model probabilities, primary research, research questions, paper center
备注:
点击查看摘要
Abstract:The primary research questions of this paper center on defining the amount of context that is necessary and/or appropriate when investigating the relationship between language model probabilities and cognitive phenomena. We investigate whether whole utterances are necessary to observe probabilistic reduction and demonstrate that n-gram representations suffice as cognitive units of planning.
8. 【2512.23647】Nested Browser-Use Learning for Agentic Information Seeking
链接:https://arxiv.org/abs/2512.23647
作者:Baixuan Li,Jialong Wu,Wenbiao Yin,Kuan Li,Zhongwang Zhang,Huifeng Yin,Zhengwei Tao,Liwen Zhang,Pengjun Xie,Jingren Zhou,Yong Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:achieved strong performance, remains largely restricted, API-level snippet retrieval, URL-based page fetching, deep search tasks
备注:
点击查看摘要
Abstract:Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.
9. 【2512.23637】A Dataset and Benchmark for Consumer Healthcare Question Summarization
链接:https://arxiv.org/abs/2512.23637
作者:Abhishek Basu,Deepak Gupta,Dina Demner-Fushman,Shweta Yadav
类目:Computation and Language (cs.CL)
关键词:seeking health information, quest for seeking, swamped the web, summarization, dataset
备注: arXiv admin note: substantial text overlap with [arXiv:2206.06581](https://arxiv.org/abs/2206.06581)
点击查看摘要
Abstract:The quest for seeking health information has swamped the web with consumers health-related questions. Generally, con- sumers use overly descriptive and peripheral information to express their medical condition or other healthcare needs, contributing to the challenges of natural language understanding. One way to address this challenge is to summarize the questions and distill the key information of the original question. Recently, large-scale datasets have significantly propelled the development of several summarization tasks, such as multi-document summarization and dialogue summarization. However, a lack of a domain-expert annotated dataset for the consumer healthcare questions summarization task inhibits the development of an efficient summarization system. To address this issue, we introduce a new dataset, CHQ-Sum,m that contains 1507 domain-expert annotated consumer health questions and corresponding summaries. The dataset is derived from the community question answering forum and therefore provides a valuable resource for understanding consumer health-related posts on social media. We benchmark the dataset on multiple state-of-the-art summarization models to show the effectiveness of the dataset
10. 【2512.23611】Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing
链接:https://arxiv.org/abs/2512.23611
作者:Yuwen Li,Wei Zhang,Zelong Huang,Mason Yang,Jiajun Wu,Shawn Guo,Huahao Hu,Lingyi Sun,Jian Yang,Mingjie Tang,Byran Dai
类目:Computation and Language (cs.CL)
关键词:Enabling Large Language, Large Language Models, Enabling Large, Large Language, reliably invoke external
备注:
点击查看摘要
Abstract:Enabling Large Language Models (LLMs) to reliably invoke external tools remains a critical bottleneck for autonomous agents. Existing approaches suffer from three fundamental challenges: expensive human annotation for high-quality trajectories, poor generalization to unseen tools, and quality ceilings inherent in single-model synthesis that perpetuate biases and coverage gaps. We introduce InfTool, a fully autonomous framework that breaks these barriers through self-evolving multi-agent synthesis. Given only raw API specifications, InfTool orchestrates three collaborative agents (User Simulator, Tool-Calling Assistant, and MCP Server) to generate diverse, verified trajectories spanning single-turn calls to complex multi-step workflows. The framework establishes a closed loop: synthesized data trains the model via Group Relative Policy Optimization (GRPO) with gated rewards, the improved model generates higher-quality data targeting capability gaps, and this cycle iterates without human intervention. Experiments on the Berkeley Function-Calling Leaderboard (BFCL) demonstrate that InfTool transforms a base 32B model from 19.8% to 70.9% accuracy (+258%), surpassing models 10x larger and rivaling Claude-Opus, and entirely from synthetic data without human annotation.
11. 【2512.23578】Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models
链接:https://arxiv.org/abs/2512.23578
作者:Yu-Xiang Lin,Cheng-Han Chiang,Hung-yi Lee
类目:Computation and Language (cs.CL); Sound (cs.SD)
关键词:specific speaking style, spoken language models, style, spoken language, speaking style
备注: Work in progress
点击查看摘要
Abstract:In this paper, we show that when spoken language models (SLMs) are instructed to speak in a specific speaking style at the beginning of a multi-turn conversation, they cannot maintain the required speaking styles after several turns of interaction; we refer to this as the style amnesia of SLMs. We focus on paralinguistic speaking styles, including emotion, accent, volume, and speaking speed. We evaluate three proprietary and two open-source SLMs, demonstrating that none of these models can maintain a consistent speaking style when instructed to do so. We further show that when SLMs are asked to recall the style instruction in later turns, they can recall the style instruction, but they fail to express it throughout the conversation. We also show that explicitly asking the model to recall the style instruction can partially mitigate style amnesia. In addition, we examine various prompting strategies and find that SLMs struggle to follow the required style when the instruction is placed in system messages rather than user messages, which contradicts the intended function of system prompts.
12. 【2512.23572】Instruction-Following Evaluation of Large Vision-Language Models
链接:https://arxiv.org/abs/2512.23572
作者:Daiki Shiono,Shumpei Miyawaki,Ryota Tanaka,Jun Suzuki
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:proposed large vision-language, large language models, large vision-language models, LVLMs' instruction-following ability, vision capabilities
备注: 21 pages, 7 figures
点击查看摘要
Abstract:Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs' instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs' instruction-following ability. Our quantitative evaluation confirmed that LVLMs' instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.
13. 【2512.23562】VL-RouterBench: A Benchmark for Vision-Language Model Routing
链接:https://arxiv.org/abs/2512.23562
作者:Zhehao Huang,Baijiong Lin,Jingyuan Zhang,Jingying Wang,Yuhang Liu,Ning Lu,Tao Li,Xiaolin Huang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:existing work lacks, evaluating vision-language models, Multi-model routing, essential infrastructure, lacks a systematic
备注:
点击查看摘要
Abstract:Multi-model routing has evolved from an engineering technique into essential infrastructure, yet existing work lacks a systematic, reproducible benchmark for evaluating vision-language models (VLMs). We present VL-RouterBench to assess the overall capability of VLM routing systems systematically. The benchmark is grounded in raw inference and scoring logs from VLMs and constructs quality and cost matrices over sample-model pairs. In scale, VL-RouterBench covers 14 datasets across 3 task groups, totaling 30,540 samples, and includes 15 open-source models and 2 API models, yielding 519,180 sample-model pairs and a total input-output token volume of 34,494,977. The evaluation protocol jointly measures average accuracy, average cost, and throughput, and builds a ranking score from the harmonic mean of normalized cost and accuracy to enable comparison across router configurations and cost budgets. On this benchmark, we evaluate 10 routing methods and baselines and observe a significant routability gain, while the best current routers still show a clear gap to the ideal Oracle, indicating considerable room for improvement in router architecture through finer visual cues and modeling of textual structure. We will open-source the complete data construction and evaluation toolchain to promote comparability, reproducibility, and practical deployment in multimodal routing research.
14. 【2512.23547】Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs
链接:https://arxiv.org/abs/2512.23547
作者:Sahil Kale,Antonio Luca Alfeo
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:false statements, remain a major, generation of apparently, apparently convincing, convincing yet false
备注: Accepted to ICPRAM 2026 in Marbella, Spain
点击查看摘要
Abstract:Hallucinations, the generation of apparently convincing yet false statements, remain a major barrier to the safe deployment of LLMs. Building on the strong performance of self-detection methods, we examine the use of structured knowledge representations, namely knowledge graphs, to improve hallucination self-detection. Specifically, we propose a simple yet powerful approach that enriches hallucination self-detection by (i) converting LLM responses into knowledge graphs of entities and relations, and (ii) using these graphs to estimate the likelihood that a response contains hallucinations. We evaluate the proposed approach using two widely used LLMs, GPT-4o and Gemini-2.5-Flash, across two hallucination detection datasets. To support more reliable future benchmarking, one of these datasets has been manually curated and enhanced and is released as a secondary outcome of this work. Compared to standard self-detection methods and SelfCheckGPT, a state-of-the-art approach, our method achieves up to 16% relative improvement in accuracy and 20% in F1-score. Our results show that LLMs can better analyse atomic facts when they are structured as knowledge graphs, even when initial outputs contain inaccuracies. This low-cost, model-agnostic approach paves the way toward safer and more trustworthy language models.
15. 【2512.23518】Single LLM Debate, MoLaCE: Mixture of Latent Concept Experts Against Confirmation Bias
链接:https://arxiv.org/abs/2512.23518
作者:Hazel Kim,Philip Torr
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, highly vulnerable, Large, confirmation bias
备注:
点击查看摘要
Abstract:Large language models (LLMs) are highly vulnerable to input confirmation bias. When a prompt implies a preferred answer, models often reinforce that bias rather than explore alternatives. This phenomenon remains underexplored, yet it is already harmful in base models and poses an even greater risk in multi-agent debate, where echo chambers reinforce bias instead of correction. We introduce Mixture of Latent Concept Experts (MoLaCE), a lightweight inference-time framework that addresses confirmation bias by mixing experts instantiated as different activation strengths over latent concepts that shape model responses. Our key insight is that, due to the compositional nature of language, differently phrased prompts reweight latent concepts in prompt-specific ways that affect factual correctness, so no single fixed intervention can be applied universally across inputs. This design enables a single LLM to emulate the benefits of debate internally while remaining computationally efficient and scalable. It can also be integrated into multi-agent debate frameworks to diversify perspectives and reduce correlated errors. We empirically show that it consistently reduces confirmation bias, improves robustness, and matches or surpasses multi-agent debate while requiring only a fraction of the computation.
16. 【2512.23512】UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?
链接:https://arxiv.org/abs/2512.23512
作者:Fengjiao Chen,Minhao Jing,Weitao Lu,Yan Feng,Xiaoyu Li,Xuezhi Cao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Vision-language large models, Vision-language large, Vision-language, visual generation tasks, generation tasks
备注:
点击查看摘要
Abstract:Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified model with a concise structure, UniHetero, under large-scale pretraining (200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. (3) Autoregression on Input Embedding is effective to capture visual details.
17. 【2512.23504】Automatic Detection of Complex Quotation Patterns in Aggadic Literature
链接:https://arxiv.org/abs/2512.23504
作者:Hadar Miller,Tsvi Kuflik,Moshe Lavee
类目:Computation and Language (cs.CL)
关键词:Allocate Connections, Rabbinic literature, Connections between Texts, paper presents ACT, Unlike existing text
备注: This paper is under review at Cogent Arts Humanities
点击查看摘要
Abstract:This paper presents ACT (Allocate Connections between Texts), a novel three-stage algorithm for the automatic detection of biblical quotations in Rabbinic literature. Unlike existing text reuse frameworks that struggle with short, paraphrased, or structurally embedded quotations, ACT combines a morphology-aware alignment algorithm with a context-sensitive enrichment stage that identifies complex citation patterns such as "Wave" and "Echo" quotations. Our approach was evaluated against leading systems, including Dicta, Passim, Text-Matcher, as well as human-annotated critical editions. We further assessed three ACT configurations to isolate the contribution of each component. Results demonstrate that the full ACT pipeline (ACT-QE) outperforms all baselines, achieving an F1 score of 0.91, with superior Recall (0.89) and Precision (0.94). Notably, ACT-2, which lacks stylistic enrichment, achieves higher Recall (0.90) but suffers in Precision, while ACT-3, using longer n-grams, offers a tradeoff between coverage and specificity. In addition to improving quotation detection, ACT's ability to classify stylistic patterns across corpora opens new avenues for genre classification and intertextual analysis. This work contributes to digital humanities and computational philology by addressing the methodological gap between exhaustive machine-based detection and human editorial judgment. ACT lays a foundation for broader applications in historical textual analysis, especially in morphologically rich and citation-dense traditions like Aggadic literature.
Comments:
This paper is under review at Cogent Arts Humanities
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2512.23504 [cs.CL]
(or
arXiv:2512.23504v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2512.23504
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
18. 【2512.23471】Semantic Tree Inference on Text Corpa using a Nested Density Approach together with Large Language Model Embeddings
链接:https://arxiv.org/abs/2512.23471
作者:Thomas Haschka,Joseph Bakarji
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, undergone significant advances, recent years due, high dimensional embeddings, language models
备注: 20 pages, 9 figures
点击查看摘要
Abstract:Semantic text classification has undergone significant advances in recent years due to the rise of large language models (LLMs) and their high dimensional embeddings. While LLM-embeddings are frequently used to store and retrieve text by semantic similarity in vector databases, the global structure semantic relationships in text corpora often remains opaque. Herein we propose a nested density clustering approach, to infer hierarchical trees of semantically related texts. The method starts by identifying texts of strong semantic similarity as it searches for dense clusters in LLM embedding space. As the density criterion is gradually relaxed, these dense clusters merge into more diffuse clusters, until the whole dataset is represented by a single cluster - the root of the tree. By embedding dense clusters into increasingly diffuse ones, we construct a tree structure that captures hierarchical semantic relationships among texts. We outline how this approach can be used to classify textual data for abstracts of scientific abstracts as a case study. This enables the data-driven discovery research areas and their subfields without predefined categories. To evaluate the general applicability of the method, we further apply it to established benchmark datasets such as the 20 News- groups and IMDB 50k Movie Reviews, demonstrating its robustness across domains. Finally we discuss possible applications on scientometrics, topic evolution, highlighting how nested density trees can reveal semantic structure and evolution in textual datasets.
19. 【2512.23457】Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following
链接:https://arxiv.org/abs/2512.23457
作者:Kongcheng Zhang,Qi Yao,Shunyu Liu,Wenjian Zhang,Min Cen,Yang Zhou,Wenkai Fang,Yiru Zhao,Baisheng Lai,Mingli Song
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:aligning Large Language, Large Language Models, Large Language, aligning Large, Language Models
备注:
点击查看摘要
Abstract:Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints. Despite the encouraging results, RL improvement inevitably relies on sampling successful, high-quality responses; however, the initial model often struggles to generate responses that satisfy all constraints due to its limited capabilities, yielding sparse or indistinguishable rewards that impede learning. In this work, we propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks, which employs a select-then-rewrite strategy to replay failed attempts as successes based on the constraints that have been satisfied in hindsight. We perform RL on these replayed samples as well as the original ones, theoretically framing the objective as dual-preference learning at both the instruction- and response-level to enable efficient optimization using only a binary reward signal. Extensive experiments demonstrate that the proposed HiR yields promising results across different instruction following tasks, while requiring less computational budget. Our code and dataset is available at this https URL.
20. 【2512.23447】Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss
链接:https://arxiv.org/abs/2512.23447
作者:Ang Lv,Jin Ma,Yiyuan Ma,Siyuan Qiao
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:limits model performance, models lack explicit, ultimately limits model, router decisions align, ERC loss
备注:
点击查看摘要
Abstract:Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.
21. 【2512.23440】ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning
链接:https://arxiv.org/abs/2512.23440
作者:Yuqi Tang,Jing Yu,Zichang Su,Kehua Feng,Zhihui Zhu,Libin Wang,Lei Liang,Qiang Zhang,Keyan Ding,Huajun Chen
类目:Computation and Language (cs.CL)
关键词:iteratively gather information, refine differential diagnosis, physicians iteratively gather, Clinical diagnosis begins, gather information
备注: 23 pages, 4 figures, under review
点击查看摘要
Abstract:Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather information, determine examination and refine differential diagnosis through patients' response. This dynamic clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on static question-answering. To mitigate these gaps, recent methods explore dynamic medical frameworks involving interactive clinical dialogues. Although effective, they often rely on limited, contamination-prone datasets and lack granular, multi-level evaluation. In this work, we propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and rubric-based assessment of diagnostic quality. Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.
22. 【2512.23430】C2PO: Diagnosing and Disentangling Bias Shortcuts in LLMs
链接:https://arxiv.org/abs/2512.23430
作者:Xuan Feng,Bo An,Tianlong Gu,Liang Chang,Fengrui Hao,Peipeng Yu,Shuai Zhao
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, poses significant risks, Large Language, poses significant
备注:
点击查看摘要
Abstract:Bias in Large Language Models (LLMs) poses significant risks to trustworthiness, manifesting primarily as stereotypical biases (e.g., gender or racial stereotypes) and structural biases (e.g., lexical overlap or position preferences). However, prior paradigms typically address these in isolation, often mitigating one at the expense of exacerbating the other. To address this, we conduct a systematic exploration of these reasoning failures and identify a primary inducement: the latent spurious feature correlations within the input that drive these erroneous reasoning shortcuts. Driven by these findings, we introduce Causal-Contrastive Preference Optimization (C2PO), a unified alignment framework designed to tackle these specific failures by simultaneously discovering and suppressing these correlations directly within the optimization process. Specifically, C2PO leverages causal counterfactual signals to isolate bias-inducing features from valid reasoning paths, and employs a fairness-sensitive preference update mechanism to dynamically evaluate logit-level contributions and suppress shortcut features. Extensive experiments across multiple benchmarks covering stereotypical bias (BBQ, Unqover), structural bias (MNLI, HANS, Chatbot, MT-Bench), out-of-domain fairness (StereoSet, WinoBias), and general utility (MMLU, GSM8K) demonstrate that C2PO effectively mitigates stereotypical and structural biases while preserving robust general reasoning capabilities.
23. 【2512.23429】he Effect of Gender Diversity on Scientific Team Impact: A Team Roles Perspective
链接:https://arxiv.org/abs/2512.23429
作者:Yi Zhao,Yongjun Zhu,Donghun Kim,Yuzhuo Wang,Heng Zhang,Chao Lu,Chengzhi Zhang
类目:Computation and Language (cs.CL); Computers and Society (cs.CY); Digital Libraries (cs.DL)
关键词:gender diversity, diversity, gender, interest to academia, team
备注:
点击查看摘要
Abstract:The influence of gender diversity on the success of scientific teams is of great interest to academia. However, prior findings remain inconsistent, and most studies operationalize diversity in aggregate terms, overlooking internal role differentiation. This limitation obscures a more nuanced understanding of how gender diversity shapes team impact. In particular, the effect of gender diversity across different team roles remains poorly understood. To this end, we define a scientific team as all coauthors of a paper and measure team impact through five-year citation counts. Using author contribution statements, we classified members into leadership and support roles. Drawing on more than 130,000 papers from PLOS journals, most of which are in biomedical-related disciplines, we employed multivariable regression to examine the association between gender diversity in these roles and team impact. Furthermore, we apply a threshold regression model to investigate how team size moderates this relationship. The results show that (1) the relationship between gender diversity and team impact follows an inverted U-shape for both leadership and support groups; (2) teams with an all-female leadership group and an all-male support group achieve higher impact than other team types. Interestingly, (3) the effect of leadership-group gender diversity is significantly negative for small teams but becomes positive and statistically insignificant in large teams. In contrast, the estimates for support-group gender diversity remain significant and positive, regardless of team size.
24. 【2512.23422】Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data
链接:https://arxiv.org/abs/2512.23422
作者:Jiapeng Wang,Yiwen Hu,Yanzipeng Gao,Haoyu Wang,Shuo Wang,Hongyu Lu,Jiaxin Mao,Wayne Xin Zhao,Junyi Li,Xiao Zhang
类目:Computation and Language (cs.CL)
关键词:grows increasingly scarce, adapting large language, domain-specific data grows, large language models, data grows increasingly
备注:
点击查看摘要
Abstract:As access to high-quality, domain-specific data grows increasingly scarce, multi-epoch training has become a practical strategy for adapting large language models (LLMs). However, autoregressive models often suffer from performance degradation under repeated data exposure, where overfitting leads to a marked decline in model capability. Through empirical analysis, we trace this degradation to an imbalance in learning dynamics: predictable, low-entropy tokens are learned quickly and come to dominate optimization, while the model's ability to generalize on high-entropy tokens deteriorates with continued training. To address this, we introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization. EntroDrop selectively masks low-entropy tokens during training and employs a curriculum schedule to adjust regularization strength in alignment with training progress. Experiments across model scales from 0.6B to 8B parameters show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training. These findings underscore the importance of aligning regularization with token-level learning dynamics when training on limited data. Our approach offers a promising pathway toward more effective adaptation of LLMs in data-constrained domains.
25. 【2512.23407】heoretical Foundations of Scaling Law in Familial Models
链接:https://arxiv.org/abs/2512.23407
作者:Huan Song,Qingfei Zhao,Ting Long,Shuyu Tian,Hongjun An,Jiawei Shao,Chi Zhang,Xuelong Li
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:optimizing large language, Neural scaling laws, large language model, Neural scaling, foundational for optimizing
备注:
点击查看摘要
Abstract:Neural scaling laws have become foundational for optimizing large language model (LLM) training, yet they typically assume a single dense model output. This limitation effectively overlooks "Familial models, a transformative paradigm essential for realizing ubiquitous intelligence across heterogeneous device-edge-cloud hierarchies. Transcending static architectures, familial models integrate early exits with relay-style inference to spawn G deployable sub-models from a single shared backbone. In this work, we theoretically and empirically extend the scaling law to capture this "one-run, many-models" paradigm by introducing Granularity (G) as a fundamental scaling variable alongside model size (N) and training tokens (D). To rigorously quantify this relationship, we propose a unified functional form L(N, D, G) and parameterize it using large-scale empirical runs. Specifically, we employ a rigorous IsoFLOP experimental design to strictly isolate architectural impact from computational scale. Across fixed budgets, we systematically sweep model sizes (N) and granularities (G) while dynamically adjusting tokens (D). This approach effectively decouples the marginal cost of granularity from the benefits of scale, ensuring high-fidelity parameterization of our unified scaling law. Our results reveal that the granularity penalty follows a multiplicative power law with an extremely small exponent. Theoretically, this bridges fixed-compute training with dynamic architectures. Practically, it validates the "train once, deploy many" paradigm, demonstrating that deployment flexibility is achievable without compromising the compute-optimality of dense baselines.
26. 【2512.23356】A Stepwise-Enhanced Reasoning Framework for Large Language Models Based on External Subgraph Generation
链接:https://arxiv.org/abs/2512.23356
作者:Xin Zhang,Yang Cao,Baoxing Wu,Xinyi Chen,Kai Song,Siying Li
类目:Computation and Language (cs.CL)
关键词:natural language processing, language processing tasks, including machine translation, Large Language Models, natural language
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved strong performance across a wide range of natural language processing tasks in recent years, including machine translation, text generation, and question answering. As their applications extend to increasingly complex scenarios, however, LLMs continue to face challenges in tasks that require deep reasoning and logical inference. In particular, models trained on large scale textual corpora may incorporate noisy or irrelevant information during generation, which can lead to incorrect predictions or outputs that are inconsistent with factual knowledge. To address this limitation, we propose a stepwise reasoning enhancement framework for LLMs based on external subgraph generation, termed SGR. The proposed framework dynamically constructs query relevant subgraphs from external knowledge bases and leverages their semantic structure to guide the reasoning process. By performing reasoning in a step by step manner over structured subgraphs, SGR reduces the influence of noisy information and improves reasoning accuracy. Specifically, the framework first generates an external subgraph tailored to the input query, then guides the model to conduct multi step reasoning grounded in the subgraph, and finally integrates multiple reasoning paths to produce the final answer. Experimental results on multiple benchmark datasets demonstrate that SGR consistently outperforms strong baselines, indicating its effectiveness in enhancing the reasoning capabilities of LLMs.
27. 【2512.23343】AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents
链接:https://arxiv.org/abs/2512.23343
作者:Jiafeng Liang,Hao Li,Chang Li,Jiaqi Zhou,Shixin Jiang,Zekun Wang,Changkai Ji,Zhihao Zhu,Runxuan Liu,Tao Ren,Jinlan Fu,See-Kiong Ng,Xia Liang,Ming Liu,Bing Qin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:navigate complex tasks, pivotal nexus bridging, nexus bridging past, complex tasks, pivotal nexus
备注: 57 pages, 5 figures
点击查看摘要
Abstract:Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.
28. 【2512.23328】CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations
链接:https://arxiv.org/abs/2512.23328
作者:Huan-ang Gao,Zikang Zhang,Tianwei Luo,Kaisen Yang,Xinzhe Juan,Jiahao Qiu,Tianxing Chen,Bingxiang He,Hao Zhao,Hao Zhou,Shilong Liu,Mengdi Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Model, Large Language, Language Model, physical-world deployment due, spatial mental model
备注: Webpage: [this https URL](https://cubebench.c7w.tech/)
点击查看摘要
Abstract:Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik's Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.
29. 【2512.23300】AI4Reading: Chinese Audiobook Interpretation System Based on Multi-Agent Collaboration
链接:https://arxiv.org/abs/2512.23300
作者:Minjiang Huang,Jipeng Qiang,Yi Zhu,Chaowei Zhang,Xiangyu Zhao,Kui Yu
类目:Computation and Language (cs.CL)
关键词:attracting increasing attention, offer readers practical, readers practical insights, increasing attention, intellectual inspiration
备注: ACL 2025 demo
点击查看摘要
Abstract:Audiobook interpretations are attracting increasing attention, as they provide accessible and in-depth analyses of books that offer readers practical insights and intellectual inspiration. However, their manual creation process remains time-consuming and resource-intensive. To address this challenge, we propose AI4Reading, a multi-agent collaboration system leveraging large language models (LLMs) and speech synthesis technology to generate podcast, like audiobook interpretations. The system is designed to meet three key objectives: accurate content preservation, enhanced comprehensibility, and a logical narrative structure. To achieve these goals, we develop a framework composed of 11 specialized agents,including topic analysts, case analysts, editors, a narrator, and proofreaders that work in concert to explore themes, extract real world cases, refine content organization, and synthesize natural spoken language. By comparing expert interpretations with our system's output, the results show that although AI4Reading still has a gap in speech generation quality, the generated interpretative scripts are simpler and more accurate.
30. 【2512.23280】Chinese Morph Resolution in E-commerce Live Streaming Scenarios
链接:https://arxiv.org/abs/2512.23280
作者:Jiahao Zhu,Jipeng Qiang,Ran Bai,Chenyu Liu,Xiaoye Ouyang
类目:Computation and Language (cs.CL)
关键词:major sales channel, platforms like Douyin, E-commerce live streaming, E-commerce live, sales channel
备注:
点击查看摘要
Abstract:E-commerce live streaming in China, particularly on platforms like Douyin, has become a major sales channel, but hosts often use morphs to evade scrutiny and engage in false advertising. This study introduces the Live Auditory Morph Resolution (LiveAMR) task to detect such violations. Unlike previous morph research focused on text-based evasion in social media and underground industries, LiveAMR targets pronunciation-based evasion in health and medical live streams. We constructed the first LiveAMR dataset with 86,790 samples and developed a method to transform the task into a text-to-text generation problem. By leveraging large language models (LLMs) to generate additional training data, we improved performance and demonstrated that morph resolution significantly enhances live streaming regulation.
31. 【2512.23260】Interpretable Safety Alignment via SAE-Constructed Low-Rank Subspace Adaptation
链接:https://arxiv.org/abs/2512.23260
作者:Dianyun Wang,Qingsen Ma,Yuhu Shang,Zhifeng Lu,Lechen Ning,Zhenbo Xu,Huijia Wu,Zhaofeng He
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:adapting large language, large language models, downstream tasks, Parameter-efficient fine-tuning, dominant paradigm
备注:
点击查看摘要
Abstract:Parameter-efficient fine-tuning has become the dominant paradigm for adapting large language models to downstream tasks. Low-rank adaptation methods such as LoRA operate under the assumption that task-relevant weight updates reside in a low-rank subspace, yet this subspace is learned implicitly from data in a black-box manner, offering no interpretability or direct control. We hypothesize that this difficulty stems from polysemanticity--individual dimensions encoding multiple entangled concepts. To address this, we leverage pre-trained Sparse Autoencoders (SAEs) to identify task-relevant features in a disentangled feature space, then construct an explicit, interpretable low-rank subspace to guide adapter initialization. We provide theoretical analysis proving that under monosemanticity assumptions, SAE-based subspace identification achieves arbitrarily small recovery error, while direct identification in polysemantic space suffers an irreducible error floor. On safety alignment, our method achieves up to 99.6% safety rate--exceeding full fine-tuning by 7.4 percentage points and approaching RLHF-based methods--while updating only 0.19-0.24% of parameters. Crucially, our method provides interpretable insights into the learned alignment subspace through the semantic grounding of SAE features. Our work demonstrates that incorporating mechanistic interpretability into the fine-tuning process can simultaneously improve both performance and transparency.
32. 【2512.23214】Anka: A Domain-Specific Language for Reliable LLM Code Generation
链接:https://arxiv.org/abs/2512.23214
作者:Saif Khalfan Saif Al Mazrouei
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL); Software Engineering (cs.SE)
关键词:Large Language Models, demonstrated remarkable capabilities, Large Language, Language Models, exhibit systematic errors
备注: 11 pages, 1 figure, 4 tables. Code and benchmarks available at [this https URL](https://github.com/BleBlo/Anka)
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, yet they exhibit systematic errors on complex, multi-step programming tasks. We hypothesize that these errors stem from the flexibility of general-purpose languages, which permits multiple valid approaches and requires implicit state management. To test this hypothesis, we introduce Anka, a domain-specific language (DSL) for data transformation pipelines designed with explicit, constrained syntax that reduces ambiguity in code generation. Despite having zero prior training exposure to Anka, Claude 3.5 Haiku achieves 99.9% parse success and 95.8% overall task accuracy across 100 benchmark problems. Critically, Anka demonstrates a 40 percentage point accuracy advantage over Python on multi-step pipeline tasks (100% vs. 60%), where Python's flexible syntax leads to frequent errors in operation sequencing and variable management. Cross-model validation with GPT-4o-mini confirms this advantage (+26.7 percentage points on multi-step tasks). Our results demonstrate that: (1) LLMs can learn novel DSLs entirely from in-context prompts, achieving near-native accuracy; (2) constrained syntax significantly reduces errors on complex tasks; and (3) domain-specific languages purposefully designed for LLM generation can outperform general-purpose languages on which the LLM has extensive training. We release the complete language implementation, benchmark suite, and evaluation framework to facilitate further research.
33. 【2512.23213】Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
链接:https://arxiv.org/abs/2512.23213
作者:Zhijun Chen,Zeyu Ji,Qianren Mao,Junhang Cheng,Bangjie Qin,Hao Wu,Zhuoran Li,Jingzheng Li,Kai Sun,Zizhe Wang,Yikun Ban,Zhu Sun,Xiangyang Ji,Hailong Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:multiple LLM-generated candidates, LLM Ensemble method, unsupervised LLM Ensemble, harnessing the collective, diverse strengths
备注:
点击查看摘要
Abstract:We propose LLM-PeerReview, an unsupervised LLM Ensemble method that selects the most ideal response from multiple LLM-generated candidates for each query, harnessing the collective wisdom of multiple models with diverse strengths. LLM-PeerReview is built on a novel, peer-review-inspired framework that offers a clear and interpretable mechanism, while remaining fully unsupervised for flexible adaptability and generalization. Specifically, it operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique to evaluate each response by reusing multiple LLMs at hand; For reasoning, we can apply a principled graphical model-based truth inference algorithm or a straightforward averaging strategy to aggregate multiple scores to produce a final score for each response; Finally, the highest-scoring response is selected as the best ensemble output. LLM-PeerReview is conceptually simple and empirically powerful. The two variants of the proposed approach obtain strong results across four datasets, including outperforming the recent advanced model Smoothie-Global by 6.9% and 7.3% points, respectively.
34. 【2512.23206】Not too long do read: Evaluating LLM-generated extreme scientific summaries
链接:https://arxiv.org/abs/2512.23206
作者:Zhuoqi Lyu,Qing Ke
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:facilitates effective science, effective science communication, scientific extreme summary, High-quality scientific extreme, extreme summary
备注:
点击查看摘要
Abstract:High-quality scientific extreme summary (TLDR) facilitates effective science communication. How do large language models (LLMs) perform in generating them? How are LLM-generated summaries different from those written by human experts? However, the lack of a comprehensive, high-quality scientific TLDR dataset hinders both the development and evaluation of LLMs' summarization ability. To address these, we propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers, which leverages the common practice of including authors' comments alongside bibliography items. We then test popular open-weight LLMs for generating TLDRs based on abstracts. Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures, hence tend to be more extractive rather than abstractive in general, compared to humans. Our code and datasets are available at this https URL (Lyu and Ke, 2025).
35. 【2512.23145】Reservoir Computing inspired Matrix Multiplication-free Language Model
链接:https://arxiv.org/abs/2512.23145
作者:Takumi Shiratsuchi,Yuichiro Tanaka,Hakaru Tamukoh
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, natural language processing, Large language, high computational cost, computational cost remains
备注: 9 pages, 10 figures
点击查看摘要
Abstract:Large language models (LLMs) have achieved state-of-the-art performance in natural language processing; however, their high computational cost remains a major bottleneck. In this study, we target computational efficiency by focusing on a matrix multiplication free language model (MatMul-free LM) and further reducing the training cost through an architecture inspired by reservoir computing. Specifically, we partially fix and share the weights of selected layers in the MatMul-free LM and insert reservoir layers to obtain rich dynamic representations without additional training overhead. Additionally, several operations are combined to reduce memory accesses. Experimental results show that the proposed architecture reduces the number of parameters by up to 19%, training time by 9.9%, and inference time by 8.0%, while maintaining comparable performance to the baseline model.
36. 【2512.23097】A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms
链接:https://arxiv.org/abs/2512.23097
作者:Yingru Li,Ziniu Li,Jiacai Liu
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Model, integrates Imitation Learning, Language Model, Large Language, Reinforcement Learning
备注:
点击查看摘要
Abstract:We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation.
37. 【2512.23065】abiBERT: A Large-Scale ModernBERT Foundation Model and Unified Benchmarking Framework for Turkish
链接:https://arxiv.org/abs/2512.23065
作者:Melikşah Türker,A. Ebrar Kızıloğlu,Onur Güngör,Susan Üsküdarlı
类目:Computation and Language (cs.CL)
关键词:Rotary Positional Embeddings, encoder-only Transformers, Transformers have evolved, integrating Rotary Positional, computational efficiency
备注: 31 pages, 1 figure, 13 tables
点击查看摘要
Abstract:Since the inception of BERT, encoder-only Transformers have evolved significantly in computational efficiency, training stability, and long-context modeling. ModernBERT consolidates these advances by integrating Rotary Positional Embeddings (RoPE), FlashAttention, and refined normalization. Despite these developments, Turkish NLP lacks a monolingual encoder trained from scratch incorporating such modern architectural paradigms. This work introduces TabiBERT, a monolingual Turkish encoder based on ModernBERT architecture trained from scratch on a large, curated corpus. TabiBERT is pre-trained on one trillion tokens sampled from an 84.88B token multi-domain corpus: web text (73%), scientific publications (20%), source code (6%), and mathematical content (0.3%). The model supports 8,192-token context length (16x original BERT), achieves up to 2.65x inference speedup, and reduces GPU memory consumption, enabling larger batch sizes. We introduce TabiBench with 28 datasets across eight task categories with standardized splits and protocols, evaluated using GLUE-style macro-averaging. TabiBERT attains 77.58 on TabiBench, outperforming BERTurk by 1.62 points and establishing state-of-the-art on five of eight categories: question answering (+9.55), code retrieval (+2.41), and document retrieval (+0.60). Compared with task-specific prior best results, including specialized models like TurkishBERTweet, TabiBERT achieves +1.47 average improvement, indicating robust cross-domain generalization. We release model weights, training configurations, and evaluation code for transparent, reproducible Turkish encoder research.
38. 【2512.23049】Accelerating Language Model Workflows with Prompt Choreography
链接:https://arxiv.org/abs/2512.23049
作者:TJ Bai,Jason Eisner
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, language models, models are increasingly, increasingly deployed
备注: to appear in TACL (final preprint of 2025-10-12); 10 pages + appendices
点击查看摘要
Abstract:Large language models are increasingly deployed in multi-agent workflows. We introduce Prompt Choreography, a framework that efficiently executes LLM workflows by maintaining a dynamic, global KV cache. Each LLM call can attend to an arbitrary, reordered subset of previously encoded messages. Parallel calls are supported. Though caching messages' encodings sometimes gives different results from re-encoding them in a new context, we show in diverse settings that fine-tuning the LLM to work with the cache can help it mimic the original results. Prompt Choreography significantly reduces per-message latency (2.0--6.2$\times$ faster time-to-first-token) and achieves substantial end-to-end speedups ($$2.2$\times$) in some workflows dominated by redundant computation.
39. 【2512.23032】Is Chain-of-Thought Really Not Explainability? Chain-of-Thought Can Be Faithful without Hint Verbalization
链接:https://arxiv.org/abs/2512.23032
作者:Kerem Zaman,Shashank Srivastava
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Recent work, Biasing Features, Biasing Features metric, omits a prompt-injected, Recent
备注: 18 pages, 20 figures, 5 tables
点击查看摘要
Abstract:Recent work, using the Biasing Features metric, labels a CoT as unfaithful if it omits a prompt-injected hint that affected the prediction. We argue this metric confuses unfaithfulness with incompleteness, the lossy compression needed to turn distributed transformer computation into a linear natural language narrative. On multi-hop reasoning tasks with Llama-3 and Gemma-3, many CoTs flagged as unfaithful by Biasing Features are judged faithful by other metrics, exceeding 50% in some models. With a new faithful@k metric, we show that larger inference-time token budgets greatly increase hint verbalization (up to 90% in some settings), suggesting much apparent unfaithfulness is due to tight token limits. Using Causal Mediation Analysis, we further show that even non-verbalized hints can causally mediate prediction changes through the CoT. We therefore caution against relying solely on hint-based evaluations and advocate a broader interpretability toolkit, including causal mediation and corruption-based metrics.
40. 【2512.23025】LENS: LLM-Enabled Narrative Synthesis for Mental Health by Aligning Multimodal Sensing with Language Models
链接:https://arxiv.org/abs/2512.23025
作者:Wenxuan Xu,Arvind Pillai,Subigya Nepal,Amanda C Collins,Daniel M Mackin,Michael V Heinz,Tess Z Griffin,Nicholas C Jacobson,Andrew Campbell
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:language remains challenging, natural language remains, sensing offers rich, numerical time-series measurements, assessing mental health
备注: 22 pages, 9 figures, under review
点击查看摘要
Abstract:Multimodal health sensing offers rich behavioral signals for assessing mental health, yet translating these numerical time-series measurements into natural language remains challenging. Current LLMs cannot natively ingest long-duration sensor streams, and paired sensor-text datasets are scarce. To address these challenges, we introduce LENS, a framework that aligns multimodal sensing data with language models to generate clinically grounded mental-health narratives. LENS first constructs a large-scale dataset by transforming Ecological Momentary Assessment (EMA) responses related to depression and anxiety symptoms into natural-language descriptions, yielding over 100,000 sensor-text QA pairs from 258 participants. To enable native time-series integration, we train a patch-level encoder that projects raw sensor signals directly into an LLM's representation space. Our results show that LENS outperforms strong baselines on standard NLP metrics and task-specific measures of symptom-severity accuracy. A user study with 13 mental-health professionals further indicates that LENS-produced narratives are comprehensive and clinically meaningful. Ultimately, our approach advances LLMs as interfaces for health sensing, providing a scalable path toward models that can reason over raw behavioral signals and support downstream clinical decision-making.
41. 【2512.23014】Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping
链接:https://arxiv.org/abs/2512.23014
作者:Tao Yu,Yongqi An,Kuan Zhu,Guibo Zhu,Ming Tang,Jinqiao Wang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, incur substantial computational, storage costs due, demonstrate impressive performance, Large Language
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate impressive performance across natural language tasks but incur substantial computational and storage costs due to their scale. Post-training structured pruning offers an efficient solution. However, when few-shot calibration sets fail to adequately reflect the pretraining data distribution, existing methods exhibit limited generalization to downstream tasks. To address this issue, we propose Function-Aware Neuron Grouping (FANG), a post-training pruning framework that alleviates calibration bias by identifying and preserving neurons critical to specific function. FANG groups neurons with similar function based on the type of semantic context they process and prunes each group independently. During importance estimation within each group, tokens that strongly correlate with the functional role of the neuron group are given higher weighting. Additionally, FANG also preserves neurons that contribute across multiple context types. To achieve a better trade-off between sparsity and performance, it allocates sparsity to each block adaptively based on its functional complexity. Experiments show that FANG improves downstream accuracy while preserving language modeling performance. It achieves the state-of-the-art (SOTA) results when combined with FLAP and OBC, two representative pruning methods. Specifically, FANG outperforms FLAP and OBC by 1.5%--8.5% in average accuracy under 30% and 40% sparsity.
42. 【2512.22966】Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks
链接:https://arxiv.org/abs/2512.22966
作者:Mengdi Chai,Ali R. Zomorrodi
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, medical knowledge assessments, decision-making remains underexplored, relevant diagnostic testing
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated promise in medical knowledge assessments, yet their practical utility in real-world clinical decision-making remains underexplored. In this study, we evaluated the performance of three state-of-the-art LLMs-ChatGPT-4o, Gemini 1.5 Pro, and LIama 3.3 70B-in clinical decision support across the entire clinical reasoning workflow of a typical patient encounter. Using 36 case studies, we first assessed LLM's out-of-the-box performance across five key sequential clinical decision-making tasks under two temperature settings (default vs. zero): differential diagnosis, essential immediate steps, relevant diagnostic testing, final diagnosis, and treatment recommendation. All models showed high variability by task, achieving near-perfect accuracy in final diagnosis, poor performance in relevant diagnostic testing, and moderate performance in remaining tasks. Furthermore, ChatGPT performed better under the zero temperature, whereas LIama showed stronger performance under the default temperature. Next, we assessed whether prompt engineering could enhance LLM performance by applying variations of the MedPrompt framework, incorporating targeted and random dynamic few-shot learning. The results demonstrate that prompt engineering is not a one-size-fit-all solution. While it significantly improved the performance on the task with lowest baseline accuracy (relevant diagnostic testing), it was counterproductive for others. Another key finding was that the targeted dynamic few-shot prompting did not consistently outperform random selection, indicating that the presumed benefits of closely matched examples may be counterbalanced by loss of broader contextual diversity. These findings suggest that the impact of prompt engineering is highly model and task-dependent, highlighting the need for tailored, context-aware strategies for integrating LLMs into healthcare.
43. 【2512.22955】Diversity or Precision? A Deep Dive into Next Token Prediction
链接:https://arxiv.org/abs/2512.22955
作者:Haoyuan Wu,Hai Wang,Jiajia Wu,Jinxiang Ou,Keyao Wang,Weile Chen,Zihao Zheng,Bei Yu
类目:Computation and Language (cs.CL)
关键词:large language models, Recent advancements, advancements have shown, shown that reinforcement, substantially improve
备注:
点击查看摘要
Abstract:Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.
44. 【2512.22933】Multimodal Fact-Checking: An Agent-based Approach
链接:https://arxiv.org/abs/2512.22933
作者:Danni Xu,Shaojing Fan,Xuanang Cheng,Mohan Kankanhalli
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:automated fact-checking systems, multimodal misinformation poses, rapid spread, poses a growing, growing challenge
备注: Code and dataset will be released at [this https URL](https://github.com/xudanni0927/AgentFact)
点击查看摘要
Abstract:The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation instances accompanied by annotated reasoning processes and verifiable evidence. To address this limitation, we introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking. RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made. In addition, the dataset includes detailed reasoning and explicitly linked evidence, which are derived from human written fact-checking articles via a large language model assisted extraction pipeline, enabling comprehensive verification and explanation. Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow. AgentFact consists of five specialized agents that collaboratively handle key fact-checking subtasks, including strategy planning, high-quality evidence retrieval, visual analysis, reasoning, and explanation generation. These agents are orchestrated through an iterative workflow that alternates between evidence searching and task-aware evidence filtering and reasoning, facilitating strategic decision-making and systematic evidence analysis. Extensive experimental results demonstrate that the synergy between RW-Post and AgentFact substantially improves both the accuracy and interpretability of multimodal fact-checking.
45. 【2512.22903】Debugging Tabular Log as Dynamic Graphs
链接:https://arxiv.org/abs/2512.22903
作者:Chumeng Liang,Zhanyang Jin,Zahaib Akhtar,Mona Pereira,Haofei Yu,Jiaxuan You
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:detect real-world inconsistencies, real-world inconsistencies efficiently, Tabular log, Tabular log abstracts, reports their updates
备注:
点击查看摘要
Abstract:Tabular log abstracts objects and events in the real-world system and reports their updates to reflect the change of the system, where one can detect real-world inconsistencies efficiently by debugging corresponding log entries. However, recent advances in processing text-enriched tabular log data overly depend on large language models (LLMs) and other heavy-load models, thus suffering from limited flexibility and scalability. This paper proposes a new framework, GraphLogDebugger, to debug tabular log based on dynamic graphs. By constructing heterogeneous nodes for objects and events and connecting node-wise edges, the framework recovers the system behind the tabular log as an evolving dynamic graph. With the help of our dynamic graph modeling, a simple dynamic Graph Neural Network (GNN) is representative enough to outperform LLMs in debugging tabular log, which is validated by experimental results on real-world log datasets of computer systems and academic papers.
46. 【2512.22857】AutoForge: Automated Environment Synthesis for Agentic Reinforcement Learning
链接:https://arxiv.org/abs/2512.22857
作者:Shihao Cai,Runnan Fang,Jialong Wu,Baixuan Li,Xinyu Wang,Yong Jiang,Liangcai Su,Liwen Zhang,Wenbiao Yin,Zhen Zhang,Fuli Feng,Pengjun Xie,Xiaobin Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Conducting reinforcement learning, enhance language-based agents, Conducting reinforcement, reinforcement learning, language-based agents
备注:
点击查看摘要
Abstract:Conducting reinforcement learning (RL) in simulated environments offers a cost-effective and highly scalable way to enhance language-based agents. However, previous work has been limited to semi-automated environment synthesis or tasks lacking sufficient difficulty, offering little breadth or depth. In addition, the instability of simulated users integrated into these environments, along with the heterogeneity across simulated environments, poses further challenges for agentic RL. In this work, we propose: (1) a unified pipeline for automated and scalable synthesis of simulated environments associated with high-difficulty but easily verifiable tasks; and (2) an environment level RL algorithm that not only effectively mitigates user instability but also performs advantage estimation at the environment level, thereby improving training efficiency and stability. Comprehensive evaluations on agentic benchmarks, including tau-bench, tau2-Bench, and VitaBench, validate the effectiveness of our proposed method. Further in-depth analyses underscore its out-of-domain generalization.
47. 【2512.22823】NepEMO: A Multi-Label Emotion and Sentiment Analysis on Nepali Reddit with Linguistic Insights and Temporal Trends
链接:https://arxiv.org/abs/2512.22823
作者:Sameer Sitoula,Tej Bahadur Shahi,Laxmi Prasad Bhatt,Anisha Pokhrel,Arjun Neupane
类目:Computation and Language (cs.CL)
关键词:Social media, specifically during challenging, challenging events, natural disasters, political elections
备注: This paper is under consideration in Neural Computing Applications (Springer) journal. This version may be deleted or updated at any time, depending on the journal's policy upon acceptance
点击查看摘要
Abstract:Social media (SM) platforms (e.g. Facebook, Twitter, and Reddit) are increasingly leveraged to share opinions and emotions, specifically during challenging events, such as natural disasters, pandemics, and political elections, and joyful occasions like festivals and celebrations. Among the SM platforms, Reddit provides a unique space for its users to anonymously express their experiences and thoughts on sensitive issues such as health and daily life. In this work, we present a novel dataset, called NepEMO, for multi-label emotion (MLE) and sentiment classification (SC) on the Nepali subreddit post. We curate and build a manually annotated dataset of 4,462 posts (January 2019- June 2025) written in English, Romanised Nepali and Devanagari script for five emotions (fear, anger, sadness, joy, and depression) and three sentiment classes (positive, negative, and neutral). We perform a detailed analysis of posts to capture linguistic insights, including emotion trends, co-occurrence of emotions, sentiment-specific n-grams, and topic modelling using Latent Dirichlet Allocation and TF-IDF keyword extraction. Finally, we compare various traditional machine learning (ML), deep learning (DL), and transformer models for MLE and SC tasks. The result shows that transformer models consistently outperform the ML and DL models for both tasks.
48. 【2512.22795】CNSight: Evaluation of Clinical Note Segmentation Tools
链接:https://arxiv.org/abs/2512.22795
作者:Risha Surana,Adrian Law,Sunwoo Kim,Rishab Sridhar,Angxiao Han,Peiyu Hong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:electronic medical record, medical record, downstream clinical applications, semi-structured formats, electronic medical
备注:
点击查看摘要
Abstract:Clinical notes are often stored in unstructured or semi-structured formats after extraction from electronic medical record (EMR) systems, which complicates their use for secondary analysis and downstream clinical applications. Reliable identification of section boundaries is a key step toward structuring these notes, as sections such as history of present illness, medications, and discharge instructions each provide distinct clinical contexts. In this work, we evaluate rule-based baselines, domain-specific transformer models, and large language models for clinical note segmentation using a curated dataset of 1,000 notes from MIMIC-IV. Our experiments show that large API-based models achieve the best overall performance, with GPT-5-mini reaching a best average F1 of 72.4 across sentence-level and freetext segmentation. Lightweight baselines remain competitive on structured sentence-level tasks but falter on unstructured freetext. Our results provide guidance for method selection and lay the groundwork for downstream tasks such as information extraction, cohort identification, and automated summarization.
49. 【2512.22778】Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language
链接:https://arxiv.org/abs/2512.22778
作者:Muhammad Zain Ali,Bernhard Pfahringer,Tony Smith
类目:Computation and Language (cs.CL)
关键词:widely acknowledged issue, acknowledged issue, social media, researchers worldwide, worldwide are actively
备注:
点击查看摘要
Abstract:Misinformation on social media is a widely acknowledged issue, and researchers worldwide are actively engaged in its detection. However, low-resource languages such as Urdu have received limited attention in this domain. An obvious approach is to utilize a multilingual pretrained language model and fine-tune it for a downstream classification task, such as misinformation detection. However, these models struggle with domain-specific terms, leading to suboptimal performance. To address this, we investigate the effectiveness of domain adaptation before fine-tuning for fake news classification in Urdu, employing a staged training approach to optimize model generalization. We evaluate two widely used multilingual models, XLM-RoBERTa and mBERT, and apply domain-adaptive pretraining using a publicly available Urdu news corpus. Experiments on four publicly available Urdu fake news datasets show that domain-adapted XLM-R consistently outperforms its vanilla counterpart, while domain-adapted mBERT exhibits mixed results.
50. 【2512.22741】xt-Routed Sparse Mixture-of-Experts Model with Explanation and Temporal Alignment for Multi-Modal Sentiment Analysis
链接:https://arxiv.org/abs/2512.22741
作者:Dongning Rao,Yunbiao Zeng,Zhihua Jiang,Jujian Lv
类目:Computation and Language (cs.CL)
关键词:Multi-modal Sentiment Analysis, Sentiment Analysis, Multi-modal Sentiment, Multi-modal Large Language, TEXT
备注: 9 pages, 9 figures, accepted by AAAI 2026
点击查看摘要
Abstract:Human-interaction-involved applications underscore the need for Multi-modal Sentiment Analysis (MSA). Although many approaches have been proposed to address the subtle emotions in different modalities, the power of explanations and temporal alignments is still underexplored. Thus, this paper proposes the Text-routed sparse mixture-of-Experts model with eXplanation and Temporal alignment for MSA (TEXT). TEXT first augments explanations for MSA via Multi-modal Large Language Models (MLLM), and then novelly aligns the epresentations of audio and video through a temporality-oriented neural network block. TEXT aligns different modalities with explanations and facilitates a new text-routed sparse mixture-of-experts with gate fusion. Our temporal alignment block merges the benefits of Mamba and temporal cross-attention. As a result, TEXT achieves the best performance cross four datasets among all tested models, including three recently proposed approaches and three MLLMs. TEXT wins on at least four metrics out of all six metrics. For example, TEXT decreases the mean absolute error to 0.353 on the CH-SIMS dataset, which signifies a 13.5% decrement compared with recently proposed approaches.
51. 【2512.22738】Harnessing Large Language Models for Biomedical Named Entity Recognition
链接:https://arxiv.org/abs/2512.22738
作者:Jian Chen,Leilei Su,Cong Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Biomedical Named Entity, Named Entity Recognition, Background and Objective, Biomedical Named, Entity Recognition
备注:
点击查看摘要
Abstract:Background and Objective: Biomedical Named Entity Recognition (BioNER) is a foundational task in medical informatics, crucial for downstream applications like drug discovery and clinical trial matching. However, adapting general-domain Large Language Models (LLMs) to this task is often hampered by their lack of domain-specific knowledge and the performance degradation caused by low-quality training data. To address these challenges, we introduce BioSelectTune, a highly efficient, data-centric framework for fine-tuning LLMs that prioritizes data quality over quantity. Methods and Results: BioSelectTune reformulates BioNER as a structured JSON generation task and leverages our novel Hybrid Superfiltering strategy, a weak-to-strong data curation method that uses a homologous weak model to distill a compact, high-impact training dataset. Conclusions: Through extensive experiments, we demonstrate that BioSelectTune achieves state-of-the-art (SOTA) performance across multiple BioNER benchmarks. Notably, our model, trained on only 50% of the curated positive data, not only surpasses the fully-trained baseline but also outperforms powerful domain-specialized models like BioMedBERT.
52. 【2512.22737】WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference
链接:https://arxiv.org/abs/2512.22737
作者:Aiwei Liu,Minghua He,Shaoxun Zeng,Sijun Zhang,Linhao Zhang,Chuhan Wu,Wei Jia,Yuan Liu,Xiao Zhou,Jie Zhou
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Language Models, Diffusion Language Models, Large Language, nature limits parallelism
备注: 23 pages, 8 figures, project page: [this https URL](https://wedlm.github.io/)
点击查看摘要
Abstract:Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.
53. 【2512.22732】Data Augmentation for Classification of Negative Pregnancy Outcomes in Imbalanced Data
链接:https://arxiv.org/abs/2512.22732
作者:Md Badsha Biswas
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Infant mortality remains, United States, Infant mortality, significant public health, public health concern
备注:
点击查看摘要
Abstract:Infant mortality remains a significant public health concern in the United States, with birth defects identified as a leading cause. Despite ongoing efforts to understand the causes of negative pregnancy outcomes like miscarriage, stillbirths, birth defects, and premature birth, there is still a need for more comprehensive research and strategies for intervention. This paper introduces a novel approach that uses publicly available social media data, especially from platforms like Twitter, to enhance current datasets for studying negative pregnancy outcomes through observational research. The inherent challenges in utilizing social media data, including imbalance, noise, and lack of structure, necessitate robust preprocessing techniques and data augmentation strategies. By constructing a natural language processing (NLP) pipeline, we aim to automatically identify women sharing their pregnancy experiences, categorizing them based on reported outcomes. Women reporting full gestation and normal birth weight will be classified as positive cases, while those reporting negative pregnancy outcomes will be identified as negative cases. Furthermore, this study offers potential applications in assessing the causal impact of specific interventions, treatments, or prenatal exposures on maternal and fetal health outcomes. Additionally, it provides a framework for future health studies involving pregnant cohorts and comparator groups. In a broader context, our research showcases the viability of social media data as an adjunctive resource in epidemiological investigations about pregnancy outcomes.
54. 【2512.22725】Mitigating Social Desirability Bias in Random Silicon Sampling
链接:https://arxiv.org/abs/2512.22725
作者:Sashank Chapala,Maksym Mironov,Songgaojun Deng
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Large Language Models, Large Language, Social Desirability Bias, Social Desirability, Desirability Bias
备注: 31 pages, 9 figures, and 24 tables
点击查看摘要
Abstract:Large Language Models (LLMs) are increasingly used to simulate population responses, a method known as ``Silicon Sampling''. However, responses to socially sensitive questions frequently exhibit Social Desirability Bias (SDB), diverging from real human data toward socially acceptable answers. Existing studies on social desirability bias in LLM-based sampling remain limited. In this work, we investigate whether minimal, psychologically grounded prompt wording can mitigate this bias and improve alignment between silicon and human samples. We conducted a study using data from the American National Election Study (ANES) on three LLMs from two model families: the open-source Llama-3.1 series and GPT-4.1-mini. We first replicate a baseline silicon sampling study, confirming the persistent Social Desirability Bias. We then test four prompt-based mitigation methods: \emph{reformulated} (neutral, third-person phrasing), \emph{reverse-coded} (semantic inversion), and two meta-instructions, \emph{priming} and \emph{preamble}, respectively encouraging analytics and sincerity. Alignment with ANES is evaluated using Jensen-Shannon Divergence with bootstrap confidence intervals. Our results demonstrate that reformulated prompts most effectively improve alignment by reducing distribution concentration on socially acceptable answers and achieving distributions closer to ANES. Reverse-coding produced mixed results across eligible items, while the Priming and Preamble encouraged response uniformity and showed no systematic benefit for bias mitigation. Our findings validate the efficacy of prompt-based framing controls in mitigating inherent Social Desirability Bias in LLMs, providing a practical path toward more representative silicon samples.
55. 【2512.22712】Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages
链接:https://arxiv.org/abs/2512.22712
作者:Anaelia Ovalle,Candace Ross,Sebastian Ruder,Adina Williams,Karen Ullrich,Mark Ibrahim,Levent Sagun
类目:Computation and Language (cs.CL)
关键词:Large language models, languages remains underexplored, reasoning quality transfers, Large language, remains underexplored
备注: Accepted to 2025 EMNLP Multilingual Representation Learning Workshop
点击查看摘要
Abstract:Large language models demonstrate strong reasoning capabilities through chain-of-thought prompting, but whether this reasoning quality transfers across languages remains underexplored. We introduce a human-validated framework to evaluate whether model-generated reasoning traces logically support their conclusions across languages. Analyzing 65k reasoning traces from GlobalMMLU questions across 6 languages and 6 frontier models, we uncover a critical blind spot: while models achieve high task accuracy, their reasoning can fail to support their conclusions. Reasoning traces in non-Latin scripts show at least twice as much misalignment between their reasoning and conclusions than those in Latin scripts. We develop an error taxonomy through human annotation to characterize these failures, finding they stem primarily from evidential errors (unsupported claims, ambiguous facts) followed by illogical reasoning steps. Our findings demonstrate that current multilingual evaluation practices provide an incomplete picture of model reasoning capabilities and highlight the need for reasoning-aware evaluation frameworks.
56. 【2512.22705】GHaLIB: A Multilingual Framework for Hope Speech Detection in Low-Resource Languages
链接:https://arxiv.org/abs/2512.22705
作者:Ahmed Abdullah,Sana Fatima,Haroon Mahmood
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Natural Language Processing, Language Processing, underrepresented in Natural, Natural Language, NLP
备注: Accepted and presented at the 15th International Arab Conference on Information Technology (ICAIT); proceedings not yet published
点击查看摘要
Abstract:Hope speech has been relatively underrepresented in Natural Language Processing (NLP). Current studies are largely focused on English, which has resulted in a lack of resources for low-resource languages such as Urdu. As a result, the creation of tools that facilitate positive online communication remains limited. Although transformer-based architectures have proven to be effective in detecting hate and offensive speech, little has been done to apply them to hope speech or, more generally, to test them across a variety of linguistic settings. This paper presents a multilingual framework for hope speech detection with a focus on Urdu. Using pretrained transformer models such as XLM-RoBERTa, mBERT, EuroBERT, and UrduBERT, we apply simple preprocessing and train classifiers for improved results. Evaluations on the PolyHope-M 2025 benchmark demonstrate strong performance, achieving F1-scores of 95.2% for Urdu binary classification and 65.2% for Urdu multi-class classification, with similarly competitive results in Spanish, German, and English. These results highlight the possibility of implementing existing multilingual models in low-resource environments, thus making it easier to identify hope speech and helping to build a more constructive digital discourse.
57. 【2512.22682】Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency
链接:https://arxiv.org/abs/2512.22682
作者:Yoshith Roy Kotla,Varshith Roy Kotla
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Deploying large language, rigorous uncertainty quantification, high-stakes domains requires, domains requires rigorous, requires rigorous uncertainty
备注: 10 pages, 5 tables and 1 algorithm
点击查看摘要
Abstract:Deploying large language models (LLMs) in high-stakes domains requires rigorous uncertainty quantification, yet standard softmax probabilities are often poorly calibrated. We present a systematic study of Adaptive Prediction Sets (APS) applied to next-token prediction in transformer-based models with large vocabularies (greater than 250,000 tokens). Our central contribution is the identification of a coverage-efficiency tradeoff: while naive conformal prediction achieves valid coverage, it produces prediction sets of hundreds of tokens, rendering them uninformative. We propose Vocabulary-Aware Conformal Prediction (VACP), a framework that leverages semantic masking and temperature-adjusted scoring to reduce the effective prediction space while provably maintaining marginal coverage. Experiments on Gemma-2B using SQUAD and WikiText benchmarks demonstrate that VACP achieves 89.7 percent empirical coverage (90 percent target) while reducing the mean prediction set size from 847 tokens to 4.3 tokens -- a 197x improvement in efficiency. We provide a theoretical analysis of vocabulary reduction and release our implementation for reproducibility.
58. 【2512.22671】Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2
链接:https://arxiv.org/abs/2512.22671
作者:Pere Martra
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Maximum Absolute Weight, Absolute Weight, Maximum Absolute, GLU-MLP layers, Structured width pruning
备注: 23 pages, 5 figures, 9 tables. Code available at [this https URL](https://github.com/peremartra/llama-glu-expansion-pruning)
点击查看摘要
Abstract:Structured width pruning of GLU-MLP layers, guided by the Maximum Absolute Weight (MAW) criterion, reveals a systematic dichotomy in how reducing the expansion ratio affects different model capabilities. While performance on tasks relying on parametric knowledge (e.g., MMLU, GSM8K) and perplexity metrics degrades predictably, instruction-following capabilities improve substantially (+46% to +75% in IFEval for Llama-3.2-1B and 3B models), and multi-step reasoning remains robust (MUSR). This pattern challenges the prevailing assumption that pruning induces uniform degradation. We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness. Our analysis identifies the expansion ratio as a critical architectural parameter that selectively modulates cognitive capabilities, rather than merely serving as a compression metric. We provide the first systematic characterization of this selective preservation phenomenon. Notably, we document a robust inverse correlation (r = -0.864, p = 0.012 in Llama-3B) between factual knowledge capacity (MMLU) and truthfulness metrics (TruthfulQA-MC2): as knowledge degrades, the model's ability to discriminate misconceptions improves consistently. This connects two previously distinct research areas, demonstrating that MAW-guided width pruning acts as a selective filter, reducing parametric knowledge while preserving or enhancing behavioral alignment. Additionally, we quantify context-dependent efficiency trade-offs: pruned configurations achieve up to 23% reduction in energy consumption (J/token) but incur penalties in single-request latency, whereas batch processing workloads benefit uniformly.
59. 【2512.22650】Scaling Unverifiable Rewards: A Case Study on Visual Insights
链接:https://arxiv.org/abs/2512.22650
作者:Shuyu Gan,James Mooney,Pan Hao,Renxiang Wang,Mingyi Hong,Qianwen Wang,Dongyeop Kang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Model, Large Language, Language Model, iterative refinement guided, Selective TTS
备注: 32 pages, 25 figures
点击查看摘要
Abstract:Large Language Model (LLM) agents can increasingly automate complex reasoning through Test-Time Scaling (TTS), iterative refinement guided by reward signals. However, many real-world tasks involve multi-stage pipeline whose final outcomes lack verifiable rewards or sufficient data to train robust reward models, making judge-based refinement prone to accumulate error over stages. We propose Selective TTS, a process-based refinement framework that scales inference across different stages in multi-agent pipeline, instead of repeated refinement over time by prior work. By distributing compute across stages and pruning low-quality branches early using process-specific judges, Selective TTS mitigates the judge drift and stabilizes refinement. Grounded in the data science pipeline, we build an end-to-end multi-agent pipeline for generating visually insightful charts and report of given dataset, and design a reliable LLM-based judge model, aligned with human experts (Kendall's {\tau}=0.55). Our proposed selective TTS then improves insight quality under a fixed compute budget, increasing mean scores from 61.64 to 65.86 while reducing variance. We hope our findings serve as the first step toward to scaling complex, open-ended tasks with unverifiable rewards, such as scientific discovery and story generation.
60. 【2512.22631】Evaluating GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs
链接:https://arxiv.org/abs/2512.22631
作者:Hadi Mohammadi,Tamas Kozak,Anastasia Giachanou
类目:Computation and Language (cs.CL)
关键词:tasks requiring multi-step, requiring multi-step reasoning, large language models, powerful technique, problem-solving capabilities
备注:
点击查看摘要
Abstract:Chain-of-thought (CoT) reasoning has emerged as a powerful technique for improving the problem-solving capabilities of large language models (LLMs), particularly for tasks requiring multi-step reasoning. However, recent studies show that CoT explanations often fail to reflect the model's actual reasoning process, as models may produce coherent yet misleading justifications or modify answers without acknowledging external cues. Such discrepancies undermine the reliability of CoT-based methods for safety supervision and alignment monitoring, as models can generate plausible but deceptive rationales for incorrect answers. To better understand this limitation, we evaluate two optimization methods, Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), in their ability to improve CoT faithfulness. Our experiments show that GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics. Both approaches exhibit positive correlations between model size and performance, but GRPO shows greater potential for improving faithfulness metrics, albeit with less stable behavior at smaller scales. These results suggest that GRPO offers a promising direction for developing more transparent and trustworthy reasoning in LLMs.
61. 【2512.22630】On the Role of Discreteness in Diffusion LLMs
链接:https://arxiv.org/abs/2512.22630
作者:Ziqi Jin,Bin Wang,Xiang Lin,Lidong Bing,Aixin Sun
类目:Computation and Language (cs.CL)
关键词:highly structured nature, offer appealing properties, models offer appealing, iterative refinement, Diffusion
备注:
点击查看摘要
Abstract:Diffusion models offer appealing properties for language generation, such as parallel decoding and iterative refinement, but the discrete and highly structured nature of text challenges the direct application of diffusion principles. In this paper, we revisit diffusion language modeling from the view of diffusion process and language modeling, and outline five properties that separate diffusion mechanics from language-specific requirements. We first categorize existing approaches into continuous diffusion in embedding space and discrete diffusion over tokens. We then show that each satisfies only part of the five essential properties and therefore reflects a structural trade-off. Through analyses of recent large diffusion language models, we identify two central issues: (i) uniform corruption does not respect how information is distributed across positions, and (ii) token-wise marginal training cannot capture multi-token dependencies during parallel decoding. These observations motivate diffusion processes that align more closely with the structure of text, and encourage future work toward more coherent diffusion language models.
62. 【2512.22628】M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation
链接:https://arxiv.org/abs/2512.22628
作者:Fanglin Xu,Wei Zhang,Jian Yang,Guo Chen,Aishan Liu,Zhoujun Li,Xianglong Liu,Bryan Dai
类目:Computation and Language (cs.CL)
关键词:sparked significant research, significant research interest, existing benchmarks predominantly, benchmarks predominantly assess, single structural granularity
备注:
点击查看摘要
Abstract:The rapid advancement of code large language models (LLMs) has sparked significant research interest in systematically evaluating their code generation capabilities, yet existing benchmarks predominantly assess models at a single structural granularity and focus on limited programming languages, obscuring fine-grained capability variations across different code scopes and multilingual scenarios. We introduce M2G-Eval, a multi-granularity, multilingual framework for evaluating code generation in large language models (LLMs) across four levels: Class, Function, Block, and Line. Spanning 18 programming languages, M2G-Eval includes 17K+ training tasks and 1,286 human-annotated, contamination-controlled test instances. We develop M2G-Eval-Coder models by training Qwen3-8B with supervised fine-tuning and Group Relative Policy Optimization. Evaluating 30 models (28 state-of-the-art LLMs plus our two M2G-Eval-Coder variants) reveals three main findings: (1) an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging; (2) widening performance gaps between full- and partial-granularity languages as task complexity increases; and (3) strong cross-language correlations, suggesting that models learn transferable programming concepts. M2G-Eval enables fine-grained diagnosis of code generation capabilities and highlights persistent challenges in synthesizing complex, long-form code.
63. 【2512.22627】Chain-of-thought Reviewing and Correction for Time Series Question Answering
链接:https://arxiv.org/abs/2512.22627
作者:Chen Su,Yuanhe Tian,Yan Song
类目:Computation and Language (cs.CL)
关键词:natural language interface, unified natural language, series question answering, diverse time series, time series question
备注:
点击查看摘要
Abstract:With the advancement of large language models (LLMs), diverse time series analysis tasks are reformulated as time series question answering (TSQA) through a unified natural language interface. However, existing LLM-based approaches largely adopt general natural language processing techniques and are prone to reasoning errors when handling complex numerical sequences. Different from purely textual tasks, time series data are inherently verifiable, enabling consistency checking between reasoning steps and the original input. Motivated by this property, we propose T3LLM, which performs multi-step reasoning with an explicit correction mechanism for time series question answering. The T3LLM framework consists of three LLMs, namely, a worker, a reviewer, and a student, that are responsible for generation, review, and reasoning learning, respectively. Within this framework, the worker generates step-wise chains of thought (CoT) under structured prompts, while the reviewer inspects the reasoning, identifies erroneous steps, and provides corrective comments. The collaboratively generated corrected CoT are used to fine-tune the student model, internalizing multi-step reasoning and self-correction into its parameters. Experiments on multiple real-world TSQA benchmarks demonstrate that T3LLM achieves state-of-the-art performance over strong LLM-based baselines.
64. 【2512.22615】Dream-VL Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
链接:https://arxiv.org/abs/2512.22615
作者:Jiacheng Ye,Shansan Gong,Jiahui Gao,Junming Fan,Shuang Wu,Wei Bi,Haoli Bai,Lifeng Shang,Lingpeng Kong
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:autoregressive Large Vision-Language, dynamic robotic control, Large Vision-Language Models, achieved remarkable success, autoregressive Large
备注:
点击查看摘要
Abstract:While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $\pi_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.
65. 【2512.22603】Structured Prompting and LLM Ensembling for Multimodal Conversational Aspect-based Sentiment Analysis
链接:https://arxiv.org/abs/2512.22603
作者:Zhiqiang Gao,Shihao Gao,Zixing Zhang,Yihao Guo,Hongyu Chen,Jing Han
类目:Computation and Language (cs.CL)
关键词:building emotionally intelligent, Multimodal Conversational Aspect-based, Conversational Aspect-based Sentiment, complex yet crucial, building emotionally
备注:
点击查看摘要
Abstract:Understanding sentiment in multimodal conversations is a complex yet crucial challenge toward building emotionally intelligent AI systems. The Multimodal Conversational Aspect-based Sentiment Analysis (MCABSA) Challenge invited participants to tackle two demanding subtasks: (1) extracting a comprehensive sentiment sextuple, including holder, target, aspect, opinion, sentiment, and rationale from multi-speaker dialogues, and (2) detecting sentiment flipping, which detects dynamic sentiment shifts and their underlying triggers. For Subtask-I, in the present paper, we designed a structured prompting pipeline that guided large language models (LLMs) to sequentially extract sentiment components with refined contextual understanding. For Subtask-II, we further leveraged the complementary strengths of three LLMs through ensembling to robustly identify sentiment transitions and their triggers. Our system achieved a 47.38% average score on Subtask-I and a 74.12% exact match F1 on Subtask-II, showing the effectiveness of step-wise refinement and ensemble strategies in rich, multimodal sentiment analysis tasks.
66. 【2512.22562】Learning When Not to Attend Globally
链接:https://arxiv.org/abs/2512.22562
作者:Xuan Luo,Kailai Zhang,Xifeng Yan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:humans focus primarily, Large Language Models, recap prior context, reading books, humans focus
备注:
点击查看摘要
Abstract:When reading books, humans focus primarily on the current page, flipping back to recap prior context only when necessary. Similarly, we demonstrate that Large Language Models (LLMs) can learn to dynamically determine when to attend to global context. We propose All-or-Here Attention (AHA), which utilizes a binary router per attention head to dynamically toggle between full attention and local sliding window attention for each token. Our results indicate that with a window size of 256 tokens, up to 93\% of the original full attention operations can be replaced by sliding window attention without performance loss. Furthermore, by evaluating AHA across various window sizes, we identify a long-tail distribution in context dependency, where the necessity for full attention decays rapidly as the local window expands. By decoupling local processing from global access, AHA reveals that full attention is largely redundant, and that efficient inference requires only on-demand access to the global context.
67. 【2512.22491】ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation
链接:https://arxiv.org/abs/2512.22491
作者:Suhua Wang,Zifan Wang,Xiaoxin Sun,D. J. Wang,Zhanbo Liu,Xin Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:presents unique challenges, Manchu presents unique, strong phonological agglutination, including severe data, severe data scarcity
备注:
点击查看摘要
Abstract:As an endangered language, Manchu presents unique challenges for speech synthesis, including severe data scarcity and strong phonological agglutination. This paper proposes ManchuTTS(Manchu Text to Speech), a novel approach tailored to Manchu's linguistic characteristics. To handle agglutination, this method designs a three-tier text representation (phoneme, syllable, prosodic) and a cross-modal hierarchical attention mechanism for multi-granular alignment. The synthesis model integrates deep convolutional networks with a flow-matching Transformer, enabling efficient, non-autoregressive generation. This method further introduce a hierarchical contrastive loss to guide structured acoustic-linguistic correspondence. To address low-resource constraints, This method construct the first Manchu TTS dataset and employ a data augmentation strategy. Experiments demonstrate that ManchuTTS attains a MOS of 4.52 using a 5.2-hour training subset derived from our full 6.24-hour annotated corpus, outperforming all baseline models by a notable margin. Ablations confirm hierarchical guidance improves agglutinative word pronunciation accuracy (AWPA) by 31% and prosodic naturalness by 27%.
68. 【2512.22487】Constituency Structure over Eojeol in Korean Treebanks
链接:https://arxiv.org/abs/2512.22487
作者:Jungyeul Park,Chulwoo Park
类目:Computation and Language (cs.CL)
关键词:fundamental representational question, eojeol based, raises a fundamental, fundamental representational, representational question
备注:
点击查看摘要
Abstract:The design of Korean constituency treebanks raises a fundamental representational question concerning the choice of terminal units. Although Korean words are morphologically complex, treating morphemes as constituency terminals conflates word internal morphology with phrase level syntactic structure and creates mismatches with eojeol based dependency resources. This paper argues for an eojeol based constituency representation, with morphological segmentation and fine grained part of speech information encoded in a separate, non constituent layer. A comparative analysis shows that, under explicit normalization assumptions, the Sejong and Penn Korean treebanks can be treated as representationally equivalent at the eojeol based constituency level. Building on this result, we outline an eojeol based annotation scheme that preserves interpretable constituency and supports cross treebank comparison and constituency dependency conversion.
69. 【2512.22455】AFA-LoRA: Enabling Non-Linear Adaptations in LoRA with Activation Function Annealing
链接:https://arxiv.org/abs/2512.22455
作者:Jiacheng Li,Jianchao Tan,Zhidong Yang,Feiye Huo,Yerui Sun,Yuchen Xie,Xunliang Cai
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:widely adopted parameter-efficient, widely adopted, Low-Rank Adaptation, PEFT, adopted parameter-efficient fine-tuning
备注:
点击查看摘要
Abstract:Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method. However, its linear adaptation process limits its expressive power. This means there is a gap between the expressive power of linear training and non-linear training. To bridge this gap, we propose AFA-LoRA, a novel training strategy that brings non-linear expressivity to LoRA while maintaining its seamless mergeability. Our key innovation is an annealed activation function that transitions from a non-linear to a linear transformation during training, allowing the adapter to initially adopt stronger representational capabilities before converging to a mergeable linear form. We implement our method on supervised fine-tuning, reinforcement learning, and speculative decoding. The results show that AFA-LoRA reduces the performance gap between LoRA and full-parameter training. This work enables a more powerful and practical paradigm of parameter-efficient adaptation.
70. 【2512.22443】Exploring the Vertical-Domain Reasoning Capabilities of Large Language Models
链接:https://arxiv.org/abs/2512.22443
作者:Jie Zhou,Xin Chen,Jie Zhang,Zhe Li
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, reshaping learning paradigms, cognitive processes
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are reshaping learning paradigms, cognitive processes, and research methodologies across a wide range of domains. Integrating LLMs with professional fields and redefining the relationship between LLMs and domain-specific applications has become a critical challenge for promoting enterprise digital transformation and broader social development. To effectively integrate LLMs into the accounting domain, it is essential to understand their domain-specific reasoning capabilities. This study introduces the concept of vertical-domain accounting reasoning and establishes evaluation criteria by analyzing the training data characteristics of representative GLM-series models. These criteria provide a foundation for subsequent research on reasoning paradigms and offer benchmarks for improving accounting reasoning performance. Based on this framework, we evaluate several representative models, including GLM-6B, GLM-130B, GLM-4, and OpenAI GPT-4, on a set of accounting reasoning tasks. Experimental results show that different prompt engineering strategies lead to varying degrees of performance improvement across models, with GPT-4 achieving the strongest accounting reasoning capability. However, current LLMs still fall short of real-world application requirements. In particular, further optimization is needed for deployment in enterprise-level accounting scenarios to fully realize the potential value of LLMs in this domain.
71. 【2512.22442】HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG
链接:https://arxiv.org/abs/2512.22442
作者:Cattalyya Nuengsigkapian
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:open-domain settings faces, settings faces significant, faces significant challenges, Hierarchical Filtering RAG, user intent
备注: A winning solution for the NeurIPS 2025 MMU-RAGent Competition (Closed-Source Text-to-Text Static Evaluation)
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) in open-domain settings faces significant challenges regarding irrelevant information in retrieved documents and the alignment of generated answers with user intent. We present HiFi-RAG (Hierarchical Filtering RAG), the winning closed-source system in the Text-to-Text static evaluation of the MMU-RAGent NeurIPS 2025 Competition. Our approach moves beyond standard embedding-based retrieval via a multi-stage pipeline. We leverage the speed and cost-efficiency of Gemini 2.5 Flash (4-6x cheaper than Pro) for query formulation, hierarchical content filtering, and citation attribution, while reserving the reasoning capabilities of Gemini 2.5 Pro for final answer generation. On the MMU-RAGent validation set, our system outperformed the baseline, improving ROUGE-L to 0.274 (+19.6%) and DeBERTaScore to 0.677 (+6.2%). On Test2025, our custom dataset evaluating questions that require post-cutoff knowledge (post January 2025), HiFi-RAG outperforms the parametric baseline by 57.4% in ROUGE-L and 14.9% in DeBERTaScore.
72. 【2512.22431】Monadic Context Engineering
链接:https://arxiv.org/abs/2512.22431
作者:Yifan Zhang,Mengdi Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL)
关键词:Large Language Models, Language Models, Large Language, proliferation of Large, autonomous agents capable
备注: Project Page: [this https URL](https://github.com/yifanzhang-pro/monadic-context-engineering)
点击查看摘要
Abstract:The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming. Project Page: this https URL.
73. 【2512.22416】Hallucination Detection and Evaluation of Large Language Model
链接:https://arxiv.org/abs/2512.22416
作者:Chenggong Zhang,Haopeng Wang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Language Models, pose a significant, significant challenge
备注:
点击查看摘要
Abstract:Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy \(82.2\%\) and TPR \(78.9\%\). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.
74. 【2512.22385】LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition
链接:https://arxiv.org/abs/2512.22385
作者:Elsen Ronando,Sozo Inoue
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Human Activity Recognition, Human Activity, Activity Recognition, distinguish similar weara-ble, LLM-Guided Exemplar Selection
备注: This paper has been accepted for presentation at ABC 2026. The manuscript is under revision prior to camera-ready submission
点击查看摘要
Abstract:In this paper, we propose an LLM-Guided Exemplar Selection framework to address a key limitation in state-of-the-art Human Activity Recognition (HAR) methods: their reliance on large labeled datasets and purely geometric exemplar selection, which often fail to distinguish similar weara-ble sensor activities such as walking, walking upstairs, and walking downstairs. Our method incorporates semantic reasoning via an LLM-generated knowledge prior that captures feature importance, inter-class confusability, and exemplar budget multipliers, and uses it to guide exemplar scoring and selection. These priors are combined with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization to obtain a compact and informative set of exemplars. Evaluated on the UCI-HAR dataset under strict few-shot conditions, the framework achieves a macro F1-score of 88.78%, outperforming classical approaches such as random sampling, herding, and $k$-center. The results show that LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.
75. 【2512.22378】owards Efficient Post-Training via Fourier-Driven Adapter Architectures
链接:https://arxiv.org/abs/2512.22378
作者:Donggyun Bae,Jongil Park
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:termed Fourier-Activated Adapter, termed Fourier-Activated, Fourier-Activated Adapter, lightweight adapter modules, pre-trained language models
备注: 10 pages, 5 figures
点击查看摘要
Abstract:We propose a novel framework, termed Fourier-Activated Adapter (FAA), for parameter-efficient fine-tuning of large pre-trained language models. By incorporating random Fourier features into lightweight adapter modules, FAA decomposes intermediate representations into complementary low- and high-frequency components, enabling frequency-aware modulation of semantic information. This design allows the model to selectively emphasize informative frequency bands during adaptation while preserving the representational capacity of the frozen backbone. Extensive experiments on GLUE, E2E NLG, and instruction-tuning benchmarks demonstrate that FAA consistently achieves competitive or superior performance compared to existing parameter-efficient fine-tuning methods, while maintaining low computational and memory overhead. Ablation studies further verify the effectiveness of frequency-aware activation and adaptive weighting mechanisms, highlighting FAA as a robust and efficient approach for post-training large language models.
76. 【2512.22376】he Syntax of qulk-clauses in Yemeni Ibbi Arabic: A Minimalist Approach
链接:https://arxiv.org/abs/2512.22376
作者:Zubaida Mohammed Albadani,Mohammed Q. Shormani
类目:Computation and Language (cs.CL)
关键词:Yemeni Ibbi Arabic, Ibbi Arabic, Yemeni Ibbi, Minimalist Program, YIA
备注: 23 pages, 6 figures
点击查看摘要
Abstract:This study investigates the syntax of qulk-clauses in Yemeni Ibbi Arabic (YIA) within the Minimalist Program. The construction qulk-clause, a morphologically fused form meaning 'I said,' introduces embedded declarative interrogative, and imperative clauses, often eithout complementizer. The central proposal of this paper is that qulk-clauses are biclausal structures in which qulk functions a clause-embedding predicate sec;ecting a dull CP complement. By applying core minimalist operations, viz., Merge, Move, Agree, and Spell-out, the study provides a layered syntactic analysis of qulk-clauses, for illustrating how their derivation proceeds through standard computational steps and post-syntactic processes such as Morphological Merger. The proposal also accounts for dialect-specific features like bipartite negation, cliticization, and CP embedding. The findings offer theoretical contributions to generative syntax, specifically minimalism. The study concludes raising theoretical questions concerning extending the analysis to the addressee-clause kil-k 'you said'. It also provides insights into the possibility of the universality of minimalism.
77. 【2512.22336】Agent2World: Learning to Generate Symbolic World Models via Adaptive Multi-Agent Feedback
链接:https://arxiv.org/abs/2512.22336
作者:Mengkang Hu,Bowei Xia,Yuran Wu,Ailing Yu,Yude Zou,Qiguang Chen,Shijian Wang,Jiarui Jin,Kexin Li,Wenxiang Jiao,Yuan Lu,Ping Luo
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:large-scale verifiable supervision, Symbolic world models, Symbolic world, verifiable supervision, central to model-based
备注: 48 pages, 15 tables, 7 figures, Project page: [this https URL](https://agent2world.github.io)
点击查看摘要
Abstract:Symbolic world models (e.g., PDDL domains or executable simulators) are central to model-based planning, but training LLMs to generate such world models is limited by the lack of large-scale verifiable supervision. Current approaches rely primarily on static validation methods that fail to catch behavior-level errors arising from interactive execution. In this paper, we propose Agent2World, a tool-augmented multi-agent framework that achieves strong inference-time world-model generation and also serves as a data engine for supervised fine-tuning, by grounding generation in multi-agent feedback. Agent2World follows a three-stage pipeline: (i) A Deep Researcher agent performs knowledge synthesis by web searching to address specification gaps; (ii) A Model Developer agent implements executable world models; And (iii) a specialized Testing Team conducts adaptive unit testing and simulation-based validation. Agent2World demonstrates superior inference-time performance across three benchmarks spanning both Planning Domain Definition Language (PDDL) and executable code representations, achieving consistent state-of-the-art results. Beyond inference, Testing Team serves as an interactive environment for the Model Developer, providing behavior-aware adaptive feedback that yields multi-turn training trajectories. The model fine-tuned on these trajectories substantially improves world-model generation, yielding an average relative gain of 30.95% over the same model before training. Project page: this https URL.
78. 【2512.22334】SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence
链接:https://arxiv.org/abs/2512.22334
作者:Yiheng Wang,Yixin Chen,Shuo Li,Yifan Zhou,Bo Liu,Hengjian Gao,Jiakang Yuan,Jia Bu,Wanghan Xu,Yuhao Zhou,Xiangyu Zhao,Zhiwang Zhou,Fengxiang Wang,Haodong Duan,Songyang Zhang,Jun Yao,Han Deng,Yizhou Wang,Jiabei Xiao,Jiaqi Liu,Encheng Su,Yujie Liu,Weida Wang,Junchi Yao,Shenghe Zheng,Haoran Sun,Runmin Ma,Xiangchao Yan,Bo Zhang,Dongzhan Zhou,Shufei Zhang,Peng Ye,Xiaosong Wang,Shixiang Tang,Wenlong Zhang,Lei Bai
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Multimodal Perception, Scientific Symbolic Reasoning, Scientific Knowledge Understanding
备注:
点击查看摘要
Abstract:We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the core competencies of scientific intelligence, including Scientific Multimodal Perception, Scientific Multimodal Reasoning, Scientific Multimodal Understanding, Scientific Symbolic Reasoning, Scientific Code Generation, Science Hypothesis Generation and Scientific Knowledge Understanding. It supports six major scientific domains, spanning from physics and chemistry to astronomy and materials science. SciEvalKit builds a foundation of expert-grade scientific benchmarks, curated from real-world, domain-specific datasets, ensuring that tasks reflect authentic scientific challenges. The toolkit features a flexible, extensible evaluation pipeline that enables batch evaluation across models and datasets, supports custom model and dataset integration, and provides transparent, reproducible, and comparable results. By bridging capability-based evaluation and disciplinary diversity, SciEvalKit offers a standardized yet customizable infrastructure to benchmark the next generation of scientific foundation models and intelligent agents. The toolkit is open-sourced and actively maintained to foster community-driven development and progress in AI4Science.
79. 【2512.22322】SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
链接:https://arxiv.org/abs/2512.22322
作者:Shaofei Cai,Yulei Qin,Haojia Lin,Zihan Xu,Gang Li,Yuchen Shi,Zongyi Li,Yong Mao,Siqi Cai,Xiaoyu Tan,Yitao Liang,Ke Li,Xing Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Agentic reinforcement learning, complex GUI tasks, holds great promise, scalability remains severely, remains severely hampered
备注:
点击查看摘要
Abstract:Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.
80. 【2512.22293】Learning from Negative Examples: Why Warning-Framed Training Data Teaches What It Warns Against
链接:https://arxiv.org/abs/2512.22293
作者:Tsogt-Ochir Enkhbayar
类目:Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
关键词:teach language models, Warning-framed content, training data, teach language, warned-against behavior
备注: Submitted to Neel Nanda's MATS Stream
点击查看摘要
Abstract:Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.
81. 【2512.22227】Hierarchical Geometry of Cognitive States in Transformer Embedding Spaces
链接:https://arxiv.org/abs/2512.22227
作者:Sophie Zhao
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:representations remains underexplored, Recent work, remains underexplored, transformer-based language models, language models learn
备注:
点击查看摘要
Abstract:Recent work has shown that transformer-based language models learn rich geometric structure in their embedding spaces, yet the presence of higher-level cognitive organization within these representations remains underexplored. In this work, we investigate whether sentence embeddings encode a graded, hierarchical structure aligned with human-interpretable cognitive or psychological attributes. We construct a dataset of 480 natural-language sentences annotated with continuous ordinal energy scores and discrete tier labels spanning seven ordered cognitive categories. Using fixed sentence embeddings from multiple transformer models, we evaluate the recoverability of these annotations via linear and shallow nonlinear probes. Across models, both continuous scores and tier labels are reliably decodable, with shallow nonlinear probes providing consistent performance gains over linear probes. Lexical TF-IDF baselines perform substantially worse, indicating that the observed structure is not attributable to surface word statistics alone. Nonparametric permutation tests further confirm that probe performance exceeds chance under label-randomization nulls. Qualitative analyses using UMAP visualizations and confusion matrices reveal smooth low-to-high gradients and predominantly adjacent-tier confusions in embedding space. Taken together, these results provide evidence that transformer embedding spaces exhibit a hierarchical geometric organization aligned with human-defined cognitive attributes, while remaining agnostic to claims of internal awareness or phenomenology.
82. 【2512.22226】VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
链接:https://arxiv.org/abs/2512.22226
作者:Naishan Zheng,Jie Huang,Qingpei Guo,Feng Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, remains challenging due, multimodal large language, temporally coherent representations, language models
备注: 11 pages, 4 figures
点击查看摘要
Abstract:Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related segments into multi-level abstractions. Working in concert, EES and HEC enable VideoScaffold to transition smoothly from fine-grained frame understanding to abstract event reasoning as the video stream unfolds. Extensive experiments across both offline and streaming video understanding benchmarks demonstrate that VideoScaffold achieves state-of-the-art performance. The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension. The code is available at this https URL.
83. 【2512.22213】On the Existence and Behaviour of Secondary Attention Sinks
链接:https://arxiv.org/abs/2512.22213
作者:Jeffrey T.H. Wong,Cheng Zhang,Louis Mahon,Wayne Luk,Anton Isopoussu,Yiren Zhao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:limited semantic relevance, receive disproportionately high, disproportionately high attention, sinks, Attention
备注:
点击查看摘要
Abstract:Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1) these sinks are formed by specific middle-layer MLP modules; these MLPs map token representations to vectors that align with the direction of the primary sink of that layer. (2) The $\ell_2$-norm of these vectors determines the sink score of the secondary sink, and also the number of layers it lasts for, thereby leading to different impacts on the attention mechanisms accordingly. (3) The primary sink weakens in middle layers, coinciding with the emergence of secondary sinks. We observe that in larger-scale models, the location and lifetime of the sinks, together referred to as sink levels, appear in a more deterministic and frequent manner. Specifically, we identify three sink levels in QwQ-32B and six levels in Qwen3-14B.
84. 【2512.22208】Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA
链接:https://arxiv.org/abs/2512.22208
作者:Pu Zhao,Xuan Shen,Zhenglun Kong,Yixin Shen,Sung-En Chang,Arash Akbari,Timothy Rupprecht,Lei Lu,Enfu Nan,Changdi Yang,Yumei He,Weiyan Shi,Xingchen Xu,Yu Huang,Wei Jiang,Wei Wang,Yue Chen,Yong He,Yanzhi Wang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, significant transformation, undergone a significant
备注:
点击查看摘要
Abstract:Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Moxin 7B is introduced as a fully open-source LLM developed in accordance with the Model Openness Framework, which moves beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, thus fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem. To further equip Moxin with various capabilities in different tasks, we develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese, which target the vision-language, vision-language-action, and Chinese capabilities, respectively. Experiments show that our models achieve superior performance in various evaluations. We adopt open-source framework and open data for the training. We release our models, along with the available data and code to derive these models.
85. 【2512.22205】A CNN-Based Malaria Diagnosis from Blood Cell Images with SHAP and LIME Explainability
链接:https://arxiv.org/abs/2512.22205
作者:Md. Ismiel Hossen Abir,Awolad Hossain
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:prevalent health concern, subtropical climates, remains a prevalent, prevalent health, health concern
备注:
点击查看摘要
Abstract:Malaria remains a prevalent health concern in regions with tropical and subtropical climates. The cause of malaria is the Plasmodium parasite, which is transmitted through the bites of infected female Anopheles mosquitoes. Traditional diagnostic methods, such as microscopic blood smear analysis, are low in sensitivity, depend on expert judgment, and require resources that may not be available in remote settings. To overcome these limitations, this study proposes a deep learning-based approach utilizing a custom Convolutional Neural Network (CNN) to automatically classify blood cell images as parasitized or uninfected. The model achieves an accuracy of 96%, with precision and recall scores exceeding 0.95 for both classes. This study also compares the custom CNN with established deep learning architectures, including ResNet50, VGG16, MobileNetV2, and DenseNet121. To enhance model interpretability, Explainable AI techniques such as SHAP, LIME, and Saliency Maps are applied. The proposed system shows how deep learning can provide quick, accurate and understandable malaria diagnosis, especially in areas with limited resources.
86. 【2512.22183】Unbiased Visual Reasoning with Controlled Visual Inputs
链接:https://arxiv.org/abs/2512.22183
作者:Zhaonan Li,Shijie Lu,Fei Wang,Jacob Dineen,Xiao Ye,Zhikun Xu,Siyi Liu,Young Min Cho,Bangzheng Li,Daniel Chang,Kenny Nguyen,Qizheng Yang,Muhao Chen,Ben Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Vision-language Models, shortcut-prone when fine-tuned, answer visual questions, Separation for Text-based, exploiting spurious correlations
备注:
点击查看摘要
Abstract:End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), while remaining competitive on MMVP and a balanced SeedBench subset. VISTA transfers robustly across unseen VLM sensors and is able to recognize and recover from VLM perception failures. Human analysis further shows that VISTA's reasoning traces are more neutral, less reliant on spurious attributes, and more explicitly grounded in visual evidence than end-to-end VLM baselines.
87. 【2512.23609】he Big Three in Marriage Talk: LLM-Assisted Analysis of Moral Ethics and Sentiment on Weibo and Xiaohongshu
链接:https://arxiv.org/abs/2512.23609
作者:Frank Tian-Fang Ye(1),Xiaozi Gao(2) ((1) Division of Social Sciences, The HKU SPACE Community College, Hong Kong SAR, PRC (2) Department of Early Childhood Education, Education University of Hong Kong, Hong Kong SAR, PRC)
类目:General Economics (econ.GN); Computation and Language (cs.CL)
关键词:million couples, declined dramatically, registrations have declined, million, Sina Weibo
备注:
点击查看摘要
Abstract:China's marriage registrations have declined dramatically, dropping from 13.47 million couples in 2013 to 6.1 million in 2024. Understanding public attitudes toward marriage requires examining not only emotional sentiment but also the moral reasoning underlying these evaluations. This study analyzed 219,358 marriage-related posts from two major Chinese social media platforms (Sina Weibo and Xiaohongshu) using large language model (LLM)-assisted content analysis. Drawing on Shweder's Big Three moral ethics framework, posts were coded for sentiment (positive, negative, neutral) and moral dimensions (Autonomy, Community, Divinity). Results revealed platform differences: Weibo discourse skewed positive, while Xiaohongshu was predominantly neutral. Most posts across both platforms lacked explicit moral framing. However, when moral ethics were invoked, significant associations with sentiment emerged. Posts invoking Autonomy ethics and Community ethics were predominantly negative, whereas Divinity-framed posts tended toward neutral or positive sentiment. These findings suggest that concerns about both personal autonomy constraints and communal obligations drive negative marriage attitudes in contemporary China. The study demonstrates LLMs' utility for scaling qualitative analysis and offers insights for developing culturally informed policies addressing marriage decline in Chinese contexts.
信息检索
1. 【2512.23647】Nested Browser-Use Learning for Agentic Information Seeking
链接:https://arxiv.org/abs/2512.23647
作者:Baixuan Li,Jialong Wu,Wenbiao Yin,Kuan Li,Zhongwang Zhang,Huifeng Yin,Zhengwei Tao,Liwen Zhang,Pengjun Xie,Jingren Zhou,Yong Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
关键词:achieved strong performance, remains largely restricted, API-level snippet retrieval, URL-based page fetching, deep search tasks
备注:
点击查看摘要
Abstract:Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching, limiting access to the richer information available through real browsing. While full browser interaction could unlock deeper capabilities, its fine-grained control and verbose page content returns introduce substantial complexity for ReAct-style function-calling agents. To bridge this gap, we propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure. This design simplifies agentic reasoning while enabling effective deep-web information acquisition. Empirical results on challenging deep IS benchmarks demonstrate that NestBrowse offers clear benefits in practice. Further in-depth analyses underscore its efficiency and flexibility.
2. 【2512.23597】Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging
链接:https://arxiv.org/abs/2512.23597
作者:Janani Annur Thiruvengadam,Kiran Mayee Nabigaru,Anusha Kovi
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:major clinical dilemma, minimal contrast margins, large spread anatomy-wide, spread anatomy-wide variation, Scalable Residual Feature
备注:
点击查看摘要
Abstract:The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.
3. 【2512.23307】RobustMask: Certified Robustness against Adversarial Neural Ranking Attack via Randomized Masking
链接:https://arxiv.org/abs/2512.23307
作者:Jiawei Liu,Zhuo Chen,Rui Zhu,Miaokun Chen,Yuyang Gong,Wei Lu,Xiaofeng Wang
类目:Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
关键词:achieved remarkable progress, Retrieval-Augmented Generation, Neural ranking models, achieved remarkable, remarkable progress
备注:
点击查看摘要
Abstract:Neural ranking models have achieved remarkable progress and are now widely deployed in real-world applications such as Retrieval-Augmented Generation (RAG). However, like other neural architectures, they remain vulnerable to adversarial manipulations: subtle character-, word-, or phrase-level perturbations can poison retrieval results and artificially promote targeted candidates, undermining the integrity of search engines and downstream systems. Existing defenses either rely on heuristics with poor generalization or on certified methods that assume overly strong adversarial knowledge, limiting their practical use. To address these challenges, we propose RobustMask, a novel defense that combines the context-prediction capability of pretrained language models with a randomized masking-based smoothing mechanism. Our approach strengthens neural ranking models against adversarial perturbations at the character, word, and phrase levels. Leveraging both the pairwise comparison ability of ranking models and probabilistic statistical analysis, we provide a theoretical proof of RobustMask's certified top-K robustness. Extensive experiments further demonstrate that RobustMask successfully certifies over 20% of candidate documents within the top-10 ranking positions against adversarial perturbations affecting up to 30% of their content. These results highlight the effectiveness of RobustMask in enhancing the adversarial robustness of neural ranking models, marking a significant step toward providing stronger security guarantees for real-world retrieval systems.
4. 【2512.22838】OrchANN: A Unified I/O Orchestration Framework for Skewed Out-of-Core Vector Search
链接:https://arxiv.org/abs/2512.22838
作者:Chengying Huan,Lizheng Chen,Zhengyi Yang,Shaonan Ma,Rong Gu,Renjie Yao,Zhibin Wang,Mingxing Zhang,Fang Xi,Jie Tao,Gang Zhang,Guihai Chen,Chen Tian
类目:Databases (cs.DB); Information Retrieval (cs.IR)
关键词:Approximate nearest neighbor, nearest neighbor search, Approximate nearest, neighbor search, nearest neighbor
备注: 13 pages, 30 figures
点击查看摘要
Abstract:Approximate nearest neighbor search (ANNS) at billion scale is fundamentally an out-of-core problem: vectors and indexes live on SSD, so performance is dominated by I/O rather than compute. Under skewed semantic embeddings, existing out-of-core systems break down: a uniform local index mismatches cluster scales, static routing misguides queries and inflates the number of probed partitions, and pruning is incomplete at the cluster level and lossy at the vector level, triggering "fetch-to-discard" reranking on raw vectors. We present OrchANN, an out-of-core ANNS engine that uses an I/O orchestration model for unified I/O governance along the route-access-verify pipeline. OrchANN selects a heterogeneous local index per cluster via offline auto-profiling, maintains a query-aware in-memory navigation graph that adapts to skewed workloads, and applies multi-level pruning with geometric bounds to filter both clusters and vectors before issuing SSD reads. Across five standard datasets under strict out-of-core constraints, OrchANN outperforms four baselines including DiskANN, Starling, SPANN, and PipeANN in both QPS and latency while reducing SSD accesses. Furthermore, OrchANN delivers up to 17.2x higher QPS and 25.0x lower latency than competing systems without sacrificing accuracy.
Comments:
13 pages, 30 figures
Subjects:
Databases (cs.DB); Information Retrieval (cs.IR)
Cite as:
arXiv:2512.22838 [cs.DB]
(or
arXiv:2512.22838v1 [cs.DB] for this version)
https://doi.org/10.48550/arXiv.2512.22838
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
5. 【2512.22629】DICE: Discrete Interpretable Comparative Evaluation with Probabilistic Scoring for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2512.22629
作者:Shiyan Liu,Jian Ma,Rui Qu
类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:Retrieval-Augmented Generation, sophisticated architectures, ensuring their trustworthiness, Discrete Interpretable Comparative, RAG
备注: Accepted at ResponsibleFM @ NeurIPS 2025
点击查看摘要
Abstract:As Retrieval-Augmented Generation (RAG) systems evolve toward more sophisticated architectures, ensuring their trustworthiness through explainable and robust evaluation becomes critical. Existing scalar metrics suffer from limited interpretability, inadequate uncertainty quantification, and computational inefficiency in multi-system comparisons, hindering responsible deployment of RAG technologies. We introduce DICE (Discrete Interpretable Comparative Evaluation), a two-stage, evidence-coupled framework that advances explainability and robustness in RAG evaluation. DICE combines deep analytical reasoning with probabilistic $\{A, B, Tie\}$ scoring to produce transparent, confidence-aware judgments that support accountable system improvement through interpretable reasoning traces, enabling systematic error diagnosis and actionable insights. To address efficiency challenges at scale, DICE employs a Swiss-system tournament that reduces computational complexity from $O(N^2)$ to $O(N \log N)$, achieving a 42.9% reduction in our eight-system evaluation while preserving ranking fidelity. Validation on a curated Chinese financial QA dataset demonstrates that DICE achieves 85.7% agreement with human experts, substantially outperforming existing LLM-based metrics such as RAGAS. Our results establish DICE as a responsible, explainable, and efficient paradigm for trustworthy RAG system assessment.
6. 【2512.22457】A Real-Time System to Populate FRA Form 57 from News
链接:https://arxiv.org/abs/2512.22457
作者:Chansong Lim,Haz Sameen Shahgir,Yue Dong,Jia Chen,Evangelos E. Papalexakis
类目:Information Retrieval (cs.IR)
关键词:Federal Railroad Administration, Local railway committees, official Federal Railroad, highway-rail grade crossing, grade crossing incidents
备注: to be published in WSDM 2026 Demonstration
点击查看摘要
Abstract:Local railway committees need timely situational awareness after highway-rail grade crossing incidents, yet official Federal Railroad Administration (FRA) investigations can take days to weeks. We present a demo system that populates Highway-Rail Grade Crossing Incident Data (Form 57) from news in real time. Our approach addresses two core challenges: the form is visually irregular and semantically dense, and news is noisy. To solve these problems, we design a pipeline that first converts Form 57 into a JSON schema using a vision language model with sample aggregation, and then performs grouped question answering following the intent of the form layout to reduce ambiguity. In addition, we build an evaluation dataset by aligning scraped news articles with official FRA records and annotating retrievable information. We then assess our system against various alternatives in terms of information retrieval accuracy and coverage.
7. 【2512.22442】HiFi-RAG: Hierarchical Content Filtering and Two-Pass Generation for Open-Domain RAG
链接:https://arxiv.org/abs/2512.22442
作者:Cattalyya Nuengsigkapian
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:open-domain settings faces, settings faces significant, faces significant challenges, Hierarchical Filtering RAG, user intent
备注: A winning solution for the NeurIPS 2025 MMU-RAGent Competition (Closed-Source Text-to-Text Static Evaluation)
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) in open-domain settings faces significant challenges regarding irrelevant information in retrieved documents and the alignment of generated answers with user intent. We present HiFi-RAG (Hierarchical Filtering RAG), the winning closed-source system in the Text-to-Text static evaluation of the MMU-RAGent NeurIPS 2025 Competition. Our approach moves beyond standard embedding-based retrieval via a multi-stage pipeline. We leverage the speed and cost-efficiency of Gemini 2.5 Flash (4-6x cheaper than Pro) for query formulation, hierarchical content filtering, and citation attribution, while reserving the reasoning capabilities of Gemini 2.5 Pro for final answer generation. On the MMU-RAGent validation set, our system outperformed the baseline, improving ROUGE-L to 0.274 (+19.6%) and DeBERTaScore to 0.677 (+6.2%). On Test2025, our custom dataset evaluating questions that require post-cutoff knowledge (post January 2025), HiFi-RAG outperforms the parametric baseline by 57.4% in ROUGE-L and 14.9% in DeBERTaScore.
8. 【2512.22416】Hallucination Detection and Evaluation of Large Language Model
链接:https://arxiv.org/abs/2512.22416
作者:Chenggong Zhang,Haopeng Wang
类目:Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Language Models, pose a significant, significant challenge
备注:
点击查看摘要
Abstract:Hallucinations in Large Language Models (LLMs) pose a significant challenge, generating misleading or unverifiable content that undermines trust and reliability. Existing evaluation methods, such as KnowHalu, employ multi-stage verification but suffer from high computational costs. To address this, we integrate the Hughes Hallucination Evaluation Model (HHEM), a lightweight classification-based framework that operates independently of LLM-based judgments, significantly improving efficiency while maintaining high detection accuracy. We conduct a comparative analysis of hallucination detection methods across various LLMs, evaluating True Positive Rate (TPR), True Negative Rate (TNR), and Accuracy on question-answering (QA) and summarization tasks. Our results show that HHEM reduces evaluation time from 8 hours to 10 minutes, while HHEM with non-fabrication checking achieves the highest accuracy \(82.2\%\) and TPR \(78.9\%\). However, HHEM struggles with localized hallucinations in summarization tasks. To address this, we introduce segment-based retrieval, improving detection by verifying smaller text components. Additionally, our cumulative distribution function (CDF) analysis indicates that larger models (7B-9B parameters) generally exhibit fewer hallucinations, while intermediate-sized models show higher instability. These findings highlight the need for structured evaluation frameworks that balance computational efficiency with robust factual validation, enhancing the reliability of LLM-generated content.
9. 【2512.22396】HalluMat: Detecting Hallucinations in LLM-Generated Materials Science Content Through Multi-Stage Verification
链接:https://arxiv.org/abs/2512.22396
作者:Bhanu Prakash Vangala,Sajid Mahmud,Pawan Neupane,Joel Selvaraj,Jianlin Cheng
类目:Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Information Retrieval (cs.IR)
关键词:Large Language Models, Artificial Intelligence, Large Language, transforming scientific discovery, enabling rapid knowledge
备注:
点击查看摘要
Abstract:Artificial Intelligence (AI), particularly Large Language Models (LLMs), is transforming scientific discovery, enabling rapid knowledge generation and hypothesis formulation. However, a critical challenge is hallucination, where LLMs generate factually incorrect or misleading information, compromising research integrity. To address this, we introduce HalluMatData, a benchmark dataset for evaluating hallucination detection methods, factual consistency, and response robustness in AI-generated materials science content. Alongside this, we propose HalluMatDetector, a multi-stage hallucination detection framework that integrates intrinsic verification, multi-source retrieval, contradiction graph analysis, and metric-based assessment to detect and mitigate LLM hallucinations. Our findings reveal that hallucination levels vary significantly across materials science subdomains, with high-entropy queries exhibiting greater factual inconsistencies. By utilizing HalluMatDetector verification pipeline, we reduce hallucination rates by 30% compared to standard LLM outputs. Furthermore, we introduce the Paraphrased Hallucination Consistency Score (PHCS) to quantify inconsistencies in LLM responses across semantically equivalent queries, offering deeper insights into model reliability.
10. 【2512.22386】OxygenREC: An Instruction-Following Generative Framework for E-commerce Recommendation
链接:https://arxiv.org/abs/2512.22386
作者:Xuegang Hao,Ming Zhang,Alex Li,Xiangyu Qian,Zhi Ma,Yanlong Zang,Shijie Yang,Zhongxuan Han,Xiaolong Ma,Jinguang Liu,Zhen Li,Zhida Jiang,Shusheng Wang,Ning Tang,Yanchen Qiao,Chenxiang Yang,Chen Sun,Jincheng Yuan,Chunhua Peng,Heng Hu,Peijun Yang,Baopeng Yuan,Caiyun Qiu,Zhaolong Xing,Haofei Yuan,Haipeng Zhang,Yuzhang Guo,Weijie Ding,Jiahua Gao,Hao Huang,Zhen Chen,Tongxuan Liu,Pinghua Gong
类目:Information Retrieval (cs.IR)
关键词:Traditional recommendation systems, Traditional recommendation, suffer from inconsistency, inconsistency in multi-stage, recommendation systems suffer
备注: 37 pages, 7 figures
点击查看摘要
Abstract:Traditional recommendation systems suffer from inconsistency in multi-stage optimization objectives. Generative Recommendation (GR) mitigates them through an end-to-end framework; however, existing methods still rely on matching mechanisms based on inductive patterns. Although responsive, they lack the ability to uncover complex user intents that require deductive reasoning based on world knowledge. Meanwhile, LLMs show strong deep reasoning capabilities, but their latency and computational costs remain challenging for industrial applications. More critically, there are performance bottlenecks in multi-scenario scalability: as shown in Figure 1, existing solutions require independent training and deployment for each scenario, leading to low resource utilization and high maintenance costs-a challenge unaddressed in GR literature. To address these, we present OxygenREC, an industrial recommendation system that leverages Fast-Slow Thinking to deliver deep reasoning with strict latency and multi-scenario requirements of real-world environments. First, we adopt a Fast-Slow Thinking architecture. Slow thinking uses a near-line LLM pipeline to synthesize Contextual Reasoning Instructions, while fast thinking employs a high-efficiency encoder--decoder backbone for real-time generation. Second, to ensure reasoning instructions effectively enhance recommendation generation, we introduce a semantic alignment mechanism with Instruction-Guided Retrieval (IGR) to filter intent-relevant historical behaviors and use a Query-to-Item (Q2I) loss for instruction-item consistency. Finally, to resolve multi-scenario scalability, we transform scenario information into controllable instructions, using unified reward mapping and Soft Adaptive Group Clip Policy Optimization (SA-GCPO) to align policies with diverse business objectives, realizing a train-once-deploy-everywhere paradigm.
11. 【2512.22167】IANEC: Digital Forensic Investigation of Contemporary Writers' Archives
链接:https://arxiv.org/abs/2512.22167
作者:Emmanuel Giguet(GREYC)
类目:Digital Libraries (cs.DL); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
关键词:GREYC Research Lab, forensic investigation tools, digital forensic investigation, develop dedicated digital, GREYC Research
备注: in French language
点击查看摘要
Abstract:The IANEC project (Investigation of Digital Archives of Contemporary Writers), led by the GREYC Research Lab and funded by the French Ministry of Culture aims to develop dedicated digital forensic investigation tools to automate the analysis of archival corpora from the Institut M{é}moires de l'{É}dition Contemporaine (IMEC). The project is based on the observation that born-digital archival materials are increasingly prevalent in contemporary archival institutions, and that digital forensics technologies have become essential for the extraction, identification, processing, and description of natively digital archival corpora.*
计算机视觉
1. 【2512.23709】Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion
链接:https://arxiv.org/abs/2512.23709
作者:Hau-Shiang Shiu,Chin-Yang Lin,Zhixiang Wang,Chi-Wei Hsiao,Po-Fan Yu,Yu-Chih Chen,Yu-Lun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:latency-sensitive settings due, expensive multi-step denoising, Diffusion-based video super-resolution, video super-resolution, Auto-regressive Temporal Guidance
备注: Project page: [this https URL](https://jamichss.github.io/stream-diffvsr-project-page/)
点击查看摘要
Abstract:Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: this https URL
2. 【2512.23705】Diffusion Knows Transparency: Repurposing Video Diffusion for Transparent Object Depth and Normal Estimation
链接:https://arxiv.org/abs/2512.23705
作者:Shaocong Xu,Songlin Wei,Qizhe Wei,Zheng Geng,Hong Li,Licheng Shen,Qianpu Sun,Shu Han,Bin Ma,Bohan Li,Chongjie Ye,Yuhang Zheng,Nan Wang,Saining Zhang,Hao Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:objects remain notoriously, remain notoriously hard, purely discriminative monocular, Transparent objects remain, discriminative monocular depth
备注: Project Page: [this https URL](https://daniellli.github.io/projects/DKT/;) Code: [this https URL](https://github.com/Daniellli/DKT;) Dataset: [this https URL](https://huggingface.co/datasets/Daniellesry/TransPhy3D)
点击查看摘要
Abstract:Transparent objects remain notoriously hard for perception systems: refraction, reflection and transmission break the assumptions behind stereo, ToF and purely discriminative monocular depth, causing holes and temporally unstable estimates. Our key observation is that modern video diffusion models already synthesize convincing transparent phenomena, suggesting they have internalized the optical rules. We build TransPhy3D, a synthetic video corpus of transparent/reflective scenes: 11k sequences rendered with Blender/Cycles. Scenes are assembled from a curated bank of category-rich static assets and shape-rich procedural assets paired with glass/plastic/metal materials. We render RGB + depth + normals with physically based ray tracing and OptiX denoising. Starting from a large video diffusion model, we learn a video-to-video translator for depth (and normals) via lightweight LoRA adapters. During training we concatenate RGB and (noisy) depth latents in the DiT backbone and co-train on TransPhy3D and existing frame-wise synthetic datasets, yielding temporally consistent predictions for arbitrary-length input videos. The resulting model, DKT, achieves zero-shot SOTA on real and synthetic video benchmarks involving transparency: ClearPose, DREDS (CatKnown/CatNovel), and TransPhy3D-Test. It improves accuracy and temporal consistency over strong image/video baselines, and a normal variant sets the best video normal estimation results on ClearPose. A compact 1.3B version runs at ~0.17 s/frame. Integrated into a grasping stack, DKT's depth boosts success rates across translucent, reflective and diffuse surfaces, outperforming prior estimators. Together, these results support a broader claim: "Diffusion knows transparency." Generative video priors can be repurposed, efficiently and label-free, into robust, temporally coherent perception for challenging real-world manipulation.
3. 【2512.23676】Web World Models
链接:https://arxiv.org/abs/2512.23676
作者:Jichen Feng,Yifan Zhang,Chenggong Zhang,Yifu Lu,Shilong Liu,Mengdi Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:agents increasingly require, increasingly require persistent, Language agents increasingly, require persistent worlds, agents increasingly
备注: Project Page: [this https URL](https://github.com/Princeton-AI2-Lab/Web-World-Models)
点击查看摘要
Abstract:Language agents increasingly require persistent worlds in which they can act, remember, and learn. Existing approaches sit at two extremes: conventional web frameworks provide reliable but fixed contexts backed by databases, while fully generative world models aim for unlimited environments at the expense of controllability and practical engineering. In this work, we introduce the Web World Model (WWM), a middle ground where world state and ``physics'' are implemented in ordinary web code to ensure logical consistency, while large language models generate context, narratives, and high-level decisions on top of this structured latent state. We build a suite of WWMs on a realistic web stack, including an infinite travel atlas grounded in real geography, fictional galaxy explorers, web-scale encyclopedic and narrative worlds, and simulation- and game-like environments. Across these systems, we identify practical design principles for WWMs: separating code-defined rules from model-driven imagination, representing latent state as typed web interfaces, and utilizing deterministic generation to achieve unlimited but structured exploration. Our results suggest that web stacks themselves can serve as a scalable substrate for world models, enabling controllable yet open-ended environments. Project Page: this https URL.
4. 【2512.23667】IDT: A Physically Grounded Transformer for Feed-Forward Multi-View Intrinsic Decomposition
链接:https://arxiv.org/abs/2512.23667
作者:Kang Du,Yirui Guan,Zeyu Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:RGB images entangle, entangle material properties, RGB images, visual understanding, Intrinsic Decomposition Transformer
备注: 10 pages 4 figures
点击查看摘要
Abstract:Intrinsic image decomposition is fundamental for visual understanding, as RGB images entangle material properties, illumination, and view-dependent effects. Recent diffusion-based methods have achieved strong results for single-view intrinsic decomposition; however, extending these approaches to multi-view settings remains challenging, often leading to severe view inconsistency. We propose \textbf{Intrinsic Decomposition Transformer (IDT)}, a feed-forward framework for multi-view intrinsic image decomposition. By leveraging transformer-based attention to jointly reason over multiple input images, IDT produces view-consistent intrinsic factors in a single forward pass, without iterative generative sampling. IDT adopts a physically grounded image formation model that explicitly decomposes images into diffuse reflectance, diffuse shading, and specular shading. This structured factorization separates Lambertian and non-Lambertian light transport, enabling interpretable and controllable decomposition of material and illumination effects across views. Experiments on both synthetic and real-world datasets demonstrate that IDT achieves cleaner diffuse reflectance, more coherent diffuse shading, and better-isolated specular components, while substantially improving multi-view consistency compared to prior intrinsic decomposition methods.
5. 【2512.23649】RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion
链接:https://arxiv.org/abs/2512.23649
作者:Zhe Li,Cheng Chi,Yangyang Wei,Boan Zhu,Tao Huang,Zhenguo Sun,Yibo Peng,Pengwei Wang,Zhongyuan Wang,Fangzhou Liu,Chang Xu,Shanghang Zhang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Humans learn locomotion, Humans learn, interpreting visual content, Humans, visual understanding
备注:
点击查看摘要
Abstract:Humans learn locomotion through visual observation, interpreting visual content first before imitating actions. However, state-of-the-art humanoid locomotion systems rely on either curated motion capture trajectories or sparse text commands, leaving a critical gap between visual understanding and control. Text-to-motion methods suffer from semantic sparsity and staged pipeline errors, while video-based approaches only perform mechanical pose mimicry without genuine visual understanding. We propose RoboMirror, the first retargeting-free video-to-locomotion framework embodying "understand before you imitate". Leveraging VLMs, it distills raw egocentric/third-person videos into visual motion intents, which directly condition a diffusion-based policy to generate physically plausible, semantically aligned locomotion without explicit pose reconstruction or retargeting. Extensive experiments validate the effectiveness of RoboMirror, it enables telepresence via egocentric videos, drastically reduces third-person control latency by 80%, and achieves a 3.7% higher task success rate than baselines. By reframing humanoid control around video understanding, we bridge the visual understanding and action gap.
6. 【2512.23646】OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding
链接:https://arxiv.org/abs/2512.23646
作者:Keda Tao,Wenjie Du,Bohan Yu,Weiqiang Wang,Jian Liu,Huan Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Omnimodal large language, made significant strides, Omnimodal large, large language models, visual modalities
备注: Website: [this https URL](https://kd-tao.github.io/OmniAgent/)
点击查看摘要
Abstract:Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.
7. 【2512.23635】Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception
链接:https://arxiv.org/abs/2512.23635
作者:Xiaoyu Li,Peidong Li,Xian Wu,Long Shi,Dedong Liu,Yitao Wu,Jiajia Fu,Dixiao Cui,Lijun Zhao,Lining Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:textural prior information, providing valuable structural, autonomous driving, prior information, valuable structural
备注: Accepted to AAAI 2026
点击查看摘要
Abstract:Spatio-temporal alignment is crucial for temporal modeling of end-to-end (E2E) perception in autonomous driving (AD), providing valuable structural and textural prior information. Existing methods typically rely on the attention mechanism to align objects across frames, simplifying the motion model with a unified explicit physical model (constant velocity, etc.). These approaches prefer semantic features for implicit alignment, challenging the importance of explicit motion modeling in the traditional perception paradigm. However, variations in motion states and object features across categories and frames render this alignment suboptimal. To address this, we propose HAT, a spatio-temporal alignment module that allows each object to adaptively decode the optimal alignment proposal from multiple hypotheses without direct supervision. Specifically, HAT first utilizes multiple explicit motion models to generate spatial anchors and motion-aware feature proposals for historical instances. It then performs multi-hypothesis decoding by incorporating semantic and motion cues embedded in cached object queries, ultimately providing the optimal alignment proposal for the target frame. On nuScenes, HAT consistently improves 3D temporal detectors and trackers across diverse baselines. It achieves state-of-the-art tracking results with 46.0% AMOTA on the test set when paired with the DETR3D detector. In an object-centric E2E AD method, HAT enhances perception accuracy (+1.3% mAP, +3.1% AMOTA) and reduces the collision rate by 32%. When semantics are corrupted (nuScenes-C), the enhancement of motion modeling by HAT enables more robust perception and planning in the E2E AD.
8. 【2512.23628】Memorization in 3D Shape Generation: An Empirical Study
链接:https://arxiv.org/abs/2512.23628
作者:Shu Pu,Boya Zeng,Kaichen Zhou,Mengyu Wang,Zhuang Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:memorizing training shapes, Generative models, vision to synthesize, training shapes, remains unclear
备注:
点击查看摘要
Abstract:Generative models are increasingly used in 3D vision to synthesize novel shapes, yet it remains unclear whether their generation relies on memorizing training shapes. Understanding their memorization could help prevent training data leakage and improve the diversity of generated results. In this paper, we design an evaluation framework to quantify memorization in 3D generative models and study the influence of different data and modeling designs on memorization. We first apply our framework to quantify memorization in existing methods. Next, through controlled experiments with a latent vector-set (Vecset) diffusion model, we find that, on the data side, memorization depends on data modality, and increases with data diversity and finer-grained conditioning; on the modeling side, it peaks at a moderate guidance scale and can be mitigated by longer Vecsets and simple rotation augmentation. Together, our framework and analysis provide an empirical understanding of memorization in 3D generative models and suggest simple yet effective strategies to reduce it without degrading generation quality. Our code is available at this https URL.
9. 【2512.23597】Scalable Residual Feature Aggregation Framework with Hybrid Metaheuristic Optimization for Robust Early Pancreatic Neoplasm Detection in Multimodal CT Imaging
链接:https://arxiv.org/abs/2512.23597
作者:Janani Annur Thiruvengadam,Kiran Mayee Nabigaru,Anusha Kovi
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:major clinical dilemma, minimal contrast margins, large spread anatomy-wide, spread anatomy-wide variation, Scalable Residual Feature
备注:
点击查看摘要
Abstract:The early detection of pancreatic neoplasm is a major clinical dilemma, and it is predominantly so because tumors are likely to occur with minimal contrast margins and a large spread anatomy-wide variation amongst patients on a CT scan. These complexities require to be addressed with an effective and scalable system that can assist in enhancing the salience of the subtle visual cues and provide a high level of the generalization on the multimodal imaging data. A Scalable Residual Feature Aggregation (SRFA) framework is proposed to be used to meet these conditions in this study. The framework integrates a pipeline of preprocessing followed by the segmentation using the MAGRes-UNet that is effective in making the pancreatic structures and isolating regions of interest more visible. DenseNet-121 performed with residual feature storage is used to extract features to allow deep hierarchical features to be aggregated without properties loss. To go further, hybrid HHO-BA metaheuristic feature selection strategy is used, which guarantees the best feature subset refinement. To be classified, the system is trained based on a new hybrid model that integrates the ability to pay attention on the world, which is the Vision Transformer (ViT) with the high representational efficiency of EfficientNet-B3. A dual optimization mechanism incorporating SSA and GWO is used to fine-tune hyperparameters to enhance greater robustness and less overfitting. Experimental results support the significant improvement in performance, with the suggested model reaching 96.23% accuracy, 95.58% F1-score and 94.83% specificity, the model is significantly better than the traditional CNNs and contemporary transformer-based models. Such results highlight the possibility of the SRFA framework as a useful instrument in the early detection of pancreatic tumors.
10. 【2512.23594】Detection Fire in Camera RGB-NIR
链接:https://arxiv.org/abs/2512.23594
作者:Nguyen Truong Khai,Luong Duc Vinh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:infrared night vision, night vision cameras, vision cameras remains, challenging task, infrared night
备注:
点击查看摘要
Abstract:Improving the accuracy of fire detection using infrared night vision cameras remains a challenging task. Previous studies have reported strong performance with popular detection models. For example, YOLOv7 achieved an mAP50-95 of 0.51 using an input image size of 640 x 1280, RT-DETR reached an mAP50-95 of 0.65 with an image size of 640 x 640, and YOLOv9 obtained an mAP50-95 of 0.598 at the same resolution. Despite these results, limitations in dataset construction continue to cause issues, particularly the frequent misclassification of bright artificial lights as fire. This report presents three main contributions: an additional NIR dataset, a two-stage detection model, and Patched-YOLO. First, to address data scarcity, we explore and apply various data augmentation strategies for both the NIR dataset and the classification dataset. Second, to improve night-time fire detection accuracy while reducing false positives caused by artificial lights, we propose a two-stage pipeline combining YOLOv11 and EfficientNetV2-B0. The proposed approach achieves higher detection accuracy compared to previous methods, particularly for night-time fire detection. Third, to improve fire detection in RGB images, especially for small and distant objects, we introduce Patched-YOLO, which enhances the model's detection capability through patch-based processing. Further details of these contributions are discussed in the following sections.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.23594 [cs.CV]
(or
arXiv:2512.23594v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.23594
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
11. 【2512.23592】Same or Not? Enhancing Visual Perception in Vision-Language Models
链接:https://arxiv.org/abs/2512.23592
作者:Damiano Marsili,Aditya Mehta,Ryan Y. Lin,Georgia Gkioxari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:exhibit visual biases, subtle visual details, broad visual understanding, miss subtle visual, Vision-language models
备注: Project webpage: [this https URL](https://glab-caltech.github.io/twin/)
点击查看摘要
Abstract:Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: this https URL
12. 【2512.23576】LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation
链接:https://arxiv.org/abs/2512.23576
作者:Ethan Chern,Zhulin Hu,Bohao Tang,Jiadi Su,Steffi Chern,Zhijie Deng,Pengfei Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:building general-purpose multimodal, essential for building, building general-purpose, general-purpose multimodal interactive, video
备注:
点击查看摘要
Abstract:Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.
13. 【2512.23573】ProGuard: Towards Proactive Multimodal Safeguard
链接:https://arxiv.org/abs/2512.23573
作者:Shaohan Yu,Lijun Li,Chenyang Si,Lu Sheng,Jing Shao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:existing defense methods, exposing the limitations, defense methods, rapid evolution, evolution of generative
备注:
点击查看摘要
Abstract:The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.
14. 【2512.23572】Instruction-Following Evaluation of Large Vision-Language Models
链接:https://arxiv.org/abs/2512.23572
作者:Daiki Shiono,Shumpei Miyawaki,Ryota Tanaka,Jun Suzuki
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:proposed large vision-language, large language models, large vision-language models, LVLMs' instruction-following ability, vision capabilities
备注: 21 pages, 7 figures
点击查看摘要
Abstract:Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs' instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs' instruction-following ability. Our quantitative evaluation confirmed that LVLMs' instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.
15. 【2512.23569】Image Denoising Using Global and Local Circulant Representation
链接:https://arxiv.org/abs/2512.23569
作者:Zhaoming Kong,Xiaowei Yang,Jiahuan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:countless image data, image data generated, increasingly high demand, effective image denoising, countless image
备注:
点击查看摘要
Abstract:The proliferation of imaging devices and countless image data generated every day impose an increasingly high demand on efficient and effective image denoising. In this paper, we establish a theoretical connection between principal component analysis (PCA) and the Haar transform under circulant representation, and present a computationally simple denoising algorithm. The proposed method, termed Haar-tSVD, exploits a unified tensor singular value decomposition (t-SVD) projection combined with Haar transform to efficiently capture global and local patch correlations. Haar-tSVD operates as a one-step, parallelizable plug-and-play denoiser that eliminates the need for learning local bases, thereby striking a balance between denoising speed and performance. Besides, an adaptive noise estimation scheme is introduced to improve robustness according to eigenvalue analysis of the circulant structure. To further enhance the performance under severe noise conditions, we integrate deep neural networks with Haar-tSVD based on the established Haar-PCA relationship. Experimental results on various denoising datasets demonstrate the efficiency and effectiveness of proposed method for noise removal. Our code is publicly available at this https URL.
16. 【2512.23568】hinkGen: Generalized Thinking for Visual Generation
链接:https://arxiv.org/abs/2512.23568
作者:Siyu Jiao,Yiheng Lin,Yujie Zhong,Qi She,Wei Zhou,Xiaohan Lan,Zilong Huang,Fei Yu,Yingchen Yu,Yunqing Zhao,Yao Zhao,Yunchao Wei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:Recent progress in Multimodal Large Language Models (MLLMs) demonstrates that Chain-of-Thought (CoT) reasoning enables systematic solutions to complex understanding tasks. However, its extension to generation tasks remains nascent and limited by scenario-specific mechanisms that hinder generalization and adaptation. In this work, we present ThinkGen, the first think-driven visual generation framework that explicitly leverages MLLM's CoT reasoning in various generation scenarios. ThinkGen employs a decoupled architecture comprising a pretrained MLLM and a Diffusion Transformer (DiT), wherein the MLLM generates tailored instructions based on user intent, and DiT produces high-quality images guided by these instructions. We further propose a separable GRPO-based training paradigm (SepGRPO), alternating reinforcement learning between the MLLM and DiT modules. This flexible design enables joint training across diverse datasets, facilitating effective CoT reasoning for a wide range of generative scenarios. Extensive experiments demonstrate that ThinkGen achieves robust, state-of-the-art performance across multiple generation benchmarks. Code is available: this https URL
17. 【2512.23565】RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature
链接:https://arxiv.org/abs/2512.23565
作者:Hanzheng Li,Xi Fang,Yixuan Li,Chaozheng Huang,Junjie Wang,Xi Wang,Hongzhe Bai,Bojun Hao,Shenyu Lin,Huiqi Liang,Linfeng Zhang,Guolin Ke
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, literature remains underexplored, authentic literature remains
备注:
点击查看摘要
Abstract:The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50\% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.
18. 【2512.23546】PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation
链接:https://arxiv.org/abs/2512.23546
作者:Zongsheng Cao,Yangfan He,Anran Liu,Jun Xie,Feng Chen,Zepeng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, notably enhanced, advances in diffusion, raise the risk, risk of generating
备注:
点击查看摘要
Abstract:Recent advances in diffusion models have notably enhanced text-to-image (T2I) generation quality, but they also raise the risk of generating unsafe content. Traditional safety methods like text blacklisting or harmful content classification have significant drawbacks: they can be easily circumvented or require extensive datasets and extra training. To overcome these challenges, we introduce PurifyGen, a novel, training-free approach for safe T2I generation that retains the model's original weights. PurifyGen introduces a dual-stage strategy for prompt purification. First, we evaluate the safety of each token in a prompt by computing its complementary semantic distance, which measures the semantic proximity between the prompt tokens and concept embeddings from predefined toxic and clean lists. This enables fine-grained prompt classification without explicit keyword matching or retraining. Tokens closer to toxic concepts are flagged as risky. Second, for risky prompts, we apply a dual-space transformation: we project toxic-aligned embeddings into the null space of the toxic concept matrix, effectively removing harmful semantic components, and simultaneously align them into the range space of clean concepts. This dual alignment purifies risky prompts by both subtracting unsafe semantics and reinforcing safe ones, while retaining the original intent and coherence. We further define a token-wise strategy to selectively replace only risky token embeddings, ensuring minimal disruption to safe content. PurifyGen offers a plug-and-play solution with theoretical grounding and strong generalization to unseen prompts and models. Extensive testing shows that PurifyGen surpasses current methods in reducing unsafe content across five datasets and competes well with training-dependent approaches. The code can refer to this https URL.
19. 【2512.23545】PathFound: An Agentic Multimodal Model Activating Evidence-seeking Pathological Diagnosis
链接:https://arxiv.org/abs/2512.23545
作者:Shengyi Hua,Jianfeng Wu,Tianle Shen,Kangzhe Hu,Zhongzhen Huang,Shujuan Ni,Zhihong Zhang,Yuan Li,Zhe Wang,Xiaofan Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Recent pathological foundation, substantially advanced visual, advanced visual representation, Recent pathological, substantially advanced
备注:
点击查看摘要
Abstract:Recent pathological foundation models have substantially advanced visual representation learning and multimodal interaction. However, most models still rely on a static inference paradigm in which whole-slide images are processed once to produce predictions, without reassessment or targeted evidence acquisition under ambiguous diagnoses. This contrasts with clinical diagnostic workflows that refine hypotheses through repeated slide observations and further examination requests. We propose PathFound, an agentic multimodal model designed to support evidence-seeking inference in pathological diagnosis. PathFound integrates the power of pathological visual foundation models, vision-language models, and reasoning models trained with reinforcement learning to perform proactive information acquisition and diagnosis refinement by progressing through the initial diagnosis, evidence-seeking, and final decision stages. Across several large multimodal models, adopting this strategy consistently improves diagnostic accuracy, indicating the effectiveness of evidence-seeking workflows in computational pathology. Among these models, PathFound achieves state-of-the-art diagnostic performance across diverse clinical scenarios and demonstrates strong potential to discover subtle details, such as nuclear features and local invasions.
20. 【2512.23537】AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization
链接:https://arxiv.org/abs/2512.23537
作者:Binhe Yu,Zhen Wang,Kexin Li,Yuqian Yuan,Wenqiao Zhang,Long Chen,Juncheng Li,Jun Xiao,Yueting Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:synthesize multiple user-specified, Multi-subject customization aims, multiple user-specified subjects, aims to synthesize, synthesize multiple
备注:
点击查看摘要
Abstract:Multi-subject customization aims to synthesize multiple user-specified subjects into a coherent image. To address issues such as subjects missing or conflicts, recent works incorporate layout guidance to provide explicit spatial constraints. However, existing methods still struggle to balance three critical objectives: text alignment, subject identity preservation, and layout control, while the reliance on additional training further limits their scalability and efficiency. In this paper, we present AnyMS, a novel training-free framework for layout-guided multi-subject customization. AnyMS leverages three input conditions: text prompt, subject images, and layout constraints, and introduces a bottom-up dual-level attention decoupling mechanism to harmonize their integration during generation. Specifically, global decoupling separates cross-attention between textual and visual conditions to ensure text alignment. Local decoupling confines each subject's attention to its designated area, which prevents subject conflicts and thus guarantees identity preservation and layout control. Moreover, AnyMS employs pre-trained image adapters to extract subject-specific features aligned with the diffusion model, removing the need for subject learning or adapter tuning. Extensive experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to a larger number of subjects.
21. 【2512.23532】Iterative Inference-time Scaling with Adaptive Frequency Steering for Image Super-Resolution
链接:https://arxiv.org/abs/2512.23532
作者:Hexin Zhang,Dong Li,Jie Huang,Bingzhou Wang,Xueyang Fu,Zhengjun Zha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Inference-Time Scaling, Adaptive Frequency Steering, leading paradigm, struggle to guarantee, inference-time scaling
备注:
点击查看摘要
Abstract:Diffusion models have become a leading paradigm for image super-resolution (SR), but existing methods struggle to guarantee both the high-frequency perceptual quality and the low-frequency structural fidelity of generated images. Although inference-time scaling can theoretically improve this trade-off by allocating more computation, existing strategies remain suboptimal: reward-driven particle optimization often causes perceptual over-smoothing, while optimal-path search tends to lose structural consistency. To overcome these difficulties, we propose Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS), a training-free framework that jointly leverages iterative refinement and frequency-aware particle fusion. IAFS addresses the challenge of balancing perceptual quality and structural fidelity by progressively refining the generated image through iterative correction of structural deviations. Simultaneously, it ensures effective frequency fusion by adaptively integrating high-frequency perceptual cues with low-frequency structural information, allowing for a more accurate and balanced reconstruction across different image details. Extensive experiments across multiple diffusion-based SR models show that IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, and outperforming existing inference-time scaling methods.
22. 【2512.23519】IdentityStory: Taming Your Identity-Preserving Generator for Human-Centric Story Generation
链接:https://arxiv.org/abs/2512.23519
作者:Donghao Zhou,Jingyu Lin,Guibao Shen,Quande Liu,Jialin Gao,Lihao Liu,Lan Du,Cunjian Chen,Chi-Wing Fu,Xiaowei Hu,Pheng-Ann Heng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent visual generative, visual generative models, generative models enable, Recent visual, models enable story
备注: Accepted by AAAI2026 (Project page: [this https URL](https://correr-zhou.github.io/IdentityStory) )
点击查看摘要
Abstract:Recent visual generative models enable story generation with consistent characters from text, but human-centric story generation faces additional challenges, such as maintaining detailed and diverse human face consistency and coordinating multiple characters across different images. This paper presents IdentityStory, a framework for human-centric story generation that ensures consistent character identity across multiple sequential images. By taming identity-preserving generators, the framework features two key components: Iterative Identity Discovery, which extracts cohesive character identities, and Re-denoising Identity Injection, which re-denoises images to inject identities while preserving desired context. Experiments on the ConsiStory-Human benchmark demonstrate that IdentityStory outperforms existing methods, particularly in face consistency, and supports multi-character combinations. The framework also shows strong potential for applications such as infinite-length story generation and dynamic character composition.
23. 【2512.23486】Multi-label Classification with Panoptic Context Aggregation Networks
链接:https://arxiv.org/abs/2512.23486
作者:Mingyuan Jiu,Hailong Zhu,Wenchuan Wei,Hichem Sahbi,Rongrong Ji,Mingliang Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling highly discriminative, highly discriminative image, discriminative image representations, Deep Panoptic Context, Context Aggregation Network
备注:
点击查看摘要
Abstract:Context modeling is crucial for visual recognition, enabling highly discriminative image representations by integrating both intrinsic and extrinsic relationships between objects and labels in images. A limitation in current approaches is their focus on basic geometric relationships or localized features, often neglecting cross-scale contextual interactions between objects. This paper introduces the Deep Panoptic Context Aggregation Network (PanCAN), a novel approach that hierarchically integrates multi-order geometric contexts through cross-scale feature aggregation in a high-dimensional Hilbert space. Specifically, PanCAN learns multi-order neighborhood relationships at each scale by combining random walks with an attention mechanism. Modules from different scales are cascaded, where salient anchors at a finer scale are selected and their neighborhood features are dynamically fused via attention. This enables effective cross-scale modeling that significantly enhances complex scene understanding by combining multi-order and cross-scale context-aware features. Extensive multi-label classification experiments on NUS-WIDE, PASCAL VOC2007, and MS-COCO benchmarks demonstrate that PanCAN consistently achieves competitive results, outperforming state-of-the-art techniques in both quantitative and qualitative evaluations, thereby substantially improving multi-label classification performance.
24. 【2512.23483】V-RAG: A Temporal-aware and Semantic Entropy-Weighted Framework for Long Video Retrieval and Understanding
链接:https://arxiv.org/abs/2512.23483
作者:Zongsheng Cao,Yangfan He,Anran Liu,Feng Chen,Zepeng Wang,Jun Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Video Language, Video Language Models, Large Video, Video Language, Language Models
备注:
点击查看摘要
Abstract:Large Video Language Models (LVLMs) have rapidly emerged as the focus of multimedia AI research. Nonetheless, when confronted with lengthy videos, these models struggle: their temporal windows are narrow, and they fail to notice fine-grained semantic shifts that unfold over extended durations. Moreover, mainstream text-based retrieval pipelines, which rely chiefly on surface-level lexical overlap, ignore the rich temporal interdependence among visual, audio, and subtitle channels. To mitigate these limitations, we propose TV-RAG, a training-free architecture that couples temporal alignment with entropy-guided semantics to improve long-video reasoning. The framework contributes two main mechanisms: \emph{(i)} a time-decay retrieval module that injects explicit temporal offsets into the similarity computation, thereby ranking text queries according to their true multimedia context; and \emph{(ii)} an entropy-weighted key-frame sampler that selects evenly spaced, information-dense frames, reducing redundancy while preserving representativeness. By weaving these temporal and semantic signals together, TV-RAG realises a dual-level reasoning routine that can be grafted onto any LVLM without re-training or fine-tuning. The resulting system offers a lightweight, budget-friendly upgrade path and consistently surpasses most leading baselines across established long-video benchmarks such as Video-MME, MLVU, and LongVideoBench, confirming the effectiveness of our model. The code can be found at this https URL.
25. 【2512.23473】SC-Net: Robust Correspondence Learning via Spatial and Cross-Channel Context
链接:https://arxiv.org/abs/2512.23473
作者:Shuyuan Lin,Hailiang Liao,Qiang Qi,Junjie Huang,Taotao Lai,Jian Weng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrating significant superiority, two-view correspondence learning, Recent research, convolutional neural networks, correspondence learning
备注:
点击查看摘要
Abstract:Recent research has focused on using convolutional neural networks (CNNs) as the backbones in two-view correspondence learning, demonstrating significant superiority over methods based on multilayer perceptrons. However, CNN backbones that are not tailored to specific tasks may fail to effectively aggregate global context and oversmooth dense motion fields in scenes with large disparity. To address these problems, we propose a novel network named SC-Net, which effectively integrates bilateral context from both spatial and channel perspectives. Specifically, we design an adaptive focused regularization module (AFR) to enhance the model's position-awareness and robustness against spurious motion samples, thereby facilitating the generation of a more accurate motion field. We then propose a bilateral field adjustment module (BFA) to refine the motion field by simultaneously modeling long-range relationships and facilitating interaction across spatial and channel dimensions. Finally, we recover the motion vectors from the refined field using a position-aware recovery module (PAR) that ensures consistency and precision. Extensive experiments demonstrate that SC-Net outperforms state-of-the-art methods in relative pose estimation and outlier removal tasks on YFCC100M and SUN3D datasets. Source code is available at this http URL.
26. 【2512.23472】MCI-Net: A Robust Multi-Domain Context Integration Network for Point Cloud Registration
链接:https://arxiv.org/abs/2512.23472
作者:Shuyuan Lin,Wenwu Peng,Junjie Huang,Qiang Qi,Miaohui Wang,Jian Weng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Robust and discriminative, discriminative feature learning, high-quality point cloud, learning is critical, critical for high-quality
备注:
点击查看摘要
Abstract:Robust and discriminative feature learning is critical for high-quality point cloud registration. However, existing deep learning-based methods typically rely on Euclidean neighborhood-based strategies for feature extraction, which struggle to effectively capture the implicit semantics and structural consistency in point clouds. To address these issues, we propose a multi-domain context integration network (MCI-Net) that improves feature representation and registration performance by aggregating contextual cues from diverse domains. Specifically, we propose a graph neighborhood aggregation module, which constructs a global graph to capture the overall structural relationships within point clouds. We then propose a progressive context interaction module to enhance feature discriminability by performing intra-domain feature decoupling and inter-domain context interaction. Finally, we design a dynamic inlier selection method that optimizes inlier weights using residual information from multiple iterations of pose estimation, thereby improving the accuracy and robustness of registration. Extensive experiments on indoor RGB-D and outdoor LiDAR datasets show that the proposed MCI-Net significantly outperforms existing state-of-the-art methods, achieving the highest registration recall of 96.4\% on 3DMatch. Source code is available at this http URL.
27. 【2512.23464】HY-Motion 1.0: Scaling Flow Matching Models for Text-To-Motion Generation
链接:https://arxiv.org/abs/2512.23464
作者:Yuxin Wen,Qing Shuai,Di Kang,Jing Li,Cheng Wen,Yue Qian,Ningxin Jiao,Changhai Chen,Weijie Chen,Yiran Wang,Jinkun Guo,Dongyue An,Han Liu,Yanyu Tong,Chao Zhang,Qing Guo,Juan Chen,Qiao Zhang,Youyi Zhang,Zihao Yao,Cheng Zhang,Hong Duan,Xiaoping Wu,Qi Chen,Fei Cheng,Liang Dong,Peng He,Hao Zhang,Jiaxin Lin,Chao Zhang,Zhongyi Fan,Yifan Li,Zhichao Hu,Yuhong Liu,Linus,Jie Jiang,Xiaolong Li,Linchao Bao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
关键词:generation models capable, motion generation models, capable of generating, textual descriptions, motion generation
备注: Github: see [this https URL](https://github.com/Tencent-Hunyuan/HY-Motion-1.0)
点击查看摘要
Abstract:We present HY-Motion 1.0, a series of state-of-the-art, large-scale, motion generation models capable of generating 3D human motions from textual descriptions. HY-Motion 1.0 represents the first successful attempt to scale up Diffusion Transformer (DiT)-based flow matching models to the billion-parameter scale within the motion generation domain, delivering instruction-following capabilities that significantly outperform current open-source benchmarks. Uniquely, we introduce a comprehensive, full-stage training paradigm -- including large-scale pretraining on over 3,000 hours of motion data, high-quality fine-tuning on 400 hours of curated data, and reinforcement learning from both human feedback and reward models -- to ensure precise alignment with the text instruction and high motion quality. This framework is supported by our meticulous data processing pipeline, which performs rigorous motion cleaning and captioning. Consequently, our model achieves the most extensive coverage, spanning over 200 motion categories across 6 major classes. We release HY-Motion 1.0 to the open-source community to foster future research and accelerate the transition of 3D human motion generation models towards commercial maturity.
28. 【2512.23463】Deterministic Image-to-Image Translation via Denoising Brownian Bridge Models with Dual Approximators
链接:https://arxiv.org/abs/2512.23463
作者:Bohan Xiao,Peiyong Wang,Qisheng He,Ming Dong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:translation involves converting, involves converting, Brownian bridge, Dual-approx Bridge, translation involves
备注: Minor correction to a reference entry
点击查看摘要
Abstract:Image-to-Image (I2I) translation involves converting an image from one domain to another. Deterministic I2I translation, such as in image super-resolution, extends this concept by guaranteeing that each input generates a consistent and predictable output, closely matching the ground truth (GT) with high fidelity. In this paper, we propose a denoising Brownian bridge model with dual approximators (Dual-approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse process) to produce faithful output with negligible variance and high image quality in I2I translations. Our extensive experiments on benchmark datasets including image generation and super-resolution demonstrate the consistent and superior performance of Dual-approx Bridge in terms of image quality and faithfulness to GT when compared to both stochastic and deterministic baselines. Project page and code: this https URL
29. 【2512.23454】Automated river gauge plate reading using a hybrid object detection and generative AI framework in the Limpopo River Basin
链接:https://arxiv.org/abs/2512.23454
作者:Kayathri Vigneswaran,Hugo Retief,Jai Clifford Holmes,Mariangel Garcia Andarcia,Hansaka Tennakoon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:flood forecasting, ecological protection, essential for flood, waterline detection, water resource management
备注: 11 pages, 14 figures, 4 tables
点击查看摘要
Abstract:Accurate and continuous monitoring of river water levels is essential for flood forecasting, water resource management, and ecological protection. Traditional hydrological observation methods are often limited by manual measurement errors and environmental constraints. This study presents a hybrid framework integrating vision based waterline detection, YOLOv8 pose scale extraction, and large multimodal language models (GPT 4o and Gemini 2.0 Flash) for automated river gauge plate reading. The methodology involves sequential stages of image preprocessing, annotation, waterline detection, scale gap estimation, and numeric reading extraction. Experiments demonstrate that waterline detection achieved high precision of 94.24 percent and an F1 score of 83.64 percent, while scale gap detection provided accurate geometric calibration for subsequent reading extraction. Incorporating scale gap metadata substantially improved the predictive performance of LLMs, with Gemini Stage 2 achieving the highest accuracy, with a mean absolute error of 5.43 cm, root mean square error of 8.58 cm, and R squared of 0.84 under optimal image conditions. Results highlight the sensitivity of LLMs to image quality, with degraded images producing higher errors, and underscore the importance of combining geometric metadata with multimodal artificial intelligence for robust water level estimation. Overall, the proposed approach offers a scalable, efficient, and reliable solution for automated hydrological monitoring, demonstrating potential for real time river gauge digitization and improved water resource management.
30. 【2512.23453】CoFi-Dec: Hallucination-Resistant Decoding via Coarse-to-Fine Generative Feedback in Large Vision-Language Models
链接:https://arxiv.org/abs/2512.23453
作者:Zongsheng Cao,Yangfan He,Anran Liu,Jun Xie,Feng Chen,Zepeng Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Vision-Language Models, achieved impressive progress, Large Vision-Language, understanding and generation, achieved impressive
备注:
点击查看摘要
Abstract:Large Vision-Language Models (LVLMs) have achieved impressive progress in multi-modal understanding and generation. However, they still tend to produce hallucinated content that is inconsistent with the visual input, which limits their reliability in real-world applications. We propose \textbf{CoFi-Dec}, a training-free decoding framework that mitigates hallucinations by integrating generative self-feedback with coarse-to-fine visual conditioning. Inspired by the human visual process from global scene perception to detailed inspection, CoFi-Dec first generates two intermediate textual responses conditioned on coarse- and fine-grained views of the original image. These responses are then transformed into synthetic images using a text-to-image model, forming multi-level visual hypotheses that enrich grounding cues. To unify the predictions from these multiple visual conditions, we introduce a Wasserstein-based fusion mechanism that aligns their predictive distributions into a geometrically consistent decoding trajectory. This principled fusion reconciles high-level semantic consistency with fine-grained visual grounding, leading to more robust and faithful outputs. Extensive experiments on six hallucination-focused benchmarks show that CoFi-Dec substantially reduces both entity-level and semantic-level hallucinations, outperforming existing decoding strategies. The framework is model-agnostic, requires no additional training, and can be seamlessly applied to a wide range of LVLMs. The implementation is available at this https URL.
31. 【2512.23441】Stochastic Siamese MAE Pretraining for Longitudinal Medical Images
链接:https://arxiv.org/abs/2512.23441
作者:Taha Emre,Arunava Chakravarty,Thomas Pinetz,Dmitrii Lachinov,Martin J. Menten,Hendrik Scholl,Sobha Sivaprasad,Daniel Rueckert,Andrew Lotery,Stefan Sacu,Ursula Schmidt-Erfurth,Hrvoje Bogunović
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Temporally aware image, Temporally aware, aware image representations, longitudinal medical datasets, aware image
备注: Under review. Code is available in [this https URL](https://github.com/EmreTaha/STAMP)
点击查看摘要
Abstract:Temporally aware image representations are crucial for capturing disease progression in 3D volumes of longitudinal medical datasets. However, recent state-of-the-art self-supervised learning approaches like Masked Autoencoding (MAE), despite their strong representation learning capabilities, lack temporal awareness. In this paper, we propose STAMP (Stochastic Temporal Autoencoder with Masked Pretraining), a Siamese MAE framework that encodes temporal information through a stochastic process by conditioning on the time difference between the 2 input volumes. Unlike deterministic Siamese approaches, which compare scans from different time points but fail to account for the inherent uncertainty in disease evolution, STAMP learns temporal dynamics stochastically by reframing the MAE reconstruction loss as a conditional variational inference objective. We evaluated STAMP on two OCT and one MRI datasets with multiple visits per patient. STAMP pretrained ViT models outperformed both existing temporal MAE methods and foundation models on different late stage Age-Related Macular Degeneration and Alzheimer's Disease progression prediction which require models to learn the underlying non-deterministic temporal dynamics of the diseases.
32. 【2512.23437】RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and Reconstruction
链接:https://arxiv.org/abs/2512.23437
作者:Shuhong Liu,Chenyu Bao,Ziteng Cui,Yun Liu,Xuangeng Chu,Lin Gu,Marcos V. Conde,Ryo Umagami,Tomohiro Hashimoto,Zijian Hu,Tianhan Xu,Yuan Gan,Yusuke Kurose,Tatsuya Harada
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:multi-view visual restoration, real-capture benchmark, visual restoration, diverse physical degradations, multi-view visual
备注:
点击查看摘要
Abstract:We introduce RealX3D, a real-capture benchmark for multi-view visual restoration and 3D reconstruction under diverse physical degradations. RealX3D groups corruptions into four families, including illumination, scattering, occlusion, and blurring, and captures each at multiple severity levels using a unified acquisition protocol that yields pixel-aligned LQ/GT views. Each scene includes high-resolution capture, RAW images, and dense laser scans, from which we derive world-scale meshes and metric depth. Benchmarking a broad range of optimization-based and feed-forward methods shows substantial degradation in reconstruction quality under physical corruptions, underscoring the fragility of current multi-view pipelines in real-world challenging environments.
33. 【2512.23436】Fuzzy-Logic and Deep Learning for Environmental Condition-Aware Road Surface Classification
链接:https://arxiv.org/abs/2512.23436
作者:Mustafa Demetgul,Sanja Lazarova Molnar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:active vehicle control, vehicle control systems, road, controlling vehicles, active vehicle
备注:
点击查看摘要
Abstract:Monitoring states of road surfaces provides valuable information for the planning and controlling vehicles and active vehicle control systems. Classical road monitoring methods are expensive and unsystematic because they require time for measurements. This article proposes an real time system based on weather conditional data and road surface condition data. For this purpose, we collected data with a mobile phone camera on the roads around the campus of the Karlsruhe Institute of Technology. We tested a large number of different image-based deep learning algorithms for road classification. In addition, we used road acceleration data along with road image data for training by using them as images. We compared the performances of acceleration-based and camera image-based approaches. The performances of the simple Alexnet, LeNet, VGG, and Resnet algorithms were compared as deep learning algorithms. For road condition classification, 5 classes were considered: asphalt, damaged asphalt, gravel road, damaged gravel road, pavement road and over 95% accuracy performance was achieved. It is also proposed to use the acceleration or the camera image to classify the road surface according to the weather and the time of day using fuzzy logic.
34. 【2512.23427】owards Integrating Uncertainty for Domain-Agnostic Segmentation
链接:https://arxiv.org/abs/2512.23427
作者:Jesse Brouwers,Xiaoyan Xing,Alexander Timans
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:family exhibit strong, exhibit strong zero-shot, strong zero-shot performance, Foundation models, family exhibit
备注: Public code at [this https URL](https://github.com/JesseBrouw/UncertSAM) | published at the 2nd Workshop on Frontiers in Probabilistic Inference (NeurIPS 2025) | 12 pages, 8 figures (incl. Appendix)
点击查看摘要
Abstract:Foundation models for segmentation such as the Segment Anything Model (SAM) family exhibit strong zero-shot performance, but remain vulnerable in shifted or limited-knowledge domains. This work investigates whether uncertainty quantification can mitigate such challenges and enhance model generalisability in a domain-agnostic manner. To this end, we (1) curate UncertSAM, a benchmark comprising eight datasets designed to stress-test SAM under challenging segmentation conditions including shadows, transparency, and camouflage; (2) evaluate a suite of lightweight, post-hoc uncertainty estimation methods; and (3) assess a preliminary uncertainty-guided prediction refinement step. Among evaluated approaches, a last-layer Laplace approximation yields uncertainty estimates that correlate well with segmentation errors, indicating a meaningful signal. While refinement benefits are preliminary, our findings underscore the potential of incorporating uncertainty into segmentation models to support robust, domain-agnostic performance. Our benchmark and code are made publicly available.
35. 【2512.23426】Direct Diffusion Score Preference Optimization via Stepwise Contrastive Policy-Pair Supervision
链接:https://arxiv.org/abs/2512.23426
作者:Dohyun Kim,Seungwoo Lyu,Seung Wook Kim,Paul Hongsuck Seo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:nuanced user intent, maintain consistent aesthetic, Direct Diffusion Score, consistent aesthetic quality, fully align outputs
备注:
点击查看摘要
Abstract:Diffusion models have achieved impressive results in generative tasks such as text-to-image synthesis, yet they often struggle to fully align outputs with nuanced user intent and maintain consistent aesthetic quality. Existing preference-based training methods like Diffusion Direct Preference Optimization help address these issues but rely on costly and potentially noisy human-labeled datasets. In this work, we introduce Direct Diffusion Score Preference Optimization (DDSPO), which directly derives per-timestep supervision from winning and losing policies when such policies are available. Unlike prior methods that operate solely on final samples, DDSPO provides dense, transition-level signals across the denoising trajectory. In practice, we avoid reliance on labeled data by automatically generating preference signals using a pretrained reference model: we contrast its outputs when conditioned on original prompts versus semantically degraded variants. This practical strategy enables effective score-space preference supervision without explicit reward modeling or manual annotations. Empirical results demonstrate that DDSPO improves text-image alignment and visual quality, outperforming or matching existing preference-based methods while requiring significantly less supervision. Our implementation is available at: this https URL
36. 【2512.23421】DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
链接:https://arxiv.org/abs/2512.23421
作者:Tianze Xia,Yongkang Li,Lijun Zhou,Jingfeng Yao,Kaixin Xiong,Haiyang Sun,Bing Wang,Kun Ma,Hangjun Ye,Wenyu Liu,Xinggang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving, crucial for autonomous, learn how scenarios, scenarios evolve, evolve over time
备注: 17 pages, 7 figures
点击查看摘要
Abstract:World models have become crucial for autonomous driving, as they learn how scenarios evolve over time to address the long-tail challenges of the real world. However, current approaches relegate world models to limited roles: they operate within ostensibly unified architectures that still keep world prediction and motion planning as decoupled processes. To bridge this gap, we propose DriveLaW, a novel paradigm that unifies video generation and motion planning. By directly injecting the latent representation from its video generator into the planner, DriveLaW ensures inherent consistency between high-fidelity future generation and reliable trajectory planning. Specifically, DriveLaW consists of two core components: DriveLaW-Video, our powerful world model that generates high-fidelity forecasting with expressive latent representations, and DriveLaW-Act, a diffusion planner that generates consistent and reliable trajectories from the latent of DriveLaW-Video, with both components optimized by a three-stage progressive training strategy. The power of our unified paradigm is demonstrated by new state-of-the-art results across both tasks. DriveLaW not only advances video prediction significantly, surpassing best-performing work by 33.3% in FID and 1.8% in FVD, but also achieves a new record on the NAVSIM planning benchmark.
37. 【2512.23413】Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment
链接:https://arxiv.org/abs/2512.23413
作者:Henglin Liu,Nisha Huang,Chang Liu,Jiangpeng Yan,Huijuan Huang,Jixuan Ying,Tong-Yee Lee,Pengfei Wan,Xiangyang Ji
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:human-aligned quantitative evaluation, quantitative evaluation system, system for AIGC, quality assessment task, task is crucial
备注: AAAI2026,Project Page: [this https URL](https://github.com/Henglin-Liu/ArtQuant)
点击查看摘要
Abstract:The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset, generated via an iterative pipeline without heavy annotation costs and easy to scale. To address challenge (2), we propose ArtQuant, an aesthetics assessment framework for artistic images which not only couples isolated aesthetic dimensions through joint description generation, but also better models long-text semantics with the help of LLM decoders. Besides, theoretical analysis confirms this symbiosis: RAD's semantic adequacy (data) and generation paradigm (model) collectively minimize prediction entropy, providing mathematical grounding for the framework. Our approach achieves state-of-the-art performance on several datasets while requiring only 33% of conventional training epochs, narrowing the cognitive gap between artistic images and aesthetic judgment. We will release both code and dataset to support future research.
38. 【2512.23411】SOFTooth: Semantics-Enhanced Order-Aware Fusion for Tooth Instance Segmentation
链接:https://arxiv.org/abs/2512.23411
作者:Xiaolan Li,Wanquan Liu,Pengcheng Li,Pengyu Jie,Chenqiang Gao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains challenging due, segmentation remains challenging, ambiguous tooth-gingiva boundaries, remains challenging, challenging due
备注: 11 pages, 5 figures
点击查看摘要
Abstract:Three-dimensional (3D) tooth instance segmentation remains challenging due to crowded arches, ambiguous tooth-gingiva boundaries, missing teeth, and rare yet clinically important third molars. Native 3D methods relying on geometric cues often suffer from boundary leakage, center drift, and inconsistent tooth identities, especially for minority classes and complex anatomies. Meanwhile, 2D foundation models such as the Segment Anything Model (SAM) provide strong boundary-aware semantics, but directly applying them in 3D is impractical in clinical workflows. To address these issues, we propose SOFTooth, a semantics-enhanced, order-aware 2D-3D fusion framework that leverages frozen 2D semantics without explicit 2D mask supervision. First, a point-wise residual gating module injects occlusal-view SAM embeddings into 3D point features to refine tooth-gingiva and inter-tooth boundaries. Second, a center-guided mask refinement regularizes consistency between instance masks and geometric centroids, reducing center drift. Furthermore, an order-aware Hungarian matching strategy integrates anatomical tooth order and center distance into similarity-based assignment, ensuring coherent labeling even under missing or crowded dentitions. On 3DTeethSeg'22, SOFTooth achieves state-of-the-art overall accuracy and mean IoU, with clear gains on cases involving third molars, demonstrating that rich 2D semantics can be effectively transferred to 3D tooth instance segmentation without 2D fine-tuning.
39. 【2512.23380】A unified framework for detecting point and collective anomalies in operating system logs via collaborative transformers
链接:https://arxiv.org/abs/2512.23380
作者:Mohammad Nasirzadeh,Jafar Tahmoresnezhad,Parviz Rashidi-Khazaee
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI); Operating Systems (cs.OS)
关键词:anomaly detection, Log anomaly detection, Log, detection, anomaly
备注: 72 pages, 19 figures, 19 tables, accepted in scientific reports on 5 November 2025
点击查看摘要
Abstract:Log anomaly detection is crucial for preserving the security of operating systems. Depending on the source of log data collection, various information is recorded in logs that can be considered log modalities. In light of this intuition, unimodal methods often struggle by ignoring the different modalities of log data. Meanwhile, multimodal methods fail to handle the interactions between these modalities. Applying multimodal sentiment analysis to log anomaly detection, we propose CoLog, a framework that collaboratively encodes logs utilizing various modalities. CoLog utilizes collaborative transformers and multi-head impressed attention to learn interactions among several modalities, ensuring comprehensive anomaly detection. To handle the heterogeneity caused by these interactions, CoLog incorporates a modality adaptation layer, which adapts the representations from different log modalities. This methodology enables CoLog to learn nuanced patterns and dependencies within the data, enhancing its anomaly detection capabilities. Extensive experiments demonstrate CoLog's superiority over existing state-of-the-art methods. Furthermore, in detecting both point and collective anomalies, CoLog achieves a mean precision of 99.63%, a mean recall of 99.59%, and a mean F1 score of 99.61% across seven benchmark datasets for log-based anomaly detection. The comprehensive detection capabilities of CoLog make it highly suitable for cybersecurity, system monitoring, and operational efficiency. CoLog represents a significant advancement in log anomaly detection, providing a sophisticated and effective solution to point and collective anomaly detection through a unified framework and a solution to the complex challenges automatic log data analysis poses. We also provide the implementation of CoLog at this https URL.
40. 【2512.23379】SoulX-LiveTalk Technical Report
链接:https://arxiv.org/abs/2512.23379
作者:Le Shen,Qiao Qian,Tan Yu,Ke Zhou,Tianhang Yu,Yu Zhan,Zhenjie Wang,Ming Tao,Shunshun Yin,Siyuan Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Deploying massive diffusion, significant engineering challenge, massive diffusion models, Deploying massive, audio-driven avatar generation
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.
41. 【2512.23374】NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection Localization
链接:https://arxiv.org/abs/2512.23374
作者:Yifei Li,Haoyuan He,Yu Zheng,Bingyao Yu,Wenzhao Zheng,Lei Chen,Jie Zhou,Jiwen Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Image Manipulation Detection, Detection and Localization, user-friendly image editing, user-friendly image, Manipulation Detection
备注:
点击查看摘要
Abstract:The accessibility surge and abuse risks of user-friendly image editing models have created an urgent need for generalizable, up-to-date methods for Image Manipulation Detection and Localization (IMDL). Current IMDL research typically uses cross-dataset evaluation, where models trained on one benchmark are tested on others. However, this simplified evaluation approach conceals the fragility of existing methods when handling diverse AI-generated content, leading to misleading impressions of progress. This paper challenges this illusion by proposing NeXT-IMDL, a large-scale diagnostic benchmark designed not just to collect data, but to probe the generalization boundaries of current detectors systematically. Specifically, NeXT-IMDL categorizes AIGC-based manipulations along four fundamental axes: editing models, manipulation types, content semantics, and forgery granularity. Built upon this, NeXT-IMDL implements five rigorous cross-dimension evaluation protocols. Our extensive experiments on 11 representative models reveal a critical insight: while these models perform well in their original settings, they exhibit systemic failures and significant performance degradation when evaluated under our designed protocols that simulate real-world, various generalization scenarios. By providing this diagnostic toolkit and the new findings, we aim to advance the development towards building truly robust, next-generation IMDL models.
42. 【2512.23369】MGCA-Net: Multi-Graph Contextual Attention Network for Two-View Correspondence Learning
链接:https://arxiv.org/abs/2512.23369
作者:Shuyuan Lin,Mengtin Lo,Haosheng Chen,Yanjie Liang,Qiangqiang Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Two-view correspondence learning, establish reliable matching, Two-view correspondence, reliable matching relationships, computer vision
备注:
点击查看摘要
Abstract:Two-view correspondence learning is a key task in computer vision, which aims to establish reliable matching relationships for applications such as camera pose estimation and 3D reconstruction. However, existing methods have limitations in local geometric modeling and cross-stage information optimization, which make it difficult to accurately capture the geometric constraints of matched pairs and thus reduce the robustness of the model. To address these challenges, we propose a Multi-Graph Contextual Attention Network (MGCA-Net), which consists of a Contextual Geometric Attention (CGA) module and a Cross-Stage Multi-Graph Consensus (CSMGC) module. Specifically, CGA dynamically integrates spatial position and feature information via an adaptive attention mechanism and enhances the capability to capture both local and global geometric relationships. Meanwhile, CSMGC establishes geometric consensus via a cross-stage sparse graph network, ensuring the consistency of geometric information across different stages. Experimental results on two representative YFCC100M and SUN3D datasets show that MGCA-Net significantly outperforms existing SOTA methods in the outlier rejection and camera pose estimation tasks. Source code is available at this http URL.
43. 【2512.23365】SpatialMosaic: A Multiview VLM Dataset for Partial Visibility
链接:https://arxiv.org/abs/2512.23365
作者:Kanghee Lee,Injae Lee,Minseok Kwak,Kwonyoung Ryu,Jungi Hong,Jaesik Park
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multimodal Large Language, Large Language Models, Multimodal Large, Large Language, progress of Multimodal
备注:
点击查看摘要
Abstract:The rapid progress of Multimodal Large Language Models (MLLMs) has unlocked the potential for enhanced 3D scene understanding and spatial reasoning. However, existing approaches often rely on pre-constructed 3D representations or off-the-shelf reconstruction pipelines, which constrain scalability and real-world applicability. A recent line of work explores learning spatial reasoning directly from multi-view images, enabling Vision-Language Models (VLMs) to understand 3D scenes without explicit 3D reconstructions. Nevertheless, key challenges that frequently arise in real-world environments, such as partial visibility, occlusion, and low-overlap conditions that require spatial reasoning from fragmented visual cues, remain under-explored. To address these limitations, we propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs, resulting in SpatialMosaic, a comprehensive instruction-tuning dataset featuring 2M QA pairs. We further introduce SpatialMosaic-Bench, a challenging benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios, consisting of 1M QA pairs across 6 tasks. In addition, we present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within VLMs for robust spatial reasoning. Extensive experiments demonstrate that our proposed dataset and VQA tasks effectively enhance spatial reasoning under challenging multi-view conditions, validating the effectiveness of our data generation pipeline in constructing realistic and diverse QA pairs. Code and dataset will be available soon.
44. 【2512.23351】CountGD++: Generalized Prompting for Open-World Counting
链接:https://arxiv.org/abs/2512.23351
作者:Niki Amini-Naieni,Andrew Zisserman
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:automatically counting objects, videos are limited, target object, visual, automatically counting
备注:
点击查看摘要
Abstract:The flexibility and accuracy of methods for automatically counting objects in images and videos are limited by the way the object can be specified. While existing methods allow users to describe the target object with text and visual examples, the visual examples must be manually annotated inside the image, and there is no way to specify what not to count. To address these gaps, we introduce novel capabilities that expand how the target object can be specified. Specifically, we extend the prompt to enable what not to count to be described with text and/or visual examples, introduce the concept of `pseudo-exemplars' that automate the annotation of visual examples at inference, and extend counting models to accept visual examples from both natural and synthetic external images. We also use our new counting model, CountGD++, as a vision expert agent for an LLM. Together, these contributions expand the prompt flexibility of multi-modal open-world counting and lead to significant improvements in accuracy, efficiency, and generalization across multiple datasets. Code is available at this https URL.
45. 【2512.23343】AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous Agents
链接:https://arxiv.org/abs/2512.23343
作者:Jiafeng Liang,Hao Li,Chang Li,Jiaqi Zhou,Shixin Jiang,Zekun Wang,Changkai Ji,Zhihao Zhu,Runxuan Liu,Tao Ren,Jinlan Fu,See-Kiong Ng,Xia Liang,Ming Liu,Bing Qin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:navigate complex tasks, pivotal nexus bridging, nexus bridging past, complex tasks, pivotal nexus
备注: 57 pages, 5 figures
点击查看摘要
Abstract:Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.
46. 【2512.23335】Visual Language Hypothesis
链接:https://arxiv.org/abs/2512.23335
作者:Xiu Li
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:study visual representation, semantic, study visual, visual representation learning, representation learning
备注:
点击查看摘要
Abstract:We study visual representation learning from a structural and topological perspective. We begin from a single hypothesis: that visual understanding presupposes a semantic language for vision, in which many perceptual observations correspond to a small number of discrete semantic states. Together with widely assumed premises on transferability and abstraction in representation learning, this hypothesis implies that the visual observation space must be organized in a fiber bundle like structure, where nuisance variation populates fibers and semantics correspond to a quotient base space. From this structure we derive two theoretical consequences. First, the semantic quotient $X/G$ is not a submanifold of $X$ and cannot be obtained through smooth deformation alone, semantic invariance requires a non-homeomorphic, discriminative target, for example, supervision via labels, cross instance identification, or multimodal alignment that supplies explicit semantic equivalence. Second, we show that approximating the quotient also places structural demands on the model architecture. Semantic abstraction requires not only an external semantic target, but a representation mechanism capable of supporting topology change: an expand-and-snap process in which the manifold is first geometrically expanded to separate structure and then collapsed to form discrete semantic regions. We emphasize that these results are interpretive rather than prescriptive: the framework provides a topological lens that aligns with empirical regularities observed in large-scale discriminative and multimodal models, and with classical principles in statistical learning theory.
47. 【2512.23333】CME-CAD: Heterogeneous Collaborative Multi-Expert Reinforcement Learning for CAD Code Generation
链接:https://arxiv.org/abs/2512.23333
作者:Ke Niu,Haiyang Yu,Zhuofan Chen,Zhengtao Yao,Weitao Jia,Xiaodong Ge,Jingqun Tang,Benlei Cui,Bin Li,Xiangyang Xue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:traditional CAD modeling, industrial design, Computer-Aided Design, Multi-Expert Reinforcement Learning, editable CAD models
备注:
点击查看摘要
Abstract:Computer-Aided Design (CAD) is essential in industrial design, but the complexity of traditional CAD modeling and workflows presents significant challenges for automating the generation of high-precision, editable CAD models. Existing methods that reconstruct 3D models from sketches often produce non-editable and approximate models that fall short of meeting the stringent requirements for precision and editability in industrial design. Moreover, the reliance on text or image-based inputs often requires significant manual annotation, limiting their scalability and applicability in industrial settings. To overcome these challenges, we propose the Heterogeneous Collaborative Multi-Expert Reinforcement Learning (CME-CAD) paradigm, a novel training paradigm for CAD code generation. Our approach integrates the complementary strengths of these models, facilitating collaborative learning and improving the model's ability to generate accurate, constraint-compatible, and fully editable CAD models. We introduce a two-stage training process: Multi-Expert Fine-Tuning (MEFT), and Multi-Expert Reinforcement Learning (MERL). Additionally, we present CADExpert, an open-source benchmark consisting of 17,299 instances, including orthographic projections with precise dimension annotations, expert-generated Chain-of-Thought (CoT) processes, executable CADQuery code, and rendered 3D models.
48. 【2512.23328】CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations
链接:https://arxiv.org/abs/2512.23328
作者:Huan-ang Gao,Zikang Zhang,Tianwei Luo,Kaisen Yang,Xinzhe Juan,Jiahao Qiu,Tianxing Chen,Bingxiang He,Hao Zhao,Hao Zhou,Shilong Liu,Mengdi Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Model, Large Language, Language Model, physical-world deployment due, spatial mental model
备注: Webpage: [this https URL](https://cubebench.c7w.tech/)
点击查看摘要
Abstract:Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik's Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.
49. 【2512.23318】PCR-ORB: Enhanced ORB-SLAM3 with Point Cloud Refinement Using Deep Learning-Based Dynamic Object Filtering
链接:https://arxiv.org/abs/2512.23318
作者:Sheng-Kai Chen,Jie-Yu Chao,Jr-Yu Chang,Po-Lien Wu,Po-Chiang Lin
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual Simultaneous Localization, Visual Simultaneous, Localization and Mapping, Simultaneous Localization, compromise tracking accuracy
备注: 17 pages, 2 figures, 1 table
点击查看摘要
Abstract:Visual Simultaneous Localization and Mapping (vSLAM) systems encounter substantial challenges in dynamic environments where moving objects compromise tracking accuracy and map consistency. This paper introduces PCR-ORB (Point Cloud Refinement ORB), an enhanced ORB-SLAM3 framework that integrates deep learning-based point cloud refinement to mitigate dynamic object interference. Our approach employs YOLOv8 for semantic segmentation combined with CUDA-accelerated processing to achieve real-time performance. The system implements a multi-stage filtering strategy encompassing ground plane estimation, sky region removal, edge filtering, and temporal consistency validation. Comprehensive evaluation on the KITTI dataset (sequences 00-09) demonstrates performance characteristics across different environmental conditions and scene types. Notable improvements are observed in specific sequences, with sequence 04 achieving 25.9% improvement in ATE RMSE and 30.4% improvement in ATE median. However, results show mixed performance across sequences, indicating scenario-dependent effectiveness. The implementation provides insights into dynamic object filtering challenges and opportunities for robust navigation in complex environments.
50. 【2512.23304】MedGemma vs GPT-4: Open-Source and Proprietary Zero-shot Medical Disease Classification from Images
链接:https://arxiv.org/abs/2512.23304
作者:Md. Sazzadul Islam Prottasha,Nabil Walid Rafi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, Large Language, extensive clinical knowledge, large multimodal model
备注: Accepted for publication in the Journal of Machine Learning and Deep Learning (JMLDL). 9 pages, 9 figures, 10 tables
点击查看摘要
Abstract:Multimodal Large Language Models (LLMs) introduce an emerging paradigm for medical imaging by interpreting scans through the lens of extensive clinical knowledge, offering a transformative approach to disease classification. This study presents a critical comparison between two fundamentally different AI architectures: the specialized open-source agent MedGemma and the proprietary large multimodal model GPT-4 for diagnosing six different diseases. The MedGemma-4b-it model, fine-tuned using Low-Rank Adaptation (LoRA), demonstrated superior diagnostic capability by achieving a mean test accuracy of 80.37% compared to 69.58% for the untuned GPT-4. Furthermore, MedGemma exhibited notably higher sensitivity in high-stakes clinical tasks, such as cancer and pneumonia detection. Quantitative analysis via confusion matrices and classification reports provides comprehensive insights into model performance across all categories. These results emphasize that domain-specific fine-tuning is essential for minimizing hallucinations in clinical implementation, positioning MedGemma as a sophisticated tool for complex, evidence-based medical reasoning.
51. 【2512.23291】Multi-Track Multimodal Learning on iMiGUE: Micro-Gesture and Emotion Recognition
链接:https://arxiv.org/abs/2512.23291
作者:Arman Martirosyan,Shahane Tigranyan,Maria Razzhivina,Artak Aslanyan,Nazgul Salikhova,Ilya Makarov,Andrey Savchenko,Aram Avetisyan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fine-grained human behaviors, require modeling subtle, primarily leveraging video, highly challenging tasks, skeletal pose data
备注:
点击查看摘要
Abstract:Micro-gesture recognition and behavior-based emotion prediction are both highly challenging tasks that require modeling subtle, fine-grained human behaviors, primarily leveraging video and skeletal pose data. In this work, we present two multimodal frameworks designed to tackle both problems on the iMiGUE dataset. For micro-gesture classification, we explore the complementary strengths of RGB and 3D pose-based representations to capture nuanced spatio-temporal patterns. To comprehensively represent gestures, video, and skeletal embeddings are extracted using MViTv2-S and 2s-AGCN, respectively. Then, they are integrated through a Cross-Modal Token Fusion module to combine spatial and pose information. For emotion recognition, our framework extends to behavior-based emotion prediction, a binary classification task identifying emotional states based on visual cues. We leverage facial and contextual embeddings extracted using SwinFace and MViTv2-S models and fuse them through an InterFusion module designed to capture emotional expressions and body gestures. Experiments conducted on the iMiGUE dataset, within the scope of the MiGA 2025 Challenge, demonstrate the robust performance and accuracy of our method in the behavior-based emotion prediction task, where our approach secured 2nd place.
52. 【2512.23273】YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection
链接:https://arxiv.org/abs/2512.23273
作者:Xu Lin,Jinlong Peng,Zhenye Gan,Jiawen Zhu,Jun Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Existing Real-Time Object, methods commonly adopt, Real-Time Object Detection, adopt YOLO-like architectures, commonly adopt YOLO-like
备注:
点击查看摘要
Abstract:Existing Real-Time Object Detection (RTOD) methods commonly adopt YOLO-like architectures for their favorable trade-off between accuracy and speed. However, these models rely on static dense computation that applies uniform processing to all inputs, misallocating representational capacity and computational resources such as over-allocating on trivial scenes while under-serving complex ones. This mismatch results in both computational redundancy and suboptimal detection performance. To overcome this limitation, we propose YOLO-Master, a novel YOLO-like framework that introduces instance-conditional adaptive computation for RTOD. This is achieved through a Efficient Sparse Mixture-of-Experts (ES-MoE) block that dynamically allocates computational resources to each input according to its scene complexity. At its core, a lightweight dynamic routing network guides expert specialization during training through a diversity enhancing objective, encouraging complementary expertise among experts. Additionally, the routing network adaptively learns to activate only the most relevant experts, thereby improving detection performance while minimizing computational overhead during inference. Comprehensive experiments on five large-scale benchmarks demonstrate the superiority of YOLO-Master. On MS COCO, our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference. Notably, the gains are most pronounced on challenging dense scenes, while the model preserves efficiency on typical inputs and maintains real-time inference speed. Code will be available.
53. 【2512.23258】Plug-and-Play Fidelity Optimization for Diffusion Transformer Acceleration via Cumulative Error Minimization
链接:https://arxiv.org/abs/2512.23258
作者:Tong Shao,Yusen Fu,Guoying Sun,Jingde Kong,Zhuotao Tian,Jingyong Su
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformer, hinders broader applicability, iterative denoising process, denoising process results, slow inference
备注:
点击查看摘要
Abstract:Although Diffusion Transformer (DiT) has emerged as a predominant architecture for image and video generation, its iterative denoising process results in slow inference, which hinders broader applicability and development. Caching-based methods achieve training-free acceleration, while suffering from considerable computational error. Existing methods typically incorporate error correction strategies such as pruning or prediction to mitigate it. However, their fixed caching strategy fails to adapt to the complex error variations during denoising, which limits the full potential of error correction. To tackle this challenge, we propose a novel fidelity-optimization plugin for existing error correction methods via cumulative error minimization, named CEM. CEM predefines the error to characterize the sensitivity of model to acceleration jointly influenced by timesteps and cache intervals. Guided by this prior, we formulate a dynamic programming algorithm with cumulative error approximation for strategy optimization, which achieves the caching error minimization, resulting in a substantial improvement in generation fidelity. CEM is model-agnostic and exhibits strong generalization, which is adaptable to arbitrary acceleration budgets. It can be seamlessly integrated into existing error correction frameworks and quantized models without introducing any additional computational overhead. Extensive experiments conducted on nine generation models and quantized methods across three tasks demonstrate that CEM significantly improves generation fidelity of existing acceleration models, and outperforms the original generation performance on FLUX.1-dev, PixArt-$\alpha$, StableDiffusion1.5 and Hunyuan. The code will be made publicly available.
54. 【2512.23255】Contour Information Aware 2D Gaussian Splatting for Image Representation
链接:https://arxiv.org/abs/2512.23255
作者:Masaya Takabe,Hiroshi Watanabe,Sujun Hong,Tomohiro Ikai,Zheming Fan,Ryo Ishimoto,Kakeru Sugimoto,Ruri Imichi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Splatting, Image representation, computer vision, Gaussian-based image representation, fundamental task
备注:
点击查看摘要
Abstract:Image representation is a fundamental task in computer vision. Recently, Gaussian Splatting has emerged as an efficient representation framework, and its extension to 2D image representation enables lightweight, yet expressive modeling of visual content. While recent 2D Gaussian Splatting (2DGS) approaches provide compact storage and real-time decoding, they often produce blurry or indistinct boundaries when the number of Gaussians is small due to the lack of contour awareness. In this work, we propose a Contour Information-Aware 2D Gaussian Splatting framework that incorporates object segmentation priors into Gaussian-based image representation. By constraining each Gaussian to a specific segmentation region during rasterization, our method prevents cross-boundary blending and preserves edge structures under high compression. We also introduce a warm-up scheme to stabilize training and improve convergence. Experiments on synthetic color charts and the DAVIS dataset demonstrate that our approach achieves higher reconstruction quality around object edges compared to existing 2DGS methods. The improvement is particularly evident in scenarios with very few Gaussians, while our method still maintains fast rendering and low memory usage.
55. 【2512.23245】ASemConsist: Adaptive Semantic Feature Control for Training-Free Identity-Consistent Generation
链接:https://arxiv.org/abs/2512.23245
作者:Shin seong Kim,Minjung Shin,Hyunin Cho,Youngjung Uh
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significantly improved visual, improved visual quality, diffusion models, models have significantly, significantly improved
备注:
点击查看摘要
Abstract:Recent text-to-image diffusion models have significantly improved visual quality and text alignment. However, generating a sequence of images while preserving consistent character identity across diverse scene descriptions remains a challenging task. Existing methods often struggle with a trade-off between maintaining identity consistency and ensuring per-image prompt alignment. In this paper, we introduce a novel framework, ASemconsist, that addresses this challenge through selective text embedding modification, enabling explicit semantic control over character identity without sacrificing prompt alignment. Furthermore, based on our analysis of padding embeddings in FLUX, we propose a semantic control strategy that repurposes padding embeddings as semantic containers. Additionally, we introduce an adaptive feature-sharing strategy that automatically evaluates textual ambiguity and applies constraints only to the ambiguous identity prompt. Finally, we propose a unified evaluation protocol, the Consistency Quality Score (CQS), which integrates identity preservation and per-image text alignment into a single comprehensive metric, explicitly capturing performance imbalances between the two metrics. Our framework achieves state-of-the-art performance, effectively overcoming prior trade-offs. Project page: this https URL
56. 【2512.23244】ViLaCD-R1: A Vision-Language Framework for Semantic Change Detection in Remote Sensing
链接:https://arxiv.org/abs/2512.23244
作者:Xingwei Ma,Shiyang Feng,Bo Zhang,Bin Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Remote sensing change, inadequately capture high-level, capture high-level semantics, Remote sensing, sensing change detection
备注:
点击查看摘要
Abstract:Remote sensing change detection (RSCD), a complex multi-image inference task, traditionally uses pixel-based operators or encoder-decoder networks that inadequately capture high-level semantics and are vulnerable to non-semantic perturbations. Although recent multimodal and vision-language model (VLM)-based approaches enhance semantic understanding of change regions by incorporating textual descriptions, they still suffer from challenges such as inaccurate spatial localization, imprecise pixel-level boundary delineation, and limited interpretability. To address these issues, we propose ViLaCD-R1, a two-stage framework comprising a Multi-Image Reasoner (MIR) and a Mask-Guided Decoder (MGD). Specifically, the VLM is trained through supervised fine-tuning (SFT) and reinforcement learning (RL) on block-level dual-temporal inference tasks, taking dual-temporal image patches as input and outputting a coarse change mask. Then, the decoder integrates dual-temporal image features with this coarse mask to predict a precise binary change map. Comprehensive evaluations on multiple RSCD benchmarks demonstrate that ViLaCD-R1 substantially improves true semantic change recognition and localization, robustly suppresses non-semantic variations, and achieves state-of-the-art accuracy in complex real-world scenarios.
57. 【2512.23243】Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism
链接:https://arxiv.org/abs/2512.23243
作者:Siyu Zhang,Ying Chen,Lianlei Shan,Runhe Qiu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:surface information extraction, Multi-scale Vision-language Alignment, Resolution Input Strategy, Dynamic Resolution Input, exhibits significant application
备注:
点击查看摘要
Abstract:Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity this http URL results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.
58. 【2512.23239】RS-Prune: Training-Free Data Pruning at High Ratios for Efficient Remote Sensing Diffusion Foundation Models
链接:https://arxiv.org/abs/2512.23239
作者:Fan Wei,Runmin Dong,Yushan Lai,Yixiang Yang,Zhaoyang Luo,Jinxiao Zhang,Miao Yang,Shuai Yuan,Jiyao Zhao,Bin Luo,Haohuan Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion-based remote sensing, Diffusion-based remote, foundation models, remote sensing, generative foundation models
备注:
点击查看摘要
Abstract:Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.
59. 【2512.23234】Physics-Inspired Modeling and Content Adaptive Routing in an Infrared Gas Leak Detection Network
链接:https://arxiv.org/abs/2512.23234
作者:Dongsheng Li,Chaobo Chen,Siling Wang,Song Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Detecting infrared gas, infrared gas leaks, Detecting infrared, industrial safety, leaks is critical
备注:
点击查看摘要
Abstract:Detecting infrared gas leaks is critical for environmental monitoring and industrial safety, yet remains difficult because plumes are faint, small, semitransparent, and have weak, diffuse boundaries. We present physics-edge hybrid gas dynamic routing network (PEG-DRNet). First, we introduce the Gas Block, a diffusion-convection unit modeling gas transport: a local branch captures short-range variations, while a large-kernel branch captures long-range propagation. An edge-gated learnable fusion module balances local detail and global context, strengthening weak-contrast plume and contour cues. Second, we propose the adaptive gradient and phase edge operator (AGPEO), computing reliable edge priors from multi-directional gradients and phase-consistent responses. These are transformed by a multi-scale edge perception module (MSEPM) into hierarchical edge features that reinforce boundaries. Finally, the content-adaptive sparse routing path aggregation network (CASR-PAN), with adaptive information modulation modules for fusion and self, selectively propagates informative features across scales based on edge and content cues, improving cross-scale discriminability while reducing redundancy. Experiments on the IIG dataset show that PEG-DRNet achieves an overall AP of 29.8\%, an AP$_{50}$ of 84.3\%, and a small-object AP of 25.3\%, surpassing the RT-DETR-R18 baseline by 3.0\%, 6.5\%, and 5.3\%, respectively, while requiring only 43.7 Gflops and 14.9 M parameters. The proposed PEG-DRNet achieves superior overall performance with the best balance of accuracy and computational efficiency, outperforming existing CNN and Transformer detectors in AP and AP$_{50}$ on the IIG and LangGas dataset.
60. 【2512.23232】SURE Guided Posterior Sampling: Trajectory Correction for Diffusion-Based Inverse Problems
链接:https://arxiv.org/abs/2512.23232
作者:Minwoo Kim,Hongki Lim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:powerful learned priors, Unbiased Risk Estimate, Stein Unbiased Risk, models have emerged, emerged as powerful
备注:
点击查看摘要
Abstract:Diffusion models have emerged as powerful learned priors for solving inverse problems. However, current iterative solving approaches which alternate between diffusion sampling and data consistency steps typically require hundreds or thousands of steps to achieve high quality reconstruction due to accumulated errors. We address this challenge with SURE Guided Posterior Sampling (SGPS), a method that corrects sampling trajectory deviations using Stein's Unbiased Risk Estimate (SURE) gradient updates and PCA based noise estimation. By mitigating noise induced errors during the critical early and middle sampling stages, SGPS enables more accurate posterior sampling and reduces error accumulation. This allows our method to maintain high reconstruction quality with fewer than 100 Neural Function Evaluations (NFEs). Our extensive evaluation across diverse inverse problems demonstrates that SGPS consistently outperforms existing methods at low NFE counts.
61. 【2512.23227】Anomaly Detection by Effectively Leveraging Synthetic Images
链接:https://arxiv.org/abs/2512.23227
作者:Sungho Kang,Hyunkyu Park,Yeonho Lee,Hanbyul Lee,Mijoo Jeong,YeongHyeon Park,Injae Lee,Juneho Yi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:images, Anomaly detection plays, Anomaly detection, defect images, industrial manufacturing
备注:
点击查看摘要
Abstract:Anomaly detection plays a vital role in industrial manufacturing. Due to the scarcity of real defect images, unsupervised approaches that rely solely on normal images have been extensively studied. Recently, diffusion-based generative models brought attention to training data synthesis as an alternative solution. In this work, we focus on a strategy to effectively leverage synthetic images to maximize the anomaly detection performance. Previous synthesis strategies are broadly categorized into two groups, presenting a clear trade-off. Rule-based synthesis, such as injecting noise or pasting patches, is cost-effective but often fails to produce realistic defect images. On the other hand, generative model-based synthesis can create high-quality defect images but requires substantial cost. To address this problem, we propose a novel framework that leverages a pre-trained text-guided image-to-image translation model and image retrieval model to efficiently generate synthetic defect images. Specifically, the image retrieval model assesses the similarity of the generated images to real normal images and filters out irrelevant outputs, thereby enhancing the quality and relevance of the generated defect images. To effectively leverage synthetic images, we also introduce a two stage training strategy. In this strategy, the model is first pre-trained on a large volume of images from rule-based synthesis and then fine-tuned on a smaller set of high-quality images. This method significantly reduces the cost for data collection while improving the anomaly detection performance. Experiments on the MVTec AD dataset demonstrate the effectiveness of our approach.
62. 【2512.23222】Bridging Your Imagination with Audio-Video Generation via a Unified Director
链接:https://arxiv.org/abs/2512.23222
作者:Jiaxu Zhang,Tianshu Hu,Yuan Zhang,Zenan Li,Linjie Luo,Guosheng Lin,Xin Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:creation systems typically, systems typically treat, large language models, AI-driven video creation, video creation systems
备注:
点击查看摘要
Abstract:Existing AI-driven video creation systems typically treat script drafting and key-shot design as two disjoint tasks: the former relies on large language models, while the latter depends on image generation models. We argue that these two tasks should be unified within a single framework, as logical reasoning and imaginative thinking are both fundamental qualities of a film director. In this work, we propose UniMAGE, a unified director model that bridges user prompts with well-structured scripts, thereby empowering non-experts to produce long-context, multi-shot films by leveraging existing audio-video generation models. To achieve this, we employ the Mixture-of-Transformers architecture that unifies text and image generation. To further enhance narrative logic and keyframe consistency, we introduce a ``first interleaving, then disentangling'' training paradigm. Specifically, we first perform Interleaved Concept Learning, which utilizes interleaved text-image data to foster the model's deeper understanding and imaginative interpretation of scripts. We then conduct Disentangled Expert Learning, which decouples script writing from keyframe generation, enabling greater flexibility and creativity in storytelling. Extensive experiments demonstrate that UniMAGE achieves state-of-the-art performance among open-source models, generating logically coherent video scripts and visually consistent keyframe images.
63. 【2512.23221】Holi-DETR: Holistic Fashion Item Detection Leveraging Contextual Information
链接:https://arxiv.org/abs/2512.23221
作者:Youngchae Kwon,Jinyoung Choi,Injung Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Holistic Detection Transformer, highly diverse appearances, Fashion item detection, fashion items, Detection Transformer
备注: 20 pages, 6 figures
点击查看摘要
Abstract:Fashion item detection is challenging due to the ambiguities introduced by the highly diverse appearances of fashion items and the similarities among item subcategories. To address this challenge, we propose a novel Holistic Detection Transformer (Holi-DETR) that detects fashion items in outfit images holistically, by leveraging contextual information. Fashion items often have meaningful relationships as they are combined to create specific styles. Unlike conventional detectors that detect each item independently, Holi-DETR detects multiple items while reducing ambiguities by leveraging three distinct types of contextual information: (1) the co-occurrence relationship between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. %Holi-DETR explicitly incorporates three types of contextual information: (1) the co-occurrence probability between fashion items, (2) the relative position and size based on inter-item spatial arrangements, and (3) the spatial relationships between items and human body key-points. To this end, we propose a novel architecture that integrates these three types of heterogeneous contextual information into the Detection Transformer (DETR) and its subsequent models. In experiments, the proposed methods improved the performance of the vanilla DETR and the more recently developed Co-DETR by 3.6 percent points (pp) and 1.1 pp, respectively, in terms of average precision (AP).
64. 【2512.23219】MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?
链接:https://arxiv.org/abs/2512.23219
作者:Shiqi Dai,Zizhi Ma,Zhicong Luo,Xuesong Yang,Yibin Huang,Wanyue Zhang,Chi Chen,Zonghao Guo,Wang Xu,Yufei Sun,Maosong Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Unmanned Aerial Vehicles, Multimodal Large Language, Large Language Models, Aerial Vehicles, Multimodal Large
备注: 25 pages
点击查看摘要
Abstract:While Multimodal Large Language Models (MLLMs) have exhibited remarkable general intelligence across diverse domains, their potential in low-altitude applications dominated by Unmanned Aerial Vehicles (UAVs) remains largely underexplored. Existing MLLM benchmarks rarely cover the unique challenges of low-altitude scenarios, while UAV-related evaluations mainly focus on specific tasks such as localization or navigation, without a unified evaluation of MLLMs'general intelligence. To bridge this gap, we present MM-UAVBench, a comprehensive benchmark that systematically evaluates MLLMs across three core capability dimensions-perception, cognition, and planning-in low-altitude UAV scenarios. MM-UAVBench comprises 19 sub-tasks with over 5.7K manually annotated questions, all derived from real-world UAV data collected from public datasets. Extensive experiments on 16 open-source and proprietary MLLMs reveal that current models struggle to adapt to the complex visual and cognitive demands of low-altitude scenarios. Our analyses further uncover critical bottlenecks such as spatial bias and multi-view understanding that hinder the effective deployment of MLLMs in UAV scenarios. We hope MM-UAVBench will foster future research on robust and reliable MLLMs for real-world UAV intelligence.
65. 【2512.23215】AVOID: The Adverse Visual Conditions Dataset with Obstacles for Driving Scene Understanding
链接:https://arxiv.org/abs/2512.23215
作者:Jongoh Jeong,Taek-Jin Song,Jong-Hwan Kim,Kuk-Jin Yoon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Understanding road scenes, intelligent self-driving cars, perception remains crucial, Understanding road, self-driving cars
备注:
点击查看摘要
Abstract:Understanding road scenes for visual perception remains crucial for intelligent self-driving cars. In particular, it is desirable to detect unexpected small road hazards reliably in real-time, especially under varying adverse conditions (e.g., weather and daylight). However, existing road driving datasets provide large-scale images acquired in either normal or adverse scenarios only, and often do not contain the road obstacles captured in the same visual domain as for the other classes. To address this, we introduce a new dataset called AVOID, the Adverse Visual Conditions Dataset, for real-time obstacle detection collected in a simulated environment. AVOID consists of a large set of unexpected road obstacles located along each path captured under various weather and time conditions. Each image is coupled with the corresponding semantic and depth maps, raw and semantic LiDAR data, and waypoints, thereby supporting most visual perception tasks. We benchmark the results on high-performing real-time networks for the obstacle detection task, and also propose and conduct ablation studies using a comprehensive multi-task network for semantic segmentation, depth and waypoint prediction tasks.
66. 【2512.23210】ask-oriented Learnable Diffusion Timesteps for Universal Few-shot Learning of Dense Tasks
链接:https://arxiv.org/abs/2512.23210
作者:Changgyoon Oh,Jongoh Jeong,Jegyeong Cho,Kuk-Jin Yoon
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Denoising diffusion probabilistic, brought tremendous advances, diffusion probabilistic models, Denoising diffusion, diffusion timestep features
备注:
点击查看摘要
Abstract:Denoising diffusion probabilistic models have brought tremendous advances in generative tasks, achieving state-of-the-art performance thus far. Current diffusion model-based applications exploit the power of learned visual representations from multistep forward-backward Markovian processes for single-task prediction tasks by attaching a task-specific decoder. However, the heuristic selection of diffusion timestep features still heavily relies on empirical intuition, often leading to sub-optimal performance biased towards certain tasks. To alleviate this constraint, we investigate the significance of versatile diffusion timestep features by adaptively selecting timesteps best suited for the few-shot dense prediction task, evaluated on an arbitrary unseen task. To this end, we propose two modules: Task-aware Timestep Selection (TTS) to select ideal diffusion timesteps based on timestep-wise losses and similarity scores, and Timestep Feature Consolidation (TFC) to consolidate the selected timestep features to improve the dense predictive performance in a few-shot setting. Accompanied by our parameter-efficient fine-tuning adapter, our framework effectively achieves superiority in dense prediction performance given only a few support queries. We empirically validate our learnable timestep consolidation method on the large-scale challenging Taskonomy dataset for dense prediction, particularly for practical universal and few-shot learning scenarios.
67. 【2512.23208】Exploring Syn-to-Real Domain Adaptation for Military Target Detection
链接:https://arxiv.org/abs/2512.23208
作者:Jongoh Jeong,Youngjin Oh,Gyeongrae Nam,Jeongeun Lee,Kuk-Jin Yoon
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:key target tasks, military target detection, target detection, military target, tasks of interest
备注:
点击查看摘要
Abstract:Object detection is one of the key target tasks of interest in the context of civil and military applications. In particular, the real-world deployment of target detection methods is pivotal in the decision-making process during military command and reconnaissance. However, current domain adaptive object detection algorithms consider adapting one domain to another similar one only within the scope of natural or autonomous driving scenes. Since military domains often deal with a mixed variety of environments, detecting objects from multiple varying target domains poses a greater challenge. Several studies for armored military target detection have made use of synthetic aperture radar (SAR) data due to its robustness to all weather, long range, and high-resolution characteristics. Nevertheless, the costs of SAR data acquisition and processing are still much higher than those of the conventional RGB camera, which is a more affordable alternative with significantly lower data processing time. Furthermore, the lack of military target detection datasets limits the use of such a low-cost approach. To mitigate these issues, we propose to generate RGB-based synthetic data using a photorealistic visual tool, Unreal Engine, for military target detection in a cross-domain setting. To this end, we conducted synthetic-to-real transfer experiments by training our synthetic dataset and validating on our web-collected real military target datasets. We benchmark the state-of-the-art domain adaptation methods distinguished by the degree of supervision on our proposed train-val dataset pair, and find that current methods using minimal hints on the image (e.g., object class) achieve a substantial improvement over unsupervised or semi-supervised DA methods. From these observations, we recognize the current challenges that remain to be overcome.
68. 【2512.23196】ForCM: Forest Cover Mapping from Multispectral Sentinel-2 Image by Integrating Deep Learning with Object-Based Image Analysis
链接:https://arxiv.org/abs/2512.23196
作者:Maisha Haque,Israt Jahan Ayshi,Sadaf M. Anis,Nahian Tasnim,Mithila Moontaha,Md. Sabbir Ahmed,Muhammad Iqbal Hossain,Mohammad Zavid Parvez,Subrata Chakraborty,Biswajeet Pradhan,Biswajit Banik
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Object-Based Image Analysis, combines Object-Based Image, Image Analysis, combines Object-Based, Object-Based Image
备注: 15 pages, 7 figures. Accepted for presentation at the Australasian Data Science and Machine Learning Conference (AusDM 2024)
点击查看摘要
Abstract:This research proposes "ForCM", a novel approach to forest cover mapping that combines Object-Based Image Analysis (OBIA) with Deep Learning (DL) using multispectral Sentinel-2 imagery. The study explores several DL models, including UNet, UNet++, ResUNet, AttentionUNet, and ResNet50-Segnet, applied to high-resolution Sentinel-2 Level 2A satellite images of the Amazon Rainforest. The datasets comprise three collections: two sets of three-band imagery and one set of four-band imagery. After evaluation, the most effective DL models are individually integrated with the OBIA technique to enhance mapping accuracy. The originality of this work lies in evaluating different deep learning models combined with OBIA and comparing them with traditional OBIA methods. The results show that the proposed ForCM method improves forest cover mapping, achieving overall accuracies of 94.54 percent with ResUNet-OBIA and 95.64 percent with AttentionUNet-OBIA, compared to 92.91 percent using traditional OBIA. This research also demonstrates the potential of free and user-friendly tools such as QGIS for accurate mapping within their limitations, supporting global environmental monitoring and conservation efforts.
69. 【2512.23180】GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
链接:https://arxiv.org/abs/2512.23180
作者:Tianchen Deng,Xuefeng Chen,Yi Chen,Qu Chen,Yuyao Xu,Lijin Yang,Le Xu,Yu Zhang,Bo Zhang,Wuxiong Huang,Hesheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Driving World Models, Driving World, World Models, developing rapidly, advances of generative
备注:
点击查看摘要
Abstract:Driving World Models (DWMs) have been developing rapidly with the advances of generative models. However, existing DWMs lack 3D scene understanding capabilities and can only generate content conditioned on input data, without the ability to interpret or reason about the driving environment. Moreover, current approaches represent 3D spatial information with point cloud or BEV features do not accurately align textual information with the underlying 3D scene. To address these limitations, we propose a novel unified DWM framework based on 3D Gaussian scene representation, which enables both 3D scene understanding and multi-modal scene generation, while also enabling contextual enrichment for understanding and generation tasks. Our approach directly aligns textual information with the 3D scene by embedding rich linguistic features into each Gaussian primitive, thereby achieving early modality alignment. In addition, we design a novel task-aware language-guided sampling strategy that removes redundant 3D Gaussians and injects accurate and compact 3D tokens into LLM. Furthermore, we design a dual-condition multi-modal generation model, where the information captured by our vision-language model is leveraged as a high-level language condition in combination with a low-level image condition, jointly guiding the multi-modal generation process. We conduct comprehensive studies on the nuScenes, and NuInteract datasets to validate the effectiveness of our framework. Our method achieves state-of-the-art performance. We will release the code publicly on GitHub this https URL.
70. 【2512.23177】Machine Learning-Assisted Vocal Cord Ultrasound Examination: Project VIPR
链接:https://arxiv.org/abs/2512.23177
作者:Will Sebelik-Lassiter,Evan Schubert,Muhammad Alliyu,Quentin Robbins,Excel Olatunji,Mustafa Barry
类目:Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV)
关键词:tolerated examination technique, Vocal cord, Vocal cord ultrasound, examination technique, operator dependent
备注: Won Best Undergraduate Research Paper at the 2025 Midwest Instruction Computing Symposium (MICS)
点击查看摘要
Abstract:Intro: Vocal cord ultrasound (VCUS) has emerged as a less invasive and better tolerated examination technique, but its accuracy is operator dependent. This research aims to apply a machine learning-assisted algorithm to automatically identify the vocal cords and distinguish normal vocal cord images from vocal cord paralysis (VCP). Methods: VCUS videos were acquired from 30 volunteers, which were split into still frames and cropped to a uniform size. Healthy and simulated VCP images were used as training data for vocal cord segmentation and VCP classification models. Results: The vocal cord segmentation model achieved a validation accuracy of 96%, while the best classification model (VIPRnet) achieved a validation accuracy of 99%. Conclusion: Machine learning-assisted analysis of VCUS shows great promise in improving diagnostic accuracy over operator-dependent human interpretation.
71. 【2512.23176】GVSynergy-Det: Synergistic Gaussian-Voxel Representations for Multi-View 3D Object Detection
链接:https://arxiv.org/abs/2512.23176
作者:Yi Zhang,Yi Wang,Lei Yao,Lap-Pui Chau
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:RGB images, expensive depth sensors, depth sensors required, aims to identify, identify and localize
备注: 11 pages, 5 figures
点击查看摘要
Abstract:Image-based 3D object detection aims to identify and localize objects in 3D space using only RGB images, eliminating the need for expensive depth sensors required by point cloud-based methods. Existing image-based approaches face two critical challenges: methods achieving high accuracy typically require dense 3D supervision, while those operating without such supervision struggle to extract accurate geometry from images alone. In this paper, we present GVSynergy-Det, a novel framework that enhances 3D detection through synergistic Gaussian-Voxel representation learning. Our key insight is that continuous Gaussian and discrete voxel representations capture complementary geometric information: Gaussians excel at modeling fine-grained surface details while voxels provide structured spatial context. We introduce a dual-representation architecture that: 1) adapts generalizable Gaussian Splatting to extract complementary geometric features for detection tasks, and 2) develops a cross-representation enhancement mechanism that enriches voxel features with geometric details from Gaussian fields. Unlike previous methods that either rely on time-consuming per-scene optimization or utilize Gaussian representations solely for depth regularization, our synergistic strategy directly leverages features from both representations through learnable integration, enabling more accurate object localization. Extensive experiments demonstrate that GVSynergy-Det achieves state-of-the-art results on challenging indoor benchmarks, significantly outperforming existing methods on both ScanNetV2 and ARKitScenes datasets, all without requiring any depth or dense 3D geometry supervision (e.g., point clouds or TSDF).
72. 【2512.23169】REVEALER: Reinforcement-Guided Visual Reasoning for Element-Level Text-Image Alignment Evaluation
链接:https://arxiv.org/abs/2512.23169
作者:Fulin Shi,Wenyi Xiao,Bin Chen,Liang Din,Leilei Gan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:textual prompts, prompts and generated, generated images, images is critical, critical for ensuring
备注:
点击查看摘要
Abstract:Evaluating the alignment between textual prompts and generated images is critical for ensuring the reliability and usability of text-to-image (T2I) models. However, most existing evaluation methods rely on coarse-grained metrics or static QA pipelines, which lack fine-grained interpretability and struggle to reflect human preferences. To address this, we propose REVEALER, a unified framework for element-level alignment evaluation based on reinforcement-guided visual reasoning. Adopting a structured "grounding-reasoning-conclusion" paradigm, our method enables Multimodal Large Language Models (MLLMs) to explicitly localize semantic elements and derive interpretable alignment judgments. We optimize the model via Group Relative Policy Optimization(GRPO) using a composite reward function that incorporates structural format, grounding accuracy, and alignment fidelity. Extensive experiments across four benchmarks-EvalMuse-40K, RichHF, MHaluBench, and GenAI-Bench-demonstrate that REVEALER achieves state-of-the-art performance. Our approach consistently outperforms both strong proprietary models and supervised baselines while demonstrating superior inference efficiency compared to existing iterative visual reasoning methods.
73. 【2512.23162】SurgWorld: Learning Surgical Robot Policies from Videos via World Modeling
链接:https://arxiv.org/abs/2512.23162
作者:Yufan He,Pengfei Guo,Mengya Xu,Zhaoshuo Li,Andriy Myronenko,Dillan Imans,Bingjie Liu,Dongren Yang,Mingxue Gu,Yongnan Ji,Yueming Jin,Ren Zhao,Baiyong Shen,Daguang Xu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Data scarcity remains, surgical, achieving fully autonomous, scarcity remains, remains a fundamental
备注:
点击查看摘要
Abstract:Data scarcity remains a fundamental barrier to achieving fully autonomous surgical robots. While large scale vision language action (VLA) models have shown impressive generalization in household and industrial manipulation by leveraging paired video action data from diverse domains, surgical robotics suffers from the paucity of datasets that include both visual observations and accurate robot kinematics. In contrast, vast corpora of surgical videos exist, but they lack corresponding action labels, preventing direct application of imitation learning or VLA training. In this work, we aim to alleviate this problem by learning policy models from SurgWorld, a world model designed for surgical physical AI. We curated the Surgical Action Text Alignment (SATA) dataset with detailed action description specifically for surgical robots. Then we built SurgeWorld based on the most advanced physical AI world model and SATA. It's able to generate diverse, generalizable and realistic surgery videos. We are also the first to use an inverse dynamics model to infer pseudokinematics from synthetic surgical videos, producing synthetic paired video action data. We demonstrate that a surgical VLA policy trained with these augmented data significantly outperforms models trained only on real demonstrations on a real surgical robot platform. Our approach offers a scalable path toward autonomous surgical skill acquisition by leveraging the abundance of unlabeled surgical video and generative world modeling, thus opening the door to generalizable and data efficient surgical robot policies.
74. 【2512.23147】GeoTeacher: Geometry-Guided Semi-Supervised 3D Object Detection
链接:https://arxiv.org/abs/2512.23147
作者:Jingyu Li,Xiaolong Zhao,Zhe Liu,Wenxiao Wu,Li Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:active research area, student model ability, student model, aiming to explore, recent years
备注:
点击查看摘要
Abstract:Semi-supervised 3D object detection, aiming to explore unlabeled data for boosting 3D object detectors, has emerged as an active research area in recent years. Some previous methods have shown substantial improvements by either employing heterogeneous teacher models to provide high-quality pseudo labels or enforcing feature-perspective consistency between the teacher and student networks. However, these methods overlook the fact that the model usually tends to exhibit low sensitivity to object geometries with limited labeled data, making it difficult to capture geometric information, which is crucial for enhancing the student model's ability in object perception and localization. In this paper, we propose GeoTeacher to enhance the student model's ability to capture geometric relations of objects with limited training data, especially unlabeled data. We design a keypoint-based geometric relation supervision module that transfers the teacher model's knowledge of object geometry to the student, thereby improving the student's capability in understanding geometric relations. Furthermore, we introduce a voxel-wise data augmentation strategy that increases the diversity of object geometries, thereby further improving the student model's ability to comprehend geometric structures. To preserve the integrity of distant objects during augmentation, we incorporate a distance-decay mechanism into this strategy. Moreover, GeoTeacher can be combined with different SS3D methods to further improve their performance. Extensive experiments on the ONCE and Waymo datasets indicate the effectiveness and generalization of our method and we achieve the new state-of-the-art results. Code will be available at this https URL
75. 【2512.23142】Domain-Shift Immunity in Deep Deformable Registration via Local Feature Representations
链接:https://arxiv.org/abs/2512.23142
作者:Mingzhen Shao,Sarang Joshi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:surpassing traditional optimization-based, advanced deformable image, surpassing traditional, accuracy and efficiency, learning has advanced
备注:
点击查看摘要
Abstract:Deep learning has advanced deformable image registration, surpassing traditional optimization-based methods in both accuracy and efficiency. However, learning-based models are widely believed to be sensitive to domain shift, with robustness typically pursued through large and diverse training datasets, without explaining the underlying mechanisms. In this work, we show that domain-shift immunity is an inherent property of deep deformable registration models, arising from their reliance on local feature representations rather than global appearance for deformation estimation. To isolate and validate this mechanism, we introduce UniReg, a universal registration framework that decouples feature extraction from deformation estimation using fixed, pre-trained feature extractors and a UNet-based deformation network. Despite training on a single dataset, UniReg exhibits robust cross-domain and multi-modal performance comparable to optimization-based methods. Our analysis further reveals that failures of conventional CNN-based models under modality shift originate from dataset-induced biases in early convolutional layers. These findings identify local feature consistency as the key driver of robustness in learning-based deformable registration and motivate backbone designs that preserve domain-invariant local features.
76. 【2512.23130】PathoSyn: Imaging-Pathology MRI Synthesis via Disentangled Deviation Diffusion
链接:https://arxiv.org/abs/2512.23130
作者:Jian Wang,Sixing Rong,Jiarui Xing,Yuling Xu,Weide Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Magnetic Resonance Imaging, Magnetic Resonance, stable anatomical manifold, disentangled additive deviation, Resonance Imaging
备注:
点击查看摘要
Abstract:We present PathoSyn, a unified generative framework for Magnetic Resonance Imaging (MRI) image synthesis that reformulates imaging-pathology as a disentangled additive deviation on a stable anatomical manifold. Current generative models typically operate in the global pixel domain or rely on binary masks, these paradigms often suffer from feature entanglement, leading to corrupted anatomical substrates or structural discontinuities. PathoSyn addresses these limitations by decomposing the synthesis task into deterministic anatomical reconstruction and stochastic deviation modeling. Central to our framework is a Deviation-Space Diffusion Model designed to learn the conditional distribution of pathological residuals, thereby capturing localized intensity variations while preserving global structural integrity by construction. To ensure spatial coherence, the diffusion process is coupled with a seam-aware fusion strategy and an inference-time stabilization module, which collectively suppress boundary artifacts and produce high-fidelity internal lesion heterogeneity. PathoSyn provides a mathematically principled pipeline for generating high-fidelity patient-specific synthetic datasets, facilitating the development of robust diagnostic algorithms in low-data regimes. By allowing interpretable counterfactual disease progression modeling, the framework supports precision intervention planning and provides a controlled environment for benchmarking clinical decision-support systems. Quantitative and qualitative evaluations on tumor imaging benchmarks demonstrate that PathoSyn significantly outperforms holistic diffusion and mask-conditioned baselines in both perceptual realism and anatomical fidelity. The source code of this work will be made publicly available.
77. 【2512.23089】MedSAM-based lung masking for multi-label chest X-ray classification
链接:https://arxiv.org/abs/2512.23089
作者:Brayden Miao,Zain Rehman,Xin Miao,Siming Liu,Jianjie Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Chest X-ray, weak disease signals, automated interpretation remains, interpretation remains challenging, remains challenging due
备注: 16 pages, 8 figures
点击查看摘要
Abstract:Chest X-ray (CXR) imaging is widely used for screening and diagnosing pulmonary abnormalities, yet automated interpretation remains challenging due to weak disease signals, dataset bias, and limited spatial supervision. Foundation models for medical image segmentation (MedSAM) provide an opportunity to introduce anatomically grounded priors that may improve robustness and interpretability in CXR analysis. We propose a segmentation-guided CXR classification pipeline that integrates MedSAM as a lung region extraction module prior to multi-label abnormality classification. MedSAM is fine-tuned using a public image-mask dataset from Airlangga University Hospital. We then apply it to a curated subset of the public NIH CXR dataset to train and evaluate deep convolutional neural networks for multi-label prediction of five abnormalities (Mass, Nodule, Pneumonia, Edema, and Fibrosis), with the normal case (No Finding) evaluated via a derived score. Experiments show that MedSAM produces anatomically plausible lung masks across diverse imaging conditions. We find that masking effects are both task-dependent and architecture-dependent. ResNet50 trained on original images achieves the strongest overall abnormality discrimination, while loose lung masking yields comparable macro AUROC but significantly improves No Finding discrimination, indicating a trade-off between abnormality-specific classification and normal case screening. Tight masking consistently reduces abnormality level performance but improves training efficiency. Loose masking partially mitigates this degradation by preserving perihilar and peripheral context. These results suggest that lung masking should be treated as a controllable spatial prior selected to match the backbone and clinical objective, rather than applied uniformly.
78. 【2512.23073】Rethinking Fine-Tuning: Unlocking Hidden Capabilities in Vision-Language Models
链接:https://arxiv.org/abs/2512.23073
作者:Mingyuan Zhang,Yue Bai,Yifan Wang,Yiyang Huang,Yun Fu
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:made impressive progress, Parameter Efficient Fine-Tuning, Parameter Efficient, impressive progress, made impressive
备注:
点击查看摘要
Abstract:Explorations in fine-tuning Vision-Language Models (VLMs), such as Low-Rank Adaptation (LoRA) from Parameter Efficient Fine-Tuning (PEFT), have made impressive progress. However, most approaches rely on explicit weight updates, overlooking the extensive representational structures already encoded in pre-trained models that remain underutilized. Recent works have demonstrated that Mask Fine-Tuning (MFT) can be a powerful and efficient post-training paradigm for language models. Instead of updating weights, MFT assigns learnable gating scores to each weight, allowing the model to reorganize its internal subnetworks for downstream task adaptation. In this paper, we rethink fine-tuning for VLMs from a structural reparameterization perspective grounded in MFT. We apply MFT to the language and projector components of VLMs with different language backbones and compare against strong PEFT baselines. Experiments show that MFT consistently surpasses LoRA variants and even full fine-tuning, achieving high performance without altering the frozen backbone. Our findings reveal that effective adaptation can emerge not only from updating weights but also from reestablishing connections among the model's existing knowledge. Code available at: this https URL
79. 【2512.23044】Video-BrowseComp: Benchmarking Agentic Video Research on Open Web
链接:https://arxiv.org/abs/2512.23044
作者:Zhengyang Liang,Yan Shu,Xiangrui Liu,Minghao Qin,Kaixin Liang,Paolo Rota,Nicu Sebe,Zheng Liu,Lizi Liao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:redefining information seeking, information seeking, open-ended web research, evolution of autonomous, redefining information
备注:
点击查看摘要
Abstract:The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, while textual and static multimodal agents have seen rapid progress, a significant modality gap remains in processing the web's most dynamic modality: video. Existing video benchmarks predominantly focus on passive perception, feeding curated clips to models without requiring external retrieval. They fail to evaluate agentic video research, which necessitates actively interrogating video timelines, cross-referencing dispersed evidence, and verifying claims against the open web. To bridge this gap, we present \textbf{Video-BrowseComp}, a challenging benchmark comprising 210 questions tailored for open-web agentic video reasoning. Unlike prior benchmarks, Video-BrowseComp enforces a mandatory dependency on temporal visual evidence, ensuring that answers cannot be derived solely through text search but require navigating video timelines to verify external claims. Our evaluation of state-of-the-art models reveals a critical bottleneck: even advanced search-augmented models like GPT-5.1 (w/ Search) achieve only 15.24\% accuracy. Our analysis reveals that these models largely rely on textual proxies, excelling in metadata-rich domains (e.g., TV shows with plot summaries) but collapsing in metadata-sparse, dynamic environments (e.g., sports, gameplay) where visual grounding is essential. As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
80. 【2512.23042】3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
链接:https://arxiv.org/abs/2512.23042
作者:Ryousuke Yamada,Kohsuke Ide,Yoshihiro Fukuhara,Hirokatsu Kataoka,Gilles Puy,Andrei Bursuc,Yuki M. Asano
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:scans remains expensive, expensive and labor-intensive, recent progress, remains expensive, scene scans remains
备注:
点击查看摘要
Abstract:Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without using any real 3D scans, LAM3C achieves higher performance than the previous self-supervised methods on indoor semantic and instance segmentation. These results suggest that unlabeled videos represent an abundant source of data for 3D self-supervised learning.
81. 【2512.23035】oward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion
链接:https://arxiv.org/abs/2512.23035
作者:Yi Zhou,Xuechao Zou,Shun Zhang,Kai Li,Shiying Wang,Jingming Chen,Congyan Lang,Tengfei Cao,Pin Tao,Yuanchun Shi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Semi-supervised remote sensing, confirmation bias leads, image semantic segmentation, remote sensing, exhaustive annotation
备注: 13 pages, 5 figures, 10 tables
点击查看摘要
Abstract:Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at this https URL.
82. 【2512.23033】Interpretable Gallbladder Ultrasound Diagnosis: A Lightweight Web-Mobile Software Platform with Real-Time XAI
链接:https://arxiv.org/abs/2512.23033
作者:Fuyad Hasan Bhoyan,Prashanta Sarker,Parsia Noor Ethila,Md. Emon Hossain,Md Kaviul Hossain,Md Humaion Kabir Mehedi
类目:oftware Engineering (cs.SE); Computer Vision and Pattern Recognition (cs.CV)
关键词:Early and accurate, interpretation is challenging, accurate detection, gallbladder disease types, ultrasound interpretation
备注:
点击查看摘要
Abstract:Early and accurate detection of gallbladder diseases is crucial, yet ultrasound interpretation is challenging. To address this, an AI-driven diagnostic software integrates our hybrid deep learning model MobResTaNet to classify ten categories, nine gallbladder disease types and normal directly from ultrasound images. The system delivers interpretable, real-time predictions via Explainable AI (XAI) visualizations, supporting transparent clinical decision-making. It achieves up to 99.85% accuracy with only 2.24M parameters. Deployed as web and mobile applications using HTML, CSS, JavaScript, Bootstrap, and Flutter, the software provides efficient, accessible, and trustworthy diagnostic support at the point of care
83. 【2512.23028】An Architecture-Led Hybrid Report on Body Language Detection Project
链接:https://arxiv.org/abs/2512.23028
作者:Thomson Tong,Diba Darooneh
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
关键词:architectural properties map, modern vision-language models, pipeline implemented, BodyLanguageDetection repository, modern vision-language
备注:
点击查看摘要
Abstract:This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.
84. 【2512.23024】With Great Context Comes Great Prediction Power: Classifying Objects via Geo-Semantic Scene Graphs
链接:https://arxiv.org/abs/2512.23024
作者:Ciprian Constantinescu,Marius Leordeanu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Humans effortlessly identify, Humans effortlessly, effortlessly identify objects, effortlessly identify, Humans
备注: This paper is a development of the visual riddle game with Human-AI interaction, entitled "GuessWhat - Riddle Eye with AI", developed by Ciprian Constantinescu (POLItEHNICA Bucharest), Serena Stan (Instituto Cervantes Bucarest) and Marius Leordeanu (POLITEHNICA Bucharest), which was the winner (1st place) of the NeoArt Connect NAC 2025 Scholarship Program
点击查看摘要
Abstract:Humans effortlessly identify objects by leveraging a rich understanding of the surrounding scene, including spatial relationships, material properties, and the co-occurrence of other objects. In contrast, most computational object recognition systems operate on isolated image regions, devoid of meaning in isolation, thus ignoring this vital contextual information. This paper argues for the critical role of context and introduces a novel framework for contextual object classification. We first construct a Geo-Semantic Contextual Graph (GSCG) from a single monocular image. This rich, structured representation is built by integrating a metric depth estimator with a unified panoptic and material segmentation model. The GSCG encodes objects as nodes with detailed geometric, chromatic, and material attributes, and their spatial relationships as edges. This explicit graph structure makes the model's reasoning process inherently interpretable. We then propose a specialized graph-based classifier that aggregates features from a target object, its immediate neighbors, and the global scene context to predict its class. Through extensive ablation studies, we demonstrate that our context-aware model achieves a classification accuracy of 73.4%, dramatically outperforming context-agnostic versions (as low as 38.4%). Furthermore, our GSCG-based approach significantly surpasses strong baselines, including fine-tuned ResNet models (max 53.5%) and a state-of-the-art multimodal Large Language Model (LLM), Llama 4 Scout, which, even when given the full image alongside a detailed description of objects, maxes out at 42.3%. These results on COCO 2017 train/val splits highlight the superiority of explicitly structured and interpretable context for object recognition tasks.
85. 【2512.23020】OpenGround: Active Cognition-based Reasoning for Open-World 3D Visual Grounding
链接:https://arxiv.org/abs/2512.23020
作者:Wenyuan Huang,Zhao Wang,Zhou Wei,Ting Huang,Fang Zhao,Jian Yang,Zhenyu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:natural language descriptions, Visual Language Models, locate objects based, Object Lookup Table, visual grounding aims
备注: 27 pages, 15 figures, 14 tables, Project Page at [this https URL]( [this https URL](https://why-102.github.io/openground.io/) )
点击查看摘要
Abstract:3D visual grounding aims to locate objects based on natural language descriptions in 3D scenes. Existing methods rely on a pre-defined Object Lookup Table (OLT) to query Visual Language Models (VLMs) for reasoning about object locations, which limits the applications in scenarios with undefined or unforeseen targets. To address this problem, we present OpenGround, a novel zero-shot framework for open-world 3D visual grounding. Central to OpenGround is the Active Cognition-based Reasoning (ACR) module, which is designed to overcome the fundamental limitation of pre-defined OLTs by progressively augmenting the cognitive scope of VLMs. The ACR module performs human-like perception of the target via a cognitive task chain and actively reasons about contextually relevant objects, thereby extending VLM cognition through a dynamically updated OLT. This allows OpenGround to function with both pre-defined and open-world categories. We also propose a new dataset named OpenTarget, which contains over 7000 object-description pairs to evaluate our method in open-world scenarios. Extensive experiments demonstrate that OpenGround achieves competitive performance on Nr3D, state-of-the-art on ScanRefer, and delivers a substantial 17.6% improvement on OpenTarget. Project Page at [this https URL](this https URL).
86. 【2512.22990】A Low-Cost UAV Deep Learning Pipeline for Integrated Apple Disease Diagnosis,Freshness Assessment, and Fruit Detection
链接:https://arxiv.org/abs/2512.22990
作者:Soham Dutta,Soham Banerjee,Sneha Mahata,Anindya Sen,Sayantani Datta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fruit quality assessment, require timely disease, orchards require timely, Apple orchards require, costly multispectral sensors
备注: This work has also been submitted to techrxiv and his currently under moderation. The content is identical
点击查看摘要
Abstract:Apple orchards require timely disease detection, fruit quality assessment, and yield estimation, yet existing UAV-based systems address such tasks in isolation and often rely on costly multispectral sensors. This paper presents a unified, low-cost RGB-only UAV-based orchard intelligent pipeline integrating ResNet50 for leaf disease detection, VGG 16 for apple freshness determination, and YOLOv8 for real-time apple detection and localization. The system runs on an ESP32-CAM and Raspberry Pi, providing fully offline on-site inference without cloud support. Experiments demonstrate 98.9% accuracy for leaf disease classification, 97.4% accuracy for freshness classification, and 0.857 F1 score for apple detection. The framework provides an accessible and scalable alternative to multispectral UAV solutions, supporting practical precision agriculture on affordable hardware.
87. 【2512.22984】Reverse Personalization
链接:https://arxiv.org/abs/2512.22984
作者:Han-Wei Kung,Tuomas Varanka,Nicu Sebe
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enabling creating personalized, personalized facial imagery, demonstrated remarkable generation, enabling creating, Recent
备注: WACV 2026
点击查看摘要
Abstract:Recent text-to-image diffusion models have demonstrated remarkable generation of realistic facial images conditioned on textual prompts and human identities, enabling creating personalized facial imagery. However, existing prompt-based methods for removing or modifying identity-specific features rely either on the subject being well-represented in the pre-trained model or require model fine-tuning for specific identities. In this work, we analyze the identity generation process and introduce a reverse personalization framework for face anonymization. Our approach leverages conditional diffusion inversion, allowing direct manipulation of images without using text prompts. To generalize beyond subjects in the model's training data, we incorporate an identity-guided conditioning branch. Unlike prior anonymization methods, which lack control over facial attributes, our framework supports attribute-controllable anonymization. We demonstrate that our method achieves a state-of-the-art balance between identity removal, attribute preservation, and image quality. Source code and data are available at this https URL .
88. 【2512.22981】Spatial-aware Symmetric Alignment for Text-guided Medical Image Segmentation
链接:https://arxiv.org/abs/2512.22981
作者:Linglin Liao,Qichuan Geng,Yu Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Text-guided Medical Image, shown considerable promise, Medical Image Segmentation, rich clinical text, clinical text serving
备注:
点击查看摘要
Abstract:Text-guided Medical Image Segmentation has shown considerable promise for medical image segmentation, with rich clinical text serving as an effective supplement for scarce data. However, current methods have two key bottlenecks. On one hand, they struggle to process diagnostic and descriptive texts simultaneously, making it difficult to identify lesions and establish associations with image regions. On the other hand, existing approaches focus on lesions description and fail to capture positional constraints, leading to critical deviations. Specifically, with the text "in the left lower lung", the segmentation results may incorrectly cover both sides of the lung. To address the limitations, we propose the Spatial-aware Symmetric Alignment (SSA) framework to enhance the capacity of referring hybrid medical texts consisting of locational, descriptive, and diagnostic information. Specifically, we propose symmetric optimal transport alignment mechanism to strengthen the associations between image regions and multiple relevant expressions, which establishes bi-directional fine-grained multimodal correspondences. In addition, we devise a composite directional guidance strategy that explicitly introduces spatial constraints in the text by constructing region-level guidance masks. Extensive experiments on public benchmarks demonstrate that SSA achieves state-of-the-art (SOTA) performance, particularly in accurately segmenting lesions characterized by spatial relational constraints.
89. 【2512.22979】PoseStreamer: A Multi-modal Framework for 6DoF Pose Estimation of Unseen Moving Objects
链接:https://arxiv.org/abs/2512.22979
作者:Huiming Yang,Linglin Liao,Fei Ding,Sibo Wang,Zijian Zeng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:standard RGB cameras, RGB cameras suffer, faces significant challenges, standard RGB, RGB cameras
备注:
点击查看摘要
Abstract:Six degree of freedom (6DoF) pose estimation for novel objects is a critical task in computer vision, yet it faces significant challenges in high-speed and low-light scenarios where standard RGB cameras suffer from motion blur. While event cameras offer a promising solution due to their high temporal resolution, current 6DoF pose estimation methods typically yield suboptimal performance in high-speed object moving scenarios. To address this gap, we propose PoseStreamer, a robust multi-modal 6DoF pose estimation framework designed specifically on high-speed moving scenarios. Our approach integrates three core components: an Adaptive Pose Memory Queue that utilizes historical orientation cues for temporal consistency, an Object-centric 2D Tracker that provides strong 2D priors to boost 3D center recall, and a Ray Pose Filter for geometric refinement along camera rays. Furthermore, we introduce MoCapCube6D, a novel multi-modal dataset constructed to benchmark performance under rapid motion. Extensive experiments demonstrate that PoseStreamer not only achieves superior accuracy in high-speed moving scenarios, but also exhibits strong generalizability as a template-free framework for unseen moving objects.
90. 【2512.22974】RealCamo: Boosting Real Camouflage Synthesis with Layout Controls and Textual-Visual Guidance
链接:https://arxiv.org/abs/2512.22974
作者:Chunyuan Chen,Yunuo Cai,Shujuan Li,Weiyun Liang,Bin Wang,Jing Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:acquiring high-quality training, high-quality training data, camouflaged object detection, Camouflaged image generation, existing CIG methods
备注: 13 pages, 6 figures
点击查看摘要
Abstract:Camouflaged image generation (CIG) has recently emerged as an efficient alternative for acquiring high-quality training data for camouflaged object detection (COD). However, existing CIG methods still suffer from a substantial gap to real camouflaged imagery: generated images either lack sufficient camouflage due to weak visual similarity, or exhibit cluttered backgrounds that are semantically inconsistent with foreground targets. To address these limitations, we propose ReamCamo, a unified out-painting based framework for realistic camouflaged image generation. ReamCamo explicitly introduces additional layout controls to regulate global image structure, thereby improving semantic coherence between foreground objects and generated backgrounds. Moreover, we construct a multi-modal textual-visual condition by combining a unified fine-grained textual task description with texture-oriented background retrieval, which jointly guides the generation process to enhance visual fidelity and realism. To quantitatively assess camouflage quality, we further introduce a background-foreground distribution divergence metric that measures the effectiveness of camouflage in generated images. Extensive experiments and visualizations demonstrate the effectiveness of our proposed framework.
91. 【2512.22973】YOLO-IOD: Towards Real Time Incremental Object Detection
链接:https://arxiv.org/abs/2512.22973
作者:Shizhou Zhang,Xueqiang Lv,Yinghui Xing,Qirui Wu,Di Xu,Chen Zhao,Yanning Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:DETR series detectors, Faster R-CNN, R-CNN or DETR, YOLO detection frameworks, real-time YOLO detection
备注: AAAAI 2026 accepted. Code models are available at: [this https URL](https://github.com/qiangzai-lv/YOLO-IOD)
点击查看摘要
Abstract:Current methods for incremental object detection (IOD) primarily rely on Faster R-CNN or DETR series detectors; however, these approaches do not accommodate the real-time YOLO detection frameworks. In this paper, we first identify three primary types of knowledge conflicts that contribute to catastrophic forgetting in YOLO-based incremental detectors: foreground-background confusion, parameter interference, and misaligned knowledge distillation. Subsequently, we introduce YOLO-IOD, a real-time Incremental Object Detection (IOD) framework that is constructed upon the pretrained YOLO-World model, facilitating incremental learning via a stage-wise parameter-efficient fine-tuning process. Specifically, YOLO-IOD encompasses three principal components: 1) Conflict-Aware Pseudo-Label Refinement (CPR), which mitigates the foreground-background confusion by leveraging the confidence levels of pseudo labels and identifying potential objects relevant to future tasks. 2) Importancebased Kernel Selection (IKS), which identifies and updates the pivotal convolution kernels pertinent to the current task during the current learning stage. 3) Cross-Stage Asymmetric Knowledge Distillation (CAKD), which addresses the misaligned knowledge distillation conflict by transmitting the features of the student target detector through the detection heads of both the previous and current teacher detectors, thereby facilitating asymmetric distillation between existing and newly introduced categories. We further introduce LoCo COCO, a more realistic benchmark that eliminates data leakage across stages. Experiments on both conventional and LoCo COCO benchmarks show that YOLO-IOD achieves superior performance with minimal forgetting.
92. 【2512.22972】Wavelet-based Multi-View Fusion of 4D Radar Tensor and Camera for Robust 3D Object Detection
链接:https://arxiv.org/abs/2512.22972
作者:Runwei Guan,Jianan Liu,Shaofeng Liang,Fangqiang Ding,Shanliang Yao,Xiaokai Bai,Daizong Liu,Tao Huang,Guoqiang Mao,Hui Xiong
类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:robot perception due, widely adopted, adopted in autonomous, autonomous driving, driving and robot
备注: 10 pages, 10 figures
点击查看摘要
Abstract:4D millimeter-wave (mmWave) radar has been widely adopted in autonomous driving and robot perception due to its low cost and all-weather robustness. However, its inherent sparsity and limited semantic richness significantly constrain perception capability. Recently, fusing camera data with 4D radar has emerged as a promising cost effective solution, by exploiting the complementary strengths of the two modalities. Nevertheless, point-cloud-based radar often suffer from information loss introduced by multi-stage signal processing, while directly utilizing raw 4D radar data incurs prohibitive computational costs. To address these challenges, we propose WRCFormer, a novel 3D object detection framework that fuses raw radar cubes with camera inputs via multi-view representations of the decoupled radar cube. Specifically, we design a Wavelet Attention Module as the basic module of wavelet-based Feature Pyramid Network (FPN) to enhance the representation of sparse radar signals and image data. We further introduce a two-stage query-based, modality-agnostic fusion mechanism termed Geometry-guided Progressive Fusion to efficiently integrate multi-view features from both modalities. Extensive experiments demonstrate that WRCFormer achieves state-of-the-art performance on the K-Radar benchmarks, surpassing the best model by approximately 2.4% in all scenarios and 1.6% in the sleet scenario, highlighting its robustness under adverse weather conditions.
93. 【2512.22969】CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision
链接:https://arxiv.org/abs/2512.22969
作者:Behnam Raoufi,Hossein Sharify,Mohamad Mahdee Ramezanee,Khosrow Hajsadeghi,Saeed Bagheri Shouraki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Conventional object detectors, Conventional object, object detectors rely, label noise, vulnerable to class
备注: 6 pages, 4 figures. Preprint under review
点击查看摘要
Abstract:Conventional object detectors rely on cross-entropy classification, which can be vulnerable to class imbalance and label noise. We propose CLIP-Joint-Detect, a simple and detector-agnostic framework that integrates CLIP-style contrastive vision-language supervision through end-to-end joint training. A lightweight parallel head projects region or grid features into the CLIP embedding space and aligns them with learnable class-specific text embeddings via InfoNCE contrastive loss and an auxiliary cross-entropy term, while all standard detection losses are optimized simultaneously. The approach applies seamlessly to both two-stage and one-stage architectures. We validate it on Pascal VOC 2007+2012 using Faster R-CNN and on the large-scale MS COCO 2017 benchmark using modern YOLO detectors (YOLOv11), achieving consistent and substantial improvements while preserving real-time inference speed. Extensive experiments and ablations demonstrate that joint optimization with learnable text embeddings markedly enhances closed-set detection performance across diverse architectures and datasets.
94. 【2512.22949】Learning Where to Focus: Density-Driven Guidance for Detecting Dense Tiny Objects
链接:https://arxiv.org/abs/2512.22949
作者:Zhicheng Zhao,Xuanang Fan,Lingma Sun,Chenglong Li,Jin Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:High-resolution remote sensing, limited pixel footprints, remote sensing imagery, sensing imagery increasingly, extremely challenging due
备注:
点击查看摘要
Abstract:High-resolution remote sensing imagery increasingly contains dense clusters of tiny objects, the detection of which is extremely challenging due to severe mutual occlusion and limited pixel footprints. Existing detection methods typically allocate computational resources uniformly, failing to adaptively focus on these density-concentrated regions, which hinders feature learning effectiveness. To address these limitations, we propose the Dense Region Mining Network (DRMNet), which leverages density maps as explicit spatial priors to guide adaptive feature learning. First, we design a Density Generation Branch (DGB) to model object distribution patterns, providing quantifiable priors that guide the network toward dense regions. Second, to address the computational bottleneck of global attention, our Dense Area Focusing Module (DAFM) uses these density maps to identify and focus on dense areas, enabling efficient local-global feature interaction. Finally, to mitigate feature degradation during hierarchical extraction, we introduce a Dual Filter Fusion Module (DFFM). It disentangles multi-scale features into high- and low-frequency components using a discrete cosine transform and then performs density-guided cross-attention to enhance complementarity while suppressing background interference. Extensive experiments on the AI-TOD and DTOD datasets demonstrate that DRMNet surpasses state-of-the-art methods, particularly in complex scenarios with high object density and severe occlusion.
95. 【2512.22939】ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving
链接:https://arxiv.org/abs/2512.22939
作者:Qihang Peng,Xuesong Chen,Chenye Yang,Shaoshuai Shi,Hongsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autonomous driving requires, complex multimodal inputs, driving requires generating, Autonomous driving, requires generating safe
备注: 11 pages, 4 figures
点击查看摘要
Abstract:Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.
96. 【2512.22905】JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
链接:https://arxiv.org/abs/2512.22905
作者:Kai Liu,Jungang Li,Yuchong Sun,Shengqiong Wu,Jianzhang Gao,Daoan Zhang,Wei Zhang,Sheng Jin,Sicheng Yu,Geng Zhan,Jiayi Ji,Fan Zhou,Liang Zheng,Shuicheng Yan,Hao Fei,Tat-Seng Chua
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:paper presents JavisGPT, large language model, unified multimodal large, multimodal large language, Joint Audio-Video
备注: Accepted by NeurIPS as a Spotlight paper. Code: [this https URL](https://github.com/JavisVerse/JavisGPT)
点击查看摘要
Abstract:This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for Joint Audio-Video (JAV) comprehension and generation. JavisGPT adopts a concise encoder-LLM-decoder architecture, featuring a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. To support this, we further construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that span diverse and multi-level comprehension and generation scenarios. Extensive experiments on JAV comprehension and generation benchmarks show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
97. 【2512.22899】HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery
链接:https://arxiv.org/abs/2512.22899
作者:Yaping Zhang,Qixuan Zhang,Xingquan Zhang,Zhiyuan Chen,Wenwen Zhuang,Yupu Liang,Lu Xiang,Yang Zhao,Jiajun Zhang,Yu Zhou,Chengqing Zong
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:sparked growing interest, large language models, Literature Review Generation, Literature-based Question Answering, textit
备注:
点击查看摘要
Abstract:The rapid advancement of large language models (LLMs) and multimodal foundation models has sparked growing interest in their potential for scientific research. However, scientific intelligence encompasses a broad spectrum of abilities ranging from understanding fundamental knowledge to conducting creative discovery, and existing benchmarks remain fragmented. Most focus on narrow tasks and fail to reflect the hierarchical and multi-disciplinary nature of real scientific inquiry. We introduce \textbf{HiSciBench}, a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow: \textit{Scientific Literacy} (L1), \textit{Literature Parsing} (L2), \textit{Literature-based Question Answering} (L3), \textit{Literature Review Generation} (L4), and \textit{Scientific Discovery} (L5). HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines, including mathematics, physics, chemistry, biology, geography, and astronomy, and supports multimodal inputs including text, equations, figures, and tables, as well as cross-lingual evaluation. Unlike prior benchmarks that assess isolated abilities, HiSciBench provides an integrated, dependency-aware framework that enables detailed diagnosis of model capabilities across different stages of scientific reasoning. Comprehensive evaluations of leading models, including GPT-5, DeepSeek-R1, and several multimodal systems, reveal substantial performance gaps: while models achieve up to 69\% accuracy on basic literacy tasks, performance declines sharply to 25\% on discovery-level challenges. HiSciBench establishes a new standard for evaluating scientific Intelligence and offers actionable insights for developing models that are not only more capable but also more reliable. The benchmark will be publicly released to facilitate future research.
98. 【2512.22882】Hash Grid Feature Pruning
链接:https://arxiv.org/abs/2512.22882
作者:Yangzhi Ma,Bojun Liu,Jie Li,Li Li,Dong Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:implicit neural field, hash grid, Gaussian splatting, inter-frame prediction, hash grid feature
备注:
点击查看摘要
Abstract:Hash grids are widely used to learn an implicit neural field for Gaussian splatting, serving either as part of the entropy model or for inter-frame prediction. However, due to the irregular and non-uniform distribution of Gaussian splats in 3D space, numerous sparse regions exist, rendering many features in the hash grid invalid. This leads to redundant storage and transmission overhead. In this work, we propose a hash grid feature pruning method that identifies and prunes invalid features based on the coordinates of the input Gaussian splats, so that only the valid features are encoded. This approach reduces the storage size of the hash grid without compromising model performance, leading to improved rate-distortion performance. Following the Common Test Conditions (CTC) defined by the standardization committee, our method achieves an average bitrate reduction of 8% compared to the baseline approach.
99. 【2512.22881】Guided Path Sampling: Steering Diffusion Models Back on Track with Principled Path Guidance
链接:https://arxiv.org/abs/2512.22881
作者:Haosen Li,Wenshuo Chen,Shaofeng Liang,Lei Wang,Haozhe Jia,Yutao Yue
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:denoising-inversion cycle, cycle are powerful, powerful tools, tools for enhancing, control of diffusion
备注:
点击查看摘要
Abstract:Iterative refinement methods based on a denoising-inversion cycle are powerful tools for enhancing the quality and control of diffusion models. However, their effectiveness is critically limited when combined with standard Classifier-Free Guidance (CFG). We identify a fundamental limitation: CFG's extrapolative nature systematically pushes the sampling path off the data manifold, causing the approximation error to diverge and undermining the refinement process. To address this, we propose Guided Path Sampling (GPS), a new paradigm for iterative refinement. GPS replaces unstable extrapolation with a principled, manifold-constrained interpolation, ensuring the sampling path remains on the data manifold. We theoretically prove that this correction transforms the error series from unbounded amplification to strictly bounded, guaranteeing stability. Furthermore, we devise an optimal scheduling strategy that dynamically adjusts guidance strength, aligning semantic injection with the model's natural coarse-to-fine generation process. Extensive experiments on modern backbones like SDXL and Hunyuan-DiT show that GPS outperforms existing methods in both perceptual quality and complex prompt adherence. For instance, GPS achieves a superior ImageReward of 0.79 and HPS v2 of 0.2995 on SDXL, while improving overall semantic alignment accuracy on GenEval to 57.45%. Our work establishes that path stability is a prerequisite for effective iterative refinement, and GPS provides a robust framework to achieve it.
100. 【2512.22878】SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation
链接:https://arxiv.org/abs/2512.22878
作者:Hasan Faraz Khan,Noor Fatima,Muzammil Behzad
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:driven remarkable advances, recent integration, integration of artificial, artificial intelligence, driven remarkable
备注:
点击查看摘要
Abstract:The recent integration of artificial intelligence into medical imaging has driven remarkable advances in automated organ segmentation. However, most existing 3D segmentation frameworks rely exclusively on visual learning from large annotated datasets restricting their adaptability to new domains and clinical tasks. The lack of semantic understanding in these models makes them ineffective in addressing flexible, user-defined segmentation objectives. To overcome these limitations, we propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation. The model employs a transformer-based visual encoder to extract volumetric features and integrates them with a compact text encoder via an efficient fusion mechanism. This design allows the system to understand natural-language prompts and correctly align semantic cues with their corresponding spatial structures in medical volumes, while producing accurate, context-aware segmentation results with low computational overhead. Extensive experiments on the BTCV dataset demonstrate that SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture. The model generalizes well to unseen data and offers significant efficiency gains compared to conventional transformer-based segmentation networks. Bridging visual perception with linguistic understanding, SwinTF3D establishes a practical and interpretable paradigm for interactive, text-driven 3D medical image segmentation, opening perspectives for more adaptive and resource-efficient solutions in clinical imaging.
101. 【2512.22877】M-ErasureBench: A Comprehensive Multimodal Evaluation Benchmark for Concept Erasure in Diffusion Models
链接:https://arxiv.org/abs/2512.22877
作者:Ju-Hsuan Weng,Jia-Wei Liao,Cheng-Fu Chou,Jun-Cheng Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:copyrighted content, motivating research, generate harmful, harmful or copyrighted, concept erasure
备注:
点击查看摘要
Abstract:Text-to-image diffusion models may generate harmful or copyrighted content, motivating research on concept erasure. However, existing approaches primarily focus on erasing concepts from text prompts, overlooking other input modalities that are increasingly critical in real-world applications such as image editing and personalized generation. These modalities can become attack surfaces, where erased concepts re-emerge despite defenses. To bridge this gap, we introduce M-ErasureBench, a novel multimodal evaluation framework that systematically benchmarks concept erasure methods across three input modalities: text prompts, learned embeddings, and inverted latents. For the latter two, we evaluate both white-box and black-box access, yielding five evaluation scenarios. Our analysis shows that existing methods achieve strong erasure performance against text prompts but largely fail under learned embeddings and inverted latents, with Concept Reproduction Rate (CRR) exceeding 90% in the white-box setting. To address these vulnerabilities, we propose IRECE (Inference-time Robustness Enhancement for Concept Erasure), a plug-and-play module that localizes target concepts via cross-attention and perturbs the associated latents during denoising. Experiments demonstrate that IRECE consistently restores robustness, reducing CRR by up to 40% under the most challenging white-box latent inversion scenario, while preserving visual quality. To the best of our knowledge, M-ErasureBench provides the first comprehensive benchmark of concept erasure beyond text prompts. Together with IRECE, our benchmark offers practical safeguards for building more reliable protective generative models.
102. 【2512.22874】Let Samples Speak: Mitigating Spurious Correlation by Exploiting the Clusterness of Samples
链接:https://arxiv.org/abs/2512.22874
作者:Weiwei Li,Junzhuo Liu,Yuanyuan Ren,Yuchen Zheng,Yahao Liu,Wen Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:spurious features, Deep learning models, spurious, prediction task, spuriously correlate
备注: CVPR 2025
点击查看摘要
Abstract:Deep learning models are known to often learn features that spuriously correlate with the class label during training but are irrelevant to the prediction task. Existing methods typically address this issue by annotating potential spurious attributes, or filtering spurious features based on some empirical assumptions (e.g., simplicity of bias). However, these methods may yield unsatisfactory performance due to the intricate and elusive nature of spurious correlations in real-world data. In this paper, we propose a data-oriented approach to mitigate the spurious correlation in deep learning models. We observe that samples that are influenced by spurious features tend to exhibit a dispersed distribution in the learned feature space. This allows us to identify the presence of spurious features. Subsequently, we obtain a bias-invariant representation by neutralizing the spurious features based on a simple grouping strategy. Then, we learn a feature transformation to eliminate the spurious features by aligning with this bias-invariant representation. Finally, we update the classifier by incorporating the learned feature transformation and obtain an unbiased model. By integrating the aforementioned identifying, neutralizing, eliminating and updating procedures, we build an effective pipeline for mitigating spurious correlation. Experiments on image and NLP debiasing benchmarks show an improvement in worst group accuracy of more than 20% compared to standard empirical risk minimization (ERM). Codes and checkpoints are available at this https URL .
103. 【2512.22872】Learning Anatomy from Multiple Perspectives via Self-supervision in Chest Radiographs
链接:https://arxiv.org/abs/2512.22872
作者:Ziyu Zhou,Haozhe Luo,Mohammad Reza Hosseinzadeh Taher,Jiaxuan Pang,Xiaowei Ding,Michael B. Gotway,Jianming Liang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural language processing, natural language, language processing, processing and computer, computer vision
备注:
点击查看摘要
Abstract:Foundation models have been successful in natural language processing and computer vision because they are capable of capturing the underlying structures (foundation) of natural languages. However, in medical imaging, the key foundation lies in human anatomy, as these images directly represent the internal structures of the body, reflecting the consistency, coherence, and hierarchy of human anatomy. Yet, existing self-supervised learning (SSL) methods often overlook these perspectives, limiting their ability to effectively learn anatomical features. To overcome the limitation, we built Lamps (learning anatomy from multiple perspectives via self-supervision) pre-trained on large-scale chest radiographs by harmoniously utilizing the consistency, coherence, and hierarchy of human anatomy as the supervision signal. Extensive experiments across 10 datasets evaluated through fine-tuning and emergent property analysis demonstrate Lamps' superior robustness, transferability, and clinical potential when compared to 10 baseline models. By learning from multiple perspectives, Lamps presents a unique opportunity for foundation models to develop meaningful, robust representations that are aligned with the structure of human anatomy.
104. 【2512.22867】MUSON: A Reasoning-oriented Multimodal Dataset for Socially Compliant Navigation in Urban Environments
链接:https://arxiv.org/abs/2512.22867
作者:Zhuonan Liu,Xinyu Zhang,Zishuo Wang,Tomohito Kawabata,Xuesu Xiao,Ling Xiao
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Socially compliant navigation, compliant navigation requires, navigation requires structured, dynamic pedestrians, ensure safe
备注:
点击查看摘要
Abstract:Socially compliant navigation requires structured reasoning over dynamic pedestrians and physical constraints to ensure safe and interpretable decisions. However, existing social navigation datasets often lack explicit reasoning supervision and exhibit highly long-tailed action distributions, limiting models' ability to learn safety-critical behaviors. To address these issues, we introduce MUSON, a multimodal dataset for short-horizon social navigation collected across diverse indoor and outdoor campus scenes. MUSON adopts a structured five-step Chain-of-Thought annotation consisting of perception, prediction, reasoning, action, and explanation, with explicit modeling of static physical constraints and a rationally balanced discrete action space. Compared to SNEI, MUSON provides consistent reasoning, action, and explanation. Benchmarking multiple state-of-the-art Small Vision Language Models on MUSON shows that Qwen2.5-VL-3B achieves the highest decision accuracy of 0.8625, demonstrating that MUSON serves as an effective and reusable benchmark for socially compliant navigation. The dataset is publicly available at this https URL
105. 【2512.22854】ByteLoom: Weaving Geometry-Consistent Human-Object Interactions through Progressive Curriculum Learning
链接:https://arxiv.org/abs/2512.22854
作者:Bangya Liu,Xinyu Gong,Zelin Zhao,Ziyang Song,Yulei Lu,Suhui Wu,Jun Zhang,Suman Banerjee,Hao Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
关键词:robotics imitation learning, garnered increasing attention, increasing attention due, Human-object interaction, imitation learning
备注:
点击查看摘要
Abstract:Human-object interaction (HOI) video generation has garnered increasing attention due to its promising applications in digital humans, e-commerce, advertising, and robotics imitation learning. However, existing methods face two critical limitations: (1) a lack of effective mechanisms to inject multi-view information of the object into the model, leading to poor cross-view consistency, and (2) heavy reliance on fine-grained hand mesh annotations for modeling interaction occlusions. To address these challenges, we introduce ByteLoom, a Diffusion Transformer (DiT)-based framework that generates realistic HOI videos with geometrically consistent object illustration, using simplified human conditioning and 3D object inputs. We first propose an RCM-cache mechanism that leverages Relative Coordinate Maps (RCM) as a universal representation to maintain object's geometry consistency and precisely control 6-DoF object transformations in the meantime. To compensate HOI dataset scarcity and leverage existing datasets, we further design a training curriculum that enhances model capabilities in a progressive style and relaxes the demand of hand mesh. Extensive experiments demonstrate that our method faithfully preserves human identity and the object's multi-view geometry, while maintaining smooth motion and object manipulation.
106. 【2512.22833】A Minimal Solver for Relative Pose Estimation with Unknown Focal Length from Two Affine Correspondences
链接:https://arxiv.org/abs/2512.22833
作者:Zhenbao Yu,Shirong Ye,Ronghe Jin,Shunkun Liang,Zibin Liu,Huiyun Zhang,Banglei Guan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
107. 【2512.22830】3D Scene Change Modeling With Consistent Multi-View Aggregation
链接:https://arxiv.org/abs/2512.22830
作者:Zirui Zhou,Junfeng Ni,Shujie Zhang,Yixin Chen,Siyuan Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Change detection plays, Change detection, plays a vital, vital role, detection plays
备注:
点击查看摘要
Abstract:Change detection plays a vital role in scene monitoring, exploration, and continual reconstruction. Existing 3D change detection methods often exhibit spatial inconsistency in the detected changes and fail to explicitly separate pre- and post-change states. To address these limitations, we propose SCaR-3D, a novel 3D scene change detection framework that identifies object-level changes from a dense-view pre-change image sequence and sparse-view post-change images. Our approach consists of a signed-distance-based 2D differencing module followed by multi-view aggregation with voting and pruning, leveraging the consistent nature of 3DGS to robustly separate pre- and post-change states. We further develop a continual scene reconstruction strategy that selectively updates dynamic regions while preserving the unchanged areas. We also contribute CCS3D, a challenging synthetic dataset that allows flexible combinations of 3D change types to support controlled evaluations. Extensive experiments demonstrate that our method achieves both high accuracy and efficiency, outperforming existing methods.
108. 【2512.22822】KANO: Kolmogorov-Arnold Neural Operator for Image Super-Resolution
链接:https://arxiv.org/abs/2512.22822
作者:Chenyu Li,Danfeng Hong,Bing Zhang,Zhaojie Pan,Jocelyn Chanussot
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:render single-image Super-resolution, uncertainty render single-image, single-image Super-resolution, sources of uncertainty, uncertainty render
备注:
点击查看摘要
Abstract:The highly nonlinear degradation process, complex physical interactions, and various sources of uncertainty render single-image Super-resolution (SR) a particularly challenging task. Existing interpretable SR approaches, whether based on prior learning or deep unfolding optimization frameworks, typically rely on black-box deep networks to model latent variables, which leaves the degradation process largely unknown and uncontrollable. Inspired by the Kolmogorov-Arnold theorem (KAT), we for the first time propose a novel interpretable operator, termed Kolmogorov-Arnold Neural Operator (KANO), with the application to image SR. KANO provides a transparent and structured representation of the latent degradation fitting process. Specifically, we employ an additive structure composed of a finite number of B-spline functions to approximate continuous spectral curves in a piecewise fashion. By learning and optimizing the shape parameters of these spline functions within defined intervals, our KANO accurately captures key spectral characteristics, such as local linear trends and the peak-valley structures at nonlinear inflection points, thereby endowing SR results with physical interpretability. Furthermore, through theoretical modeling and experimental evaluations across natural images, aerial photographs, and satellite remote sensing data, we systematically compare multilayer perceptrons (MLPs) and Kolmogorov-Arnold networks (KANs) in handling complex sequence fitting tasks. This comparative study elucidates the respective advantages and limitations of these models in characterizing intricate degradation mechanisms, offering valuable insights for the development of interpretable SR techniques.
109. 【2512.22819】Depth Anything in $360^\circ$: Towards Scale Invariance in the Wild
链接:https://arxiv.org/abs/2512.22819
作者:Hualie Jiang,Ziyang Song,Zhiqiang Lou,Rui Xu,Minglang Tan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:environmental structural information, offering significant benefits, capturing complete, environmental structural, structural information
备注: [this https URL](https://insta360-research-team.github.io/DA360)
点击查看摘要
Abstract:Panoramic depth estimation provides a comprehensive solution for capturing complete $360^\circ$ environmental structural information, offering significant benefits for robotics and AR/VR applications. However, while extensively studied in indoor settings, its zero-shot generalization to open-world domains lags far behind perspective images, which benefit from abundant training data. This disparity makes transferring capabilities from the perspective domain an attractive solution. To bridge this gap, we present Depth Anything in $360^\circ$ (DA360), a panoramic-adapted version of Depth Anything V2. Our key innovation involves learning a shift parameter from the ViT backbone, transforming the model's scale- and shift-invariant output into a scale-invariant estimate that directly yields well-formed 3D point clouds. This is complemented by integrating circular padding into the DPT decoder to eliminate seam artifacts, ensuring spatially coherent depth maps that respect spherical continuity. Evaluated on standard indoor benchmarks and our newly curated outdoor dataset, Metropolis, DA360 shows substantial gains over its base model, achieving over 50\% and 10\% relative depth error reduction on indoor and outdoor benchmarks, respectively. Furthermore, DA360 significantly outperforms robust panoramic depth estimation methods, achieving about 30\% relative error improvement compared to PanDA across all three test datasets and establishing new state-of-the-art performance for zero-shot panoramic depth estimation.
110. 【2512.22808】EgoReAct: Egocentric Video-Driven 3D Human Reaction Generation
链接:https://arxiv.org/abs/2512.22808
作者:Libo Zhang,Zekun Li,Tianyu Li,Zeyu Cao,Rui Xu,Xiaoxiao Long,Wenjia Wang,Jingbo Wang,Yuan Liu,Wenping Wang,Daquan Zhou,Taku Komura,Zhiyang Dou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Humans exhibit adaptive, exhibit adaptive, context-sensitive responses, Human Reaction Dataset, egocentric video
备注: 12 pages, 9 figures
点击查看摘要
Abstract:Humans exhibit adaptive, context-sensitive responses to egocentric visual input. However, faithfully modeling such reactions from egocentric video remains challenging due to the dual requirements of strictly causal generation and precise 3D spatial alignment. To tackle this problem, we first construct the Human Reaction Dataset (HRD) to address data scarcity and misalignment by building a spatially aligned egocentric video-reaction dataset, as existing datasets (e.g., ViMo) suffer from significant spatial inconsistency between the egocentric video and reaction motion, e.g., dynamically moving motions are always paired with fixed-camera videos. Leveraging HRD, we present EgoReAct, the first autoregressive framework that generates 3D-aligned human reaction motions from egocentric video streams in real-time. We first compress the reaction motion into a compact yet expressive latent space via a Vector Quantised-Variational AutoEncoder and then train a Generative Pre-trained Transformer for reaction generation from the visual input. EgoReAct incorporates 3D dynamic features, i.e., metric depth, and head dynamics during the generation, which effectively enhance spatial grounding. Extensive experiments demonstrate that EgoReAct achieves remarkably higher realism, spatial consistency, and generation efficiency compared with prior methods, while maintaining strict causality during generation. We will release code, models, and data upon acceptance.
111. 【2512.22802】ReDiF: Reinforced Distillation for Few Step Diffusion
链接:https://arxiv.org/abs/2512.22802
作者:Amirhossein Tighkhorshid,Zahra Dehghanian,Gholamali Aminian,Chengchun Shi,Hamid R. Rabiee
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:slow sampling problem, diffusion models, addresses the slow, slow sampling, smaller size
备注:
点击查看摘要
Abstract:Distillation addresses the slow sampling problem in diffusion models by creating models with smaller size or fewer steps that approximate the behavior of high-step teachers. In this work, we propose a reinforcement learning based distillation framework for diffusion models. Instead of relying on fixed reconstruction or consistency losses, we treat the distillation process as a policy optimization problem, where the student is trained using a reward signal derived from alignment with the teacher's outputs. This RL driven approach dynamically guides the student to explore multiple denoising paths, allowing it to take longer, optimized steps toward high-probability regions of the data distribution, rather than relying on incremental refinements. Our framework utilizes the inherent ability of diffusion models to handle larger steps and effectively manage the generative process. Experimental results show that our method achieves superior performance with significantly fewer inference steps and computational resources compared to existing distillation techniques. Additionally, the framework is model agnostic, applicable to any type of diffusion models with suitable reward functions, providing a general optimization paradigm for efficient diffusion learning.
112. 【2512.22801】Evaluating the Performance of Open-Vocabulary Object Detection in Low-quality Image
链接:https://arxiv.org/abs/2512.22801
作者:Po-Chih Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Open-vocabulary object detection, achieve recognition capabilities, recognition capabilities comparable, Open-vocabulary object, object detection enables
备注:
点击查看摘要
Abstract:Open-vocabulary object detection enables models to localize and recognize objects beyond a predefined set of categories and is expected to achieve recognition capabilities comparable to human performance. In this study, we aim to evaluate the performance of existing models on open-vocabulary object detection tasks under low-quality image conditions. For this purpose, we introduce a new dataset that simulates low-quality images in the real world. In our evaluation experiment, we find that although open-vocabulary object detection models exhibited no significant decrease in mAP scores under low-level image degradation, the performance of all models dropped sharply under high-level image degradation. OWLv2 models consistently performed better across different types of degradation, while OWL-ViT, GroundingDINO, and Detic showed significant performance declines. We will release our dataset and codes to facilitate future studies.
113. 【2512.22800】Medical Scene Reconstruction and Segmentation based on 3D Gaussian Representation
链接:https://arxiv.org/abs/2512.22800
作者:Bin Liu,Wenyan Tian,Huangxin Fu,Zizheng Li,Zhifen He,Bo Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:providing structural visualization, structural visualization support, surgical planning, key technology, support for disease
备注: 14 pages, 3 figures
点击查看摘要
Abstract:3D reconstruction of medical images is a key technology in medical image analysis and clinical diagnosis, providing structural visualization support for disease assessment and surgical planning. Traditional methods are computationally expensive and prone to structural discontinuities and loss of detail in sparse slices, making it difficult to meet clinical accuracy this http URL address these challenges, we propose an efficient 3D reconstruction method based on 3D Gaussian and tri-plane representations. This method not only maintains the advantages of Gaussian representation in efficient rendering and geometric representation but also significantly enhances structural continuity and semantic consistency under sparse slicing conditions. Experimental results on multimodal medical datasets such as US and MRI show that our proposed method can generate high-quality, anatomically coherent, and semantically stable medical images under sparse data conditions, while significantly improving reconstruction efficiency. This provides an efficient and reliable new approach for 3D visualization and clinical analysis of medical images.
114. 【2512.22799】VPTracker: Global Vision-Language Tracking via Visual Prompt and MLLM
链接:https://arxiv.org/abs/2512.22799
作者:Jingchao Wang,Kaiwen Zhou,Zhijian Wu,Kunhua Ji,Dingjiang Huang,Yefeng Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision-Language Tracking aims, continuously localize objects, Multimodal Large Language, aims to continuously, continuously localize
备注: 6 pages
点击查看摘要
Abstract:Vision-Language Tracking aims to continuously localize objects described by a visual template and a language description. Existing methods, however, are typically limited to local search, making them prone to failures under viewpoint changes, occlusions, and rapid target movements. In this work, we introduce the first global tracking framework based on Multimodal Large Language Models (VPTracker), exploiting their powerful semantic reasoning to locate targets across the entire image space. While global search improves robustness and reduces drift, it also introduces distractions from visually or semantically similar objects. To address this, we propose a location-aware visual prompting mechanism that incorporates spatial priors into the MLLM. Specifically, we construct a region-level prompt based on the target's previous location, enabling the model to prioritize region-level recognition and resort to global inference only when necessary. This design retains the advantages of global tracking while effectively suppressing interference from distracting visual content. Extensive experiments show that our approach significantly enhances tracking stability and target disambiguation under challenging scenarios, opening a new avenue for integrating MLLMs into visual tracking. Code is available at this https URL.
115. 【2512.22796】Parallel Diffusion Solver via Residual Dirichlet Policy Optimization
链接:https://arxiv.org/abs/2512.22796
作者:Ruoyu Wang,Ziyu Li,Beier Zhu,Liangyu Yuan,Hanwang Zhang,Xun Yang,Xiaojun Chang,Chi Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion models, sequential denoising nature, high sampling latency, sampling latency due, Ensemble Parallel Direction
备注:
点击查看摘要
Abstract:Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face significant image quality degradation under a low-latency budget, primarily due to accumulated truncation errors arising from the inability to capture high-curvature trajectory segments. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as EPD-Solver), a novel ODE solver that mitigates these errors by incorporating multiple parallel gradient evaluations in each step. Motivated by the geometric insight that sampling trajectories are largely confined to a low-dimensional manifold, EPD-Solver leverages the Mean Value Theorem for vector-valued functions to approximate the integral solution more accurately. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling nature. We introduce a two-stage optimization framework. Initially, EPD-Solver optimizes a small set of learnable parameters via a distillation-based approach. We further propose a parameter-efficient Reinforcement Learning (RL) fine-tuning scheme that reformulates the solver as a stochastic Dirichlet policy. Unlike traditional methods that fine-tune the massive backbone, our RL approach operates strictly within the low-dimensional solver space, effectively mitigating reward hacking while enhancing performance in complex text-to-image (T2I) generation tasks. In addition, our method is flexible and can serve as a plugin (EPD-Plugin) to improve existing ODE samplers.
116. 【2512.22780】Plug In, Grade Right: Psychology-Inspired AGIQA
链接:https://arxiv.org/abs/2512.22780
作者:Zhicheng Liao,Baoliang Chen,Hanwei Zhu,Lingyu Zhu,Shiqi Wang,Weisi Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Existing AGIQA models, Existing AGIQA, AGIQA models typically, text embeddings derived, measuring and aggregating
备注:
点击查看摘要
Abstract:Existing AGIQA models typically estimate image quality by measuring and aggregating the similarities between image embeddings and text embeddings derived from multi-grade quality descriptions. Although effective, we observe that such similarity distributions across grades usually exhibit multimodal patterns. For instance, an image embedding may show high similarity to both "excellent" and "poor" grade descriptions while deviating from the "good" one. We refer to this phenomenon as "semantic drift", where semantic inconsistencies between text embeddings and their intended descriptions undermine the reliability of text-image shared-space learning. To mitigate this issue, we draw inspiration from psychometrics and propose an improved Graded Response Model (GRM) for AGIQA. The GRM is a classical assessment model that categorizes a subject's ability across grades using test items with various difficulty levels. This paradigm aligns remarkably well with human quality rating, where image quality can be interpreted as an image's ability to meet various quality grades. Building on this philosophy, we design a two-branch quality grading module: one branch estimates image ability while the other constructs multiple difficulty levels. To ensure monotonicity in difficulty levels, we further model difficulty generation in an arithmetic manner, which inherently enforces a unimodal and interpretable quality distribution. Our Arithmetic GRM based Quality Grading (AGQG) module enjoys a plug-and-play advantage, consistently improving performance when integrated into various state-of-the-art AGIQA frameworks. Moreover, it also generalizes effectively to both natural and screen content image quality assessment, revealing its potential as a key component in future IQA models.
117. 【2512.22774】Schrodinger AI: A Unified Spectral-Dynamical Framework for Classification, Reasoning, and Operator-Based Generalization
链接:https://arxiv.org/abs/2512.22774
作者:Truong Son Nguyen
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:learning framework inspired, unified machine learning, machine learning framework, quantum mechanics, framework inspired
备注:
点击查看摘要
Abstract:We introduce \textbf{Schrödinger AI}, a unified machine learning framework inspired by quantum mechanics. The system is defined by three tightly coupled components: (1) a {time-independent wave-energy solver} that treats perception and classification as spectral decomposition under a learned Hamiltonian; (2) a {time-dependent dynamical solver} governing the evolution of semantic wavefunctions over time, enabling context-aware decision revision, re-routing, and reasoning under environmental changes; and (3) a {low-rank operator calculus} that learns symbolic transformations such as modular arithmetic through learned quantum-like transition operators. Together, these components form a coherent physics-driven alternative to conventional cross-entropy training and transformer attention, providing robust generalization, interpretable semantics, and emergent topology. Empirically, Schrödinger AI demonstrates: (a) emergent semantic manifolds that reflect human-conceived class relations without explicit supervision; (b) dynamic reasoning that adapts to changing environments, including maze navigation with real-time potential-field perturbations; and (c) exact operator generalization on modular arithmetic tasks, where the system learns group actions and composes them across sequences far beyond training length. These results suggest a new foundational direction for machine learning, where learning is cast as discovering and navigating an underlying semantic energy landscape.
Subjects:
Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2512.22774 [cs.LG]
(or
arXiv:2512.22774v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2512.22774
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
118. 【2512.22771】Next Best View Selections for Semantic and Dynamic 3D Gaussian Splatting
链接:https://arxiv.org/abs/2512.22771
作者:Yiqian Li,Wen Jiang,Kostas Daniilidis
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:crucial for embodied, embodied agents, scene understanding task, understanding task, static scene understanding
备注:
点击查看摘要
Abstract:Understanding semantics and dynamics has been crucial for embodied agents in various tasks. Both tasks have much more data redundancy than the static scene understanding task. We formulate the view selection problem as an active learning problem, where the goal is to prioritize frames that provide the greatest information gain for model training. To this end, we propose an active learning algorithm with Fisher Information that quantifies the informativeness of candidate views with respect to both semantic Gaussian parameters and deformation networks. This formulation allows our method to jointly handle semantic reasoning and dynamic scene modeling, providing a principled alternative to heuristic or random strategies. We evaluate our method on large-scale static images and dynamic video datasets by selecting informative frames from multi-camera setups. Experimental results demonstrate that our approach consistently improves rendering quality and semantic segmentation performance, outperforming baseline methods based on random selection and uncertainty-based heuristics.
119. 【2512.22760】Neighbor-Aware Token Reduction via Hilbert Curve for Vision Transformers
链接:https://arxiv.org/abs/2512.22760
作者:Yunge Li,Lanyu Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Transformers, visual recognition tasks, achieved remarkable success, recognition tasks, computational efficiency
备注:
点击查看摘要
Abstract:Vision Transformers (ViTs) have achieved remarkable success in visual recognition tasks, but redundant token representations limit their computational efficiency. Existing token merging and pruning strategies often overlook spatial continuity and neighbor relationships, resulting in the loss of local context. This paper proposes novel neighbor-aware token reduction methods based on Hilbert curve reordering, which explicitly preserves the neighbor structure in a 2D space using 1D sequential representations. Our method introduces two key strategies: Neighbor-Aware Pruning (NAP) for selective token retention and Merging by Adjacent Token similarity (MAT) for local token aggregation. Experiments demonstrate that our approach achieves state-of-the-art accuracy-efficiency trade-offs compared to existing methods. This work highlights the importance of spatial continuity and neighbor structure, offering new insights for the architectural optimization of ViTs.
120. 【2512.22748】rimTokenator-LC: Towards Adaptive Visual Token Pruning for Large Multimodal Models with Long Contexts
链接:https://arxiv.org/abs/2512.22748
作者:Hao Zhang,Mengsi Lyu,Bo Huang,Yulong Ao,Yonghua Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Multimodal Models, Large Multimodal, Multimodal Models, Original Model sequences, Original Model
备注: 16 pages
点击查看摘要
Abstract:Large Multimodal Models (LMMs) have proven effective on various tasks. They typically encode visual inputs into Original Model sequences of tokens, which are then concatenated with textual tokens and jointly processed by the language model. However, the growing number of visual tokens greatly increases inference cost. Visual token pruning has emerged as a promising solution. However, existing methods often overlook scenarios involving long context inputs with multiple images. In this paper, we analyze the challenges of visual token pruning in long context, multi-image settings and introduce an adaptive pruning method tailored for such scenarios. We decompose redundancy into intra-image and inter-image components and quantify them through intra-image diversity and inter-image variation, which jointly guide dynamic budget allocation. Our approach consists of two stages. The intra-image stage allocates each image a content-aware token budget and greedily selects its most representative tokens. The inter-image stage performs global diversity filtering to form a candidate pool and then applies a Pareto selection procedure that balances diversity with text alignment. Extensive experiments show that our approach maintains strong performance in long context settings while significantly cutting down the number of visual tokens.
121. 【2512.22745】Split4D: Decomposed 4D Scene Reconstruction Without Video Segmentation
链接:https://arxiv.org/abs/2512.22745
作者:Yongzhen Hu,Yihui Yang,Haotong Lin,Yifan Wang,Junting Dong,Yifu Deng,Xinyu Zhu,Fan Jia,Hujun Bao,Xiaowei Zhou,Sida Peng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:video segmentation, video segmentation maps, paper addresses, addresses the problem, video segmentation results
备注:
点击查看摘要
Abstract:This paper addresses the problem of decomposed 4D scene reconstruction from multi-view videos. Recent methods achieve this by lifting video segmentation results to a 4D representation through differentiable rendering techniques. Therefore, they heavily rely on the quality of video segmentation maps, which are often unstable, leading to unreliable reconstruction results. To overcome this challenge, our key idea is to represent the decomposed 4D scene with the Freetime FeatureGS and design a streaming feature learning strategy to accurately recover it from per-image segmentation maps, eliminating the need for video segmentation. Freetime FeatureGS models the dynamic scene as a set of Gaussian primitives with learnable features and linear motion ability, allowing them to move to neighboring regions over time. We apply a contrastive loss to Freetime FeatureGS, forcing primitive features to be close or far apart based on whether their projections belong to the same instance in the 2D segmentation map. As our Gaussian primitives can move across time, it naturally extends the feature learning to the temporal dimension, achieving 4D segmentation. Furthermore, we sample observations for training in a temporally ordered manner, enabling the streaming propagation of features over time and effectively avoiding local minima during the optimization process. Experimental results on several datasets show that the reconstruction quality of our method outperforms recent methods by a large margin.
122. 【2512.22730】Improved cystic hygroma detection from prenatal imaging using ultrasound-specific self-supervised representation learning
链接:https://arxiv.org/abs/2512.22730
作者:Youssef Megahed,Robin Ducharme,Inok Lee,Inbal Willner,Olivier X. Miguel,Kevin Dick,Adrian D. C. Chan,Mark Walker,Steven Hawken
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:adverse pregnancy outcomes, portends high rates, high-risk prenatal ultrasound, prenatal ultrasound finding, structural malformations
备注: 13 pages, 6 figures, 2 tables
点击查看摘要
Abstract:Cystic hygroma is a high-risk prenatal ultrasound finding that portends high rates of chromosomal abnormalities, structural malformations, and adverse pregnancy outcomes. Automated detection can increase reproducibility and support scalable early screening programs, but supervised deep learning methods are limited by small labelled datasets. This study assesses whether ultrasound-specific self-supervised pretraining can facilitate accurate, robust deep learning detection of cystic hygroma in first-trimester ultrasound images. We fine-tuned the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), pretrained on over 370,000 unlabelled ultrasound images, for binary classification of normal controls and cystic hygroma cases used in this study. Performance was evaluated on the same curated ultrasound dataset, preprocessing pipeline, and 4-fold cross-validation protocol as for the DenseNet-169 baseline, using accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (ROC-AUC). Model interpretability was analyzed qualitatively using Score-CAM visualizations. USF-MAE outperformed the DenseNet-169 baseline on all evaluation metrics. The proposed model yielded a mean accuracy of 0.96, sensitivity of 0.94, specificity of 0.98, and ROC-AUC of 0.98 compared to 0.93, 0.92, 0.94, and 0.94 for the DenseNet-169 baseline, respectively. Qualitative Score-CAM visualizations of model predictions demonstrated clinical relevance by highlighting expected regions in the fetal neck for both positive and negative cases. Paired statistical analysis using a Wilcoxon signed-rank test confirmed that performance improvements achieved by USF-MAE were statistically significant (p = 0.0057).
123. 【2512.22716】Memento-II: Learning by Stateful Reflective Memory
链接:https://arxiv.org/abs/2512.22716
作者:Jun Wang
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Stateful Reflective Decision, integrates episodic memory, language model agents, models reflective learning, Reflective Decision Process
备注: 32 pages, three figures
点击查看摘要
Abstract:We propose a theoretical framework for continual and experiential learning in large language model agents that integrates episodic memory with reinforcement learning. The framework identifies reflection as the key mechanism that enables agents to adapt through interaction without back propagation or model fine tuning, thereby relaxing the conventional separation between training and this http URL formalise this process, we introduce the Stateful Reflective Decision Process, which models reflective learning as a two stage read write interaction with episodic memory. Writing stores interaction outcomes and corresponds to policy evaluation, while reading retrieves relevant past cases and corresponds to policy improvement. We show that this process induces an equivalent Markov decision process over augmented state memory representations, allowing the use of classical tools from dynamic programming and reinforcement learning. We further instantiate the framework using entropy regularised policy iteration and establish convergence guarantees. As episodic memory grows and achieves sufficient coverage of the state space, the resulting policy converges to the optimal solution. This work provides a principled foundation for memory augmented and retrieval based language model agents capable of continual adaptation without parameter updates.
124. 【2512.22706】SCPainter: A Unified Framework for Realistic 3D Asset Insertion and Novel View Synthesis
链接:https://arxiv.org/abs/2512.22706
作者:Paul Dobre,Jackson Cooper,Xin Wang,Hongzhou Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Asset insertion, NVS, enhancing the diversity, key components, training data
备注:
点击查看摘要
Abstract:3D Asset insertion and novel view synthesis (NVS) are key components for autonomous driving simulation, enhancing the diversity of training data. With better training data that is diverse and covers a wide range of situations, including long-tailed driving scenarios, autonomous driving models can become more robust and safer. This motivates a unified simulation framework that can jointly handle realistic integration of inserted 3D assets and NVS. Recent 3D asset reconstruction methods enable reconstruction of dynamic actors from video, supporting their re-insertion into simulated driving scenes. While the overall structure and appearance can be accurate, it still struggles to capture the realism of 3D assets through lighting or shadows, particularly when inserted into scenes. In parallel, recent advances in NVS methods have demonstrated promising results in synthesizing viewpoints beyond the originally recorded trajectories. However, existing approaches largely treat asset insertion and NVS capabilities in isolation. To allow for interaction with the rest of the scene and to enable more diverse creation of new scenarios for training, realistic 3D asset insertion should be combined with NVS. To address this, we present SCPainter (Street Car Painter), a unified framework which integrates 3D Gaussian Splat (GS) car asset representations and 3D scene point clouds with diffusion-based generation to jointly enable realistic 3D asset insertion and NVS. The 3D GS assets and 3D scene point clouds are projected together into novel views, and these projections are used to condition a diffusion model to generate high quality images. Evaluation on the Waymo Open Dataset demonstrate the capability of our framework to enable 3D asset insertion and NVS, facilitating the creation of diverse and realistic driving data.
125. 【2512.22690】Mesquite MoCap: Democratizing Real-Time Motion Capture with Affordable, Bodyworn IoT Sensors and WebXR SLAM
链接:https://arxiv.org/abs/2512.22690
作者:Poojan Vanani,Darsh Patel,Danyal Khorami,Siva Munaganuru,Pavan Reddy,Varun Reddy,Bhargav Raghunath,Ishrat Lallmamode,Romir Patel,Assegid Kidané,Tejaswi Gowda
类目:Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
关键词:capture remains costly, complex to deploy, specialized laboratories, Motion capture remains, remains costly
备注: submitted to IEEE Journal of IoT
点击查看摘要
Abstract:Motion capture remains costly and complex to deploy, limiting use outside specialized laboratories. We present Mesquite, an open-source, low-cost inertial motion-capture system that combines a body-worn network of 15 IMU sensor nodes with a hip-worn Android smartphone for position tracking. A low-power wireless link streams quaternion orientations to a central USB dongle and a browser-based application for real-time visualization and recording. Built on modern web technologies -- WebGL for rendering, WebXR for SLAM, WebSerial and WebSockets for device and network I/O, and Progressive Web Apps for packaging -- the system runs cross-platform entirely in the browser. In benchmarks against a commercial optical system, Mesquite achieves mean joint-angle error of 2-5 degrees while operating at approximately 5% of the cost. The system sustains 30 frames per second with end-to-end latency under 15ms and a packet delivery rate of at least 99.7% in standard indoor environments. By leveraging IoT principles, edge processing, and a web-native stack, Mesquite lowers the barrier to motion capture for applications in entertainment, biomechanics, healthcare monitoring, human-computer interaction, and virtual reality. We release hardware designs, firmware, and software under an open-source license (GNU GPL).
126. 【2512.22689】Multimodal Diffeomorphic Registration with Neural ODEs and Structural Descriptors
链接:https://arxiv.org/abs/2512.22689
作者:Salvador Rodriguez-Sanz,Monica Hernandez
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Ordinary Differential Equations, Neural Ordinary Differential, Differential Equations, Ordinary Differential, Neural Ordinary
备注:
点击查看摘要
Abstract:This work proposes a multimodal diffeomorphic registration method using Neural Ordinary Differential Equations (Neural ODEs). Nonrigid registration algorithms exhibit tradeoffs between their accuracy, the computational complexity of their deformation model, and its proper regularization. In addition, they also assume intensity correlation in anatomically homologous regions of interest among image pairs, limiting their applicability to the monomodal setting. Unlike learning-based models, we propose an instance-specific framework that is not subject to high scan requirements for training and does not suffer performance degradation at inference time on modalities unseen during training. Our method exploits the potential of continuous-depth networks in the Neural ODE paradigm with structural descriptors, widely adopted as modality-agnostic metric models which exploit self-similarities on parameterized neighborhood geometries. We propose three different variants that integrate image-based or feature-based structural descriptors and nonstructural image similarities computed by local mutual information. We conduct extensive evaluations on different experiments formed by scan dataset combinations and show surpassing qualitative and quantitative results compared to state-of-the-art baselines adequate for large or small deformations, and specific of multimodal registration. Lastly, we also demonstrate the underlying robustness of the proposed framework to varying levels of explicit regularization while maintaining low error, its suitability for registration at varying scales, and its efficiency with respect to other methods targeted to large-deformation registration.
127. 【2512.22688】Autoregressive Flow Matching for Motion Prediction
链接:https://arxiv.org/abs/2512.22688
作者:Johnathan Xie,Stefan Stojanov,Cristobal Eyzaguirre,Daniel L. K. Yamins,Jiajun Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Motion prediction, trained on narrow, narrow distributions, distributions and applied, human motion prediction
备注:
点击查看摘要
Abstract:Motion prediction has been studied in different contexts with models trained on narrow distributions and applied to downstream tasks in human motion prediction and robotics. Simultaneously, recent efforts in scaling video prediction have demonstrated impressive visual realism, yet they struggle to accurately model complex motions despite massive scale. Inspired by the scaling of video generation, we develop autoregressive flow matching (ARFM), a new method for probabilistic modeling of sequential continuous data and train it on diverse video datasets to generate future point track locations over long horizons. To evaluate our model, we develop benchmarks for evaluating the ability of motion prediction models to predict human and robot motion. Our model is able to predict complex motions, and we demonstrate that conditioning robot action prediction and human motion prediction on predicted future tracks can significantly improve downstream task performance. Code and models publicly available at: this https URL.
128. 【2512.22681】CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation
链接:https://arxiv.org/abs/2512.22681
作者:ZhenQi Chen,TsaiChing Ni,YuanFu Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable visual, achieved remarkable, Recent, complex prompts, semantic
备注:
点击查看摘要
Abstract:Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.
129. 【2512.22666】INTERACT-CMIL: Multi-Task Shared Learning and Inter-Task Consistency for Conjunctival Melanocytic Intraepithelial Lesion Grading
链接:https://arxiv.org/abs/2512.22666
作者:Mert Ikinci,Luna Toma,Karin U. Loeffler,Leticia Ussem,Daniela Süsskind,Julia M. Weller,Yousef Yeganeh,Martina C. Herwig-Carl,Shadi Albarqouni
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Melanocytic Intraepithelial Lesions, Conjunctival Melanocytic Intraepithelial, interrelated diagnostic criteria, remains difficult due, subtle morphological cues
备注:
点击查看摘要
Abstract:Accurate grading of Conjunctival Melanocytic Intraepithelial Lesions (CMIL) is essential for treatment and melanoma prediction but remains difficult due to subtle morphological cues and interrelated diagnostic criteria. We introduce INTERACT-CMIL, a multi-head deep learning framework that jointly predicts five histopathological axes; WHO4, WHO5, horizontal spread, vertical spread, and cytologic atypia, through Shared Feature Learning with Combinatorial Partial Supervision and an Inter-Dependence Loss enforcing cross-task consistency. Trained and evaluated on a newly curated, multi-center dataset of 486 expert-annotated conjunctival biopsy patches from three university hospitals, INTERACT-CMIL achieves consistent improvements over CNN and foundation-model (FM) baselines, with relative macro F1 gains up to 55.1% (WHO4) and 25.0% (vertical spread). The framework provides coherent, interpretable multi-criteria predictions aligned with expert grading, offering a reproducible computational benchmark for CMIL diagnosis and a step toward standardized digital ocular pathology.
130. 【2512.22664】Unleashing Foundation Vision Models: Adaptive Transfer for Diverse Data-Limited Scientific Domains
链接:https://arxiv.org/abs/2512.22664
作者:Qiankun Li,Feng He,Huabao Chen,Xin Ning,Kun Wang,Zengfu Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:acquiring substantial knowledge, big data era, computer vision field, vision field benefits, acquiring substantial
备注:
点击查看摘要
Abstract:In the big data era, the computer vision field benefits from large-scale datasets such as LAION-2B, LAION-400M, and ImageNet-21K, Kinetics, on which popular models like the ViT and ConvNeXt series have been pre-trained, acquiring substantial knowledge. However, numerous downstream tasks in specialized and data-limited scientific domains continue to pose significant challenges. In this paper, we propose a novel Cluster Attention Adapter (CLAdapter), which refines and adapts the rich representations learned from large-scale data to various data-limited downstream tasks. Specifically, CLAdapter introduces attention mechanisms and cluster centers to personalize the enhancement of transformed features through distribution correlation and transformation matrices. This enables models fine-tuned with CLAdapter to learn distinct representations tailored to different feature sets, facilitating the models' adaptation from rich pre-trained features to various downstream scenarios effectively. In addition, CLAdapter's unified interface design allows for seamless integration with multiple model architectures, including CNNs and Transformers, in both 2D and 3D contexts. Through extensive experiments on 10 datasets spanning domains such as generic, multimedia, biological, medical, industrial, agricultural, environmental, geographical, materials science, out-of-distribution (OOD), and 3D analysis, CLAdapter achieves state-of-the-art performance across diverse data-limited scientific domains, demonstrating its effectiveness in unleashing the potential of foundation vision models via adaptive transfer. Code is available at this https URL.
131. 【2512.22657】Investigating Deep Learning Models for Ejection Fraction Estimation from Echocardiography Videos
链接:https://arxiv.org/abs/2512.22657
作者:Shravan Saranyan,Pramit Saha
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Left ventricular ejection, ventricular ejection fraction, Left ventricular, ejection fraction, cardiovascular disease
备注:
点击查看摘要
Abstract:Left ventricular ejection fraction (LVEF) is a key indicator of cardiac function and plays a central role in the diagnosis and management of cardiovascular disease. Echocardiography, as a readily accessible and non-invasive imaging modality, is widely used in clinical practice to estimate LVEF. However, manual assessment of cardiac function from echocardiograms is time-consuming and subject to considerable inter-observer variability. Deep learning approaches offer a promising alternative, with the potential to achieve performance comparable to that of experienced human experts. In this study, we investigate the effectiveness of several deep learning architectures for LVEF estimation from echocardiography videos, including 3D Inception, two-stream, and CNN-RNN models. We systematically evaluate architectural modifications and fusion strategies to identify configurations that maximize prediction accuracy. Models were trained and evaluated on the EchoNet-Dynamic dataset, comprising 10,030 echocardiogram videos. Our results demonstrate that modified 3D Inception architectures achieve the best overall performance, with a root mean squared error (RMSE) of 6.79%. Across architectures, we observe a tendency toward overfitting, with smaller and simpler models generally exhibiting improved generalization. Model performance was also found to be highly sensitive to hyperparameter choices, particularly convolutional kernel sizes and normalization strategies. While this study focuses on echocardiography-based LVEF estimation, the insights gained regarding architectural design and training strategies may be applicable to a broader range of medical and non-medical video analysis tasks.
132. 【2512.22653】Visual Autoregressive Modelling for Monocular Depth Estimation
链接:https://arxiv.org/abs/2512.22653
作者:Amir El-Ghoussani,André Kaup,Nassir Navab,Gustavo Carneiro,Vasileios Belagiannis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:offering an alternative, diffusion-based approaches, propose a monocular, based on visual, alternative to diffusion-based
备注:
点击查看摘要
Abstract:We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "this https URL.
133. 【2512.22647】FinPercep-RM: A Fine-grained Reward Model and Co-evolutionary Curriculum for RL-based Real-world Super-Resolution
链接:https://arxiv.org/abs/2512.22647
作者:Yidi Liu,Zihao Fan,Jie Huang,Jie Xiao,Dong Li,Wenlong Zhang,Lei Bai,Xueyang Fu,Zheng-Jun Zha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:align human preferences, Human Feedback, image generation field, Image Quality Assessment, generation field guided
备注:
点击查看摘要
Abstract:Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions. This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking. Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.
134. 【2512.22626】Envision: Embodied Visual Planning via Goal-Imagery Video Diffusion
链接:https://arxiv.org/abs/2512.22626
作者:Yuming Gu,Yizhi Wang,Yining Hong,Yipeng Gao,Hao Jiang,Angtian Wang,Bo Liu,Nathaniel S. Dennler,Zhengfei Kuang,Hao Li,Gordon Wetzstein,Chongyang Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enable manipulation tasks, guide actions, goal, aims to enable, tasks by imagining
备注: Page: [this https URL](https://envision-paper.github.io)
点击查看摘要
Abstract:Embodied visual planning aims to enable manipulation tasks by imagining how a scene evolves toward a desired goal and using the imagined trajectories to guide actions. Video diffusion models, through their image-to-video generation capability, provide a promising foundation for such visual imagination. However, existing approaches are largely forward predictive, generating trajectories conditioned on the initial observation without explicit goal modeling, thus often leading to spatial drift and goal misalignment. To address these challenges, we propose Envision, a diffusion-based framework that performs visual planning for embodied agents. By explicitly constraining the generation with a goal image, our method enforces physical plausibility and goal consistency throughout the generated trajectory. Specifically, Envision operates in two stages. First, a Goal Imagery Model identifies task-relevant regions, performs region-aware cross attention between the scene and the instruction, and synthesizes a coherent goal image that captures the desired outcome. Then, an Env-Goal Video Model, built upon a first-and-last-frame-conditioned video diffusion model (FL2V), interpolates between the initial observation and the goal image, producing smooth and physically plausible video trajectories that connect the start and goal states. Experiments on object manipulation and image editing benchmarks demonstrate that Envision achieves superior goal alignment, spatial consistency, and object preservation compared to baselines. The resulting visual plans can directly support downstream robotic planning and control, providing reliable guidance for embodied agents.
135. 【2512.22624】Rethinking Memory Design in SAM-Based Visual Object Tracking
链接:https://arxiv.org/abs/2512.22624
作者:Mohamad Alansari,Muzammal Naseer,Hasan Al Marzouqi,Naoufel Werghi,Sajid Javed
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:modern segmentation-based frameworks, Memory, modern segmentation-based, robust visual object, visual object tracking
备注: \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}
点击查看摘要
Abstract:\noindent Memory has become the central mechanism enabling robust visual object tracking in modern segmentation-based frameworks. Recent methods built upon Segment Anything Model 2 (SAM2) have demonstrated strong performance by refining how past observations are stored and reused. However, existing approaches address memory limitations in a method-specific manner, leaving the broader design principles of memory in SAM-based tracking poorly understood. Moreover, it remains unclear how these memory mechanisms transfer to stronger, next-generation foundation models such as Segment Anything Model 3 (SAM3). In this work, we present a systematic memory-centric study of SAM-based visual object tracking. We first analyze representative SAM2-based trackers and show that most methods primarily differ in how short-term memory frames are selected, while sharing a common object-centric representation. Building on this insight, we faithfully reimplement these memory mechanisms within the SAM3 framework and conduct large-scale evaluations across ten diverse benchmarks, enabling a controlled analysis of memory design independent of backbone strength. Guided by our empirical findings, we propose a unified hybrid memory framework that explicitly decomposes memory into short-term appearance memory and long-term distractor-resolving memory. This decomposition enables the integration of existing memory policies in a modular and principled manner. Extensive experiments demonstrate that the proposed framework consistently improves robustness under long-term occlusion, complex motion, and distractor-heavy scenarios on both SAM2 and SAM3 backbones. Code is available at: this https URL. \textbf{This is a preprint. Some results are being finalized and may be updated in a future revision.}
136. 【2512.22615】Dream-VL Dream-VLA: Open Vision-Language and Vision-Language-Action Models with Diffusion Language Model Backbone
链接:https://arxiv.org/abs/2512.22615
作者:Jiacheng Ye,Shansan Gong,Jiahui Gao,Junming Fan,Shuang Wu,Wei Bi,Haoli Bai,Lifeng Shang,Lingpeng Kong
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:autoregressive Large Vision-Language, dynamic robotic control, Large Vision-Language Models, achieved remarkable success, autoregressive Large
备注:
点击查看摘要
Abstract:While autoregressive Large Vision-Language Models (VLMs) have achieved remarkable success, their sequential generation often limits their efficacy in complex visual planning and dynamic robotic control. In this work, we investigate the potential of constructing Vision-Language Models upon diffusion-based large language models (dLLMs) to overcome these limitations. We introduce Dream-VL, an open diffusion-based VLM (dVLM) that achieves state-of-the-art performance among previous dVLMs. Dream-VL is comparable to top-tier AR-based VLMs trained on open data on various benchmarks but exhibits superior potential when applied to visual planning tasks. Building upon Dream-VL, we introduce Dream-VLA, a dLLM-based Vision-Language-Action model (dVLA) developed through continuous pre-training on open robotic datasets. We demonstrate that the natively bidirectional nature of this diffusion backbone serves as a superior foundation for VLA tasks, inherently suited for action chunking and parallel generation, leading to significantly faster convergence in downstream fine-tuning. Dream-VLA achieves top-tier performance of 97.2% average success rate on LIBERO, 71.4% overall average on SimplerEnv-Bridge, and 60.5% overall average on SimplerEnv-Fractal, surpassing leading models such as $\pi_0$ and GR00T-N1. We also validate that dVLMs surpass AR baselines on downstream tasks across different training objectives. We release both Dream-VL and Dream-VLA to facilitate further research in the community.
137. 【2512.22612】Enhancing Noise Resilience in Face Clustering via Sparse Differential Transformer
链接:https://arxiv.org/abs/2512.22612
作者:Dafeng Zhang,Yongqi Song,Shizhuo Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Jaccard similarity coefficient, face embeddings plays, embeddings plays, plays a crucial, crucial role
备注: Published at AAAI 2026
点击查看摘要
Abstract:The method used to measure relationships between face embeddings plays a crucial role in determining the performance of face clustering. Existing methods employ the Jaccard similarity coefficient instead of the cosine distance to enhance the measurement accuracy. However, these methods introduce too many irrelevant nodes, producing Jaccard coefficients with limited discriminative power and adversely affecting clustering performance. To address this issue, we propose a prediction-driven Top-K Jaccard similarity coefficient that enhances the purity of neighboring nodes, thereby improving the reliability of similarity measurements. Nevertheless, accurately predicting the optimal number of neighbors (Top-K) remains challenging, leading to suboptimal clustering results. To overcome this limitation, we develop a Transformer-based prediction model that examines the relationships between the central node and its neighboring nodes near the Top-K to further enhance the reliability of similarity estimation. However, vanilla Transformer, when applied to predict relationships between nodes, often introduces noise due to their overemphasis on irrelevant feature relationships. To address these challenges, we propose a Sparse Differential Transformer (SDT), instead of the vanilla Transformer, to eliminate noise and enhance the model's anti-noise capabilities. Extensive experiments on multiple datasets, such as MS-Celeb-1M, demonstrate that our approach achieves state-of-the-art (SOTA) performance, outperforming existing methods and providing a more robust solution for face clustering.
138. 【2512.22605】Learning Multi-Modal Mobility Dynamics for Generalized Next Location Recommendation
链接:https://arxiv.org/abs/2512.22605
作者:Junshu Dai,Yu Wang,Tongya Zheng,Wei Ji,Qinghong Guo,Ji Cao,Jie Song,Canghong Jin,Mingli Song
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:significant socioeconomic impacts, produced significant socioeconomic, textbf, socioeconomic impacts, evacuation suggestions
备注:
点击查看摘要
Abstract:The precise prediction of human mobility has produced significant socioeconomic impacts, such as location recommendations and evacuation suggestions. However, existing methods suffer from limited generalization capability: unimodal approaches are constrained by data sparsity and inherent biases, while multi-modal methods struggle to effectively capture mobility dynamics caused by the semantic gap between static multi-modal representation and spatial-temporal dynamics. Therefore, we leverage multi-modal spatial-temporal knowledge to characterize mobility dynamics for the location recommendation task, dubbed as \textbf{M}ulti-\textbf{M}odal \textbf{Mob}ility (\textbf{M}$^3$\textbf{ob}). First, we construct a unified spatial-temporal relational graph (STRG) for multi-modal representation, by leveraging the functional semantics and spatial-temporal knowledge captured by the large language models (LLMs)-enhanced spatial-temporal knowledge graph (STKG). Second, we design a gating mechanism to fuse spatial-temporal graph representations of different modalities, and propose an STKG-guided cross-modal alignment to inject spatial-temporal dynamic knowledge into the static image modality. Extensive experiments on six public datasets show that our proposed method not only achieves consistent improvements in normal scenarios but also exhibits significant generalization ability in abnormal scenarios.
139. 【2512.22602】PTalker: Personalized Speech-Driven 3D Talking Head Animation via Style Disentanglement and Modality Alignment
链接:https://arxiv.org/abs/2512.22602
作者:Bin Wang,Yang Xu,Huan Zhao,Hao Zhang,Zixing Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:head generation aims, animations precisely synchronized, produce lifelike facial, talking head generation, lifelike facial animations
备注:
点击查看摘要
Abstract:Speech-driven 3D talking head generation aims to produce lifelike facial animations precisely synchronized with speech. While considerable progress has been made in achieving high lip-synchronization accuracy, existing methods largely overlook the intricate nuances of individual speaking styles, which limits personalization and realism. In this work, we present a novel framework for personalized 3D talking head animation, namely "PTalker". This framework preserves speaking style through style disentanglement from audio and facial motion sequences and enhances lip-synchronization accuracy through a three-level alignment mechanism between audio and mesh modalities. Specifically, to effectively disentangle style and content, we design disentanglement constraints that encode driven audio and motion sequences into distinct style and content spaces to enhance speaking style representation. To improve lip-synchronization accuracy, we adopt a modality alignment mechanism incorporating three aspects: spatial alignment using Graph Attention Networks to capture vertex connectivity in the 3D mesh structure, temporal alignment using cross-attention to capture and synchronize temporal dependencies, and feature alignment by top-k bidirectional contrastive losses and KL divergence constraints to ensure consistency between speech and mesh modalities. Extensive qualitative and quantitative experiments on public datasets demonstrate that PTalker effectively generates realistic, stylized 3D talking heads that accurately match identity-specific speaking styles, outperforming state-of-the-art methods. The source code and supplementary videos are available at: PTalker.
140. 【2512.22581】KV-Tracker: Real-Time Pose Tracking with Transformers
链接:https://arxiv.org/abs/2512.22581
作者:Marwan Taher,Ignacio Alzugaray,Kirill Mazur,Xin Kong,Andrew J. Davison
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:geometry networks offer, offer a powerful, prohibitively slow, real-time applications, monocular RGB videos
备注: Project Page: [this https URL](https://marwan99.github.io/kv_tracker)
点击查看摘要
Abstract:Multi-view 3D geometry networks offer a powerful prior but are prohibitively slow for real-time applications. We propose a novel way to adapt them for online use, enabling real-time 6-DoF pose tracking and online reconstruction of objects and scenes from monocular RGB videos. Our method rapidly selects and manages a set of images as keyframes to map a scene or object via $\pi^3$ with full bidirectional attention. We then cache the global self-attention block's key-value (KV) pairs and use them as the sole scene representation for online tracking. This allows for up to $15\times$ speedup during inference without the fear of drift or catastrophic forgetting. Our caching strategy is model-agnostic and can be applied to other off-the-shelf multi-view networks without retraining. We demonstrate KV-Tracker on both scene-level tracking and the more challenging task of on-the-fly object tracking and reconstruction without depth measurements or object priors. Experiments on the TUM RGB-D, 7-Scenes, Arctic and OnePose datasets show the strong performance of our system while maintaining high frame-rates up to ${\sim}27$ FPS.
141. 【2512.22570】ReFRM3D: A Radiomics-enhanced Fused Residual Multiparametric 3D Network with Multi-Scale Feature Fusion for Glioma Characterization
链接:https://arxiv.org/abs/2512.22570
作者:Md. Abdur Rahman,Mohaimenul Azam Khan Raiaan,Arefin Ittesafun Abian,Yan Zhang,Mirjam Jonkman,Sami Azam
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:complex diagnostic processes, high mortality rates, aggressive cancers, diagnostic processes, mortality rates
备注:
点击查看摘要
Abstract:Gliomas are among the most aggressive cancers, characterized by high mortality rates and complex diagnostic processes. Existing studies on glioma diagnosis and classification often describe issues such as high variability in imaging data, inadequate optimization of computational resources, and inefficient segmentation and classification of gliomas. To address these challenges, we propose novel techniques utilizing multi-parametric MRI data to enhance tumor segmentation and classification efficiency. Our work introduces the first-ever radiomics-enhanced fused residual multiparametric 3D network (ReFRM3D) for brain tumor characterization, which is based on a 3D U-Net architecture and features multi-scale feature fusion, hybrid upsampling, and an extended residual skip mechanism. Additionally, we propose a multi-feature tumor marker-based classifier that leverages radiomic features extracted from the segmented regions. Experimental results demonstrate significant improvements in segmentation performance across the BraTS2019, BraTS2020, and BraTS2021 datasets, achieving high Dice Similarity Coefficients (DSC) of 94.04%, 92.68%, and 93.64% for whole tumor (WT), enhancing tumor (ET), and tumor core (TC) respectively in BraTS2019; 94.09%, 92.91%, and 93.84% in BraTS2020; and 93.70%, 90.36%, and 92.13% in BraTS2021.
142. 【2512.22545】Self-Rewarded Multimodal Coherent Reasoning Across Diverse Visual Domains
链接:https://arxiv.org/abs/2512.22545
作者:Jesen Zhang,Ningyuan Liu,Kaitong Cai,Sidi Liu,Jing Yang,Ziliang Chen,Xiaofei Sun,Keze Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal LLMs, existing alignment approaches, alignment approaches supervise, insufficient visual grounding, exhibiting weak
备注:
点击查看摘要
Abstract:Multimodal LLMs often produce fluent yet unreliable reasoning, exhibiting weak step-to-step coherence and insufficient visual grounding, largely because existing alignment approaches supervise only the final answer while ignoring the reliability of the intermediate reasoning process. We introduce SR-MCR, a lightweight and label-free framework that aligns reasoning by exploiting intrinsic process signals derived directly from model outputs. Five self-referential cues -- semantic alignment, lexical fidelity, non-redundancy, visual grounding, and step consistency -- are integrated into a normalized, reliability-weighted reward that provides fine-grained process-level guidance. A critic-free GRPO objective, enhanced with a confidence-aware cooling mechanism, further stabilizes training and suppresses trivial or overly confident generations. Built on Qwen2.5-VL, SR-MCR improves both answer accuracy and reasoning coherence across a broad set of visual benchmarks; among open-source models of comparable size, SR-MCR-7B achieves state-of-the-art performance with an average accuracy of 81.4%. Ablation studies confirm the independent contributions of each reward term and the cooling module.
143. 【2512.22539】VLA-Arena: An Open-Source Framework for Benchmarking Vision-Language-Action Models
链接:https://arxiv.org/abs/2512.22539
作者:Borong Zhang,Jiahao Li,Jiachen Shen,Yishuai Cai,Yuhao Zhang,Yuanpei Chen,Juntao Dai,Jiaming Ji,Yaodong Yang
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:generalist robot policies, robot policies, failure modes, rapidly advancing, advancing towards generalist
备注:
点击查看摘要
Abstract:While Vision-Language-Action models (VLAs) are rapidly advancing towards generalist robot policies, it remains difficult to quantitatively understand their limits and failure modes. To address this, we introduce a comprehensive benchmark called VLA-Arena. We propose a novel structured task design framework to quantify difficulty across three orthogonal axes: (1) Task Structure, (2) Language Command, and (3) Visual Observation. This allows us to systematically design tasks with fine-grained difficulty levels, enabling a precise measurement of model capability frontiers. For Task Structure, VLA-Arena's 170 tasks are grouped into four dimensions: Safety, Distractor, Extrapolation, and Long Horizon. Each task is designed with three difficulty levels (L0-L2), with fine-tuning performed exclusively on L0 to assess general capability. Orthogonal to this, language (W0-W4) and visual (V0-V4) perturbations can be applied to any task to enable a decoupled analysis of robustness. Our extensive evaluation of state-of-the-art VLAs reveals several critical limitations, including a strong tendency toward memorization over generalization, asymmetric robustness, a lack of consideration for safety constraints, and an inability to compose learned skills for long-horizon tasks. To foster research addressing these challenges and ensure reproducibility, we provide the complete VLA-Arena framework, including an end-to-end toolchain from task definition to automated evaluation and the VLA-Arena-S/M/L datasets for fine-tuning. Our benchmark, data, models, and leaderboard are available at this https URL.
144. 【2512.22536】CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation
链接:https://arxiv.org/abs/2512.22536
作者:Qinglin Zeng,Kaitong Cai,Ruiqi Chen,Qinhan Lv,Keze Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Maintaining narrative coherence, open-domain video generation, visual consistency remains, remains a central, central challenge
备注:
点击查看摘要
Abstract:Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and unstable temporal structure. We propose CoAgent, a collaborative and closed-loop framework for coherent video generation that formulates the process as a plan-synthesize-verify pipeline. Given a user prompt, style reference, and pacing constraints, a Storyboard Planner decomposes the input into structured shot-level plans with explicit entities, spatial relations, and temporal cues. A Global Context Manager maintains entity-level memory to preserve appearance and identity consistency across shots. Each shot is then generated by a Synthesis Module under the guidance of a Visual Consistency Controller, while a Verifier Agent evaluates intermediate results using vision-language reasoning and triggers selective regeneration when inconsistencies are detected. Finally, a pacing-aware editor refines temporal rhythm and transitions to match the desired narrative flow. Extensive experiments demonstrate that CoAgent significantly improves coherence, visual consistency, and narrative quality in long-form video generation.
145. 【2512.22525】DreamOmni3: Scribble-based Editing and Generation
链接:https://arxiv.org/abs/2512.22525
作者:Bin Xia,Bohao Peng,Jiyang Liu,Sitong Wu,Jingyao Li,Junjia Huang,Xu Zhao,Yitong Wang,Ruihang Chu,Bei Yu,Jiaya Jia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recently unified generation, achieved remarkable success, Recently unified, editing, instruction-based editing
备注:
点击查看摘要
Abstract:Recently unified generation and editing models have achieved remarkable success with their impressive performance. These models rely mainly on text prompts for instruction-based editing and generation, but language often fails to capture users intended edit locations and fine-grained visual details. To this end, we propose two tasks: scribble-based editing and generation, that enables more flexible creation on graphical user interface (GUI) combining user textual, images, and freehand sketches. We introduce DreamOmni3, tackling two challenges: data creation and framework design. Our data synthesis pipeline includes two parts: scribble-based editing and generation. For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based editing, image fusion, and doodle editing. Based on DreamOmni2 dataset, we extract editable regions and overlay hand-drawn boxes, circles, doodles or cropped image to construct training data. For scribble-based generation, we define three tasks: scribble and instruction-based generation, scribble and multimodal instruction-based generation, and doodle generation, following similar data creation pipelines. For the framework, instead of using binary masks, which struggle with complex edits involving multiple scribbles, images, and instructions, we propose a joint input scheme that feeds both the original and scribbled source images into the model, using different colors to distinguish regions and simplify processing. By applying the same index and position encodings to both images, the model can precisely localize scribbled regions while maintaining accurate editing. Finally, we establish comprehensive benchmarks for these tasks to promote further research. Experimental results demonstrate that DreamOmni3 achieves outstanding performance, and models and code will be publicly released.
146. 【2512.22503】SCAFusion: A Multimodal 3D Detection Framework for Small Object Detection in Lunar Surface Exploration
链接:https://arxiv.org/abs/2512.22503
作者:Xin Chen,Kang Luo,Yangyi Xiao,Hesheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reliable and precise, lunar surface exploration, fragments and rocks, surface exploration, navigation and operation
备注:
点击查看摘要
Abstract:Reliable and precise detection of small and irregular objects, such as meteor fragments and rocks, is critical for autonomous navigation and operation in lunar surface exploration. Existing multimodal 3D perception methods designed for terrestrial autonomous driving often underperform in off world environments due to poor feature alignment, limited multimodal synergy, and weak small object detection. This paper presents SCAFusion, a multimodal 3D object detection model tailored for lunar robotic missions. Built upon the BEVFusion framework, SCAFusion integrates a Cognitive Adapter for efficient camera backbone tuning, a Contrastive Alignment Module to enhance camera LiDAR feature consistency, a Camera Auxiliary Training Branch to strengthen visual representation, and most importantly, a Section aware Coordinate Attention mechanism explicitly designed to boost the detection performance of small, irregular targets. With negligible increase in parameters and computation, our model achieves 69.7% mAP and 72.1% NDS on the nuScenes validation set, improving the baseline by 5.0% and 2.7%, respectively. In simulated lunar environments built on Isaac Sim, SCAFusion achieves 90.93% mAP, outperforming the baseline by 11.5%, with notable gains in detecting small meteor like obstacles.
147. 【2512.22489】racking by Predicting 3-D Gaussians Over Time
链接:https://arxiv.org/abs/2512.22489
作者:Tanish Baranwal,Himanshu Gaurav Singh,Jathushan Rajasegaran,Jitendra Malik
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Gaussian Masked Autoencoders, Video Gaussian Masked, Masked Autoencoders, Gaussian splats moving, propose Video Gaussian
备注:
点击查看摘要
Abstract:We propose Video Gaussian Masked Autoencoders (Video-GMAE), a self-supervised approach for representation learning that encodes a sequence of images into a set of Gaussian splats moving over time. Representing a video as a set of Gaussians enforces a reasonable inductive bias: that 2-D videos are often consistent projections of a dynamic 3-D scene. We find that tracking emerges when pretraining a network with this architecture. Mapping the trajectory of the learnt Gaussians onto the image plane gives zero-shot tracking performance comparable to state-of-the-art. With small-scale finetuning, our models achieve 34.6% improvement on Kinetics, and 13.1% on Kubric datasets, surpassing existing self-supervised video approaches. The project page and code are publicly available at this https URL and this https URL.
148. 【2512.22488】oward Real-World IoT Security: Concept Drift-Resilient IoT Botnet Detection via Latent Space Representation Learning and Alignment
链接:https://arxiv.org/abs/2512.22488
作者:Hassan Wasswa,Timothy Lynar
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved high accuracy, constrained by reliance, reliance on stationary, fail to reflect, frequently affected
备注:
点击查看摘要
Abstract:Although AI-based models have achieved high accuracy in IoT threat detection, their deployment in enterprise environments is constrained by reliance on stationary datasets that fail to reflect the dynamic nature of real-world IoT NetFlow traffic, which is frequently affected by concept drift. Existing solutions typically rely on periodic classifier retraining, resulting in high computational overhead and the risk of catastrophic forgetting. To address these challenges, this paper proposes a scalable framework for adaptive IoT threat detection that eliminates the need for continuous classifier retraining. The proposed approach trains a classifier once on latent-space representations of historical traffic, while an alignment model maps incoming traffic to the learned historical latent space prior to classification, thereby preserving knowledge of previously observed attacks. To capture inter-instance relationships among attack samples, the low-dimensional latent representations are further transformed into a graph-structured format and classified using a graph neural network. Experimental evaluations on real-world heterogeneous IoT traffic datasets demonstrate that the proposed framework maintains robust detection performance under concept drift. These results highlight the framework's potential for practical deployment in dynamic and large-scale IoT environments.
149. 【2512.22483】Scalpel-SAM: A Semi-Supervised Paradigm for Adapting SAM to Infrared Small Object Detection
链接:https://arxiv.org/abs/2512.22483
作者:Zihan Liu,Xiangning Ren,Dezhang Kong,Yipeng Zhang,Meng Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Infrared small object, small object detection, object detection urgently, detection urgently requires, Infrared small
备注: 9 pages, 3 figures
点击查看摘要
Abstract:Infrared small object detection urgently requires semi-supervised paradigms due to the high cost of annotation. However, existing methods like SAM face significant challenges of domain gaps, inability of encoding physical priors, and inherent architectural complexity. To address this, we designed a Hierarchical MoE Adapter consisting of four white-box neural operators. Building upon this core component, we propose a two-stage paradigm for knowledge distillation and transfer: (1) Prior-Guided Knowledge Distillation, where we use our MoE adapter and 10% of available fully supervised data to distill SAM into an expert teacher (Scalpel-SAM); and (2) Deployment-Oriented Knowledge Transfer, where we use Scalpel-SAM to generate pseudo labels for training lightweight and efficient downstream models. Experiments demonstrate that with minimal annotations, our paradigm enables downstream models to achieve performance comparable to, or even surpassing, their fully supervised counterparts. To our knowledge, this is the first semi-supervised paradigm that systematically addresses the data scarcity issue in IR-SOT using SAM as the teacher model.
150. 【2512.22474】Event-based high temporal resolution measurement of shock wave motion field
链接:https://arxiv.org/abs/2512.22474
作者:Taihang Lei,Banglei Guan,Minzu Liang,Pengju Sun,Jing Tao,Yang Shang,Qifeng Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shock wave motion, shock wave, Accurate measurement, shock, wave motion parameters
备注:
点击查看摘要
Abstract:Accurate measurement of shock wave motion parameters with high spatiotemporal resolution is essential for applications such as power field testing and damage assessment. However, significant challenges are posed by the fast, uneven propagation of shock waves and unstable testing conditions. To address these challenges, a novel framework is proposed that utilizes multiple event cameras to estimate the asymmetry of shock waves, leveraging its high-speed and high-dynamic range capabilities. Initially, a polar coordinate system is established, which encodes events to reveal shock wave propagation patterns, with adaptive region-of-interest (ROI) extraction through event offset calculations. Subsequently, shock wave front events are extracted using iterative slope analysis, exploiting the continuity of velocity changes. Finally, the geometric model of events and shock wave motion parameters is derived according to event-based optical imaging model, along with the 3D reconstruction model. Through the above process, multi-angle shock wave measurement, motion field reconstruction, and explosive equivalence inversion are achieved. The results of the speed measurement are compared with those of the pressure sensors and the empirical formula, revealing a maximum error of 5.20% and a minimum error of 0.06%. The experimental results demonstrate that our method achieves high-precision measurement of the shock wave motion field with both high spatial and temporal resolution, representing significant progress.
151. 【2512.22464】Pose-Guided Residual Refinement for Interpretable Text-to-Motion Generation and Editing
链接:https://arxiv.org/abs/2512.22464
作者:Sukhyun Jeong,Yong-Hoon Choi
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:automatically synthesize diverse, synthesize diverse motions, existing motion sequence, extend user creativity, pose codes
备注:
点击查看摘要
Abstract:Text-based 3D motion generation aims to automatically synthesize diverse motions from natural-language descriptions to extend user creativity, whereas motion editing modifies an existing motion sequence in response to text while preserving its overall structure. Pose-code-based frameworks such as CoMo map quantifiable pose attributes into discrete pose codes that support interpretable motion control, but their frame-wise representation struggles to capture subtle temporal dynamics and high-frequency details, often degrading reconstruction fidelity and local controllability. To address this limitation, we introduce pose-guided residual refinement for motion (PGR$^2$M), a hybrid representation that augments interpretable pose codes with residual codes learned via residual vector quantization (RVQ). A pose-guided RVQ tokenizer decomposes motion into pose latents that encode coarse global structure and residual latents that model fine-grained temporal variations. Residual dropout further discourages over-reliance on residuals, preserving the semantic alignment and editability of the pose codes. On top of this tokenizer, a base Transformer autoregressively predicts pose codes from text, and a refine Transformer predicts residual codes conditioned on text, pose codes, and quantization stage. Experiments on HumanML3D and KIT-ML show that PGR$^2$M improves Fréchet inception distance and reconstruction metrics for both generation and editing compared with CoMo and recent diffusion- and tokenization-based baselines, while user studies confirm that it enables intuitive, structure-preserving motion edits.
152. 【2512.22454】Comparing Object Detection Models for Electrical Substation Component Mapping
链接:https://arxiv.org/abs/2512.22454
作者:Haley Mody,Namish Bansal,Dennies Kiprono Bor,Edward J. Oughton(George Mason University)
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:substation, Electrical, electrical grid, Abstract, Electrical substations
备注: 26 pages, 13 figures
点击查看摘要
Abstract:Electrical substations are a significant component of an electrical grid. Indeed, the assets at these substations (e.g., transformers) are prone to disruption from many hazards, including hurricanes, flooding, earthquakes, and geomagnetically induced currents (GICs). As electrical grids are considered critical national infrastructure, any failure can have significant economic and public safety implications. To help prevent and mitigate these failures, it is thus essential that we identify key substation components to quantify vulnerability. Unfortunately, traditional manual mapping of substation infrastructure is time-consuming and labor-intensive. Therefore, an autonomous solution utilizing computer vision models is preferable, as it allows for greater convenience and efficiency. In this research paper, we train and compare the outputs of 3 models (YOLOv8, YOLOv11, RF-DETR) on a manually labeled dataset of US substation images. Each model is evaluated for detection accuracy, precision, and efficiency. We present the key strengths and limitations of each model, identifying which provides reliable and large-scale substation component mapping. Additionally, we utilize these models to effectively map the various substation components in the United States, showcasing a use case for machine learning in substation mapping.
153. 【2512.22452】SAM 3D for 3D Object Reconstruction from Remote Sensing Images
链接:https://arxiv.org/abs/2512.22452
作者:Junsheng Yao,Lichao Mou,Qingyu Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:require task-specific architectures, remote sensing imagery, monocular remote sensing, remote sensing building, Frechet Inception Distance
备注:
点击查看摘要
Abstract:Monocular 3D building reconstruction from remote sensing imagery is essential for scalable urban modeling, yet existing methods often require task-specific architectures and intensive supervision. This paper presents the first systematic evaluation of SAM 3D, a general-purpose image-to-3D foundation model, for monocular remote sensing building reconstruction. We benchmark SAM 3D against TRELLIS on samples from the NYC Urban Dataset, employing Frechet Inception Distance (FID) and CLIP-based Maximum Mean Discrepancy (CMMD) as evaluation metrics. Experimental results demonstrate that SAM 3D produces more coherent roof geometry and sharper boundaries compared to TRELLIS. We further extend SAM 3D to urban scene reconstruction through a segment-reconstruct-compose pipeline, demonstrating its potential for urban scene modeling. We also analyze practical limitations and discuss future research directions. These findings provide practical guidance for deploying foundation models in urban 3D reconstruction and motivate future integration of scene-level structural priors.
154. 【2512.22449】SonoVision: A Computer Vision Approach for Helping Visually Challenged Individuals Locate Objects with the Help of Sound Cues
链接:https://arxiv.org/abs/2512.22449
作者:Md Abu Obaida Zishan,Annajiat Alim Rasel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Locating objects, significant challenge, visually impaired, Locating, visually
备注:
点击查看摘要
Abstract:Locating objects for the visually impaired is a significant challenge and is something no one can get used to over time. However, this hinders their independence and could push them towards risky and dangerous scenarios. Hence, in the spirit of making the visually challenged more self-sufficient, we present SonoVision, a smart-phone application that helps them find everyday objects using sound cues through earphones/headphones. This simply means, if an object is on the right or left side of a user, the app makes a sinusoidal sound in a user's respective ear through ear/headphones. However, to indicate objects located directly in front, both the left and right earphones are rung simultaneously. These sound cues could easily help a visually impaired individual locate objects with the help of their smartphones and reduce the reliance on people in their surroundings, consequently making them more independent. This application is made with the flutter development platform and uses the Efficientdet-D2 model for object detection in the backend. We believe the app will significantly assist the visually impaired in a safe and user-friendly manner with its capacity to work completely offline. Our application can be accessed here this https URL.
155. 【2512.22447】owards Robust Optical-SAR Object Detection under Missing Modalities: A Dynamic Quality-Aware Fusion Framework
链接:https://arxiv.org/abs/2512.22447
作者:Zhicheng Zhao,Yuancheng Xu,Andong Lu,Chenglong Li,Jin Tang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Synthetic Aperture Radar, Optical and Synthetic, Aperture Radar, Synthetic Aperture, attracted significant research
备注:
点击查看摘要
Abstract:Optical and Synthetic Aperture Radar (SAR) fusion-based object detection has attracted significant research interest in remote sensing, as these modalities provide complementary information for all-weather monitoring. However, practical deployment is severely limited by inherent challenges. Due to distinct imaging mechanisms, temporal asynchrony, and registration difficulties, obtaining well-aligned optical-SAR image pairs remains extremely difficult, frequently resulting in missing or degraded modality data. Although recent approaches have attempted to address this issue, they still suffer from limited robustness to random missing modalities and lack effective mechanisms to ensure consistent performance improvement in fusion-based detection. To address these limitations, we propose a novel Quality-Aware Dynamic Fusion Network (QDFNet) for robust optical-SAR object detection. Our proposed method leverages learnable reference tokens to dynamically assess feature reliability and guide adaptive fusion in the presence of missing modalities. In particular, we design a Dynamic Modality Quality Assessment (DMQA) module that employs learnable reference tokens to iteratively refine feature reliability assessment, enabling precise identification of degraded regions and providing quality guidance for subsequent fusion. Moreover, we develop an Orthogonal Constraint Normalization Fusion (OCNF) module that employs orthogonal constraints to preserve modality independence while dynamically adjusting fusion weights based on reliability scores, effectively suppressing unreliable feature propagation. Extensive experiments on the SpaceNet6-OTD and OGSOD-2.0 datasets demonstrate the superiority and effectiveness of QDFNet compared to state-of-the-art methods, particularly under partial modality corruption or missing data scenarios.
156. 【2512.22441】LECalib: Line-Based Event Camera Calibration
链接:https://arxiv.org/abs/2512.22441
作者:Zibin Liu,Banglei Guana,Yang Shanga,Zhenbao Yu,Yifei Bian,Qifeng Yu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:event-based vision applications, vision applications, essential prerequisite, prerequisite for event-based, event-based vision
备注: 9 Pages, 6 figures
点击查看摘要
Abstract:Camera calibration is an essential prerequisite for event-based vision applications. Current event camera calibration methods typically involve using flashing patterns, reconstructing intensity images, and utilizing the features extracted from events. Existing methods are generally time-consuming and require manually placed calibration objects, which cannot meet the needs of rapidly changing scenarios. In this paper, we propose a line-based event camera calibration framework exploiting the geometric lines of commonly-encountered objects in man-made environments, e.g., doors, windows, boxes, etc. Different from previous methods, our method detects lines directly from event streams and leverages an event-line calibration model to generate the initial guess of camera parameters, which is suitable for both planar and non-planar lines. Then, a non-linear optimization is adopted to refine camera parameters. Both simulation and real-world experiments have demonstrated the feasibility and accuracy of our method, with validation performed on monocular and stereo event cameras. The source code is released at this https URL.
157. 【2512.22439】SuperiorGAT: Graph Attention Networks for Sparse LiDAR Point Cloud Reconstruction in Autonomous Systems
链接:https://arxiv.org/abs/2512.22439
作者:Khalfalla Awedat,Mohamed Abidalrekab,Gurcan Comert,Mustafa Ayad
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:LiDAR-based perception, environmental occlusions, perception in autonomous, autonomous systems, systems is constrained
备注:
点击查看摘要
Abstract:LiDAR-based perception in autonomous systems is constrained by fixed vertical beam resolution and further compromised by beam dropout resulting from environmental occlusions. This paper introduces SuperiorGAT, a graph attention-based framework designed to reconstruct missing elevation information in sparse LiDAR point clouds. By modeling LiDAR scans as beam-aware graphs and incorporating gated residual fusion with feed-forward refinement, SuperiorGAT enables accurate reconstruction without increasing network depth. To evaluate performance, structured beam dropout is simulated by removing every fourth vertical scanning beam. Extensive experiments across diverse KITTI environments, including Person, Road, Campus, and City sequences, demonstrate that SuperiorGAT consistently achieves lower reconstruction error and improved geometric consistency compared to PointNet-based models and deeper GAT baselines. Qualitative X-Z projections further confirm the model's ability to preserve structural integrity with minimal vertical distortion. These results suggest that architectural refinement offers a computationally efficient method for improving LiDAR resolution without requiring additional sensor hardware.
158. 【2512.22437】EmoCtrl: Controllable Emotional Image Content Generation
链接:https://arxiv.org/abs/2512.22437
作者:Jingyuan Yang,Weibin Luo,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:image conveys meaning, Controllable Emotional Image, Image Content Generation, shaping human perception, jointly shaping human
备注:
点击查看摘要
Abstract:An image conveys meaning through both its visual content and emotional tone, jointly shaping human perception. We introduce Controllable Emotional Image Content Generation (C-EICG), which aims to generate images that remain faithful to a given content description while expressing a target emotion. Existing text-to-image models ensure content consistency but lack emotional awareness, whereas emotion-driven models generate affective results at the cost of content distortion. To address this gap, we propose EmoCtrl, supported by a dataset annotated with content, emotion, and affective prompts, bridging abstract emotions to visual cues. EmoCtrl incorporates textual and visual emotion enhancement modules that enrich affective expression via descriptive semantics and perceptual cues. The learned emotion tokens exhibit complementary effects, as demonstrated through ablations and visualizations. Quantatitive and qualatitive experiments demonstrate that EmoCtrl achieves faithful content and expressive emotion control, outperforming existing methods across multiple aspects. User studies confirm EmoCtrl's strong alignment with human preference. Moreover, EmoCtrl generalizes well to creative applications, further demonstrating the robustness and adaptability of the learned emotion tokens.
159. 【2512.22425】FluenceFormer: Transformer-Driven Multi-Beam Fluence Map Regression for Radiotherapy Planning
链接:https://arxiv.org/abs/2512.22425
作者:Ujunwa Mgboh,Rafi Ibn Sultan,Joshua Kim,Kundan Thind,Dongxiao Zhu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:automated radiotherapy planning, ill-posed inverse problem, inverse problem due, Fluence map prediction, beam-intensity modulation
备注:
点击查看摘要
Abstract:Fluence map prediction is central to automated radiotherapy planning but remains an ill-posed inverse problem due to the complex relationship between volumetric anatomy and beam-intensity modulation. Convolutional methods in prior work often struggle to capture long-range dependencies, which can lead to structurally inconsistent or physically unrealizable plans. We introduce \textbf{FluenceFormer}, a backbone-agnostic transformer framework for direct, geometry-aware fluence regression. The model uses a unified two-stage design: Stage~1 predicts a global dose prior from anatomical inputs, and Stage~2 conditions this prior on explicit beam geometry to regress physically calibrated fluence maps. Central to the approach is the \textbf{Fluence-Aware Regression (FAR)} loss, a physics-informed objective that integrates voxel-level fidelity, gradient smoothness, structural consistency, and beam-wise energy conservation. We evaluate the generality of the framework across multiple transformer backbones, including Swin UNETR, UNETR, nnFormer, and MedFormer, using a prostate IMRT dataset. FluenceFormer with Swin UNETR achieves the strongest performance among the evaluated models and improves over existing benchmark CNN and single-stage methods, reducing Energy Error to $\mathbf{4.5\%}$ and yielding statistically significant gains in structural fidelity ($p 0.05$).
160. 【2512.22423】Bright 4B: Scaling Hyperspherical Learning for Segmentation in 3D Brightfield Microscopy
链接:https://arxiv.org/abs/2512.22423
作者:Amil Khan,Matheus Palhares Viana,Suraj Mishra,B.S. Manjunath
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:visualize cellular morphology, brightfield microscopy offers, robust volumetric segmentation, cellular morphology, Native Sparse Attention
备注: 20 pages, 15 figures
点击查看摘要
Abstract:Label-free 3D brightfield microscopy offers a fast and noninvasive way to visualize cellular morphology, yet robust volumetric segmentation still typically depends on fluorescence or heavy post-processing. We address this gap by introducing Bright-4B, a 4 billion parameter foundation model that learns on the unit hypersphere to segment subcellular structures directly from 3D brightfield volumes. Bright-4B combines a hardware-aligned Native Sparse Attention mechanism (capturing local, coarse, and selected global context), depth-width residual HyperConnections that stabilize representation flow, and a soft Mixture-of-Experts for adaptive capacity. A plug-and-play anisotropic patch embed further respects confocal point-spread and axial thinning, enabling geometry-faithful 3D tokenization. The resulting model produces morphology-accurate segmentations of nuclei, mitochondria, and other organelles from brightfield stacks alone--without fluorescence, auxiliary channels, or handcrafted post-processing. Across multiple confocal datasets, Bright-4B preserves fine structural detail across depth and cell types, outperforming contemporary CNN and Transformer baselines. All code, pretrained weights, and models for downstream finetuning will be released to advance large-scale, label-free 3D cell mapping.
161. 【2512.22406】DeFloMat: Detection with Flow Matching for Stable and Efficient Generative Object Localization
链接:https://arxiv.org/abs/2512.22406
作者:Hansang Lee,Chaelin Lee,Nieun Seo,Joon Seok Lim,Helen Hong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Conditional Flow Matching, Flow Matching, critical latency bottleneck, Magnetic Resonance Enterography, integrating Conditional Flow
备注:
点击查看摘要
Abstract:We propose DeFloMat (Detection with Flow Matching), a novel generative object detection framework that addresses the critical latency bottleneck of diffusion-based detectors, such as DiffusionDet, by integrating Conditional Flow Matching (CFM). Diffusion models achieve high accuracy by formulating detection as a multi-step stochastic denoising process, but their reliance on numerous sampling steps ($T \gg 60$) makes them impractical for time-sensitive clinical applications like Crohn's Disease detection in Magnetic Resonance Enterography (MRE). DeFloMat replaces this slow stochastic path with a highly direct, deterministic flow field derived from Conditional Optimal Transport (OT) theory, specifically approximating the Rectified Flow. This shift enables fast inference via a simple Ordinary Differential Equation (ODE) solver. We demonstrate the superiority of DeFloMat on a challenging MRE clinical dataset. Crucially, DeFloMat achieves state-of-the-art accuracy ($43.32\% \text{ } AP_{10:50}$) in only $3$ inference steps, which represents a $1.4\times$ performance improvement over DiffusionDet's maximum converged performance ($31.03\% \text{ } AP_{10:50}$ at $4$ steps). Furthermore, our deterministic flow significantly enhances localization characteristics, yielding superior Recall and stability in the few-step regime. DeFloMat resolves the trade-off between generative accuracy and clinical efficiency, setting a new standard for stable and rapid object localization.
162. 【2512.22392】OSPointMapper: RealTime Pedestrian and Accessibility Mapping with Mobile AI
链接:https://arxiv.org/abs/2512.22392
作者:Himanshu Naidu,Yuxiang Zhang,Sachin Mehta,Anat Caspi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:inclusive pedestrian infrastructure, difficult to scale, essential for building, building accessible, accessible and inclusive
备注:
点击查看摘要
Abstract:Accurate, up-to-date sidewalk data is essential for building accessible and inclusive pedestrian infrastructure, yet current approaches to data collection are often costly, fragmented, and difficult to scale. We introduce iOSPointMapper, a mobile application that enables real-time, privacy-conscious sidewalk mapping on the ground, using recent-generation iPhones and iPads. The system leverages on-device semantic segmentation, LiDAR-based depth estimation, and fused GPS/IMU data to detect and localize sidewalk-relevant features such as traffic signs, traffic lights and poles. To ensure transparency and improve data quality, iOSPointMapper incorporates a user-guided annotation interface for validating system outputs before submission. Collected data is anonymized and transmitted to the Transportation Data Exchange Initiative (TDEI), where it integrates seamlessly with broader multimodal transportation datasets. Detailed evaluations of the system's feature detection and spatial mapping performance reveal the application's potential for enhanced pedestrian mapping. Together, these capabilities offer a scalable and user-centered approach to closing critical data gaps in pedestrian
163. 【2512.22385】LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition
链接:https://arxiv.org/abs/2512.22385
作者:Elsen Ronando,Sozo Inoue
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Human Activity Recognition, Human Activity, Activity Recognition, distinguish similar weara-ble, LLM-Guided Exemplar Selection
备注: This paper has been accepted for presentation at ABC 2026. The manuscript is under revision prior to camera-ready submission
点击查看摘要
Abstract:In this paper, we propose an LLM-Guided Exemplar Selection framework to address a key limitation in state-of-the-art Human Activity Recognition (HAR) methods: their reliance on large labeled datasets and purely geometric exemplar selection, which often fail to distinguish similar weara-ble sensor activities such as walking, walking upstairs, and walking downstairs. Our method incorporates semantic reasoning via an LLM-generated knowledge prior that captures feature importance, inter-class confusability, and exemplar budget multipliers, and uses it to guide exemplar scoring and selection. These priors are combined with margin-based validation cues, PageRank centrality, hubness penalization, and facility-location optimization to obtain a compact and informative set of exemplars. Evaluated on the UCI-HAR dataset under strict few-shot conditions, the framework achieves a macro F1-score of 88.78%, outperforming classical approaches such as random sampling, herding, and $k$-center. The results show that LLM-derived semantic priors, when integrated with structural and geometric cues, provide a stronger foundation for selecting representative sensor exemplars in few-shot wearable-sensor HAR.
164. 【2512.22374】Self-Evaluation Unlocks Any-Step Text-to-Image Generation
链接:https://arxiv.org/abs/2512.22374
作者:Xin Yu,Xiaojuan Qi,Zhengqi Li,Kai Zhang,Richard Zhang,Zhe Lin,Eli Shechtman,Tianyu Wang,Yotam Nitzan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:introduce the Self-Evaluating, Flow Matching, Flow Matching model, supports any-step inference, Flow
备注: Project page: [this https URL](https://xinyu-andy.github.io/SelfE-project/)
点击查看摘要
Abstract:We introduce the Self-Evaluating Model (Self-E), a novel, from-scratch training approach for text-to-image generation that supports any-step inference. Self-E learns from data similarly to a Flow Matching model, while simultaneously employing a novel self-evaluation mechanism: it evaluates its own generated samples using its current score estimates, effectively serving as a dynamic self-teacher. Unlike traditional diffusion or flow models, it does not rely solely on local supervision, which typically necessitates many inference steps. Unlike distillation-based approaches, it does not require a pretrained teacher. This combination of instantaneous local learning and self-driven global matching bridges the gap between the two paradigms, enabling the training of a high-quality text-to-image model from scratch that excels even at very low step counts. Extensive experiments on large-scale text-to-image benchmarks show that Self-E not only excels in few-step generation, but is also competitive with state-of-the-art Flow Matching models at 50 steps. We further find that its performance improves monotonically as inference steps increase, enabling both ultra-fast few-step generation and high-quality long-trajectory sampling within a single unified model. To our knowledge, Self-E is the first from-scratch, any-step text-to-image model, offering a unified framework for efficient and scalable generation.
165. 【2512.22351】VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement
链接:https://arxiv.org/abs/2512.22351
作者:Zhengfei Kuang,Rui Lin,Long Zhao,Gordon Wetzstein,Saining Xie,Sanghyun Woo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language
备注:
点击查看摘要
Abstract:Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines. Website: this http URL
166. 【2512.22349】Human-like visual computing advances explainability and few-shot learning in deep neural networks for complex physiological data
链接:https://arxiv.org/abs/2512.22349
作者:Alaa Alahmadi,Mohamed Hasan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:typically require large, offer limited insight, require large training, large training datasets, deep neural networks
备注:
点击查看摘要
Abstract:Machine vision models, particularly deep neural networks, are increasingly applied to physiological signal interpretation, including electrocardiography (ECG), yet they typically require large training datasets and offer limited insight into the causal features underlying their predictions. This lack of data efficiency and interpretability constrains their clinical reliability and alignment with human reasoning. Here, we show that a perception-informed pseudo-colouring technique, previously demonstrated to enhance human ECG interpretation, can improve both explainability and few-shot learning in deep neural networks analysing complex physiological data. We focus on acquired, drug-induced long QT syndrome (LQTS) as a challenging case study characterised by heterogeneous signal morphology, variable heart rate, and scarce positive cases associated with life-threatening arrhythmias such as torsades de pointes. This setting provides a stringent test of model generalisation under extreme data scarcity. By encoding clinically salient temporal features, such as QT-interval duration, into structured colour representations, models learn discriminative and interpretable features from as few as one or five training examples. Using prototypical networks and a ResNet-18 architecture, we evaluate one-shot and few-shot learning on ECG images derived from single cardiac cycles and full 10-second rhythms. Explainability analyses show that pseudo-colouring guides attention toward clinically meaningful ECG features while suppressing irrelevant signal components. Aggregating multiple cardiac cycles further improves performance, mirroring human perceptual averaging across heartbeats. Together, these findings demonstrate that human-like perceptual encoding can bridge data efficiency, explainability, and causal reasoning in medical machine intelligence.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
Cite as:
arXiv:2512.22349 [cs.CV]
(or
arXiv:2512.22349v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2512.22349
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
167. 【2512.22335】Feature Learning with Multi-Stage Vision Transformers on Inter-Modality HER2 Status Scoring and Tumor Classification on Whole Slides
链接:https://arxiv.org/abs/2512.22335
作者:Olaide N. Oyelade,Oliver Hoxey,Yulia Humrye
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:hematoxylin and eosin, scoring, IHC, WSIs, histopathology images
备注:
点击查看摘要
Abstract:The popular use of histopathology images, such as hematoxylin and eosin (HE), has proven to be useful in detecting tumors. However, moving such cancer cases forward for treatment requires accurate on the amount of the human epidermal growth factor receptor 2 (HER2) protein expression. Predicting both the lower and higher levels of HER2 can be challenging. Moreover, jointly analyzing HE and immunohistochemistry (IHC) stained images for HER2 scoring is difficult. Although several deep learning methods have been investigated to address the challenge of HER2 scoring, they suffer from providing a pixel-level localization of HER2 status. In this study, we propose a single end-to-end pipeline using a system of vision transformers with HER2 status scoring on whole slide images of WSIs. The method includes patch-wise processing of HE WSIs for tumor localization. A novel mapping function is proposed to correspondingly identify correlated IHC WSIs regions with malignant regions on HE. A clinically inspired HER2 scoring mechanism is embedded in the pipeline and allows for automatic pixel-level annotation of 4-way HER2 scoring (0, 1+, 2+, and 3+). Also, the proposed method accurately returns HER2-negative and HER2-positive. Privately curated datasets were collaboratively extracted from 13 different cases of WSIs of HE and IHC. A thorough experiment is conducted on the proposed method. Results obtained showed a good classification accuracy during tumor localization. Also, a classification accuracy of 0.94 and a specificity of 0.933 were returned for the prediction of HER2 status, scoring in the 4-way methods. The applicability of the proposed pipeline was investigated using WSIs patches as comparable to human pathologists. Findings from the study showed the usability of jointly evaluated HE and IHC images on end-to-end ViTs-based models for HER2 scoring
168. 【2512.22331】he Multi-View Paradigm Shift in MRI Radiomics: Predicting MGMT Methylation in Glioblastoma
链接:https://arxiv.org/abs/2512.22331
作者:Mariya Miteva,Maria Nisheva-Pavlova
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:molecular tumor characteristics, carries important prognostic, Non-invasive inference, methylation carries important, goal of radiogenomics
备注: 14 pages, 3 figures
点击查看摘要
Abstract:Non-invasive inference of molecular tumor characteristics from medical imaging is a central goal of radiogenomics, particularly in glioblastoma (GBM), where O6-methylguanine-DNA methyltransferase (MGMT) promoter methylation carries important prognostic and therapeutic significance. Although radiomics-based machine learning methods have shown promise for this task, conventional unimodal and early-fusion approaches are often limited by high feature redundancy and an incomplete modeling of modality-specific information. In this work, we introduce a multi-view latent representation learning framework based on variational autoencoders (VAE) to integrate complementary radiomic features derived from post-contrast T1-weighted (T1Gd) and Fluid-Attenuated Inversion Recovery (FLAIR) magnetic resonance imaging (MRI). By encoding each modality through an independent probabilistic encoder and performing fusion in a compact latent space, the proposed approach preserves modality-specific structure while enabling effective multimodal integration. The resulting latent embeddings are subsequently used for MGMT promoter methylation classification.
169. 【2512.22324】DeMoGen: Towards Decompositional Human Motion Generation with Energy-Based Diffusion Models
链接:https://arxiv.org/abs/2512.22324
作者:Jianrong Zhang,Hehe Fan,Yi Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Human motions, motion, combinations of simpler, Human, complex behaviors
备注: Project page: [this https URL](https://jiro-zhang.github.io/DeMoGen/)
点击查看摘要
Abstract:Human motions are compositional: complex behaviors can be described as combinations of simpler primitives. However, existing approaches primarily focus on forward modeling, e.g., learning holistic mappings from text to motion or composing a complex motion from a set of motion concepts. In this paper, we consider the inverse perspective: decomposing a holistic motion into semantically meaningful sub-components. We propose DeMoGen, a compositional training paradigm for decompositional learning that employs an energy-based diffusion model. This energy formulation directly captures the composed distribution of multiple motion concepts, enabling the model to discover them without relying on ground-truth motions for individual concepts. Within this paradigm, we introduce three training variants to encourage a decompositional understanding of motion: 1. DeMoGen-Exp explicitly trains on decomposed text prompts; 2. DeMoGen-OSS performs orthogonal self-supervised decomposition; 3. DeMoGen-SC enforces semantic consistency between original and decomposed text embeddings. These variants enable our approach to disentangle reusable motion primitives from complex motion sequences. We also demonstrate that the decomposed motion concepts can be flexibly recombined to generate diverse and novel motions, generalizing beyond the training distribution. Additionally, we construct a text-decomposed dataset to support compositional training, serving as an extended resource to facilitate text-to-motion generation and motion composition.
170. 【2512.22323】SpotEdit: Selective Region Editing in Diffusion Transformers
链接:https://arxiv.org/abs/2512.22323
作者:Zhibin Qin,Zhenxiong Tan,Zeqing Wang,Songhua Liu,Xinchao Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Diffusion Transformer models, transformer layers, Transformer models, significantly advanced image, Diffusion Transformer
备注: 14 pages, 8 figures
点击查看摘要
Abstract:Diffusion Transformer models have significantly advanced image editing by encoding conditional images and integrating them into transformer layers. However, most edits involve modifying only small regions, while current methods uniformly process and denoise all tokens at every timestep, causing redundant computation and potentially degrading unchanged areas. This raises a fundamental question: Is it truly necessary to regenerate every region during editing? To address this, we propose SpotEdit, a training-free diffusion editing framework that selectively updates only the modified regions. SpotEdit comprises two key components: SpotSelector identifies stable regions via perceptual similarity and skips their computation by reusing conditional image features; SpotFusion adaptively blends these features with edited tokens through a dynamic fusion mechanism, preserving contextual coherence and editing quality. By reducing unnecessary computation and maintaining high fidelity in unmodified areas, SpotEdit achieves efficient and precise image editing.
171. 【2512.22322】SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents
链接:https://arxiv.org/abs/2512.22322
作者:Shaofei Cai,Yulei Qin,Haojia Lin,Zihan Xu,Gang Li,Yuchen Shi,Zongyi Li,Yong Mao,Siqi Cai,Xiaoyu Tan,Yitao Liang,Ke Li,Xing Sun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
关键词:Agentic reinforcement learning, complex GUI tasks, holds great promise, scalability remains severely, remains severely hampered
备注:
点击查看摘要
Abstract:Agentic reinforcement learning (RL) holds great promise for the development of autonomous agents under complex GUI tasks, but its scalability remains severely hampered by the verification of task completion. Existing task verification is treated as a passive, post-hoc process: a verifier (i.e., rule-based scoring script, reward or critic model, and LLM-as-a-Judge) analyzes the agent's entire interaction trajectory to determine if the agent succeeds. Such processing of verbose context that contains irrelevant, noisy history poses challenges to the verification protocols and therefore leads to prohibitive cost and low reliability. To overcome this bottleneck, we propose SmartSnap, a paradigm shift from this passive, post-hoc verification to proactive, in-situ self-verification by the agent itself. We introduce the Self-Verifying Agent, a new type of agent designed with dual missions: to not only complete a task but also to prove its accomplishment with curated snapshot evidences. Guided by our proposed 3C Principles (Completeness, Conciseness, and Creativity), the agent leverages its accessibility to the online environment to perform self-verification on a minimal, decisive set of snapshots. Such evidences are provided as the sole materials for a general LLM-as-a-Judge verifier to determine their validity and relevance. Experiments on mobile tasks across model families and scales demonstrate that our SmartSnap paradigm allows training LLM-driven agents in a scalable manner, bringing performance gains up to 26.08% and 16.66% respectively to 8B and 30B models. The synergizing between solution finding and evidence seeking facilitates the cultivation of efficient, self-verifying agents with competitive performance against DeepSeek V3.1 and Qwen3-235B-A22B.
172. 【2512.22317】LangPrecip: Language-Aware Multimodal Precipitation Nowcasting
链接:https://arxiv.org/abs/2512.22317
作者:Xudong Ling,Tianxi Huang,Qian Dong,Tao He,Chaorong Li,Guiduo Duan
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:extreme weather events, under-constrained spatiotemporal forecasting, Short-term precipitation nowcasting, spatiotemporal forecasting problem, Short-term precipitation
备注:
点击查看摘要
Abstract:Short-term precipitation nowcasting is an inherently uncertain and under-constrained spatiotemporal forecasting problem, especially for rapidly evolving and extreme weather events. Existing generative approaches rely primarily on visual conditioning, leaving future motion weakly constrained and ambiguous. We propose a language-aware multimodal nowcasting framework(LangPrecip) that treats meteorological text as a semantic motion constraint on precipitation evolution. By formulating nowcasting as a semantically constrained trajectory generation problem under the Rectified Flow paradigm, our method enables efficient and physically consistent integration of textual and radar information in latent this http URL further introduce LangPrecip-160k, a large-scale multimodal dataset with 160k paired radar sequences and motion descriptions. Experiments on Swedish and MRMS datasets show consistent improvements over state-of-the-art methods, achieving over 60 \% and 19\% gains in heavy-rainfall CSI at an 80-minute lead time.
173. 【2512.22315】VideoZoomer: Reinforcement-Learned Temporal Focusing for Long Video Reasoning
链接:https://arxiv.org/abs/2512.22315
作者:Yang Ding,Yizhen Zhang,Xin Lai,Ruihang Chu,Yujiu Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multimodal Large Language, Large Language Models, limited context window, Multimodal Large, achieved remarkable progress
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language tasks yet remain limited in long video understanding due to the limited context window. Consequently, prevailing approaches tend to rely on uniform frame sampling or static pre-selection, which might overlook critical evidence and unable to correct its initial selection error during its reasoning process. To overcome these limitations, we propose VideoZoomer, a novel agentic framework that enables MLLMs to dynamically control their visual focus during reasoning. Starting from a coarse low-frame-rate overview, VideoZoomer invokes a temporal zoom tool to obtain high-frame-rate clips at autonomously chosen moments, thereby progressively gathering fine-grained evidence in a multi-turn interactive manner. Accordingly, we adopt a two-stage training strategy: a cold-start supervised fine-tuning phase on a curated dataset of distilled exemplar and reflection trajectories, followed by reinforcement learning to further refine the agentic policy. Extensive experiments demonstrate that our 7B model delivers diverse and complex reasoning patterns, yielding strong performance across a broad set of long video understanding and reasoning benchmarks. These emergent capabilities allow it to consistently surpass existing open-source models and even rival proprietary systems on challenging tasks, while achieving superior efficiency under reduced frame budgets.
174. 【2512.22310】MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation
链接:https://arxiv.org/abs/2512.22310
作者:Run Ling,Ke Cao,Jian Lu,Ao Ma,Haowei Liu,Runze He,Changwei Wang,Rongtao Xu,Yihua Shao,Zhanjie Zhang,Peng Wu,Guibing Guo,Wei Feng,Zheng Zhang,Jingjing Lv,Junjie Shen,Ching Law,Xingwei Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Multi-subject video generation, Multi-subject video, video generation aims, multiple reference images, synthesize videos
备注: AAAI 2026
点击查看摘要
Abstract:Multi-subject video generation aims to synthesize videos from textual prompts and multiple reference images, ensuring that each subject preserves natural scale and visual fidelity. However, current methods face two challenges: scale inconsistency, where variations in subject size lead to unnatural generation, and permutation sensitivity, where the order of reference inputs causes subject distortion. In this paper, we propose MoFu, a unified framework that tackles both challenges. For scale inconsistency, we introduce Scale-Aware Modulation (SMO), an LLM-guided module that extracts implicit scale cues from the prompt and modulates features to ensure consistent subject sizes. To address permutation sensitivity, we present a simple yet effective Fourier Fusion strategy that processes the frequency information of reference features via the Fast Fourier Transform to produce a unified representation. Besides, we design a Scale-Permutation Stability Loss to jointly encourage scale-consistent and permutation-invariant generation. To further evaluate these challenges, we establish a dedicated benchmark with controlled variations in subject scale and reference permutation. Extensive experiments demonstrate that MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.
175. 【2512.22304】PortionNet: Distilling 3D Geometric Knowledge for Food Nutrition Estimation
链接:https://arxiv.org/abs/2512.22304
作者:Darrin Bright,Rakshith Raj,Kanchan Keisham
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate food nutrition, Accurate food, food nutrition estimation, food nutrition, challenging due
备注: Accepted at the 11th Annual Conference on Vision and Intelligent Systems (CVIS 2025)
点击查看摘要
Abstract:Accurate food nutrition estimation from single images is challenging due to the loss of 3D information. While depth-based methods provide reliable geometry, they remain inaccessible on most smartphones because of depth-sensor requirements. To overcome this challenge, we propose PortionNet, a novel cross-modal knowledge distillation framework that learns geometric features from point clouds during training while requiring only RGB images at inference. Our approach employs a dual-mode training strategy where a lightweight adapter network mimics point cloud representations, enabling pseudo-3D reasoning without any specialized hardware requirements. PortionNet achieves state-of-the-art performance on MetaFood3D, outperforming all previous methods in both volume and energy estimation. Cross-dataset evaluation on SimpleFood45 further demonstrates strong generalization in energy estimation.
176. 【2512.22303】Attack-Aware Deepfake Detection under Counter-Forensic Manipulations
链接:https://arxiv.org/abs/2512.22303
作者:Noor Fatima,Hasan Faraz Khan,Muzammil Behzad
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:realistic deployment conditions, image-forensics detector designed, Feature Pyramid Network, shallow Feature Pyramid, Pyramid Network style
备注:
点击查看摘要
Abstract:This work presents an attack-aware deepfake and image-forensics detector designed for robustness, well-calibrated probabilities, and transparent evidence under realistic deployment conditions. The method combines red-team training with randomized test-time defense in a two-stream architecture, where one stream encodes semantic content using a pretrained backbone and the other extracts forensic residuals, fused via a lightweight residual adapter for classification, while a shallow Feature Pyramid Network style head produces tamper heatmaps under weak supervision. Red-team training applies worst-of-K counter-forensics per batch, including JPEG realign and recompress, resampling warps, denoise-to-regrain operations, seam smoothing, small color and gamma shifts, and social-app transcodes, while test-time defense injects low-cost jitters such as resize and crop phase changes, mild gamma variation, and JPEG phase shifts with aggregated predictions. Heatmaps are guided to concentrate within face regions using face-box masks without strict pixel-level annotations. Evaluation on existing benchmarks, including standard deepfake datasets and a surveillance-style split with low light and heavy compression, reports clean and attacked performance, AUC, worst-case accuracy, reliability, abstention quality, and weak-localization scores. Results demonstrate near-perfect ranking across attacks, low calibration error, minimal abstention risk, and controlled degradation under regrain, establishing a modular, data-efficient, and practically deployable baseline for attack-aware detection with calibrated probabilities and actionable heatmaps.
177. 【2512.22298】Real-Time In-Cabin Driver Behavior Recognition on Low-Cost Edge Hardware
链接:https://arxiv.org/abs/2512.22298
作者:Vesal Ahsani,Babak Hossein Khalaj
类目:Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:Coral Edge TPU, Google Coral Edge, In-cabin Driver Monitoring, in-cabin driver behavior, Driver Monitoring Systems
备注: 14 pages, 2 figures, 4 tables
点击查看摘要
Abstract:In-cabin Driver Monitoring Systems (DMS) must recognize distraction- and drowsiness-related behaviors with low latency under strict constraints on compute, power, and cost. We present a single-camera in-cabin driver behavior recognition system designed for deployment on two low-cost edge platforms: Raspberry Pi 5 (CPU-only) and Google Coral Edge TPU. The proposed pipeline combines (i) a compact per-frame vision model, (ii) a confounder-aware label design to reduce visually similar false positives, and (iii) a temporal decision head that triggers alerts only when predictions are both confident and sustained. The system covers 17 behavior classes, including multiple phone-use modes, eating/drinking, smoking, reaching behind, gaze/attention shifts, passenger interaction, grooming, control-panel interaction, yawning, and eyes-closed sleep. Training and evaluation use licensed datasets spanning diverse drivers, vehicles, and lighting conditions (details in Section 6), and we further validate runtime behavior in real in-vehicle tests. The optimized deployments achieve about 16 FPS on Raspberry Pi 5 with INT8 inference (per-frame latency under 60 ms) and about 25 FPS on Coral Edge TPU, enabling real-time monitoring and stable alert generation on inexpensive hardware. Finally, we discuss how reliable in-cabin human-state perception can serve as an upstream input for human-centered vehicle intelligence, including emerging agentic vehicle concepts.
178. 【2512.22295】Learning Dynamic Scene Reconstruction with Sinusoidal Geometric Priors
链接:https://arxiv.org/abs/2512.22295
作者:Tian Guo,Hui Yuan,Philip Xu,David Elizondo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:periodic activation properties, sinusoidal representation networks, geometric priors derived, loss function, function that combines
备注:
点击查看摘要
Abstract:We propose SirenPose, a novel loss function that combines the periodic activation properties of sinusoidal representation networks with geometric priors derived from keypoint structures to improve the accuracy of dynamic 3D scene reconstruction. Existing approaches often struggle to maintain motion modeling accuracy and spatiotemporal consistency in fast moving and multi target scenes. By introducing physics inspired constraint mechanisms, SirenPose enforces coherent keypoint predictions across both spatial and temporal dimensions. We further expand the training dataset to 600,000 annotated instances to support robust learning. Experimental results demonstrate that models trained with SirenPose achieve significant improvements in spatiotemporal consistency metrics compared to prior methods, showing superior performance in handling rapid motion and complex scene changes.
179. 【2512.22294】A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation
链接:https://arxiv.org/abs/2512.22294
作者:Philip Xu,David Elizondo,Raouf Hamzaoui
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:scale open vocabulary, large scale open, open vocabulary, large scale, scale open
备注:
点击查看摘要
Abstract:We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.
180. 【2512.22288】Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model
链接:https://arxiv.org/abs/2512.22288
作者:Renping Zhou,Zanlin Ni,Tianyi Chen,Zeyu Liu,Yang Yue,Yulin Wang,Yuxuan Wang,Jingshu Liu,Gao Huang
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Masked Diffusion Models, Masked Diffusion, shown promising potential, Diffusion Models, potential across vision
备注: 17 pages, 6 figures
点击查看摘要
Abstract:Recently, Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation. However, a notable discrepancy exists between their training and inference procedures. In particular, MDM inference is a multi-step, iterative process governed not only by the model itself but also by various schedules that dictate the token-decoding trajectory (e.g., how many tokens to decode at each step). In contrast, MDMs are typically trained using a simplified, single-step BERT-style objective that masks a subset of tokens and predicts all of them simultaneously. This step-level simplification fundamentally disconnects the training paradigm from the trajectory-level nature of inference, leaving the inference schedules never optimized during training. In this paper, we introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule. By applying Group Relative Policy Optimization at the trajectory level, Co-GRPO cooperatively optimizes model parameters and schedule parameters under a shared reward, without requiring costly backpropagation through the multi-step generation process. This holistic optimization aligns training with inference more thoroughly and substantially improves generation quality. Empirical results across four benchmarks-ImageReward, HPS, GenEval, and DPG-Bench-demonstrate the effectiveness of our approach. For more details, please refer to our project page: this https URL .
181. 【2512.22278】FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound
链接:https://arxiv.org/abs/2512.22278
作者:Hussain Alasmawi,Numan Saeed,Mohammad Yaqub
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:fetal health monitoring, essential fetal health, trained sonographers, creating barriers, health monitoring
备注:
点击查看摘要
Abstract:The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality's challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various fetal ultrasound tasks. Our benchmark comprises over 42,000 images and 93,000 question-answer pairs, spanning anatomical plane identification, visual grounding of anatomical structures, fetal orientation assessment, clinical view conformity, and clinical diagnosis. We systematically evaluate several state-of-the-art VLMs, including general-purpose and medical-specific models, and reveal a substantial performance gap: the best-performing model achieves only 55\% accuracy, far below clinical requirements. Our analysis identifies critical limitations of current VLMs in fetal ultrasound interpretation, highlighting the urgent need for domain-adapted architectures and specialized training approaches. Fetal-Gauge establishes a rigorous foundation for advancing multimodal deep learning in prenatal care and provides a pathway toward addressing global healthcare accessibility challenges. Our benchmark will be publicly available once the paper gets accepted.
182. 【2512.22275】he Illusion of Clinical Reasoning: A Benchmark Reveals the Pervasive Gap in Vision-Language Models for Clinical Competency
链接:https://arxiv.org/abs/2512.22275
作者:Dingyu Wang,Zimu Yuan,Jiajun Liu,Shanggui Liu,Nan Zhou,Tianxing Xu,Di Huang,Dong Jiang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:narrow examination success, public health necessitates, examination success, practice and public, public health
备注:
点击查看摘要
Abstract:Background: The rapid integration of foundation models into clinical practice and public health necessitates a rigorous evaluation of their true clinical reasoning capabilities beyond narrow examination success. Current benchmarks, typically based on medical licensing exams or curated vignettes, fail to capture the integrated, multimodal reasoning essential for real-world patient care. Methods: We developed the Bones and Joints (BJ) Benchmark, a comprehensive evaluation framework comprising 1,245 questions derived from real-world patient cases in orthopedics and sports medicine. This benchmark assesses models across 7 tasks that mirror the clinical reasoning pathway, including knowledge recall, text and image interpretation, diagnosis generation, treatment planning, and rationale provision. We evaluated eleven vision-language models (VLMs) and six large language models (LLMs), comparing their performance against expert-derived ground truth. Results: Our results demonstrate a pronounced performance gap between task types. While state-of-the-art models achieved high accuracy, exceeding 90%, on structured multiple-choice questions, their performance markedly declined on open-ended tasks requiring multimodal integration, with accuracy scarcely reaching 60%. VLMs demonstrated substantial limitations in interpreting medical images and frequently exhibited severe text-driven hallucinations, often ignoring contradictory visual evidence. Notably, models specifically fine-tuned for medical applications showed no consistent advantage over general-purpose counterparts. Conclusions: Current artificial intelligence models are not yet clinically competent for complex, multimodal reasoning. Their safe deployment should currently be limited to supportive, text-based roles. Future advancement in core clinical tasks awaits fundamental breakthroughs in multimodal integration and visual understanding.
183. 【2512.22274】GeCo: A Differentiable Geometric Consistency Metric for Video Generation
链接:https://arxiv.org/abs/2512.22274
作者:Leslie Gu,Junhwa Hur,Charles Herrmann,Fangneng Zhan,Todd Zickler,Deqing Sun,Hanspeter Pfister
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:jointly detecting geometric, detecting geometric deformation, static scenes, geometry-grounded metric, metric for jointly
备注:
点击查看摘要
Abstract:We introduce GeCo, a geometry-grounded metric for jointly detecting geometric deformation and occlusion-inconsistency artifacts in static scenes. By fusing residual motion and depth priors, GeCo produces interpretable, dense consistency maps that reveal these artifacts. We use GeCo to systematically benchmark recent video generation models, uncovering common failure modes, and further employ it as a training-free guidance loss to reduce deformation artifacts during video generation.
184. 【2512.22272】Human-Aligned Generative Perception: Bridging Psychophysics and Generative Models
链接:https://arxiv.org/abs/2512.22272
作者:Antara Titikhsha,Om Kulkarni,Dharun Muthaiah
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:highly detailed textures, generate highly detailed, strict geometric constraints, follow strict geometric, models generate highly
备注:
点击查看摘要
Abstract:Text-to-image diffusion models generate highly detailed textures, yet they often rely on surface appearance and fail to follow strict geometric constraints, particularly when those constraints conflict with the style implied by the text prompt. This reflects a broader semantic gap between human perception and current generative models. We investigate whether geometric understanding can be introduced without specialized training by using lightweight, off-the-shelf discriminators as external guidance signals. We propose a Human Perception Embedding (HPE) teacher trained on the THINGS triplet dataset, which captures human sensitivity to object shape. By injecting gradients from this teacher into the latent diffusion process, we show that geometry and style can be separated in a controllable manner. We evaluate this approach across three architectures: Stable Diffusion v1.5 with a U-Net backbone, the flow-matching model SiT-XL/2, and the diffusion transformer PixArt-{\Sigma}. Our experiments reveal that flow models tend to drift back toward their default trajectories without continuous guidance, and we demonstrate zero-shot transfer of complex three-dimensional shapes, such as an Eames chair, onto conflicting materials such as pink metal. This guided generation improves semantic alignment by about 80 percent compared to unguided baselines. Overall, our results show that small teacher models can reliably guide large generative systems, enabling stronger geometric control and broadening the creative range of text-to-image synthesis.
185. 【2512.22263】Evaluating an Adaptive Multispectral Turret System for Autonomous Tracking Across Variable Illumination Conditions
链接:https://arxiv.org/abs/2512.22263
作者:Aahan Sachdeva,Dhanvinkumar Ganeshkumar,James E. Gallagher,Tyler Treat,Edward J. Oughton
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:emergency services sector, services sector, supporting missions, zones and reconnaissance, platforms are playing
备注:
点击查看摘要
Abstract:Autonomous robotic platforms are playing a growing role across the emergency services sector, supporting missions such as search and rescue operations in disaster zones and reconnaissance. However, traditional red-green-blue (RGB) detection pipelines struggle in low-light environments, and thermal-based systems lack color and texture information. To overcome these limitations, we present an adaptive framework that fuses RGB and long-wave infrared (LWIR) video streams at multiple fusion ratios and dynamically selects the optimal detection model for each illumination condition. We trained 33 You Only Look Once (YOLO) models on over 22,000 annotated images spanning three light levels: no-light (10 lux), dim-light (10-1000 lux), and full-light (1000 lux). To integrate both modalities, fusion was performed by blending aligned RGB and LWIR frames at eleven ratios, from full RGB (100/0) to full LWIR (0/100) in 10% increments. Evaluation showed that the best full-light model (80/20 RGB-LWIR) and dim-light model (90/10 fusion) achieved 92.8% and 92.0% mean confidence; both significantly outperformed the YOLOv5 nano (YOLOv5n) and YOLOv11 nano (YOLOv11n) baselines. Under no-light conditions, the top 40/60 fusion reached 71.0%, exceeding baselines though not statistically significant. Adaptive RGB-LWIR fusion improved detection confidence and reliability across all illumination conditions, enhancing autonomous robotic vision performance.
186. 【2512.22249】mporal Visual Semantics-Induced Human Motion Understanding with Large Language Models
链接:https://arxiv.org/abs/2512.22249
作者:Zheng Xing,Weibing Zhao
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Unsupervised human motion, Unsupervised human, subspace clustering, subspace clustering techniques, subspace
备注:
点击查看摘要
Abstract:Unsupervised human motion segmentation (HMS) can be effectively achieved using subspace clustering techniques. However, traditional methods overlook the role of temporal semantic exploration in HMS. This paper explores the use of temporal vision semantics (TVS) derived from human motion sequences, leveraging the image-to-text capabilities of a large language model (LLM) to enhance subspace clustering performance. The core idea is to extract textual motion information from consecutive frames via LLM and incorporate this learned information into the subspace clustering framework. The primary challenge lies in learning TVS from human motion sequences using LLM and integrating this information into subspace clustering. To address this, we determine whether consecutive frames depict the same motion by querying the LLM and subsequently learn temporal neighboring information based on its response. We then develop a TVS-integrated subspace clustering approach, incorporating subspace embedding with a temporal regularizer that induces each frame to share similar subspace embeddings with its temporal neighbors. Additionally, segmentation is performed based on subspace embedding with a temporal constraint that induces the grouping of each frame with its temporal neighbors. We also introduce a feedback-enabled framework that continuously optimizes subspace embedding based on the segmentation output. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art approaches on four benchmark human motion datasets.
187. 【2512.22242】Fairness Evaluation of Risk Estimation Models for Lung Cancer Screening
链接:https://arxiv.org/abs/2512.22242
作者:Shaurya Gaur,Michel Vitale,Alessa Hering,Johan Kwisthout,Colin Jacobs,Lena Philipp,Fennie van der Graaf
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Image and Video Processing (eess.IV)
关键词:Lung cancer, lung cancer screening, lung cancer risk, Sybil lung cancer, adults worldwide
备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) [this https URL](https://melba-journal.org/2025:025)
点击查看摘要
Abstract:Lung cancer is the leading cause of cancer-related mortality in adults worldwide. Screening high-risk individuals with annual low-dose CT (LDCT) can support earlier detection and reduce deaths, but widespread implementation may strain the already limited radiology workforce. AI models have shown potential in estimating lung cancer risk from LDCT scans. However, high-risk populations for lung cancer are diverse, and these models' performance across demographic groups remains an open question. In this study, we drew on the considerations on confounding factors and ethically significant biases outlined in the JustEFAB framework to evaluate potential performance disparities and fairness in two deep learning risk estimation models for lung cancer screening: the Sybil lung cancer risk model and the Venkadesh21 nodule risk estimator. We also examined disparities in the PanCan2b logistic regression model recommended in the British Thoracic Society nodule management guideline. Both deep learning models were trained on data from the US-based National Lung Screening Trial (NLST), and assessed on a held-out NLST validation set. We evaluated AUROC, sensitivity, and specificity across demographic subgroups, and explored potential confounding from clinical risk factors. We observed a statistically significant AUROC difference in Sybil's performance between women (0.88, 95% CI: 0.86, 0.90) and men (0.81, 95% CI: 0.78, 0.84, p .001). At 90% specificity, Venkadesh21 showed lower sensitivity for Black (0.39, 95% CI: 0.23, 0.59) than White participants (0.69, 95% CI: 0.65, 0.73). These differences were not explained by available clinical confounders and thus may be classified as unfair biases according to JustEFAB. Our findings highlight the importance of improving and monitoring model performance across underrepresented subgroups, and further research on algorithmic fairness, in lung cancer screening.
188. 【2512.22239】Multi-objective hybrid knowledge distillation for efficient deep learning in smart agriculture
链接:https://arxiv.org/abs/2512.22239
作者:Phi-Hung Hoang,Nam-Thuan Trinh,Van-Manh Tran,Thi-Thu-Hong Phan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Deploying deep learning, resource-constrained edge devices, edge devices remains, Deploying deep, deep learning models
备注: 7 tables; 11 figures
点击查看摘要
Abstract:Deploying deep learning models on resource-constrained edge devices remains a major challenge in smart agriculture due to the trade-off between computational efficiency and recognition accuracy. To address this challenge, this study proposes a hybrid knowledge distillation framework for developing a lightweight yet high-performance convolutional neural network. The proposed approach designs a customized student model that combines inverted residual blocks with dense connectivity and trains it under the guidance of a ResNet18 teacher network using a multi-objective strategy that integrates hard-label supervision, feature-level distillation, response-level distillation, and self-distillation. Experiments are conducted on a rice seed variety identification dataset containing nine varieties and further extended to four plant leaf disease datasets, including rice, potato, coffee, and corn, to evaluate generalization capability. On the rice seed variety classification task, the distilled student model achieves an accuracy of 98.56%, which is only 0.09% lower than the teacher model (98.65%), while requiring only 0.68 GFLOPs and approximately 1.07 million parameters. This corresponds to a reduction of about 2.7 times in computational cost and more than 10 times in model size compared with the ResNet18 teacher model. In addition, compared with representative pretrained models, the proposed student reduces the number of parameters by more than 6 times relative to DenseNet121 and by over 80 times compared with the Vision Transformer (ViT) architecture, while maintaining comparable or superior classification accuracy. Consistent performance gains across multiple plant leaf disease datasets further demonstrate the robustness, efficiency, and strong deployment potential of the proposed framework for hardware-limited smart agriculture systems.
189. 【2512.22238】Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
链接:https://arxiv.org/abs/2512.22238
作者:Byung-Kwan Lee,Yu-Chiang Frank Wang,Ryo Hachiuma
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large-scale vision-language models, remarkable multimodal understanding, recently achieved remarkable, achieved remarkable multimodal, Large-scale vision-language
备注:
点击查看摘要
Abstract:Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.
190. 【2512.22237】Meta-information Guided Cross-domain Synergistic Diffusion Model for Low-dose PET Reconstruction
链接:https://arxiv.org/abs/2512.22237
作者:Mengxiao Geng,Ran Hong,Xiaoling Xu,Bingxuan Li,Qiegen Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Low-dose PET imaging, Low-dose PET, reducing patient radiation, patient radiation exposure, reduced contrast
备注:
点击查看摘要
Abstract:Low-dose PET imaging is crucial for reducing patient radiation exposure but faces challenges like noise interference, reduced contrast, and difficulty in preserving physiological details. Existing methods often neglect both projection-domain physics knowledge and patient-specific meta-information, which are critical for functional-semantic correlation mining. In this study, we introduce a meta-information guided cross-domain synergistic diffusion model (MiG-DM) that integrates comprehensive cross-modal priors to generate high-quality PET images. Specifically, a meta-information encoding module transforms clinical parameters into semantic prompts by considering patient characteristics, dose-related information, and semi-quantitative parameters, enabling cross-modal alignment between textual meta-information and image reconstruction. Additionally, the cross-domain architecture combines projection-domain and image-domain processing. In the projection domain, a specialized sinogram adapter captures global physical structures through convolution operations equivalent to global image-domain filtering. Experiments on the UDPET public dataset and clinical datasets with varying dose levels demonstrate that MiG-DM outperforms state-of-the-art methods in enhancing PET image quality and preserving physiological details.
191. 【2512.22228】KAN-FPN-Stem:A KAN-Enhanced Feature Pyramid Stem for Boosting ViT-based Pose Estimation
链接:https://arxiv.org/abs/2512.22228
作者:HaoNan Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Transformers, dense prediction tasks, demonstrated significant promise, pose estimation, promise in dense
备注:
点击查看摘要
Abstract:Vision Transformers (ViT) have demonstrated significant promise in dense prediction tasks such as pose estimation. However, their performance is frequently constrained by the overly simplistic front-end designs employed in models like ViTPose. This naive patchification mechanism struggles to effectively handle multi-scale variations and results in irreversible information loss during the initial feature extraction phase. To overcome this limitation, we introduce a novel KAN-enhanced FPN-Stem architecture. Through rigorous ablation studies, we first identified that the true bottleneck for performance improvement lies not in plug-and-play attention modules (e.g., CBAM), but in the post-fusion non-linear smoothing step within the FPN. Guided by this insight, our core innovation is to retain the classic "upsample-and-add" fusion stream of the FPN, but replace its terminal, standard linear 3x3 smoothing convolution with a powerful KAN-based convolutional layer. Leveraging its superior non-linear modeling capabilities, this KAN-based layer adaptively learns and rectifies the "artifacts" generated during the multi-scale fusion process. Extensive experiments on the COCO dataset demonstrate that our KAN-FPN-Stem achieves a significant performance boost of up to +2.0 AP over the lightweight ViTPose-S baseline. This work not only delivers a plug-and-play, high-performance module but, more importantly, reveals that: the performance bottleneck in ViT front-end often lies not in 'feature refinement' (Attention), but in the quality of 'feature fusion' (Fusion). Furthermore, it provides an effective path to address this bottleneck through the introduction of the KAN operator.
192. 【2512.22226】VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
链接:https://arxiv.org/abs/2512.22226
作者:Naishan Zheng,Jie Huang,Qingpei Guo,Feng Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, remains challenging due, multimodal large language, temporally coherent representations, language models
备注: 11 pages, 4 figures
点击查看摘要
Abstract:Understanding long videos with multimodal large language models (MLLMs) remains challenging due to the heavy redundancy across frames and the need for temporally coherent representations. Existing static strategies, such as sparse sampling, frame compression, and clustering, are optimized for offline settings and often produce fragmented or over-compressed outputs when applied to continuous video streams. We present VideoScaffold, a dynamic representation framework designed for streaming video understanding. It adaptively adjusts event granularity according to video duration while preserving fine-grained visual semantics. VideoScaffold introduces two key components: Elastic-Scale Event Segmentation (EES), which performs prediction-guided segmentation to dynamically refine event boundaries, and Hierarchical Event Consolidation (HEC), which progressively aggregates semantically related segments into multi-level abstractions. Working in concert, EES and HEC enable VideoScaffold to transition smoothly from fine-grained frame understanding to abstract event reasoning as the video stream unfolds. Extensive experiments across both offline and streaming video understanding benchmarks demonstrate that VideoScaffold achieves state-of-the-art performance. The framework is modular and plug-and-play, seamlessly extending existing image-based MLLMs to continuous video comprehension. The code is available at this https URL.
193. 【2512.22220】On Extending Semantic Abstraction for Efficient Search of Hidden Objects
链接:https://arxiv.org/abs/2512.22220
作者:Tasha Pais,Nikhilesh Belulkar
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:activations roughly correspond, VLMs' relevancy activations, relevancy activations roughly, Abstraction key observation, Semantic Abstraction key
备注:
点击查看摘要
Abstract:Semantic Abstraction's key observation is that 2D VLMs' relevancy activations roughly correspond to their confidence of whether and where an object is in the scene. Thus, relevancy maps are treated as "abstract object" representations. We use this framework for learning 3D localization and completion for the exclusive domain of hidden objects, defined as objects that cannot be directly identified by a VLM because they are at least partially occluded. This process of localizing hidden objects is a form of unstructured search that can be performed more efficiently using historical data of where an object is frequently placed. Our model can accurately identify the complete 3D location of a hidden object on the first try significantly faster than a naive random search. These extensions to semantic abstraction hope to provide household robots with the skills necessary to save time and effort when looking for lost objects.
194. 【2512.22218】owards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark
链接:https://arxiv.org/abs/2512.22218
作者:Hieu Minh Nguyen,Tam Le-Thanh Dang,Kiet Van Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Visual Question Answering, Question Answering, remains underexplored, Vietnamese, VQA
备注: Dataset paper; code and data will be released
点击查看摘要
Abstract:Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi-agent VQA framework combining perception and reasoning agents with GPT-4, achieving 75.98% accuracy via majority voting. Our study presents the first large-scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain-specific resources in enhancing text-based VQA for low-resource languages. ViSignVQA serves as a benchmark capturing real-world scene text characteristics and supporting the development and evaluation of OCR-integrated VQA models in Vietnamese.
195. 【2512.22217】VLM-PAR: A Vision Language Model for Pedestrian Attribute Recognition
链接:https://arxiv.org/abs/2512.22217
作者:Abdellah Zakaria Sellam,Salah Eddine Bekhouche,Fadi Dornaika,Cosimo Distante,Abdenour Hadid
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Pedestrian Attribute Recognition, intricate attribute co-dependencies, predicting fine-grained attributes, involves predicting fine-grained, Attribute Recognition
备注:
点击查看摘要
Abstract:Pedestrian Attribute Recognition (PAR) involves predicting fine-grained attributes such as clothing color, gender, and accessories from pedestrian imagery, yet is hindered by severe class imbalance, intricate attribute co-dependencies, and domain shifts. We introduce VLM-PAR, a modular vision-language framework built on frozen SigLIP 2 multilingual encoders. By first aligning image and prompt embeddings via refining visual features through a compact cross-attention fusion, VLM-PAR achieves significant accuracy improvement on the highly imbalanced PA100K benchmark, setting a new state-of-the-art performance, while also delivering significant gains in mean accuracy across PETA and Market-1501 benchmarks. These results underscore the efficacy of integrating large-scale vision-language pretraining with targeted cross-modal refinement to overcome imbalance and generalization challenges in PAR.
196. 【2512.22214】Signal-SGN++: Topology-Enhanced Time-Frequency Spiking Graph Network for Skeleton-Based Action Recognition
链接:https://arxiv.org/abs/2512.22214
作者:Naichuan Zheng,Xiahai Lun,Weiyi Li,Yuchen Du
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Graph Convolutional Networks, demonstrate strong capability, dense floating-point computations, floating-point computations incur, computations incur high
备注:
点击查看摘要
Abstract:Graph Convolutional Networks (GCNs) demonstrate strong capability in modeling skeletal topology for action recognition, yet their dense floating-point computations incur high energy costs. Spiking Neural Networks (SNNs), characterized by event-driven and sparse activation, offer energy efficiency but remain limited in capturing coupled temporal-frequency and topological dependencies of human motion. To bridge this gap, this article proposes Signal-SGN++, a topology-aware spiking graph framework that integrates structural adaptivity with time-frequency spiking dynamics. The network employs a backbone composed of 1D Spiking Graph Convolution (1D-SGC) and Frequency Spiking Convolution (FSC) for joint spatiotemporal and spectral feature extraction. Within this backbone, a Topology-Shift Self-Attention (TSSA) mechanism is embedded to adaptively route attention across learned skeletal topologies, enhancing graph-level sensitivity without increasing computational complexity. Moreover, an auxiliary Multi-Scale Wavelet Transform Fusion (MWTF) branch decomposes spiking features into multi-resolution temporal-frequency representations, wherein a Topology-Aware Time-Frequency Fusion (TATF) unit incorporates structural priors to preserve topology-consistent spectral fusion. Comprehensive experiments on large-scale benchmarks validate that Signal-SGN++ achieves superior accuracy-efficiency trade-offs, outperforming existing SNN-based methods and achieving competitive results against state-of-the-art GCNs under substantially reduced energy consumption.
197. 【2512.22208】Open-Source Multimodal Moxin Models with Moxin-VLM and Moxin-VLA
链接:https://arxiv.org/abs/2512.22208
作者:Pu Zhao,Xuan Shen,Zhenglun Kong,Yixin Shen,Sung-En Chang,Arash Akbari,Timothy Rupprecht,Lei Lu,Enfu Nan,Changdi Yang,Yumei He,Weiyan Shi,Xingchen Xu,Yu Huang,Wei Jiang,Wei Wang,Yue Chen,Yong He,Yanzhi Wang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, significant transformation, undergone a significant
备注:
点击查看摘要
Abstract:Recently, Large Language Models (LLMs) have undergone a significant transformation, marked by a rapid rise in both their popularity and capabilities. Leading this evolution are proprietary LLMs like GPT-4 and GPT-o1, which have captured widespread attention in the AI community due to their remarkable performance and versatility. Simultaneously, open-source LLMs, such as LLaMA and Mistral, have made great contributions to the ever-increasing popularity of LLMs due to the ease to customize and deploy the models across diverse applications. Moxin 7B is introduced as a fully open-source LLM developed in accordance with the Model Openness Framework, which moves beyond the simple sharing of model weights to embrace complete transparency in training, datasets, and implementation detail, thus fostering a more inclusive and collaborative research environment that can sustain a healthy open-source ecosystem. To further equip Moxin with various capabilities in different tasks, we develop three variants based on Moxin, including Moxin-VLM, Moxin-VLA, and Moxin-Chinese, which target the vision-language, vision-language-action, and Chinese capabilities, respectively. Experiments show that our models achieve superior performance in various evaluations. We adopt open-source framework and open data for the training. We release our models, along with the available data and code to derive these models.
198. 【2512.22205】A CNN-Based Malaria Diagnosis from Blood Cell Images with SHAP and LIME Explainability
链接:https://arxiv.org/abs/2512.22205
作者:Md. Ismiel Hossen Abir,Awolad Hossain
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:prevalent health concern, subtropical climates, remains a prevalent, prevalent health, health concern
备注:
点击查看摘要
Abstract:Malaria remains a prevalent health concern in regions with tropical and subtropical climates. The cause of malaria is the Plasmodium parasite, which is transmitted through the bites of infected female Anopheles mosquitoes. Traditional diagnostic methods, such as microscopic blood smear analysis, are low in sensitivity, depend on expert judgment, and require resources that may not be available in remote settings. To overcome these limitations, this study proposes a deep learning-based approach utilizing a custom Convolutional Neural Network (CNN) to automatically classify blood cell images as parasitized or uninfected. The model achieves an accuracy of 96%, with precision and recall scores exceeding 0.95 for both classes. This study also compares the custom CNN with established deep learning architectures, including ResNet50, VGG16, MobileNetV2, and DenseNet121. To enhance model interpretability, Explainable AI techniques such as SHAP, LIME, and Saliency Maps are applied. The proposed system shows how deep learning can provide quick, accurate and understandable malaria diagnosis, especially in areas with limited resources.
199. 【2512.22203】CFormer: A 5M-Parameter Transformer with Density-Guided Aggregation for Weakly-Supervised Crowd Counting
链接:https://arxiv.org/abs/2512.22203
作者:Qiang Guo,Rubo Zhang,Bingbing Zhang,Junjie Liu,Jianqing Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:computationally intensive backbones, counting typically relies, Crowd counting typically, labor-intensive point-level annotations, Crowd counting
备注:
点击查看摘要
Abstract:Crowd counting typically relies on labor-intensive point-level annotations and computationally intensive backbones, restricting its scalability and deployment in resource-constrained environments. To address these challenges, this paper proposes the TCFormer, a tiny, ultra-lightweight, weakly-supervised transformer-based crowd counting framework with only 5 million parameters that achieves competitive performance. Firstly, a powerful yet efficient vision transformer is adopted as the feature extractor, the global context-aware capabilities of which provides semantic meaningful crowd features with a minimal memory footprint. Secondly, to compensate for the lack of spatial supervision, we design a feature aggregation mechanism termed the Learnable Density-Weighted Averaging module. This module dynamically re-weights local tokens according to predicted density scores, enabling the network to adaptively modulate regional features based on their specific density characteristics without the need for additional annotations. Furthermore, this paper introduces a density-level classification loss, which discretizes crowd density into distinct grades, thereby regularizing the training process and enhancing the model's classification power across varying levels of crowd density. Therefore, although TCformer is trained under a weakly-supervised paradigm utilizing only image-level global counts, the joint optimization of count and density-level losses enables the framework to achieve high estimation accuracy. Extensive experiments on four benchmarks including ShanghaiTech A/B, UCF-QNRF, and NWPU datasets demonstrate that our approach strikes a superior trade-off between parameter efficiency and counting accuracy and can be a good solution for crowd counting tasks in edge devices.
200. 【2512.22197】Quadrant Segmentation VLM with Few-Shot Adaptation and OCT Learning-based Explainability Methods for Diabetic Retinopathy
链接:https://arxiv.org/abs/2512.22197
作者:Shivum Telang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diabetic Retinopathy, vision loss worldwide, requiring early detection, loss worldwide, requiring early
备注: 4 pages, 6 figures
点击查看摘要
Abstract:Diabetic Retinopathy (DR) is a leading cause of vision loss worldwide, requiring early detection to preserve sight. Limited access to physicians often leaves DR undiagnosed. To address this, AI models utilize lesion segmentation for interpretability; however, manually annotating lesions is impractical for clinicians. Physicians require a model that explains the reasoning for classifications rather than just highlighting lesion locations. Furthermore, current models are one-dimensional, relying on a single imaging modality for explainability and achieving limited effectiveness. In contrast, a quantitative-detection system that identifies individual DR lesions in natural language would overcome these limitations, enabling diverse applications in screening, treatment, and research settings. To address this issue, this paper presents a novel multimodal explainability model utilizing a VLM with few-shot learning, which mimics an ophthalmologist's reasoning by analyzing lesion distributions within retinal quadrants for fundus images. The model generates paired Grad-CAM heatmaps, showcasing individual neuron weights across both OCT and fundus images, which visually highlight the regions contributing to DR severity classification. Using a dataset of 3,000 fundus images and 1,000 OCT images, this innovative methodology addresses key limitations in current DR diagnostics, offering a practical and comprehensive tool for improving patient outcomes.
201. 【2512.22193】ny-YOLOSAM: Fast Hybrid Image Segmentation
链接:https://arxiv.org/abs/2512.22193
作者:Kenneth Xu,Songhan Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Segment Anything Model, enables promptable, latency-critical settings, distilled SAM variant, computationally expensive
备注: 7 pages, 6 figures, 5 tables. Code available at: [this https URL](https://github.com/Kenneth-Xu11566/tiny-yolosam)
点击查看摘要
Abstract:The Segment Anything Model (SAM) enables promptable, high-quality segmentation but is often too computationally expensive for latency-critical settings. TinySAM is a lightweight, distilled SAM variant that preserves strong zero-shot mask quality, yet its "segment-everything" mode still requires hundreds of prompts and remains slow in practice. We first replicate TinySAM on COCO val2017 using official checkpoints, matching the reported AP within 0.03%, establishing a reliable experimental baseline. Building on this, we propose Tiny-YOLOSAM, a fast hybrid pipeline that uses a recent YOLO detector (YOLOv12) to generate box prompts for TinySAM on salient foreground objects, and supplements uncovered regions with sparse point prompts sampled only where YOLO-guided masks provide no coverage. On COCO val2017, the hybrid system substantially improves class-agnostic coverage (AR from 16.4% to 77.1%, mIoU from 19.2% to 67.8%) while reducing end-to-end runtime from 49.20s/image to 10.39s/image (4.7x) on an Apple M1 Pro CPU. These results suggest detector-guided prompting combined with targeted sparse sampling as an effective alternative to dense "segment-everything" prompting for practical full-scene segmentation.
202. 【2512.22188】HookMIL: Revisiting Context Modeling in Multiple Instance Learning for Computational Pathology
链接:https://arxiv.org/abs/2512.22188
作者:Xitong Ling,Minxi Ouyang,Xiaoxiao Li,Jiawen Li,Ying Chen,Yuxuan Sun,Xinrui Chen,Tian Guan,Xiaoping Liu,Yonghong He
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Multiple Instance Learning, enabled weakly supervised, weakly supervised analysis, Multiple Instance, Instance Learning
备注:
点击查看摘要
Abstract:Multiple Instance Learning (MIL) has enabled weakly supervised analysis of whole-slide images (WSIs) in computational pathology. However, traditional MIL approaches often lose crucial contextual information, while transformer-based variants, though more expressive, suffer from quadratic complexity and redundant computations. To address these limitations, we propose HookMIL, a context-aware and computationally efficient MIL framework that leverages compact, learnable hook tokens for structured contextual aggregation. These tokens can be initialized from (i) key-patch visual features, (ii) text embeddings from vision-language pathology models, and (iii) spatially grounded features from spatial transcriptomics-vision models. This multimodal initialization enables Hook Tokens to incorporate rich textual and spatial priors, accelerating convergence and enhancing representation quality. During training, Hook tokens interact with instances through bidirectional attention with linear complexity. To further promote specialization, we introduce a Hook Diversity Loss that encourages each token to focus on distinct histopathological patterns. Additionally, a hook-to-hook communication mechanism refines contextual interactions while minimizing redundancy. Extensive experiments on four public pathology datasets demonstrate that HookMIL achieves state-of-the-art performance, with improved computational efficiency and interpretability. Codes are available at this https URL.
203. 【2512.22185】SAMM2D: Scale-Aware Multi-Modal 2D Dual-Encoder for High-Sensitivity Intracrania Aneurysm Screening
链接:https://arxiv.org/abs/2512.22185
作者:Antara Titikhsha,Divyanshu Tak
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Effective aneurysm detection, pronounced class imbalance, avert life-threatening hemorrhages, remains challenging due, Effective aneurysm
备注:
点击查看摘要
Abstract:Effective aneurysm detection is essential to avert life-threatening hemorrhages, but it remains challenging due to the subtle morphology of the aneurysm, pronounced class imbalance, and the scarcity of annotated data. We introduce SAMM2D, a dual-encoder framework that achieves an AUC of 0.686 on the RSNA intracranial aneurysm dataset; an improvement of 32% over the clinical baseline. In a comprehensive ablation across six augmentation regimes, we made a striking discovery: any form of data augmentation degraded performance when coupled with a strong pretrained backbone. Our unaugmented baseline model outperformed all augmented variants by 1.75--2.23 percentage points (p 0.01), overturning the assumption that "more augmentation is always better" in low-data medical settings. We hypothesize that ImageNet-pretrained features already capture robust invariances, rendering additional augmentations both redundant and disruptive to the learned feature manifold. By calibrating the decision threshold, SAMM2D reaches 95% sensitivity, surpassing average radiologist performance, and translates to a projected \$13.9M in savings per 1,000 patients in screening applications. Grad-CAM visualizations confirm that 85% of true positives attend to relevant vascular regions (62% IoU with expert annotations), demonstrating the model's clinically meaningful focus. Our results suggest that future medical imaging workflows could benefit more from strong pretraining than from increasingly complex augmentation pipelines.
204. 【2512.22183】Unbiased Visual Reasoning with Controlled Visual Inputs
链接:https://arxiv.org/abs/2512.22183
作者:Zhaonan Li,Shijie Lu,Fei Wang,Jacob Dineen,Xiao Ye,Zhikun Xu,Siyi Liu,Young Min Cho,Bangzheng Li,Daniel Chang,Kenny Nguyen,Qizheng Yang,Muhao Chen,Ben Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Vision-language Models, shortcut-prone when fine-tuned, answer visual questions, Separation for Text-based, exploiting spurious correlations
备注:
点击查看摘要
Abstract:End-to-end Vision-language Models (VLMs) often answer visual questions by exploiting spurious correlations instead of causal visual evidence, and can become more shortcut-prone when fine-tuned. We introduce VISTA (Visual-Information Separation for Text-based Analysis), a modular framework that decouples perception from reasoning via an explicit information bottleneck. A frozen VLM sensor is restricted to short, objective perception queries, while a text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language. This controlled interface defines a reward-aligned environment for training unbiased visual reasoning with reinforcement learning. Instantiated with Qwen2.5-VL and Llama3.2-Vision sensors, and trained with GRPO from only 641 curated multi-step questions, VISTA significantly improves robustness to real-world spurious correlations on SpuriVerse (+16.29% with Qwen-2.5-VL-7B and +6.77% with Llama-3.2-Vision-11B), while remaining competitive on MMVP and a balanced SeedBench subset. VISTA transfers robustly across unseen VLM sensors and is able to recognize and recover from VLM perception failures. Human analysis further shows that VISTA's reasoning traces are more neutral, less reliant on spurious attributes, and more explicitly grounded in visual evidence than end-to-end VLM baselines.
205. 【2512.22182】Enhancing Medical Data Analysis through AI-Enhanced Locally Linear Embedding: Applications in Medical Point Location and Imagery
链接:https://arxiv.org/abs/2512.22182
作者:Hassan Khalid,Muhammad Mahad Khaliq,Muhammad Jawad Bashir
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Locally Linear Embedding, evolution of Artificial, Artificial intelligence, including medical billing, AI-enhanced LLE
备注: 5 pages, 3 figures. Accepted and published at 2024 19th International Conference on Emerging Technologies (ICET)
点击查看摘要
Abstract:The rapid evolution of Artificial intelligence in healthcare has opened avenues for enhancing various processes, including medical billing and transcription. This paper introduces an innovative approach by integrating AI with Locally Linear Embedding (LLE) to revolutionize the handling of high-dimensional medical data. This AI-enhanced LLE model is specifically tailored to improve the accuracy and efficiency of medical billing systems and transcription services. By automating these processes, the model aims to reduce human error and streamline operations, thereby facilitating faster and more accurate patient care documentation and financial transactions. This paper provides a comprehensive mathematical model of AI-enhanced LLE, demonstrating its application in real-world healthcare scenarios through a series of experiments. The results indicate a significant improvement in data processing accuracy and operational efficiency. This study not only underscores the potential of AI-enhanced LLE in medical data analysis but also sets a foundation for future research into broader healthcare applications.
206. 【2512.22177】Real-Time American Sign Language Recognition Using 3D Convolutional Neural Networks and LSTM: Architecture, Training, and Deployment
链接:https://arxiv.org/abs/2512.22177
作者:Dawnena Key
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Convolutional Neural Networks, Long Short-Term Memory, Convolutional Neural, Neural Networks, American Sign Language
备注: 10 pages, 1 figure, 2 tables. Patent pending (US 63/918,518). Code available at [this https URL](https://github.com/dawnenakey/spokhandSLR)
点击查看摘要
Abstract:This paper presents a real-time American Sign Language (ASL) recognition system utilizing a hybrid deep learning architecture combining 3D Convolutional Neural Networks (3D CNN) with Long Short-Term Memory (LSTM) networks. The system processes webcam video streams to recognize word-level ASL signs, addressing communication barriers for over 70 million deaf and hard-of-hearing individuals worldwide. Our architecture leverages 3D convolutions to capture spatial-temporal features from video frames, followed by LSTM layers that model sequential dependencies inherent in sign language gestures. Trained on the WLASL dataset (2,000 common words), ASL-LEX lexical database (~2,700 signs), and a curated set of 100 expert-annotated ASL signs, the system achieves F1-scores ranging from 0.71 to 0.99 across sign classes. The model is deployed on AWS infrastructure with edge deployment capability on OAK-D cameras for real-time inference. We discuss the architecture design, training methodology, evaluation metrics, and deployment considerations for practical accessibility applications.
207. 【2512.22175】Characterizing Motion Encoding in Video Diffusion Timesteps
链接:https://arxiv.org/abs/2512.22175
作者:Vatsal Baherwani,Yixuan Ren,Abhinav Shrivastava
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
关键词:remains poorly understood, models synthesize temporal, diffusion models synthesize, timesteps remains poorly, synthesize temporal motion
备注: 10 pages, 4 figures
点击查看摘要
Abstract:Text-to-video diffusion models synthesize temporal motion and spatial appearance through iterative denoising, yet how motion is encoded across timesteps remains poorly understood. Practitioners often exploit the empirical heuristic that early timesteps mainly shape motion and layout while later ones refine appearance, but this behavior has not been systematically characterized. In this work, we proxy motion encoding in video diffusion timesteps by the trade-off between appearance editing and motion preservation induced when injecting new conditions over specified timestep ranges, and characterize this proxy through a large-scale quantitative study. This protocol allows us to factor motion from appearance by quantitatively mapping how they compete along the denoising trajectory. Across diverse architectures, we consistently identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space. Building on this characterization, we simplify current one-shot motion customization paradigm by restricting training and inference to the motion-dominant regime, achieving strong motion transfer without auxiliary debiasing modules or specialized objectives. Our analysis turns a widely used heuristic into a spatiotemporal disentanglement principle, and our timestep-constrained recipe can serve as ready integration into existing motion transfer and editing methods.
208. 【2512.22170】SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models
链接:https://arxiv.org/abs/2512.22170
作者:Jiesong Lian,Ruizhe Zhong,Zixiang Zhou,Xiaoyue Mi,Yixue Hao,Yuan Zhou,Qinglin Lu,Long Hu,Junchi Yan
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:critical goal, video generation models, effective Reward Models, generation models, models
备注: 16 pages, 9 figures
点击查看摘要
Abstract:Post-training alignment of video generation models with human preferences is a critical goal. Developing effective Reward Models (RMs) for this process faces significant methodological hurdles. Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise. Concurrently, the architectural design of VLM-based RMs, particularly their output mechanisms, remains underexplored. Furthermore, RM is susceptible to reward hacking in post-training. To mitigate these limitations, we propose SoliReward, a systematic framework for video RM training. Our framework first sources high-quality, cost-efficient data via single-item binary annotations, then constructs preference pairs using a cross-prompt pairing strategy. Architecturally, we employ a Hierarchical Progressive Query Attention mechanism to enhance feature aggregation. Finally, we introduce a modified BT loss that explicitly accommodates win-tie scenarios. This approach regularizes the RM's score distribution for positive samples, providing more nuanced preference signals to alleviate over-focus on a small number of top-scoring samples. Our approach is validated on benchmarks evaluating physical plausibility, subject deformity, and semantic alignment, demonstrating improvements in direct RM evaluation metrics and in the efficacy of post-training on video generation models. Code and benchmark will be publicly available.
209. 【2512.22136】SlimEdge: Lightweight Distributed DNN Deployment on Constrained Hardware
链接:https://arxiv.org/abs/2512.22136
作者:Mahadev Sunil Kumar,Arnab Raha,Debayan Das,Gopakumar G,Amitava Mukherjee
类目:Distributed, Parallel, and Cluster Computing (cs.DC); Computer Vision and Pattern Recognition (cs.CV)
关键词:substantial parameter counts, Deep distributed networks, devices remains hindered, modern computer vision, Deep distributed
备注:
点击查看摘要
Abstract:Deep distributed networks (DNNs) have become central to modern computer vision, yet their deployment on resource-constrained edge devices remains hindered by substantial parameter counts and computational demands. Here, we present an approach to the efficient deployment of distributed DNNs that jointly respects hardware limitations and preserves task performance. Our method integrates a structured model pruning with a multi-objective optimization to tailor network capacity to heterogeneous device constraints. We demonstrate this framework using Multi-View Convolutional Neural Network (MVCNN), a state-of-the-art architecture for 3D object recognition, by quantifying the contribution of individual views to classification accuracy and allocating pruning budgets, respectively. Experimental results show that the resulting models satisfy user-specified bounds on accuracy and memory footprint while reducing inference latency by factors ranging from 1.2x to 5.0x across diverse hardware platforms. These findings suggest that performance-aware, view-adaptive compression provides a viable pathway for deploying complex vision models in distributed edge environments.
210. 【2512.23185】EIR: Enhanced Image Representations for Medical Report Generation
链接:https://arxiv.org/abs/2512.23185
作者:Qiang Sun,Zongcheng Ji,Yinlong Xiao,Peng Chang,Jun Yu
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:chest X-ray images, chest X-ray, critical and time-consuming, time-consuming task, X-ray images
备注:
点击查看摘要
Abstract:Generating medical reports from chest X-ray images is a critical and time-consuming task for radiologists, especially in emergencies. To alleviate the stress on radiologists and reduce the risk of misdiagnosis, numerous research efforts have been dedicated to automatic medical report generation in recent years. Most recent studies have developed methods that represent images by utilizing various medical metadata, such as the clinical document history of the current patient and the medical graphs constructed from retrieved reports of other similar patients. However, all existing methods integrate additional metadata representations with visual representations through a simple "Add and LayerNorm" operation, which suffers from the information asymmetry problem due to the distinct distributions between them. In addition, chest X-ray images are usually represented using pre-trained models based on natural domain images, which exhibit an obvious domain gap between general and medical domain images. To this end, we propose a novel approach called Enhanced Image Representations (EIR) for generating accurate chest X-ray reports. We utilize cross-modal transformers to fuse metadata representations with image representations, thereby effectively addressing the information asymmetry problem between them, and we leverage medical domain pre-trained models to encode medical images, effectively bridging the domain gap for image representation. Experimental results on the widely used MIMIC and Open-I datasets demonstrate the effectiveness of our proposed method.
211. 【2512.23078】Deep Learning for Art Market Valuation
链接:https://arxiv.org/abs/2512.23078
作者:Jianping Mei,Michael Moses,Jan Waelty,Yucheng Yang
类目:General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); General Economics (econ.GN)
关键词:content of artworks, modern deep architectures, visual content, deep learning, multi-modal deep learning
备注:
点击查看摘要
Abstract:We study how deep learning can improve valuation in the art market by incorporating the visual content of artworks into predictive models. Using a large repeated-sales dataset from major auction houses, we benchmark classical hedonic regressions and tree-based methods against modern deep architectures, including multi-modal models that fuse tabular and image data. We find that while artist identity and prior transaction history dominate overall predictive power, visual embeddings provide a distinct and economically meaningful contribution for fresh-to-market works where historical anchors are absent. Interpretability analyses using Grad-CAM and embedding visualizations show that models attend to compositional and stylistic cues. Our findings demonstrate that multi-modal deep learning delivers significant value precisely when valuation is hardest, namely first-time sales, and thus offers new insights for both academic research and practice in art market valuation.
212. 【2512.22855】A Rapid GeoSAM-Based Workflow for Multi-Temporal Glacier Delineation: Case Study from Svalbard
链接:https://arxiv.org/abs/2512.22855
作者:Alexandru Hegyi
类目:Geophysics (physics.geo-ph); Computer Vision and Pattern Recognition (cs.CV)
关键词:long time series, monitoring glacier change, glacier boundary delineation, heterogeneous environments, Consistent glacier boundary
备注:
点击查看摘要
Abstract:Consistent glacier boundary delineation is essential for monitoring glacier change, yet many existing approaches are difficult to scale across long time series and heterogeneous environments. In this report, we present a GeoSAM-based, semi-automatic workflow for rapid glacier delineation from Sentinel-2 surface reflectance imagery. The method combines late-summer image compositing, spectral-index-based identification of candidate ice areas, prompt-guided segmentation using GeoSAM, and physically based post-processing to derive annual glacier outlines. The workflow is demonstrated in the Ny-Alesund and Kongsfjorden region of western Svalbard across multiple years of the Sentinel-2 era. Results show that the approach produces spatially coherent and temporally consistent outlines for major glacier bodies, while most errors are associated with small features affected by water bodies, terrain shadows, or high surface variability. The reliance on derived RGB imagery makes the method flexible and transferable to other optical datasets, with improved performance expected at higher spatial resolution. Although user inspection remains necessary to filter incorrect polygons and adjust thresholds for local conditions, the workflow provides a fast and practical alternative for multi-temporal glacier mapping and ice-loss assessment.
213. 【2512.22766】SwinCCIR: An end-to-end deep network for Compton camera imaging reconstruction
链接:https://arxiv.org/abs/2512.22766
作者:Minghao Dong,Xinyang Luo,Xujian Ouyang,Yongshun Xiao
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Nuclear Experiment (nucl-ex)
关键词:incident gammas based, Compton cameras, gamma cameras, Compton scatter, incident gammas
备注: 10 pages, 7 figures
点击查看摘要
Abstract:Compton cameras (CCs) are a kind of gamma cameras which are designed to determine the directions of incident gammas based on the Compton scatter. However, the reconstruction of CCs face problems of severe artifacts and deformation due to the fundamental reconstruction principle of back-projection of Compton cones. Besides, a part of systematic errors originated from the performance of devices are hard to remove through calibration, leading to deterioration of imaging quality. Iterative algorithms and deep-learning based methods have been widely used to improve reconstruction. But most of them are optimization based on the results of back-projection. Therefore, we proposed an end-to-end deep learning framework, SwinCCIR, for CC imaging. Through adopting swin-transformer blocks and a transposed convolution-based image generation module, we established the relationship between the list-mode events and the radioactive source distribution. SwinCCIR was trained and validated on both simulated and practical dataset. The experimental results indicate that SwinCCIR effectively overcomes problems of conventional CC imaging, which are expected to be implemented in practical applications.
214. 【2512.22674】Semantic contrastive learning for orthogonal X-ray computed tomography reconstruction
链接:https://arxiv.org/abs/2512.22674
作者:Jiashu Dong,Jiabing Xiang,Lisheng Geng,Suqing Tian,Wei Zhao
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:X-ray computed tomography, reduce radiation dose, X-ray computed, sparse-view reconstruction offering, computed tomography
备注: This paper is accepted by Fully3D 2025
点击查看摘要
Abstract:X-ray computed tomography (CT) is widely used in medical imaging, with sparse-view reconstruction offering an effective way to reduce radiation dose. However, ill-posed conditions often result in severe streak artifacts. Recent advances in deep learning-based methods have improved reconstruction quality, but challenges still remain. To address these challenges, we propose a novel semantic feature contrastive learning loss function that evaluates semantic similarity in high-level latent spaces and anatomical similarity in shallow latent spaces. Our approach utilizes a three-stage U-Net-based architecture: one for coarse reconstruction, one for detail refinement, and one for semantic similarity measurement. Tests on a chest dataset with orthogonal projections demonstrate that our method achieves superior reconstruction quality and faster processing compared to other algorithms. The results show significant improvements in image quality while maintaining low computational complexity, making it a practical solution for orthogonal CT reconstruction.
215. 【2512.22485】JParc: Joint cortical surface parcellation with registration
链接:https://arxiv.org/abs/2512.22485
作者:Jian Li,Karthik Gopinath,Brian L. Edlow,Adrian V. Dalca,Bruce Fischl
类目:Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:basic neuroscience research, neuroscience research, Cortical surface parcellation, parcellation, Cortical surface
备注: A. V. Dalca and B. Fischl are co-senior authors with equal contributions
点击查看摘要
Abstract:Cortical surface parcellation is a fundamental task in both basic neuroscience research and clinical applications, enabling more accurate mapping of brain regions. Model-based and learning-based approaches for automated parcellation alleviate the need for manual labeling. Despite the advancement in parcellation performance, learning-based methods shift away from registration and atlas propagation without exploring the reason for the improvement compared to traditional methods. In this study, we present JParc, a joint cortical registration and parcellation framework, that outperforms existing state-of-the-art parcellation methods. In rigorous experiments, we demonstrate that the enhanced performance of JParc is primarily attributable to accurate cortical registration and a learned parcellation atlas. By leveraging a shallow subnetwork to fine-tune the propagated atlas labels, JParc achieves a Dice score greater than 90% on the Mindboggle dataset, using only basic geometric features (sulcal depth, curvature) that describe cortical folding patterns. The superior accuracy of JParc can significantly increase the statistical power in brain mapping studies as well as support applications in surgical planning and many other downstream neuroscientific and clinical tasks.
216. 【2512.22463】MEGA-PCC: A Mamba-based Efficient Approach for Joint Geometry and Attribute Point Cloud Compression
链接:https://arxiv.org/abs/2512.22463
作者:Kai-Hsiang Hsieh,Monyneath Yim,Wen-Hsiao Peng,Jui-Chiu Chiang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:essential for efficient, Joint compression, geometry and attribute, data representation, geometry
备注: Accepted at the IEEE/CVF Winter Conference on Applications of Computer Vision 2026 (WACV 2026)
点击查看摘要
Abstract:Joint compression of point cloud geometry and attributes is essential for efficient 3D data representation. Existing methods often rely on post-hoc recoloring procedures and manually tuned bitrate allocation between geometry and attribute bitstreams in inference, which hinders end-to-end optimization and increases system complexity. To overcome these limitations, we propose MEGA-PCC, a fully end-to-end, learning-based framework featuring two specialized models for joint compression. The main compression model employs a shared encoder that encodes both geometry and attribute information into a unified latent representation, followed by dual decoders that sequentially reconstruct geometry and then attributes. Complementing this, the Mamba-based Entropy Model (MEM) enhances entropy coding by capturing spatial and channel-wise correlations to improve probability estimation. Both models are built on the Mamba architecture to effectively model long-range dependencies and rich contextual features. By eliminating the need for recoloring and heuristic bitrate tuning, MEGA-PCC enables data-driven bitrate allocation during training and simplifies the overall pipeline. Extensive experiments demonstrate that MEGA-PCC achieves superior rate-distortion performance and runtime efficiency compared to both traditional and learning-based baselines, offering a powerful solution for AI-driven point cloud compression.
217. 【2512.22209】Super-Resolution Enhancement of Medical Images Based on Diffusion Model: An Optimization Scheme for Low-Resolution Gastric Images
链接:https://arxiv.org/abs/2512.22209
作者:Haozhe Jia
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:enabled minimally invasive, captured images due, minimally invasive gastrointestinal, inherently low resolution, due to hardware
备注: 19 pages, 16 figures. Undergraduate final year project
点击查看摘要
Abstract:Capsule endoscopy has enabled minimally invasive gastrointestinal imaging, but its clinical utility is limited by the inherently low resolution of captured images due to hardware, power, and transmission constraints. This limitation hampers the identification of fine-grained mucosal textures and subtle pathological features essential for early diagnosis. This work investigates a diffusion-based super-resolution framework to enhance capsule endoscopy images in a data-driven and anatomically consistent manner. We adopt the SR3 (Super-Resolution via Repeated Refinement) framework built upon Denoising Diffusion Probabilistic Models (DDPMs) to learn a probabilistic mapping from low-resolution to high-resolution images. Unlike GAN-based approaches that often suffer from training instability and hallucination artifacts, diffusion models provide stable likelihood-based training and improved structural fidelity. The HyperKvasir dataset, a large-scale publicly available gastrointestinal endoscopy dataset, is used for training and evaluation. Quantitative results demonstrate that the proposed method significantly outperforms bicubic interpolation and GAN-based super-resolution methods such as ESRGAN, achieving PSNR of 27.5 dB and SSIM of 0.65 for a baseline model, and improving to 29.3 dB and 0.71 with architectural enhancements including attention mechanisms. Qualitative results show improved preservation of anatomical boundaries, vascular patterns, and lesion structures. These findings indicate that diffusion-based super-resolution is a promising approach for enhancing non-invasive medical imaging, particularly in capsule endoscopy where image resolution is fundamentally constrained.
Comments:
19 pages, 16 figures. Undergraduate final year project
Subjects:
Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
MSC classes:
68U10, 68T07
ACMclasses:
I.4.2; I.5.4
Cite as:
arXiv:2512.22209 [eess.IV]
(or
arXiv:2512.22209v1 [eess.IV] for this version)
https://doi.org/10.48550/arXiv.2512.22209
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
218. 【2512.22202】Complex Swin Transformer for Accelerating Enhanced SMWI Reconstruction
链接:https://arxiv.org/abs/2512.22202
作者:Muhammad Usman,Sung-Min Gho
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Susceptibility Map Weighted, Map Weighted Imaging, Susceptibility Map, magnetic resonance imaging, resonance imaging technique
备注: Published at ISMRM 2025 (Abstract #2651)
点击查看摘要
Abstract:Susceptibility Map Weighted Imaging (SMWI) is an advanced magnetic resonance imaging technique used to detect nigral hyperintensity in Parkinsons disease. However, full resolution SMWI acquisition is limited by long scan times. Efficient reconstruction methods are therefore required to generate high quality SMWI from reduced k space data while preserving diagnostic relevance. In this work, we propose a complex valued Swin Transformer based network for super resolution reconstruction of multi echo MRI data. The proposed method reconstructs high quality SMWI images from low resolution k space inputs. Experimental results demonstrate that the method achieves a structural similarity index of 0.9116 and a mean squared error of 0.076 when reconstructing SMWI from 256 by 256 k space data, while maintaining critical diagnostic features. This approach enables high quality SMWI reconstruction from reduced k space sampling, leading to shorter scan times without compromising diagnostic detail. The proposed method has the potential to improve the clinical applicability of SMWI for Parkinsons disease and support faster and more efficient neuroimaging workflows.
219. 【2512.22184】AI-Enhanced Virtual Biopsies for Brain Tumor Diagnosis in Low Resource Settings
链接:https://arxiv.org/abs/2512.22184
作者:Areeb Ehsan
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:expert neuroradiology interpretation, Timely brain tumor, high-end MRI hardware, invasive biopsy procedures, diagnosis remains challenging
备注: 6 pages, 10 figures
点击查看摘要
Abstract:Timely brain tumor diagnosis remains challenging in low-resource clinical environments where expert neuroradiology interpretation, high-end MRI hardware, and invasive biopsy procedures may be limited. Although deep learning has achieved strong performance in brain tumor analysis, real-world adoption is constrained by computational demands, dataset shift across scanners, and limited interpretability. This paper presents a prototype virtual biopsy pipeline for four-class classification of 2D brain MRI images using a lightweight convolutional neural network (CNN) and complementary radiomics-style handcrafted features. A MobileNetV2-based CNN is trained for classification, while an interpretable radiomics branch extracts eight features capturing lesion shape, intensity statistics, and gray-level co-occurrence matrix (GLCM) texture descriptors. A late fusion strategy concatenates CNN embeddings with radiomics features and trains a RandomForest classifier on the fused representation. Explainability is provided via Grad-CAM visualizations and radiomics feature importance analysis. Experiments on a public Kaggle brain tumor MRI dataset show improved validation performance for fusion relative to single-branch baselines, while robustness tests under reduced resolution and additive noise highlight sensitivity relevant to low-resource imaging conditions. The system is framed as decision support and not a substitute for clinical diagnosis or histopathology.
220. 【2512.22176】Field strength-dependent performance variability in deep learning-based analysis of magnetic resonance imaging
链接:https://arxiv.org/abs/2512.22176
作者:Muhammad Ibtsaam Qadir,Duane Schonlau,Ulrike Dydak,Fiona R. Kolbinger
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
关键词:study quantitatively evaluates, MRI scanner magnetic, study quantitatively, quantitatively evaluates, evaluates the impact
备注: 16 pages, 1 table, 4 figures
点击查看摘要
Abstract:This study quantitatively evaluates the impact of MRI scanner magnetic field strength on the performance and generalizability of deep learning-based segmentation algorithms. Three publicly available MRI datasets (breast tumor, pancreas, and cervical spine) were stratified by scanner field strength (1.5T vs. 3.0T). For each segmentation task, three nnU-Net-based models were developed: A model trained on 1.5T data only (m-1.5T), a model trained on 3.0T data only (m-3.0T), and a model trained on pooled 1.5T and 3.0T data (m-combined). Each model was evaluated on both 1.5T and 3.0T validation sets. Field-strength-dependent performance differences were investigated via Uniform Manifold Approximation and Projection (UMAP)-based clustering and radiomic analysis, including 23 first-order and texture features. For breast tumor segmentation, m-3.0T (DSC: 0.494 [1.5T] and 0.433 [3.0T]) significantly outperformed m-1.5T (DSC: 0.411 [1.5T] and 0.289 [3.0T]) and m-combined (DSC: 0.373 [1.5T] and 0.268[3.0T]) on both validation sets (p0.0001). Pancreas segmentation showed similar trends: m-3.0T achieved the highest DSC (0.774 [1.5T], 0.840 [3.0T]), while m-1.5T underperformed significantly (p0.0001). For cervical spine, models performed optimally on same-field validation sets with minimal cross-field performance degradation (DSC0.92 for all comparisons). Radiomic analysis revealed moderate field-strength-dependent clustering in soft tissues (silhouette scores 0.23-0.29) but minimal separation in osseous structures (0.12). These results indicate that magnetic field strength in the training data substantially influences the performance of deep learning-based segmentation models, particularly for soft-tissue structures (e.g., small lesions). This warrants consideration of magnetic field strength as a confounding factor in studies evaluating AI performance on MRI.




