本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表，以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新1154篇论文，其中：

自然语言处理154篇
信息检索14篇
计算机视觉383篇

自然语言处理

1. 【2503.07605】SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models

作者：Xun Liang,Hanyu Wang,Huayi Lai,Simin Niu,Shichao Song,Jiawei Yang,Jihao Zhao,Feiyu Xiong,Bo Tang,Zhiyu Li

类目：Computation and Language (cs.CL)

关键词：Large Language Models, natural language processing, Large Language, achieved remarkable success, language processing tasks

备注： 15 pages, 7 figures, 8 tables

点击查看摘要

Abstract:Large Language Models have achieved remarkable success across various natural language processing tasks, yet their high computational cost during inference remains a major bottleneck. This paper introduces Sparse Expert Activation Pruning (SEAP), a training-free pruning method that selectively retains task-relevant parameters to reduce inference overhead. Inspired by the clustering patterns of hidden states and activations in LLMs, SEAP identifies task-specific expert activation patterns and prunes the model while preserving task performance and enhancing computational efficiency. Experimental results demonstrate that SEAP significantly reduces computational overhead while maintaining competitive accuracy. Notably, at 50% pruning, SEAP surpasses both WandA and FLAP by over 20%, and at 20% pruning, it incurs only a 2.2% performance drop compared to the dense model. These findings highlight SEAP's scalability and effectiveness, making it a promising approach for optimizing large-scale LLMs.

2. 【2503.07604】Implicit Reasoning in Transformers is Reasoning through Shortcuts

链接：https://arxiv.org/abs/2503.07604

作者：Tianhe Lin,Jian Xie,Siyu Yuan,Deqing Yang

类目：Computation and Language (cs.CL)

关键词：enhancing language models', language models' complex, models' complex multi-step, Test-time compute, implicit reasoning

备注：

点击查看摘要

Abstract:Test-time compute is emerging as a new paradigm for enhancing language models' complex multi-step reasoning capabilities, as demonstrated by the success of OpenAI's o1 and o3, as well as DeepSeek's R1. Compared to explicit reasoning in test-time compute, implicit reasoning is more inference-efficient, requiring fewer generated tokens. However, why does the advanced reasoning capability fail to emerge in the implicit reasoning style? In this work, we train GPT-2 from scratch on a curated multi-step mathematical reasoning dataset and conduct analytical experiments to investigate how language models perform implicit reasoning in multi-step tasks. Our findings reveal: 1) Language models can perform step-by-step reasoning and achieve high accuracy in both in-domain and out-of-domain tests via implicit reasoning. However, this capability only emerges when trained on fixed-pattern data. 2) Conversely, implicit reasoning abilities emerging from training on unfixed-pattern data tend to overfit a specific pattern and fail to generalize further. Notably, this limitation is also observed in state-of-the-art large language models. These findings suggest that language models acquire implicit reasoning through shortcut learning, enabling strong performance on tasks with similar patterns while lacking generalization.

3. 【2503.07595】Detection Avoidance Techniques for Large Language Models

链接：https://arxiv.org/abs/2503.07595

作者：Sinclair Schneider,Florian Steuber,Joao A. G. Schneider,Gabi Dreo Rodosek

类目：Computation and Language (cs.CL)

关键词：systematically spreading fake, large language models, brought various risks, including the potential, increasing popularity

备注：

点击查看摘要

Abstract:The increasing popularity of large language models has not only led to widespread use but has also brought various risks, including the potential for systematically spreading fake news. Consequently, the development of classification systems such as DetectGPT has become vital. These detectors are vulnerable to evasion techniques, as demonstrated in an experimental series: Systematic changes of the generative models' temperature proofed shallow learning-detectors to be the least reliable. Fine-tuning the generative model via reinforcement learning circumvented BERT-based-detectors. Finally, rephrasing led to a 90\% evasion of zero-shot-detectors like DetectGPT, although texts stayed highly similar to the original. A comparison with existing work highlights the better performance of the presented methods. Possible implications for society and further research are discussed.

4. 【2503.07575】VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models

链接：https://arxiv.org/abs/2503.07575

作者：Jen-tse Huang,Jiantong Qin,Jianping Zhang,Youliang Yuan,Wenxuan Wang,Jieyu Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：social biases exhibited, implicit social biases, research investigates, exhibited by Vision-Language, Vision-Language Models

备注： 9 pages

点击查看摘要

Abstract:This research investigates both explicit and implicit social biases exhibited by Vision-Language Models (VLMs). The key distinction between these bias types lies in the level of awareness: explicit bias refers to conscious, intentional biases, while implicit bias operates subconsciously. To analyze explicit bias, we directly pose questions to VLMs related to gender and racial differences: (1) Multiple-choice questions based on a given image (e.g., "What is the education level of the person in the image?") (2) Yes-No comparisons using two images (e.g., "Is the person in the first image more educated than the person in the second image?") For implicit bias, we design tasks where VLMs assist users but reveal biases through their responses: (1) Image description tasks: Models are asked to describe individuals in images, and we analyze disparities in textual cues across demographic groups. (2) Form completion tasks: Models draft a personal information collection form with 20 attributes, and we examine correlations among selected attributes for potential biases. We evaluate Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6. Our code and data are publicly available at this https URL.

5. 【2503.07572】Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

链接：https://arxiv.org/abs/2503.07572

作者：Yuxiao Qu,Matthew Y. R. Yang,Amrith Setlur,Lewis Tunstall,Edward Emanuel Beeching,Ruslan Salakhutdinov,Aviral Kumar

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：test-time compute, crucial for improving, optimizing test-time compute, test-time, compute

备注：

点击查看摘要

Abstract:Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the ''progress'' made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcement Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.

6. 【2503.07550】KSOD: Knowledge Supplement for LLMs On Demand

链接：https://arxiv.org/abs/2503.07550

作者：Haoran Li,Junfeng Hu

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, Knowledge

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet still produce errors in domain-specific tasks. To further improve their performance, we propose KSOD (Knowledge Supplement for LLMs On Demand), a novel framework that empowers LLMs to improve their capabilities with knowledge-based supervised fine-tuning (SFT). KSOD analyzes the causes of errors from the perspective of knowledge deficiency by identifying potential missing knowledge in LLM that may lead to the errors. Subsequently, KSOD tunes a knowledge module on knowledge dataset and verifies whether the LLM lacks the identified knowledge based on it. If the knowledge is verified, KSOD supplements the LLM with the identified knowledge using the knowledge module. Tuning LLMs on specific knowledge instead of specific task decouples task and knowledge and our experiments on two domain-specific benchmarks and four general benchmarks empirically demonstrate that KSOD enhances the performance of LLMs on tasks requiring the supplemented knowledge while preserving their performance on other tasks. Our findings shed light on the potential of improving the capabilities of LLMs with knowledge-based SFT.

7. 【2503.07539】XIFBench: Evaluating Large Language Models on Multilingual Instruction Following

链接：https://arxiv.org/abs/2503.07539

作者：Zhenyu Li,Kehai Chen,Yunfei Long,Xuefeng Bai,Yaoyin Zhang,Xuchen Wei,Juntao Li,Min Zhang

类目：Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, Language Models, remarkable instruction-following capabilities, demonstrated remarkable instruction-following

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings remains poorly understood, as existing evaluations lack fine-grained constraint analysis. We introduce XIFBench, a comprehensive constraint-based benchmark for assessing multilingual instruction-following abilities of LLMs, featuring a novel taxonomy of five constraint categories and 465 parallel instructions across six languages spanning different resource levels. To ensure consistent cross-lingual evaluation, we develop a requirement-based protocol that leverages English requirements as semantic anchors. These requirements are then used to validate the translations across languages. Extensive experiments with various LLMs reveal notable variations in instruction-following performance across resource levels, identifying key influencing factors such as constraint categories, instruction complexity, and cultural specificity.

8. 【2503.07536】LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

链接：https://arxiv.org/abs/2503.07536

作者：Yingzhe Peng,Gongrui Zhang,Miaosen Zhang,Zhiyuan You,Jie Liu,Qipeng Zhu,Kai Yang,Xingzhong Xu,Xin Geng,Xu Yang

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Multimodal Models, architectural constraints limit, faces unique challenges, limit reasoning capacity, constraints limit reasoning

备注：

点击查看摘要

Abstract:Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{\method}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that \method achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2503.07536 [cs.CL]

(or
arXiv:2503.07536v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2503.07536

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Yingzhe Peng [view email] [v1]
Mon, 10 Mar 2025 17:04:14 UTC (9,790 KB)

9. 【2503.07519】GRITHopper: Decomposition-Free Multi-Hop Dense Retrieval

链接：https://arxiv.org/abs/2503.07519

作者：Justus-Jonas Erker,Nils Reimers,Iryna Gurevych

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Decomposition-based multi-hop retrieval, Decomposition-based multi-hop, retrieval methods rely, complex queries, computationally expensive

备注： Under Review at ACL Rolling Review (ARR)

点击查看摘要

Abstract:Decomposition-based multi-hop retrieval methods rely on many autoregressive steps to break down complex queries, which breaks end-to-end differentiability and is computationally expensive. Decomposition-free methods tackle this, but current decomposition-free approaches struggle with longer multi-hop problems and generalization to out-of-distribution data. To address these challenges, we introduce GRITHopper-7B, a novel multi-hop dense retrieval model that achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks. GRITHopper combines generative and representational instruction tuning by integrating causal language modeling with dense retrieval training. Through controlled studies, we find that incorporating additional context after the retrieval process, referred to as post-retrieval language modeling, enhances dense retrieval performance. By including elements such as final answers during training, the model learns to better contextualize and retrieve relevant information. GRITHopper-7B offers a robust, scalable, and generalizable solution for multi-hop dense retrieval, and we release it to the community for future research and applications requiring multi-hop reasoning and retrieval capabilities.

10. 【2503.07518】okenButler: Token Importance is Predictable

链接：https://arxiv.org/abs/2503.07518

作者：Yash Akhauri,Ahmed F AbouElhamayed,Yifei Gao,Chi-Chih Chang,Nilesh Jain,Mohamed S. Abdelfattah

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Large Language Models, Large Language, enabling efficient decoding, Cache to store, store token history

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck, however, there is an opportunity to alleviate this bottleneck, especially because prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks (pages) of tokens at generation, failing at dense, context-rich tasks. Additionally, many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. By training a light-weight predictor with less than 1.2% parameter overhead, TokenButler prioritizes tokens based on their contextual, predicted importance. This improves perplexity downstream accuracy by over 8% relative to SoTA methods for estimating token importance. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy. Code, models and benchmarks: this https URL

11. 【2503.07513】Language Models Fail to Introspect About Their Knowledge of Language

链接：https://arxiv.org/abs/2503.07513

作者：Siyuan Song,Jennifer Hu,Kyle Mahowald

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：large language models, large language, knowledge, LLMs, internal states

备注：

点击查看摘要

Abstract:There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking "Is this sentence grammatical?"). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model's internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models' responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model's prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged "self-access". Our findings complicate recent results suggesting that models can introspect, and add new evidence to the argument that prompted responses should not be conflated with models' linguistic generalizations.

12. 【2503.07510】Sometimes the Model doth Preach: Quantifying Religious Bias in Open LLMs through Demographic Analysis in Asian Nations

链接：https://arxiv.org/abs/2503.07510

作者：Hari Shankar,Vedanta S P,Tejas Cavale,Ponnurangam Kumaraguru,Abhijnan Chakraborty

类目：Computers and Society (cs.CY); Computation and Language (cs.CL)

关键词：Large Language Models, Large Language, propagating bias unknowingly, non-diverse data collection, Language Models

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) are capable of generating opinions and propagating bias unknowingly, originating from unrepresentative and non-diverse data collection. Prior research has analysed these opinions with respect to the West, particularly the United States. However, insights thus produced may not be generalized in non-Western populations. With the widespread usage of LLM systems by users across several different walks of life, the cultural sensitivity of each generated output is of crucial interest. Our work proposes a novel method that quantitatively analyzes the opinions generated by LLMs, improving on previous work with regards to extracting the social demographics of the models. Our method measures the distance from an LLM's response to survey respondents, through Hamming Distance, to infer the demographic characteristics reflected in the model's outputs. We evaluate modern, open LLMs such as Llama and Mistral on surveys conducted in various global south countries, with a focus on India and other Asian nations, specifically assessing the model's performance on surveys related to religious tolerance and identity. Our analysis reveals that most open LLMs match a single homogeneous profile, varying across different countries/territories, which in turn raises questions about the risks of LLMs promoting a hegemonic worldview, and undermining perspectives of different minorities. Our framework may also be useful for future research investigating the complex intersection between training data, model architecture, and the resulting biases reflected in LLM outputs, particularly concerning sensitive topics like religious tolerance and identity.

13. 【2503.07459】MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

链接：https://arxiv.org/abs/2503.07459

作者：Xiangru Tang,Daniel Shao,Jiwoong Sohn,Jiapeng Chen,Jiayi Zhang,Jinyu Xiang,Fang Wu,Yilun Zhao,Chenglin Wu,Wenqi Shi,Arman Cohan,Mark Gerstein

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词：Large Language Models, Large Language, shown impressive performance, Language Models, shown impressive

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios where current models still struggle despite their strong performance on standard tests. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: (1) the prevalence of straightforward questions where even base models achieve high performance, (2) inconsistent sampling and evaluation protocols across studies, and (3) lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods offer promising performance-to-cost ratios compared to traditional approaches. Our analysis reveals substantial performance gaps between model families on complex questions and identifies optimal model selections for different computational constraints. Our benchmark and evaluation framework are publicly available at this https URL.

14. 【2503.07457】LLMs syntactically adapt their language use to their conversational partner

链接：https://arxiv.org/abs/2503.07457

作者：Florian Kandra,Vera Demberg,Alexander Koller

类目：Computation and Language (cs.CL)

关键词：human speakers align, frequently observed, observed that human, human speakers, speakers align

备注： 4 pages, 1 table, 1 figure, submitted to ACL

点击查看摘要

Abstract:It has been frequently observed that human speakers align their language use with each other during conversations. In this paper, we study empirically whether large language models (LLMs) exhibit the same behavior of conversational adaptation. We construct a corpus of conversations between LLMs and find that two LLM agents end up making more similar syntactic choices as conversations go on, confirming that modern LLMs adapt their language use to their conversational partners in at least a rudimentary way.

15. 【2503.07453】Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration

链接：https://arxiv.org/abs/2503.07453

作者：Dylan J. Foster,Zakaria Mhammedi,Dhruv Rohatgi

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Statistics Theory (math.ST)

关键词：Language model alignment, Toggle, language models, model, exploration

备注：

点击查看摘要

Abstract:Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Statistics Theory (math.ST)

Cite as:
arXiv:2503.07453 [cs.LG]

(or
arXiv:2503.07453v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2503.07453

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Dylan Foster [view email] [v1]
Mon, 10 Mar 2025 15:31:42 UTC (111 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration, by Dylan J. Foster and Zakaria Mhammedi and Dhruv RohatgiView PDFTeX SourceOther Formats
view license

Current browse context: cs.LG

|
next

new
|
recent
| 2025-03

Change to browse by:

cs
cs.AI
cs.CL
math
math.ST
stat
stat.TH

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

a
export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

信息检索

1. 【2503.07584】alking to GDELT Through Knowledge Graphs

链接：https://arxiv.org/abs/2503.07584

作者：Audun Myers,Max Vargas,Sinan G. Aksoy,Cliff Joslyn,Benjamin Wilson,Tom Grimes

类目：Information Retrieval (cs.IR)

关键词：Retrieval Augmented Regeneration, Augmented Regeneration, Retrieval Augmented, work we study, strengths and weaknesses

备注：

点击查看摘要

Abstract:In this work we study various Retrieval Augmented Regeneration (RAG) approaches to gain an understanding of the strengths and weaknesses of each approach in a question-answering analysis. To gain this understanding we use a case-study subset of the Global Database of Events, Language, and Tone (GDELT) dataset as well as a corpus of raw text scraped from the online news articles. To retrieve information from the text corpus we implement a traditional vector store RAG as well as state-of-the-art large language model (LLM) based approaches for automatically constructing KGs and retrieving the relevant subgraphs. In addition to these corpus approaches, we develop a novel ontology-based framework for constructing knowledge graphs (KGs) from GDELT directly which leverages the underlying schema of GDELT to create structured representations of global events. For retrieving relevant information from the ontology-based KGs we implement both direct graph queries and state-of-the-art graph retrieval approaches. We compare the performance of each method in a question-answering task. We find that while our ontology-based KGs are valuable for question-answering, automated extraction of the relevant subgraphs is challenging. Conversely, LLM-generated KGs, while capturing event summaries, often lack consistency and interpretability. Our findings suggest benefits of a synergistic approach between ontology and LLM-based KG construction, with proposed avenues toward that end.

2. 【2503.07520】From Limited Labels to Open Domains: An Efficient Learning Paradigm for UAV-view Geo-Localization

链接：https://arxiv.org/abs/2503.07520

作者：Zhongwei Chen,Zhao-Xu Yang,Hai-Jun Rong,Jiawei Lang

类目：Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：Traditional UAV-view Geo-Localization, positive sample selection, Traditional UAV-view, learn cross-view domain-invariant, cross-view domain-invariant representations

备注：

点击查看摘要

Abstract:Traditional UAV-view Geo-Localization (UVGL) supervised paradigms are constrained by the strict reliance on paired data for positive sample selection, which limits their ability to learn cross-view domain-invariant representations from unpaired data. Moreover, it is necessary to reconstruct the pairing relationship with expensive re-labeling costs for scenario-specific training when deploying in a new domain, which fails to meet the practical demands of open-environment applications. To address this issue, we propose a novel cross-domain invariance knowledge transfer network (CDIKTNet), which comprises a cross-domain invariance sub-network and a cross-domain transfer sub-network to realize a closed-loop framework of invariance feature learning and knowledge transfer. The cross-domain invariance sub-network is utilized to construct an essentially shared feature space across domains by learning structural invariance and spatial invariance in cross-view features. Meanwhile, the cross-domain transfer sub-network uses these invariant features as anchors and employs a dual-path contrastive memory learning mechanism to mine latent cross-domain correlation patterns in unpaired data. Extensive experiments demonstrate that our method achieves state-of-the-art performance under fully supervised conditions. More importantly, with merely 2\% paired data, our method exhibits performance comparable to existing supervised paradigms and possesses the ability to transfer directly to qualify for applications in the other scenarios completely without any prior pairing relationship.

3. 【2503.07519】GRITHopper: Decomposition-Free Multi-Hop Dense Retrieval

链接：https://arxiv.org/abs/2503.07519

作者：Justus-Jonas Erker,Nils Reimers,Iryna Gurevych

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：Decomposition-based multi-hop retrieval, Decomposition-based multi-hop, retrieval methods rely, complex queries, computationally expensive

备注： Under Review at ACL Rolling Review (ARR)

点击查看摘要

4. 【2503.07470】Advancing Vietnamese Information Retrieval with Learning Objective and Benchmark

链接：https://arxiv.org/abs/2503.07470

作者：Phu-Vinh Nguyen,Minh-Nam Tran,Long Nguyen,Dien Dinh

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：natural language processing, rapid development, invented for multiple, Vietnamese, language processing

备注：

点击查看摘要

Abstract:With the rapid development of natural language processing, many language models have been invented for multiple tasks. One important task is information retrieval (IR), which requires models to retrieve relevant documents. Despite its importance in many real-life applications, especially in retrieval augmented generation (RAG) systems, this task lacks Vietnamese benchmarks. This situation causes difficulty in assessing and comparing many existing Vietnamese embedding language models on the task and slows down the advancement of Vietnamese natural language processing (NLP) research. In this work, we aim to provide the Vietnamese research community with a new benchmark for information retrieval, which mainly focuses on retrieval and reranking tasks. Furthermore, we also present a new objective function based on the InfoNCE loss function, which is used to train our Vietnamese embedding model. Our function aims to be better than the origin in information retrieval tasks. Finally, we analyze the effect of temperature, a hyper-parameter in both objective functions, on the performance of text embedding models.

5. 【2503.07377】Process-Supervised LLM Recommenders via Flow-guided Tuning

链接：https://arxiv.org/abs/2503.07377

作者：Chongming Gao,Mengyao Gao,Chenxiao Fan,Shuai Yuan,Wentao Shi,Xiangnan He

类目：Information Retrieval (cs.IR)

关键词：large language models, likelihood maximization objective, Generative Flow Network, approach amplifies popularity, language models

备注：

点击查看摘要

Abstract:While large language models (LLMs) are increasingly adapted for recommendation systems via supervised fine-tuning (SFT), this approach amplifies popularity bias due to its likelihood maximization objective, compromising recommendation diversity and fairness. To address this, we present Flow-guided fine-tuning recommender (Flower), which replaces SFT with a Generative Flow Network (GFlowNet) framework that enacts process supervision through token-level reward propagation. Flower's key innovation lies in decomposing item-level rewards into constituent token rewards, enabling direct alignment between token generation probabilities and their reward signals. This mechanism achieves three critical advancements: (1) popularity bias mitigation and fairness enhancement through empirical distribution matching, (2) preservation of diversity through GFlowNet's proportional sampling, and (3) flexible integration of personalized preferences via adaptable token rewards. Experiments demonstrate Flower's superior distribution-fitting capability and its significant advantages over traditional SFT in terms of fairness, diversity, and accuracy, highlighting its potential to improve LLM-based recommendation systems. The implementation is available via this https URL

6. 【2503.07037】Zero-Shot Hashing Based on Reconstruction With Part Alignment

链接：https://arxiv.org/abs/2503.07037

作者：Yan Jiang,Zhongmiao Qi,Jianhao Li,Jiangbo Qian,Chong Wang,Yu Xin

类目：Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：Zero-shot hashing algorithms, large-scale image retrieval, unseen class data, Hashing algorithms, class data

备注：

点击查看摘要

Abstract:Hashing algorithms have been widely used in large-scale image retrieval tasks, especially for seen class data. Zero-shot hashing algorithms have been proposed to handle unseen class data. The key technique in these algorithms involves learning features from seen classes and transferring them to unseen classes, that is, aligning the feature embeddings between the seen and unseen classes. Most existing zero-shot hashing algorithms use the shared attributes between the two classes of interest to complete alignment tasks. However, the attributes are always described for a whole image, even though they represent specific parts of the image. Hence, these methods ignore the importance of aligning attributes with the corresponding image parts, which explicitly introduces noise and reduces the accuracy achieved when aligning the features of seen and unseen classes. To address this problem, we propose a new zero-shot hashing method called RAZH. We first use a clustering algorithm to group similar patches to image parts for attribute matching and then replace the image parts with the corresponding attribute vectors, gradually aligning each part with its nearest attribute. Extensive evaluation results demonstrate the superiority of the RAZH method over several state-of-the-art methods.

7. 【2503.07025】Weak Supervision for Improved Precision in Search Systems

链接：https://arxiv.org/abs/2503.07025

作者：Sriram Vasudevan

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：deep learning models, supervised learning methods, power deep learning, modern search engines, Labeled datasets

备注： Accepted to the AAAI 2025 Workshop on Computational Jobs Marketplace

点击查看摘要

Abstract:Labeled datasets are essential for modern search engines, which increasingly rely on supervised learning methods like Learning to Rank and massive amounts of data to power deep learning models. However, creating these datasets is both time-consuming and costly, leading to the common use of user click and activity logs as proxies for relevance. In this paper, we present a weak supervision approach to infer the quality of query-document pairs and apply it within a Learning to Rank framework to enhance the precision of a large-scale search system.

8. 【2503.06963】Multi-Behavior Recommender Systems: A Survey

链接：https://arxiv.org/abs/2503.06963

作者：Kyungho Kim,Sunwoo Kim,Geon Lee,Jinhong Jung,Kijung Shin

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：Traditional recommender systems, predict user preferences, Traditional recommender, systems primarily rely, Multi-behavior recommender systems

备注： Accepted in the PAKDD 2025 Survey Track

点击查看摘要

Abstract:Traditional recommender systems primarily rely on a single type of user-item interaction, such as item purchases or ratings, to predict user preferences. However, in real-world scenarios, users engage in a variety of behaviors, such as clicking on items or adding them to carts, offering richer insights into their interests. Multi-behavior recommender systems leverage these diverse interactions to enhance recommendation quality, and research on this topic has grown rapidly in recent years. This survey provides a timely review of multi-behavior recommender systems, focusing on three key steps: (1) Data Modeling: representing multi-behaviors at the input level, (2) Encoding: transforming these inputs into vector representations (i.e., embeddings), and (3) Training: optimizing machine-learning models. We systematically categorize existing multi-behavior recommender systems based on the commonalities and differences in their approaches across the above steps. Additionally, we discuss promising future directions for advancing multi-behavior recommender systems.

9. 【2503.06920】AlignPxtr: Aligning Predicted Behavior Distributions for Bias-Free Video Recommendations

链接：https://arxiv.org/abs/2503.06920

作者：Chengzhi Lin,Chuyuan Wang,Annan Xie,Wuhong Wang,Ziye Zhang,Canguang Ruan,Yuancai Huang,Yongqi Liu

类目：Information Retrieval (cs.IR)

关键词：infer user interest, user interest, user, biases, video recommendation systems

备注： video recommendation. 7 page, 1 figure

点击查看摘要

Abstract:In video recommendation systems, user behaviors such as watch time, likes, and follows are commonly used to infer user interest. However, these behaviors are influenced by various biases, including duration bias, demographic biases, and content category biases, which obscure true user preferences. In this paper, we hypothesize that biases and user interest are independent of each other. Based on this assumption, we propose a novel method that aligns predicted behavior distributions across different bias conditions using quantile mapping, theoretically guaranteeing zero mutual information between bias variables and the true user interest. By explicitly modeling the conditional distributions of user behaviors under different biases and mapping these behaviors to quantiles, we effectively decouple user interest from the confounding effects of various biases. Our approach uniquely handles both continuous signals (e.g., watch time) and discrete signals (e.g., likes, comments), while simultaneously addressing multiple bias dimensions. Additionally, we introduce a computationally efficient mean alignment alternative technique for practical real-time inference in large-scale systems. We validate our method through online A/B testing on two major video platforms: Kuaishou Lite and Kuaishou. The results demonstrate significant improvements in user engagement and retention, with \textbf{cumulative lifts of 0.267\% and 0.115\% in active days, and 1.102\% and 0.131\% in average app usage time}, respectively. The results demonstrate that our approach consistently achieves significant improvements in long-term user retention and substantial gains in average app usage time across different platforms. Our core code will be publised at this https URL.

10. 【2503.06489】Improving Access to Trade and Investment Information in Thailand through Intelligent Document Retrieval

链接：https://arxiv.org/abs/2503.06489

作者：Sirinda Palahan

类目：Information Retrieval (cs.IR); Social and Information Networks (cs.SI)

关键词：Overseas investment, daunting for beginners, beginners due, vast amount, amount of complex

备注：

点击查看摘要

Abstract:Overseas investment and trade can be daunting for beginners due to the vast amount of complex information. This paper presents a chatbot system that integrates natural language processing and information retrieval techniques to simplify the document retrieval process. The proposed system identifies the most relevant content, enabling users to navigate the intricate landscape of foreign trade and investment more efficiently. Our methodology combines the BM25 model and a deep learning model to rank and retrieve documents, aiming to reduce noise in the document content and enhance the accuracy of the results. Experiments with Thai natural language queries have demonstrated the effectiveness of our system in retrieving pertinent documents. A user satisfaction survey further validated the system's effectiveness. Most respondents found the system helpful and agreed with the suggested documents, indicating its potential as a valuable tool for Thai entrepreneurs navigating foreign trade and investment.

11. 【2503.06474】HuixiangDou2: A Robustly Optimized GraphRAG Approach

链接：https://arxiv.org/abs/2503.06474

作者：Huanjun Kong,Zhefan Wang,Chenyang Wang,Zhe Ma,Nanqing Dong

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：

备注： 11 pages

点击查看摘要

None

12. 【2503.06430】Graph Retrieval-Augmented LLM for Conversational Recommendation Systems

链接：https://arxiv.org/abs/2503.06430

作者：Zhangchi Qiu,Linhao Luo,Zicheng Zhao,Shirui Pan,Alan Wee-Chung Liew

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词：

备注： Accepted by PAKDD 2025

点击查看摘要

None

13. 【2503.06238】Image is All You Need: Towards Efficient and Effective Large Language Model-Based Recommender Systems

链接：https://arxiv.org/abs/2503.06238

作者：Kibum Kim,Sein Kim,Hongseok Kang,Jiwan Kim,Heewoong Noh,Yeonjun In,Kanghoon Yoon,Jinoh Oh,Chanyoung Park

类目：Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

14. 【2503.06034】Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning

链接：https://arxiv.org/abs/2503.06034

作者：Shengyao Zhuang,Xueguang Ma,Bevan Koopman,Jimmy Lin,Guido Zuccon

类目：Information Retrieval (cs.IR); Computation and Language (cs.CL)

关键词：

备注：

点击查看摘要

None

计算机视觉

1. 【2503.07608】AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning

链接：https://arxiv.org/abs/2503.07608

作者：Bo Jiang,Shaoyu Chen,Qian Zhang,Wenyu Liu,Xinggang Wang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：surpass human expert-level, human expert-level performance, mathematics and science, reinforcement learning, crucial role

备注： Project Page: [this https URL](https://github.com/hustvl/AlphaDrive)

点击查看摘要

Abstract:OpenAI o1 and DeepSeek R1 achieve or even surpass human expert-level performance in complex domains like mathematics and science, with reinforcement learning (RL) and reasoning playing a crucial role. In autonomous driving, recent end-to-end models have greatly improved planning performance but still struggle with long-tailed problems due to limited common sense and reasoning abilities. Some studies integrate vision-language models (VLMs) into autonomous driving, but they typically rely on pre-trained models with simple supervised fine-tuning (SFT) on driving data, without further exploration of training strategies or optimizations specifically tailored for planning. In this paper, we propose AlphaDrive, a RL and reasoning framework for VLMs in autonomous driving. AlphaDrive introduces four GRPO-based RL rewards tailored for planning and employs a two-stage planning reasoning training strategy that combines SFT with RL. As a result, AlphaDrive significantly improves both planning performance and training efficiency compared to using only SFT or without reasoning. Moreover, we are also excited to discover that, following RL training, AlphaDrive exhibits some emergent multimodal planning capabilities, which is critical for improving driving safety and efficiency. To the best of our knowledge, AlphaDrive is the first to integrate GRPO-based RL with planning reasoning into autonomous driving. Code will be released to facilitate future research.

2. 【2503.07607】VoD: Learning Volume of Differences for Video-Based Deepfake Detection

链接：https://arxiv.org/abs/2503.07607

作者：Ying Xu,Marius Pedersen,Kiran Raja

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：digital contact landscape, digital media integrity, creating realistic Deepfake, poses substantial challenges, contact landscape

备注：

点击查看摘要

Abstract:The rapid development of deep learning and generative AI technologies has profoundly transformed the digital contact landscape, creating realistic Deepfake that poses substantial challenges to public trust and digital media integrity. This paper introduces a novel Deepfake detention framework, Volume of Differences (VoD), designed to enhance detection accuracy by exploiting temporal and spatial inconsistencies between consecutive video frames. VoD employs a progressive learning approach that captures differences across multiple axes through the use of consecutive frame differences (CFD) and a network with stepwise expansions. We evaluate our approach with intra-dataset and cross-dataset testing scenarios on various well-known Deepfake datasets. Our findings demonstrate that VoD excels with the data it has been trained on and shows strong adaptability to novel, unseen data. Additionally, comprehensive ablation studies examine various configurations of segment length, sampling steps, and intervals, offering valuable insights for optimizing the framework. The code for our VoD framework is available at this https URL.

3. 【2503.07603】Should VLMs be Pre-trained with Image Data?

链接：https://arxiv.org/abs/2503.07603

作者：Sedrick Keh,Jean Mercat,Samir Yitzhak Gadre,Kushal Arora,Igor Vasiljevic,Benjamin Burchfiel,Shuran Song,Russ Tedrake,Thomas Kollar,Ludwig Schmidt,Achal Dave

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Pre-trained LLMs, tasks, Abstract, vision-language, vision-language tasks

备注： ICLR 2025

点击查看摘要

Abstract:Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.

4. 【2503.07602】DreamRelation: Relation-Centric Video Customization

链接：https://arxiv.org/abs/2503.07602

作者：Yujie Wei,Shiwei Zhang,Hangjie Yuan,Biao Gong,Longxiang Tang,Xiang Wang,Haonan Qiu,Hengjia Li,Shuai Tan,Yingya Zhang,Hongming Shan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Relational video customization, Relational Dynamics Enhancement, Relational Decoupling Learning, real-world visual content, comprehending real-world visual

备注： Project Page: [this https URL](https://dreamrelation.github.io)

点击查看摘要

Abstract:Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending real-world visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose DreamRelation, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce space-time relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on detailed subject appearances. Extensive experiments demonstrate that DreamRelation outperforms state-of-the-art methods in relational video customization. Code and models will be made publicly available.

5. 【2503.07601】Balanced Image Stylization with Style Matching Score

链接：https://arxiv.org/abs/2503.07601

作者：Yuxin Jiang,Liming Jiang,Shuai Yang,Jia-Wei Liu,Ivor Tsang,Mike Zheng Shou

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：present Style Matching, Style Matching Score, Progressive Spectrum Regularization, Style, style distribution matching

备注： Project page: [this https URL](https://yuxinn-j.github.io/projects/SMS.html)

点击查看摘要

Abstract:We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments.

6. 【2503.07598】VACE: All-in-One Video Creation and Editing

链接：https://arxiv.org/abs/2503.07598

作者：Zeyinzi Jiang,Zhen Han,Chaojie Mao,Jingfeng Zhang,Yulin Pan,Yu Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion Transformer, demonstrated powerful capability, Transformer has demonstrated, generating high-quality images, demonstrated powerful

备注：

点击查看摘要

Abstract:Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: this https URL.

7. 【2503.07597】HumanMM: Global Human Motion Recovery from Multi-shot Videos

链接：https://arxiv.org/abs/2503.07597

作者：Yuhong Zhang,Guanlin Wu,Ling-Hao Chen,Zhuokai Zhao,Jing Lin,Xiaoke Jiang,Jiamin Wu,Zhuoheng Li,Hao Frank Yang,Haoqian Wang,Lei Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multiple shot transitions, framework designed, designed to reconstruct, shot transitions, reconstruct long-sequence

备注： CVPR 2025; Project page: [this https URL](https://zhangyuhong01.github.io/HumanMM/)

点击查看摘要

Abstract:In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions, and dynamic backgrounds presented in such videos. Existing methods primarily focus on single-shot videos, where continuity is maintained within a single camera view, or simplify multi-shot alignment in camera space only. In this work, we tackle the challenges by integrating an enhanced camera pose estimation with Human Motion Recovery (HMR) by incorporating a shot transition detector and a robust alignment module for accurate pose and orientation continuity across shots. By leveraging a custom motion integrator, we effectively mitigate the problem of foot sliding and ensure temporal consistency in human pose. Extensive evaluations on our created multi-shot dataset from public 3D human datasets demonstrate the robustness of our method in reconstructing realistic human motion in world coordinates.

8. 【2503.07593】Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection

链接：https://arxiv.org/abs/2503.07593

作者：Youjun Zhao,Jiaying Lin,Rynson W.H. Lau

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：aims at localizing, closed sets, localizing and classifying, object detection, Open-vocabulary

备注： AAAI 2025 (Extented Version). Project Page: [this https URL](https://youjunzhao.github.io/HCMA/)

点击查看摘要

Abstract:Open-vocabulary 3D object detection (OV-3DOD) aims at localizing and classifying novel objects beyond closed sets. The recent success of vision-language models (VLMs) has demonstrated their remarkable capabilities to understand open vocabularies. Existing works that leverage VLMs for 3D object detection (3DOD) generally resort to representations that lose the rich scene context required for 3D perception. To address this problem, we propose in this paper a hierarchical framework, named HCMA, to simultaneously learn local object and global scene information for OV-3DOD. Specifically, we first design a Hierarchical Data Integration (HDI) approach to obtain coarse-to-fine 3D-image-text data, which is fed into a VLM to extract object-centric knowledge. To facilitate the association of feature hierarchies, we then propose an Interactive Cross-Modal Alignment (ICMA) strategy to establish effective intra-level and inter-level feature connections. To better align features across different levels, we further propose an Object-Focusing Context Adjustment (OFCA) module to refine multi-level features by emphasizing object-related features. Extensive experiments demonstrate that the proposed method outperforms SOTA methods on the existing OV-3DOD benchmarks. It also achieves promising OV-3DOD results even without any 3D annotations.

9. 【2503.07591】Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning

链接：https://arxiv.org/abs/2503.07591

作者：Bardia Safaei,Faizan Siddiqui,Jiacong Xu,Vishal M. Patel,Shao-Yuan Lo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Visual instruction tuning, large vision-language models, VIT, image-instruction pairs, Visual instruction

备注： Accepted at Computer Vision and Pattern Recognition Conference (CVPR) 2025

点击查看摘要

Abstract:Visual instruction tuning (VIT) for large vision-language models (LVLMs) requires training on expansive datasets of image-instruction pairs, which can be costly. Recent efforts in VIT data selection aim to select a small subset of high-quality image-instruction pairs, reducing VIT runtime while maintaining performance comparable to full-scale training. However, a major challenge often overlooked is that generating instructions from unlabeled images for VIT is highly expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. PreSel first estimates the relative importance of each vision task within VIT datasets to derive task-wise sampling budgets. It then clusters image features within each task, selecting the most representative images with the budget. This approach reduces computational overhead for both instruction generation during VIT data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets. The link to our project page: this https URL

10. 【2503.07588】When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

链接：https://arxiv.org/abs/2503.07588

作者：Junwei Luo,Yingying Zhang,Xue Yang,Kang Wu,Qi Zhu,Lei Liang,Jingdong Chen,Yansheng Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：large Remote Sensing, Remote Sensing Images, Efficient vision-language understanding, Remote Sensing, Efficient vision-language

备注： 12 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in this https URL.

11. 【2503.07587】Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru

链接：https://arxiv.org/abs/2503.07587

作者：Dunant Cusipuma,David Ortega,Victor Flores-Benites,Arturo Deza

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：foundational models start, multimodal foundational models, Foundational Visual Language, Visual Language Models, Representational Similarity Analysis

备注： A pre-print. 26 pages. Link to Code + Data: [this https URL](https://huggingface.co/datasets/Artificio/robusto-1)

点击查看摘要

Abstract:As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations -- especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the worst (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.

12. 【2503.07578】Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation

链接：https://arxiv.org/abs/2503.07578

作者：Tianyu Chen,Yasi Zhang,Zhendong Wang,Ying Nian Wu,Oscar Leong,Mingyuan Zhou

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable success, diverse natural distributions, generating high-resolution, realistic images, achieved remarkable

备注： First Author and Second Author contributed equally to this work. The last two authors equally advised this work

点击查看摘要

Abstract:Diffusion models have achieved remarkable success in generating high-resolution, realistic images across diverse natural distributions. However, their performance heavily relies on high-quality training data, making it challenging to learn meaningful distributions from corrupted samples. This limitation restricts their applicability in scientific domains where clean data is scarce or costly to obtain. In this work, we introduce denoising score distillation (DSD), a surprisingly effective and novel approach for training high-quality generative models from low-quality data. DSD first pretrains a diffusion model exclusively on noisy, corrupted samples and then distills it into a one-step generator capable of producing refined, clean outputs. While score distillation is traditionally viewed as a method to accelerate diffusion models, we show that it can also significantly enhance sample quality, particularly when starting from a degraded teacher model. Across varying noise levels and datasets, DSD consistently improves generative performancewe summarize our empirical evidence in Fig. 1. Furthermore, we provide theoretical insights showing that, in a linear model setting, DSD identifies the eigenspace of the clean data distributions covariance matrix, implicitly regularizing the generator. This perspective reframes score distillation as not only a tool for efficiency but also a mechanism for improving generative models, particularly in low-quality data settings.

13. 【2503.07575】VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models

链接：https://arxiv.org/abs/2503.07575

作者：Jen-tse Huang,Jiantong Qin,Jianping Zhang,Youliang Yuan,Wenxuan Wang,Jieyu Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：social biases exhibited, implicit social biases, research investigates, exhibited by Vision-Language, Vision-Language Models

备注： 9 pages

点击查看摘要

14. 【2503.07561】Alligat0R: Pre-Training Through Co-Visibility Segmentation for Relative Camera Pose Regression

链接：https://arxiv.org/abs/2503.07561

作者：Thibaut Loiseau,Guillaume Bourmaud,Vincent Lepetit

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：advanced computer vision, greatly advanced computer, yielding impressive results, completion approach yielding, approach yielding impressive

备注：

点击查看摘要

Abstract:Pre-training techniques have greatly advanced computer vision, with CroCo's cross-view completion approach yielding impressive results in tasks like 3D reconstruction and pose regression. However, this method requires substantial overlap between training pairs, limiting its effectiveness. We introduce Alligat0R, a novel pre-training approach that reformulates cross-view learning as a co-visibility segmentation task. Our method predicts whether each pixel in one image is co-visible in the second image, occluded, or outside the field of view (FOV), enabling the use of image pairs with any degree of overlap and providing interpretable predictions. To support this, we present Cub3, a large-scale dataset with 2.5 million image pairs and dense co-visibility annotations derived from the nuScenes dataset. This dataset includes diverse scenarios with varying degrees of overlap. The experiments show that Alligat0R significantly outperforms CroCo in relative pose regression, especially in scenarios with limited overlap. Alligat0R and Cub3 will be made publicly available.

15. 【2503.07535】LBM: Latent Bridge Matching for Fast Image-to-Image Translation

链接：https://arxiv.org/abs/2503.07535

作者：Clément Chadebec,Onur Tasar,Sanjeev Sreetharan,Benjamin Aubin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Latent Bridge Matching, introduce Latent Bridge, Bridge Matching, Latent Bridge, introduce Latent

备注：

点击查看摘要

Abstract:In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation. We show that the method can reach state-of-the-art results for various image-to-image tasks using only a single inference step. In addition to its efficiency, we also demonstrate the versatility of the method across different image translation tasks such as object removal, normal and depth estimation, and object relighting. We also derive a conditional framework of LBM and demonstrate its effectiveness by tackling the tasks of controllable image relighting and shadow generation. We provide an open-source implementation of the method at this https URL.

16. 【2503.07523】VisRL: Intention-Driven Visual Perception via Reinforced Reasoning

链接：https://arxiv.org/abs/2503.07523

作者：Zhangquan Chen,Xufang Luo,Dongsheng Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：understanding is inherently, scene based, humans selectively focus, Visual understanding, Visual

备注： 18pages,11 figures

点击查看摘要

Abstract:Visual understanding is inherently intention-driven - humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as a internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at this [URL](this https URL).

17. 【2503.07520】From Limited Labels to Open Domains: An Efficient Learning Paradigm for UAV-view Geo-Localization

链接：https://arxiv.org/abs/2503.07520

作者：Zhongwei Chen,Zhao-Xu Yang,Hai-Jun Rong,Jiawei Lang

类目：Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：Traditional UAV-view Geo-Localization, positive sample selection, Traditional UAV-view, learn cross-view domain-invariant, cross-view domain-invariant representations

备注：

点击查看摘要

18. 【2503.07517】FastInstShadow: A Simple Query-Based Model for Instance Shadow Detection

链接：https://arxiv.org/abs/2503.07517

作者：Takeru Inoue,Ryusuke Miyamoto

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Instance shadow detection, Instance shadow, shadows and objects, task of detecting, detecting pairs

备注：

点击查看摘要

Abstract:Instance shadow detection is the task of detecting pairs of shadows and objects, where existing methods first detect shadows and objects independently, then associate them. This paper introduces FastInstShadow, a method that enhances detection accuracy through a query-based architecture featuring an association transformer decoder with two dual-path transformer decoders to assess relationships between shadows and objects during detection. Experimental results using the SOBA dataset showed that the proposed method outperforms all existing methods across all criteria. This method makes real-time processing feasible for moderate-resolution images with better accuracy than SSISv2, the most accurate existing method. Our code is available at this https URL.

19. 【2503.07516】CPAny: Couple With Any Encoder to Refer Multi-Object Tracking

链接：https://arxiv.org/abs/2503.07516

作者：Weize Li,Yunhao Du,Qixiang Yin,Zhicheng Zhao,Fei Su,Daqi Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：localize target trajectories, Referring Multi-Object Tracking, natural language expressions, aims to localize, localize target

备注：

点击查看摘要

Abstract:Referring Multi-Object Tracking (RMOT) aims to localize target trajectories specified by natural language expressions in videos. Existing RMOT methods mainly follow two paradigms, namely, one-stage strategies and two-stage ones. The former jointly trains tracking with referring but suffers from substantial computational overhead. Although the latter improves computational efficiency, its CLIP-inspired dual-tower architecture restricts compatibility with other visual/text backbones and is not future-proof. To overcome these limitations, we propose CPAny, a novel encoder-decoder framework for two-stage RMOT, which introduces two core components: (1) a Contextual Visual Semantic Abstractor (CVSA) performs context-aware aggregation on visual backbone features and projects them into a unified semantic space; (2) a Parallel Semantic Summarizer (PSS) decodes the visual and linguistic features at the semantic level in parallel and generates referring scores. By replacing the inherent feature alignment of encoders with a self-constructed unified semantic space, CPAny achieves flexible compatibility with arbitrary emerging visual / text encoders. Meanwhile, CPAny aggregates contextual information by encoding only once and processes multiple expressions in parallel, significantly reducing computational redundancy. Extensive experiments on the Refer-KITTI and Refer-KITTI-V2 datasets show that CPAny outperforms SOTA methods across diverse encoder combinations, with a particular 7.77\% HOTA improvement on Refer-KITTI-V2. Code will be available soon.

20. 【2503.07511】PointVLA: Injecting the 3D World into Vision-Language-Action Models

链接：https://arxiv.org/abs/2503.07511

作者：Chengmeng Li,Junjie Wen,Yan Peng,Yaxin Peng,Feifei Feng,Yichen Zhu

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：limits spatial reasoning, spatial reasoning critical, RGB images limits, reliance on RGB, images limits spatial

备注：

点击查看摘要

Abstract:Vision-Language-Action (VLA) models excel at robotic tasks by leveraging large-scale 2D vision-language pretraining, but their reliance on RGB images limits spatial reasoning critical for real-world interaction. Retraining these models with 3D data is computationally prohibitive, while discarding existing 2D datasets wastes valuable resources. To bridge this gap, we propose PointVLA, a framework that enhances pre-trained VLAs with point cloud inputs without requiring retraining. Our method freezes the vanilla action expert and injects 3D features via a lightweight modular block. To identify the most effective way of integrating point cloud representations, we conduct a skip-block analysis to pinpoint less useful blocks in the vanilla action expert, ensuring that 3D features are injected only into these blocks--minimizing disruption to pre-trained representations. Extensive experiments demonstrate that PointVLA outperforms state-of-the-art 2D imitation learning methods, such as OpenVLA, Diffusion Policy and DexVLA, across both simulated and real-world robotic tasks. Specifically, we highlight several key advantages of PointVLA enabled by point cloud integration: (1) Few-shot multi-tasking, where PointVLA successfully performs four different tasks using only 20 demonstrations each; (2) Real-vs-photo discrimination, where PointVLA distinguishes real objects from their images, leveraging 3D world knowledge to improve safety and reliability; (3) Height adaptability, Unlike conventional 2D imitation learning methods, PointVLA enables robots to adapt to objects at varying table height that unseen in train data. Furthermore, PointVLA achieves strong performance in long-horizon tasks, such as picking and packing objects from a moving conveyor belt, showcasing its ability to generalize across complex, dynamic environments.

Subjects:

Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

Cite as:
arXiv:2503.07511 [cs.RO]

(or
arXiv:2503.07511v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2503.07511

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

21. 【2503.07507】PE3R: Perception-Efficient 3D Reconstruction

链接：https://arxiv.org/abs/2503.07507

作者：Jie Hu,Shizun Wang,Xinchao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advancements, improved the understanding, reconstruction, Recent, perception accuracy

备注：

点击查看摘要

Abstract:Recent advancements in 2D-to-3D perception have significantly improved the understanding of 3D scenes from 2D images. However, existing methods face critical challenges, including limited generalization across scenes, suboptimal perception accuracy, and slow reconstruction speeds. To address these limitations, we propose Perception-Efficient 3D Reconstruction (PE3R), a novel framework designed to enhance both accuracy and efficiency. PE3R employs a feed-forward architecture to enable rapid 3D semantic field reconstruction. The framework demonstrates robust zero-shot generalization across diverse scenes and objects while significantly improving reconstruction speed. Extensive experiments on 2D-to-3D open-vocabulary segmentation and 3D reconstruction validate the effectiveness and versatility of PE3R. The framework achieves a minimum 9-fold speedup in 3D semantic field reconstruction, along with substantial gains in perception accuracy and reconstruction precision, setting new benchmarks in the field. The code is publicly available at: this https URL.

22. 【2503.07506】ADROIT: A Self-Supervised Framework for Learning Robust Representations for Active Learning

链接：https://arxiv.org/abs/2503.07506

作者：Soumya Banerjee,Vinay Kumar Verma

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：minimizing annotation costs, Active learning aims, Active learning, select optimal samples, minimizing annotation

备注：

点击查看摘要

Abstract:Active learning aims to select optimal samples for labeling, minimizing annotation costs. This paper introduces a unified representation learning framework tailored for active learning with task awareness. It integrates diverse sources, comprising reconstruction, adversarial, self-supervised, knowledge-distillation, and classification losses into a unified VAE-based ADROIT approach. The proposed approach comprises three key components - a unified representation generator (VAE), a state discriminator, and a (proxy) task-learner or classifier. ADROIT learns a latent code using both labeled and unlabeled data, incorporating task-awareness by leveraging labeled data with the proxy classifier. Unlike previous approaches, the proxy classifier additionally employs a self-supervised loss on unlabeled data and utilizes knowledge distillation to align with the target task-learner. The state discriminator distinguishes between labeled and unlabeled data, facilitating the selection of informative unlabeled samples. The dynamic interaction between VAE and the state discriminator creates a competitive environment, with the VAE attempting to deceive the discriminator, while the state discriminator learns to differentiate between labeled and unlabeled inputs. Extensive evaluations on diverse datasets and ablation analysis affirm the effectiveness of the proposed model.

23. 【2503.07503】hink Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts

链接：https://arxiv.org/abs/2503.07503

作者：Shiu-hong Kao,Yu-Wing Tai,Chi-Keung Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, challenging vision-language task, non-visual query text, multimodal Large Language, vision-language task

备注： Project page: [this https URL](https://cse.hkust.edu.hk/~skao/thinkfirst.html)

点击查看摘要

Abstract:Reasoning segmentation is a challenging vision-language task that aims to output the segmentation mask with respect to a complex, implicit, and even non-visual query text. Previous works incorporated multimodal Large Language Models (MLLMs) with segmentation models to approach the difficult problem. However, their segmentation quality often falls short in complex cases, particularly when dealing with out-of-domain objects with intricate structures, blurry boundaries, occlusions, or high similarity with surroundings. In this paper, we introduce ThinkFirst, a training-free reasoning segmentation framework that leverages GPT's chain of thought to address these challenging cases. Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image. This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process. Our framework allows users to easily interact with the segmentation agent using multimodal inputs, such as easy text and image scribbles, for successive refinement or communication. We evaluate the performance of ThinkFirst on diverse objects. Extensive experiments show that, this zero-shot-CoT approach significantly improves the vanilla reasoning segmentation agent, both qualitatively and quantitatively, while being less sensitive or critical to user-supplied prompts after Thinking First.

24. 【2503.07499】AthletePose3D: A Benchmark Dataset for 3D Human Pose Estimation and Kinematic Validation in Athletic Movements

链接：https://arxiv.org/abs/2503.07499

作者：Calvin Yeung,Tomohiro Suzuki,Ryota Tanaka,Zhuoer Yin,Keisuke Fujii

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Human pose estimation, Human pose, spanning sports science, applications spanning sports, pose estimation

备注：

点击查看摘要

Abstract:Human pose estimation is a critical task in computer vision and sports biomechanics, with applications spanning sports science, rehabilitation, and biomechanical research. While significant progress has been made in monocular 3D pose estimation, current datasets often fail to capture the complex, high-acceleration movements typical of competitive sports. In this work, we introduce AthletePose3D, a novel dataset designed to address this gap. AthletePose3D includes 12 types of sports motions across various disciplines, with approximately 1.3 million frames and 165 thousand individual postures, specifically capturing high-speed, high-acceleration athletic movements. We evaluate state-of-the-art (SOTA) monocular 2D and 3D pose estimation models on the dataset, revealing that models trained on conventional datasets perform poorly on athletic motions. However, fine-tuning these models on AthletePose3D notably reduces the SOTA model mean per joint position error (MPJPE) from 214mm to 65mm-a reduction of over 69%. We also validate the kinematic accuracy of monocular pose estimations through waveform analysis, highlighting strong correlations in joint angle estimations but limitations in velocity estimation. Our work provides a comprehensive evaluation of monocular pose estimation models in the context of sports, contributing valuable insights for advancing monocular pose estimation techniques in high-performance sports environments. The dataset, code, and model checkpoints are available at: this https URL

25. 【2503.07493】V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation

链接：https://arxiv.org/abs/2503.07493

作者：Guiwei Zhang,Tianyu Zhang,Mohan Zhou,Yalong Bai,Biye Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：large language models, produces discrete visual, latent distribution alignment, visual, language models

备注： 11 pages, 6 figures

点击查看摘要

Abstract:We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. this https URL

26. 【2503.07487】LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?

链接：https://arxiv.org/abs/2503.07487

作者：Bangyan Li,Wenxuan Huang,Yunhang Shen,Yeqiang Wang,Shaohui Lin,Jingzhong Lin,Ling You,Yinqi Zhang,Ke Li,Xing Sun,Yuling Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated exceptional capabilities, multimodal large models, vision-language tasks, zero-shot medical disease, demonstrated exceptional

备注：

点击查看摘要

Abstract:Recently, multimodal large models (MLLMs) have demonstrated exceptional capabilities in visual understanding and reasoning across various vision-language tasks. However, MLLMs usually perform poorly in zero-shot medical disease recognition, as they do not fully exploit the captured features and available medical knowledge. To address this challenge, we propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition. Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture and incorporate modality-specific tokens tailored for different modalities, which effectively utilizes image and text representations and facilitates robust cross-modal alignment. Additionally, we introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models, which mitigates the category semantic gap in image-text alignment. DKAM improves category-level alignment, allowing for accurate disease recognition. Extensive experiments on multiple benchmarks demonstrate that our LLaVA-RadZ significantly outperforms traditional MLLMs in zero-shot disease recognition and exhibits the state-of-the-art performance compared to the well-established and highly-optimized CLIP-based approaches.

27. 【2503.07485】Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction

链接：https://arxiv.org/abs/2503.07485

作者：Zongzheng Zhang,Xinrun Li,Sizhe Zou,Guoxuan Chi,Siqi Li,Xuchong Qiu,Guoliang Wang,Guantian Zheng,Leichen Wang,Hang Zhao,Hao Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：mapless autonomous driving, involves detecting lanes, key perception task, extraction involves detecting, autonomous driving

备注： ICRA 2025, Project Page: [this https URL](https://github.com/XR-Lee/neural-symbolic)

点击查看摘要

Abstract:Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existing approaches have notable limitations: (1) Dense visual prompting with VLMs can achieve strong performance but is costly in terms of both financial resources and carbon footprint, making it impractical for robotics applications. (2) Neuro-symbolic reasoning methods for 3D scene understanding fail to integrate visual inputs when synthesizing programs, making them ineffective in handling complex corner cases. To this end, we propose a fast-slow neuro-symbolic lane topology extraction algorithm, named Chameleon, which alternates between a fast system that directly reasons over detected instances using synthesized programs and a slow system that utilizes a VLM with a chain-of-thought design to handle corner cases. Chameleon leverages the strengths of both approaches, providing an affordable solution while maintaining high performance. We evaluate the method on the OpenLane-V2 dataset, showing consistent improvements across various baseline detectors. Our code, data, and models are publicly available at this https URL

28. 【2503.07478】VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

链接：https://arxiv.org/abs/2503.07478

作者：Jiacheng Ruan,Wenzhen Yuan,Xian Gao,Ye Guo,Daoxin Zhang,Zhe Xu,Yao Hu,Ting Liu,Yuzhuo Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated strong performance, occasionally arise due, reasoning process, large visual-language models, demonstrated strong

备注： 12 pages, 4 figures. This work is in progress

点击查看摘要

Abstract:Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs perform error analysis on the entire reasoning process, followed by corrections. However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. VLRMBench is constructed based on three distinct types of datasets, covering mathematical reasoning, hallucination understanding, and multi-image understanding. We design 12 tasks across three major categories, focusing on evaluating VLRMs in the aspects of process understanding, outcome judgment, and critique generation. Extensive experiments are conducted on 21 open-source models and 5 advanced closed-source models, highlighting the challenges posed by VLRMBench. For instance, in the `Forecasting Future', a binary classification task, the advanced GPT-4o achieves only a 76.0% accuracy. Additionally, we perform comprehensive analytical studies, offering valuable insights for the future development of VLRMs. We anticipate that VLRMBench will serve as a pivotal benchmark in advancing VLRMs. Code and datasets will be available at this https URL.

29. 【2503.07476】SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting

链接：https://arxiv.org/abs/2503.07476

作者：Jiahui Zhang,Fangneng Zhan,Ling Shao,Shijian Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：reduced Gaussian redundancy, Gaussian splatting, Gaussian redundancy, rendering quality, Gaussian attribute prediction

备注： Accepted by CVPR 2025

点击查看摘要

Abstract:Anchor-based 3D Gaussian splatting (3D-GS) exploits anchor features in 3D Gaussian prediction, which has achieved impressive 3D rendering quality with reduced Gaussian redundancy. On the other hand, it often encounters the dilemma among anchor features, model size, and rendering quality - large anchor features lead to large 3D models and high-quality rendering whereas reducing anchor features degrades Gaussian attribute prediction which leads to clear artifacts in the rendered textures and geometries. We design SOGS, an anchor-based 3D-GS technique that introduces second-order anchors to achieve superior rendering quality and reduced anchor features and model size simultaneously. Specifically, SOGS incorporates covariance-based second-order statistics and correlation across feature dimensions to augment features within each anchor, compensating for the reduced feature size and improving rendering quality effectively. In addition, it introduces a selective gradient loss to enhance the optimization of scene textures and scene geometries, leading to high-quality rendering with small anchor features. Extensive experiments over multiple widely adopted benchmarks show that SOGS achieves superior rendering quality in novel view synthesis with clearly reduced model size.

30. 【2503.07472】A Review on Geometry and Surface Inspection in 3D Concrete Printing

链接：https://arxiv.org/abs/2503.07472

作者：K. Mawas,M. Maboudi,M. Gerke

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：conventionally manufactured parts, manufacturing in construction, manufactured parts, substantial growth, additive manufacturing

备注：

点击查看摘要

Abstract:Given the substantial growth in the use of additive manufacturing in construction (AMC), it is necessary to ensure the quality of printed specimens which can be much more complex than conventionally manufactured parts. This study explores the various aspects of geometry and surface quality control for 3D concrete printing (3DCP), with a particular emphasis on deposition-based methods, namely extrusion and shotcrete 3D printing (SC3DP). A comprehensive overview of existing quality control (QC) methods and strategies is provided and preceded by an in-depth discussion. Four categories of data capture technologies are investigated and their advantages and limitations in the context of AMC are discussed. Additionally, the effects of environmental conditions and objects' properties on data capture are also analyzed. The study extends to automated data capture planning methods for different sensors. Furthermore, various quality control strategies are explored across different stages of the fabrication cycle of the printed object including: (i) During printing, (ii) Layer-wise, (iii) Preassembly, and (iv) Assembly. In addition to reviewing the methods already applied in AMC, we also address various research gaps and future trends and highlight potential methodologies from adjacent domains that could be transferred to AMC.

31. 【2503.07465】YOLOE: Real-Time Seeing Anything

链接：https://arxiv.org/abs/2503.07465

作者：Ao Wang,Lihao Liu,Hui Chen,Zijia Lin,Jungong Han,Guiguang Ding

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision applications, YOLO series, vision applications, predefined categories, hindering adaptability

备注： 15 pages, 9 figures;

点击查看摘要

Abstract:Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models are available at this https URL.

32. 【2503.07456】Anatomy-Aware Conditional Image-Text Retrieval

链接：https://arxiv.org/abs/2503.07456

作者：Meng Zheng,Jiajin Zhang,Benjamin Planche,Zhongpai Gao,Terrence Chen,Ziyan Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：finds broad applications, automatically retrieving relevant, retrieving relevant patient, efficient clinical diagnosis, finds broad

备注： 16 pages, 10 figures

点击查看摘要

Abstract:Image-Text Retrieval (ITR) finds broad applications in healthcare, aiding clinicians and radiologists by automatically retrieving relevant patient cases in the database given the query image and/or report, for more efficient clinical diagnosis and treatment, especially for rare diseases. However conventional ITR systems typically only rely on global image or text representations for measuring patient image/report similarities, which overlook local distinctiveness across patient cases. This often results in suboptimal retrieval performance. In this paper, we propose an Anatomical Location-Conditioned Image-Text Retrieval (ALC-ITR) framework, which, given a query image and the associated suspicious anatomical region(s), aims to retrieve similar patient cases exhibiting the same disease or symptoms in the same anatomical region. To perform location-conditioned multimodal retrieval, we learn a medical Relevance-Region-Aligned Vision Language (RRA-VL) model with semantic global-level and region-/word-level alignment to produce generalizable, well-aligned multi-modal representations. Additionally, we perform location-conditioned contrastive learning to further utilize cross-pair region-level contrastiveness for improved multi-modal retrieval. We show that our proposed RRA-VL achieves state-of-the-art localization performance in phase-grounding tasks, and satisfying multi-modal retrieval performance with or without location conditioning. Finally, we thoroughly investigate the generalizability and explainability of our proposed ALC-ITR system in providing explanations and preliminary diagnosis reports given retrieved patient cases (conditioned on anatomical regions), with proper off-the-shelf LLM prompts.

33. 【2503.07446】EigenGS Representation: From Eigenspace to Gaussian Image Space

链接：https://arxiv.org/abs/2503.07446

作者：Lo-Wei Tai,Ching-En Li,Cheng-Lin Chen,Chih-Jung Tsai,Hwann-Tzong Chen,Tyng-Luh Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Principal Component Analysis, Principal Component, Component Analysis, offer distinct approaches, dimensionality reduction technique

备注：

点击查看摘要

Abstract:Principal Component Analysis (PCA), a classical dimensionality reduction technique, and 2D Gaussian representation, an adaptation of 3D Gaussian Splatting for image representation, offer distinct approaches to modeling visual data. We present EigenGS, a novel method that bridges these paradigms through an efficient transformation pipeline connecting eigenspace and image-space Gaussian representations. Our approach enables instant initialization of Gaussian parameters for new images without requiring per-image optimization from scratch, dramatically accelerating convergence. EigenGS introduces a frequency-aware learning mechanism that encourages Gaussians to adapt to different scales, effectively modeling varied spatial frequencies and preventing artifacts in high-resolution reconstruction. Extensive experiments demonstrate that EigenGS not only achieves superior reconstruction quality compared to direct 2D Gaussian fitting but also reduces necessary parameter count and training time. The results highlight EigenGS's effectiveness and generalization ability across images with varying resolutions and diverse categories, making Gaussian-based image representation both high-quality and viable for real-time applications.

34. 【2503.07444】Divide and Conquer Self-Supervised Learning for High-Content Imaging

链接：https://arxiv.org/abs/2503.07444

作者：Lucas Farndale,Paul Henderson,Edward W Roberts,Ke Yuan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

关键词：complex features, Component Embedding Registration, Split Component Embedding, complex, features

备注：

点击查看摘要

Abstract:Self-supervised representation learning methods often fail to learn subtle or complex features, which can be dominated by simpler patterns which are much easier to learn. This limitation is particularly problematic in applications to science and engineering, as complex features can be critical for discovery and analysis. To address this, we introduce Split Component Embedding Registration (SpliCER), a novel architecture which splits the image into sections and distils information from each section to guide the model to learn more subtle and complex features without compromising on simpler features. SpliCER is compatible with any self-supervised loss function and can be integrated into existing methods without modification. The primary contributions of this work are as follows: i) we demonstrate that existing self-supervised methods can learn shortcut solutions when simple and complex features are both present; ii) we introduce a novel self-supervised training method, SpliCER, to overcome the limitations of existing methods, and achieve significant downstream performance improvements; iii) we demonstrate the effectiveness of SpliCER in cutting-edge medical and geospatial imaging settings. SpliCER offers a powerful new tool for representation learning, enabling models to uncover complex features which could be overlooked by other methods.

35. 【2503.07435】Open-Set Gait Recognition from Sparse mmWave Radar Point Clouds

链接：https://arxiv.org/abs/2503.07435

作者：Riccardo Mazzieri,Jacopo Pegoraro,Michele Rossi

类目：Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)

关键词：recently gathered significant, gathered significant attention, significant attention due, Open-set Gait Recognition, gait recognition

备注：

点击查看摘要

Abstract:The adoption of Millimeter-Wave (mmWave) radar devices for human sensing, particularly gait recognition, has recently gathered significant attention due to their efficiency, resilience to environmental conditions, and privacy-preserving nature. In this work, we tackle the challenging problem of Open-set Gait Recognition (OSGR) from sparse mmWave radar point clouds. Unlike most existing research, which assumes a closed-set scenario, our work considers the more realistic open-set case, where unknown subjects might be present at inference time, and should be correctly recognized by the system. Point clouds are well-suited for edge computing applications with resource constraints, but are more significantly affected by noise and random fluctuations than other representations, like the more common micro-Doppler signature. This is the first work addressing open-set gait recognition with sparse point cloud data. To do so, we propose a novel neural network architecture that combines supervised classification with unsupervised reconstruction of the point clouds, creating a robust, rich, and highly regularized latent space of gait features. To detect unknown subjects at inference time, we introduce a probabilistic novelty detection algorithm that leverages the structured latent space and offers a tunable trade-off between inference speed and prediction accuracy. Along with this paper, we release mmGait10, an original human gait dataset featuring over five hours of measurements from ten subjects, under varied walking modalities. Extensive experimental results show that our solution attains F1-Score improvements by 24% over state-of-the-art methods, on average, and across multiple openness levels.

36. 【2503.07425】CATPlan: Loss-based Collision Prediction in End-to-End Autonomous Driving

链接：https://arxiv.org/abs/2503.07425

作者：Ziliang Xiong,Shipeng Liu,Nathaniel Helgesen,Joakim Johnander,Per-Erik Forssen

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving, recent years, uncertainty, systems, uncertainty quantification

备注：

点击查看摘要

Abstract:In recent years, there has been increased interest in the design, training, and evaluation of end-to-end autonomous driving (AD) systems. One often overlooked aspect is the uncertainty of planned trajectories predicted by these systems, despite awareness of their own uncertainty being key to achieve safety and robustness. We propose to estimate this uncertainty by adapting loss prediction from the uncertainty quantification literature. To this end, we introduce a novel light-weight module, dubbed CATPlan, that is trained to decode motion and planning embeddings into estimates of the collision loss used to partially supervise end-to-end AD systems. During inference, these estimates are interpreted as collision risk. We evaluate CATPlan on the safety-critical, nerf-based, closed-loop benchmark NeuroNCAP and find that it manages to detect collisions with a $54.8\%$ relative improvement to average precision over a GMM-based baseline in which the predicted trajectory is compared to the forecasted trajectories of other road users. Our findings indicate that the addition of CATPlan can lead to safer end-to-end AD systems and hope that our work will spark increased interest in uncertainty quantification for such systems.

37. 【2503.07419】Analysis of 3D Urticaceae Pollen Classification Using Deep Learning Models

链接：https://arxiv.org/abs/2503.07419

作者：Tijs Konijn,Imaan Bijl,Lu Cao,Fons Verbeek

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：pressing healthcare problem, climate change, hay fever, affected population, prolonged period

备注：

点击查看摘要

Abstract:Due to the climate change, hay fever becomes a pressing healthcare problem with an increasing number of affected population, prolonged period of affect and severer symptoms. A precise pollen classification could help monitor the trend of allergic pollen in the air throughout the year and guide preventive strategies launched by municipalities. Most of the pollen classification works use 2D microscopy image or 2D projection derived from 3D image datasets. In this paper, we aim at using whole stack of 3D images for the classification and evaluating the classification performance with different deep learning models. The 3D image dataset used in this paper is from Urticaceae family, particularly the genera Urtica and Parietaria, which are morphologically similar yet differ significantly in allergenic potential. The pre-trained ResNet3D model, using optimal layer selection and extended epochs, achieved the best performance with an F1-score of 98.3%.

38. 【2503.07418】AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion

链接：https://arxiv.org/abs/2503.07418

作者：Mingzhen Sun,Weining Wang,Gen Li,Jiawei Liu,Jiahui Sun,Wanquan Feng,Shanshan Lao,SiYu Zhou,Qian He,Jing Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requires synthesizing visually, synthesizing visually realistic, temporally coherent video, generation requires synthesizing, requires synthesizing

备注： Accepted by CVPR 2025

点击查看摘要

Abstract:The task of video generation requires synthesizing visually realistic and temporally coherent video frames. Existing methods primarily use asynchronous auto-regressive models or synchronous diffusion models to address this challenge. However, asynchronous auto-regressive models often suffer from inconsistencies between training and inference, leading to issues such as error accumulation, while synchronous diffusion models are limited by their reliance on rigid sequence length. To address these issues, we introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible, asynchronous video generation. Specifically, our approach leverages diffusion to gradually corrupt video frames in both training and inference, reducing the discrepancy between these phases. Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames, ensuring that earlier frames remain clearer than subsequent ones. This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence. In addition, we design two specialized timestep schedulers: the FoPP scheduler for balanced timestep sampling during training, and the AD scheduler for flexible timestep differences during inference, supporting both synchronous and asynchronous generation. Extensive experiments demonstrate the superiority of our proposed method, which achieves competitive and state-of-the-art results across four challenging benchmarks.

39. 【2503.07417】GM-MoE: Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts

链接：https://arxiv.org/abs/2503.07417

作者：Minwen Liao,Hao Bo Dong,Xinyi Wang,Ziyang Yan,Yihua Shao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：improve information utilization, significantly improve information, remote sensing, autonomous driving, information utilization

备注：

点击查看摘要

Abstract:Low-light enhancement has wide applications in autonomous driving, 3D reconstruction, remote sensing, surveillance, and so on, which can significantly improve information utilization. However, most existing methods lack generalization and are limited to specific tasks such as image recovery. To address these issues, we propose \textbf{Gated-Mechanism Mixture-of-Experts (GM-MoE)}, the first framework to introduce a mixture-of-experts network for low-light image enhancement. GM-MoE comprises a dynamic gated weight conditioning network and three sub-expert networks, each specializing in a distinct enhancement task. Combining a self-designed gated mechanism that dynamically adjusts the weights of the sub-expert networks for different data domains. Additionally, we integrate local and global feature fusion within sub-expert networks to enhance image quality by capturing multi-scale features. Experimental results demonstrate that the GM-MoE achieves superior generalization with respect to 25 compared approaches, reaching state-of-the-art performance on PSNR on 5 benchmarks and SSIM on 4 benchmarks, respectively.

40. 【2503.07416】meStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision

链接：https://arxiv.org/abs/2503.07416

作者：Shaobin Zhuang,Yiwei Guo,Yanbo Ding,Kunchang Li,Xinyuan Chen,Yaohui Wang,Fangyikang Wang,Ying Zhang,Chen Li,Yali Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Diffusion models, Diffusion, past years, TimeStep LoRA experts, driven the advancement

备注： 17 pages, 5 figures, 13 tables

点击查看摘要

Abstract:Diffusion models have driven the advancement of vision generation over the past years. However, it is often difficult to apply these large models in downstream tasks, due to massive fine-tuning cost. Recently, Low-Rank Adaptation (LoRA) has been applied for efficient tuning of diffusion models. Unfortunately, the capabilities of LoRA-tuned diffusion models are limited, since the same LoRA is used for different timesteps of the diffusion process. To tackle this problem, we introduce a general and concise TimeStep Master (TSM) paradigm with two key fine-tuning stages. In the fostering stage (1-stage), we apply different LoRAs to fine-tune the diffusion model at different timestep intervals. This results in different TimeStep LoRA experts that can effectively capture different noise levels. In the assembling stage (2-stage), we design a novel asymmetrical mixture of TimeStep LoRA experts, via core-context collaboration of experts at multi-scale intervals. For each timestep, we leverage TimeStep LoRA expert within the smallest interval as the core expert without gating, and use experts within the bigger intervals as the context experts with time-dependent gating. Consequently, our TSM can effectively model the noise level via the expert in the finest interval, and adaptively integrate contexts from the experts of other scales, boosting the versatility of diffusion models. To show the effectiveness of our TSM paradigm, we conduct extensive experiments on three typical and popular LoRA-related tasks of diffusion models, including domain adaptation, post-pretraining, and model distillation. Our TSM achieves the state-of-the-art results on all these tasks, throughout various model structures (UNet, DiT and MM-DiT) and visual data modalities (Image, Video), showing its remarkable generalization capacity.

41. 【2503.07413】REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding

链接：https://arxiv.org/abs/2503.07413

作者：Yan Tai,Luhao Zhu,Zhiqiang Chen,Ynan Ding,Yiying Dong,Xiaohong Liu,Guodong Guo

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Large Language, robust zero-shot capabilities, diverse vision-language tasks

备注：

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo available at this https URL.

42. 【2503.07399】Keeping Representation Similarity in Finetuning for Medical Image Analysis

链接：https://arxiv.org/abs/2503.07399

作者：Wenqiang Zu,Shenghao Xie,Hao Chen,Yiming Liang,Lei Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large-scale natural images, large-scale natural, Foundation models pretrained, medical image analysis, foundation model original

备注： 12 pages, 6 figures

点击查看摘要

Abstract:Foundation models pretrained on large-scale natural images have been widely used to adapt to medical image analysis through finetuning. This is largely attributed to pretrained representations capturing universal, robust, and generalizable features, which can be reutilized by downstream tasks. However, these representations are later found to gradually vanish during finetuning, accompanied by a degradation of foundation model's original abilities, e.g., generalizability. In this paper, we argue that pretrained representations can be well preserved while still effectively adapting to downstream tasks. We study this by proposing a new finetuning method RepSim, which minimizes the distance between pretrained and finetuned representations via constraining learnable orthogonal manifold based on similarity invariance. Compared to standard finetuning methods, e.g., full finetuning, our method improves representation similarity by over 30% while maintaining competitive accuracy, and reduces sharpness by 42% across five medical image classification datasets. The code will be released.

43. 【2503.07396】Brain Inspired Adaptive Memory Dual-Net for Few-Shot Image Classification

链接：https://arxiv.org/abs/2503.07396

作者：Kexin Di,Xiuxing Li,Yuyang Han,Ziyu Li,Qing Li,Xia Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：popular research topic, supervision collapse induced, single image-level annotation, image-level annotation remains, Few-shot image classification

备注：

点击查看摘要

Abstract:Few-shot image classification has become a popular research topic for its wide application in real-world scenarios, however the problem of supervision collapse induced by single image-level annotation remains a major challenge. Existing methods aim to tackle this problem by locating and aligning relevant local features. However, the high intra-class variability in real-world images poses significant challenges in locating semantically relevant local regions under few-shot settings. Drawing inspiration from the human's complementary learning system, which excels at rapidly capturing and integrating semantic features from limited examples, we propose the generalization-optimized Systems Consolidation Adaptive Memory Dual-Network, SCAM-Net. This approach simulates the systems consolidation of complementary learning system with an adaptive memory module, which successfully addresses the difficulty of identifying meaningful features in few-shot scenarios. Specifically, we construct a Hippocampus-Neocortex dual-network that consolidates structured representation of each category, the structured representation is then stored and adaptively regulated following the generalization optimization principle in a long-term memory inside Neocortex. Extensive experiments on benchmark datasets show that the proposed model has achieved state-of-the-art performance.

44. 【2503.07392】SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models

链接：https://arxiv.org/abs/2503.07392

作者：Ouxiang Li,Yuan Wang,Xinting Hu,Houcheng Jiang,Tao Liang,Yanbin Hao,Guojun Ma,Fuli Feng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：increasingly crucial due, offensive content, copyright infringement, privacy violations, increasingly crucial

备注：

点击查看摘要

Abstract:Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. However, existing methods either require costly fine-tuning or degrade image quality for non-target concepts (i.e., prior) due to inherent optimization limitations. In this paper, we introduce SPEED, a model editing-based concept erasure approach that leverages null-space constraints for scalable, precise, and efficient erasure. Specifically, SPEED incorporates Influence-based Prior Filtering (IPF) to retain the most affected non-target concepts during erasing, Directed Prior Augmentation (DPA) to expand prior coverage while maintaining semantic consistency, and Invariant Equality Constraints (IEC) to regularize model editing by explicitly preserving key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in prior preservation while achieving efficient and high-fidelity concept erasure, successfully removing 100 concepts within just 5 seconds. Our code and models are available at: this https URL.

45. 【2503.07390】PersonaBooth: Personalized Text-to-Motion Generation

链接：https://arxiv.org/abs/2503.07390

作者：Boeun Kim,Hea In Jeong,JungHoon Sung,Yihua Cheng,Jeongmin Lee,Ju Yong Chang,Sang-Il Choi,Younggeun Choi,Saim Shin,Jungho Kim,Hyung Jin Chang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generates personalized motions, personalized motions aligned, paper introduces Motion, generates personalized, introduces Motion Personalization

备注：

点击查看摘要

Abstract:This paper introduces Motion Personalization, a new task that generates personalized motions aligned with text descriptions using several basic motions containing Persona. To support this novel task, we introduce a new large-scale motion dataset called PerMo (PersonaMotion), which captures the unique personas of multiple actors. We also propose a multi-modal finetuning method of a pretrained motion diffusion model called PersonaBooth. PersonaBooth addresses two main challenges: i) A significant distribution gap between the persona-focused PerMo dataset and the pretraining datasets, which lack persona-specific data, and ii) the difficulty of capturing a consistent persona from the motions vary in content (action type). To tackle the dataset distribution gap, we introduce a persona token to accept new persona features and perform multi-modal adaptation for both text and visuals during finetuning. To capture a consistent persona, we incorporate a contrastive learning technique to enhance intra-cohesion among samples with the same persona. Furthermore, we introduce a context-aware fusion mechanism to maximize the integration of persona cues from multiple input motions. PersonaBooth outperforms state-of-the-art motion style transfer methods, establishing a new benchmark for motion personalization.

46. 【2503.07389】RCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models

链接：https://arxiv.org/abs/2503.07389

作者：Ruidong Chen,Honglin Guo,Lanjun Wang,Chenyu Zhang,Weizhi Nie,An-An Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：enable photorealistic image, NSFW images, Recent advances, photorealistic image generation, models enable photorealistic

备注：

点击查看摘要

Abstract:Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model's original generation ability. The code is available at: this http URL. CAUTION: This paper includes model-generated content that may contain offensive material.

47. 【2503.07375】Probabilistic Segmentation for Robust Field of View Estimation

链接：https://arxiv.org/abs/2503.07375

作者：R. Spencer Hallyburton,David Hunt,Yiwei He,Judy He,Miroslav Pajic

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：autonomous vehicles, perception threaten, threaten the safe, FOV, Abstract

备注：

点击查看摘要

Abstract:Attacks on sensing and perception threaten the safe deployment of autonomous vehicles (AVs). Security-aware sensor fusion helps mitigate threats but requires accurate field of view (FOV) estimation which has not been evaluated autonomy. To address this gap, we adapt classical computer graphics algorithms to develop the first autonomy-relevant FOV estimators and create the first datasets with ground truth FOV labels. Unfortunately, we find that these approaches are themselves highly vulnerable to attacks on sensing. To improve robustness of FOV estimation against attacks, we propose a learning-based segmentation model that captures FOV features, integrates Monte Carlo dropout (MCD) for uncertainty quantification, and performs anomaly detection on confidence maps. We illustrate through comprehensive evaluations attack resistance and strong generalization across environments. Architecture trade studies demonstrate the model is feasible for real-time deployment in multiple applications.

48. 【2503.07371】HGO-YOLO: Advancing Anomaly Behavior Detection with Hierarchical Features and Lightweight Optimized Detection

链接：https://arxiv.org/abs/2503.07371

作者：Qizhi Zheng,Zhongze Luo,Meiyan Guo,Xinzhu Wang,Renqimuge Wu,Qiu Meng,Guanghui Dong

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：hardware limitations, scenarios constrained, constrained by hardware, speed is essential, essential for enhancing

备注： 10 pages

点击查看摘要

Abstract:Accurate and real-time object detection is crucial for anomaly behavior detection, especially in scenarios constrained by hardware limitations, where balancing accuracy and speed is essential for enhancing detection performance. This study proposes a model called HGO-YOLO, which integrates the HGNetv2 architecture into YOLOv8. This combination expands the receptive field and captures a wider range of features while simplifying model complexity through GhostConv. We introduced a lightweight detection head, OptiConvDetect, which utilizes parameter sharing to construct the detection head effectively. Evaluation results show that the proposed algorithm achieves a mAP@0.5 of 87.4% and a recall rate of 81.1%, with a model size of only 4.6 MB and a frame rate of 56 FPS on the CPU. HGO-YOLO not only improves accuracy by 3.0% but also reduces computational load by 51.69% (from 8.9 GFLOPs to 4.3 GFLOPs), while increasing the frame rate by a factor of 1.7. Additionally, real-time tests were conducted on Raspberry Pi4 and NVIDIA platforms. These results indicate that the HGO-YOLO model demonstrates superior performance in anomaly behavior detection.

49. 【2503.07367】LEGO-Motion: Learning-Enhanced Grids with Occupancy Instance Modeling for Class-Agnostic Motion Prediction

链接：https://arxiv.org/abs/2503.07367

作者：Kangan Qian,Jinyu Miao,Ziang Luo,Zheng Fu,and Jinchen Li,Yining Shi,Yunlong Wang,Kun Jiang,Mengmeng Yang,Diange Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous driving systems, motion information plays, driving systems, information plays, plays a pivotal

备注： 8 pages, 4 figures

点击查看摘要

Abstract:Accurate and reliable spatial and motion information plays a pivotal role in autonomous driving systems. However, object-level perception models struggle with handling open scenario categories and lack precise intrinsic geometry. On the other hand, occupancy-based class-agnostic methods excel in representing scenes but fail to ensure physics consistency and ignore the importance of interactions between traffic participants, hindering the model's ability to learn accurate and reliable motion. In this paper, we introduce a novel occupancy-instance modeling framework for class-agnostic motion prediction tasks, named LEGO-Motion, which incorporates instance features into Bird's Eye View (BEV) space. Our model comprises (1) a BEV encoder, (2) an Interaction-Augmented Instance Encoder, and (3) an Instance-Enhanced BEV Encoder, improving both interaction relationships and physics consistency within the model, thereby ensuring a more accurate and robust understanding of the environment. Extensive experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches. Furthermore, the effectiveness of our framework is validated on the advanced FMCW LiDAR benchmark, showcasing its practical applicability and generalization capabilities. The code will be made publicly available to facilitate further research.

50. 【2503.07365】MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

链接：https://arxiv.org/abs/2503.07365

作者：Fanqing Meng,Lingxiao Du,Zongkai Liu,Zhixiang Zhou,Quanfeng Lu,Daocheng Fu,Botian Shi,Wenhai Wang,Junjun He,Kaipeng Zhang,Ping Luo,Yu Qiao,Qiaosheng Zhang,Wenqi Shao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：successfully extends large-scale, rule-based reinforcement learning, extends large-scale rule-based, large-scale rule-based reinforcement, present MM-Eureka

备注：

点击查看摘要

Abstract:We present MM-Eureka, a multimodal reasoning model that successfully extends large-scale rule-based reinforcement learning (RL) to multimodal reasoning. While rule-based RL has shown remarkable success in improving LLMs' reasoning abilities in text domains, its application to multimodal settings has remained challenging. Our work reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, including steady increases in accuracy reward and response length, and the emergence of reflection behaviors. We demonstrate that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at this https URL

51. 【2503.07363】Inversion-Free Video Style Transfer with Trajectory Reset Attention Control and Content-Style Bridging

链接：https://arxiv.org/abs/2503.07363

作者：Jiang Lin,Zili Yi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Reset Attention Control, Video style transfer, Trajectory Reset Attention, Attention Control, style

备注：

点击查看摘要

Abstract:Video style transfer aims to alter the style of a video while preserving its content. Previous methods often struggle with content leakage and style misalignment, particularly when using image-driven approaches that aim to transfer precise styles. In this work, we introduce Trajectory Reset Attention Control (TRAC), a novel method that allows for high-quality style transfer while preserving content integrity. TRAC operates by resetting the denoising trajectory and enforcing attention control, thus enhancing content consistency while significantly reducing the computational costs against inversion-based methods. Additionally, a concept termed Style Medium is introduced to bridge the gap between content and style, enabling a more precise and harmonious transfer of stylistic elements. Building upon these concepts, we present a tuning-free framework that offers a stable, flexible, and efficient solution for both image and video style transfer. Experimental results demonstrate that our proposed framework accommodates a wide range of stylized outputs, from precise content preservation to the production of visually striking results with vibrant and expressive styles.

52. 【2503.07353】Certifiably Optimal Anisotropic Rotation Averaging

链接：https://arxiv.org/abs/2503.07353

作者：Carl Olsson,Yaroslava Lochman,Johan Malmport,Christopher Zach

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：vision and robotics, key subproblem, subproblem in applications, applications of computer, computer vision

备注：

点击查看摘要

Abstract:Rotation averaging is a key subproblem in applications of computer vision and robotics. Many methods for solving this problem exist, and there are also several theoretical results analyzing difficulty and optimality. However, one aspect that most of these have in common is a focus on the isotropic setting, where the intrinsic uncertainties in the measurements are not fully incorporated into the resulting optimization task. Recent empirical results suggest that moving to an anisotropic framework, where these uncertainties are explicitly included, can result in an improvement of solution quality. However, global optimization for rotation averaging has remained a challenge in this scenario. In this paper we show how anisotropic costs can be incorporated in certifiably optimal rotation averaging. We also demonstrate how existing solvers, designed for isotropic situations, fail in the anisotropic setting. Finally, we propose a stronger relaxation and show empirically that it is able to recover global optima in all tested datasets and leads to a more accurate reconstruction in all but one of the scenes.

53. 【2503.07348】Fully Unsupervised Annotation of C. Elegans

链接：https://arxiv.org/abs/2503.07348

作者：Christoph Karg,Sebastian Stricker,Lisa Hutschenreiter,Bogdan Savchynskyy,Dagmar Kainmueller

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：unsupervised multi-graph matching, multi-graph matching, determine Gaussian parameters, work we present, applies to problems

备注：

点击查看摘要

Abstract:In this work we present a novel approach for unsupervised multi-graph matching, which applies to problems for which a Gaussian distribution of keypoint features can be assumed. We leverage cycle consistency as loss for self-supervised learning, and determine Gaussian parameters through Bayesian Optimization, yielding a highly efficient approach that scales to large datasets. Our fully unsupervised approach enables us to reach the accuracy of state-of-the-art supervised methodology for the use case of annotating cell nuclei in 3D microscopy images of the worm C. elegans. To this end, our approach yields the first unsupervised atlas of C. elegans, i.e. a model of the joint distribution of all of its cell nuclei, without the need for any ground truth cell annotation. This advancement enables highly efficient annotation of cell nuclei in large microscopy datasets of C. elegans. Beyond C. elegans, our approach offers fully unsupervised construction of cell-level atlases for any model organism with a stereotyped cell lineage, and thus bears the potential to catalyze respective comparative developmental studies in a range of further species.

54. 【2503.07347】DaD: Distilled Reinforcement Learning for Diverse Keypoint Detection

链接：https://arxiv.org/abs/2503.07347

作者：Johan Edstedt,Georg Bökman,Mårten Wadenbäck,Michael Felsberg

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：systems to scale, thousands of images, scale to thousands, keypoint detection, keypoint detection objective

备注：

点击查看摘要

Abstract:Keypoints are what enable Structure-from-Motion (SfM) systems to scale to thousands of images. However, designing a keypoint detection objective is a non-trivial task, as SfM is non-differentiable. Typically, an auxiliary objective involving a descriptor is optimized. This however induces a dependency on the descriptor, which is undesirable. In this paper we propose a fully self-supervised and descriptor-free objective for keypoint detection, through reinforcement learning. To ensure training does not degenerate, we leverage a balanced top-K sampling strategy. While this already produces competitive models, we find that two qualitatively different types of detectors emerge, which are only able to detect light and dark keypoints respectively. To remedy this, we train a third detector, DaD, that optimizes the Kullback-Leibler divergence of the pointwise maximum of both light and dark detectors. Our approach significantly improve upon SotA across a range of benchmarks. Code and model weights are publicly available at this https URL

55. 【2503.07346】Now you see me! A framework for obtaining class-relevant saliency maps

链接：https://arxiv.org/abs/2503.07346

作者：Nils Philipp Walter,Jilles Vreeken,Jonas Fischer

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Neural networks, daily-life decision-making, transparency are key, features neural networks, part of daily-life

备注：

点击查看摘要

Abstract:Neural networks are part of daily-life decision-making, including in high-stakes settings where understanding and transparency are key. Saliency maps have been developed to gain understanding into which input features neural networks use for a specific prediction. Although widely employed, these methods often result in overly general saliency maps that fail to identify the specific information that triggered the classification. In this work, we suggest a framework that allows to incorporate attributions across classes to arrive at saliency maps that actually capture the class-relevant information. On established benchmarks for attribution methods, including the grid-pointing game and randomization-based sanity checks, we show that our framework heavily boosts the performance of standard saliency map approaches. It is, by design, agnostic to model architectures and attribution methods and now allows to identify the distinguishing and shared features used for a model prediction.

56. 【2503.07334】Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment

链接：https://arxiv.org/abs/2503.07334

作者：Xing Xie,Jiawei Liu,Ziyue Lin,Huijie Fan,Zhi Han,Yandong Tang,Liangqiong Qu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Autoregressive Representation Alignment, present Autoregressive Representation, unlocks global-coherent, ARRA, Representation Alignment

备注：

点击查看摘要

Abstract:We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural changes. Unlike prior work that requires complex architectural redesigns, ARRA aligns LLM hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, HYBNEXT. This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA's plug-and-play versatility. When training from text-generation-only LLMs or random initialization, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet), and 7.5% (ImageNet) for advanced autoregressive LLMs like Chameleon and LlamaGen, all without framework modifications. For domain adaption, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). By demonstrating that training objective redesign -- not just architectural innovation -- can resolve cross-modal global coherence challenges, ARRA offers a complementary paradigm for advancing autoregressive models. Code and models will be released to advance autoregressive image generation.

57. 【2503.07330】Mitigating Hallucinations in YOLO-based Object Detection Models: A Revisit to Out-of-Distribution Detection

链接：https://arxiv.org/abs/2503.07330

作者：Weicheng He,Changshun Wu,Chih-Hong Cheng,Xiaowei Huang,Saddek Bensalem

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)

关键词：ensure safe decision-making, reliably perceive objects, dynamic environments, reliably perceive, overly confident

备注：

点击查看摘要

Abstract:Object detection systems must reliably perceive objects of interest without being overly confident to ensure safe decision-making in dynamic environments. Filtering techniques based on out-of-distribution (OoD) detection are commonly added as an extra safeguard to filter hallucinations caused by overconfidence in novel objects. Nevertheless, evaluating YOLO-family detectors and their filters under existing OoD benchmarks often leads to unsatisfactory performance. This paper studies the underlying reasons for performance bottlenecks and proposes a methodology to improve performance fundamentally. Our first contribution is a calibration of all existing evaluation results: Although images in existing OoD benchmark datasets are claimed not to have objects within in-distribution (ID) classes (i.e., categories defined in the training dataset), around 13% of objects detected by the object detector are actually ID objects. Dually, the ID dataset containing OoD objects can also negatively impact the decision boundary of filters. These ultimately lead to a significantly imprecise performance estimation. Our second contribution is to consider the task of hallucination reduction as a joint pipeline of detectors and filters. By developing a methodology to carefully synthesize an OoD dataset that semantically resembles the objects to be detected, and using the crafted OoD dataset in the fine-tuning of YOLO detectors to suppress the objectness score, we achieve a 88% reduction in overall hallucination error with a combined fine-tuned detection and filtering system on the self-driving benchmark BDD-100K. Our code and dataset are available at: this https URL.

58. 【2503.07323】Dynamic Path Navigation for Motion Agents with LLM Reasoning

链接：https://arxiv.org/abs/2503.07323

作者：Yubo Zhao,Qi Wu,Yifan Wang,Yu-Wing Tai,Chi-Keung Tang

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Large Language, Language Models, demonstrated strong generalizable, strong generalizable reasoning

备注：

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated strong generalizable reasoning and planning capabilities. However, their efficacies in spatial path planning and obstacle-free trajectory generation remain underexplored. Leveraging LLMs for navigation holds significant potential, given LLMs' ability to handle unseen scenarios, support user-agent interactions, and provide global control across complex systems, making them well-suited for agentic planning and humanoid motion generation. As one of the first studies in this domain, we explore the zero-shot navigation and path generation capabilities of LLMs by constructing a dataset and proposing an evaluation protocol. Specifically, we represent paths using anchor points connected by straight lines, enabling movement in various directions. This approach offers greater flexibility and practicality compared to previous methods while remaining simple and intuitive for LLMs. We demonstrate that, when tasks are well-structured in this manner, modern LLMs exhibit substantial planning proficiency in avoiding obstacles while autonomously refining navigation with the generated motion to reach the target. Further, this spatial reasoning ability of a single LLM motion agent interacting in a static environment can be seamlessly generalized in multi-motion agents coordination in dynamic environments. Unlike traditional approaches that rely on single-step planning or local policies, our training-free LLM-based method enables global, dynamic, closed-loop planning, and autonomously resolving collision issues.

59. 【2503.07315】Group-robust Sample Reweighting for Subpopulation Shifts via Influence Functions

链接：https://arxiv.org/abs/2503.07315

作者：Rui Qiao,Zhaoxuan Wu,Jingtan Wang,Pang Wei Koh,Bryan Kian Hsiang Low

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：Machine learning models, Machine learning, uneven performance, learning models, data distributions

备注： Accepted to the 13th International Conference on Learning Representations (ICLR 2025). Code is available at [this https URL](https://github.com/qiaoruiyt/GSR)

点击查看摘要

Abstract:Machine learning models often have uneven performance among subpopulations (a.k.a., groups) in the data distributions. This poses a significant challenge for the models to generalize when the proportions of the groups shift during deployment. To improve robustness to such shifts, existing approaches have developed strategies that train models or perform hyperparameter tuning using the group-labeled data to minimize the worst-case loss over groups. However, a non-trivial amount of high-quality labels is often required to obtain noticeable improvements. Given the costliness of the labels, we propose to adopt a different paradigm to enhance group label efficiency: utilizing the group-labeled data as a target set to optimize the weights of other group-unlabeled data. We introduce Group-robust Sample Reweighting (GSR), a two-stage approach that first learns the representations from group-unlabeled data, and then tinkers the model by iteratively retraining its last layer on the reweighted data using influence functions. Our GSR is theoretically sound, practically lightweight, and effective in improving the robustness to subpopulation shifts. In particular, GSR outperforms the previous state-of-the-art approaches that require the same amount or even more group labels.

60. 【2503.07314】Automated Movie Generation via Multi-Agent CoT Planning

链接：https://arxiv.org/abs/2503.07314

作者：Weijia Wu,Zeyu Zhu,Mike Zheng Shou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：requiring manual input, Existing long-form video, lack automated planning, automated movie generation, Existing long-form

备注： The code and project website are available at: [this https URL](https://github.com/showlab/MovieAgent) and [this https URL](https://weijiawu.github.io/MovieAgent)

点击查看摘要

Abstract:Existing long-form video generation frameworks lack automated planning, requiring manual input for storylines, scenes, cinematography, and character interactions, resulting in high costs and inefficiencies. To address these challenges, we present MovieAgent, an automated movie generation via multi-agent Chain of Thought (CoT) planning. MovieAgent offers two key advantages: 1) We firstly explore and define the paradigm of automated movie/long-video generation. Given a script and character bank, our MovieAgent can generates multi-scene, multi-shot long-form videos with a coherent narrative, while ensuring character consistency, synchronized subtitles, and stable audio throughout the film. 2) MovieAgent introduces a hierarchical CoT-based reasoning process to automatically structure scenes, camera settings, and cinematography, significantly reducing human effort. By employing multiple LLM agents to simulate the roles of a director, screenwriter, storyboard artist, and location manager, MovieAgent streamlines the production pipeline. Experiments demonstrate that MovieAgent achieves new state-of-the-art results in script faithfulness, character consistency, and narrative coherence. Our hierarchical framework takes a step forward and provides new insights into fully automated movie generation. The code and project website are available at: this https URL and this https URL.

61. 【2503.07307】AttenST: A Training-Free Attention-Driven Style Transfer Framework with Pre-Trained Diffusion Models

链接：https://arxiv.org/abs/2503.07307

作者：Bo Huang,Wenlun Xu,Qizhuo Han,Haodong Jing,Ying Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：optimizing pre-trained models, achieved remarkable progress, high computational costs, balancing content preservation, methods typically rely

备注：

点击查看摘要

Abstract:While diffusion models have achieved remarkable progress in style transfer tasks, existing methods typically rely on fine-tuning or optimizing pre-trained models during inference, leading to high computational costs and challenges in balancing content preservation with style integration. To address these limitations, we introduce AttenST, a training-free attention-driven style transfer framework. Specifically, we propose a style-guided self-attention mechanism that conditions self-attention on the reference style by retaining the query of the content image while substituting its key and value with those from the style image, enabling effective style feature integration. To mitigate style information loss during inversion, we introduce a style-preserving inversion strategy that refines inversion accuracy through multiple resampling steps. Additionally, we propose a content-aware adaptive instance normalization, which integrates content statistics into the normalization process to optimize style fusion while mitigating the content degradation. Furthermore, we introduce a dual-feature cross-attention mechanism to fuse content and style features, ensuring a harmonious synthesis of structural fidelity and stylistic expression. Extensive experiments demonstrate that AttenST outperforms existing methods, achieving state-of-the-art performance in style transfer dataset.

62. 【2503.07300】Goal Conditioned Reinforcement Learning for Photo Finishing Tuning

链接：https://arxiv.org/abs/2503.07300

作者：Jiarui Wu,Yujin Wang,Lingen Li,Zhang Fan,Tianfan Xue

类目：Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Lightroom or Darktable, Adobe Lightroom, Photo finishing, photo finishing pipeline, manual tuning process

备注： 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Photo finishing tuning aims to automate the manual tuning process of the photo finishing pipeline, like Adobe Lightroom or Darktable. Previous works either use zeroth-order optimization, which is slow when the set of parameters increases, or rely on a differentiable proxy of the target finishing pipeline, which is hard to train. To overcome these challenges, we propose a novel goal-conditioned reinforcement learning framework for efficiently tuning parameters using a goal image as a condition. Unlike previous approaches, our tuning framework does not rely on any proxy and treats the photo finishing pipeline as a black box. Utilizing a trained reinforcement learning policy, it can efficiently find the desired set of parameters within just 10 queries, while optimization based approaches normally take 200 queries. Furthermore, our architecture utilizes a goal image to guide the iterative tuning of pipeline parameters, allowing for flexible conditioning on pixel-aligned target images, style images, or any other visually representable goals. We conduct detailed experiments on photo finishing tuning and photo stylization tuning tasks, demonstrating the advantages of our method. Project website: this https URL.

63. 【2503.07298】ALLVB: All-in-One Long Video Understanding Benchmark

链接：https://arxiv.org/abs/2503.07298

作者：Xichen Tan,Yuanjing Luo,Yunfan Ye,Fang Liu,Zhiping Cai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：long video understanding, video understanding, video understanding benchmark, Multi-modal LLMs, capabilities of Multi-modal

备注： AAAI 2025

点击查看摘要

Abstract:From image to video understanding, the capabilities of Multi-modal LLMs (MLLMs) are increasingly powerful. However, most existing video understanding benchmarks are relatively short, which makes them inadequate for effectively evaluating the long-sequence modeling capabilities of MLLMs. This highlights the urgent need for a comprehensive and integrated long video understanding benchmark to assess the ability of MLLMs thoroughly. To this end, we propose ALLVB (ALL-in-One Long Video Understanding Benchmark). ALLVB's main contributions include: 1) It integrates 9 major video understanding tasks. These tasks are converted into video QA formats, allowing a single benchmark to evaluate 9 different video understanding capabilities of MLLMs, highlighting the versatility, comprehensiveness, and challenging nature of ALLVB. 2) A fully automated annotation pipeline using GPT-4o is designed, requiring only human quality control, which facilitates the maintenance and expansion of the benchmark. 3) It contains 1,376 videos across 16 categories, averaging nearly 2 hours each, with a total of 252k QAs. To the best of our knowledge, it is the largest long video understanding benchmark in terms of the number of videos, average duration, and number of QAs. We have tested various mainstream MLLMs on ALLVB, and the results indicate that even the most advanced commercial models have significant room for improvement. This reflects the benchmark's challenging nature and demonstrates the substantial potential for development in long video understanding.

64. 【2503.07294】Distilling Knowledge into Quantum Vision Transformers for Biomedical Image Classification

链接：https://arxiv.org/abs/2503.07294

作者：Thomas Boucher,Evangelos B. Mazomenos

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：vision transformers, Quantum vision transformers, replacing linear layers, improve feature representation, quantum neural networks

备注： Submitted for MICCAI 2025

点击查看摘要

Abstract:Quantum vision transformers (QViTs) build on vision transformers (ViTs) by replacing linear layers within the self-attention mechanism with parameterised quantum neural networks (QNNs), harnessing quantum mechanical properties to improve feature representation. This hybrid approach aims to achieve superior performance, with significantly reduced model complexity as a result of the enriched feature representation, requiring fewer parameters. This paper proposes a novel QViT model for biomedical image classification and investigates its performance against comparable ViTs across eight diverse datasets, encompassing various modalities and classification tasks. We assess models trained from scratch and those pre-trained using knowledge distillation (KD) from high-quality teacher models. Our findings demonstrate that QViTs outperform comparable ViTs with average ROC AUC (0.863 vs 0.846) and accuracy (0.710 vs 0.687) when trained from scratch, and even compete with state-of-the-art classical models in multiple tasks, whilst being significantly more efficient (89% reduction in GFLOPs and 99.99% in parameter number). Additionally, we find that QViTs and ViTs respond equally well to KD, with QViT pre-training performance scaling with model complexity. This is the first investigation into the efficacy of deploying QViTs with KD for computer-aided diagnosis. Our results highlight the enormous potential of quantum machine learning (QML) in biomedical image analysis.

65. 【2503.07276】A Systematic Review of ECG Arrhythmia Classification: Adherence to Standards, Fair Evaluation, and Embedded Feasibility

链接：https://arxiv.org/abs/2503.07276

作者：Guilherme Silva,Pedro Silva,Gladston Moreira,Vander Freitas,Jadson Gertrudes,Eduardo Luz

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：signals is crucial, cardiac conditions, crucial for early, early detection, detection of arrhythmias

备注：

点击查看摘要

Abstract:The classification of electrocardiogram (ECG) signals is crucial for early detection of arrhythmias and other cardiac conditions. However, despite advances in machine learning, many studies fail to follow standardization protocols, leading to inconsistencies in performance evaluation and real-world applicability. Additionally, hardware constraints essential for practical deployment, such as in pacemakers, Holter monitors, and wearable ECG patches, are often overlooked. Since real-world impact depends on feasibility in resource-constrained devices, ensuring efficient deployment is critical for continuous monitoring. This review systematically analyzes ECG classification studies published between 2017 and 2024, focusing on those adhering to the E3C (Embedded, Clinical, and Comparative Criteria), which include inter-patient paradigm implementation, compliance with Association for the Advancement of Medical Instrumentation (AAMI) recommendations, and model feasibility for embedded systems. While many studies report high accuracy, few properly consider patient-independent partitioning and hardware limitations. We identify state-of-the-art methods meeting E3C criteria and conduct a comparative analysis of accuracy, inference time, energy consumption, and memory usage. Finally, we propose standardized reporting practices to ensure fair comparisons and practical applicability of ECG classification models. By addressing these gaps, this study aims to guide future research toward more robust and clinically viable ECG classification systems.

66. 【2503.07274】Efficient Distillation of Classifier-Free Guidance using Adapters

链接：https://arxiv.org/abs/2503.07274

作者：Cristian Perez Jensen,Seyedmorteza Sadat

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：neural function evaluations, function evaluations, guidance distillation methods, essential for conditional, doubles the number

备注：

点击查看摘要

Abstract:While classifier-free guidance (CFG) is essential for conditional diffusion models, it doubles the number of neural function evaluations (NFEs) per inference step. To mitigate this inefficiency, we introduce adapter guidance distillation (AGD), a novel approach that simulates CFG in a single forward pass. AGD leverages lightweight adapters to approximate CFG, effectively doubling the sampling speed while maintaining or even improving sample quality. Unlike prior guidance distillation methods that tune the entire model, AGD keeps the base model frozen and only trains minimal additional parameters ($\sim$2%) to significantly reduce the resource requirement of the distillation phase. Additionally, this approach preserves the original model weights and enables the adapters to be seamlessly combined with other checkpoints derived from the same base model. We also address a key mismatch between training and inference in existing guidance distillation methods by training on CFG-guided trajectories instead of standard diffusion trajectories. Through extensive experiments, we show that AGD achieves comparable or superior FID to CFG across multiple architectures with only half the NFEs. Notably, our method enables the distillation of large models ($\sim$2.6B parameters) on a single consumer GPU with 24 GB of VRAM, making it more accessible than previous approaches that require multiple high-end GPUs. We will publicly release the implementation of our method.

67. 【2503.07266】Customized SAM 2 for Referring Remote Sensing Image Segmentation

链接：https://arxiv.org/abs/2503.07266

作者：Fu Rong,Meng Lan,Qian Zhang,Lefei Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Referring Remote Sensing, Remote Sensing Image, Sensing Image Segmentation, Remote Sensing, Referring Remote

备注：

点击查看摘要

Abstract:Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM 2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text descriptions. To address these issues, we propose RS2-SAM 2, a novel framework that adapts SAM 2 to RRSIS by aligning the adapted RS features and textual features, providing pseudo-mask-based dense prompts, and enforcing boundary constraints. Specifically, we first employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. Then, we design a bidirectional hierarchical fusion module to adapt SAM 2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. Additionally, a mask prompt generator is introduced to take the visual embeddings and class tokens as input and produce a pseudo-mask as the dense prompt of SAM 2. To further refine segmentation, we introduce a text-guided boundary loss to optimize segmentation boundaries by computing text-weighted gradient differences. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM 2 achieves state-of-the-art performance.

68. 【2503.07265】WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

链接：https://arxiv.org/abs/2503.07265

作者：Yuwei Niu,Munan Ning,Mengren Zheng,Bin Lin,Peng Jin,Jiaqi Liao,Kunpeng Ning,Bin Zhu,Li Yuan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：generating high-quality artistic, high-quality artistic creations, textbf, visual content, capable of generating

备注： Code, data and leaderboard: [this https URL](https://github.com/PKU-YuanGroup/WISE)

点击查看摘要

69. 【2503.07259】COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition

链接：https://arxiv.org/abs/2503.07259

作者：Baiyu Chen,Wilson Wongso,Zechen Li,Yonchanok Khaokaew,Hao Xue,Flora Salim

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：human activity recognition, video-based models capture, human activity, capture rich semantic, activity recognition

备注：

点击查看摘要

Abstract:Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at this https URL .

70. 【2503.07253】AnomalyPainter: Vision-Language-Diffusion Synergy for Zero-Shot Realistic and Diverse Industrial Anomaly Synthesis

链接：https://arxiv.org/abs/2503.07253

作者：Zhangyu Lai,Yilin Lu,Xinyang Li,Jianghang Lin,Yansong Qu,Liujuan Cao,Ming Li,Rongrong Ji

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：made remarkable progress, Language Large Model, Vision Language Large, Latent Diffusion Model, synergizing Vision Language

备注： anomaly synthesis,anomaly detection

点击查看摘要

Abstract:While existing anomaly synthesis methods have made remarkable progress, achieving both realism and diversity in synthesis remains a major obstacle. To address this, we propose AnomalyPainter, a zero-shot framework that breaks the diversity-realism trade-off dilemma through synergizing Vision Language Large Model (VLLM), Latent Diffusion Model (LDM), and our newly introduced texture library Tex-9K. Tex-9K is a professional texture library containing 75 categories and 8,792 texture assets crafted for diverse anomaly synthesis. Leveraging VLLM's general knowledge, reasonable anomaly text descriptions are generated for each industrial object and matched with relevant diverse textures from Tex-9K. These textures then guide the LDM via ControlNet to paint on normal images. Furthermore, we introduce Texture-Aware Latent Init to stabilize the natural-image-trained ControlNet for industrial images. Extensive experiments show that AnomalyPainter outperforms existing methods in realism, diversity, and generalization, achieving superior downstream performance.

71. 【2503.07252】Semantic Communications with Computer Vision Sensing for Edge Video Transmission

链接：https://arxiv.org/abs/2503.07252

作者：Yubo Peng,Luping Xiang,Kun Yang,Kezhi Wang,Merouane Debbah

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)

关键词：data consumes substantial, consumes substantial spectrum, video data consumes, substantial spectrum resources, widespread adoption

备注：

点击查看摘要

Abstract:Despite the widespread adoption of vision sensors in edge applications, such as surveillance, the transmission of video data consumes substantial spectrum resources. Semantic communication (SC) offers a solution by extracting and compressing information at the semantic level, preserving the accuracy and relevance of transmitted data while significantly reducing the volume of transmitted information. However, traditional SC methods face inefficiencies due to the repeated transmission of static frames in edge videos, exacerbated by the absence of sensing capabilities, which results in spectrum inefficiency. To address this challenge, we propose a SC with computer vision sensing (SCCVS) framework for edge video transmission. The framework first introduces a compression ratio (CR) adaptive SC (CRSC) model, capable of adjusting CR based on whether the frames are static or dynamic, effectively conserving spectrum resources. Additionally, we implement an object detection and semantic segmentation models-enabled sensing (OSMS) scheme, which intelligently senses the changes in the scene and assesses the significance of each frame through in-context analysis. Hence, The OSMS scheme provides CR prompts to the CRSC model based on real-time sensing results. Moreover, both CRSC and OSMS are designed as lightweight models, ensuring compatibility with resource-constrained sensors commonly used in practical edge applications. Experimental simulations validate the effectiveness of the proposed SCCVS framework, demonstrating its ability to enhance transmission efficiency without sacrificing critical semantic information.

72. 【2503.07249】xt-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes

链接：https://arxiv.org/abs/2503.07249

作者：Feng Huang,Shuyuan Zheng,Zhaobing Qiu,Huanxian Liu,Huanxin Bai,Liqiong Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Infrared small target, computer vision, Infrared small, small target detection, hot and challenging

备注：

点击查看摘要

Abstract:Infrared small target detection is currently a hot and challenging task in computer vision. Existing methods usually focus on mining visual features of targets, which struggles to cope with complex and diverse detection scenarios. The main reason is that infrared small targets have limited image information on their own, thus relying only on visual features fails to discriminate targets and interferences, leading to lower detection performance. To address this issue, we introduce a novel approach leveraging semantic text to guide infrared small target detection, called Text-IRSTD. It innovatively expands classical IRSTD to text-guided IRSTD, providing a new research idea. On the one hand, we devise a novel fuzzy semantic text prompt to accommodate ambiguous target categories. On the other hand, we propose a progressive cross-modal semantic interaction decoder (PCSID) to facilitate information fusion between texts and images. In addition, we construct a new benchmark consisting of 2,755 infrared images of different scenarios with fuzzy semantic textual annotations, called FZDT. Extensive experimental results demonstrate that our method achieves better detection performance and target contour recovery than the state-of-the-art methods. Moreover, proposed Text-IRSTD shows strong generalization and wide application prospects in unseen detection scenarios. The dataset and code will be publicly released after acceptance of this paper.

73. 【2503.07235】Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image Fusion

链接：https://arxiv.org/abs/2503.07235

作者：Haowen Bai,Jiangshe Zhang,Zixiang Zhao,Lilun Deng,Yukun Cui,Shuang Xu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：low dynamic range, high dynamic range, singular high dynamic, dynamic range images, dynamic range

备注：

点击查看摘要

Abstract:Multi-exposure image fusion consolidates multiple low dynamic range images of the same scene into a singular high dynamic range image. Retinex theory, which separates image illumination from scene reflectance, is naturally adopted to ensure consistent scene representation and effective information fusion across varied exposure levels. However, the conventional pixel-wise multiplication of illumination and reflectance inadequately models the glare effect induced by overexposure. To better adapt this theory for multi-exposure image fusion, we introduce an unsupervised and controllable method termed~\textbf{(Retinex-MEF)}. Specifically, our method decomposes multi-exposure images into separate illumination components and a shared reflectance component, and effectively modeling the glare induced by overexposure. Employing a bidirectional loss constraint to learn the common reflectance component, our approach effectively mitigates the glare effect. Furthermore, we establish a controllable exposure fusion criterion, enabling global exposure adjustments while preserving contrast, thus overcoming the constraints of fixed-level fusion. A series of experiments across multiple datasets, including underexposure-overexposure fusion, exposure control fusion, and homogeneous extreme exposure fusion, demonstrate the effective decomposition and flexible fusion capability of our model.

74. 【2503.07234】CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting

链接：https://arxiv.org/abs/2503.07234

作者：Haicheng Liao,Hanlin Kong,Bonan Wang,Chengyue Wang,Wang Ye,Zhengbing He,Chengzhong Xu,Zhenning Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词：Accurate motion forecasting, safe autonomous driving, Accurate motion, motion forecasting, autonomous driving

备注：

点击查看摘要

Abstract:Accurate motion forecasting is crucial for safe autonomous driving (AD). This study proposes CoT-Drive, a novel approach that enhances motion forecasting by leveraging large language models (LLMs) and a chain-of-thought (CoT) prompting method. We introduce a teacher-student knowledge distillation strategy to effectively transfer LLMs' advanced scene understanding capabilities to lightweight language models (LMs), ensuring that CoT-Drive operates in real-time on edge devices while maintaining comprehensive scene understanding and generalization capabilities. By leveraging CoT prompting techniques for LLMs without additional training, CoT-Drive generates semantic annotations that significantly improve the understanding of complex traffic environments, thereby boosting the accuracy and robustness of predictions. Additionally, we present two new scene description datasets, Highway-Text and Urban-Text, designed for fine-tuning lightweight LMs to generate context-specific semantic annotations. Comprehensive evaluations of five real-world datasets demonstrate that CoT-Drive outperforms existing models, highlighting its effectiveness and efficiency in handling complex traffic scenarios. Overall, this study is the first to consider the practical application of LLMs in this field. It pioneers the training and use of a lightweight LLM surrogate for motion forecasting, setting a new benchmark and showcasing the potential of integrating LLMs into AD systems.

75. 【2503.07232】Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios

链接：https://arxiv.org/abs/2503.07232

作者：Chenglu Pan,Xiaogang Xu,Ganggui Ding,Yunke Zhang,Wenbo Li,Jiarong Xu,Qingbiao Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Restoring low-resolution text, Restoring low-resolution, low-resolution text images, text images presents, significant challenge

备注：

点击查看摘要

Abstract:Restoring low-resolution text images presents a significant challenge, as it requires maintaining both the fidelity and stylistic realism of the text in restored images. Existing text image restoration methods often fall short in hard situations, as the traditional super-resolution models cannot guarantee clarity, while diffusion-based methods fail to maintain fidelity. In this paper, we introduce a novel framework aimed at improving the generalization ability of diffusion models for text image super-resolution (SR), especially promoting fidelity. First, we propose a progressive data sampling strategy that incorporates diverse image types at different stages of training, stabilizing the convergence and improving the generalization. For the network architecture, we leverage a pre-trained SR prior to provide robust spatial reasoning capabilities, enhancing the model's ability to preserve textual information. Additionally, we employ a cross-attention mechanism to better integrate textual priors. To further reduce errors in textual priors, we utilize confidence scores to dynamically adjust the importance of textual features during training. Extensive experiments on real-world datasets demonstrate that our approach not only produces text images with more realistic visual appearances but also improves the accuracy of text structure.

76. 【2503.07230】A Deep Learning Architecture for Land Cover Mapping Using Spatio-Temporal Sentinel-1 Features

链接：https://arxiv.org/abs/2503.07230

作者：Luigi Russo,Antonietta Sorriso,Silvia Liberata Ullo,Paolo Gamba

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：Land Cover, Convolutional Neural Networks, Synthetic Aperture Radar, mapping using satellite, monitoring and management

备注： Submitted to IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

点击查看摘要

Abstract:Land Cover (LC) mapping using satellite imagery is critical for environmental monitoring and management. Deep Learning (DL), particularly Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have revolutionized this field by enhancing the accuracy of classification tasks. In this work, a novel approach combining a transformer-based Swin-Unet architecture with seasonal synthesized spatio-temporal images has been employed to classify LC types using spatio-temporal features extracted from Sentinel-1 (S1) Synthetic Aperture Radar (SAR) data, organized into seasonal clusters. The study focuses on three distinct regions - Amazonia, Africa, and Siberia - and evaluates the model performance across diverse ecoregions within these areas. By utilizing seasonal feature sequences instead of dense temporal sequences, notable performance improvements have been achieved, especially in regions with temporal data gaps like Siberia, where S1 data distribution is uneven and non-uniform. The results demonstrate the effectiveness and the generalization capabilities of the proposed methodology in achieving high overall accuracy (O.A.) values, even in regions with limited training data.

77. 【2503.07217】ReelWave: A Multi-Agent Framework Toward Professional Movie Sound Generation

链接：https://arxiv.org/abs/2503.07217

作者：Zixuan Wang,Chi-Keung Tang,Yu-Wing Tai

类目：ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)

关键词：Film production, important application, application for generative, Film, audio

备注：

点击查看摘要

Abstract:Film production is an important application for generative audio, where richer context is provided through multiple scenes. In ReelWave, we propose a multi-agent framework for audio generation inspired by the professional movie production process. We first capture semantic and temporal synchronized "on-screen" sound by training a prediction model that predicts three interpretable time-varying audio control signals comprising loudness, pitch, and timbre. These three parameters are subsequently specified as conditions by a cross-attention module. Then, our framework infers "off-screen" sound to complement the generation through cooperative interaction between communicative agents. Each agent takes up specific roles similar to the movie production team and is supervised by an agent called the director. Besides, we investigate when the conditional video consists of multiple scenes, a case frequently seen in videos extracted from movies of considerable length. Consequently, our framework can capture a richer context of audio generation conditioned on video clips extracted from movies.

78. 【2503.07209】Synthetic Lung X-ray Generation through Cross-Attention and Affinity Transformation

链接：https://arxiv.org/abs/2503.07209

作者：Ruochen Pi,Lianlei Shan

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Collecting and annotating, resource-intensive task, time-consuming and resource-intensive, Collecting, diffusion model trained

备注：

点击查看摘要

Abstract:Collecting and annotating medical images is a time-consuming and resource-intensive task. However, generating synthetic data through models such as Diffusion offers a cost-effective alternative. This paper introduces a new method for the automatic generation of accurate semantic masks from synthetic lung X-ray images based on a stable diffusion model trained on text-image pairs. This method uses cross-attention mapping between text and image to extend text-driven image synthesis to semantic mask generation. It employs text-guided cross-attention information to identify specific areas in an image and combines this with innovative techniques to produce high-resolution, class-differentiated pixel masks. This approach significantly reduces the costs associated with data collection and annotation. The experimental results demonstrate that segmentation models trained on synthetic data generated using the method are comparable to, and in some cases even better than, models trained on real datasets. This shows the effectiveness of the method and its potential to revolutionize medical image analysis.

79. 【2503.07204】Endo-FASt3r: Endoscopic Foundation model Adaptation for Structure from motion

链接：https://arxiv.org/abs/2503.07204

作者：Mona Sheikh Zeinoddin,Mobarakol Islam,Zafer Tandogdu,Greg Shaw,Mathew J. Clarkson,Evangelos Mazomenos,Danail Stoyanov

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Accurate depth, achieving high-quality, visualisations in robotic-assisted, robotic-assisted surgery, pose estimation

备注：

点击查看摘要

Abstract:Accurate depth and camera pose estimation is essential for achieving high-quality 3D visualisations in robotic-assisted surgery. Despite recent advancements in foundation model adaptation to monocular depth estimation of endoscopic scenes via self-supervised learning (SSL), no prior work has explored their use for pose estimation. These methods rely on low rank-based adaptation approaches, which constrain model updates to a low-rank space. We propose Endo-FASt3r, the first monocular SSL depth and pose estimation framework that uses foundation models for both tasks. We extend the Reloc3r relative pose estimation foundation model by designing Reloc3rX, introducing modifications necessary for convergence in SSL. We also present DoMoRA, a novel adaptation technique that enables higher-rank updates and faster convergence. Experiments on the SCARED dataset show that Endo-FASt3r achieves a substantial $10\%$ improvement in pose estimation and a $2\%$ improvement in depth estimation over prior work. Similar performance gains on the Hamlyn and StereoMIS datasets reinforce the generalisability of Endo-FASt3r across different datasets.

80. 【2503.07197】Effective and Efficient Masked Image Generation Models

链接：https://arxiv.org/abs/2503.07197

作者：Zebin You,Jingyang Ou,Xiaolu Zhang,Jun Hu,Jun Zhou,Chongxuan Li

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：masked image generation, motivations and objectives, single framework, masked image, Fréchet Inception Distance

备注：

点击查看摘要

Abstract:Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models.

81. 【2503.07191】All That Glitters Is Not Gold: Key-Secured 3D Secrets within 3D Gaussian Splatting

链接：https://arxiv.org/abs/2503.07191

作者：Yan Ren,Shilin Lu,Adams Wai-Kin Kong

类目：Graphics (cs.GR); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Recent advances, Gaussian Splatting, revolutionized scene reconstruction, opening new possibilities, revolutionized scene

备注：

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have revolutionized scene reconstruction, opening new possibilities for 3D steganography by hiding 3D secrets within 3D covers. The key challenge in steganography is ensuring imperceptibility while maintaining high-fidelity reconstruction. However, existing methods often suffer from detectability risks and utilize only suboptimal 3DGS features, limiting their full potential. We propose a novel end-to-end key-secured 3D steganography framework (KeySS) that jointly optimizes a 3DGS model and a key-secured decoder for secret reconstruction. Our approach reveals that Gaussian features contribute unequally to secret hiding. The framework incorporates a key-controllable mechanism enabling multi-secret hiding and unauthorized access prevention, while systematically exploring optimal feature update to balance fidelity and security. To rigorously evaluate steganographic imperceptibility beyond conventional 2D metrics, we introduce 3D-Sinkhorn distance analysis, which quantifies distributional differences between original and steganographic Gaussian parameters in the representation space. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both cover and secret reconstruction while maintaining high security levels, advancing the field of 3D steganography. Code is available at this https URL

82. 【2503.07190】Multi-Modal 3D Mesh Reconstruction from Images and Text

链接：https://arxiv.org/abs/2503.07190

作者：Melvin Reka,Tessa Pulli,Markus Vincze

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：require large datasets, high computational costs, object pose estimation, large datasets, struggle to generalize

备注： under review

点击查看摘要

83. 【2503.07185】Evaluation of Alignment-Regularity Characteristics in Deformable Image Registration

链接：https://arxiv.org/abs/2503.07185

作者：Vasiliki Sideri-Lampretsa,Daniel Rueckert,Huaqi Qiu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Evaluating deformable image, achieving high alignment, high alignment accuracy, maintaining deformation regularity, Evaluating deformable

备注：

点击查看摘要

Abstract:Evaluating deformable image registration (DIR) is challenging due to the inherent trade-off between achieving high alignment accuracy and maintaining deformation regularity. In this work, we introduce a novel evaluation scheme based on the alignment-regularity characteristic (ARC) to systematically capture and analyze this trade-off. We first introduce the ARC curves, which describe the performance of a given registration algorithm as a spectrum measured by alignment and regularity metrics. We further adopt a HyperNetwork-based approach that learns to continuously interpolate across the full regularization range, accelerating the construction and improving the sample density of ARC curves. We empirically demonstrate our evaluation scheme using representative learning-based deformable image registration methods with various network architectures and transformation models on two public datasets. We present a range of findings not evident from existing evaluation practices and provide general recommendations for model evaluation and selection using our evaluation scheme. All code relevant is made publicly available.

84. 【2503.07173】owards Spatial Transcriptomics-guided Pathological Image Recognition with Batch-Agnostic Encoder

链接：https://arxiv.org/abs/2503.07173

作者：Kazuya Nishimura,Ryoma Bise,Yasuhiro Kojima

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：simultaneously captures pathological, technique that simultaneously, simultaneously captures, spatial coordinates, captures pathological images

备注： Accepted to ISBI 2025

点击查看摘要

Abstract:Spatial transcriptomics (ST) is a novel technique that simultaneously captures pathological images and gene expression profiling with spatial coordinates. Since ST is closely related to pathological features such as disease subtypes, it may be valuable to augment image representation with pathological information. However, there are no attempts to leverage ST for image recognition ({\it i.e,} patch-level classification of subtypes of pathological image.). One of the big challenges is significant batch effects in spatial transcriptomics that make it difficult to extract pathological features of images from ST. In this paper, we propose a batch-agnostic contrastive learning framework that can extract consistent signals from gene expression of ST in multiple patients. To extract consistent signals from ST, we utilize the batch-agnostic gene encoder that is trained in a variational inference manner. Experiments demonstrated the effectiveness of our framework on a publicly available dataset. Code is publicly available at this https URL

85. 【2503.07168】HisTrackMap: Global Vectorized High-Definition Map Construction via History Map Tracking

链接：https://arxiv.org/abs/2503.07168

作者：Jing Yang,Sen Yang,Xiao Tan,Hanli Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：implicitly propagate queries, autonomous driving systems, query-based detection frameworks, precise environmental information, maps provide rich

备注：

点击查看摘要

Abstract:As an essential component of autonomous driving systems, high-definition (HD) maps provide rich and precise environmental information for auto-driving scenarios; however, existing methods, which primarily rely on query-based detection frameworks to directly model map elements or implicitly propagate queries over time, often struggle to maintain consistent temporal perception outcomes. These inconsistencies pose significant challenges to the stability and reliability of real-world autonomous driving and map data collection systems. To address this limitation, we propose a novel end-to-end tracking framework for global map construction by temporally tracking map elements' historical trajectories. Firstly, instance-level historical rasterization map representation is designed to explicitly store previous perception results, which can control and maintain different global instances' history information in a fine-grained way. Secondly, we introduce a Map-Trajectory Prior Fusion module within this tracking framework, leveraging historical priors for tracked instances to improve temporal smoothness and continuity. Thirdly, we propose a global perspective metric to evaluate the quality of temporal geometry construction in HD maps, filling the gap in current metrics for assessing global geometric perception results. Substantial experiments on the nuScenes and Argoverse2 datasets demonstrate that the proposed method outperforms state-of-the-art (SOTA) methods in both single-frame and temporal metrics. our project page: $\href{this https URL}{this https URL.}$

86. 【2503.07167】mporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation

链接：https://arxiv.org/abs/2503.07167

作者：Ziliang Miao,Runjian Chen,Yixi Cai,Buwei He,Wenquan Zhao,Wenqi Shao,Bo Zhang,Fu Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：Moving object segmentation, textbf, Moving object, self-driving vehicles, clouds is crucial

备注：

点击查看摘要

Abstract:Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose \textbf{T}emporal \textbf{O}verlapping \textbf{P}rediction (\textbf{TOP}), a self-supervised pre-training method that alleviate the labeling burden for MOS. \textbf{TOP} explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called $\text{mIoU}_{\text{obj}}$ to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that \textbf{TOP} outperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77\% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.

87. 【2503.07157】MIRAM: Masked Image Reconstruction Across Multiple Scales for Breast Lesion Risk Prediction

链接：https://arxiv.org/abs/2503.07157

作者：Hung Q. Vo,Pengyu Yuan,Zheng Yin,Kelvin K. Wong,Chika F. Ezeana,Son T. Ly,Stephen T.C. Wong,Hien V. Nguyen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：computer vision communities, garnered substantial interest, Self-supervised learning, vision communities, garnered substantial

备注：

点击查看摘要

Abstract:Self-supervised learning (SSL) has garnered substantial interest within the machine learning and computer vision communities. Two prominent approaches in SSL include contrastive-based learning and self-distillation utilizing cropping augmentation. Lately, masked image modeling (MIM) has emerged as a more potent SSL technique, employing image inpainting as a pretext task. MIM creates a strong inductive bias toward meaningful spatial and semantic understanding. This has opened up new opportunities for SSL to contribute not only to classification tasks but also to more complex applications like object detection and image segmentation. Building upon this progress, our research paper introduces a scalable and practical SSL approach centered around more challenging pretext tasks that facilitate the acquisition of robust features. Specifically, we leverage multi-scale image reconstruction from randomly masked input images as the foundation for feature learning. Our hypothesis posits that reconstructing high-resolution images enables the model to attend to finer spatial details, particularly beneficial for discerning subtle intricacies within medical images. The proposed SSL features help improve classification performance on the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) dataset. In pathology classification, our method demonstrates a 3\% increase in average precision (AP) and a 1\% increase in the area under the receiver operating characteristic curve (AUC) when compared to state-of-the-art (SOTA) algorithms. Moreover, in mass margins classification, our approach achieves a 4\% increase in AP and a 2\% increase in AUC.

88. 【2503.07152】Controllable 3D Outdoor Scene Generation via Scene Graphs

链接：https://arxiv.org/abs/2503.07152

作者：Yuheng Liu,Xinke Li,Yuning Zhang,Lu Qi,Xin Li,Wenping Wang,Chongshou Li,Xueting Li,Ming-Hsuan Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Three-dimensional scene generation, spanning autonomous driving, applications spanning autonomous, Three-dimensional scene, scene graphs

备注： Project Page: [this https URL](https://yuheng.ink/project-page/control-3d-scene/)

点击查看摘要

Abstract:Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses, scene graphs, an accessible, user friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs.

89. 【2503.07135】VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

链接：https://arxiv.org/abs/2503.07135

作者：Hanzhi Chen,Boyang Sun,Anran Zhang,Marc Pollefeys,Stefan Leutenegger

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：versatile systems capable, Future robots, envisioned as versatile, capable of performing, performing a variety

备注： Accepted to CVPR 2025

点击查看摘要

Abstract:Future robots are envisioned as versatile systems capable of performing a variety of household tasks. The big question remains, how can we bridge the embodiment gap while minimizing physical robot learning, which fundamentally does not scale well. We argue that learning from in-the-wild human videos offers a promising solution for robotic manipulation tasks, as vast amounts of relevant data already exist on the internet. In this work, we present VidBot, a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot leverages a pipeline to extract explicit representations from them, namely 3D hand trajectories from videos, combining a depth foundation model with structure-from-motion techniques to reconstruct temporally consistent, metric-scale 3D affordance representations agnostic to embodiments. We introduce a coarse-to-fine affordance learning model that first identifies coarse actions from the pixel space and then generates fine-grained interaction trajectories with a diffusion model, conditioned on coarse actions and guided by test-time constraints for context-aware interaction planning, enabling substantial generalization to novel scenes and embodiments. Extensive experiments demonstrate the efficacy of VidBot, which significantly outperforms counterparts across 13 manipulation tasks in zero-shot settings and can be seamlessly deployed across robot systems in real-world environments. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.

90. 【2503.07133】A Light Perspective for 3D Object Detection

链接：https://arxiv.org/abs/2503.07133

作者：Marcelo Eduardo Pederiva,José Mario De Martino,Alessandro Zimmer

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：autonomous vehicle technologies, advancing autonomous vehicle, accurately detecting objects, Comprehending the environment, space are essential

备注：

点击查看摘要

Abstract:Comprehending the environment and accurately detecting objects in 3D space are essential for advancing autonomous vehicle technologies. Integrating Camera and LIDAR data has emerged as an effective approach for achieving high accuracy in 3D Object Detection models. However, existing methodologies often rely on heavy, traditional backbones that are computationally demanding. This paper introduces a novel approach that incorporates cutting-edge Deep Learning techniques into the feature extraction process, aiming to create more efficient models without compromising performance. Our model, NextBEV, surpasses established feature extractors like ResNet50 and MobileNetV2. On the KITTI 3D Monocular detection benchmark, NextBEV achieves an accuracy improvement of 2.39%, having less than 10% of the MobileNetV3 parameters. Moreover, we propose changes in LIDAR backbones that decreased the original inference time to 10 ms. Additionally, by fusing these lightweight proposals, we have enhanced the accuracy of the VoxelNet-based model by 2.93% and improved the F1-score of the PointPillar-based model by approximately 20%. Therefore, this work contributes to establishing lightweight and powerful models for individual or fusion techniques, making them more suitable for onboard implementations.

91. 【2503.07125】Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation

链接：https://arxiv.org/abs/2503.07125

作者：Sihao Lin,Daqi Liu,Ruochong Fu,Dongrui Liu,Andy Song,Hongwei Xie,Zhihui Li,Bing Wang,Xiaojun Chang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：challenging task due, metric depth, fundamental yet challenging, labour-intensive nature, challenging task

备注： preprint

点击查看摘要

Abstract:Estimating the 3D world from 2D monocular images is a fundamental yet challenging task due to the labour-intensive nature of 3D annotations. To simplify label acquisition, this work proposes a novel approach that bridges 2D vision foundation models (VFMs) with 3D tasks by decoupling 3D supervision into an ensemble of image-level primitives, e.g., semantic and geometric components. As a key motivator, we leverage the zero-shot capabilities of vision-language models for image semantics. However, due to the notorious ill-posed problem - multiple distinct 3D scenes can produce identical 2D projections, directly inferring metric depth from a monocular image in a zero-shot manner is unsuitable. In contrast, 2D VFMs provide promising sources of relative depth, which theoretically aligns with metric depth when properly scaled and offset. Thus, we adapt the relative depth derived from VFMs into metric depth by optimising the scale and offset using temporal consistency, also known as novel view synthesis, without access to ground-truth metric depth. Consequently, we project the semantics into 3D space using the reconstructed metric depth, thereby providing 3D supervision. Extensive experiments on nuScenes and SemanticKITTI demonstrate the effectiveness of our framework. For instance, the proposed method surpasses the current state-of-the-art by 3.34% mIoU on nuScenes for voxel occupancy prediction.

92. 【2503.07120】Exposure Bias Reduction for Enhancing Diffusion Transformer Feature Caching

链接：https://arxiv.org/abs/2503.07120

作者：Zhen Zou,Hu Yu,Jie Xiao,Feng Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：high computational complexity, faces great challenges, great challenges due, impressive generation capabilities, exhibited impressive generation

备注：

点击查看摘要

Abstract:Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this problem, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing the impact of caching on the generation of intermediate processes. So the lack of exploration provides us with room for analysis and improvement. In this paper, we analyze the impact of caching on the SNR of the diffusion process and discern that feature caching intensifies the denoising procedure, and we further identify this as a more severe exposure bias issue. Drawing on this insight, we introduce EB-Cache, a joint cache strategy that aligns the Non-exposure bias (which gives us a higher performance ceiling) diffusion process. Our approach incorporates a comprehensive understanding of caching mechanisms and offers a novel perspective on leveraging caches to expedite diffusion processes. Empirical results indicate that EB-Cache optimizes model performance while concurrently facilitating acceleration. Specifically, in the 50-step generation process, EB-Cache achieves 1.49$\times$ acceleration with 0.63 FID reduction from 3.69, surpassing prior acceleration methods. Code will be available at \href{this https URL}{this https URL}.

93. 【2503.07115】YOLOMG: Vision-based Drone-to-Drone Detection with Appearance and Pixel-Level Motion Fusion

链接：https://arxiv.org/abs/2503.07115

作者：Hanqing Guo,Xiuxiu Lin,Shiyu Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：attracted increasing attention, increasing attention due, vision-based swarming, attracted increasing, increasing attention

备注： 9 pages, 8 figures

点击查看摘要

Abstract:Vision-based drone-to-drone detection has attracted increasing attention due to its importance in numerous tasks such as vision-based swarming, aerial see-and-avoid, and malicious drone detection. However, existing methods often encounter failures when the background is complex or the target is tiny. This paper proposes a novel end-to-end framework that accurately identifies small drones in complex environments using motion guidance. It starts by creating a motion difference map to capture the motion characteristics of tiny drones. Next, this motion difference map is combined with an RGB image using a bimodal fusion module, allowing for adaptive feature learning of the drone. Finally, the fused feature map is processed through an enhanced backbone and detection head based on the YOLOv5 framework to achieve accurate detection results. To validate our method, we propose a new dataset, named ARD100, which comprises 100 videos (202,467 frames) covering various challenging conditions and has the smallest average object size compared with the existing drone detection datasets. Extensive experiments on the ARD100 and NPS-Drones datasets show that our proposed detector performs exceptionally well under challenging conditions and surpasses state-of-the-art algorithms across various metrics. We publicly release the codes and ARD100 dataset at this https URL.

94. 【2503.07107】owards Experience Replay for Class-Incremental Learning in Fully-Binary Networks

链接：https://arxiv.org/abs/2503.07107

作者：Yanis Basso-Bert,Anca Molnos,Romain Lemaire,William Guicquero,Antoine Dupret

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：enable Artificial Neural, Artificial Neural Network, Binary Neural Networks, Artificial Neural, ultra-low power edge

备注：

点击查看摘要

Abstract:Binary Neural Networks (BNNs) are a promising approach to enable Artificial Neural Network (ANN) implementation on ultra-low power edge devices. Such devices may compute data in highly dynamic environments, in which the classes targeted for inference can evolve or even novel classes may arise, requiring continual learning. Class Incremental Learning (CIL) is a common type of continual learning for classification problems, that has been scarcely addressed in the context of BNNs. Furthermore, most of existing BNNs models are not fully binary, as they require several real-valued network layers, at the input, the output, and for batch normalization. This paper goes a step further, enabling class incremental learning in Fully-Binarized NNs (FBNNs) through four main contributions. We firstly revisit the FBNN design and its training procedure that is suitable to CIL. Secondly, we explore loss balancing, a method to trade-off the performance of past and current classes. Thirdly, we propose a semi-supervised method to pre-train the feature extractor of the FBNN for transferable representations. Fourthly, two conventional CIL methods, \ie, Latent and Native replay, are thoroughly compared. These contributions are exemplified first on the CIFAR100 dataset, before being scaled up to address the CORE50 continual learning benchmark. The final results based on our 3Mb FBNN on CORE50 exhibit at par and better performance than conventional real-valued larger NN models.

95. 【2503.07101】SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements

链接：https://arxiv.org/abs/2503.07101

作者：Haiyang Xie,Xi Shen,Shihua Huang,Zheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：offers significant advantages, RAW object detection, preserving sensor information, data offers significant, ISP processing

备注：

点击查看摘要

Abstract:Most visual models are designed for sRGB images, yet RAW data offers significant advantages for object detection by preserving sensor information before ISP processing. This enables improved detection accuracy and more efficient hardware designs by bypassing the ISP. However, RAW object detection is challenging due to limited training data, unbalanced pixel distributions, and sensor noise. To address this, we propose SimROD, a lightweight and effective approach for RAW object detection. We introduce a Global Gamma Enhancement (GGE) module, which applies a learnable global gamma transformation with only four parameters, improving feature representation while keeping the model efficient. Additionally, we leverage the green channel's richer signal to enhance local details, aligning with the human eye's sensitivity and Bayer filter design. Extensive experiments on multiple RAW object detection datasets and detectors demonstrate that SimROD outperforms state-of-the-art methods like RAW-Adapter and DIAP while maintaining efficiency. Our work highlights the potential of RAW data for real-world object detection.

96. 【2503.07098】OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

链接：https://arxiv.org/abs/2503.07098

作者：Ding Zhong,Xu Zheng,Chenfei Liao,Yuanhuiyi Lyu,Jialei Chen,Shengyang Wu,Linfeng Zhang,Xuming Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：strong base model, pinhole imaging segmentation, circ, imaging segmentation tasks, strong base

备注：

点击查看摘要

Abstract:Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^\circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^\circ \times 70^\circ$) and panoramic images ($180^\circ \times 360^\circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2's memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state-of-the-art methods by large margins, e.g., 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.

97. 【2503.07091】FaceID-6M: A Large-Scale, Open-Source FaceID Customization Dataset

链接：https://arxiv.org/abs/2503.07091

作者：Shuhe Wang,Xiaoya Li,Jiwei Li,Guoyin Wang,Xiaofei Sun,Bob Zhu,Han Qiu,Mo Yu,Shengjie Shen,Eduard Hovy

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：current face identity, high-quality text-image pairs, data-driven nature, nature of current, text-image pairs

备注：

点击查看摘要

Abstract:Due to the data-driven nature of current face identity (FaceID) customization methods, all state-of-the-art models rely on large-scale datasets containing millions of high-quality text-image pairs for training. However, none of these datasets are publicly available, which restricts transparency and hinders further advancements in the field. To address this issue, in this paper, we collect and release FaceID-6M, the first large-scale, open-source FaceID dataset containing 6 million high-quality text-image pairs. Filtered from LAION-5B \cite{schuhmann2022laion}, FaceID-6M undergoes a rigorous image and text filtering steps to ensure dataset quality, including resolution filtering to maintain high-quality images and faces, face filtering to remove images that lack human faces, and keyword-based strategy to retain descriptions containing human-related terms (e.g., nationality, professions and names). Through these cleaning processes, FaceID-6M provides a high-quality dataset optimized for training powerful FaceID customization models, facilitating advancements in the field by offering an open resource for research and development. We conduct extensive experiments to show the effectiveness of our FaceID-6M, demonstrating that models trained on our FaceID-6M dataset achieve performance that is comparable to, and slightly better than currently available industrial models. Additionally, to support and advance research in the FaceID customization community, we make our code, datasets, and models fully publicly available. Our codes, models, and datasets are available at: this https URL.

Comments:
arXiv admin note: text overlap with arXiv:2501.15407

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2503.07091 [cs.CV]

(or
arXiv:2503.07091v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2503.07091

Focus to learn more

              arXiv-issued DOI via DataCite</p>

98. 【2503.07085】RS2V-L: Vehicle-Mounted LiDAR Data Generation from Roadside Sensor Observations

链接：https://arxiv.org/abs/2503.07085

作者：Ruidan Xing,Runyi Huang,Qing Xu,Lei He

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：refined control commands, process multi-modal sensory, directly generate refined, generate refined control, multi-modal sensory data

备注： 7 pages, 4 figures

点击查看摘要

Abstract:End-to-end autonomous driving solutions, which process multi-modal sensory data to directly generate refined control commands, have become a dominant paradigm in autonomous driving research. However, these approaches predominantly depend on single-vehicle data collection for model training and optimization, resulting in significant challenges such as high data acquisition and annotation costs, the scarcity of critical driving scenarios, and fragmented datasets that impede model generalization. To mitigate these limitations, we introduce RS2V-L, a novel framework for reconstructing and synthesizing vehicle-mounted LiDAR data from roadside sensor observations. Specifically, our method transforms roadside LiDAR point clouds into the vehicle-mounted LiDAR coordinate system by leveraging the target vehicle's relative pose. Subsequently, high-fidelity vehicle-mounted LiDAR data is synthesized through virtual LiDAR modeling, point cloud classification, and resampling techniques. To the best of our knowledge, this is the first approach to reconstruct vehicle-mounted LiDAR data from roadside sensor inputs. Extensive experimental evaluations demonstrate that incorporating the generated data into model training-complementing the KITTI dataset-enhances 3D object detection accuracy by over \text{30\%} while improving the efficiency of end-to-end autonomous driving data generation by more than an order of magnitude. These findings strongly validate the effectiveness of the proposed method and underscore its potential in reducing dependence on costly vehicle-mounted data collection while improving the robustness of autonomous driving models.

99. 【2503.07082】On the Generalization of Representation Uncertainty in Earth Observation

链接：https://arxiv.org/abs/2503.07082

作者：Spyros Kondylatos,Nikolaos Ioannis Bountos,Dimitrios Michail,Xiao Xiang Zhu,Gustau Camps-Valls,Ioannis Papoutsis

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：Computer Vision, Recent advances, advances in Computer, Vision have introduced, enabling zero-shot uncertainty

备注： 18 pages

点击查看摘要

Abstract:Recent advances in Computer Vision have introduced the concept of pretrained representation uncertainty, enabling zero-shot uncertainty estimation. This holds significant potential for Earth Observation (EO), where trustworthiness is critical, yet the complexity of EO data poses challenges to uncertainty-aware methods. In this work, we investigate the generalization of representation uncertainty in EO, considering the domain's unique semantic characteristics. We pretrain uncertainties on large EO datasets and propose an evaluation framework to assess their zero-shot performance in multi-label classification and segmentation EO tasks. Our findings reveal that, unlike uncertainties pretrained on natural images, EO-pretraining exhibits strong generalization across unseen EO domains, geographic locations, and target granularities, while maintaining sensitivity to variations in ground sampling distance. We demonstrate the practical utility of pretrained uncertainties showcasing their alignment with task-specific uncertainties in downstream tasks, their sensitivity to real-world EO image noise, and their ability to generate spatial uncertainty estimates out-of-the-box. Initiating the discussion on representation uncertainty in EO, our study provides insights into its strengths and limitations, paving the way for future research in the field. Code and weights are available at: this https URL.

100. 【2503.07076】NFIG: Autoregressive Image Generation with Next-Frequency Prediction

链接：https://arxiv.org/abs/2503.07076

作者：Zhihao Huang,Xi Qiu,Yukuo Ma,Yifu Zhou,Chi Zhang,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：achieved promising results, natural language processing, textbf, language processing, models have achieved

备注： 10 pages, 7 figures, 2 tables

点击查看摘要

Abstract:Autoregressive models have achieved promising results in natural language processing. However, for image generation tasks, they encounter substantial challenges in effectively capturing long-range dependencies, managing computational costs, and most crucially, defining meaningful autoregressive sequences that reflect natural image hierarchies. To address these issues, we present \textbf{N}ext-\textbf{F}requency \textbf{I}mage \textbf{G}eneration (\textbf{NFIG}), a novel framework that decomposes the image generation process into multiple frequency-guided stages. Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images. This principled autoregressive sequence not only improves the quality of generated images by better capturing true causal relationships between image components, but also significantly reduces computational overhead during inference. Extensive experiments demonstrate that NFIG achieves state-of-the-art performance with fewer steps, offering a more efficient solution for image generation, with 1.25$\times$ speedup compared to VAR-d20 while achieving better performance (FID: 2.81) on the ImageNet-256 benchmark. We hope that our insight of incorporating frequency-domain knowledge to guide autoregressive sequence design will shed light on future research. We will make our code publicly available upon acceptance of the paper.

101. 【2503.07075】XR-VLM: Cross-Relationship Modeling with Multi-part Prompts and Visual Features for Fine-Grained Recognition

链接：https://arxiv.org/abs/2503.07075

作者：Chuanming Wang,Henming Mao,Huanhuan Zhang,Huiyuan Fu,Huadong Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated impressive performance, achieve optimal performance, impressive performance, downstream tasks, optimal performance

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated impressive performance on various visual tasks, yet they still require adaptation on downstream tasks to achieve optimal performance. Recently, various adaptation technologies have been proposed, but we observe they often underperform in fine-grained visual recognition, which requires models to capture subtle yet discriminative features to distinguish similar sub-categories. Current adaptation methods typically rely on an alignment-based prediction framework, \ie the visual feature is compared with each class prompt for similarity calculation as the final prediction, which lacks class interaction during the forward pass. Besides, learning single uni-modal feature further restricts the model's expressive capacity. Therefore, we propose a novel mechanism, XR-VLM, to discover subtle differences by modeling cross-relationships, which specifically excels in scenarios involving multiple features. Our method introduces a unified multi-part visual feature extraction module designed to seamlessly integrate with the diverse backbones inherent in VLMs. Additionally, we develop a multi-part prompt learning module to capture multi-perspective descriptions of sub-categories. To further enhance discriminative capability, we propose a cross relationship modeling pattern that combines visual feature with all class prompt features, enabling a deeper exploration of the relationships between these two modalities. Extensive experiments have been conducted on various fine-grained datasets, and the results demonstrate that our method achieves significant improvements compared to current state-of-the-art approaches. Code will be released.

102. 【2503.07065】Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning

链接：https://arxiv.org/abs/2503.07065

作者：Huilin Deng,Ding Zou,Rui Ma,Hongchen Luo,Yang Cao,Yu Kang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：success heavily relies, demonstrated remarkable capabilities, massive model scaling, Curriculum Reinforcement Finetuning, Rejected Sampling-based Self-improvement

备注：

点击查看摘要

Abstract:While state-of-the-art vision-language models (VLMs) have demonstrated remarkable capabilities in complex visual-text tasks, their success heavily relies on massive model scaling, limiting their practical deployment. Small-scale VLMs offer a more practical alternative but face significant challenges when trained with traditional supervised fine-tuning (SFT), particularly in two aspects: out-of-domain (OOD) generalization and reasoning abilities, which significantly lags behind the contemporary Large language models (LLMs). To address these challenges, we propose Curriculum Reinforcement Finetuning (Curr-ReFT), a novel post-training paradigm specifically designed for small-scale VLMs. Inspired by the success of reinforcement learning in LLMs, Curr-ReFT comprises two sequential stages: (1) Curriculum Reinforcement Learning, which ensures steady progression of model capabilities through difficulty-aware reward design, transitioning from basic visual perception to complex reasoning tasks; and (2) Rejected Sampling-based Self-improvement, which maintains the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples. Extensive experiments demonstrate that models trained with Curr-ReFT paradigm achieve state-of-the-art performance across various visual tasks in both in-domain and out-of-domain settings. Moreover, our Curr-ReFT enhanced 3B model matches the performance of 32B-parameter models, demonstrating that efficient training paradigms can effectively bridge the gap between small and large models.

103. 【2503.07058】Breaking the Limits of Quantization-Aware Defenses: QADT-R for Robustness Against Patch-Based Adversarial Attacks in QNNs

链接：https://arxiv.org/abs/2503.07058

作者：Amira Guesmi,Bassem Ouni,Muhammad Shafique

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Quantized Neural Networks, Neural Networks, reducing model size, Quantized Neural, computational costs

备注：

点击查看摘要

Abstract:Quantized Neural Networks (QNNs) have emerged as a promising solution for reducing model size and computational costs, making them well-suited for deployment in edge and resource-constrained environments. While quantization is known to disrupt gradient propagation and enhance robustness against pixel-level adversarial attacks, its effectiveness against patch-based adversarial attacks remains largely unexplored. In this work, we demonstrate that adversarial patches remain highly transferable across quantized models, achieving over 70\% attack success rates (ASR) even at extreme bit-width reductions (e.g., 2-bit). This challenges the common assumption that quantization inherently mitigates adversarial threats. To address this, we propose Quantization-Aware Defense Training with Randomization (QADT-R), a novel defense strategy that integrates Adaptive Quantization-Aware Patch Generation (A-QAPA), Dynamic Bit-Width Training (DBWT), and Gradient-Inconsistent Regularization (GIR) to enhance resilience against highly transferable patch-based attacks. A-QAPA generates adversarial patches within quantized models, ensuring robustness across different bit-widths. DBWT introduces bit-width cycling during training to prevent overfitting to a specific quantization setting, while GIR injects controlled gradient perturbations to disrupt adversarial optimization. Extensive evaluations on CIFAR-10 and ImageNet show that QADT-R reduces ASR by up to 25\% compared to prior defenses such as PBAT and DWQ. Our findings further reveal that PBAT-trained models, while effective against seen patch configurations, fail to generalize to unseen patches due to quantization shift. Additionally, our empirical analysis of gradient alignment, spatial sensitivity, and patch visibility provides insights into the mechanisms that contribute to the high transferability of patch-based attacks in QNNs.

104. 【2503.07050】IDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

链接：https://arxiv.org/abs/2503.07050

作者：Victor Shea-Jay Huang,Le Zhuo,Yi Xin,Zhaokai Wang,Peng Gao,Hongsheng Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)

关键词：Interpretable Diffusion transformErs, Diffusion Transformers, Temporal-aware Sparse Autoencoders, Sparse Autoencoders, powerful yet underexplored

备注：

点击查看摘要

Abstract:Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion models. To bridge this gap, we introduce TIDE (Temporal-aware Sparse Autoencoders for Interpretable Diffusion transformErs), a novel framework that enhances temporal reconstruction within DiT activation layers across denoising steps. TIDE employs Sparse Autoencoders (SAEs) with a sparse bottleneck layer to extract interpretable and hierarchical features, revealing that diffusion models inherently learn hierarchical features at multiple levels (e.g., 3D, semantic, class) during generative pre-training. Our approach achieves state-of-the-art reconstruction performance, with a mean squared error (MSE) of 1e-3 and a cosine similarity of 0.97, demonstrating superior accuracy in capturing activation dynamics along the denoising trajectory. Beyond interpretability, we showcase TIDE's potential in downstream applications such as sparse activation-guided image editing and style transfer, enabling improved controllability for generative systems. By providing a comprehensive training and evaluation protocol tailored for DiTs, TIDE contributes to developing more interpretable, transparent, and trustworthy generative models.

105. 【2503.07047】Recovering Partially Corrupted Major Objects through Tri-modality Based Image Completion

链接：https://arxiv.org/abs/2503.07047

作者：Yongle Zhang,Yimin Liu,Qiang Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：prompts commonly employed, ensure semantic coherence, image completion tasks, providing high-level guidance, text prompts commonly

备注： 17 pages, 6 page supplementary

点击查看摘要

Abstract:Diffusion models have become widely adopted in image completion tasks, with text prompts commonly employed to ensure semantic coherence by providing high-level guidance. However, a persistent challenge arises when an object is partially obscured in the damaged region, yet its remaining parts are still visible in the background. While text prompts offer semantic direction, they often fail to precisely recover fine-grained structural details, such as the object's overall posture, ensuring alignment with the visible object information in the background. This limitation stems from the inability of text prompts to provide pixel-level specificity. To address this, we propose supplementing text-based guidance with a novel visual aid: a casual sketch, which can be roughly drawn by anyone based on visible object parts. This sketch supplies critical structural cues, enabling the generative model to produce an object structure that seamlessly integrates with the existing background. We introduce the Visual Sketch Self-Aware (VSSA) model, which integrates the casual sketch into each iterative step of the diffusion process, offering distinct advantages for partially corrupted scenarios. By blending sketch-derived features with those of the corrupted image, and leveraging text prompt guidance, the VSSA assists the diffusion model in generating images that preserve both the intended object semantics and structural consistency across the restored objects and original regions. To support this research, we created two datasets, CUB-sketch and MSCOCO-sketch, each combining images, sketches, and text. Extensive qualitative and quantitative experiments demonstrate that our approach outperforms several state-of-the-art methods.

106. 【2503.07046】MambaFlow: A Mamba-Centric Architecture for End-to-End Optical Flow Estimation

链接：https://arxiv.org/abs/2503.07046

作者：Juntian Du,Yuan Sun,Zhihu Zhou,Pinyi Chen,Runzhe Zhang,Keji Mao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Transformer powerful global, demonstrated impressive performance, global modeling capabilities, powerful global modeling, recently proposed top-performing

备注：

点击查看摘要

Abstract:Optical flow estimation based on deep learning, particularly the recently proposed top-performing methods that incorporate the Transformer, has demonstrated impressive performance, due to the Transformer's powerful global modeling capabilities. However, the quadratic computational complexity of attention mechanism in the Transformers results in time-consuming training and inference. To alleviate these issues, we propose a novel MambaFlow framework that leverages the high accuracy and efficiency of Mamba architecture to capture features with local correlation while preserving its global information, achieving remarkable performance. To the best of our knowledge, the proposed method is the first Mamba-centric architecture for end-to-end optical flow estimation. It comprises two primary contributed components, both of which are Mamba-centric: a feature enhancement Mamba (FEM) module designed to optimize feature representation quality and a flow propagation Mamba (FPM) module engineered to address occlusion issues by facilitate effective flow information dissemination. Extensive experiments demonstrate that our approach achieves state-of-the-art results, despite encountering occluded regions. On the Sintel benchmark, MambaFlow achieves an EPE all of 1.60, surpassing the leading 1.74 of GMFlow. Additionally, MambaFlow significantly improves inference speed with a runtime of 0.113 seconds, making it 18% faster than GMFlow. The source code will be made publicly available upon acceptance of the paper.

107. 【2503.07038】Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization

链接：https://arxiv.org/abs/2503.07038

作者：Michael Green,Matan Levy,Issar Tzachor,Dvir Samuel,Nir Darshan,Rami Ben-Ari

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：specific small object, Small Object Image, Small Object, specific small, cluttered scene

备注：

点击查看摘要

Abstract:We address the challenge of Small Object Image Retrieval (SoIR), where the goal is to retrieve images containing a specific small object, in a cluttered scene. The key challenge in this setting is constructing a single image descriptor, for scalable and efficient search, that effectively represents all objects in the image. In this paper, we first analyze the limitations of existing methods on this challenging task and then introduce new benchmarks to support SoIR evaluation. Next, we introduce Multi-object Attention Optimization (MaO), a novel retrieval framework which incorporates a dedicated multi-object pre-training phase. This is followed by a refinement process that leverages attention-based feature extraction with object masks, integrating them into a single unified image descriptor. Our MaO approach significantly outperforms existing retrieval methods and strong baselines, achieving notable improvements in both zero-shot and lightweight multi-object fine-tuning. We hope this work will lay the groundwork and inspire further research to enhance retrieval performance for this highly practical task.

108. 【2503.07037】Zero-Shot Hashing Based on Reconstruction With Part Alignment

链接：https://arxiv.org/abs/2503.07037

作者：Yan Jiang,Zhongmiao Qi,Jianhao Li,Jiangbo Qian,Chong Wang,Yu Xin

类目：Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)

关键词：Zero-shot hashing algorithms, large-scale image retrieval, unseen class data, Hashing algorithms, class data

备注：

点击查看摘要

109. 【2503.07035】Universal Incremental Learning: Mitigating Confusion from Inter- and Intra-task Distribution Randomness

链接：https://arxiv.org/abs/2503.07035

作者：Sheng Luo,Yi Zhou,Tao Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：overcome catastrophic forgetting, Universal Incremental Learning, Incremental learning, aims to overcome, overcome catastrophic

备注： 10 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Incremental learning (IL) aims to overcome catastrophic forgetting of previous tasks while learning new ones. Existing IL methods make strong assumptions that the incoming task type will either only increases new classes or domains (i.e. Class IL, Domain IL), or increase by a static scale in a class- and domain-agnostic manner (i.e. Versatile IL (VIL)), which greatly limit their applicability in the unpredictable and dynamic wild. In this work, we investigate $\textbf{Universal Incremental Learning (UIL)}$, where a model neither knows which new classes or domains will increase along sequential tasks, nor the scale of the increments within each task. This uncertainty prevents the model from confidently learning knowledge from all task distributions and symmetrically focusing on the diverse knowledge within each task distribution. Consequently, UIL presents a more general and realistic IL scenario, making the model face confusion arising from inter-task and intra-task distribution randomness. To $\textbf{Mi}$tigate both $\textbf{Co}$nfusion, we propose a simple yet effective framework for UIL, named $\textbf{MiCo}$. At the inter-task distribution level, we employ a multi-objective learning scheme to enforce accurate and deterministic predictions, and its effectiveness is further enhanced by a direction recalibration module that reduces conflicting gradients. Moreover, at the intra-task distribution level, we introduce a magnitude recalibration module to alleviate asymmetrical optimization towards imbalanced class distribution. Extensive experiments on three benchmarks demonstrate the effectiveness of our method, outperforming existing state-of-the-art methods in both the UIL scenario and the VIL scenario. Our code will be available at $\href{this https URL}{here}$.

110. 【2503.07033】Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion

链接：https://arxiv.org/abs/2503.07033

作者：Haolong Ma,Hui Li,Chunyang Cheng,Zeyang Zhang,Xiaoning Song,Xiao-Jun Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generating high-quality fused, multi-modal image fusion, address complex scenes, high-quality fused images, image fusion

备注：

点击查看摘要

Abstract:All-in-One Degradation-Aware Fusion Models (ADFMs), a class of multi-modal image fusion models, address complex scenes by mitigating degradations from source images and generating high-quality fused images. Mainstream ADFMs often rely on highly synthetic multi-modal multi-quality images for supervision, limiting their effectiveness in cross-modal and rare degradation scenarios. The inherent relationship among these multi-modal, multi-quality images of the same scene provides explicit supervision for training, but also raises above problems. To address these limitations, we present LURE, a Learning-driven Unified Representation model for infrared and visible Image Fusion, which is degradation-aware. LURE decouples multi-modal multi-quality data at the data level and recouples this relationship in a unified latent feature space (ULFS) by proposing a novel unified loss. This decoupling circumvents data-level limitations of prior models and allows leveraging real-world restoration datasets for training high-quality degradation-aware models, sidestepping above issues. To enhance text-image interaction, we refine image-text interaction and residual structures via Text-Guided Attention (TGA) and an inner residual structure. These enhances text's spatial perception of images and preserve more visual details. Experiments show our method outperforms state-of-the-art (SOTA) methods across general fusion, degradation-aware fusion, and downstream tasks. The code will be publicly available.

111. 【2503.07032】Multimodal Human-AI Synergy for Medical Imaging Quality Control: A Hybrid Intelligence Framework with Adaptive Dataset Curation and Closed-Loop Evaluation

链接：https://arxiv.org/abs/2503.07032

作者：Zhi Qin,Qianhui Gui,Mouxiao Bian,Rui Wang,Hong Ge,Dandan Yao,Ziying Sun,Yuan Zhao,Yu Zhang,Hui Shi,Dongdong Wang,Chenxin Song,Shenghong Ju,Lihao Liu,Junjun He,Jie Xu,Yuan-Cheng Wang

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：methods remain labor-intensive, imaging quality control, Medical imaging quality, Medical imaging, accurate diagnosis

备注：

点击查看摘要

112. 【2503.07029】Availability-aware Sensor Fusion via Unified Canonical Space for 4D Radar, LiDAR, and Camera

链接：https://arxiv.org/abs/2503.07029

作者：Dong-Hee Paek,Seung-Hyun Kong

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Radar has brought, autonomous driving, brought a significant, sensor degradation, Sensor

备注： Arxiv preprint

点击查看摘要

Abstract:Sensor fusion of camera, LiDAR, and 4-dimensional (4D) Radar has brought a significant performance improvement in autonomous driving (AD). However, there still exist fundamental challenges: deeply coupled fusion methods assume continuous sensor availability, making them vulnerable to sensor degradation and failure, whereas sensor-wise cross-attention fusion methods struggle with computational cost and unified feature representation. This paper presents availability-aware sensor fusion (ASF), a novel method that employs unified canonical projection (UCP) to enable consistency in all sensor features for fusion and cross-attention across sensors along patches (CASAP) to enhance robustness of sensor fusion against sensor degradation and failure. As a result, the proposed ASF shows a superior object detection performance to the existing state-of-the-art fusion methods under various weather and sensor degradation (or failure) conditions; Extensive experiments on the K-Radar dataset demonstrate that ASF achieves improvements of 9.7% in AP BEV (87.2%) and 20.1% in AP 3D (73.6%) in object detection at IoU=0.5, while requiring a low computational cost. The code will be available at this https URL.

113. 【2503.07027】EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

链接：https://arxiv.org/abs/2503.07027

作者：Yuxuan Zhang,Yirui Yuan,Yiren Song,Haofan Wang,Jiaming Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：introduced effective spatial, Unet-based diffusion models, Recent advancements, advancements in Unet-based, Unet-based diffusion

备注：

点击查看摘要

Abstract:Recent advancements in Unet-based diffusion models, such as ControlNet and IP-Adapter, have introduced effective spatial and subject control mechanisms. However, the DiT (Diffusion Transformer) architecture still struggles with efficient and flexible control. To tackle this issue, we propose EasyControl, a novel framework designed to unify condition-guided diffusion transformers with high efficiency and flexibility. Our framework is built on three key innovations. First, we introduce a lightweight Condition Injection LoRA Module. This module processes conditional signals in isolation, acting as a plug-and-play solution. It avoids modifying the base model weights, ensuring compatibility with customized models and enabling the flexible injection of diverse conditions. Notably, this module also supports harmonious and robust zero-shot multi-condition generalization, even when trained only on single-condition data. Second, we propose a Position-Aware Training Paradigm. This approach standardizes input conditions to fixed resolutions, allowing the generation of images with arbitrary aspect ratios and flexible resolutions. At the same time, it optimizes computational efficiency, making the framework more practical for real-world applications. Third, we develop a Causal Attention Mechanism combined with the KV Cache technique, adapted for conditional generation tasks. This innovation significantly reduces the latency of image synthesis, improving the overall efficiency of the framework. Through extensive experiments, we demonstrate that EasyControl achieves exceptional performance across various application scenarios. These innovations collectively make our framework highly efficient, flexible, and suitable for a wide range of tasks.

114. 【2503.07026】Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways

链接：https://arxiv.org/abs/2503.07026

作者：Yi Liu,Hao Zhou,Wenxiang Shang,Ran Lin,Benlei Cui

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：precisely remove target, remove target objects, object removal, aims to precisely, precisely remove

备注： accepted by CVPR 2025

点击查看摘要

Abstract:Erase inpainting, or object removal, aims to precisely remove target objects within masked regions while preserving the overall consistency of the surrounding content. Despite diffusion-based methods have made significant strides in the field of image inpainting, challenges remain regarding the emergence of unexpected objects or artifacts. We assert that the inexact diffusion pathways established by existing standard optimization paradigms constrain the efficacy of object removal. To tackle these challenges, we propose a novel Erase Diffusion, termed EraDiff, aimed at unleashing the potential power of standard diffusion in the context of object removal. In contrast to standard diffusion, the EraDiff adapts both the optimization paradigm and the network to improve the coherence and elimination of the erasure results. We first introduce a Chain-Rectifying Optimization (CRO) paradigm, a sophisticated diffusion process specifically designed to align with the objectives of erasure. This paradigm establishes innovative diffusion transition pathways that simulate the gradual elimination of objects during optimization, allowing the model to accurately capture the intent of object removal. Furthermore, to mitigate deviations caused by artifacts during the sampling pathways, we develop a simple yet effective Self-Rectifying Attention (SRA) mechanism. The SRA calibrates the sampling pathways by altering self-attention activation, allowing the model to effectively bypass artifacts while further enhancing the coherence of the generated content. With this design, our proposed EraDiff achieves state-of-the-art performance on the OpenImages V5 dataset and demonstrates significant superiority in real-world scenarios.

115. 【2503.07019】HybridReg: Robust 3D Point Cloud Registration with Hybrid Motions

链接：https://arxiv.org/abs/2503.07019

作者：Keyu Du,Hao Xu,Haipeng Li,Hong Qu,Chi-Wing Fu,Shuaicheng Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：point cloud registration, cloud registration, point cloud, Scene-level point cloud, trained models

备注： 2025, Association for the Advancement of Artificial Intelligence

点击查看摘要

Abstract:Scene-level point cloud registration is very challenging when considering dynamic foregrounds. Existing indoor datasets mostly assume rigid motions, so the trained models cannot robustly handle scenes with non-rigid motions. On the other hand, non-rigid datasets are mainly object-level, so the trained models cannot generalize well to complex scenes. This paper presents HybridReg, a new approach to 3D point cloud registration, learning uncertainty mask to account for hybrid motions: rigid for backgrounds and non-rigid/rigid for instance-level foregrounds. First, we build a scene-level 3D registration dataset, namely HybridMatch, designed specifically with strategies to arrange diverse deforming foregrounds in a controllable manner. Second, we account for different motion types and formulate a mask-learning module to alleviate the interference of deforming outliers. Third, we exploit a simple yet effective negative log-likelihood loss to adopt uncertainty to guide the feature extraction and correlation computation. To our best knowledge, HybridReg is the first work that exploits hybrid motions for robust point cloud registration. Extensive experiments show HybridReg's strengths, leading it to achieve state-of-the-art performance on both widely-used indoor and outdoor datasets.

116. 【2503.07008】SDFA: Structure Aware Discriminative Feature Aggregation for Efficient Human Fall Detection in Video

链接：https://arxiv.org/abs/2503.07008

作者：Sania Zahan,Ghulam Mubashar Hassan,Ajmal Mian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Older people, deteriorating health, people are susceptible, Older, due to instability

备注： Published IEEE Transactions on Industrial Informatics

点击查看摘要

Abstract:Older people are susceptible to fall due to instability in posture and deteriorating health. Immediate access to medical support can greatly reduce repercussions. Hence, there is an increasing interest in automated fall detection, often incorporated into a smart healthcare system to provide better monitoring. Existing systems focus on wearable devices which are inconvenient or video monitoring which has privacy concerns. Moreover, these systems provide a limited perspective of their generalization ability as they are tested on datasets containing few activities that have wide disparity in the action space and are easy to differentiate. Complex daily life scenarios pose much greater challenges with activities that overlap in action spaces due to similar posture or motion. To overcome these limitations, we propose a fall detection model, coined SDFA, based on human skeletons extracted from low-resolution videos. The use of skeleton data ensures privacy and low-resolution videos ensures low hardware and computational cost. Our model captures discriminative structural displacements and motion trends using unified joint and motion features projected onto a shared high dimensional space. Particularly, the use of separable convolution combined with a powerful GCN architecture provides improved performance. Extensive experiments on five large-scale datasets with a wide range of evaluation settings show that our model achieves competitive performance with extremely low computational complexity and runs faster than existing models.

117. 【2503.07004】NukesFormers: Unpaired Hyperspectral Image Generation with Non-Uniform Domain Alignment

链接：https://arxiv.org/abs/2503.07004

作者：Jiaojiao Li,Shiyao Duan,Haitao XU,Rui Song

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Hyperspectral Image Generation, data-driven Hyperspectral Image, current data-driven Hyperspectral, co-registered RGB-hyperspectral image, acquiring accurately co-registered

备注：

点击查看摘要

Abstract:The inherent difficulty in acquiring accurately co-registered RGB-hyperspectral image (HSI) pairs has significantly impeded the practical deployment of current data-driven Hyperspectral Image Generation (HIG) networks in engineering applications. Gleichzeitig, the ill-posed nature of the aligning constraints, compounded with the complexities of mining cross-domain features, also hinders the advancement of unpaired HIG (UnHIG) tasks. In this paper, we conquer these challenges by modeling the UnHIG to range space interaction and compensations of null space through Range-Null Space Decomposition (RND) methodology. Specifically, the introduced contrastive learning effectively aligns the geometric and spectral distributions of unpaired data by building the interaction of range space, considering the consistent feature in degradation process. Following this, we map the frequency representations of dual-domain input and thoroughly mining the null space, like degraded and high-frequency components, through the proposed Non-uniform Kolmogorov-Arnold Networks. Extensive comparative experiments demonstrate that it establishes a new benchmark in UnHIG.

118. 【2503.07002】aking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

链接：https://arxiv.org/abs/2503.07002

作者：Jiazheng Liu,Sipeng Zheng,Börje F. Karlsson,Zongqing Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：large language models, large-scale pre-trained vision, pre-trained vision towers, shown great capabilities, language models

备注：

点击查看摘要

Abstract:Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.

119. 【2503.07000】Frequency-Aware Density Control via Reparameterization for High-Quality Rendering of 3D Gaussian Splatting

链接：https://arxiv.org/abs/2503.07000

作者：Zhaojie Zeng,Yuesong Wang,Lili Ju,Tao Guan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：represent scene details, Gaussian Splatting, Gaussians, adaptively controlling, represent scene

备注： Accepted to AAAI2025

点击查看摘要

Abstract:By adaptively controlling the density and generating more Gaussians in regions with high-frequency information, 3D Gaussian Splatting (3DGS) can better represent scene details. From the signal processing perspective, representing details usually needs more Gaussians with relatively smaller scales. However, 3DGS currently lacks an explicit constraint linking the density and scale of 3D Gaussians across the domain, leading to 3DGS using improper-scale Gaussians to express frequency information, resulting in the loss of accuracy. In this paper, we propose to establish a direct relation between density and scale through the reparameterization of the scaling parameters and ensure the consistency between them via explicit constraints (i.e., density responds well to changes in frequency). Furthermore, we develop a frequency-aware density control strategy, consisting of densification and deletion, to improve representation quality with fewer Gaussians. A dynamic threshold encourages densification in high-frequency regions, while a scale-based filter deletes Gaussians with improper scale. Experimental results on various datasets demonstrate that our method outperforms existing state-of-the-art methods quantitatively and qualitatively.

120. 【2503.06998】SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models

链接：https://arxiv.org/abs/2503.06998

作者：Haoyu Zheng,Qifan Yu,Binghe Yu,Yang Dai,Wenqiao Zhang,Juncheng Li,Siliang Tang,Yueting Zhuang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：achieved remarkable progress, video style morphing, video, style morphing, style

备注：

点击查看摘要

Abstract:Diffusion models have achieved remarkable progress in image and video stylization. However, most existing methods focus on single-style transfer, while video stylization involving multiple styles necessitates seamless transitions between them. We refer to this smooth style transition between video frames as video style morphing. Current approaches often generate stylized video frames with discontinuous structures and abrupt style changes when handling such transitions. To address these limitations, we introduce SOYO, a novel diffusion-based framework for video style morphing. Our method employs a pre-trained text-to-image diffusion model without fine-tuning, combining attention injection and AdaIN to preserve structural consistency and enable smooth style transitions across video frames. Moreover, we notice that applying linear equidistant interpolation directly induces imbalanced style morphing. To harmonize across video frames, we propose a novel adaptive sampling scheduler operating between two style images. Extensive experiments demonstrate that SOYO outperforms existing methods in open-domain video style morphing, better preserving the structural coherence of video frames while achieving stable and smooth style transitions.

121. 【2503.06996】Public space security management using digital twin technologies

链接：https://arxiv.org/abs/2503.06996

作者：Stylianos Zindros,Christos Chronis,Panagiotis Radoglou-Grammatikis,Vasileios Argyriou,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Digital Twin technologies, predicting potential future, Digital Twin, potential future threats, Twin technologies

备注：

点击查看摘要

Abstract:As the security of public spaces remains a critical issue in today's world, Digital Twin technologies have emerged in recent years as a promising solution for detecting and predicting potential future threats. The applied methodology leverages a Digital Twin of a metro station in Athens, Greece, using the FlexSim simulation software. The model encompasses points of interest and passenger flows, and sets their corresponding parameters. These elements influence and allow the model to provide reasonable predictions on the security management of the station under various scenarios. Experimental tests are conducted with different configurations of surveillance cameras and optimizations of camera angles to evaluate the effectiveness of the space surveillance setup. The results show that the strategic positioning of surveillance cameras and the adjustment of their angles significantly improves the detection of suspicious behaviors and with the use of the DT it is possible to evaluate different scenarios and find the optimal camera setup for each case. In summary, this study highlights the value of Digital Twins in real-time simulation and data-driven security management. The proposed approach contributes to the ongoing development of smart security solutions for public spaces and provides an innovative framework for threat detection and prevention.

122. 【2503.06993】CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model

链接：https://arxiv.org/abs/2503.06993

作者：Shihao Hou,Xinyi Shang,Shreyank N Gowda,Yang Lu,Chao Wu,Yan Yan,Hanzi Wang

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：Federated Long-tailed Learning, federated long-tailed, long-tailed, long-tailed distributions remains, handling the co-occurrence

备注：

点击查看摘要

Abstract:Effectively handling the co-occurrence of non-IID data and long-tailed distributions remains a critical challenge in federated learning. While fine-tuning vision-language models (VLMs) like CLIP has shown to be promising in addressing non-IID data challenges, this approach leads to severe degradation of tail classes in federated long-tailed scenarios. Under the composite effects of strong non-IID data distribution and long-tailed class imbalances, VLM fine-tuning may even fail to yield any improvement. To address this issue, we propose Class-Aware Prompt Learning for Federated Long-tailed Learning (CAPT), a novel framework that leverages a pre-trained VLM to effectively handle both data heterogeneity and long-tailed distributions. CAPT introduces a dual-prompt mechanism that synergizes general and class-aware prompts, enabling the framework to capture global trends while preserving class-specific knowledge. To better aggregate and share knowledge across clients, we introduce a heterogeneity-aware client clustering strategy that groups clients based on their data distributions, enabling efficient collaboration and knowledge sharing. Extensive experiments on various long-tailed datasets with different levels of data heterogeneity demonstrate that CAPT significantly improves tail class performance without compromising overall accuracy, outperforming state-of-the-art methods in federated long-tailed learning scenarios.

123. 【2503.06992】Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow

链接：https://arxiv.org/abs/2503.06992

作者：Hanyu Zhou,Haonan Wang,Haoyue Liu,Yuxing Duan,Yi Chang,Luxin Yan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：suffers spatial blur, High-dynamic scene optical, challenging task, optical flow, scene optical flow

备注：

点击查看摘要

Abstract:High-dynamic scene optical flow is a challenging task, which suffers spatial blur and temporal discontinuous motion due to large displacement in frame imaging, thus deteriorating the spatiotemporal feature of optical flow. Typically, existing methods mainly introduce event camera to directly fuse the spatiotemporal features between the two modalities. However, this direct fusion is ineffective, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. To address this issue, we explore a common-latent space as an intermediate bridge to mitigate the modality gap. In this work, we propose a novel common spatiotemporal fusion between frame and event modalities for high-dynamic scene optical flow, including visual boundary localization and motion correlation fusion. Specifically, in visual boundary localization, we figure out that frame and event share the similar spatiotemporal gradients, whose similarity distribution is consistent with the extracted boundary distribution. This motivates us to design the common spatiotemporal gradient to constrain the reference boundary localization. In motion correlation fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has spatially sparse but temporally continuous correlation. This inspires us to use the reference boundary to guide the complementary motion knowledge fusion between the two modalities. Moreover, common spatiotemporal fusion can not only relieve the cross-modal feature discrepancy, but also make the fusion process interpretable for dense and continuous optical flow. Extensive experiments have been performed to verify the superiority of the proposed method.

124. 【2503.06991】Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols

链接：https://arxiv.org/abs/2503.06991

作者：Yongwoo Kim,Sungmin Cha,Donghyun Kim

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：remove specific data, specific data points, addressing privacy, legal requirements, process to remove

备注：

点击查看摘要

Abstract:Machine unlearning is a process to remove specific data points from a trained model while maintaining the performance on retain data, addressing privacy or legal requirements. Despite its importance, existing unlearning evaluations tend to focus on logit-based metrics (i.e., accuracy) under small-scale scenarios. We observe that this could lead to a false sense of security in unlearning approaches under real-world scenarios. In this paper, we conduct a new comprehensive evaluation that employs representation-based evaluations of the unlearned model under large-scale scenarios to verify whether the unlearning approaches genuinely eliminate the targeted forget data from the model's representation perspective. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model or merely modify the classifier (i.e., the last layer), thereby achieving superior logit-based evaluation metrics while maintaining significant representational similarity to the original model. Furthermore, we introduce a novel unlearning evaluation setup from a transfer learning perspective, in which the forget set classes exhibit semantic similarity to downstream task classes, necessitating that feature representations diverge significantly from those of the original model. Our comprehensive benchmark not only addresses a critical gap between theoretical machine unlearning and practical scenarios, but also establishes a foundation to inspire future research directions in developing genuinely effective unlearning methodologies.

125. 【2503.06989】Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs

链接：https://arxiv.org/abs/2503.06989

作者：Wenzhuo Xu,Zhipeng Wei,Xiongtao Sun,Deyue Zhang,Dongdong Yang,Quanchen Zou,Xiangzheng Zhang

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large

备注：

点击查看摘要

Abstract:Recently, Multimodal Large Language Models (MLLMs) have demonstrated their superior ability in understanding multimodal contents. However, they remain vulnerable to jailbreak attacks, which exploit weaknesses in their safety alignment to generate harmful responses. Previous studies categorize jailbreaks as successful or failed based on whether responses contain malicious content. However, given the stochastic nature of MLLM responses, this binary classification of an input's ability to jailbreak MLLMs is inappropriate. Derived from this viewpoint, we introduce jailbreak probability to quantify the jailbreak potential of an input, which represents the likelihood that MLLMs generated a malicious response when prompted with this input. We approximate this probability through multiple queries to MLLMs. After modeling the relationship between input hidden states and their corresponding jailbreak probability using Jailbreak Probability Prediction Network (JPPN), we use continuous jailbreak probability for optimization. Specifically, we propose Jailbreak-Probability-based Attack (JPA) that optimizes adversarial perturbations on inputs to maximize jailbreak probability. To counteract attacks, we also propose two defensive methods: Jailbreak-Probability-based Finetuning (JPF) and Jailbreak-Probability-based Defensive Noise (JPDN), which minimizes jailbreak probability in the MLLM parameters and input space, respectively. Extensive experiments show that (1) JPA yields improvements (up to 28.38\%) under both white and black box settings compared to previous methods with small perturbation bounds and few iterations. (2) JPF and JPDN significantly reduce jailbreaks by at most over 60\%. Both of the above results demonstrate the significance of introducing jailbreak probability to make nuanced distinctions among input jailbreak abilities.

126. 【2503.06986】ConcreTizer: Model Inversion Attack via Occupancy Classification and Dispersion Control for 3D Point Cloud Restoration

链接：https://arxiv.org/abs/2503.06986

作者：Youngseok Kim,Sunwook Hwang,Hyung-Sin Kim,Saewoong Bahk

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：point cloud, point cloud data, point, point clouds remains, model inversion attacks

备注：

点击查看摘要

Abstract:The growing use of 3D point cloud data in autonomous vehicles (AVs) has raised serious privacy concerns, particularly due to the sensitive information that can be extracted from 3D data. While model inversion attacks have been widely studied in the context of 2D data, their application to 3D point clouds remains largely unexplored. To fill this gap, we present the first in-depth study of model inversion attacks aimed at restoring 3D point cloud scenes. Our analysis reveals the unique challenges, the inherent sparsity of 3D point clouds and the ambiguity between empty and non-empty voxels after voxelization, which are further exacerbated by the dispersion of non-empty voxels across feature extractor layers. To address these challenges, we introduce ConcreTizer, a simple yet effective model inversion attack designed specifically for voxel-based 3D point cloud data. ConcreTizer incorporates Voxel Occupancy Classification to distinguish between empty and non-empty voxels and Dispersion-Controlled Supervision to mitigate non-empty voxel dispersion. Extensive experiments on widely used 3D feature extractors and benchmark datasets, such as KITTI and Waymo, demonstrate that ConcreTizer concretely restores the original 3D point cloud scene from disrupted 3D feature data. Our findings highlight both the vulnerability of 3D data to inversion attacks and the urgent need for robust defense strategies.

127. 【2503.06984】Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

链接：https://arxiv.org/abs/2503.06984

作者：Juncheng Wang,Chao Xu,Cheng Yu,Lei Shang,Zhe Hu,Shujun Wang,Liefeng Bo

类目：ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)

关键词：synthesizing realistic audio, realistic audio tracks, Mel Quantization-Continuum Decomposition, synthesizing realistic, tracks that synchronize

备注： Accepted to CVPR-25

点击查看摘要

Abstract:Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos. Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD). We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor. Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process. Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency. Our codes and demos will be released at \href{Website}{this https URL}.

128. 【2503.06983】Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark

链接：https://arxiv.org/abs/2503.06983

作者：Jiahao Wang,Xiangyu Cao,Jiaru Zhong,Yuner Zhang,Haibao Yu,Lei He,Shaobing Xu

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：autonomous driving systems, driving systems continue, long-range detection due, significant advancements, autonomous driving

备注： 8 pages, 7 figures. This work has been submitted to IROS 2025 for possible publication

点击查看摘要

Abstract:Despite significant advancements, autonomous driving systems continue to struggle with occluded objects and long-range detection due to the inherent limitations of single-perspective sensing. Aerial-ground cooperation offers a promising solution by integrating UAVs' aerial views with ground vehicles' local observations. However, progress in this emerging field has been hindered by the absence of public datasets and standardized evaluation benchmarks. To address this gap, this paper presents a comprehensive solution for aerial-ground cooperative 3D perception through three key contributions: (1) Griffin, a large-scale multi-modal dataset featuring over 200 dynamic scenes (30k+ frames) with varied UAV altitudes (20-60m), diverse weather conditions, and occlusion-aware 3D annotations, enhanced by CARLA-AirSim co-simulation for realistic UAV dynamics; (2) A unified benchmarking framework for aerial-ground cooperative detection and tracking tasks, including protocols for evaluating communication efficiency, latency tolerance, and altitude adaptability; (3) AGILE, an instance-level intermediate fusion baseline that dynamically aligns cross-view features through query-based interaction, achieving an advantageous balance between communication overhead and perception accuracy. Extensive experiments prove the effectiveness of aerial-ground cooperative perception and demonstrate the direction of further research. The dataset and codes are available at this https URL.

129. 【2503.06978】Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition

链接：https://arxiv.org/abs/2503.06978

作者：Xinyu Xi,Hua Yang,Shentai Zhang,Yijie Liu,Sijin Sun,Xiuju Fu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：intelligent marine robotics, Maritime Multi-Scene Recognition, crucial for enhancing, enhancing the capabilities, capabilities of intelligent

备注： 19 pages, 4 figures, submitted to Engineering Applications of Artificial Intelligence

点击查看摘要

Abstract:Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics, particularly in applications such as marine conservation, environmental monitoring, and disaster response. However, this task presents significant challenges due to environmental interference, where marine conditions degrade image quality, and the complexity of maritime scenes, which requires deeper reasoning for accurate recognition. Pure vision models alone are insufficient to address these issues. To overcome these limitations, we propose a novel multimodal Artificial Intelligence (AI) framework that integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM), to provide richer semantic understanding and improve recognition accuracy. Our framework employs an efficient multimodal fusion mechanism to further enhance model robustness and adaptability in complex maritime environments. Experimental results show that our model achieves 98$\%$ accuracy, surpassing previous SOTA models by 3.5$\%$. To optimize deployment on resource-constrained platforms, we adopt activation-aware weight quantization (AWQ) as a lightweight technique, reducing the model size to 68.75MB with only a 0.5$\%$ accuracy drop while significantly lowering computational overhead. This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.

130. 【2503.06976】ask-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation

链接：https://arxiv.org/abs/2503.06976

作者：Pengchen Liang,Haishan Huang,Bin Pu,Jianguo Chen,Xiang Hua,Jing Zhang,Weibo Ma,Zhuangzhuang Chen,Yiwei Li,Qing Chang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision Foundation Models, Vision Foundation, Large-scale pre-trained models, transferring generalized knowledge, Large-scale pre-trained

备注： 29 pages, 10 figures, 16 tables

点击查看摘要

Abstract:Large-scale pre-trained models, such as Vision Foundation Models (VFMs), have demonstrated impressive performance across various downstream tasks by transferring generalized knowledge, especially when target data is limited. However, their high computational cost and the domain gap between natural and medical images limit their practical application in medical segmentation tasks. Motivated by this, we pose the following important question: "How can we effectively utilize the knowledge of large pre-trained VFMs to train a small, task-specific model for medical image segmentation when training data is limited?" To address this problem, we propose a novel and generalizable task-specific knowledge distillation framework. Our method fine-tunes the VFM on the target segmentation task to capture task-specific features before distilling the knowledge to smaller models, leveraging Low-Rank Adaptation (LoRA) to reduce the computational cost of fine-tuning. Additionally, we incorporate synthetic data generated by diffusion models to augment the transfer set, enhancing model performance in data-limited scenarios. Experimental results across five medical image datasets demonstrate that our method consistently outperforms task-agnostic knowledge distillation and self-supervised pretraining approaches like MoCo v3 and Masked Autoencoders (MAE). For example, on the KidneyUS dataset, our method achieved a 28% higher Dice score than task-agnostic KD using 80 labeled samples for fine-tuning. On the CHAOS dataset, it achieved an 11% improvement over MAE with 100 labeled samples. These results underscore the potential of task-specific knowledge distillation to train accurate, efficient models for medical image segmentation in data-constrained settings.

131. 【2503.06974】Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment

链接：https://arxiv.org/abs/2503.06974

作者：Yang Liu,Mengyuan Liu,Shudong Huang,Jiancheng Lv

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Learning visual semantic, visual semantic similarity, Learning visual, visual semantic, semantic similarity

备注： 9 pages, 5 figures, The 39th Annual AAAI Conference on Artificial Intelligence

点击查看摘要

Abstract:Learning visual semantic similarity is a critical challenge in bridging the gap between images and texts. However, there exist inherent variations between vision and language data, such as information density, i.e., images can contain textual information from multiple different views, which makes it difficult to compute the similarity between these two modalities accurately and efficiently. In this paper, we propose a novel framework called Asymmetric Visual Semantic Embedding (AVSE) to dynamically select features from various regions of images tailored to different textual inputs for similarity calculation. To capture information from different views in the image, we design a radial bias sampling module to sample image patches and obtain image features from various views, Furthermore, AVSE introduces a novel module for efficient computation of visual semantic similarity between asymmetric image and text embeddings. Central to this module is the presumption of foundational semantic units within the embeddings, denoted as ``meta-semantic embeddings." It segments all embeddings into meta-semantic embeddings with the same dimension and calculates visual semantic similarity by finding the optimal match of meta-semantic embeddings of two modalities. Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.

132. 【2503.06973】A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis

链接：https://arxiv.org/abs/2503.06973

作者：Xiang Liu,Zhaoxiang Liu,Huan Hu,Zezhou Chen,Kohou Wang,Kai Wang,Shiguo Lian

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：shown considerable potential, text-based interactions, crop disease diagnosis, shown considerable, considerable potential

备注： Accepted by ECCV 2024 (14 pages, 8 figures)

点击查看摘要

Abstract:While conversational generative AI has shown considerable potential in enhancing decision-making for agricultural professionals, its exploration has predominantly been anchored in text-based interactions. The evolution of multimodal conversational AI, leveraging vast amounts of image-text data from diverse sources, marks a significant stride forward. However, the application of such advanced vision-language models in the agricultural domain, particularly for crop disease diagnosis, remains underexplored. In this work, we present the crop disease domain multimodal (CDDM) dataset, a pioneering resource designed to advance the field of agricultural research through the application of multimodal learning techniques. The dataset comprises 137,000 images of various crop diseases, accompanied by 1 million question-answer pairs that span a broad spectrum of agricultural knowledge, from disease identification to management practices. By integrating visual and textual data, CDDM facilitates the development of sophisticated question-answering systems capable of providing precise, useful advice to farmers and agricultural professionals. We demonstrate the utility of the dataset by finetuning state-of-the-art multimodal models, showcasing significant improvements in crop disease diagnosis. Specifically, we employed a novel finetuning strategy that utilizes low-rank adaptation (LoRA) to finetune the visual encoder, adapter and language model simultaneously. Our contributions include not only the dataset but also a finetuning strategy and a benchmark to stimulate further research in agricultural technology, aiming to bridge the gap between advanced AI techniques and practical agricultural applications. The dataset is available at https: //github.com/UnicomAI/UnicomBenchmark/tree/main/CDDMBench.

133. 【2503.06966】MIGA: Mutual Information-Guided Attack on Denoising Models for Semantic Manipulation

链接：https://arxiv.org/abs/2503.06966

作者：Guanghao Li,Mingzhi Chen,Hao Yu,Shuting Dong,Wenhao Jiang,Ming Tang,Chun Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：retaining crucial semantic, functioning as filters, denoising models, Deep learning-based denoising, widely employed

备注：

点击查看摘要

Abstract:Deep learning-based denoising models have been widely employed in vision tasks, functioning as filters to eliminate noise while retaining crucial semantic information. Additionally, they play a vital role in defending against adversarial perturbations that threaten downstream tasks. However, these models can be intrinsically susceptible to adversarial attacks due to their dependence on specific noise assumptions. Existing attacks on denoising models mainly aim at deteriorating visual clarity while neglecting semantic manipulation, rendering them either easily detectable or limited in effectiveness. In this paper, we propose Mutual Information-Guided Attack (MIGA), the first method designed to directly attack deep denoising models by strategically disrupting their ability to preserve semantic content via adversarial perturbations. By minimizing the mutual information between the original and denoised images, a measure of semantic similarity. MIGA forces the denoiser to produce perceptually clean yet semantically altered outputs. While these images appear visually plausible, they encode systematically distorted semantics, revealing a fundamental vulnerability in denoising models. These distortions persist in denoised outputs and can be quantitatively assessed through downstream task performance. We propose new evaluation metrics and systematically assess MIGA on four denoising models across five datasets, demonstrating its consistent effectiveness in disrupting semantic fidelity. Our findings suggest that denoising models are not always robust and can introduce security risks in real-world applications.

134. 【2503.06965】SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks

链接：https://arxiv.org/abs/2503.06965

作者：Shining Wang,Yunlong Wang,Ruiqi Wu,Bingliang Jiao,Wenxuan Wang,Peng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：making identity matching, identity matching difficult, significant appearance variations, appearance variations caused, Aerial-Ground Person Re-identification

备注：

点击查看摘要

Abstract:When discussing the Aerial-Ground Person Re-identification (AGPReID) task, we face the main challenge of the significant appearance variations caused by different viewpoints, making identity matching difficult. To address this issue, previous methods attempt to reduce the differences between viewpoints by critical attributes and decoupling the viewpoints. While these methods can mitigate viewpoint differences to some extent, they still face two main issues: (1) difficulty in handling viewpoint diversity and (2) neglect of the contribution of local features. To effectively address these challenges, we design and implement the Self-Calibrating and Adaptive Prompt (SeCap) method for the AGPReID task. The core of this framework relies on the Prompt Re-calibration Module (PRM), which adaptively re-calibrates prompts based on the input. Combined with the Local Feature Refinement Module (LFRM), SeCap can extract view-invariant features from local features for AGPReID. Meanwhile, given the current scarcity of datasets in the AGPReID field, we further contribute two real-world Large-scale Aerial-Ground Person Re-Identification datasets, LAGPeR and G2APS-ReID. The former is collected and annotated by us independently, covering $4,231$ unique identities and containing $63,841$ high-quality images; the latter is reconstructed from the person search dataset G2APS. Through extensive experiments on AGPReID datasets, we demonstrate that SeCap is a feasible and effective solution for the AGPReID task. The datasets and source code available on this https URL.

135. 【2503.06960】A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

链接：https://arxiv.org/abs/2503.06960

作者：Xin Wen,Bingchen Zhao,Yilun Chen,Jiangmiao Pang,Xiaojuan Qi

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：configuration remains unclear, optimal configuration remains, Pre-trained vision models, Pre-trained vision, remains unclear

备注： Accepted by CVPR 2025

点击查看摘要

Abstract:Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data--a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. Our code and models are publicly available at this https URL.

136. 【2503.06956】LatexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending

链接：https://arxiv.org/abs/2503.06956

作者：Jian Jin,Zhenbo Yu,Yang Shen,Zhenyong Fu,Jian Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：generation renders user-specified, renders user-specified concepts, Latent Textual space, renders user-specified, contexts based

备注： cvpr2025

点击查看摘要

Abstract:Customized text-to-image generation renders user-specified concepts into novel contexts based on textual prompts. Scaling the number of concepts in customized generation meets a broader demand for user creation, whereas existing methods face challenges with generation quality and computational efficiency. In this paper, we propose LaTexBlend, a novel framework for effectively and efficiently scaling multi-concept customized generation. The core idea of LaTexBlend is to represent single concepts and blend multiple concepts within a Latent Textual space, which is positioned after the text encoder and a linear projection. LaTexBlend customizes each concept individually, storing them in a concept bank with a compact representation of latent textual features that captures sufficient concept information to ensure high fidelity. At inference, concepts from the bank can be freely and seamlessly combined in the latent textual space, offering two key merits for multi-concept generation: 1) excellent scalability, and 2) significant reduction of denoising deviation, preserving coherent layouts. Extensive experiments demonstrate that LaTexBlend can flexibly integrate multiple customized concepts with harmonious structures and high subject fidelity, substantially outperforming baselines in both generation quality and computational efficiency. Our code will be publicly available.

137. 【2503.06955】Motion Anything: Any to Motion Generation

链接：https://arxiv.org/abs/2503.06955

作者：Zeyu Zhang,Yiran Wang,Wei Mao,Danning Li,Rui Zhao,Biao Wu,Zirui Song,Bohan Zhuang,Ian Reid,Richard Hartley

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Conditional motion generation, Conditional motion, computer vision, extensively studied, studied in computer

备注：

点击查看摘要

Abstract:Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Motion-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website this https URL

138. 【2503.06954】Approximate Size Targets Are Sufficient for Accurate Semantic Segmentation

链接：https://arxiv.org/abs/2503.06954

作者：Xingye Fan,Zhongwen(Rex)Zhang,Yuri Boykov

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：relative object-size distributions, extending binary class, approximate relative object-size, binary class tags, extending binary

备注：

点击查看摘要

Abstract:This paper demonstrates a surprising result for segmentation with image-level targets: extending binary class tags to approximate relative object-size distributions allows off-the-shelf architectures to solve the segmentation problem. A straightforward zero-avoiding KL-divergence loss for average predictions produces segmentation accuracy comparable to the standard pixel-precise supervision with full ground truth masks. In contrast, current results based on class tags typically require complex non-reproducible architectural modifications and specialized multi-stage training procedures. Our ideas are validated on PASCAL VOC using our new human annotations of approximate object sizes. We also show the results on COCO and medical data using synthetically corrupted size targets. All standard networks demonstrate robustness to the size targets' errors. For some classes, the validation accuracy is significantly better than the pixel-level supervision; the latter is not robust to errors in the masks. Our work provides new ideas and insights on image-level supervision in segmentation and may encourage other simple general solutions to the problem.

139. 【2503.06948】Large Language Model Guided Progressive Feature Alignment for Multimodal UAV Object Detection

链接：https://arxiv.org/abs/2503.06948

作者：Wentao Wu,Chenglong Li,Xiao Wang,Bin Luo,Qi Liu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Existing multimodal UAV, multimodal UAV object, UAV object detection, Large Language Model, UAV object

备注：

点击查看摘要

Abstract:Existing multimodal UAV object detection methods often overlook the impact of semantic gaps between modalities, which makes it difficult to achieve accurate semantic and spatial alignments, limiting detection performance. To address this problem, we propose a Large Language Model (LLM) guided Progressive feature Alignment Network called LPANet, which leverages the semantic features extracted from a large language model to guide the progressive semantic and spatial alignment between modalities for multimodal UAV object detection. To employ the powerful semantic representation of LLM, we generate the fine-grained text descriptions of each object category by ChatGPT and then extract the semantic features using the large language model MPNet. Based on the semantic features, we guide the semantic and spatial alignments in a progressive manner as follows. First, we design the Semantic Alignment Module (SAM) to pull the semantic features and multimodal visual features of each object closer, alleviating the semantic differences of objects between modalities. Second, we design the Explicit Spatial alignment Module (ESM) by integrating the semantic relations into the estimation of feature-level offsets, alleviating the coarse spatial misalignment between modalities. Finally, we design the Implicit Spatial alignment Module (ISM), which leverages the cross-modal correlations to aggregate key features from neighboring regions to achieve implicit spatial alignment. Comprehensive experiments on two public multimodal UAV object detection datasets demonstrate that our approach outperforms state-of-the-art multimodal UAV object detectors.

140. 【2503.06947】Aligning Instance-Semantic Sparse Representation towards Unsupervised Object Segmentation and Shape Abstraction with Repeatable Primitives

链接：https://arxiv.org/abs/2503.06947

作者：Jiaxin Li,Hongxing Wang,Jiawei Tan,Zhilong Ou,Junsong Yuan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：object parts, object, shape, abstracted from results, object parts abstracted

备注： 15 pages, 15 figures, 8 tables

点击查看摘要

Abstract:Understanding 3D object shapes necessitates shape representation by object parts abstracted from results of instance and semantic segmentation. Promising shape representations enable computers to interpret a shape with meaningful parts and identify their repeatability. However, supervised shape representations depend on costly annotation efforts, while current unsupervised methods work under strong semantic priors and involve multi-stage training, thereby limiting their generalization and deployment in shape reasoning and understanding. Driven by the tendency of high-dimensional semantically similar features to lie in or near low-dimensional subspaces, we introduce a one-stage, fully unsupervised framework towards semantic-aware shape representation. This framework produces joint instance segmentation, semantic segmentation, and shape abstraction through sparse representation and feature alignment of object parts in a high-dimensional space. For sparse representation, we devise a sparse latent membership pursuit method that models each object part feature as a sparse convex combination of point features at either the semantic or instance level, promoting part features in the same subspace to exhibit similar semantics. For feature alignment, we customize an attention-based strategy in the feature space to align instance- and semantic-level object part features and reconstruct the input shape using both of them, ensuring geometric reusability and semantic consistency of object parts. To firm up semantic disambiguation, we construct cascade unfrozen learning on geometric parameters of object parts.

141. 【2503.06940】CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing

链接：https://arxiv.org/abs/2503.06940

作者：Jianxiong Gao,Yichang Liu,Baofeng Yang,Jianfeng Feng,Yanwei Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：featuring simultaneous EEG, Big Bang Theory, dynamic audiovisual stimulation, large-scale dataset featuring, dataset featuring simultaneous

备注： 14 pages, 13 figures

点击查看摘要

Abstract:In this paper, we introduce CineBrain, the first large-scale dataset featuring simultaneous EEG and fMRI recordings during dynamic audiovisual stimulation. Recognizing the complementary strengths of EEG's high temporal resolution and fMRI's deep-brain spatial coverage, CineBrain provides approximately six hours of narrative-driven content from the popular television series The Big Bang Theory for each of six participants. Building upon this unique dataset, we propose CineSync, an innovative multimodal decoding framework integrates a Multi-Modal Fusion Encoder with a diffusion-based Neural Latent Decoder. Our approach effectively fuses EEG and fMRI signals, significantly improving the reconstruction quality of complex audiovisual stimuli. To facilitate rigorous evaluation, we introduce Cine-Benchmark, a comprehensive evaluation protocol that assesses reconstructions across semantic and perceptual dimensions. Experimental results demonstrate that CineSync achieves state-of-the-art video reconstruction performance and highlight our initial success in combining fMRI and EEG for reconstructing both video and audio stimuli. Project Page: this https URL.

142. 【2503.06938】Modeling Human Skeleton Joint Dynamics for Fall Detection

链接：https://arxiv.org/abs/2503.06938

作者：Sania Zahan,Ghulam Mubashar Hassan,Ajmal Mian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：population aging calls, support systems, increasing pace, pace of population, population aging

备注： Published in 2021 Digital Image Computing: Techniques and Applications (DICTA)

点击查看摘要

Abstract:The increasing pace of population aging calls for better care and support systems. Falling is a frequent and critical problem for elderly people causing serious long-term health issues. Fall detection from video streams is not an attractive option for real-life applications due to privacy issues. Existing methods try to resolve this issue by using very low-resolution cameras or video encryption. However, privacy cannot be ensured completely with such approaches. Key points on the body, such as skeleton joints, can convey significant information about motion dynamics and successive posture changes which are crucial for fall detection. Skeleton joints have been explored for feature extraction but with image recognition models that ignore joint dependency across frames which is important for the classification of actions. Moreover, existing models are over-parameterized or evaluated on small datasets with very few activity classes. We propose an efficient graph convolution network model that exploits spatio-temporal joint dependencies and dynamics of human skeleton joints for accurate fall detection. Our method leverages dynamic representation with robust concurrent spatio-temporal characteristics of skeleton joints. We performed extensive experiments on three large-scale datasets. With a significantly smaller model size than most existing methods, our proposed method achieves state-of-the-art results on the large scale NTU datasets.

143. 【2503.06934】LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

链接：https://arxiv.org/abs/2503.06934

作者：Hanyu Zhou,Gim Hee Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large multimodal models, spatiotemporal reasoning due, fine-grained spatiotemporal reasoning, multimodal models, Large multimodal

备注：

点击查看摘要

Abstract:Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.

144. 【2503.06930】Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping

链接：https://arxiv.org/abs/2503.06930

作者：Ning Ding,Jing Han,Yuchuan Tian,Chao Xu,Kai Han,Yehui Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：great generation capability, building image generation, Diffusion Transformer, preferred choice, choice for building

备注：

点击查看摘要

Abstract:Diffusion Transformer (DiT) has now become the preferred choice for building image generation models due to its great generation capability. Unlike previous convolution-based UNet models, DiT is purely composed of a stack of transformer blocks, which renders DiT excellent in scalability like large language models. However, the growing model size and multi-step sampling paradigm bring about considerable pressure on deployment and inference. In this work, we propose a post-training quantization framework tailored for Diffusion Transforms to tackle these challenges. We firstly locate that the quantization difficulty of DiT mainly originates from the time-dependent channel-specific outliers. We propose a timestep-aware shift-and-scale strategy to smooth the activation distribution to reduce the quantization error. Secondly, based on the observation that activations of adjacent timesteps have similar distributions, we utilize a hierarchical clustering scheme to divide the denoising timesteps into multiple groups. We further design a re-parameterization scheme which absorbs the quantization parameters into nearby module to avoid redundant computations. Comprehensive experiments demonstrate that out PTQ method successfully quantize the Diffusion Transformer into 8-bit weight and 8-bit activation (W8A8) with state-of-the-art FiD score. And our method can further quantize DiT model into 4-bit weight and 8-bit activation (W4A8) without sacrificing generation quality.

145. 【2503.06923】From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

链接：https://arxiv.org/abs/2503.06923

作者：Jiacheng Liu,Chang Zou,Yuanhuiyi Lyu,Junjie Chen,Linfeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：computational demands remain, demands remain prohibitive, Diffusion Transformers, revolutionized high-fidelity image, real-time applications

备注： 13 pages, 14 figures

点击查看摘要

Abstract:Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:this https URL

146. 【2503.06903】When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack

链接：https://arxiv.org/abs/2503.06903

作者：Hanqing Liu,Shouwei Ruan,Yao Huang,Shiji Zhao,Xingxing Wei

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remains largely unexplored, achieved remarkable success, variations remains largely, largely unexplored, textbf

备注：

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable success in various tasks, yet their robustness to real-world illumination variations remains largely unexplored. To bridge this gap, we propose \textbf{I}llumination \textbf{T}ransformation \textbf{A}ttack (\textbf{ITA}), the first framework to systematically assess VLMs' robustness against illumination changes. However, there still exist two key challenges: (1) how to model global illumination with fine-grained control to achieve diverse lighting conditions and (2) how to ensure adversarial effectiveness while maintaining naturalness. To address the first challenge, we innovatively decompose global illumination into multiple parameterized point light sources based on the illumination rendering equation. This design enables us to model more diverse lighting variations that previous methods could not capture. Then, by integrating these parameterized lighting variations with physics-based lighting reconstruction techniques, we could precisely render such light interactions in the original scenes, finally meeting the goal of fine-grained lighting control. For the second challenge, by controlling illumination through the lighting reconstrution model's latent space rather than direct pixel manipulation, we inherently preserve physical lighting priors. Furthermore, to prevent potential reconstruction artifacts, we design additional perceptual constraints for maintaining visual consistency with original images and diversity constraints for avoiding light source convergence. Extensive experiments demonstrate that our ITA could significantly reduce the performance of advanced VLMs, e.g., LLaVA-1.6, while possessing competitive naturalness, exposing VLMS' critical illuminiation vulnerabilities.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2503.06903 [cs.CV]

(or
arXiv:2503.06903v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2503.06903

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

147. 【2503.06901】Iterative Prompt Relocation for Distribution-Adaptive Visual Prompt Tuning

链接：https://arxiv.org/abs/2503.06901

作者：Chikai Shang,Mengke Li,Yiqun Zhang,Zhen Chen,Jinlin Wu,Fangqing Gu,Yang Lu,Yiu-ming Cheung

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：Visual prompt tuning, adapting pre-trained models, incorporating learnable prompts, Visual prompt, VPT

备注：

点击查看摘要

Abstract:Visual prompt tuning (VPT) provides an efficient and effective solution for adapting pre-trained models to various downstream tasks by incorporating learnable prompts. However, most prior art indiscriminately applies a fixed prompt distribution across different tasks, neglecting the importance of each block differing depending on the task. In this paper, we investigate adaptive distribution optimization (ADO) by addressing two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition? Through in-depth analysis, we provide an affirmative answer that properly adjusting the distribution significantly improves VPT performance, and further uncover a key insight that a nested relationship exists between ADO and VPT. Based on these findings, we propose a new VPT framework, termed PRO-VPT (iterative Prompt RelOcation-based VPT), which adaptively adjusts the distribution building upon a nested optimization formulation. Specifically, we develop a prompt relocation strategy for ADO derived from this formulation, comprising two optimization steps: identifying and pruning idle prompts, followed by determining the optimal blocks for their relocation. By iteratively performing prompt relocation and VPT, our proposal adaptively learns the optimal prompt distribution, thereby unlocking the full potential of VPT. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VPT methods, e.g., PRO-VPT surpasses VPT by 1.6% average accuracy, leading prompt-based methods to state-of-the-art performance on the VTAB-1k benchmark. The code is available at this https URL.

148. 【2503.06900】DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation

链接：https://arxiv.org/abs/2503.06900

作者：Xiaoliang Ju,Hongsheng Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：present DirectTriGS, Gaussian Splatting, represent Gaussian Splatting, Gaussian, Gaussian point clouds

备注： Accepted by CVPR 2025

点击查看摘要

Abstract:We present DirectTriGS, a novel framework designed for 3D object generation with Gaussian Splatting (GS). GS-based rendering for 3D content has gained considerable attention recently. However, there has been limited exploration in directly generating 3D Gaussians compared to traditional generative modeling approaches. The main challenge lies in the complex data structure of GS represented by discrete point clouds with multiple channels. To overcome this challenge, we propose employing the triplane representation, which allows us to represent Gaussian Splatting as an image-like continuous field. This representation effectively encodes both the geometry and texture information, enabling smooth transformation back to Gaussian point clouds and rendering into images by a TriRenderer, with only 2D supervisions. The proposed TriRenderer is fully differentiable, so that the rendering loss can supervise both texture and geometry encoding. Furthermore, the triplane representation can be compressed using a Variational Autoencoder (VAE), which can subsequently be utilized in latent diffusion to generate 3D objects. The experiments demonstrate that the proposed generation framework can produce high-quality 3D object geometry and rendering results in the text-to-3D task.

149. 【2503.06898】Illuminating Darkness: Enhancing Real-world Low-light Scenes with Smartphone Images

链接：https://arxiv.org/abs/2503.06898

作者：S M A Sharif,Abdur Rehman,Zain Ul Abidin,Rizwan Ali Naqvi,Fayaz Ali Dharejo,Radu Timofte

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Digital cameras, produce plausible images, cameras often struggle, struggle to produce, produce plausible

备注：

点击查看摘要

Abstract:Digital cameras often struggle to produce plausible images in low-light conditions. Improving these single-shot images remains challenging due to a lack of diverse real-world pair data samples. To address this limitation, we propose a large-scale high-resolution (i.e., beyond 4k) pair Single-Shot Low-Light Enhancement (SLLIE) dataset. Our dataset comprises 6,425 unique focus-aligned image pairs captured with smartphone sensors in dynamic settings under challenging lighting conditions (0.1--200 lux), covering various indoor and outdoor scenes with varying noise and intensity. We extracted and refined around 180,000 non-overlapping patches from 6,025 collected scenes for training while reserving 400 pairs for benchmarking. In addition to that, we collected 2,117 low-light scenes from different sources for extensive real-world aesthetic evaluation. To our knowledge, this is the largest real-world dataset available for SLLIE research. We also propose learning luminance-chrominance (LC) attributes separately through a tuning fork-shaped transformer model to enhance real-world low-light images, addressing challenges like denoising and over-enhancement in complex scenes. We also propose an LC cross-attention block for feature fusion, an LC refinement block for enhanced reconstruction, and LC-guided supervision to ensure perceptually coherent enhancements. We demonstrated our method's effectiveness across various hardware and scenarios, proving its practicality in real-world applications. Code and dataset available at this https URL.

150. 【2503.06897】HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation

链接：https://arxiv.org/abs/2503.06897

作者：Xingzu Zhan,Chen Xie,Haoran Sun,Xiaochun Mai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：rapidly growing field, computer graphics, promising flexible, applications in gaming, virtual reality

备注： 11pages,3figures,

点击查看摘要

Abstract:Text-to-motion generation is a rapidly growing field at the nexus of multimodal learning and computer graphics, promising flexible and cost-effective applications in gaming, animation, robotics, and virtual reality. Existing approaches often rely on simple spatiotemporal stacking, which introduces feature redundancy, while subtle joint-level details remain overlooked from a spatial perspective. To this end, we propose a novel HiSTF Mamba framework. The framework is composed of three key modules: Dual-Spatial Mamba, Bi-Temporal Mamba, and Dynamic Spatiotemporal Fusion Module (DSFM). Dual-Spatial Mamba incorporates ``Part-based + Whole-based'' parallel modeling to represent both whole-body coordination and fine-grained joint dynamics. Bi-Temporal Mamba adopts a bidirectional scanning strategy, effectively encoding short-term motion details and long-term dependencies. DSFM further performs redundancy removal and extraction of complementary information for temporal features, then fuses them with spatial features, yielding an expressive spatio-temporal representation. Experimental results on the HumanML3D dataset demonstrate that HiSTF Mamba achieves state-of-the-art performance across multiple metrics. In particular, it reduces the FID score from 0.283 to 0.189, a relative decrease of nearly 30%. These findings validate the effectiveness of HiSTF Mamba in achieving high fidelity and strong semantic alignment in text-to-motion generation.

151. 【2503.06896】CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution

链接：https://arxiv.org/abs/2503.06896

作者：Xin Liu,Jie Liu,Jie Tang,Gangshan Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：low-level visual tasks, demonstrated impressive performance, Transformer-based methods, demonstrated impressive, low-level visual

备注： Accepted by CVPR2025

点击查看摘要

Abstract:Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR). However, its computational complexity grows quadratically with the spatial resolution. A series of works attempt to alleviate this problem by dividing Low-Resolution images into local windows, axial stripes, or dilated windows. SR typically leverages the redundancy of images for reconstruction, and this redundancy appears not only in local regions but also in long-range regions. However, these methods limit attention computation to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency. To address these issues, we propose a lightweight Content-Aware Token Aggregation Network (CATANet). Specifically, we propose an efficient Content-Aware Token Aggregation module for aggregating long-range content-similar tokens, which shares token centers across all image tokens and updates them only during the training phase. Then we utilize intra-group self-attention to enable long-range information interaction. Moreover, we design an inter-group cross-attention to further enhance global information interaction. The experimental results show that, compared with the state-of-the-art cluster-based method SPIN, our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.

152. 【2503.06894】Improving cognitive diagnostics in pathology: a deep learning approach for augmenting perceptional understanding of histopathology images

链接：https://arxiv.org/abs/2503.06894

作者：Xiaoqian Hu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Made Significant Strides, Recent Years, Made Significant, Significant Strides, Combines Vision Transformers

备注：

点击查看摘要

Abstract:In Recent Years, Digital Technologies Have Made Significant Strides In Augmenting-Human-Health, Cognition, And Perception, Particularly Within The Field Of Computational-Pathology. This Paper Presents A Novel Approach To Enhancing The Analysis Of Histopathology Images By Leveraging A Mult-modal-Model That Combines Vision Transformers (Vit) With Gpt-2 For Image Captioning. The Model Is Fine-Tuned On The Specialized Arch-Dataset, Which Includes Dense Image Captions Derived From Clinical And Academic Resources, To Capture The Complexities Of Pathology Images Such As Tissue Morphologies, Staining Variations, And Pathological Conditions. By Generating Accurate, Contextually Captions, The Model Augments The Cognitive Capabilities Of Healthcare Professionals, Enabling More Efficient Disease Classification, Segmentation, And Detection. The Model Enhances The Perception Of Subtle Pathological Features In Images That Might Otherwise Go Unnoticed, Thereby Improving Diagnostic Accuracy. Our Approach Demonstrates The Potential For Digital Technologies To Augment Human Cognitive Abilities In Medical Image Analysis, Providing Steps Toward More Personalized And Accurate Healthcare Outcomes.

153. 【2503.06887】Accessing the Effect of Phyllotaxy and Planting Density on Light Use Efficiency in Field-Grown Maize using 3D Reconstructions

链接：https://arxiv.org/abs/2503.06887

作者：Nasla Saleem,Talukder Zaki Jubery,Aditya Balu,Yan Zhou,Yawei Li,Patrick S. Schnable,Adarsh Krishnamurthy,Baskar Ganapathysubramanian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：widely adopted strategy, increased interplant competition, enhance maize productivity, limit light capture, light capture

备注： 17 pages, 8 figures

点击查看摘要

Abstract:High-density planting is a widely adopted strategy to enhance maize productivity, yet it introduces challenges such as increased interplant competition and shading, which can limit light capture and overall yield potential. In response, some maize plants naturally reorient their canopies to optimize light capture, a process known as canopy reorientation. Understanding this adaptive response and its impact on light capture is crucial for maximizing agricultural yield potential. This study introduces an end-to-end framework that integrates realistic 3D reconstructions of field-grown maize with photosynthetically active radiation (PAR) modeling to assess the effects of phyllotaxy and planting density on light interception. In particular, using 3D point clouds derived from field data, virtual fields for a diverse set of maize genotypes were constructed and validated against field PAR measurements. Using this framework, we present detailed analyses of the impact of canopy orientations, plant and row spacings, and planting row directions on PAR interception throughout a typical growing season. Our findings highlight significant variations in light interception efficiency across different planting densities and canopy orientations. By elucidating the relationship between canopy architecture and light capture, this study offers valuable guidance for optimizing maize breeding and cultivation strategies across diverse agricultural settings.

154. 【2503.06885】ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

链接：https://arxiv.org/abs/2503.06885

作者：Yan Yang,Dongxu Li,Haoning Wu,Bei Chen,Liu Liu,Liyuan Pan,Junnan Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Solving expert-level multimodal, Solving expert-level, expert-level multimodal tasks, key milestone, milestone towards general

备注：

点击查看摘要

Abstract:Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.

155. 【2503.06884】xt-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help

链接：https://arxiv.org/abs/2503.06884

作者：Yuefan Cao,Xuyang Guo,Jiayan Huo,Yingyu Liang,Zhenmei Shi,Zhao Song,Jiahao Zhang,Zhen Zhuang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：unprecedented real-world impacts, gained unprecedented real-world, today AI community, real-world impacts, modeling is widely

备注：

点击查看摘要

Abstract:Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability. Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

Cite as:
arXiv:2503.06884 [cs.CV]

(or
arXiv:2503.06884v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2503.06884

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

156. 【2503.06873】Interactive Medical Image Analysis with Concept-based Similarity Reasoning

链接：https://arxiv.org/abs/2503.06873

作者：Ta Duc Huy,Sen Kim Tran,Phan Nguyen,Nguyen Hoang Tran,Tran Bao Sam,Anton van den Hengel,Zhibin Liao,Johan W. Verjans,Minh-Son To,Vu Minh Hieu Phan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：intervene model decisions, computer-aided diagnosis methods, clinical workflows, ability to interpret, interpret and intervene

备注： Accepted CVPR2025

点击查看摘要

Abstract:The ability to interpret and intervene model decisions is important for the adoption of computer-aided diagnosis methods in clinical workflows. Recent concept-based methods link the model predictions with interpretable concepts and modify their activation scores to interact with the model. However, these concepts are at the image level, which hinders the model from pinpointing the exact patches the concepts are activated. Alternatively, prototype-based methods learn representations from training image patches and compare these with test image patches, using the similarity scores for final class prediction. However, interpreting the underlying concepts of these patches can be challenging and often necessitates post-hoc guesswork. To address this issue, this paper introduces the novel Concept-based Similarity Reasoning network (CSR), which offers (i) patch-level prototype with intrinsic concept interpretation, and (ii) spatial interactivity. First, the proposed CSR provides localized explanation by grounding prototypes of each concept on image regions. Second, our model introduces novel spatial-level interaction, allowing doctors to engage directly with specific image areas, making it an intuitive and transparent tool for medical imaging. CSR improves upon prior state-of-the-art interpretable methods by up to 4.5\% across three biomedical datasets. Our code is released at this https URL.

157. 【2503.06863】HIF: Height Interval Filtering for Efficient Dynamic Points Removal

链接：https://arxiv.org/abs/2503.06863

作者：Shufang Zhang,Tao Jiang,Jiazheng Wu,Ziyu Meng,Ziyang Zhang,Shan An

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：cloud mapping plays, point cloud mapping, autonomous navigation, mapping plays, plays a essential

备注：

点击查看摘要

Abstract:3D point cloud mapping plays a essential role in localization and autonomous navigation. However, dynamic objects often leave residual traces during the map construction process, which undermine the performance of subsequent tasks. Therefore, dynamic object removal has become a critical challenge in point cloud based map construction within dynamic scenarios. Existing approaches, however, often incur significant computational overhead, making it difficult to meet the real-time processing requirements. To address this issue, we introduce the Height Interval Filtering (HIF) method. This approach constructs pillar-based height interval representations to probabilistically model the vertical dimension, with interval probabilities updated through Bayesian inference. It ensures real-time performance while achieving high accuracy and improving robustness in complex environments. Additionally, we propose a low-height preservation strategy that enhances the detection of unknown spaces, reducing misclassification in areas blocked by obstacles (occluded regions). Experiments on public datasets demonstrate that HIF delivers a 7.7 times improvement in time efficiency with comparable accuracy to existing SOTA methods. The code will be publicly available.

158. 【2503.06860】owards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting

链接：https://arxiv.org/abs/2503.06860

作者：Cagri Gungor,Derek Eppinger,Adriana Kovashka

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：direct physical contact, computer vision, multimodal learning, relies on direct, direct physical

备注：

点击查看摘要

Abstract:Tactile sensing, which relies on direct physical contact, is critical for human perception and underpins applications in computer vision, robotics, and multimodal learning. Because tactile data is often scarce and costly to acquire, generating synthetic tactile images provides a scalable solution to augment real-world measurements. However, ensuring robust generalization in synthesizing tactile images-capturing subtle, material-specific contact features-remains challenging. We demonstrate that overlapping training and test samples in commonly used datasets inflate performance metrics, obscuring the true generalizability of tactile models. To address this, we propose a leakage-free evaluation protocol coupled with novel, reference-free metrics-TMMD, I-TMMD, CI-TMMD, and D-TMMD-tailored for tactile generation. Moreover, we propose a vision-to-touch generation method that leverages text as an intermediate modality by incorporating concise, material-specific descriptions during training to better capture essential tactile features. Experiments on two popular visuo-tactile datasets, Touch and Go and HCT, show that our approach achieves superior performance and enhanced generalization in a leakage-free setting.

159. 【2503.06859】ActiveInitSplat: How Active Image Selection Helps Gaussian Splatting

链接：https://arxiv.org/abs/2503.06859

作者：Konstantinos D. Polyzos,Athanasios Bacharis,Saketh Madhuvarasu,Nikos Papanikolopoulos,Tara Javidi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：meeting reduced storage, reduced storage demands, real-time scene rendering, computational efficiency, extensions and variants

备注：

点击查看摘要

Abstract:Gaussian splatting (GS) along with its extensions and variants provides outstanding performance in real-time scene rendering while meeting reduced storage demands and computational efficiency. While the selection of 2D images capturing the scene of interest is crucial for the proper initialization and training of GS, hence markedly affecting the rendering performance, prior works rely on passively and typically densely selected 2D images. In contrast, this paper proposes `ActiveInitSplat', a novel framework for active selection of training images for proper initialization and training of GS. ActiveInitSplat relies on density and occupancy criteria of the resultant 3D scene representation from the selected 2D images, to ensure that the latter are captured from diverse viewpoints leading to better scene coverage and that the initialized Gaussian functions are well aligned with the actual 3D structure. Numerical tests on well-known simulated and real environments demonstrate the merits of ActiveInitSplat resulting in significant GS rendering performance improvement over passive GS baselines, in the widely adopted LPIPS, SSIM, and PSNR metrics.

160. 【2503.06852】From Image- to Pixel-level: Label-efficient Hyperspectral Image Reconstruction

链接：https://arxiv.org/abs/2503.06852

作者：Yihong Leng,Jiaojiao Li,Haitao Xu,Rui Song

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Current hyperspectral image, form abundant high-quality, abundant high-quality HSIs, Current hyperspectral, methods primarily rely

备注：

点击查看摘要

Abstract:Current hyperspectral image (HSI) reconstruction methods primarily rely on image-level approaches, which are time-consuming to form abundant high-quality HSIs through imagers. In contrast, spectrometers offer a more efficient alternative by capturing high-fidelity point spectra, enabling pixel-level HSI reconstruction that balances accuracy and label efficiency. To this end, we introduce a pixel-level spectral super-resolution (Pixel-SSR) paradigm that reconstructs HSI from RGB and point spectra. Despite its advantages, Pixel-SSR presents two key challenges: 1) generalizability to novel scenes lacking point spectra, and 2) effective information extraction to promote reconstruction accuracy. To address the first challenge, a Gamma-modeled strategy is investigated to synthesize point spectra based on their intrinsic properties, including nonnegativity, a skewed distribution, and a positive correlation. Furthermore, complementary three-branch prompts from RGB and point spectra are extracted with a Dynamic Prompt Mamba (DyPro-Mamba), which progressively directs the reconstruction with global spatial distributions, edge details, and spectral dependency. Comprehensive evaluations, including horizontal comparisons with leading methods and vertical assessments across unsupervised and image-level supervised paradigms, demonstrate that ours achieves competitive reconstruction accuracy with efficient label consumption.

161. 【2503.06847】MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification

链接：https://arxiv.org/abs/2503.06847

作者：Xiangyan Qu,Jing Yu,Jiamin Zhuang,Gaopeng Gou,Gang Xiong,Qi Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recognize unseen classes, Zero-shot learning, shared auxiliary information, unseen classes, aims to train

备注：

点击查看摘要

Abstract:Zero-shot learning (ZSL) aims to train a model on seen classes and recognize unseen classes by knowledge transfer through shared auxiliary information. Recent studies reveal that documents from encyclopedias provide helpful auxiliary information. However, existing methods align noisy documents, entangled in visual and non-visual descriptions, with image regions, yet solely depend on implicit learning. These models fail to filter non-visual noise reliably and incorrectly align non-visual words to image regions, which is harmful to knowledge transfer. In this work, we propose a novel multi-attribute document supervision framework to remove noises at both document collection and model learning stages. With the help of large language models, we introduce a novel prompt algorithm that automatically removes non-visual descriptions and enriches less-described documents in multiple attribute views. Our proposed model, MADS, extracts multi-view transferable knowledge with information decoupling and semantic interactions for semantic alignment at local and global levels. Besides, we introduce a model-agnostic focus loss to explicitly enhance attention to visually discriminative information during training, also improving existing methods without additional parameters. With comparable computation costs, MADS consistently outperforms the SOTA by 7.2% and 8.2% on average in three benchmarks for document-based ZSL and GZSL settings, respectively. Moreover, we qualitatively offer interpretable predictions from multiple attribute views.

162. 【2503.06840】Improving Visual Place Recognition with Sequence-Matching Receptiveness Prediction

链接：https://arxiv.org/abs/2503.06840

作者：Somayeh Hussaini,Tobias Fischer,Michael Milford

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：visual place recognition, integrating temporal information, sequence-based matching approaches, place recognition, filtering and sequence-based

备注： 8 pages, 5 figures, under review

点击查看摘要

Abstract:In visual place recognition (VPR), filtering and sequence-based matching approaches can improve performance by integrating temporal information across image sequences, especially in challenging conditions. While these methods are commonly applied, their effects on system behavior can be unpredictable and can actually make performance worse in certain situations. In this work, we present a new supervised learning approach that learns to predict the per-frame sequence matching receptiveness (SMR) of VPR techniques, enabling the system to selectively decide when to trust the output of a sequence matching system. The approach is agnostic to the underlying VPR technique. Our approach predicts SMR-and hence significantly improves VPR performance-across a large range of state-of-the-art and classical VPR techniques (namely CosPlace, MixVPR, EigenPlaces, SALAD, AP-GeM, NetVLAD and SAD), and across three benchmark VPR datasets (Nordland, Oxford RobotCar, and SFU-Mountain). We also provide insights into a complementary approach that uses the predictor to replace discarded matches, as well as ablation studies, including an analysis of the interactions between our SMR predictor and the selected sequence length. We will release our code upon acceptance.

163. 【2503.06839】AttFC: Attention Fully-Connected Layer for Large-Scale Face Recognition with One GPU

链接：https://arxiv.org/abs/2503.06839

作者：Zhuowen Zheng,Yain-Whar Si,Xiaochen Yuan,Junwei Duan,Ke Wang,Xiaofan Li,Xinyuan Zhang,Xueyuan Gong

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：deep neural networks, achieved exceptional performance, large-scale datasets, neural networks, advancement of deep

备注：

点击查看摘要

Abstract:Nowadays, with the advancement of deep neural networks (DNNs) and the availability of large-scale datasets, the face recognition (FR) model has achieved exceptional performance. However, since the parameter magnitude of the fully connected (FC) layer directly depends on the number of identities in the dataset. If training the FR model on large-scale datasets, the size of the model parameter will be excessively huge, leading to substantial demand for computational resources, such as time and memory. This paper proposes the attention fully connected (AttFC) layer, which could significantly reduce computational resources. AttFC employs an attention loader to generate the generative class center (GCC), and dynamically store the class center with Dynamic Class Container (DCC). DCC only stores a small subset of all class centers in FC, thus its parameter count is substantially less than the FC layer. Also, training face recognition models on large-scale datasets with one GPU often encounter out-of-memory (OOM) issues. AttFC overcomes this and achieves comparable performance to state-of-the-art methods.

164. 【2503.06832】GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought

链接：https://arxiv.org/abs/2503.06832

作者：Sungsik Kim,Janghyun Baek,Jinkyu Kim,Jaekoo Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large Language Models, Language Models, Large Language, recently shown impressive, shown impressive results

备注： 10 pages, 5 figures, will be published on The 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)

点击查看摘要

Abstract:While Large Language Models (LLMs) have recently shown impressive results in reasoning tasks, their application to pedestrian trajectory prediction remains challenging due to two key limitations: insufficient use of visual information and the difficulty of predicting entire trajectories. To address these challenges, we propose Goal-driven and User-Informed Dynamic Estimation for pedestrian trajectory using Chain-of-Thought (GUIDE-CoT). Our approach integrates two innovative modules: (1) a goal-oriented visual prompt, which enhances goal prediction accuracy combining visual prompts with a pretrained visual encoder, and (2) a chain-of-thought (CoT) LLM for trajectory generation, which generates realistic trajectories toward the predicted goal. Moreover, our method introduces controllable trajectory generation, allowing for flexible and user-guided modifications to the predicted paths. Through extensive experiments on the ETH/UCY benchmark datasets, our method achieves state-of-the-art performance, delivering both high accuracy and greater adaptability in pedestrian trajectory prediction. Our code is publicly available at this https URL.

165. 【2503.06831】One-Shot Dual-Arm Imitation Learning

链接：https://arxiv.org/abs/2503.06831

作者：Yilong Wang,Edward Johns

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：One-Shot Dual-Arm Imitation, Dual-Arm Imitation Learning, Imitation Learning, introduce One-Shot Dual-Arm, Dual-Arm Imitation

备注： Accepted at ICRA 2025. Project Webpage: [this https URL](https://www.robot-learning.uk/one-shot-dual-arm)

点击查看摘要

Abstract:We introduce One-Shot Dual-Arm Imitation Learning (ODIL), which enables dual-arm robots to learn precise and coordinated everyday tasks from just a single demonstration of the task. ODIL uses a new three-stage visual servoing (3-VS) method for precise alignment between the end-effector and target object, after which replay of the demonstration trajectory is sufficient to perform the task. This is achieved without requiring prior task or object knowledge, or additional data collection and training following the single demonstration. Furthermore, we propose a new dual-arm coordination paradigm for learning dual-arm tasks from a single demonstration. ODIL was tested on a real-world dual-arm robot, demonstrating state-of-the-art performance across six precise and coordinated tasks in both 4-DoF and 6-DoF settings, and showing robustness in the presence of distractor objects and partial occlusions. Videos are available at: this https URL.

166. 【2503.06821】HierDAMap: Towards Universal Domain Adaptive BEV Mapping via Hierarchical Perspective Priors

链接：https://arxiv.org/abs/2503.06821

作者：Siyu Li,Yihong Cao,Hao Shi,Yongsheng Zang,Xuan He,Kailun Yang,Zhiyong Li

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)

关键词：visual perception technology, driven significant innovation, BEV mapping, BEV, BEV mapping tasks

备注： The source code will be made publicly available at [this https URL](https://github.com/lynn-yu/HierDAMap)

点击查看摘要

Abstract:The exploration of Bird's-Eye View (BEV) mapping technology has driven significant innovation in visual perception technology for autonomous driving. BEV mapping models need to be applied to the unlabeled real world, making the study of unsupervised domain adaptation models an essential path. However, research on unsupervised domain adaptation for BEV mapping remains limited and cannot perfectly accommodate all BEV mapping tasks. To address this gap, this paper proposes HierDAMap, a universal and holistic BEV domain adaptation framework with hierarchical perspective priors. Unlike existing research that solely focuses on image-level learning using prior knowledge, this paper explores the guiding role of perspective prior knowledge across three distinct levels: global, sparse, and instance levels. With these priors, HierDA consists of three essential components, including Semantic-Guided Pseudo Supervision (SGPS), Dynamic-Aware Coherence Learning (DACL), and Cross-Domain Frustum Mixing (CDFM). SGPS constrains the cross-domain consistency of perspective feature distribution through pseudo labels generated by vision foundation models in 2D space. To mitigate feature distribution discrepancies caused by spatial variations, DACL employs uncertainty-aware predicted depth as an intermediary to derive dynamic BEV labels from perspective pseudo-labels, thereby constraining the coarse BEV features derived from corresponding perspective features. CDFM, on the other hand, leverages perspective masks of view frustum to mix multi-view perspective images from both domains, which guides cross-domain view transformation and encoding learning through mixed BEV labels. The proposed method is verified on multiple BEV mapping tasks, such as BEV semantic segmentation, high-definition semantic, and vectorized mapping. The source code will be made publicly available at this https URL.

167. 【2503.06820】owards Fine-Grained Video Question Answering

链接：https://arxiv.org/abs/2503.06820

作者：Wei Dai,Alan Luo,Zane Durante,Debadutta Dash,Arnold Milstein,Kevin Schulman,Ehsan Adeli,Li Fei-Fei

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：rapidly evolving domain, Video Question Answering, Question Answering, Multi-Actor Question Answering, remains a focal

备注：

点击查看摘要

Abstract:In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.

168. 【2503.06818】Sub-Image Recapture for Multi-View 3D Reconstruction

链接：https://arxiv.org/abs/2503.06818

作者：Yanwei Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high-resolution target remains, challenge task due, input image size, high-resolution target, target remains

备注： 5 pages, 4 figures

点击查看摘要

Abstract:3D reconstruction of high-resolution target remains a challenge task due to the large memory required from the large input image size. Recently developed learning based algorithms provide promising reconstruction performance than traditional ones, however, they generally require more memory than the traditional algorithms and facing scalability issue. In this paper, we developed a generic approach, sub-image recapture (SIR), to split large image into smaller sub-images and process them individually. As a result of this framework, the existing 3D reconstruction algorithms can be implemented based on sub-image recapture with significantly reduced memory and substantially improved scalability

169. 【2503.06814】Unlocking Generalization for Robotics via Modularity and Scale

链接：https://arxiv.org/abs/2503.06814

作者：Murtaza Dalal

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：generalist robot systems, large-scale policy learning, generalist robot, robot systems, robot

备注： CMU Robotics PhD Thesis, 185 pages

点击查看摘要

Abstract:How can we build generalist robot systems? Scale may not be enough due to the significant multimodality of robotics tasks, lack of easily accessible data and the challenges of deploying on physical hardware. Meanwhile, most deployed robotic systems today are inherently modular and can leverage the independent generalization capabilities of each module to perform well. Therefore, this thesis seeks to tackle the task of building generalist robot agents by integrating these components into one: combining modularity with large-scale learning for general purpose robot control. The first question we consider is: how can we build modularity and hierarchy into learning systems? Our key insight is that rather than having the agent learn hierarchy and low-level control end-to-end, we can enforce modularity via planning to enable more efficient and capable robot learners. Next, we come to the role of scale in building generalist robot systems. To scale, neural networks require vast amounts of diverse data, expressive architectures to fit the data and a source of supervision to generate the data. We leverage a powerful supervision source: classical planning, which can generalize, but is expensive to run and requires access to privileged information to perform well in practice. We use these planners to supervise large-scale policy learning in simulation to produce generalist agents. Finally, we consider how to unify modularity with large-scale policy learning to build real-world robot systems capable of performing zero-shot manipulation. We do so by tightly integrating key ingredients of modular high and mid-level planning, learned local control, procedural scene generation and large-scale policy learning for sim2real transfer. We demonstrate that this recipe can produce a single, generalist agent that can solve challenging long-horizon manipulation tasks in the real world.

170. 【2503.06805】Multimodal Emotion Recognition and Sentiment Analysis in Multi-Party Conversation Contexts

链接：https://arxiv.org/abs/2503.06805

作者：Aref Farhadipour,Hossein Ranjbar,Masoumeh Chapariniya,Teodora Vukovic,Sarah Ebling,Volker Dellwo

类目：Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：scenarios involving multi-party, real-world scenarios involving, conversational data, language processing, involving multi-party

备注： 5 pages

点击查看摘要

Abstract:Emotion recognition and sentiment analysis are pivotal tasks in speech and language processing, particularly in real-world scenarios involving multi-party, conversational data. This paper presents a multimodal approach to tackle these challenges on a well-known dataset. We propose a system that integrates four key modalities/channels using pre-trained models: RoBERTa for text, Wav2Vec2 for speech, a proposed FacialNet for facial expressions, and a CNN+Transformer architecture trained from scratch for video analysis. Feature embeddings from each modality are concatenated to form a multimodal vector, which is then used to predict emotion and sentiment labels. The multimodal system demonstrates superior performance compared to unimodal approaches, achieving an accuracy of 66.36% for emotion recognition and 72.15% for sentiment analysis.

171. 【2503.06800】VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation

链接：https://arxiv.org/abs/2503.06800

作者：Hritik Bansal,Clark Peng,Yonatan Bitton,Roman Goldenberg,Aditya Grover,Kai-Wei Chang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Large-scale video generative, physical world simulators, diverse visual concepts, general-purpose physical world, Large-scale video

备注： 41 pages, 33 Figures

点击查看摘要

Abstract:Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. The data and code is available at this https URL.

172. 【2503.06795】Robotic Ultrasound-Guided Femoral Artery Reconstruction of Anatomically-Representative Phantoms

链接：https://arxiv.org/abs/2503.06795

作者：Lidia Al-Zogbi,Deepak Raina,Vinciya Pandian,Thorsten Fleiter,Axel Krieger

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：including diagnostic angiography, therapeutic catheterization, including diagnostic, diagnostic angiography, essential for numerous

备注：

点击查看摘要

Abstract:Femoral artery access is essential for numerous clinical procedures, including diagnostic angiography, therapeutic catheterization, and emergency interventions. Despite its critical role, successful vascular access remains challenging due to anatomical variability, overlying adipose tissue, and the need for precise ultrasound (US) guidance. Errors in needle placement can lead to severe complications, restricting the procedure to highly skilled clinicians in controlled hospital settings. While robotic systems have shown promise in addressing these challenges through autonomous scanning and vessel reconstruction, clinical translation remains limited due to reliance on simplified phantom models that fail to capture human anatomical complexity. In this work, we present a method for autonomous robotic US scanning of bifurcated femoral arteries, and validate it on five vascular phantoms created from real patient computed tomography (CT) data. Additionally, we introduce a video-based deep learning US segmentation network tailored for vascular imaging, enabling improved 3D arterial reconstruction. The proposed network achieves a Dice score of 89.21% and an Intersection over Union of 80.54% on a newly developed vascular dataset. The quality of the reconstructed artery centerline is evaluated against ground truth CT data, demonstrating an average L2 deviation of 0.91+/-0.70 mm, with an average Hausdorff distance of 4.36+/-1.11mm. This study is the first to validate an autonomous robotic system for US scanning of the femoral artery on a diverse set of patient-specific phantoms, introducing a more advanced framework for evaluating robotic performance in vascular imaging and intervention.

173. 【2503.06794】Silent Hazards of Token Reduction in Vision-Language Models: The Hidden Impact on Consistency

链接：https://arxiv.org/abs/2503.06794

作者：Yizheng Sun,Hao Li,Chang Xu,Chenghua Lin,Riza Batista-Navarro,Jingyuan Sun

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：Vision language models, Vision language, incur high computational, token reduction, incur high

备注：

点击查看摘要

174. 【2503.06790】GenDR: Lightning Generative Detail Restorator

链接：https://arxiv.org/abs/2503.06790

作者：Yan Wang,Shijie Zhao,Kai Chen,Kexin Zhang,Junlin Li,Li Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)

关键词：Recent research applying, achieved remarkable success, Recent research, research applying, real-world super-resolution

备注：

点击查看摘要

Abstract:Recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable success. However, fundamental misalignments between T2I and SR targets result in a dilemma between inference speed and detail fidelity. Specifically, T2I tasks prioritize multi-step inversion to synthesize coherent outputs aligned with textual prompts and shrink the latent space to reduce generating complexity. Contrariwise, SR tasks preserve most information from low-resolution input while solely restoring high-frequency details, thus necessitating sufficient latent space and fewer inference steps. To bridge the gap, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand latent space without enlarging the model size. Regarding step-distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.

175. 【2503.06784】Infinite Leagues Under the Sea: Photorealistic 3D Underwater Terrain Generation by Latent Fractal Diffusion Models

链接：https://arxiv.org/abs/2503.06784

作者：Tianyi Zhang,Weiming Zhi,Joshua Mangelson,Matthew Johnson-Roberson

类目：Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：paper tackles, tackles the problem, problem of generating, generating representations, underwater

备注： 10 pages

点击查看摘要

Abstract:This paper tackles the problem of generating representations of underwater 3D terrain. Off-the-shelf generative models, trained on Internet-scale data but not on specialized underwater images, exhibit downgraded realism, as images of the seafloor are relatively uncommon. To this end, we introduce DreamSea, a generative model to generate hyper-realistic underwater scenes. DreamSea is trained on real-world image databases collected from underwater robot surveys. Images from these surveys contain massive real seafloor observations and covering large areas, but are prone to noise and artifacts from the real world. We extract 3D geometry and semantics from the data with visual foundation models, and train a diffusion model that generates realistic seafloor images in RGBD channels, conditioned on novel fractal distribution-based latent embeddings. We then fuse the generated images into a 3D map, building a 3DGS model supervised by 2D diffusion priors which allows photorealistic novel view rendering. DreamSea is rigorously evaluated, demonstrating the ability to robustly generate large-scale underwater scenes that are consistent, diverse, and photorealistic. Our work drives impact in multiple domains, spanning filming, gaming, and robot simulation.

176. 【2503.06773】Investigating Image Manifolds of 3D Objects: Learning, Shape Analysis, and Comparisons

链接：https://arxiv.org/abs/2503.06773

作者：Benjamin Beaudett,Shenyuan Liang,Anuj Srivastava

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：manifolds, long been hypothesized, hypothesized to form, objects, image manifolds

备注：

点击查看摘要

Abstract:Despite high-dimensionality of images, the sets of images of 3D objects have long been hypothesized to form low-dimensional manifolds. What is the nature of such manifolds? How do they differ across objects and object classes? Answering these questions can provide key insights in explaining and advancing success of machine learning algorithms in computer vision. This paper investigates dual tasks -- learning and analyzing shapes of image manifolds -- by revisiting a classical problem of manifold learning but from a novel geometrical perspective. It uses geometry-preserving transformations to map the pose image manifolds, sets of images formed by rotating 3D objects, to low-dimensional latent spaces. The pose manifolds of different objects in latent spaces are found to be nonlinear, smooth manifolds. The paper then compares shapes of these manifolds for different objects using Kendall's shape analysis, modulo rigid motions and global scaling, and clusters objects according to these shape metrics. Interestingly, pose manifolds for objects from the same classes are frequently clustered together. The geometries of image manifolds can be exploited to simplify vision and image processing tasks, to predict performances, and to provide insights into learning methods.

177. 【2503.06764】SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation

链接：https://arxiv.org/abs/2503.06764

作者：Zisheng Chen,Chunwei Wang,Xiuwei Chen,Hang Xu,Jianhua Han,Xiandan Liang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：understanding and generation, consistent discrete feature, discrete feature representations, generation tasks, Semantic-Guided Hierarchical codebook

备注： Under Review

点击查看摘要

Abstract:We present SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete feature representations for multimodal understanding and generation tasks. Recently, unified multimodal large models (MLLMs) for understanding and generation have sparked exploration within research community. Previous works attempt to train a unified image tokenizer by combining loss functions for semantic feature reconstruction and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation tasks, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through Semantic-Guided Hierarchical codebook which builds texture sub-codebooks on pre-trained semantic codebook. This design decouples the training of semantic reconstruction and pixel reconstruction and equips the tokenizer with low-level texture feature extraction capability without degradation of high-level semantic feature extraction ability. Our experiments demonstrate that SemHiTok achieves state-of-the-art rFID score at 256X256resolution compared to other unified tokenizers, and exhibits competitive performance on multimodal understanding and generation tasks.

178. 【2503.06762】Gaussian RBFNet: Gaussian Radial Basis Functions for Fast and Accurate Representation and Reconstruction of Neural Fields

链接：https://arxiv.org/abs/2503.06762

作者：Abdelaziz Bouzidi,Hamid Laga,Hazem Wannous

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：recently revolutionized novel-view, revolutionized novel-view synthesis, Neural Radiance Fields, recently revolutionized, revolutionized novel-view

备注： Our code is available at [this https URL](https://grbfnet.github.io/)

点击查看摘要

Abstract:Neural fields such as DeepSDF and Neural Radiance Fields have recently revolutionized novel-view synthesis and 3D reconstruction from RGB images and videos. However, achieving high-quality representation, reconstruction, and rendering requires deep neural networks, which are slow to train and evaluate. Although several acceleration techniques have been proposed, they often trade off speed for memory. Gaussian splatting-based methods, on the other hand, accelerate the rendering time but remain costly in terms of training speed and memory needed to store the parameters of a large number of Gaussians. In this paper, we introduce a novel neural representation that is fast, both at training and inference times, and lightweight. Our key observation is that the neurons used in traditional MLPs perform simple computations (a dot product followed by ReLU activation) and thus one needs to use either wide and deep MLPs or high-resolution and high-dimensional feature grids to parameterize complex nonlinear functions. We show in this paper that by replacing traditional neurons with Radial Basis Function (RBF) kernels, one can achieve highly accurate representation of 2D (RGB images), 3D (geometry), and 5D (radiance fields) signals with just a single layer of such neurons. The representation is highly parallelizable, operates on low-resolution feature grids, and is compact and memory-efficient. We demonstrate that the proposed novel representation can be trained for 3D geometry representation in less than 15 seconds and for novel view synthesis in less than 15 mins. At runtime, it can synthesize novel views at more than 60 fps without sacrificing quality.

179. 【2503.06759】Revisiting Invariant Learning for Out-of-Domain Generalization on Multi-Site Mammogram Datasets

链接：https://arxiv.org/abs/2503.06759

作者：Hung Q. Vo,Samira Zare,Son T. Ly,Lin Wang,Chika F. Ezeana,Xiaohui Yu,Kelvin K. Wong,Stephen T.C. Wong,Hien V. Nguyen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：settings remains uncertain, development settings remains, deep learning techniques, remains uncertain, invariant learning

备注：

点击查看摘要

Abstract:Despite significant progress in robust deep learning techniques for mammogram breast cancer classification, their reliability in real-world clinical development settings remains uncertain. The translation of these models to clinical practice faces challenges due to variations in medical centers, imaging protocols, and patient populations. To enhance their robustness, invariant learning methods have been proposed, prioritizing causal factors over misleading features. However, their effectiveness in clinical development and impact on mammogram classification require investigation. This paper reassesses the application of invariant learning for breast cancer risk estimation based on mammograms. Utilizing diverse multi-site public datasets, it represents the first study in this area. The objective is to evaluate invariant learning's benefits in developing robust models. Invariant learning methods, including Invariant Risk Minimization and Variance Risk Extrapolation, are compared quantitatively against Empirical Risk Minimization. Evaluation metrics include accuracy, average precision, and area under the curve. Additionally, interpretability is examined through class activation maps and visualization of learned representations. This research examines the advantages, limitations, and challenges of invariant learning for mammogram classification, guiding future studies to develop generalized methods for breast cancer prediction on whole mammograms in out-of-domain scenarios.

180. 【2503.06749】Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

链接：https://arxiv.org/abs/2503.06749

作者：Wenxuan Huang,Bohan Jia,Zijie Zhai,Shaosheng Cao,Zheyu Ye,Fei Zhao,Yao Hu,Shaohui Lin

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词：Reinforcement Learning, purely through Reinforcement, successfully demonstrated, demonstrated the emergence, LLMs purely

备注：

点击查看摘要

181. 【2503.06748】DiffAtlas: GenAI-fying Atlas Segmentation via Image-Mask Diffusion

链接：https://arxiv.org/abs/2503.06748

作者：Hantao Zhang,Yuhe Liu,Jiancheng Yang,Weidong Guo,Xinyuan Wang,Pascal Fua

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：precise anatomical delineation, Accurate medical image, Accurate medical, anatomical delineation, crucial for precise

备注： 11 pages

点击查看摘要

Abstract:Accurate medical image segmentation is crucial for precise anatomical delineation. Deep learning models like U-Net have shown great success but depend heavily on large datasets and struggle with domain shifts, complex structures, and limited training samples. Recent studies have explored diffusion models for segmentation by iteratively refining masks. However, these methods still retain the conventional image-to-mask mapping, making them highly sensitive to input data, which hampers stability and generalization. In contrast, we introduce DiffAtlas, a novel generative framework that models both images and masks through diffusion during training, effectively ``GenAI-fying'' atlas-based segmentation. During testing, the model is guided to generate a specific target image-mask pair, from which the corresponding mask is obtained. DiffAtlas retains the robustness of the atlas paradigm while overcoming its scalability and domain-specific limitations. Extensive experiments on CT and MRI across same-domain, cross-modality, varying-domain, and different data-scale settings using the MMWHS and TotalSegmentator datasets demonstrate that our approach outperforms existing methods, particularly in limited-data and zero-shot modality segmentation. Code is available at this https URL.

182. 【2503.06746】Color Alignment in Diffusion

链接：https://arxiv.org/abs/2503.06746

作者：Ka Chun Shum,Binh-Son Hua,Duc Thanh Nguyen,Sai-Kit Yeung

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：shown great promise, synthesizing visually appealing, visually appealing images, Diffusion models, shown great

备注： CVPR 2025

点击查看摘要

Abstract:Diffusion models have shown great promise in synthesizing visually appealing images. However, it remains challenging to condition the synthesis at a fine-grained level, for instance, synthesizing image pixels following some generic color pattern. Existing image synthesis methods often produce contents that fall outside the desired pixel conditions. To address this, we introduce a novel color alignment algorithm that confines the generative process in diffusion models within a given color pattern. Specifically, we project diffusion terms, either imagery samples or latent representations, into a conditional color space to align with the input color distribution. This strategy simplifies the prediction in diffusion models within a color manifold while still allowing plausible structures in generated contents, thus enabling the generation of diverse contents that comply with the target color pattern. Experimental results demonstrate our state-of-the-art performance in conditioning and controlling of color pixels, while maintaining on-par generation quality and diversity in comparison with regular diffusion models.

183. 【2503.06744】CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving

链接：https://arxiv.org/abs/2503.06744

作者：Rui Song,Chenwei Liang,Yan Xia,Walter Zimmer,Hu Cao,Holger Caesar,Andreas Festag,Alois Knoll

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Dynamic scene rendering, scene rendering opens, Dynamic scene, enabling closed-loop simulations, photorealistic data

备注：

点击查看摘要

Abstract:Dynamic scene rendering opens new avenues in autonomous driving by enabling closed-loop simulations with photorealistic data, which is crucial for validating end-to-end algorithms. However, the complex and highly dynamic nature of traffic environments presents significant challenges in accurately rendering these scenes. In this paper, we introduce a novel 4D Gaussian Splatting (4DGS) approach, which incorporates context and temporal deformation awareness to improve dynamic scene rendering. Specifically, we employ a 2D semantic segmentation foundation model to self-supervise the 4D semantic features of Gaussians, ensuring meaningful contextual embedding. Simultaneously, we track the temporal deformation of each Gaussian across adjacent frames. By aggregating and encoding both semantic and temporal deformation features, each Gaussian is equipped with cues for potential deformation compensation within 3D space, facilitating a more precise representation of dynamic scenes. Experimental results show that our method improves 4DGS's ability to capture fine details in dynamic scene rendering for autonomous driving and outperforms other self-supervised methods in 4D reconstruction and novel view synthesis. Furthermore, CoDa-4DGS deforms semantic features with each Gaussian, enabling broader applications.

184. 【2503.06740】D3DR: Lighting-Aware Object Insertion in Gaussian Splatting

链接：https://arxiv.org/abs/2503.06740

作者：Vsevolod Skorokhodov,Nikita Durasov,Pascal Fua

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Computer Vision tasks, Computer Vision, Vision tasks, Gaussian Splatting, dynamic scene rendering

备注：

点击查看摘要

Abstract:Gaussian Splatting has become a popular technique for various 3D Computer Vision tasks, including novel view synthesis, scene reconstruction, and dynamic scene rendering. However, the challenge of natural-looking object insertion, where the object's appearance seamlessly matches the scene, remains unsolved. In this work, we propose a method, dubbed D3DR, for inserting a 3DGS-parametrized object into 3DGS scenes while correcting its lighting, shadows, and other visual artifacts to ensure consistency, a problem that has not been successfully addressed before. We leverage advances in diffusion models, which, trained on real-world data, implicitly understand correct scene lighting. After inserting the object, we optimize a diffusion-based Delta Denoising Score (DDS)-inspired objective to adjust its 3D Gaussian parameters for proper lighting correction. Utilizing diffusion model personalization techniques to improve optimization quality, our approach ensures seamless object insertion and natural appearance. Finally, we demonstrate the method's effectiveness by comparing it to existing approaches, achieving 0.5 PSNR and 0.15 SSIM improvements in relighting quality.

185. 【2503.06717】Continuous Online Adaptation Driven by User Interaction for Medical Image Segmentation

链接：https://arxiv.org/abs/2503.06717

作者：Wentian Xu,Ziyun Liang,Harry Anthony,Yasin Ibrahim,Felix Cohen,Guang Yang,Daniel Whitehouse,David Menon,Virginia Newcombe,Konstantinos Kamnitsas

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：real-time user interactions, Interactive segmentation models, extra inputs, inputs to dynamically, dynamically refine

备注：

点击查看摘要

Abstract:Interactive segmentation models use real-time user interactions, such as mouse clicks, as extra inputs to dynamically refine the model predictions. After model deployment, user corrections of model predictions could be used to adapt the model to the post-deployment data distribution, countering distribution-shift and enhancing reliability. Motivated by this, we introduce an online adaptation framework that enables an interactive segmentation model to continuously learn from user interaction and improve its performance on new data distributions, as it processes a sequence of test images. We introduce the Gaussian Point Loss function to train the model how to leverage user clicks, along with a two-stage online optimization method that adapts the model using the corrected predictions generated via user interactions. We demonstrate that this simple and therefore practical approach is very effective. Experiments on 5 fundus and 4 brain MRI databases demonstrate that our method outperforms existing approaches under various data distribution shifts, including segmentation of image modalities and pathologies not seen during training.

186. 【2503.06700】MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation

链接：https://arxiv.org/abs/2503.06700

作者：Chenfei Liao,Xu Zheng,Yuanhuiyi Lyu,Haiwei Xue,Yihong Cao,Jiawen Wang,Kailun Yang,Xuming Hu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：multiple visual modalities, visual modalities captured, Research has focused, diverse sensors, pixel-wise predictions

备注：

点击查看摘要

Abstract:Research has focused on Multi-Modal Semantic Segmentation (MMSS), where pixel-wise predictions are derived from multiple visual modalities captured by diverse sensors. Recently, the large vision model, Segment Anything Model 2 (SAM2), has shown strong zero-shot segmentation performance on both images and videos. When extending SAM2 to MMSS, two issues arise: 1. How can SAM2 be adapted to multi-modal data? 2. How can SAM2 better understand semantics? Inspired by cross-frame correlation in videos, we propose to treat multi-modal data as a sequence of frames representing the same scene. Our key idea is to ''memorize'' the modality-agnostic information and 'memorize' the semantics related to the targeted scene. To achieve this, we apply SAM2's memory mechanisms across multi-modal data to capture modality-agnostic features. Meanwhile, to memorize the semantic knowledge, we propose a training-only Semantic Prototype Memory Module (SPMM) to store category-level prototypes across training for facilitating SAM2's transition from instance to semantic segmentation. A prototypical adaptation loss is imposed between global and local prototypes iteratively to align and refine SAM2's semantic understanding. Extensive experimental results demonstrate that our proposed MemorySAM outperforms SoTA methods by large margins on both synthetic and real-world benchmarks (65.38% on DELIVER, 52.88% on MCubeS). Source code will be made publicly available.

187. 【2503.06699】Unsupervised Multi-Clustering and Decision-Making Strategies for 4D-STEM Orientation Mapping

链接：https://arxiv.org/abs/2503.06699

作者：Junhao Cao,Nicolas Folastre,Gozde Oney,Edgar Rauch,Stavros Nicolopoulos,Partha Pratim Das,Arnaud Demortière

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：non-negative matrix factorization, Image Quality Assessment, primary clustering method, study presents, integration of unsupervised

备注： 32 pages, 5 figures, 5 figures in SI

点击查看摘要

Abstract:This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (k) required for robust and interpretable orientation mapping. By leveraging the K-Component Loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.

188. 【2503.06698】What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization

链接：https://arxiv.org/abs/2503.06698

作者：Xavier Thomas,Deepti Ghadiyaram

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：unseen data distributions, Domain Generalization aims, data distributions, aims to develop, Generalization aims

备注：

点击查看摘要

Abstract:Domain Generalization aims to develop models that can generalize to novel and unseen data distributions. In this work, we study how model architectures and pre-training objectives impact feature richness and propose a method to effectively leverage them for domain generalization. Specifically, given a pre-trained feature space, we first discover latent domain structures, referred to as pseudo-domains, that capture domain-specific variations in an unsupervised manner. Next, we augment existing classifiers with these complementary pseudo-domain representations making them more amenable to diverse unseen test domains. We analyze how different pre-training feature spaces differ in the domain-specific variances they capture. Our empirical studies reveal that features from diffusion models excel at separating domains in the absence of explicit domain labels and capture nuanced domain-specific information. On 5 datasets, we show that our very simple framework improves generalization to unseen domains by a maximum test accuracy improvement of over 4% compared to the standard baseline Empirical Risk Minimization (ERM). Crucially, our method outperforms most algorithms that access domain labels during training.

189. 【2503.06685】Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence

链接：https://arxiv.org/abs/2503.06685

作者：Zhaowei Chen,Borui Zhao,Yuchen Ge,Yuhao Chen,Renjie Song,Jiajun Liang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：compact student network, pretrained teacher network, models, teacher models, student models

备注：

点击查看摘要

Abstract:Online Knowledge Distillation (OKD) methods streamline the distillation training process into a single stage, eliminating the need for knowledge transfer from a pretrained teacher network to a more compact student network. This paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when applied to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.

190. 【2503.06684】PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation

链接：https://arxiv.org/abs/2503.06684

作者：Yanjie Pan,Qingdong He,Zhengkai Jiang,Pengcheng Xu,Chaoyi Wang,Jinlong Peng,Haoxuan Wang,Yun Cao,Zhenye Gan,Mingmin Chi,Bo Peng,Yabiao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated promising results, Recent advances, advances in diffusion-based, demonstrated promising, promising results

备注：

点击查看摘要

Abstract:Recent advances in diffusion-based text-to-image generation have demonstrated promising results through visual condition control. However, existing ControlNet-like methods struggle with compositional visual conditioning - simultaneously preserving semantic fidelity across multiple heterogeneous control signals while maintaining high visual quality, where they employ separate control branches that often introduce conflicting guidance during the denoising process, leading to structural distortions and artifacts in generated images. To address this issue, we present PixelPonder, a novel unified control framework, which allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level, enabling precise local guidance without global interference. Additionally, a time-aware control injection scheme is deployed to modulate condition influence according to denoising timesteps, progressively transitioning from structural preservation to texture refinement and fully utilizing the control information from different categories to promote more harmonious image generation. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets, showing superior improvement in spatial alignment accuracy while maintaining high textual semantic consistency.

191. 【2503.06683】Dynamic Dictionary Learning for Remote Sensing Image Segmentation

链接：https://arxiv.org/abs/2503.06683

作者：Xuechao Zou,Yue Li,Shun Zhang,Kai Li,Shiying Wang,Pin Tao,Junliang Xing,Congyan Lang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：diverse scene variations, Remote sensing image, segmentation faces persistent, faces persistent challenges, distinguishing morphologically similar

备注：

点击查看摘要

Abstract:Remote sensing image segmentation faces persistent challenges in distinguishing morphologically similar categories and adapting to diverse scene variations. While existing methods rely on implicit representation learning paradigms, they often fail to dynamically adjust semantic embeddings according to contextual cues, leading to suboptimal performance in fine-grained scenarios such as cloud thickness differentiation. This work introduces a dynamic dictionary learning framework that explicitly models class ID embeddings through iterative refinement. The core contribution lies in a novel dictionary construction mechanism, where class-aware semantic embeddings are progressively updated via multi-stage alternating cross-attention querying between image features and dictionary embeddings. This process enables adaptive representation learning tailored to input-specific characteristics, effectively resolving ambiguities in intra-class heterogeneity and inter-class homogeneity. To further enhance discriminability, a contrastive constraint is applied to the dictionary space, ensuring compact intra-class distributions while maximizing inter-class separability. Extensive experiments across both coarse- and fine-grained datasets demonstrate consistent improvements over state-of-the-art methods, particularly in two online test benchmarks (LoveDA and UAVid). Code is available at this https URL.

192. 【2503.06678】Gamma: Toward Generic Image Assessment with Mixture of Assessment Experts

链接：https://arxiv.org/abs/2503.06678

作者：Hantao Zhou,Rui Yang,Longxiang Tang,Guanyi Qin,Yan Zhang,Runze Hu,Xiu Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：natural and AIGC, textbf, AIGC scenes, Image assessment, aims to evaluate

备注：

点击查看摘要

Abstract:Image assessment aims to evaluate the quality and aesthetics of images and has been applied across various scenarios, such as natural and AIGC scenes. Existing methods mostly address these sub-tasks or scenes individually. While some works attempt to develop unified image assessment models, they have struggled to achieve satisfactory performance or cover a broad spectrum of assessment scenarios. In this paper, we present \textbf{Gamma}, a \textbf{G}eneric im\textbf{A}ge assess\textbf{M}ent model using \textbf{M}ixture of \textbf{A}ssessment Experts, which can effectively assess images from diverse scenes through mixed-dataset training. Achieving unified training in image assessment presents significant challenges due to annotation biases across different datasets. To address this issue, we first propose a Mixture of Assessment Experts (MoAE) module, which employs shared and adaptive experts to dynamically learn common and specific knowledge for different datasets, respectively. In addition, we introduce a Scene-based Differential Prompt (SDP) strategy, which uses scene-specific prompts to provide prior knowledge and guidance during the learning process, further boosting adaptation for various scenes. Our Gamma model is trained and evaluated on 12 datasets spanning 6 image assessment scenarios. Extensive experiments show that our unified Gamma outperforms other state-of-the-art mixed-training methods by significant margins while covering more scenes. Code: this https URL.

193. 【2503.06677】REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints

链接：https://arxiv.org/abs/2503.06677

作者：Di Wu,Liu Liu,Zhou Linli,Anran Huang,Liangtu Song,Qiaojun Yu,Qi Wu,Cewu Lu

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：representations play crucial, play crucial roles, Articulated objects, textured surface reconstruction, human life

备注： 11pages, 6 figures

点击查看摘要

Abstract:Articulated objects, as prevalent entities in human life, their 3D representations play crucial roles across various applications. However, achieving both high-fidelity textured surface reconstruction and dynamic generation for articulated objects remains challenging for existing methods. In this paper, we present REArtGS, a novel framework that introduces additional geometric and motion constraints to 3D Gaussian primitives, enabling high-quality textured surface reconstruction and generation for articulated objects. Specifically, given multi-view RGB images of arbitrary two states of articulated objects, we first introduce an unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields, enhancing geometry constraints and improving surface reconstruction quality. Then we establish deformable fields for 3D Gaussians constrained by the kinematic structures of articulated objects, achieving unsupervised generation of surface meshes in unseen states. Extensive experiments on both synthetic and real datasets demonstrate our approach achieves high-quality textured surface reconstruction for given states, and enables high-fidelity surface generation for unseen states. Codes will be released within the next four months.

194. 【2503.06676】Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform

链接：https://arxiv.org/abs/2503.06676

作者：Chenyu Huang,Peng Ye,Xiaohui Wang,Shenghe Zheng,Biqing Qi,Lei Bai,Wanli Ouyang,Tao Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词：pose critical challenges, multiple tasks pose, tasks pose critical, individual finetuned models, paradigm becoming mainstream

备注： 15 pages, 7 figures

点击查看摘要

Abstract:With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.

195. 【2503.06674】Learning Few-Step Diffusion Models by Trajectory Distribution Matching

链接：https://arxiv.org/abs/2503.06674

作者：Yihong Luo,Tianyang Hu,Jiacheng Sun,Yujun Cai,Jing Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：efficient AIGC deployment, AIGC deployment, efficient AIGC, Accelerating diffusion model, Accelerating diffusion

备注： Project page: [this https URL](https://tdm-t2x.github.io/)

点击查看摘要

Abstract:Accelerating diffusion model sampling is crucial for efficient AIGC deployment. While diffusion distillation methods -- based on distribution matching and trajectory matching -- reduce sampling to as few as one step, they fall short on complex tasks like text-to-image generation. Few-step generation offers a better balance between speed and quality, but existing approaches face a persistent trade-off: distribution matching lacks flexibility for multi-step sampling, while trajectory matching often yields suboptimal image quality. To bridge this gap, we propose learning few-step diffusion models by Trajectory Distribution Matching (TDM), a unified distillation paradigm that combines the strengths of distribution and trajectory matching. Our method introduces a data-free score distillation objective, aligning the student's trajectory with the teacher's at the distribution level. Further, we develop a sampling-steps-aware objective that decouples learning targets across different steps, enabling more adjustable sampling. This approach supports both deterministic sampling for superior image quality and flexible multi-step adaptation, achieving state-of-the-art performance with remarkable efficiency. Our model, TDM, outperforms existing methods on various backbones, such as SDXL and PixArt-$\alpha$, delivering superior quality and significantly reduced training costs. In particular, our method distills PixArt-$\alpha$ into a 4-step generator that outperforms its teacher on real user preference at 1024 resolution. This is accomplished with 500 iterations and 2 A800 hours -- a mere 0.01% of the teacher's training cost. In addition, our proposed TDM can be extended to accelerate text-to-video diffusion. Notably, TDM can outperform its teacher model (CogVideoX-2B) by using only 4 NFE on VBench, improving the total score from 80.91 to 81.65. Project page: this https URL

196. 【2503.06671】Emulating Self-attention with Convolution for Efficient Image Super-Resolution

链接：https://arxiv.org/abs/2503.06671

作者：Dongheon Lee,Seokju Yun,Youngmin Ro

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：high computational overhead, lightweight image super-resolution, times, image super-resolution, tackle the high

备注：

点击查看摘要

Abstract:In this paper, we tackle the high computational overhead of transformers for lightweight image super-resolution. (SR). Motivated by the observations of self-attention's inter-layer repetition, we introduce a convolutionized self-attention module named Convolutional Attention (ConvAttn) that emulates self-attention's long-range modeling capability and instance-dependent weighting with a single shared large kernel and dynamic kernels. By utilizing the ConvAttn module, we significantly reduce the reliance on self-attention and its involved memory-bound operations while maintaining the representational capability of transformers. Furthermore, we overcome the challenge of integrating flash attention into the lightweight SR regime, effectively mitigating self-attention's inherent memory bottleneck. We scale up window size to 32$\times$32 with flash attention rather than proposing an intricated self-attention module, significantly improving PSNR by 0.31dB on Urban100$\times$2 while reducing latency and memory usage by 16$\times$ and 12.2$\times$. Building on these approaches, our proposed network, termed Emulating Self-attention with Convolution (ESC), notably improves PSNR by 0.27 dB on Urban100$\times$4 compared to HiT-SRF, reducing the latency and memory usage by 3.7$\times$ and 6.2$\times$, respectively. Extensive experiments demonstrate that our ESC maintains the ability for long-range modeling, data scalability, and the representational power of transformers despite most self-attentions being replaced by the ConvAttn module.

197. 【2503.06670】Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On

链接：https://arxiv.org/abs/2503.06670

作者：Roni Goldshmidt

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：crucial for trust, high-stakes applications, decision-making in high-stakes, Vision-Language Models, framework extending Shapley-based

备注：

点击查看摘要

198. 【2503.06669】AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

链接：https://arxiv.org/abs/2503.06669

作者：AgiBot-World-Contributors,Qingwen Bu,Jisong Cai,Li Chen,Xiuqi Cui,Yan Ding,Siyuan Feng,Shenyuan Gao,Xindong He,Xu Huang,Shu Jiang,Yuxin Jiang,Cheng Jing,Hongyang Li,Jialu Li,Chiming Liu,Yi Liu,Yuxiang Lu,Jianlan Luo,Ping Luo,Yao Mu,Yuehan Niu,Yixuan Pan,Jiangmiao Pang,Yu Qiao,Guanghui Ren,Cheng Ruan,Jiaqi Shan,Yongjian Shen,Chengshi Shi,Mingkang Shi,Modi Shi,Chonghao Sima,Jianheng Song,Huijie Wang,Wenhao Wang,Dafeng Wei,Chengen Xie,Guo Xu,Junchi Yan,Cunbiao Yang,Lei Yang,Shukai Yang,Maoqing Yao,Jia Zeng,Chi Zhang,Qinglin Zhang,Bin Zhao,Chengyue Zhao,Jiaqi Zhao,Jianchao Zhu

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：generalized robotic manipulation, address real-world challenges, robotic manipulation, challenges for generalized, generalized robotic

备注： Project website: [this https URL](https://agibot-world.com/) , Code: [this https URL](https://github.com/OpenDriveLab/AgiBot-World)

点击查看摘要

Abstract:We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.

199. 【2503.06661】AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP

链接：https://arxiv.org/abs/2503.06661

作者：Wenxin Ma,Xu Zhang,Qingsong Yao,Fenghe Tang,Chenxu Wu,Yingtai Li,Rui Yan,Zihang Jiang,S.Kevin Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：lesion detection, identifies outliers, defect and lesion, Anomaly detection, CLIP

备注： 8 pages, 7 figures

点击查看摘要

Abstract:Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. AA-CLIP is achieved through a straightforward yet effective two-stage approach: it first creates anomaly-aware text anchors to differentiate normal and abnormal semantics clearly, then aligns patch-level visual features with these anchors for precise anomaly localization. This two-stage strategy, with the help of residual adapters, gradually adapts CLIP in a controlled manner, achieving effective AD while maintaining CLIP's class knowledge. Extensive experiments validate AA-CLIP as a resource-efficient solution for zero-shot AD tasks, achieving state-of-the-art results in industrial and medical applications. The code is available at this https URL.

200. 【2503.06660】AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation

链接：https://arxiv.org/abs/2503.06660

作者：Yang Zou,Zhaoshuai Qi,Yating Liu,Zihao Xu,Weipeng Sun,Weiyi Liu,Xingyuan Li,Jiaqi Yang,Yanning Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：augmented reality, role in robotics, autonomous driving, computer vision, plays a vital

备注：

点击查看摘要

Abstract:Object pose estimation, which plays a vital role in robotics, augmented reality, and autonomous driving, has been of great interest in computer vision. Existing studies either require multi-stage pose regression or rely on 2D-3D feature matching. Though these approaches have shown promising results, they rely heavily on appearance information, requiring complex input (i.e., multi-view reference input, depth, or CAD models) and intricate pipeline (i.e., feature extraction-SfM-2D to 3D matching-PnP). We propose AxisPose, a model-free, matching-free, single-shot solution for robust 6D pose estimation, which fundamentally diverges from the existing paradigm. Unlike existing methods that rely on 2D-3D or 2D-2D matching using 3D techniques, such as SfM and PnP, AxisPose directly infers a robust 6D pose from a single view by leveraging a diffusion model to learn the latent axis distribution of objects without reference views. Specifically, AxisPose constructs an Axis Generation Module (AGM) to capture the latent geometric distribution of object axes through a diffusion model. The diffusion process is guided by injecting the gradient of geometric consistency loss into the noise estimation to maintain the geometric consistency of the generated tri-axis. With the generated tri-axis projection, AxisPose further adopts a Triaxial Back-projection Module (TBM) to recover the 6D pose from the object tri-axis. The proposed AxisPose achieves robust performance at the cross-instance level (i.e., one model for N instances) using only a single view as input without reference images, with great potential for generalization to unseen-object level.

201. 【2503.06652】Adding Additional Control to One-Step Diffusion with Joint Distribution Matching

链接：https://arxiv.org/abs/2503.06652

作者：Yihong Luo,Tianyang Hu,Yifan Song,Jiacheng Sun,Zhenguo Li,Jing Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Variational Score Distillation, Variational Score, latest user preferences, adapting distilled models, remains challenging

备注：

点击查看摘要

Abstract:While diffusion distillation has enabled one-step generation through methods like Variational Score Distillation, adapting distilled models to emerging new controls -- such as novel structural constraints or latest user preferences -- remains challenging. Conventional approaches typically requires modifying the base diffusion model and redistilling it -- a process that is both computationally intensive and time-consuming. To address these challenges, we introduce Joint Distribution Matching (JDM), a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model and facilitates improved classifier-free guidance (CFG) usage and seamless integration of human feedback learning (HFL). Experimental results demonstrate that JDM surpasses baseline methods such as multi-step ControlNet by mere one-step in most cases, while achieving state-of-the-art performance in one-step text-to-image synthesis through improved usage of CFG or HFL integration.

202. 【2503.06647】Personalized Class Incremental Context-Aware Food Classification for Food Intake Monitoring Systems

链接：https://arxiv.org/abs/2503.06647

作者：Hassan Kazemi Tehrani,Jun Cai,Abbas Yekanlou,Sylvia Santosa

类目：Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)

关键词：preventing nutrition-related diseases, Accurate food intake, Accurate food, food, food intake monitoring

备注：

点击查看摘要

Abstract:Accurate food intake monitoring is crucial for maintaining a healthy diet and preventing nutrition-related diseases. With the diverse range of foods consumed across various cultures, classic food classification models have limitations due to their reliance on fixed-sized food datasets. Studies show that people consume only a small range of foods across the existing ones, each consuming a unique set of foods. Existing class-incremental models have low accuracy for the new classes and lack personalization. This paper introduces a personalized, class-incremental food classification model designed to overcome these challenges and improve the performance of food intake monitoring systems. Our approach adapts itself to the new array of food classes, maintaining applicability and accuracy, both for new and existing classes by using personalization. Our model's primary focus is personalization, which improves classification accuracy by prioritizing a subset of foods based on an individual's eating habits, including meal frequency, times, and locations. A modified version of DSN is utilized to expand on the appearance of new food classes. Additionally, we propose a comprehensive framework that integrates this model into a food intake monitoring system. This system analyzes meal images provided by users, makes use of a smart scale to estimate food weight, utilizes a nutrient content database to calculate the amount of each macro-nutrient, and creates a dietary user profile through a mobile application. Finally, experimental evaluations on two new benchmark datasets FOOD101-Personal and VFN-Personal, personalized versions of well-known datasets for food classification, are conducted to demonstrate the effectiveness of our model in improving the classification accuracy of both new and existing classes, addressing the limitations of both conventional and class-incremental food classification models.

203. 【2503.06641】CLICv2: Image Complexity Representation via Content Invariance Contrastive Learning

链接：https://arxiv.org/abs/2503.06641

作者：Shipeng Liu,Liang Zhao,Dengfeng Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：selection and sensitivity, positive sample selection, complexity representation, image, positive

备注：

点击查看摘要

Abstract:Unsupervised image complexity representation often suffers from bias in positive sample selection and sensitivity to image content. We propose CLICv2, a contrastive learning framework that enforces content invariance for complexity representation. Unlike CLIC, which generates positive samples via cropping-introducing positive pairs bias-our shifted patchify method applies randomized directional shifts to image patches before contrastive learning. Patches at corresponding positions serve as positive pairs, ensuring content-invariant learning. Additionally, we propose patch-wise contrastive loss, which enhances local complexity representation while mitigating content interference. In order to further suppress the interference of image content, we introduce Masked Image Modeling as an auxiliary task, but we set its modeling objective as the entropy of masked patches, which recovers the entropy of the overall image by using the information of the unmasked patches, and then obtains the global complexity perception ability. Extensive experiments on IC9600 demonstrate that CLICv2 significantly outperforms existing unsupervised methods in PCC and SRCC, achieving content-invariant complexity representation without introducing positive pairs bias.

204. 【2503.06637】CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

链接：https://arxiv.org/abs/2503.06637

作者：Lei Shi,Andreas Bulling

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Constrained Latent Action, propose CLAD, vision-language procedure planning, Constrained Latent, procedure planning

备注：

点击查看摘要

Abstract:We propose CLAD -- a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.

205. 【2503.06632】owards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias

链接：https://arxiv.org/abs/2503.06632

作者：Mingxiao Li,Tingyu Qu,Tinne Tuytelaars,Marie-Francine Moens

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：customized visual content, improve daily life, Personalized image generation, visual content, great potential

备注： 18

点击查看摘要

Abstract:Personalized image generation via text prompts has great potential to improve daily life and professional work by facilitating the creation of customized visual content. The aim of image personalization is to create images based on a user-provided subject while maintaining both consistency of the subject and flexibility to accommodate various textual descriptions of that subject. However, current methods face challenges in ensuring fidelity to the text prompt while not overfitting to the training data. In this work, we introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images, allowing the model to focus on learning an effective representation of the personalized subject. Moreover, current evaluation methods struggle due to the lack of a dedicated test set. The evaluation set-up typically relies on the training data of the personalization task to compute text-image and image-image similarity scores, which, while useful, tend to overestimate performance. Although human evaluations are commonly used as an alternative, they often suffer from bias and inconsistency. To address these issues, we curate a diverse and high-quality test set with well-designed prompts. With this new benchmark, automatic evaluation metrics can reliably assess model performance

206. 【2503.06626】DiffCLIP: Differential Attention Meets CLIP

链接：https://arxiv.org/abs/2503.06626

作者：Hasan Abed Al Kader Hammoud,Bernard Ghanem

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：CLIP architectures, differential attention, Abstract, CLIP, differential

备注： Under review

点击查看摘要

Abstract:We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at this https URL.

207. 【2503.06625】Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking

链接：https://arxiv.org/abs/2503.06625

作者：Chaocan Xue,Bineng Zhong,Qihua Liang,Yaozong Zheng,Ning Li,Yuanliang Xue,Shuxiang Song

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Vision transformers, popular backbone, backbone for visual, Vision, complete ViT architectures

备注：

点击查看摘要

Abstract:Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize the structure of ViTs. Our approach dynamically disables a large number of representation-similar layers and selectively retains only a single optimal layer among them, aiming to achieve a better accuracy-speed trade-off. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision. Codes and models are available at this https URL.

208. 【2503.06624】Chameleon: On the Scene Diversity and Domain Variety of AI-Generated Videos Detection

链接：https://arxiv.org/abs/2503.06624

作者：Meiyu Zeng,Xingming Liao,Canyu Chen,Nankai Lin,Zhuowei Wang,Chong Chen,Aimin Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Artificial intelligence generated, Artificial intelligence, intelligence generated content, spreading disinformation, intelligence generated

备注： 17 pages

点击查看摘要

Abstract:Artificial intelligence generated content (AIGC), known as DeepFakes, has emerged as a growing concern because it is being utilized as a tool for spreading disinformation. While much research exists on identifying AI-generated text and images, research on detecting AI-generated videos is limited. Existing datasets for AI-generated videos detection exhibit limitations in terms of diversity, complexity, and realism. To address these issues, this paper focuses on AI-generated videos detection and constructs a diverse dataset named Chameleon. We generate videos through multiple generation tools and various real video sources. At the same time, we preserve the videos' real-world complexity, including scene switches and dynamic perspective changes, and expand beyond face-centered detection to include human actions and environment generation. Our work bridges the gap between AI-generated dataset construction and real-world forensic needs, offering a valuable benchmark to counteract the evolving threats of AI-generated content.

209. 【2503.06623】ransforming Weather Data from Pixel to Latent Space

链接：https://arxiv.org/abs/2503.06623

作者：Sijie Zhao,Feng Liu,Xueliang Zhang,Hao Chen,Tao Han,Junchao Gong,Ran Tao,Pengfeng Xiao,Lei Bai,Wanli Ouyang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：spurred growing interest, extreme weather events, weather, PVS, pixel space

备注： 8 pages, 6 figures

点击查看摘要

Abstract:The increasing impact of climate change and extreme weather events has spurred growing interest in deep learning for weather research. However, existing studies often rely on weather data in pixel space, which presents several challenges such as smooth outputs in model outputs, limited applicability to a single pressure-variable subset (PVS), and high data storage and computational costs. To address these challenges, we propose a novel Weather Latent Autoencoder (WLA) that transforms weather data from pixel space to latent space, enabling efficient weather task modeling. By decoupling weather reconstruction from downstream tasks, WLA improves the accuracy and sharpness of weather task model results. The incorporated Pressure-Variable Unified Module transforms multiple PVS into a unified representation, enhancing the adaptability of the model in multiple weather scenarios. Furthermore, weather tasks can be performed in a low-storage latent space of WLA rather than a high-storage pixel space, thus significantly reducing data storage and computational costs. Through extensive experimentation, we demonstrate its superior compression and reconstruction performance, enabling the creation of the ERA5-latent dataset with unified representations of multiple PVS from ERA5 data. The compressed full PVS in the ERA5-latent dataset reduces the original 244.34 TB of data to 0.43 TB. The downstream task further demonstrates that task models can apply to multiple PVS with low data costs in latent space and achieve superior performance compared to models in pixel space. Code, ERA5-latent data, and pre-trained models are available at this https URL.

210. 【2503.06621】Dynamic Updates for Language Adaptation in Visual-Language Tracking

链接：https://arxiv.org/abs/2503.06621

作者：Xiaohai Li,Bineng Zhong,Qihua Liang,Zhiyi Mo,Jian Nong,Shuxiang Song

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：dynamic language descriptions, multi-modal references, semantic information provided, Dynamic Language, static multi-modal references

备注：

点击查看摘要

Abstract:The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_{\rm{ext}}$, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at this https URL.

211. 【2503.06617】Pixel to Gaussian: Ultra-Fast Continuous Super-Resolution with 2D Gaussian Modeling

链接：https://arxiv.org/abs/2503.06617

作者：Long Peng,Anran Wu,Wenbo Li,Peizhe Xia,Xueyuan Dai,Xinjie Zhang,Xin Di,Haoze Sun,Renjing Pei,Yang Wang,Yang Cao,Zheng-Jun Zha

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：arbitrary upsampling factors, inputs with arbitrary, fixed-scale factors, ASSR, limitations of traditional

备注： Tech Report

点击查看摘要

Abstract:Arbitrary-scale super-resolution (ASSR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs with arbitrary upsampling factors using a single model, addressing the limitations of traditional SR methods constrained to fixed-scale factors (\textit{e.g.}, $\times$ 2). Recent advances leveraging implicit neural representation (INR) have achieved great progress by modeling coordinate-to-pixel mappings. However, the efficiency of these methods may suffer from repeated upsampling and decoding, while their reconstruction fidelity and quality are constrained by the intrinsic representational limitations of coordinate-based functions. To address these challenges, we propose a novel ContinuousSR framework with a Pixel-to-Gaussian paradigm, which explicitly reconstructs 2D continuous HR signals from LR images using Gaussian Splatting. This approach eliminates the need for time-consuming upsampling and decoding, enabling extremely fast arbitrary-scale super-resolution. Once the Gaussian field is built in a single pass, ContinuousSR can perform arbitrary-scale rendering in just 1ms per scale. Our method introduces several key innovations. Through statistical ana

212. 【2503.06608】GroMo: Plant Growth Modeling with Multiview Images

链接：https://arxiv.org/abs/2503.06608

作者：Ruchi Bhatt,Shreya Bansal,Amanpreet Chander,Rupinder Kaur,Malya Singh,Mohan Kankanhalli,Abdulmotaleb El Saddik,Mukesh Kumar Saini

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)

关键词：Understanding plant growth, Understanding plant, plant growth dynamics, Growth Modelling, growth dynamics

备注： 7 pages, 5 Figures, 3 Tables

点击查看摘要

Abstract:Understanding plant growth dynamics is essential for applications in agriculture and plant phenotyping. We present the Growth Modelling (GroMo) challenge, which is designed for two primary tasks: (1) plant age prediction and (2) leaf count estimation, both essential for crop monitoring and precision agriculture. For this challenge, we introduce GroMo25, a dataset with images of four crops: radish, okra, wheat, and mustard. Each crop consists of multiple plants (p1, p2, ..., pn) captured over different days (d1, d2, ..., dm) and categorized into five levels (L1, L2, L3, L4, L5). Each plant is captured from 24 different angles with a 15-degree gap between images. Participants are required to perform both tasks for all four crops with these multiview images. We proposed a Multiview Vision Transformer (MVVT) model for the GroMo challenge and evaluated the crop-wise performance on GroMo25. MVVT reports an average MAE of 7.74 for age prediction and an MAE of 5.52 for leaf count. The GroMo Challenge aims to advance plant phenotyping research by encouraging innovative solutions for tracking and predicting plant growth. The GitHub repository is publicly available at this https URL.

213. 【2503.06604】Steerable Pyramid Weighted Loss: Multi-Scale Adaptive Weighting for Semantic Segmentation

链接：https://arxiv.org/abs/2503.06604

作者：Renhao Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：remote sensing, biomedical imaging, autonomous driving, computer vision, loss

备注： 9 pages, 4 figures

点击查看摘要

Abstract:Semantic segmentation is a core task in computer vision with applications in biomedical imaging, remote sensing, and autonomous driving. While standard loss functions such as cross-entropy and Dice loss perform well in general cases, they often struggle with fine structures, particularly in tasks involving thin structures or closely packed objects. Various weight map-based loss functions have been proposed to address this issue by assigning higher loss weights to pixels prone to misclassification. However, these methods typically rely on precomputed or runtime-generated weight maps based on distance transforms, which impose significant computational costs and fail to adapt to evolving network predictions. In this paper, we propose a novel steerable pyramid-based weighted (SPW) loss function that efficiently generates adaptive weight maps. Unlike traditional boundary-aware losses that depend on static or iteratively updated distance maps, our method leverages steerable pyramids to dynamically emphasize regions across multiple frequency bands (capturing features at different scales) while maintaining computational efficiency. Additionally, by incorporating network predictions into the weight computation, our approach enables adaptive refinement during training. We evaluate our method on the SNEMI3D, GlaS, and DRIVE datasets, benchmarking it against 11 state-of-the-art loss functions. Our results demonstrate that the proposed SPW loss function achieves superior pixel precision and segmentation accuracy with minimal computational overhead. This work provides an effective and efficient solution for improving semantic segmentation, particularly for applications requiring multiscale feature representation. The code is avaiable at this https URL

214. 【2503.06601】StructVPR++: Distill Structural and Semantic Knowledge with Weighting Samples for Visual Place Recognition

链接：https://arxiv.org/abs/2503.06601

作者：Yanqing Shen,Sanping Zhou,Jingwen Fu,Ruotong Wang,Shitao Chen,Nanning Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Visual place recognition, Visual place, driving and robotics, image retrieval problem, place recognition

备注：

点击查看摘要

Abstract:Visual place recognition is a challenging task for autonomous driving and robotics, which is usually considered as an image retrieval problem. A commonly used two-stage strategy involves global retrieval followed by re-ranking using patch-level descriptors. Most deep learning-based methods in an end-to-end manner cannot extract global features with sufficient semantic information from RGB images. In contrast, re-ranking can utilize more explicit structural and semantic information in one-to-one matching process, but it is time-consuming. To bridge the gap between global retrieval and re-ranking and achieve a good trade-off between accuracy and efficiency, we propose StructVPR++, a framework that embeds structural and semantic knowledge into RGB global representations via segmentation-guided distillation. Our key innovation lies in decoupling label-specific features from global descriptors, enabling explicit semantic alignment between image pairs without requiring segmentation during deployment. Furthermore, we introduce a sample-wise weighted distillation strategy that prioritizes reliable training pairs while suppressing noisy ones. Experiments on four benchmarks demonstrate that StructVPR++ surpasses state-of-the-art global methods by 5-23% in Recall@1 and even outperforms many two-stage approaches, achieving real-time efficiency with a single RGB input.

215. 【2503.06598】MultiCo3D: Multi-Label Voxel Contrast for One-Shot Incremental Segmentation of 3D Neuroimages

链接：https://arxiv.org/abs/2503.06598

作者：Hao Xu,Tengfei Xue,Dongnan Liu,Yuqian Chen,Fan Zhang,Carl-Fredrik Westin,Ron Kikinis,Lauren J. O'Donnell,Weidong Cai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：functional connectivity analysis, One-shot Class Incremental, structure and function, aiding in precise, Class Incremental

备注： 13 pages, 6 figures, 6 tables

点击查看摘要

Abstract:3D neuroimages provide a comprehensive view of brain structure and function, aiding in precise localization and functional connectivity analysis. Segmentation of white matter (WM) tracts using 3D neuroimages is vital for understanding the brain's structural connectivity in both healthy and diseased states. One-shot Class Incremental Semantic Segmentation (OCIS) refers to effectively segmenting new (novel) classes using only a single sample while retaining knowledge of old (base) classes without forgetting. Voxel-contrastive OCIS methods adjust the feature space to alleviate the feature overlap problem between the base and novel classes. However, since WM tract segmentation is a multi-label segmentation task, existing single-label voxel contrastive-based methods may cause inherent contradictions. To address this, we propose a new multi-label voxel contrast framework called MultiCo3D for one-shot class incremental tract segmentation. Our method utilizes uncertainty distillation to preserve base tract segmentation knowledge while adjusting the feature space with multi-label voxel contrast to alleviate feature overlap when learning novel tracts and dynamically weighting multi losses to balance overall loss. We compare our method against several state-of-the-art (SOTA) approaches. The experimental results show that our method significantly enhances one-shot class incremental tract segmentation accuracy across five different experimental setups on HCP and Preto datasets.

216. 【2503.06588】Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder

链接：https://arxiv.org/abs/2503.06588

作者：Yaxuan Li,Han Jiang,Yifei Ma,Shihua Qin,Fangxu Xing

类目：ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)

关键词：Magnetic Resonance Imaging, Dynamic Magnetic Resonance, adopted imaging modality, increasingly adopted imaging, Magnetic Resonance

备注：

点击查看摘要

Abstract:Dynamic Magnetic Resonance Imaging (MRI) of the vocal tract has become an increasingly adopted imaging modality for speech motor studies. Beyond image signals, systematic data loss, noise pollution, and audio file corruption can occur due to the unpredictability of the MRI acquisition environment. In such cases, generating audio from images is critical for data recovery in both clinical and research applications. However, this remains challenging due to hardware constraints, acoustic interference, and data corruption. Existing solutions, such as denoising and multi-stage synthesis methods, face limitations in audio fidelity and generalizability. To address these challenges, we propose a Knowledge Enhanced Conditional Variational Autoencoder (KE-CVAE), a novel two-step "knowledge enhancement + variational inference" framework for generating speech audio signals from cine dynamic MRI sequences. This approach introduces two key innovations: (1) integration of unlabeled MRI data for knowledge enhancement, and (2) a variational inference architecture to improve generative modeling capacity. To the best of our knowledge, this is one of the first attempts at synthesizing speech audio directly from dynamic MRI video sequences. The proposed method was trained and evaluated on an open-source dynamic vocal tract MRI dataset recorded during speech. Experimental results demonstrate its effectiveness in generating natural speech waveforms while addressing MRI-specific acoustic challenges, outperforming conventional deep learning-based synthesis approaches.

217. 【2503.06587】Introducing Unbiased Depth into 2D Gaussian Splatting for High-accuracy Surface Reconstruction

链接：https://arxiv.org/abs/2503.06587

作者：Xiaoming Peng,Yixin Yang,Yang Zhou,Hui Huang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated superior geometry, approximate thin surfaces, superior geometry reconstruction, Gaussian Splatting, surfels to approximate

备注：

点击查看摘要

Abstract:Recently, 2D Gaussian Splatting (2DGS) has demonstrated superior geometry reconstruction quality than the popular 3DGS by using 2D surfels to approximate thin surfaces. However, it falls short when dealing with glossy surfaces, resulting in visible holes in these areas. We found the reflection discontinuity causes the issue. To fit the jump from diffuse to specular reflection at different viewing angles, depth bias is introduced in the optimized Gaussian primitives. To address that, we first replace the depth distortion loss in 2DGS with a novel depth convergence loss, which imposes a strong constraint on depth continuity. Then, we rectified the depth criterion in determining the actual surface, which fully accounts for all the intersecting Gaussians along the ray. Qualitative and quantitative evaluations across various datasets reveal that our method significantly improves reconstruction quality, with more complete and accurate surfaces than 2DGS.

218. 【2503.06569】Global-Aware Monocular Semantic Scene Completion with State Space Models

链接：https://arxiv.org/abs/2503.06569

作者：Shijie Li,Zhongyao Cheng,Rong Li,Shuai Li,Juergen Gall,Xun Xu,Xulei Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Semantic Scene Completion, Monocular Semantic Scene, diverse real-world applications, Monocular Semantic, Scene Completion

备注：

点击查看摘要

Abstract:Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image, enabling diverse real-world applications. However, existing methods are often constrained by the local receptive field of Convolutional Neural Networks (CNNs), making it challenging to handle the non-uniform distribution of projected points (Fig. \ref{fig:perspective}) and effectively reconstruct missing information caused by the 3D-to-2D projection. In this work, we introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space. Specifically, we propose a Dual-Head Multi-Modality Encoder, which leverages a Transformer architecture to capture spatial relationships across all features in the 2D image domain, enabling more comprehensive 2D feature extraction. Additionally, we introduce the Frustum Mamba Decoder, built on the State Space Model (SSM), to efficiently capture long-range dependencies in 3D space. Furthermore, we propose a frustum reordering strategy within the Frustum Mamba Decoder to mitigate feature discontinuities in the reordered voxel sequence, ensuring better alignment with the scan mechanism of the State Space Model (SSM) for improved 3D representation learning. We conduct extensive experiments on the widely used Occ-ScanNet and NYUv2 datasets, demonstrating that our proposed method achieves state-of-the-art performance, validating its effectiveness. The code will be released upon acceptance.

219. 【2503.06568】Conceptrol: Concept Control of Zero-shot Personalized Image Generation

链接：https://arxiv.org/abs/2503.06568

作者：Qiyuan He,Angela Yao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：diffusion models generates, models generates unseen, generates unseen images, unseen images based, diffusion models

备注：

点击查看摘要

Abstract:Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA. The source code is available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2503.06568 [cs.CV]

(or
arXiv:2503.06568v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2503.06568

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>

220. 【2503.06565】Future-Aware Interaction Network For Motion Forecasting

链接：https://arxiv.org/abs/2503.06565

作者：Shijie Li,Xun Xu,Si Yong Yeo,Xulei Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

221. 【2503.06564】R-DQ: Time-Rotation Diffusion Quantization

链接：https://arxiv.org/abs/2503.06564

作者：Yihua Shao,Deyang Lin,Fanhu Zeng,Minxi Yan,Muyang Zhang,Siyu Chen,Yuxuan Fan,Ziyang Yan,Haozhe Wang,Jingcai Guo,Yan Wang,Haotong Qin,Hao Tang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：widely adopted, quantization, diffusion quantization, generation, Diffusion

备注：

点击查看摘要

Abstract:Diffusion models have been widely adopted in image and video generation. However, their complex network architecture leads to high inference overhead for its generation process. Existing diffusion quantization methods primarily focus on the quantization of the model structure while ignoring the impact of time-steps variation during sampling. At the same time, most current approaches fail to account for significant activations that cannot be eliminated, resulting in substantial performance degradation after quantization. To address these issues, we propose Time-Rotation Diffusion Quantization (TR-DQ), a novel quantization method incorporating time-step and rotation-based optimization. TR-DQ first divides the sampling process based on time-steps and applies a rotation matrix to smooth activations and weights dynamically. For different time-steps, a dedicated hyperparameter is introduced for adaptive timing modeling, which enables dynamic quantization across different time steps. Additionally, we also explore the compression potential of Classifier-Free Guidance (CFG-wise) to establish a foundation for subsequent work. TR-DQ achieves state-of-the-art (SOTA) performance on image generation and video generation tasks and a 1.38-1.89x speedup and 1.97-2.58x memory reduction in inference compared to existing quantization methods.

222. 【2503.06559】MMARD: Improving the Min-Max Optimization Process in Adversarial Robustness Distillation

链接：https://arxiv.org/abs/2503.06559

作者：Yuzheng Wang,Zhaoyu Chen,Dingkang Yang,Yuanhang Wang,Lizhe Qi

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Adversarial Robustness Distillation, Adversarial Robustness, optimization Adversarial Robustness, Robustness Distillation, pre-trained robust teacher

备注：

点击查看摘要

Abstract:Adversarial Robustness Distillation (ARD) is a promising task to boost the robustness of small-capacity models with the guidance of the pre-trained robust teacher. The ARD can be summarized as a min-max optimization process, i.e., synthesizing adversarial examples (inner) training the student (outer). Although competitive robustness performance, existing ARD methods still have issues. In the inner process, the synthetic training examples are far from the teacher's decision boundary leading to important robust information missing. In the outer process, the student model is decoupled from learning natural and robust scenarios, leading to the robustness saturation, i.e., student performance is highly susceptible to customized teacher selection. To tackle these issues, this paper proposes a general Min-Max optimization Adversarial Robustness Distillation (MMARD) method. For the inner process, we introduce the teacher's robust predictions, which drive the training examples closer to the teacher's decision boundary to explore more robust knowledge. For the outer process, we propose a structured information modeling method based on triangular relationships to measure the mutual information of the model in natural and robust scenarios and enhance the model's ability to understand multi-scenario mapping relationships. Experiments show our MMARD achieves state-of-the-art performance on multiple benchmarks. Besides, MMARD is plug-and-play and convenient to combine with existing methods.

223. 【2503.06553】ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

链接：https://arxiv.org/abs/2503.06553

作者：Jiaxin Ai,Pengfei Zhou,Zhaopan Xu,Ming Li,Fanrui Zhang,Zizhen Li,Jianwen Sun,Yukang Feng,Baojin Huang,Zhongyuan Wang,Kaipeng Zhang

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：solving scientific problems, frequently exhibit errors, fine-grained model weaknesses, multi-modal large language, large language models

备注：

点击查看摘要

Abstract:As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses. Since human evaluation is laborious and costly, prompting MLLMs as automated process judges has become a common practice. However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. ProJudgeBench comprises 2,400 test cases and 50,118 step-level labels, spanning four scientific disciplines with diverse difficulty levels and multi-modal content. In ProJudgeBench, each step is meticulously annotated by human experts for correctness, error type, and explanation, enabling a systematic evaluation of judges' capabilities to detect, classify and diagnose errors. Evaluation on ProJudgeBench reveals a significant performance gap between open-source and proprietary models. To bridge this gap, we further propose ProJudge-173k, a large-scale instruction-tuning dataset, and a Dynamic Dual-Phase fine-tuning strategy that encourages models to explicitly reason through problem-solving before assessing solutions. Both contributions significantly enhance the process evaluation capabilities of open-source models. All the resources will be released to foster future research of reliable multi-modal process evaluation.

224. 【2503.06545】QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation

链接：https://arxiv.org/abs/2503.06545

作者：Junyi Wu,Zhiteng Li,Zheng Hui,Yulun Zhang,Linghe Kong,Xiaokang Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： The code and models will be available at [this https URL](https://github.com/JunyiWuCode/QuantCache)

点击查看摘要

None

225. 【2503.06542】ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy

链接：https://arxiv.org/abs/2503.06542

作者：Jianwen Sun,Yukang Feng,Chuanhao Li,Fanrui Zhang,Zizhen Li,Jiaxin Ai,Sizhuo Zhou,Yu Dai,Shenglin Zhang,Kaipeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Unified models, multimodal understanding, recently received, received much attention, area of vision

备注：

点击查看摘要

Abstract:Unified models (UniMs) for multimodal understanding and generation have recently received much attention in the area of vision and language. Existing UniMs are designed to simultaneously learn both multimodal understanding and generation capabilities, demanding substantial computational resources, and often struggle to generate interleaved text-image. We present ARMOR, a resource-efficient and pure autoregressive framework that achieves both understanding and generation by fine-tuning existing multimodal large language models (MLLMs). Specifically, ARMOR extends existing MLLMs from three perspectives: (1) For model architecture, an asymmetric encoder-decoder architecture with a forward-switching mechanism is introduced to unify embedding space integrating textual and visual modalities for enabling natural text-image interleaved generation with minimal computational overhead. (2) For training data, a meticulously curated, high-quality interleaved dataset is collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a ``what or how to generate" algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at this https URL.

226. 【2503.06537】One-Step Diffusion Model for Image Motion-Deblurring

链接：https://arxiv.org/abs/2503.06537

作者：Xiaoyang Liu,Yuquan Wang,Zheng Chen,Jiezhang Cao,He Zhang,Yulun Zhang,Xiaokang Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

227. 【2503.06529】AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection

链接：https://arxiv.org/abs/2503.06529

作者：Jialin Lu,Junjie Shan,Ziqi Zhao,Ka-Ho Chow

类目：Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

228. 【2503.06526】meLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos

链接：https://arxiv.org/abs/2503.06526

作者：Chen-Lin Zhang,Lin Sui,Shuming Liu,Fangzhou Mu,Zhangcheng Wang,Bernard Ghanem

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：

备注： Code models will be released at [this https URL](https://github.com/sming256/TimeLoc) . The first 4 authors contributes equally

点击查看摘要

None

229. 【2503.06522】SGA-INTERACT: A 3D Skeleton-based Benchmark for Group Activity Understanding in Modern Basketball Tactic

链接：https://arxiv.org/abs/2503.06522

作者：Yuchen Yang,Wei Wang,Yifei Liu,Linfeng Dong,Hao Wu,Mingxin Zhang,Zhihang Zhong,Xiao Sun

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Group Activity Recognition, Group Activity Understanding, Group Activity, Activity Recognition, Activity Understanding

备注： None

点击查看摘要

Abstract:Group Activity Understanding is predominantly studied as Group Activity Recognition (GAR) task. However, existing GAR benchmarks suffer from coarse-grained activity vocabularies and the only data form in single-view, which hinder the evaluation of state-of-the-art algorithms. To address these limitations, we introduce SGA-INTERACT, the first 3D skeleton-based benchmark for group activity understanding. It features complex activities inspired by basketball tactics, emphasizing rich spatial interactions and long-term dependencies. SGA-INTERACT introduces Temporal Group Activity Localization (TGAL) task, extending group activity understanding to untrimmed sequences, filling the gap left by GAR as a standalone task. In addition to the benchmark, we propose One2Many, a novel framework that employs a pretrained 3D skeleton backbone for unified individual feature extraction. This framework aligns with the feature extraction paradigm in RGB-based methods, enabling direct evaluation of RGB-based models on skeleton-based benchmarks. We conduct extensive evaluations on SGA-INTERACT using two skeleton-based methods, three RGB-based methods, and a proposed baseline within the One2Many framework. The general low performance of baselines highlights the benchmark's challenges, motivating advancements in group activity understanding.

230. 【2503.06520】Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

链接：https://arxiv.org/abs/2503.06520

作者：Yuqi Liu,Bohao Peng,Zhisheng Zhong,Zihao Yue,Fanbin Lu,Bei Yu,Jiaya Jia

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：Traditional methods, simple descriptions, explicit reasoning processes, rely on supervised, supervised fine-tuning

备注：

点击查看摘要

Abstract:Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at this https URL.

231. 【2503.06517】Instance-wise Supervision-level Optimization in Active Learning

链接：https://arxiv.org/abs/2503.06517

作者：Shinnosuke Matsuo,Riku Togashi,Ryoma Bise,Seiichi Uchida,Masahiro Nomura

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：

备注： Accepted at CVPR2025

点击查看摘要

None

232. 【2503.06515】SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model

链接：https://arxiv.org/abs/2503.06515

作者：Jing Zhang,Zhikai Li,Qingyi Gu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

233. 【2503.06514】GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks

链接：https://arxiv.org/abs/2503.06514

作者：Haoqiang Kang,Enna Sachdeva,Piyush Gupta,Sangjae Bae,Kwonjoon Lee

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：recently shown promising, shown promising advancements, Proximal Policy Optimization, sequential decision-making tasks, recently shown

备注：

点击查看摘要

234. 【2503.06508】A Light and Tuning-free Method for Simulating Camera Motion in Video Generation

链接：https://arxiv.org/abs/2503.06508

作者：Quanjian Song,Zhihang Lin,Zhanpeng Zeng,Ziyue Zhang,Liujuan Cao,Rongrong Ji

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：face computational bottlenecks, methods face computational, camera motion-controlled video, Existing camera motion-controlled, latent space

备注： 18 pages in total

点击查看摘要

Abstract:Existing camera motion-controlled video generation methods face computational bottlenecks in fine-tuning and inference. This paper proposes LightMotion, a light and tuning-free method for simulating camera motion in video generation. Operating in the latent space, it eliminates additional fine-tuning, inpainting, and depth estimation, making it more streamlined than existing methods. The endeavors of this paper comprise: (i) The latent space permutation operation effectively simulates various camera motions like panning, zooming, and rotation. (ii) The latent space resampling strategy combines background-aware sampling and cross-frame alignment to accurately fill new perspectives while maintaining coherence across frames. (iii) Our in-depth analysis shows that the permutation and resampling cause an SNR shift in latent space, leading to poor-quality generation. To address this, we propose latent space correction, which reintroduces noise during denoising to mitigate SNR shift and enhance video generation quality. Exhaustive experiments show that our LightMotion outperforms existing methods, both quantitatively and qualitatively.

235. 【2503.06506】Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation

链接：https://arxiv.org/abs/2503.06506

作者：Amir Mohammad Izadi,Seyed Mohammad Hadi Hosseini,Soroush Vafaie Tabar,Ali Abdollahi,Armin Saghafian,Mahdieh Soleymani Baghshah

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：

备注：

点击查看摘要

None

236. 【2503.06505】DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability

链接：https://arxiv.org/abs/2503.06505

作者：Xirui Hu,Jiahao Wang,Hao Chen,Weizhan Zhang,Benqi Wang,Yikun Li,Haishun Nan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注： 17 pages, 16 figures

点击查看摘要

None

237. 【2503.06501】xtInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification

链接：https://arxiv.org/abs/2503.06501

作者：Huaqi Tao,Bingxi Liu,Calvin Chen,Tingjun Huang,He Li,Jinqiang Cui,Hong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：

备注： 8 pages,5 figures

点击查看摘要

None

238. 【2503.06499】ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis

链接：https://arxiv.org/abs/2503.06499

作者：Xukun Zhou,Fengxin Li,Ming Chen,Yan Zhou,Pengfei Wan,Di Zhang,Hongyan Liu,Jun He,Zhaoxin Fan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

239. 【2503.06497】Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving

链接：https://arxiv.org/abs/2503.06497

作者：Enming Zhang,Peizhe Gong,Xingyuan Dai,Yisheng Lv,Qinghai Miao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

240. 【2503.06492】VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering

链接：https://arxiv.org/abs/2503.06492

作者：Yanling Wang,Yihan Zhao,Xiaodong Chen,Shasha Guo,Lixin Liu,Haoyang Li,Yong Xiao,Jing Zhang,Qi Li,Ke Xu

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：demonstrated remarkable achievements, Large vision-language models, non-factual responses remains, responses remains prevalent, Large vision-language

备注：

点击查看摘要

241. 【2503.06486】PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

链接：https://arxiv.org/abs/2503.06486

作者：Cong Chen,Mingyu Liu,Chenchen Jing,Yizhou Zhou,Fengyun Rao,Hao Chen,Bo Zhang,Chunhua Shen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

242. 【2503.06485】A Mesh Is Worth 512 Numbers: Spectral-domain Diffusion Modeling for High-dimension Shape Generation

链接：https://arxiv.org/abs/2503.06485

作者：Jiajie Fan,Amal Trigui,Andrea Bonfanti,Felix Dietrich,Thomas Bäck,Hao Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

243. 【2503.06484】Sign Language Translation using Frame and Event Stream: Benchmark Dataset and Algorithms

链接：https://arxiv.org/abs/2503.06484

作者：Xiao Wang,Yuehang Li,Fuling Wang,Bo Jiang,Yaowei Wang,Yonghong Tian,Jin Tang,Bin Luo

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)

关键词：

备注： In Peer Review

点击查看摘要

None

244. 【2503.06482】PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization

链接：https://arxiv.org/abs/2503.06482

作者：Honglin Li,Zhongyi Shui,Yunlong Zhang,Chenglu Zhu,Lin Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

245. 【2503.06477】PDB: Not All Drivers Are the Same -- A Personalized Dataset for Understanding Driving Behavior

链接：https://arxiv.org/abs/2503.06477

作者：Chuheng Wei,Ziye Qin,Siyan Li,Ziyan Zhang,Xuanpeng Zhao,Amr Abdelraouf,Rohit Gupta,Kyungtae Han,Matthew J. Barth,Guoyuan Wu

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

246. 【2503.06473】Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals

链接：https://arxiv.org/abs/2503.06473

作者：Hanze Li,Xiande Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：Growing evidence suggests, deep neural networks, significantly advanced network, Growing evidence, advanced network architectures

备注： 11 pages, 7 figures

点击查看摘要

Abstract:Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30\% reduction in training time while enhancing performance in tasks such as image classification and object detection.

247. 【2503.06472】CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model

链接：https://arxiv.org/abs/2503.06472

作者：Yuxuan Luo,Jiaqi Tang,Chenyi Huang,Feiyang Hao,Zhouhui Lian

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：remains computationally challenging, computationally challenging due, UNESCO Heritage, Chinese Calligraphy Contextualization, remains computationally

备注： 11 pages

点击查看摘要

Abstract:Chinese calligraphy, a UNESCO Heritage, remains computationally challenging due to visual ambiguity and cultural complexity. Existing AI systems fail to contextualize their intricate scripts, because of limited annotated data and poor visual-semantic alignment. We propose CalliReader, a vision-language model (VLM) that solves the Chinese Calligraphy Contextualization (CC$^2$) problem through three innovations: (1) character-wise slicing for precise character extraction and sorting, (2) CalliAlign for visual-text token compression and alignment, (3) embedding instruction tuning (e-IT) for improving alignment and addressing data scarcity. We also build CalliBench, the first benchmark for full-page calligraphic contextualization, addressing three critical issues in previous OCR and VQA approaches: fragmented context, shallow reasoning, and hallucination. Extensive experiments including user studies have been conducted to verify our CalliReader's \textbf{superiority to other state-of-the-art methods and even human professionals in page-level calligraphy recognition and interpretation}, achieving higher accuracy while reducing hallucination. Comparisons with reasoning models highlight the importance of accurate recognition as a prerequisite for reliable comprehension. Quantitative analyses validate CalliReader's efficiency; evaluations on document and real-world benchmarks confirm its robust generalization ability.

248. 【2503.06471】Online Dense Point Tracking with Streaming Memory

链接：https://arxiv.org/abs/2503.06471

作者：Qiaole Dong,Yanwei Fu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

249. 【2503.06470】hink Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems

链接：https://arxiv.org/abs/2503.06470

作者：Fei Tang,Yongliang Shen,Hang Zhang,Siqi Chen,Guiyang Hou,Wenqi Zhang,Wenqiao Zhang,Kaitao Song,Weiming Lu,Yueting Zhuang

类目：Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

250. 【2503.06469】Vector Quantized Feature Fields for Fast 3D Semantic Lifting

链接：https://arxiv.org/abs/2503.06469

作者：George Tang,Aditya Agarwal,Weiqiao Han,Trevor Darrell,Yutong Bai

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

251. 【2503.06467】SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts

链接：https://arxiv.org/abs/2503.06467

作者：Shijia Zhao,Qiming Xia,Xusheng Guo,Pufan Zou,Maoji Zheng,Hai Wu,Chenglu Wen,Cheng Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 11 pages, 3 figures

点击查看摘要

None

252. 【2503.06462】StructGS: Adaptive Spherical Harmonics and Rendering Enhancements for Superior 3D Gaussian Splatting

链接：https://arxiv.org/abs/2503.06462

作者：Zexu Huang,Min Xu,Stuart Perry

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

253. 【2503.06461】Long-tailed Adversarial Training with Self-Distillation

链接：https://arxiv.org/abs/2503.06461

作者：Seungju Cho,Hongsin Lee,Changick Kim

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： ICLR 2025

点击查看摘要

None

254. 【2503.06458】Reconstructing Depth Images of Moving Objects from Wi-Fi CSI Data

链接：https://arxiv.org/abs/2503.06458

作者：Guanyu Cao,Takuya Maekawa,Kazuya Ohara,Yasue Kishino

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)

关键词：

备注：

点击查看摘要

None

255. 【2503.06457】Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning

链接：https://arxiv.org/abs/2503.06457

作者：Yanbiao Ma,Wei Dai,Wenke Huang,Jiayi Chen

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：divergent local optimization, local optimization directions, global geometric shapes, federated learning, leads to divergent

备注： Accepted by CVPR 2025

点击查看摘要

Abstract:Data heterogeneity in federated learning, characterized by a significant misalignment between local and global distributions, leads to divergent local optimization directions and hinders global model training. Existing studies mainly focus on optimizing local updates or global aggregation, but these indirect approaches demonstrate instability when handling highly heterogeneous data distributions, especially in scenarios where label skew and domain skew coexist. To address this, we propose a geometry-guided data generation method that centers on simulating the global embedding distribution locally. We first introduce the concept of the geometric shape of an embedding distribution and then address the challenge of obtaining global geometric shapes under privacy constraints. Subsequently, we propose GGEUR, which leverages global geometric shapes to guide the generation of new samples, enabling a closer approximation to the ideal global distribution. In single-domain scenarios, we augment samples based on global geometric shapes to enhance model generalization; in multi-domain scenarios, we further employ class prototypes to simulate the global distribution across domains. Extensive experimental results demonstrate that our method significantly enhances the performance of existing approaches in handling highly heterogeneous data, including scenarios with label skew, domain skew, and their coexistence. Code published at: this https URL

256. 【2503.06456】DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning

链接：https://arxiv.org/abs/2503.06456

作者：Chengxuan Qian,Kai Han,Jingchao Wang,Zhenlong Yuan,Rui Qian,Chongwen Lyu,Jun Chen,Zhe Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

257. 【2503.06451】A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification

链接：https://arxiv.org/abs/2503.06451

作者：Basudha Pal,Siyuan(Cyan)Huang,Rama Chellappa

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：Person Re-identification, systems identify individuals, identify individuals, individuals across images, images or video

备注：

点击查看摘要

Abstract:Person Re-identification (ReID) systems identify individuals across images or video frames and play a critical role in various real-world applications. However, many ReID methods are influenced by sensitive attributes such as gender, pose, and body mass index (BMI), which vary in uncontrolled environments, leading to biases and reduced generalization. To address this, we extend the concept of expressivity to the body recognition domain to better understand how ReID models encode these attributes. Expressivity, defined as the mutual information between feature vector representations and specific attributes, is computed using a secondary neural network that takes feature and attribute vectors as inputs. This provides a quantitative framework for analyzing the extent to which sensitive attributes are embedded in the model's representations. We apply expressivity analysis to SemReID, a state-of-the-art self-supervised ReID model, and find that BMI consistently exhibits the highest expressivity scores in the model's final layers, underscoring its dominant role in feature encoding. In the final attention layer of the trained network, the expressivity order for body attributes is BMI Pitch Yaw Gender, highlighting their relative importance in learned representations. Additionally, expressivity values evolve progressively across network layers and training epochs, reflecting a dynamic encoding of attributes during feature extraction. These insights emphasize the influence of body-related attributes on ReID models and provide a systematic methodology for identifying and mitigating attribute-driven biases. By leveraging expressivity analysis, we offer valuable tools to enhance the fairness, robustness, and generalization of ReID systems in diverse real-world settings.

258. 【2503.06446】M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification

链接：https://arxiv.org/abs/2503.06446

作者：Mingxiang Cao,Weiying Xie,Xin Zhang,Jiaqing Zhang,Kai Jiang,Jie Lei,Yunsong Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

259. 【2503.06442】OT-DETECTOR: Delving into Optimal Transport for Zero-shot Out-of-Distribution Detection

链接：https://arxiv.org/abs/2503.06442

作者：Yu Liu,Hao Tang,Haiqi Zhang,Jing Qin,Zechao Li

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词：

备注： The first two authors contributed equally to this work

点击查看摘要

None

260. 【2503.06437】SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding

链接：https://arxiv.org/abs/2503.06437

作者：Juhyeon Park,Peter Yongho Kim,Jiook Cha,Shinjae Yoo,Taesup Moon

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：

备注： Under Review

点击查看摘要

None

261. 【2503.06435】OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection

链接：https://arxiv.org/abs/2503.06435

作者：Adrian Chow,Evelien Riddell,Yimu Wang,Sean Sedwards,Krzysztof Czarnecki

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

262. 【2503.06427】Pre-Training Meta-Rule Selection Policy for Visual Generative Abductive Learning

链接：https://arxiv.org/abs/2503.06427

作者：Yu Jin,Jingming Liu,Zhexu Luo,Yifei Peng,Ziang Qin,Wang-Zhou Dai,Yao-Xiang Ding,Kun Zhou

类目：Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Published as a conference paper at IJCLR'24

点击查看摘要

None

263. 【2503.06426】Federated Learning for Diffusion Models

链接：https://arxiv.org/abs/2503.06426

作者：Zihao Peng,Xijun Wang,Shengbo Chen,Hong Rao,Cong Shen

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词：produce highly realistic, highly realistic samples, Diffusion models, produce highly, highly realistic

备注：

点击查看摘要

Abstract:Diffusion models are powerful generative models that can produce highly realistic samples for various tasks. Typically, these models are constructed using centralized, independently and identically distributed (IID) training data. However, in practical scenarios, data is often distributed across multiple clients and frequently manifests non-IID characteristics. Federated Learning (FL) can leverage this distributed data to train diffusion models, but the performance of existing FL methods is unsatisfactory in non-IID scenarios. To address this, we propose FedDDPM-Federated Learning with Denoising Diffusion Probabilistic Models, which leverages the data generative capability of diffusion models to facilitate model training. In particular, the server uses well-trained local diffusion models uploaded by each client before FL training to generate auxiliary data that can approximately represent the global data distribution. Following each round of model aggregation, the server further optimizes the global model using the auxiliary dataset to alleviate the impact of heterogeneous data on model performance. We provide a rigorous convergence analysis of FedDDPM and propose an enhanced algorithm, FedDDPM+, to reduce training overheads. FedDDPM+ detects instances of slow model learning and performs a one-shot correction using the auxiliary dataset. Experimental results validate that our proposed algorithms outperform the state-of-the-art FL algorithms on the MNIST, CIFAR10 and CIFAR100 datasets.

264. 【2503.06419】Consistent Image Layout Editing with Diffusion Models

链接：https://arxiv.org/abs/2503.06419

作者：Tao Xia,Yudi Zhang,Ting Liu Lei Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

265. 【2503.06415】Polygonal network disorder and the turning distance

链接：https://arxiv.org/abs/2503.06415

作者：Alex Dolce,Ryan Lavelle,Bernard Scott,Ashlyn Urbanski,Joseph Klobusicky

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

266. 【2503.06399】FEDS: Feature and Entropy-Based Distillation Strategy for Efficient Learned Image Compression

链接：https://arxiv.org/abs/2503.06399

作者：Haisheng Fu,Jie Liang,Zhenman Fang,Jingning Han

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词：

备注： 16 pages

点击查看摘要

None

267. 【2503.06397】Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter

链接：https://arxiv.org/abs/2503.06397

作者：Yanyu Zhu,Licheng Bai,Jintao Xu,Jiwei Tang,Hai-tao Zheng

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

268. 【2503.06385】A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization

链接：https://arxiv.org/abs/2503.06385

作者：Md Yousuf Harun,Christopher Kanan

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Preprint

点击查看摘要

None

269. 【2503.06380】I-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems

链接：https://arxiv.org/abs/2503.06380

作者：Khang H. N. Vo,Duc P. T. Nguyen,Thong Nguyen,Tho T. Quan

类目：Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词：

备注：

点击查看摘要

None

270. 【2503.06369】Spectral State Space Model for Rotation-Invariant~Visual~Representation~Learning

链接：https://arxiv.org/abs/2503.06369

作者：Sahar Dastani,Ali Bahri,Moslem Yazdanpanah,Mehrdad Noori,David Osowiechi,Gustavo Adolfo Vargas Hakim,Farzad Beizaee,Milad Cheraghalikhani,Arnab Kumar Mondal,Herve Lombaert,Christian Desrosiers

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

271. 【2503.06368】VORTEX: Challenging CNNs at Texture Recognition by using Vision Transformers with Orderless and Randomized Token Encodings

链接：https://arxiv.org/abs/2503.06368

作者：Leonardo Scabini,Kallil M. Zielinski,Emir Konuk,Ricardo T. Fares,Lucas C. Ribas,Kevin Smith,Odemir M. Bruno

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：

备注：

点击查看摘要

None

272. 【2503.06364】Generative Video Bi-flow

链接：https://arxiv.org/abs/2503.06364

作者：Chen Liu,Tobias Ritschel

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

273. 【2503.06362】Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs

链接：https://arxiv.org/abs/2503.06362

作者：Umberto Cappellazzo,Minsu Kim,Stavros Petridis

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：speech recognition robustness, enhance speech recognition, Speech Recognition, Large Language Models, leverages both audio

备注：

点击查看摘要

Abstract:Audio-Visual Speech Recognition (AVSR) leverages both audio and visual modalities to enhance speech recognition robustness, particularly in noisy environments. Recent advancements in Large Language Models (LLMs) have demonstrated their effectiveness in speech recognition, including AVSR. However, due to the significant length of speech representations, direct integration with LLMs imposes substantial computational costs. Prior approaches address this by compressing speech representations before feeding them into LLMs. However, higher compression ratios often lead to performance degradation, necessitating a trade-off between computational efficiency and recognition accuracy. To address this challenge, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which enables flexible adaptation of the audio-visual token allocation based on specific computational constraints while preserving high performance. Our approach, inspired by Matryoshka Representation Learning, encodes audio-visual representations at multiple granularities within a single model, eliminating the need to train separate models for different compression levels. Moreover, to efficiently fine-tune the LLM, we introduce three LoRA-based Matryoshka strategies using global and scale-specific LoRA modules. Extensive evaluations on the two largest AVSR datasets demonstrate that Llama-MTSK achieves state-of-the-art results, matching or surpassing models trained independently at fixed compression levels.

274. 【2503.06361】Adversarial Robustness of Discriminative Self-Supervised Learning in Vision

链接：https://arxiv.org/abs/2503.06361

作者：Ömer Veysel Çağatan,Ömer Faruk Tal,M. Emre Gürsoy

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 53 pages

点击查看摘要

None

275. 【2503.06339】Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning

链接：https://arxiv.org/abs/2503.06339

作者：Gaurav Patel,Qiang Qiu

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

276. 【2503.06317】Accurate and Efficient Two-Stage Gun Detection in Video

链接：https://arxiv.org/abs/2503.06317

作者：Badhan Chandra Das,M. Hadi Amini,Yanzhao Wu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

277. 【2503.06316】End-to-End Action Segmentation Transformer

链接：https://arxiv.org/abs/2503.06316

作者：Tieqiao Wang,Sinisa Todorovic

类目：Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词：

备注：

点击查看摘要

None

278. 【2503.06313】Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection

链接：https://arxiv.org/abs/2503.06313

作者：Chandan Kumar Sah,Ankit Kumar Shaw,Xiaoli Lian,Arsalan Shahid Baig,Tuopu Wen,Kun Jiang,Mengmeng Yang,Diange Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：ensure safe navigation, Large Language Models, require reliable traffic, Multimodal Large Language, traffic sign recognition

备注： 11 pages, 9 figures

点击查看摘要

279. 【2503.06312】GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models

链接：https://arxiv.org/abs/2503.06312

作者：Zhitong Xiong,Yi Wang,Weikang Yu,Adam J Stewart,Jie Zhao,Nils Lehmann,Thomas Dujardin,Zhenghang Yuan,Pedram Ghamisi,Xiao Xiang Zhu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： code weights: [this https URL](https://github.com/xiong-zhitong/GeoLB-SigLIP)

点击查看摘要

None

280. 【2503.06310】xt2Story: Advancing Video Storytelling with Text Guidance

链接：https://arxiv.org/abs/2503.06310

作者：Taewon Kang,Divya Kothandaraman,Ming C. Lin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 15 pages, 6 figures

点击查看摘要

None

281. 【2503.06307】ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation

链接：https://arxiv.org/abs/2503.06307

作者：Qizhen Lan,Qing Tian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 8 pages, 10 tables, 3 figures

点击查看摘要

None

282. 【2503.06287】Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

链接：https://arxiv.org/abs/2503.06287

作者：Seil Kang,Jinyeong Kim,Junhyeok Kim,Seong Jae Hwang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

283. 【2503.06282】From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning

链接：https://arxiv.org/abs/2503.06282

作者：Shuangzhi Li,Junlong Shen,Lei Ma,Xingyu Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

284. 【2503.06277】STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification

链接：https://arxiv.org/abs/2503.06277

作者：Siyi Du,Xinzhe Luo,Declan P. O'Regan,Chen Qin

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 16 pages (including 5 pages of supplementary materials), accepted by CVPR 2025

点击查看摘要

None

285. 【2503.06276】Exploring Adversarial Transferability between Kolmogorov-arnold Networks

链接：https://arxiv.org/abs/2503.06276

作者：Songping Wang,Xinquan Yue,Yueming Lyu,Caifeng Shan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

286. 【2503.06273】Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

链接：https://arxiv.org/abs/2503.06273

作者：Jeong Hun Yeo,Minsu Kim,Chae Won Kim,Stavros Petridis,Yong Man Ro

类目：Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词：

备注：

点击查看摘要

None

287. 【2503.06271】SplatTalk: 3D VQA with Gaussian Splatting

链接：https://arxiv.org/abs/2503.06271

作者：Anh Thai,Songyou Peng,Kyle Genova,Leonidas Guibas,Thomas Funkhouser

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

288. 【2503.06268】Get In Video: Add Anything You Want to the Video

链接：https://arxiv.org/abs/2503.06268

作者：Shaobin Zhuang,Zhipeng Huang,Binxin Yang,Ying Zhang,Fangyikang Wang,Canmiao Fu,Chong Sun,Zheng-Jun Zha,Chen Li,Yali Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Project page: [this https URL](https://zhuangshaobin.github.io/GetInVideo-project/)

点击查看摘要

None

289. 【2503.06261】Segment Anything, Even Occluded

链接：https://arxiv.org/abs/2503.06261

作者：Wei-En Tai,Yu-Lin Shih,Cheng Sun,Yu-Chiang Frank Wang,Hwann-Tzong Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

290. 【2503.06260】From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models

链接：https://arxiv.org/abs/2503.06260

作者：Muzhi Dai,Jiashuo Sun,Zhiyuan Zhao,Shixuan Liu,Rui Li,Junyu Gao,Xuelong Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

291. 【2503.06252】Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?

链接：https://arxiv.org/abs/2503.06252

作者：Kun Xiang,Zhili Liu,Zihao Jiang,Yunshuang Nie,Kaixin Cai,Yiyang Yin,Runhui Huang,Haoxiang Fan,Hanhui Li,Weiran Huang,Yihan Zeng,Yu-Jie Yuan,Jianhua Han,Lanqing Hong,Hang Xu,Xiaodan Liang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

292. 【2503.06237】Rethinking Lanes and Points in Complex Scenarios for Monocular 3D Lane Detection

链接：https://arxiv.org/abs/2503.06237

作者：Yifan Chang,Junjie Huang,Xiaofeng Wang,Yun Ye,Zhujin Liang,Yi Shan,Dalong Du,Xingang Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： CVPR2025

点击查看摘要

None

293. 【2503.06236】Dynamically evolving segment anything model with continuous learning for medical image segmentation

链接：https://arxiv.org/abs/2503.06236

作者：Zhaori Liu,Mengyang Li,Hu Han,Enli Zhang,Shiguang Shan,Zhiming Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

294. 【2503.06235】StreamGS: Online Generalizable Gaussian Splatting Reconstruction for Unposed Image Streams

链接：https://arxiv.org/abs/2503.06235

作者：Yang LI,Jinglu Wang,Lei Chu,Xiao Li,Shiu-hong Kao,Ying-Cong Chen,Yan Lu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 8 pages

点击查看摘要

None

295. 【2503.06232】Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning

链接：https://arxiv.org/abs/2503.06232

作者：Yanjun Chen,Yirong Sun,Xinghao Chen,Jian Wang,Xiaoyu Shen,Wenjie Li,Wei Zhang

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：proven effective, effective in natural, remains underexplored, reasoning, CoT

备注：

点击查看摘要

296. 【2503.06223】Reinforced Diffuser for Red Teaming Large Vision-Language Models

链接：https://arxiv.org/abs/2503.06223

作者：Ruofan Wang,Xiang Zheng,Xiaosen Wang,Cong Wang,Xingjun Ma

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

297. 【2503.06222】Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations

链接：https://arxiv.org/abs/2503.06222

作者：Meng Wang,Fan Wu,Yunchuan Qin,Ruihui Li,Zhuo Tang,Kenli Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

298. 【2503.06220】StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

链接：https://arxiv.org/abs/2503.06220

作者：Xin Ding,Hao Wu,Yifan Yang,Shiqi Jiang,Donglin Bai,Zhibo Chen,Ting Cao

类目：Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：

备注：

点击查看摘要

None

299. 【2503.06219】VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion

链接：https://arxiv.org/abs/2503.06219

作者：Meng Wang,Huilong Pi,Ruihui Li,Yunchuan Qin,Zhuo Tang,Kenli Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Accept by AAAI-2025(Oral)

点击查看摘要

None

300. 【2503.06201】Explainable Synthetic Image Detection through Diffusion Timestep Ensembling

链接：https://arxiv.org/abs/2503.06201

作者：Yixin Wu,Feiran Zhang,Tianyuan Shi,Ruicheng Yin,Zhenghua Wang,Zhenliang Gan,Xiaohua Wang,Changze Lv,Xiaoqing Zheng,Xuanjing Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：posing significant security, significant security risks, Recent advances, deceptively real images, posing significant

备注： 13 pages, 5 figures

点击查看摘要

301. 【2503.06200】Removing Multiple Hybrid Adverse Weather in Video via a Unified Model

链接：https://arxiv.org/abs/2503.06200

作者：Yecong Wan,Mingwen Shao,Yuanshuo Cheng,Jun Shu,Shuigen Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：conditions typically suffer, weather, typically suffer, suffer from uncertain, degradation distributions

备注：

点击查看摘要

Abstract:Videos captured under real-world adverse weather conditions typically suffer from uncertain hybrid weather artifacts with heterogeneous degradation distributions. However, existing algorithms only excel at specific single degradation distributions due to limited adaption capacity and have to deal with different weather degradations with separately trained models, thus may fail to handle real-world stochastic weather scenarios. Besides, the model training is also infeasible due to the lack of paired video data to characterize the coexistence of multiple weather. To ameliorate the aforementioned issue, we propose a novel unified model, dubbed UniWRV, to remove multiple heterogeneous video weather degradations in an all-in-one fashion. Specifically, to tackle degenerate spatial feature heterogeneity, we propose a tailored weather prior guided module that queries exclusive priors for different instances as prompts to steer spatial feature characterization. To tackle degenerate temporal feature heterogeneity, we propose a dynamic routing aggregation module that can automatically select optimal fusion paths for different instances to dynamically integrate temporal features. Additionally, we managed to construct a new synthetic video dataset, termed HWVideo, for learning and benchmarking multiple hybrid adverse weather removal, which contains 15 hybrid weather conditions with a total of 1500 adverse-weather/clean paired video clips. Real-world hybrid weather videos are also collected for evaluating model generalizability. Comprehensive experiments demonstrate that our UniWRV exhibits robust and superior adaptation capability in multiple heterogeneous degradations learning scenarios, including various generic video restoration tasks beyond weather removal.

302. 【2503.06196】NeuroADDA: Active Discriminative Domain Adaptation in Connectomic

链接：https://arxiv.org/abs/2503.06196

作者：Shashata Sawmya,Thomas L. Athey,Gwyneth Liu,Nir Shavit

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 8 pages, 3 figures, 3 tables

点击查看摘要

None

303. 【2503.06187】MSConv: Multiplicative and Subtractive Convolution for Face Recognition

链接：https://arxiv.org/abs/2503.06187

作者：Si Zhou,Yain-Whar Si,Xiaochen Yuan,Xiaofan Li,Xiaoxiang Liu,Xinyuan Zhang,Cong Lin,Xueyuan Gong

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

304. 【2503.06186】PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures with Phase-Transferred Diffusion Model

链接：https://arxiv.org/abs/2503.06186

作者：Xiang Gao,Shuai Yang,Jiaying Liu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)

点击查看摘要

None

305. 【2503.06182】FORESCENE: FOREcasting human activity via latent SCENE graphs diffusion

链接：https://arxiv.org/abs/2503.06182

作者：Antonio Alliegro,Francesca Pistilli,Tatiana Tommasi,Giuseppe Averta

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

306. 【2503.06179】ForestSplats: Deformable transient field for Gaussian Splatting in the Wild

链接：https://arxiv.org/abs/2503.06179

作者：Wongi Park,Myeongseok Nam,Siwon Kim,Sangwoo Jo,Soomok Lee

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

307. 【2503.06170】Object-Centric World Model for Language-Guided Manipulation

链接：https://arxiv.org/abs/2503.06170

作者：Youngjoon Jeong,Junha Chun,Soonwoo Cha,Taesup Kim

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：driving and robotics, plan in domains, autonomous driving, language instructions, world model

备注：

点击查看摘要

Abstract:A world model is essential for an agent to predict the future and plan in domains such as autonomous driving and robotics. To achieve this, recent advancements have focused on video generation, which has gained significant attention due to the impressive success of diffusion models. However, these models require substantial computational resources. To address these challenges, we propose a world model leveraging object-centric representation space using slot attention, guided by language instructions. Our model perceives the current state as an object-centric representation and predicts future states in this representation space conditioned on natural language instructions. This approach results in a more compact and computationally efficient model compared to diffusion-based generative alternatives. Furthermore, it flexibly predicts future states based on language instructions, and offers a significant advantage in manipulation tasks where object recognition is crucial. In this paper, we demonstrate that our latent predictive world model surpasses generative world models in visuo-linguo-motor control tasks, achieving superior sample and computation efficiency. We also investigate the generalization performance of the proposed method and explore various strategies for predicting actions using object-centric representations.

308. 【2503.06169】reble Counterfactual VLMs: A Causal Approach to Hallucination

链接：https://arxiv.org/abs/2503.06169

作者：Li Li,Jiashu Qu,Yuxiao Zhou,Yuehan Qin,Tiankai Yang,Yue Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

309. 【2503.06163】VACT: A Video Automatic Causal Testing System and a Benchmark

链接：https://arxiv.org/abs/2503.06163

作者：Haotong Yang,Qingyuan Zheng,Yunjian Gao,Yongkun Yang,Yangbo He,Zhouchen Lin,Muhan Zhang

类目：Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)

关键词：

备注：

点击查看摘要

None

310. 【2503.06161】Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction

链接：https://arxiv.org/abs/2503.06161

作者：Kai Li,Junhao Wang,William Han,Ding Zhao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注： 14 pages, 5 figures

点击查看摘要

None

311. 【2503.06157】UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces

链接：https://arxiv.org/abs/2503.06157

作者：Baining Zhao,Jianjie Fang,Zichao Dai,Ziyou Wang,Jirong Zha,Weichen Zhang,Chen Gao,Yue Wang,Jinqiang Cui,Xinlei Chen,Yong Li

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注： 22 pages

点击查看摘要

None

312. 【2503.06154】SRM-Hair: Single Image Head Mesh Reconstruction via 3D Morphable Hair

链接：https://arxiv.org/abs/2503.06154

作者：Zidu Wang,Jiankuo Zhao,Miao Xu,Xiangyu Zhu,Zhen Lei

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Under review

点击查看摘要

None

313. 【2503.06151】BioMoDiffuse: Physics-Guided Biomechanical Diffusion for Controllable and Authentic Human Motion Synthesis

链接：https://arxiv.org/abs/2503.06151

作者：Zixi Kang,Xinghan Wang,Yadong Mu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

314. 【2503.06146】OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images

链接：https://arxiv.org/abs/2503.06146

作者：Ziyue Huang,Yongchao Feng,Shuai Yang,Ziqi Liu,Qingjie Liu,Yunhong Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 11 pages, 4 figures

点击查看摘要

None

315. 【2503.06142】VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models

链接：https://arxiv.org/abs/2503.06142

作者：Xinan He,Yue Zhou,Bing Fan,Bin Li,Guopu Zhu,Feng Ding

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

316. 【2503.06141】Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model

链接：https://arxiv.org/abs/2503.06141

作者：Mingxing Li,Rui Wang,Lei Sun,Yancheng Bai,Xiangxiang Chu

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

317. 【2503.06140】Boosting the Local Invariance for Better Adversarial Transferability

链接：https://arxiv.org/abs/2503.06140

作者：Bohan Liu,Xiaosen Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：directly targeting victim, targeting victim models, pose a significant, significant threat, threat to real-world

备注：

点击查看摘要

Abstract:Transfer-based attacks pose a significant threat to real-world applications by directly targeting victim models with adversarial examples generated on surrogate models. While numerous approaches have been proposed to enhance adversarial transferability, existing works often overlook the intrinsic relationship between adversarial perturbations and input images. In this work, we find that adversarial perturbation often exhibits poor translation invariance for a given clean image and model, which is attributed to local invariance. Through empirical analysis, we demonstrate that there is a positive correlation between the local invariance of adversarial perturbations w.r.t. the input image and their transferability across different models. Based on this finding, we propose a general adversarial transferability boosting technique called Local Invariance Boosting approach (LI-Boost). Extensive experiments on the standard ImageNet dataset demonstrate that LI-Boost could significantly boost various types of transfer-based attacks (e.g., gradient-based, input transformation-based, model-related, advanced objective function, ensemble, etc.) on CNNs, ViTs, and defense mechanisms. Our approach presents a promising direction for future research in improving adversarial transferability across different models.

318. 【2503.06136】GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

链接：https://arxiv.org/abs/2503.06136

作者：Ye Tao,Jiawei Zhang,Yahao Shi,Dongqing Zou,Bin Zhou

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

319. 【2503.06134】X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

链接：https://arxiv.org/abs/2503.06134

作者：Jian Ma,Qirong Peng,Xu Guo,Chen Chen,Haonan Lu,Zhenyu Yang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： [this https URL](https://github.com/OPPO-Mente-Lab/X2I)

点击查看摘要

None

320. 【2503.06132】USP: Unified Self-Supervised Pretraining for Image Generation and Understanding

链接：https://arxiv.org/abs/2503.06132

作者：Xiangxiang Chu,Renda Li,Yong Wang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

321. 【2503.06129】Viewport-Unaware Blind Omnidirectional Image Quality Assessment: A Flexible and Effective Paradigm

链接：https://arxiv.org/abs/2503.06129

作者：Jiebin Yan,Kangcheng Wu,Junjie Chen,Ziwen Tan,Yuming Fang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

322. 【2503.06118】SecureGS: Boosting the Security and Fidelity of 3D Gaussian Splatting Steganography

链接：https://arxiv.org/abs/2503.06118

作者：Xuanyu Zhang,Jiarui Meng,Zhipei Xu,Shuzhou Yang,Yanmin Wu,Ronggang Wang,Jian Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Accepted by ICLR 2025

点击查看摘要

None

323. 【2503.06117】NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features

链接：https://arxiv.org/abs/2503.06117

作者：Hongjia Zhai,Boming Zhao,Hai Li,Xiaokun Pan,Yijia He,Zhaopeng Cui,Hujun Bao,Guofeng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： ICRA 2025

点击查看摘要

None

324. 【2503.06107】Feature Fusion Attention Network with CycleGAN for Image Dehazing, De-Snowing and De-Raining

链接：https://arxiv.org/abs/2503.06107

作者：Akshat Jain

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

325. 【2503.06106】Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

链接：https://arxiv.org/abs/2503.06106

作者：Kuanghong Liu,Jin Wang,Kangjian He,Dan Xu,Xuejie Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Accepted by AAAI 2025

点击查看摘要

None

326. 【2503.06104】Handwritten Digit Recognition: An Ensemble-Based Approach for Superior Performance

链接：https://arxiv.org/abs/2503.06104

作者：Syed Sajid Ullah,Li Gang,Mudassir Riaz,Ahsan Ashfaq,Salman Khan,Sajawal Khan

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：postal code reading, Convolutional Neural Networks, combines Convolutional Neural, computer vision, document digitization

备注： 11 pages,6 figures

点击查看摘要

Abstract:Handwritten digit recognition remains a fundamental challenge in computer vision, with applications ranging from postal code reading to document digitization. This paper presents an ensemble-based approach that combines Convolutional Neural Networks (CNNs) with traditional machine learning techniques to improve recognition accuracy and robustness. We evaluate our method on the MNIST dataset, comprising 70,000 handwritten digit images. Our hybrid model, which uses CNNs for feature extraction and Support Vector Machines (SVMs) for classification, achieves an accuracy of 99.30%. We also explore the effectiveness of data augmentation and various ensemble techniques in enhancing model performance. Our results demonstrate that this approach not only achieves high accuracy but also shows improved generalization across diverse handwriting styles. The findings contribute to the development of more reliable handwritten digit recognition systems and highlight the potential of combining deep learning with traditional machine learning methods in pattern recognition tasks.

327. 【2503.06100】Patch-Depth Fusion: Dichotomous Image Segmentation via Fine-Grained Patch Strategy and Depth Integrity-Prior

链接：https://arxiv.org/abs/2503.06100

作者：Xianjie Liu,Keren Fu,Qijun Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

328. 【2503.06094】PointDiffuse: A Dual-Conditional Diffusion Model for Enhanced Point Cloud Semantic Segmentation

链接：https://arxiv.org/abs/2503.06094

作者：Yong He,Hongshan Yu,Mingtao Feng,Tongjia Chen,Zechuan Li,Anwaar Ulhaq,Saeed Anwar,Ajmal Saeed Mian

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 8 pages, 3 figures, 7 tables

点击查看摘要

None

329. 【2503.06092】ZO-DARTS++: An Efficient and Size-Variable Zeroth-Order Neural Architecture Search Algorithm

链接：https://arxiv.org/abs/2503.06092

作者：Lunchen Xie,Eugenio Lomurno,Matteo Gambella,Danilo Ardagna,Manual Roveri,Matteo Matteucci,Qingjiang Shi

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词：

备注： 14 pages, 8 figures

点击查看摘要

None

330. 【2503.06089】Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision

链接：https://arxiv.org/abs/2503.06089

作者：David C. Jeong,Aditya Puranik,James Vong,Vrushabh Abhijit Deogirikar,Ryan Fell,Julianna Dietrich,Maria Kyrarini,Christopher Kitts

类目：Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词：

备注：

点击查看摘要

None

331. 【2503.06084】Exploring Interpretability for Visual Prompt Tuning with Hierarchical Concepts

链接：https://arxiv.org/abs/2503.06084

作者：Yubin Wang,Xinyang Jiang,De Cheng,Xiangqian Zhao,Zilong Wang,Dongsheng Li,Cairong Zhao

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 10 pages, 9 figures

点击查看摘要

None

332. 【2503.06073】GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images

链接：https://arxiv.org/abs/2503.06073

作者：Xiang Lan,Feng Wu,Kai He,Qinghao Zhao,Shenda Hong,Mengling Feng

类目：Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

333. 【2503.06071】ransParking: A Dual-Decoder Transformer Framework with Soft Localization for End-to-End Automatic Parking

链接：https://arxiv.org/abs/2503.06071

作者：Hangyu Du,Chee-Meng Chew

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

334. 【2503.06064】A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts

链接：https://arxiv.org/abs/2503.06064

作者：Wenzhuo Du,Gerun Wang,Guancheng Chen,Hang Zhao,Xin Li,Jian Gao

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词：

备注：

点击查看摘要

None

335. 【2503.06063】Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices

链接：https://arxiv.org/abs/2503.06063

作者：Junyan Lin,Haoran Chen,Yue Fan,Yingqi Fan,Xin Jin,Hui Su,Jinlan Fu,Xiaoyu Shen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Accepted by CVPR2025

点击查看摘要

None

336. 【2503.06060】STAR: A Foundation Model-driven Framework for Robust Task Planning and Failure Recovery in Robotic Systems

链接：https://arxiv.org/abs/2503.06060

作者：Md Sadman Sakib,Yu Sun

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

337. 【2503.06056】Pathological Prior-Guided Multiple Instance Learning For Mitigating Catastrophic Forgetting in Breast Cancer Whole Slide Image Classification

链接：https://arxiv.org/abs/2503.06056

作者：Weixi Zheng,Aoling Huang. Jingping Yuan,Haoyu Zhao,Zhou Zhao,Yongchao Xu,Thierry Géraud

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： ICASSP2025(Oral)

点击查看摘要

None

338. 【2503.06053】DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation

链接：https://arxiv.org/abs/2503.06053

作者：Runze Zhang,Guoguang Du,Xiaochuan Li,Qi Jia,Liang Jin,Lu Liu,Jingjing Wang,Cong Xu,Zhenhua Guo,Yaqian Zhao,Xiaoli Gong,Rengang Li,Baoyu Fan

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

339. 【2503.06042】Improving SAM for Camouflaged Object Detection via Dual Stream Adapters

链接：https://arxiv.org/abs/2503.06042

作者：Jiaming Liu,Linghe Kong,Guihai Chen

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

340. 【2503.06038】A Label-Free High-Precision Residual Moveout Picking Method for Travel Time Tomography based on Deep Learning

链接：https://arxiv.org/abs/2503.06038

作者：Hongtao Wang,Jiandong Liang,Lei Wang,Shuaizhe Liang,Jinping Zhu,Chunxia Zhang,Jiangshe Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

341. 【2503.06030】owards Universal Text-driven CT Image Segmentation

链接：https://arxiv.org/abs/2503.06030

作者：Yuheng Li,Yuxiang Lai,Maria Thor,Deborah Marshall,Zachary Buchwald,David S. Yu,Xiaofeng Yang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

342. 【2503.06026】Zero-Shot Peg Insertion: Identifying Mating Holes and Estimating SE(2) Poses with Vision-Language Models

链接：https://arxiv.org/abs/2503.06026

作者：Masaru Yajima,Kei Ota,Asako Kanezaki,Rei Kawakami

类目：Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Under submission

点击查看摘要

None

343. 【2503.06019】GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

链接：https://arxiv.org/abs/2503.06019

作者：Xudong Lu,Yinghao Chen,Renshou Wu,Haohao Gao,Xi Chen,Xue Yang,Xiangyu Zhao,Aojun Zhou,Fangyuan Li,Yafei Wen,Xiaoxin Chen,Shuai Ren,Hongsheng Li

类目：Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 14 pages

点击查看摘要

None

344. 【2503.06014】owards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

链接：https://arxiv.org/abs/2503.06014

作者：Xiaohao Xu,Feng Xue,Xiang Li,Haowei Li,Shusheng Yang,Tianyi Zhang,Matthew Johnson-Roberson,Xiaonan Huang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

关键词：

备注： 32 pages, 31 figures, github repo: [this https URL](https://github.com/Xiaohao-Xu/Ambiguity-in-Space)

点击查看摘要

None

345. 【2503.06012】End-to-End HOI Reconstruction Transformer with Graph-based Encoding

链接：https://arxiv.org/abs/2503.06012

作者：Zhenrong Wang,Qi Zheng,Sihan Ma,Maosheng Ye,Yibing Zhan,Dongjiang Li

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

346. 【2503.06003】Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models

链接：https://arxiv.org/abs/2503.06003

作者：Md Azim Khan,Aryya Gangopadhyay,Jianwu Wang,Robert F. Erbacher

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 8 pages, 4 figures

点击查看摘要

None

347. 【2503.05978】MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

链接：https://arxiv.org/abs/2503.05978

作者：Hongwei Yi,Tian Ye,Shitong Shao,Xuancheng Yang,Jiantong Zhao,Hanzhong Guo,Terrance Wang,Qingyu Yin,Zeke Xie,Lei Zhu,Wei Li,Michael Lingelbach,Daquan Zhou

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： MagicInfinite is publicly accessible at [this https URL](https://www.hedra.com/) . More examples are at [this https URL](https://magicinfinite.github.io/)

点击查看摘要

None

348. 【2503.05977】Is Your Video Language Model a Reliable Judge?

链接：https://arxiv.org/abs/2503.05977

作者：Ming Liu,Wensheng Zhang

类目：Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词：

备注：

点击查看摘要

None

349. 【2503.05962】OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

链接：https://arxiv.org/abs/2503.05962

作者：Franklin Mingzhe Li,Kaitlyn Ng,Bin Zhu,Patrick Carrington

类目：Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： CHI 2025 Late Breaking Work

点击查看摘要

None

350. 【2503.05949】Bayesian Fields: Task-driven Open-Set Semantic Gaussian Splatting

链接：https://arxiv.org/abs/2503.05949

作者：Dominic Maggio,Luca Carlone

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

351. 【2503.05936】CASP: Compression of Large Multimodal Models Based on Attention Sparsity

链接：https://arxiv.org/abs/2503.05936

作者：Mohsen Gholami,Mohammad Akbari,Kevin Cannons,Yong Zhang

类目：Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

352. 【2503.05911】Generalizable Image Repair for Robust Visual Autonomous Racing

链接：https://arxiv.org/abs/2503.05911

作者：Carson Sobolewski,Zhenjiang Mao,Kshitij Vejre,Ivan Ruchkin

类目：Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 8 pages, 4 figures, Submitted to 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)

点击查看摘要

None

353. 【2503.05850】Encrypted Vector Similarity Computations Using Partially Homomorphic Encryption: Applications and Performance Analysis

链接：https://arxiv.org/abs/2503.05850

作者：Sefik Serengil,Alper Ozpinar

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：

备注：

点击查看摘要

None

354. 【2503.05839】Enhancing AUTOSAR-Based Firmware Over-the-Air Updates in the Automotive Industry with a Practical Implementation on a Steering System

链接：https://arxiv.org/abs/2503.05839

作者：Mostafa Ahmed Mostafa Ahmed,Mohamed Khaled Mohamed Elsayed,Radwa Waheed Ezzat Abdelmohsen

类目：Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

关键词：

备注： Bachelor's thesis

点击查看摘要

None

355. 【2503.05837】Randomized based restricted kernel machine for hyperspectral image classification

链接：https://arxiv.org/abs/2503.05837

作者：A. Quadir,M. Tanveer

类目：Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

356. 【2503.07491】NeAS: 3D Reconstruction from X-ray Images using Neural Attenuation Surface

链接：https://arxiv.org/abs/2503.07491

作者：Chengrui Zhu,Ryoichi Ishikawa,Masataka Kagesawa,Tomohisa Yuzawa,Toru Watsuji,Takeshi Oishi

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

357. 【2503.07369】Skelite: Compact Neural Networks for Efficient Iterative Skeletonization

链接：https://arxiv.org/abs/2503.07369

作者：Luis D. Reyes Vargas,Martin J. Menten,Johannes C. Paetzold,Nassir Navab,Mohammad Farid Azampour

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

358. 【2503.07248】AI-Driven Automated Tool for Abdominal CT Body Composition Analysis in Gastrointestinal Cancer Management

链接：https://arxiv.org/abs/2503.07248

作者：Xinyu Nan,Meng He,Zifan Chen,Bin Dong,Lei Tang,Li Zhang

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

359. 【2503.07177】he 4D Human Embryonic Brain Atlas: spatiotemporal atlas generation for rapid anatomical changes using first-trimester ultrasound from the Rotterdam Periconceptional Cohort

链接：https://arxiv.org/abs/2503.07177

作者：Wietske A.P. Bastiaansen,Melek Rousian,Anton H.J. Koning,Wiro J. Niessen,Bernadette S. de Bakker,Régine P.M. Steegers-Theunissen,Stefan Klein

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)

关键词：

备注：

点击查看摘要

None

360. 【2503.07104】Global Context Is All You Need for Parallel Efficient Tractography Parcellation

链接：https://arxiv.org/abs/2503.07104

作者：Valentin von Bornhaupt,Johannes Grün,and Justus Bisten,Tobias Bauer,Theodor Rüber,Thomas Schultz

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)

关键词：

备注： 8 pages, 2 pages references, 3 figures, 2 tables

点击查看摘要

None

361. 【2503.07097】A Comprehensive Survey on Magnetic Resonance Image Reconstruction

链接：https://arxiv.org/abs/2503.07097

作者：Xiaoyan Kui,Zijie Fan,Zexin Ji,Qinsong Li,Chengtao Liu,Weixin Si,Beiji Zou

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

362. 【2503.06945】Dynamic Cross-Modal Feature Interaction Network for Hyperspectral and LiDAR Data Classification

链接：https://arxiv.org/abs/2503.06945

作者：Junyan Lin,Feng Gap,Lin Qi,Junyu Dong,Qian Du,Xinbo Gao

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Accepted by IEEE TGRS 2025

点击查看摘要

None

363. 【2503.06919】CAFusion: Controllable Anatomical Synthesis of Perirectal Lymph Nodes via SDF-guided Diffusion

链接：https://arxiv.org/abs/2503.06919

作者：Weidong Guo,Hantao Zhang,Shouhong Wan,Bingbing Zou,Wanqin Wang,Chenyang Qiu,Peiquan Jin

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

364. 【2503.06828】owards a Multimodal MRI-Based Foundation Model for Multi-Level Feature Exploration in Segmentation, Molecular Subtyping, and Grading of Glioma

链接：https://arxiv.org/abs/2503.06828

作者：Somayeh Farahani,Marjaneh Hejazi,Antonio Di Ieva,Emad Fatemizadeh,Sidong Liu

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

365. 【2503.06827】wo-stage Deep Denoising with Self-guided Noise Attention for Multimodal Medical Images

链接：https://arxiv.org/abs/2503.06827

作者：S M A Sharif,Rizwan Ali Naqvi,Woong-Kee Loh

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： IEEE Transactions on Radiation and Plasma Medical Sciences (2024)

点击查看摘要

None

366. 【2503.06816】Semi-Supervised Medical Image Segmentation via Knowledge Mining from Large Models

链接：https://arxiv.org/abs/2503.06816

作者：Yuchen Mao,Hongwei Li,Yinyi Lai,Giorgos Papanastasiou,Peng Qi,Yunjie Yang,Chengjia Wang

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 18 pages, 2 figures

点击查看摘要

None

367. 【2503.06809】Interactive Tumor Progression Modeling via Sketch-Based Image Editing

链接：https://arxiv.org/abs/2503.06809

作者：Gexin Huang,Ruinan Jin,Yucheng Tang,Can Zhao,Tatsuya Harada,Xiaoxiao Li,Gu Lin

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 9 pages, 4 figures

点击查看摘要

None

368. 【2503.06743】X-GAN: A Generative AI-Powered Unsupervised Model for High-Precision Segmentation of Retinal Main Vessels toward Early Detection of Glaucoma

链接：https://arxiv.org/abs/2503.06743

作者：Cheng Huang,Weizheng Xie,Tsengdar J. Lee,Jui-Kai Wang,Karanjit Kooner,Jia Zhang

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 11 pages, 8 figures

点击查看摘要

None

369. 【2503.06686】ImplicitCell: Resolution Cell Modeling of Joint Implicit Volume Reconstruction and Pose Refinement in Freehand 3D Ultrasound

链接：https://arxiv.org/abs/2503.06686

作者：Sheng Song,Yiting Chen,Duo Xu,Songhan Ge,Yunqian Huang,Junni Shi,Man Chen,Hongbo Chen,Rui Zheng

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

370. 【2503.06563】LSA: Latent Style Augmentation Towards Stain-Agnostic Cervical Cancer Screening

链接：https://arxiv.org/abs/2503.06563

作者：Jiangdong Cai,Haotian Jiang,Zhenrong Shen,Yonghao Li,Honglin Xiong,Lichi Zhang,Qian Wang

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

371. 【2503.06382】X-LRM: X-ray Large Reconstruction Model for Extremely Sparse-View Computed Tomography Recovery in One Second

链接：https://arxiv.org/abs/2503.06382

作者：Guofeng Zhang,Ruyi Zha,Hao He,Yixun Liang,Alan Yuille,Hongdong Li,Yuanhao Cai

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： A large reconstruction model and the largest dataset (16K samples) for sparse-view CT recovery

点击查看摘要

None

372. 【2503.06321】Enhanced Pediatric Dental Segmentation Using a Custom SegUNet with VGG19 Backbone on Panoramic Radiographs

链接：https://arxiv.org/abs/2503.06321

作者：Md Ohiduzzaman Ovi,Maliha Sanjana,Fahad Fahad,Mahjabin Runa,Zarin Tasnim Rothy,Tanmoy Sarkar Pias,A.M. Tayeful Islam,Rumman Ahmed Prodhan

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

373. 【2503.06190】Attention on the Wires (AttWire): A Foundation Model for Detecting Devices and Catheters in X-ray Fluoroscopic Images

链接：https://arxiv.org/abs/2503.06190

作者：YingLiang Ma,Sandra Howell,Aldo Rinaldi,Tarv Dhanjal,Kawal S. Rhode

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

374. 【2503.06125】RGB-Phase Speckle: Cross-Scene Stereo 3D Reconstruction via Wrapped Pre-Normalization

链接：https://arxiv.org/abs/2503.06125

作者：Kai Yang,Zijian Bai,Yang Xiao,Xinyu Li,Xiaohan Shi

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Submitted to ICCV 2025

点击查看摘要

None

375. 【2503.06114】Pathology-Guided AI System for Accurate Segmentation and Diagnosis of Cervical Spondylosis

链接：https://arxiv.org/abs/2503.06114

作者：Qi Zhang,Xiuyuan Chen,Ziyi He,Lianming Wu,Kun Wang,Jianqi Sun,Hongxing Shen

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

376. 【2503.05991】GrInAdapt: Scaling Retinal Vessel Structural Map Segmentation Through Grounding, Integrating and Adapting Multi-device, Multi-site, and Multi-modal Fundus Domains

链接：https://arxiv.org/abs/2503.05991

作者：Zixuan Liu,Aaron Honjaya,Yuekai Xu,Yi Zhang,Hefu Pan,Xin Wang,Linda G Shapiro,Sheng Wang,Ruikang K Wang

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词：

备注：

点击查看摘要

None

377. 【2503.05990】HealthiVert-GAN: A Novel Framework of Pseudo-Healthy Vertebral Image Synthesis for Interpretable Compression Fracture Grading

链接：https://arxiv.org/abs/2503.05990

作者：Qi Zhang,Shunan Zhang,Ziqi Zhao,Kun Wang,Jun Xu,Jianqi Sun

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

378. 【2503.05974】LapLoss: Laplacian Pyramid-based Multiscale loss for Image Translation

链接：https://arxiv.org/abs/2503.05974

作者：Krish Didwania,Ishaan Gakhar,Prakhar Arya,Sanskriti Labroo

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： Accepted at the DeLTa Workshop, ICLR 2025

点击查看摘要

None

379. 【2503.05933】Beyond HE: Unlocking Pathological Insights with Polarization via Self-supervised Learning

链接：https://arxiv.org/abs/2503.05933

作者：Yao Du,Jiaxin Zhuang,Xiaoyu Zheng,Jing Cong,Limei Guo,Chao He,Lin Luo,Xiaomeng Li

类目：Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None

380. 【2503.05916】SAS: Segment Anything Small for Ultrasound -- A Non-Generative Data Augmentation Technique for Robust Deep Learning in Ultrasound Imaging

链接：https://arxiv.org/abs/2503.05916

作者：Danielle L. Ferreira,Ahana Gangopadhyay,Hsi-Ming Chang,Ravi Soni,Gopal Avinash

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注： 25 pages, 8 figures

点击查看摘要

None

381. 【2503.05843】Decadal analysis of sea surface temperature patterns, climatology, and anomalies in temperate coastal waters with Landsat-8 TIRS observations

链接：https://arxiv.org/abs/2503.05843

作者：Yiqing Guo,Nagur Cherukuru,Eric Lehmann,Xiubin Qi,Mark Doubelld,S. L. Kesav Unnithan,Ming Feng

类目：Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Geophysics (physics.geo-ph)

关键词：

备注： Submitted to GIScience Remote Sensing

点击查看摘要

None

382. 【2503.05802】Illuminant and light direction estimation using Wasserstein distance method

链接：https://arxiv.org/abs/2503.05802

作者：Selcuk Yazar

类目：Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词：

备注：

点击查看摘要

None