本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新1154篇论文,其中:
- 自然语言处理154篇
- 信息检索14篇
- 计算机视觉383篇
自然语言处理
1. 【2503.07605】SEAP: Training-free Sparse Expert Activation Pruning Unlock the Brainpower of Large Language Models
链接:https://arxiv.org/abs/2503.07605
作者:Xun Liang,Hanyu Wang,Huayi Lai,Simin Niu,Shichao Song,Jiawei Yang,Jihao Zhao,Feiyu Xiong,Bo Tang,Zhiyu Li
类目:Computation and Language (cs.CL)
关键词:Large Language Models, natural language processing, Large Language, achieved remarkable success, language processing tasks
备注: 15 pages, 7 figures, 8 tables
点击查看摘要
Abstract:Large Language Models have achieved remarkable success across various natural language processing tasks, yet their high computational cost during inference remains a major bottleneck. This paper introduces Sparse Expert Activation Pruning (SEAP), a training-free pruning method that selectively retains task-relevant parameters to reduce inference overhead. Inspired by the clustering patterns of hidden states and activations in LLMs, SEAP identifies task-specific expert activation patterns and prunes the model while preserving task performance and enhancing computational efficiency. Experimental results demonstrate that SEAP significantly reduces computational overhead while maintaining competitive accuracy. Notably, at 50% pruning, SEAP surpasses both WandA and FLAP by over 20%, and at 20% pruning, it incurs only a 2.2% performance drop compared to the dense model. These findings highlight SEAP's scalability and effectiveness, making it a promising approach for optimizing large-scale LLMs.
2. 【2503.07604】Implicit Reasoning in Transformers is Reasoning through Shortcuts
链接:https://arxiv.org/abs/2503.07604
作者:Tianhe Lin,Jian Xie,Siyu Yuan,Deqing Yang
类目:Computation and Language (cs.CL)
关键词:enhancing language models', language models' complex, models' complex multi-step, Test-time compute, implicit reasoning
备注:
点击查看摘要
Abstract:Test-time compute is emerging as a new paradigm for enhancing language models' complex multi-step reasoning capabilities, as demonstrated by the success of OpenAI's o1 and o3, as well as DeepSeek's R1. Compared to explicit reasoning in test-time compute, implicit reasoning is more inference-efficient, requiring fewer generated tokens. However, why does the advanced reasoning capability fail to emerge in the implicit reasoning style? In this work, we train GPT-2 from scratch on a curated multi-step mathematical reasoning dataset and conduct analytical experiments to investigate how language models perform implicit reasoning in multi-step tasks. Our findings reveal: 1) Language models can perform step-by-step reasoning and achieve high accuracy in both in-domain and out-of-domain tests via implicit reasoning. However, this capability only emerges when trained on fixed-pattern data. 2) Conversely, implicit reasoning abilities emerging from training on unfixed-pattern data tend to overfit a specific pattern and fail to generalize further. Notably, this limitation is also observed in state-of-the-art large language models. These findings suggest that language models acquire implicit reasoning through shortcut learning, enabling strong performance on tasks with similar patterns while lacking generalization.
3. 【2503.07595】Detection Avoidance Techniques for Large Language Models
链接:https://arxiv.org/abs/2503.07595
作者:Sinclair Schneider,Florian Steuber,Joao A. G. Schneider,Gabi Dreo Rodosek
类目:Computation and Language (cs.CL)
关键词:systematically spreading fake, large language models, brought various risks, including the potential, increasing popularity
备注:
点击查看摘要
Abstract:The increasing popularity of large language models has not only led to widespread use but has also brought various risks, including the potential for systematically spreading fake news. Consequently, the development of classification systems such as DetectGPT has become vital. These detectors are vulnerable to evasion techniques, as demonstrated in an experimental series: Systematic changes of the generative models' temperature proofed shallow learning-detectors to be the least reliable. Fine-tuning the generative model via reinforcement learning circumvented BERT-based-detectors. Finally, rephrasing led to a 90\% evasion of zero-shot-detectors like DetectGPT, although texts stayed highly similar to the original. A comparison with existing work highlights the better performance of the presented methods. Possible implications for society and further research are discussed.
4. 【2503.07575】VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models
链接:https://arxiv.org/abs/2503.07575
作者:Jen-tse Huang,Jiantong Qin,Jianping Zhang,Youliang Yuan,Wenxuan Wang,Jieyu Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:social biases exhibited, implicit social biases, research investigates, exhibited by Vision-Language, Vision-Language Models
备注: 9 pages
点击查看摘要
Abstract:This research investigates both explicit and implicit social biases exhibited by Vision-Language Models (VLMs). The key distinction between these bias types lies in the level of awareness: explicit bias refers to conscious, intentional biases, while implicit bias operates subconsciously. To analyze explicit bias, we directly pose questions to VLMs related to gender and racial differences: (1) Multiple-choice questions based on a given image (e.g., "What is the education level of the person in the image?") (2) Yes-No comparisons using two images (e.g., "Is the person in the first image more educated than the person in the second image?") For implicit bias, we design tasks where VLMs assist users but reveal biases through their responses: (1) Image description tasks: Models are asked to describe individuals in images, and we analyze disparities in textual cues across demographic groups. (2) Form completion tasks: Models draft a personal information collection form with 20 attributes, and we examine correlations among selected attributes for potential biases. We evaluate Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6. Our code and data are publicly available at this https URL.
5. 【2503.07572】Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning
链接:https://arxiv.org/abs/2503.07572
作者:Yuxiao Qu,Matthew Y. R. Yang,Amrith Setlur,Lewis Tunstall,Edward Emanuel Beeching,Ruslan Salakhutdinov,Aviral Kumar
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:test-time compute, crucial for improving, optimizing test-time compute, test-time, compute
备注:
点击查看摘要
Abstract:Training models to effectively use test-time compute is crucial for improving the reasoning performance of LLMs. Current methods mostly do so via fine-tuning on search traces or running RL with 0/1 outcome reward, but do these approaches efficiently utilize test-time compute? Would these approaches continue to scale as the budget improves? In this paper, we try to answer these questions. We formalize the problem of optimizing test-time compute as a meta-reinforcement learning (RL) problem, which provides a principled perspective on spending test-time compute. This perspective enables us to view the long output stream from the LLM as consisting of several episodes run at test time and leads us to use a notion of cumulative regret over output tokens as a way to measure the efficacy of test-time compute. Akin to how RL algorithms can best tradeoff exploration and exploitation over training, minimizing cumulative regret would also provide the best balance between exploration and exploitation in the token stream. While we show that state-of-the-art models do not minimize regret, one can do so by maximizing a dense reward bonus in conjunction with the outcome 0/1 reward RL. This bonus is the ''progress'' made by each subsequent block in the output stream, quantified by the change in the likelihood of eventual success. Using these insights, we develop Meta Reinforcement Fine-Tuning, or MRT, a new class of fine-tuning methods for optimizing test-time compute. MRT leads to a 2-3x relative gain in performance and roughly a 1.5x gain in token efficiency for math reasoning compared to outcome-reward RL.
6. 【2503.07550】KSOD: Knowledge Supplement for LLMs On Demand
链接:https://arxiv.org/abs/2503.07550
作者:Haoran Li,Junfeng Hu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, demonstrated remarkable capabilities, Knowledge
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet still produce errors in domain-specific tasks. To further improve their performance, we propose KSOD (Knowledge Supplement for LLMs On Demand), a novel framework that empowers LLMs to improve their capabilities with knowledge-based supervised fine-tuning (SFT). KSOD analyzes the causes of errors from the perspective of knowledge deficiency by identifying potential missing knowledge in LLM that may lead to the errors. Subsequently, KSOD tunes a knowledge module on knowledge dataset and verifies whether the LLM lacks the identified knowledge based on it. If the knowledge is verified, KSOD supplements the LLM with the identified knowledge using the knowledge module. Tuning LLMs on specific knowledge instead of specific task decouples task and knowledge and our experiments on two domain-specific benchmarks and four general benchmarks empirically demonstrate that KSOD enhances the performance of LLMs on tasks requiring the supplemented knowledge while preserving their performance on other tasks. Our findings shed light on the potential of improving the capabilities of LLMs with knowledge-based SFT.
7. 【2503.07539】XIFBench: Evaluating Large Language Models on Multilingual Instruction Following
链接:https://arxiv.org/abs/2503.07539
作者:Zhenyu Li,Kehai Chen,Yunfei Long,Xuefeng Bai,Yaoyin Zhang,Xuchen Wei,Juntao Li,Min Zhang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, remarkable instruction-following capabilities, demonstrated remarkable instruction-following
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications. However, their performance in multilingual settings remains poorly understood, as existing evaluations lack fine-grained constraint analysis. We introduce XIFBench, a comprehensive constraint-based benchmark for assessing multilingual instruction-following abilities of LLMs, featuring a novel taxonomy of five constraint categories and 465 parallel instructions across six languages spanning different resource levels. To ensure consistent cross-lingual evaluation, we develop a requirement-based protocol that leverages English requirements as semantic anchors. These requirements are then used to validate the translations across languages. Extensive experiments with various LLMs reveal notable variations in instruction-following performance across resource levels, identifying key influencing factors such as constraint categories, instruction complexity, and cultural specificity.
8. 【2503.07536】LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
链接:https://arxiv.org/abs/2503.07536
作者:Yingzhe Peng,Gongrui Zhang,Miaosen Zhang,Zhiyuan You,Jie Liu,Qipeng Zhu,Kai Yang,Xingzhong Xu,Xin Geng,Xu Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Multimodal Models, architectural constraints limit, faces unique challenges, limit reasoning capacity, constraints limit reasoning
备注:
点击查看摘要
Abstract:Enhancing reasoning in Large Multimodal Models (LMMs) faces unique challenges from the complex interplay between visual perception and logical reasoning, particularly in compact 3B-parameter architectures where architectural constraints limit reasoning capacity and modality alignment. While rule-based reinforcement learning (RL) excels in text-only domains, its multimodal extension confronts two critical barriers: (1) data limitations due to ambiguous answers and scarce complex reasoning examples, and (2) degraded foundational reasoning induced by multimodal pretraining. To address these challenges, we propose \textbf{\method}, a two-stage framework adapting rule-based RL for multimodal reasoning through \textbf{Foundational Reasoning Enhancement (FRE)} followed by \textbf{Multimodal Generalization Training (MGT)}. The FRE stage first strengthens reasoning abilities using text-only data with rule-based RL, then the MGT stage generalizes these reasoning capabilities to multimodal domains. Experiments on Qwen2.5-VL-Instruct-3B demonstrate that \method achieves 4.83\% and 4.5\% average improvements over baselines in multimodal and text-only benchmarks, respectively, with a 3.63\% gain in complex Football Game tasks. These results validate that text-based reasoning enhancement enables effective multimodal generalization, offering a data-efficient paradigm that bypasses costly high-quality multimodal training data.
Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2503.07536 [cs.CL]
(or
arXiv:2503.07536v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2503.07536
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Yingzhe Peng [view email] [v1]
Mon, 10 Mar 2025 17:04:14 UTC (9,790 KB)
9. 【2503.07519】GRITHopper: Decomposition-Free Multi-Hop Dense Retrieval
链接:https://arxiv.org/abs/2503.07519
作者:Justus-Jonas Erker,Nils Reimers,Iryna Gurevych
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Decomposition-based multi-hop retrieval, Decomposition-based multi-hop, retrieval methods rely, complex queries, computationally expensive
备注: Under Review at ACL Rolling Review (ARR)
点击查看摘要
Abstract:Decomposition-based multi-hop retrieval methods rely on many autoregressive steps to break down complex queries, which breaks end-to-end differentiability and is computationally expensive. Decomposition-free methods tackle this, but current decomposition-free approaches struggle with longer multi-hop problems and generalization to out-of-distribution data. To address these challenges, we introduce GRITHopper-7B, a novel multi-hop dense retrieval model that achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks. GRITHopper combines generative and representational instruction tuning by integrating causal language modeling with dense retrieval training. Through controlled studies, we find that incorporating additional context after the retrieval process, referred to as post-retrieval language modeling, enhances dense retrieval performance. By including elements such as final answers during training, the model learns to better contextualize and retrieve relevant information. GRITHopper-7B offers a robust, scalable, and generalizable solution for multi-hop dense retrieval, and we release it to the community for future research and applications requiring multi-hop reasoning and retrieval capabilities.
10. 【2503.07518】okenButler: Token Importance is Predictable
链接:https://arxiv.org/abs/2503.07518
作者:Yash Akhauri,Ahmed F AbouElhamayed,Yifei Gao,Chi-Chih Chang,Nilesh Jain,Mohamed S. Abdelfattah
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, enabling efficient decoding, Cache to store, store token history
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck, however, there is an opportunity to alleviate this bottleneck, especially because prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks (pages) of tokens at generation, failing at dense, context-rich tasks. Additionally, many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. By training a light-weight predictor with less than 1.2% parameter overhead, TokenButler prioritizes tokens based on their contextual, predicted importance. This improves perplexity downstream accuracy by over 8% relative to SoTA methods for estimating token importance. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy. Code, models and benchmarks: this https URL
11. 【2503.07513】Language Models Fail to Introspect About Their Knowledge of Language
链接:https://arxiv.org/abs/2503.07513
作者:Siyuan Song,Jennifer Hu,Kyle Mahowald
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, large language, knowledge, LLMs, internal states
备注:
点击查看摘要
Abstract:There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking "Is this sentence grammatical?"). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model's internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models' responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model's prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged "self-access". Our findings complicate recent results suggesting that models can introspect, and add new evidence to the argument that prompted responses should not be conflated with models' linguistic generalizations.
12. 【2503.07510】Sometimes the Model doth Preach: Quantifying Religious Bias in Open LLMs through Demographic Analysis in Asian Nations
链接:https://arxiv.org/abs/2503.07510
作者:Hari Shankar,Vedanta S P,Tejas Cavale,Ponnurangam Kumaraguru,Abhijnan Chakraborty
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, propagating bias unknowingly, non-diverse data collection, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are capable of generating opinions and propagating bias unknowingly, originating from unrepresentative and non-diverse data collection. Prior research has analysed these opinions with respect to the West, particularly the United States. However, insights thus produced may not be generalized in non-Western populations. With the widespread usage of LLM systems by users across several different walks of life, the cultural sensitivity of each generated output is of crucial interest. Our work proposes a novel method that quantitatively analyzes the opinions generated by LLMs, improving on previous work with regards to extracting the social demographics of the models. Our method measures the distance from an LLM's response to survey respondents, through Hamming Distance, to infer the demographic characteristics reflected in the model's outputs. We evaluate modern, open LLMs such as Llama and Mistral on surveys conducted in various global south countries, with a focus on India and other Asian nations, specifically assessing the model's performance on surveys related to religious tolerance and identity. Our analysis reveals that most open LLMs match a single homogeneous profile, varying across different countries/territories, which in turn raises questions about the risks of LLMs promoting a hegemonic worldview, and undermining perspectives of different minorities. Our framework may also be useful for future research investigating the complex intersection between training data, model architecture, and the resulting biases reflected in LLM outputs, particularly concerning sensitive topics like religious tolerance and identity.
13. 【2503.07459】MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
链接:https://arxiv.org/abs/2503.07459
作者:Xiangru Tang,Daniel Shao,Jiwoong Sohn,Jiapeng Chen,Jiayi Zhang,Jinyu Xiang,Fang Wu,Yilun Zhao,Chenglin Wu,Wenqi Shi,Arman Cohan,Mark Gerstein
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, shown impressive performance, Language Models, shown impressive
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios where current models still struggle despite their strong performance on standard tests. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: (1) the prevalence of straightforward questions where even base models achieve high performance, (2) inconsistent sampling and evaluation protocols across studies, and (3) lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods offer promising performance-to-cost ratios compared to traditional approaches. Our analysis reveals substantial performance gaps between model families on complex questions and identifies optimal model selections for different computational constraints. Our benchmark and evaluation framework are publicly available at this https URL.
14. 【2503.07457】LLMs syntactically adapt their language use to their conversational partner
链接:https://arxiv.org/abs/2503.07457
作者:Florian Kandra,Vera Demberg,Alexander Koller
类目:Computation and Language (cs.CL)
关键词:human speakers align, frequently observed, observed that human, human speakers, speakers align
备注: 4 pages, 1 table, 1 figure, submitted to ACL
点击查看摘要
Abstract:It has been frequently observed that human speakers align their language use with each other during conversations. In this paper, we study empirically whether large language models (LLMs) exhibit the same behavior of conversational adaptation. We construct a corpus of conversations between LLMs and find that two LLM agents end up making more similar syntactic choices as conversations go on, confirming that modern LLMs adapt their language use to their conversational partners in at least a rudimentary way.
15. 【2503.07453】Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration
链接:https://arxiv.org/abs/2503.07453
作者:Dylan J. Foster,Zakaria Mhammedi,Dhruv Rohatgi
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Statistics Theory (math.ST)
关键词:Language model alignment, Toggle, language models, model, exploration
备注:
点击查看摘要
Abstract:Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.
Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Statistics Theory (math.ST)
Cite as:
arXiv:2503.07453 [cs.LG]
(or
arXiv:2503.07453v1 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2503.07453
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Dylan Foster [view email] [v1]
Mon, 10 Mar 2025 15:31:42 UTC (111 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration, by Dylan J. Foster and Zakaria Mhammedi and Dhruv RohatgiView PDFTeX SourceOther Formats
view license
Current browse context: cs.LG
prev
|
next
new
|
recent
| 2025-03
Change to browse by:
cs
cs.AI
cs.CL
math
math.ST
stat
stat.TH
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
a
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
IArxiv recommender toggle
IArxiv Recommender
(What is IArxiv?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
Get status notifications via
email
or slack
16. 【2503.07395】Revisiting Noise in Natural Language Processing for Computational Social Science
链接:https://arxiv.org/abs/2503.07395
作者:Nadav Borenstein
类目:Computation and Language (cs.CL)
关键词:Computational Social Science, Computational Social, Social Science, emerging field driven, unprecedented availability
备注: PhD thesis. Under the supervision of Prof. Isabelle Augenstein
点击查看摘要
Abstract:Computational Social Science (CSS) is an emerging field driven by the unprecedented availability of human-generated content for researchers. This field, however, presents a unique set of challenges due to the nature of the theories and datasets it explores, including highly subjective tasks and complex, unstructured textual corpora. Among these challenges, one of the less well-studied topics is the pervasive presence of noise. This thesis aims to address this gap in the literature by presenting a series of interconnected case studies that examine different manifestations of noise in CSS. These include character-level errors following the OCR processing of historical records, archaic language, inconsistencies in annotations for subjective and ambiguous tasks, and even noise and biases introduced by large language models during content generation. This thesis challenges the conventional notion that noise in CSS is inherently harmful or useless. Rather, it argues that certain forms of noise can encode meaningful information that is invaluable for advancing CSS research, such as the unique communication styles of individuals or the culture-dependent nature of datasets and tasks. Further, this thesis highlights the importance of nuance in dealing with noise and the considerations CSS researchers must address when encountering it, demonstrating that different types of noise require distinct strategies.
17. 【2503.07384】Is My Text in Your AI Model? Gradient-based Membership Inference Test applied to LLMs
链接:https://arxiv.org/abs/2503.07384
作者:Gonzalo Mancera,Daniel de Alcala,Julian Fierrez,Ruben Tolosana,Aythami Morales
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Membership Inference Test, gradient-based Membership Inference, Inference Test, Membership Inference, Natural Language Processing
备注:
点击查看摘要
Abstract:This work adapts and studies the gradient-based Membership Inference Test (gMINT) to the classification of text based on LLMs. MINT is a general approach intended to determine if given data was used for training machine learning models, and this work focuses on its application to the domain of Natural Language Processing. Using gradient-based analysis, the MINT model identifies whether particular data samples were included during the language model training phase, addressing growing concerns about data privacy in machine learning. The method was evaluated in seven Transformer-based models and six datasets comprising over 2.5 million sentences, focusing on text classification tasks. Experimental results demonstrate MINTs robustness, achieving AUC scores between 85% and 99%, depending on data size and model architecture. These findings highlight MINTs potential as a scalable and reliable tool for auditing machine learning models, ensuring transparency, safeguarding sensitive data, and fostering ethical compliance in the deployment of AI/NLP technologies.
18. 【2503.07358】RepoST: Scalable Repository-Level Coding Environment Construction with Sandbox Testing
链接:https://arxiv.org/abs/2503.07358
作者:Yiqing Xie,Alex Xie,Divyanshu Sheth,Pengfei Liu,Daniel Fried,Carolyn Rose
类目:Computation and Language (cs.CL); Software Engineering (cs.SE)
关键词:provide execution feedback, execution feedback, present RepoST, provide execution, repository-level code generation
备注:
点击查看摘要
Abstract:We present RepoST, a scalable method to construct environments that provide execution feedback for repository-level code generation for both training and evaluation. Unlike existing works that aim to build entire repositories for execution, which is challenging for both human and LLMs, we provide execution feedback with sandbox testing, which isolates a given target function and its dependencies to a separate script for testing. Sandbox testing reduces the complexity of external dependencies and enables constructing environments at a large scale. We use our method to construct RepoST-Train, a large-scale train set with 7,415 functions from 832 repositories. Training with the execution feedback provided by RepoST-Train leads to a performance gain of 5.5% Pass@1 on HumanEval and 3.5% Pass@1 on RepoEval. We also build an evaluation dataset, RepoST-Eval, and benchmark 12 code generation models.
19. 【2503.07329】Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models
链接:https://arxiv.org/abs/2503.07329
作者:Hao Zhou,Guergana Savova,Lijing Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:large language models, URL this study, fine-tuning large language, http URL, GLUE and SuperGLUE
备注: 7 pages, 5 tables, 3 figures
点击查看摘要
Abstract:The impact of random seeds in fine-tuning large language models (LLMs) has been largely overlooked despite its potential influence on model this http URL this study, we systematically evaluate the effects of random seeds on LLMs using the GLUE and SuperGLUE benchmarks. We analyze the macro-level impact through traditional metrics like accuracy and F1, calculating their mean and variance to quantify performance fluctuations. To capture the micro-level effects, we introduce a novel metric, consistency, measuring the stability of individual predictions across runs. Our experiments reveal significant variance at both macro and micro levels, underscoring the need for careful consideration of random seeds in fine-tuning and evaluation.
20. 【2503.07306】Benchmarking Chinese Medical LLMs: A Medbench-based Analysis of Performance Gaps and Hierarchical Optimization Strategies
链接:https://arxiv.org/abs/2503.07306
作者:Luyi Jiang,Jiayuan Chen,Lu Lu,Xinwei Peng,Lihao Liu,Junjun He,Jie Xu
类目:Computation and Language (cs.CL)
关键词:medical large language, large language models, real-world deployment, ethical alignment, Medical Language Generation
备注:
点击查看摘要
Abstract:The evaluation and improvement of medical large language models (LLMs) are critical for their real-world deployment, particularly in ensuring accuracy, safety, and ethical alignment. Existing frameworks inadequately dissect domain-specific error patterns or address cross-modal challenges. This study introduces a granular error taxonomy through systematic analysis of top 10 models on MedBench, categorizing incorrect responses into eight types: Omissions, Hallucination, Format Mismatch, Causal Reasoning Deficiency, Contextual Inconsistency, Unanswered, Output Error, and Deficiency in Medical Language Generation. Evaluation of 10 leading models reveals vulnerabilities: despite achieving 0.86 accuracy in medical knowledge recall, critical reasoning tasks show 96.3% omission, while safety ethics evaluations expose alarming inconsistency (robustness score: 0.79) under option shuffled. Our analysis uncovers systemic weaknesses in knowledge boundary enforcement and multi-step reasoning. To address these, we propose a tiered optimization strategy spanning four levels, from prompt engineering and knowledge-augmented retrieval to hybrid neuro-symbolic architectures and causal reasoning frameworks. This work establishes an actionable roadmap for developing clinically robust LLMs while redefining evaluation paradigms through error-driven insights, ultimately advancing the safety and trustworthiness of AI in high-stakes medical environments.
21. 【2503.07303】An Information-Theoretic Approach to Identifying Formulaic Clusters in Textual Data
链接:https://arxiv.org/abs/2503.07303
作者:Gideon Yoffe,Yair Segev,Barak Sober
类目:Computation and Language (cs.CL)
关键词:Hebrew Bible, exhibit structural, cultural context, Texts, patterns
备注:
点击查看摘要
Abstract:Texts, whether literary or historical, exhibit structural and stylistic patterns shaped by their purpose, authorship, and cultural context. Formulaic texts, characterized by repetition and constrained expression, tend to have lower variability in self-information compared to more dynamic compositions. Identifying such patterns in historical documents, particularly multi-author texts like the Hebrew Bible provides insights into their origins, purpose, and transmission. This study aims to identify formulaic clusters -- sections exhibiting systematic repetition and structural constraints -- by analyzing recurring phrases, syntactic structures, and stylistic markers. However, distinguishing formulaic from non-formulaic elements in an unsupervised manner presents a computational challenge, especially in high-dimensional textual spaces where patterns must be inferred without predefined labels. To address this, we develop an information-theoretic algorithm leveraging weighted self-information distributions to detect structured patterns in text, unlike covariance-based methods, which become unstable in small-sample, high-dimensional settings, our approach directly models variations in self-information to identify formulaicity. By extending classical discrete self-information measures with a continuous formulation based on differential self-information, our method remains applicable across different types of textual representations, including neural embeddings under Gaussian priors. Applied to hypothesized authorial divisions in the Hebrew Bible, our approach successfully isolates stylistic layers, providing a quantitative framework for textual stratification. This method enhances our ability to analyze compositional patterns, offering deeper insights into the literary and cultural evolution of texts shaped by complex authorship and editorial processes.
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2503.07303 [cs.CL]
(or
arXiv:2503.07303v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2503.07303
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Gideon Yoffe [view email] [v1]
Mon, 10 Mar 2025 13:24:46 UTC (5,802 KB)
22. 【2503.07282】A Graph-based Verification Framework for Fact-Checking
链接:https://arxiv.org/abs/2503.07282
作者:Yani Huang,Richong Zhang,Zhijie Nie,Junfan Chen,Xuefeng Zhang
类目:Computation and Language (cs.CL)
关键词:combating misinformation, plays a crucial, crucial role, role in combating, Fact-checking plays
备注: 13pages, 4figures
点击查看摘要
Abstract:Fact-checking plays a crucial role in combating misinformation. Existing methods using large language models (LLMs) for claim decomposition face two key limitations: (1) insufficient decomposition, introducing unnecessary complexity to the verification process, and (2) ambiguity of mentions, leading to incorrect verification results. To address these challenges, we suggest introducing a claim graph consisting of triplets to address the insufficient decomposition problem and reduce mention ambiguity through graph structure. Based on this core idea, we propose a graph-based framework, GraphFC, for fact-checking. The framework features three key components: graph construction, which builds both claim and evidence graphs; graph-guided planning, which prioritizes the triplet verification order; and graph-guided checking, which verifies the triples one by one between claim and evidence graphs. Extensive experiments show that GraphFC enables fine-grained decomposition while resolving referential ambiguities through relational constraints, achieving state-of-the-art performance across three datasets.
23. 【2503.07279】VizTrust: A Visual Analytics Tool for Capturing User Trust Dynamics in Human-AI Communication
链接:https://arxiv.org/abs/2503.07279
作者:Xin Wang,Stephanie Tulk Jesso,Sadamori Kojaku,David M Neyens,Min Sun Kim
类目:Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:artificial intelligence, plays a fundamental, fundamental role, role in shaping, shaping the willingness
备注: Accepted by ACM CHI conference 2025
点击查看摘要
Abstract:Trust plays a fundamental role in shaping the willingness of users to engage and collaborate with artificial intelligence (AI) systems. Yet, measuring user trust remains challenging due to its complex and dynamic nature. While traditional survey methods provide trust levels for long conversations, they fail to capture its dynamic evolution during ongoing interactions. Here, we present VizTrust, which addresses this challenge by introducing a real-time visual analytics tool that leverages a multi-agent collaboration system to capture and analyze user trust dynamics in human-agent communication. Built on established human-computer trust scales-competence, integrity, benevolence, and predictability-, VizTrust enables stakeholders to observe trust formation as it happens, identify patterns in trust development, and pinpoint specific interaction elements that influence trust. Our tool offers actionable insights into human-agent trust formation and evolution in real time through a dashboard, supporting the design of adaptive conversational agents that responds effectively to user trust signals.
24. 【2503.07269】SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection
链接:https://arxiv.org/abs/2503.07269
作者:Shamsuddeen Hassan Muhammad,Nedjma Ousidhoum,Idris Abdulmumin,Seid Muhie Yimam,Jan Philip Wahle,Terry Ruas,Meriem Beloucif,Christine De Kock,Tadesse Destaw Belay,Ibrahim Said Ahmad,Nirmal Surange,Daniela Teodorescu,David Ifeoluwa Adelani,Alham Fikri Aji,Felermino Ali,Vladimir Araujo,Abinew Ali Ayele,Oana Ignat,Alexander Panchenko,Yi Zhou,Saif M. Mohammad
类目:Computation and Language (cs.CL)
关键词:text-based emotion detection, distinct language families, present our shared, emotion detection, emotion
备注: SemEval2025 Task11 (Task Description Paper). arXiv admin note: text overlap with [arXiv:2502.11926](https://arxiv.org/abs/2502.11926)
点击查看摘要
Abstract:We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and spoken across various continents. The data instances are multi-labeled into six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels in three tracks: (a) emotion labels in monolingual settings, (b) emotion intensity scores, and (c) emotion labels in cross-lingual settings. The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, as well as findings on the best-performing systems, the most common approaches, and the most effective methods across various tracks and languages. The datasets for this task are publicly available.
25. 【2503.07265】WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
链接:https://arxiv.org/abs/2503.07265
作者:Yuwei Niu,Munan Ning,Mengren Zheng,Bin Lin,Peng Jin,Jiaqi Liao,Kunpeng Ning,Bin Zhu,Li Yuan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:generating high-quality artistic, high-quality artistic creations, textbf, visual content, capable of generating
备注: Code, data and leaderboard: [this https URL](https://github.com/PKU-YuanGroup/WISE)
点击查看摘要
Abstract:Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $\textbf{WISE}$, the first benchmark specifically designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic $\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at this https URL.
26. 【2503.07237】LLM-C3MOD: A Human-LLM Collaborative System for Cross-Cultural Hate Speech Moderation
链接:https://arxiv.org/abs/2503.07237
作者:Junyeong Park,Seogyeong Jeong,Seyoung Song,Yohan Lee,Alice Oh
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:leaving low-resource languages, prioritize high-resource languages, major tech platforms, tech platforms prioritize, platforms prioritize high-resource
备注: Accepted to NAACL 2025 Workshop - C3NLP (Workshop on Cross-Cultural Considerations in NLP)
点击查看摘要
Abstract:Content moderation is a global challenge, yet major tech platforms prioritize high-resource languages, leaving low-resource languages with scarce native moderators. Since effective moderation depends on understanding contextual cues, this imbalance increases the risk of improper moderation due to non-native moderators' limited cultural understanding. Through a user study, we identify that non-native moderators struggle with interpreting culturally-specific knowledge, sentiment, and internet culture in the hate speech moderation. To assist them, we present LLM-C3MOD, a human-LLM collaborative pipeline with three steps: (1) RAG-enhanced cultural context annotations; (2) initial LLM-based moderation; and (3) targeted human moderation for cases lacking LLM consensus. Evaluated on a Korean hate speech dataset with Indonesian and German participants, our system achieves 78% accuracy (surpassing GPT-4o's 71% baseline), while reducing human workload by 83.6%. Notably, human moderators excel at nuanced contents where LLMs struggle. Our findings suggest that non-native moderators, when properly supported by LLMs, can effectively contribute to cross-cultural hate speech moderation.
27. 【2503.07214】Cross-Lingual IPA Contrastive Learning for Zero-Shot NER
链接:https://arxiv.org/abs/2503.07214
作者:Jimin Sohn,David R. Mortensen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Named Entity Recognition, zero-shot Named Entity, Entity Recognition, Named Entity, Existing approaches
备注: 17 pages, 6 figures
点击查看摘要
Abstract:Existing approaches to zero-shot Named Entity Recognition (NER) for low-resource languages have primarily relied on machine translation, whereas more recent methods have shifted focus to phonemic representation. Building upon this, we investigate how reducing the phonemic representation gap in IPA transcription between languages with similar phonetic characteristics enables models trained on high-resource languages to perform effectively on low-resource languages. In this work, we propose CONtrastive Learning with IPA (CONLIPA) dataset containing 10 English and high resource languages IPA pairs from 10 frequently used language families. We also propose a cross-lingual IPA Contrastive learning method (IPAC) using the CONLIPA dataset. Furthermore, our proposed dataset and methodology demonstrate a substantial average gain when compared to the best performing baseline.
28. 【2503.07195】Contextual Cues in Machine Translation: Investigating the Potential of Multi-Source Input Strategies in LLMs and NMT Systems
链接:https://arxiv.org/abs/2503.07195
作者:Lia Shahnazaryan,Patrick Simianer,Joern Wuebker
类目:Computation and Language (cs.CL)
关键词:multilingual neural machine, traditional multilingual neural, neural machine translation, large language model, multi-source input strategies
备注: 11 pages
点击查看摘要
Abstract:We explore the impact of multi-source input strategies on machine translation (MT) quality, comparing GPT-4o, a large language model (LLM), with a traditional multilingual neural machine translation (NMT) system. Using intermediate language translations as contextual cues, we evaluate their effectiveness in enhancing English and Chinese translations into Portuguese. Results suggest that contextual information significantly improves translation quality for domain-specific datasets and potentially for linguistically distant language pairs, with diminishing returns observed in benchmarks with high linguistic variability. Additionally, we demonstrate that shallow fusion, a multi-source approach we apply within the NMT system, shows improved results when using high-resource languages as context for other translation pairs, highlighting the importance of strategic context language selection.
29. 【2503.07190】Multi-Modal 3D Mesh Reconstruction from Images and Text
链接:https://arxiv.org/abs/2503.07190
作者:Melvin Reka,Tessa Pulli,Markus Vincze
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:require large datasets, high computational costs, object pose estimation, large datasets, struggle to generalize
备注: under review
点击查看摘要
Abstract:6D object pose estimation for unseen objects is essential in robotics but traditionally relies on trained models that require large datasets, high computational costs, and struggle to generalize. Zero-shot approaches eliminate the need for training but depend on pre-existing 3D object models, which are often impractical to obtain. To address this, we propose a language-guided few-shot 3D reconstruction method, reconstructing a 3D mesh from few input images. In the proposed pipeline, receives a set of input images and a language query. A combination of GroundingDINO and Segment Anything Model outputs segmented masks from which a sparse point cloud is reconstructed with VGGSfM. Subsequently, the mesh is reconstructed with the Gaussian Splatting method SuGAR. In a final cleaning step, artifacts are removed, resulting in the final 3D mesh of the queried object. We evaluate the method in terms of accuracy and quality of the geometry and texture. Furthermore, we study the impact of imaging conditions such as viewing angle, number of input images, and image overlap on 3D object reconstruction quality, efficiency, and computational scalability.
30. 【2503.07179】Strategies for political-statement segmentation and labelling in unstructured text
链接:https://arxiv.org/abs/2503.07179
作者:Dmitry Nikolaev,Sean Papay
类目:Computation and Language (cs.CL)
关键词:Analysis of parliamentary, integral area, area of computational, computational study, parliamentary speeches
备注: Accepted to NLP4DH 2025 @ NAACL 2025
点击查看摘要
Abstract:Analysis of parliamentary speeches and political-party manifestos has become an integral area of computational study of political texts. While speeches have been overwhelmingly analysed using unsupervised methods, a large corpus of manifestos with by-statement political-stance labels has been created by the participants of the MARPOR project. It has been recently shown that these labels can be predicted by a neural model; however, the current approach relies on provided statement boundaries, limiting out-of-domain applicability. In this work, we propose and test a range of unified split-and-label frameworks -- based on linear-chain CRFs, fine-tuned text-to-text models, and the combination of in-context learning with constrained decoding -- that can be used to jointly segment and classify statements from raw textual data. We show that our approaches achieve competitive accuracy when applied to raw text of political manifestos, and then demonstrate the research potential of our method by applying it to the records of the UK House of Commons and tracing the political trajectories of four major parties in the last three decades.
31. 【2503.07170】DeFine: A Decomposed and Fine-Grained Annotated Dataset for Long-form Article Generation
链接:https://arxiv.org/abs/2503.07170
作者:Ming Wang,Fang Wang,Minghao Hu,Li He,Haiyang Wang,Jun Zhang,Tianwei Yan,Li Li,Zhunchen Luo,Wei Luo,Xiaoying Bai,Guotong Geng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:maintaining logical consistency, Long-form article generation, article generation, Long-form article, presents challenges
备注:
点击查看摘要
Abstract:Long-form article generation (LFAG) presents challenges such as maintaining logical consistency, comprehensive topic coverage, and narrative coherence across extended articles. Existing datasets often lack both the hierarchical structure and fine-grained annotation needed to effectively decompose tasks, resulting in shallow, disorganized article generation. To address these limitations, we introduce DeFine, a Decomposed and Fine-grained annotated dataset for long-form article generation. DeFine is characterized by its hierarchical decomposition strategy and the integration of domain-specific knowledge with multi-level annotations, ensuring granular control and enhanced depth in article generation. To construct the dataset, a multi-agent collaborative pipeline is proposed, which systematically segments the generation process into four parts: Data Miner, Cite Retreiver, QA Annotator and Data Cleaner. To validate the effectiveness of DeFine, we designed and tested three LFAG baselines: the web retrieval, the local retrieval, and the grounded reference. We fine-tuned the Qwen2-7b-Instruct model using the DeFine training dataset. The experimental results showed significant improvements in text quality, specifically in topic coverage, depth of information, and content fidelity. Our dataset publicly available to facilitate future research.
32. 【2503.07144】MRCEval: A Comprehensive, Challenging and Accessible Machine Reading Comprehension Benchmark
链接:https://arxiv.org/abs/2503.07144
作者:Shengkun Ma,Hao Peng,Lei Hou,Juanzi Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Machine Reading Comprehension, Machine Reading, Reading Comprehension, natural language understanding, evaluating natural language
备注: Under review
点击查看摘要
Abstract:Machine Reading Comprehension (MRC) is an essential task in evaluating natural language understanding. Existing MRC datasets primarily assess specific aspects of reading comprehension (RC), lacking a comprehensive MRC benchmark. To fill this gap, we first introduce a novel taxonomy that categorizes the key capabilities required for RC. Based on this taxonomy, we construct MRCEval, an MRC benchmark that leverages advanced Large Language Models (LLMs) as both sample generators and selection judges. MRCEval is a comprehensive, challenging and accessible benchmark designed to assess the RC capabilities of LLMs thoroughly, covering 13 distinct RC skills with a total of 2.1K high-quality multi-choice questions. We perform an extensive evaluation of 28 widely used open-source and proprietary models, highlighting that MRC continues to present significant challenges even in the era of LLMs.
33. 【2503.07142】A Systematic Comparison of Syntactic Representations of Dependency Parsing
链接:https://arxiv.org/abs/2503.07142
作者:Guillaume Wisniewski(LLF - UMR7110, UPCité),Ophélie Lacroix(UCPH)
类目:Computation and Language (cs.CL)
关键词:annotation schemes, transition-based parser, evaluate parsing performances, syntactic constructions observed, specific syntactic constructions
备注:
点击查看摘要
Abstract:We compare the performance of a transition-based parser in regards to different annotation schemes. We pro-pose to convert some specific syntactic constructions observed in the universal dependency treebanks into a so-called more standard representation and to evaluate parsing performances over all the languages of the project. We show that the ``standard'' constructions do not lead systematically to better parsing performance and that the scores vary considerably according to the languages.
34. 【2503.07140】Application of Multiple Chain-of-Thought in Contrastive Reasoning for Implicit Sentiment Analysis
链接:https://arxiv.org/abs/2503.07140
作者:Liwei Yang,Xinying Wang,Xiaotang Zhou,Zhengchao Wu,Ningning Tan
类目:Computation and Language (cs.CL)
关键词:Reverse Chain Reasoning, Implicit sentiment analysis, Dual Reverse Chain, Implicit sentiment, Triple Reverse Chain
备注:
点击查看摘要
Abstract:Implicit sentiment analysis aims to uncover emotions that are subtly expressed, often obscured by ambiguity and figurative language. To accomplish this task, large language models and multi-step reasoning are needed to identify those sentiments that are not explicitly stated. In this study, we propose a novel Dual Reverse Chain Reasoning (DRCR) framework to enhance the performance of implicit sentiment analysis. Inspired by deductive reasoning, the framework consists of three key steps: 1) hypothesize an emotional polarity and derive a reasoning process, 2) negate the initial hypothesis and derive a new reasoning process, and 3) contrast the two reasoning paths to deduce the final sentiment polarity. Building on this, we also introduce a Triple Reverse Chain Reasoning (TRCR) framework to address the limitations of random hypotheses. Both methods combine contrastive mechanisms and multi-step reasoning, significantly improving the accuracy of implicit sentiment classification. Experimental results demonstrate that both approaches outperform existing methods across various model scales, achieving state-of-the-art performance. This validates the effectiveness of combining contrastive reasoning and multi-step reasoning for implicit sentiment analysis.
35. 【2503.07129】ASTRA: A Negotiation Agent with Adaptive and Strategic Reasoning through Action in Dynamic Offer Optimization
链接:https://arxiv.org/abs/2503.07129
作者:Deuksin Kwon,Jiwon Hae,Emma Clift,Daniel Shamsoddini,Jonathan Gratch,Gale M. Lucas
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:requires dynamically balancing, dynamically balancing self-interest, Negotiation requires dynamically, requires dynamically, dynamically balancing
备注:
点击查看摘要
Abstract:Negotiation requires dynamically balancing self-interest and cooperation to maximize one's own utility. Yet, existing agents struggle due to bounded rationality in human data, low adaptability to counterpart behavior, and limited strategic reasoning. To address this, we introduce principle-driven negotiation agents, powered by ASTRA, a novel framework for turn-level offer optimization grounded in two core principles: opponent modeling and Tit-for-Tat reciprocity. ASTRA operates in three stages: (1) interpreting counterpart behavior, (2) optimizing counteroffers via a linear programming (LP) solver, and (3) selecting offers based on negotiation tactics and the partner's acceptance probability. Through simulations and human evaluations, our agent effectively adapts to an opponent's shifting stance and achieves favorable outcomes through enhanced adaptability and strategic reasoning. Beyond improving negotiation performance, it also serves as a powerful coaching tool, offering interpretable strategic feedback and optimal offer recommendations.
36. 【2503.07111】PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM
链接:https://arxiv.org/abs/2503.07111
作者:Alan Dao(Gia Tuan Dao),Dinh Bach Vu,Tuan Le Duc Anh,Bui Quang Huy
类目:Robotics (cs.RO); Computation and Language (cs.CL)
关键词:explicit pose estimation, paper introduces PoseLess, directly mapping, projected representations, paper introduces
备注:
点击查看摘要
Abstract:This paper introduces PoseLess, a novel framework for robot hand control that eliminates the need for explicit pose estimation by directly mapping 2D images to joint angles using projected representations. Our approach leverages synthetic training data generated through randomized joint configurations, enabling zero-shot generalization to real-world scenarios and cross-morphology transfer from robotic to human hands. By projecting visual inputs and employing a transformer-based decoder, PoseLess achieves robust, low-latency control while addressing challenges such as depth ambiguity and data scarcity. Experimental results demonstrate competitive performance in joint angle prediction accuracy without relying on any human-labelled dataset.
37. 【2503.07094】A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images
链接:https://arxiv.org/abs/2503.07094
作者:Xiaoyi Liang,Mouxiao Bian,Moxin Chen,Lihao Liu,Junjun He,Jie Xu,Lin Li
类目:Computation and Language (cs.CL)
关键词:large language models, demonstrated remarkable potential, multimodal large language, large language, recent years
备注:
点击查看摘要
Abstract:In recent years, large language models (LLMs) have demonstrated remarkable potential across various medical applications. Building on this foundation, multimodal large language models (MLLMs) integrate LLMs with visual models to process diverse inputs, including clinical data and medical images. In ophthalmology, LLMs have been explored for analyzing optical coherence tomography (OCT) reports, assisting in disease classification, and even predicting treatment outcomes. However, existing MLLM benchmarks often fail to capture the complexities of real-world clinical practice, particularly in the analysis of OCT images. Many suffer from limitations such as small sample sizes, a lack of diverse OCT datasets, and insufficient expert validation. These shortcomings hinder the accurate assessment of MLLMs' ability to interpret OCT scans and their broader applicability in ophthalmology. Our dataset, curated through rigorous quality control and expert annotation, consists of 439 fundus images and 75 OCT images. Using a standardized API-based framework, we assessed seven mainstream MLLMs and observed significant variability in diagnostic accuracy across different diseases. While some models performed well in diagnosing conditions such as diabetic retinopathy and age-related macular degeneration, they struggled with others, including choroidal neovascularization and myopia, highlighting inconsistencies in performance and the need for further refinement. Our findings emphasize the importance of developing clinically relevant benchmarks to provide a more accurate assessment of MLLMs' capabilities. By refining these models and expanding their scope, we can enhance their potential to transform ophthalmic diagnosis and treatment.
38. 【2503.07078】Linguistic Knowledge Transfer Learning for Speech Enhancement
链接:https://arxiv.org/abs/2503.07078
作者:Kuo-Hsuan Hung,Xugang Lu,Szu-Wei Fu,Huan-Hsin Tseng,Hsin-Yi Lin,Chii-Wann Lin,Yu Tsao
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:spoken language comprehension, plays a crucial, crucial role, role in spoken, Linguistic knowledge plays
备注: 11 pages, 6 figures
点击查看摘要
Abstract:Linguistic knowledge plays a crucial role in spoken language comprehension. It provides essential semantic and syntactic context for speech perception in noisy environments. However, most speech enhancement (SE) methods predominantly rely on acoustic features to learn the mapping relationship between noisy and clean speech, with limited exploration of linguistic integration. While text-informed SE approaches have been investigated, they often require explicit speech-text alignment or externally provided textual data, constraining their practicality in real-world scenarios. Additionally, using text as input poses challenges in aligning linguistic and acoustic representations due to their inherent differences. In this study, we propose the Cross-Modality Knowledge Transfer (CMKT) learning framework, which leverages pre-trained large language models (LLMs) to infuse linguistic knowledge into SE models without requiring text input or LLMs during inference. Furthermore, we introduce a misalignment strategy to improve knowledge transfer. This strategy applies controlled temporal shifts, encouraging the model to learn more robust representations. Experimental evaluations demonstrate that CMKT consistently outperforms baseline models across various SE architectures and LLM embeddings, highlighting its adaptability to different configurations. Additionally, results on Mandarin and English datasets confirm its effectiveness across diverse linguistic conditions, further validating its robustness. Moreover, CMKT remains effective even in scenarios without textual data, underscoring its practicality for real-world applications. By bridging the gap between linguistic and acoustic modalities, CMKT offers a scalable and innovative solution for integrating linguistic knowledge into SE models, leading to substantial improvements in both intelligibility and enhancement performance.
39. 【2503.07067】DistiLLM-2: A Contrastive Approach Boosts the Distillation of LLMs
链接:https://arxiv.org/abs/2503.07067
作者:Jongwoo Ko,Tianyi Chen,Sungnyun Kim,Tianyu Ding,Luming Liang,Ilya Zharkov,Se-Young Yun
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:prior work applies, work applies identical, identical loss functions, applies identical loss, large language models
备注: The code will be available soon at [this https URL](https://github.com/jongwooko/distillm-2)
点击查看摘要
Abstract:Despite the success of distillation in large language models (LLMs), most prior work applies identical loss functions to both teacher- and student-generated data. These strategies overlook the synergy between loss formulations and data types, leading to a suboptimal performance boost in student models. To address this, we propose DistiLLM-2, a contrastive approach that simultaneously increases the likelihood of teacher responses and decreases that of student responses by harnessing this synergy. Our extensive experiments show that DistiLLM-2 not only builds high-performing student models across a wide range of tasks, including instruction-following and code generation, but also supports diverse applications, such as preference alignment and vision-language extensions. These findings highlight the potential of a contrastive approach to enhance the efficacy of LLM distillation by effectively aligning teacher and student models across varied data types.
40. 【2503.07044】DatawiseAgent: A Notebook-Centric LLM Agent Framework for Automated Data Science
链接:https://arxiv.org/abs/2503.07044
作者:Ziming You,Yumiao Zhang,Dexuan Xu,Yiwei Lou,Yandong Yan,Wei Wang,Huaming Zhang,Yu Huang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Data Science tasks, Data Science, automated data science, Science tasks, Finite State Transducer
备注:
点击查看摘要
Abstract:Data Science tasks are multifaceted, dynamic, and often domain-specific. Existing LLM-based approaches largely concentrate on isolated phases, neglecting the interdependent nature of many data science tasks and limiting their capacity for comprehensive end-to-end support. We propose DatawiseAgent, a notebook-centric LLM agent framework that unifies interactions among user, agent and the computational environment through markdown and executable code cells, supporting flexible and adaptive automated data science. Built on a Finite State Transducer(FST), DatawiseAgent orchestrates four stages, including DSF-like planning, incremental execution, self-debugging, and post-filtering. Specifically, the DFS-like planning stage systematically explores the solution space, while incremental execution harnesses real-time feedback and accommodates LLM's limited capabilities to progressively complete tasks. The self-debugging and post-filtering modules further enhance reliability by diagnosing and correcting errors and pruning extraneous information. Extensive experiments on diverse tasks, including data analysis, visualization, and data modeling, show that DatawiseAgent consistently outperforms or matches state-of-the-art methods across multiple model settings. These results highlight its potential to generalize across data science scenarios and lay the groundwork for more efficient, fully automated workflows.
41. 【2503.07041】CM-3CEval: A Triaxial Benchmark for Assessing Responses from Large Language Models in Traditional Chinese Medicine
链接:https://arxiv.org/abs/2503.07041
作者:Tianai Huang,Lu Lu,Jiayuan Chen,Lihao Liu,Junjun He,Yuping Zhao,Wenchao Tang,Jie Xu
类目:Computation and Language (cs.CL)
关键词:traditional Chinese medicine, Large language models, Large language, NLP tasks, modern medicine
备注:
点击查看摘要
Abstract:Large language models (LLMs) excel in various NLP tasks and modern medicine, but their evaluation in traditional Chinese medicine (TCM) is underexplored. To address this, we introduce TCM3CEval, a benchmark assessing LLMs in TCM across three dimensions: core knowledge mastery, classical text understanding, and clinical decision-making. We evaluate diverse models, including international (e.g., GPT-4o), Chinese (e.g., InternLM), and medical-specific (e.g., PLUSE). Results show a performance hierarchy: all models have limitations in specialized subdomains like Meridian Acupoint theory and Various TCM Schools, revealing gaps between current capabilities and clinical needs. Models with Chinese linguistic and cultural priors perform better in classical text interpretation and clinical reasoning. TCM-3CEval sets a standard for AI evaluation in TCM, offering insights for optimizing LLMs in culturally grounded medical domains. The benchmark is available on Medbench's TCM track, aiming to assess LLMs' TCM capabilities in basic knowledge, classic texts, and clinical decision-making through multidimensional questions and real cases.
42. 【2503.07036】Bot Wars Evolved: Orchestrating Competing LLMs in a Counterstrike Against Phone Scams
链接:https://arxiv.org/abs/2503.07036
作者:Nardine Basta,Conor Atkins,Dali Kaafar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Bot Wars, Language Models, Large Language, counter phone scams
备注:
点击查看摘要
Abstract:We present "Bot Wars," a framework using Large Language Models (LLMs) scam-baiters to counter phone scams through simulated adversarial dialogues. Our key contribution is a formal foundation for strategy emergence through chain-of-thought reasoning without explicit optimization. Through a novel two-layer prompt architecture, our framework enables LLMs to craft demographically authentic victim personas while maintaining strategic coherence. We evaluate our approach using a dataset of 3,200 scam dialogues validated against 179 hours of human scam-baiting interactions, demonstrating its effectiveness in capturing complex adversarial dynamics. Our systematic evaluation through cognitive, quantitative, and content-specific metrics shows that GPT-4 excels in dialogue naturalness and persona authenticity, while Deepseek demonstrates superior engagement sustainability.
43. 【2503.07032】Multimodal Human-AI Synergy for Medical Imaging Quality Control: A Hybrid Intelligence Framework with Adaptive Dataset Curation and Closed-Loop Evaluation
链接:https://arxiv.org/abs/2503.07032
作者:Zhi Qin,Qianhui Gui,Mouxiao Bian,Rui Wang,Hong Ge,Dandan Yao,Ziying Sun,Yuan Zhao,Yu Zhang,Hui Shi,Dongdong Wang,Chenxin Song,Shenghong Ju,Lihao Liu,Junjun He,Jie Xu,Yuan-Cheng Wang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:methods remain labor-intensive, imaging quality control, Medical imaging quality, Medical imaging, accurate diagnosis
备注:
点击查看摘要
Abstract:Medical imaging quality control (QC) is essential for accurate diagnosis, yet traditional QC methods remain labor-intensive and subjective. To address this challenge, in this study, we establish a standardized dataset and evaluation framework for medical imaging QC, systematically assessing large language models (LLMs) in image quality assessment and report standardization. Specifically, we first constructed and anonymized a dataset of 161 chest X-ray (CXR) radiographs and 219 CT reports for evaluation. Then, multiple LLMs, including Gemini 2.0-Flash, GPT-4o, and DeepSeek-R1, were evaluated based on recall, precision, and F1 score to detect technical errors and inconsistencies. Experimental results show that Gemini 2.0-Flash achieved a Macro F1 score of 90 in CXR tasks, demonstrating strong generalization but limited fine-grained performance. DeepSeek-R1 excelled in CT report auditing with a 62.23\% recall rate, outperforming other models. However, its distilled variants performed poorly, while InternLM2.5-7B-chat exhibited the highest additional discovery rate, indicating broader but less precise error detection. These findings highlight the potential of LLMs in medical imaging QC, with DeepSeek-R1 and Gemini 2.0-Flash demonstrating superior performance.
44. 【2503.07018】oward Multi-Session Personalized Conversation: A Large-Scale Dataset and Hierarchical Tree Framework for Implicit Reasoning
链接:https://arxiv.org/abs/2503.07018
作者:Xintong Li,Jalend Bantupalli,Ria Dharmani,Yuwei Zhang,Jingbo Shang
类目:Computation and Language (cs.CL)
关键词:generate responses based, large language models, conversational agents, large language, agents to generate
备注: Preprint
点击查看摘要
Abstract:There has been a surge in the use of large language models (LLM) conversational agents to generate responses based on long-term history from multiple sessions. However, existing long-term open-domain dialogue datasets lack complex, real-world personalization and fail to capture implicit reasoning-where relevant information is embedded in subtle, syntactic, or semantically distant connections rather than explicit statements. In such cases, traditional retrieval methods fail to capture relevant context, and long-context modeling also becomes inefficient due to numerous complicated persona-related details. To address this gap, we introduce ImplexConv, a large-scale long-term dataset with 2,500 examples, each containing approximately 100 conversation sessions, designed to study implicit reasoning in personalized dialogues. Additionally, we propose TaciTree, a novel hierarchical tree framework that structures conversation history into multiple levels of summarization. Instead of brute-force searching all data, TaciTree enables an efficient, level-based retrieval process where models refine their search by progressively selecting relevant details. Our experiments demonstrate that TaciTree significantly improves the ability of LLMs to reason over long-term conversations with implicit contextual dependencies.
45. 【2503.07010】ProjectEval: A Benchmark for Programming Agents Automated Evaluation on Project-Level Code Generation
链接:https://arxiv.org/abs/2503.07010
作者:Kaiyuan Liu,Youcheng Pan,Jing Li,Daojing He,Yang Xiang,Yexing Du,Tianrun Gao
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:made rapid progress, LLM, LLM agents, Recently, made rapid
备注: 17 pages (9 Appendix pages), 4 figures, 7 tables
点击查看摘要
Abstract:Recently, LLM agents have made rapid progress in improving their programming capabilities. However, existing benchmarks lack the ability to automatically evaluate from users' perspective, and also lack the explainability of the results of LLM agents' code generation capabilities. Thus, we introduce ProjectEval, a new benchmark for LLM agents project-level code generation's automated evaluation by simulating user interaction. ProjectEval is constructed by LLM with human reviewing. It has three different level inputs of natural languages or code skeletons. ProjectEval can evaluate the generated projects by user interaction simulation for execution, and by code similarity through existing objective indicators. Through ProjectEval, we find that systematic engineering project code, overall understanding of the project and comprehensive analysis capability are the keys for LLM agents to achieve practical projects. Our findings and benchmark provide valuable insights for developing more effective programming agents that can be deployed in future real-world production.
46. 【2503.07003】Large Language Models Often Say One Thing and Do Another
链接:https://arxiv.org/abs/2503.07003
作者:Ruoxi Xu,Hongyu Lin,Xianpei Han,Jia Zheng,Weixiang Zhou,Le Sun,Yingfei Sun
类目:Computation and Language (cs.CL)
关键词:large language models, diverse user populations, language models, increasingly become central, user populations
备注: Published on ICLR 2025
点击查看摘要
Abstract:As large language models (LLMs) increasingly become central to various applications and interact with diverse user populations, ensuring their reliable and consistent performance is becoming more important. This paper explores a critical issue in assessing the reliability of LLMs: the consistency between their words and deeds. To quantitatively explore this consistency, we developed a novel evaluation benchmark called the Words and Deeds Consistency Test (WDCT). The benchmark establishes a strict correspondence between word-based and deed-based questions across different domains, including opinion vs. action, non-ethical value vs. action, ethical value vs. action, and theory vs. application. The evaluation results reveal a widespread inconsistency between words and deeds across different LLMs and domains. Subsequently, we conducted experiments with either word alignment or deed alignment to observe their impact on the other aspect. The experimental results indicate that alignment only on words or deeds poorly and unpredictably influences the other aspect. This supports our hypothesis that the underlying knowledge guiding LLMs' word or deed choices is not contained within a unified space.
47. 【2503.06987】Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
链接:https://arxiv.org/abs/2503.06987
作者:Jiho Jin,Woosung Kang,Junho Myung,Alice Oh
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Measuring social bias, large language models, Measuring social, evaluation methods struggle, existing bias evaluation
备注:
点击查看摘要
Abstract:Measuring social bias in large language models (LLMs) is crucial, but existing bias evaluation methods struggle to assess bias in long-form generation. We propose a Bias Benchmark for Generation (BBG), an adaptation of the Bias Benchmark for QA (BBQ), designed to evaluate social bias in long-form generation by having LLMs generate continuations of story prompts. Building our benchmark in English and Korean, we measure the probability of neutral and biased generations across ten LLMs. We also compare our long-form story generation evaluation results with multiple-choice BBQ evaluation, showing that the two approaches produce inconsistent results.
48. 【2503.06980】Exploring Multimodal Perception in Large Language Models Through Perceptual Strength Ratings
链接:https://arxiv.org/abs/2503.06980
作者:Jonghyun Lee,Dojun Park,Jiwoo Lee,Hoekeon Choi,Sung-Eun Lee
类目:Computation and Language (cs.CL)
关键词:perceptual strength ratings, large language models, Utilizing perceptual strength, capture human-like perceptual, perceptual strength
备注: under review, 15 pages
点击查看摘要
Abstract:This study investigated the multimodal perception of large language models (LLMs), focusing on their ability to capture human-like perceptual strength ratings across sensory modalities. Utilizing perceptual strength ratings as a benchmark, the research compared GPT-3.5, GPT-4, GPT-4o, and GPT-4o-mini, highlighting the influence of multimodal inputs on grounding and linguistic reasoning. While GPT-4 and GPT-4o demonstrated strong alignment with human evaluations and significant advancements over smaller models, qualitative analyses revealed distinct differences in processing patterns, such as multisensory overrating and reliance on loose semantic associations. Despite integrating multimodal capabilities, GPT-4o did not exhibit superior grounding compared to GPT-4, raising questions about their role in improving human-like grounding. These findings underscore how LLMs' reliance on linguistic patterns can both approximate and diverge from human embodied cognition, revealing limitations in replicating sensory experiences.
49. 【2503.06950】CtrlRAG: Black-box Adversarial Attacks Based on Masked Language Models in Retrieval-Augmented Language Generation
链接:https://arxiv.org/abs/2503.06950
作者:Runqi Sui
类目:Computation and Language (cs.CL)
关键词:Large Language Models, enhance Large Language, systems enhance Large, Retrieval-Augmented Generation, external knowledge bases
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) systems enhance Large Language Models (LLMs) by integrating external knowledge bases. However, this integration introduces a new security threat: adversaries can exploit the retrieval mechanism to inject malicious content into the knowledge base, thereby influencing the generated responses. Based on this attack vector, we propose CtrlRAG, a novel attack method designed for RAG system in the black-box setting, which aligns with real-world scenarios. Unlike existing attack methods, CtrlRAG introduces a perturbation mechanism using Masked Language Model (MLM) to dynamically optimize malicious content in response to changes in the retrieved context. Experimental results demonstrate that CtrlRAG outperforms three baseline methods in both Emotional Manipulation and Hallucination Amplification objectives. Furthermore, we evaluate three existing defense mechanisms, revealing their limited effectiveness against CtrlRAG and underscoring the urgent need for more robust defenses.
50. 【2503.06949】Lshan-1.0 Technical Report
链接:https://arxiv.org/abs/2503.06949
作者:Haotian Chen,Yanyu Xu,Boyan Wang,Chaoyue Zhao,Xiaoyu Han,Fang Wang,Lizhen Cui,Yonghui Xu
类目:Computation and Language (cs.CL)
关键词:highly specialized Chinese, specialized Chinese legal, meet diverse realistic, Chinese legal domain, specialized Chinese
备注:
点击查看摘要
Abstract:In this report, we introduce our first-generation reasoning model, Lshan-1.0, a large language model designed for the highly specialized Chinese legal domain, offering comprehensive capabilities to meet diverse realistic needs. Existing legal LLMs face two primary challenges. Firstly, their design and evaluation are predominantly driven by computer science perspectives, leading to insufficient incorporation of legal expertise and logic, which is crucial for high-precision legal applications, such as handling complex prosecutorial tasks. Secondly, these models often underperform due to a lack of comprehensive training data from the legal domain, limiting their ability to effectively address real-world legal scenarios. To address this, we first compile millions of legal documents covering over 20 types of crimes from 31 provinces in China for model training. From the extensive dataset, we further select high-quality for supervised fine-tuning, ensuring enhanced relevance and precision. The model further undergoes large-scale reinforcement learning without additional supervision, emphasizing the enhancement of its reasoning capabilities and explainability. To validate its effectiveness in complex legal applications, we also conduct human evaluations with legal experts. We develop fine-tuned models based on DeepSeek-R1-Distilled versions, available in three dense configurations: 14B, 32B, and 70B.
51. 【2503.06926】Effect of Selection Format on LLM Performance
链接:https://arxiv.org/abs/2503.06926
作者:Yuchen Han,Yucheng Wu,Jeffrey Willard
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
关键词:large language model, LLM, classification task options, model performance, paper investigates
备注:
点击查看摘要
Abstract:This paper investigates a critical aspect of large language model (LLM) performance: the optimal formatting of classification task options in prompts. Through an extensive experimental study, we compared two selection formats -- bullet points and plain English -- to determine their impact on model performance. Our findings suggest that presenting options via bullet points generally yields better results, although there are some exceptions. Furthermore, our research highlights the need for continued exploration of option formatting to drive further improvements in model performance.
52. 【2503.06924】Automatic Speech Recognition for Non-Native English: Accuracy and Disfluency Handling
链接:https://arxiv.org/abs/2503.06924
作者:Michael McGuire
类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:computer assisted language, assisted language testing, computer assisted, assisted language learning, Automatic speech recognition
备注: 33 pages, 10 figures
点击查看摘要
Abstract:Automatic speech recognition (ASR) has been an essential component of computer assisted language learning (CALL) and computer assisted language testing (CALT) for many years. As this technology continues to develop rapidly, it is important to evaluate the accuracy of current ASR systems for language learning applications. This study assesses five cutting-edge ASR systems' recognition of non-native accented English speech using recordings from the L2-ARCTIC corpus, featuring speakers from six different L1 backgrounds (Arabic, Chinese, Hindi, Korean, Spanish, and Vietnamese), in the form of both read and spontaneous speech. The read speech consisted of 2,400 single sentence recordings from 24 speakers, while the spontaneous speech included narrative recordings from 22 speakers. Results showed that for read speech, Whisper and AssemblyAI achieved the best accuracy with mean Match Error Rates (MER) of 0.054 and 0.056 respectively, approaching human-level accuracy. For spontaneous speech, RevAI performed best with a mean MER of 0.063. The study also examined how each system handled disfluencies such as filler words, repetitions, and revisions, finding significant variation in performance across systems and disfluency types. While processing speed varied considerably between systems, longer processing times did not necessarily correlate with better accuracy. By detailing the performance of several of the most recent, widely-available ASR systems on non-native English speech, this study aims to help language instructors and researchers understand the strengths and weaknesses of each system and identify which may be suitable for specific use cases.
53. 【2503.06899】KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus
链接:https://arxiv.org/abs/2503.06899
作者:Xiaoming Shi,Zeming Liu,Yiming Lei,Chenkai Zhang,Haitao Leng,Chuan Wang,Qingjie Liu,Wanxiang Che,Shaoguo Liu,Size Li,Yunhong Wang
类目:Computation and Language (cs.CL)
关键词:garnering growing interest, Video-based dialogue systems, Video-based dialogue, education assistants, growing interest
备注:
点击查看摘要
Abstract:Video-based dialogue systems, such as education assistants, have compelling application value, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering, emotional dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.
54. 【2503.06888】A LongFormer-Based Framework for Accurate and Efficient Medical Text Summarization
链接:https://arxiv.org/abs/2503.06888
作者:Dan Sun,Jacky He,Hanlu Zhang,Zhen Qi,Hongye Zheng,Xiaokai Wang
类目:Computation and Language (cs.CL)
关键词:aimed at addressing, summarization method based, paper proposes, addressing the challenges, challenges faced
备注: Paper accepted by 2025 8th International Conference on Advanced Algorithms and Control Engineering (ICAACE 2025)
点击查看摘要
Abstract:This paper proposes a medical text summarization method based on LongFormer, aimed at addressing the challenges faced by existing models when processing long medical texts. Traditional summarization methods are often limited by short-term memory, leading to information loss or reduced summary quality in long texts. LongFormer, by introducing long-range self-attention, effectively captures long-range dependencies in the text, retaining more key information and improving the accuracy and information retention of summaries. Experimental results show that the LongFormer-based model outperforms traditional models, such as RNN, T5, and BERT in automatic evaluation metrics like ROUGE. It also receives high scores in expert evaluations, particularly excelling in information retention and grammatical accuracy. However, there is still room for improvement in terms of conciseness and readability. Some experts noted that the generated summaries contain redundant information, which affects conciseness. Future research will focus on further optimizing the model structure to enhance conciseness and fluency, achieving more efficient medical text summarization. As medical data continues to grow, automated summarization technology will play an increasingly important role in fields such as medical research, clinical decision support, and knowledge management.
55. 【2503.06868】Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation
链接:https://arxiv.org/abs/2503.06868
作者:Junhao Zhang,Richong Zhang,Fanshuang Kong,Ziyang Miao,Yanhan Ye,Yaowei Zheng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:producing lengthy texts, generation methods primarily, methods primarily concentrate, Existing long-text generation, neglecting the long-input
备注:
点击查看摘要
Abstract:Existing long-text generation methods primarily concentrate on producing lengthy texts from short inputs, neglecting the long-input and long-output tasks. Such tasks have numerous practical applications while lacking available benchmarks. Moreover, as the input grows in length, existing methods inevitably encounter the "lost-in-the-middle" phenomenon. In this paper, we first introduce a Long Input and Output Benchmark (LongInOutBench), including a synthetic dataset and a comprehensive evaluation framework, addressing the challenge of the missing benchmark. We then develop the Retrieval-Augmented Long-Text Writer (RAL-Writer), which retrieves and restates important yet overlooked content, mitigating the "lost-in-the-middle" issue by constructing explicit prompts. We finally employ the proposed LongInOutBench to evaluate our RAL-Writer against comparable baselines, and the results demonstrate the effectiveness of our approach. Our code has been released at this https URL.
56. 【2503.06861】Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention
链接:https://arxiv.org/abs/2503.06861
作者:Mengzhe Hei,Zhouran Zhang,Qingbao Liu,Yan Pan,Xiang Zhao,Yongqian Peng,Yicong Ye,Xin Zhang,Shuxin Bai
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Extracting high-quality structured, advancing material design, Extracting high-quality, scientific literature, scientific literature remain
备注: 17 pages, 5 figures
点击查看摘要
Abstract:Extracting high-quality structured information from scientific literature is crucial for advancing material design through data-driven methods. Despite the considerable research in natural language processing for dataset extraction, effective approaches for multi-tuple extraction in scientific literature remain scarce due to the complex interrelations of tuples and contextual ambiguities. In the study, we illustrate the multi-tuple extraction of mechanical properties from multi-principal-element alloys and presents a novel framework that combines an entity extraction model based on MatSciBERT with pointer networks and an allocation model utilizing inter- and intra-entity attention. Our rigorous experiments on tuple extraction demonstrate impressive F1 scores of 0.963, 0.947, 0.848, and 0.753 across datasets with 1, 2, 3, and 4 tuples, confirming the effectiveness of the model. Furthermore, an F1 score of 0.854 was achieved on a randomly curated dataset. These results highlight the model's capacity to deliver precise and structured information, offering a robust alternative to large language models and equipping researchers with essential data for fostering data-driven innovations.
57. 【2503.06794】Silent Hazards of Token Reduction in Vision-Language Models: The Hidden Impact on Consistency
链接:https://arxiv.org/abs/2503.06794
作者:Yizheng Sun,Hao Li,Chang Xu,Chenghua Lin,Riza Batista-Navarro,Jingyuan Sun
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Vision language models, Vision language, incur high computational, token reduction, incur high
备注:
点击查看摘要
Abstract:Vision language models (VLMs) have excelled in visual reasoning but often incur high computational costs. One key reason is the redundancy of visual tokens. Although recent token reduction methods claim to achieve minimal performance loss, our extensive experiments reveal that token reduction can substantially alter a model's output distribution, leading to changes in prediction patterns that standard metrics such as accuracy loss do not fully capture. Such inconsistencies are especially concerning for practical applications where system stability is critical. To investigate this phenomenon, we analyze how token reduction influences the energy distribution of a VLM's internal representations using a lower-rank approximation via Singular Value Decomposition (SVD). Our results show that changes in the Inverse Participation Ratio of the singular value spectrum are strongly correlated with the model's consistency after token reduction. Based on these insights, we propose LoFi--a training-free visual token reduction method that utilizes the leverage score from SVD for token pruning. Experimental evaluations demonstrate that LoFi not only reduces computational costs with minimal performance degradation but also significantly outperforms state-of-the-art methods in terms of output consistency.
58. 【2503.06792】On the Mutual Influence of Gender and Occupation in LLM Representations
链接:https://arxiv.org/abs/2503.06792
作者:Haozhe An,Connor Baumler,Abhilasha Sancheti,Rachel Rudinger
类目:Computation and Language (cs.CL)
关键词:examine LLM representations, occupational contexts, first-name gender representations, gender representations, examine LLM
备注: In submission
点击查看摘要
Abstract:We examine LLM representations of gender for first names in various occupational contexts to study how occupations and the gender perception of first names in LLMs influence each other mutually. We find that LLMs' first-name gender representations correlate with real-world gender statistics associated with the name, and are influenced by the co-occurrence of stereotypically feminine or masculine occupations. Additionally, we study the influence of first-name gender representations on LLMs in a downstream occupation prediction task and their potential as an internal metric to identify extrinsic model biases. While feminine first-name embeddings often raise the probabilities for female-dominated jobs (and vice versa for male-dominated jobs), reliably using these internal gender representations for bias detection remains challenging.
59. 【2503.06781】Dr Genre: Reinforcement Learning from Decoupled LLM Feedback for Generic Text Rewriting
链接:https://arxiv.org/abs/2503.06781
作者:Yufei Li,John Nham,Ganesh Jawahar,Lei Shu,David Uthus,Yun-Hsuan Sung,Chengrun Yang,Itai Rolnick,Yi Qiao,Cong Liu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:prevalent large language, covers diverse real-world, large language model, fact correction, application that covers
备注: 29 pages, 4 figures, 25 tables
点击查看摘要
Abstract:Generic text rewriting is a prevalent large language model (LLM) application that covers diverse real-world tasks, such as style transfer, fact correction, and email editing. These tasks vary in rewriting objectives (e.g., factual consistency vs. semantic preservation), making it challenging to develop a unified model that excels across all dimensions. Existing methods often specialize in either a single task or a specific objective, limiting their generalizability. In this work, we introduce a generic model proficient in factuality, stylistic, and conversational rewriting tasks. To simulate real-world user rewrite requests, we construct a conversational rewrite dataset, ChatRewrite, that presents ``natural''-sounding instructions, from raw emails using LLMs. Combined with other popular rewrite datasets, including LongFact for the factuality rewrite task and RewriteLM for the stylistic rewrite task, this forms a broad benchmark for training and evaluating generic rewrite models. To align with task-specific objectives, we propose Dr Genre, a Decoupled-reward learning framework for Generic rewriting, that utilizes objective-oriented reward models with a task-specific weighting. Evaluation shows that \approach delivers higher-quality rewrites across all targeted tasks, improving objectives including instruction following (agreement), internal consistency (coherence), and minimal unnecessary edits (conciseness).
60. 【2503.06778】Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators
链接:https://arxiv.org/abs/2503.06778
作者:Feng Gu,Zongxia Li,Carlos Rafael Colon,Benjamin Evans,Ishani Mondal,Jordan Lee Boyd-Graber
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:understanding sociological trends, Event Set Curation, monitoring breaking, sociological trends, important for identifying
备注: 9 pages, 4 figures
点击查看摘要
Abstract:Event annotation is important for identifying market changes, monitoring breaking news, and understanding sociological trends. Although expert annotators set the gold standards, human coding is expensive and inefficient. Unlike information extraction experiments that focus on single contexts, we evaluate a holistic workflow that removes irrelevant documents, merges documents about the same event, and annotates the events. Although LLM-based automated annotations are better than traditional TF-IDF-based methods or Event Set Curation, they are still not reliable annotators compared to human experts. However, adding LLMs to assist experts for Event Set Curation can reduce the time and mental effort required for Variable Annotation. When using LLMs to extract event variables to assist expert annotators, they agree more with the extracted variables than fully automated LLMs for annotation.
61. 【2503.06765】Effectiveness of Zero-shot-CoT in Japanese Prompts
链接:https://arxiv.org/abs/2503.06765
作者:Shusuke Takayama,Ian Frank
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Language Understanding Benchmark, Multi-task Language Understanding, Japanese Multi-task Language, Understanding Benchmark, Language Understanding
备注: NLP2025 Workshop on Japanese Language Resources (JLR2025)
点击查看摘要
Abstract:We compare the effectiveness of zero-shot Chain-of-Thought (CoT) prompting in Japanese and English using ChatGPT-3.5 and 4o-mini. The technique of zero-shot CoT, which involves appending a phrase such as "Let's think step by step" to a prompt to encourage reasoning before answering, has been shown to offer LLM performance improvements in mathematical and reasoning tasks, particularly in English. We investigate how these effects transfer to Japanese using the Japanese Multi-task Language Understanding Benchmark (JMMLU) and the Multi-task Language Understanding Benchmark (MMLU). Our results show that while zero-shot CoT prompting can lead to notable performance gains for some prompt categories in GPT-3.5, its impact in GPT-4o-mini is associated with significant performance declines. However, for Japanese prompts there remain certain categories, such as college mathematics and abstract algebra, that still exhibit improvements, despite the broader trend of diminishing effectiveness in more advanced models.
62. 【2503.06749】Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
链接:https://arxiv.org/abs/2503.06749
作者:Wenxuan Huang,Bohan Jia,Zijie Zhai,Shaosheng Cao,Zheyu Ye,Fei Zhao,Yao Hu,Shaohui Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Reinforcement Learning, purely through Reinforcement, successfully demonstrated, demonstrated the emergence, LLMs purely
备注:
点击查看摘要
Abstract:DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: this https URL .
63. 【2503.06734】Gender Encoding Patterns in Pretrained Language Model Representations
链接:https://arxiv.org/abs/2503.06734
作者:Mahdi Zakizadeh,Mohammad Taher Pilehvar
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:poses significant social, poses significant, ethical challenges, significant social, social and ethical
备注: Proceedings of the 5th Workshop on Trustworthy Natural Language Processing (TrustNLP 2025)
点击查看摘要
Abstract:Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures. We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases. Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.
64. 【2503.06724】opology of Syntax Networks across Languages
链接:https://arxiv.org/abs/2503.06724
作者:Juan Soria-Postigo,Luis F Seoane
类目:Computation and Language (cs.CL)
关键词:Toggle, syntax networks, Code, Papers, Toggle Hugging Face
备注: Final Thesis for MSc in Computational and Applied Mathematics at UC3M
点击查看摘要
Abstract:Syntax connects words to each other in very specific ways. Two words are syntactically connected if they depend directly on each other. Syntactic connections usually happen within a sentence. Gathering all those connection across several sentences gives birth to syntax networks. Earlier studies in the field have analysed the structure and properties of syntax networks trying to find clusters/phylogenies of languages that share similar network features. The results obtained in those studies will be put to test in this thesis by increasing both the number of languages and the number of properties considered in the analysis. Besides that, language networks of particular languages will be inspected in depth by means of a novel network analysis [25]. Words (nodes of the network) will be clustered into topological communities whose members share similar features. The properties of each of these communities will be thoroughly studied along with the Part of Speech (grammatical class) of each word. Results across different languages will also be compared in an attempt to discover universally preserved structural patterns across syntax networks.
Comments:
Final Thesis for MSc in Computational and Applied Mathematics at UC3M
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2503.06724 [cs.CL]
(or
arXiv:2503.06724v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2503.06724
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Juan Soria Postigo [view email] [v1]
Sun, 9 Mar 2025 18:47:17 UTC (5,343 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled Topology of Syntax Networks across Languages, by Juan Soria-Postigo and 1 other authorsView PDFHTML (experimental)TeX SourceOther Formats
view license
Current browse context: cs.CL
prev
|
next
new
|
recent
| 2025-03
Change to browse by:
cs
References Citations
NASA ADSGoogle Scholar
Semantic Scholar
a
export BibTeX citation
Loading…
BibTeX formatted citation
loading…
Data provided by:
Bookmark
checked=“checked”>
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.
Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)
mathjaxToggle();
About
Help
contact arXivClick here to contact arXiv
Contact
subscribe to arXiv mailingsClick here to subscribe
Subscribe
Copyright
Privacy Policy
Web Accessibility Assistance
arXiv Operational Status
Get status notifications via
email
or slack
65. 【2503.06709】Delusions of Large Language Models
链接:https://arxiv.org/abs/2503.06709
作者:Hongshen Xu,Zixv yang,Zichen Zhu,Kunyao Lan,Zihan Wang,Mengyue Wu,Ziwei Ji,Lu Chen,Pascale Fung,Kai Yu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, generate factually incorrect, Language Models, generate factually
备注:
点击查看摘要
Abstract:Large Language Models often generate factually incorrect but plausible outputs, known as hallucinations. We identify a more insidious phenomenon, LLM delusion, defined as high belief hallucinations, incorrect outputs with abnormally high confidence, making them harder to detect and mitigate. Unlike ordinary hallucinations, delusions persist with low uncertainty, posing significant challenges to model reliability. Through empirical analysis across different model families and sizes on several Question Answering tasks, we show that delusions are prevalent and distinct from hallucinations. LLMs exhibit lower honesty with delusions, which are harder to override via finetuning or self reflection. We link delusion formation with training dynamics and dataset noise and explore mitigation strategies such as retrieval augmented generation and multi agent debating to mitigate delusions. By systematically investigating the nature, prevalence, and mitigation of LLM delusions, our study provides insights into the underlying causes of this phenomenon and outlines future directions for improving model reliability.
66. 【2503.06708】Alignment for Efficient Tool Calling of Large Language Models
链接:https://arxiv.org/abs/2503.06708
作者:Hongshen Xu,Zihan Wang,Zichen Zhu,Lei Pan,Xingyu Chen,Lu Chen,Kai Yu
类目:Computation and Language (cs.CL)
关键词:enabled large language, Recent advancements, integrate external tools, large language models, enhancing their task
备注:
点击查看摘要
Abstract:Recent advancements in tool learning have enabled large language models (LLMs) to integrate external tools, enhancing their task performance by expanding their knowledge boundaries. However, relying on tools often introduces tradeoffs between performance, speed, and cost, with LLMs sometimes exhibiting overreliance and overconfidence in tool usage. This paper addresses the challenge of aligning LLMs with their knowledge boundaries to make more intelligent decisions about tool invocation. We propose a multi objective alignment framework that combines probabilistic knowledge boundary estimation with dynamic decision making, allowing LLMs to better assess when to invoke tools based on their confidence. Our framework includes two methods for knowledge boundary estimation, consistency based and absolute estimation, and two training strategies for integrating these estimates into the model decision making process. Experimental results on various tool invocation scenarios demonstrate the effectiveness of our framework, showing significant improvements in tool efficiency by reducing unnecessary tool usage.
67. 【2503.06706】PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts
链接:https://arxiv.org/abs/2503.06706
作者:Ming Zhang,Yuhui Wang,Yujiong Shen,Tingyi Yang,Changhao Jiang,Yilong Wu,Shihan Dou,Qinhao Chen,Zhiheng Xi,Zhihao Zhang,Yi Dong,Zhen Wang,Zhihui Fei,Mingyang Wan,Tao Liang,Guojun Ma,Qi Zhang,Tao Gui,Xuanjing Huang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Process-driven dialogue systems, equipment maintenance scenarios, predefined process constraints, strict predefined process, Process-driven dialogue
备注:
点击查看摘要
Abstract:Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models' performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in this https URL.
68. 【2503.06692】InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
链接:https://arxiv.org/abs/2503.06692
作者:Yuchen Yan,Yongliang Shen,Yang Liu,Jin Jiang,Mengdi Zhang,Jian Shao,Yueting Zhuang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:maximum context boundaries, pre-training context windows, faces critical limitations, paradigm faces critical, achieved remarkable performance
备注:
点击查看摘要
Abstract:Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.
69. 【2503.06689】DependEval: Benchmarking LLMs for Repository Dependency Understanding
链接:https://arxiv.org/abs/2503.06689
作者:Junjia Du,Yadi Liu,Hongcheng Guo,Jiawei Wang,Haojian Huang,Yunyi Ni,Zhoujun Li
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:shown considerable promise, software development demands, development demands advanced, advanced repository-level reasoning, real-world software development
备注:
点击查看摘要
Abstract:While large language models (LLMs) have shown considerable promise in code generation, real-world software development demands advanced repository-level reasoning. This includes understanding dependencies, project structures, and managing multi-file changes. However, the ability of LLMs to effectively comprehend and handle complex code repositories has yet to be fully explored. To address challenges, we introduce a hierarchical benchmark designed to evaluate repository dependency understanding (DependEval). Benchmark is based on 15,576 repositories collected from real-world websites. It evaluates models on three core tasks: Dependency Recognition, Repository Construction, and Multi-file Editing, across 8 programming languages from actual code repositories. Our evaluation of over 25 LLMs reveals substantial performance gaps and provides valuable insights into repository-level code understanding.
70. 【2503.06670】Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On
链接:https://arxiv.org/abs/2503.06670
作者:Roni Goldshmidt
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:crucial for trust, high-stakes applications, decision-making in high-stakes, Vision-Language Models, framework extending Shapley-based
备注:
点击查看摘要
Abstract:Interpretability in Vision-Language Models (VLMs) is crucial for trust, debugging, and decision-making in high-stakes applications. We introduce PixelSHAP, a model-agnostic framework extending Shapley-based analysis to structured visual entities. Unlike previous methods focusing on text prompts, PixelSHAP applies to vision-based reasoning by systematically perturbing image objects and quantifying their influence on a VLM's response. PixelSHAP requires no model internals, operating solely on input-output pairs, making it compatible with open-source and commercial models. It supports diverse embedding-based similarity metrics and scales efficiently using optimization techniques inspired by Shapley-based methods. We validate PixelSHAP in autonomous driving, highlighting its ability to enhance interpretability. Key challenges include segmentation sensitivity and object occlusion. Our open-source implementation facilitates further research.
71. 【2503.06648】Enhancing NLP Robustness and Generalization through LLM-Generated Contrast Sets: A Scalable Framework for Systematic Evaluation and Adversarial Training
链接:https://arxiv.org/abs/2503.06648
作者:Hender Lin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:capture vulnerabilities stemming, Standard NLP benchmarks, spurious correlations, benchmarks often fail, fail to capture
备注:
点击查看摘要
Abstract:Standard NLP benchmarks often fail to capture vulnerabilities stemming from dataset artifacts and spurious correlations. Contrast sets address this gap by challenging models near decision boundaries but are traditionally labor-intensive to create and limited in diversity. This study leverages large language models to automate the generation of diverse contrast sets. Using the SNLI dataset, we created a 3,000-example contrast set to evaluate and improve model robustness. Fine-tuning on these contrast sets enhanced performance on systematically perturbed examples, maintained standard test accuracy, and modestly improved generalization to novel perturbations. This automated approach offers a scalable solution for evaluating and improving NLP models, addressing systematic generalization challenges, and advancing robustness in real-world applications.
72. 【2503.06643】Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models
链接:https://arxiv.org/abs/2503.06643
作者:Batu Guan,Xiao Wu,Yuanyuan Yuan,Shaohua Li
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:tackle a critical, critical challenge, Abstract, dynamic, models
备注: 14 pages, 7 figures
点击查看摘要
Abstract:In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this challenge. Given a code understanding or reasoning benchmark, our framework dynamically transforms each input, i.e., programs, with various semantic-preserving mutations to build a syntactically new while semantically identical benchmark. We evaluated ten popular language models on our dynamic benchmarks. Our evaluation reveals several interesting or surprising findings: (1) all models perform significantly worse than before, (2) the ranking between some models shifts dramatically, and (3) our dynamic benchmarks can resist against the data contamination problem.
73. 【2503.06627】Revisiting Early Detection of Sexual Predators via Turn-level Optimization
链接:https://arxiv.org/abs/2503.06627
作者:Jinmyeong An,Sangwon Ryu,Heejin Do,Yunsu Kim,Jungseul Ok,Gary Geunbae Lee
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:severe social threat, gradually entrap child, entrap child victims, Online grooming, sexual predators gradually
备注: Accepted as a main conference paper at NAACL 2025
点击查看摘要
Abstract:Online grooming is a severe social threat where sexual predators gradually entrap child victims with subtle and gradual manipulation. Therefore, timely intervention for online grooming is critical for proactive protection. However, previous methods fail to determine the optimal intervention points (i.e., jump to conclusions) as they rely on chat-level risk labels by causing weak supervision of risky utterances. For timely detection, we propose speed control reinforcement learning (SCoRL) (The code and supplementary materials are available at this https URL), incorporating a practical strategy derived from luring communication theory (LCT). To capture the predator's turn-level entrapment, we use a turn-level risk label based on the LCT. Then, we design a novel speed control reward function that balances the trade-off between speed and accuracy based on turn-level risk label; thus, SCoRL can identify the optimal intervention moment. In addition, we introduce a turn-level metric for precise evaluation, identifying limitations in previously used chat-level metrics. Experimental results show that SCoRL effectively preempted online grooming, offering a more proactive and timely solution. Further analysis reveals that our method enhances performance while intuitively identifying optimal early intervention points.
74. 【2503.06594】Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation
链接:https://arxiv.org/abs/2503.06594
作者:Yingfeng Luo,Tong Zheng,Yongyu Mu,Bei Li,Qinghong Zhang,Yongqi Gao,Ziqiang Xu,Peinan Feng,Xiaoqian Liu,Tong Xiao,Jingbo Zhu
类目:Computation and Language (cs.CL)
关键词:large language models, neural machine translation, earlier NMT models, machine translation, NMT
备注:
点击查看摘要
Abstract:The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems using a single pre-trained Transformer decoder, while encoder-decoder architectures, which were the standard in earlier NMT models, have received relatively less attention. In this paper, we explore translation models that are universal, efficient, and easy to optimize, by marrying the world of LLMs with the world of NMT. We apply LLMs to NMT encoding and leave the NMT decoder unchanged. We also develop methods for adapting LLMs to work better with the NMT decoder. Furthermore, we construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes across various tasks. Evaluations on the WMT and our datasets show that results using our method match or surpass a range of baselines in terms of translation quality, but achieve $2.4 \sim 6.5 \times$ inference speedups and a $75\%$ reduction in the memory footprint of the KV cache. It also demonstrates strong generalization across a variety of translation-related tasks.
75. 【2503.06573】WildIFEval: Instruction Following in the Wild
链接:https://arxiv.org/abs/2503.06573
作者:Gili Lior,Asaf Yehudai,Ariel Gera,Liat Ein-Dor
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:shown remarkable success, multiple constraints remains, Recent LLMs, significant challenge, shown remarkable
备注:
点击查看摘要
Abstract:Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
76. 【2503.06552】Multimodal Programming in Computer Science with Interactive Assistance Powered by Large Language Model
链接:https://arxiv.org/abs/2503.06552
作者:Rajan Das Gupta,Md. Tanzib Hosain,M. F. Mridha,Salah Uddin Ahmed
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:LLM chatbot interfaces, LLM chatbot, chatbot interfaces, interactive assistance, LLM
备注: Accepted in Proceedings of the 27th International Conference on. Human-Computer Interaction, 2025
点击查看摘要
Abstract:LLM chatbot interfaces allow students to get instant, interactive assistance with homework, but doing so carelessly may not advance educational objectives. In this study, an interactive homework help system based on DeepSeek R1 is developed and first implemented for students enrolled in a large computer science beginning programming course. In addition to an assist button in a well-known code editor, our assistant also has a feedback option in our command-line automatic evaluator. It wraps student work in a personalized prompt that advances our educational objectives without offering answers straight away. We have discovered that our assistant can recognize students' conceptual difficulties and provide ideas, plans, and template code in pedagogically appropriate ways. However, among other mistakes, it occasionally incorrectly labels the correct student code as incorrect or encourages students to use correct-but-lesson-inappropriate approaches, which can lead to long and frustrating journeys for the students. After discussing many development and deployment issues, we provide our conclusions and future actions.
77. 【2503.06550】BingoGuard: LLM Content Moderation Tools with Risk Levels
链接:https://arxiv.org/abs/2503.06550
作者:Fan Yin,Philippe Laban,Xiangyu Peng,Yilun Zhou,Yixin Mao,Vaibhav Vats,Linnea Ross,Divyansh Agarwal,Caiming Xiong,Chien-Sheng Wu
类目:Computation and Language (cs.CL)
关键词:
备注: 10 pages, 4 figures, 4 tables. ICLR 2025 poster
点击查看摘要
None
78. 【2503.06547】KréyoLID From Language Identification Towards Language Mining
链接:https://arxiv.org/abs/2503.06547
作者:Rasul Dent,Pedro Ortiz Suarez,Thibault Clérice,Benoît Sagot
类目:Computation and Language (cs.CL)
关键词:Automatic language identification, multi-class classification problem, Automatic language, identification is frequently, frequently framed
备注: 8 main pages
点击查看摘要
Abstract:Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora much faster and with better coverage than using established pipelines. To demonstrate the effectiveness of the language mining perspective, we introduce a new pipeline and corpora for several French-based Creoles.
79. 【2503.06534】SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations
链接:https://arxiv.org/abs/2503.06534
作者:Xingwei Tan,Chen Lyu,Hafiz Muhammad Umer,Sahrish Khan,Mahathi Parvatham,Lois Arthurs,Simon Cullen,Shelley Wilson,Arshad Jhumka,Gabriele Pergola
类目:Computation and Language (cs.CL)
关键词:Detecting toxic language, Detecting toxic, harassment and abusive, abusive behaviour, remains a critical
备注: NAACL 2025 system demonstration camera-ready
点击查看摘要
Abstract:Detecting toxic language including sexism, harassment and abusive behaviour, remains a critical challenge, particularly in its subtle and context-dependent forms. Existing approaches largely focus on isolated message-level classification, overlooking toxicity that emerges across conversational contexts. To promote and enable future research in this direction, we introduce SafeSpeech, a comprehensive platform for toxic content detection and analysis that bridges message-level and conversation-level insights. The platform integrates fine-tuned classifiers and large language models (LLMs) to enable multi-granularity detection, toxic-aware conversation summarization, and persona profiling. SafeSpeech also incorporates explainability mechanisms, such as perplexity gain analysis, to highlight the linguistic elements driving predictions. Evaluations on benchmark datasets, including EDOS, OffensEval, and HatEval, demonstrate the reproduction of state-of-the-art performance across multiple tasks, including fine-grained sexism detection.
80. 【2503.06531】MetaXCR: Reinforcement-Based Meta-Transfer Learning for Cross-Lingual Commonsense Reasoning
链接:https://arxiv.org/abs/2503.06531
作者:Jie He,Yu Fu
类目:Computation and Language (cs.CL)
关键词:achieved great progress, Commonsense reasoning, Low-Resource Commonsense Reasoning, Cross-lingual Low-Resource Commonsense, pieces of domain
备注:
点击查看摘要
Abstract:Commonsense reasoning (CR) has been studied in many pieces of domain and has achieved great progress with the aid of large datasets. Unfortunately, most existing CR datasets are built in English, so most previous work focus on English. Furthermore, as the annotation of commonsense reasoning is costly, it is impossible to build a large dataset for every novel task. Therefore, there are growing appeals for Cross-lingual Low-Resource Commonsense Reasoning, which aims to leverage diverse existed English datasets to help the model adapt to new cross-lingual target datasets with limited labeled data. In this paper, we propose a multi-source adapter for cross-lingual low-resource Commonsense Reasoning (MetaXCR). In this framework, we first extend meta learning by incorporating multiple training datasets to learn a generalized task adapters across different tasks. Then, we further introduce a reinforcement-based sampling strategy to help the model sample the source task that is the most helpful to the target task. Finally, we introduce two types of cross-lingual meta-adaption methods to enhance the performance of models on target languages. Extensive experiments demonstrate MetaXCR is superior over state-of-the-arts, while being trained with fewer parameters than other work.
81. 【2503.06514】GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks
链接:https://arxiv.org/abs/2503.06514
作者:Haoqiang Kang,Enna Sachdeva,Piyush Gupta,Sangjae Bae,Kwonjoon Lee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:recently shown promising, shown promising advancements, Proximal Policy Optimization, sequential decision-making tasks, recently shown
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.
82. 【2503.06510】Less is More: Adaptive Program Repair with Bug Localization and Preference Learning
链接:https://arxiv.org/abs/2503.06510
作者:Zhenlong Dai,Bingrui Chen,Zhuoluo Zhao,Xiu Tang,Sai Wu,Chang Yao,Zhipeng Gao,Jingyuan Chen
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:Automated Program Repair, Automated Program, automatically generate patches, Program Repair, Adaptive Program Repair
备注: accepted by AAAI2025 Oral
点击查看摘要
Abstract:Automated Program Repair (APR) is a task to automatically generate patches for the buggy code. However, most research focuses on generating correct patches while ignoring the consistency between the fixed code and the original buggy code. How to conduct adaptive bug fixing and generate patches with minimal modifications have seldom been investigated. To bridge this gap, we first introduce a novel task, namely AdaPR (Adaptive Program Repair). We then propose a two-stage approach AdaPatcher (Adaptive Patch Generator) to enhance program repair while maintaining the consistency. In the first stage, we utilize a Bug Locator with self-debug learning to accurately pinpoint bug locations. In the second stage, we train a Program Modifier to ensure consistency between the post-modified fixed code and the pre-modified buggy code. The Program Modifier is enhanced with a location-aware repair learning strategy to generate patches based on identified buggy lines, a hybrid training strategy for selective reference and an adaptive preference learning to prioritize fewer changes. The experimental results show that our approach outperforms a set of baselines by a large margin, validating the effectiveness of our two-stage framework for the newly proposed AdaPR task.
83. 【2503.06492】VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering
链接:https://arxiv.org/abs/2503.06492
作者:Yanling Wang,Yihan Zhao,Xiaodong Chen,Shasha Guo,Lixin Liu,Haoyang Li,Yong Xiao,Jing Zhang,Qi Li,Ke Xu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable achievements, Large vision-language models, non-factual responses remains, responses remains prevalent, Large vision-language
备注:
点击查看摘要
Abstract:Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at this https URL.
84. 【2503.06491】MoFE: Mixture of Frozen Experts Architecture
链接:https://arxiv.org/abs/2503.06491
作者:Jean Seo,Jaeyoon Kim,Hyopil Shin
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:
备注: NAACL 2025 Industry
点击查看摘要
None
85. 【2503.06475】SKG-LLM: Developing a Mathematical Model for Stroke Knowledge Graph Construction Using Large Language Models
链接:https://arxiv.org/abs/2503.06475
作者:Ali Sarabadani,Kheirolah Rahsepar Fard,Hamid Dalvand
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
86. 【2503.06474】HuixiangDou2: A Robustly Optimized GraphRAG Approach
链接:https://arxiv.org/abs/2503.06474
作者:Huanjun Kong,Zhefan Wang,Chenyang Wang,Zhe Ma,Nanqing Dong
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注: 11 pages
点击查看摘要
None
87. 【2503.06470】hink Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems
链接:https://arxiv.org/abs/2503.06470
作者:Fei Tang,Yongliang Shen,Hang Zhang,Siqi Chen,Guiyang Hou,Wenqi Zhang,Wenqiao Zhang,Kaitao Song,Weiming Lu,Yueting Zhuang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
88. 【2503.06430】Graph Retrieval-Augmented LLM for Conversational Recommendation Systems
链接:https://arxiv.org/abs/2503.06430
作者:Zhangchi Qiu,Linhao Luo,Zicheng Zhao,Shirui Pan,Alan Wee-Chung Liew
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:
备注: Accepted by PAKDD 2025
点击查看摘要
None
89. 【2503.06424】raining LLM-based Tutors to Improve Student Learning Outcomes in Dialogues
链接:https://arxiv.org/abs/2503.06424
作者:Alexander Scarlatos,Naiming Liu,Jaewook Lee,Richard Baraniuk,Andrew Lan
类目:Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:Generative artificial intelligence, Generative artificial, large language models, artificial intelligence, potential to scale
备注:
点击查看摘要
Abstract:Generative artificial intelligence (AI) has the potential to scale up personalized tutoring through large language models (LLMs). Recent AI tutors are adapted for the tutoring task by training or prompting LLMs to follow effective pedagogical principles, though they are not trained to maximize student learning throughout the course of a dialogue. Therefore, they may engage with students in a suboptimal way. We address this limitation by introducing an approach to train LLMs to generate tutor utterances that maximize the likelihood of student correctness, while still encouraging the model to follow good pedagogical practice. Specifically, we generate a set of candidate tutor utterances and score them using (1) an LLM-based student model to predict the chance of correct student responses and (2) a pedagogical rubric evaluated by GPT-4o. We then use the resulting data to train an open-source LLM, Llama 3.1 8B, using direct preference optimization. We show that tutor utterances generated by our model lead to significantly higher chances of correct student responses while maintaining the pedagogical quality of GPT-4o. We also conduct qualitative analyses and a human evaluation to demonstrate that our model generates high quality tutor utterances.
90. 【2503.06394】How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders
链接:https://arxiv.org/abs/2503.06394
作者:Tatsuro Inaba,Kentaro Inui,Yusuke Miyao,Yohei Oseki,Benjamin Heinzerling,Yu Takagi
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, demonstrate remarkable multilingual, Large Language, remarkable multilingual capabilities, demonstrate remarkable
备注: Our code, demo, SAE weights are available at: [this https URL](https://github.com/llm-jp/llm-jp-sae)
点击查看摘要
Abstract:Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.
91. 【2503.06380】I-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
链接:https://arxiv.org/abs/2503.06380
作者:Khang H. N. Vo,Duc P. T. Nguyen,Thong Nguyen,Tho T. Quan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
92. 【2503.06378】General Scales Unlock AI Evaluation with Explanatory and Predictive Power
链接:https://arxiv.org/abs/2503.06378
作者:Lexin Zhou,Lorenzo Pacchiardi,Fernando Martínez-Plumed,Katherine M. Collins,Yael Moros-Daval,Seraphina Zhang,Qinlin Zhao,Yitian Huang,Luning Sun,Jonathan E. Prunty,Zongqian Li,Pablo Sánchez-García,Kexin Jiang Chen,Pablo A. M. Casares,Jiyun Zu,John Burden,Behzad Mehrbakhsh,David Stillwell,Manuel Cebrian,Jindong Wang,Peter Henderson,Sherry Tongshuang Wu,Patrick C. Kyllonen,Lucy Cheke,Xing Xie,José Hernández-Orallo
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:
备注:
点击查看摘要
None
93. 【2503.06335】Phraselette: A Poet's Procedural Palette
链接:https://arxiv.org/abs/2503.06335
作者:Alex Calderwood,John Joon Young Chung,Yuqian Sun,Melissa Roemmele,Max Kreminski
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
94. 【2503.06330】States of LLM-generated Texts and Phase Transitions between them
链接:https://arxiv.org/abs/2503.06330
作者:Nikolay Mikhaylovskiy
类目:Computation and Language (cs.CL); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
关键词:
备注: Published as a conference paper at MathAI 2025
点击查看摘要
None
95. 【2503.06313】Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection
链接:https://arxiv.org/abs/2503.06313
作者:Chandan Kumar Sah,Ankit Kumar Shaw,Xiaoli Lian,Arsalan Shahid Baig,Tuopu Wen,Kun Jiang,Mengmeng Yang,Diange Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:ensure safe navigation, Large Language Models, require reliable traffic, Multimodal Large Language, traffic sign recognition
备注: 11 pages, 9 figures
点击查看摘要
Abstract:Autonomous vehicles (AVs) require reliable traffic sign recognition and robust lane detection capabilities to ensure safe navigation in complex and dynamic environments. This paper introduces an integrated approach combining advanced deep learning techniques and Multimodal Large Language Models (MLLMs) for comprehensive road perception. For traffic sign recognition, we systematically evaluate ResNet-50, YOLOv8, and RT-DETR, achieving state-of-the-art performance of 99.8% with ResNet-50, 98.0% accuracy with YOLOv8, and achieved 96.6% accuracy in RT-DETR despite its higher computational complexity. For lane detection, we propose a CNN-based segmentation method enhanced by polynomial curve fitting, which delivers high accuracy under favorable conditions. Furthermore, we introduce a lightweight, Multimodal, LLM-based framework that directly undergoes instruction tuning using small yet diverse datasets, eliminating the need for initial pretraining. This framework effectively handles various lane types, complex intersections, and merging zones, significantly enhancing lane detection reliability by reasoning under adverse conditions. Despite constraints in available training resources, our multimodal approach demonstrates advanced reasoning capabilities, achieving a Frame Overall Accuracy (FRM) of 53.87%, a Question Overall Accuracy (QNS) of 82.83%, lane detection accuracies of 99.6% in clear conditions and 93.0% at night, and robust performance in reasoning about lane invisibility due to rain (88.4%) or road degradation (95.6%). The proposed comprehensive framework markedly enhances AV perception reliability, thus contributing significantly to safer autonomous driving across diverse and challenging road scenarios.
96. 【2503.06296】MoEMoE: Question Guided Dense and Scalable Sparse Mixture-of-Expert for Multi-source Multi-modal Answering
链接:https://arxiv.org/abs/2503.06296
作者:Vinay Kumar Verma,Shreyas Sunil Kulkarni,Happy Mittal,Deepak Gupta
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:
备注: To appear at NAACL Industry Track
点击查看摘要
None
97. 【2503.06291】IteRABRe: Iterative Recovery-Aided Block Reduction
链接:https://arxiv.org/abs/2503.06291
作者:Haryo Akbarianto Wibowo,Haiyue Song,Hideki Tanaka,Masao Utiyama,Alham Fikri Aji,Raj Dabre
类目:Computation and Language (cs.CL)
关键词:
备注: 8 pages
点击查看摘要
None
98. 【2503.06263】Critical Foreign Policy Decisions (CFPD)-Benchmark: Measuring Diplomatic Preferences in Large Language Models
链接:https://arxiv.org/abs/2503.06263
作者:Benjamin Jensen,Ian Reynolds,Yasir Atalan,Michael Garcia,Austin Woo,Anthony Chen,Trevor Howarth
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
99. 【2503.06241】A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment
链接:https://arxiv.org/abs/2503.06241
作者:Koji Inoue,Yuki Okafuji,Jun Baba,Yoshiki Ohira,Katsuya Hyodo,Tatsuya Kawahara
类目:Robotics (cs.RO); Computation and Language (cs.CL); Sound (cs.SD)
关键词:
备注:
点击查看摘要
None
100. 【2503.06232】Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning
链接:https://arxiv.org/abs/2503.06232
作者:Yanjun Chen,Yirong Sun,Xinghao Chen,Jian Wang,Xiaoyu Shen,Wenjie Li,Wei Zhang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:proven effective, effective in natural, remains underexplored, reasoning, CoT
备注:
点击查看摘要
Abstract:Chain-of-Thought (CoT) reasoning has proven effective in natural language tasks but remains underexplored in multimodal alignment. This study investigates its integration into 3D vision-language learning by embedding structured reasoning into alignment training. We introduce the 3D-CoT Benchmark, a dataset with hierarchical CoT annotations covering shape recognition, functional inference, and causal reasoning. Through controlled experiments, we compare CoT-structured and standard textual annotations across large reasoning models (LRMs) and large language models (LLMs). Our evaluation employs a dual-layer framework assessing both intermediate reasoning and final inference quality. Extensive experiments demonstrate that CoT significantly improves 3D semantic grounding, with LRMs leveraging CoT more effectively than LLMs. Furthermore, we highlight that annotation structure influences performance-explicit reasoning markers aid LLMs, while unmarked CoT better aligns with LRM inference patterns. Our analyses suggest that CoT is crucial for enhancing multimodal reasoning, with implications beyond 3D tasks.
101. 【2503.06218】KnowLogic: A Benchmark for Commonsense Reasoning via Knowledge-Driven Data Synthesis
链接:https://arxiv.org/abs/2503.06218
作者:Weidong Zhan,Yue Wang,Nan Hu,Liming Xiao,Jingyuan Ma,Yuhang Qin,Zheng Li,Yixin Yang,Sirui Deng,Jinkun Ding,Wenhan Ma,Rui Li,Weilin Luo,Qun Liu,Zhifang Sui
类目:Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
102. 【2503.06211】xt-Speech Language Models with Improved Cross-Modal Transfer by Aligning Abstraction Levels
链接:https://arxiv.org/abs/2503.06211
作者:Santiago Cuervo,Adel Moumen,Yanis Labrak,Sameer Khurana,Antoine Laurent,Mickael Rouvier,Ricard Marxer
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
关键词:
备注:
点击查看摘要
None
103. 【2503.06204】CUPCase: Clinically Uncommon Patient Cases and Diagnoses Dataset
链接:https://arxiv.org/abs/2503.06204
作者:Oriel Perets,Ofir Ben Shoham,Nir Grinberg,Nadav Rappoport
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:
备注: Accepted to AAAI 2025
点击查看摘要
None
104. 【2503.06201】Explainable Synthetic Image Detection through Diffusion Timestep Ensembling
链接:https://arxiv.org/abs/2503.06201
作者:Yixin Wu,Feiran Zhang,Tianyuan Shi,Ruicheng Yin,Zhenghua Wang,Zhenliang Gan,Xiaohua Wang,Changze Lv,Xiaoqing Zheng,Xuanjing Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:posing significant security, significant security risks, Recent advances, deceptively real images, posing significant
备注: 13 pages, 5 figures
点击查看摘要
Abstract:Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we reveal that natural and synthetic images exhibit distinct differences in the high-frequency domains of their Fourier power spectra after undergoing iterative noise perturbations through an inverse multi-step denoising process, suggesting that such noise can provide additional discriminative information for identifying synthetic images. Based on this observation, we propose a novel detection method that amplifies these differences by progressively adding noise to the original images across multiple timesteps, and train an ensemble of classifiers on these noised images. To enhance human comprehension, we introduce an explanation generation and refinement module to identify flaws located in AI-generated images. Additionally, we construct two new datasets, GenHard and GenExplain, derived from the GenImage benchmark, providing detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and harder samples, increasing a minimal of 2.51% and 3.46% compared to baselines. Furthermore, our method also generalizes effectively to images generated by other diffusion models. Our code and datasets will be made publicly available.
105. 【2503.06184】Sample-aware Adaptive Structured Pruning for Large Language Models
链接:https://arxiv.org/abs/2503.06184
作者:Jun Kong,Xinge Ma,Jin Wang,Xuejie Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
106. 【2503.06139】GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs
链接:https://arxiv.org/abs/2503.06139
作者:Mingyang Song,Mao Zheng,Xuan Luo
类目:Computation and Language (cs.CL)
关键词:
备注: Ongoing Work
点击查看摘要
None
107. 【2503.06137】Evaluating Discourse Cohesion in Pre-trained Language Models
链接:https://arxiv.org/abs/2503.06137
作者:Jie He,Wanqiu Long,Deyi Xiong
类目:Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
108. 【2503.06112】AF-KAN: Activation Function-Based Kolmogorov-Arnold Networks for Efficient Representation Learning
链接:https://arxiv.org/abs/2503.06112
作者:Hoang-Thang Ta,Anh Tran
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:
备注: 25 pages
点击查看摘要
None
109. 【2503.06091】heta Theory: operads and coloring
链接:https://arxiv.org/abs/2503.06091
作者:Matilde Marcolli,Richard K. Larson
类目:Computation and Language (cs.CL)
关键词:
备注: 26 pages LaTeX
点击查看摘要
None
110. 【2503.06085】Multi-Attribute Multi-Grained Adaptation of Pre-Trained Language Models for Text Understanding from Bayesian Perspective
链接:https://arxiv.org/abs/2503.06085
作者:You Zhang,Jin Wang,Liang-Chih Yu,Dan Xu,Xuejie Zhang
类目:Computation and Language (cs.CL)
关键词:
备注: Extended version accepted by AAAI 2025
点击查看摘要
None
111. 【2503.06076】An Empirical Study of Causal Relation Extraction Transfer: Design and Data
链接:https://arxiv.org/abs/2503.06076
作者:Sydney Anuyah,Jack Vanschaik,Palak Jain,Sawyer Lehman,Sunandan Chakraborty
类目:Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
112. 【2503.06074】owards Conversational AI for Disease Management
链接:https://arxiv.org/abs/2503.06074
作者:Anil Palepu,Valentin Liévin,Wei-Hung Weng,Khaled Saab,David Stutz,Yong Cheng,Kavita Kulkarni,S. Sara Mahdavi,Joëlle Barral,Dale R. Webster,Katherine Chou,Avinatan Hassidim,Yossi Matias,James Manyika,Ryutaro Tanno,Vivek Natarajan,Adam Rodman,Tao Tu,Alan Karthikesalingam,Mike Schaekermann
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:
备注: 62 pages, 7 figures in main text, 36 figures in appendix
点击查看摘要
None
113. 【2503.06073】GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images
链接:https://arxiv.org/abs/2503.06073
作者:Xiang Lan,Feng Wu,Kai He,Qinghao Zhao,Shenda Hong,Mengling Feng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
114. 【2503.06072】A Survey on Post-training of Large Language Models
链接:https://arxiv.org/abs/2503.06072
作者:Guiyao Tie,Zeli Zhao,Dingjie Song,Fuyang Wei,Rong Zhou,Yurou Dai,Wen Yin,Zhejian Yang,Jiangyue Yan,Yao Su,Zhenhan Dai,Yifeng Xie,Yihan Cao,Lichao Sun,Pan Zhou,Lifang He,Hechang Chen,Yu Zhang,Qingsong Wen,Tianming Liu,Neil Zhenqiang Gong,Jiliang Tang,Caiming Xiong,Heng Ji,Philip S. Yu,Jianfeng Gao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:
备注: 87 pages, 21 figures, 9 tables
点击查看摘要
None
115. 【2503.06064】A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts
链接:https://arxiv.org/abs/2503.06064
作者:Wenzhuo Du,Gerun Wang,Guancheng Chen,Hang Zhao,Xin Li,Jian Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
116. 【2503.06054】Fine-Grained Bias Detection in LLM: Enhancing detection mechanisms for nuanced biases
链接:https://arxiv.org/abs/2503.06054
作者:Suvendu Mohanty
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:
备注: Bias detection, Large Language Models, nuanced biases, fine-grained mechanisms, model transparency, ethical AI
点击查看摘要
None
117. 【2503.06048】Constructions are Revealed in Word Distributions
链接:https://arxiv.org/abs/2503.06048
作者:Joshua Rozner,Leonie Weissweiler,Kyle Mahowald,Cory Shain
类目:Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
118. 【2503.06047】DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments
链接:https://arxiv.org/abs/2503.06047
作者:Wenjie Tang,Yuan Zhou,Erqiang Xu,Keyan Cheng,Minne Li,Liquan Xiao
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注: 43 pages, 5 figures, conference
点击查看摘要
None
119. 【2503.06040】Mitigating Memorization in LLMs using Activation Steering
链接:https://arxiv.org/abs/2503.06040
作者:Manan Suri,Nishit Anand,Amisha Bhaskar
类目:Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
120. 【2503.06034】Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning
链接:https://arxiv.org/abs/2503.06034
作者:Shengyao Zhuang,Xueguang Ma,Bevan Koopman,Jimmy Lin,Guido Zuccon
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
121. 【2503.06029】SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?
链接:https://arxiv.org/abs/2503.06029
作者:Xudong Lu,Haohao Gao,Renshou Wu,Shuai Ren,Xiaoxin Chen,Hongsheng Li,Fangyuan Li
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:
备注: 23 pages
点击查看摘要
None
122. 【2503.06019】GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
链接:https://arxiv.org/abs/2503.06019
作者:Xudong Lu,Yinghao Chen,Renshou Wu,Haohao Gao,Xi Chen,Xue Yang,Xiangyu Zhao,Aojun Zhou,Fangyuan Li,Yafei Wen,Xiaoxin Chen,Shuai Ren,Hongsheng Li
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 14 pages
点击查看摘要
None
123. 【2503.06011】Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models
链接:https://arxiv.org/abs/2503.06011
作者:Panatchakorn Anantaprayoon,Masahiro Kaneko,Naoaki Okazaki
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:
备注: 18 pages. Under review
点击查看摘要
None
124. 【2503.05992】Psycholinguistic Analyses in Software Engineering Text: A Systematic Literature Review
链接:https://arxiv.org/abs/2503.05992
作者:Amirali Sajadi,Kostadin Damevski,Preetha Chatterjee
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL); Computers and Society (cs.CY)
关键词:
备注:
点击查看摘要
None
125. 【2503.05980】SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs
链接:https://arxiv.org/abs/2503.05980
作者:Samir Abdaljalil,Hasan Kurban,Parichit Sharma,Erchin Serpedin,Rachad Atat
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
126. 【2503.05958】SANDWiCH: Semantical Analysis of Neighbours for Disambiguating Words in Context ad Hoc
链接:https://arxiv.org/abs/2503.05958
作者:Daniel Guzman-Olivares,Lara Quijano-Sanchez,Federico Liberatore
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:
备注: 15 pages, 2 figures, 7 tables, NAACL 2025
点击查看摘要
None
127. 【2503.05935】DETQUS: Decomposition-Enhanced Transformers for QUery-focused Summarization
链接:https://arxiv.org/abs/2503.05935
作者:Yasir Khan,Xinlei Wu,Sangpil Youm,Justin Ho,Aryaan Shaikh,Jairo Garciga,Rohan Sharma,Bonnie J. Dorr
类目:Computation and Language (cs.CL)
关键词:
备注: 12 pages, 2 figures, Accepted to NAACL 2025 main conference
点击查看摘要
None
128. 【2503.05931】raining and Inference Efficiency of Encoder-Decoder Speech Models
链接:https://arxiv.org/abs/2503.05931
作者:Piotr Żelasko,Kunal Dhawan,Daniel Galvez,Krishna C. Puvvada,Ankita Pasad,Nithin Rao Koluguri,Ke Hu,Vitaly Lavrukhin,Jagadeesh Balam,Boris Ginsburg
类目:Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
关键词:
备注:
点击查看摘要
None
129. 【2503.05920】IDEA Prune: An Integrated Enlarge-and-Prune Pipeline in Generative Language Model Pretraining
链接:https://arxiv.org/abs/2503.05920
作者:Yixiao Li,Xianzhi Du,Ajay Jaiswal,Tao Lei,Tuo Zhao,Chong Wang,Jianyu Wang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
130. 【2503.05919】From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning
链接:https://arxiv.org/abs/2503.05919
作者:Eric Zhao,Pranjal Awasthi,Nika Haghtalab
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
131. 【2503.05891】MastermindEval: A Simple But Scalable Reasoning Benchmark
链接:https://arxiv.org/abs/2503.05891
作者:Jonas Golde,Patrick Haller,Fabio Barth,Alan Akbik
类目:Computation and Language (cs.CL)
关键词:
备注: 9 pages, 2 figures, 4 tables
点击查看摘要
None
132. 【2503.05888】QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
链接:https://arxiv.org/abs/2503.05888
作者:Bang Nguyen,Tingting Du,Mengxia Yu,Lawrence Angrave,Meng Jiang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
关键词:
备注: Under Review
点击查看摘要
None
133. 【2503.05858】Bimodal Connection Attention Fusion for Speech Emotion Recognition
链接:https://arxiv.org/abs/2503.05858
作者:Jiachen Luo,Huy Phan,Lin Wang,Joshua D. Reiss
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
关键词:
备注:
点击查看摘要
None
134. 【2503.05856】his Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs
链接:https://arxiv.org/abs/2503.05856
作者:Lorenz Wolf,Sangwoong Yoon,Ilija Bogunovic
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:
备注: 35 pages, 9 figures, 16 tables
点击查看摘要
None
135. 【2503.05846】Extracting and Emulsifying Cultural Explanation to Improve Multilingual Capability of LLMs
链接:https://arxiv.org/abs/2503.05846
作者:Hamin Koo,Jaehyung Kim
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:
备注: under review, 18pages
点击查看摘要
None
136. 【2503.05793】MedSimAI: Simulation and Formative Feedback Generation to Enhance Deliberate Practice in Medical Education
链接:https://arxiv.org/abs/2503.05793
作者:Yann Hicke,Jadon Geathers,Niroop Rajashekar,Colleen Chan,Anyanate Gwendolyne Jack,Justin Sewell,Mackenzi Preston,Susannah Cornes,Dennis Shung,Rene Kizilcec
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
137. 【2503.05788】Emergent Abilities in Large Language Models: A Survey
链接:https://arxiv.org/abs/2503.05788
作者:Leonardo Berti,Flavio Giorgi,Gjergji Kasneci
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
138. 【2503.05786】FedMentalCare: Towards Privacy-Preserving Fine-Tuned LLMs to Analyze Mental Health Status Using Federated Learning Framework
链接:https://arxiv.org/abs/2503.05786
作者:S M Sarwar
类目:Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
关键词:
备注: 9 pages, 3 figures, 2 tables and 2 algorithms
点击查看摘要
None
139. 【2503.05781】Where is my Glass Slipper? AI, Poetry and Art
链接:https://arxiv.org/abs/2503.05781
作者:Anastasios P. Pagiaslis
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:
备注: 36 pages, 0 figures, I have updated the submission to the correct submission standards apologies. The paper is a Literature Review so there are no formulas or results tables and images
点击查看摘要
None
140. 【2503.05778】DreamNet: A Multimodal Framework for Semantic and Emotional Analysis of Sleep Narratives
链接:https://arxiv.org/abs/2503.05778
作者:Tapasvi Panchagnula
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注: 10 pages, 5 figures, new research contribution
点击查看摘要
None
141. 【2503.05777】Medical Hallucinations in Foundation Models and Their Impact on Healthcare
链接:https://arxiv.org/abs/2503.05777
作者:Yubin Kim,Hyewon Jeong,Shan Chen,Shuyue Stella Li,Mingyu Lu,Kumail Alhamoud,Jimin Mun,Cristina Grau,Minseok Jung,Rodrigo Gameiro,Lizhou Fan,Eugene Park,Tristan Lin,Joonsik Yoon,Wonjin Yoon,Maarten Sap,Yulia Tsvetkov,Paul Liang,Xuhai Xu,Xin Liu,Daniel McDuff,Hyeonhoon Lee,Hae Won Park,Samir Tulebaev,Cynthia Breazeal
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
关键词:generating multi-modal data, role in medicine, capable of processing, processing and generating, generating multi-modal
备注:
点击查看摘要
Abstract:Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at this https URL hallucination.
142. 【2503.05769】Effect of Gender Fair Job Description on Generative AI Images
链接:https://arxiv.org/abs/2503.05769
作者:Finn Böckling,Jan Marquenie,Ingo Siegert
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models
点击查看摘要
None
143. 【2503.05763】Graph Masked Language Models
链接:https://arxiv.org/abs/2503.05763
作者:Aarush Sinha,OM Kumar CU
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
144. 【2503.05757】Uncertainty-Aware Fusion: An Ensemble Framework for Mitigating Hallucinations in Large Language Models
链接:https://arxiv.org/abs/2503.05757
作者:Prasenjit Dey,Srujana Merugu,Sivaramakrishnan Kaveri
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注: Proceedings of the ACM Web Conference 2025, WWW 25
点击查看摘要
None
145. 【2503.05750】CSTRL: Context-Driven Sequential Transfer Learning for Abstractive Radiology Report Summarization
链接:https://arxiv.org/abs/2503.05750
作者:Mst. Fahmida Sultana Naznin,Adnan Ibney Faruq,Mostafa Rifat Tazwar,Md Jobayer,Md. Mehedi Hasan Shawon,Md Rakibul Hasan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:
备注: 11-pages main paper with 2-pages appendices
点击查看摘要
None
146. 【2503.05740】ChatWise: AI-Powered Engaging Conversations for Enhancing Senior Cognitive Wellbeing
链接:https://arxiv.org/abs/2503.05740
作者:Zhengbang Yang,Zhuangdi Zhu
类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
147. 【2503.05721】What Are They Filtering Out? A Survey of Filtering Strategies for Harm Reduction in Pretraining Datasets
链接:https://arxiv.org/abs/2503.05721
作者:Marco Antonio Stranisci,Christian Hardmeier
类目:Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
148. 【2503.05720】hat is Unacceptable: the Moral Foundations of Canceling
链接:https://arxiv.org/abs/2503.05720
作者:Soda Marem Lo,Oscar Araque,Rajesh Sharma,Marco Antonio Stranisci
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
149. 【2503.05713】Beyond English: Unveiling Multilingual Bias in LLM Copyright Compliance
链接:https://arxiv.org/abs/2503.05713
作者:Yupeng Chen,Xiaoyu Zhang,Yixian Huang,Qian Xie
类目:Computers and Society (cs.CY); Computation and Language (cs.CL)
关键词:
备注: Work in progress
点击查看摘要
None
150. 【2503.05707】Russo-Ukrainian war disinformation detection in suspicious Telegram channels
链接:https://arxiv.org/abs/2503.05707
作者:Anton Bazdyrev
类目:Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:
备注: CEUR-WS, Vol-3777 ProfIT AI 2024 4th International Workshop of IT-professionals on Artificial Intelligence 2024
点击查看摘要
None
151. 【2503.05701】OPTIC: Optimizing Patient-Provider Triaging Improving Communications in Clinical Operations using GPT-4 Data Labeling and Model Distillation
链接:https://arxiv.org/abs/2503.05701
作者:Alberto Santamaria-Pang,Frank Tuan,Ross Campbell,Cindy Zhang,Ankush Jindal,Roopa Surapur,Brad Holloman,Deanna Hanisch,Rae Buckley,Carisa Cooney,Ivan Tarapov,Kimberly S. Peairs,Brian Hasselfeld,Peter Greene
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:
备注: 15 pages, 8 figures. submitted to Journal of the American Medical Informatics Association
点击查看摘要
None
152. 【2503.05200】ORANSight-2.0: Foundational LLMs for O-RAN
链接:https://arxiv.org/abs/2503.05200
作者:Pranshav Gajjar,Vijay K. Shah
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
关键词:
备注:
点击查看摘要
None
153. 【2503.07522】Building English ASR model with regional language support
链接:https://arxiv.org/abs/2503.07522
作者:Purvi Agrawal,Vikas Joshi,Bharati Patidar,Ankur Gupta,Rupesh Kumar Mehta
类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL)
关键词:
备注: 5 pages, 3 figures
点击查看摘要
None
154. 【2503.06646】Evaluating and Aligning Human Economic Risk Preferences in LLMs
链接:https://arxiv.org/abs/2503.06646
作者:Jiaxin Liu,Yi Yang,Kar Yan Tam
类目:General Economics (econ.GN); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
信息检索
1. 【2503.07584】alking to GDELT Through Knowledge Graphs
链接:https://arxiv.org/abs/2503.07584
作者:Audun Myers,Max Vargas,Sinan G. Aksoy,Cliff Joslyn,Benjamin Wilson,Tom Grimes
类目:Information Retrieval (cs.IR)
关键词:Retrieval Augmented Regeneration, Augmented Regeneration, Retrieval Augmented, work we study, strengths and weaknesses
备注:
点击查看摘要
Abstract:In this work we study various Retrieval Augmented Regeneration (RAG) approaches to gain an understanding of the strengths and weaknesses of each approach in a question-answering analysis. To gain this understanding we use a case-study subset of the Global Database of Events, Language, and Tone (GDELT) dataset as well as a corpus of raw text scraped from the online news articles. To retrieve information from the text corpus we implement a traditional vector store RAG as well as state-of-the-art large language model (LLM) based approaches for automatically constructing KGs and retrieving the relevant subgraphs. In addition to these corpus approaches, we develop a novel ontology-based framework for constructing knowledge graphs (KGs) from GDELT directly which leverages the underlying schema of GDELT to create structured representations of global events. For retrieving relevant information from the ontology-based KGs we implement both direct graph queries and state-of-the-art graph retrieval approaches. We compare the performance of each method in a question-answering task. We find that while our ontology-based KGs are valuable for question-answering, automated extraction of the relevant subgraphs is challenging. Conversely, LLM-generated KGs, while capturing event summaries, often lack consistency and interpretability. Our findings suggest benefits of a synergistic approach between ontology and LLM-based KG construction, with proposed avenues toward that end.
2. 【2503.07520】From Limited Labels to Open Domains: An Efficient Learning Paradigm for UAV-view Geo-Localization
链接:https://arxiv.org/abs/2503.07520
作者:Zhongwei Chen,Zhao-Xu Yang,Hai-Jun Rong,Jiawei Lang
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Traditional UAV-view Geo-Localization, positive sample selection, Traditional UAV-view, learn cross-view domain-invariant, cross-view domain-invariant representations
备注:
点击查看摘要
Abstract:Traditional UAV-view Geo-Localization (UVGL) supervised paradigms are constrained by the strict reliance on paired data for positive sample selection, which limits their ability to learn cross-view domain-invariant representations from unpaired data. Moreover, it is necessary to reconstruct the pairing relationship with expensive re-labeling costs for scenario-specific training when deploying in a new domain, which fails to meet the practical demands of open-environment applications. To address this issue, we propose a novel cross-domain invariance knowledge transfer network (CDIKTNet), which comprises a cross-domain invariance sub-network and a cross-domain transfer sub-network to realize a closed-loop framework of invariance feature learning and knowledge transfer. The cross-domain invariance sub-network is utilized to construct an essentially shared feature space across domains by learning structural invariance and spatial invariance in cross-view features. Meanwhile, the cross-domain transfer sub-network uses these invariant features as anchors and employs a dual-path contrastive memory learning mechanism to mine latent cross-domain correlation patterns in unpaired data. Extensive experiments demonstrate that our method achieves state-of-the-art performance under fully supervised conditions. More importantly, with merely 2\% paired data, our method exhibits performance comparable to existing supervised paradigms and possesses the ability to transfer directly to qualify for applications in the other scenarios completely without any prior pairing relationship.
3. 【2503.07519】GRITHopper: Decomposition-Free Multi-Hop Dense Retrieval
链接:https://arxiv.org/abs/2503.07519
作者:Justus-Jonas Erker,Nils Reimers,Iryna Gurevych
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:Decomposition-based multi-hop retrieval, Decomposition-based multi-hop, retrieval methods rely, complex queries, computationally expensive
备注: Under Review at ACL Rolling Review (ARR)
点击查看摘要
Abstract:Decomposition-based multi-hop retrieval methods rely on many autoregressive steps to break down complex queries, which breaks end-to-end differentiability and is computationally expensive. Decomposition-free methods tackle this, but current decomposition-free approaches struggle with longer multi-hop problems and generalization to out-of-distribution data. To address these challenges, we introduce GRITHopper-7B, a novel multi-hop dense retrieval model that achieves state-of-the-art performance on both in-distribution and out-of-distribution benchmarks. GRITHopper combines generative and representational instruction tuning by integrating causal language modeling with dense retrieval training. Through controlled studies, we find that incorporating additional context after the retrieval process, referred to as post-retrieval language modeling, enhances dense retrieval performance. By including elements such as final answers during training, the model learns to better contextualize and retrieve relevant information. GRITHopper-7B offers a robust, scalable, and generalizable solution for multi-hop dense retrieval, and we release it to the community for future research and applications requiring multi-hop reasoning and retrieval capabilities.
4. 【2503.07470】Advancing Vietnamese Information Retrieval with Learning Objective and Benchmark
链接:https://arxiv.org/abs/2503.07470
作者:Phu-Vinh Nguyen,Minh-Nam Tran,Long Nguyen,Dien Dinh
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:natural language processing, rapid development, invented for multiple, Vietnamese, language processing
备注:
点击查看摘要
Abstract:With the rapid development of natural language processing, many language models have been invented for multiple tasks. One important task is information retrieval (IR), which requires models to retrieve relevant documents. Despite its importance in many real-life applications, especially in retrieval augmented generation (RAG) systems, this task lacks Vietnamese benchmarks. This situation causes difficulty in assessing and comparing many existing Vietnamese embedding language models on the task and slows down the advancement of Vietnamese natural language processing (NLP) research. In this work, we aim to provide the Vietnamese research community with a new benchmark for information retrieval, which mainly focuses on retrieval and reranking tasks. Furthermore, we also present a new objective function based on the InfoNCE loss function, which is used to train our Vietnamese embedding model. Our function aims to be better than the origin in information retrieval tasks. Finally, we analyze the effect of temperature, a hyper-parameter in both objective functions, on the performance of text embedding models.
5. 【2503.07377】Process-Supervised LLM Recommenders via Flow-guided Tuning
链接:https://arxiv.org/abs/2503.07377
作者:Chongming Gao,Mengyao Gao,Chenxiao Fan,Shuai Yuan,Wentao Shi,Xiangnan He
类目:Information Retrieval (cs.IR)
关键词:large language models, likelihood maximization objective, Generative Flow Network, approach amplifies popularity, language models
备注:
点击查看摘要
Abstract:While large language models (LLMs) are increasingly adapted for recommendation systems via supervised fine-tuning (SFT), this approach amplifies popularity bias due to its likelihood maximization objective, compromising recommendation diversity and fairness. To address this, we present Flow-guided fine-tuning recommender (Flower), which replaces SFT with a Generative Flow Network (GFlowNet) framework that enacts process supervision through token-level reward propagation. Flower's key innovation lies in decomposing item-level rewards into constituent token rewards, enabling direct alignment between token generation probabilities and their reward signals. This mechanism achieves three critical advancements: (1) popularity bias mitigation and fairness enhancement through empirical distribution matching, (2) preservation of diversity through GFlowNet's proportional sampling, and (3) flexible integration of personalized preferences via adaptable token rewards. Experiments demonstrate Flower's superior distribution-fitting capability and its significant advantages over traditional SFT in terms of fairness, diversity, and accuracy, highlighting its potential to improve LLM-based recommendation systems. The implementation is available via this https URL
6. 【2503.07037】Zero-Shot Hashing Based on Reconstruction With Part Alignment
链接:https://arxiv.org/abs/2503.07037
作者:Yan Jiang,Zhongmiao Qi,Jianhao Li,Jiangbo Qian,Chong Wang,Yu Xin
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Zero-shot hashing algorithms, large-scale image retrieval, unseen class data, Hashing algorithms, class data
备注:
点击查看摘要
Abstract:Hashing algorithms have been widely used in large-scale image retrieval tasks, especially for seen class data. Zero-shot hashing algorithms have been proposed to handle unseen class data. The key technique in these algorithms involves learning features from seen classes and transferring them to unseen classes, that is, aligning the feature embeddings between the seen and unseen classes. Most existing zero-shot hashing algorithms use the shared attributes between the two classes of interest to complete alignment tasks. However, the attributes are always described for a whole image, even though they represent specific parts of the image. Hence, these methods ignore the importance of aligning attributes with the corresponding image parts, which explicitly introduces noise and reduces the accuracy achieved when aligning the features of seen and unseen classes. To address this problem, we propose a new zero-shot hashing method called RAZH. We first use a clustering algorithm to group similar patches to image parts for attribute matching and then replace the image parts with the corresponding attribute vectors, gradually aligning each part with its nearest attribute. Extensive evaluation results demonstrate the superiority of the RAZH method over several state-of-the-art methods.
7. 【2503.07025】Weak Supervision for Improved Precision in Search Systems
链接:https://arxiv.org/abs/2503.07025
作者:Sriram Vasudevan
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:deep learning models, supervised learning methods, power deep learning, modern search engines, Labeled datasets
备注: Accepted to the AAAI 2025 Workshop on Computational Jobs Marketplace
点击查看摘要
Abstract:Labeled datasets are essential for modern search engines, which increasingly rely on supervised learning methods like Learning to Rank and massive amounts of data to power deep learning models. However, creating these datasets is both time-consuming and costly, leading to the common use of user click and activity logs as proxies for relevance. In this paper, we present a weak supervision approach to infer the quality of query-document pairs and apply it within a Learning to Rank framework to enhance the precision of a large-scale search system.
8. 【2503.06963】Multi-Behavior Recommender Systems: A Survey
链接:https://arxiv.org/abs/2503.06963
作者:Kyungho Kim,Sunwoo Kim,Geon Lee,Jinhong Jung,Kijung Shin
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Traditional recommender systems, predict user preferences, Traditional recommender, systems primarily rely, Multi-behavior recommender systems
备注: Accepted in the PAKDD 2025 Survey Track
点击查看摘要
Abstract:Traditional recommender systems primarily rely on a single type of user-item interaction, such as item purchases or ratings, to predict user preferences. However, in real-world scenarios, users engage in a variety of behaviors, such as clicking on items or adding them to carts, offering richer insights into their interests. Multi-behavior recommender systems leverage these diverse interactions to enhance recommendation quality, and research on this topic has grown rapidly in recent years. This survey provides a timely review of multi-behavior recommender systems, focusing on three key steps: (1) Data Modeling: representing multi-behaviors at the input level, (2) Encoding: transforming these inputs into vector representations (i.e., embeddings), and (3) Training: optimizing machine-learning models. We systematically categorize existing multi-behavior recommender systems based on the commonalities and differences in their approaches across the above steps. Additionally, we discuss promising future directions for advancing multi-behavior recommender systems.
9. 【2503.06920】AlignPxtr: Aligning Predicted Behavior Distributions for Bias-Free Video Recommendations
链接:https://arxiv.org/abs/2503.06920
作者:Chengzhi Lin,Chuyuan Wang,Annan Xie,Wuhong Wang,Ziye Zhang,Canguang Ruan,Yuancai Huang,Yongqi Liu
类目:Information Retrieval (cs.IR)
关键词:infer user interest, user interest, user, biases, video recommendation systems
备注: video recommendation. 7 page, 1 figure
点击查看摘要
Abstract:In video recommendation systems, user behaviors such as watch time, likes, and follows are commonly used to infer user interest. However, these behaviors are influenced by various biases, including duration bias, demographic biases, and content category biases, which obscure true user preferences. In this paper, we hypothesize that biases and user interest are independent of each other. Based on this assumption, we propose a novel method that aligns predicted behavior distributions across different bias conditions using quantile mapping, theoretically guaranteeing zero mutual information between bias variables and the true user interest. By explicitly modeling the conditional distributions of user behaviors under different biases and mapping these behaviors to quantiles, we effectively decouple user interest from the confounding effects of various biases. Our approach uniquely handles both continuous signals (e.g., watch time) and discrete signals (e.g., likes, comments), while simultaneously addressing multiple bias dimensions. Additionally, we introduce a computationally efficient mean alignment alternative technique for practical real-time inference in large-scale systems. We validate our method through online A/B testing on two major video platforms: Kuaishou Lite and Kuaishou. The results demonstrate significant improvements in user engagement and retention, with \textbf{cumulative lifts of 0.267\% and 0.115\% in active days, and 1.102\% and 0.131\% in average app usage time}, respectively. The results demonstrate that our approach consistently achieves significant improvements in long-term user retention and substantial gains in average app usage time across different platforms. Our core code will be publised at this https URL.
10. 【2503.06489】Improving Access to Trade and Investment Information in Thailand through Intelligent Document Retrieval
链接:https://arxiv.org/abs/2503.06489
作者:Sirinda Palahan
类目:Information Retrieval (cs.IR); Social and Information Networks (cs.SI)
关键词:Overseas investment, daunting for beginners, beginners due, vast amount, amount of complex
备注:
点击查看摘要
Abstract:Overseas investment and trade can be daunting for beginners due to the vast amount of complex information. This paper presents a chatbot system that integrates natural language processing and information retrieval techniques to simplify the document retrieval process. The proposed system identifies the most relevant content, enabling users to navigate the intricate landscape of foreign trade and investment more efficiently. Our methodology combines the BM25 model and a deep learning model to rank and retrieve documents, aiming to reduce noise in the document content and enhance the accuracy of the results. Experiments with Thai natural language queries have demonstrated the effectiveness of our system in retrieving pertinent documents. A user satisfaction survey further validated the system's effectiveness. Most respondents found the system helpful and agreed with the suggested documents, indicating its potential as a valuable tool for Thai entrepreneurs navigating foreign trade and investment.
11. 【2503.06474】HuixiangDou2: A Robustly Optimized GraphRAG Approach
链接:https://arxiv.org/abs/2503.06474
作者:Huanjun Kong,Zhefan Wang,Chenyang Wang,Zhe Ma,Nanqing Dong
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注: 11 pages
点击查看摘要
None
12. 【2503.06430】Graph Retrieval-Augmented LLM for Conversational Recommendation Systems
链接:https://arxiv.org/abs/2503.06430
作者:Zhangchi Qiu,Linhao Luo,Zicheng Zhao,Shirui Pan,Alan Wee-Chung Liew
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
关键词:
备注: Accepted by PAKDD 2025
点击查看摘要
None
13. 【2503.06238】Image is All You Need: Towards Efficient and Effective Large Language Model-Based Recommender Systems
链接:https://arxiv.org/abs/2503.06238
作者:Kibum Kim,Sein Kim,Hongseok Kang,Jiwan Kim,Heewoong Noh,Yeonjun In,Kanghoon Yoon,Jinoh Oh,Chanyoung Park
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
14. 【2503.06034】Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning
链接:https://arxiv.org/abs/2503.06034
作者:Shengyao Zhuang,Xueguang Ma,Bevan Koopman,Jimmy Lin,Guido Zuccon
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
计算机视觉
1. 【2503.07608】AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
链接:https://arxiv.org/abs/2503.07608
作者:Bo Jiang,Shaoyu Chen,Qian Zhang,Wenyu Liu,Xinggang Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:surpass human expert-level, human expert-level performance, mathematics and science, reinforcement learning, crucial role
备注: Project Page: [this https URL](https://github.com/hustvl/AlphaDrive)
点击查看摘要
Abstract:OpenAI o1 and DeepSeek R1 achieve or even surpass human expert-level performance in complex domains like mathematics and science, with reinforcement learning (RL) and reasoning playing a crucial role. In autonomous driving, recent end-to-end models have greatly improved planning performance but still struggle with long-tailed problems due to limited common sense and reasoning abilities. Some studies integrate vision-language models (VLMs) into autonomous driving, but they typically rely on pre-trained models with simple supervised fine-tuning (SFT) on driving data, without further exploration of training strategies or optimizations specifically tailored for planning. In this paper, we propose AlphaDrive, a RL and reasoning framework for VLMs in autonomous driving. AlphaDrive introduces four GRPO-based RL rewards tailored for planning and employs a two-stage planning reasoning training strategy that combines SFT with RL. As a result, AlphaDrive significantly improves both planning performance and training efficiency compared to using only SFT or without reasoning. Moreover, we are also excited to discover that, following RL training, AlphaDrive exhibits some emergent multimodal planning capabilities, which is critical for improving driving safety and efficiency. To the best of our knowledge, AlphaDrive is the first to integrate GRPO-based RL with planning reasoning into autonomous driving. Code will be released to facilitate future research.
2. 【2503.07607】VoD: Learning Volume of Differences for Video-Based Deepfake Detection
链接:https://arxiv.org/abs/2503.07607
作者:Ying Xu,Marius Pedersen,Kiran Raja
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:digital contact landscape, digital media integrity, creating realistic Deepfake, poses substantial challenges, contact landscape
备注:
点击查看摘要
Abstract:The rapid development of deep learning and generative AI technologies has profoundly transformed the digital contact landscape, creating realistic Deepfake that poses substantial challenges to public trust and digital media integrity. This paper introduces a novel Deepfake detention framework, Volume of Differences (VoD), designed to enhance detection accuracy by exploiting temporal and spatial inconsistencies between consecutive video frames. VoD employs a progressive learning approach that captures differences across multiple axes through the use of consecutive frame differences (CFD) and a network with stepwise expansions. We evaluate our approach with intra-dataset and cross-dataset testing scenarios on various well-known Deepfake datasets. Our findings demonstrate that VoD excels with the data it has been trained on and shows strong adaptability to novel, unseen data. Additionally, comprehensive ablation studies examine various configurations of segment length, sampling steps, and intervals, offering valuable insights for optimizing the framework. The code for our VoD framework is available at this https URL.
3. 【2503.07603】Should VLMs be Pre-trained with Image Data?
链接:https://arxiv.org/abs/2503.07603
作者:Sedrick Keh,Jean Mercat,Samir Yitzhak Gadre,Kushal Arora,Igor Vasiljevic,Benjamin Burchfiel,Shuran Song,Russ Tedrake,Thomas Kollar,Ludwig Schmidt,Achal Dave
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Pre-trained LLMs, tasks, Abstract, vision-language, vision-language tasks
备注: ICLR 2025
点击查看摘要
Abstract:Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.
4. 【2503.07602】DreamRelation: Relation-Centric Video Customization
链接:https://arxiv.org/abs/2503.07602
作者:Yujie Wei,Shiwei Zhang,Hangjie Yuan,Biao Gong,Longxiang Tang,Xiang Wang,Haonan Qiu,Hengjia Li,Shuai Tan,Yingya Zhang,Hongming Shan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Relational video customization, Relational Dynamics Enhancement, Relational Decoupling Learning, real-world visual content, comprehending real-world visual
备注: Project Page: [this https URL](https://dreamrelation.github.io)
点击查看摘要
Abstract:Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending real-world visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose DreamRelation, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce space-time relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on detailed subject appearances. Extensive experiments demonstrate that DreamRelation outperforms state-of-the-art methods in relational video customization. Code and models will be made publicly available.
5. 【2503.07601】Balanced Image Stylization with Style Matching Score
链接:https://arxiv.org/abs/2503.07601
作者:Yuxin Jiang,Liming Jiang,Shuai Yang,Jia-Wei Liu,Ivor Tsang,Mike Zheng Shou
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:present Style Matching, Style Matching Score, Progressive Spectrum Regularization, Style, style distribution matching
备注: Project page: [this https URL](https://yuxinn-j.github.io/projects/SMS.html)
点击查看摘要
Abstract:We present Style Matching Score (SMS), a novel optimization method for image stylization with diffusion models. Balancing effective style transfer with content preservation is a long-standing challenge. Unlike existing efforts, our method reframes image stylization as a style distribution matching problem. The target style distribution is estimated from off-the-shelf style-dependent LoRAs via carefully designed score functions. To preserve content information adaptively, we propose Progressive Spectrum Regularization, which operates in the frequency domain to guide stylization progressively from low-frequency layouts to high-frequency details. In addition, we devise a Semantic-Aware Gradient Refinement technique that leverages relevance maps derived from diffusion semantic priors to selectively stylize semantically important regions. The proposed optimization formulation extends stylization from pixel space to parameter space, readily applicable to lightweight feedforward generators for efficient one-step stylization. SMS effectively balances style alignment and content preservation, outperforming state-of-the-art approaches, verified by extensive experiments.
6. 【2503.07598】VACE: All-in-One Video Creation and Editing
链接:https://arxiv.org/abs/2503.07598
作者:Zeyinzi Jiang,Zhen Han,Chaojie Mao,Jingfeng Zhang,Yulin Pan,Yu Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion Transformer, demonstrated powerful capability, Transformer has demonstrated, generating high-quality images, demonstrated powerful
备注:
点击查看摘要
Abstract:Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: this https URL.
7. 【2503.07597】HumanMM: Global Human Motion Recovery from Multi-shot Videos
链接:https://arxiv.org/abs/2503.07597
作者:Yuhong Zhang,Guanlin Wu,Ling-Hao Chen,Zhuokai Zhao,Jing Lin,Xiaoke Jiang,Jiamin Wu,Zhuoheng Li,Hao Frank Yang,Haoqian Wang,Lei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multiple shot transitions, framework designed, designed to reconstruct, shot transitions, reconstruct long-sequence
备注: CVPR 2025; Project page: [this https URL](https://zhangyuhong01.github.io/HumanMM/)
点击查看摘要
Abstract:In this paper, we present a novel framework designed to reconstruct long-sequence 3D human motion in the world coordinates from in-the-wild videos with multiple shot transitions. Such long-sequence in-the-wild motions are highly valuable to applications such as motion generation and motion understanding, but are of great challenge to be recovered due to abrupt shot transitions, partial occlusions, and dynamic backgrounds presented in such videos. Existing methods primarily focus on single-shot videos, where continuity is maintained within a single camera view, or simplify multi-shot alignment in camera space only. In this work, we tackle the challenges by integrating an enhanced camera pose estimation with Human Motion Recovery (HMR) by incorporating a shot transition detector and a robust alignment module for accurate pose and orientation continuity across shots. By leveraging a custom motion integrator, we effectively mitigate the problem of foot sliding and ensure temporal consistency in human pose. Extensive evaluations on our created multi-shot dataset from public 3D human datasets demonstrate the robustness of our method in reconstructing realistic human motion in world coordinates.
8. 【2503.07593】Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection
链接:https://arxiv.org/abs/2503.07593
作者:Youjun Zhao,Jiaying Lin,Rynson W.H. Lau
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:aims at localizing, closed sets, localizing and classifying, object detection, Open-vocabulary
备注: AAAI 2025 (Extented Version). Project Page: [this https URL](https://youjunzhao.github.io/HCMA/)
点击查看摘要
Abstract:Open-vocabulary 3D object detection (OV-3DOD) aims at localizing and classifying novel objects beyond closed sets. The recent success of vision-language models (VLMs) has demonstrated their remarkable capabilities to understand open vocabularies. Existing works that leverage VLMs for 3D object detection (3DOD) generally resort to representations that lose the rich scene context required for 3D perception. To address this problem, we propose in this paper a hierarchical framework, named HCMA, to simultaneously learn local object and global scene information for OV-3DOD. Specifically, we first design a Hierarchical Data Integration (HDI) approach to obtain coarse-to-fine 3D-image-text data, which is fed into a VLM to extract object-centric knowledge. To facilitate the association of feature hierarchies, we then propose an Interactive Cross-Modal Alignment (ICMA) strategy to establish effective intra-level and inter-level feature connections. To better align features across different levels, we further propose an Object-Focusing Context Adjustment (OFCA) module to refine multi-level features by emphasizing object-related features. Extensive experiments demonstrate that the proposed method outperforms SOTA methods on the existing OV-3DOD benchmarks. It also achieves promising OV-3DOD results even without any 3D annotations.
9. 【2503.07591】Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
链接:https://arxiv.org/abs/2503.07591
作者:Bardia Safaei,Faizan Siddiqui,Jiacong Xu,Vishal M. Patel,Shao-Yuan Lo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Visual instruction tuning, large vision-language models, VIT, image-instruction pairs, Visual instruction
备注: Accepted at Computer Vision and Pattern Recognition Conference (CVPR) 2025
点击查看摘要
Abstract:Visual instruction tuning (VIT) for large vision-language models (LVLMs) requires training on expansive datasets of image-instruction pairs, which can be costly. Recent efforts in VIT data selection aim to select a small subset of high-quality image-instruction pairs, reducing VIT runtime while maintaining performance comparable to full-scale training. However, a major challenge often overlooked is that generating instructions from unlabeled images for VIT is highly expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which limits users with constrained resources from creating VIT datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more practical data selection paradigm that directly selects the most beneficial unlabeled images and generates instructions only for the selected images. PreSel first estimates the relative importance of each vision task within VIT datasets to derive task-wise sampling budgets. It then clusters image features within each task, selecting the most representative images with the budget. This approach reduces computational overhead for both instruction generation during VIT data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves performance comparable to full-data VIT on the LLaVA-1.5 and Vision-Flan datasets. The link to our project page: this https URL
10. 【2503.07588】When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
链接:https://arxiv.org/abs/2503.07588
作者:Junwei Luo,Yingying Zhang,Xue Yang,Kang Wu,Qi Zhu,Lei Liang,Jingdong Chen,Yansheng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:large Remote Sensing, Remote Sensing Images, Efficient vision-language understanding, Remote Sensing, Efficient vision-language
备注: 12 pages, 6 figures, 7 tables
点击查看摘要
Abstract:Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in this https URL.
11. 【2503.07587】Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru
链接:https://arxiv.org/abs/2503.07587
作者:Dunant Cusipuma,David Ortega,Victor Flores-Benites,Arturo Deza
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:foundational models start, multimodal foundational models, Foundational Visual Language, Visual Language Models, Representational Similarity Analysis
备注: A pre-print. 26 pages. Link to Code + Data: [this https URL](https://huggingface.co/datasets/Artificio/robusto-1)
点击查看摘要
Abstract:As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations -- especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the worst (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.
12. 【2503.07578】Denoising Score Distillation: From Noisy Diffusion Pretraining to One-Step High-Quality Generation
链接:https://arxiv.org/abs/2503.07578
作者:Tianyu Chen,Yasi Zhang,Zhendong Wang,Ying Nian Wu,Oscar Leong,Mingyuan Zhou
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, diverse natural distributions, generating high-resolution, realistic images, achieved remarkable
备注: First Author and Second Author contributed equally to this work. The last two authors equally advised this work
点击查看摘要
Abstract:Diffusion models have achieved remarkable success in generating high-resolution, realistic images across diverse natural distributions. However, their performance heavily relies on high-quality training data, making it challenging to learn meaningful distributions from corrupted samples. This limitation restricts their applicability in scientific domains where clean data is scarce or costly to obtain. In this work, we introduce denoising score distillation (DSD), a surprisingly effective and novel approach for training high-quality generative models from low-quality data. DSD first pretrains a diffusion model exclusively on noisy, corrupted samples and then distills it into a one-step generator capable of producing refined, clean outputs. While score distillation is traditionally viewed as a method to accelerate diffusion models, we show that it can also significantly enhance sample quality, particularly when starting from a degraded teacher model. Across varying noise levels and datasets, DSD consistently improves generative performancewe summarize our empirical evidence in Fig. 1. Furthermore, we provide theoretical insights showing that, in a linear model setting, DSD identifies the eigenspace of the clean data distributions covariance matrix, implicitly regularizing the generator. This perspective reframes score distillation as not only a tool for efficiency but also a mechanism for improving generative models, particularly in low-quality data settings.
13. 【2503.07575】VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models
链接:https://arxiv.org/abs/2503.07575
作者:Jen-tse Huang,Jiantong Qin,Jianping Zhang,Youliang Yuan,Wenxuan Wang,Jieyu Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:social biases exhibited, implicit social biases, research investigates, exhibited by Vision-Language, Vision-Language Models
备注: 9 pages
点击查看摘要
Abstract:This research investigates both explicit and implicit social biases exhibited by Vision-Language Models (VLMs). The key distinction between these bias types lies in the level of awareness: explicit bias refers to conscious, intentional biases, while implicit bias operates subconsciously. To analyze explicit bias, we directly pose questions to VLMs related to gender and racial differences: (1) Multiple-choice questions based on a given image (e.g., "What is the education level of the person in the image?") (2) Yes-No comparisons using two images (e.g., "Is the person in the first image more educated than the person in the second image?") For implicit bias, we design tasks where VLMs assist users but reveal biases through their responses: (1) Image description tasks: Models are asked to describe individuals in images, and we analyze disparities in textual cues across demographic groups. (2) Form completion tasks: Models draft a personal information collection form with 20 attributes, and we examine correlations among selected attributes for potential biases. We evaluate Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6. Our code and data are publicly available at this https URL.
14. 【2503.07561】Alligat0R: Pre-Training Through Co-Visibility Segmentation for Relative Camera Pose Regression
链接:https://arxiv.org/abs/2503.07561
作者:Thibaut Loiseau,Guillaume Bourmaud,Vincent Lepetit
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:advanced computer vision, greatly advanced computer, yielding impressive results, completion approach yielding, approach yielding impressive
备注:
点击查看摘要
Abstract:Pre-training techniques have greatly advanced computer vision, with CroCo's cross-view completion approach yielding impressive results in tasks like 3D reconstruction and pose regression. However, this method requires substantial overlap between training pairs, limiting its effectiveness. We introduce Alligat0R, a novel pre-training approach that reformulates cross-view learning as a co-visibility segmentation task. Our method predicts whether each pixel in one image is co-visible in the second image, occluded, or outside the field of view (FOV), enabling the use of image pairs with any degree of overlap and providing interpretable predictions. To support this, we present Cub3, a large-scale dataset with 2.5 million image pairs and dense co-visibility annotations derived from the nuScenes dataset. This dataset includes diverse scenarios with varying degrees of overlap. The experiments show that Alligat0R significantly outperforms CroCo in relative pose regression, especially in scenarios with limited overlap. Alligat0R and Cub3 will be made publicly available.
15. 【2503.07535】LBM: Latent Bridge Matching for Fast Image-to-Image Translation
链接:https://arxiv.org/abs/2503.07535
作者:Clément Chadebec,Onur Tasar,Sanjeev Sreetharan,Benjamin Aubin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Latent Bridge Matching, introduce Latent Bridge, Bridge Matching, Latent Bridge, introduce Latent
备注:
点击查看摘要
Abstract:In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation. We show that the method can reach state-of-the-art results for various image-to-image tasks using only a single inference step. In addition to its efficiency, we also demonstrate the versatility of the method across different image translation tasks such as object removal, normal and depth estimation, and object relighting. We also derive a conditional framework of LBM and demonstrate its effectiveness by tackling the tasks of controllable image relighting and shadow generation. We provide an open-source implementation of the method at this https URL.
16. 【2503.07523】VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
链接:https://arxiv.org/abs/2503.07523
作者:Zhangquan Chen,Xufang Luo,Dongsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:understanding is inherently, scene based, humans selectively focus, Visual understanding, Visual
备注: 18pages,11 figures
点击查看摘要
Abstract:Visual understanding is inherently intention-driven - humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as a internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs. Our code is available at this [URL](this https URL).
17. 【2503.07520】From Limited Labels to Open Domains: An Efficient Learning Paradigm for UAV-view Geo-Localization
链接:https://arxiv.org/abs/2503.07520
作者:Zhongwei Chen,Zhao-Xu Yang,Hai-Jun Rong,Jiawei Lang
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Traditional UAV-view Geo-Localization, positive sample selection, Traditional UAV-view, learn cross-view domain-invariant, cross-view domain-invariant representations
备注:
点击查看摘要
Abstract:Traditional UAV-view Geo-Localization (UVGL) supervised paradigms are constrained by the strict reliance on paired data for positive sample selection, which limits their ability to learn cross-view domain-invariant representations from unpaired data. Moreover, it is necessary to reconstruct the pairing relationship with expensive re-labeling costs for scenario-specific training when deploying in a new domain, which fails to meet the practical demands of open-environment applications. To address this issue, we propose a novel cross-domain invariance knowledge transfer network (CDIKTNet), which comprises a cross-domain invariance sub-network and a cross-domain transfer sub-network to realize a closed-loop framework of invariance feature learning and knowledge transfer. The cross-domain invariance sub-network is utilized to construct an essentially shared feature space across domains by learning structural invariance and spatial invariance in cross-view features. Meanwhile, the cross-domain transfer sub-network uses these invariant features as anchors and employs a dual-path contrastive memory learning mechanism to mine latent cross-domain correlation patterns in unpaired data. Extensive experiments demonstrate that our method achieves state-of-the-art performance under fully supervised conditions. More importantly, with merely 2\% paired data, our method exhibits performance comparable to existing supervised paradigms and possesses the ability to transfer directly to qualify for applications in the other scenarios completely without any prior pairing relationship.
18. 【2503.07517】FastInstShadow: A Simple Query-Based Model for Instance Shadow Detection
链接:https://arxiv.org/abs/2503.07517
作者:Takeru Inoue,Ryusuke Miyamoto
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Instance shadow detection, Instance shadow, shadows and objects, task of detecting, detecting pairs
备注:
点击查看摘要
Abstract:Instance shadow detection is the task of detecting pairs of shadows and objects, where existing methods first detect shadows and objects independently, then associate them. This paper introduces FastInstShadow, a method that enhances detection accuracy through a query-based architecture featuring an association transformer decoder with two dual-path transformer decoders to assess relationships between shadows and objects during detection. Experimental results using the SOBA dataset showed that the proposed method outperforms all existing methods across all criteria. This method makes real-time processing feasible for moderate-resolution images with better accuracy than SSISv2, the most accurate existing method. Our code is available at this https URL.
19. 【2503.07516】CPAny: Couple With Any Encoder to Refer Multi-Object Tracking
链接:https://arxiv.org/abs/2503.07516
作者:Weize Li,Yunhao Du,Qixiang Yin,Zhicheng Zhao,Fei Su,Daqi Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:localize target trajectories, Referring Multi-Object Tracking, natural language expressions, aims to localize, localize target
备注:
点击查看摘要
Abstract:Referring Multi-Object Tracking (RMOT) aims to localize target trajectories specified by natural language expressions in videos. Existing RMOT methods mainly follow two paradigms, namely, one-stage strategies and two-stage ones. The former jointly trains tracking with referring but suffers from substantial computational overhead. Although the latter improves computational efficiency, its CLIP-inspired dual-tower architecture restricts compatibility with other visual/text backbones and is not future-proof. To overcome these limitations, we propose CPAny, a novel encoder-decoder framework for two-stage RMOT, which introduces two core components: (1) a Contextual Visual Semantic Abstractor (CVSA) performs context-aware aggregation on visual backbone features and projects them into a unified semantic space; (2) a Parallel Semantic Summarizer (PSS) decodes the visual and linguistic features at the semantic level in parallel and generates referring scores. By replacing the inherent feature alignment of encoders with a self-constructed unified semantic space, CPAny achieves flexible compatibility with arbitrary emerging visual / text encoders. Meanwhile, CPAny aggregates contextual information by encoding only once and processes multiple expressions in parallel, significantly reducing computational redundancy. Extensive experiments on the Refer-KITTI and Refer-KITTI-V2 datasets show that CPAny outperforms SOTA methods across diverse encoder combinations, with a particular 7.77\% HOTA improvement on Refer-KITTI-V2. Code will be available soon.
20. 【2503.07511】PointVLA: Injecting the 3D World into Vision-Language-Action Models
链接:https://arxiv.org/abs/2503.07511
作者:Chengmeng Li,Junjie Wen,Yan Peng,Yaxin Peng,Feifei Feng,Yichen Zhu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:limits spatial reasoning, spatial reasoning critical, RGB images limits, reliance on RGB, images limits spatial
备注:
点击查看摘要
Abstract:Vision-Language-Action (VLA) models excel at robotic tasks by leveraging large-scale 2D vision-language pretraining, but their reliance on RGB images limits spatial reasoning critical for real-world interaction. Retraining these models with 3D data is computationally prohibitive, while discarding existing 2D datasets wastes valuable resources. To bridge this gap, we propose PointVLA, a framework that enhances pre-trained VLAs with point cloud inputs without requiring retraining. Our method freezes the vanilla action expert and injects 3D features via a lightweight modular block. To identify the most effective way of integrating point cloud representations, we conduct a skip-block analysis to pinpoint less useful blocks in the vanilla action expert, ensuring that 3D features are injected only into these blocks--minimizing disruption to pre-trained representations. Extensive experiments demonstrate that PointVLA outperforms state-of-the-art 2D imitation learning methods, such as OpenVLA, Diffusion Policy and DexVLA, across both simulated and real-world robotic tasks. Specifically, we highlight several key advantages of PointVLA enabled by point cloud integration: (1) Few-shot multi-tasking, where PointVLA successfully performs four different tasks using only 20 demonstrations each; (2) Real-vs-photo discrimination, where PointVLA distinguishes real objects from their images, leveraging 3D world knowledge to improve safety and reliability; (3) Height adaptability, Unlike conventional 2D imitation learning methods, PointVLA enables robots to adapt to objects at varying table height that unseen in train data. Furthermore, PointVLA achieves strong performance in long-horizon tasks, such as picking and packing objects from a moving conveyor belt, showcasing its ability to generalize across complex, dynamic environments.
Subjects:
Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:
arXiv:2503.07511 [cs.RO]
(or
arXiv:2503.07511v1 [cs.RO] for this version)
https://doi.org/10.48550/arXiv.2503.07511
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
21. 【2503.07507】PE3R: Perception-Efficient 3D Reconstruction
链接:https://arxiv.org/abs/2503.07507
作者:Jie Hu,Shizun Wang,Xinchao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advancements, improved the understanding, reconstruction, Recent, perception accuracy
备注:
点击查看摘要
Abstract:Recent advancements in 2D-to-3D perception have significantly improved the understanding of 3D scenes from 2D images. However, existing methods face critical challenges, including limited generalization across scenes, suboptimal perception accuracy, and slow reconstruction speeds. To address these limitations, we propose Perception-Efficient 3D Reconstruction (PE3R), a novel framework designed to enhance both accuracy and efficiency. PE3R employs a feed-forward architecture to enable rapid 3D semantic field reconstruction. The framework demonstrates robust zero-shot generalization across diverse scenes and objects while significantly improving reconstruction speed. Extensive experiments on 2D-to-3D open-vocabulary segmentation and 3D reconstruction validate the effectiveness and versatility of PE3R. The framework achieves a minimum 9-fold speedup in 3D semantic field reconstruction, along with substantial gains in perception accuracy and reconstruction precision, setting new benchmarks in the field. The code is publicly available at: this https URL.
22. 【2503.07506】ADROIT: A Self-Supervised Framework for Learning Robust Representations for Active Learning
链接:https://arxiv.org/abs/2503.07506
作者:Soumya Banerjee,Vinay Kumar Verma
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:minimizing annotation costs, Active learning aims, Active learning, select optimal samples, minimizing annotation
备注:
点击查看摘要
Abstract:Active learning aims to select optimal samples for labeling, minimizing annotation costs. This paper introduces a unified representation learning framework tailored for active learning with task awareness. It integrates diverse sources, comprising reconstruction, adversarial, self-supervised, knowledge-distillation, and classification losses into a unified VAE-based ADROIT approach. The proposed approach comprises three key components - a unified representation generator (VAE), a state discriminator, and a (proxy) task-learner or classifier. ADROIT learns a latent code using both labeled and unlabeled data, incorporating task-awareness by leveraging labeled data with the proxy classifier. Unlike previous approaches, the proxy classifier additionally employs a self-supervised loss on unlabeled data and utilizes knowledge distillation to align with the target task-learner. The state discriminator distinguishes between labeled and unlabeled data, facilitating the selection of informative unlabeled samples. The dynamic interaction between VAE and the state discriminator creates a competitive environment, with the VAE attempting to deceive the discriminator, while the state discriminator learns to differentiate between labeled and unlabeled inputs. Extensive evaluations on diverse datasets and ablation analysis affirm the effectiveness of the proposed model.
23. 【2503.07503】hink Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts
链接:https://arxiv.org/abs/2503.07503
作者:Shiu-hong Kao,Yu-Wing Tai,Chi-Keung Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, challenging vision-language task, non-visual query text, multimodal Large Language, vision-language task
备注: Project page: [this https URL](https://cse.hkust.edu.hk/~skao/thinkfirst.html)
点击查看摘要
Abstract:Reasoning segmentation is a challenging vision-language task that aims to output the segmentation mask with respect to a complex, implicit, and even non-visual query text. Previous works incorporated multimodal Large Language Models (MLLMs) with segmentation models to approach the difficult problem. However, their segmentation quality often falls short in complex cases, particularly when dealing with out-of-domain objects with intricate structures, blurry boundaries, occlusions, or high similarity with surroundings. In this paper, we introduce ThinkFirst, a training-free reasoning segmentation framework that leverages GPT's chain of thought to address these challenging cases. Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image. This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process. Our framework allows users to easily interact with the segmentation agent using multimodal inputs, such as easy text and image scribbles, for successive refinement or communication. We evaluate the performance of ThinkFirst on diverse objects. Extensive experiments show that, this zero-shot-CoT approach significantly improves the vanilla reasoning segmentation agent, both qualitatively and quantitatively, while being less sensitive or critical to user-supplied prompts after Thinking First.
24. 【2503.07499】AthletePose3D: A Benchmark Dataset for 3D Human Pose Estimation and Kinematic Validation in Athletic Movements
链接:https://arxiv.org/abs/2503.07499
作者:Calvin Yeung,Tomohiro Suzuki,Ryota Tanaka,Zhuoer Yin,Keisuke Fujii
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Human pose estimation, Human pose, spanning sports science, applications spanning sports, pose estimation
备注:
点击查看摘要
Abstract:Human pose estimation is a critical task in computer vision and sports biomechanics, with applications spanning sports science, rehabilitation, and biomechanical research. While significant progress has been made in monocular 3D pose estimation, current datasets often fail to capture the complex, high-acceleration movements typical of competitive sports. In this work, we introduce AthletePose3D, a novel dataset designed to address this gap. AthletePose3D includes 12 types of sports motions across various disciplines, with approximately 1.3 million frames and 165 thousand individual postures, specifically capturing high-speed, high-acceleration athletic movements. We evaluate state-of-the-art (SOTA) monocular 2D and 3D pose estimation models on the dataset, revealing that models trained on conventional datasets perform poorly on athletic motions. However, fine-tuning these models on AthletePose3D notably reduces the SOTA model mean per joint position error (MPJPE) from 214mm to 65mm-a reduction of over 69%. We also validate the kinematic accuracy of monocular pose estimations through waveform analysis, highlighting strong correlations in joint angle estimations but limitations in velocity estimation. Our work provides a comprehensive evaluation of monocular pose estimation models in the context of sports, contributing valuable insights for advancing monocular pose estimation techniques in high-performance sports environments. The dataset, code, and model checkpoints are available at: this https URL
25. 【2503.07493】V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
链接:https://arxiv.org/abs/2503.07493
作者:Guiwei Zhang,Tianyu Zhang,Mohan Zhou,Yalong Bai,Biye Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:large language models, produces discrete visual, latent distribution alignment, visual, language models
备注: 11 pages, 6 figures
点击查看摘要
Abstract:We propose V2Flow, a novel tokenizer that produces discrete visual tokens capable of high-fidelity reconstruction, while ensuring structural and latent distribution alignment with the vocabulary space of large language models (LLMs). Leveraging this tight visual-vocabulary coupling, V2Flow enables autoregressive visual generation on top of existing LLMs. Our approach formulates visual tokenization as a flow-matching problem, aiming to learn a mapping from a standard normal prior to the continuous image distribution, conditioned on token sequences embedded within the LLMs vocabulary space. The effectiveness of V2Flow stems from two core designs. First, we propose a Visual Vocabulary resampler, which compresses visual data into compact token sequences, with each represented as a soft categorical distribution over LLM's vocabulary. This allows seamless integration of visual tokens into existing LLMs for autoregressive visual generation. Second, we present a masked autoregressive Rectified-Flow decoder, employing a masked transformer encoder-decoder to refine visual tokens into contextually enriched embeddings. These embeddings then condition a dedicated velocity field for precise reconstruction. Additionally, an autoregressive rectified-flow sampling strategy is incorporated, ensuring flexible sequence lengths while preserving competitive reconstruction quality. Extensive experiments show that V2Flow outperforms mainstream VQ-based tokenizers and facilitates autoregressive visual generation on top of existing. this https URL
26. 【2503.07487】LLaVA-RadZ: Can Multimodal Large Language Models Effectively Tackle Zero-shot Radiology Recognition?
链接:https://arxiv.org/abs/2503.07487
作者:Bangyan Li,Wenxuan Huang,Yunhang Shen,Yeqiang Wang,Shaohui Lin,Jingzhong Lin,Ling You,Yinqi Zhang,Ke Li,Xing Sun,Yuling Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated exceptional capabilities, multimodal large models, vision-language tasks, zero-shot medical disease, demonstrated exceptional
备注:
点击查看摘要
Abstract:Recently, multimodal large models (MLLMs) have demonstrated exceptional capabilities in visual understanding and reasoning across various vision-language tasks. However, MLLMs usually perform poorly in zero-shot medical disease recognition, as they do not fully exploit the captured features and available medical knowledge. To address this challenge, we propose LLaVA-RadZ, a simple yet effective framework for zero-shot medical disease recognition. Specifically, we design an end-to-end training strategy, termed Decoding-Side Feature Alignment Training (DFAT) to take advantage of the characteristics of the MLLM decoder architecture and incorporate modality-specific tokens tailored for different modalities, which effectively utilizes image and text representations and facilitates robust cross-modal alignment. Additionally, we introduce a Domain Knowledge Anchoring Module (DKAM) to exploit the intrinsic medical knowledge of large models, which mitigates the category semantic gap in image-text alignment. DKAM improves category-level alignment, allowing for accurate disease recognition. Extensive experiments on multiple benchmarks demonstrate that our LLaVA-RadZ significantly outperforms traditional MLLMs in zero-shot disease recognition and exhibits the state-of-the-art performance compared to the well-established and highly-optimized CLIP-based approaches.
27. 【2503.07485】Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction
链接:https://arxiv.org/abs/2503.07485
作者:Zongzheng Zhang,Xinrun Li,Sizhe Zou,Guoxuan Chi,Siqi Li,Xuchong Qiu,Guoliang Wang,Guantian Zheng,Leichen Wang,Hang Zhao,Hao Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:mapless autonomous driving, involves detecting lanes, key perception task, extraction involves detecting, autonomous driving
备注: ICRA 2025, Project Page: [this https URL](https://github.com/XR-Lee/neural-symbolic)
点击查看摘要
Abstract:Lane topology extraction involves detecting lanes and traffic elements and determining their relationships, a key perception task for mapless autonomous driving. This task requires complex reasoning, such as determining whether it is possible to turn left into a specific lane. To address this challenge, we introduce neuro-symbolic methods powered by vision-language foundation models (VLMs). Existing approaches have notable limitations: (1) Dense visual prompting with VLMs can achieve strong performance but is costly in terms of both financial resources and carbon footprint, making it impractical for robotics applications. (2) Neuro-symbolic reasoning methods for 3D scene understanding fail to integrate visual inputs when synthesizing programs, making them ineffective in handling complex corner cases. To this end, we propose a fast-slow neuro-symbolic lane topology extraction algorithm, named Chameleon, which alternates between a fast system that directly reasons over detected instances using synthesized programs and a slow system that utilizes a VLM with a chain-of-thought design to handle corner cases. Chameleon leverages the strengths of both approaches, providing an affordable solution while maintaining high performance. We evaluate the method on the OpenLane-V2 dataset, showing consistent improvements across various baseline detectors. Our code, data, and models are publicly available at this https URL
28. 【2503.07478】VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
链接:https://arxiv.org/abs/2503.07478
作者:Jiacheng Ruan,Wenzhen Yuan,Xian Gao,Ye Guo,Daoxin Zhang,Zhe Xu,Yao Hu,Ting Liu,Yuzhuo Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated strong performance, occasionally arise due, reasoning process, large visual-language models, demonstrated strong
备注: 12 pages, 4 figures. This work is in progress
点击查看摘要
Abstract:Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs perform error analysis on the entire reasoning process, followed by corrections. However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. VLRMBench is constructed based on three distinct types of datasets, covering mathematical reasoning, hallucination understanding, and multi-image understanding. We design 12 tasks across three major categories, focusing on evaluating VLRMs in the aspects of process understanding, outcome judgment, and critique generation. Extensive experiments are conducted on 21 open-source models and 5 advanced closed-source models, highlighting the challenges posed by VLRMBench. For instance, in the `Forecasting Future', a binary classification task, the advanced GPT-4o achieves only a 76.0% accuracy. Additionally, we perform comprehensive analytical studies, offering valuable insights for the future development of VLRMs. We anticipate that VLRMBench will serve as a pivotal benchmark in advancing VLRMs. Code and datasets will be available at this https URL.
29. 【2503.07476】SOGS: Second-Order Anchor for Advanced 3D Gaussian Splatting
链接:https://arxiv.org/abs/2503.07476
作者:Jiahui Zhang,Fangneng Zhan,Ling Shao,Shijian Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:reduced Gaussian redundancy, Gaussian splatting, Gaussian redundancy, rendering quality, Gaussian attribute prediction
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Anchor-based 3D Gaussian splatting (3D-GS) exploits anchor features in 3D Gaussian prediction, which has achieved impressive 3D rendering quality with reduced Gaussian redundancy. On the other hand, it often encounters the dilemma among anchor features, model size, and rendering quality - large anchor features lead to large 3D models and high-quality rendering whereas reducing anchor features degrades Gaussian attribute prediction which leads to clear artifacts in the rendered textures and geometries. We design SOGS, an anchor-based 3D-GS technique that introduces second-order anchors to achieve superior rendering quality and reduced anchor features and model size simultaneously. Specifically, SOGS incorporates covariance-based second-order statistics and correlation across feature dimensions to augment features within each anchor, compensating for the reduced feature size and improving rendering quality effectively. In addition, it introduces a selective gradient loss to enhance the optimization of scene textures and scene geometries, leading to high-quality rendering with small anchor features. Extensive experiments over multiple widely adopted benchmarks show that SOGS achieves superior rendering quality in novel view synthesis with clearly reduced model size.
30. 【2503.07472】A Review on Geometry and Surface Inspection in 3D Concrete Printing
链接:https://arxiv.org/abs/2503.07472
作者:K. Mawas,M. Maboudi,M. Gerke
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:conventionally manufactured parts, manufacturing in construction, manufactured parts, substantial growth, additive manufacturing
备注:
点击查看摘要
Abstract:Given the substantial growth in the use of additive manufacturing in construction (AMC), it is necessary to ensure the quality of printed specimens which can be much more complex than conventionally manufactured parts. This study explores the various aspects of geometry and surface quality control for 3D concrete printing (3DCP), with a particular emphasis on deposition-based methods, namely extrusion and shotcrete 3D printing (SC3DP). A comprehensive overview of existing quality control (QC) methods and strategies is provided and preceded by an in-depth discussion. Four categories of data capture technologies are investigated and their advantages and limitations in the context of AMC are discussed. Additionally, the effects of environmental conditions and objects' properties on data capture are also analyzed. The study extends to automated data capture planning methods for different sensors. Furthermore, various quality control strategies are explored across different stages of the fabrication cycle of the printed object including: (i) During printing, (ii) Layer-wise, (iii) Preassembly, and (iv) Assembly. In addition to reviewing the methods already applied in AMC, we also address various research gaps and future trends and highlight potential methodologies from adjacent domains that could be transferred to AMC.
31. 【2503.07465】YOLOE: Real-Time Seeing Anything
链接:https://arxiv.org/abs/2503.07465
作者:Ao Wang,Lihao Liu,Hui Chen,Zijia Lin,Jungong Han,Guiguang Ding
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision applications, YOLO series, vision applications, predefined categories, hindering adaptability
备注: 15 pages, 9 figures;
点击查看摘要
Abstract:Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference speedup, YOLOE-v8-S surpasses YOLO-Worldv2-S by 3.5 AP. When transferring to COCO, YOLOE-v8-L achieves 0.6 AP$^b$ and 0.4 AP$^m$ gains over closed-set YOLOv8-L with nearly 4$\times$ less training time. Code and models are available at this https URL.
32. 【2503.07456】Anatomy-Aware Conditional Image-Text Retrieval
链接:https://arxiv.org/abs/2503.07456
作者:Meng Zheng,Jiajin Zhang,Benjamin Planche,Zhongpai Gao,Terrence Chen,Ziyan Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:finds broad applications, automatically retrieving relevant, retrieving relevant patient, efficient clinical diagnosis, finds broad
备注: 16 pages, 10 figures
点击查看摘要
Abstract:Image-Text Retrieval (ITR) finds broad applications in healthcare, aiding clinicians and radiologists by automatically retrieving relevant patient cases in the database given the query image and/or report, for more efficient clinical diagnosis and treatment, especially for rare diseases. However conventional ITR systems typically only rely on global image or text representations for measuring patient image/report similarities, which overlook local distinctiveness across patient cases. This often results in suboptimal retrieval performance. In this paper, we propose an Anatomical Location-Conditioned Image-Text Retrieval (ALC-ITR) framework, which, given a query image and the associated suspicious anatomical region(s), aims to retrieve similar patient cases exhibiting the same disease or symptoms in the same anatomical region. To perform location-conditioned multimodal retrieval, we learn a medical Relevance-Region-Aligned Vision Language (RRA-VL) model with semantic global-level and region-/word-level alignment to produce generalizable, well-aligned multi-modal representations. Additionally, we perform location-conditioned contrastive learning to further utilize cross-pair region-level contrastiveness for improved multi-modal retrieval. We show that our proposed RRA-VL achieves state-of-the-art localization performance in phase-grounding tasks, and satisfying multi-modal retrieval performance with or without location conditioning. Finally, we thoroughly investigate the generalizability and explainability of our proposed ALC-ITR system in providing explanations and preliminary diagnosis reports given retrieved patient cases (conditioned on anatomical regions), with proper off-the-shelf LLM prompts.
33. 【2503.07446】EigenGS Representation: From Eigenspace to Gaussian Image Space
链接:https://arxiv.org/abs/2503.07446
作者:Lo-Wei Tai,Ching-En Li,Cheng-Lin Chen,Chih-Jung Tsai,Hwann-Tzong Chen,Tyng-Luh Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Principal Component Analysis, Principal Component, Component Analysis, offer distinct approaches, dimensionality reduction technique
备注:
点击查看摘要
Abstract:Principal Component Analysis (PCA), a classical dimensionality reduction technique, and 2D Gaussian representation, an adaptation of 3D Gaussian Splatting for image representation, offer distinct approaches to modeling visual data. We present EigenGS, a novel method that bridges these paradigms through an efficient transformation pipeline connecting eigenspace and image-space Gaussian representations. Our approach enables instant initialization of Gaussian parameters for new images without requiring per-image optimization from scratch, dramatically accelerating convergence. EigenGS introduces a frequency-aware learning mechanism that encourages Gaussians to adapt to different scales, effectively modeling varied spatial frequencies and preventing artifacts in high-resolution reconstruction. Extensive experiments demonstrate that EigenGS not only achieves superior reconstruction quality compared to direct 2D Gaussian fitting but also reduces necessary parameter count and training time. The results highlight EigenGS's effectiveness and generalization ability across images with varying resolutions and diverse categories, making Gaussian-based image representation both high-quality and viable for real-time applications.
34. 【2503.07444】Divide and Conquer Self-Supervised Learning for High-Content Imaging
链接:https://arxiv.org/abs/2503.07444
作者:Lucas Farndale,Paul Henderson,Edward W Roberts,Ke Yuan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
关键词:complex features, Component Embedding Registration, Split Component Embedding, complex, features
备注:
点击查看摘要
Abstract:Self-supervised representation learning methods often fail to learn subtle or complex features, which can be dominated by simpler patterns which are much easier to learn. This limitation is particularly problematic in applications to science and engineering, as complex features can be critical for discovery and analysis. To address this, we introduce Split Component Embedding Registration (SpliCER), a novel architecture which splits the image into sections and distils information from each section to guide the model to learn more subtle and complex features without compromising on simpler features. SpliCER is compatible with any self-supervised loss function and can be integrated into existing methods without modification. The primary contributions of this work are as follows: i) we demonstrate that existing self-supervised methods can learn shortcut solutions when simple and complex features are both present; ii) we introduce a novel self-supervised training method, SpliCER, to overcome the limitations of existing methods, and achieve significant downstream performance improvements; iii) we demonstrate the effectiveness of SpliCER in cutting-edge medical and geospatial imaging settings. SpliCER offers a powerful new tool for representation learning, enabling models to uncover complex features which could be overlooked by other methods.
35. 【2503.07435】Open-Set Gait Recognition from Sparse mmWave Radar Point Clouds
链接:https://arxiv.org/abs/2503.07435
作者:Riccardo Mazzieri,Jacopo Pegoraro,Michele Rossi
类目:Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
关键词:recently gathered significant, gathered significant attention, significant attention due, Open-set Gait Recognition, gait recognition
备注:
点击查看摘要
Abstract:The adoption of Millimeter-Wave (mmWave) radar devices for human sensing, particularly gait recognition, has recently gathered significant attention due to their efficiency, resilience to environmental conditions, and privacy-preserving nature. In this work, we tackle the challenging problem of Open-set Gait Recognition (OSGR) from sparse mmWave radar point clouds. Unlike most existing research, which assumes a closed-set scenario, our work considers the more realistic open-set case, where unknown subjects might be present at inference time, and should be correctly recognized by the system. Point clouds are well-suited for edge computing applications with resource constraints, but are more significantly affected by noise and random fluctuations than other representations, like the more common micro-Doppler signature. This is the first work addressing open-set gait recognition with sparse point cloud data. To do so, we propose a novel neural network architecture that combines supervised classification with unsupervised reconstruction of the point clouds, creating a robust, rich, and highly regularized latent space of gait features. To detect unknown subjects at inference time, we introduce a probabilistic novelty detection algorithm that leverages the structured latent space and offers a tunable trade-off between inference speed and prediction accuracy. Along with this paper, we release mmGait10, an original human gait dataset featuring over five hours of measurements from ten subjects, under varied walking modalities. Extensive experimental results show that our solution attains F1-Score improvements by 24% over state-of-the-art methods, on average, and across multiple openness levels.
36. 【2503.07425】CATPlan: Loss-based Collision Prediction in End-to-End Autonomous Driving
链接:https://arxiv.org/abs/2503.07425
作者:Ziliang Xiong,Shipeng Liu,Nathaniel Helgesen,Joakim Johnander,Per-Erik Forssen
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving, recent years, uncertainty, systems, uncertainty quantification
备注:
点击查看摘要
Abstract:In recent years, there has been increased interest in the design, training, and evaluation of end-to-end autonomous driving (AD) systems. One often overlooked aspect is the uncertainty of planned trajectories predicted by these systems, despite awareness of their own uncertainty being key to achieve safety and robustness. We propose to estimate this uncertainty by adapting loss prediction from the uncertainty quantification literature. To this end, we introduce a novel light-weight module, dubbed CATPlan, that is trained to decode motion and planning embeddings into estimates of the collision loss used to partially supervise end-to-end AD systems. During inference, these estimates are interpreted as collision risk. We evaluate CATPlan on the safety-critical, nerf-based, closed-loop benchmark NeuroNCAP and find that it manages to detect collisions with a $54.8\%$ relative improvement to average precision over a GMM-based baseline in which the predicted trajectory is compared to the forecasted trajectories of other road users. Our findings indicate that the addition of CATPlan can lead to safer end-to-end AD systems and hope that our work will spark increased interest in uncertainty quantification for such systems.
37. 【2503.07419】Analysis of 3D Urticaceae Pollen Classification Using Deep Learning Models
链接:https://arxiv.org/abs/2503.07419
作者:Tijs Konijn,Imaan Bijl,Lu Cao,Fons Verbeek
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:pressing healthcare problem, climate change, hay fever, affected population, prolonged period
备注:
点击查看摘要
Abstract:Due to the climate change, hay fever becomes a pressing healthcare problem with an increasing number of affected population, prolonged period of affect and severer symptoms. A precise pollen classification could help monitor the trend of allergic pollen in the air throughout the year and guide preventive strategies launched by municipalities. Most of the pollen classification works use 2D microscopy image or 2D projection derived from 3D image datasets. In this paper, we aim at using whole stack of 3D images for the classification and evaluating the classification performance with different deep learning models. The 3D image dataset used in this paper is from Urticaceae family, particularly the genera Urtica and Parietaria, which are morphologically similar yet differ significantly in allergenic potential. The pre-trained ResNet3D model, using optimal layer selection and extended epochs, achieved the best performance with an F1-score of 98.3%.
38. 【2503.07418】AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion
链接:https://arxiv.org/abs/2503.07418
作者:Mingzhen Sun,Weining Wang,Gen Li,Jiawei Liu,Jiahui Sun,Wanquan Feng,Shanshan Lao,SiYu Zhou,Qian He,Jing Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requires synthesizing visually, synthesizing visually realistic, temporally coherent video, generation requires synthesizing, requires synthesizing
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:The task of video generation requires synthesizing visually realistic and temporally coherent video frames. Existing methods primarily use asynchronous auto-regressive models or synchronous diffusion models to address this challenge. However, asynchronous auto-regressive models often suffer from inconsistencies between training and inference, leading to issues such as error accumulation, while synchronous diffusion models are limited by their reliance on rigid sequence length. To address these issues, we introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible, asynchronous video generation. Specifically, our approach leverages diffusion to gradually corrupt video frames in both training and inference, reducing the discrepancy between these phases. Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames, ensuring that earlier frames remain clearer than subsequent ones. This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence. In addition, we design two specialized timestep schedulers: the FoPP scheduler for balanced timestep sampling during training, and the AD scheduler for flexible timestep differences during inference, supporting both synchronous and asynchronous generation. Extensive experiments demonstrate the superiority of our proposed method, which achieves competitive and state-of-the-art results across four challenging benchmarks.
39. 【2503.07417】GM-MoE: Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts
链接:https://arxiv.org/abs/2503.07417
作者:Minwen Liao,Hao Bo Dong,Xinyi Wang,Ziyang Yan,Yihua Shao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:improve information utilization, significantly improve information, remote sensing, autonomous driving, information utilization
备注:
点击查看摘要
Abstract:Low-light enhancement has wide applications in autonomous driving, 3D reconstruction, remote sensing, surveillance, and so on, which can significantly improve information utilization. However, most existing methods lack generalization and are limited to specific tasks such as image recovery. To address these issues, we propose \textbf{Gated-Mechanism Mixture-of-Experts (GM-MoE)}, the first framework to introduce a mixture-of-experts network for low-light image enhancement. GM-MoE comprises a dynamic gated weight conditioning network and three sub-expert networks, each specializing in a distinct enhancement task. Combining a self-designed gated mechanism that dynamically adjusts the weights of the sub-expert networks for different data domains. Additionally, we integrate local and global feature fusion within sub-expert networks to enhance image quality by capturing multi-scale features. Experimental results demonstrate that the GM-MoE achieves superior generalization with respect to 25 compared approaches, reaching state-of-the-art performance on PSNR on 5 benchmarks and SSIM on 4 benchmarks, respectively.
40. 【2503.07416】meStep Master: Asymmetrical Mixture of Timestep LoRA Experts for Versatile and Efficient Diffusion Models in Vision
链接:https://arxiv.org/abs/2503.07416
作者:Shaobin Zhuang,Yiwei Guo,Yanbo Ding,Kunchang Li,Xinyuan Chen,Yaohui Wang,Fangyikang Wang,Ying Zhang,Chen Li,Yali Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Diffusion models, Diffusion, past years, TimeStep LoRA experts, driven the advancement
备注: 17 pages, 5 figures, 13 tables
点击查看摘要
Abstract:Diffusion models have driven the advancement of vision generation over the past years. However, it is often difficult to apply these large models in downstream tasks, due to massive fine-tuning cost. Recently, Low-Rank Adaptation (LoRA) has been applied for efficient tuning of diffusion models. Unfortunately, the capabilities of LoRA-tuned diffusion models are limited, since the same LoRA is used for different timesteps of the diffusion process. To tackle this problem, we introduce a general and concise TimeStep Master (TSM) paradigm with two key fine-tuning stages. In the fostering stage (1-stage), we apply different LoRAs to fine-tune the diffusion model at different timestep intervals. This results in different TimeStep LoRA experts that can effectively capture different noise levels. In the assembling stage (2-stage), we design a novel asymmetrical mixture of TimeStep LoRA experts, via core-context collaboration of experts at multi-scale intervals. For each timestep, we leverage TimeStep LoRA expert within the smallest interval as the core expert without gating, and use experts within the bigger intervals as the context experts with time-dependent gating. Consequently, our TSM can effectively model the noise level via the expert in the finest interval, and adaptively integrate contexts from the experts of other scales, boosting the versatility of diffusion models. To show the effectiveness of our TSM paradigm, we conduct extensive experiments on three typical and popular LoRA-related tasks of diffusion models, including domain adaptation, post-pretraining, and model distillation. Our TSM achieves the state-of-the-art results on all these tasks, throughout various model structures (UNet, DiT and MM-DiT) and visual data modalities (Image, Video), showing its remarkable generalization capacity.
41. 【2503.07413】REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding
链接:https://arxiv.org/abs/2503.07413
作者:Yan Tai,Luhao Zhu,Zhiqiang Chen,Ynan Ding,Yiying Dong,Xiaohong Liu,Guodong Guo
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Large Language, robust zero-shot capabilities, diverse vision-language tasks
备注:
点击查看摘要
Abstract:Multimodal Large Language Models (MLLMs) demonstrate robust zero-shot capabilities across diverse vision-language tasks after training on mega-scale datasets. However, dense prediction tasks, such as semantic segmentation and keypoint detection, pose significant challenges for MLLMs when represented solely as text outputs. Simultaneously, current MLLMs utilizing latent embeddings for visual task decoding generally demonstrate limited adaptability to both multi-task learning and multi-granularity scenarios. In this work, we present REF-VLM, an end-to-end framework for unified training of various visual decoding tasks. To address complex visual decoding scenarios, we introduce the Triplet-Based Referring Paradigm (TRP), which explicitly decouples three critical dimensions in visual decoding tasks through a triplet structure: concepts, decoding types, and targets. TRP employs symbolic delimiters to enforce structured representation learning, enhancing the parsability and interpretability of model outputs. Additionally, we construct Visual-Task Instruction Following Dataset (VTInstruct), a large-scale multi-task dataset containing over 100 million multimodal dialogue samples across 25 task types. Beyond text inputs and outputs, VT-Instruct incorporates various visual prompts such as point, box, scribble, and mask, and generates outputs composed of text and visual units like box, keypoint, depth and mask. The combination of different visual prompts and visual units generates a wide variety of task types, expanding the applicability of REF-VLM significantly. Both qualitative and quantitative experiments demonstrate that our REF-VLM outperforms other MLLMs across a variety of standard benchmarks. The code, dataset, and demo available at this https URL.
42. 【2503.07399】Keeping Representation Similarity in Finetuning for Medical Image Analysis
链接:https://arxiv.org/abs/2503.07399
作者:Wenqiang Zu,Shenghao Xie,Hao Chen,Yiming Liang,Lei Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large-scale natural images, large-scale natural, Foundation models pretrained, medical image analysis, foundation model original
备注: 12 pages, 6 figures
点击查看摘要
Abstract:Foundation models pretrained on large-scale natural images have been widely used to adapt to medical image analysis through finetuning. This is largely attributed to pretrained representations capturing universal, robust, and generalizable features, which can be reutilized by downstream tasks. However, these representations are later found to gradually vanish during finetuning, accompanied by a degradation of foundation model's original abilities, e.g., generalizability. In this paper, we argue that pretrained representations can be well preserved while still effectively adapting to downstream tasks. We study this by proposing a new finetuning method RepSim, which minimizes the distance between pretrained and finetuned representations via constraining learnable orthogonal manifold based on similarity invariance. Compared to standard finetuning methods, e.g., full finetuning, our method improves representation similarity by over 30% while maintaining competitive accuracy, and reduces sharpness by 42% across five medical image classification datasets. The code will be released.
43. 【2503.07396】Brain Inspired Adaptive Memory Dual-Net for Few-Shot Image Classification
链接:https://arxiv.org/abs/2503.07396
作者:Kexin Di,Xiuxing Li,Yuyang Han,Ziyu Li,Qing Li,Xia Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:popular research topic, supervision collapse induced, single image-level annotation, image-level annotation remains, Few-shot image classification
备注:
点击查看摘要
Abstract:Few-shot image classification has become a popular research topic for its wide application in real-world scenarios, however the problem of supervision collapse induced by single image-level annotation remains a major challenge. Existing methods aim to tackle this problem by locating and aligning relevant local features. However, the high intra-class variability in real-world images poses significant challenges in locating semantically relevant local regions under few-shot settings. Drawing inspiration from the human's complementary learning system, which excels at rapidly capturing and integrating semantic features from limited examples, we propose the generalization-optimized Systems Consolidation Adaptive Memory Dual-Network, SCAM-Net. This approach simulates the systems consolidation of complementary learning system with an adaptive memory module, which successfully addresses the difficulty of identifying meaningful features in few-shot scenarios. Specifically, we construct a Hippocampus-Neocortex dual-network that consolidates structured representation of each category, the structured representation is then stored and adaptively regulated following the generalization optimization principle in a long-term memory inside Neocortex. Extensive experiments on benchmark datasets show that the proposed model has achieved state-of-the-art performance.
44. 【2503.07392】SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models
链接:https://arxiv.org/abs/2503.07392
作者:Ouxiang Li,Yuan Wang,Xinting Hu,Houcheng Jiang,Tao Liang,Yanbin Hao,Guojun Ma,Fuli Feng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:increasingly crucial due, offensive content, copyright infringement, privacy violations, increasingly crucial
备注:
点击查看摘要
Abstract:Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. However, existing methods either require costly fine-tuning or degrade image quality for non-target concepts (i.e., prior) due to inherent optimization limitations. In this paper, we introduce SPEED, a model editing-based concept erasure approach that leverages null-space constraints for scalable, precise, and efficient erasure. Specifically, SPEED incorporates Influence-based Prior Filtering (IPF) to retain the most affected non-target concepts during erasing, Directed Prior Augmentation (DPA) to expand prior coverage while maintaining semantic consistency, and Invariant Equality Constraints (IEC) to regularize model editing by explicitly preserving key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in prior preservation while achieving efficient and high-fidelity concept erasure, successfully removing 100 concepts within just 5 seconds. Our code and models are available at: this https URL.
45. 【2503.07390】PersonaBooth: Personalized Text-to-Motion Generation
链接:https://arxiv.org/abs/2503.07390
作者:Boeun Kim,Hea In Jeong,JungHoon Sung,Yihua Cheng,Jeongmin Lee,Ju Yong Chang,Sang-Il Choi,Younggeun Choi,Saim Shin,Jungho Kim,Hyung Jin Chang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generates personalized motions, personalized motions aligned, paper introduces Motion, generates personalized, introduces Motion Personalization
备注:
点击查看摘要
Abstract:This paper introduces Motion Personalization, a new task that generates personalized motions aligned with text descriptions using several basic motions containing Persona. To support this novel task, we introduce a new large-scale motion dataset called PerMo (PersonaMotion), which captures the unique personas of multiple actors. We also propose a multi-modal finetuning method of a pretrained motion diffusion model called PersonaBooth. PersonaBooth addresses two main challenges: i) A significant distribution gap between the persona-focused PerMo dataset and the pretraining datasets, which lack persona-specific data, and ii) the difficulty of capturing a consistent persona from the motions vary in content (action type). To tackle the dataset distribution gap, we introduce a persona token to accept new persona features and perform multi-modal adaptation for both text and visuals during finetuning. To capture a consistent persona, we incorporate a contrastive learning technique to enhance intra-cohesion among samples with the same persona. Furthermore, we introduce a context-aware fusion mechanism to maximize the integration of persona cues from multiple input motions. PersonaBooth outperforms state-of-the-art motion style transfer methods, establishing a new benchmark for motion personalization.
46. 【2503.07389】RCE: Towards Reliable Malicious Concept Erasure in Text-to-Image Diffusion Models
链接:https://arxiv.org/abs/2503.07389
作者:Ruidong Chen,Honglin Guo,Lanjun Wang,Chenyu Zhang,Weizhi Nie,An-An Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:enable photorealistic image, NSFW images, Recent advances, photorealistic image generation, models enable photorealistic
备注:
点击查看摘要
Abstract:Recent advances in text-to-image diffusion models enable photorealistic image generation, but they also risk producing malicious content, such as NSFW images. To mitigate risk, concept erasure methods are studied to facilitate the model to unlearn specific concepts. However, current studies struggle to fully erase malicious concepts implicitly embedded in prompts (e.g., metaphorical expressions or adversarial prompts) while preserving the model's normal generation capability. To address this challenge, our study proposes TRCE, using a two-stage concept erasure strategy to achieve an effective trade-off between reliable erasure and knowledge preservation. Firstly, TRCE starts by erasing the malicious semantics implicitly embedded in textual prompts. By identifying a critical mapping objective(i.e., the [EoT] embedding), we optimize the cross-attention layers to map malicious prompts to contextually similar prompts but with safe concepts. This step prevents the model from being overly influenced by malicious semantics during the denoising process. Following this, considering the deterministic properties of the sampling trajectory of the diffusion model, TRCE further steers the early denoising prediction toward the safe direction and away from the unsafe one through contrastive learning, thus further avoiding the generation of malicious content. Finally, we conduct comprehensive evaluations of TRCE on multiple malicious concept erasure benchmarks, and the results demonstrate its effectiveness in erasing malicious concepts while better preserving the model's original generation ability. The code is available at: this http URL. CAUTION: This paper includes model-generated content that may contain offensive material.
47. 【2503.07375】Probabilistic Segmentation for Robust Field of View Estimation
链接:https://arxiv.org/abs/2503.07375
作者:R. Spencer Hallyburton,David Hunt,Yiwei He,Judy He,Miroslav Pajic
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:autonomous vehicles, perception threaten, threaten the safe, FOV, Abstract
备注:
点击查看摘要
Abstract:Attacks on sensing and perception threaten the safe deployment of autonomous vehicles (AVs). Security-aware sensor fusion helps mitigate threats but requires accurate field of view (FOV) estimation which has not been evaluated autonomy. To address this gap, we adapt classical computer graphics algorithms to develop the first autonomy-relevant FOV estimators and create the first datasets with ground truth FOV labels. Unfortunately, we find that these approaches are themselves highly vulnerable to attacks on sensing. To improve robustness of FOV estimation against attacks, we propose a learning-based segmentation model that captures FOV features, integrates Monte Carlo dropout (MCD) for uncertainty quantification, and performs anomaly detection on confidence maps. We illustrate through comprehensive evaluations attack resistance and strong generalization across environments. Architecture trade studies demonstrate the model is feasible for real-time deployment in multiple applications.
48. 【2503.07371】HGO-YOLO: Advancing Anomaly Behavior Detection with Hierarchical Features and Lightweight Optimized Detection
链接:https://arxiv.org/abs/2503.07371
作者:Qizhi Zheng,Zhongze Luo,Meiyan Guo,Xinzhu Wang,Renqimuge Wu,Qiu Meng,Guanghui Dong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:hardware limitations, scenarios constrained, constrained by hardware, speed is essential, essential for enhancing
备注: 10 pages
点击查看摘要
Abstract:Accurate and real-time object detection is crucial for anomaly behavior detection, especially in scenarios constrained by hardware limitations, where balancing accuracy and speed is essential for enhancing detection performance. This study proposes a model called HGO-YOLO, which integrates the HGNetv2 architecture into YOLOv8. This combination expands the receptive field and captures a wider range of features while simplifying model complexity through GhostConv. We introduced a lightweight detection head, OptiConvDetect, which utilizes parameter sharing to construct the detection head effectively. Evaluation results show that the proposed algorithm achieves a mAP@0.5 of 87.4% and a recall rate of 81.1%, with a model size of only 4.6 MB and a frame rate of 56 FPS on the CPU. HGO-YOLO not only improves accuracy by 3.0% but also reduces computational load by 51.69% (from 8.9 GFLOPs to 4.3 GFLOPs), while increasing the frame rate by a factor of 1.7. Additionally, real-time tests were conducted on Raspberry Pi4 and NVIDIA platforms. These results indicate that the HGO-YOLO model demonstrates superior performance in anomaly behavior detection.
49. 【2503.07367】LEGO-Motion: Learning-Enhanced Grids with Occupancy Instance Modeling for Class-Agnostic Motion Prediction
链接:https://arxiv.org/abs/2503.07367
作者:Kangan Qian,Jinyu Miao,Ziang Luo,Zheng Fu,and Jinchen Li,Yining Shi,Yunlong Wang,Kun Jiang,Mengmeng Yang,Diange Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous driving systems, motion information plays, driving systems, information plays, plays a pivotal
备注: 8 pages, 4 figures
点击查看摘要
Abstract:Accurate and reliable spatial and motion information plays a pivotal role in autonomous driving systems. However, object-level perception models struggle with handling open scenario categories and lack precise intrinsic geometry. On the other hand, occupancy-based class-agnostic methods excel in representing scenes but fail to ensure physics consistency and ignore the importance of interactions between traffic participants, hindering the model's ability to learn accurate and reliable motion. In this paper, we introduce a novel occupancy-instance modeling framework for class-agnostic motion prediction tasks, named LEGO-Motion, which incorporates instance features into Bird's Eye View (BEV) space. Our model comprises (1) a BEV encoder, (2) an Interaction-Augmented Instance Encoder, and (3) an Instance-Enhanced BEV Encoder, improving both interaction relationships and physics consistency within the model, thereby ensuring a more accurate and robust understanding of the environment. Extensive experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance, outperforming existing approaches. Furthermore, the effectiveness of our framework is validated on the advanced FMCW LiDAR benchmark, showcasing its practical applicability and generalization capabilities. The code will be made publicly available to facilitate further research.
50. 【2503.07365】MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning
链接:https://arxiv.org/abs/2503.07365
作者:Fanqing Meng,Lingxiao Du,Zongkai Liu,Zhixiang Zhou,Quanfeng Lu,Daocheng Fu,Botian Shi,Wenhai Wang,Junjun He,Kaipeng Zhang,Ping Luo,Yu Qiao,Qiaosheng Zhang,Wenqi Shao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:successfully extends large-scale, rule-based reinforcement learning, extends large-scale rule-based, large-scale rule-based reinforcement, present MM-Eureka
备注:
点击查看摘要
Abstract:We present MM-Eureka, a multimodal reasoning model that successfully extends large-scale rule-based reinforcement learning (RL) to multimodal reasoning. While rule-based RL has shown remarkable success in improving LLMs' reasoning abilities in text domains, its application to multimodal settings has remained challenging. Our work reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, including steady increases in accuracy reward and response length, and the emergence of reflection behaviors. We demonstrate that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at this https URL
51. 【2503.07363】Inversion-Free Video Style Transfer with Trajectory Reset Attention Control and Content-Style Bridging
链接:https://arxiv.org/abs/2503.07363
作者:Jiang Lin,Zili Yi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Reset Attention Control, Video style transfer, Trajectory Reset Attention, Attention Control, style
备注:
点击查看摘要
Abstract:Video style transfer aims to alter the style of a video while preserving its content. Previous methods often struggle with content leakage and style misalignment, particularly when using image-driven approaches that aim to transfer precise styles. In this work, we introduce Trajectory Reset Attention Control (TRAC), a novel method that allows for high-quality style transfer while preserving content integrity. TRAC operates by resetting the denoising trajectory and enforcing attention control, thus enhancing content consistency while significantly reducing the computational costs against inversion-based methods. Additionally, a concept termed Style Medium is introduced to bridge the gap between content and style, enabling a more precise and harmonious transfer of stylistic elements. Building upon these concepts, we present a tuning-free framework that offers a stable, flexible, and efficient solution for both image and video style transfer. Experimental results demonstrate that our proposed framework accommodates a wide range of stylized outputs, from precise content preservation to the production of visually striking results with vibrant and expressive styles.
52. 【2503.07353】Certifiably Optimal Anisotropic Rotation Averaging
链接:https://arxiv.org/abs/2503.07353
作者:Carl Olsson,Yaroslava Lochman,Johan Malmport,Christopher Zach
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:vision and robotics, key subproblem, subproblem in applications, applications of computer, computer vision
备注:
点击查看摘要
Abstract:Rotation averaging is a key subproblem in applications of computer vision and robotics. Many methods for solving this problem exist, and there are also several theoretical results analyzing difficulty and optimality. However, one aspect that most of these have in common is a focus on the isotropic setting, where the intrinsic uncertainties in the measurements are not fully incorporated into the resulting optimization task. Recent empirical results suggest that moving to an anisotropic framework, where these uncertainties are explicitly included, can result in an improvement of solution quality. However, global optimization for rotation averaging has remained a challenge in this scenario. In this paper we show how anisotropic costs can be incorporated in certifiably optimal rotation averaging. We also demonstrate how existing solvers, designed for isotropic situations, fail in the anisotropic setting. Finally, we propose a stronger relaxation and show empirically that it is able to recover global optima in all tested datasets and leads to a more accurate reconstruction in all but one of the scenes.
53. 【2503.07348】Fully Unsupervised Annotation of C. Elegans
链接:https://arxiv.org/abs/2503.07348
作者:Christoph Karg,Sebastian Stricker,Lisa Hutschenreiter,Bogdan Savchynskyy,Dagmar Kainmueller
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unsupervised multi-graph matching, multi-graph matching, determine Gaussian parameters, work we present, applies to problems
备注:
点击查看摘要
Abstract:In this work we present a novel approach for unsupervised multi-graph matching, which applies to problems for which a Gaussian distribution of keypoint features can be assumed. We leverage cycle consistency as loss for self-supervised learning, and determine Gaussian parameters through Bayesian Optimization, yielding a highly efficient approach that scales to large datasets. Our fully unsupervised approach enables us to reach the accuracy of state-of-the-art supervised methodology for the use case of annotating cell nuclei in 3D microscopy images of the worm C. elegans. To this end, our approach yields the first unsupervised atlas of C. elegans, i.e. a model of the joint distribution of all of its cell nuclei, without the need for any ground truth cell annotation. This advancement enables highly efficient annotation of cell nuclei in large microscopy datasets of C. elegans. Beyond C. elegans, our approach offers fully unsupervised construction of cell-level atlases for any model organism with a stereotyped cell lineage, and thus bears the potential to catalyze respective comparative developmental studies in a range of further species.
54. 【2503.07347】DaD: Distilled Reinforcement Learning for Diverse Keypoint Detection
链接:https://arxiv.org/abs/2503.07347
作者:Johan Edstedt,Georg Bökman,Mårten Wadenbäck,Michael Felsberg
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:systems to scale, thousands of images, scale to thousands, keypoint detection, keypoint detection objective
备注:
点击查看摘要
Abstract:Keypoints are what enable Structure-from-Motion (SfM) systems to scale to thousands of images. However, designing a keypoint detection objective is a non-trivial task, as SfM is non-differentiable. Typically, an auxiliary objective involving a descriptor is optimized. This however induces a dependency on the descriptor, which is undesirable. In this paper we propose a fully self-supervised and descriptor-free objective for keypoint detection, through reinforcement learning. To ensure training does not degenerate, we leverage a balanced top-K sampling strategy. While this already produces competitive models, we find that two qualitatively different types of detectors emerge, which are only able to detect light and dark keypoints respectively. To remedy this, we train a third detector, DaD, that optimizes the Kullback-Leibler divergence of the pointwise maximum of both light and dark detectors. Our approach significantly improve upon SotA across a range of benchmarks. Code and model weights are publicly available at this https URL
55. 【2503.07346】Now you see me! A framework for obtaining class-relevant saliency maps
链接:https://arxiv.org/abs/2503.07346
作者:Nils Philipp Walter,Jilles Vreeken,Jonas Fischer
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Neural networks, daily-life decision-making, transparency are key, features neural networks, part of daily-life
备注:
点击查看摘要
Abstract:Neural networks are part of daily-life decision-making, including in high-stakes settings where understanding and transparency are key. Saliency maps have been developed to gain understanding into which input features neural networks use for a specific prediction. Although widely employed, these methods often result in overly general saliency maps that fail to identify the specific information that triggered the classification. In this work, we suggest a framework that allows to incorporate attributions across classes to arrive at saliency maps that actually capture the class-relevant information. On established benchmarks for attribution methods, including the grid-pointing game and randomization-based sanity checks, we show that our framework heavily boosts the performance of standard saliency map approaches. It is, by design, agnostic to model architectures and attribution methods and now allows to identify the distinguishing and shared features used for a model prediction.
56. 【2503.07334】Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
链接:https://arxiv.org/abs/2503.07334
作者:Xing Xie,Jiawei Liu,Ziyue Lin,Huijie Fan,Zhi Han,Yandong Tang,Liangqiong Qu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Autoregressive Representation Alignment, present Autoregressive Representation, unlocks global-coherent, ARRA, Representation Alignment
备注:
点击查看摘要
Abstract:We present Autoregressive Representation Alignment (ARRA), a new training framework that unlocks global-coherent text-to-image generation in autoregressive LLMs without architectural changes. Unlike prior work that requires complex architectural redesigns, ARRA aligns LLM hidden states with visual representations from external visual foundational models via a global visual alignment loss and a hybrid token, HYBNEXT. This token enforces dual constraints: local next-token prediction and global semantic distillation, enabling LLMs to implicitly learn spatial and contextual coherence while retaining their original autoregressive paradigm. Extensive experiments validate ARRA's plug-and-play versatility. When training from text-generation-only LLMs or random initialization, ARRA reduces FID by 25.5% (MIMIC-CXR), 8.8% (DeepEyeNet), and 7.5% (ImageNet) for advanced autoregressive LLMs like Chameleon and LlamaGen, all without framework modifications. For domain adaption, ARRA aligns general-purpose LLMs with specialized models (e.g., BioMedCLIP), achieving an 18.6% FID reduction over direct fine-tuning on medical imaging (MIMIC-CXR). By demonstrating that training objective redesign -- not just architectural innovation -- can resolve cross-modal global coherence challenges, ARRA offers a complementary paradigm for advancing autoregressive models. Code and models will be released to advance autoregressive image generation.
57. 【2503.07330】Mitigating Hallucinations in YOLO-based Object Detection Models: A Revisit to Out-of-Distribution Detection
链接:https://arxiv.org/abs/2503.07330
作者:Weicheng He,Changshun Wu,Chih-Hong Cheng,Xiaowei Huang,Saddek Bensalem
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
关键词:ensure safe decision-making, reliably perceive objects, dynamic environments, reliably perceive, overly confident
备注:
点击查看摘要
Abstract:Object detection systems must reliably perceive objects of interest without being overly confident to ensure safe decision-making in dynamic environments. Filtering techniques based on out-of-distribution (OoD) detection are commonly added as an extra safeguard to filter hallucinations caused by overconfidence in novel objects. Nevertheless, evaluating YOLO-family detectors and their filters under existing OoD benchmarks often leads to unsatisfactory performance. This paper studies the underlying reasons for performance bottlenecks and proposes a methodology to improve performance fundamentally. Our first contribution is a calibration of all existing evaluation results: Although images in existing OoD benchmark datasets are claimed not to have objects within in-distribution (ID) classes (i.e., categories defined in the training dataset), around 13% of objects detected by the object detector are actually ID objects. Dually, the ID dataset containing OoD objects can also negatively impact the decision boundary of filters. These ultimately lead to a significantly imprecise performance estimation. Our second contribution is to consider the task of hallucination reduction as a joint pipeline of detectors and filters. By developing a methodology to carefully synthesize an OoD dataset that semantically resembles the objects to be detected, and using the crafted OoD dataset in the fine-tuning of YOLO detectors to suppress the objectness score, we achieve a 88% reduction in overall hallucination error with a combined fine-tuned detection and filtering system on the self-driving benchmark BDD-100K. Our code and dataset are available at: this https URL.
58. 【2503.07323】Dynamic Path Navigation for Motion Agents with LLM Reasoning
链接:https://arxiv.org/abs/2503.07323
作者:Yubo Zhao,Qi Wu,Yifan Wang,Yu-Wing Tai,Chi-Keung Tang
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Large Language, Language Models, demonstrated strong generalizable, strong generalizable reasoning
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated strong generalizable reasoning and planning capabilities. However, their efficacies in spatial path planning and obstacle-free trajectory generation remain underexplored. Leveraging LLMs for navigation holds significant potential, given LLMs' ability to handle unseen scenarios, support user-agent interactions, and provide global control across complex systems, making them well-suited for agentic planning and humanoid motion generation. As one of the first studies in this domain, we explore the zero-shot navigation and path generation capabilities of LLMs by constructing a dataset and proposing an evaluation protocol. Specifically, we represent paths using anchor points connected by straight lines, enabling movement in various directions. This approach offers greater flexibility and practicality compared to previous methods while remaining simple and intuitive for LLMs. We demonstrate that, when tasks are well-structured in this manner, modern LLMs exhibit substantial planning proficiency in avoiding obstacles while autonomously refining navigation with the generated motion to reach the target. Further, this spatial reasoning ability of a single LLM motion agent interacting in a static environment can be seamlessly generalized in multi-motion agents coordination in dynamic environments. Unlike traditional approaches that rely on single-step planning or local policies, our training-free LLM-based method enables global, dynamic, closed-loop planning, and autonomously resolving collision issues.
59. 【2503.07315】Group-robust Sample Reweighting for Subpopulation Shifts via Influence Functions
链接:https://arxiv.org/abs/2503.07315
作者:Rui Qiao,Zhaoxuan Wu,Jingtan Wang,Pang Wei Koh,Bryan Kian Hsiang Low
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Machine learning models, Machine learning, uneven performance, learning models, data distributions
备注: Accepted to the 13th International Conference on Learning Representations (ICLR 2025). Code is available at [this https URL](https://github.com/qiaoruiyt/GSR)
点击查看摘要
Abstract:Machine learning models often have uneven performance among subpopulations (a.k.a., groups) in the data distributions. This poses a significant challenge for the models to generalize when the proportions of the groups shift during deployment. To improve robustness to such shifts, existing approaches have developed strategies that train models or perform hyperparameter tuning using the group-labeled data to minimize the worst-case loss over groups. However, a non-trivial amount of high-quality labels is often required to obtain noticeable improvements. Given the costliness of the labels, we propose to adopt a different paradigm to enhance group label efficiency: utilizing the group-labeled data as a target set to optimize the weights of other group-unlabeled data. We introduce Group-robust Sample Reweighting (GSR), a two-stage approach that first learns the representations from group-unlabeled data, and then tinkers the model by iteratively retraining its last layer on the reweighted data using influence functions. Our GSR is theoretically sound, practically lightweight, and effective in improving the robustness to subpopulation shifts. In particular, GSR outperforms the previous state-of-the-art approaches that require the same amount or even more group labels.
60. 【2503.07314】Automated Movie Generation via Multi-Agent CoT Planning
链接:https://arxiv.org/abs/2503.07314
作者:Weijia Wu,Zeyu Zhu,Mike Zheng Shou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:requiring manual input, Existing long-form video, lack automated planning, automated movie generation, Existing long-form
备注: The code and project website are available at: [this https URL](https://github.com/showlab/MovieAgent) and [this https URL](https://weijiawu.github.io/MovieAgent)
点击查看摘要
Abstract:Existing long-form video generation frameworks lack automated planning, requiring manual input for storylines, scenes, cinematography, and character interactions, resulting in high costs and inefficiencies. To address these challenges, we present MovieAgent, an automated movie generation via multi-agent Chain of Thought (CoT) planning. MovieAgent offers two key advantages: 1) We firstly explore and define the paradigm of automated movie/long-video generation. Given a script and character bank, our MovieAgent can generates multi-scene, multi-shot long-form videos with a coherent narrative, while ensuring character consistency, synchronized subtitles, and stable audio throughout the film. 2) MovieAgent introduces a hierarchical CoT-based reasoning process to automatically structure scenes, camera settings, and cinematography, significantly reducing human effort. By employing multiple LLM agents to simulate the roles of a director, screenwriter, storyboard artist, and location manager, MovieAgent streamlines the production pipeline. Experiments demonstrate that MovieAgent achieves new state-of-the-art results in script faithfulness, character consistency, and narrative coherence. Our hierarchical framework takes a step forward and provides new insights into fully automated movie generation. The code and project website are available at: this https URL and this https URL.
61. 【2503.07307】AttenST: A Training-Free Attention-Driven Style Transfer Framework with Pre-Trained Diffusion Models
链接:https://arxiv.org/abs/2503.07307
作者:Bo Huang,Wenlun Xu,Qizhuo Han,Haodong Jing,Ying Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:optimizing pre-trained models, achieved remarkable progress, high computational costs, balancing content preservation, methods typically rely
备注:
点击查看摘要
Abstract:While diffusion models have achieved remarkable progress in style transfer tasks, existing methods typically rely on fine-tuning or optimizing pre-trained models during inference, leading to high computational costs and challenges in balancing content preservation with style integration. To address these limitations, we introduce AttenST, a training-free attention-driven style transfer framework. Specifically, we propose a style-guided self-attention mechanism that conditions self-attention on the reference style by retaining the query of the content image while substituting its key and value with those from the style image, enabling effective style feature integration. To mitigate style information loss during inversion, we introduce a style-preserving inversion strategy that refines inversion accuracy through multiple resampling steps. Additionally, we propose a content-aware adaptive instance normalization, which integrates content statistics into the normalization process to optimize style fusion while mitigating the content degradation. Furthermore, we introduce a dual-feature cross-attention mechanism to fuse content and style features, ensuring a harmonious synthesis of structural fidelity and stylistic expression. Extensive experiments demonstrate that AttenST outperforms existing methods, achieving state-of-the-art performance in style transfer dataset.
62. 【2503.07300】Goal Conditioned Reinforcement Learning for Photo Finishing Tuning
链接:https://arxiv.org/abs/2503.07300
作者:Jiarui Wu,Yujin Wang,Lingen Li,Zhang Fan,Tianfan Xue
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Lightroom or Darktable, Adobe Lightroom, Photo finishing, photo finishing pipeline, manual tuning process
备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)
点击查看摘要
Abstract:Photo finishing tuning aims to automate the manual tuning process of the photo finishing pipeline, like Adobe Lightroom or Darktable. Previous works either use zeroth-order optimization, which is slow when the set of parameters increases, or rely on a differentiable proxy of the target finishing pipeline, which is hard to train. To overcome these challenges, we propose a novel goal-conditioned reinforcement learning framework for efficiently tuning parameters using a goal image as a condition. Unlike previous approaches, our tuning framework does not rely on any proxy and treats the photo finishing pipeline as a black box. Utilizing a trained reinforcement learning policy, it can efficiently find the desired set of parameters within just 10 queries, while optimization based approaches normally take 200 queries. Furthermore, our architecture utilizes a goal image to guide the iterative tuning of pipeline parameters, allowing for flexible conditioning on pixel-aligned target images, style images, or any other visually representable goals. We conduct detailed experiments on photo finishing tuning and photo stylization tuning tasks, demonstrating the advantages of our method. Project website: this https URL.
63. 【2503.07298】ALLVB: All-in-One Long Video Understanding Benchmark
链接:https://arxiv.org/abs/2503.07298
作者:Xichen Tan,Yuanjing Luo,Yunfan Ye,Fang Liu,Zhiping Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:long video understanding, video understanding, video understanding benchmark, Multi-modal LLMs, capabilities of Multi-modal
备注: AAAI 2025
点击查看摘要
Abstract:From image to video understanding, the capabilities of Multi-modal LLMs (MLLMs) are increasingly powerful. However, most existing video understanding benchmarks are relatively short, which makes them inadequate for effectively evaluating the long-sequence modeling capabilities of MLLMs. This highlights the urgent need for a comprehensive and integrated long video understanding benchmark to assess the ability of MLLMs thoroughly. To this end, we propose ALLVB (ALL-in-One Long Video Understanding Benchmark). ALLVB's main contributions include: 1) It integrates 9 major video understanding tasks. These tasks are converted into video QA formats, allowing a single benchmark to evaluate 9 different video understanding capabilities of MLLMs, highlighting the versatility, comprehensiveness, and challenging nature of ALLVB. 2) A fully automated annotation pipeline using GPT-4o is designed, requiring only human quality control, which facilitates the maintenance and expansion of the benchmark. 3) It contains 1,376 videos across 16 categories, averaging nearly 2 hours each, with a total of 252k QAs. To the best of our knowledge, it is the largest long video understanding benchmark in terms of the number of videos, average duration, and number of QAs. We have tested various mainstream MLLMs on ALLVB, and the results indicate that even the most advanced commercial models have significant room for improvement. This reflects the benchmark's challenging nature and demonstrates the substantial potential for development in long video understanding.
64. 【2503.07294】Distilling Knowledge into Quantum Vision Transformers for Biomedical Image Classification
链接:https://arxiv.org/abs/2503.07294
作者:Thomas Boucher,Evangelos B. Mazomenos
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:vision transformers, Quantum vision transformers, replacing linear layers, improve feature representation, quantum neural networks
备注: Submitted for MICCAI 2025
点击查看摘要
Abstract:Quantum vision transformers (QViTs) build on vision transformers (ViTs) by replacing linear layers within the self-attention mechanism with parameterised quantum neural networks (QNNs), harnessing quantum mechanical properties to improve feature representation. This hybrid approach aims to achieve superior performance, with significantly reduced model complexity as a result of the enriched feature representation, requiring fewer parameters. This paper proposes a novel QViT model for biomedical image classification and investigates its performance against comparable ViTs across eight diverse datasets, encompassing various modalities and classification tasks. We assess models trained from scratch and those pre-trained using knowledge distillation (KD) from high-quality teacher models. Our findings demonstrate that QViTs outperform comparable ViTs with average ROC AUC (0.863 vs 0.846) and accuracy (0.710 vs 0.687) when trained from scratch, and even compete with state-of-the-art classical models in multiple tasks, whilst being significantly more efficient (89% reduction in GFLOPs and 99.99% in parameter number). Additionally, we find that QViTs and ViTs respond equally well to KD, with QViT pre-training performance scaling with model complexity. This is the first investigation into the efficacy of deploying QViTs with KD for computer-aided diagnosis. Our results highlight the enormous potential of quantum machine learning (QML) in biomedical image analysis.
65. 【2503.07276】A Systematic Review of ECG Arrhythmia Classification: Adherence to Standards, Fair Evaluation, and Embedded Feasibility
链接:https://arxiv.org/abs/2503.07276
作者:Guilherme Silva,Pedro Silva,Gladston Moreira,Vander Freitas,Jadson Gertrudes,Eduardo Luz
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:signals is crucial, cardiac conditions, crucial for early, early detection, detection of arrhythmias
备注:
点击查看摘要
Abstract:The classification of electrocardiogram (ECG) signals is crucial for early detection of arrhythmias and other cardiac conditions. However, despite advances in machine learning, many studies fail to follow standardization protocols, leading to inconsistencies in performance evaluation and real-world applicability. Additionally, hardware constraints essential for practical deployment, such as in pacemakers, Holter monitors, and wearable ECG patches, are often overlooked. Since real-world impact depends on feasibility in resource-constrained devices, ensuring efficient deployment is critical for continuous monitoring. This review systematically analyzes ECG classification studies published between 2017 and 2024, focusing on those adhering to the E3C (Embedded, Clinical, and Comparative Criteria), which include inter-patient paradigm implementation, compliance with Association for the Advancement of Medical Instrumentation (AAMI) recommendations, and model feasibility for embedded systems. While many studies report high accuracy, few properly consider patient-independent partitioning and hardware limitations. We identify state-of-the-art methods meeting E3C criteria and conduct a comparative analysis of accuracy, inference time, energy consumption, and memory usage. Finally, we propose standardized reporting practices to ensure fair comparisons and practical applicability of ECG classification models. By addressing these gaps, this study aims to guide future research toward more robust and clinically viable ECG classification systems.
66. 【2503.07274】Efficient Distillation of Classifier-Free Guidance using Adapters
链接:https://arxiv.org/abs/2503.07274
作者:Cristian Perez Jensen,Seyedmorteza Sadat
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:neural function evaluations, function evaluations, guidance distillation methods, essential for conditional, doubles the number
备注:
点击查看摘要
Abstract:While classifier-free guidance (CFG) is essential for conditional diffusion models, it doubles the number of neural function evaluations (NFEs) per inference step. To mitigate this inefficiency, we introduce adapter guidance distillation (AGD), a novel approach that simulates CFG in a single forward pass. AGD leverages lightweight adapters to approximate CFG, effectively doubling the sampling speed while maintaining or even improving sample quality. Unlike prior guidance distillation methods that tune the entire model, AGD keeps the base model frozen and only trains minimal additional parameters ($\sim$2%) to significantly reduce the resource requirement of the distillation phase. Additionally, this approach preserves the original model weights and enables the adapters to be seamlessly combined with other checkpoints derived from the same base model. We also address a key mismatch between training and inference in existing guidance distillation methods by training on CFG-guided trajectories instead of standard diffusion trajectories. Through extensive experiments, we show that AGD achieves comparable or superior FID to CFG across multiple architectures with only half the NFEs. Notably, our method enables the distillation of large models ($\sim$2.6B parameters) on a single consumer GPU with 24 GB of VRAM, making it more accessible than previous approaches that require multiple high-end GPUs. We will publicly release the implementation of our method.
67. 【2503.07266】Customized SAM 2 for Referring Remote Sensing Image Segmentation
链接:https://arxiv.org/abs/2503.07266
作者:Fu Rong,Meng Lan,Qian Zhang,Lefei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Referring Remote Sensing, Remote Sensing Image, Sensing Image Segmentation, Remote Sensing, Referring Remote
备注:
点击查看摘要
Abstract:Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM 2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text descriptions. To address these issues, we propose RS2-SAM 2, a novel framework that adapts SAM 2 to RRSIS by aligning the adapted RS features and textual features, providing pseudo-mask-based dense prompts, and enforcing boundary constraints. Specifically, we first employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. Then, we design a bidirectional hierarchical fusion module to adapt SAM 2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. Additionally, a mask prompt generator is introduced to take the visual embeddings and class tokens as input and produce a pseudo-mask as the dense prompt of SAM 2. To further refine segmentation, we introduce a text-guided boundary loss to optimize segmentation boundaries by computing text-weighted gradient differences. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM 2 achieves state-of-the-art performance.
68. 【2503.07265】WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
链接:https://arxiv.org/abs/2503.07265
作者:Yuwei Niu,Munan Ning,Mengren Zheng,Bin Lin,Peng Jin,Jiaqi Liao,Kunpeng Ning,Bin Zhu,Li Yuan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:generating high-quality artistic, high-quality artistic creations, textbf, visual content, capable of generating
备注: Code, data and leaderboard: [this https URL](https://github.com/PKU-YuanGroup/WISE)
点击查看摘要
Abstract:Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $\textbf{WISE}$, the first benchmark specifically designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic $\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at this https URL.
69. 【2503.07259】COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition
链接:https://arxiv.org/abs/2503.07259
作者:Baiyu Chen,Wilson Wongso,Zechen Li,Yonchanok Khaokaew,Hao Xue,Flora Salim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:human activity recognition, video-based models capture, human activity, capture rich semantic, activity recognition
备注:
点击查看摘要
Abstract:Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR). However, their high power consumption, privacy concerns, and dependence on lighting conditions limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient and privacy-preserving alternative, yet they suffer from limited large-scale annotated datasets, leading to weaker generalization in downstream tasks. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers rich semantic knowledge from the video modality to the IMU modality without requiring labeled annotations. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue, aligning the feature distributions of video and IMU embeddings. By distilling knowledge from video representations, our approach enables the IMU encoder to inherit rich semantic information from video while preserving its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets demonstrate that COMODO consistently improves downstream classification performance, achieving results comparable to or exceeding fully supervised fine-tuned models. Moreover, COMODO exhibits strong cross-dataset generalization. Benefiting from its simplicity, our method is also generally applicable to various video and time-series pre-trained models, offering the potential to leverage more powerful teacher and student foundation models in future research. The code is available at this https URL .
70. 【2503.07253】AnomalyPainter: Vision-Language-Diffusion Synergy for Zero-Shot Realistic and Diverse Industrial Anomaly Synthesis
链接:https://arxiv.org/abs/2503.07253
作者:Zhangyu Lai,Yilin Lu,Xinyang Li,Jianghang Lin,Yansong Qu,Liujuan Cao,Ming Li,Rongrong Ji
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:made remarkable progress, Language Large Model, Vision Language Large, Latent Diffusion Model, synergizing Vision Language
备注: anomaly synthesis,anomaly detection
点击查看摘要
Abstract:While existing anomaly synthesis methods have made remarkable progress, achieving both realism and diversity in synthesis remains a major obstacle. To address this, we propose AnomalyPainter, a zero-shot framework that breaks the diversity-realism trade-off dilemma through synergizing Vision Language Large Model (VLLM), Latent Diffusion Model (LDM), and our newly introduced texture library Tex-9K. Tex-9K is a professional texture library containing 75 categories and 8,792 texture assets crafted for diverse anomaly synthesis. Leveraging VLLM's general knowledge, reasonable anomaly text descriptions are generated for each industrial object and matched with relevant diverse textures from Tex-9K. These textures then guide the LDM via ControlNet to paint on normal images. Furthermore, we introduce Texture-Aware Latent Init to stabilize the natural-image-trained ControlNet for industrial images. Extensive experiments show that AnomalyPainter outperforms existing methods in realism, diversity, and generalization, achieving superior downstream performance.
71. 【2503.07252】Semantic Communications with Computer Vision Sensing for Edge Video Transmission
链接:https://arxiv.org/abs/2503.07252
作者:Yubo Peng,Luping Xiang,Kun Yang,Kezhi Wang,Merouane Debbah
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
关键词:data consumes substantial, consumes substantial spectrum, video data consumes, substantial spectrum resources, widespread adoption
备注:
点击查看摘要
Abstract:Despite the widespread adoption of vision sensors in edge applications, such as surveillance, the transmission of video data consumes substantial spectrum resources. Semantic communication (SC) offers a solution by extracting and compressing information at the semantic level, preserving the accuracy and relevance of transmitted data while significantly reducing the volume of transmitted information. However, traditional SC methods face inefficiencies due to the repeated transmission of static frames in edge videos, exacerbated by the absence of sensing capabilities, which results in spectrum inefficiency. To address this challenge, we propose a SC with computer vision sensing (SCCVS) framework for edge video transmission. The framework first introduces a compression ratio (CR) adaptive SC (CRSC) model, capable of adjusting CR based on whether the frames are static or dynamic, effectively conserving spectrum resources. Additionally, we implement an object detection and semantic segmentation models-enabled sensing (OSMS) scheme, which intelligently senses the changes in the scene and assesses the significance of each frame through in-context analysis. Hence, The OSMS scheme provides CR prompts to the CRSC model based on real-time sensing results. Moreover, both CRSC and OSMS are designed as lightweight models, ensuring compatibility with resource-constrained sensors commonly used in practical edge applications. Experimental simulations validate the effectiveness of the proposed SCCVS framework, demonstrating its ability to enhance transmission efficiency without sacrificing critical semantic information.
72. 【2503.07249】xt-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes
链接:https://arxiv.org/abs/2503.07249
作者:Feng Huang,Shuyuan Zheng,Zhaobing Qiu,Huanxian Liu,Huanxin Bai,Liqiong Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Infrared small target, computer vision, Infrared small, small target detection, hot and challenging
备注:
点击查看摘要
Abstract:Infrared small target detection is currently a hot and challenging task in computer vision. Existing methods usually focus on mining visual features of targets, which struggles to cope with complex and diverse detection scenarios. The main reason is that infrared small targets have limited image information on their own, thus relying only on visual features fails to discriminate targets and interferences, leading to lower detection performance. To address this issue, we introduce a novel approach leveraging semantic text to guide infrared small target detection, called Text-IRSTD. It innovatively expands classical IRSTD to text-guided IRSTD, providing a new research idea. On the one hand, we devise a novel fuzzy semantic text prompt to accommodate ambiguous target categories. On the other hand, we propose a progressive cross-modal semantic interaction decoder (PCSID) to facilitate information fusion between texts and images. In addition, we construct a new benchmark consisting of 2,755 infrared images of different scenarios with fuzzy semantic textual annotations, called FZDT. Extensive experimental results demonstrate that our method achieves better detection performance and target contour recovery than the state-of-the-art methods. Moreover, proposed Text-IRSTD shows strong generalization and wide application prospects in unseen detection scenarios. The dataset and code will be publicly released after acceptance of this paper.
73. 【2503.07235】Retinex-MEF: Retinex-based Glare Effects Aware Unsupervised Multi-Exposure Image Fusion
链接:https://arxiv.org/abs/2503.07235
作者:Haowen Bai,Jiangshe Zhang,Zixiang Zhao,Lilun Deng,Yukun Cui,Shuang Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low dynamic range, high dynamic range, singular high dynamic, dynamic range images, dynamic range
备注:
点击查看摘要
Abstract:Multi-exposure image fusion consolidates multiple low dynamic range images of the same scene into a singular high dynamic range image. Retinex theory, which separates image illumination from scene reflectance, is naturally adopted to ensure consistent scene representation and effective information fusion across varied exposure levels. However, the conventional pixel-wise multiplication of illumination and reflectance inadequately models the glare effect induced by overexposure. To better adapt this theory for multi-exposure image fusion, we introduce an unsupervised and controllable method termed~\textbf{(Retinex-MEF)}. Specifically, our method decomposes multi-exposure images into separate illumination components and a shared reflectance component, and effectively modeling the glare induced by overexposure. Employing a bidirectional loss constraint to learn the common reflectance component, our approach effectively mitigates the glare effect. Furthermore, we establish a controllable exposure fusion criterion, enabling global exposure adjustments while preserving contrast, thus overcoming the constraints of fixed-level fusion. A series of experiments across multiple datasets, including underexposure-overexposure fusion, exposure control fusion, and homogeneous extreme exposure fusion, demonstrate the effective decomposition and flexible fusion capability of our model.
74. 【2503.07234】CoT-Drive: Efficient Motion Forecasting for Autonomous Driving with LLMs and Chain-of-Thought Prompting
链接:https://arxiv.org/abs/2503.07234
作者:Haicheng Liao,Hanlin Kong,Bonan Wang,Chengyue Wang,Wang Ye,Zhengbing He,Chengzhong Xu,Zhenning Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
关键词:Accurate motion forecasting, safe autonomous driving, Accurate motion, motion forecasting, autonomous driving
备注:
点击查看摘要
Abstract:Accurate motion forecasting is crucial for safe autonomous driving (AD). This study proposes CoT-Drive, a novel approach that enhances motion forecasting by leveraging large language models (LLMs) and a chain-of-thought (CoT) prompting method. We introduce a teacher-student knowledge distillation strategy to effectively transfer LLMs' advanced scene understanding capabilities to lightweight language models (LMs), ensuring that CoT-Drive operates in real-time on edge devices while maintaining comprehensive scene understanding and generalization capabilities. By leveraging CoT prompting techniques for LLMs without additional training, CoT-Drive generates semantic annotations that significantly improve the understanding of complex traffic environments, thereby boosting the accuracy and robustness of predictions. Additionally, we present two new scene description datasets, Highway-Text and Urban-Text, designed for fine-tuning lightweight LMs to generate context-specific semantic annotations. Comprehensive evaluations of five real-world datasets demonstrate that CoT-Drive outperforms existing models, highlighting its effectiveness and efficiency in handling complex traffic scenarios. Overall, this study is the first to consider the practical application of LLMs in this field. It pioneers the training and use of a lightweight LLM surrogate for motion forecasting, setting a new benchmark and showcasing the potential of integrating LLMs into AD systems.
75. 【2503.07232】Boosting Diffusion-Based Text Image Super-Resolution Model Towards Generalized Real-World Scenarios
链接:https://arxiv.org/abs/2503.07232
作者:Chenglu Pan,Xiaogang Xu,Ganggui Ding,Yunke Zhang,Wenbo Li,Jiarong Xu,Qingbiao Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Restoring low-resolution text, Restoring low-resolution, low-resolution text images, text images presents, significant challenge
备注:
点击查看摘要
Abstract:Restoring low-resolution text images presents a significant challenge, as it requires maintaining both the fidelity and stylistic realism of the text in restored images. Existing text image restoration methods often fall short in hard situations, as the traditional super-resolution models cannot guarantee clarity, while diffusion-based methods fail to maintain fidelity. In this paper, we introduce a novel framework aimed at improving the generalization ability of diffusion models for text image super-resolution (SR), especially promoting fidelity. First, we propose a progressive data sampling strategy that incorporates diverse image types at different stages of training, stabilizing the convergence and improving the generalization. For the network architecture, we leverage a pre-trained SR prior to provide robust spatial reasoning capabilities, enhancing the model's ability to preserve textual information. Additionally, we employ a cross-attention mechanism to better integrate textual priors. To further reduce errors in textual priors, we utilize confidence scores to dynamically adjust the importance of textual features during training. Extensive experiments on real-world datasets demonstrate that our approach not only produces text images with more realistic visual appearances but also improves the accuracy of text structure.
76. 【2503.07230】A Deep Learning Architecture for Land Cover Mapping Using Spatio-Temporal Sentinel-1 Features
链接:https://arxiv.org/abs/2503.07230
作者:Luigi Russo,Antonietta Sorriso,Silvia Liberata Ullo,Paolo Gamba
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:Land Cover, Convolutional Neural Networks, Synthetic Aperture Radar, mapping using satellite, monitoring and management
备注: Submitted to IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
点击查看摘要
Abstract:Land Cover (LC) mapping using satellite imagery is critical for environmental monitoring and management. Deep Learning (DL), particularly Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), have revolutionized this field by enhancing the accuracy of classification tasks. In this work, a novel approach combining a transformer-based Swin-Unet architecture with seasonal synthesized spatio-temporal images has been employed to classify LC types using spatio-temporal features extracted from Sentinel-1 (S1) Synthetic Aperture Radar (SAR) data, organized into seasonal clusters. The study focuses on three distinct regions - Amazonia, Africa, and Siberia - and evaluates the model performance across diverse ecoregions within these areas. By utilizing seasonal feature sequences instead of dense temporal sequences, notable performance improvements have been achieved, especially in regions with temporal data gaps like Siberia, where S1 data distribution is uneven and non-uniform. The results demonstrate the effectiveness and the generalization capabilities of the proposed methodology in achieving high overall accuracy (O.A.) values, even in regions with limited training data.
77. 【2503.07217】ReelWave: A Multi-Agent Framework Toward Professional Movie Sound Generation
链接:https://arxiv.org/abs/2503.07217
作者:Zixuan Wang,Chi-Keung Tang,Yu-Wing Tai
类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
关键词:Film production, important application, application for generative, Film, audio
备注:
点击查看摘要
Abstract:Film production is an important application for generative audio, where richer context is provided through multiple scenes. In ReelWave, we propose a multi-agent framework for audio generation inspired by the professional movie production process. We first capture semantic and temporal synchronized "on-screen" sound by training a prediction model that predicts three interpretable time-varying audio control signals comprising loudness, pitch, and timbre. These three parameters are subsequently specified as conditions by a cross-attention module. Then, our framework infers "off-screen" sound to complement the generation through cooperative interaction between communicative agents. Each agent takes up specific roles similar to the movie production team and is supervised by an agent called the director. Besides, we investigate when the conditional video consists of multiple scenes, a case frequently seen in videos extracted from movies of considerable length. Consequently, our framework can capture a richer context of audio generation conditioned on video clips extracted from movies.
78. 【2503.07209】Synthetic Lung X-ray Generation through Cross-Attention and Affinity Transformation
链接:https://arxiv.org/abs/2503.07209
作者:Ruochen Pi,Lianlei Shan
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Collecting and annotating, resource-intensive task, time-consuming and resource-intensive, Collecting, diffusion model trained
备注:
点击查看摘要
Abstract:Collecting and annotating medical images is a time-consuming and resource-intensive task. However, generating synthetic data through models such as Diffusion offers a cost-effective alternative. This paper introduces a new method for the automatic generation of accurate semantic masks from synthetic lung X-ray images based on a stable diffusion model trained on text-image pairs. This method uses cross-attention mapping between text and image to extend text-driven image synthesis to semantic mask generation. It employs text-guided cross-attention information to identify specific areas in an image and combines this with innovative techniques to produce high-resolution, class-differentiated pixel masks. This approach significantly reduces the costs associated with data collection and annotation. The experimental results demonstrate that segmentation models trained on synthetic data generated using the method are comparable to, and in some cases even better than, models trained on real datasets. This shows the effectiveness of the method and its potential to revolutionize medical image analysis.
79. 【2503.07204】Endo-FASt3r: Endoscopic Foundation model Adaptation for Structure from motion
链接:https://arxiv.org/abs/2503.07204
作者:Mona Sheikh Zeinoddin,Mobarakol Islam,Zafer Tandogdu,Greg Shaw,Mathew J. Clarkson,Evangelos Mazomenos,Danail Stoyanov
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate depth, achieving high-quality, visualisations in robotic-assisted, robotic-assisted surgery, pose estimation
备注:
点击查看摘要
Abstract:Accurate depth and camera pose estimation is essential for achieving high-quality 3D visualisations in robotic-assisted surgery. Despite recent advancements in foundation model adaptation to monocular depth estimation of endoscopic scenes via self-supervised learning (SSL), no prior work has explored their use for pose estimation. These methods rely on low rank-based adaptation approaches, which constrain model updates to a low-rank space. We propose Endo-FASt3r, the first monocular SSL depth and pose estimation framework that uses foundation models for both tasks. We extend the Reloc3r relative pose estimation foundation model by designing Reloc3rX, introducing modifications necessary for convergence in SSL. We also present DoMoRA, a novel adaptation technique that enables higher-rank updates and faster convergence. Experiments on the SCARED dataset show that Endo-FASt3r achieves a substantial $10\%$ improvement in pose estimation and a $2\%$ improvement in depth estimation over prior work. Similar performance gains on the Hamlyn and StereoMIS datasets reinforce the generalisability of Endo-FASt3r across different datasets.
80. 【2503.07197】Effective and Efficient Masked Image Generation Models
链接:https://arxiv.org/abs/2503.07197
作者:Zebin You,Jingyang Ou,Xiaolu Zhang,Jun Hu,Jun Zhou,Chongxuan Li
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:masked image generation, motivations and objectives, single framework, masked image, Fréchet Inception Distance
备注:
点击查看摘要
Abstract:Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models.
81. 【2503.07191】All That Glitters Is Not Gold: Key-Secured 3D Secrets within 3D Gaussian Splatting
链接:https://arxiv.org/abs/2503.07191
作者:Yan Ren,Shilin Lu,Adams Wai-Kin Kong
类目:Graphics (cs.GR); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Recent advances, Gaussian Splatting, revolutionized scene reconstruction, opening new possibilities, revolutionized scene
备注:
点击查看摘要
Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have revolutionized scene reconstruction, opening new possibilities for 3D steganography by hiding 3D secrets within 3D covers. The key challenge in steganography is ensuring imperceptibility while maintaining high-fidelity reconstruction. However, existing methods often suffer from detectability risks and utilize only suboptimal 3DGS features, limiting their full potential. We propose a novel end-to-end key-secured 3D steganography framework (KeySS) that jointly optimizes a 3DGS model and a key-secured decoder for secret reconstruction. Our approach reveals that Gaussian features contribute unequally to secret hiding. The framework incorporates a key-controllable mechanism enabling multi-secret hiding and unauthorized access prevention, while systematically exploring optimal feature update to balance fidelity and security. To rigorously evaluate steganographic imperceptibility beyond conventional 2D metrics, we introduce 3D-Sinkhorn distance analysis, which quantifies distributional differences between original and steganographic Gaussian parameters in the representation space. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both cover and secret reconstruction while maintaining high security levels, advancing the field of 3D steganography. Code is available at this https URL
82. 【2503.07190】Multi-Modal 3D Mesh Reconstruction from Images and Text
链接:https://arxiv.org/abs/2503.07190
作者:Melvin Reka,Tessa Pulli,Markus Vincze
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:require large datasets, high computational costs, object pose estimation, large datasets, struggle to generalize
备注: under review
点击查看摘要
Abstract:6D object pose estimation for unseen objects is essential in robotics but traditionally relies on trained models that require large datasets, high computational costs, and struggle to generalize. Zero-shot approaches eliminate the need for training but depend on pre-existing 3D object models, which are often impractical to obtain. To address this, we propose a language-guided few-shot 3D reconstruction method, reconstructing a 3D mesh from few input images. In the proposed pipeline, receives a set of input images and a language query. A combination of GroundingDINO and Segment Anything Model outputs segmented masks from which a sparse point cloud is reconstructed with VGGSfM. Subsequently, the mesh is reconstructed with the Gaussian Splatting method SuGAR. In a final cleaning step, artifacts are removed, resulting in the final 3D mesh of the queried object. We evaluate the method in terms of accuracy and quality of the geometry and texture. Furthermore, we study the impact of imaging conditions such as viewing angle, number of input images, and image overlap on 3D object reconstruction quality, efficiency, and computational scalability.
83. 【2503.07185】Evaluation of Alignment-Regularity Characteristics in Deformable Image Registration
链接:https://arxiv.org/abs/2503.07185
作者:Vasiliki Sideri-Lampretsa,Daniel Rueckert,Huaqi Qiu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Evaluating deformable image, achieving high alignment, high alignment accuracy, maintaining deformation regularity, Evaluating deformable
备注:
点击查看摘要
Abstract:Evaluating deformable image registration (DIR) is challenging due to the inherent trade-off between achieving high alignment accuracy and maintaining deformation regularity. In this work, we introduce a novel evaluation scheme based on the alignment-regularity characteristic (ARC) to systematically capture and analyze this trade-off. We first introduce the ARC curves, which describe the performance of a given registration algorithm as a spectrum measured by alignment and regularity metrics. We further adopt a HyperNetwork-based approach that learns to continuously interpolate across the full regularization range, accelerating the construction and improving the sample density of ARC curves. We empirically demonstrate our evaluation scheme using representative learning-based deformable image registration methods with various network architectures and transformation models on two public datasets. We present a range of findings not evident from existing evaluation practices and provide general recommendations for model evaluation and selection using our evaluation scheme. All code relevant is made publicly available.
84. 【2503.07173】owards Spatial Transcriptomics-guided Pathological Image Recognition with Batch-Agnostic Encoder
链接:https://arxiv.org/abs/2503.07173
作者:Kazuya Nishimura,Ryoma Bise,Yasuhiro Kojima
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:simultaneously captures pathological, technique that simultaneously, simultaneously captures, spatial coordinates, captures pathological images
备注: Accepted to ISBI 2025
点击查看摘要
Abstract:Spatial transcriptomics (ST) is a novel technique that simultaneously captures pathological images and gene expression profiling with spatial coordinates. Since ST is closely related to pathological features such as disease subtypes, it may be valuable to augment image representation with pathological information. However, there are no attempts to leverage ST for image recognition ({\it i.e,} patch-level classification of subtypes of pathological image.). One of the big challenges is significant batch effects in spatial transcriptomics that make it difficult to extract pathological features of images from ST. In this paper, we propose a batch-agnostic contrastive learning framework that can extract consistent signals from gene expression of ST in multiple patients. To extract consistent signals from ST, we utilize the batch-agnostic gene encoder that is trained in a variational inference manner. Experiments demonstrated the effectiveness of our framework on a publicly available dataset. Code is publicly available at this https URL
85. 【2503.07168】HisTrackMap: Global Vectorized High-Definition Map Construction via History Map Tracking
链接:https://arxiv.org/abs/2503.07168
作者:Jing Yang,Sen Yang,Xiao Tan,Hanli Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:implicitly propagate queries, autonomous driving systems, query-based detection frameworks, precise environmental information, maps provide rich
备注:
点击查看摘要
Abstract:As an essential component of autonomous driving systems, high-definition (HD) maps provide rich and precise environmental information for auto-driving scenarios; however, existing methods, which primarily rely on query-based detection frameworks to directly model map elements or implicitly propagate queries over time, often struggle to maintain consistent temporal perception outcomes. These inconsistencies pose significant challenges to the stability and reliability of real-world autonomous driving and map data collection systems. To address this limitation, we propose a novel end-to-end tracking framework for global map construction by temporally tracking map elements' historical trajectories. Firstly, instance-level historical rasterization map representation is designed to explicitly store previous perception results, which can control and maintain different global instances' history information in a fine-grained way. Secondly, we introduce a Map-Trajectory Prior Fusion module within this tracking framework, leveraging historical priors for tracked instances to improve temporal smoothness and continuity. Thirdly, we propose a global perspective metric to evaluate the quality of temporal geometry construction in HD maps, filling the gap in current metrics for assessing global geometric perception results. Substantial experiments on the nuScenes and Argoverse2 datasets demonstrate that the proposed method outperforms state-of-the-art (SOTA) methods in both single-frame and temporal metrics. our project page: $\href{this https URL}{this https URL.}$
86. 【2503.07167】mporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation
链接:https://arxiv.org/abs/2503.07167
作者:Ziliang Miao,Runjian Chen,Yixi Cai,Buwei He,Wenquan Zhao,Wenqi Shao,Bo Zhang,Fu Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:Moving object segmentation, textbf, Moving object, self-driving vehicles, clouds is crucial
备注:
点击查看摘要
Abstract:Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose \textbf{T}emporal \textbf{O}verlapping \textbf{P}rediction (\textbf{TOP}), a self-supervised pre-training method that alleviate the labeling burden for MOS. \textbf{TOP} explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called $\text{mIoU}_{\text{obj}}$ to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that \textbf{TOP} outperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77\% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.
87. 【2503.07157】MIRAM: Masked Image Reconstruction Across Multiple Scales for Breast Lesion Risk Prediction
链接:https://arxiv.org/abs/2503.07157
作者:Hung Q. Vo,Pengyu Yuan,Zheng Yin,Kelvin K. Wong,Chika F. Ezeana,Son T. Ly,Stephen T.C. Wong,Hien V. Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:computer vision communities, garnered substantial interest, Self-supervised learning, vision communities, garnered substantial
备注:
点击查看摘要
Abstract:Self-supervised learning (SSL) has garnered substantial interest within the machine learning and computer vision communities. Two prominent approaches in SSL include contrastive-based learning and self-distillation utilizing cropping augmentation. Lately, masked image modeling (MIM) has emerged as a more potent SSL technique, employing image inpainting as a pretext task. MIM creates a strong inductive bias toward meaningful spatial and semantic understanding. This has opened up new opportunities for SSL to contribute not only to classification tasks but also to more complex applications like object detection and image segmentation. Building upon this progress, our research paper introduces a scalable and practical SSL approach centered around more challenging pretext tasks that facilitate the acquisition of robust features. Specifically, we leverage multi-scale image reconstruction from randomly masked input images as the foundation for feature learning. Our hypothesis posits that reconstructing high-resolution images enables the model to attend to finer spatial details, particularly beneficial for discerning subtle intricacies within medical images. The proposed SSL features help improve classification performance on the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) dataset. In pathology classification, our method demonstrates a 3\% increase in average precision (AP) and a 1\% increase in the area under the receiver operating characteristic curve (AUC) when compared to state-of-the-art (SOTA) algorithms. Moreover, in mass margins classification, our approach achieves a 4\% increase in AP and a 2\% increase in AUC.
88. 【2503.07152】Controllable 3D Outdoor Scene Generation via Scene Graphs
链接:https://arxiv.org/abs/2503.07152
作者:Yuheng Liu,Xinke Li,Yuning Zhang,Lu Qi,Xin Li,Wenping Wang,Chongshou Li,Xueting Li,Ming-Hsuan Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Three-dimensional scene generation, spanning autonomous driving, applications spanning autonomous, Three-dimensional scene, scene graphs
备注: Project Page: [this https URL](https://yuheng.ink/project-page/control-3d-scene/)
点击查看摘要
Abstract:Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses, scene graphs, an accessible, user friendly control format to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs.
89. 【2503.07135】VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation
链接:https://arxiv.org/abs/2503.07135
作者:Hanzhi Chen,Boyang Sun,Anran Zhang,Marc Pollefeys,Stefan Leutenegger
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:versatile systems capable, Future robots, envisioned as versatile, capable of performing, performing a variety
备注: Accepted to CVPR 2025
点击查看摘要
Abstract:Future robots are envisioned as versatile systems capable of performing a variety of household tasks. The big question remains, how can we bridge the embodiment gap while minimizing physical robot learning, which fundamentally does not scale well. We argue that learning from in-the-wild human videos offers a promising solution for robotic manipulation tasks, as vast amounts of relevant data already exist on the internet. In this work, we present VidBot, a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot leverages a pipeline to extract explicit representations from them, namely 3D hand trajectories from videos, combining a depth foundation model with structure-from-motion techniques to reconstruct temporally consistent, metric-scale 3D affordance representations agnostic to embodiments. We introduce a coarse-to-fine affordance learning model that first identifies coarse actions from the pixel space and then generates fine-grained interaction trajectories with a diffusion model, conditioned on coarse actions and guided by test-time constraints for context-aware interaction planning, enabling substantial generalization to novel scenes and embodiments. Extensive experiments demonstrate the efficacy of VidBot, which significantly outperforms counterparts across 13 manipulation tasks in zero-shot settings and can be seamlessly deployed across robot systems in real-world environments. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
90. 【2503.07133】A Light Perspective for 3D Object Detection
链接:https://arxiv.org/abs/2503.07133
作者:Marcelo Eduardo Pederiva,José Mario De Martino,Alessandro Zimmer
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:autonomous vehicle technologies, advancing autonomous vehicle, accurately detecting objects, Comprehending the environment, space are essential
备注:
点击查看摘要
Abstract:Comprehending the environment and accurately detecting objects in 3D space are essential for advancing autonomous vehicle technologies. Integrating Camera and LIDAR data has emerged as an effective approach for achieving high accuracy in 3D Object Detection models. However, existing methodologies often rely on heavy, traditional backbones that are computationally demanding. This paper introduces a novel approach that incorporates cutting-edge Deep Learning techniques into the feature extraction process, aiming to create more efficient models without compromising performance. Our model, NextBEV, surpasses established feature extractors like ResNet50 and MobileNetV2. On the KITTI 3D Monocular detection benchmark, NextBEV achieves an accuracy improvement of 2.39%, having less than 10% of the MobileNetV3 parameters. Moreover, we propose changes in LIDAR backbones that decreased the original inference time to 10 ms. Additionally, by fusing these lightweight proposals, we have enhanced the accuracy of the VoxelNet-based model by 2.93% and improved the F1-score of the PointPillar-based model by approximately 20%. Therefore, this work contributes to establishing lightweight and powerful models for individual or fusion techniques, making them more suitable for onboard implementations.
91. 【2503.07125】Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation
链接:https://arxiv.org/abs/2503.07125
作者:Sihao Lin,Daqi Liu,Ruochong Fu,Dongrui Liu,Andy Song,Hongwei Xie,Zhihui Li,Bing Wang,Xiaojun Chang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:challenging task due, metric depth, fundamental yet challenging, labour-intensive nature, challenging task
备注: preprint
点击查看摘要
Abstract:Estimating the 3D world from 2D monocular images is a fundamental yet challenging task due to the labour-intensive nature of 3D annotations. To simplify label acquisition, this work proposes a novel approach that bridges 2D vision foundation models (VFMs) with 3D tasks by decoupling 3D supervision into an ensemble of image-level primitives, e.g., semantic and geometric components. As a key motivator, we leverage the zero-shot capabilities of vision-language models for image semantics. However, due to the notorious ill-posed problem - multiple distinct 3D scenes can produce identical 2D projections, directly inferring metric depth from a monocular image in a zero-shot manner is unsuitable. In contrast, 2D VFMs provide promising sources of relative depth, which theoretically aligns with metric depth when properly scaled and offset. Thus, we adapt the relative depth derived from VFMs into metric depth by optimising the scale and offset using temporal consistency, also known as novel view synthesis, without access to ground-truth metric depth. Consequently, we project the semantics into 3D space using the reconstructed metric depth, thereby providing 3D supervision. Extensive experiments on nuScenes and SemanticKITTI demonstrate the effectiveness of our framework. For instance, the proposed method surpasses the current state-of-the-art by 3.34% mIoU on nuScenes for voxel occupancy prediction.
92. 【2503.07120】Exposure Bias Reduction for Enhancing Diffusion Transformer Feature Caching
链接:https://arxiv.org/abs/2503.07120
作者:Zhen Zou,Hu Yu,Jie Xiao,Feng Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:high computational complexity, faces great challenges, great challenges due, impressive generation capabilities, exhibited impressive generation
备注:
点击查看摘要
Abstract:Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this problem, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing the impact of caching on the generation of intermediate processes. So the lack of exploration provides us with room for analysis and improvement. In this paper, we analyze the impact of caching on the SNR of the diffusion process and discern that feature caching intensifies the denoising procedure, and we further identify this as a more severe exposure bias issue. Drawing on this insight, we introduce EB-Cache, a joint cache strategy that aligns the Non-exposure bias (which gives us a higher performance ceiling) diffusion process. Our approach incorporates a comprehensive understanding of caching mechanisms and offers a novel perspective on leveraging caches to expedite diffusion processes. Empirical results indicate that EB-Cache optimizes model performance while concurrently facilitating acceleration. Specifically, in the 50-step generation process, EB-Cache achieves 1.49$\times$ acceleration with 0.63 FID reduction from 3.69, surpassing prior acceleration methods. Code will be available at \href{this https URL}{this https URL}.
93. 【2503.07115】YOLOMG: Vision-based Drone-to-Drone Detection with Appearance and Pixel-Level Motion Fusion
链接:https://arxiv.org/abs/2503.07115
作者:Hanqing Guo,Xiuxiu Lin,Shiyu Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:attracted increasing attention, increasing attention due, vision-based swarming, attracted increasing, increasing attention
备注: 9 pages, 8 figures
点击查看摘要
Abstract:Vision-based drone-to-drone detection has attracted increasing attention due to its importance in numerous tasks such as vision-based swarming, aerial see-and-avoid, and malicious drone detection. However, existing methods often encounter failures when the background is complex or the target is tiny. This paper proposes a novel end-to-end framework that accurately identifies small drones in complex environments using motion guidance. It starts by creating a motion difference map to capture the motion characteristics of tiny drones. Next, this motion difference map is combined with an RGB image using a bimodal fusion module, allowing for adaptive feature learning of the drone. Finally, the fused feature map is processed through an enhanced backbone and detection head based on the YOLOv5 framework to achieve accurate detection results. To validate our method, we propose a new dataset, named ARD100, which comprises 100 videos (202,467 frames) covering various challenging conditions and has the smallest average object size compared with the existing drone detection datasets. Extensive experiments on the ARD100 and NPS-Drones datasets show that our proposed detector performs exceptionally well under challenging conditions and surpasses state-of-the-art algorithms across various metrics. We publicly release the codes and ARD100 dataset at this https URL.
94. 【2503.07107】owards Experience Replay for Class-Incremental Learning in Fully-Binary Networks
链接:https://arxiv.org/abs/2503.07107
作者:Yanis Basso-Bert,Anca Molnos,Romain Lemaire,William Guicquero,Antoine Dupret
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:enable Artificial Neural, Artificial Neural Network, Binary Neural Networks, Artificial Neural, ultra-low power edge
备注:
点击查看摘要
Abstract:Binary Neural Networks (BNNs) are a promising approach to enable Artificial Neural Network (ANN) implementation on ultra-low power edge devices. Such devices may compute data in highly dynamic environments, in which the classes targeted for inference can evolve or even novel classes may arise, requiring continual learning. Class Incremental Learning (CIL) is a common type of continual learning for classification problems, that has been scarcely addressed in the context of BNNs. Furthermore, most of existing BNNs models are not fully binary, as they require several real-valued network layers, at the input, the output, and for batch normalization. This paper goes a step further, enabling class incremental learning in Fully-Binarized NNs (FBNNs) through four main contributions. We firstly revisit the FBNN design and its training procedure that is suitable to CIL. Secondly, we explore loss balancing, a method to trade-off the performance of past and current classes. Thirdly, we propose a semi-supervised method to pre-train the feature extractor of the FBNN for transferable representations. Fourthly, two conventional CIL methods, \ie, Latent and Native replay, are thoroughly compared. These contributions are exemplified first on the CIFAR100 dataset, before being scaled up to address the CORE50 continual learning benchmark. The final results based on our 3Mb FBNN on CORE50 exhibit at par and better performance than conventional real-valued larger NN models.
95. 【2503.07101】SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements
链接:https://arxiv.org/abs/2503.07101
作者:Haiyang Xie,Xi Shen,Shihua Huang,Zheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:offers significant advantages, RAW object detection, preserving sensor information, data offers significant, ISP processing
备注:
点击查看摘要
Abstract:Most visual models are designed for sRGB images, yet RAW data offers significant advantages for object detection by preserving sensor information before ISP processing. This enables improved detection accuracy and more efficient hardware designs by bypassing the ISP. However, RAW object detection is challenging due to limited training data, unbalanced pixel distributions, and sensor noise. To address this, we propose SimROD, a lightweight and effective approach for RAW object detection. We introduce a Global Gamma Enhancement (GGE) module, which applies a learnable global gamma transformation with only four parameters, improving feature representation while keeping the model efficient. Additionally, we leverage the green channel's richer signal to enhance local details, aligning with the human eye's sensitivity and Bayer filter design. Extensive experiments on multiple RAW object detection datasets and detectors demonstrate that SimROD outperforms state-of-the-art methods like RAW-Adapter and DIAP while maintaining efficiency. Our work highlights the potential of RAW data for real-world object detection.
96. 【2503.07098】OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation
链接:https://arxiv.org/abs/2503.07098
作者:Ding Zhong,Xu Zheng,Chenfei Liao,Yuanhuiyi Lyu,Jialei Chen,Shengyang Wu,Linfeng Zhang,Xuming Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:strong base model, pinhole imaging segmentation, circ, imaging segmentation tasks, strong base
备注:
点击查看摘要
Abstract:Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^\circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^\circ \times 70^\circ$) and panoramic images ($180^\circ \times 360^\circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2's memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state-of-the-art methods by large margins, e.g., 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.
97. 【2503.07091】FaceID-6M: A Large-Scale, Open-Source FaceID Customization Dataset
链接:https://arxiv.org/abs/2503.07091
作者:Shuhe Wang,Xiaoya Li,Jiwei Li,Guoyin Wang,Xiaofei Sun,Bob Zhu,Han Qiu,Mo Yu,Shengjie Shen,Eduard Hovy
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:current face identity, high-quality text-image pairs, data-driven nature, nature of current, text-image pairs
备注:
点击查看摘要
Abstract:Due to the data-driven nature of current face identity (FaceID) customization methods, all state-of-the-art models rely on large-scale datasets containing millions of high-quality text-image pairs for training. However, none of these datasets are publicly available, which restricts transparency and hinders further advancements in the field. To address this issue, in this paper, we collect and release FaceID-6M, the first large-scale, open-source FaceID dataset containing 6 million high-quality text-image pairs. Filtered from LAION-5B \cite{schuhmann2022laion}, FaceID-6M undergoes a rigorous image and text filtering steps to ensure dataset quality, including resolution filtering to maintain high-quality images and faces, face filtering to remove images that lack human faces, and keyword-based strategy to retain descriptions containing human-related terms (e.g., nationality, professions and names). Through these cleaning processes, FaceID-6M provides a high-quality dataset optimized for training powerful FaceID customization models, facilitating advancements in the field by offering an open resource for research and development. We conduct extensive experiments to show the effectiveness of our FaceID-6M, demonstrating that models trained on our FaceID-6M dataset achieve performance that is comparable to, and slightly better than currently available industrial models. Additionally, to support and advance research in the FaceID customization community, we make our code, datasets, and models fully publicly available. Our codes, models, and datasets are available at: this https URL.
Comments:
arXiv admin note: text overlap with arXiv:2501.15407
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2503.07091 [cs.CV]
(or
arXiv:2503.07091v2 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2503.07091
Focus to learn more
arXiv-issued DOI via DataCite</p>
98. 【2503.07085】RS2V-L: Vehicle-Mounted LiDAR Data Generation from Roadside Sensor Observations
链接:https://arxiv.org/abs/2503.07085
作者:Ruidan Xing,Runyi Huang,Qing Xu,Lei He
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:refined control commands, process multi-modal sensory, directly generate refined, generate refined control, multi-modal sensory data
备注: 7 pages, 4 figures
点击查看摘要
Abstract:End-to-end autonomous driving solutions, which process multi-modal sensory data to directly generate refined control commands, have become a dominant paradigm in autonomous driving research. However, these approaches predominantly depend on single-vehicle data collection for model training and optimization, resulting in significant challenges such as high data acquisition and annotation costs, the scarcity of critical driving scenarios, and fragmented datasets that impede model generalization. To mitigate these limitations, we introduce RS2V-L, a novel framework for reconstructing and synthesizing vehicle-mounted LiDAR data from roadside sensor observations. Specifically, our method transforms roadside LiDAR point clouds into the vehicle-mounted LiDAR coordinate system by leveraging the target vehicle's relative pose. Subsequently, high-fidelity vehicle-mounted LiDAR data is synthesized through virtual LiDAR modeling, point cloud classification, and resampling techniques. To the best of our knowledge, this is the first approach to reconstruct vehicle-mounted LiDAR data from roadside sensor inputs. Extensive experimental evaluations demonstrate that incorporating the generated data into model training-complementing the KITTI dataset-enhances 3D object detection accuracy by over \text{30\%} while improving the efficiency of end-to-end autonomous driving data generation by more than an order of magnitude. These findings strongly validate the effectiveness of the proposed method and underscore its potential in reducing dependence on costly vehicle-mounted data collection while improving the robustness of autonomous driving models.
99. 【2503.07082】On the Generalization of Representation Uncertainty in Earth Observation
链接:https://arxiv.org/abs/2503.07082
作者:Spyros Kondylatos,Nikolaos Ioannis Bountos,Dimitrios Michail,Xiao Xiang Zhu,Gustau Camps-Valls,Ioannis Papoutsis
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Computer Vision, Recent advances, advances in Computer, Vision have introduced, enabling zero-shot uncertainty
备注: 18 pages
点击查看摘要
Abstract:Recent advances in Computer Vision have introduced the concept of pretrained representation uncertainty, enabling zero-shot uncertainty estimation. This holds significant potential for Earth Observation (EO), where trustworthiness is critical, yet the complexity of EO data poses challenges to uncertainty-aware methods. In this work, we investigate the generalization of representation uncertainty in EO, considering the domain's unique semantic characteristics. We pretrain uncertainties on large EO datasets and propose an evaluation framework to assess their zero-shot performance in multi-label classification and segmentation EO tasks. Our findings reveal that, unlike uncertainties pretrained on natural images, EO-pretraining exhibits strong generalization across unseen EO domains, geographic locations, and target granularities, while maintaining sensitivity to variations in ground sampling distance. We demonstrate the practical utility of pretrained uncertainties showcasing their alignment with task-specific uncertainties in downstream tasks, their sensitivity to real-world EO image noise, and their ability to generate spatial uncertainty estimates out-of-the-box. Initiating the discussion on representation uncertainty in EO, our study provides insights into its strengths and limitations, paving the way for future research in the field. Code and weights are available at: this https URL.
100. 【2503.07076】NFIG: Autoregressive Image Generation with Next-Frequency Prediction
链接:https://arxiv.org/abs/2503.07076
作者:Zhihao Huang,Xi Qiu,Yukuo Ma,Yifu Zhou,Chi Zhang,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:achieved promising results, natural language processing, textbf, language processing, models have achieved
备注: 10 pages, 7 figures, 2 tables
点击查看摘要
Abstract:Autoregressive models have achieved promising results in natural language processing. However, for image generation tasks, they encounter substantial challenges in effectively capturing long-range dependencies, managing computational costs, and most crucially, defining meaningful autoregressive sequences that reflect natural image hierarchies. To address these issues, we present \textbf{N}ext-\textbf{F}requency \textbf{I}mage \textbf{G}eneration (\textbf{NFIG}), a novel framework that decomposes the image generation process into multiple frequency-guided stages. Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images. This principled autoregressive sequence not only improves the quality of generated images by better capturing true causal relationships between image components, but also significantly reduces computational overhead during inference. Extensive experiments demonstrate that NFIG achieves state-of-the-art performance with fewer steps, offering a more efficient solution for image generation, with 1.25$\times$ speedup compared to VAR-d20 while achieving better performance (FID: 2.81) on the ImageNet-256 benchmark. We hope that our insight of incorporating frequency-domain knowledge to guide autoregressive sequence design will shed light on future research. We will make our code publicly available upon acceptance of the paper.
101. 【2503.07075】XR-VLM: Cross-Relationship Modeling with Multi-part Prompts and Visual Features for Fine-Grained Recognition
链接:https://arxiv.org/abs/2503.07075
作者:Chuanming Wang,Henming Mao,Huanhuan Zhang,Huiyuan Fu,Huadong Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated impressive performance, achieve optimal performance, impressive performance, downstream tasks, optimal performance
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have demonstrated impressive performance on various visual tasks, yet they still require adaptation on downstream tasks to achieve optimal performance. Recently, various adaptation technologies have been proposed, but we observe they often underperform in fine-grained visual recognition, which requires models to capture subtle yet discriminative features to distinguish similar sub-categories. Current adaptation methods typically rely on an alignment-based prediction framework, \ie the visual feature is compared with each class prompt for similarity calculation as the final prediction, which lacks class interaction during the forward pass. Besides, learning single uni-modal feature further restricts the model's expressive capacity. Therefore, we propose a novel mechanism, XR-VLM, to discover subtle differences by modeling cross-relationships, which specifically excels in scenarios involving multiple features. Our method introduces a unified multi-part visual feature extraction module designed to seamlessly integrate with the diverse backbones inherent in VLMs. Additionally, we develop a multi-part prompt learning module to capture multi-perspective descriptions of sub-categories. To further enhance discriminative capability, we propose a cross relationship modeling pattern that combines visual feature with all class prompt features, enabling a deeper exploration of the relationships between these two modalities. Extensive experiments have been conducted on various fine-grained datasets, and the results demonstrate that our method achieves significant improvements compared to current state-of-the-art approaches. Code will be released.
102. 【2503.07065】Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning
链接:https://arxiv.org/abs/2503.07065
作者:Huilin Deng,Ding Zou,Rui Ma,Hongchen Luo,Yang Cao,Yu Kang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:success heavily relies, demonstrated remarkable capabilities, massive model scaling, Curriculum Reinforcement Finetuning, Rejected Sampling-based Self-improvement
备注:
点击查看摘要
Abstract:While state-of-the-art vision-language models (VLMs) have demonstrated remarkable capabilities in complex visual-text tasks, their success heavily relies on massive model scaling, limiting their practical deployment. Small-scale VLMs offer a more practical alternative but face significant challenges when trained with traditional supervised fine-tuning (SFT), particularly in two aspects: out-of-domain (OOD) generalization and reasoning abilities, which significantly lags behind the contemporary Large language models (LLMs). To address these challenges, we propose Curriculum Reinforcement Finetuning (Curr-ReFT), a novel post-training paradigm specifically designed for small-scale VLMs. Inspired by the success of reinforcement learning in LLMs, Curr-ReFT comprises two sequential stages: (1) Curriculum Reinforcement Learning, which ensures steady progression of model capabilities through difficulty-aware reward design, transitioning from basic visual perception to complex reasoning tasks; and (2) Rejected Sampling-based Self-improvement, which maintains the fundamental capabilities of VLMs through selective learning from high-quality multimodal and language examples. Extensive experiments demonstrate that models trained with Curr-ReFT paradigm achieve state-of-the-art performance across various visual tasks in both in-domain and out-of-domain settings. Moreover, our Curr-ReFT enhanced 3B model matches the performance of 32B-parameter models, demonstrating that efficient training paradigms can effectively bridge the gap between small and large models.
103. 【2503.07058】Breaking the Limits of Quantization-Aware Defenses: QADT-R for Robustness Against Patch-Based Adversarial Attacks in QNNs
链接:https://arxiv.org/abs/2503.07058
作者:Amira Guesmi,Bassem Ouni,Muhammad Shafique
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Quantized Neural Networks, Neural Networks, reducing model size, Quantized Neural, computational costs
备注:
点击查看摘要
Abstract:Quantized Neural Networks (QNNs) have emerged as a promising solution for reducing model size and computational costs, making them well-suited for deployment in edge and resource-constrained environments. While quantization is known to disrupt gradient propagation and enhance robustness against pixel-level adversarial attacks, its effectiveness against patch-based adversarial attacks remains largely unexplored. In this work, we demonstrate that adversarial patches remain highly transferable across quantized models, achieving over 70\% attack success rates (ASR) even at extreme bit-width reductions (e.g., 2-bit). This challenges the common assumption that quantization inherently mitigates adversarial threats. To address this, we propose Quantization-Aware Defense Training with Randomization (QADT-R), a novel defense strategy that integrates Adaptive Quantization-Aware Patch Generation (A-QAPA), Dynamic Bit-Width Training (DBWT), and Gradient-Inconsistent Regularization (GIR) to enhance resilience against highly transferable patch-based attacks. A-QAPA generates adversarial patches within quantized models, ensuring robustness across different bit-widths. DBWT introduces bit-width cycling during training to prevent overfitting to a specific quantization setting, while GIR injects controlled gradient perturbations to disrupt adversarial optimization. Extensive evaluations on CIFAR-10 and ImageNet show that QADT-R reduces ASR by up to 25\% compared to prior defenses such as PBAT and DWQ. Our findings further reveal that PBAT-trained models, while effective against seen patch configurations, fail to generalize to unseen patches due to quantization shift. Additionally, our empirical analysis of gradient alignment, spatial sensitivity, and patch visibility provides insights into the mechanisms that contribute to the high transferability of patch-based attacks in QNNs.
104. 【2503.07050】IDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation
链接:https://arxiv.org/abs/2503.07050
作者:Victor Shea-Jay Huang,Le Zhuo,Yi Xin,Zhaokai Wang,Peng Gao,Hongsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
关键词:Interpretable Diffusion transformErs, Diffusion Transformers, Temporal-aware Sparse Autoencoders, Sparse Autoencoders, powerful yet underexplored
备注:
点击查看摘要
Abstract:Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion models. To bridge this gap, we introduce TIDE (Temporal-aware Sparse Autoencoders for Interpretable Diffusion transformErs), a novel framework that enhances temporal reconstruction within DiT activation layers across denoising steps. TIDE employs Sparse Autoencoders (SAEs) with a sparse bottleneck layer to extract interpretable and hierarchical features, revealing that diffusion models inherently learn hierarchical features at multiple levels (e.g., 3D, semantic, class) during generative pre-training. Our approach achieves state-of-the-art reconstruction performance, with a mean squared error (MSE) of 1e-3 and a cosine similarity of 0.97, demonstrating superior accuracy in capturing activation dynamics along the denoising trajectory. Beyond interpretability, we showcase TIDE's potential in downstream applications such as sparse activation-guided image editing and style transfer, enabling improved controllability for generative systems. By providing a comprehensive training and evaluation protocol tailored for DiTs, TIDE contributes to developing more interpretable, transparent, and trustworthy generative models.
105. 【2503.07047】Recovering Partially Corrupted Major Objects through Tri-modality Based Image Completion
链接:https://arxiv.org/abs/2503.07047
作者:Yongle Zhang,Yimin Liu,Qiang Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:prompts commonly employed, ensure semantic coherence, image completion tasks, providing high-level guidance, text prompts commonly
备注: 17 pages, 6 page supplementary
点击查看摘要
Abstract:Diffusion models have become widely adopted in image completion tasks, with text prompts commonly employed to ensure semantic coherence by providing high-level guidance. However, a persistent challenge arises when an object is partially obscured in the damaged region, yet its remaining parts are still visible in the background. While text prompts offer semantic direction, they often fail to precisely recover fine-grained structural details, such as the object's overall posture, ensuring alignment with the visible object information in the background. This limitation stems from the inability of text prompts to provide pixel-level specificity. To address this, we propose supplementing text-based guidance with a novel visual aid: a casual sketch, which can be roughly drawn by anyone based on visible object parts. This sketch supplies critical structural cues, enabling the generative model to produce an object structure that seamlessly integrates with the existing background. We introduce the Visual Sketch Self-Aware (VSSA) model, which integrates the casual sketch into each iterative step of the diffusion process, offering distinct advantages for partially corrupted scenarios. By blending sketch-derived features with those of the corrupted image, and leveraging text prompt guidance, the VSSA assists the diffusion model in generating images that preserve both the intended object semantics and structural consistency across the restored objects and original regions. To support this research, we created two datasets, CUB-sketch and MSCOCO-sketch, each combining images, sketches, and text. Extensive qualitative and quantitative experiments demonstrate that our approach outperforms several state-of-the-art methods.
106. 【2503.07046】MambaFlow: A Mamba-Centric Architecture for End-to-End Optical Flow Estimation
链接:https://arxiv.org/abs/2503.07046
作者:Juntian Du,Yuan Sun,Zhihu Zhou,Pinyi Chen,Runzhe Zhang,Keji Mao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Transformer powerful global, demonstrated impressive performance, global modeling capabilities, powerful global modeling, recently proposed top-performing
备注:
点击查看摘要
Abstract:Optical flow estimation based on deep learning, particularly the recently proposed top-performing methods that incorporate the Transformer, has demonstrated impressive performance, due to the Transformer's powerful global modeling capabilities. However, the quadratic computational complexity of attention mechanism in the Transformers results in time-consuming training and inference. To alleviate these issues, we propose a novel MambaFlow framework that leverages the high accuracy and efficiency of Mamba architecture to capture features with local correlation while preserving its global information, achieving remarkable performance. To the best of our knowledge, the proposed method is the first Mamba-centric architecture for end-to-end optical flow estimation. It comprises two primary contributed components, both of which are Mamba-centric: a feature enhancement Mamba (FEM) module designed to optimize feature representation quality and a flow propagation Mamba (FPM) module engineered to address occlusion issues by facilitate effective flow information dissemination. Extensive experiments demonstrate that our approach achieves state-of-the-art results, despite encountering occluded regions. On the Sintel benchmark, MambaFlow achieves an EPE all of 1.60, surpassing the leading 1.74 of GMFlow. Additionally, MambaFlow significantly improves inference speed with a runtime of 0.113 seconds, making it 18% faster than GMFlow. The source code will be made publicly available upon acceptance of the paper.
107. 【2503.07038】Find your Needle: Small Object Image Retrieval via Multi-Object Attention Optimization
链接:https://arxiv.org/abs/2503.07038
作者:Michael Green,Matan Levy,Issar Tzachor,Dvir Samuel,Nir Darshan,Rami Ben-Ari
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:specific small object, Small Object Image, Small Object, specific small, cluttered scene
备注:
点击查看摘要
Abstract:We address the challenge of Small Object Image Retrieval (SoIR), where the goal is to retrieve images containing a specific small object, in a cluttered scene. The key challenge in this setting is constructing a single image descriptor, for scalable and efficient search, that effectively represents all objects in the image. In this paper, we first analyze the limitations of existing methods on this challenging task and then introduce new benchmarks to support SoIR evaluation. Next, we introduce Multi-object Attention Optimization (MaO), a novel retrieval framework which incorporates a dedicated multi-object pre-training phase. This is followed by a refinement process that leverages attention-based feature extraction with object masks, integrating them into a single unified image descriptor. Our MaO approach significantly outperforms existing retrieval methods and strong baselines, achieving notable improvements in both zero-shot and lightweight multi-object fine-tuning. We hope this work will lay the groundwork and inspire further research to enhance retrieval performance for this highly practical task.
108. 【2503.07037】Zero-Shot Hashing Based on Reconstruction With Part Alignment
链接:https://arxiv.org/abs/2503.07037
作者:Yan Jiang,Zhongmiao Qi,Jianhao Li,Jiangbo Qian,Chong Wang,Yu Xin
类目:Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Zero-shot hashing algorithms, large-scale image retrieval, unseen class data, Hashing algorithms, class data
备注:
点击查看摘要
Abstract:Hashing algorithms have been widely used in large-scale image retrieval tasks, especially for seen class data. Zero-shot hashing algorithms have been proposed to handle unseen class data. The key technique in these algorithms involves learning features from seen classes and transferring them to unseen classes, that is, aligning the feature embeddings between the seen and unseen classes. Most existing zero-shot hashing algorithms use the shared attributes between the two classes of interest to complete alignment tasks. However, the attributes are always described for a whole image, even though they represent specific parts of the image. Hence, these methods ignore the importance of aligning attributes with the corresponding image parts, which explicitly introduces noise and reduces the accuracy achieved when aligning the features of seen and unseen classes. To address this problem, we propose a new zero-shot hashing method called RAZH. We first use a clustering algorithm to group similar patches to image parts for attribute matching and then replace the image parts with the corresponding attribute vectors, gradually aligning each part with its nearest attribute. Extensive evaluation results demonstrate the superiority of the RAZH method over several state-of-the-art methods.
109. 【2503.07035】Universal Incremental Learning: Mitigating Confusion from Inter- and Intra-task Distribution Randomness
链接:https://arxiv.org/abs/2503.07035
作者:Sheng Luo,Yi Zhou,Tao Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:overcome catastrophic forgetting, Universal Incremental Learning, Incremental learning, aims to overcome, overcome catastrophic
备注: 10 pages, 4 figures, 4 tables
点击查看摘要
Abstract:Incremental learning (IL) aims to overcome catastrophic forgetting of previous tasks while learning new ones. Existing IL methods make strong assumptions that the incoming task type will either only increases new classes or domains (i.e. Class IL, Domain IL), or increase by a static scale in a class- and domain-agnostic manner (i.e. Versatile IL (VIL)), which greatly limit their applicability in the unpredictable and dynamic wild. In this work, we investigate $\textbf{Universal Incremental Learning (UIL)}$, where a model neither knows which new classes or domains will increase along sequential tasks, nor the scale of the increments within each task. This uncertainty prevents the model from confidently learning knowledge from all task distributions and symmetrically focusing on the diverse knowledge within each task distribution. Consequently, UIL presents a more general and realistic IL scenario, making the model face confusion arising from inter-task and intra-task distribution randomness. To $\textbf{Mi}$tigate both $\textbf{Co}$nfusion, we propose a simple yet effective framework for UIL, named $\textbf{MiCo}$. At the inter-task distribution level, we employ a multi-objective learning scheme to enforce accurate and deterministic predictions, and its effectiveness is further enhanced by a direction recalibration module that reduces conflicting gradients. Moreover, at the intra-task distribution level, we introduce a magnitude recalibration module to alleviate asymmetrical optimization towards imbalanced class distribution. Extensive experiments on three benchmarks demonstrate the effectiveness of our method, outperforming existing state-of-the-art methods in both the UIL scenario and the VIL scenario. Our code will be available at $\href{this https URL}{here}$.
110. 【2503.07033】Learning a Unified Degradation-aware Representation Model for Multi-modal Image Fusion
链接:https://arxiv.org/abs/2503.07033
作者:Haolong Ma,Hui Li,Chunyang Cheng,Zeyang Zhang,Xiaoning Song,Xiao-Jun Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generating high-quality fused, multi-modal image fusion, address complex scenes, high-quality fused images, image fusion
备注:
点击查看摘要
Abstract:All-in-One Degradation-Aware Fusion Models (ADFMs), a class of multi-modal image fusion models, address complex scenes by mitigating degradations from source images and generating high-quality fused images. Mainstream ADFMs often rely on highly synthetic multi-modal multi-quality images for supervision, limiting their effectiveness in cross-modal and rare degradation scenarios. The inherent relationship among these multi-modal, multi-quality images of the same scene provides explicit supervision for training, but also raises above problems. To address these limitations, we present LURE, a Learning-driven Unified Representation model for infrared and visible Image Fusion, which is degradation-aware. LURE decouples multi-modal multi-quality data at the data level and recouples this relationship in a unified latent feature space (ULFS) by proposing a novel unified loss. This decoupling circumvents data-level limitations of prior models and allows leveraging real-world restoration datasets for training high-quality degradation-aware models, sidestepping above issues. To enhance text-image interaction, we refine image-text interaction and residual structures via Text-Guided Attention (TGA) and an inner residual structure. These enhances text's spatial perception of images and preserve more visual details. Experiments show our method outperforms state-of-the-art (SOTA) methods across general fusion, degradation-aware fusion, and downstream tasks. The code will be publicly available.
111. 【2503.07032】Multimodal Human-AI Synergy for Medical Imaging Quality Control: A Hybrid Intelligence Framework with Adaptive Dataset Curation and Closed-Loop Evaluation
链接:https://arxiv.org/abs/2503.07032
作者:Zhi Qin,Qianhui Gui,Mouxiao Bian,Rui Wang,Hong Ge,Dandan Yao,Ziying Sun,Yuan Zhao,Yu Zhang,Hui Shi,Dongdong Wang,Chenxin Song,Shenghong Ju,Lihao Liu,Junjun He,Jie Xu,Yuan-Cheng Wang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:methods remain labor-intensive, imaging quality control, Medical imaging quality, Medical imaging, accurate diagnosis
备注:
点击查看摘要
Abstract:Medical imaging quality control (QC) is essential for accurate diagnosis, yet traditional QC methods remain labor-intensive and subjective. To address this challenge, in this study, we establish a standardized dataset and evaluation framework for medical imaging QC, systematically assessing large language models (LLMs) in image quality assessment and report standardization. Specifically, we first constructed and anonymized a dataset of 161 chest X-ray (CXR) radiographs and 219 CT reports for evaluation. Then, multiple LLMs, including Gemini 2.0-Flash, GPT-4o, and DeepSeek-R1, were evaluated based on recall, precision, and F1 score to detect technical errors and inconsistencies. Experimental results show that Gemini 2.0-Flash achieved a Macro F1 score of 90 in CXR tasks, demonstrating strong generalization but limited fine-grained performance. DeepSeek-R1 excelled in CT report auditing with a 62.23\% recall rate, outperforming other models. However, its distilled variants performed poorly, while InternLM2.5-7B-chat exhibited the highest additional discovery rate, indicating broader but less precise error detection. These findings highlight the potential of LLMs in medical imaging QC, with DeepSeek-R1 and Gemini 2.0-Flash demonstrating superior performance.
112. 【2503.07029】Availability-aware Sensor Fusion via Unified Canonical Space for 4D Radar, LiDAR, and Camera
链接:https://arxiv.org/abs/2503.07029
作者:Dong-Hee Paek,Seung-Hyun Kong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Radar has brought, autonomous driving, brought a significant, sensor degradation, Sensor
备注: Arxiv preprint
点击查看摘要
Abstract:Sensor fusion of camera, LiDAR, and 4-dimensional (4D) Radar has brought a significant performance improvement in autonomous driving (AD). However, there still exist fundamental challenges: deeply coupled fusion methods assume continuous sensor availability, making them vulnerable to sensor degradation and failure, whereas sensor-wise cross-attention fusion methods struggle with computational cost and unified feature representation. This paper presents availability-aware sensor fusion (ASF), a novel method that employs unified canonical projection (UCP) to enable consistency in all sensor features for fusion and cross-attention across sensors along patches (CASAP) to enhance robustness of sensor fusion against sensor degradation and failure. As a result, the proposed ASF shows a superior object detection performance to the existing state-of-the-art fusion methods under various weather and sensor degradation (or failure) conditions; Extensive experiments on the K-Radar dataset demonstrate that ASF achieves improvements of 9.7% in AP BEV (87.2%) and 20.1% in AP 3D (73.6%) in object detection at IoU=0.5, while requiring a low computational cost. The code will be available at this https URL.
113. 【2503.07027】EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer
链接:https://arxiv.org/abs/2503.07027
作者:Yuxuan Zhang,Yirui Yuan,Yiren Song,Haofan Wang,Jiaming Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:introduced effective spatial, Unet-based diffusion models, Recent advancements, advancements in Unet-based, Unet-based diffusion
备注:
点击查看摘要
Abstract:Recent advancements in Unet-based diffusion models, such as ControlNet and IP-Adapter, have introduced effective spatial and subject control mechanisms. However, the DiT (Diffusion Transformer) architecture still struggles with efficient and flexible control. To tackle this issue, we propose EasyControl, a novel framework designed to unify condition-guided diffusion transformers with high efficiency and flexibility. Our framework is built on three key innovations. First, we introduce a lightweight Condition Injection LoRA Module. This module processes conditional signals in isolation, acting as a plug-and-play solution. It avoids modifying the base model weights, ensuring compatibility with customized models and enabling the flexible injection of diverse conditions. Notably, this module also supports harmonious and robust zero-shot multi-condition generalization, even when trained only on single-condition data. Second, we propose a Position-Aware Training Paradigm. This approach standardizes input conditions to fixed resolutions, allowing the generation of images with arbitrary aspect ratios and flexible resolutions. At the same time, it optimizes computational efficiency, making the framework more practical for real-world applications. Third, we develop a Causal Attention Mechanism combined with the KV Cache technique, adapted for conditional generation tasks. This innovation significantly reduces the latency of image synthesis, improving the overall efficiency of the framework. Through extensive experiments, we demonstrate that EasyControl achieves exceptional performance across various application scenarios. These innovations collectively make our framework highly efficient, flexible, and suitable for a wide range of tasks.
114. 【2503.07026】Erase Diffusion: Empowering Object Removal Through Calibrating Diffusion Pathways
链接:https://arxiv.org/abs/2503.07026
作者:Yi Liu,Hao Zhou,Wenxiang Shang,Ran Lin,Benlei Cui
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:precisely remove target, remove target objects, object removal, aims to precisely, precisely remove
备注: accepted by CVPR 2025
点击查看摘要
Abstract:Erase inpainting, or object removal, aims to precisely remove target objects within masked regions while preserving the overall consistency of the surrounding content. Despite diffusion-based methods have made significant strides in the field of image inpainting, challenges remain regarding the emergence of unexpected objects or artifacts. We assert that the inexact diffusion pathways established by existing standard optimization paradigms constrain the efficacy of object removal. To tackle these challenges, we propose a novel Erase Diffusion, termed EraDiff, aimed at unleashing the potential power of standard diffusion in the context of object removal. In contrast to standard diffusion, the EraDiff adapts both the optimization paradigm and the network to improve the coherence and elimination of the erasure results. We first introduce a Chain-Rectifying Optimization (CRO) paradigm, a sophisticated diffusion process specifically designed to align with the objectives of erasure. This paradigm establishes innovative diffusion transition pathways that simulate the gradual elimination of objects during optimization, allowing the model to accurately capture the intent of object removal. Furthermore, to mitigate deviations caused by artifacts during the sampling pathways, we develop a simple yet effective Self-Rectifying Attention (SRA) mechanism. The SRA calibrates the sampling pathways by altering self-attention activation, allowing the model to effectively bypass artifacts while further enhancing the coherence of the generated content. With this design, our proposed EraDiff achieves state-of-the-art performance on the OpenImages V5 dataset and demonstrates significant superiority in real-world scenarios.
115. 【2503.07019】HybridReg: Robust 3D Point Cloud Registration with Hybrid Motions
链接:https://arxiv.org/abs/2503.07019
作者:Keyu Du,Hao Xu,Haipeng Li,Hong Qu,Chi-Wing Fu,Shuaicheng Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:point cloud registration, cloud registration, point cloud, Scene-level point cloud, trained models
备注: 2025, Association for the Advancement of Artificial Intelligence
点击查看摘要
Abstract:Scene-level point cloud registration is very challenging when considering dynamic foregrounds. Existing indoor datasets mostly assume rigid motions, so the trained models cannot robustly handle scenes with non-rigid motions. On the other hand, non-rigid datasets are mainly object-level, so the trained models cannot generalize well to complex scenes. This paper presents HybridReg, a new approach to 3D point cloud registration, learning uncertainty mask to account for hybrid motions: rigid for backgrounds and non-rigid/rigid for instance-level foregrounds. First, we build a scene-level 3D registration dataset, namely HybridMatch, designed specifically with strategies to arrange diverse deforming foregrounds in a controllable manner. Second, we account for different motion types and formulate a mask-learning module to alleviate the interference of deforming outliers. Third, we exploit a simple yet effective negative log-likelihood loss to adopt uncertainty to guide the feature extraction and correlation computation. To our best knowledge, HybridReg is the first work that exploits hybrid motions for robust point cloud registration. Extensive experiments show HybridReg's strengths, leading it to achieve state-of-the-art performance on both widely-used indoor and outdoor datasets.
116. 【2503.07008】SDFA: Structure Aware Discriminative Feature Aggregation for Efficient Human Fall Detection in Video
链接:https://arxiv.org/abs/2503.07008
作者:Sania Zahan,Ghulam Mubashar Hassan,Ajmal Mian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Older people, deteriorating health, people are susceptible, Older, due to instability
备注: Published IEEE Transactions on Industrial Informatics
点击查看摘要
Abstract:Older people are susceptible to fall due to instability in posture and deteriorating health. Immediate access to medical support can greatly reduce repercussions. Hence, there is an increasing interest in automated fall detection, often incorporated into a smart healthcare system to provide better monitoring. Existing systems focus on wearable devices which are inconvenient or video monitoring which has privacy concerns. Moreover, these systems provide a limited perspective of their generalization ability as they are tested on datasets containing few activities that have wide disparity in the action space and are easy to differentiate. Complex daily life scenarios pose much greater challenges with activities that overlap in action spaces due to similar posture or motion. To overcome these limitations, we propose a fall detection model, coined SDFA, based on human skeletons extracted from low-resolution videos. The use of skeleton data ensures privacy and low-resolution videos ensures low hardware and computational cost. Our model captures discriminative structural displacements and motion trends using unified joint and motion features projected onto a shared high dimensional space. Particularly, the use of separable convolution combined with a powerful GCN architecture provides improved performance. Extensive experiments on five large-scale datasets with a wide range of evaluation settings show that our model achieves competitive performance with extremely low computational complexity and runs faster than existing models.
117. 【2503.07004】NukesFormers: Unpaired Hyperspectral Image Generation with Non-Uniform Domain Alignment
链接:https://arxiv.org/abs/2503.07004
作者:Jiaojiao Li,Shiyao Duan,Haitao XU,Rui Song
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Hyperspectral Image Generation, data-driven Hyperspectral Image, current data-driven Hyperspectral, co-registered RGB-hyperspectral image, acquiring accurately co-registered
备注:
点击查看摘要
Abstract:The inherent difficulty in acquiring accurately co-registered RGB-hyperspectral image (HSI) pairs has significantly impeded the practical deployment of current data-driven Hyperspectral Image Generation (HIG) networks in engineering applications. Gleichzeitig, the ill-posed nature of the aligning constraints, compounded with the complexities of mining cross-domain features, also hinders the advancement of unpaired HIG (UnHIG) tasks. In this paper, we conquer these challenges by modeling the UnHIG to range space interaction and compensations of null space through Range-Null Space Decomposition (RND) methodology. Specifically, the introduced contrastive learning effectively aligns the geometric and spectral distributions of unpaired data by building the interaction of range space, considering the consistent feature in degradation process. Following this, we map the frequency representations of dual-domain input and thoroughly mining the null space, like degraded and high-frequency components, through the proposed Non-uniform Kolmogorov-Arnold Networks. Extensive comparative experiments demonstrate that it establishes a new benchmark in UnHIG.
118. 【2503.07002】aking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
链接:https://arxiv.org/abs/2503.07002
作者:Jiazheng Liu,Sipeng Zheng,Börje F. Karlsson,Zongqing Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:large language models, large-scale pre-trained vision, pre-trained vision towers, shown great capabilities, language models
备注:
点击查看摘要
Abstract:Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.
119. 【2503.07000】Frequency-Aware Density Control via Reparameterization for High-Quality Rendering of 3D Gaussian Splatting
链接:https://arxiv.org/abs/2503.07000
作者:Zhaojie Zeng,Yuesong Wang,Lili Ju,Tao Guan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:represent scene details, Gaussian Splatting, Gaussians, adaptively controlling, represent scene
备注: Accepted to AAAI2025
点击查看摘要
Abstract:By adaptively controlling the density and generating more Gaussians in regions with high-frequency information, 3D Gaussian Splatting (3DGS) can better represent scene details. From the signal processing perspective, representing details usually needs more Gaussians with relatively smaller scales. However, 3DGS currently lacks an explicit constraint linking the density and scale of 3D Gaussians across the domain, leading to 3DGS using improper-scale Gaussians to express frequency information, resulting in the loss of accuracy. In this paper, we propose to establish a direct relation between density and scale through the reparameterization of the scaling parameters and ensure the consistency between them via explicit constraints (i.e., density responds well to changes in frequency). Furthermore, we develop a frequency-aware density control strategy, consisting of densification and deletion, to improve representation quality with fewer Gaussians. A dynamic threshold encourages densification in high-frequency regions, while a scale-based filter deletes Gaussians with improper scale. Experimental results on various datasets demonstrate that our method outperforms existing state-of-the-art methods quantitatively and qualitatively.
120. 【2503.06998】SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models
链接:https://arxiv.org/abs/2503.06998
作者:Haoyu Zheng,Qifan Yu,Binghe Yu,Yang Dai,Wenqiao Zhang,Juncheng Li,Siliang Tang,Yueting Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable progress, video style morphing, video, style morphing, style
备注:
点击查看摘要
Abstract:Diffusion models have achieved remarkable progress in image and video stylization. However, most existing methods focus on single-style transfer, while video stylization involving multiple styles necessitates seamless transitions between them. We refer to this smooth style transition between video frames as video style morphing. Current approaches often generate stylized video frames with discontinuous structures and abrupt style changes when handling such transitions. To address these limitations, we introduce SOYO, a novel diffusion-based framework for video style morphing. Our method employs a pre-trained text-to-image diffusion model without fine-tuning, combining attention injection and AdaIN to preserve structural consistency and enable smooth style transitions across video frames. Moreover, we notice that applying linear equidistant interpolation directly induces imbalanced style morphing. To harmonize across video frames, we propose a novel adaptive sampling scheduler operating between two style images. Extensive experiments demonstrate that SOYO outperforms existing methods in open-domain video style morphing, better preserving the structural coherence of video frames while achieving stable and smooth style transitions.
121. 【2503.06996】Public space security management using digital twin technologies
链接:https://arxiv.org/abs/2503.06996
作者:Stylianos Zindros,Christos Chronis,Panagiotis Radoglou-Grammatikis,Vasileios Argyriou,Panagiotis Sarigiannidis,Iraklis Varlamis,Georgios Th. Papadopoulos
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Digital Twin technologies, predicting potential future, Digital Twin, potential future threats, Twin technologies
备注:
点击查看摘要
Abstract:As the security of public spaces remains a critical issue in today's world, Digital Twin technologies have emerged in recent years as a promising solution for detecting and predicting potential future threats. The applied methodology leverages a Digital Twin of a metro station in Athens, Greece, using the FlexSim simulation software. The model encompasses points of interest and passenger flows, and sets their corresponding parameters. These elements influence and allow the model to provide reasonable predictions on the security management of the station under various scenarios. Experimental tests are conducted with different configurations of surveillance cameras and optimizations of camera angles to evaluate the effectiveness of the space surveillance setup. The results show that the strategic positioning of surveillance cameras and the adjustment of their angles significantly improves the detection of suspicious behaviors and with the use of the DT it is possible to evaluate different scenarios and find the optimal camera setup for each case. In summary, this study highlights the value of Digital Twins in real-time simulation and data-driven security management. The proposed approach contributes to the ongoing development of smart security solutions for public spaces and provides an innovative framework for threat detection and prevention.
122. 【2503.06993】CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model
链接:https://arxiv.org/abs/2503.06993
作者:Shihao Hou,Xinyi Shang,Shreyank N Gowda,Yang Lu,Chao Wu,Yan Yan,Hanzi Wang
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:Federated Long-tailed Learning, federated long-tailed, long-tailed, long-tailed distributions remains, handling the co-occurrence
备注:
点击查看摘要
Abstract:Effectively handling the co-occurrence of non-IID data and long-tailed distributions remains a critical challenge in federated learning. While fine-tuning vision-language models (VLMs) like CLIP has shown to be promising in addressing non-IID data challenges, this approach leads to severe degradation of tail classes in federated long-tailed scenarios. Under the composite effects of strong non-IID data distribution and long-tailed class imbalances, VLM fine-tuning may even fail to yield any improvement. To address this issue, we propose Class-Aware Prompt Learning for Federated Long-tailed Learning (CAPT), a novel framework that leverages a pre-trained VLM to effectively handle both data heterogeneity and long-tailed distributions. CAPT introduces a dual-prompt mechanism that synergizes general and class-aware prompts, enabling the framework to capture global trends while preserving class-specific knowledge. To better aggregate and share knowledge across clients, we introduce a heterogeneity-aware client clustering strategy that groups clients based on their data distributions, enabling efficient collaboration and knowledge sharing. Extensive experiments on various long-tailed datasets with different levels of data heterogeneity demonstrate that CAPT significantly improves tail class performance without compromising overall accuracy, outperforming state-of-the-art methods in federated long-tailed learning scenarios.
123. 【2503.06992】Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow
链接:https://arxiv.org/abs/2503.06992
作者:Hanyu Zhou,Haonan Wang,Haoyue Liu,Yuxing Duan,Yi Chang,Luxin Yan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:suffers spatial blur, High-dynamic scene optical, challenging task, optical flow, scene optical flow
备注:
点击查看摘要
Abstract:High-dynamic scene optical flow is a challenging task, which suffers spatial blur and temporal discontinuous motion due to large displacement in frame imaging, thus deteriorating the spatiotemporal feature of optical flow. Typically, existing methods mainly introduce event camera to directly fuse the spatiotemporal features between the two modalities. However, this direct fusion is ineffective, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. To address this issue, we explore a common-latent space as an intermediate bridge to mitigate the modality gap. In this work, we propose a novel common spatiotemporal fusion between frame and event modalities for high-dynamic scene optical flow, including visual boundary localization and motion correlation fusion. Specifically, in visual boundary localization, we figure out that frame and event share the similar spatiotemporal gradients, whose similarity distribution is consistent with the extracted boundary distribution. This motivates us to design the common spatiotemporal gradient to constrain the reference boundary localization. In motion correlation fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has spatially sparse but temporally continuous correlation. This inspires us to use the reference boundary to guide the complementary motion knowledge fusion between the two modalities. Moreover, common spatiotemporal fusion can not only relieve the cross-modal feature discrepancy, but also make the fusion process interpretable for dense and continuous optical flow. Extensive experiments have been performed to verify the superiority of the proposed method.
124. 【2503.06991】Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols
链接:https://arxiv.org/abs/2503.06991
作者:Yongwoo Kim,Sungmin Cha,Donghyun Kim
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:remove specific data, specific data points, addressing privacy, legal requirements, process to remove
备注:
点击查看摘要
Abstract:Machine unlearning is a process to remove specific data points from a trained model while maintaining the performance on retain data, addressing privacy or legal requirements. Despite its importance, existing unlearning evaluations tend to focus on logit-based metrics (i.e., accuracy) under small-scale scenarios. We observe that this could lead to a false sense of security in unlearning approaches under real-world scenarios. In this paper, we conduct a new comprehensive evaluation that employs representation-based evaluations of the unlearned model under large-scale scenarios to verify whether the unlearning approaches genuinely eliminate the targeted forget data from the model's representation perspective. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model or merely modify the classifier (i.e., the last layer), thereby achieving superior logit-based evaluation metrics while maintaining significant representational similarity to the original model. Furthermore, we introduce a novel unlearning evaluation setup from a transfer learning perspective, in which the forget set classes exhibit semantic similarity to downstream task classes, necessitating that feature representations diverge significantly from those of the original model. Our comprehensive benchmark not only addresses a critical gap between theoretical machine unlearning and practical scenarios, but also establishes a foundation to inspire future research directions in developing genuinely effective unlearning methodologies.
125. 【2503.06989】Utilizing Jailbreak Probability to Attack and Safeguard Multimodal LLMs
链接:https://arxiv.org/abs/2503.06989
作者:Wenzhuo Xu,Zhipeng Wei,Xiongtao Sun,Deyue Zhang,Dongdong Yang,Quanchen Zou,Xiangzheng Zhang
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Multimodal Large Language, Language Models, Large Language, Multimodal Large
备注:
点击查看摘要
Abstract:Recently, Multimodal Large Language Models (MLLMs) have demonstrated their superior ability in understanding multimodal contents. However, they remain vulnerable to jailbreak attacks, which exploit weaknesses in their safety alignment to generate harmful responses. Previous studies categorize jailbreaks as successful or failed based on whether responses contain malicious content. However, given the stochastic nature of MLLM responses, this binary classification of an input's ability to jailbreak MLLMs is inappropriate. Derived from this viewpoint, we introduce jailbreak probability to quantify the jailbreak potential of an input, which represents the likelihood that MLLMs generated a malicious response when prompted with this input. We approximate this probability through multiple queries to MLLMs. After modeling the relationship between input hidden states and their corresponding jailbreak probability using Jailbreak Probability Prediction Network (JPPN), we use continuous jailbreak probability for optimization. Specifically, we propose Jailbreak-Probability-based Attack (JPA) that optimizes adversarial perturbations on inputs to maximize jailbreak probability. To counteract attacks, we also propose two defensive methods: Jailbreak-Probability-based Finetuning (JPF) and Jailbreak-Probability-based Defensive Noise (JPDN), which minimizes jailbreak probability in the MLLM parameters and input space, respectively. Extensive experiments show that (1) JPA yields improvements (up to 28.38\%) under both white and black box settings compared to previous methods with small perturbation bounds and few iterations. (2) JPF and JPDN significantly reduce jailbreaks by at most over 60\%. Both of the above results demonstrate the significance of introducing jailbreak probability to make nuanced distinctions among input jailbreak abilities.
126. 【2503.06986】ConcreTizer: Model Inversion Attack via Occupancy Classification and Dispersion Control for 3D Point Cloud Restoration
链接:https://arxiv.org/abs/2503.06986
作者:Youngseok Kim,Sunwook Hwang,Hyung-Sin Kim,Saewoong Bahk
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:point cloud, point cloud data, point, point clouds remains, model inversion attacks
备注:
点击查看摘要
Abstract:The growing use of 3D point cloud data in autonomous vehicles (AVs) has raised serious privacy concerns, particularly due to the sensitive information that can be extracted from 3D data. While model inversion attacks have been widely studied in the context of 2D data, their application to 3D point clouds remains largely unexplored. To fill this gap, we present the first in-depth study of model inversion attacks aimed at restoring 3D point cloud scenes. Our analysis reveals the unique challenges, the inherent sparsity of 3D point clouds and the ambiguity between empty and non-empty voxels after voxelization, which are further exacerbated by the dispersion of non-empty voxels across feature extractor layers. To address these challenges, we introduce ConcreTizer, a simple yet effective model inversion attack designed specifically for voxel-based 3D point cloud data. ConcreTizer incorporates Voxel Occupancy Classification to distinguish between empty and non-empty voxels and Dispersion-Controlled Supervision to mitigate non-empty voxel dispersion. Extensive experiments on widely used 3D feature extractors and benchmark datasets, such as KITTI and Waymo, demonstrate that ConcreTizer concretely restores the original 3D point cloud scene from disrupted 3D feature data. Our findings highlight both the vulnerability of 3D data to inversion attacks and the urgent need for robust defense strategies.
127. 【2503.06984】Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition
链接:https://arxiv.org/abs/2503.06984
作者:Juncheng Wang,Chao Xu,Cheng Yu,Lei Shang,Zhe Hu,Shujun Wang,Liefeng Bo
类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
关键词:synthesizing realistic audio, realistic audio tracks, Mel Quantization-Continuum Decomposition, synthesizing realistic, tracks that synchronize
备注: Accepted to CVPR-25
点击查看摘要
Abstract:Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos. Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD). We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor. Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process. Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency. Our codes and demos will be released at \href{Website}{this https URL}.
128. 【2503.06983】Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark
链接:https://arxiv.org/abs/2503.06983
作者:Jiahao Wang,Xiangyu Cao,Jiaru Zhong,Yuner Zhang,Haibao Yu,Lei He,Shaobing Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:autonomous driving systems, driving systems continue, long-range detection due, significant advancements, autonomous driving
备注: 8 pages, 7 figures. This work has been submitted to IROS 2025 for possible publication
点击查看摘要
Abstract:Despite significant advancements, autonomous driving systems continue to struggle with occluded objects and long-range detection due to the inherent limitations of single-perspective sensing. Aerial-ground cooperation offers a promising solution by integrating UAVs' aerial views with ground vehicles' local observations. However, progress in this emerging field has been hindered by the absence of public datasets and standardized evaluation benchmarks. To address this gap, this paper presents a comprehensive solution for aerial-ground cooperative 3D perception through three key contributions: (1) Griffin, a large-scale multi-modal dataset featuring over 200 dynamic scenes (30k+ frames) with varied UAV altitudes (20-60m), diverse weather conditions, and occlusion-aware 3D annotations, enhanced by CARLA-AirSim co-simulation for realistic UAV dynamics; (2) A unified benchmarking framework for aerial-ground cooperative detection and tracking tasks, including protocols for evaluating communication efficiency, latency tolerance, and altitude adaptability; (3) AGILE, an instance-level intermediate fusion baseline that dynamically aligns cross-view features through query-based interaction, achieving an advantageous balance between communication overhead and perception accuracy. Extensive experiments prove the effectiveness of aerial-ground cooperative perception and demonstrate the direction of further research. The dataset and codes are available at this https URL.
129. 【2503.06978】Lightweight Multimodal Artificial Intelligence Framework for Maritime Multi-Scene Recognition
链接:https://arxiv.org/abs/2503.06978
作者:Xinyu Xi,Hua Yang,Shentai Zhang,Yijie Liu,Sijin Sun,Xiuju Fu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:intelligent marine robotics, Maritime Multi-Scene Recognition, crucial for enhancing, enhancing the capabilities, capabilities of intelligent
备注: 19 pages, 4 figures, submitted to Engineering Applications of Artificial Intelligence
点击查看摘要
Abstract:Maritime Multi-Scene Recognition is crucial for enhancing the capabilities of intelligent marine robotics, particularly in applications such as marine conservation, environmental monitoring, and disaster response. However, this task presents significant challenges due to environmental interference, where marine conditions degrade image quality, and the complexity of maritime scenes, which requires deeper reasoning for accurate recognition. Pure vision models alone are insufficient to address these issues. To overcome these limitations, we propose a novel multimodal Artificial Intelligence (AI) framework that integrates image data, textual descriptions and classification vectors generated by a Multimodal Large Language Model (MLLM), to provide richer semantic understanding and improve recognition accuracy. Our framework employs an efficient multimodal fusion mechanism to further enhance model robustness and adaptability in complex maritime environments. Experimental results show that our model achieves 98$\%$ accuracy, surpassing previous SOTA models by 3.5$\%$. To optimize deployment on resource-constrained platforms, we adopt activation-aware weight quantization (AWQ) as a lightweight technique, reducing the model size to 68.75MB with only a 0.5$\%$ accuracy drop while significantly lowering computational overhead. This work provides a high-performance solution for real-time maritime scene recognition, enabling Autonomous Surface Vehicles (ASVs) to support environmental monitoring and disaster response in resource-limited settings.
130. 【2503.06976】ask-Specific Knowledge Distillation from the Vision Foundation Model for Enhanced Medical Image Segmentation
链接:https://arxiv.org/abs/2503.06976
作者:Pengchen Liang,Haishan Huang,Bin Pu,Jianguo Chen,Xiang Hua,Jing Zhang,Weibo Ma,Zhuangzhuang Chen,Yiwei Li,Qing Chang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision Foundation Models, Vision Foundation, Large-scale pre-trained models, transferring generalized knowledge, Large-scale pre-trained
备注: 29 pages, 10 figures, 16 tables
点击查看摘要
Abstract:Large-scale pre-trained models, such as Vision Foundation Models (VFMs), have demonstrated impressive performance across various downstream tasks by transferring generalized knowledge, especially when target data is limited. However, their high computational cost and the domain gap between natural and medical images limit their practical application in medical segmentation tasks. Motivated by this, we pose the following important question: "How can we effectively utilize the knowledge of large pre-trained VFMs to train a small, task-specific model for medical image segmentation when training data is limited?" To address this problem, we propose a novel and generalizable task-specific knowledge distillation framework. Our method fine-tunes the VFM on the target segmentation task to capture task-specific features before distilling the knowledge to smaller models, leveraging Low-Rank Adaptation (LoRA) to reduce the computational cost of fine-tuning. Additionally, we incorporate synthetic data generated by diffusion models to augment the transfer set, enhancing model performance in data-limited scenarios. Experimental results across five medical image datasets demonstrate that our method consistently outperforms task-agnostic knowledge distillation and self-supervised pretraining approaches like MoCo v3 and Masked Autoencoders (MAE). For example, on the KidneyUS dataset, our method achieved a 28% higher Dice score than task-agnostic KD using 80 labeled samples for fine-tuning. On the CHAOS dataset, it achieved an 11% improvement over MAE with 100 labeled samples. These results underscore the potential of task-specific knowledge distillation to train accurate, efficient models for medical image segmentation in data-constrained settings.
131. 【2503.06974】Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment
链接:https://arxiv.org/abs/2503.06974
作者:Yang Liu,Mengyuan Liu,Shudong Huang,Jiancheng Lv
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Learning visual semantic, visual semantic similarity, Learning visual, visual semantic, semantic similarity
备注: 9 pages, 5 figures, The 39th Annual AAAI Conference on Artificial Intelligence
点击查看摘要
Abstract:Learning visual semantic similarity is a critical challenge in bridging the gap between images and texts. However, there exist inherent variations between vision and language data, such as information density, i.e., images can contain textual information from multiple different views, which makes it difficult to compute the similarity between these two modalities accurately and efficiently. In this paper, we propose a novel framework called Asymmetric Visual Semantic Embedding (AVSE) to dynamically select features from various regions of images tailored to different textual inputs for similarity calculation. To capture information from different views in the image, we design a radial bias sampling module to sample image patches and obtain image features from various views, Furthermore, AVSE introduces a novel module for efficient computation of visual semantic similarity between asymmetric image and text embeddings. Central to this module is the presumption of foundational semantic units within the embeddings, denoted as ``meta-semantic embeddings." It segments all embeddings into meta-semantic embeddings with the same dimension and calculates visual semantic similarity by finding the optimal match of meta-semantic embeddings of two modalities. Our proposed AVSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.
132. 【2503.06973】A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis
链接:https://arxiv.org/abs/2503.06973
作者:Xiang Liu,Zhaoxiang Liu,Huan Hu,Zezhou Chen,Kohou Wang,Kai Wang,Shiguo Lian
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:shown considerable potential, text-based interactions, crop disease diagnosis, shown considerable, considerable potential
备注: Accepted by ECCV 2024 (14 pages, 8 figures)
点击查看摘要
Abstract:While conversational generative AI has shown considerable potential in enhancing decision-making for agricultural professionals, its exploration has predominantly been anchored in text-based interactions. The evolution of multimodal conversational AI, leveraging vast amounts of image-text data from diverse sources, marks a significant stride forward. However, the application of such advanced vision-language models in the agricultural domain, particularly for crop disease diagnosis, remains underexplored. In this work, we present the crop disease domain multimodal (CDDM) dataset, a pioneering resource designed to advance the field of agricultural research through the application of multimodal learning techniques. The dataset comprises 137,000 images of various crop diseases, accompanied by 1 million question-answer pairs that span a broad spectrum of agricultural knowledge, from disease identification to management practices. By integrating visual and textual data, CDDM facilitates the development of sophisticated question-answering systems capable of providing precise, useful advice to farmers and agricultural professionals. We demonstrate the utility of the dataset by finetuning state-of-the-art multimodal models, showcasing significant improvements in crop disease diagnosis. Specifically, we employed a novel finetuning strategy that utilizes low-rank adaptation (LoRA) to finetune the visual encoder, adapter and language model simultaneously. Our contributions include not only the dataset but also a finetuning strategy and a benchmark to stimulate further research in agricultural technology, aiming to bridge the gap between advanced AI techniques and practical agricultural applications. The dataset is available at https: //github.com/UnicomAI/UnicomBenchmark/tree/main/CDDMBench.
133. 【2503.06966】MIGA: Mutual Information-Guided Attack on Denoising Models for Semantic Manipulation
链接:https://arxiv.org/abs/2503.06966
作者:Guanghao Li,Mingzhi Chen,Hao Yu,Shuting Dong,Wenhao Jiang,Ming Tang,Chun Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:retaining crucial semantic, functioning as filters, denoising models, Deep learning-based denoising, widely employed
备注:
点击查看摘要
Abstract:Deep learning-based denoising models have been widely employed in vision tasks, functioning as filters to eliminate noise while retaining crucial semantic information. Additionally, they play a vital role in defending against adversarial perturbations that threaten downstream tasks. However, these models can be intrinsically susceptible to adversarial attacks due to their dependence on specific noise assumptions. Existing attacks on denoising models mainly aim at deteriorating visual clarity while neglecting semantic manipulation, rendering them either easily detectable or limited in effectiveness. In this paper, we propose Mutual Information-Guided Attack (MIGA), the first method designed to directly attack deep denoising models by strategically disrupting their ability to preserve semantic content via adversarial perturbations. By minimizing the mutual information between the original and denoised images, a measure of semantic similarity. MIGA forces the denoiser to produce perceptually clean yet semantically altered outputs. While these images appear visually plausible, they encode systematically distorted semantics, revealing a fundamental vulnerability in denoising models. These distortions persist in denoised outputs and can be quantitatively assessed through downstream task performance. We propose new evaluation metrics and systematically assess MIGA on four denoising models across five datasets, demonstrating its consistent effectiveness in disrupting semantic fidelity. Our findings suggest that denoising models are not always robust and can introduce security risks in real-world applications.
134. 【2503.06965】SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks
链接:https://arxiv.org/abs/2503.06965
作者:Shining Wang,Yunlong Wang,Ruiqi Wu,Bingliang Jiao,Wenxuan Wang,Peng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:making identity matching, identity matching difficult, significant appearance variations, appearance variations caused, Aerial-Ground Person Re-identification
备注:
点击查看摘要
Abstract:When discussing the Aerial-Ground Person Re-identification (AGPReID) task, we face the main challenge of the significant appearance variations caused by different viewpoints, making identity matching difficult. To address this issue, previous methods attempt to reduce the differences between viewpoints by critical attributes and decoupling the viewpoints. While these methods can mitigate viewpoint differences to some extent, they still face two main issues: (1) difficulty in handling viewpoint diversity and (2) neglect of the contribution of local features. To effectively address these challenges, we design and implement the Self-Calibrating and Adaptive Prompt (SeCap) method for the AGPReID task. The core of this framework relies on the Prompt Re-calibration Module (PRM), which adaptively re-calibrates prompts based on the input. Combined with the Local Feature Refinement Module (LFRM), SeCap can extract view-invariant features from local features for AGPReID. Meanwhile, given the current scarcity of datasets in the AGPReID field, we further contribute two real-world Large-scale Aerial-Ground Person Re-Identification datasets, LAGPeR and G2APS-ReID. The former is collected and annotated by us independently, covering $4,231$ unique identities and containing $63,841$ high-quality images; the latter is reconstructed from the person search dataset G2APS. Through extensive experiments on AGPReID datasets, we demonstrate that SeCap is a feasible and effective solution for the AGPReID task. The datasets and source code available on this https URL.
135. 【2503.06960】A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning
链接:https://arxiv.org/abs/2503.06960
作者:Xin Wen,Bingchen Zhao,Yilun Chen,Jiangmiao Pang,Xiaojuan Qi
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:configuration remains unclear, optimal configuration remains, Pre-trained vision models, Pre-trained vision, remains unclear
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data--a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. Our code and models are publicly available at this https URL.
136. 【2503.06956】LatexBlend: Scaling Multi-concept Customized Generation with Latent Textual Blending
链接:https://arxiv.org/abs/2503.06956
作者:Jian Jin,Zhenbo Yu,Yang Shen,Zhenyong Fu,Jian Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:generation renders user-specified, renders user-specified concepts, Latent Textual space, renders user-specified, contexts based
备注: cvpr2025
点击查看摘要
Abstract:Customized text-to-image generation renders user-specified concepts into novel contexts based on textual prompts. Scaling the number of concepts in customized generation meets a broader demand for user creation, whereas existing methods face challenges with generation quality and computational efficiency. In this paper, we propose LaTexBlend, a novel framework for effectively and efficiently scaling multi-concept customized generation. The core idea of LaTexBlend is to represent single concepts and blend multiple concepts within a Latent Textual space, which is positioned after the text encoder and a linear projection. LaTexBlend customizes each concept individually, storing them in a concept bank with a compact representation of latent textual features that captures sufficient concept information to ensure high fidelity. At inference, concepts from the bank can be freely and seamlessly combined in the latent textual space, offering two key merits for multi-concept generation: 1) excellent scalability, and 2) significant reduction of denoising deviation, preserving coherent layouts. Extensive experiments demonstrate that LaTexBlend can flexibly integrate multiple customized concepts with harmonious structures and high subject fidelity, substantially outperforming baselines in both generation quality and computational efficiency. Our code will be publicly available.
137. 【2503.06955】Motion Anything: Any to Motion Generation
链接:https://arxiv.org/abs/2503.06955
作者:Zeyu Zhang,Yiran Wang,Wei Mao,Danning Li,Rui Zhao,Biao Wu,Zirui Song,Bohan Zhuang,Ian Reid,Richard Hartley
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Conditional motion generation, Conditional motion, computer vision, extensively studied, studied in computer
备注:
点击查看摘要
Abstract:Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Motion-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD. See our project website this https URL
138. 【2503.06954】Approximate Size Targets Are Sufficient for Accurate Semantic Segmentation
链接:https://arxiv.org/abs/2503.06954
作者:Xingye Fan,Zhongwen(Rex)Zhang,Yuri Boykov
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:relative object-size distributions, extending binary class, approximate relative object-size, binary class tags, extending binary
备注:
点击查看摘要
Abstract:This paper demonstrates a surprising result for segmentation with image-level targets: extending binary class tags to approximate relative object-size distributions allows off-the-shelf architectures to solve the segmentation problem. A straightforward zero-avoiding KL-divergence loss for average predictions produces segmentation accuracy comparable to the standard pixel-precise supervision with full ground truth masks. In contrast, current results based on class tags typically require complex non-reproducible architectural modifications and specialized multi-stage training procedures. Our ideas are validated on PASCAL VOC using our new human annotations of approximate object sizes. We also show the results on COCO and medical data using synthetically corrupted size targets. All standard networks demonstrate robustness to the size targets' errors. For some classes, the validation accuracy is significantly better than the pixel-level supervision; the latter is not robust to errors in the masks. Our work provides new ideas and insights on image-level supervision in segmentation and may encourage other simple general solutions to the problem.
139. 【2503.06948】Large Language Model Guided Progressive Feature Alignment for Multimodal UAV Object Detection
链接:https://arxiv.org/abs/2503.06948
作者:Wentao Wu,Chenglong Li,Xiao Wang,Bin Luo,Qi Liu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Existing multimodal UAV, multimodal UAV object, UAV object detection, Large Language Model, UAV object
备注:
点击查看摘要
Abstract:Existing multimodal UAV object detection methods often overlook the impact of semantic gaps between modalities, which makes it difficult to achieve accurate semantic and spatial alignments, limiting detection performance. To address this problem, we propose a Large Language Model (LLM) guided Progressive feature Alignment Network called LPANet, which leverages the semantic features extracted from a large language model to guide the progressive semantic and spatial alignment between modalities for multimodal UAV object detection. To employ the powerful semantic representation of LLM, we generate the fine-grained text descriptions of each object category by ChatGPT and then extract the semantic features using the large language model MPNet. Based on the semantic features, we guide the semantic and spatial alignments in a progressive manner as follows. First, we design the Semantic Alignment Module (SAM) to pull the semantic features and multimodal visual features of each object closer, alleviating the semantic differences of objects between modalities. Second, we design the Explicit Spatial alignment Module (ESM) by integrating the semantic relations into the estimation of feature-level offsets, alleviating the coarse spatial misalignment between modalities. Finally, we design the Implicit Spatial alignment Module (ISM), which leverages the cross-modal correlations to aggregate key features from neighboring regions to achieve implicit spatial alignment. Comprehensive experiments on two public multimodal UAV object detection datasets demonstrate that our approach outperforms state-of-the-art multimodal UAV object detectors.
140. 【2503.06947】Aligning Instance-Semantic Sparse Representation towards Unsupervised Object Segmentation and Shape Abstraction with Repeatable Primitives
链接:https://arxiv.org/abs/2503.06947
作者:Jiaxin Li,Hongxing Wang,Jiawei Tan,Zhilong Ou,Junsong Yuan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:object parts, object, shape, abstracted from results, object parts abstracted
备注: 15 pages, 15 figures, 8 tables
点击查看摘要
Abstract:Understanding 3D object shapes necessitates shape representation by object parts abstracted from results of instance and semantic segmentation. Promising shape representations enable computers to interpret a shape with meaningful parts and identify their repeatability. However, supervised shape representations depend on costly annotation efforts, while current unsupervised methods work under strong semantic priors and involve multi-stage training, thereby limiting their generalization and deployment in shape reasoning and understanding. Driven by the tendency of high-dimensional semantically similar features to lie in or near low-dimensional subspaces, we introduce a one-stage, fully unsupervised framework towards semantic-aware shape representation. This framework produces joint instance segmentation, semantic segmentation, and shape abstraction through sparse representation and feature alignment of object parts in a high-dimensional space. For sparse representation, we devise a sparse latent membership pursuit method that models each object part feature as a sparse convex combination of point features at either the semantic or instance level, promoting part features in the same subspace to exhibit similar semantics. For feature alignment, we customize an attention-based strategy in the feature space to align instance- and semantic-level object part features and reconstruct the input shape using both of them, ensuring geometric reusability and semantic consistency of object parts. To firm up semantic disambiguation, we construct cascade unfrozen learning on geometric parameters of object parts.
141. 【2503.06940】CineBrain: A Large-Scale Multi-Modal Brain Dataset During Naturalistic Audiovisual Narrative Processing
链接:https://arxiv.org/abs/2503.06940
作者:Jianxiong Gao,Yichang Liu,Baofeng Yang,Jianfeng Feng,Yanwei Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:featuring simultaneous EEG, Big Bang Theory, dynamic audiovisual stimulation, large-scale dataset featuring, dataset featuring simultaneous
备注: 14 pages, 13 figures
点击查看摘要
Abstract:In this paper, we introduce CineBrain, the first large-scale dataset featuring simultaneous EEG and fMRI recordings during dynamic audiovisual stimulation. Recognizing the complementary strengths of EEG's high temporal resolution and fMRI's deep-brain spatial coverage, CineBrain provides approximately six hours of narrative-driven content from the popular television series The Big Bang Theory for each of six participants. Building upon this unique dataset, we propose CineSync, an innovative multimodal decoding framework integrates a Multi-Modal Fusion Encoder with a diffusion-based Neural Latent Decoder. Our approach effectively fuses EEG and fMRI signals, significantly improving the reconstruction quality of complex audiovisual stimuli. To facilitate rigorous evaluation, we introduce Cine-Benchmark, a comprehensive evaluation protocol that assesses reconstructions across semantic and perceptual dimensions. Experimental results demonstrate that CineSync achieves state-of-the-art video reconstruction performance and highlight our initial success in combining fMRI and EEG for reconstructing both video and audio stimuli. Project Page: this https URL.
142. 【2503.06938】Modeling Human Skeleton Joint Dynamics for Fall Detection
链接:https://arxiv.org/abs/2503.06938
作者:Sania Zahan,Ghulam Mubashar Hassan,Ajmal Mian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:population aging calls, support systems, increasing pace, pace of population, population aging
备注: Published in 2021 Digital Image Computing: Techniques and Applications (DICTA)
点击查看摘要
Abstract:The increasing pace of population aging calls for better care and support systems. Falling is a frequent and critical problem for elderly people causing serious long-term health issues. Fall detection from video streams is not an attractive option for real-life applications due to privacy issues. Existing methods try to resolve this issue by using very low-resolution cameras or video encryption. However, privacy cannot be ensured completely with such approaches. Key points on the body, such as skeleton joints, can convey significant information about motion dynamics and successive posture changes which are crucial for fall detection. Skeleton joints have been explored for feature extraction but with image recognition models that ignore joint dependency across frames which is important for the classification of actions. Moreover, existing models are over-parameterized or evaluated on small datasets with very few activity classes. We propose an efficient graph convolution network model that exploits spatio-temporal joint dependencies and dynamics of human skeleton joints for accurate fall detection. Our method leverages dynamic representation with robust concurrent spatio-temporal characteristics of skeleton joints. We performed extensive experiments on three large-scale datasets. With a significantly smaller model size than most existing methods, our proposed method achieves state-of-the-art results on the large scale NTU datasets.
143. 【2503.06934】LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs
链接:https://arxiv.org/abs/2503.06934
作者:Hanyu Zhou,Gim Hee Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large multimodal models, spatiotemporal reasoning due, fine-grained spatiotemporal reasoning, multimodal models, Large multimodal
备注:
点击查看摘要
Abstract:Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.
144. 【2503.06930】Post-Training Quantization for Diffusion Transformer via Hierarchical Timestep Grouping
链接:https://arxiv.org/abs/2503.06930
作者:Ning Ding,Jing Han,Yuchuan Tian,Chao Xu,Kai Han,Yehui Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:great generation capability, building image generation, Diffusion Transformer, preferred choice, choice for building
备注:
点击查看摘要
Abstract:Diffusion Transformer (DiT) has now become the preferred choice for building image generation models due to its great generation capability. Unlike previous convolution-based UNet models, DiT is purely composed of a stack of transformer blocks, which renders DiT excellent in scalability like large language models. However, the growing model size and multi-step sampling paradigm bring about considerable pressure on deployment and inference. In this work, we propose a post-training quantization framework tailored for Diffusion Transforms to tackle these challenges. We firstly locate that the quantization difficulty of DiT mainly originates from the time-dependent channel-specific outliers. We propose a timestep-aware shift-and-scale strategy to smooth the activation distribution to reduce the quantization error. Secondly, based on the observation that activations of adjacent timesteps have similar distributions, we utilize a hierarchical clustering scheme to divide the denoising timesteps into multiple groups. We further design a re-parameterization scheme which absorbs the quantization parameters into nearby module to avoid redundant computations. Comprehensive experiments demonstrate that out PTQ method successfully quantize the Diffusion Transformer into 8-bit weight and 8-bit activation (W8A8) with state-of-the-art FiD score. And our method can further quantize DiT model into 4-bit weight and 8-bit activation (W4A8) without sacrificing generation quality.
145. 【2503.06923】From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
链接:https://arxiv.org/abs/2503.06923
作者:Jiacheng Liu,Chang Zou,Yuanhuiyi Lyu,Junjie Chen,Linfeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:computational demands remain, demands remain prohibitive, Diffusion Transformers, revolutionized high-fidelity image, real-time applications
备注: 13 pages, 14 figures
点击查看摘要
Abstract:Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:this https URL
146. 【2503.06903】When Lighting Deceives: Exposing Vision-Language Models' Illumination Vulnerability Through Illumination Transformation Attack
链接:https://arxiv.org/abs/2503.06903
作者:Hanqing Liu,Shouwei Ruan,Yao Huang,Shiji Zhao,Xingxing Wei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remains largely unexplored, achieved remarkable success, variations remains largely, largely unexplored, textbf
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have achieved remarkable success in various tasks, yet their robustness to real-world illumination variations remains largely unexplored. To bridge this gap, we propose \textbf{I}llumination \textbf{T}ransformation \textbf{A}ttack (\textbf{ITA}), the first framework to systematically assess VLMs' robustness against illumination changes. However, there still exist two key challenges: (1) how to model global illumination with fine-grained control to achieve diverse lighting conditions and (2) how to ensure adversarial effectiveness while maintaining naturalness. To address the first challenge, we innovatively decompose global illumination into multiple parameterized point light sources based on the illumination rendering equation. This design enables us to model more diverse lighting variations that previous methods could not capture. Then, by integrating these parameterized lighting variations with physics-based lighting reconstruction techniques, we could precisely render such light interactions in the original scenes, finally meeting the goal of fine-grained lighting control. For the second challenge, by controlling illumination through the lighting reconstrution model's latent space rather than direct pixel manipulation, we inherently preserve physical lighting priors. Furthermore, to prevent potential reconstruction artifacts, we design additional perceptual constraints for maintaining visual consistency with original images and diversity constraints for avoiding light source convergence. Extensive experiments demonstrate that our ITA could significantly reduce the performance of advanced VLMs, e.g., LLaVA-1.6, while possessing competitive naturalness, exposing VLMS' critical illuminiation vulnerabilities.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as:
arXiv:2503.06903 [cs.CV]
(or
arXiv:2503.06903v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2503.06903
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
147. 【2503.06901】Iterative Prompt Relocation for Distribution-Adaptive Visual Prompt Tuning
链接:https://arxiv.org/abs/2503.06901
作者:Chikai Shang,Mengke Li,Yiqun Zhang,Zhen Chen,Jinlin Wu,Fangqing Gu,Yang Lu,Yiu-ming Cheung
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Visual prompt tuning, adapting pre-trained models, incorporating learnable prompts, Visual prompt, VPT
备注:
点击查看摘要
Abstract:Visual prompt tuning (VPT) provides an efficient and effective solution for adapting pre-trained models to various downstream tasks by incorporating learnable prompts. However, most prior art indiscriminately applies a fixed prompt distribution across different tasks, neglecting the importance of each block differing depending on the task. In this paper, we investigate adaptive distribution optimization (ADO) by addressing two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition? Through in-depth analysis, we provide an affirmative answer that properly adjusting the distribution significantly improves VPT performance, and further uncover a key insight that a nested relationship exists between ADO and VPT. Based on these findings, we propose a new VPT framework, termed PRO-VPT (iterative Prompt RelOcation-based VPT), which adaptively adjusts the distribution building upon a nested optimization formulation. Specifically, we develop a prompt relocation strategy for ADO derived from this formulation, comprising two optimization steps: identifying and pruning idle prompts, followed by determining the optimal blocks for their relocation. By iteratively performing prompt relocation and VPT, our proposal adaptively learns the optimal prompt distribution, thereby unlocking the full potential of VPT. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VPT methods, e.g., PRO-VPT surpasses VPT by 1.6% average accuracy, leading prompt-based methods to state-of-the-art performance on the VTAB-1k benchmark. The code is available at this https URL.
148. 【2503.06900】DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation
链接:https://arxiv.org/abs/2503.06900
作者:Xiaoliang Ju,Hongsheng Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:present DirectTriGS, Gaussian Splatting, represent Gaussian Splatting, Gaussian, Gaussian point clouds
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:We present DirectTriGS, a novel framework designed for 3D object generation with Gaussian Splatting (GS). GS-based rendering for 3D content has gained considerable attention recently. However, there has been limited exploration in directly generating 3D Gaussians compared to traditional generative modeling approaches. The main challenge lies in the complex data structure of GS represented by discrete point clouds with multiple channels. To overcome this challenge, we propose employing the triplane representation, which allows us to represent Gaussian Splatting as an image-like continuous field. This representation effectively encodes both the geometry and texture information, enabling smooth transformation back to Gaussian point clouds and rendering into images by a TriRenderer, with only 2D supervisions. The proposed TriRenderer is fully differentiable, so that the rendering loss can supervise both texture and geometry encoding. Furthermore, the triplane representation can be compressed using a Variational Autoencoder (VAE), which can subsequently be utilized in latent diffusion to generate 3D objects. The experiments demonstrate that the proposed generation framework can produce high-quality 3D object geometry and rendering results in the text-to-3D task.
149. 【2503.06898】Illuminating Darkness: Enhancing Real-world Low-light Scenes with Smartphone Images
链接:https://arxiv.org/abs/2503.06898
作者:S M A Sharif,Abdur Rehman,Zain Ul Abidin,Rizwan Ali Naqvi,Fayaz Ali Dharejo,Radu Timofte
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Digital cameras, produce plausible images, cameras often struggle, struggle to produce, produce plausible
备注:
点击查看摘要
Abstract:Digital cameras often struggle to produce plausible images in low-light conditions. Improving these single-shot images remains challenging due to a lack of diverse real-world pair data samples. To address this limitation, we propose a large-scale high-resolution (i.e., beyond 4k) pair Single-Shot Low-Light Enhancement (SLLIE) dataset. Our dataset comprises 6,425 unique focus-aligned image pairs captured with smartphone sensors in dynamic settings under challenging lighting conditions (0.1--200 lux), covering various indoor and outdoor scenes with varying noise and intensity. We extracted and refined around 180,000 non-overlapping patches from 6,025 collected scenes for training while reserving 400 pairs for benchmarking. In addition to that, we collected 2,117 low-light scenes from different sources for extensive real-world aesthetic evaluation. To our knowledge, this is the largest real-world dataset available for SLLIE research. We also propose learning luminance-chrominance (LC) attributes separately through a tuning fork-shaped transformer model to enhance real-world low-light images, addressing challenges like denoising and over-enhancement in complex scenes. We also propose an LC cross-attention block for feature fusion, an LC refinement block for enhanced reconstruction, and LC-guided supervision to ensure perceptually coherent enhancements. We demonstrated our method's effectiveness across various hardware and scenarios, proving its practicality in real-world applications. Code and dataset available at this https URL.
150. 【2503.06897】HiSTF Mamba: Hierarchical Spatiotemporal Fusion with Multi-Granular Body-Spatial Modeling for High-Fidelity Text-to-Motion Generation
链接:https://arxiv.org/abs/2503.06897
作者:Xingzu Zhan,Chen Xie,Haoran Sun,Xiaochun Mai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:rapidly growing field, computer graphics, promising flexible, applications in gaming, virtual reality
备注: 11pages,3figures,
点击查看摘要
Abstract:Text-to-motion generation is a rapidly growing field at the nexus of multimodal learning and computer graphics, promising flexible and cost-effective applications in gaming, animation, robotics, and virtual reality. Existing approaches often rely on simple spatiotemporal stacking, which introduces feature redundancy, while subtle joint-level details remain overlooked from a spatial perspective. To this end, we propose a novel HiSTF Mamba framework. The framework is composed of three key modules: Dual-Spatial Mamba, Bi-Temporal Mamba, and Dynamic Spatiotemporal Fusion Module (DSFM). Dual-Spatial Mamba incorporates ``Part-based + Whole-based'' parallel modeling to represent both whole-body coordination and fine-grained joint dynamics. Bi-Temporal Mamba adopts a bidirectional scanning strategy, effectively encoding short-term motion details and long-term dependencies. DSFM further performs redundancy removal and extraction of complementary information for temporal features, then fuses them with spatial features, yielding an expressive spatio-temporal representation. Experimental results on the HumanML3D dataset demonstrate that HiSTF Mamba achieves state-of-the-art performance across multiple metrics. In particular, it reduces the FID score from 0.283 to 0.189, a relative decrease of nearly 30%. These findings validate the effectiveness of HiSTF Mamba in achieving high fidelity and strong semantic alignment in text-to-motion generation.
151. 【2503.06896】CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution
链接:https://arxiv.org/abs/2503.06896
作者:Xin Liu,Jie Liu,Jie Tang,Gangshan Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:low-level visual tasks, demonstrated impressive performance, Transformer-based methods, demonstrated impressive, low-level visual
备注: Accepted by CVPR2025
点击查看摘要
Abstract:Transformer-based methods have demonstrated impressive performance in low-level visual tasks such as Image Super-Resolution (SR). However, its computational complexity grows quadratically with the spatial resolution. A series of works attempt to alleviate this problem by dividing Low-Resolution images into local windows, axial stripes, or dilated windows. SR typically leverages the redundancy of images for reconstruction, and this redundancy appears not only in local regions but also in long-range regions. However, these methods limit attention computation to content-agnostic local regions, limiting directly the ability of attention to capture long-range dependency. To address these issues, we propose a lightweight Content-Aware Token Aggregation Network (CATANet). Specifically, we propose an efficient Content-Aware Token Aggregation module for aggregating long-range content-similar tokens, which shares token centers across all image tokens and updates them only during the training phase. Then we utilize intra-group self-attention to enable long-range information interaction. Moreover, we design an inter-group cross-attention to further enhance global information interaction. The experimental results show that, compared with the state-of-the-art cluster-based method SPIN, our method achieves superior performance, with a maximum PSNR improvement of 0.33dB and nearly double the inference speed.
152. 【2503.06894】Improving cognitive diagnostics in pathology: a deep learning approach for augmenting perceptional understanding of histopathology images
链接:https://arxiv.org/abs/2503.06894
作者:Xiaoqian Hu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Made Significant Strides, Recent Years, Made Significant, Significant Strides, Combines Vision Transformers
备注:
点击查看摘要
Abstract:In Recent Years, Digital Technologies Have Made Significant Strides In Augmenting-Human-Health, Cognition, And Perception, Particularly Within The Field Of Computational-Pathology. This Paper Presents A Novel Approach To Enhancing The Analysis Of Histopathology Images By Leveraging A Mult-modal-Model That Combines Vision Transformers (Vit) With Gpt-2 For Image Captioning. The Model Is Fine-Tuned On The Specialized Arch-Dataset, Which Includes Dense Image Captions Derived From Clinical And Academic Resources, To Capture The Complexities Of Pathology Images Such As Tissue Morphologies, Staining Variations, And Pathological Conditions. By Generating Accurate, Contextually Captions, The Model Augments The Cognitive Capabilities Of Healthcare Professionals, Enabling More Efficient Disease Classification, Segmentation, And Detection. The Model Enhances The Perception Of Subtle Pathological Features In Images That Might Otherwise Go Unnoticed, Thereby Improving Diagnostic Accuracy. Our Approach Demonstrates The Potential For Digital Technologies To Augment Human Cognitive Abilities In Medical Image Analysis, Providing Steps Toward More Personalized And Accurate Healthcare Outcomes.
153. 【2503.06887】Accessing the Effect of Phyllotaxy and Planting Density on Light Use Efficiency in Field-Grown Maize using 3D Reconstructions
链接:https://arxiv.org/abs/2503.06887
作者:Nasla Saleem,Talukder Zaki Jubery,Aditya Balu,Yan Zhou,Yawei Li,Patrick S. Schnable,Adarsh Krishnamurthy,Baskar Ganapathysubramanian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:widely adopted strategy, increased interplant competition, enhance maize productivity, limit light capture, light capture
备注: 17 pages, 8 figures
点击查看摘要
Abstract:High-density planting is a widely adopted strategy to enhance maize productivity, yet it introduces challenges such as increased interplant competition and shading, which can limit light capture and overall yield potential. In response, some maize plants naturally reorient their canopies to optimize light capture, a process known as canopy reorientation. Understanding this adaptive response and its impact on light capture is crucial for maximizing agricultural yield potential. This study introduces an end-to-end framework that integrates realistic 3D reconstructions of field-grown maize with photosynthetically active radiation (PAR) modeling to assess the effects of phyllotaxy and planting density on light interception. In particular, using 3D point clouds derived from field data, virtual fields for a diverse set of maize genotypes were constructed and validated against field PAR measurements. Using this framework, we present detailed analyses of the impact of canopy orientations, plant and row spacings, and planting row directions on PAR interception throughout a typical growing season. Our findings highlight significant variations in light interception efficiency across different planting densities and canopy orientations. By elucidating the relationship between canopy architecture and light capture, this study offers valuable guidance for optimizing maize breeding and cultivation strategies across diverse agricultural settings.
154. 【2503.06885】ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks
链接:https://arxiv.org/abs/2503.06885
作者:Yan Yang,Dongxu Li,Haoning Wu,Bei Chen,Liu Liu,Liyuan Pan,Junnan Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Solving expert-level multimodal, Solving expert-level, expert-level multimodal tasks, key milestone, milestone towards general
备注:
点击查看摘要
Abstract:Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.
155. 【2503.06884】xt-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help
链接:https://arxiv.org/abs/2503.06884
作者:Yuefan Cao,Xuyang Guo,Jiayan Huo,Yingyu Liang,Zhenmei Shi,Zhao Song,Jiahao Zhang,Zhen Zhuang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:unprecedented real-world impacts, gained unprecedented real-world, today AI community, real-world impacts, modeling is widely
备注:
点击查看摘要
Abstract:Generative modeling is widely regarded as one of the most essential problems in today's AI community, with text-to-image generation having gained unprecedented real-world impacts. Among various approaches, diffusion models have achieved remarkable success and have become the de facto solution for text-to-image generation. However, despite their impressive performance, these models exhibit fundamental limitations in adhering to numerical constraints in user instructions, frequently generating images with an incorrect number of objects. While several prior works have mentioned this issue, a comprehensive and rigorous evaluation of this limitation remains lacking. To address this gap, we introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models. Our benchmark encompasses a diverse set of generative models, including both open-source and private systems. It explicitly isolates counting performance from other capabilities, provides structured difficulty levels, and incorporates human evaluations to ensure high reliability. Extensive evaluations with T2ICountBench reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases. Additionally, an exploratory study on prompt refinement demonstrates that such simple interventions generally do not improve counting accuracy. Our findings highlight the inherent challenges in numerical understanding within diffusion models and point to promising directions for future improvements.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2503.06884 [cs.CV]
(or
arXiv:2503.06884v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2503.06884
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
156. 【2503.06873】Interactive Medical Image Analysis with Concept-based Similarity Reasoning
链接:https://arxiv.org/abs/2503.06873
作者:Ta Duc Huy,Sen Kim Tran,Phan Nguyen,Nguyen Hoang Tran,Tran Bao Sam,Anton van den Hengel,Zhibin Liao,Johan W. Verjans,Minh-Son To,Vu Minh Hieu Phan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:intervene model decisions, computer-aided diagnosis methods, clinical workflows, ability to interpret, interpret and intervene
备注: Accepted CVPR2025
点击查看摘要
Abstract:The ability to interpret and intervene model decisions is important for the adoption of computer-aided diagnosis methods in clinical workflows. Recent concept-based methods link the model predictions with interpretable concepts and modify their activation scores to interact with the model. However, these concepts are at the image level, which hinders the model from pinpointing the exact patches the concepts are activated. Alternatively, prototype-based methods learn representations from training image patches and compare these with test image patches, using the similarity scores for final class prediction. However, interpreting the underlying concepts of these patches can be challenging and often necessitates post-hoc guesswork. To address this issue, this paper introduces the novel Concept-based Similarity Reasoning network (CSR), which offers (i) patch-level prototype with intrinsic concept interpretation, and (ii) spatial interactivity. First, the proposed CSR provides localized explanation by grounding prototypes of each concept on image regions. Second, our model introduces novel spatial-level interaction, allowing doctors to engage directly with specific image areas, making it an intuitive and transparent tool for medical imaging. CSR improves upon prior state-of-the-art interpretable methods by up to 4.5\% across three biomedical datasets. Our code is released at this https URL.
157. 【2503.06863】HIF: Height Interval Filtering for Efficient Dynamic Points Removal
链接:https://arxiv.org/abs/2503.06863
作者:Shufang Zhang,Tao Jiang,Jiazheng Wu,Ziyu Meng,Ziyang Zhang,Shan An
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:cloud mapping plays, point cloud mapping, autonomous navigation, mapping plays, plays a essential
备注:
点击查看摘要
Abstract:3D point cloud mapping plays a essential role in localization and autonomous navigation. However, dynamic objects often leave residual traces during the map construction process, which undermine the performance of subsequent tasks. Therefore, dynamic object removal has become a critical challenge in point cloud based map construction within dynamic scenarios. Existing approaches, however, often incur significant computational overhead, making it difficult to meet the real-time processing requirements. To address this issue, we introduce the Height Interval Filtering (HIF) method. This approach constructs pillar-based height interval representations to probabilistically model the vertical dimension, with interval probabilities updated through Bayesian inference. It ensures real-time performance while achieving high accuracy and improving robustness in complex environments. Additionally, we propose a low-height preservation strategy that enhances the detection of unknown spaces, reducing misclassification in areas blocked by obstacles (occluded regions). Experiments on public datasets demonstrate that HIF delivers a 7.7 times improvement in time efficiency with comparable accuracy to existing SOTA methods. The code will be publicly available.
158. 【2503.06860】owards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting
链接:https://arxiv.org/abs/2503.06860
作者:Cagri Gungor,Derek Eppinger,Adriana Kovashka
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:direct physical contact, computer vision, multimodal learning, relies on direct, direct physical
备注:
点击查看摘要
Abstract:Tactile sensing, which relies on direct physical contact, is critical for human perception and underpins applications in computer vision, robotics, and multimodal learning. Because tactile data is often scarce and costly to acquire, generating synthetic tactile images provides a scalable solution to augment real-world measurements. However, ensuring robust generalization in synthesizing tactile images-capturing subtle, material-specific contact features-remains challenging. We demonstrate that overlapping training and test samples in commonly used datasets inflate performance metrics, obscuring the true generalizability of tactile models. To address this, we propose a leakage-free evaluation protocol coupled with novel, reference-free metrics-TMMD, I-TMMD, CI-TMMD, and D-TMMD-tailored for tactile generation. Moreover, we propose a vision-to-touch generation method that leverages text as an intermediate modality by incorporating concise, material-specific descriptions during training to better capture essential tactile features. Experiments on two popular visuo-tactile datasets, Touch and Go and HCT, show that our approach achieves superior performance and enhanced generalization in a leakage-free setting.
159. 【2503.06859】ActiveInitSplat: How Active Image Selection Helps Gaussian Splatting
链接:https://arxiv.org/abs/2503.06859
作者:Konstantinos D. Polyzos,Athanasios Bacharis,Saketh Madhuvarasu,Nikos Papanikolopoulos,Tara Javidi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:meeting reduced storage, reduced storage demands, real-time scene rendering, computational efficiency, extensions and variants
备注:
点击查看摘要
Abstract:Gaussian splatting (GS) along with its extensions and variants provides outstanding performance in real-time scene rendering while meeting reduced storage demands and computational efficiency. While the selection of 2D images capturing the scene of interest is crucial for the proper initialization and training of GS, hence markedly affecting the rendering performance, prior works rely on passively and typically densely selected 2D images. In contrast, this paper proposes `ActiveInitSplat', a novel framework for active selection of training images for proper initialization and training of GS. ActiveInitSplat relies on density and occupancy criteria of the resultant 3D scene representation from the selected 2D images, to ensure that the latter are captured from diverse viewpoints leading to better scene coverage and that the initialized Gaussian functions are well aligned with the actual 3D structure. Numerical tests on well-known simulated and real environments demonstrate the merits of ActiveInitSplat resulting in significant GS rendering performance improvement over passive GS baselines, in the widely adopted LPIPS, SSIM, and PSNR metrics.
160. 【2503.06852】From Image- to Pixel-level: Label-efficient Hyperspectral Image Reconstruction
链接:https://arxiv.org/abs/2503.06852
作者:Yihong Leng,Jiaojiao Li,Haitao Xu,Rui Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Current hyperspectral image, form abundant high-quality, abundant high-quality HSIs, Current hyperspectral, methods primarily rely
备注:
点击查看摘要
Abstract:Current hyperspectral image (HSI) reconstruction methods primarily rely on image-level approaches, which are time-consuming to form abundant high-quality HSIs through imagers. In contrast, spectrometers offer a more efficient alternative by capturing high-fidelity point spectra, enabling pixel-level HSI reconstruction that balances accuracy and label efficiency. To this end, we introduce a pixel-level spectral super-resolution (Pixel-SSR) paradigm that reconstructs HSI from RGB and point spectra. Despite its advantages, Pixel-SSR presents two key challenges: 1) generalizability to novel scenes lacking point spectra, and 2) effective information extraction to promote reconstruction accuracy. To address the first challenge, a Gamma-modeled strategy is investigated to synthesize point spectra based on their intrinsic properties, including nonnegativity, a skewed distribution, and a positive correlation. Furthermore, complementary three-branch prompts from RGB and point spectra are extracted with a Dynamic Prompt Mamba (DyPro-Mamba), which progressively directs the reconstruction with global spatial distributions, edge details, and spectral dependency. Comprehensive evaluations, including horizontal comparisons with leading methods and vertical assessments across unsupervised and image-level supervised paradigms, demonstrate that ours achieves competitive reconstruction accuracy with efficient label consumption.
161. 【2503.06847】MADS: Multi-Attribute Document Supervision for Zero-Shot Image Classification
链接:https://arxiv.org/abs/2503.06847
作者:Xiangyan Qu,Jing Yu,Jiamin Zhuang,Gaopeng Gou,Gang Xiong,Qi Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recognize unseen classes, Zero-shot learning, shared auxiliary information, unseen classes, aims to train
备注:
点击查看摘要
Abstract:Zero-shot learning (ZSL) aims to train a model on seen classes and recognize unseen classes by knowledge transfer through shared auxiliary information. Recent studies reveal that documents from encyclopedias provide helpful auxiliary information. However, existing methods align noisy documents, entangled in visual and non-visual descriptions, with image regions, yet solely depend on implicit learning. These models fail to filter non-visual noise reliably and incorrectly align non-visual words to image regions, which is harmful to knowledge transfer. In this work, we propose a novel multi-attribute document supervision framework to remove noises at both document collection and model learning stages. With the help of large language models, we introduce a novel prompt algorithm that automatically removes non-visual descriptions and enriches less-described documents in multiple attribute views. Our proposed model, MADS, extracts multi-view transferable knowledge with information decoupling and semantic interactions for semantic alignment at local and global levels. Besides, we introduce a model-agnostic focus loss to explicitly enhance attention to visually discriminative information during training, also improving existing methods without additional parameters. With comparable computation costs, MADS consistently outperforms the SOTA by 7.2% and 8.2% on average in three benchmarks for document-based ZSL and GZSL settings, respectively. Moreover, we qualitatively offer interpretable predictions from multiple attribute views.
162. 【2503.06840】Improving Visual Place Recognition with Sequence-Matching Receptiveness Prediction
链接:https://arxiv.org/abs/2503.06840
作者:Somayeh Hussaini,Tobias Fischer,Michael Milford
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:visual place recognition, integrating temporal information, sequence-based matching approaches, place recognition, filtering and sequence-based
备注: 8 pages, 5 figures, under review
点击查看摘要
Abstract:In visual place recognition (VPR), filtering and sequence-based matching approaches can improve performance by integrating temporal information across image sequences, especially in challenging conditions. While these methods are commonly applied, their effects on system behavior can be unpredictable and can actually make performance worse in certain situations. In this work, we present a new supervised learning approach that learns to predict the per-frame sequence matching receptiveness (SMR) of VPR techniques, enabling the system to selectively decide when to trust the output of a sequence matching system. The approach is agnostic to the underlying VPR technique. Our approach predicts SMR-and hence significantly improves VPR performance-across a large range of state-of-the-art and classical VPR techniques (namely CosPlace, MixVPR, EigenPlaces, SALAD, AP-GeM, NetVLAD and SAD), and across three benchmark VPR datasets (Nordland, Oxford RobotCar, and SFU-Mountain). We also provide insights into a complementary approach that uses the predictor to replace discarded matches, as well as ablation studies, including an analysis of the interactions between our SMR predictor and the selected sequence length. We will release our code upon acceptance.
163. 【2503.06839】AttFC: Attention Fully-Connected Layer for Large-Scale Face Recognition with One GPU
链接:https://arxiv.org/abs/2503.06839
作者:Zhuowen Zheng,Yain-Whar Si,Xiaochen Yuan,Junwei Duan,Ke Wang,Xiaofan Li,Xinyuan Zhang,Xueyuan Gong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:deep neural networks, achieved exceptional performance, large-scale datasets, neural networks, advancement of deep
备注:
点击查看摘要
Abstract:Nowadays, with the advancement of deep neural networks (DNNs) and the availability of large-scale datasets, the face recognition (FR) model has achieved exceptional performance. However, since the parameter magnitude of the fully connected (FC) layer directly depends on the number of identities in the dataset. If training the FR model on large-scale datasets, the size of the model parameter will be excessively huge, leading to substantial demand for computational resources, such as time and memory. This paper proposes the attention fully connected (AttFC) layer, which could significantly reduce computational resources. AttFC employs an attention loader to generate the generative class center (GCC), and dynamically store the class center with Dynamic Class Container (DCC). DCC only stores a small subset of all class centers in FC, thus its parameter count is substantially less than the FC layer. Also, training face recognition models on large-scale datasets with one GPU often encounter out-of-memory (OOM) issues. AttFC overcomes this and achieves comparable performance to state-of-the-art methods.
164. 【2503.06832】GUIDE-CoT: Goal-driven and User-Informed Dynamic Estimation for Pedestrian Trajectory using Chain-of-Thought
链接:https://arxiv.org/abs/2503.06832
作者:Sungsik Kim,Janghyun Baek,Jinkyu Kim,Jaekoo Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large Language Models, Language Models, Large Language, recently shown impressive, shown impressive results
备注: 10 pages, 5 figures, will be published on The 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)
点击查看摘要
Abstract:While Large Language Models (LLMs) have recently shown impressive results in reasoning tasks, their application to pedestrian trajectory prediction remains challenging due to two key limitations: insufficient use of visual information and the difficulty of predicting entire trajectories. To address these challenges, we propose Goal-driven and User-Informed Dynamic Estimation for pedestrian trajectory using Chain-of-Thought (GUIDE-CoT). Our approach integrates two innovative modules: (1) a goal-oriented visual prompt, which enhances goal prediction accuracy combining visual prompts with a pretrained visual encoder, and (2) a chain-of-thought (CoT) LLM for trajectory generation, which generates realistic trajectories toward the predicted goal. Moreover, our method introduces controllable trajectory generation, allowing for flexible and user-guided modifications to the predicted paths. Through extensive experiments on the ETH/UCY benchmark datasets, our method achieves state-of-the-art performance, delivering both high accuracy and greater adaptability in pedestrian trajectory prediction. Our code is publicly available at this https URL.
165. 【2503.06831】One-Shot Dual-Arm Imitation Learning
链接:https://arxiv.org/abs/2503.06831
作者:Yilong Wang,Edward Johns
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:One-Shot Dual-Arm Imitation, Dual-Arm Imitation Learning, Imitation Learning, introduce One-Shot Dual-Arm, Dual-Arm Imitation
备注: Accepted at ICRA 2025. Project Webpage: [this https URL](https://www.robot-learning.uk/one-shot-dual-arm)
点击查看摘要
Abstract:We introduce One-Shot Dual-Arm Imitation Learning (ODIL), which enables dual-arm robots to learn precise and coordinated everyday tasks from just a single demonstration of the task. ODIL uses a new three-stage visual servoing (3-VS) method for precise alignment between the end-effector and target object, after which replay of the demonstration trajectory is sufficient to perform the task. This is achieved without requiring prior task or object knowledge, or additional data collection and training following the single demonstration. Furthermore, we propose a new dual-arm coordination paradigm for learning dual-arm tasks from a single demonstration. ODIL was tested on a real-world dual-arm robot, demonstrating state-of-the-art performance across six precise and coordinated tasks in both 4-DoF and 6-DoF settings, and showing robustness in the presence of distractor objects and partial occlusions. Videos are available at: this https URL.
166. 【2503.06821】HierDAMap: Towards Universal Domain Adaptive BEV Mapping via Hierarchical Perspective Priors
链接:https://arxiv.org/abs/2503.06821
作者:Siyu Li,Yihong Cao,Hao Shi,Yongsheng Zang,Xuan He,Kailun Yang,Zhiyong Li
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
关键词:visual perception technology, driven significant innovation, BEV mapping, BEV, BEV mapping tasks
备注: The source code will be made publicly available at [this https URL](https://github.com/lynn-yu/HierDAMap)
点击查看摘要
Abstract:The exploration of Bird's-Eye View (BEV) mapping technology has driven significant innovation in visual perception technology for autonomous driving. BEV mapping models need to be applied to the unlabeled real world, making the study of unsupervised domain adaptation models an essential path. However, research on unsupervised domain adaptation for BEV mapping remains limited and cannot perfectly accommodate all BEV mapping tasks. To address this gap, this paper proposes HierDAMap, a universal and holistic BEV domain adaptation framework with hierarchical perspective priors. Unlike existing research that solely focuses on image-level learning using prior knowledge, this paper explores the guiding role of perspective prior knowledge across three distinct levels: global, sparse, and instance levels. With these priors, HierDA consists of three essential components, including Semantic-Guided Pseudo Supervision (SGPS), Dynamic-Aware Coherence Learning (DACL), and Cross-Domain Frustum Mixing (CDFM). SGPS constrains the cross-domain consistency of perspective feature distribution through pseudo labels generated by vision foundation models in 2D space. To mitigate feature distribution discrepancies caused by spatial variations, DACL employs uncertainty-aware predicted depth as an intermediary to derive dynamic BEV labels from perspective pseudo-labels, thereby constraining the coarse BEV features derived from corresponding perspective features. CDFM, on the other hand, leverages perspective masks of view frustum to mix multi-view perspective images from both domains, which guides cross-domain view transformation and encoding learning through mixed BEV labels. The proposed method is verified on multiple BEV mapping tasks, such as BEV semantic segmentation, high-definition semantic, and vectorized mapping. The source code will be made publicly available at this https URL.
167. 【2503.06820】owards Fine-Grained Video Question Answering
链接:https://arxiv.org/abs/2503.06820
作者:Wei Dai,Alan Luo,Zane Durante,Debadutta Dash,Arnold Milstein,Kevin Schulman,Ehsan Adeli,Li Fei-Fei
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:rapidly evolving domain, Video Question Answering, Question Answering, Multi-Actor Question Answering, remains a focal
备注:
点击查看摘要
Abstract:In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.
168. 【2503.06818】Sub-Image Recapture for Multi-View 3D Reconstruction
链接:https://arxiv.org/abs/2503.06818
作者:Yanwei Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-resolution target remains, challenge task due, input image size, high-resolution target, target remains
备注: 5 pages, 4 figures
点击查看摘要
Abstract:3D reconstruction of high-resolution target remains a challenge task due to the large memory required from the large input image size. Recently developed learning based algorithms provide promising reconstruction performance than traditional ones, however, they generally require more memory than the traditional algorithms and facing scalability issue. In this paper, we developed a generic approach, sub-image recapture (SIR), to split large image into smaller sub-images and process them individually. As a result of this framework, the existing 3D reconstruction algorithms can be implemented based on sub-image recapture with significantly reduced memory and substantially improved scalability
169. 【2503.06814】Unlocking Generalization for Robotics via Modularity and Scale
链接:https://arxiv.org/abs/2503.06814
作者:Murtaza Dalal
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:generalist robot systems, large-scale policy learning, generalist robot, robot systems, robot
备注: CMU Robotics PhD Thesis, 185 pages
点击查看摘要
Abstract:How can we build generalist robot systems? Scale may not be enough due to the significant multimodality of robotics tasks, lack of easily accessible data and the challenges of deploying on physical hardware. Meanwhile, most deployed robotic systems today are inherently modular and can leverage the independent generalization capabilities of each module to perform well. Therefore, this thesis seeks to tackle the task of building generalist robot agents by integrating these components into one: combining modularity with large-scale learning for general purpose robot control. The first question we consider is: how can we build modularity and hierarchy into learning systems? Our key insight is that rather than having the agent learn hierarchy and low-level control end-to-end, we can enforce modularity via planning to enable more efficient and capable robot learners. Next, we come to the role of scale in building generalist robot systems. To scale, neural networks require vast amounts of diverse data, expressive architectures to fit the data and a source of supervision to generate the data. We leverage a powerful supervision source: classical planning, which can generalize, but is expensive to run and requires access to privileged information to perform well in practice. We use these planners to supervise large-scale policy learning in simulation to produce generalist agents. Finally, we consider how to unify modularity with large-scale policy learning to build real-world robot systems capable of performing zero-shot manipulation. We do so by tightly integrating key ingredients of modular high and mid-level planning, learned local control, procedural scene generation and large-scale policy learning for sim2real transfer. We demonstrate that this recipe can produce a single, generalist agent that can solve challenging long-horizon manipulation tasks in the real world.
170. 【2503.06805】Multimodal Emotion Recognition and Sentiment Analysis in Multi-Party Conversation Contexts
链接:https://arxiv.org/abs/2503.06805
作者:Aref Farhadipour,Hossein Ranjbar,Masoumeh Chapariniya,Teodora Vukovic,Sarah Ebling,Volker Dellwo
类目:Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:scenarios involving multi-party, real-world scenarios involving, conversational data, language processing, involving multi-party
备注: 5 pages
点击查看摘要
Abstract:Emotion recognition and sentiment analysis are pivotal tasks in speech and language processing, particularly in real-world scenarios involving multi-party, conversational data. This paper presents a multimodal approach to tackle these challenges on a well-known dataset. We propose a system that integrates four key modalities/channels using pre-trained models: RoBERTa for text, Wav2Vec2 for speech, a proposed FacialNet for facial expressions, and a CNN+Transformer architecture trained from scratch for video analysis. Feature embeddings from each modality are concatenated to form a multimodal vector, which is then used to predict emotion and sentiment labels. The multimodal system demonstrates superior performance compared to unimodal approaches, achieving an accuracy of 66.36% for emotion recognition and 72.15% for sentiment analysis.
171. 【2503.06800】VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation
链接:https://arxiv.org/abs/2503.06800
作者:Hritik Bansal,Clark Peng,Yonatan Bitton,Roman Goldenberg,Aditya Grover,Kai-Wei Chang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Large-scale video generative, physical world simulators, diverse visual concepts, general-purpose physical world, Large-scale video
备注: 41 pages, 33 Figures
点击查看摘要
Abstract:Large-scale video generative models, capable of creating realistic videos of diverse visual concepts, are strong candidates for general-purpose physical world simulators. However, their adherence to physical commonsense across real-world actions remains unclear (e.g., playing tennis, backflip). Existing benchmarks suffer from limitations such as limited size, lack of human evaluation, sim-to-real gaps, and absence of fine-grained physical rule analysis. To address this, we introduce VideoPhy-2, an action-centric dataset for evaluating physical commonsense in generated videos. We curate 200 diverse actions and detailed prompts for video synthesis from modern generative models. We perform human evaluation that assesses semantic adherence, physical commonsense, and grounding of physical rules in the generated videos. Our findings reveal major shortcomings, with even the best model achieving only 22% joint performance (i.e., high semantic and physical commonsense adherence) on the hard subset of VideoPhy-2. We find that the models particularly struggle with conservation laws like mass and momentum. Finally, we also train VideoPhy-AutoEval, an automatic evaluator for fast, reliable assessment on our dataset. Overall, VideoPhy-2 serves as a rigorous benchmark, exposing critical gaps in video generative models and guiding future research in physically-grounded video generation. The data and code is available at this https URL.
172. 【2503.06795】Robotic Ultrasound-Guided Femoral Artery Reconstruction of Anatomically-Representative Phantoms
链接:https://arxiv.org/abs/2503.06795
作者:Lidia Al-Zogbi,Deepak Raina,Vinciya Pandian,Thorsten Fleiter,Axel Krieger
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:including diagnostic angiography, therapeutic catheterization, including diagnostic, diagnostic angiography, essential for numerous
备注:
点击查看摘要
Abstract:Femoral artery access is essential for numerous clinical procedures, including diagnostic angiography, therapeutic catheterization, and emergency interventions. Despite its critical role, successful vascular access remains challenging due to anatomical variability, overlying adipose tissue, and the need for precise ultrasound (US) guidance. Errors in needle placement can lead to severe complications, restricting the procedure to highly skilled clinicians in controlled hospital settings. While robotic systems have shown promise in addressing these challenges through autonomous scanning and vessel reconstruction, clinical translation remains limited due to reliance on simplified phantom models that fail to capture human anatomical complexity. In this work, we present a method for autonomous robotic US scanning of bifurcated femoral arteries, and validate it on five vascular phantoms created from real patient computed tomography (CT) data. Additionally, we introduce a video-based deep learning US segmentation network tailored for vascular imaging, enabling improved 3D arterial reconstruction. The proposed network achieves a Dice score of 89.21% and an Intersection over Union of 80.54% on a newly developed vascular dataset. The quality of the reconstructed artery centerline is evaluated against ground truth CT data, demonstrating an average L2 deviation of 0.91+/-0.70 mm, with an average Hausdorff distance of 4.36+/-1.11mm. This study is the first to validate an autonomous robotic system for US scanning of the femoral artery on a diverse set of patient-specific phantoms, introducing a more advanced framework for evaluating robotic performance in vascular imaging and intervention.
173. 【2503.06794】Silent Hazards of Token Reduction in Vision-Language Models: The Hidden Impact on Consistency
链接:https://arxiv.org/abs/2503.06794
作者:Yizheng Sun,Hao Li,Chang Xu,Chenghua Lin,Riza Batista-Navarro,Jingyuan Sun
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:Vision language models, Vision language, incur high computational, token reduction, incur high
备注:
点击查看摘要
Abstract:Vision language models (VLMs) have excelled in visual reasoning but often incur high computational costs. One key reason is the redundancy of visual tokens. Although recent token reduction methods claim to achieve minimal performance loss, our extensive experiments reveal that token reduction can substantially alter a model's output distribution, leading to changes in prediction patterns that standard metrics such as accuracy loss do not fully capture. Such inconsistencies are especially concerning for practical applications where system stability is critical. To investigate this phenomenon, we analyze how token reduction influences the energy distribution of a VLM's internal representations using a lower-rank approximation via Singular Value Decomposition (SVD). Our results show that changes in the Inverse Participation Ratio of the singular value spectrum are strongly correlated with the model's consistency after token reduction. Based on these insights, we propose LoFi--a training-free visual token reduction method that utilizes the leverage score from SVD for token pruning. Experimental evaluations demonstrate that LoFi not only reduces computational costs with minimal performance degradation but also significantly outperforms state-of-the-art methods in terms of output consistency.
174. 【2503.06790】GenDR: Lightning Generative Detail Restorator
链接:https://arxiv.org/abs/2503.06790
作者:Yan Wang,Shijie Zhao,Kai Chen,Kexin Zhang,Junlin Li,Li Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
关键词:Recent research applying, achieved remarkable success, Recent research, research applying, real-world super-resolution
备注:
点击查看摘要
Abstract:Recent research applying text-to-image (T2I) diffusion models to real-world super-resolution (SR) has achieved remarkable success. However, fundamental misalignments between T2I and SR targets result in a dilemma between inference speed and detail fidelity. Specifically, T2I tasks prioritize multi-step inversion to synthesize coherent outputs aligned with textual prompts and shrink the latent space to reduce generating complexity. Contrariwise, SR tasks preserve most information from low-resolution input while solely restoring high-frequency details, thus necessitating sufficient latent space and fewer inference steps. To bridge the gap, we present a one-step diffusion model for generative detail restoration, GenDR, distilled from a tailored diffusion model with larger latent space. In detail, we train a new SD2.1-VAE16 (0.9B) via representation alignment to expand latent space without enlarging the model size. Regarding step-distillation, we propose consistent score identity distillation (CiD) that incorporates SR task-specific loss into score distillation to leverage more SR priors and align the training target. Furthermore, we extend CiD with adversarial learning and representation alignment (CiDA) to enhance perceptual quality and accelerate training. We also polish the pipeline to achieve a more efficient inference. Experimental results demonstrate that GenDR achieves state-of-the-art performance in both quantitative metrics and visual fidelity.
175. 【2503.06784】Infinite Leagues Under the Sea: Photorealistic 3D Underwater Terrain Generation by Latent Fractal Diffusion Models
链接:https://arxiv.org/abs/2503.06784
作者:Tianyi Zhang,Weiming Zhi,Joshua Mangelson,Matthew Johnson-Roberson
类目:Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:paper tackles, tackles the problem, problem of generating, generating representations, underwater
备注: 10 pages
点击查看摘要
Abstract:This paper tackles the problem of generating representations of underwater 3D terrain. Off-the-shelf generative models, trained on Internet-scale data but not on specialized underwater images, exhibit downgraded realism, as images of the seafloor are relatively uncommon. To this end, we introduce DreamSea, a generative model to generate hyper-realistic underwater scenes. DreamSea is trained on real-world image databases collected from underwater robot surveys. Images from these surveys contain massive real seafloor observations and covering large areas, but are prone to noise and artifacts from the real world. We extract 3D geometry and semantics from the data with visual foundation models, and train a diffusion model that generates realistic seafloor images in RGBD channels, conditioned on novel fractal distribution-based latent embeddings. We then fuse the generated images into a 3D map, building a 3DGS model supervised by 2D diffusion priors which allows photorealistic novel view rendering. DreamSea is rigorously evaluated, demonstrating the ability to robustly generate large-scale underwater scenes that are consistent, diverse, and photorealistic. Our work drives impact in multiple domains, spanning filming, gaming, and robot simulation.
176. 【2503.06773】Investigating Image Manifolds of 3D Objects: Learning, Shape Analysis, and Comparisons
链接:https://arxiv.org/abs/2503.06773
作者:Benjamin Beaudett,Shenyuan Liang,Anuj Srivastava
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:manifolds, long been hypothesized, hypothesized to form, objects, image manifolds
备注:
点击查看摘要
Abstract:Despite high-dimensionality of images, the sets of images of 3D objects have long been hypothesized to form low-dimensional manifolds. What is the nature of such manifolds? How do they differ across objects and object classes? Answering these questions can provide key insights in explaining and advancing success of machine learning algorithms in computer vision. This paper investigates dual tasks -- learning and analyzing shapes of image manifolds -- by revisiting a classical problem of manifold learning but from a novel geometrical perspective. It uses geometry-preserving transformations to map the pose image manifolds, sets of images formed by rotating 3D objects, to low-dimensional latent spaces. The pose manifolds of different objects in latent spaces are found to be nonlinear, smooth manifolds. The paper then compares shapes of these manifolds for different objects using Kendall's shape analysis, modulo rigid motions and global scaling, and clusters objects according to these shape metrics. Interestingly, pose manifolds for objects from the same classes are frequently clustered together. The geometries of image manifolds can be exploited to simplify vision and image processing tasks, to predict performances, and to provide insights into learning methods.
177. 【2503.06764】SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
链接:https://arxiv.org/abs/2503.06764
作者:Zisheng Chen,Chunwei Wang,Xiuwei Chen,Hang Xu,Jianhua Han,Xiandan Liang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:understanding and generation, consistent discrete feature, discrete feature representations, generation tasks, Semantic-Guided Hierarchical codebook
备注: Under Review
点击查看摘要
Abstract:We present SemHiTok, a unified image Tokenizer via Semantic-Guided Hierarchical codebook that provides consistent discrete feature representations for multimodal understanding and generation tasks. Recently, unified multimodal large models (MLLMs) for understanding and generation have sparked exploration within research community. Previous works attempt to train a unified image tokenizer by combining loss functions for semantic feature reconstruction and pixel reconstruction. However, due to the differing levels of features prioritized by multimodal understanding and generation tasks, joint training methods face significant challenges in achieving a good trade-off. SemHiTok addresses this challenge through Semantic-Guided Hierarchical codebook which builds texture sub-codebooks on pre-trained semantic codebook. This design decouples the training of semantic reconstruction and pixel reconstruction and equips the tokenizer with low-level texture feature extraction capability without degradation of high-level semantic feature extraction ability. Our experiments demonstrate that SemHiTok achieves state-of-the-art rFID score at 256X256resolution compared to other unified tokenizers, and exhibits competitive performance on multimodal understanding and generation tasks.
178. 【2503.06762】Gaussian RBFNet: Gaussian Radial Basis Functions for Fast and Accurate Representation and Reconstruction of Neural Fields
链接:https://arxiv.org/abs/2503.06762
作者:Abdelaziz Bouzidi,Hamid Laga,Hazem Wannous
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:recently revolutionized novel-view, revolutionized novel-view synthesis, Neural Radiance Fields, recently revolutionized, revolutionized novel-view
备注: Our code is available at [this https URL](https://grbfnet.github.io/)
点击查看摘要
Abstract:Neural fields such as DeepSDF and Neural Radiance Fields have recently revolutionized novel-view synthesis and 3D reconstruction from RGB images and videos. However, achieving high-quality representation, reconstruction, and rendering requires deep neural networks, which are slow to train and evaluate. Although several acceleration techniques have been proposed, they often trade off speed for memory. Gaussian splatting-based methods, on the other hand, accelerate the rendering time but remain costly in terms of training speed and memory needed to store the parameters of a large number of Gaussians. In this paper, we introduce a novel neural representation that is fast, both at training and inference times, and lightweight. Our key observation is that the neurons used in traditional MLPs perform simple computations (a dot product followed by ReLU activation) and thus one needs to use either wide and deep MLPs or high-resolution and high-dimensional feature grids to parameterize complex nonlinear functions. We show in this paper that by replacing traditional neurons with Radial Basis Function (RBF) kernels, one can achieve highly accurate representation of 2D (RGB images), 3D (geometry), and 5D (radiance fields) signals with just a single layer of such neurons. The representation is highly parallelizable, operates on low-resolution feature grids, and is compact and memory-efficient. We demonstrate that the proposed novel representation can be trained for 3D geometry representation in less than 15 seconds and for novel view synthesis in less than 15 mins. At runtime, it can synthesize novel views at more than 60 fps without sacrificing quality.
179. 【2503.06759】Revisiting Invariant Learning for Out-of-Domain Generalization on Multi-Site Mammogram Datasets
链接:https://arxiv.org/abs/2503.06759
作者:Hung Q. Vo,Samira Zare,Son T. Ly,Lin Wang,Chika F. Ezeana,Xiaohui Yu,Kelvin K. Wong,Stephen T.C. Wong,Hien V. Nguyen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:settings remains uncertain, development settings remains, deep learning techniques, remains uncertain, invariant learning
备注:
点击查看摘要
Abstract:Despite significant progress in robust deep learning techniques for mammogram breast cancer classification, their reliability in real-world clinical development settings remains uncertain. The translation of these models to clinical practice faces challenges due to variations in medical centers, imaging protocols, and patient populations. To enhance their robustness, invariant learning methods have been proposed, prioritizing causal factors over misleading features. However, their effectiveness in clinical development and impact on mammogram classification require investigation. This paper reassesses the application of invariant learning for breast cancer risk estimation based on mammograms. Utilizing diverse multi-site public datasets, it represents the first study in this area. The objective is to evaluate invariant learning's benefits in developing robust models. Invariant learning methods, including Invariant Risk Minimization and Variance Risk Extrapolation, are compared quantitatively against Empirical Risk Minimization. Evaluation metrics include accuracy, average precision, and area under the curve. Additionally, interpretability is examined through class activation maps and visualization of learned representations. This research examines the advantages, limitations, and challenges of invariant learning for mammogram classification, guiding future studies to develop generalized methods for breast cancer prediction on whole mammograms in out-of-domain scenarios.
180. 【2503.06749】Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
链接:https://arxiv.org/abs/2503.06749
作者:Wenxuan Huang,Bohan Jia,Zijie Zhai,Shaosheng Cao,Zheyu Ye,Fei Zhao,Yao Hu,Shaohui Lin
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Reinforcement Learning, purely through Reinforcement, successfully demonstrated, demonstrated the emergence, LLMs purely
备注:
点击查看摘要
Abstract:DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: this https URL .
181. 【2503.06748】DiffAtlas: GenAI-fying Atlas Segmentation via Image-Mask Diffusion
链接:https://arxiv.org/abs/2503.06748
作者:Hantao Zhang,Yuhe Liu,Jiancheng Yang,Weidong Guo,Xinyuan Wang,Pascal Fua
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:precise anatomical delineation, Accurate medical image, Accurate medical, anatomical delineation, crucial for precise
备注: 11 pages
点击查看摘要
Abstract:Accurate medical image segmentation is crucial for precise anatomical delineation. Deep learning models like U-Net have shown great success but depend heavily on large datasets and struggle with domain shifts, complex structures, and limited training samples. Recent studies have explored diffusion models for segmentation by iteratively refining masks. However, these methods still retain the conventional image-to-mask mapping, making them highly sensitive to input data, which hampers stability and generalization. In contrast, we introduce DiffAtlas, a novel generative framework that models both images and masks through diffusion during training, effectively ``GenAI-fying'' atlas-based segmentation. During testing, the model is guided to generate a specific target image-mask pair, from which the corresponding mask is obtained. DiffAtlas retains the robustness of the atlas paradigm while overcoming its scalability and domain-specific limitations. Extensive experiments on CT and MRI across same-domain, cross-modality, varying-domain, and different data-scale settings using the MMWHS and TotalSegmentator datasets demonstrate that our approach outperforms existing methods, particularly in limited-data and zero-shot modality segmentation. Code is available at this https URL.
182. 【2503.06746】Color Alignment in Diffusion
链接:https://arxiv.org/abs/2503.06746
作者:Ka Chun Shum,Binh-Son Hua,Duc Thanh Nguyen,Sai-Kit Yeung
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:shown great promise, synthesizing visually appealing, visually appealing images, Diffusion models, shown great
备注: CVPR 2025
点击查看摘要
Abstract:Diffusion models have shown great promise in synthesizing visually appealing images. However, it remains challenging to condition the synthesis at a fine-grained level, for instance, synthesizing image pixels following some generic color pattern. Existing image synthesis methods often produce contents that fall outside the desired pixel conditions. To address this, we introduce a novel color alignment algorithm that confines the generative process in diffusion models within a given color pattern. Specifically, we project diffusion terms, either imagery samples or latent representations, into a conditional color space to align with the input color distribution. This strategy simplifies the prediction in diffusion models within a color manifold while still allowing plausible structures in generated contents, thus enabling the generation of diverse contents that comply with the target color pattern. Experimental results demonstrate our state-of-the-art performance in conditioning and controlling of color pixels, while maintaining on-par generation quality and diversity in comparison with regular diffusion models.
183. 【2503.06744】CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving
链接:https://arxiv.org/abs/2503.06744
作者:Rui Song,Chenwei Liang,Yan Xia,Walter Zimmer,Hu Cao,Holger Caesar,Andreas Festag,Alois Knoll
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Dynamic scene rendering, scene rendering opens, Dynamic scene, enabling closed-loop simulations, photorealistic data
备注:
点击查看摘要
Abstract:Dynamic scene rendering opens new avenues in autonomous driving by enabling closed-loop simulations with photorealistic data, which is crucial for validating end-to-end algorithms. However, the complex and highly dynamic nature of traffic environments presents significant challenges in accurately rendering these scenes. In this paper, we introduce a novel 4D Gaussian Splatting (4DGS) approach, which incorporates context and temporal deformation awareness to improve dynamic scene rendering. Specifically, we employ a 2D semantic segmentation foundation model to self-supervise the 4D semantic features of Gaussians, ensuring meaningful contextual embedding. Simultaneously, we track the temporal deformation of each Gaussian across adjacent frames. By aggregating and encoding both semantic and temporal deformation features, each Gaussian is equipped with cues for potential deformation compensation within 3D space, facilitating a more precise representation of dynamic scenes. Experimental results show that our method improves 4DGS's ability to capture fine details in dynamic scene rendering for autonomous driving and outperforms other self-supervised methods in 4D reconstruction and novel view synthesis. Furthermore, CoDa-4DGS deforms semantic features with each Gaussian, enabling broader applications.
184. 【2503.06740】D3DR: Lighting-Aware Object Insertion in Gaussian Splatting
链接:https://arxiv.org/abs/2503.06740
作者:Vsevolod Skorokhodov,Nikita Durasov,Pascal Fua
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Computer Vision tasks, Computer Vision, Vision tasks, Gaussian Splatting, dynamic scene rendering
备注:
点击查看摘要
Abstract:Gaussian Splatting has become a popular technique for various 3D Computer Vision tasks, including novel view synthesis, scene reconstruction, and dynamic scene rendering. However, the challenge of natural-looking object insertion, where the object's appearance seamlessly matches the scene, remains unsolved. In this work, we propose a method, dubbed D3DR, for inserting a 3DGS-parametrized object into 3DGS scenes while correcting its lighting, shadows, and other visual artifacts to ensure consistency, a problem that has not been successfully addressed before. We leverage advances in diffusion models, which, trained on real-world data, implicitly understand correct scene lighting. After inserting the object, we optimize a diffusion-based Delta Denoising Score (DDS)-inspired objective to adjust its 3D Gaussian parameters for proper lighting correction. Utilizing diffusion model personalization techniques to improve optimization quality, our approach ensures seamless object insertion and natural appearance. Finally, we demonstrate the method's effectiveness by comparing it to existing approaches, achieving 0.5 PSNR and 0.15 SSIM improvements in relighting quality.
185. 【2503.06717】Continuous Online Adaptation Driven by User Interaction for Medical Image Segmentation
链接:https://arxiv.org/abs/2503.06717
作者:Wentian Xu,Ziyun Liang,Harry Anthony,Yasin Ibrahim,Felix Cohen,Guang Yang,Daniel Whitehouse,David Menon,Virginia Newcombe,Konstantinos Kamnitsas
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:real-time user interactions, Interactive segmentation models, extra inputs, inputs to dynamically, dynamically refine
备注:
点击查看摘要
Abstract:Interactive segmentation models use real-time user interactions, such as mouse clicks, as extra inputs to dynamically refine the model predictions. After model deployment, user corrections of model predictions could be used to adapt the model to the post-deployment data distribution, countering distribution-shift and enhancing reliability. Motivated by this, we introduce an online adaptation framework that enables an interactive segmentation model to continuously learn from user interaction and improve its performance on new data distributions, as it processes a sequence of test images. We introduce the Gaussian Point Loss function to train the model how to leverage user clicks, along with a two-stage online optimization method that adapts the model using the corrected predictions generated via user interactions. We demonstrate that this simple and therefore practical approach is very effective. Experiments on 5 fundus and 4 brain MRI databases demonstrate that our method outperforms existing approaches under various data distribution shifts, including segmentation of image modalities and pathologies not seen during training.
186. 【2503.06700】MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation
链接:https://arxiv.org/abs/2503.06700
作者:Chenfei Liao,Xu Zheng,Yuanhuiyi Lyu,Haiwei Xue,Yihong Cao,Jiawen Wang,Kailun Yang,Xuming Hu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:multiple visual modalities, visual modalities captured, Research has focused, diverse sensors, pixel-wise predictions
备注:
点击查看摘要
Abstract:Research has focused on Multi-Modal Semantic Segmentation (MMSS), where pixel-wise predictions are derived from multiple visual modalities captured by diverse sensors. Recently, the large vision model, Segment Anything Model 2 (SAM2), has shown strong zero-shot segmentation performance on both images and videos. When extending SAM2 to MMSS, two issues arise: 1. How can SAM2 be adapted to multi-modal data? 2. How can SAM2 better understand semantics? Inspired by cross-frame correlation in videos, we propose to treat multi-modal data as a sequence of frames representing the same scene. Our key idea is to ''memorize'' the modality-agnostic information and 'memorize' the semantics related to the targeted scene. To achieve this, we apply SAM2's memory mechanisms across multi-modal data to capture modality-agnostic features. Meanwhile, to memorize the semantic knowledge, we propose a training-only Semantic Prototype Memory Module (SPMM) to store category-level prototypes across training for facilitating SAM2's transition from instance to semantic segmentation. A prototypical adaptation loss is imposed between global and local prototypes iteratively to align and refine SAM2's semantic understanding. Extensive experimental results demonstrate that our proposed MemorySAM outperforms SoTA methods by large margins on both synthetic and real-world benchmarks (65.38% on DELIVER, 52.88% on MCubeS). Source code will be made publicly available.
187. 【2503.06699】Unsupervised Multi-Clustering and Decision-Making Strategies for 4D-STEM Orientation Mapping
链接:https://arxiv.org/abs/2503.06699
作者:Junhao Cao,Nicolas Folastre,Gozde Oney,Edgar Rauch,Stavros Nicolopoulos,Partha Pratim Das,Arnaud Demortière
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:non-negative matrix factorization, Image Quality Assessment, primary clustering method, study presents, integration of unsupervised
备注: 32 pages, 5 figures, 5 figures in SI
点击查看摘要
Abstract:This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (k) required for robust and interpretable orientation mapping. By leveraging the K-Component Loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.
188. 【2503.06698】What's in a Latent? Leveraging Diffusion Latent Space for Domain Generalization
链接:https://arxiv.org/abs/2503.06698
作者:Xavier Thomas,Deepti Ghadiyaram
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:unseen data distributions, Domain Generalization aims, data distributions, aims to develop, Generalization aims
备注:
点击查看摘要
Abstract:Domain Generalization aims to develop models that can generalize to novel and unseen data distributions. In this work, we study how model architectures and pre-training objectives impact feature richness and propose a method to effectively leverage them for domain generalization. Specifically, given a pre-trained feature space, we first discover latent domain structures, referred to as pseudo-domains, that capture domain-specific variations in an unsupervised manner. Next, we augment existing classifiers with these complementary pseudo-domain representations making them more amenable to diverse unseen test domains. We analyze how different pre-training feature spaces differ in the domain-specific variances they capture. Our empirical studies reveal that features from diffusion models excel at separating domains in the absence of explicit domain labels and capture nuanced domain-specific information. On 5 datasets, we show that our very simple framework improves generalization to unseen domains by a maximum test accuracy improvement of over 4% compared to the standard baseline Empirical Risk Minimization (ERM). Crucially, our method outperforms most algorithms that access domain labels during training.
189. 【2503.06685】Asymmetric Decision-Making in Online Knowledge Distillation:Unifying Consensus and Divergence
链接:https://arxiv.org/abs/2503.06685
作者:Zhaowei Chen,Borui Zhao,Yuchen Ge,Yuhao Chen,Renjie Song,Jiajun Liang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:compact student network, pretrained teacher network, models, teacher models, student models
备注:
点击查看摘要
Abstract:Online Knowledge Distillation (OKD) methods streamline the distillation training process into a single stage, eliminating the need for knowledge transfer from a pretrained teacher network to a more compact student network. This paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when applied to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.
190. 【2503.06684】PixelPonder: Dynamic Patch Adaptation for Enhanced Multi-Conditional Text-to-Image Generation
链接:https://arxiv.org/abs/2503.06684
作者:Yanjie Pan,Qingdong He,Zhengkai Jiang,Pengcheng Xu,Chaoyi Wang,Jinlong Peng,Haoxuan Wang,Yun Cao,Zhenye Gan,Mingmin Chi,Bo Peng,Yabiao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated promising results, Recent advances, advances in diffusion-based, demonstrated promising, promising results
备注:
点击查看摘要
Abstract:Recent advances in diffusion-based text-to-image generation have demonstrated promising results through visual condition control. However, existing ControlNet-like methods struggle with compositional visual conditioning - simultaneously preserving semantic fidelity across multiple heterogeneous control signals while maintaining high visual quality, where they employ separate control branches that often introduce conflicting guidance during the denoising process, leading to structural distortions and artifacts in generated images. To address this issue, we present PixelPonder, a novel unified control framework, which allows for effective control of multiple visual conditions under a single control structure. Specifically, we design a patch-level adaptive condition selection mechanism that dynamically prioritizes spatially relevant control signals at the sub-region level, enabling precise local guidance without global interference. Additionally, a time-aware control injection scheme is deployed to modulate condition influence according to denoising timesteps, progressively transitioning from structural preservation to texture refinement and fully utilizing the control information from different categories to promote more harmonious image generation. Extensive experiments demonstrate that PixelPonder surpasses previous methods across different benchmark datasets, showing superior improvement in spatial alignment accuracy while maintaining high textual semantic consistency.
191. 【2503.06683】Dynamic Dictionary Learning for Remote Sensing Image Segmentation
链接:https://arxiv.org/abs/2503.06683
作者:Xuechao Zou,Yue Li,Shun Zhang,Kai Li,Shiying Wang,Pin Tao,Junliang Xing,Congyan Lang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:diverse scene variations, Remote sensing image, segmentation faces persistent, faces persistent challenges, distinguishing morphologically similar
备注:
点击查看摘要
Abstract:Remote sensing image segmentation faces persistent challenges in distinguishing morphologically similar categories and adapting to diverse scene variations. While existing methods rely on implicit representation learning paradigms, they often fail to dynamically adjust semantic embeddings according to contextual cues, leading to suboptimal performance in fine-grained scenarios such as cloud thickness differentiation. This work introduces a dynamic dictionary learning framework that explicitly models class ID embeddings through iterative refinement. The core contribution lies in a novel dictionary construction mechanism, where class-aware semantic embeddings are progressively updated via multi-stage alternating cross-attention querying between image features and dictionary embeddings. This process enables adaptive representation learning tailored to input-specific characteristics, effectively resolving ambiguities in intra-class heterogeneity and inter-class homogeneity. To further enhance discriminability, a contrastive constraint is applied to the dictionary space, ensuring compact intra-class distributions while maximizing inter-class separability. Extensive experiments across both coarse- and fine-grained datasets demonstrate consistent improvements over state-of-the-art methods, particularly in two online test benchmarks (LoveDA and UAVid). Code is available at this https URL.
192. 【2503.06678】Gamma: Toward Generic Image Assessment with Mixture of Assessment Experts
链接:https://arxiv.org/abs/2503.06678
作者:Hantao Zhou,Rui Yang,Longxiang Tang,Guanyi Qin,Yan Zhang,Runze Hu,Xiu Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:natural and AIGC, textbf, AIGC scenes, Image assessment, aims to evaluate
备注:
点击查看摘要
Abstract:Image assessment aims to evaluate the quality and aesthetics of images and has been applied across various scenarios, such as natural and AIGC scenes. Existing methods mostly address these sub-tasks or scenes individually. While some works attempt to develop unified image assessment models, they have struggled to achieve satisfactory performance or cover a broad spectrum of assessment scenarios. In this paper, we present \textbf{Gamma}, a \textbf{G}eneric im\textbf{A}ge assess\textbf{M}ent model using \textbf{M}ixture of \textbf{A}ssessment Experts, which can effectively assess images from diverse scenes through mixed-dataset training. Achieving unified training in image assessment presents significant challenges due to annotation biases across different datasets. To address this issue, we first propose a Mixture of Assessment Experts (MoAE) module, which employs shared and adaptive experts to dynamically learn common and specific knowledge for different datasets, respectively. In addition, we introduce a Scene-based Differential Prompt (SDP) strategy, which uses scene-specific prompts to provide prior knowledge and guidance during the learning process, further boosting adaptation for various scenes. Our Gamma model is trained and evaluated on 12 datasets spanning 6 image assessment scenarios. Extensive experiments show that our unified Gamma outperforms other state-of-the-art mixed-training methods by significant margins while covering more scenes. Code: this https URL.
193. 【2503.06677】REArtGS: Reconstructing and Generating Articulated Objects via 3D Gaussian Splatting with Geometric and Motion Constraints
链接:https://arxiv.org/abs/2503.06677
作者:Di Wu,Liu Liu,Zhou Linli,Anran Huang,Liangtu Song,Qiaojun Yu,Qi Wu,Cewu Lu
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:representations play crucial, play crucial roles, Articulated objects, textured surface reconstruction, human life
备注: 11pages, 6 figures
点击查看摘要
Abstract:Articulated objects, as prevalent entities in human life, their 3D representations play crucial roles across various applications. However, achieving both high-fidelity textured surface reconstruction and dynamic generation for articulated objects remains challenging for existing methods. In this paper, we present REArtGS, a novel framework that introduces additional geometric and motion constraints to 3D Gaussian primitives, enabling high-quality textured surface reconstruction and generation for articulated objects. Specifically, given multi-view RGB images of arbitrary two states of articulated objects, we first introduce an unbiased Signed Distance Field (SDF) guidance to regularize Gaussian opacity fields, enhancing geometry constraints and improving surface reconstruction quality. Then we establish deformable fields for 3D Gaussians constrained by the kinematic structures of articulated objects, achieving unsupervised generation of surface meshes in unseen states. Extensive experiments on both synthetic and real datasets demonstrate our approach achieves high-quality textured surface reconstruction for given states, and enables high-fidelity surface generation for unseen states. Codes will be released within the next four months.
194. 【2503.06676】Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform
链接:https://arxiv.org/abs/2503.06676
作者:Chenyu Huang,Peng Ye,Xiaohui Wang,Shenghe Zheng,Biqing Qi,Lei Bai,Wanli Ouyang,Tao Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:pose critical challenges, multiple tasks pose, tasks pose critical, individual finetuned models, paradigm becoming mainstream
备注: 15 pages, 7 figures
点击查看摘要
Abstract:With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.
195. 【2503.06674】Learning Few-Step Diffusion Models by Trajectory Distribution Matching
链接:https://arxiv.org/abs/2503.06674
作者:Yihong Luo,Tianyang Hu,Jiacheng Sun,Yujun Cai,Jing Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:efficient AIGC deployment, AIGC deployment, efficient AIGC, Accelerating diffusion model, Accelerating diffusion
备注: Project page: [this https URL](https://tdm-t2x.github.io/)
点击查看摘要
Abstract:Accelerating diffusion model sampling is crucial for efficient AIGC deployment. While diffusion distillation methods -- based on distribution matching and trajectory matching -- reduce sampling to as few as one step, they fall short on complex tasks like text-to-image generation. Few-step generation offers a better balance between speed and quality, but existing approaches face a persistent trade-off: distribution matching lacks flexibility for multi-step sampling, while trajectory matching often yields suboptimal image quality. To bridge this gap, we propose learning few-step diffusion models by Trajectory Distribution Matching (TDM), a unified distillation paradigm that combines the strengths of distribution and trajectory matching. Our method introduces a data-free score distillation objective, aligning the student's trajectory with the teacher's at the distribution level. Further, we develop a sampling-steps-aware objective that decouples learning targets across different steps, enabling more adjustable sampling. This approach supports both deterministic sampling for superior image quality and flexible multi-step adaptation, achieving state-of-the-art performance with remarkable efficiency. Our model, TDM, outperforms existing methods on various backbones, such as SDXL and PixArt-$\alpha$, delivering superior quality and significantly reduced training costs. In particular, our method distills PixArt-$\alpha$ into a 4-step generator that outperforms its teacher on real user preference at 1024 resolution. This is accomplished with 500 iterations and 2 A800 hours -- a mere 0.01% of the teacher's training cost. In addition, our proposed TDM can be extended to accelerate text-to-video diffusion. Notably, TDM can outperform its teacher model (CogVideoX-2B) by using only 4 NFE on VBench, improving the total score from 80.91 to 81.65. Project page: this https URL
196. 【2503.06671】Emulating Self-attention with Convolution for Efficient Image Super-Resolution
链接:https://arxiv.org/abs/2503.06671
作者:Dongheon Lee,Seokju Yun,Youngmin Ro
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high computational overhead, lightweight image super-resolution, times, image super-resolution, tackle the high
备注:
点击查看摘要
Abstract:In this paper, we tackle the high computational overhead of transformers for lightweight image super-resolution. (SR). Motivated by the observations of self-attention's inter-layer repetition, we introduce a convolutionized self-attention module named Convolutional Attention (ConvAttn) that emulates self-attention's long-range modeling capability and instance-dependent weighting with a single shared large kernel and dynamic kernels. By utilizing the ConvAttn module, we significantly reduce the reliance on self-attention and its involved memory-bound operations while maintaining the representational capability of transformers. Furthermore, we overcome the challenge of integrating flash attention into the lightweight SR regime, effectively mitigating self-attention's inherent memory bottleneck. We scale up window size to 32$\times$32 with flash attention rather than proposing an intricated self-attention module, significantly improving PSNR by 0.31dB on Urban100$\times$2 while reducing latency and memory usage by 16$\times$ and 12.2$\times$. Building on these approaches, our proposed network, termed Emulating Self-attention with Convolution (ESC), notably improves PSNR by 0.27 dB on Urban100$\times$4 compared to HiT-SRF, reducing the latency and memory usage by 3.7$\times$ and 6.2$\times$, respectively. Extensive experiments demonstrate that our ESC maintains the ability for long-range modeling, data scalability, and the representational power of transformers despite most self-attentions being replaced by the ConvAttn module.
197. 【2503.06670】Attention, Please! PixelSHAP Reveals What Vision-Language Models Actually Focus On
链接:https://arxiv.org/abs/2503.06670
作者:Roni Goldshmidt
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:crucial for trust, high-stakes applications, decision-making in high-stakes, Vision-Language Models, framework extending Shapley-based
备注:
点击查看摘要
Abstract:Interpretability in Vision-Language Models (VLMs) is crucial for trust, debugging, and decision-making in high-stakes applications. We introduce PixelSHAP, a model-agnostic framework extending Shapley-based analysis to structured visual entities. Unlike previous methods focusing on text prompts, PixelSHAP applies to vision-based reasoning by systematically perturbing image objects and quantifying their influence on a VLM's response. PixelSHAP requires no model internals, operating solely on input-output pairs, making it compatible with open-source and commercial models. It supports diverse embedding-based similarity metrics and scales efficiently using optimization techniques inspired by Shapley-based methods. We validate PixelSHAP in autonomous driving, highlighting its ability to enhance interpretability. Key challenges include segmentation sensitivity and object occlusion. Our open-source implementation facilitates further research.
198. 【2503.06669】AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems
链接:https://arxiv.org/abs/2503.06669
作者:AgiBot-World-Contributors,Qingwen Bu,Jisong Cai,Li Chen,Xiuqi Cui,Yan Ding,Siyuan Feng,Shenyuan Gao,Xindong He,Xu Huang,Shu Jiang,Yuxin Jiang,Cheng Jing,Hongyang Li,Jialu Li,Chiming Liu,Yi Liu,Yuxiang Lu,Jianlan Luo,Ping Luo,Yao Mu,Yuehan Niu,Yixuan Pan,Jiangmiao Pang,Yu Qiao,Guanghui Ren,Cheng Ruan,Jiaqi Shan,Yongjian Shen,Chengshi Shi,Mingkang Shi,Modi Shi,Chonghao Sima,Jianheng Song,Huijie Wang,Wenhao Wang,Dafeng Wei,Chengen Xie,Guo Xu,Junchi Yan,Cunbiao Yang,Lei Yang,Shukai Yang,Maoqing Yao,Jia Zeng,Chi Zhang,Qinglin Zhang,Bin Zhao,Chengyue Zhao,Jiaqi Zhao,Jianchao Zhu
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:generalized robotic manipulation, address real-world challenges, robotic manipulation, challenges for generalized, generalized robotic
备注: Project website: [this https URL](https://agibot-world.com/) , Code: [this https URL](https://github.com/OpenDriveLab/AgiBot-World)
点击查看摘要
Abstract:We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.
199. 【2503.06661】AA-CLIP: Enhancing Zero-shot Anomaly Detection via Anomaly-Aware CLIP
链接:https://arxiv.org/abs/2503.06661
作者:Wenxin Ma,Xu Zhang,Qingsong Yao,Fenghe Tang,Chenxu Wu,Yingtai Li,Rui Yan,Zihang Jiang,S.Kevin Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:lesion detection, identifies outliers, defect and lesion, Anomaly detection, CLIP
备注: 8 pages, 7 figures
点击查看摘要
Abstract:Anomaly detection (AD) identifies outliers for applications like defect and lesion detection. While CLIP shows promise for zero-shot AD tasks due to its strong generalization capabilities, its inherent Anomaly-Unawareness leads to limited discrimination between normal and abnormal features. To address this problem, we propose Anomaly-Aware CLIP (AA-CLIP), which enhances CLIP's anomaly discrimination ability in both text and visual spaces while preserving its generalization capability. AA-CLIP is achieved through a straightforward yet effective two-stage approach: it first creates anomaly-aware text anchors to differentiate normal and abnormal semantics clearly, then aligns patch-level visual features with these anchors for precise anomaly localization. This two-stage strategy, with the help of residual adapters, gradually adapts CLIP in a controlled manner, achieving effective AD while maintaining CLIP's class knowledge. Extensive experiments validate AA-CLIP as a resource-efficient solution for zero-shot AD tasks, achieving state-of-the-art results in industrial and medical applications. The code is available at this https URL.
200. 【2503.06660】AxisPose: Model-Free Matching-Free Single-Shot 6D Object Pose Estimation via Axis Generation
链接:https://arxiv.org/abs/2503.06660
作者:Yang Zou,Zhaoshuai Qi,Yating Liu,Zihao Xu,Weipeng Sun,Weiyi Liu,Xingyuan Li,Jiaqi Yang,Yanning Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:augmented reality, role in robotics, autonomous driving, computer vision, plays a vital
备注:
点击查看摘要
Abstract:Object pose estimation, which plays a vital role in robotics, augmented reality, and autonomous driving, has been of great interest in computer vision. Existing studies either require multi-stage pose regression or rely on 2D-3D feature matching. Though these approaches have shown promising results, they rely heavily on appearance information, requiring complex input (i.e., multi-view reference input, depth, or CAD models) and intricate pipeline (i.e., feature extraction-SfM-2D to 3D matching-PnP). We propose AxisPose, a model-free, matching-free, single-shot solution for robust 6D pose estimation, which fundamentally diverges from the existing paradigm. Unlike existing methods that rely on 2D-3D or 2D-2D matching using 3D techniques, such as SfM and PnP, AxisPose directly infers a robust 6D pose from a single view by leveraging a diffusion model to learn the latent axis distribution of objects without reference views. Specifically, AxisPose constructs an Axis Generation Module (AGM) to capture the latent geometric distribution of object axes through a diffusion model. The diffusion process is guided by injecting the gradient of geometric consistency loss into the noise estimation to maintain the geometric consistency of the generated tri-axis. With the generated tri-axis projection, AxisPose further adopts a Triaxial Back-projection Module (TBM) to recover the 6D pose from the object tri-axis. The proposed AxisPose achieves robust performance at the cross-instance level (i.e., one model for N instances) using only a single view as input without reference images, with great potential for generalization to unseen-object level.
201. 【2503.06652】Adding Additional Control to One-Step Diffusion with Joint Distribution Matching
链接:https://arxiv.org/abs/2503.06652
作者:Yihong Luo,Tianyang Hu,Yifan Song,Jiacheng Sun,Zhenguo Li,Jing Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Variational Score Distillation, Variational Score, latest user preferences, adapting distilled models, remains challenging
备注:
点击查看摘要
Abstract:While diffusion distillation has enabled one-step generation through methods like Variational Score Distillation, adapting distilled models to emerging new controls -- such as novel structural constraints or latest user preferences -- remains challenging. Conventional approaches typically requires modifying the base diffusion model and redistilling it -- a process that is both computationally intensive and time-consuming. To address these challenges, we introduce Joint Distribution Matching (JDM), a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model and facilitates improved classifier-free guidance (CFG) usage and seamless integration of human feedback learning (HFL). Experimental results demonstrate that JDM surpasses baseline methods such as multi-step ControlNet by mere one-step in most cases, while achieving state-of-the-art performance in one-step text-to-image synthesis through improved usage of CFG or HFL integration.
202. 【2503.06647】Personalized Class Incremental Context-Aware Food Classification for Food Intake Monitoring Systems
链接:https://arxiv.org/abs/2503.06647
作者:Hassan Kazemi Tehrani,Jun Cai,Abbas Yekanlou,Sylvia Santosa
类目:Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
关键词:preventing nutrition-related diseases, Accurate food intake, Accurate food, food, food intake monitoring
备注:
点击查看摘要
Abstract:Accurate food intake monitoring is crucial for maintaining a healthy diet and preventing nutrition-related diseases. With the diverse range of foods consumed across various cultures, classic food classification models have limitations due to their reliance on fixed-sized food datasets. Studies show that people consume only a small range of foods across the existing ones, each consuming a unique set of foods. Existing class-incremental models have low accuracy for the new classes and lack personalization. This paper introduces a personalized, class-incremental food classification model designed to overcome these challenges and improve the performance of food intake monitoring systems. Our approach adapts itself to the new array of food classes, maintaining applicability and accuracy, both for new and existing classes by using personalization. Our model's primary focus is personalization, which improves classification accuracy by prioritizing a subset of foods based on an individual's eating habits, including meal frequency, times, and locations. A modified version of DSN is utilized to expand on the appearance of new food classes. Additionally, we propose a comprehensive framework that integrates this model into a food intake monitoring system. This system analyzes meal images provided by users, makes use of a smart scale to estimate food weight, utilizes a nutrient content database to calculate the amount of each macro-nutrient, and creates a dietary user profile through a mobile application. Finally, experimental evaluations on two new benchmark datasets FOOD101-Personal and VFN-Personal, personalized versions of well-known datasets for food classification, are conducted to demonstrate the effectiveness of our model in improving the classification accuracy of both new and existing classes, addressing the limitations of both conventional and class-incremental food classification models.
203. 【2503.06641】CLICv2: Image Complexity Representation via Content Invariance Contrastive Learning
链接:https://arxiv.org/abs/2503.06641
作者:Shipeng Liu,Liang Zhao,Dengfeng Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:selection and sensitivity, positive sample selection, complexity representation, image, positive
备注:
点击查看摘要
Abstract:Unsupervised image complexity representation often suffers from bias in positive sample selection and sensitivity to image content. We propose CLICv2, a contrastive learning framework that enforces content invariance for complexity representation. Unlike CLIC, which generates positive samples via cropping-introducing positive pairs bias-our shifted patchify method applies randomized directional shifts to image patches before contrastive learning. Patches at corresponding positions serve as positive pairs, ensuring content-invariant learning. Additionally, we propose patch-wise contrastive loss, which enhances local complexity representation while mitigating content interference. In order to further suppress the interference of image content, we introduce Masked Image Modeling as an auxiliary task, but we set its modeling objective as the entropy of masked patches, which recovers the entropy of the overall image by using the information of the unmasked patches, and then obtains the global complexity perception ability. Extensive experiments on IC9600 demonstrate that CLICv2 significantly outperforms existing unsupervised methods in PCC and SRCC, achieving content-invariant complexity representation without introducing positive pairs bias.
204. 【2503.06637】CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning
链接:https://arxiv.org/abs/2503.06637
作者:Lei Shi,Andreas Bulling
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Constrained Latent Action, propose CLAD, vision-language procedure planning, Constrained Latent, procedure planning
备注:
点击查看摘要
Abstract:We propose CLAD -- a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.
205. 【2503.06632】owards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias
链接:https://arxiv.org/abs/2503.06632
作者:Mingxiao Li,Tingyu Qu,Tinne Tuytelaars,Marie-Francine Moens
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:customized visual content, improve daily life, Personalized image generation, visual content, great potential
备注: 18
点击查看摘要
Abstract:Personalized image generation via text prompts has great potential to improve daily life and professional work by facilitating the creation of customized visual content. The aim of image personalization is to create images based on a user-provided subject while maintaining both consistency of the subject and flexibility to accommodate various textual descriptions of that subject. However, current methods face challenges in ensuring fidelity to the text prompt while not overfitting to the training data. In this work, we introduce a novel training pipeline that incorporates an attractor to filter out distractions in training images, allowing the model to focus on learning an effective representation of the personalized subject. Moreover, current evaluation methods struggle due to the lack of a dedicated test set. The evaluation set-up typically relies on the training data of the personalization task to compute text-image and image-image similarity scores, which, while useful, tend to overestimate performance. Although human evaluations are commonly used as an alternative, they often suffer from bias and inconsistency. To address these issues, we curate a diverse and high-quality test set with well-designed prompts. With this new benchmark, automatic evaluation metrics can reliably assess model performance
206. 【2503.06626】DiffCLIP: Differential Attention Meets CLIP
链接:https://arxiv.org/abs/2503.06626
作者:Hasan Abed Al Kader Hammoud,Bernard Ghanem
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:CLIP architectures, differential attention, Abstract, CLIP, differential
备注: Under review
点击查看摘要
Abstract:We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency. Code can be found at this https URL.
207. 【2503.06625】Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking
链接:https://arxiv.org/abs/2503.06625
作者:Chaocan Xue,Bineng Zhong,Qihua Liang,Yaozong Zheng,Ning Li,Yuanliang Xue,Shuxiang Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Vision transformers, popular backbone, backbone for visual, Vision, complete ViT architectures
备注:
点击查看摘要
Abstract:Vision transformers (ViTs) have emerged as a popular backbone for visual tracking. However, complete ViT architectures are too cumbersome to deploy for unmanned aerial vehicle (UAV) tracking which extremely emphasizes efficiency. In this study, we discover that many layers within lightweight ViT-based trackers tend to learn relatively redundant and repetitive target representations. Based on this observation, we propose a similarity-guided layer adaptation approach to optimize the structure of ViTs. Our approach dynamically disables a large number of representation-similar layers and selectively retains only a single optimal layer among them, aiming to achieve a better accuracy-speed trade-off. By incorporating this approach into existing ViTs, we tailor previously complete ViT architectures into an efficient similarity-guided layer-adaptive framework, namely SGLATrack, for real-time UAV tracking. Extensive experiments on six tracking benchmarks verify the effectiveness of the proposed approach, and show that our SGLATrack achieves a state-of-the-art real-time speed while maintaining competitive tracking precision. Codes and models are available at this https URL.
208. 【2503.06624】Chameleon: On the Scene Diversity and Domain Variety of AI-Generated Videos Detection
链接:https://arxiv.org/abs/2503.06624
作者:Meiyu Zeng,Xingming Liao,Canyu Chen,Nankai Lin,Zhuowei Wang,Chong Chen,Aimin Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Artificial intelligence generated, Artificial intelligence, intelligence generated content, spreading disinformation, intelligence generated
备注: 17 pages
点击查看摘要
Abstract:Artificial intelligence generated content (AIGC), known as DeepFakes, has emerged as a growing concern because it is being utilized as a tool for spreading disinformation. While much research exists on identifying AI-generated text and images, research on detecting AI-generated videos is limited. Existing datasets for AI-generated videos detection exhibit limitations in terms of diversity, complexity, and realism. To address these issues, this paper focuses on AI-generated videos detection and constructs a diverse dataset named Chameleon. We generate videos through multiple generation tools and various real video sources. At the same time, we preserve the videos' real-world complexity, including scene switches and dynamic perspective changes, and expand beyond face-centered detection to include human actions and environment generation. Our work bridges the gap between AI-generated dataset construction and real-world forensic needs, offering a valuable benchmark to counteract the evolving threats of AI-generated content.
209. 【2503.06623】ransforming Weather Data from Pixel to Latent Space
链接:https://arxiv.org/abs/2503.06623
作者:Sijie Zhao,Feng Liu,Xueliang Zhang,Hao Chen,Tao Han,Junchao Gong,Ran Tao,Pengfeng Xiao,Lei Bai,Wanli Ouyang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:spurred growing interest, extreme weather events, weather, PVS, pixel space
备注: 8 pages, 6 figures
点击查看摘要
Abstract:The increasing impact of climate change and extreme weather events has spurred growing interest in deep learning for weather research. However, existing studies often rely on weather data in pixel space, which presents several challenges such as smooth outputs in model outputs, limited applicability to a single pressure-variable subset (PVS), and high data storage and computational costs. To address these challenges, we propose a novel Weather Latent Autoencoder (WLA) that transforms weather data from pixel space to latent space, enabling efficient weather task modeling. By decoupling weather reconstruction from downstream tasks, WLA improves the accuracy and sharpness of weather task model results. The incorporated Pressure-Variable Unified Module transforms multiple PVS into a unified representation, enhancing the adaptability of the model in multiple weather scenarios. Furthermore, weather tasks can be performed in a low-storage latent space of WLA rather than a high-storage pixel space, thus significantly reducing data storage and computational costs. Through extensive experimentation, we demonstrate its superior compression and reconstruction performance, enabling the creation of the ERA5-latent dataset with unified representations of multiple PVS from ERA5 data. The compressed full PVS in the ERA5-latent dataset reduces the original 244.34 TB of data to 0.43 TB. The downstream task further demonstrates that task models can apply to multiple PVS with low data costs in latent space and achieve superior performance compared to models in pixel space. Code, ERA5-latent data, and pre-trained models are available at this https URL.
210. 【2503.06621】Dynamic Updates for Language Adaptation in Visual-Language Tracking
链接:https://arxiv.org/abs/2503.06621
作者:Xiaohai Li,Bineng Zhong,Qihua Liang,Zhiyi Mo,Jian Nong,Shuxiang Song
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:dynamic language descriptions, multi-modal references, semantic information provided, Dynamic Language, static multi-modal references
备注:
点击查看摘要
Abstract:The consistency between the semantic information provided by the multi-modal reference and the tracked object is crucial for visual-language (VL) tracking. However, existing VL tracking frameworks rely on static multi-modal references to locate dynamic objects, which can lead to semantic discrepancies and reduce the robustness of the tracker. To address this issue, we propose a novel vision-language tracking framework, named DUTrack, which captures the latest state of the target by dynamically updating multi-modal references to maintain consistency. Specifically, we introduce a Dynamic Language Update Module, which leverages a large language model to generate dynamic language descriptions for the object based on visual features and object category information. Then, we design a Dynamic Template Capture Module, which captures the regions in the image that highly match the dynamic language descriptions. Furthermore, to ensure the efficiency of description generation, we design an update strategy that assesses changes in target displacement, scale, and other factors to decide on updates. Finally, the dynamic template and language descriptions that record the latest state of the target are used to update the multi-modal references, providing more accurate reference information for subsequent inference and enhancing the robustness of the tracker. DUTrack achieves new state-of-the-art performance on four mainstream vision-language and two vision-only tracking benchmarks, including LaSOT, LaSOT$_{\rm{ext}}$, TNL2K, OTB99-Lang, GOT-10K, and UAV123. Code and models are available at this https URL.
211. 【2503.06617】Pixel to Gaussian: Ultra-Fast Continuous Super-Resolution with 2D Gaussian Modeling
链接:https://arxiv.org/abs/2503.06617
作者:Long Peng,Anran Wu,Wenbo Li,Peizhe Xia,Xueyuan Dai,Xinjie Zhang,Xin Di,Haoze Sun,Renjing Pei,Yang Wang,Yang Cao,Zheng-Jun Zha
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:arbitrary upsampling factors, inputs with arbitrary, fixed-scale factors, ASSR, limitations of traditional
备注: Tech Report
点击查看摘要
Abstract:Arbitrary-scale super-resolution (ASSR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs with arbitrary upsampling factors using a single model, addressing the limitations of traditional SR methods constrained to fixed-scale factors (\textit{e.g.}, $\times$ 2). Recent advances leveraging implicit neural representation (INR) have achieved great progress by modeling coordinate-to-pixel mappings. However, the efficiency of these methods may suffer from repeated upsampling and decoding, while their reconstruction fidelity and quality are constrained by the intrinsic representational limitations of coordinate-based functions. To address these challenges, we propose a novel ContinuousSR framework with a Pixel-to-Gaussian paradigm, which explicitly reconstructs 2D continuous HR signals from LR images using Gaussian Splatting. This approach eliminates the need for time-consuming upsampling and decoding, enabling extremely fast arbitrary-scale super-resolution. Once the Gaussian field is built in a single pass, ContinuousSR can perform arbitrary-scale rendering in just 1ms per scale. Our method introduces several key innovations. Through statistical ana
212. 【2503.06608】GroMo: Plant Growth Modeling with Multiview Images
链接:https://arxiv.org/abs/2503.06608
作者:Ruchi Bhatt,Shreya Bansal,Amanpreet Chander,Rupinder Kaur,Malya Singh,Mohan Kankanhalli,Abdulmotaleb El Saddik,Mukesh Kumar Saini
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
关键词:Understanding plant growth, Understanding plant, plant growth dynamics, Growth Modelling, growth dynamics
备注: 7 pages, 5 Figures, 3 Tables
点击查看摘要
Abstract:Understanding plant growth dynamics is essential for applications in agriculture and plant phenotyping. We present the Growth Modelling (GroMo) challenge, which is designed for two primary tasks: (1) plant age prediction and (2) leaf count estimation, both essential for crop monitoring and precision agriculture. For this challenge, we introduce GroMo25, a dataset with images of four crops: radish, okra, wheat, and mustard. Each crop consists of multiple plants (p1, p2, ..., pn) captured over different days (d1, d2, ..., dm) and categorized into five levels (L1, L2, L3, L4, L5). Each plant is captured from 24 different angles with a 15-degree gap between images. Participants are required to perform both tasks for all four crops with these multiview images. We proposed a Multiview Vision Transformer (MVVT) model for the GroMo challenge and evaluated the crop-wise performance on GroMo25. MVVT reports an average MAE of 7.74 for age prediction and an MAE of 5.52 for leaf count. The GroMo Challenge aims to advance plant phenotyping research by encouraging innovative solutions for tracking and predicting plant growth. The GitHub repository is publicly available at this https URL.
213. 【2503.06604】Steerable Pyramid Weighted Loss: Multi-Scale Adaptive Weighting for Semantic Segmentation
链接:https://arxiv.org/abs/2503.06604
作者:Renhao Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:remote sensing, biomedical imaging, autonomous driving, computer vision, loss
备注: 9 pages, 4 figures
点击查看摘要
Abstract:Semantic segmentation is a core task in computer vision with applications in biomedical imaging, remote sensing, and autonomous driving. While standard loss functions such as cross-entropy and Dice loss perform well in general cases, they often struggle with fine structures, particularly in tasks involving thin structures or closely packed objects. Various weight map-based loss functions have been proposed to address this issue by assigning higher loss weights to pixels prone to misclassification. However, these methods typically rely on precomputed or runtime-generated weight maps based on distance transforms, which impose significant computational costs and fail to adapt to evolving network predictions. In this paper, we propose a novel steerable pyramid-based weighted (SPW) loss function that efficiently generates adaptive weight maps. Unlike traditional boundary-aware losses that depend on static or iteratively updated distance maps, our method leverages steerable pyramids to dynamically emphasize regions across multiple frequency bands (capturing features at different scales) while maintaining computational efficiency. Additionally, by incorporating network predictions into the weight computation, our approach enables adaptive refinement during training. We evaluate our method on the SNEMI3D, GlaS, and DRIVE datasets, benchmarking it against 11 state-of-the-art loss functions. Our results demonstrate that the proposed SPW loss function achieves superior pixel precision and segmentation accuracy with minimal computational overhead. This work provides an effective and efficient solution for improving semantic segmentation, particularly for applications requiring multiscale feature representation. The code is avaiable at this https URL
214. 【2503.06601】StructVPR++: Distill Structural and Semantic Knowledge with Weighting Samples for Visual Place Recognition
链接:https://arxiv.org/abs/2503.06601
作者:Yanqing Shen,Sanping Zhou,Jingwen Fu,Ruotong Wang,Shitao Chen,Nanning Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Visual place recognition, Visual place, driving and robotics, image retrieval problem, place recognition
备注:
点击查看摘要
Abstract:Visual place recognition is a challenging task for autonomous driving and robotics, which is usually considered as an image retrieval problem. A commonly used two-stage strategy involves global retrieval followed by re-ranking using patch-level descriptors. Most deep learning-based methods in an end-to-end manner cannot extract global features with sufficient semantic information from RGB images. In contrast, re-ranking can utilize more explicit structural and semantic information in one-to-one matching process, but it is time-consuming. To bridge the gap between global retrieval and re-ranking and achieve a good trade-off between accuracy and efficiency, we propose StructVPR++, a framework that embeds structural and semantic knowledge into RGB global representations via segmentation-guided distillation. Our key innovation lies in decoupling label-specific features from global descriptors, enabling explicit semantic alignment between image pairs without requiring segmentation during deployment. Furthermore, we introduce a sample-wise weighted distillation strategy that prioritizes reliable training pairs while suppressing noisy ones. Experiments on four benchmarks demonstrate that StructVPR++ surpasses state-of-the-art global methods by 5-23% in Recall@1 and even outperforms many two-stage approaches, achieving real-time efficiency with a single RGB input.
215. 【2503.06598】MultiCo3D: Multi-Label Voxel Contrast for One-Shot Incremental Segmentation of 3D Neuroimages
链接:https://arxiv.org/abs/2503.06598
作者:Hao Xu,Tengfei Xue,Dongnan Liu,Yuqian Chen,Fan Zhang,Carl-Fredrik Westin,Ron Kikinis,Lauren J. O'Donnell,Weidong Cai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:functional connectivity analysis, One-shot Class Incremental, structure and function, aiding in precise, Class Incremental
备注: 13 pages, 6 figures, 6 tables
点击查看摘要
Abstract:3D neuroimages provide a comprehensive view of brain structure and function, aiding in precise localization and functional connectivity analysis. Segmentation of white matter (WM) tracts using 3D neuroimages is vital for understanding the brain's structural connectivity in both healthy and diseased states. One-shot Class Incremental Semantic Segmentation (OCIS) refers to effectively segmenting new (novel) classes using only a single sample while retaining knowledge of old (base) classes without forgetting. Voxel-contrastive OCIS methods adjust the feature space to alleviate the feature overlap problem between the base and novel classes. However, since WM tract segmentation is a multi-label segmentation task, existing single-label voxel contrastive-based methods may cause inherent contradictions. To address this, we propose a new multi-label voxel contrast framework called MultiCo3D for one-shot class incremental tract segmentation. Our method utilizes uncertainty distillation to preserve base tract segmentation knowledge while adjusting the feature space with multi-label voxel contrast to alleviate feature overlap when learning novel tracts and dynamically weighting multi losses to balance overall loss. We compare our method against several state-of-the-art (SOTA) approaches. The experimental results show that our method significantly enhances one-shot class incremental tract segmentation accuracy across five different experimental setups on HCP and Preto datasets.
216. 【2503.06588】Speech Audio Generation from dynamic MRI via a Knowledge Enhanced Conditional Variational Autoencoder
链接:https://arxiv.org/abs/2503.06588
作者:Yaxuan Li,Han Jiang,Yifei Ma,Shihua Qin,Fangxu Xing
类目:ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV)
关键词:Magnetic Resonance Imaging, Dynamic Magnetic Resonance, adopted imaging modality, increasingly adopted imaging, Magnetic Resonance
备注:
点击查看摘要
Abstract:Dynamic Magnetic Resonance Imaging (MRI) of the vocal tract has become an increasingly adopted imaging modality for speech motor studies. Beyond image signals, systematic data loss, noise pollution, and audio file corruption can occur due to the unpredictability of the MRI acquisition environment. In such cases, generating audio from images is critical for data recovery in both clinical and research applications. However, this remains challenging due to hardware constraints, acoustic interference, and data corruption. Existing solutions, such as denoising and multi-stage synthesis methods, face limitations in audio fidelity and generalizability. To address these challenges, we propose a Knowledge Enhanced Conditional Variational Autoencoder (KE-CVAE), a novel two-step "knowledge enhancement + variational inference" framework for generating speech audio signals from cine dynamic MRI sequences. This approach introduces two key innovations: (1) integration of unlabeled MRI data for knowledge enhancement, and (2) a variational inference architecture to improve generative modeling capacity. To the best of our knowledge, this is one of the first attempts at synthesizing speech audio directly from dynamic MRI video sequences. The proposed method was trained and evaluated on an open-source dynamic vocal tract MRI dataset recorded during speech. Experimental results demonstrate its effectiveness in generating natural speech waveforms while addressing MRI-specific acoustic challenges, outperforming conventional deep learning-based synthesis approaches.
217. 【2503.06587】Introducing Unbiased Depth into 2D Gaussian Splatting for High-accuracy Surface Reconstruction
链接:https://arxiv.org/abs/2503.06587
作者:Xiaoming Peng,Yixin Yang,Yang Zhou,Hui Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated superior geometry, approximate thin surfaces, superior geometry reconstruction, Gaussian Splatting, surfels to approximate
备注:
点击查看摘要
Abstract:Recently, 2D Gaussian Splatting (2DGS) has demonstrated superior geometry reconstruction quality than the popular 3DGS by using 2D surfels to approximate thin surfaces. However, it falls short when dealing with glossy surfaces, resulting in visible holes in these areas. We found the reflection discontinuity causes the issue. To fit the jump from diffuse to specular reflection at different viewing angles, depth bias is introduced in the optimized Gaussian primitives. To address that, we first replace the depth distortion loss in 2DGS with a novel depth convergence loss, which imposes a strong constraint on depth continuity. Then, we rectified the depth criterion in determining the actual surface, which fully accounts for all the intersecting Gaussians along the ray. Qualitative and quantitative evaluations across various datasets reveal that our method significantly improves reconstruction quality, with more complete and accurate surfaces than 2DGS.
218. 【2503.06569】Global-Aware Monocular Semantic Scene Completion with State Space Models
链接:https://arxiv.org/abs/2503.06569
作者:Shijie Li,Zhongyao Cheng,Rong Li,Shuai Li,Juergen Gall,Xun Xu,Xulei Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Semantic Scene Completion, Monocular Semantic Scene, diverse real-world applications, Monocular Semantic, Scene Completion
备注:
点击查看摘要
Abstract:Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image, enabling diverse real-world applications. However, existing methods are often constrained by the local receptive field of Convolutional Neural Networks (CNNs), making it challenging to handle the non-uniform distribution of projected points (Fig. \ref{fig:perspective}) and effectively reconstruct missing information caused by the 3D-to-2D projection. In this work, we introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space. Specifically, we propose a Dual-Head Multi-Modality Encoder, which leverages a Transformer architecture to capture spatial relationships across all features in the 2D image domain, enabling more comprehensive 2D feature extraction. Additionally, we introduce the Frustum Mamba Decoder, built on the State Space Model (SSM), to efficiently capture long-range dependencies in 3D space. Furthermore, we propose a frustum reordering strategy within the Frustum Mamba Decoder to mitigate feature discontinuities in the reordered voxel sequence, ensuring better alignment with the scan mechanism of the State Space Model (SSM) for improved 3D representation learning. We conduct extensive experiments on the widely used Occ-ScanNet and NYUv2 datasets, demonstrating that our proposed method achieves state-of-the-art performance, validating its effectiveness. The code will be released upon acceptance.
219. 【2503.06568】Conceptrol: Concept Control of Zero-shot Personalized Image Generation
链接:https://arxiv.org/abs/2503.06568
作者:Qiyuan He,Angela Yao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:diffusion models generates, models generates unseen, generates unseen images, unseen images based, diffusion models
备注:
点击查看摘要
Abstract:Personalized image generation with text-to-image diffusion models generates unseen images based on reference image content. Zero-shot adapter methods such as IP-Adapter and OminiControl are especially interesting because they do not require test-time fine-tuning. However, they struggle to balance preserving personalized content and adherence to the text prompt. We identify a critical design flaw resulting in this performance gap: current adapters inadequately integrate personalization images with the textual descriptions. The generated images, therefore, replicate the personalized content rather than adhere to the text prompt instructions. Yet the base text-to-image has strong conceptual understanding capabilities that can be leveraged. We propose Conceptrol, a simple yet effective framework that enhances zero-shot adapters without adding computational overhead. Conceptrol constrains the attention of visual specification with a textual concept mask that improves subject-driven generation capabilities. It achieves as much as 89% improvement on personalization benchmarks over the vanilla IP-Adapter and can even outperform fine-tuning approaches such as Dreambooth LoRA. The source code is available at this https URL.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:
arXiv:2503.06568 [cs.CV]
(or
arXiv:2503.06568v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2503.06568
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
220. 【2503.06565】Future-Aware Interaction Network For Motion Forecasting
链接:https://arxiv.org/abs/2503.06565
作者:Shijie Li,Xun Xu,Si Yong Yeo,Xulei Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
221. 【2503.06564】R-DQ: Time-Rotation Diffusion Quantization
链接:https://arxiv.org/abs/2503.06564
作者:Yihua Shao,Deyang Lin,Fanhu Zeng,Minxi Yan,Muyang Zhang,Siyu Chen,Yuxuan Fan,Ziyang Yan,Haozhe Wang,Jingcai Guo,Yan Wang,Haotong Qin,Hao Tang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:widely adopted, quantization, diffusion quantization, generation, Diffusion
备注:
点击查看摘要
Abstract:Diffusion models have been widely adopted in image and video generation. However, their complex network architecture leads to high inference overhead for its generation process. Existing diffusion quantization methods primarily focus on the quantization of the model structure while ignoring the impact of time-steps variation during sampling. At the same time, most current approaches fail to account for significant activations that cannot be eliminated, resulting in substantial performance degradation after quantization. To address these issues, we propose Time-Rotation Diffusion Quantization (TR-DQ), a novel quantization method incorporating time-step and rotation-based optimization. TR-DQ first divides the sampling process based on time-steps and applies a rotation matrix to smooth activations and weights dynamically. For different time-steps, a dedicated hyperparameter is introduced for adaptive timing modeling, which enables dynamic quantization across different time steps. Additionally, we also explore the compression potential of Classifier-Free Guidance (CFG-wise) to establish a foundation for subsequent work. TR-DQ achieves state-of-the-art (SOTA) performance on image generation and video generation tasks and a 1.38-1.89x speedup and 1.97-2.58x memory reduction in inference compared to existing quantization methods.
222. 【2503.06559】MMARD: Improving the Min-Max Optimization Process in Adversarial Robustness Distillation
链接:https://arxiv.org/abs/2503.06559
作者:Yuzheng Wang,Zhaoyu Chen,Dingkang Yang,Yuanhang Wang,Lizhe Qi
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Adversarial Robustness Distillation, Adversarial Robustness, optimization Adversarial Robustness, Robustness Distillation, pre-trained robust teacher
备注:
点击查看摘要
Abstract:Adversarial Robustness Distillation (ARD) is a promising task to boost the robustness of small-capacity models with the guidance of the pre-trained robust teacher. The ARD can be summarized as a min-max optimization process, i.e., synthesizing adversarial examples (inner) training the student (outer). Although competitive robustness performance, existing ARD methods still have issues. In the inner process, the synthetic training examples are far from the teacher's decision boundary leading to important robust information missing. In the outer process, the student model is decoupled from learning natural and robust scenarios, leading to the robustness saturation, i.e., student performance is highly susceptible to customized teacher selection. To tackle these issues, this paper proposes a general Min-Max optimization Adversarial Robustness Distillation (MMARD) method. For the inner process, we introduce the teacher's robust predictions, which drive the training examples closer to the teacher's decision boundary to explore more robust knowledge. For the outer process, we propose a structured information modeling method based on triangular relationships to measure the mutual information of the model in natural and robust scenarios and enhance the model's ability to understand multi-scenario mapping relationships. Experiments show our MMARD achieves state-of-the-art performance on multiple benchmarks. Besides, MMARD is plug-and-play and convenient to combine with existing methods.
223. 【2503.06553】ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges
链接:https://arxiv.org/abs/2503.06553
作者:Jiaxin Ai,Pengfei Zhou,Zhaopan Xu,Ming Li,Fanrui Zhang,Zizhen Li,Jianwen Sun,Yukang Feng,Baojin Huang,Zhongyuan Wang,Kaipeng Zhang
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:solving scientific problems, frequently exhibit errors, fine-grained model weaknesses, multi-modal large language, large language models
备注:
点击查看摘要
Abstract:As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses. Since human evaluation is laborious and costly, prompting MLLMs as automated process judges has become a common practice. However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. ProJudgeBench comprises 2,400 test cases and 50,118 step-level labels, spanning four scientific disciplines with diverse difficulty levels and multi-modal content. In ProJudgeBench, each step is meticulously annotated by human experts for correctness, error type, and explanation, enabling a systematic evaluation of judges' capabilities to detect, classify and diagnose errors. Evaluation on ProJudgeBench reveals a significant performance gap between open-source and proprietary models. To bridge this gap, we further propose ProJudge-173k, a large-scale instruction-tuning dataset, and a Dynamic Dual-Phase fine-tuning strategy that encourages models to explicitly reason through problem-solving before assessing solutions. Both contributions significantly enhance the process evaluation capabilities of open-source models. All the resources will be released to foster future research of reliable multi-modal process evaluation.
224. 【2503.06545】QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation
链接:https://arxiv.org/abs/2503.06545
作者:Junyi Wu,Zhiteng Li,Zheng Hui,Yulun Zhang,Linghe Kong,Xiaokang Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: The code and models will be available at [this https URL](https://github.com/JunyiWuCode/QuantCache)
点击查看摘要
None
225. 【2503.06542】ARMOR v0.1: Empowering Autoregressive Multimodal Understanding Model with Interleaved Multimodal Generation via Asymmetric Synergy
链接:https://arxiv.org/abs/2503.06542
作者:Jianwen Sun,Yukang Feng,Chuanhao Li,Fanrui Zhang,Zizhen Li,Jiaxin Ai,Sizhuo Zhou,Yu Dai,Shenglin Zhang,Kaipeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Unified models, multimodal understanding, recently received, received much attention, area of vision
备注:
点击查看摘要
Abstract:Unified models (UniMs) for multimodal understanding and generation have recently received much attention in the area of vision and language. Existing UniMs are designed to simultaneously learn both multimodal understanding and generation capabilities, demanding substantial computational resources, and often struggle to generate interleaved text-image. We present ARMOR, a resource-efficient and pure autoregressive framework that achieves both understanding and generation by fine-tuning existing multimodal large language models (MLLMs). Specifically, ARMOR extends existing MLLMs from three perspectives: (1) For model architecture, an asymmetric encoder-decoder architecture with a forward-switching mechanism is introduced to unify embedding space integrating textual and visual modalities for enabling natural text-image interleaved generation with minimal computational overhead. (2) For training data, a meticulously curated, high-quality interleaved dataset is collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a ``what or how to generate" algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at this https URL.
226. 【2503.06537】One-Step Diffusion Model for Image Motion-Deblurring
链接:https://arxiv.org/abs/2503.06537
作者:Xiaoyang Liu,Yuquan Wang,Zheng Chen,Jiezhang Cao,He Zhang,Yulun Zhang,Xiaokang Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
227. 【2503.06529】AnywhereDoor: Multi-Target Backdoor Attacks on Object Detection
链接:https://arxiv.org/abs/2503.06529
作者:Jialin Lu,Junjie Shan,Ziqi Zhao,Ka-Ho Chow
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
228. 【2503.06526】meLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos
链接:https://arxiv.org/abs/2503.06526
作者:Chen-Lin Zhang,Lin Sui,Shuming Liu,Fangzhou Mu,Zhangcheng Wang,Bernard Ghanem
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:
备注: Code models will be released at [this https URL](https://github.com/sming256/TimeLoc) . The first 4 authors contributes equally
点击查看摘要
None
229. 【2503.06522】SGA-INTERACT: A 3D Skeleton-based Benchmark for Group Activity Understanding in Modern Basketball Tactic
链接:https://arxiv.org/abs/2503.06522
作者:Yuchen Yang,Wei Wang,Yifei Liu,Linfeng Dong,Hao Wu,Mingxin Zhang,Zhihang Zhong,Xiao Sun
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Group Activity Recognition, Group Activity Understanding, Group Activity, Activity Recognition, Activity Understanding
备注: None
点击查看摘要
Abstract:Group Activity Understanding is predominantly studied as Group Activity Recognition (GAR) task. However, existing GAR benchmarks suffer from coarse-grained activity vocabularies and the only data form in single-view, which hinder the evaluation of state-of-the-art algorithms. To address these limitations, we introduce SGA-INTERACT, the first 3D skeleton-based benchmark for group activity understanding. It features complex activities inspired by basketball tactics, emphasizing rich spatial interactions and long-term dependencies. SGA-INTERACT introduces Temporal Group Activity Localization (TGAL) task, extending group activity understanding to untrimmed sequences, filling the gap left by GAR as a standalone task. In addition to the benchmark, we propose One2Many, a novel framework that employs a pretrained 3D skeleton backbone for unified individual feature extraction. This framework aligns with the feature extraction paradigm in RGB-based methods, enabling direct evaluation of RGB-based models on skeleton-based benchmarks. We conduct extensive evaluations on SGA-INTERACT using two skeleton-based methods, three RGB-based methods, and a proposed baseline within the One2Many framework. The general low performance of baselines highlights the benchmark's challenges, motivating advancements in group activity understanding.
230. 【2503.06520】Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
链接:https://arxiv.org/abs/2503.06520
作者:Yuqi Liu,Bohao Peng,Zhisheng Zhong,Zihao Yue,Fanbin Lu,Bei Yu,Jiaya Jia
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Traditional methods, simple descriptions, explicit reasoning processes, rely on supervised, supervised fine-tuning
备注:
点击查看摘要
Abstract:Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at this https URL.
231. 【2503.06517】Instance-wise Supervision-level Optimization in Active Learning
链接:https://arxiv.org/abs/2503.06517
作者:Shinnosuke Matsuo,Riku Togashi,Ryoma Bise,Seiichi Uchida,Masahiro Nomura
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:
备注: Accepted at CVPR2025
点击查看摘要
None
232. 【2503.06515】SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model
链接:https://arxiv.org/abs/2503.06515
作者:Jing Zhang,Zhikai Li,Qingyi Gu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
233. 【2503.06514】GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks
链接:https://arxiv.org/abs/2503.06514
作者:Haoqiang Kang,Enna Sachdeva,Piyush Gupta,Sangjae Bae,Kwonjoon Lee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:recently shown promising, shown promising advancements, Proximal Policy Optimization, sequential decision-making tasks, recently shown
备注:
点击查看摘要
Abstract:Vision-Language Models (VLMs) have recently shown promising advancements in sequential decision-making tasks through task-specific fine-tuning. However, common fine-tuning methods, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) techniques like Proximal Policy Optimization (PPO), present notable limitations: SFT assumes Independent and Identically Distributed (IID) data, while PPO focuses on maximizing cumulative rewards. These limitations often restrict solution diversity and hinder generalization in multi-step reasoning tasks. To address these challenges, we introduce a novel framework, GFlowVLM, a framework that fine-tune VLMs using Generative Flow Networks (GFlowNets) to promote generation of diverse solutions for complex reasoning tasks. GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications. It takes observations and task descriptions as inputs to prompt chain-of-thought (CoT) reasoning which subsequently guides action selection. We use task based rewards to fine-tune VLM with GFlowNets. This approach enables VLMs to outperform prior fine-tuning methods, including SFT and RL. Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld), showing enhanced training efficiency, solution diversity, and stronger generalization capabilities across both in-distribution and out-of-distribution scenarios.
234. 【2503.06508】A Light and Tuning-free Method for Simulating Camera Motion in Video Generation
链接:https://arxiv.org/abs/2503.06508
作者:Quanjian Song,Zhihang Lin,Zhanpeng Zeng,Ziyue Zhang,Liujuan Cao,Rongrong Ji
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:face computational bottlenecks, methods face computational, camera motion-controlled video, Existing camera motion-controlled, latent space
备注: 18 pages in total
点击查看摘要
Abstract:Existing camera motion-controlled video generation methods face computational bottlenecks in fine-tuning and inference. This paper proposes LightMotion, a light and tuning-free method for simulating camera motion in video generation. Operating in the latent space, it eliminates additional fine-tuning, inpainting, and depth estimation, making it more streamlined than existing methods. The endeavors of this paper comprise: (i) The latent space permutation operation effectively simulates various camera motions like panning, zooming, and rotation. (ii) The latent space resampling strategy combines background-aware sampling and cross-frame alignment to accurately fill new perspectives while maintaining coherence across frames. (iii) Our in-depth analysis shows that the permutation and resampling cause an SNR shift in latent space, leading to poor-quality generation. To address this, we propose latent space correction, which reintroduces noise during denoising to mitigate SNR shift and enhance video generation quality. Exhaustive experiments show that our LightMotion outperforms existing methods, both quantitatively and qualitatively.
235. 【2503.06506】Fine-Grained Alignment and Noise Refinement for Compositional Text-to-Image Generation
链接:https://arxiv.org/abs/2503.06506
作者:Amir Mohammad Izadi,Seyed Mohammad Hadi Hosseini,Soroush Vafaie Tabar,Ali Abdollahi,Armin Saghafian,Mahdieh Soleymani Baghshah
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
236. 【2503.06505】DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability
链接:https://arxiv.org/abs/2503.06505
作者:Xirui Hu,Jiahao Wang,Hao Chen,Weizhan Zhang,Benqi Wang,Yikun Li,Haishun Nan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注: 17 pages, 16 figures
点击查看摘要
None
237. 【2503.06501】xtInPlace: Indoor Visual Place Recognition in Repetitive Structures with Scene Text Spotting and Verification
链接:https://arxiv.org/abs/2503.06501
作者:Huaqi Tao,Bingxi Liu,Calvin Chen,Tingjun Huang,He Li,Jinqiang Cui,Hong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:
备注: 8 pages,5 figures
点击查看摘要
None
238. 【2503.06499】ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis
链接:https://arxiv.org/abs/2503.06499
作者:Xukun Zhou,Fengxin Li,Ming Chen,Yan Zhou,Pengfei Wan,Di Zhang,Hongyan Liu,Jun He,Zhaoxin Fan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
239. 【2503.06497】Evaluation of Safety Cognition Capability in Vision-Language Models for Autonomous Driving
链接:https://arxiv.org/abs/2503.06497
作者:Enming Zhang,Peizhe Gong,Xingyuan Dai,Yisheng Lv,Qinghai Miao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
240. 【2503.06492】VisualSimpleQA: A Benchmark for Decoupled Evaluation of Large Vision-Language Models in Fact-Seeking Question Answering
链接:https://arxiv.org/abs/2503.06492
作者:Yanling Wang,Yihan Zhao,Xiaodong Chen,Shasha Guo,Lixin Liu,Haoyang Li,Yong Xiao,Jing Zhang,Qi Li,Ke Xu
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:demonstrated remarkable achievements, Large vision-language models, non-factual responses remains, responses remains prevalent, Large vision-language
备注:
点击查看摘要
Abstract:Large vision-language models (LVLMs) have demonstrated remarkable achievements, yet the generation of non-factual responses remains prevalent in fact-seeking question answering (QA). Current multimodal fact-seeking benchmarks primarily focus on comparing model outputs to ground truth answers, providing limited insights into the performance of modality-specific modules. To bridge this gap, we introduce VisualSimpleQA, a multimodal fact-seeking benchmark with two key features. First, it enables streamlined and decoupled evaluation of LVLMs in visual and linguistic modalities. Second, it incorporates well-defined difficulty criteria to guide human annotation and facilitates the extraction of a challenging subset, VisualSimpleQA-hard. Experiments on 15 LVLMs show that even state-of-the-art models such as GPT-4o achieve merely 60%+ correctness in multimodal fact-seeking QA on VisualSimpleQA and 30%+ on VisualSimpleQA-hard. Furthermore, the decoupled evaluation across these models highlights substantial opportunities for improvement in both visual and linguistic modules. The dataset is available at this https URL.
241. 【2503.06486】PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
链接:https://arxiv.org/abs/2503.06486
作者:Cong Chen,Mingyu Liu,Chenchen Jing,Yizhou Zhou,Fengyun Rao,Hao Chen,Bo Zhang,Chunhua Shen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
242. 【2503.06485】A Mesh Is Worth 512 Numbers: Spectral-domain Diffusion Modeling for High-dimension Shape Generation
链接:https://arxiv.org/abs/2503.06485
作者:Jiajie Fan,Amal Trigui,Andrea Bonfanti,Felix Dietrich,Thomas Bäck,Hao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
243. 【2503.06484】Sign Language Translation using Frame and Event Stream: Benchmark Dataset and Algorithms
链接:https://arxiv.org/abs/2503.06484
作者:Xiao Wang,Yuehang Li,Fuling Wang,Bo Jiang,Yaowei Wang,Yonghong Tian,Jin Tang,Bin Luo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
关键词:
备注: In Peer Review
点击查看摘要
None
244. 【2503.06482】PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization
链接:https://arxiv.org/abs/2503.06482
作者:Honglin Li,Zhongyi Shui,Yunlong Zhang,Chenglu Zhu,Lin Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
245. 【2503.06477】PDB: Not All Drivers Are the Same -- A Personalized Dataset for Understanding Driving Behavior
链接:https://arxiv.org/abs/2503.06477
作者:Chuheng Wei,Ziye Qin,Siyan Li,Ziyan Zhang,Xuanpeng Zhao,Amr Abdelraouf,Rohit Gupta,Kyungtae Han,Matthew J. Barth,Guoyuan Wu
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
246. 【2503.06473】Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals
链接:https://arxiv.org/abs/2503.06473
作者:Hanze Li,Xiande Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Growing evidence suggests, deep neural networks, significantly advanced network, Growing evidence, advanced network architectures
备注: 11 pages, 7 figures
点击查看摘要
Abstract:Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30\% reduction in training time while enhancing performance in tasks such as image classification and object detection.
247. 【2503.06472】CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model
链接:https://arxiv.org/abs/2503.06472
作者:Yuxuan Luo,Jiaqi Tang,Chenyi Huang,Feiyang Hao,Zhouhui Lian
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:remains computationally challenging, computationally challenging due, UNESCO Heritage, Chinese Calligraphy Contextualization, remains computationally
备注: 11 pages
点击查看摘要
Abstract:Chinese calligraphy, a UNESCO Heritage, remains computationally challenging due to visual ambiguity and cultural complexity. Existing AI systems fail to contextualize their intricate scripts, because of limited annotated data and poor visual-semantic alignment. We propose CalliReader, a vision-language model (VLM) that solves the Chinese Calligraphy Contextualization (CC$^2$) problem through three innovations: (1) character-wise slicing for precise character extraction and sorting, (2) CalliAlign for visual-text token compression and alignment, (3) embedding instruction tuning (e-IT) for improving alignment and addressing data scarcity. We also build CalliBench, the first benchmark for full-page calligraphic contextualization, addressing three critical issues in previous OCR and VQA approaches: fragmented context, shallow reasoning, and hallucination. Extensive experiments including user studies have been conducted to verify our CalliReader's \textbf{superiority to other state-of-the-art methods and even human professionals in page-level calligraphy recognition and interpretation}, achieving higher accuracy while reducing hallucination. Comparisons with reasoning models highlight the importance of accurate recognition as a prerequisite for reliable comprehension. Quantitative analyses validate CalliReader's efficiency; evaluations on document and real-world benchmarks confirm its robust generalization ability.
248. 【2503.06471】Online Dense Point Tracking with Streaming Memory
链接:https://arxiv.org/abs/2503.06471
作者:Qiaole Dong,Yanwei Fu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
249. 【2503.06470】hink Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems
链接:https://arxiv.org/abs/2503.06470
作者:Fei Tang,Yongliang Shen,Hang Zhang,Siqi Chen,Guiyang Hou,Wenqi Zhang,Wenqiao Zhang,Kaitao Song,Weiming Lu,Yueting Zhuang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
250. 【2503.06469】Vector Quantized Feature Fields for Fast 3D Semantic Lifting
链接:https://arxiv.org/abs/2503.06469
作者:George Tang,Aditya Agarwal,Weiqiao Han,Trevor Darrell,Yutong Bai
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
251. 【2503.06467】SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts
链接:https://arxiv.org/abs/2503.06467
作者:Shijia Zhao,Qiming Xia,Xusheng Guo,Pufan Zou,Maoji Zheng,Hai Wu,Chenglu Wen,Cheng Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 11 pages, 3 figures
点击查看摘要
None
252. 【2503.06462】StructGS: Adaptive Spherical Harmonics and Rendering Enhancements for Superior 3D Gaussian Splatting
链接:https://arxiv.org/abs/2503.06462
作者:Zexu Huang,Min Xu,Stuart Perry
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
253. 【2503.06461】Long-tailed Adversarial Training with Self-Distillation
链接:https://arxiv.org/abs/2503.06461
作者:Seungju Cho,Hongsin Lee,Changick Kim
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: ICLR 2025
点击查看摘要
None
254. 【2503.06458】Reconstructing Depth Images of Moving Objects from Wi-Fi CSI Data
链接:https://arxiv.org/abs/2503.06458
作者:Guanyu Cao,Takuya Maekawa,Kazuya Ohara,Yasue Kishino
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:
备注:
点击查看摘要
None
255. 【2503.06457】Geometric Knowledge-Guided Localized Global Distribution Alignment for Federated Learning
链接:https://arxiv.org/abs/2503.06457
作者:Yanbiao Ma,Wei Dai,Wenke Huang,Jiayi Chen
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:divergent local optimization, local optimization directions, global geometric shapes, federated learning, leads to divergent
备注: Accepted by CVPR 2025
点击查看摘要
Abstract:Data heterogeneity in federated learning, characterized by a significant misalignment between local and global distributions, leads to divergent local optimization directions and hinders global model training. Existing studies mainly focus on optimizing local updates or global aggregation, but these indirect approaches demonstrate instability when handling highly heterogeneous data distributions, especially in scenarios where label skew and domain skew coexist. To address this, we propose a geometry-guided data generation method that centers on simulating the global embedding distribution locally. We first introduce the concept of the geometric shape of an embedding distribution and then address the challenge of obtaining global geometric shapes under privacy constraints. Subsequently, we propose GGEUR, which leverages global geometric shapes to guide the generation of new samples, enabling a closer approximation to the ideal global distribution. In single-domain scenarios, we augment samples based on global geometric shapes to enhance model generalization; in multi-domain scenarios, we further employ class prototypes to simulate the global distribution across domains. Extensive experimental results demonstrate that our method significantly enhances the performance of existing approaches in handling highly heterogeneous data, including scenarios with label skew, domain skew, and their coexistence. Code published at: this https URL
256. 【2503.06456】DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning
链接:https://arxiv.org/abs/2503.06456
作者:Chengxuan Qian,Kai Han,Jingchao Wang,Zhenlong Yuan,Rui Qian,Chongwen Lyu,Jun Chen,Zhe Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
257. 【2503.06451】A Quantitative Evaluation of the Expressivity of BMI, Pose and Gender in Body Embeddings for Recognition and Identification
链接:https://arxiv.org/abs/2503.06451
作者:Basudha Pal,Siyuan(Cyan)Huang,Rama Chellappa
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Person Re-identification, systems identify individuals, identify individuals, individuals across images, images or video
备注:
点击查看摘要
Abstract:Person Re-identification (ReID) systems identify individuals across images or video frames and play a critical role in various real-world applications. However, many ReID methods are influenced by sensitive attributes such as gender, pose, and body mass index (BMI), which vary in uncontrolled environments, leading to biases and reduced generalization. To address this, we extend the concept of expressivity to the body recognition domain to better understand how ReID models encode these attributes. Expressivity, defined as the mutual information between feature vector representations and specific attributes, is computed using a secondary neural network that takes feature and attribute vectors as inputs. This provides a quantitative framework for analyzing the extent to which sensitive attributes are embedded in the model's representations. We apply expressivity analysis to SemReID, a state-of-the-art self-supervised ReID model, and find that BMI consistently exhibits the highest expressivity scores in the model's final layers, underscoring its dominant role in feature encoding. In the final attention layer of the trained network, the expressivity order for body attributes is BMI Pitch Yaw Gender, highlighting their relative importance in learned representations. Additionally, expressivity values evolve progressively across network layers and training epochs, reflecting a dynamic encoding of attributes during feature extraction. These insights emphasize the influence of body-related attributes on ReID models and provide a systematic methodology for identifying and mitigating attribute-driven biases. By leveraging expressivity analysis, we offer valuable tools to enhance the fairness, robustness, and generalization of ReID systems in diverse real-world settings.
258. 【2503.06446】M$^3$amba: CLIP-driven Mamba Model for Multi-modal Remote Sensing Classification
链接:https://arxiv.org/abs/2503.06446
作者:Mingxiang Cao,Weiying Xie,Xin Zhang,Jiaqing Zhang,Kai Jiang,Jie Lei,Yunsong Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
259. 【2503.06442】OT-DETECTOR: Delving into Optimal Transport for Zero-shot Out-of-Distribution Detection
链接:https://arxiv.org/abs/2503.06442
作者:Yu Liu,Hao Tang,Haiqi Zhang,Jing Qin,Zechao Li
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:
备注: The first two authors contributed equally to this work
点击查看摘要
None
260. 【2503.06437】SEED: Towards More Accurate Semantic Evaluation for Visual Brain Decoding
链接:https://arxiv.org/abs/2503.06437
作者:Juhyeon Park,Peter Yongho Kim,Jiook Cha,Shinjae Yoo,Taesup Moon
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:
备注: Under Review
点击查看摘要
None
261. 【2503.06435】OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection
链接:https://arxiv.org/abs/2503.06435
作者:Adrian Chow,Evelien Riddell,Yimu Wang,Sean Sedwards,Krzysztof Czarnecki
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
262. 【2503.06427】Pre-Training Meta-Rule Selection Policy for Visual Generative Abductive Learning
链接:https://arxiv.org/abs/2503.06427
作者:Yu Jin,Jingming Liu,Zhexu Luo,Yifei Peng,Ziang Qin,Wang-Zhou Dai,Yao-Xiang Ding,Kun Zhou
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Published as a conference paper at IJCLR'24
点击查看摘要
None
263. 【2503.06426】Federated Learning for Diffusion Models
链接:https://arxiv.org/abs/2503.06426
作者:Zihao Peng,Xijun Wang,Shengbo Chen,Hong Rao,Cong Shen
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
关键词:produce highly realistic, highly realistic samples, Diffusion models, produce highly, highly realistic
备注:
点击查看摘要
Abstract:Diffusion models are powerful generative models that can produce highly realistic samples for various tasks. Typically, these models are constructed using centralized, independently and identically distributed (IID) training data. However, in practical scenarios, data is often distributed across multiple clients and frequently manifests non-IID characteristics. Federated Learning (FL) can leverage this distributed data to train diffusion models, but the performance of existing FL methods is unsatisfactory in non-IID scenarios. To address this, we propose FedDDPM-Federated Learning with Denoising Diffusion Probabilistic Models, which leverages the data generative capability of diffusion models to facilitate model training. In particular, the server uses well-trained local diffusion models uploaded by each client before FL training to generate auxiliary data that can approximately represent the global data distribution. Following each round of model aggregation, the server further optimizes the global model using the auxiliary dataset to alleviate the impact of heterogeneous data on model performance. We provide a rigorous convergence analysis of FedDDPM and propose an enhanced algorithm, FedDDPM+, to reduce training overheads. FedDDPM+ detects instances of slow model learning and performs a one-shot correction using the auxiliary dataset. Experimental results validate that our proposed algorithms outperform the state-of-the-art FL algorithms on the MNIST, CIFAR10 and CIFAR100 datasets.
264. 【2503.06419】Consistent Image Layout Editing with Diffusion Models
链接:https://arxiv.org/abs/2503.06419
作者:Tao Xia,Yudi Zhang,Ting Liu Lei Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
265. 【2503.06415】Polygonal network disorder and the turning distance
链接:https://arxiv.org/abs/2503.06415
作者:Alex Dolce,Ryan Lavelle,Bernard Scott,Ashlyn Urbanski,Joseph Klobusicky
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
266. 【2503.06399】FEDS: Feature and Entropy-Based Distillation Strategy for Efficient Learned Image Compression
链接:https://arxiv.org/abs/2503.06399
作者:Haisheng Fu,Jie Liang,Zhenman Fang,Jingning Han
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
关键词:
备注: 16 pages
点击查看摘要
None
267. 【2503.06397】Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter
链接:https://arxiv.org/abs/2503.06397
作者:Yanyu Zhu,Licheng Bai,Jintao Xu,Jiwei Tang,Hai-tao Zheng
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
268. 【2503.06385】A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization
链接:https://arxiv.org/abs/2503.06385
作者:Md Yousuf Harun,Christopher Kanan
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Preprint
点击查看摘要
None
269. 【2503.06380】I-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
链接:https://arxiv.org/abs/2503.06380
作者:Khang H. N. Vo,Duc P. T. Nguyen,Thong Nguyen,Tho T. Quan
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
270. 【2503.06369】Spectral State Space Model for Rotation-Invariant~Visual~Representation~Learning
链接:https://arxiv.org/abs/2503.06369
作者:Sahar Dastani,Ali Bahri,Moslem Yazdanpanah,Mehrdad Noori,David Osowiechi,Gustavo Adolfo Vargas Hakim,Farzad Beizaee,Milad Cheraghalikhani,Arnab Kumar Mondal,Herve Lombaert,Christian Desrosiers
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
271. 【2503.06368】VORTEX: Challenging CNNs at Texture Recognition by using Vision Transformers with Orderless and Randomized Token Encodings
链接:https://arxiv.org/abs/2503.06368
作者:Leonardo Scabini,Kallil M. Zielinski,Emir Konuk,Ricardo T. Fares,Lucas C. Ribas,Kevin Smith,Odemir M. Bruno
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
272. 【2503.06364】Generative Video Bi-flow
链接:https://arxiv.org/abs/2503.06364
作者:Chen Liu,Tobias Ritschel
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
273. 【2503.06362】Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs
链接:https://arxiv.org/abs/2503.06362
作者:Umberto Cappellazzo,Minsu Kim,Stavros Petridis
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:speech recognition robustness, enhance speech recognition, Speech Recognition, Large Language Models, leverages both audio
备注:
点击查看摘要
Abstract:Audio-Visual Speech Recognition (AVSR) leverages both audio and visual modalities to enhance speech recognition robustness, particularly in noisy environments. Recent advancements in Large Language Models (LLMs) have demonstrated their effectiveness in speech recognition, including AVSR. However, due to the significant length of speech representations, direct integration with LLMs imposes substantial computational costs. Prior approaches address this by compressing speech representations before feeding them into LLMs. However, higher compression ratios often lead to performance degradation, necessitating a trade-off between computational efficiency and recognition accuracy. To address this challenge, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR, which enables flexible adaptation of the audio-visual token allocation based on specific computational constraints while preserving high performance. Our approach, inspired by Matryoshka Representation Learning, encodes audio-visual representations at multiple granularities within a single model, eliminating the need to train separate models for different compression levels. Moreover, to efficiently fine-tune the LLM, we introduce three LoRA-based Matryoshka strategies using global and scale-specific LoRA modules. Extensive evaluations on the two largest AVSR datasets demonstrate that Llama-MTSK achieves state-of-the-art results, matching or surpassing models trained independently at fixed compression levels.
274. 【2503.06361】Adversarial Robustness of Discriminative Self-Supervised Learning in Vision
链接:https://arxiv.org/abs/2503.06361
作者:Ömer Veysel Çağatan,Ömer Faruk Tal,M. Emre Gürsoy
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 53 pages
点击查看摘要
None
275. 【2503.06339】Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning
链接:https://arxiv.org/abs/2503.06339
作者:Gaurav Patel,Qiang Qiu
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
276. 【2503.06317】Accurate and Efficient Two-Stage Gun Detection in Video
链接:https://arxiv.org/abs/2503.06317
作者:Badhan Chandra Das,M. Hadi Amini,Yanzhao Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
277. 【2503.06316】End-to-End Action Segmentation Transformer
链接:https://arxiv.org/abs/2503.06316
作者:Tieqiao Wang,Sinisa Todorovic
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:
备注:
点击查看摘要
None
278. 【2503.06313】Advancing Autonomous Vehicle Intelligence: Deep Learning and Multimodal LLM for Traffic Sign Recognition and Robust Lane Detection
链接:https://arxiv.org/abs/2503.06313
作者:Chandan Kumar Sah,Ankit Kumar Shaw,Xiaoli Lian,Arsalan Shahid Baig,Tuopu Wen,Kun Jiang,Mengmeng Yang,Diange Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:ensure safe navigation, Large Language Models, require reliable traffic, Multimodal Large Language, traffic sign recognition
备注: 11 pages, 9 figures
点击查看摘要
Abstract:Autonomous vehicles (AVs) require reliable traffic sign recognition and robust lane detection capabilities to ensure safe navigation in complex and dynamic environments. This paper introduces an integrated approach combining advanced deep learning techniques and Multimodal Large Language Models (MLLMs) for comprehensive road perception. For traffic sign recognition, we systematically evaluate ResNet-50, YOLOv8, and RT-DETR, achieving state-of-the-art performance of 99.8% with ResNet-50, 98.0% accuracy with YOLOv8, and achieved 96.6% accuracy in RT-DETR despite its higher computational complexity. For lane detection, we propose a CNN-based segmentation method enhanced by polynomial curve fitting, which delivers high accuracy under favorable conditions. Furthermore, we introduce a lightweight, Multimodal, LLM-based framework that directly undergoes instruction tuning using small yet diverse datasets, eliminating the need for initial pretraining. This framework effectively handles various lane types, complex intersections, and merging zones, significantly enhancing lane detection reliability by reasoning under adverse conditions. Despite constraints in available training resources, our multimodal approach demonstrates advanced reasoning capabilities, achieving a Frame Overall Accuracy (FRM) of 53.87%, a Question Overall Accuracy (QNS) of 82.83%, lane detection accuracies of 99.6% in clear conditions and 93.0% at night, and robust performance in reasoning about lane invisibility due to rain (88.4%) or road degradation (95.6%). The proposed comprehensive framework markedly enhances AV perception reliability, thus contributing significantly to safer autonomous driving across diverse and challenging road scenarios.
279. 【2503.06312】GeoLangBind: Unifying Earth Observation with Agglomerative Vision-Language Foundation Models
链接:https://arxiv.org/abs/2503.06312
作者:Zhitong Xiong,Yi Wang,Weikang Yu,Adam J Stewart,Jie Zhao,Nils Lehmann,Thomas Dujardin,Zhenghang Yuan,Pedram Ghamisi,Xiao Xiang Zhu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: code weights: [this https URL](https://github.com/xiong-zhitong/GeoLB-SigLIP)
点击查看摘要
None
280. 【2503.06310】xt2Story: Advancing Video Storytelling with Text Guidance
链接:https://arxiv.org/abs/2503.06310
作者:Taewon Kang,Divya Kothandaraman,Ming C. Lin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 15 pages, 6 figures
点击查看摘要
None
281. 【2503.06307】ACAM-KD: Adaptive and Cooperative Attention Masking for Knowledge Distillation
链接:https://arxiv.org/abs/2503.06307
作者:Qizhen Lan,Qing Tian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 8 pages, 10 tables, 3 figures
点击查看摘要
None
282. 【2503.06287】Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
链接:https://arxiv.org/abs/2503.06287
作者:Seil Kang,Jinyeong Kim,Junhyeok Kim,Seong Jae Hwang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
283. 【2503.06282】From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning
链接:https://arxiv.org/abs/2503.06282
作者:Shuangzhi Li,Junlong Shen,Lei Ma,Xingyu Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
284. 【2503.06277】STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification
链接:https://arxiv.org/abs/2503.06277
作者:Siyi Du,Xinzhe Luo,Declan P. O'Regan,Chen Qin
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 16 pages (including 5 pages of supplementary materials), accepted by CVPR 2025
点击查看摘要
None
285. 【2503.06276】Exploring Adversarial Transferability between Kolmogorov-arnold Networks
链接:https://arxiv.org/abs/2503.06276
作者:Songping Wang,Xinquan Yue,Yueming Lyu,Caifeng Shan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
286. 【2503.06273】Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
链接:https://arxiv.org/abs/2503.06273
作者:Jeong Hun Yeo,Minsu Kim,Chae Won Kim,Stavros Petridis,Yong Man Ro
类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:
备注:
点击查看摘要
None
287. 【2503.06271】SplatTalk: 3D VQA with Gaussian Splatting
链接:https://arxiv.org/abs/2503.06271
作者:Anh Thai,Songyou Peng,Kyle Genova,Leonidas Guibas,Thomas Funkhouser
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
288. 【2503.06268】Get In Video: Add Anything You Want to the Video
链接:https://arxiv.org/abs/2503.06268
作者:Shaobin Zhuang,Zhipeng Huang,Binxin Yang,Ying Zhang,Fangyikang Wang,Canmiao Fu,Chong Sun,Zheng-Jun Zha,Chen Li,Yali Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Project page: [this https URL](https://zhuangshaobin.github.io/GetInVideo-project/)
点击查看摘要
None
289. 【2503.06261】Segment Anything, Even Occluded
链接:https://arxiv.org/abs/2503.06261
作者:Wei-En Tai,Yu-Lin Shih,Cheng Sun,Yu-Chiang Frank Wang,Hwann-Tzong Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
290. 【2503.06260】From Captions to Rewards (CAREVL): Leveraging Large Language Model Experts for Enhanced Reward Modeling in Large Vision-Language Models
链接:https://arxiv.org/abs/2503.06260
作者:Muzhi Dai,Jiashuo Sun,Zhiyuan Zhao,Shixuan Liu,Rui Li,Junyu Gao,Xuelong Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
291. 【2503.06252】Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models?
链接:https://arxiv.org/abs/2503.06252
作者:Kun Xiang,Zhili Liu,Zihao Jiang,Yunshuang Nie,Kaixin Cai,Yiyang Yin,Runhui Huang,Haoxiang Fan,Hanhui Li,Weiran Huang,Yihan Zeng,Yu-Jie Yuan,Jianhua Han,Lanqing Hong,Hang Xu,Xiaodan Liang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
292. 【2503.06237】Rethinking Lanes and Points in Complex Scenarios for Monocular 3D Lane Detection
链接:https://arxiv.org/abs/2503.06237
作者:Yifan Chang,Junjie Huang,Xiaofeng Wang,Yun Ye,Zhujin Liang,Yi Shan,Dalong Du,Xingang Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: CVPR2025
点击查看摘要
None
293. 【2503.06236】Dynamically evolving segment anything model with continuous learning for medical image segmentation
链接:https://arxiv.org/abs/2503.06236
作者:Zhaori Liu,Mengyang Li,Hu Han,Enli Zhang,Shiguang Shan,Zhiming Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
294. 【2503.06235】StreamGS: Online Generalizable Gaussian Splatting Reconstruction for Unposed Image Streams
链接:https://arxiv.org/abs/2503.06235
作者:Yang LI,Jinglu Wang,Lei Chu,Xiao Li,Shiu-hong Kao,Ying-Cong Chen,Yan Lu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 8 pages
点击查看摘要
None
295. 【2503.06232】Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning
链接:https://arxiv.org/abs/2503.06232
作者:Yanjun Chen,Yirong Sun,Xinghao Chen,Jian Wang,Xiaoyu Shen,Wenjie Li,Wei Zhang
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:proven effective, effective in natural, remains underexplored, reasoning, CoT
备注:
点击查看摘要
Abstract:Chain-of-Thought (CoT) reasoning has proven effective in natural language tasks but remains underexplored in multimodal alignment. This study investigates its integration into 3D vision-language learning by embedding structured reasoning into alignment training. We introduce the 3D-CoT Benchmark, a dataset with hierarchical CoT annotations covering shape recognition, functional inference, and causal reasoning. Through controlled experiments, we compare CoT-structured and standard textual annotations across large reasoning models (LRMs) and large language models (LLMs). Our evaluation employs a dual-layer framework assessing both intermediate reasoning and final inference quality. Extensive experiments demonstrate that CoT significantly improves 3D semantic grounding, with LRMs leveraging CoT more effectively than LLMs. Furthermore, we highlight that annotation structure influences performance-explicit reasoning markers aid LLMs, while unmarked CoT better aligns with LRM inference patterns. Our analyses suggest that CoT is crucial for enhancing multimodal reasoning, with implications beyond 3D tasks.
296. 【2503.06223】Reinforced Diffuser for Red Teaming Large Vision-Language Models
链接:https://arxiv.org/abs/2503.06223
作者:Ruofan Wang,Xiang Zheng,Xiaosen Wang,Cong Wang,Xingjun Ma
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
297. 【2503.06222】Vision-based 3D Semantic Scene Completion via Capture Dynamic Representations
链接:https://arxiv.org/abs/2503.06222
作者:Meng Wang,Fan Wu,Yunchuan Qin,Ruihui Li,Zhuo Tang,Kenli Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
298. 【2503.06220】StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
链接:https://arxiv.org/abs/2503.06220
作者:Xin Ding,Hao Wu,Yifan Yang,Shiqi Jiang,Donglin Bai,Zhibo Chen,Ting Cao
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
299. 【2503.06219】VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion
链接:https://arxiv.org/abs/2503.06219
作者:Meng Wang,Huilong Pi,Ruihui Li,Yunchuan Qin,Zhuo Tang,Kenli Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Accept by AAAI-2025(Oral)
点击查看摘要
None
300. 【2503.06201】Explainable Synthetic Image Detection through Diffusion Timestep Ensembling
链接:https://arxiv.org/abs/2503.06201
作者:Yixin Wu,Feiran Zhang,Tianyuan Shi,Ruicheng Yin,Zhenghua Wang,Zhenliang Gan,Xiaohua Wang,Changze Lv,Xiaoqing Zheng,Xuanjing Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:posing significant security, significant security risks, Recent advances, deceptively real images, posing significant
备注: 13 pages, 5 figures
点击查看摘要
Abstract:Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we reveal that natural and synthetic images exhibit distinct differences in the high-frequency domains of their Fourier power spectra after undergoing iterative noise perturbations through an inverse multi-step denoising process, suggesting that such noise can provide additional discriminative information for identifying synthetic images. Based on this observation, we propose a novel detection method that amplifies these differences by progressively adding noise to the original images across multiple timesteps, and train an ensemble of classifiers on these noised images. To enhance human comprehension, we introduce an explanation generation and refinement module to identify flaws located in AI-generated images. Additionally, we construct two new datasets, GenHard and GenExplain, derived from the GenImage benchmark, providing detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and harder samples, increasing a minimal of 2.51% and 3.46% compared to baselines. Furthermore, our method also generalizes effectively to images generated by other diffusion models. Our code and datasets will be made publicly available.
301. 【2503.06200】Removing Multiple Hybrid Adverse Weather in Video via a Unified Model
链接:https://arxiv.org/abs/2503.06200
作者:Yecong Wan,Mingwen Shao,Yuanshuo Cheng,Jun Shu,Shuigen Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:conditions typically suffer, weather, typically suffer, suffer from uncertain, degradation distributions
备注:
点击查看摘要
Abstract:Videos captured under real-world adverse weather conditions typically suffer from uncertain hybrid weather artifacts with heterogeneous degradation distributions. However, existing algorithms only excel at specific single degradation distributions due to limited adaption capacity and have to deal with different weather degradations with separately trained models, thus may fail to handle real-world stochastic weather scenarios. Besides, the model training is also infeasible due to the lack of paired video data to characterize the coexistence of multiple weather. To ameliorate the aforementioned issue, we propose a novel unified model, dubbed UniWRV, to remove multiple heterogeneous video weather degradations in an all-in-one fashion. Specifically, to tackle degenerate spatial feature heterogeneity, we propose a tailored weather prior guided module that queries exclusive priors for different instances as prompts to steer spatial feature characterization. To tackle degenerate temporal feature heterogeneity, we propose a dynamic routing aggregation module that can automatically select optimal fusion paths for different instances to dynamically integrate temporal features. Additionally, we managed to construct a new synthetic video dataset, termed HWVideo, for learning and benchmarking multiple hybrid adverse weather removal, which contains 15 hybrid weather conditions with a total of 1500 adverse-weather/clean paired video clips. Real-world hybrid weather videos are also collected for evaluating model generalizability. Comprehensive experiments demonstrate that our UniWRV exhibits robust and superior adaptation capability in multiple heterogeneous degradations learning scenarios, including various generic video restoration tasks beyond weather removal.
302. 【2503.06196】NeuroADDA: Active Discriminative Domain Adaptation in Connectomic
链接:https://arxiv.org/abs/2503.06196
作者:Shashata Sawmya,Thomas L. Athey,Gwyneth Liu,Nir Shavit
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 8 pages, 3 figures, 3 tables
点击查看摘要
None
303. 【2503.06187】MSConv: Multiplicative and Subtractive Convolution for Face Recognition
链接:https://arxiv.org/abs/2503.06187
作者:Si Zhou,Yain-Whar Si,Xiaochen Yuan,Xiaofan Li,Xiaoxiang Liu,Xinyuan Zhang,Cong Lin,Xueyuan Gong
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
304. 【2503.06186】PTDiffusion: Free Lunch for Generating Optical Illusion Hidden Pictures with Phase-Transferred Diffusion Model
链接:https://arxiv.org/abs/2503.06186
作者:Xiang Gao,Shuai Yang,Jiaying Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025)
点击查看摘要
None
305. 【2503.06182】FORESCENE: FOREcasting human activity via latent SCENE graphs diffusion
链接:https://arxiv.org/abs/2503.06182
作者:Antonio Alliegro,Francesca Pistilli,Tatiana Tommasi,Giuseppe Averta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
306. 【2503.06179】ForestSplats: Deformable transient field for Gaussian Splatting in the Wild
链接:https://arxiv.org/abs/2503.06179
作者:Wongi Park,Myeongseok Nam,Siwon Kim,Sangwoo Jo,Soomok Lee
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
307. 【2503.06170】Object-Centric World Model for Language-Guided Manipulation
链接:https://arxiv.org/abs/2503.06170
作者:Youngjoon Jeong,Junha Chun,Soonwoo Cha,Taesup Kim
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:driving and robotics, plan in domains, autonomous driving, language instructions, world model
备注:
点击查看摘要
Abstract:A world model is essential for an agent to predict the future and plan in domains such as autonomous driving and robotics. To achieve this, recent advancements have focused on video generation, which has gained significant attention due to the impressive success of diffusion models. However, these models require substantial computational resources. To address these challenges, we propose a world model leveraging object-centric representation space using slot attention, guided by language instructions. Our model perceives the current state as an object-centric representation and predicts future states in this representation space conditioned on natural language instructions. This approach results in a more compact and computationally efficient model compared to diffusion-based generative alternatives. Furthermore, it flexibly predicts future states based on language instructions, and offers a significant advantage in manipulation tasks where object recognition is crucial. In this paper, we demonstrate that our latent predictive world model surpasses generative world models in visuo-linguo-motor control tasks, achieving superior sample and computation efficiency. We also investigate the generalization performance of the proposed method and explore various strategies for predicting actions using object-centric representations.
308. 【2503.06169】reble Counterfactual VLMs: A Causal Approach to Hallucination
链接:https://arxiv.org/abs/2503.06169
作者:Li Li,Jiashu Qu,Yuxiao Zhou,Yuehan Qin,Tiankai Yang,Yue Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
309. 【2503.06163】VACT: A Video Automatic Causal Testing System and a Benchmark
链接:https://arxiv.org/abs/2503.06163
作者:Haotong Yang,Qingyuan Zheng,Yunjian Gao,Yongkun Yang,Yangbo He,Zhouchen Lin,Muhan Zhang
类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)
关键词:
备注:
点击查看摘要
None
310. 【2503.06161】Feature-EndoGaussian: Feature Distilled Gaussian Splatting in Surgical Deformable Scene Reconstruction
链接:https://arxiv.org/abs/2503.06161
作者:Kai Li,Junhao Wang,William Han,Ding Zhao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注: 14 pages, 5 figures
点击查看摘要
None
311. 【2503.06157】UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces
链接:https://arxiv.org/abs/2503.06157
作者:Baining Zhao,Jianjie Fang,Zichao Dai,Ziyou Wang,Jirong Zha,Weichen Zhang,Chen Gao,Yue Wang,Jinqiang Cui,Xinlei Chen,Yong Li
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注: 22 pages
点击查看摘要
None
312. 【2503.06154】SRM-Hair: Single Image Head Mesh Reconstruction via 3D Morphable Hair
链接:https://arxiv.org/abs/2503.06154
作者:Zidu Wang,Jiankuo Zhao,Miao Xu,Xiangyu Zhu,Zhen Lei
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Under review
点击查看摘要
None
313. 【2503.06151】BioMoDiffuse: Physics-Guided Biomechanical Diffusion for Controllable and Authentic Human Motion Synthesis
链接:https://arxiv.org/abs/2503.06151
作者:Zixi Kang,Xinghan Wang,Yadong Mu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
314. 【2503.06146】OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images
链接:https://arxiv.org/abs/2503.06146
作者:Ziyue Huang,Yongchao Feng,Shuai Yang,Ziqi Liu,Qingjie Liu,Yunhong Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 11 pages, 4 figures
点击查看摘要
None
315. 【2503.06142】VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models
链接:https://arxiv.org/abs/2503.06142
作者:Xinan He,Yue Zhou,Bing Fan,Bin Li,Guopu Zhu,Feng Ding
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
316. 【2503.06141】Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model
链接:https://arxiv.org/abs/2503.06141
作者:Mingxing Li,Rui Wang,Lei Sun,Yancheng Bai,Xiangxiang Chu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
317. 【2503.06140】Boosting the Local Invariance for Better Adversarial Transferability
链接:https://arxiv.org/abs/2503.06140
作者:Bohan Liu,Xiaosen Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:directly targeting victim, targeting victim models, pose a significant, significant threat, threat to real-world
备注:
点击查看摘要
Abstract:Transfer-based attacks pose a significant threat to real-world applications by directly targeting victim models with adversarial examples generated on surrogate models. While numerous approaches have been proposed to enhance adversarial transferability, existing works often overlook the intrinsic relationship between adversarial perturbations and input images. In this work, we find that adversarial perturbation often exhibits poor translation invariance for a given clean image and model, which is attributed to local invariance. Through empirical analysis, we demonstrate that there is a positive correlation between the local invariance of adversarial perturbations w.r.t. the input image and their transferability across different models. Based on this finding, we propose a general adversarial transferability boosting technique called Local Invariance Boosting approach (LI-Boost). Extensive experiments on the standard ImageNet dataset demonstrate that LI-Boost could significantly boost various types of transfer-based attacks (e.g., gradient-based, input transformation-based, model-related, advanced objective function, ensemble, etc.) on CNNs, ViTs, and defense mechanisms. Our approach presents a promising direction for future research in improving adversarial transferability across different models.
318. 【2503.06136】GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation
链接:https://arxiv.org/abs/2503.06136
作者:Ye Tao,Jiawei Zhang,Yahao Shi,Dongqing Zou,Bin Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
319. 【2503.06134】X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
链接:https://arxiv.org/abs/2503.06134
作者:Jian Ma,Qirong Peng,Xu Guo,Chen Chen,Haonan Lu,Zhenyu Yang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: [this https URL](https://github.com/OPPO-Mente-Lab/X2I)
点击查看摘要
None
320. 【2503.06132】USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
链接:https://arxiv.org/abs/2503.06132
作者:Xiangxiang Chu,Renda Li,Yong Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
321. 【2503.06129】Viewport-Unaware Blind Omnidirectional Image Quality Assessment: A Flexible and Effective Paradigm
链接:https://arxiv.org/abs/2503.06129
作者:Jiebin Yan,Kangcheng Wu,Junjie Chen,Ziwen Tan,Yuming Fang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
322. 【2503.06118】SecureGS: Boosting the Security and Fidelity of 3D Gaussian Splatting Steganography
链接:https://arxiv.org/abs/2503.06118
作者:Xuanyu Zhang,Jiarui Meng,Zhipei Xu,Shuzhou Yang,Yanmin Wu,Ronggang Wang,Jian Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Accepted by ICLR 2025
点击查看摘要
None
323. 【2503.06117】NeuraLoc: Visual Localization in Neural Implicit Map with Dual Complementary Features
链接:https://arxiv.org/abs/2503.06117
作者:Hongjia Zhai,Boming Zhao,Hai Li,Xiaokun Pan,Yijia He,Zhaopeng Cui,Hujun Bao,Guofeng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: ICRA 2025
点击查看摘要
None
324. 【2503.06107】Feature Fusion Attention Network with CycleGAN for Image Dehazing, De-Snowing and De-Raining
链接:https://arxiv.org/abs/2503.06107
作者:Akshat Jain
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
325. 【2503.06106】Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation
链接:https://arxiv.org/abs/2503.06106
作者:Kuanghong Liu,Jin Wang,Kangjian He,Dan Xu,Xuejie Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Accepted by AAAI 2025
点击查看摘要
None
326. 【2503.06104】Handwritten Digit Recognition: An Ensemble-Based Approach for Superior Performance
链接:https://arxiv.org/abs/2503.06104
作者:Syed Sajid Ullah,Li Gang,Mudassir Riaz,Ahsan Ashfaq,Salman Khan,Sajawal Khan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:postal code reading, Convolutional Neural Networks, combines Convolutional Neural, computer vision, document digitization
备注: 11 pages,6 figures
点击查看摘要
Abstract:Handwritten digit recognition remains a fundamental challenge in computer vision, with applications ranging from postal code reading to document digitization. This paper presents an ensemble-based approach that combines Convolutional Neural Networks (CNNs) with traditional machine learning techniques to improve recognition accuracy and robustness. We evaluate our method on the MNIST dataset, comprising 70,000 handwritten digit images. Our hybrid model, which uses CNNs for feature extraction and Support Vector Machines (SVMs) for classification, achieves an accuracy of 99.30%. We also explore the effectiveness of data augmentation and various ensemble techniques in enhancing model performance. Our results demonstrate that this approach not only achieves high accuracy but also shows improved generalization across diverse handwriting styles. The findings contribute to the development of more reliable handwritten digit recognition systems and highlight the potential of combining deep learning with traditional machine learning methods in pattern recognition tasks.
327. 【2503.06100】Patch-Depth Fusion: Dichotomous Image Segmentation via Fine-Grained Patch Strategy and Depth Integrity-Prior
链接:https://arxiv.org/abs/2503.06100
作者:Xianjie Liu,Keren Fu,Qijun Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
328. 【2503.06094】PointDiffuse: A Dual-Conditional Diffusion Model for Enhanced Point Cloud Semantic Segmentation
链接:https://arxiv.org/abs/2503.06094
作者:Yong He,Hongshan Yu,Mingtao Feng,Tongjia Chen,Zechuan Li,Anwaar Ulhaq,Saeed Anwar,Ajmal Saeed Mian
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 8 pages, 3 figures, 7 tables
点击查看摘要
None
329. 【2503.06092】ZO-DARTS++: An Efficient and Size-Variable Zeroth-Order Neural Architecture Search Algorithm
链接:https://arxiv.org/abs/2503.06092
作者:Lunchen Xie,Eugenio Lomurno,Matteo Gambella,Danilo Ardagna,Manual Roveri,Matteo Matteucci,Qingjiang Shi
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:
备注: 14 pages, 8 figures
点击查看摘要
None
330. 【2503.06089】Fish2Mesh Transformer: 3D Human Mesh Recovery from Egocentric Vision
链接:https://arxiv.org/abs/2503.06089
作者:David C. Jeong,Aditya Puranik,James Vong,Vrushabh Abhijit Deogirikar,Ryan Fell,Julianna Dietrich,Maria Kyrarini,Christopher Kitts
类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
关键词:
备注:
点击查看摘要
None
331. 【2503.06084】Exploring Interpretability for Visual Prompt Tuning with Hierarchical Concepts
链接:https://arxiv.org/abs/2503.06084
作者:Yubin Wang,Xinyang Jiang,De Cheng,Xiangqian Zhao,Zilong Wang,Dongsheng Li,Cairong Zhao
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 10 pages, 9 figures
点击查看摘要
None
332. 【2503.06073】GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images
链接:https://arxiv.org/abs/2503.06073
作者:Xiang Lan,Feng Wu,Kai He,Qinghao Zhao,Shenda Hong,Mengling Feng
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
333. 【2503.06071】ransParking: A Dual-Decoder Transformer Framework with Soft Localization for End-to-End Automatic Parking
链接:https://arxiv.org/abs/2503.06071
作者:Hangyu Du,Chee-Meng Chew
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
334. 【2503.06064】A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts
链接:https://arxiv.org/abs/2503.06064
作者:Wenzhuo Du,Gerun Wang,Guancheng Chen,Hang Zhao,Xin Li,Jian Gao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:
备注:
点击查看摘要
None
335. 【2503.06063】Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
链接:https://arxiv.org/abs/2503.06063
作者:Junyan Lin,Haoran Chen,Yue Fan,Yingqi Fan,Xin Jin,Hui Su,Jinlan Fu,Xiaoyu Shen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Accepted by CVPR2025
点击查看摘要
None
336. 【2503.06060】STAR: A Foundation Model-driven Framework for Robust Task Planning and Failure Recovery in Robotic Systems
链接:https://arxiv.org/abs/2503.06060
作者:Md Sadman Sakib,Yu Sun
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
337. 【2503.06056】Pathological Prior-Guided Multiple Instance Learning For Mitigating Catastrophic Forgetting in Breast Cancer Whole Slide Image Classification
链接:https://arxiv.org/abs/2503.06056
作者:Weixi Zheng,Aoling Huang. Jingping Yuan,Haoyu Zhao,Zhou Zhao,Yongchao Xu,Thierry Géraud
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: ICASSP2025(Oral)
点击查看摘要
None
338. 【2503.06053】DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation
链接:https://arxiv.org/abs/2503.06053
作者:Runze Zhang,Guoguang Du,Xiaochuan Li,Qi Jia,Liang Jin,Lu Liu,Jingjing Wang,Cong Xu,Zhenhua Guo,Yaqian Zhao,Xiaoli Gong,Rengang Li,Baoyu Fan
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
339. 【2503.06042】Improving SAM for Camouflaged Object Detection via Dual Stream Adapters
链接:https://arxiv.org/abs/2503.06042
作者:Jiaming Liu,Linghe Kong,Guihai Chen
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
340. 【2503.06038】A Label-Free High-Precision Residual Moveout Picking Method for Travel Time Tomography based on Deep Learning
链接:https://arxiv.org/abs/2503.06038
作者:Hongtao Wang,Jiandong Liang,Lei Wang,Shuaizhe Liang,Jinping Zhu,Chunxia Zhang,Jiangshe Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
341. 【2503.06030】owards Universal Text-driven CT Image Segmentation
链接:https://arxiv.org/abs/2503.06030
作者:Yuheng Li,Yuxiang Lai,Maria Thor,Deborah Marshall,Zachary Buchwald,David S. Yu,Xiaofeng Yang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
342. 【2503.06026】Zero-Shot Peg Insertion: Identifying Mating Holes and Estimating SE(2) Poses with Vision-Language Models
链接:https://arxiv.org/abs/2503.06026
作者:Masaru Yajima,Kei Ota,Asako Kanezaki,Rei Kawakami
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Under submission
点击查看摘要
None
343. 【2503.06019】GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
链接:https://arxiv.org/abs/2503.06019
作者:Xudong Lu,Yinghao Chen,Renshou Wu,Haohao Gao,Xi Chen,Xue Yang,Xiangyu Zhao,Aojun Zhou,Fangyuan Li,Yafei Wen,Xiaoxin Chen,Shuai Ren,Hongsheng Li
类目:Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 14 pages
点击查看摘要
None
344. 【2503.06014】owards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity
链接:https://arxiv.org/abs/2503.06014
作者:Xiaohao Xu,Feng Xue,Xiang Li,Haowei Li,Shusheng Yang,Tianyi Zhang,Matthew Johnson-Roberson,Xiaonan Huang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
关键词:
备注: 32 pages, 31 figures, github repo: [this https URL](https://github.com/Xiaohao-Xu/Ambiguity-in-Space)
点击查看摘要
None
345. 【2503.06012】End-to-End HOI Reconstruction Transformer with Graph-based Encoding
链接:https://arxiv.org/abs/2503.06012
作者:Zhenrong Wang,Qi Zheng,Sihan Ma,Maosheng Ye,Yibing Zhan,Dongjiang Li
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
346. 【2503.06003】Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models
链接:https://arxiv.org/abs/2503.06003
作者:Md Azim Khan,Aryya Gangopadhyay,Jianwu Wang,Robert F. Erbacher
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 8 pages, 4 figures
点击查看摘要
None
347. 【2503.05978】MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice
链接:https://arxiv.org/abs/2503.05978
作者:Hongwei Yi,Tian Ye,Shitong Shao,Xuancheng Yang,Jiantong Zhao,Hanzhong Guo,Terrance Wang,Qingyu Yin,Zeke Xie,Lei Zhu,Wei Li,Michael Lingelbach,Daquan Zhou
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: MagicInfinite is publicly accessible at [this https URL](https://www.hedra.com/) . More examples are at [this https URL](https://magicinfinite.github.io/)
点击查看摘要
None
348. 【2503.05977】Is Your Video Language Model a Reliable Judge?
链接:https://arxiv.org/abs/2503.05977
作者:Ming Liu,Wensheng Zhang
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:
备注:
点击查看摘要
None
349. 【2503.05962】OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking
链接:https://arxiv.org/abs/2503.05962
作者:Franklin Mingzhe Li,Kaitlyn Ng,Bin Zhu,Patrick Carrington
类目:Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: CHI 2025 Late Breaking Work
点击查看摘要
None
350. 【2503.05949】Bayesian Fields: Task-driven Open-Set Semantic Gaussian Splatting
链接:https://arxiv.org/abs/2503.05949
作者:Dominic Maggio,Luca Carlone
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
351. 【2503.05936】CASP: Compression of Large Multimodal Models Based on Attention Sparsity
链接:https://arxiv.org/abs/2503.05936
作者:Mohsen Gholami,Mohammad Akbari,Kevin Cannons,Yong Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
352. 【2503.05911】Generalizable Image Repair for Robust Visual Autonomous Racing
链接:https://arxiv.org/abs/2503.05911
作者:Carson Sobolewski,Zhenjiang Mao,Kshitij Vejre,Ivan Ruchkin
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 8 pages, 4 figures, Submitted to 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025)
点击查看摘要
None
353. 【2503.05850】Encrypted Vector Similarity Computations Using Partially Homomorphic Encryption: Applications and Performance Analysis
链接:https://arxiv.org/abs/2503.05850
作者:Sefik Serengil,Alper Ozpinar
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
354. 【2503.05839】Enhancing AUTOSAR-Based Firmware Over-the-Air Updates in the Automotive Industry with a Practical Implementation on a Steering System
链接:https://arxiv.org/abs/2503.05839
作者:Mostafa Ahmed Mostafa Ahmed,Mohamed Khaled Mohamed Elsayed,Radwa Waheed Ezzat Abdelmohsen
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
关键词:
备注: Bachelor's thesis
点击查看摘要
None
355. 【2503.05837】Randomized based restricted kernel machine for hyperspectral image classification
链接:https://arxiv.org/abs/2503.05837
作者:A. Quadir,M. Tanveer
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
356. 【2503.07491】NeAS: 3D Reconstruction from X-ray Images using Neural Attenuation Surface
链接:https://arxiv.org/abs/2503.07491
作者:Chengrui Zhu,Ryoichi Ishikawa,Masataka Kagesawa,Tomohisa Yuzawa,Toru Watsuji,Takeshi Oishi
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
357. 【2503.07369】Skelite: Compact Neural Networks for Efficient Iterative Skeletonization
链接:https://arxiv.org/abs/2503.07369
作者:Luis D. Reyes Vargas,Martin J. Menten,Johannes C. Paetzold,Nassir Navab,Mohammad Farid Azampour
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
358. 【2503.07248】AI-Driven Automated Tool for Abdominal CT Body Composition Analysis in Gastrointestinal Cancer Management
链接:https://arxiv.org/abs/2503.07248
作者:Xinyu Nan,Meng He,Zifan Chen,Bin Dong,Lei Tang,Li Zhang
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
359. 【2503.07177】he 4D Human Embryonic Brain Atlas: spatiotemporal atlas generation for rapid anatomical changes using first-trimester ultrasound from the Rotterdam Periconceptional Cohort
链接:https://arxiv.org/abs/2503.07177
作者:Wietske A.P. Bastiaansen,Melek Rousian,Anton H.J. Koning,Wiro J. Niessen,Bernadette S. de Bakker,Régine P.M. Steegers-Theunissen,Stefan Klein
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
关键词:
备注:
点击查看摘要
None
360. 【2503.07104】Global Context Is All You Need for Parallel Efficient Tractography Parcellation
链接:https://arxiv.org/abs/2503.07104
作者:Valentin von Bornhaupt,Johannes Grün,and Justus Bisten,Tobias Bauer,Theodor Rüber,Thomas Schultz
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
关键词:
备注: 8 pages, 2 pages references, 3 figures, 2 tables
点击查看摘要
None
361. 【2503.07097】A Comprehensive Survey on Magnetic Resonance Image Reconstruction
链接:https://arxiv.org/abs/2503.07097
作者:Xiaoyan Kui,Zijie Fan,Zexin Ji,Qinsong Li,Chengtao Liu,Weixin Si,Beiji Zou
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
362. 【2503.06945】Dynamic Cross-Modal Feature Interaction Network for Hyperspectral and LiDAR Data Classification
链接:https://arxiv.org/abs/2503.06945
作者:Junyan Lin,Feng Gap,Lin Qi,Junyu Dong,Qian Du,Xinbo Gao
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Accepted by IEEE TGRS 2025
点击查看摘要
None
363. 【2503.06919】CAFusion: Controllable Anatomical Synthesis of Perirectal Lymph Nodes via SDF-guided Diffusion
链接:https://arxiv.org/abs/2503.06919
作者:Weidong Guo,Hantao Zhang,Shouhong Wan,Bingbing Zou,Wanqin Wang,Chenyang Qiu,Peiquan Jin
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
364. 【2503.06828】owards a Multimodal MRI-Based Foundation Model for Multi-Level Feature Exploration in Segmentation, Molecular Subtyping, and Grading of Glioma
链接:https://arxiv.org/abs/2503.06828
作者:Somayeh Farahani,Marjaneh Hejazi,Antonio Di Ieva,Emad Fatemizadeh,Sidong Liu
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
365. 【2503.06827】wo-stage Deep Denoising with Self-guided Noise Attention for Multimodal Medical Images
链接:https://arxiv.org/abs/2503.06827
作者:S M A Sharif,Rizwan Ali Naqvi,Woong-Kee Loh
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: IEEE Transactions on Radiation and Plasma Medical Sciences (2024)
点击查看摘要
None
366. 【2503.06816】Semi-Supervised Medical Image Segmentation via Knowledge Mining from Large Models
链接:https://arxiv.org/abs/2503.06816
作者:Yuchen Mao,Hongwei Li,Yinyi Lai,Giorgos Papanastasiou,Peng Qi,Yunjie Yang,Chengjia Wang
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 18 pages, 2 figures
点击查看摘要
None
367. 【2503.06809】Interactive Tumor Progression Modeling via Sketch-Based Image Editing
链接:https://arxiv.org/abs/2503.06809
作者:Gexin Huang,Ruinan Jin,Yucheng Tang,Can Zhao,Tatsuya Harada,Xiaoxiao Li,Gu Lin
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 9 pages, 4 figures
点击查看摘要
None
368. 【2503.06743】X-GAN: A Generative AI-Powered Unsupervised Model for High-Precision Segmentation of Retinal Main Vessels toward Early Detection of Glaucoma
链接:https://arxiv.org/abs/2503.06743
作者:Cheng Huang,Weizheng Xie,Tsengdar J. Lee,Jui-Kai Wang,Karanjit Kooner,Jia Zhang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 11 pages, 8 figures
点击查看摘要
None
369. 【2503.06686】ImplicitCell: Resolution Cell Modeling of Joint Implicit Volume Reconstruction and Pose Refinement in Freehand 3D Ultrasound
链接:https://arxiv.org/abs/2503.06686
作者:Sheng Song,Yiting Chen,Duo Xu,Songhan Ge,Yunqian Huang,Junni Shi,Man Chen,Hongbo Chen,Rui Zheng
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
370. 【2503.06563】LSA: Latent Style Augmentation Towards Stain-Agnostic Cervical Cancer Screening
链接:https://arxiv.org/abs/2503.06563
作者:Jiangdong Cai,Haotian Jiang,Zhenrong Shen,Yonghao Li,Honglin Xiong,Lichi Zhang,Qian Wang
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
371. 【2503.06382】X-LRM: X-ray Large Reconstruction Model for Extremely Sparse-View Computed Tomography Recovery in One Second
链接:https://arxiv.org/abs/2503.06382
作者:Guofeng Zhang,Ruyi Zha,Hao He,Yixun Liang,Alan Yuille,Hongdong Li,Yuanhao Cai
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: A large reconstruction model and the largest dataset (16K samples) for sparse-view CT recovery
点击查看摘要
None
372. 【2503.06321】Enhanced Pediatric Dental Segmentation Using a Custom SegUNet with VGG19 Backbone on Panoramic Radiographs
链接:https://arxiv.org/abs/2503.06321
作者:Md Ohiduzzaman Ovi,Maliha Sanjana,Fahad Fahad,Mahjabin Runa,Zarin Tasnim Rothy,Tanmoy Sarkar Pias,A.M. Tayeful Islam,Rumman Ahmed Prodhan
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
373. 【2503.06190】Attention on the Wires (AttWire): A Foundation Model for Detecting Devices and Catheters in X-ray Fluoroscopic Images
链接:https://arxiv.org/abs/2503.06190
作者:YingLiang Ma,Sandra Howell,Aldo Rinaldi,Tarv Dhanjal,Kawal S. Rhode
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
374. 【2503.06125】RGB-Phase Speckle: Cross-Scene Stereo 3D Reconstruction via Wrapped Pre-Normalization
链接:https://arxiv.org/abs/2503.06125
作者:Kai Yang,Zijian Bai,Yang Xiao,Xinyu Li,Xiaohan Shi
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Submitted to ICCV 2025
点击查看摘要
None
375. 【2503.06114】Pathology-Guided AI System for Accurate Segmentation and Diagnosis of Cervical Spondylosis
链接:https://arxiv.org/abs/2503.06114
作者:Qi Zhang,Xiuyuan Chen,Ziyi He,Lianming Wu,Kun Wang,Jianqi Sun,Hongxing Shen
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
376. 【2503.05991】GrInAdapt: Scaling Retinal Vessel Structural Map Segmentation Through Grounding, Integrating and Adapting Multi-device, Multi-site, and Multi-modal Fundus Domains
链接:https://arxiv.org/abs/2503.05991
作者:Zixuan Liu,Aaron Honjaya,Yuekai Xu,Yi Zhang,Hefu Pan,Xin Wang,Linda G Shapiro,Sheng Wang,Ruikang K Wang
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:
备注:
点击查看摘要
None
377. 【2503.05990】HealthiVert-GAN: A Novel Framework of Pseudo-Healthy Vertebral Image Synthesis for Interpretable Compression Fracture Grading
链接:https://arxiv.org/abs/2503.05990
作者:Qi Zhang,Shunan Zhang,Ziqi Zhao,Kun Wang,Jun Xu,Jianqi Sun
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
378. 【2503.05974】LapLoss: Laplacian Pyramid-based Multiscale loss for Image Translation
链接:https://arxiv.org/abs/2503.05974
作者:Krish Didwania,Ishaan Gakhar,Prakhar Arya,Sanskriti Labroo
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: Accepted at the DeLTa Workshop, ICLR 2025
点击查看摘要
None
379. 【2503.05933】Beyond HE: Unlocking Pathological Insights with Polarization via Self-supervised Learning
链接:https://arxiv.org/abs/2503.05933
作者:Yao Du,Jiaxin Zhuang,Xiaoyu Zheng,Jing Cong,Limei Guo,Chao He,Lin Luo,Xiaomeng Li
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
380. 【2503.05916】SAS: Segment Anything Small for Ultrasound -- A Non-Generative Data Augmentation Technique for Robust Deep Learning in Ultrasound Imaging
链接:https://arxiv.org/abs/2503.05916
作者:Danielle L. Ferreira,Ahana Gangopadhyay,Hsi-Ming Chang,Ravi Soni,Gopal Avinash
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注: 25 pages, 8 figures
点击查看摘要
None
381. 【2503.05843】Decadal analysis of sea surface temperature patterns, climatology, and anomalies in temperate coastal waters with Landsat-8 TIRS observations
链接:https://arxiv.org/abs/2503.05843
作者:Yiqing Guo,Nagur Cherukuru,Eric Lehmann,Xiubin Qi,Mark Doubelld,S. L. Kesav Unnithan,Ming Feng
类目:Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Geophysics (physics.geo-ph)
关键词:
备注: Submitted to GIScience Remote Sensing
点击查看摘要
None
382. 【2503.05802】Illuminant and light direction estimation using Wasserstein distance method
链接:https://arxiv.org/abs/2503.05802
作者:Selcuk Yazar
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:
备注:
点击查看摘要
None
383. 【2503.03786】Self is the Best Learner: CT-free Ultra-Low-Dose PET Organ Segmentation via Collaborating Denoising and Segmentation Learning
链接:https://arxiv.org/abs/2503.03786
作者:Zanting Ye,Xiaolong Niu,Xuanbin Wu,Wantong Lu,Lijun Lu
类目:Tissues and Organs (q-bio.TO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:
备注: 8 pages, 5 figures
点击查看摘要
None