本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新306篇论文,其中:

  • 自然语言处理43
  • 信息检索6
  • 计算机视觉75

自然语言处理

1. 【2501.09012】Multimodal LLMs Can Reason about Aesthetics in Zero-Shot

链接https://arxiv.org/abs/2501.09012

作者:Ruixiang Jiang,Changwen Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:Multimodal LLMs', elicited to evaluate, Multimodal, Abstract, reasoning ability

备注: WIP, Homepage [this https URL](https://github.com/songrise/MLLM4Art)

点击查看摘要

Abstract:We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability shall be elicited to evaluate the aesthetics of artworks. To facilitate this investigation, we construct MM-StyleBench, a novel high-quality dataset for benchmarking artistic stylization. We then develop a principled method for human preference modeling and perform a systematic correlation analysis between MLLMs' responses and human preference. Our experiments reveal an inherent hallucination issue of MLLMs in art evaluation, associated with response subjectivity. ArtCoT is proposed, demonstrating that art-specific task decomposition and the use of concrete language boost MLLMs' reasoning ability for aesthetics. Our findings offer valuable insights into MLLMs for art and can benefit a wide range of downstream applications, such as style transfer and artistic image generation. Code available at this https URL.

2. 【2501.09004】Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails

链接https://arxiv.org/abs/2501.09004

作者:Shaona Ghosh,Prasoon Varshney,Makesh Narsimhan Sreedhar,Aishwarya Padmakumar,Traian Rebedea,Jibin Rajan Varghese,Christopher Parisien

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, increasingly widespread, concerns about content, grown in parallel

备注: arXiv admin note: text overlap with [arXiv:2404.05993](https://arxiv.org/abs/2404.05993)

点击查看摘要

Abstract:As Large Language Models (LLMs) and generative AI become increasingly widespread, concerns about content safety have grown in parallel. Currently, there is a clear lack of high-quality, human-annotated datasets that address the full spectrum of LLM-related safety risks and are usable for commercial applications. To bridge this gap, we propose a comprehensive and adaptable taxonomy for categorizing safety risks, structured into 12 top-level hazard categories with an extension to 9 fine-grained subcategories. This taxonomy is designed to meet the diverse requirements of downstream users, offering more granular and flexible tools for managing various risk types. Using a hybrid data generation pipeline that combines human annotations with a multi-LLM "jury" system to assess the safety of responses, we obtain Aegis 2.0, a carefully curated collection of 34,248 samples of human-LLM interactions, annotated according to our proposed taxonomy. To validate its effectiveness, we demonstrate that several lightweight models, trained using parameter-efficient techniques on Aegis 2.0, achieve performance competitive with leading safety models fully fine-tuned on much larger, non-commercial datasets. In addition, we introduce a novel training blend that combines safety with topic following this http URL approach enhances the adaptability of guard models, enabling them to generalize to new risk categories defined during inference. We plan to open-source Aegis 2.0 data and models to the research community to aid in the safety guardrailing of LLMs.

3. 【2501.08985】Personality Modeling for Persuasion of Misinformation using AI Agent

链接https://arxiv.org/abs/2501.08985

作者:Qianmin Lou,Wentao Xu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)

关键词:individual personality traits, personality traits, platforms has highlighted, understand how individual, misinformation

备注

点击查看摘要

Abstract:The proliferation of misinformation on social media platforms has highlighted the need to understand how individual personality traits influence susceptibility to and propagation of misinformation. This study employs an innovative agent-based modeling approach to investigate the relationship between personality traits and misinformation dynamics. Using six AI agents embodying different dimensions of the Big Five personality traits (Extraversion, Agreeableness, and Neuroticism), we simulated interactions across six diverse misinformation topics. The experiment, implemented through the AgentScope framework using the GLM-4-Flash model, generated 90 unique interactions, revealing complex patterns in how personality combinations affect persuasion and resistance to misinformation. Our findings demonstrate that analytical and critical personality traits enhance effectiveness in evidence-based discussions, while non-aggressive persuasion strategies show unexpected success in misinformation correction. Notably, agents with critical traits achieved a 59.4% success rate in HIV-related misinformation discussions, while those employing non-aggressive approaches maintained consistent persuasion rates above 40% across different personality combinations. The study also revealed a non-transitive pattern in persuasion effectiveness, challenging conventional assumptions about personality-based influence. These results provide crucial insights for developing personality-aware interventions in digital environments and suggest that effective misinformation countermeasures should prioritize emotional connection and trust-building over confrontational approaches. The findings contribute to both theoretical understanding of personality-misinformation dynamics and practical strategies for combating misinformation in social media contexts.

4. 【2501.08974】Learning to Extract Cross-Domain Aspects and Understanding Sentiments Using Large Language Models

链接https://arxiv.org/abs/2501.08974

作者:Karukriti Kaushik Ghosh,Chiranjib Sur

类目:Computation and Language (cs.CL)

关键词:sentiment analysis, sentiment, Aspect-based sentiment analysis, refined approach, extract and classify

备注

点击查看摘要

Abstract:Aspect-based sentiment analysis (ASBA) is a refined approach to sentiment analysis that aims to extract and classify sentiments based on specific aspects or features of a product, service, or entity. Unlike traditional sentiment analysis, which assigns a general sentiment score to entire reviews or texts, ABSA focuses on breaking down the text into individual components or aspects (e.g., quality, price, service) and evaluating the sentiment towards each. This allows for a more granular level of understanding of customer opinions, enabling businesses to pinpoint specific areas of strength and improvement. The process involves several key steps, including aspect extraction, sentiment classification, and aspect-level sentiment aggregation for a review paragraph or any other form that the users have provided. ABSA has significant applications in areas such as product reviews, social media monitoring, customer feedback analysis, and market research. By leveraging techniques from natural language processing (NLP) and machine learning, ABSA facilitates the extraction of valuable insights, enabling companies to make data-driven decisions that enhance customer satisfaction and optimize offerings. As ABSA evolves, it holds the potential to greatly improve personalized customer experiences by providing a deeper understanding of sentiment across various product aspects. In this work, we have analyzed the strength of LLMs for a complete cross-domain aspect-based sentiment analysis with the aim of defining the framework for certain products and using it for other similar situations. We argue that it is possible to that at an effectiveness of 92\% accuracy for the Aspect Based Sentiment Analysis dataset of SemEval-2015 Task 12.

5. 【2501.08946】Applying General Turn-taking Models to Conversational Human-Robot Interaction

链接https://arxiv.org/abs/2501.08946

作者:Gabriel Skantze,Bahar Irfan

类目:Computation and Language (cs.CL); Robotics (cs.RO)

关键词:current Human-Robot Interaction, Voice Activity Projection, Human-Robot Interaction, aspect of conversation, rely on simplistic

备注: Accepted at HRI 2025 (the IEEE/ACM International Conference on Human-Robot Interaction)

点击查看摘要

Abstract:Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.

6. 【2501.08925】Disentangling Exploration of Large Language Models by Optimal Exploitation

链接https://arxiv.org/abs/2501.08925

作者:Tim Grams,Patrick Betz,Christian Bartelt

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:open-ended problem-solving, crucial skill, skill for self-improvement, self-improvement and open-ended, Exploration

备注

点击查看摘要

Abstract:Exploration is a crucial skill for self-improvement and open-ended problem-solving. However, it remains uncertain whether large language models can effectively explore the state-space. Existing evaluations predominantly focus on the trade-off between exploration and exploitation, often assessed in multi-armed bandit problems. In contrast, this work isolates exploration as the sole objective, tasking the agent with delivering information that enhances future returns. For the evaluation, we propose to decompose missing rewards into exploration and exploitation components by measuring the optimal achievable return for the states already explored. Our experiments with various LLMs reveal that most models struggle to sufficiently explore the state-space and that weak exploration is insufficient. We observe a positive correlation between model size and exploration performance, with larger models demonstrating superior capabilities. Furthermore, we show that our decomposition provides insights into differences in behaviors driven by agent instructions during prompt engineering, offering a valuable tool for refining LLM performance in exploratory tasks.

7. 【2501.08913】GenAI Content Detection Task 3: Cross-Domain Machine-Generated Text Detection Challenge

链接https://arxiv.org/abs/2501.08913

作者:Liam Dugan,Andrew Zhu,Firoj Alam,Preslav Nakov,Marianna Apidianaki,Chris Callison-Burch

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Large Language, Language Models, shared tasks targeting, targeting the detection

备注: COLING 2025

点击查看摘要

Abstract:Recently there have been many shared tasks targeting the detection of generated text from Large Language Models (LLMs). However, these shared tasks tend to focus either on cases where text is limited to one particular domain or cases where text can be from many domains, some of which may not be seen during test time. In this shared task, using the newly released RAID benchmark, we aim to answer whether or not models can detect generated text from a large, yet fixed, number of domains and LLMs, all of which are seen during training. Over the course of three months, our task was attempted by 9 teams with 23 detector submissions. We find that multiple participants were able to obtain accuracies of over 99% on machine-generated text from RAID while maintaining a 5% False Positive Rate -- suggesting that detectors are able to robustly detect text from many domains and models simultaneously. We discuss potential interpretations of this result and provide directions for future research.

8. 【2501.08838】oMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind

链接https://arxiv.org/abs/2501.08838

作者:Kazutoshi Shinoda,Nobukatsu Hojo,Kyosuke Nishida,Saki Mizuno,Keita Suzuki,Ryo Masumura,Hiroaki Sugiyama,Kuniko Saito

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Existing Theory, Theory of Mind, comprehensively explored, diverge from real-world, real-world scenarios

备注: Accepted by AAAI 2025

点击查看摘要

Abstract:Existing Theory of Mind (ToM) benchmarks diverge from real-world scenarios in three aspects: 1) they assess a limited range of mental states such as beliefs, 2) false beliefs are not comprehensively explored, and 3) the diverse personality traits of characters are overlooked. To address these challenges, we introduce ToMATO, a new ToM benchmark formulated as multiple-choice QA over conversations. ToMATO is generated via LLM-LLM conversations featuring information asymmetry. By employing a prompting method that requires role-playing LLMs to verbalize their thoughts before each utterance, we capture both first- and second-order mental states across five categories: belief, intention, desire, emotion, and knowledge. These verbalized thoughts serve as answers to questions designed to assess the mental states of characters within conversations. Furthermore, the information asymmetry introduced by hiding thoughts from others induces the generation of false beliefs about various mental states. Assigning distinct personality traits to LLMs further diversifies both utterances and thoughts. ToMATO consists of 5.4k questions, 753 conversations, and 15 personality trait patterns. Our analysis shows that this dataset construction approach frequently generates false beliefs due to the information asymmetry between role-playing LLMs, and effectively reflects diverse personalities. We evaluate nine LLMs on ToMATO and find that even GPT-4o mini lags behind human performance, especially in understanding false beliefs, and lacks robustness to various personality traits.

9. 【2501.08828】MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

链接https://arxiv.org/abs/2501.08828

作者:Kuicai Dong,Yujing Chang,Xin Deik Goh,Dexun Li,Ruiming Tang,Yong Liu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-modal document retrieval, Multi-modal document, document retrieval, designed to identify, identify and retrieve

备注: [this https URL](https://huggingface.co/MMDocIR)

点击查看摘要

Abstract:Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents. Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval. To address this gap, this work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis. A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation. Through rigorous experiments, we reveal that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval and (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text. These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.

10. 【2501.08814】SAIF: A Comprehensive Framework for Evaluating the Risks of Generative AI in the Public Sector

链接https://arxiv.org/abs/2501.08814

作者:Kyeongryul Lee,Heehyeon Kim,Joyce Jiyoung Whang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:automated public assistance, encompassing diverse applications, diverse applications ranging, public sector, immigration processes

备注: 6 pages, 2 figures, 1 tables. AI for Public Missions (AIPM) Workshop at the 39th AAAI Conference on Artificial Intelligence (AAAI 2025)

点击查看摘要

Abstract:The rapid adoption of generative AI in the public sector, encompassing diverse applications ranging from automated public assistance to welfare services and immigration processes, highlights its transformative potential while underscoring the pressing need for thorough risk assessments. Despite its growing presence, evaluations of risks associated with AI-driven systems in the public sector remain insufficiently explored. Building upon an established taxonomy of AI risks derived from diverse government policies and corporate guidelines, we investigate the critical risks posed by generative AI in the public sector while extending the scope to account for its multimodal capabilities. In addition, we propose a Systematic dAta generatIon Framework for evaluating the risks of generative AI (SAIF). SAIF involves four key stages: breaking down risks, designing scenarios, applying jailbreak methods, and exploring prompt types. It ensures the systematic and consistent generation of prompt data, facilitating a comprehensive evaluation while providing a solid foundation for mitigating the risks. Furthermore, SAIF is designed to accommodate emerging jailbreak methods and evolving prompt types, thereby enabling effective responses to unforeseen risk scenarios. We believe that this study can play a crucial role in fostering the safe and responsible integration of generative AI into the public sector.

11. 【2501.08769】Enhanced Large Language Models for Effective Screening of Depression and Anxiety

链接https://arxiv.org/abs/2501.08769

作者:June M. Liu,Mengxia Gao,Sahand Sabour,Zhuang Chen,Minlie Huang,Tatia M.C. Lee

类目:Computation and Language (cs.CL)

关键词:necessitating timely identification, Large Language Models, necessitating timely, identification and management, timely identification

备注

点击查看摘要

Abstract:Depressive and anxiety disorders are widespread, necessitating timely identification and management. Recent advances in Large Language Models (LLMs) offer potential solutions, yet high costs and ethical concerns about training data remain challenges. This paper introduces a pipeline for synthesizing clinical interviews, resulting in 1,157 interactive dialogues (PsyInterview), and presents EmoScan, an LLM-based emotional disorder screening system. EmoScan distinguishes between coarse (e.g., anxiety or depressive disorders) and fine disorders (e.g., major depressive disorders) and conducts high-quality interviews. Evaluations showed that EmoScan exceeded the performance of base models and other LLMs like GPT-4 in screening emotional disorders (F1-score=0.7467). It also delivers superior explanations (BERTScore=0.9408) and demonstrates robust generalizability (F1-score of 0.67 on an external dataset). Furthermore, EmoScan outperforms baselines in interviewing skills, as validated by automated ratings and human evaluations. This work highlights the importance of scalable data-generative pipelines for developing effective mental health LLM tools.

12. 【2501.08758】Expanding Vietnamese SentiWordNet to Improve Performance of Vietnamese Sentiment Analysis Models

链接https://arxiv.org/abs/2501.08758

作者:Hong-Viet Tran,Van-Tan Bui,Lam-Quan Tran

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, Language Processing, Natural Language, machine learning models, Sentiment analysis

备注

点击查看摘要

Abstract:Sentiment analysis is one of the most crucial tasks in Natural Language Processing (NLP), involving the training of machine learning models to classify text based on the polarity of opinions. Pre-trained Language Models (PLMs) can be applied to downstream tasks through fine-tuning, eliminating the need to train the model from scratch. Specifically, PLMs have been employed for Sentiment Analysis, a process that involves detecting, analyzing, and extracting the polarity of text sentiments. Numerous models have been proposed to address this task, with pre-trained PhoBERT-V2 models standing out as the state-of-the-art language models for Vietnamese. The PhoBERT-V2 pre-training approach is based on RoBERTa, optimizing the BERT pre-training method for more robust performance. In this paper, we introduce a novel approach that combines PhoBERT-V2 and SentiWordnet for Sentiment Analysis of Vietnamese reviews. Our proposed model utilizes PhoBERT-V2 for Vietnamese, offering a robust optimization for the prominent BERT model in the context of Vietnamese language, and leverages SentiWordNet, a lexical resource explicitly designed to support sentiment classification applications. Experimental results on the VLSP 2016 and AIVIVN 2019 datasets demonstrate that our sentiment analysis system has achieved excellent performance in comparison to other models.

13. 【2501.08716】he Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities

链接https://arxiv.org/abs/2501.08716

作者:Irina Bigoulaeva,Harish Tayyar Madabushi,Iryna Gurevych

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, demonstrated remarkable abilities, extensive web-scale corpora, Language Models

备注: The code for this paper is available at: [this https URL](https://github.com/UKPLab/arxiv2025-inherent-limits-plms)

点击查看摘要

Abstract:Large Language Models (LLMs), trained on extensive web-scale corpora, have demonstrated remarkable abilities across diverse tasks, especially as they are scaled up. Nevertheless, even state-of-the-art models struggle in certain cases, sometimes failing at problems solvable by young children, indicating that traditional notions of task complexity are insufficient for explaining LLM capabilities. However, exploring LLM capabilities is complicated by the fact that most widely-used models are also "instruction-tuned" to respond appropriately to prompts. With the goal of disentangling the factors influencing LLM performance, we investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. Through extensive experiments across various model families, scales and task types, which included instruction tuning 90 different LLMs, we demonstrate that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. By clarifying what instruction-tuning contributes, we extend prior research into in-context learning, which suggests that base models use priors from pretraining data to solve tasks. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve, with the added influence of the instruction-tuning dataset.

14. 【2501.08696】Deep Learning-Based Feature Fusion for Emotion Analysis and Suicide Risk Differentiation in Chinese Psychological Support Hotlines

链接https://arxiv.org/abs/2501.08696

作者:Han Wang,Jianqiang Li,Qing Zhao,Zhonglong Chen,Changwei Song,Jing Tang,Yuning Huang,Wei Zhai,Yongsheng Tong,Guanghui Fu

类目:Computation and Language (cs.CL)

关键词:providing mental health, mental health assistance, public health issue, global public health, Mental health

备注

点击查看摘要

Abstract:Mental health is a critical global public health issue, and psychological support hotlines play a pivotal role in providing mental health assistance and identifying suicide risks at an early stage. However, the emotional expressions conveyed during these calls remain underexplored in current research. This study introduces a method that combines pitch acoustic features with deep learning-based features to analyze and understand emotions expressed during hotline interactions. Using data from China's largest psychological support hotline, our method achieved an F1-score of 79.13% for negative binary emotion this http URL, the proposed approach was validated on an open dataset for multi-class emotion classification,where it demonstrated better performance compared to the state-of-the-art methods. To explore its clinical relevance, we applied the model to analysis the frequency of negative emotions and the rate of emotional change in the conversation, comparing 46 subjects with suicidal behavior to those without. While the suicidal group exhibited more frequent emotional changes than the non-suicidal group, the difference was not statistically this http URL, our findings suggest that emotional fluctuation intensity and frequency could serve as novel features for psychological assessment scales and suicide risk this http URL proposed method provides valuable insights into emotional dynamics and has the potential to advance early intervention and improve suicide prevention strategies through integration with clinical tools and assessments The source code is publicly available at this https URL.

15. 【2501.08686】Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching

链接https://arxiv.org/abs/2501.08686

作者:Chuangtao Ma,Sriom Chakrabarti,Arijit Khan,Bálint Molnár

类目:Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Traditional similarity-based schema, resolving semantic ambiguities, schema matching, Traditional similarity-based, similarity-based schema matching

备注: Under Review

点击查看摘要

Abstract:Traditional similarity-based schema matching methods are incapable of resolving semantic ambiguities and conflicts in domain-specific complex mapping scenarios due to missing commonsense and domain-specific knowledge. The hallucination problem of large language models (LLMs) also makes it challenging for LLM-based schema matching to address the above issues. Therefore, we propose a Knowledge Graph-based Retrieval-Augmented Generation model for Schema Matching, referred to as the KG-RAG4SM. In particular, KG-RAG4SM introduces novel vector-based, graph traversal-based, and query-based graph retrievals, as well as a hybrid approach and ranking schemes that identify the most relevant subgraphs from external large knowledge graphs (KGs). We showcase that KG-based retrieval-augmented LLMs are capable of generating more accurate results for complex matching cases without any re-training. Our experimental results show that KG-RAG4SM outperforms the LLM-based state-of-the-art (SOTA) methods (e.g., Jellyfish-8B) by 35.89% and 30.50% in terms of precision and F1 score on the MIMIC dataset, respectively; KG-RAG4SM with GPT-4o-mini outperforms the pre-trained language model (PLM)-based SOTA methods (e.g., SMAT) by 69.20% and 21.97% in terms of precision and F1 score on the Synthea dataset, respectively. The results also demonstrate that our approach is more efficient in end-to-end schema matching, and scales to retrieve from large KGs. Our case studies on the dataset from the real-world schema matching scenario exhibit that the hallucination problem of LLMs for schema matching is well mitigated by our solution.

16. 【2501.08648】MAGNET: Augmenting Generative Decoders with Representation Learning and Infilling Capabilities

链接https://arxiv.org/abs/2501.08648

作者:Savya Khosla,Kushal Kafle,Simon Jenni,Handong Zhao,John Collomosse,Jing Shi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:unidirectional generative modeling, large language models, generative modeling, decoder-only large language, originally designed

备注

点击查看摘要

Abstract:While originally designed for unidirectional generative modeling, decoder-only large language models (LLMs) are increasingly being adapted for bidirectional modeling. However, unidirectional and bidirectional models are typically trained separately with distinct objectives (generation and representation learning, respectively). This separation overlooks the opportunity for developing a more versatile language model and for these objectives to complement each other. In this work, we introduce MAGNET, an adaptation of decoder-only LLMs that enhances their ability to generate robust representations and infill missing text spans, while preserving their knowledge and text generation capabilities. MAGNET employs three self-supervised training objectives and introduces an attention mechanism that combines bidirectional and causal attention, enabling unified training across all objectives. Our results demonstrate that LLMs adapted with MAGNET (1) surpass strong text encoders on token-level and sentence-level representation learning tasks, (2) generate contextually appropriate text infills by leveraging future context, (3) retain the ability for open-ended text generation without exhibiting repetition problem, and (4) preserve the knowledge gained by the LLM during pretraining.

17. 【2501.08641】Reassessing the Role of Chain-of-Thought in Sentiment Analysis: Insights and Limitations

链接https://arxiv.org/abs/2501.08641

作者:Kaiyuan Zheng,Qinghua Zhao,Lei Li

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:unresolved philosophical issue, remains an unresolved, unresolved philosophical, philosophical issue, language

备注

点击查看摘要

Abstract:The relationship between language and thought remains an unresolved philosophical issue. Existing viewpoints can be broadly categorized into two schools: one asserting their independence, and another arguing that language constrains thought. In the context of large language models, this debate raises a crucial question: Does a language model's grasp of semantic meaning depend on thought processes? To explore this issue, we investigate whether reasoning techniques can facilitate semantic understanding. Specifically, we conceptualize thought as reasoning, employ chain-of-thought prompting as a reasoning technique, and examine its impact on sentiment analysis tasks. The experiments show that chain-of-thought has a minimal impact on sentiment analysis tasks. Both the standard and chain-of-thought prompts focus on aspect terms rather than sentiment in the generated content. Furthermore, counterfactual experiments reveal that the model's handling of sentiment tasks primarily depends on information from demonstrations. The experimental results support the first viewpoint.

18. 【2501.08631】SWSC: Shared Weight for Similar Channel in LLM

链接https://arxiv.org/abs/2501.08631

作者:Binrui Zeng,Yongtao Tang,Xiaodong Liu,Xiaopeng Li

类目:Machine Learning (cs.LG); Computation and Language (cs.CL)

关键词:Large language models, Large language, multiple industries, spurred development, development in multiple

备注: 5pages, 3 figures, work in progress

点击查看摘要

Abstract:Large language models (LLMs) have spurred development in multiple industries. However, the growing number of their parameters brings substantial storage and computing burdens, making it essential to explore model compression techniques for parameter reduction and easier deployment. We propose SWSC, an LLM compression method based on the concept of Shared Weight for Similar Channel. It uses the K-Means clustering algorithm to cluster model weights channel-by-channel, generating clusters with highly similar vectors within each. A representative vector from each cluster is selected to approximately replace all vectors in the cluster, significantly reducing the number of model weight parameters. However, approximate restoration will inevitably cause damage to the performance of the model. To tackle this issue, we perform singular value decomposition on the weight error values before and after compression and retain the larger singular values and their corresponding singular vectors to compensate for the accuracy. The experimental results show that our method can effectively ensure the performance of the compressed LLM even under low-precision conditions.

19. 【2501.08621】ViBidirectionMT-Eval: Machine Translation for Vietnamese-Chinese and Vietnamese-Lao language pair

链接https://arxiv.org/abs/2501.08621

作者:Hong-Viet Tran,Minh-Quy Nguyen,Van-Vinh Nguyen

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:VLSP 2022-2023 Machine, 2022-2023 Machine Translation, Translation Shared Tasks, Machine Translation, VLSP 2022-2023

备注

点击查看摘要

Abstract:This paper presents an results of the VLSP 2022-2023 Machine Translation Shared Tasks, focusing on Vietnamese-Chinese and Vietnamese-Lao machine translation. The tasks were organized as part of the 9th, 10th annual workshop on Vietnamese Language and Speech Processing (VLSP 2022, VLSP 2023). The objective of the shared task was to build machine translation systems, specifically targeting Vietnamese-Chinese and Vietnamese-Lao translation (corresponding to 4 translation directions). The submission were evaluated on 1,000 pairs for testing (news and general domains) using established metrics like BLEU [11] and SacreBLEU [12]. Additionally, system outputs also were evaluated with human judgment provided by experts in Chinese and Lao languages. These human assessments played a crucial role in ranking the performance of the machine translation models, ensuring a more comprehensive evaluation.

20. 【2501.08618】Disjoint Processing Mechanisms of Hierarchical and Linear Grammars in Large Language Models

链接https://arxiv.org/abs/2501.08618

作者:Aruna Sankaranarayanan,Dylan Hadfield-Menell,Aaron Mueller

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:grammars, structured hierarchically, hierarchical, natural languages, language

备注

点击查看摘要

Abstract:All natural languages are structured hierarchically. In humans, this structural restriction is neurologically coded: when two grammars are presented with identical vocabularies, brain areas responsible for language processing are only sensitive to hierarchical grammars. Using large language models (LLMs), we investigate whether such functionally distinct hierarchical processing regions can arise solely from exposure to large-scale language distributions. We generate inputs using English, Italian, Japanese, or nonce words, varying the underlying grammars to conform to either hierarchical or linear/positional rules. Using these grammars, we first observe that language models show distinct behaviors on hierarchical versus linearly structured inputs. Then, we find that the components responsible for processing hierarchical grammars are distinct from those that process linear grammars; we causally verify this in ablation experiments. Finally, we observe that hierarchy-selective components are also active on nonce grammars; this suggests that hierarchy sensitivity is not tied to meaning, nor in-distribution inputs.

21. 【2501.08617】RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

链接https://arxiv.org/abs/2501.08617

作者:Kaiqu Liang,Haimin Hu,Ryan Liu,Thomas L. Griffiths,Jaime Fernández Fisac

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Generative AI systems, helpful and trustworthy, Reinforcement Learning, Goodhart Law dynamics, Feedback

备注

点击查看摘要

Abstract:Generative AI systems like foundation models (FMs) must align well with human values to ensure their behavior is helpful and trustworthy. While Reinforcement Learning from Human Feedback (RLHF) has shown promise for optimizing model performance using human judgments, existing RLHF pipelines predominantly rely on immediate feedback, which can fail to accurately reflect the downstream impact of an interaction on users' utility. We demonstrate that feedback based on evaluators' foresight estimates of downstream consequences systematically induces Goodhart's Law dynamics, incentivizing misaligned behaviors like sycophancy and deception and ultimately degrading user outcomes. To alleviate this, we propose decoupling evaluation from prediction by refocusing RLHF on hindsight feedback. Our theoretical analysis reveals that conditioning evaluator feedback on downstream observations mitigates misalignment and improves expected human utility, even when these observations are simulated by the AI system itself. To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which first simulates plausible consequences and then elicits feedback to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS to two widely-employed online and offline preference optimization methods -- Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) -- and show empirically that misalignment is significantly reduced with both methods. Through an online human user study, we show that RLHS consistently outperforms RLHF in helping users achieve their goals and earns higher satisfaction ratings, despite being trained solely with simulated hindsight feedback. These results underscore the importance of focusing on long-term consequences, even simulated ones, to mitigate misalignment in RLHF.

22. 【2501.08613】Assessing the Alignment of FOL Closeness Metrics with Human Judgement

链接https://arxiv.org/abs/2501.08613

作者:Ramya Keerthy Thatikonda,Wray Buntine,Ehsan Shareghi

类目:Computation and Language (cs.CL)

关键词:external theorem provers, recent successful paradigm, solving logical reasoning, logical reasoning problems, First-Order Logic

备注: Code: [this https URL](https://github.com/RamyaKeerthy/AlignmentFOL)

点击查看摘要

Abstract:The recent successful paradigm of solving logical reasoning problems with tool-augmented large language models (LLMs) leverages translation of natural language statements into First-Order Logic~(FOL) and external theorem provers. However, the correctness of FOL statements, comprising operators and text predicates, often goes unverified due to the lack of a reliable evaluation metric for comparing generated and ground-truth FOLs. In this paper, we present a comprehensive study of sensitivity of existing metrics and their alignment with human judgement on FOL evaluation. Using ground-truth FOLs, we carefully designed various perturbations on the ground-truth to assess metric sensitivity. We sample FOL translation candidates for natural language statements and measure the ranking alignment between automatic metrics and human annotators. Our empirical findings highlight oversensitivity in the n-gram metric BLEU for text perturbations, the semantic graph metric Smatch++ for structural perturbations, and FOL metric for operator perturbation. We also observe a closer alignment between BertScore and human judgement. Additionally, we show that combining metrics enhances both alignment and sensitivity compared to using individual metrics.

23. 【2501.08597】Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning

链接https://arxiv.org/abs/2501.08597

作者:Julian Perry,Surasakdi Siripong,Thanakorn Phonchai

类目:Computation and Language (cs.CL)

关键词:Large Vision-Language Models, demonstrated impressive capabilities, visual question answering, Large Vision-Language, external knowledge integration

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal tasks, but their performance is often constrained by the lack of external knowledge integration, limiting their ability to handle knowledge-intensive tasks such as visual question answering and reasoning. To address this challenge, we propose a novel method, Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM), which dynamically incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning. Our approach employs a knowledge encoder to represent external knowledge, a retrieval mechanism to select task-relevant information, and a dynamic adaptor to align multimodal and knowledge representations effectively. We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art models. Furthermore, human evaluations highlight the superior correctness and relevance of our model's outputs. Extensive analyses confirm the robustness, efficiency, and scalability of AKGP-LVLM, making it a compelling solution for real-world knowledge-intensive tasks.

24. 【2501.08582】LoRS: Efficient Low-Rank Adaptation for Sparse Large Language Model

链接https://arxiv.org/abs/2501.08582

作者:Yuxuan Hu,Jing Zhang,Xiaodong Chen,Zhe Zhao,Cuiping Li,Hong Chen

类目:Computation and Language (cs.CL)

关键词:large language models, Existing low-rank adaptation, methods face challenges, sparse large language, low-rank adaptation

备注: 12 pages, 4 figures

点击查看摘要

Abstract:Existing low-rank adaptation (LoRA) methods face challenges on sparse large language models (LLMs) due to the inability to maintain sparsity. Recent works introduced methods that maintain sparsity by augmenting LoRA techniques with additional masking mechanisms. Despite these successes, such approaches suffer from an increased memory and computation overhead, which affects efficiency of LoRA methods. In response to this limitation, we introduce LoRS, an innovative method designed to achieve both memory and computation efficiency when fine-tuning sparse LLMs. To mitigate the substantial memory and computation demands associated with preserving sparsity, our approach incorporates strategies of weight recompute and computational graph rearrangement. In addition, we also improve the effectiveness of LoRS through better adapter initialization. These innovations lead to a notable reduction in memory and computation consumption during the fine-tuning phase, all while achieving performance levels that outperform existing LoRA approaches.

25. 【2501.08579】What Limits LLM-based Human Simulation: LLMs or Our Design?

链接https://arxiv.org/abs/2501.08579

作者:Qian Wang,Jiaying Wu,Zhenheng Tang,Bingqiao Luo,Nuo Chen,Wei Chen,Bingsheng He

类目:Computation and Language (cs.CL)

关键词:advancing LLM-based human, simulation requires addressing, LLM-based human simulation, human simulation requires, LLM-based human

备注

点击查看摘要

Abstract:We argue that advancing LLM-based human simulation requires addressing both LLM's inherent limitations and simulation framework design challenges. Recent studies have revealed significant gaps between LLM-based human simulations and real-world observations, highlighting these dual challenges. To address these gaps, we present a comprehensive analysis of LLM limitations and our design issues, proposing targeted solutions for both aspects. Furthermore, we explore future directions that address both challenges simultaneously, particularly in data collection, LLM generation, and evaluation. To support further research in this field, we provide a curated collection of LLM-based human simulation resources.\footnote{this https URL}

26. 【2501.08570】Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms

链接https://arxiv.org/abs/2501.08570

作者:Kewei Li,Yanwen Kong,Yiping Xu,Lan Huang,Ruochi Zhang,Fengfeng Zhou

类目:Computation and Language (cs.CL)

关键词:Large Language Models, natural language processing, Large Language, capabilities of Large, language processing

备注

点击查看摘要

Abstract:Improving the length extrapolation capabilities of Large Language Models (LLMs) remains a critical challenge in natural language processing. Many recent efforts have focused on modifying the scaled dot-product attention mechanism, and often introduce scaled temperatures without rigorous theoretical justification. To fill this gap, we introduce a novel approach based on information entropy invariance. We propose two new scaled temperatures to enhance length extrapolation. First, a training-free method InfoScale is designed for dot-product attention, and preserves focus on original tokens during length extrapolation by ensuring information entropy remains consistent. Second, we theoretically analyze the impact of scaling (CosScale) on cosine attention. Experimental data demonstrates that combining InfoScale and CosScale achieves state-of-the-art performance on the GAU-{\alpha} model with a context window extended to 64 times the training length, and outperforms seven existing methods. Our analysis reveals that significantly increasing CosScale approximates windowed attention, and highlights the significance of attention score dilution as a key challenge in long-range context handling. The code and data are available at this https URL.

27. 【2501.08540】Knowledge prompt chaining for semantic modeling

链接https://arxiv.org/abs/2501.08540

作者:Ning Pei Ding,Jingge Du,Zaiwen Feng

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)

关键词:knowledge representation field, XML files, representation field, files is highly, highly relevant

备注

点击查看摘要

Abstract:The task of building semantics for structured data such as CSV, JSON, and XML files is highly relevant in the knowledge representation field. Even though we have a vast of structured data on the internet, mapping them to domain ontologies to build semantics for them is still very challenging as it requires the construction model to understand and learn graph-structured knowledge. Otherwise, the task will require human beings' effort and cost. In this paper, we proposed a novel automatic semantic modeling framework: Knowledge Prompt Chaining. It can serialize the graph-structured knowledge and inject it into the LLMs properly in a Prompt Chaining architecture. Through this knowledge injection and prompting chaining, the model in our framework can learn the structure information and latent space of the graph and generate the semantic labels and semantic graphs following the chains' insturction naturally. Based on experimental results, our method achieves better performance than existing leading techniques, despite using reduced structured input data.

28. 【2501.08537】Complexity Control Facilitates Reasoning-Based Compositional Generalization in Transformers

链接https://arxiv.org/abs/2501.08537

作者:Zhongwang Zhang,Pengxiao Lin,Zhiwei Wang,Yaoyu Zhang,Zhi-Qin John Xu

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:demonstrated impressive capabilities, Transformers have demonstrated, compositional problems remains, subject of debate, demonstrated impressive

备注: Mistakenly submitted as a replacement to [2405.05409v4](https://arxiv.org/abs/2405.05409v4)

点击查看摘要

Abstract:Transformers have demonstrated impressive capabilities across various tasks, yet their performance on compositional problems remains a subject of debate. In this study, we investigate the internal mechanisms underlying Transformers' behavior in compositional tasks. We find that complexity control strategies significantly influence whether the model learns primitive-level rules that generalize out-of-distribution (reasoning-based solutions) or relies solely on memorized mappings (memory-based solutions). By applying masking strategies to the model's information circuits and employing multiple complexity metrics, we reveal distinct internal working mechanisms associated with different solution types. Further analysis reveals that reasoning-based solutions exhibit a lower complexity bias, which aligns with the well-studied neuron condensation phenomenon. This lower complexity bias is hypothesized to be the key factor enabling these solutions to learn reasoning rules. We validate these conclusions across multiple real-world datasets, including image generation and natural language processing tasks, confirming the broad applicability of our findings.

29. 【2501.08523】Doc-Guided Sent2Sent++: A Sent2Sent++ Agent with Doc-Guided memory for Document-level Machine Translation

链接https://arxiv.org/abs/2501.08523

作者:Jiaxin Guo,Yuanchang Luo,Daimeng Wei,Ling Zhang,Zongyao Li,Hengchao Shang,Zhiqiang Rao,Shaojun Li,Jinlong Yang,Zhanglin Wu,Hao Yang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, natural language processing, capabilities of Large, witnessed significant advancements, Document-level Machine Translation

备注

点击查看摘要

Abstract:The field of artificial intelligence has witnessed significant advancements in natural language processing, largely attributed to the capabilities of Large Language Models (LLMs). These models form the backbone of Agents designed to address long-context dependencies, particularly in Document-level Machine Translation (DocMT). DocMT presents unique challenges, with quality, consistency, and fluency being the key metrics for evaluation. Existing approaches, such as Doc2Doc and Doc2Sent, either omit sentences or compromise fluency. This paper introduces Doc-Guided Sent2Sent++, an Agent that employs an incremental sentence-level forced decoding strategy \textbf{to ensure every sentence is translated while enhancing the fluency of adjacent sentences.} Our Agent leverages a Doc-Guided Memory, focusing solely on the summary and its translation, which we find to be an efficient approach to maintaining consistency. Through extensive testing across multiple languages and domains, we demonstrate that Sent2Sent++ outperforms other methods in terms of quality, consistency, and fluency. The results indicate that, our approach has achieved significant improvements in metrics such as s-COMET, d-COMET, LTCR-$1_f$, and document-level perplexity (d-ppl). The contributions of this paper include a detailed analysis of current DocMT research, the introduction of the Sent2Sent++ decoding method, the Doc-Guided Memory mechanism, and validation of its effectiveness across languages and domains.

30. 【2501.08502】Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom

链接https://arxiv.org/abs/2501.08502

作者:Melissa Torgbi,Andrew Clayman,Jordan J. Speight,Harish Tayyar Madabushi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:United Kingdom, automatic speech recognition, capturing regional differences, Scotland with distinct, biased ASR models

备注

点击查看摘要

Abstract:We collect novel data in the public service domain to evaluate the capability of the state-of-the-art automatic speech recognition (ASR) models in capturing regional differences in accents in the United Kingdom (UK), specifically focusing on two accents from Scotland with distinct dialects. This study addresses real-world problems where biased ASR models can lead to miscommunication in public services, disadvantaging individuals with regional accents particularly those in vulnerable populations. We first examine the out-of-the-box performance of the Whisper large-v3 model on a baseline dataset and our data. We then explore the impact of fine-tuning Whisper on the performance in the two UK regions and investigate the effectiveness of existing model evaluation techniques for our real-world application through manual inspection of model errors. We observe that the Whisper model has a higher word error rate (WER) on our test datasets compared to the baseline data and fine-tuning on a given data improves performance on the test dataset with the same domain and accent. The fine-tuned models also appear to show improved performance when applied to the test data outside of the region it was trained on suggesting that fine-tuned models may be transferable within parts of the UK. Our manual analysis of model outputs reveals the benefits and drawbacks of using WER as an evaluation metric and fine-tuning to adapt to regional dialects.

31. 【2501.08496】Quantifying the Importance of Data Alignment in Downstream Model Performance

链接https://arxiv.org/abs/2501.08496

作者:Krrish Chawla,Aryan Sahai,Mario DePavia,Sudharsan Sundar,Brando Miranda

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)

关键词:capable Large Language, training capable Large, Large Language Models, capable Large, Large Language

备注

点击查看摘要

Abstract:Contrary to the conventional emphasis on dataset size, we explore the role of data alignment -- an often overlooked aspect of data quality -- in training capable Large Language Models (LLMs). To do so, we use the Task2Vec-based alignment coefficient, a quantitative measure of the similarity between two datasets, to quantify the impact of alignment between training data and evaluation data on downstream performance. In particular, we conduct controlled \textit{interventional} experiments for two settings: 1. the impact of increased alignment coefficients between various pre-training (pt) against evaluation datasets, and 2. the impact of increased alignment coefficients between domain specific fine-tuning (ft) against domain specific evaluation. The domain specific task we explore is Autoformalization -- the machine translation task between natural language and code for formal verification. In both settings, we find a strong, predictable negative correlation between the alignment coefficient of a model's training and evaluation data and the model's loss/perplexity on the respective downstream task. These findings suggest a re-evaluation of LLM training approaches, demonstrating the relevance of data alignment compared to data quantity, especially in specialized downstream tasks such as Autoformalization.

32. 【2501.08474】he Theater Stage as Laboratory: Review of Real-Time Comedy LLM Systems for Live Performance

链接https://arxiv.org/abs/2501.08474

作者:Piotr Wojciech Mirowski,Boyd Branch,Kory Wallace Mathewson

类目:Computation and Language (cs.CL)

关键词:eclectic recent history, artistic works involving, involving computational systems, position paper, review the eclectic

备注: 8 pages, 1st Workshop on Computational Humor (CHum), COLING 2025

点击查看摘要

Abstract:In this position paper, we review the eclectic recent history of academic and artistic works involving computational systems for humor generation, and focus specifically on live performance. We make the case that AI comedy should be evaluated in live conditions, in front of audiences sharing either physical or online spaces, and under real-time constraints. We further suggest that improvised comedy is therefore the perfect substrate for deploying and assessing computational humor systems. Using examples of successful AI-infused shows, we demonstrate that live performance raises three sets of challenges for computational humor generation: 1) questions around robotic embodiment, anthropomorphism and competition between humans and machines, 2) questions around comedic timing and the nature of audience interaction, and 3) questions about the human interpretation of seemingly absurd AI-generated humor. We argue that these questions impact the choice of methodologies for evaluating computational humor, as any such method needs to work around the constraints of live audiences and performance spaces. These interrogations also highlight different types of collaborative relationship of human comedians towards AI tools.

33. 【2501.08468】Selective Attention Merging for low resource tasks: A case study of Child ASR

链接https://arxiv.org/abs/2501.08468

作者:Natarajan Balaji Shankar,Zilai Wang,Eray Eren,Abeer Alwan

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Automatic Speech Recognition, child Automatic Speech, Speech Foundation Models, Speech Recognition, child Automatic

备注: To appear in ICASSP 2025

点击查看摘要

Abstract:While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. This paper also introduces Selective Attention (SA) Merge, a novel method that selectively merges task vectors from attention matrices to enhance SFM performance on low-resource tasks. Experiments on the MyST database show significant reductions in relative word error rate of up to 14%, outperforming existing model merging and data augmentation techniques. By combining data augmentation techniques with SA Merge, we achieve a new state-of-the-art WER of 8.69 on the MyST database for the Whisper-small model, highlighting the potential of SA Merge for improving low-resource ASR.

34. 【2501.08460】owards Zero-Shot Explainable Video Description by Reasoning over Graphs of Events in Space and Time

链接https://arxiv.org/abs/2501.08460

作者:Mihai Masala,Marius Leordeanu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Machine Learning, era of Machine, natural language processing, language processing, current era

备注

点击查看摘要

Abstract:In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.

35. 【2501.08457】Large Language Models For Text Classification: Case Study And Comprehensive Review

链接https://arxiv.org/abs/2501.08457

作者:Arina Kostina,Marios D. Dikaiakos,Dimosthenis Stefanidis,George Pallis

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, Unlocking the potential, natural language processing, potential of Large, Large Language

备注

点击查看摘要

Abstract:Unlocking the potential of Large Language Models (LLMs) in data classification represents a promising frontier in natural language processing. In this work, we evaluate the performance of different LLMs in comparison with state-of-the-art deep-learning and machine-learning models, in two different classification scenarios: i) the classification of employees' working locations based on job reviews posted online (multiclass classification), and 2) the classification of news articles as fake or not (binary classification). Our analysis encompasses a diverse range of language models differentiating in size, quantization, and architecture. We explore the impact of alternative prompting techniques and evaluate the models based on the weighted F1-score. Also, we examine the trade-off between performance (F1-score) and time (inference response time) for each language model to provide a more nuanced understanding of each model's practical applicability. Our work reveals significant variations in model responses based on the prompting strategies. We find that LLMs, particularly Llama3 and GPT-4, can outperform traditional methods in complex classification tasks, such as multiclass classification, though at the cost of longer inference times. In contrast, simpler ML models offer better performance-to-time trade-offs in simpler binary classification tasks.

36. 【2501.08454】agTab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack

链接https://arxiv.org/abs/2501.08454

作者:Sagiv Antebi,Edan Habler,Asaf Shabtai,Yuval Elovici

类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)

关键词:task assistance tools, essential digital task, digital task assistance, Large language models, Large language

备注

点击查看摘要

Abstract:Large language models (LLMs) have become essential digital task assistance tools. Their training relies heavily on the collection of vast amounts of data, which may include copyright-protected or sensitive information. Recent studies on the detection of pretraining data in LLMs have primarily focused on sentence-level or paragraph-level membership inference attacks (MIAs), usually involving probability analysis of the target model prediction tokens. However, the proposed methods often demonstrate poor performance, specifically in terms of accuracy, failing to account for the semantic importance of textual content and word significance. To address these shortcomings, we propose TagTab, a novel approach for detecting data that has been used as part of the LLM pretraining. Our method leverages advanced natural language processing (NLP) techniques to tag keywords in the input text - a process we term Tagging. Then, the LLM is used to obtain the probabilities of these keywords and calculate their average log-likelihood to determine input text membership, a process we refer to as Tabbing. Our experiments on three benchmark datasets (BookMIA, MIMIR, and the Pile) and several open-source LLMs of varying sizes demonstrate an average increase in the AUC scores ranging from 4.1% to 12.1% over state-of-the-art methods. TagTab not only sets a new standard for data leakage detection in LLMs, but its outstanding performance is a testament to the importance of words in MIAs on LLMs.

37. 【2501.08442】Jochre 3 and the Yiddish OCR corpus

链接https://arxiv.org/abs/2501.08442

作者:Assaf Urieli,Amber Clooney,Michelle Sigiel,Grisha Leyfer

类目:Computation and Language (cs.CL)

关键词:Alto OCR layer, Yiddish OCR Corpus, OCR layer generation, open source OCR, Yiddish Book Center

备注: 10 pages, 4 figures

点击查看摘要

Abstract:We describe the construction of a publicly available Yiddish OCR Corpus, and describe and evaluate the open source OCR tool suite Jochre 3, including an Alto editor for corpus annotation, OCR software for Alto OCR layer generation, and a customizable OCR search engine. The current version of the Yiddish OCR corpus contains 658 pages, 186K tokens and 840K glyphs. The Jochre 3 OCR tool uses various fine-tuned YOLOv8 models for top-down page layout analysis, and a custom CNN network for glyph recognition. It attains a CER of 1.5% on our test corpus, far out-performing all other existing public models for Yiddish. We analyzed the full 660M word Yiddish Book Center with Jochre 3 OCR, and the new OCR is searchable through the Yiddish Book Center OCR search engine.

38. 【2501.08441】Religious Bias Landscape in Language and Text-to-Image Models: Analysis, Detection, and Debiasing Strategies

链接https://arxiv.org/abs/2501.08441

作者:Ajwad Abrar,Nafisa Tabassum Oeshy,Mohsinul Kabir,Sophia Ananiadou

类目:Computation and Language (cs.CL)

关键词:potentially offensive content, offensive content related, presented solely, academic purposes, language models

备注

点击查看摘要

Abstract:Note: This paper includes examples of potentially offensive content related to religious bias, presented solely for academic purposes. The widespread adoption of language models highlights the need for critical examinations of their inherent biases, particularly concerning religion. This study systematically investigates religious bias in both language models and text-to-image generation models, analyzing both open-source and closed-source systems. We construct approximately 400 unique, naturally occurring prompts to probe language models for religious bias across diverse tasks, including mask filling, prompt completion, and image generation. Our experiments reveal concerning instances of underlying stereotypes and biases associated disproportionately with certain religions. Additionally, we explore cross-domain biases, examining how religious bias intersects with demographic factors such as gender, age, and nationality. This study further evaluates the effectiveness of targeted debiasing techniques by employing corrective prompts designed to mitigate the identified biases. Our findings demonstrate that language models continue to exhibit significant biases in both text and image generation tasks, emphasizing the urgent need to develop fairer language models to achieve global acceptability.

39. 【2501.08413】Ensemble of Large Language Models for Curated Labeling and Rating of Free-text Data

链接https://arxiv.org/abs/2501.08413

作者:Jiaxing Qiu,Dongliang Guo,Papini Natalie,Peace Noelle,Levinson Cheri,Teague R. Henry

类目:Computation and Language (cs.CL)

关键词:providing rich qualitative, rich qualitative insights, free-text data, psychological studies, providing rich

备注

点击查看摘要

Abstract:Free-text responses are commonly collected in psychological studies, providing rich qualitative insights that quantitative measures may not capture. Labeling curated topics of research interest in free-text data by multiple trained human coders is typically labor-intensive and time-consuming. Though large language models (LLMs) excel in language processing, LLM-assisted labeling techniques relying on closed-source LLMs cannot be directly applied to free-text data, without explicit consent for external use. In this study, we propose a framework of assembling locally-deployable LLMs to enhance the labeling of predetermined topics in free-text data under privacy constraints. Analogous to annotation by multiple human raters, this framework leverages the heterogeneity of diverse open-source LLMs. The ensemble approach seeks a balance between the agreement and disagreement across LLMs, guided by a relevancy scoring methodology that utilizes embedding distances between topic descriptions and LLMs' reasoning. We evaluated the ensemble approach using both publicly accessible Reddit data from eating disorder related forums, and free-text responses from eating disorder patients, both complemented by human annotations. We found that: (1) there is heterogeneity in the performance of labeling among same-sized LLMs, with some showing low sensitivity but high precision, while others exhibit high sensitivity but low precision. (2) Compared to individual LLMs, the ensemble of LLMs achieved the highest accuracy and optimal precision-sensitivity trade-off in predicting human annotations. (3) The relevancy scores across LLMs showed greater agreement than dichotomous labels, indicating that the relevancy scoring method effectively mitigates the heterogeneity in LLMs' labeling.

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2501.08413 [cs.CL]

(or
arXiv:2501.08413v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2501.08413

Focus to learn more

              arXiv-issued DOI via DataCite</p>
40. 【2501.08406】OptiChat: Bridging Optimization Models and Practitioners with Large Language Models

链接https://arxiv.org/abs/2501.08406

作者:Hao Chen,Gonzalo Esteban Constante-Flores,Krishna Sri Ipsit Mantri,Sai Madhukiran Kompalli,Akshdeep Singh Ahluwalia,Can Li

类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Machine Learning (cs.LG); Optimization and Control (math.OC)

关键词:decision-making problems, Optimization models, applied to solve, solve a wide, wide variety

备注

点击查看摘要

Abstract:Optimization models have been applied to solve a wide variety of decision-making problems. These models are usually developed by optimization experts but are used by practitioners without optimization expertise in various application domains. As a result, practitioners often struggle to interact with and draw useful conclusions from optimization models independently. To fill this gap, we introduce OptiChat, a natural language dialogue system designed to help practitioners interpret model formulation, diagnose infeasibility, analyze sensitivity, retrieve information, evaluate modifications, and provide counterfactual explanations. By augmenting large language models (LLMs) with functional calls and code generation tailored for optimization models, we enable seamless interaction and minimize the risk of hallucinations in OptiChat. We develop a new dataset to evaluate OptiChat's performance in explaining optimization models. Experiments demonstrate that OptiChat effectively bridges the gap between optimization models and practitioners, delivering autonomous, accurate, and instant responses.

41. 【2501.08365】owards Best Practices for Open Datasets for LLM Training

链接https://arxiv.org/abs/2501.08365

作者:Stefan Baack,Stella Biderman,Kasia Odrozek,Aviya Skowron,Ayah Bdeir,Jillian Bommarito,Jennifer Ding,Maximilian Gahntz,Paul Keller,Pierre-Carl Langlais,Greg Lindahl,Sebastian Majstorovic,Nik Marda,Guilherme Penedo,Maarten Van Segbroeck,Jennifer Wang,Leandro von Werra,Mitchell Baker,Julie Belião,Kasia Chmielinski,Marzieh Fadaee,Lisa Gutermuth,Hynek Kydlíček,Greg Leppert,EM Lewis-Jong,Solana Larsen,Shayne Longpre,Angela Oduor Lungati,Cullen Miller,Victor Miller,Max Ryabinin,Kathleen Siminyu,Andrew Strait,Mark Surman,Anna Tumadóttir,Maurice Weber,Rebecca Weiss,Lee White,Thomas Wolf

类目:Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:large language models, United States, training language models, copyright owners, large language

备注

点击查看摘要

Abstract:Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners. The permissibility of doing so varies by jurisdiction: in countries like the EU and Japan, this is allowed under certain restrictions, while in the United States, the legal landscape is more ambiguous. Regardless of the legal status, concerns from creative producers have led to several high-profile copyright lawsuits, and the threat of litigation is commonly cited as a reason for the recent trend towards minimizing the information shared about training datasets by both corporate and public interest actors. This trend in limiting data information causes harm by hindering transparency, accountability, and innovation in the broader ecosystem by denying researchers, auditors, and impacted individuals access to the information needed to understand AI models. While this could be mitigated by training language models on open access and public domain data, at the time of writing, there are no such models (trained at a meaningful scale) due to the substantial technical and sociological challenges in assembling the necessary corpus. These challenges include incomplete and unreliable metadata, the cost and complexity of digitizing physical records, and the diverse set of legal and technical skills required to ensure relevance and responsibility in a quickly changing landscape. Building towards a future where AI systems can be trained on openly licensed data that is responsibly curated and governed requires collaboration across legal, technical, and policy domains, along with investments in metadata standards, digitization, and fostering a culture of openness.

Subjects:

Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

Cite as:
arXiv:2501.08365 [cs.CY]

(or
arXiv:2501.08365v1 [cs.CY] for this version)

https://doi.org/10.48550/arXiv.2501.08365

Focus to learn more

              arXiv-issued DOI via DataCite</p>
42. 【2501.08335】MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish

链接https://arxiv.org/abs/2501.08335

作者:Xin Huang,Tarun Kumar Vangani,Minh Duc Pham,Xunlong Zou,Bin Wang,Zhengyuan Liu,Ai Ti Aw

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Multilingual large language, Multilingual large, shown impressive capabilities, shown impressive, Multilingual

备注

点击查看摘要

Abstract:Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.

43. 【2501.08421】SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models

链接https://arxiv.org/abs/2501.08421

作者:Anurag Kumar,Rohit Paturi,Amber Afshan,Sundararajan Srinivasan

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)

关键词:Speaker Diarization, ASR pipelines, component of modern, crucial component, Diarization

备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models including fine-tuned large language models (LLMs) have shown to be effective as a second-pass speaker error corrector by leveraging lexical context in the transcribed output. In this work, we introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM. We also show that a simpler constrained decoding strategy reduces LLM hallucinations, while avoiding complicated post-processing. Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets, compared to the first-pass Acoustic SD.

信息检索

1. 【2501.08828】MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

链接https://arxiv.org/abs/2501.08828

作者:Kuicai Dong,Yujing Chang,Xin Deik Goh,Dexun Li,Ruiming Tang,Yong Liu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-modal document retrieval, Multi-modal document, document retrieval, designed to identify, identify and retrieve

备注: [this https URL](https://huggingface.co/MMDocIR)

点击查看摘要

Abstract:Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents. Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval. To address this gap, this work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis. A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation. Through rigorous experiments, we reveal that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval and (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text. These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.

2. 【2501.08717】$\texttt{InfoHier}$: Hierarchical Information Extraction via Encoding and Embedding

链接https://arxiv.org/abs/2501.08717

作者:Tianru Zhang,Li Ju,Prashant Singh,Salman Toor

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Analyzing large-scale datasets, Analyzing large-scale, large-scale datasets, Analyzing, SSL

备注: 10 pages, 4 figures

点击查看摘要

Abstract:Analyzing large-scale datasets, especially involving complex and high-dimensional data like images, is particularly challenging. While self-supervised learning (SSL) has proven effective for learning representations from unlabelled data, it typically focuses on flat, non-hierarchical structures, missing the multi-level relationships present in many real-world datasets. Hierarchical clustering (HC) can uncover these relationships by organizing data into a tree-like structure, but it often relies on rigid similarity metrics that struggle to capture the complexity of diverse data types. To address these we envision $\texttt{InfoHier}$, a framework that combines SSL with HC to jointly learn robust latent representations and hierarchical structures. This approach leverages SSL to provide adaptive representations, enhancing HC's ability to capture complex patterns. Simultaneously, it integrates HC loss to refine SSL training, resulting in representations that are more attuned to the underlying information hierarchy. $\texttt{InfoHier}$ has the potential to improve the expressiveness and performance of both clustering and representation learning, offering significant benefits for data analysis, management, and information retrieval.

3. 【2501.08695】Real-time Indexing for Large-scale Recommendation by Streaming Vector Quantization Retriever

链接https://arxiv.org/abs/2501.08695

作者:Xingyan Bin,Jianfei Cui,Wujie Yan,Zhichen Zhao,Xintian Han,Chongyang Yan,Feng Zhang,Xun Zhou,Qi Wu,Zuotao Liu

类目:Information Retrieval (cs.IR)

关键词:strict latency limitations, important recommendation stages, latency limitations, recommendation stages, important recommendation

备注

点击查看摘要

Abstract:Retrievers, which form one of the most important recommendation stages, are responsible for efficiently selecting possible positive samples to the later stages under strict latency limitations. Because of this, large-scale systems always rely on approximate calculations and indexes to roughly shrink candidate scale, with a simple ranking model. Considering simple models lack the ability to produce precise predictions, most of the existing methods mainly focus on incorporating complicated ranking models. However, another fundamental problem of index effectiveness remains unresolved, which also bottlenecks complication. In this paper, we propose a novel index structure: streaming Vector Quantization model, as a new generation of retrieval paradigm. Streaming VQ attaches items with indexes in real time, granting it immediacy. Moreover, through meticulous verification of possible variants, it achieves additional benefits like index balancing and reparability, enabling it to support complicated ranking models as existing approaches. As a lightweight and implementation-friendly architecture, streaming VQ has been deployed and replaced all major retrievers in Douyin and Douyin Lite, resulting in remarkable user engagement gain.

4. 【2501.08686】Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching

链接https://arxiv.org/abs/2501.08686

作者:Chuangtao Ma,Sriom Chakrabarti,Arijit Khan,Bálint Molnár

类目:Databases (cs.DB); Computation and Language (cs.CL); Information Retrieval (cs.IR)

关键词:Traditional similarity-based schema, resolving semantic ambiguities, schema matching, Traditional similarity-based, similarity-based schema matching

备注: Under Review

点击查看摘要

Abstract:Traditional similarity-based schema matching methods are incapable of resolving semantic ambiguities and conflicts in domain-specific complex mapping scenarios due to missing commonsense and domain-specific knowledge. The hallucination problem of large language models (LLMs) also makes it challenging for LLM-based schema matching to address the above issues. Therefore, we propose a Knowledge Graph-based Retrieval-Augmented Generation model for Schema Matching, referred to as the KG-RAG4SM. In particular, KG-RAG4SM introduces novel vector-based, graph traversal-based, and query-based graph retrievals, as well as a hybrid approach and ranking schemes that identify the most relevant subgraphs from external large knowledge graphs (KGs). We showcase that KG-based retrieval-augmented LLMs are capable of generating more accurate results for complex matching cases without any re-training. Our experimental results show that KG-RAG4SM outperforms the LLM-based state-of-the-art (SOTA) methods (e.g., Jellyfish-8B) by 35.89% and 30.50% in terms of precision and F1 score on the MIMIC dataset, respectively; KG-RAG4SM with GPT-4o-mini outperforms the pre-trained language model (PLM)-based SOTA methods (e.g., SMAT) by 69.20% and 21.97% in terms of precision and F1 score on the Synthea dataset, respectively. The results also demonstrate that our approach is more efficient in end-to-end schema matching, and scales to retrieve from large KGs. Our case studies on the dataset from the real-world schema matching scenario exhibit that the hallucination problem of LLMs for schema matching is well mitigated by our solution.

5. 【2501.08572】DNMDR: Dynamic Networks and Multi-view Drug Representations for Safe Medication Recommendation

链接https://arxiv.org/abs/2501.08572

作者:Guanlin Liu,Xiaomei Yu,Zihao Liu,Xue Li,Xingxu Fan,Xiangwei Zheng

类目:Machine Learning (cs.LG); Information Retrieval (cs.IR)

关键词:promising research topic, booms diverse applications, diverse medical events, clinical domains, promising research

备注

点击查看摘要

Abstract:Medication Recommendation (MR) is a promising research topic which booms diverse applications in the healthcare and clinical domains. However, existing methods mainly rely on sequential modeling and static graphs for representation learning, which ignore the dynamic correlations in diverse medical events of a patient's temporal visits, leading to insufficient global structural exploration on nodes. Additionally, mitigating drug-drug interactions (DDIs) is another issue determining the utility of the MR systems. To address the challenges mentioned above, this paper proposes a novel MR method with the integration of dynamic networks and multi-view drug representations (DNMDR). Specifically, weighted snapshot sequences for dynamic heterogeneous networks are constructed based on discrete visits in temporal EHRs, and all the dynamic networks are jointly trained to gain both structural correlations in diverse medical events and temporal dependency in historical health conditions, for achieving comprehensive patient representations with both semantic features and structural relationships. Moreover, combining the drug co-occurrences and adverse drug-drug interactions (DDIs) in internal view of drug molecule structure and interactive view of drug pairs, the safe drug representations are available to obtain high-quality medication combination recommendation. Finally, extensive experiments on real world datasets are conducted for performance evaluation, and the experimental results demonstrate that the proposed DNMDR method outperforms the state-of-the-art baseline models with a large margin on various metrics such as PRAUC, Jaccard, DDI rates and so on.

6. 【2501.08927】Continuous Approach to Phase (Norm) Retrieval Frames

链接https://arxiv.org/abs/2501.08927

作者:Ramin Farshchian,Rajab Ali Kamyabi-Gol,Fahimeh Arabyani-Neyshaburi,Fatemeh Esmaeelzadeh

类目:Functional Analysis (math.FA); Information Retrieval (cs.IR); Mathematical Physics (math-ph); Numerical Analysis (math.NA); Optics (physics.optics)

关键词:paper investigates, norm retrieval, Hilbert spaces, retrieval, phase retrieval

备注

点击查看摘要

Abstract:This paper investigates the properties of continuous frames, with a particular focus on phase retrieval and norm retrieval in the context of Hilbert spaces. We introduce the concept of continuous near-Riesz bases and prove their invariance under invertible operators. Some equivalent conditions for phase and norm retrieval property of continuous frames are presented. We study the stability of phase retrieval under perturbations. Furthermore, tensor product frames for separable Hilbert spaces are studied, and we establish the equivalence of phase retrieval and norm retrieval properties between components and their tensor products.

计算机视觉

1. 【2501.09019】Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

链接https://arxiv.org/abs/2501.09019

作者:Jingyuan Chen,Fuchen Long,Jie An,Zhaofan Qiu,Ting Yao,Jiebo Luo,Tao Mei

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recently emerged, effective approach, approach for tuning-free, tuning-free long video, frames

备注

点击查看摘要

Abstract:The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.

2. 【2501.09012】Multimodal LLMs Can Reason about Aesthetics in Zero-Shot

链接https://arxiv.org/abs/2501.09012

作者:Ruixiang Jiang,Changwen Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)

关键词:Multimodal LLMs', elicited to evaluate, Multimodal, Abstract, reasoning ability

备注: WIP, Homepage [this https URL](https://github.com/songrise/MLLM4Art)

点击查看摘要

Abstract:We present the first study on how Multimodal LLMs' (MLLMs) reasoning ability shall be elicited to evaluate the aesthetics of artworks. To facilitate this investigation, we construct MM-StyleBench, a novel high-quality dataset for benchmarking artistic stylization. We then develop a principled method for human preference modeling and perform a systematic correlation analysis between MLLMs' responses and human preference. Our experiments reveal an inherent hallucination issue of MLLMs in art evaluation, associated with response subjectivity. ArtCoT is proposed, demonstrating that art-specific task decomposition and the use of concrete language boost MLLMs' reasoning ability for aesthetics. Our findings offer valuable insights into MLLMs for art and can benefit a wide range of downstream applications, such as style transfer and artistic image generation. Code available at this https URL.

3. 【2501.09008】SimGen: A Diffusion-Based Framework for Simultaneous Surgical Image and Segmentation Mask Generation

链接https://arxiv.org/abs/2501.09008

作者:Aditya Bhat,Rupak Bose,Chinedu Innocent Nwoye,Nicolas Padoy

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:significant expert involvement, requiring significant expert, Acquiring and annotating, annotating surgical data, ethical constraining

备注: 12 pages, 17 figures, 4 tables, project page at [this https URL](https://camma-public.github.io/endogen/)

点击查看摘要

Abstract:Acquiring and annotating surgical data is often resource-intensive, ethical constraining, and requiring significant expert involvement. While generative AI models like text-to-image can alleviate data scarcity, incorporating spatial annotations, such as segmentation masks, is crucial for precision-driven surgical applications, simulation, and education. This study introduces both a novel task and method, SimGen, for Simultaneous Image and Mask Generation. SimGen is a diffusion model based on the DDPM framework and Residual U-Net, designed to jointly generate high-fidelity surgical images and their corresponding segmentation masks. The model leverages cross-correlation priors to capture dependencies between continuous image and discrete mask distributions. Additionally, a Canonical Fibonacci Lattice (CFL) is employed to enhance class separability and uniformity in the RGB space of the masks. SimGen delivers high-fidelity images and accurate segmentation masks, outperforming baselines across six public datasets assessed on image and semantic inception distance metrics. Ablation study shows that the CFL improves mask quality and spatial separation. Downstream experiments suggest generated image-mask pairs are usable if regulations limit human data release for research. This work offers a cost-effective solution for generating paired surgical images and complex labels, advancing surgical AI development by reducing the need for expensive manual annotations.

4. 【2501.08994】RepVideo: Rethinking Cross-Layer Representation for Video Generation

链接https://arxiv.org/abs/2501.08994

作者:Chenyang Si,Weichen Fan,Zhengyao Lv,Ziqi Huang,Yu Qiao,Ziwei Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:achieved remarkable progress, Video generation, video generation process, achieved remarkable, remarkable progress

备注: Project page: [this https URL](https://vchitect.github.io/RepVid-Webpage)

点击查看摘要

Abstract:Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.

5. 【2501.08983】CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities

链接https://arxiv.org/abs/2501.08983

作者:Haozhe Xie,Zhaoxi Chen,Fangzhou Hong,Ziwei Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made significant progress, garnered growing attention, significant progress, garnered growing, growing attention

备注

点击查看摘要

Abstract:3D scene generation has garnered growing attention in recent years and has made significant progress. Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments. To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities. Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff. Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation. Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles. To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations. Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations. Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.

6. 【2501.08982】CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation

链接https://arxiv.org/abs/2501.08982

作者:Qi Ma,Runyi Yang,Bin Ren,Ender Konukoglu,Luc Van Gool,Danda Pani Paudel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Localizing text descriptions, Toggle, Localizing text, text descriptions, code

备注

点击查看摘要

Abstract:Localizing text descriptions in large-scale 3D scenes is inherently an ambiguous task. This nonetheless arises while describing general concepts, e.g. all traffic lights in a city. To facilitate reasoning based on such concepts, text localization in the form of distribution is required. In this paper, we generate the distribution of the camera poses conditioned upon the textual description. To facilitate such generation, we propose a diffusion-based architecture that conditionally diffuses the noisy 6DoF camera poses to their plausible locations. The conditional signals are derived from the text descriptions, using the pre-trained text encoders. The connection between text descriptions and pose distribution is established through pretrained Vision-Language-Model, i.e. CLIP. Furthermore, we demonstrate that the candidate poses for the distribution can be further refined by rendering potential poses using 3D Gaussian splatting, guiding incorrectly posed samples towards locations that better align with the textual description, through visual reasoning. We demonstrate the effectiveness of our method by comparing it with both standard retrieval methods and learning-based approaches. Our proposed method consistently outperforms these baselines across all five large-scale datasets. Our source code and dataset will be made publicly available.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2501.08982 [cs.CV]

(or
arXiv:2501.08982v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2501.08982

Focus to learn more

              arXiv-issued DOI via DataCite

Submission history From: Runyi Yang [view email] [v1]
Wed, 15 Jan 2025 17:59:32 UTC (33,172 KB)

Full-text links:
Access Paper:

View a PDF of the paper titled CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation, by Qi Ma and 5 other authorsView PDFHTML (experimental)TeX SourceOther Formats
view license

Current browse context: cs.CV

prev

|
next

new
|
recent
| 2025-01

Change to browse by:

cs

References Citations

NASA ADSGoogle Scholar
Semantic Scholar

a
export BibTeX citation
Loading…

BibTeX formatted citation

loading…

Data provided by:

Bookmark

checked=“checked”>
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Related Papers

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

    About arXivLabs

arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs.

Which authors of this paper are endorsers? |
Disable MathJax (What is MathJax?)

mathjaxToggle();

About
Help

contact arXivClick here to contact arXiv
Contact

subscribe to arXiv mailingsClick here to subscribe
Subscribe

Copyright
Privacy Policy

Web Accessibility Assistance

arXiv Operational Status
Get status notifications via
email
or slack

7. 【2501.08962】An analysis of data variation and bias in image-based dermatological datasets for machine learning classification

链接https://arxiv.org/abs/2501.08962

作者:Francisco Mauro,Emanoel Thyago,Othon Vinicius,Rodrigo Abreu,Kelvin Cunha,José Gabriel,Rafael Barros,Thales Bezerra,Manoel Henriques,Natalia Lopes,Érico Moutinho,Jéssica Guido,Tsang Ing Ren,Paulo Borba

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:professionals in healthcare, valuable in aiding, aiding professionals, clinical, critical decision demands

备注: 10 pages, 1 figure

点击查看摘要

Abstract:AI algorithms have become valuable in aiding professionals in healthcare. The increasing confidence obtained by these models is helpful in critical decision demands. In clinical dermatology, classification models can detect malignant lesions on patients' skin using only RGB images as input. However, most learning-based methods employ data acquired from dermoscopic datasets on training, which are large and validated by a gold standard. Clinical models aim to deal with classification on users' smartphone cameras that do not contain the corresponding resolution provided by dermoscopy. Also, clinical applications bring new challenges. It can contain captures from uncontrolled environments, skin tone variations, viewpoint changes, noises in data and labels, and unbalanced classes. A possible alternative would be to use transfer learning to deal with the clinical images. However, as the number of samples is low, it can cause degradations on the model's performance; the source distribution used in training differs from the test set. This work aims to evaluate the gap between dermoscopic and clinical samples and understand how the dataset variations impact training. It assesses the main differences between distributions that disturb the model's prediction. Finally, from experiments on different architectures, we argue how to combine the data from divergent distributions, decreasing the impact on the model's final accuracy.

8. 【2501.08931】Visual WetlandBirds Dataset: Bird Species Identification and Behavior Recognition in Videos

链接https://arxiv.org/abs/2501.08931

作者:Javier Rodriguez-Juan,David Ortiz-Perez,Manuel Benavent-Lledo,David Mulero-Pérez,Pablo Ruiz-Ponce,Adrian Orihuela-Torres,Jose Garcia-Rodriguez,Esther Sebastián-González

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:loss crisis makes, crisis makes animal, current biodiversity loss, biodiversity loss crisis, makes animal monitoring

备注

点击查看摘要

Abstract:The current biodiversity loss crisis makes animal monitoring a relevant field of study. In light of this, data collected through monitoring can provide essential insights, and information for decision-making aimed at preserving global biodiversity. Despite the importance of such data, there is a notable scarcity of datasets featuring videos of birds, and none of the existing datasets offer detailed annotations of bird behaviors in video format. In response to this gap, our study introduces the first fine-grained video dataset specifically designed for bird behavior detection and species classification. This dataset addresses the need for comprehensive bird video datasets and provides detailed data on bird actions, facilitating the development of deep learning models to recognize these, similar to the advancements made in human action recognition. The proposed dataset comprises 178 videos recorded in Spanish wetlands, capturing 13 different bird species performing 7 distinct behavior classes. In addition, we also present baseline results using state of the art models on two tasks: bird behavior recognition and species classification.

9. 【2501.08924】Learning Joint Denoising, Demosaicing, and Compression from the Raw Natural Image Noise Dataset

链接https://arxiv.org/abs/2501.08924

作者:Benoit Brummer,Christophe De Vleeschouwer

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Image Noise Dataset, Natural Image Noise, Noise Dataset, Raw Natural Image, Raw Natural

备注

点击查看摘要

Abstract:This paper introduces the Raw Natural Image Noise Dataset (RawNIND), a diverse collection of paired raw images designed to support the development of denoising models that generalize across sensors, image development workflows, and styles. Two denoising methods are proposed: one operates directly on raw Bayer data, leveraging computational efficiency, while the other processes linear RGB images for improved generalization to different sensors, with both preserving flexibility for subsequent development. Both methods outperform traditional approaches which rely on developed images. Additionally, the integration of denoising and compression at the raw data level significantly enhances rate-distortion performance and computational efficiency. These findings suggest a paradigm shift toward raw data workflows for efficient and flexible image processing.

10. 【2501.08912】Empowering Agricultural Insights: RiceLeafBD - A Novel Dataset and Optimal Model Selection for Rice Leaf Disease Diagnosis through Transfer Learning Technique

链接https://arxiv.org/abs/2501.08912

作者:Sadia Afrin Rimi,Md. Jalal Uddin Chowdhury,Rifat Abdullah,Iftekhar Ahmed,Mahrima Akter Mim,Mohammad Shoaib Rahman

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:lush greenery, daily basis, number of people, people living, agricultural nation

备注

点击查看摘要

Abstract:The number of people living in this agricultural nation of ours, which is surrounded by lush greenery, is growing on a daily basis. As a result of this, the level of arable land is decreasing, as well as residential houses and industrial factories. The food crisis is becoming the main threat for us in the upcoming days. Because on the one hand, the population is increasing, and on the other hand, the amount of food crop production is decreasing due to the attack of diseases. Rice is one of the most significant cultivated crops since it provides food for more than half of the world's population. Bangladesh is dependent on rice (Oryza sativa) as a vital crop for its agriculture, but it faces a significant problem as a result of the ongoing decline in rice yield brought on by common diseases. Early disease detection is the main difficulty in rice crop cultivation. In this paper, we proposed our own dataset, which was collected from the Bangladesh field, and also applied deep learning and transfer learning models for the evaluation of the datasets. We elaborately explain our dataset and also give direction for further research work to serve society using this dataset. We applied a light CNN model and pre-trained InceptionNet-V2, EfficientNet-V2, and MobileNet-V2 models, which achieved 91.5% performance for the EfficientNet-V2 model of this work. The results obtained assaulted other models and even exceeded approaches that are considered to be part of the state of the art. It has been demonstrated by this study that it is possible to precisely and effectively identify diseases that affect rice leaves using this unbiased datasets. After analysis of the performance of different models, the proposed datasets are significant for the society for research work to provide solutions for decreasing rice leaf disease.

11. 【2501.08910】Lights, Camera, Matching: The Role of Image Illumination in Fair Face Recognition

链接https://arxiv.org/abs/2501.08910

作者:Gabriella Pangelinan,Grace Bezold,Haiyu Wu,Michael C. King,Kevin W. Bowyer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:quality factor impacting, key image quality, image quality factor, recognition accuracy differentials, factor impacting face

备注: 14 pages, 11 figures, Conference submission

点击查看摘要

Abstract:Facial brightness is a key image quality factor impacting face recognition accuracy differentials across demographic groups. In this work, we aim to decrease the accuracy gap between the similarity score distributions for Caucasian and African American female mated image pairs, as measured by d' between distributions. To balance brightness across demographic groups, we conduct three experiments, interpreting brightness in the face skin region either as median pixel value or as the distribution of pixel values. Balancing based on median brightness alone yields up to a 46.8% decrease in d', while balancing based on brightness distribution yields up to a 57.6% decrease. In all three cases, the similarity scores of the individual distributions improve, with mean scores maximally improving 5.9% for Caucasian females and 3.7% for African American females.

12. 【2501.08900】Enhanced Multi-Scale Cross-Attention for Person Image Generation

链接https://arxiv.org/abs/2501.08900

作者:Hao Tang,Ling Shao,Nicu Sebe,Luc Van Gool

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generative adversarial network, generative adversarial, adversarial network, image generation task, GAN

备注: Accepted to TPAMI, an extended version of a paper published in ECCV2020. arXiv admin note: substantial text overlap with [arXiv:2007.09278](https://arxiv.org/abs/2007.09278)

点击查看摘要

Abstract:In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.

13. 【2501.08885】Feature-based One-For-All: A Universal Framework for Heterogeneous Knowledge Distillation

链接https://arxiv.org/abs/2501.08885

作者:Jhe-Hao Lin,Yi Yao,Chan-Feng Hsu,Hongxia Xie,Hong-Han Shuai,Wen-Huang Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:involves transferring knowledge, maintaining comparable effectiveness, pre-trained heavy teacher, transferring knowledge, Convolutional Neural Networks

备注

点击查看摘要

Abstract:Knowledge distillation (KD) involves transferring knowledge from a pre-trained heavy teacher model to a lighter student model, thereby reducing the inference cost while maintaining comparable effectiveness. Prior KD techniques typically assume homogeneity between the teacher and student models. However, as technology advances, a wide variety of architectures have emerged, ranging from initial Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), and Multi-Level Perceptrons (MLPs). Consequently, developing a universal KD framework compatible with any architecture has become an important research topic. In this paper, we introduce a feature-based one-for-all (FOFA) KD framework to enable feature distillation across diverse architecture. Our framework comprises two key components. First, we design prompt tuning blocks that incorporate student feedback, allowing teacher features to adapt to the student model's learning process. Second, we propose region-aware attention to mitigate the view mismatch problem between heterogeneous architecture. By leveraging these two modules, effective distillation of intermediate features can be achieved across heterogeneous architectures. Extensive experiments on CIFAR, ImageNet, and COCO demonstrate the superiority of the proposed method.

14. 【2501.08861】Generative Planning with 3D-vision Language Pre-training for End-to-End Autonomous Driving

链接https://arxiv.org/abs/2501.08861

作者:Tengpeng Li,Hanli Wang,Xianfei Li,Wenlong Liao,Tao He,Pai Peng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:safe trajectory planning, task that requires, requires perceiving, surrounding environment, environment for safe

备注

点击查看摘要

Abstract:Autonomous driving is a challenging task that requires perceiving and understanding the surrounding environment for safe trajectory planning. While existing vision-based end-to-end models have achieved promising results, these methods are still facing the challenges of vision understanding, decision reasoning and scene generalization. To solve these issues, a generative planning with 3D-vision language pre-training model named GPVL is proposed for end-to-end autonomous driving. The proposed paradigm has two significant aspects. On one hand, a 3D-vision language pre-training module is designed to bridge the gap between visual perception and linguistic understanding in the bird's eye view. On the other hand, a cross-modal language model is introduced to generate holistic driving decisions and fine-grained trajectories with perception and navigation information in an auto-regressive manner. Experiments on the challenging nuScenes dataset demonstrate that the proposed scheme achieves excellent performances compared with state-of-the-art methods. Besides, the proposed GPVL presents strong generalization ability and real-time potential when handling high-level commands in various scenarios. It is believed that the effective, robust and efficient performance of GPVL is crucial for the practical application of future autonomous driving systems. Code is available at this https URL

15. 【2501.08841】Exploring Task-Level Optimal Prompts for Visual In-Context Learning

链接https://arxiv.org/abs/2501.08841

作者:Yan Zhu,Huan Ma,Changqing Zhang

类目:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:Visual In-Context Learning, Vision Foundation Models, Vision Foundation, Visual In-Context, In-Context Learning

备注

点击查看摘要

Abstract:With the development of Vision Foundation Models (VFMs) in recent years, Visual In-Context Learning (VICL) has become a better choice compared to modifying models in most scenarios. Different from retraining or fine-tuning model, VICL does not require modifications to the model's weights or architecture, and only needs a prompt with demonstrations to teach VFM how to solve tasks. Currently, significant computational cost for finding optimal prompts for every test sample hinders the deployment of VICL, as determining which demonstrations to use for constructing prompts is very costly. In this paper, however, we find a counterintuitive phenomenon that most test samples actually achieve optimal performance under the same prompts, and searching for sample-level prompts only costs more time but results in completely identical prompts. Therefore, we propose task-level prompting to reduce the cost of searching for prompts during the inference stage and introduce two time-saving yet effective task-level prompt search strategies. Extensive experimental results show that our proposed method can identify near-optimal prompts and reach the best VICL performance with a minimal cost that prior work has never achieved.

16. 【2501.08837】MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation

链接https://arxiv.org/abs/2501.08837

作者:Olga Zatsarynna,Emad Bahrami,Yazan Abu Farha,Gianpiero Francesca,Juergen Gall

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:addresses the problem, stochastic long-term dense, long-term dense anticipation, future, dense anticipation

备注

点击查看摘要

Abstract:Our work addresses the problem of stochastic long-term dense anticipation. The goal of this task is to predict actions and their durations several minutes into the future based on provided video observations. Anticipation over extended horizons introduces high uncertainty, as a single observation can lead to multiple plausible future outcomes. To address this uncertainty, stochastic models are designed to predict several potential future action sequences. Recent work has further proposed to incorporate uncertainty modelling for observed frames by simultaneously predicting per-frame past and future actions in a unified manner. While such joint modelling of actions is beneficial, it requires long-range temporal capabilities to connect events across distant past and future time points. However, the previous work struggles to achieve such a long-range understanding due to its limited and/or sparse receptive field. To alleviate this issue, we propose a novel MANTA (MAmba for ANTicipation) network. Our model enables effective long-term temporal modelling even for very long sequences while maintaining linear complexity in sequence length. We demonstrate that our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101 - while also significantly improving computational and memory efficiency.

17. 【2501.08828】MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents

链接https://arxiv.org/abs/2501.08828

作者:Kuicai Dong,Yujing Chang,Xin Deik Goh,Dexun Li,Ruiming Tang,Yong Liu

类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Multi-modal document retrieval, Multi-modal document, document retrieval, designed to identify, identify and retrieve

备注: [this https URL](https://huggingface.co/MMDocIR)

点击查看摘要

Abstract:Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents. Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval. To address this gap, this work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis. A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation. Through rigorous experiments, we reveal that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval and (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text. These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.

18. 【2501.08816】IDEA: Image Description Enhanced CLIP-Adapter

链接https://arxiv.org/abs/2501.08816

作者:Zhipeng Ye,Feng Jiang,Qiufeng Wang,Kaizhu Huang,Jiaqi Huang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Contrastive Language-Image Pre-training, attained great success, Contrastive Language-Image, Language-Image Pre-training, attained great

备注

点击查看摘要

Abstract:CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at this https URL.

19. 【2501.08815】Human Pose-Constrained UV Map Estimation

链接https://arxiv.org/abs/2501.08815

作者:Matej Suchanek,Miroslav Purkrabek,Jiri Matas

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Continuous Surface Embeddings, posture or activity, computer vision, vision for detailed, detailed analysis

备注

点击查看摘要

Abstract:UV map estimation is used in computer vision for detailed analysis of human posture or activity. Previous methods assign pixels to body model vertices by comparing pixel descriptors independently, without enforcing global coherence or plausibility in the UV map. We propose Pose-Constrained Continuous Surface Embeddings (PC-CSE), which integrates estimated 2D human pose into the pixel-to-vertex assignment process. The pose provides global anatomical constraints, ensuring that UV maps remain coherent while preserving local precision. Evaluation on DensePose COCO demonstrates consistent improvement, regardless of the chosen 2D human pose model. Whole-body poses offer better constraints by incorporating additional details about the hands and feet. Conditioning UV maps with human pose reduces invalid mappings and enhances anatomical plausibility. In addition, we highlight inconsistencies in the ground-truth annotations.

20. 【2501.08807】Multi-visual modality micro drone-based structural damage detection

链接https://arxiv.org/abs/2501.08807

作者:Isaac Osei Agyemanga,Liaoyuan Zeng,Jianwen Chena,Isaac Adjei-Mensah,Daniel Acheampong

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structural damage detection, Accurate detection, civil infrastructure, important in ensuring, ensuring the continuous

备注

点击查看摘要

Abstract:Accurate detection and resilience of object detectors in structural damage detection are important in ensuring the continuous use of civil infrastructure. However, achieving robustness in object detectors remains a persistent challenge, impacting their ability to generalize effectively. This study proposes DetectorX, a robust framework for structural damage detection coupled with a micro drone. DetectorX addresses the challenges of object detector robustness by incorporating two innovative modules: a stem block and a spiral pooling technique. The stem block introduces a dynamic visual modality by leveraging the outputs of two Deep Convolutional Neural Network (DCNN) models. The framework employs the proposed event-based reward reinforcement learning to constrain the actions of a parent and child DCNN model leading to a reward. This results in the induction of two dynamic visual modalities alongside the Red, Green, and Blue (RGB) data. This enhancement significantly augments DetectorX's perception and adaptability in diverse environmental situations. Further, a spiral pooling technique, an online image augmentation method, strengthens the framework by increasing feature representations by concatenating spiraled and average/max pooled features. In three extensive experiments: (1) comparative and (2) robustness, which use the Pacific Earthquake Engineering Research Hub ImageNet dataset, and (3) field-experiment, DetectorX performed satisfactorily across varying metrics, including precision (0.88), recall (0.84), average precision (0.91), mean average precision (0.76), and mean average recall (0.73), compared to the competing detectors including You Only Look Once X-medium (YOLOX-m) and others. The study's findings indicate that DetectorX can provide satisfactory results and demonstrate resilience in challenging environments.

21. 【2501.08799】Exploring ChatGPT for Face Presentation Attack Detection in Zero and Few-Shot in-Context Learning

链接https://arxiv.org/abs/2501.08799

作者:Alain Komaty,Hatef Otroshi Shahreza,Anjith George,Sebastien Marcel

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:Presentation Attack Detection, including commercial solutions, Face Presentation Attack, Attack Detection, Presentation Attack

备注: Accepted in WACV workshop 2025

点击查看摘要

Abstract:This study highlights the potential of ChatGPT (specifically GPT-4o) as a competitive alternative for Face Presentation Attack Detection (PAD), outperforming several PAD models, including commercial solutions, in specific scenarios. Our results show that GPT-4o demonstrates high consistency, particularly in few-shot in-context learning, where its performance improves as more examples are provided (reference data). We also observe that detailed prompts enable the model to provide scores reliably, a behavior not observed with concise prompts. Additionally, explanation-seeking prompts slightly enhance the model's performance by improving its interpretability. Remarkably, the model exhibits emergent reasoning capabilities, correctly predicting the attack type (print or replay) with high accuracy in few-shot scenarios, despite not being explicitly instructed to classify attack types. Despite these strengths, GPT-4o faces challenges in zero-shot tasks, where its performance is limited compared to specialized PAD systems. Experiments were conducted on a subset of the SOTERIA dataset, ensuring compliance with data privacy regulations by using only data from consenting individuals. These findings underscore GPT-4o's promise in PAD applications, laying the groundwork for future research to address broader data privacy concerns and improve cross-dataset generalization. Code available here: this https URL

22. 【2501.08771】Admitting Ignorance Helps the Video Question Answering Models to Answer

链接https://arxiv.org/abs/2501.08771

作者:Haopeng Li,Tom Drummond,Mingming Gong,Mohammed Bennamoun,Qiuhong Ke

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Significant progress, large-scale pretraining, video question answering, deep learning, learning and large-scale

备注

点击查看摘要

Abstract:Significant progress has been made in the field of video question answering (VideoQA) thanks to deep learning and large-scale pretraining. Despite the presence of sophisticated model structures and powerful video-text foundation models, most existing methods focus solely on maximizing the correlation between answers and video-question pairs during training. We argue that these models often establish shortcuts, resulting in spurious correlations between questions and answers, especially when the alignment between video and text data is suboptimal. To address these spurious correlations, we propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question, rather than making guesses solely based on superficial question-answer correlations. We introduce methodologies for intervening in questions, utilizing techniques such as displacement and perturbation, and design frameworks for the model to admit its lack of knowledge in both multi-choice VideoQA and open-ended settings. In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness. The results clearly demonstrate that our framework can significantly enhance the performance of VideoQA models with minimal structural modifications.

23. 【2501.08763】Few-Shot Learner Generalizes Across AI-Generated Image Detection

链接https://arxiv.org/abs/2501.08763

作者:Shiyu Wu,Jing Liu,Jing Li,Yequan Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:limited studied generative, Current fake image, Current fake, datasets perform satisfactorily, large synthetic image

备注: 11 pages, 5 figures

点击查看摘要

Abstract:Current fake image detectors trained on large synthetic image datasets perform satisfactorily on limited studied generative models. However, they suffer a notable performance decline over unseen models. Besides, collecting adequate training data from online generative models is often expensive or infeasible. To overcome these issues, we propose Few-Shot Detector (FSD), a novel AI-generated image detector which learns a specialized metric space to effectively distinguish unseen fake images by utilizing very few samples. Experiments show FSD achieves state-of-the-art performance by $+7.4\%$ average ACC on GenImage dataset. More importantly, our method is better capable of capturing the intra-category common features in unseen images without further training.

24. 【2501.08717】$\texttt{InfoHier}$: Hierarchical Information Extraction via Encoding and Embedding

链接https://arxiv.org/abs/2501.08717

作者:Tianru Zhang,Li Ju,Prashant Singh,Salman Toor

类目:Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Analyzing large-scale datasets, Analyzing large-scale, large-scale datasets, Analyzing, SSL

备注: 10 pages, 4 figures

点击查看摘要

Abstract:Analyzing large-scale datasets, especially involving complex and high-dimensional data like images, is particularly challenging. While self-supervised learning (SSL) has proven effective for learning representations from unlabelled data, it typically focuses on flat, non-hierarchical structures, missing the multi-level relationships present in many real-world datasets. Hierarchical clustering (HC) can uncover these relationships by organizing data into a tree-like structure, but it often relies on rigid similarity metrics that struggle to capture the complexity of diverse data types. To address these we envision $\texttt{InfoHier}$, a framework that combines SSL with HC to jointly learn robust latent representations and hierarchical structures. This approach leverages SSL to provide adaptive representations, enhancing HC's ability to capture complex patterns. Simultaneously, it integrates HC loss to refine SSL training, resulting in representations that are more attuned to the underlying information hierarchy. $\texttt{InfoHier}$ has the potential to improve the expressiveness and performance of both clustering and representation learning, offering significant benefits for data analysis, management, and information retrieval.

25. 【2501.08712】Self-supervised Transformation Learning for Equivariant Representations

链接https://arxiv.org/abs/2501.08712

作者:Jaemyung Yu,Jaehyun Choi,Dong-Jae Lee,HyeongGwon Hong,Junmo Kim

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Unsupervised representation learning, Unsupervised representation, significantly advanced, advanced various machine, Unsupervised

备注: 38th Conference on Neural Information Processing Systems (NeurIPS 2024)

点击查看摘要

Abstract:Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at this https URL.

26. 【2501.08682】RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency

链接https://arxiv.org/abs/2501.08682

作者:Siqi Li,Zhengkai Jiang,Jiawei Zhou,Zhihong Liu,Xiaowei Chi,Haoqian Wang

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:clothing items fit, Virtual try-on, aimed at digitally, Video Virtual Try-on, intersection of computer

备注: 10 pages (8 pages main text, 2 pages references), 5 figures in the main text, and 4 pages supplementary materials with 3 additional figures

点击查看摘要

Abstract:Virtual try-on has emerged as a pivotal task at the intersection of computer vision and fashion, aimed at digitally simulating how clothing items fit on the human body. Despite notable progress in single-image virtual try-on (VTO), current methodologies often struggle to preserve a consistent and authentic appearance of clothing across extended video sequences. This challenge arises from the complexities of capturing dynamic human pose and maintaining target clothing characteristics. We leverage pre-existing video foundation models to introduce RealVVT, a photoRealistic Video Virtual Try-on framework tailored to bolster stability and realism within dynamic video contexts. Our methodology encompasses a Clothing Temporal Consistency strategy, an Agnostic-guided Attention Focus Loss mechanism to ensure spatial consistency, and a Pose-guided Long Video VTO technique adept at handling extended video this http URL experiments across various datasets confirms that our approach outperforms existing state-of-the-art models in both single-image and video VTO tasks, offering a viable solution for practical applications within the realms of fashion e-commerce and virtual fitting environments.

27. 【2501.08676】FlexiClip: Locality-Preserving Free-Form Character Animation

链接https://arxiv.org/abs/2501.08676

作者:Anant Khandelwal

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:Animating clipart images, coherence presents significant, presents significant challenges, temporal coherence presents, maintaining visual fidelity

备注: 13 pages, 4 figures, 7 tables

点击查看摘要

Abstract:Animating clipart images with seamless motion while maintaining visual fidelity and temporal coherence presents significant challenges. Existing methods, such as AniClipart, effectively model spatial deformations but often fail to ensure smooth temporal transitions, resulting in artifacts like abrupt motions and geometric distortions. Similarly, text-to-video (T2V) and image-to-video (I2V) models struggle to handle clipart due to the mismatch in statistical properties between natural video and clipart styles. This paper introduces FlexiClip, a novel approach designed to overcome these limitations by addressing the intertwined challenges of temporal consistency and geometric integrity. FlexiClip extends traditional Bézier curve-based trajectory modeling with key innovations: temporal Jacobians to correct motion dynamics incrementally, continuous-time modeling via probability flow ODEs (pfODEs) to mitigate temporal noise, and a flow matching loss inspired by GFlowNet principles to optimize smooth motion transitions. These enhancements ensure coherent animations across complex scenarios involving rapid movements and non-rigid deformations. Extensive experiments validate the effectiveness of FlexiClip in generating animations that are not only smooth and natural but also structurally consistent across diverse clipart types, including humans and animals. By integrating spatial and temporal modeling with pre-trained video diffusion models, FlexiClip sets a new standard for high-quality clipart animation, offering robust performance across a wide range of visual content. Project Page: this https URL

28. 【2501.08672】GS-LIVO: Real-Time LiDAR, Inertial, and Visual Multi-sensor Fused Odometry with Gaussian Mapping

链接https://arxiv.org/abs/2501.08672

作者:Sheng Hong,Chunran Zheng,Yishu Shen,Changze Li,Fu Zhang,Tong Qin,Shaojie Shen

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:scene representation approach, Gaussian map, global Gaussian map, recent years, representation approach

备注

点击查看摘要

Abstract:In recent years, 3D Gaussian splatting (3D-GS) has emerged as a novel scene representation approach. However, existing vision-only 3D-GS methods often rely on hand-crafted heuristics for point-cloud densification and face challenges in handling occlusions and high GPU memory and computation consumption. LiDAR-Inertial-Visual (LIV) sensor configuration has demonstrated superior performance in localization and dense mapping by leveraging complementary sensing characteristics: rich texture information from cameras, precise geometric measurements from LiDAR, and high-frequency motion data from IMU. Inspired by this, we propose a novel real-time Gaussian-based simultaneous localization and mapping (SLAM) system. Our map system comprises a global Gaussian map and a sliding window of Gaussians, along with an IESKF-based odometry. The global Gaussian map consists of hash-indexed voxels organized in a recursive octree, effectively covering sparse spatial volumes while adapting to different levels of detail and scales. The Gaussian map is initialized through multi-sensor fusion and optimized with photometric gradients. Our system incrementally maintains a sliding window of Gaussians, significantly reducing GPU computation and memory consumption by only optimizing the map within the sliding window. Moreover, we implement a tightly coupled multi-sensor fusion odometry with an iterative error state Kalman filter (IESKF), leveraging real-time updating and rendering of the Gaussian map. Our system represents the first real-time Gaussian-based SLAM framework deployable on resource-constrained embedded systems, demonstrated on the NVIDIA Jetson Orin NX platform. The framework achieves real-time performance while maintaining robust multi-sensor fusion capabilities. All implementation algorithms, hardware designs, and CAD models will be publicly available.

29. 【2501.08665】A Survey on Facial Image Privacy Preservation in Cloud-Based Services

链接https://arxiv.org/abs/2501.08665

作者:Chen Chen,Mengyuan Sun,Xueluan Gong,Yanjiao Chen,Qian Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Facial recognition models, government agencies, commercial enterprises, identity verification, increasingly employed

备注

点击查看摘要

Abstract:Facial recognition models are increasingly employed by commercial enterprises, government agencies, and cloud service providers for identity verification, consumer services, and surveillance. These models are often trained using vast amounts of facial data processed and stored in cloud-based platforms, raising significant privacy concerns. Users' facial images may be exploited without their consent, leading to potential data breaches and misuse. This survey presents a comprehensive review of current methods aimed at preserving facial image privacy in cloud-based services. We categorize these methods into two primary approaches: image obfuscation-based protection and adversarial perturbation-based protection. We provide an in-depth analysis of both categories, offering qualitative and quantitative comparisons of their effectiveness. Additionally, we highlight unresolved challenges and propose future research directions to improve privacy preservation in cloud computing environments.

30. 【2501.08659】BRIGHT-VO: Brightness-Guided Hybrid Transformer for Visual Odometry with Multi-modality Refinement Module

链接https://arxiv.org/abs/2501.08659

作者:Dongzhihan Wang,Yang Yang,Liang Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:robotic navigation, plays a crucial, autonomous driving, crucial role, role in autonomous

备注: 9 pages, 7 figures

点击查看摘要

Abstract:Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks by estimating the position and orientation of a camera based on visual input. Significant progress has been made in data-driven VO methods, particularly those leveraging deep learning techniques to extract image features and estimate camera poses. However, these methods often struggle in low-light conditions because of the reduced visibility of features and the increased difficulty of matching keypoints. To address this limitation, we introduce BrightVO, a novel VO model based on Transformer architecture, which not only performs front-end visual feature extraction, but also incorporates a multi-modality refinement module in the back-end that integrates Inertial Measurement Unit (IMU) data. Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness. Furthermore, we create a synthetic low-light dataset, KiC4R, which includes a variety of lighting conditions to facilitate the training and evaluation of VO frameworks in challenging environments. Experimental results demonstrate that BrightVO achieves state-of-the-art performance on both the KiC4R dataset and the KITTI benchmarks. Specifically, it provides an average improvement of 20% in pose estimation accuracy in normal outdoor environments and 259% in low-light conditions, outperforming existing methods. For widespread use and further development, the research work is fully open-source at this https URL.

31. 【2501.08654】StereoGen: High-quality Stereo Image Generation from a Single Image

链接https://arxiv.org/abs/2501.08654

作者:Xianqi Wang,Hao Yang,Gangwei Xu,Junda Cheng,Min Lin,Yong Deng,Jinliang Zang,Yurui Chen,Xin Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:supervised stereo matching, achieved amazing results, achieved amazing, stereo matching methods, supervised stereo

备注

点击查看摘要

Abstract:State-of-the-art supervised stereo matching methods have achieved amazing results on various benchmarks. However, these data-driven methods suffer from generalization to real-world scenarios due to the lack of real-world annotated data. In this paper, we propose StereoGen, a novel pipeline for high-quality stereo image generation. This pipeline utilizes arbitrary single images as left images and pseudo disparities generated by a monocular depth estimation model to synthesize high-quality corresponding right images. Unlike previous methods that fill the occluded area in warped right images using random backgrounds or using convolutions to take nearby pixels selectively, we fine-tune a diffusion inpainting model to recover the background. Images generated by our model possess better details and undamaged semantic structures. Besides, we propose Training-free Confidence Generation and Adaptive Disparity Selection. The former suppresses the negative effect of harmful pseudo ground truth during stereo training, while the latter helps generate a wider disparity distribution and better synthetic images. Experiments show that models trained under our pipeline achieve state-of-the-art zero-shot generalization results among all published methods. The code will be available upon publication of the paper.

32. 【2501.08649】Joint Learning of Depth and Appearance for Portrait Image Animation

链接https://arxiv.org/abs/2501.08649

作者:Xinya Ji,Gaspard Zoss,Prashanth Chandran,Lingchen Yang,Xun Cao,Barbara Solenthaler,Derek Bradley

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:experienced significant advancements, recent years, experienced significant, significant advancements, advancements in recent

备注

点击查看摘要

Abstract:2D portrait animation has experienced significant advancements in recent years. Much research has utilized the prior knowledge embedded in large generative diffusion models to enhance high-quality image manipulation. However, most methods only focus on generating RGB images as output, and the co-generation of consistent visual plus 3D output remains largely under-explored. In our work, we propose to jointly learn the visual appearance and depth simultaneously in a diffusion-based portrait image generator. Our method embraces the end-to-end diffusion paradigm and introduces a new architecture suitable for learning this conditional joint distribution, consisting of a reference network and a channel-expanded diffusion backbone. Once trained, our framework can be efficiently adapted to various downstream applications, such as facial depth-to-image and image-to-depth generation, portrait relighting, and audio-driven talking head animation with consistent 3D output.

33. 【2501.08643】MonSter: Marry Monodepth to Stereo Unleashes Power

链接https://arxiv.org/abs/2501.08643

作者:Junda Cheng,Longliang Liu,Gangwei Xu,Xianqi Wang,Zhaoxing Zhang,Yong Deng,Jinliang Zang,Yurui Chen,Zhipeng Cai,Xin Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:image correspondences, Stereo matching, Stereo matching recovers, matching recovers depth, Stereo

备注

点击查看摘要

Abstract:Stereo matching recovers depth from image correspondences. Existing methods struggle to handle ill-posed regions with limited matching cues, such as occlusions and textureless areas. To address this, we propose MonSter, a novel method that leverages the complementary strengths of monocular depth estimation and stereo matching. MonSter integrates monocular depth and stereo matching into a dual-branch architecture to iteratively improve each other. Confidence-based guidance adaptively selects reliable stereo cues for monodepth scale-shift recovery. The refined monodepth is in turn guides stereo effectively at ill-posed regions. Such iterative mutual enhancement enables MonSter to evolve monodepth priors from coarse object-level structures to pixel-level geometry, fully unlocking the potential of stereo matching. As shown in Fig.1, MonSter ranks 1st across five most commonly used leaderboards -- SceneFlow, KITTI 2012, KITTI 2015, Middlebury, and ETH3D. Achieving up to 49.5% improvements (Bad 1.0 on ETH3D) over the previous best method. Comprehensive analysis verifies the effectiveness of MonSter in ill-posed regions. In terms of zero-shot generalization, MonSter significantly and consistently outperforms state-of-the-art across the board. The code is publicly available at: this https URL.

34. 【2501.08639】Detecting Wildfire Flame and Smoke through Edge Computing using Transfer Learning Enhanced Deep Learning Models

链接https://arxiv.org/abs/2501.08639

作者:Giovanny Vazquez,Shengjie Zhai,Mei Yang

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Autonomous unmanned aerial, dramatically reducing latency, capabilities empower real-time, empower real-time data, real-time data processing

备注: 11 pages, 7 figures

点击查看摘要

Abstract:Autonomous unmanned aerial vehicles (UAVs) integrated with edge computing capabilities empower real-time data processing directly on the device, dramatically reducing latency in critical scenarios such as wildfire detection. This study underscores Transfer Learning's (TL) significance in boosting the performance of object detectors for identifying wildfire smoke and flames, especially when trained on limited datasets, and investigates the impact TL has on edge computing metrics. With the latter focusing how TL-enhanced You Only Look Once (YOLO) models perform in terms of inference time, power usage, and energy consumption when using edge computing devices. This study utilizes the Aerial Fire and Smoke Essential (AFSE) dataset as the target, with the Flame and Smoke Detection Dataset (FASDD) and the Microsoft Common Objects in Context (COCO) dataset serving as source datasets. We explore a two-stage cascaded TL method, utilizing D-Fire or FASDD as initial stage target datasets and AFSE as the subsequent stage. Through fine-tuning, TL significantly enhances detection precision, achieving up to 79.2% mean Average Precision (mAP@0.5), reduces training time, and increases model generalizability across the AFSE dataset. However, cascaded TL yielded no notable improvements and TL alone did not benefit the edge computing metrics evaluated. Lastly, this work found that YOLOv5n remains a powerful model when lacking hardware acceleration, finding that YOLOv5n can process images nearly twice as fast as its newer counterpart, YOLO11n. Overall, the results affirm TL's role in augmenting the accuracy of object detectors while also illustrating that additional enhancements are needed to improve edge computing performance.

35. 【2501.08629】Self-Organizing Edge Computing Distribution Framework for Visual SLAM

链接https://arxiv.org/abs/2501.08629

作者:Jussi Kalliola,Lauri Suomela,Sergio Moreschini,David Hästbacka

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:crucial capability, SLAM, Simultaneous Localization, Localization and Mapping, mobile robots

备注: 8 pages, 5 figures

点击查看摘要

Abstract:Localization within a known environment is a crucial capability for mobile robots. Simultaneous Localization and Mapping (SLAM) is a prominent solution to this problem. SLAM is a framework that consists of a diverse set of computational tasks ranging from real-time tracking to computation-intensive map optimization. This combination can present a challenge for resource-limited mobile robots. Previously, edge-assisted SLAM methods have demonstrated promising real-time execution capabilities by offloading heavy computations while performing real-time tracking onboard. However, the common approach of utilizing a client-server architecture for offloading is sensitive to server and network failures. In this article, we propose a novel edge-assisted SLAM framework capable of self-organizing fully distributed SLAM execution across a network of devices or functioning on a single device without connectivity. The architecture consists of three layers and is designed to be device-agnostic, resilient to network failures, and minimally invasive to the core SLAM system. We have implemented and demonstrated the framework for monocular ORB SLAM3 and evaluated it in both fully distributed and standalone SLAM configurations against the ORB SLAM3. The experiment results demonstrate that the proposed design matches the accuracy and resource utilization of the monolithic approach while enabling collaborative execution.

36. 【2501.08609】Computerized Assessment of Motor Imitation for Distinguishing Autism in Video (CAMI-2DNet)

链接https://arxiv.org/abs/2501.08609

作者:Kaleab A. Kinfu,Carolina Pacheco,Alice D. Sperry,Deana Crocetti,Bahar Tunçgenç,Stewart H. Mostofsky,René Vidal

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:autism spectrum conditions, addressing autism heterogeneity, Motor imitation, Motor imitation impairments, motor imitation assessment

备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Motor imitation impairments are commonly reported in individuals with autism spectrum conditions (ASCs), suggesting that motor imitation could be used as a phenotype for addressing autism heterogeneity. Traditional methods for assessing motor imitation are subjective, labor-intensive, and require extensive human training. Modern Computerized Assessment of Motor Imitation (CAMI) methods, such as CAMI-3D for motion capture data and CAMI-2D for video data, are less subjective. However, they rely on labor-intensive data normalization and cleaning techniques, and human annotations for algorithm training. To address these challenges, we propose CAMI-2DNet, a scalable and interpretable deep learning-based approach to motor imitation assessment in video data, which eliminates the need for data normalization, cleaning and annotation. CAMI-2DNet uses an encoder-decoder architecture to map a video to a motion encoding that is disentangled from nuisance factors such as body shape and camera views. To learn a disentangled representation, we employ synthetic data generated by motion retargeting of virtual characters through the reshuffling of motion, body shape, and camera views, as well as real participant data. To automatically assess how well an individual imitates an actor, we compute a similarity score between their motion encodings, and use it to discriminate individuals with ASCs from neurotypical (NT) individuals. Our comparative analysis demonstrates that CAMI-2DNet has a strong correlation with human scores while outperforming CAMI-2D in discriminating ASC vs NT children. Moreover, CAMI-2DNet performs comparably to CAMI-3D while offering greater practicality by operating directly on video data and without the need for ad-hoc data normalization and human annotations.

37. 【2501.08605】PACF: Prototype Augmented Compact Features for Improving Domain Adaptive Object Detection

链接https://arxiv.org/abs/2501.08605

作者:Chenguang Liu,Yongchao Feng,Yanan Zhang,Qingjie Liu,Yunhong Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:recent years, object detection, advancement in object, significant advancement, Prototype Augmented Compact

备注

点击查看摘要

Abstract:In recent years, there has been significant advancement in object detection. However, applying off-the-shelf detectors to a new domain leads to significant performance drop, caused by the domain gap. These detectors exhibit higher-variance class-conditional distributions in the target domain than that in the source domain, along with mean shift. To address this problem, we propose the Prototype Augmented Compact Features (PACF) framework to regularize the distribution of intra-class features. Specifically, we provide an in-depth theoretical analysis on the lower bound of the target features-related likelihood and derive the prototype cross entropy loss to further calibrate the distribution of target RoI features. Furthermore, a mutual regularization strategy is designed to enable the linear and prototype-based classifiers to learn from each other, promoting feature compactness while enhancing discriminability. Thanks to this PACF framework, we have obtained a more compact cross-domain feature space, within which the variance of the target features' class-conditional distributions has significantly decreased, and the class-mean shift between the two domains has also been further reduced. The results on different adaptation settings are state-of-the-art, which demonstrate the board applicability and effectiveness of the proposed approach.

38. 【2501.08604】Watermarking in Diffusion Model: Gaussian Shading with Exact Diffusion Inversion via Coupled Transformations (EDICT)

链接https://arxiv.org/abs/2501.08604

作者:Krishna Panthi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Coupled Transformations, Exact Diffusion Inversion, Gaussian Shading, Gaussian Shading traditionally, Exact Diffusion

备注: 5 pages

点击查看摘要

Abstract:This paper introduces a novel approach to enhance the performance of Gaussian Shading, a prevalent watermarking technique, by integrating the Exact Diffusion Inversion via Coupled Transformations (EDICT) framework. While Gaussian Shading traditionally embeds watermarks in a noise latent space, followed by iterative denoising for image generation and noise addition for watermark recovery, its inversion process is not exact, leading to potential watermark distortion. We propose to leverage EDICT's ability to derive exact inverse mappings to refine this process. Our method involves duplicating the watermark-infused noisy latent and employing a reciprocal, alternating denoising and noising scheme between the two latents, facilitated by EDICT. This allows for a more precise reconstruction of both the image and the embedded watermark. Empirical evaluation on standard datasets demonstrates that our integrated approach yields a slight, yet statistically significant improvement in watermark recovery fidelity. These results highlight the potential of EDICT to enhance existing diffusion-based watermarking techniques by providing a more accurate and robust inversion mechanism. To the best of our knowledge, this is the first work to explore the synergy between EDICT and Gaussian Shading for digital watermarking, opening new avenues for research in robust and high-fidelity watermark embedding and extraction.

39. 【2501.08593】Image-to-Force Estimation for Soft Tissue Interaction in Robotic-Assisted Surgery Using Structured Light

链接https://arxiv.org/abs/2501.08593

作者:Jiayin Wang,Mingfeng Yao,Yanran Wei,Xiaoyu Guo,Ayong Zheng,Weidong Zhao

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Minimally Invasive Surgical, Invasive Surgical, Minimally Invasive, accurate haptic interaction, interaction force feedback

备注

点击查看摘要

Abstract:For Minimally Invasive Surgical (MIS) robots, accurate haptic interaction force feedback is essential for ensuring the safety of interacting with soft tissue. However, most existing MIS robotic systems cannot facilitate direct measurement of the interaction force with hardware sensors due to space limitations. This letter introduces an effective vision-based scheme that utilizes a One-Shot structured light projection with a designed pattern on soft tissue coupled with haptic information processing through a trained image-to-force neural network. The images captured from the endoscopic stereo camera are analyzed to reconstruct high-resolution 3D point clouds for soft tissue deformation. Based on this, a modified PointNet-based force estimation method is proposed, which excels in representing the complex mechanical properties of soft tissue. Numerical force interaction experiments are conducted on three silicon materials with different stiffness. The results validate the effectiveness of the proposed scheme.

40. 【2501.08580】Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

链接https://arxiv.org/abs/2501.08580

作者:Jiaqi Huang,Zunnan Xu,Ting Liu,Yong Liu,Haonan Han,Kehong Yuan,Xiu Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision, domain of computer, increasingly replacing, replacing the traditional, traditional paradigm

备注: Accepted by AAAI2025

点击查看摘要

Abstract:In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks. Our project is available at \url{this https URL}.

41. 【2501.08577】Scalable and High-Quality Neural Implicit Representation for 3D Reconstruction

链接https://arxiv.org/abs/2501.08577

作者:Leyuan Yang,Bailin Deng,Juyong Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:remarkable modeling capabilities, demonstrated remarkable modeling, modeling capabilities, demonstrated remarkable, remarkable modeling

备注

点击查看摘要

Abstract:Various SDF-based neural implicit surface reconstruction methods have been proposed recently, and have demonstrated remarkable modeling capabilities. However, due to the global nature and limited representation ability of a single network, existing methods still suffer from many drawbacks, such as limited accuracy and scale of the reconstruction. In this paper, we propose a versatile, scalable and high-quality neural implicit representation to address these issues. We integrate a divide-and-conquer approach into the neural SDF-based reconstruction. Specifically, we model the object or scene as a fusion of multiple independent local neural SDFs with overlapping regions. The construction of our representation involves three key steps: (1) constructing the distribution and overlap relationship of the local radiance fields based on object structure or data distribution, (2) relative pose registration for adjacent local SDFs, and (3) SDF blending. Thanks to the independent representation of each local region, our approach can not only achieve high-fidelity surface reconstruction, but also enable scalable scene reconstruction. Extensive experimental results demonstrate the effectiveness and practicality of our proposed method.

42. 【2501.08575】GOTLoc: General Outdoor Text-based Localization Using Scene Graph Retrieval with OpenStreetMap

链接https://arxiv.org/abs/2501.08575

作者:Donghwi Jung,Keonwoo Kim,Seong-Woo Kim

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:environments where GPS, GPS signals, signals are unavailable, capable of operating, robust localization

备注

点击查看摘要

Abstract:We propose GOTLoc, a robust localization method capable of operating even in outdoor environments where GPS signals are unavailable. The method achieves this robust localization by leveraging comparisons between scene graphs generated from text descriptions and maps. Existing text-based localization studies typically represent maps as point clouds and identify the most similar scenes by comparing embeddings of text and point cloud data. However, point cloud maps have limited scalability as it is impractical to pre-generate maps for all outdoor spaces. Furthermore, their large data size makes it challenging to store and utilize them directly on actual robots. To address these issues, GOTLoc leverages compact data structures, such as scene graphs, to store spatial information, enabling individual robots to carry and utilize large amounts of map data. Additionally, by utilizing publicly available map data, such as OpenStreetMap, which provides global information on outdoor spaces, we eliminate the need for additional effort to create custom map data. For performance evaluation, we utilized the KITTI360Pose dataset in conjunction with corresponding OpenStreetMap data to compare the proposed method with existing approaches. Our results demonstrate that the proposed method achieves accuracy comparable to algorithms relying on point cloud maps. Moreover, in city-scale tests, GOTLoc required significantly less storage compared to point cloud-based methods and completed overall processing within a few seconds, validating its applicability to real-world robotics. Our code is available at this https URL.

43. 【2501.08562】MIAFEx: An Attention-based Feature Extraction Method for Medical Image Classification

链接https://arxiv.org/abs/2501.08562

作者:Oscar Ramos-Soto,Jorge Ramos-Frutos,Ezequiel Perez-Zarate,Diego Oliva,Sandra E. Balderas-Mata

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:exhibit significant limitations, providing sufficient discriminative, sufficient discriminative information, Convolutional Neural Networks, machine learning classifiers

备注: In preparation for Journal Submission

点击查看摘要

Abstract:Feature extraction techniques are crucial in medical image classification; however, classical feature extractors in addition to traditional machine learning classifiers often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model's adaptability to the challenges presented by medical imaging data. The MIAFEx output features quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating its superiority in accuracy and robustness across multiple complex classification medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at this https URL

44. 【2501.08553】DynamicFace: High-Quality and Consistent Video Face Swapping using Composable 3D Facial Priors

链接https://arxiv.org/abs/2501.08553

作者:Runqi Wang,Sijie Xu,Tianyao He,Yang Chen,Wei Zhu,Dejia Song,Nemo Chen,Xu Tang,Yao Hu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Face, Face swapping, target face, retaining the attributes, Face swapping transfers

备注

点击查看摘要

Abstract:Face swapping transfers the identity of a source face to a target face while retaining the attributes like expression, pose, hair, and background of the target face. Advanced face swapping methods have achieved attractive results. However, these methods often inadvertently transfer identity information from the target face, compromising expression-related details and accurate identity. We propose a novel method DynamicFace that leverages the power of diffusion model and plug-and-play temporal layers for video face swapping. First, we introduce four fine-grained face conditions using 3D facial priors. All conditions are designed to be disentangled from each other for precise and unique control. Then, we adopt Face Former and ReferenceNet for high-level and detailed identity injection. Through experiments on the FF++ dataset, we demonstrate that our method achieves state-of-the-art results in face swapping, showcasing superior image quality, identity preservation, and expression accuracy. Besides, our method could be easily transferred to video domain with temporal attention layer. Our code and results will be available on the project page: this https URL

45. 【2501.08549】he Devil is in Temporal Token: High Quality Video Reasoning Segmentation

链接https://arxiv.org/abs/2501.08549

作者:Sitong Gong,Yunzhi Zhuge,Lu Zhang,Zongxin Yang,Pingping Zhang,Huchuan Lu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:inadequately capturing spatial, capturing spatial complexity, Video Reasoning Segmentation, Multimodal Large Language, Segmentation rely heavily

备注

点击查看摘要

Abstract:Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical this http URL key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level SEG and temporal-level TAK tokens that utilize MLLM's autoregressive learning to effectively capture both local and global information. Subsequently, we apply a similarity-based weighted fusion and frame selection strategy, then utilize SAM2 to perform keyframe segmentation and propagation. To enhance keyframe localization accuracy, the TKS filters keyframes based on SAM2's occlusion scores during inference. VRS-HQ achieves state-of-the-art performance on ReVOS, surpassing VISA by 5.9%/12.5%/9.1% in JF scores across the three subsets. These results highlight the strong temporal reasoning and segmentation capabilities of our method. Code and model weights will be released at VRS-HQ.

46. 【2501.08545】Comprehensive Subjective and Objective Evaluation Method for Text-generated Video

链接https://arxiv.org/abs/2501.08545

作者:Zelu Qi,Ping Shi,Shuqi Wang,Zhaoyang Zhang,Zefeng Ying,Da Pan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:technology advancements, applicability and popularity, significantly broadened, broadened its applicability, textbf

备注

点击查看摘要

Abstract:Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen3, Pika, and Sora, have significantly broadened its applicability and popularity. This progress has created a growing demand for accurate quality assessment metrics to evaluate the perceptual quality of text-generated videos and optimize video generation models. However, assessing the quality of text-generated videos remains challenging due to the presence of highly complex distortions, such as unnatural actions and phenomena that defy human cognition. To address these challenges, we constructed a large-scale benchmark dataset for \textbf{T}ext-generated \textbf{V}ideo \textbf{eval}uation, \textbf{T2VEval-Bench}, comprising 148 textual words and 1,783 videos generated by 12 models. During the subjective evaluation, we collected five key scores: overall impression, video quality, aesthetic quality, realness, and text-video consistency. For objective evaluation, we developed the \textbf{T2VEval} model, which assesses videos across three branches: quality, authenticity, and consistency. Using an attention-based fusion module, T2VEval effectively integrates features from each branch and predicts scores with the aid of a large oracle model. Additionally, we implemented a progressive training strategy, enabling each branch to learn targeted knowledge while maintaining synergy with the others. Experimental results demonstrate that T2VEval achieves state-of-the-art performance across multiple metrics. The dataset and code will be open-sourced upon completion of the follow-up work.

47. 【2501.08514】Multimodal Fake News Video Explanation Generation

链接https://arxiv.org/abs/2501.08514

作者:Lizhi Chen,Zhong Qian,Peifeng Li,Qiaoming Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:Multi-modal explanation involves, multiple information modalities, Multi-modal explanation, information modalities, involves the assessment

备注

点击查看摘要

Abstract:Multi-modal explanation involves the assessment of the veracity of a variety of different content, and relies on multiple information modalities to comprehensively consider the relevance and consistency between modalities. Most existing fake news video detection methods focus on improving accuracy while ignoring the importance of providing explanations. In this paper, we propose a novel problem - Fake News Video Explanation (FNVE) - Given a multimodal news containing both video and caption text, we aim to generate natural language explanations to reveal the truth of predictions. To this end, we develop FakeNVE, a new dataset of explanations for truthfully multimodal posts, where each explanation is a natural language (English) sentence describing the attribution of a news thread. We benchmark FakeNVE by using a multimodal transformer-based architecture. Subsequently, a BART-based autoregressive decoder is used as the generator. Empirical results show compelling results for various baselines (applicable to FNVE) across multiple evaluation metrics. We also perform human evaluation on explanation generation, achieving high scores for both adequacy and fluency.

48. 【2501.08506】Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training

链接https://arxiv.org/abs/2501.08506

作者:Kavita Selva,Satita Vittayaareekul,Brando Miranda

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:dominate the narrative, model size dominate, data diversity, diversity, data

备注

点击查看摘要

Abstract:Currently, data and model size dominate the narrative in the training of super-large, powerful models. However, there has been a lack of exploration on the effect of other attributes of the training dataset on model performance. We hypothesize that dataset diversity can impact the performance of vision models. Our study shows positive correlations between test set accuracy and data diversity, providing an argument for furthering the research of dataset attributes beyond size. We analyzed pre-training and model-agnostic meta-learning methods on twelve popular visual datasets (e.g., Omniglot, CIFAR-FS, Aircraft) and five model configurations, including MAML variants with different numbers of inner gradient steps and supervised learning. We show moderate to strong positive correlations (R-squared: 0.15-0.42) between accuracy and data diversity and weaker but significant correlations (R-squared: ~0.2) between loss and diversity. These findings support our hypothesis and demonstrate a promising way for a deeper exploration of how formal data diversity influences model performance. This initial study highlights the potential of (Task2Vec) data diversity as a valuable measure in the rapidly evolving field of large-scale learning and emphasizes that understanding the dataset is key to building more powerful and generalizable models.

49. 【2501.08505】Yuan: Yielding Unblemished Aesthetics Through A Unified Network for Visual Imperfections Removal in Generated Images

链接https://arxiv.org/abs/2501.08505

作者:Zhenyu Yu,Chee Seng Chan

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:Generative AI presents, presents transformative potential, scientific visualization, presents transformative, creative arts

备注

点击查看摘要

Abstract:Generative AI presents transformative potential across various domains, from creative arts to scientific visualization. However, the utility of AI-generated imagery is often compromised by visual flaws, including anatomical inaccuracies, improper object placements, and misplaced textual elements. These imperfections pose significant challenges for practical applications. To overcome these limitations, we introduce \textit{Yuan}, a novel framework that autonomously corrects visual imperfections in text-to-image synthesis. \textit{Yuan} uniquely conditions on both the textual prompt and the segmented image, generating precise masks that identify areas in need of refinement without requiring manual intervention -- a common constraint in previous methodologies. Following the automated masking process, an advanced inpainting module seamlessly integrates contextually coherent content into the identified regions, preserving the integrity and fidelity of the original image and associated text prompts. Through extensive experimentation on publicly available datasets such as ImageNet100 and Stanford Dogs, along with a custom-generated dataset, \textit{Yuan} demonstrated superior performance in eliminating visual imperfections. Our approach consistently achieved higher scores in quantitative metrics, including NIQE, BRISQUE, and PI, alongside favorable qualitative evaluations. These results underscore \textit{Yuan}'s potential to significantly enhance the quality and applicability of AI-generated images across diverse fields.

50. 【2501.08504】SuperSAM: Crafting a SAM Supernetwork via Structured Pruning and Unstructured Parameter Prioritization

链接https://arxiv.org/abs/2501.08504

作者:Waqwoya Abebe,Sadegh Jafari,Sixing Yu,Akash Dutta,Jan Strube,Nathan R. Tallent,Luanzheng Guo,Pablo Munoz,Ali Jannesari

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Neural Architecture Search, efficient neural architectures, search space, NAS, Neural Architecture

备注

点击查看摘要

Abstract:Neural Architecture Search (NAS) is a powerful approach of automating the design of efficient neural architectures. In contrast to traditional NAS methods, recently proposed one-shot NAS methods prove to be more efficient in performing NAS. One-shot NAS works by generating a singular weight-sharing supernetwork that acts as a search space (container) of subnetworks. Despite its achievements, designing the one-shot search space remains a major challenge. In this work we propose a search space design strategy for Vision Transformer (ViT)-based architectures. In particular, we convert the Segment Anything Model (SAM) into a weight-sharing supernetwork called SuperSAM. Our approach involves automating the search space design via layer-wise structured pruning and parameter prioritization. While the structured pruning applies probabilistic removal of certain transformer layers, parameter prioritization performs weight reordering and slicing of MLP-blocks in the remaining layers. We train supernetworks on several datasets using the sandwich rule. For deployment, we enhance subnetwork discovery by utilizing a program autotuner to identify efficient subnetworks within the search space. The resulting subnetworks are 30-70% smaller in size compared to the original pre-trained SAM ViT-B, yet outperform the pretrained model. Our work introduces a new and effective method for ViT NAS search-space design.

51. 【2501.08490】FLAVARS: A Multimodal Foundational Language and Vision Alignment Model for Remote Sensing

链接https://arxiv.org/abs/2501.08490

作者:Isaac Corley,Simone Fobi Nsutezo,Anthony Ortiz,Caleb Robinson,Rahul Dodhia,Juan M. Lavista Ferres,Peyman Najafirad

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Remote sensing imagery, contextual visual information, Remote sensing, visual information, sensing imagery

备注

点击查看摘要

Abstract:Remote sensing imagery is dense with objects and contextual visual information. There is a recent trend to combine paired satellite images and text captions for pretraining performant encoders for downstream tasks. However, while contrastive image-text methods like CLIP enable vision-language alignment and zero-shot classification ability, vision-only downstream performance tends to degrade compared to image-only pretraining, such as MAE. In this paper, we propose FLAVARS, a pretraining method that combines the best of both contrastive learning and masked modeling, along with geospatial alignment via contrastive location encoding. We find that FLAVARS significantly outperforms a baseline of SkyCLIP for vision-only tasks such as KNN classification and semantic segmentation, +6\% mIOU on SpaceNet1, while retaining the ability to perform zero-shot classification, unlike MAE pretrained methods.

52. 【2501.08471】Benchmarking Classical, Deep, and Generative Models for Human Activity Recognition

链接https://arxiv.org/abs/2501.08471

作者:Md Meem Hossain, TheAnh Han,Safina Showkat Ara,Zia Ush Shamszaman

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Human Activity Recognition, Human Activity, Activity Recognition, gained significant importance, Restricted Boltzmann Machines

备注: 48 pages, 21 Figures

点击查看摘要

Abstract:Human Activity Recognition (HAR) has gained significant importance with the growing use of sensor-equipped devices and large datasets. This paper evaluates the performance of three categories of models : classical machine learning, deep learning architectures, and Restricted Boltzmann Machines (RBMs) using five key benchmark datasets of HAR (UCI-HAR, OPPORTUNITY, PAMAP2, WISDM, and Berkeley MHAD). We assess various models, including Decision Trees, Random Forests, Convolutional Neural Networks (CNN), and Deep Belief Networks (DBNs), using metrics such as accuracy, precision, recall, and F1-score for a comprehensive comparison. The results show that CNN models offer superior performance across all datasets, especially on the Berkeley MHAD. Classical models like Random Forest do well on smaller datasets but face challenges with larger, more complex data. RBM-based models also show notable potential, particularly for feature learning. This paper offers a detailed comparison to help researchers choose the most suitable model for HAR tasks.

53. 【2501.08470】Detecting Contextual Anomalies by Discovering Consistent Spatial Regions

链接https://arxiv.org/abs/2501.08470

作者:Zhengye Yang,Richard J. Radke

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:video anomaly detection, modeling spatial context, enable video anomaly, anomaly detection, describe a method

备注

点击查看摘要

Abstract:We describe a method for modeling spatial context to enable video anomaly detection. The main idea is to discover regions that share similar object-level activities by clustering joint object attributes using Gaussian mixture models. We demonstrate that this straightforward approach, using orders of magnitude fewer parameters than competing models, achieves state-of-the-art performance in the challenging spatial-context-dependent Street Scene dataset. As a side benefit, the high-resolution discovered regions learned by the model also provide explainable normalcy maps for human operators without the need for any pre-trained segmentation model.

54. 【2501.08465】Predicting Performance of Object Detection Models in Electron Microscopy Using Random Forests

链接https://arxiv.org/abs/2501.08465

作者:Ni Li,Ryan Jacobs,Matthew Lynch,Vidit Agrawal,Kevin Field,Dane Morgan

类目:Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)

关键词:applied machine learning, object detection models, applying object detection, object detection, forest regression model

备注: 14 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Quantifying prediction uncertainty when applying object detection models to new, unlabeled datasets is critical in applied machine learning. This study introduces an approach to estimate the performance of deep learning-based object detection models for quantifying defects in transmission electron microscopy (TEM) images, focusing on detecting irradiation-induced cavities in TEM images of metal alloys. We developed a random forest regression model that predicts the object detection F1 score, a statistical metric used to evaluate the ability to accurately locate and classify objects of interest. The random forest model uses features extracted from the predictions of the object detection model whose uncertainty is being quantified, enabling fast prediction on new, unlabeled images. The mean absolute error (MAE) for predicting F1 of the trained model on test data is 0.09, and the $R^2$ score is 0.77, indicating there is a significant correlation between the random forest regression model predicted and true defect detection F1 scores. The approach is shown to be robust across three distinct TEM image datasets with varying imaging and material domains. Our approach enables users to estimate the reliability of a defect detection and segmentation model predictions and assess the applicability of the model to their specific datasets, providing valuable information about possible domain shifts and whether the model needs to be fine-tuned or trained on additional data to be maximally effective for the desired use case.

55. 【2501.08460】owards Zero-Shot Explainable Video Description by Reasoning over Graphs of Events in Space and Time

链接https://arxiv.org/abs/2501.08460

作者:Mihai Masala,Marius Leordeanu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Machine Learning, era of Machine, natural language processing, language processing, current era

备注

点击查看摘要

Abstract:In the current era of Machine Learning, Transformers have become the de facto approach across a variety of domains, such as computer vision and natural language processing. Transformer-based solutions are the backbone of current state-of-the-art methods for language generation, image and video classification, segmentation, action and object recognition, among many others. Interestingly enough, while these state-of-the-art methods produce impressive results in their respective domains, the problem of understanding the relationship between vision and language is still beyond our reach. In this work, we propose a common ground between vision and language based on events in space and time in an explainable and programmatic way, to connect learning-based vision and language state of the art models and provide a solution to the long standing problem of describing videos in natural language. We validate that our algorithmic approach is able to generate coherent, rich and relevant textual descriptions on videos collected from a variety of datasets, using both standard metrics (e.g. Bleu, ROUGE) and the modern LLM-as-a-Jury approach.

56. 【2501.08453】Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

链接https://arxiv.org/abs/2501.08453

作者:Weichen Fan,Chenyang Si,Junhao Song,Zhenyu Yang,Yinan He,Long Zhuo,Ziqi Huang,Ziyue Dong,Jingwen He,Dongwei Pan,Yi Wang,Yuming Jiang,Yaohui Wang,Peng Gao,Xinyuan Chen,Hengjie Li,Dahua Lin,Yu Qiao,Ziwei Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:parallel transformer architecture, transformer architecture designed, Multimodal Diffusion Block, models for large-scale, parallel transformer

备注

点击查看摘要

Abstract:We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames, while maintaining temporal coherence across sequences. (2) To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework that incorporates hybrid parallelism and other memory reduction techniques, enabling efficient training of long video sequences on distributed systems. (3) Additionally, our enhanced data processing pipeline ensures the creation of Vchitect T2V DataVerse, a high-quality million-scale training dataset through rigorous annotation and aesthetic evaluation. Extensive benchmarking demonstrates that Vchitect-2.0 outperforms existing methods in video quality, training efficiency, and scalability, serving as a suitable base for high-fidelity video generation.

57. 【2501.08446】Poseidon: A ViT-based Architecture for Multi-Frame Pose Estimation with Adaptive Frame Weighting and Multi-Scale Feature Fusion

链接https://arxiv.org/abs/2501.08446

作者:Cesare Davide Pace,Alessandro Marco De Nunzio,Claudio De Stefano,Francesco Fontanella,Mario Molinara

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:localising human joints, Human pose estimation, localising human, human joints, pose estimation

备注

点击查看摘要

Abstract:Human pose estimation, a vital task in computer vision, involves detecting and localising human joints in images and videos. While single-frame pose estimation has seen significant progress, it often fails to capture the temporal dynamics for understanding complex, continuous movements. We propose Poseidon, a novel multi-frame pose estimation architecture that extends the ViTPose model by integrating temporal information for enhanced accuracy and robustness to address these limitations. Poseidon introduces key innovations: (1) an Adaptive Frame Weighting (AFW) mechanism that dynamically prioritises frames based on their relevance, ensuring that the model focuses on the most informative data; (2) a Multi-Scale Feature Fusion (MSFF) module that aggregates features from different backbone layers to capture both fine-grained details and high-level semantics; and (3) a Cross-Attention module for effective information exchange between central and contextual frames, enhancing the model's temporal coherence. The proposed architecture improves performance in complex video scenarios and offers scalability and computational efficiency suitable for real-world applications. Our approach achieves state-of-the-art performance on the PoseTrack21 and PoseTrack18 datasets, achieving mAP scores of 88.3 and 87.8, respectively, outperforming existing methods.

58. 【2501.08443】Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models

链接https://arxiv.org/abs/2501.08443

作者:Xu Li,Yi Zheng,Haotian Chen,Xiaolei Chen,Yuxuan Liang,Chenghang Lai

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Large Vision-Language Models, large language models, achieved significant success, Vision-Language Models, language models

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have achieved significant success in multimodal tasks by combining pre-trained vision encoders and large language models. However, current LVLMs mainly rely on features from the final layers of the vision encoder, neglecting complementary information in shallower layers. While recent methods have explored multi-layer features, they are often task-agnostic. We investigate the contributions of visual features from different encoder layers across 18 benchmarks and 6 task categories. Our results show that multi-layer features provide complementary strengths with varying task dependencies, and uniform fusion performs suboptimally. Based on these findings, we propose an instruction-guided vision aggregator that dynamically integrates multi-layer features based on textual instructions, without increasing the number of visual tokens. Extensive evaluations show superior performance, and analysis reveals the dominance of mid-to-high-level features in semantic tasks and the critical role of low-level features in fine-grained perception. This work provides valuable insights into the adaptive use of hierarchical visual features in LVLMs, advancing more flexible multimodal systems.

59. 【2501.08440】FARE: A Deep Learning-Based Framework for Radar-based Face Recognition and Out-of-distribution Detection

链接https://arxiv.org/abs/2501.08440

作者:Sabri Mustafa Kahya,Boran Hamdi Sivrikaya,Muhammet Sami Yavuz,Eckehard Steinbach

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP)

关键词:short-range FMCW radar, OOD detection, OOD detection AUROC, micro Range-Doppler Images, FMCW radar

备注: Accepted at ICASSP 2025

点击查看摘要

Abstract:In this work, we propose a novel pipeline for face recognition and out-of-distribution (OOD) detection using short-range FMCW radar. The proposed system utilizes Range-Doppler and micro Range-Doppler Images. The architecture features a primary path (PP) responsible for the classification of in-distribution (ID) faces, complemented by intermediate paths (IPs) dedicated to OOD detection. The network is trained in two stages: first, the PP is trained using triplet loss to optimize ID face classification. In the second stage, the PP is frozen, and the IPs-comprising simple linear autoencoder networks-are trained specifically for OOD detection. Using our dataset generated with a 60 GHz FMCW radar, our method achieves an ID classification accuracy of 99.30% and an OOD detection AUROC of 96.91%.

60. 【2501.08415】Cross-Modal Transferable Image-to-Video Attack on Video Quality Metrics

链接https://arxiv.org/abs/2501.08415

作者:Georgii Gotin,Ekaterina Shumitskaya,Anastasia Antsiferova,Dmitriy Vatolin

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent studies, video quality assessment, VQA, studies have revealed, quality assessment

备注: Accepted for VISAPP 2025

点击查看摘要

Abstract:Recent studies have revealed that modern image and video quality assessment (IQA/VQA) metrics are vulnerable to adversarial attacks. An attacker can manipulate a video through preprocessing to artificially increase its quality score according to a certain metric, despite no actual improvement in visual quality. Most of the attacks studied in the literature are white-box attacks, while black-box attacks in the context of VQA have received less attention. Moreover, some research indicates a lack of transferability of adversarial examples generated for one model to another when applied to VQA. In this paper, we propose a cross-modal attack method, IC2VQA, aimed at exploring the vulnerabilities of modern VQA models. This approach is motivated by the observation that the low-level feature spaces of images and videos are similar. We investigate the transferability of adversarial perturbations across different modalities; specifically, we analyze how adversarial perturbations generated on a white-box IQA model with an additional CLIP module can effectively target a VQA model. The addition of the CLIP module serves as a valuable aid in increasing transferability, as the CLIP model is known for its effective capture of low-level semantics. Extensive experiments demonstrate that IC2VQA achieves a high success rate in attacking three black-box VQA models. We compare our method with existing black-box attack strategies, highlighting its superiority in terms of attack success within the same number of iterations and levels of attack strength. We believe that the proposed method will contribute to the deeper analysis of robust VQA metrics.

61. 【2501.08411】BiDepth Multimodal Neural Network: Bidirectional Depth Deep Learning Arcitecture for Spatial-Temporal Prediction

链接https://arxiv.org/abs/2501.08411

作者:Sina Ehsani,Fenglian Pan,Qingpei Hu,Jian Liu

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP)

关键词:dynamic systems, challenging problem, Accurate prediction, mobility and weather, crucial yet challenging

备注: This paper has been submitted to Applied Intelligence for review

点击查看摘要

Abstract:Accurate prediction of spatial-temporal (ST) information in dynamic systems, such as urban mobility and weather patterns, is a crucial yet challenging problem. The complexity stems from the intricate interplay between spatial proximity and temporal relevance, where both long-term trends and short-term fluctuations are present in convoluted patterns. Existing approaches, including traditional statistical methods and conventional neural networks, may provide inaccurate results due to the lack of an effective mechanism that simultaneously incorporates information at variable temporal depths while maintaining spatial context, resulting in a trade-off between comprehensive long-term historical analysis and responsiveness to short-term new information. To bridge this gap, this paper proposes the BiDepth Multimodal Neural Network (BDMNN) with bidirectional depth modulation that enables a comprehensive understanding of both long-term seasonality and short-term fluctuations, adapting to the complex ST context. Case studies with real-world public data demonstrate significant improvements in prediction accuracy, with a 12% reduction in Mean Squared Error for urban traffic prediction and a 15% improvement in rain precipitation forecasting compared to state-of-the-art benchmarks, without demanding extra computational resources.

62. 【2501.08408】Leveraging 2D Masked Reconstruction for Domain Adaptation of 3D Pose Estimation

链接https://arxiv.org/abs/2501.08408

作者:Hansoo Park,Chanwoo Kim,Jihyeon Kim,Hoseong Cho,Nhat Nguyen Bao Truong,Taehwan Kim,Seungryul Baek

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:emergence of high-quality, development of deep, deep learning, existing methods, data

备注: 16 pages, 7 figures

点击查看摘要

Abstract:RGB-based 3D pose estimation methods have been successful with the development of deep learning and the emergence of high-quality 3D pose datasets. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. However, most existing methods do not operate well for testing images whose distribution is far from that of training data. This problem might be alleviated by involving diverse data during training, however it is non-trivial to collect such diverse data with corresponding labels (i.e. 3D pose). In this paper, we introduced an unsupervised domain adaptation framework for 3D pose estimation that utilizes the unlabeled data in addition to labeled data via masked image modeling (MIM) framework. Foreground-centric reconstruction and attention regularization are further proposed to increase the effectiveness of unlabeled data usage. Experiments are conducted on the various datasets in human and hand pose estimation tasks, especially using the cross-domain scenario. We demonstrated the effectiveness of ours by achieving the state-of-the-art accuracy on all datasets.

63. 【2501.08370】3D Gaussian Splatting with Normal Information for Mesh Extraction and Improved Rendering

链接https://arxiv.org/abs/2501.08370

作者:Meenakshi Krishnan,Liam Fowl,Ramani Duraiswami

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:real-time novel-view synthesis, representing complex scenes, enabling high-quality real-time, high-quality real-time novel-view, flexible rendering technique

备注: ICASSP 2025: Workshop on Generative Data Augmentation for Real-World Signal Processing Applications

点击查看摘要

Abstract:Differentiable 3D Gaussian splatting has emerged as an efficient and flexible rendering technique for representing complex scenes from a collection of 2D views and enabling high-quality real-time novel-view synthesis. However, its reliance on photometric losses can lead to imprecisely reconstructed geometry and extracted meshes, especially in regions with high curvature or fine detail. We propose a novel regularization method using the gradients of a signed distance function estimated from the Gaussians, to improve the quality of rendering while also extracting a surface mesh. The regularizing normal supervision facilitates better rendering and mesh reconstruction, which is crucial for downstream applications in video generation, animation, AR-VR and gaming. We demonstrate the effectiveness of our approach on datasets such as Mip-NeRF360, Tanks and Temples, and Deep-Blending. Our method scores higher on photorealism metrics compared to other mesh extracting rendering methods without compromising mesh quality.

64. 【2501.08361】Weight Averaging for Out-of-Distribution Generalization and Few-Shot Domain Adaptation

链接https://arxiv.org/abs/2501.08361

作者:Shijian Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Empirical risk minimization, Empirical risk, generalization, ERM, generalization performance

备注: Master Thesis

点击查看摘要

Abstract:Empirical risk minimization (ERM) is not robust to changes in the distribution of data. When the distribution of test data is different from that of training data, the problem is known as out-of-distribution generalization. Recently, two techniques have been developed for addressing out-of-distribution generalization in computer vision: weight averaging (WA) and sharpness-aware minimization (SAM). WA involves training multiple models with different hyperparameters and then averaging the weights of these models, which can significantly improve out-of-distribution generalization performance. SAM optimizes a neural network to find minima in flat regions, which have been proven to perform well under distribution shifts. While these techniques have made great progress, there is still room for improvement and further exploration. In this thesis, we propose increasing the model diversity in WA explicitly by introducing gradient similarity as a loss regularizer to further improve out-of-distribution generalization performance. We also propose combining WA and SAM to solve the problem of few-shot domain adaptation. Our extensive experiments on digits datasets (MNIST, SVHN, USPS, MNIST-M) and other domain adaptation datasets (VLCS, PACS) show that combining WA and SAM leads to improved out-of-distribution generalization performance and significantly increases few-shot domain adaptation accuracy.

65. 【2501.08352】A Preliminary Survey of Semantic Descriptive Model for Images

链接https://arxiv.org/abs/2501.08352

作者:Chengxi Yan,Jie Jian,Yang Li

类目:Digital Libraries (cs.DL); Computer Vision and Pattern Recognition (cs.CV)

关键词:Ancient Chinese Paintings, Beijing Palace Museum, Palace Museum ACP, Museum ACP collections, Chinese Paintings

备注: 3 pages, 2 figures

点击查看摘要

Abstract:Considering the lack of a unified framework for image description and deep cultural analysis at the subject level in the field of Ancient Chinese Paintings (ACP), this study utilized the Beijing Palace Museum's ACP collections to develop a semantic model integrating the iconological theory with a new workflow for term extraction and mapping. Our findings underscore the model's effectiveness. SDM can be used to support further art-related knowledge organization and cultural exploration of ACPs.

66. 【2501.08347】SCOT: Self-Supervised Contrastive Pretraining For Zero-Shot Compositional Retrieval

链接https://arxiv.org/abs/2501.08347

作者:Bhavin Jawade,Joao V. B. Soares,Kapil Thadani,Deen Dayal Mohan,Amir Erfan Eshratifar,Benjamin Culpepper,Paloma de Juan,Srirangaraj Setlur,Venu Govindaraju

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:multimodal learning task, user-provided text modification, CIR finds applications, modification to retrieve, CIR

备注: Paper accepted at WACV 2025 in round 1

点击查看摘要

Abstract:Compositional image retrieval (CIR) is a multimodal learning task where a model combines a query image with a user-provided text modification to retrieve a target image. CIR finds applications in a variety of domains including product retrieval (e-commerce) and web search. Existing methods primarily focus on fully-supervised learning, wherein models are trained on datasets of labeled triplets such as FashionIQ and CIRR. This poses two significant challenges: (i) curating such triplet datasets is labor intensive; and (ii) models lack generalization to unseen objects and domains. In this work, we propose SCOT (Self-supervised COmpositional Training), a novel zero-shot compositional pretraining strategy that combines existing large image-text pair datasets with the generative capabilities of large language models to contrastively train an embedding composition network. Specifically, we show that the text embedding from a large-scale contrastively-pretrained vision-language model can be utilized as proxy target supervision during compositional pretraining, replacing the target image embedding. In zero-shot settings, this strategy surpasses SOTA zero-shot compositional retrieval methods as well as many fully-supervised methods on standard benchmarks such as FashionIQ and CIRR.

67. 【2501.09001】Vision Foundation Models for Computed Tomography

链接https://arxiv.org/abs/2501.09001

作者:Suraj Pai(1 and 2 and 3),Ibrahim Hadzic(1 and 2 and 3),Dennis Bontempi(1 and 2 and 3),Keno Bressem(4 and 5),Benjamin H. Kann(1 and 3),Andriy Fedorov(6),Raymond H. Mak(1 and 3),Hugo J. W. L. Aerts(1 and 2 and 3 and 6) ((1) Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, (2) Radiology and Nuclear Medicine, CARIM amp; GROW, Maastricht University, (3) Department of Radiation Oncology, Brigham and Women's Hospital, Dana-Farber Cancer Institute, Harvard Medical School, (4) Department of Diagnostic and Interventional Radiology, Technical University of Munich, School of Medicine and Health, Klinikum rechts der Isar, TUM University Hospital, (5) Department of Cardiovascular Radiology and Nuclear Medicine, Technical University of Munich, School of Medicine and Health, German Heart Center, TUM University Hospital, (6) Department of Radiology, Brigham and Women's Hospital, Dana-Farber Cancer Institute, Harvard Medical School)

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:shown transformative potential, performing diverse, shown transformative, transformative potential, Imaging Data Commons

备注: 6 figures, followed by 9 Extended Data Figures and a Supplementary Information document

点击查看摘要

Abstract:Foundation models (FMs) have shown transformative potential in radiology by performing diverse, complex tasks across imaging modalities. Here, we developed CT-FM, a large-scale 3D image-based pre-trained model designed explicitly for various radiological tasks. CT-FM was pre-trained using 148,000 computed tomography (CT) scans from the Imaging Data Commons through label-agnostic contrastive learning. We evaluated CT-FM across four categories of tasks, namely, whole-body and tumor segmentation, head CT triage, medical image retrieval, and semantic understanding, showing superior performance against state-of-the-art models. Beyond quantitative success, CT-FM demonstrated the ability to cluster regions anatomically and identify similar anatomical and structural concepts across scans. Furthermore, it remained robust across test-retest settings and indicated reasonable salient regions attached to its embeddings. This study demonstrates the value of large-scale medical imaging foundation models and by open-sourcing the model weights, code, and data, aims to support more adaptable, reliable, and interpretable AI solutions in radiology.

68. 【2501.08902】Multi-View Transformers for Airway-To-Lung Ratio Inference on Cardiac CT Scans: The C4R Study

链接https://arxiv.org/abs/2501.08902

作者:Sneha N. Naik,Elsa D. Angelini,Eric A. Hoffman,Elizabeth C. Oelsner,R. Graham Barr,Benjamin M. Smith,Andrew F. Laine

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:obstructive pulmonary disease, airway tree lumen, major risk factor, chronic obstructive pulmonary, full-lung computed tomography

备注: Accepted to appear in Proceedings of International Symposium on Biomedical Imaging (ISBI), 2025

点击查看摘要

Abstract:The ratio of airway tree lumen to lung size (ALR), assessed at full inspiration on high resolution full-lung computed tomography (CT), is a major risk factor for chronic obstructive pulmonary disease (COPD). There is growing interest to infer ALR from cardiac CT images, which are widely available in epidemiological cohorts, to investigate the relationship of ALR to severe COVID-19 and post-acute sequelae of SARS-CoV-2 infection (PASC). Previously, cardiac scans included approximately 2/3 of the total lung volume with 5-6x greater slice thickness than high-resolution (HR) full-lung (FL) CT. In this study, we present a novel attention-based Multi-view Swin Transformer to infer FL ALR values from segmented cardiac CT scans. For the supervised training we exploit paired full-lung and cardiac CTs acquired in the Multi-Ethnic Study of Atherosclerosis (MESA). Our network significantly outperforms a proxy direct ALR inference on segmented cardiac CT scans and achieves accuracy and reproducibility comparable with a scan-rescan reproducibility of the FL ALR ground-truth.

69. 【2501.08819】Boosting Diffusion Guidance via Learning Degradation-Aware Models for Blind Super Resolution

链接https://arxiv.org/abs/2501.08819

作者:Shao-Hao Lu,Ren Wang,Ching-Chun Huang,Wei-Chen Chiu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:abundant high-frequency detail, shown great ability, generate high-resolution images, diffusion-based blind super-resolution, expense of fidelity

备注: To appear in WACV 2025. Code is available at: [this https URL](https://github.com/ryanlu2240/Boosting-Diffusion-Guidance-via-Learning-Degradation-Aware-Models-for-Blind-Super-Resolution)

点击查看摘要

Abstract:Recently, diffusion-based blind super-resolution (SR) methods have shown great ability to generate high-resolution images with abundant high-frequency detail, but the detail is often achieved at the expense of fidelity. Meanwhile, another line of research focusing on rectifying the reverse process of diffusion models (i.e., diffusion guidance), has demonstrated the power to generate high-fidelity results for non-blind SR. However, these methods rely on known degradation kernels, making them difficult to apply to blind SR. To address these issues, we introduce degradation-aware models that can be integrated into the diffusion guidance framework, eliminating the need to know degradation kernels. Additionally, we propose two novel techniques input perturbation and guidance scalar to further improve our performance. Extensive experimental results show that our proposed method has superior performance over state-of-the-art methods on blind SR benchmarks

70. 【2501.08667】meFlow: Longitudinal Brain Image Registration and Aging Progression Analysis

链接https://arxiv.org/abs/2501.08667

作者:Bailiang Jian,Jiazhen Pan,Yitong Li,Fabian Bongratz,Ruochen Li,Daniel Rueckert,Benedikt Wiestler,Christian Wachinger

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Predicting future brain, brain MRI registration, Longitudinal brain MRI, Predicting future, brain MRI

备注

点击查看摘要

Abstract:Predicting future brain states is crucial for understanding healthy aging and neurodegenerative diseases. Longitudinal brain MRI registration, a cornerstone for such analyses, has long been limited by its inability to forecast future developments, reliance on extensive, dense longitudinal data, and the need to balance registration accuracy with temporal smoothness. In this work, we present \emph{TimeFlow}, a novel framework for longitudinal brain MRI registration that overcomes all these challenges. Leveraging a U-Net architecture with temporal conditioning inspired by diffusion models, TimeFlow enables accurate longitudinal registration and facilitates prospective analyses through future image prediction. Unlike traditional methods that depend on explicit smoothness regularizers and dense sequential data, TimeFlow achieves temporal consistency and continuity without these constraints. Experimental results highlight its superior performance in both future timepoint prediction and registration accuracy compared to state-of-the-art methods. Additionally, TimeFlow supports novel biological brain aging analyses, effectively differentiating neurodegenerative conditions from healthy aging. It eliminates the need for segmentation, thereby avoiding the challenges of non-trivial annotation and inconsistent segmentation errors. TimeFlow paves the way for accurate, data-efficient, and annotation-free prospective analyses of brain aging and chronic diseases.

71. 【2501.08662】Product of Gaussian Mixture Diffusion Model for non-linear MRI Inversion

链接https://arxiv.org/abs/2501.08662

作者:Laurenz Nagler,Martin Zach,Thomas Pock

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:recently shown remarkable, magnetic resonance imaging, shown remarkable results, resonance imaging reconstruction, recently shown

备注

点击查看摘要

Abstract:Diffusion models have recently shown remarkable results in magnetic resonance imaging reconstruction. However, the employed networks typically are black-box estimators of the (smoothed) prior score with tens of millions of parameters, restricting interpretability and increasing reconstruction time. Furthermore, parallel imaging reconstruction algorithms either rely on off-line coil sensitivity estimation, which is prone to misalignment and restricting sampling trajectories, or perform per-coil reconstruction, making the computational cost proportional to the number of coils. To overcome this, we jointly reconstruct the image and the coil sensitivities using the lightweight, parameter-efficient, and interpretable product of Gaussian mixture diffusion model as an image prior and a classical smoothness priors on the coil sensitivities. The proposed method delivers promising results while allowing for fast inference and demonstrating robustness to contrast out-of-distribution data and sampling trajectories, comparable to classical variational penalties such as total variation. Finally, the probabilistic formulation allows the calculation of the posterior expectation and pixel-wise variance.

72. 【2501.08585】A Systematic Review of Machine Learning Methods for Multimodal EEG Data in Clinical Application

链接https://arxiv.org/abs/2501.08585

作者:Siqi Zhao(1),Wangyang Li(1),Xiru Wang(1),Stevie Foglia(2),Hongzhao Tan(1),Bohan Zhang(1),Ameer Hamoodi(2),Aimee Nelson(2 and 3),Zhen Gao(1 and 2) ((1) WBooth School of Engineering Practice and Technology, McMaster University, Hamilton, Ontario Canada, (2) School of Biomedical Engineering, McMaster University, Hamilton, Ontario, Canada, (3) Department of Kinesiology, McMaster University, Hamilton, Ontario, Canada)

类目:ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:multimodal EEG data, deep learning, multimodal EEG, EEG data, EEG

备注: This paper includes 4 figures, 6 tables, and totals 18 pages

点击查看摘要

Abstract:Machine learning (ML) and deep learning (DL) techniques have been widely applied to analyze electroencephalography (EEG) signals for disease diagnosis and brain-computer interfaces (BCI). The integration of multimodal data has been shown to enhance the accuracy of ML and DL models. Combining EEG with other modalities can improve clinical decision-making by addressing complex tasks in clinical populations. This systematic literature review explores the use of multimodal EEG data in ML and DL models for clinical applications. A comprehensive search was conducted across PubMed, Web of Science, and Google Scholar, yielding 16 relevant studies after three rounds of filtering. These studies demonstrate the application of multimodal EEG data in addressing clinical challenges, including neuropsychiatric disorders, neurological conditions (e.g., seizure detection), neurodevelopmental disorders (e.g., autism spectrum disorder), and sleep stage classification. Data fusion occurred at three levels: signal, feature, and decision levels. The most commonly used ML models were support vector machines (SVM) and decision trees. Notably, 11 out of the 16 studies reported improvements in model accuracy with multimodal EEG data. This review highlights the potential of multimodal EEG-based ML models in enhancing clinical diagnostics and problem-solving.

73. 【2501.08495】Automotive Elevation Mapping with Interferometric Synthetic Aperture Radar

链接https://arxiv.org/abs/2501.08495

作者:Leyla A. Kabuli,Griffin Foster

类目:ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:arrival analysis, resolution and sensitivity, low-cost and ubiquitous, performing direction, direction of arrival

备注: 9 pages, 6 figures

点击查看摘要

Abstract:Radar is a low-cost and ubiquitous automotive sensor, but is limited by array resolution and sensitivity when performing direction of arrival analysis. Synthetic Aperture Radar (SAR) is a class of techniques to improve azimuth resolution and sensitivity for radar. Interferometric SAR (InSAR) can be used to extract elevation from the variations in phase measurements in SAR images. Utilizing InSAR we show that a typical, low-resolution radar array mounted on a vehicle can be used to accurately localize detections in 3D space for both urban and agricultural environments. We generate point clouds in each environment by combining InSAR with a signal processing scheme tailored to automotive driving. This low-compute approach allows radar to be used as a primary sensor to map fine details in complex driving environments, and be used to make autonomous perception decisions.

74. 【2501.08458】RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation

链接https://arxiv.org/abs/2501.08458

作者:Juntao Jiang,Jiangning Zhang,Weixuan Liu,Muxuan Gao,Xiaobin Hu,Xiaoxiao Yan,Feiyue Huang,Yong Liu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:convolutional neural networks, Receptance Weighted Key, recent years, neural networks, medical image analysis

备注

点击查看摘要

Abstract:In recent years, there have been significant advancements in deep learning for medical image analysis, especially with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies while transformers suffer high computational complexities. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model's ability to capture long-range dependencies and improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed inverted residual RWKV (IR-RWKV) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on benchmark datasets, including Synapse, ACDC, BUSI, CVC-ClinicDB, CVC-ColonDB, Kvasir-SEG, ISIC 2017 and GLAS show that RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.

75. 【2501.08334】High-throughput digital twin framework for predicting neurite deterioration using MetaFormer attention

链接https://arxiv.org/abs/2501.08334

作者:Kuanren Qian,Genesis Omana Suarez,Toshihiko Nambara,Takahisa Kanekiyo,Yongjie Jessica Zhang

类目:Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:autism spectrum disorder, peripheral nervous systems, including autism spectrum, Neurodevelopmental disorders, hyperactivity disorder

备注: 17 pages, 8 figures

点击查看摘要

Abstract:Neurodevelopmental disorders (NDDs) cover a variety of conditions, including autism spectrum disorder, attention-deficit/hyperactivity disorder, and epilepsy, which impair the central and peripheral nervous systems. Their high comorbidity and complex etiologies present significant challenges for accurate diagnosis and effective treatments. Conventional clinical and experimental studies are time-intensive, burdening research progress considerably. This paper introduces a high-throughput digital twin framework for modeling neurite deteriorations associated with NDDs, integrating synthetic data generation, experimental images, and machine learning (ML) models. The synthetic data generator utilizes an isogeometric analysis (IGA)-based phase field model to capture diverse neurite deterioration patterns such as neurite retraction, atrophy, and fragmentation while mitigating the limitations of scarce experimental data. The ML model utilizes MetaFormer-based gated spatiotemporal attention architecture with deep temporal layers and provides fast predictions. The framework effectively captures long-range temporal dependencies and intricate morphological transformations with average errors of 1.9641% and 6.0339% for synthetic and experimental neurite deterioration, respectively. Seamlessly integrating simulations, experiments, and ML, the digital twin framework can guide researchers to make informed experimental decisions by predicting potential experimental outcomes, significantly reducing costs and saving valuable time. It can also advance our understanding of neurite deterioration and provide a scalable solution for exploring complex neurological mechanisms, contributing to the development of targeted treatments.