本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。

统计

今日共更新471篇论文,其中:

  • 自然语言处理52
  • 信息检索9
  • 计算机视觉150

自然语言处理

1. 【2411.18620】Cross-modal Information Flow in Multimodal Large Language Models

链接https://arxiv.org/abs/2411.18620

作者:Zhi Zhang,Srishti Yadav,Fengze Han,Ekaterina Shutova

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated promising progress, large language models, auto-regressive multimodal large, large language, multimodal large language

备注

点击查看摘要

Abstract:The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities -- language and vision -- in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing.

2. 【2411.18583】Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation

链接https://arxiv.org/abs/2411.18583

作者:Nurshat Fateh Ali,Md. Mahdi Mohtasim,Shakil Mosharrof,T. Gopi Krishna

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Large Language Model, Natural Language Processing, compares multiple approaches, Large Language, Language Model

备注: Key Words : T5, SpaCy, Large Language Model, GPT, ROUGE, Literature Review, Natural Language Processing, Retrieval-augmented generation

点击查看摘要

Abstract:This research presents and compares multiple approaches to automate the generation of literature reviews using several Natural Language Processing (NLP) techniques and retrieval-augmented generation (RAG) with a Large Language Model (LLM). The ever-increasing number of research articles provides a huge challenge for manual literature review. It has resulted in an increased demand for automation. Developing a system capable of automatically generating the literature reviews from only the PDF files as input is the primary objective of this research work. The effectiveness of several Natural Language Processing (NLP) strategies, such as the frequency-based method (spaCy), the transformer model (Simple T5), and retrieval-augmented generation (RAG) with Large Language Model (GPT-3.5-turbo), is evaluated to meet the primary objective. The SciTLDR dataset is chosen for this research experiment and three distinct techniques are utilized to implement three different systems for auto-generating the literature reviews. The ROUGE scores are used for the evaluation of all three systems. Based on the evaluation, the Large Language Model GPT-3.5-turbo achieved the highest ROUGE-1 score, 0.364. The transformer model comes in second place and spaCy is at the last position. Finally, a graphical user interface is created for the best system based on the large language model.

3. 【2411.18577】On Importance of Code-Mixed Embeddings for Hate Speech Identification

链接https://arxiv.org/abs/2411.18577

作者:Shruti Jagdale,Omkar Khade,Gauri Takalikar,Mihir Inamdar,Raviraj Joshi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:people commonly speak, India where people, commonly speak multiple, occurs in multilingual, multilingual communities

备注

点击查看摘要

Abstract:Code-mixing is the practice of using two or more languages in a single sentence, which often occurs in multilingual communities such as India where people commonly speak multiple languages. Classic NLP tools, trained on monolingual data, face challenges when dealing with code-mixed data. Extracting meaningful information from sentences containing multiple languages becomes difficult, particularly in tasks like hate speech detection, due to linguistic variation, cultural nuances, and data sparsity. To address this, we aim to analyze the significance of code-mixed embeddings and evaluate the performance of BERT and HingBERT models (trained on a Hindi-English corpus) in hate speech detection. Our study demonstrates that HingBERT models, benefiting from training on the extensive Hindi-English dataset L3Cube-HingCorpus, outperform BERT models when tested on hate speech text datasets. We also found that code-mixed Hing-FastText performs better than standard English FastText and vanilla BERT models.

4. 【2411.18571】Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning

链接https://arxiv.org/abs/2411.18571

作者:Omkar Khade,Shruti Jagdale,Abhishek Phaltankar,Gauri Takalikar,Raviraj Joshi

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:Large Language Models, demonstrated remarkable multilingual, Large Language, multilingual Gemma models, demonstrated remarkable

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, yet challenges persist in adapting these models for low-resource languages. In this study, we investigate the effects of Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning (PEFT) on multilingual Gemma models for Marathi, a language with limited resources. Using a translated Alpaca dataset with 52,000 instruction-response pairs, our findings reveal that while evaluation metrics often show a performance decline post-fine-tuning, manual assessments frequently suggest that the fine-tuned models outperform their original counterparts. The observations indicate improvements in target language generation capabilities but a reduction in reasoning abilities following language adaptation. These results underscore the need for improved evaluation methodologies and the creation of high-quality native datasets to accurately assess language-specific model performance in low-resource settings.

5. 【2411.18564】A Pipeline of Neural-Symbolic Integration to Enhance Spatial Reasoning in Large Language Models

链接https://arxiv.org/abs/2411.18564

作者:Rong Wang,Kun Sun,Jonas Kuhn

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, demonstrated impressive capabilities, Answer Set Programming

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated impressive capabilities across various tasks. However, LLMs often struggle with spatial reasoning which is one essential part of reasoning and inference and requires understanding complex relationships between objects in space. This paper proposes a novel neural-symbolic framework that enhances LLMs' spatial reasoning abilities. We evaluate our approach on two benchmark datasets: StepGame and SparQA, implementing three distinct strategies: (1) ASP (Answer Set Programming)-based symbolic reasoning, (2) LLM + ASP pipeline using DSPy, and (3) Fact + Logical rules. Our experiments demonstrate significant improvements over the baseline prompting methods, with accuracy increases of 40-50% on StepGame} dataset and 3-13% on the more complex SparQA dataset. The "LLM + ASP" pipeline achieves particularly strong results on the tasks of Finding Relations (FR) and Finding Block (FB) questions, though performance varies across different question types. The impressive results suggest that while neural-symbolic approaches offer promising directions for enhancing spatial reasoning in LLMs, their effectiveness depends heavily on the specific task characteristics and implementation strategies. We propose an integrated, simple yet effective set of strategies using a neural-symbolic pipeline to boost spatial reasoning abilities in LLMs. This pipeline and its strategies demonstrate strong and broader applicability to other reasoning domains in LLMs, such as temporal reasoning, deductive inference etc.

6. 【2411.18553】Retrofitting (Large) Language Models with Dynamic Tokenization

链接https://arxiv.org/abs/2411.18553

作者:Darius Feher,Benjamin Minixhofer,Ivan Vulić

类目:Computation and Language (cs.CL)

关键词:Current language models, Current language, static subword tokenizer, Current, subword tokenizer

备注

点击查看摘要

Abstract:Current language models (LMs) use a fixed, static subword tokenizer. This choice, often taken for granted, typically results in degraded efficiency and capabilities in languages other than English, and makes it challenging to apply LMs to new domains or languages. To address these issues, we propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text. For encoder-style models, we introduce a subword-merging algorithm inspired by byte-pair encoding (BPE), but at a batch level. We merge frequent subword sequences in a batch, then apply a pretrained embedding-prediction hypernetwork to compute the token embeddings on-the-fly. When applied with word-level boundaries, this on average reduces token sequence lengths by 20% across 14 languages on XNLI with XLM-R while degrading its task performance by less than 2%. For decoder-style models, we apply dynamic tokenization in two ways: 1) for prefilling, maintaining performance of Mistral-7B almost completely with up to 40% sequence reduction - relative to the word-level; and 2) via an approximate nearest neighbor index, achieving fast generation with a one million token vocabulary, demonstrating scalability to even larger, dynamic vocabularies. Overall, our findings show that dynamic tokenization substantially improves inference speed and promotes fairness across languages, making a leap towards overcoming the limitations of static tokenization and enabling more equitable and adaptable LMs.

7. 【2411.18530】Emergence of Self-Identity in AI: A Mathematical Framework and Empirical Study with Generative Large Language Models

链接https://arxiv.org/abs/2411.18530

作者:Minhyeok Lee

类目:Computation and Language (cs.CL); Metric Geometry (math.MG)

关键词:mathcal, addressing a critical, paper introduces, introduces a mathematical, defining and quantifying

备注

点击查看摘要

Abstract:This paper introduces a mathematical framework for defining and quantifying self-identity in artificial intelligence (AI) systems, addressing a critical gap in the theoretical foundations of artificial consciousness. While existing approaches to artificial self-awareness often rely on heuristic implementations or philosophical abstractions, we present a formal framework grounded in metric space theory, measure theory, and functional analysis. Our framework posits that self-identity emerges from two mathematically quantifiable conditions: the existence of a connected continuum of memories $C \subseteq \mathcal{M}$ in a metric space $(\mathcal{M}, d_{\mathcal{M}})$, and a continuous mapping $I: \mathcal{M} \to \mathcal{S}$ that maintains consistent self-recognition across this continuum, where $(\mathcal{S}, d_{\mathcal{S}})$ represents the metric space of possible self-identities. To validate this theoretical framework, we conducted empirical experiments using the Llama 3.2 1B model, employing Low-Rank Adaptation (LoRA) for efficient fine-tuning. The model was trained on a synthetic dataset containing temporally structured memories, designed to capture the complexity of coherent self-identity formation. Our evaluation metrics included quantitative measures of self-awareness, response consistency, and linguistic precision. The experimental results demonstrate substantial improvements in measurable self-awareness metrics, with the primary self-awareness score increasing from 0.276 to 0.801. This enables the structured creation of AI systems with validated self-identity features. The implications of our study are immediately relevant to the fields of humanoid robotics and autonomous systems.

8. 【2411.18478】Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

链接https://arxiv.org/abs/2411.18478

作者:Jinyang Wu,Mingkuan Feng,Shuai Zhang,Feihu Che,Zengqi Wen,Jianhua Tao

类目:Computation and Language (cs.CL)

关键词:In-context Learning, enables large language, large language models, tackle downstream tasks, enables large

备注

点击查看摘要

Abstract:In-context Learning (ICL) enables large language models (LLMs) to tackle downstream tasks through sophisticated prompting and high-quality demonstrations. However, this traditional ICL paradigm shows limitations when facing complex mathematical reasoning tasks, primarily due to its heavy dependence on example quality and the necessity for human intervention in challenging scenarios. To address these limitations, this paper presents HiAR-ICL, a \textbf{Hi}gh-level \textbf{A}utomated \textbf{R}easoning paradigm in \textbf{ICL} that shifts focus from specific examples to abstract thinking patterns, extending the conventional concept of context in ICL. HiAR-ICL introduces five atomic reasoning actions as fundamental components for constructing chain-structured patterns. Using Monte Carlo Tree Search, we explore reasoning paths and construct thought cards to guide subsequent inference. We then develop a cognitive complexity framework that dynamically matches problems with appropriate thought cards. Experimental results demonstrate HiAR-ICL's effectiveness, achieving state-of-the-art accuracy (79.6$\%$) on the MATH benchmark with Qwen2.5-7B-Instruct, surpassing GPT-4o (76.6$\%$) and Claude 3.5 (71.1$\%$).

9. 【2411.18472】Isolating authorship from content with semantic embeddings and contrastive learning

链接https://arxiv.org/abs/2411.18472

作者:Javier Huertas-Tato,Adrián Girón-Jiménez,Alejandro Martín,David Camacho

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:entangled style, style, content inside, contrastive learning, content

备注

点击查看摘要

Abstract:Authorship has entangled style and content inside. Authors frequently write about the same topics in the same style, so when different authors write about the exact same topic the easiest way out to distinguish them is by understanding the nuances of their style. Modern neural models for authorship can pick up these features using contrastive learning, however, some amount of content leakage is always present. Our aim is to reduce the inevitable impact and correlation between content and authorship. We present a technique to use contrastive learning (InfoNCE) with additional hard negatives synthetically created using a semantic similarity model. This disentanglement technique aims to distance the content embedding space from the style embedding space, leading to embeddings more informed by style. We demonstrate the performance with ablations on two different datasets and compare them on out-of-domain challenges. Improvements are clearly shown on challenging evaluations on prolific authors with up to a 10% increase in accuracy when the settings are particularly hard. Trials on challenges also demonstrate the preservation of zero-shot capabilities of this method as fine tuning.

10. 【2411.18468】Parole de pr\'esidents (1958-2022)

链接https://arxiv.org/abs/2411.18468

作者:Dominique Labbé,Jacques Savoy

类目:Computation and Language (cs.CL)

关键词:République française, soixante ans, sont succédé, Giscard d'Estaing, huit présidents

备注: in French language

点击查看摘要

Abstract:En plus de soixante ans, huit présidents se sont succédé à la tête de la Ve République française (de Gaulle, Pompidou, Giscard d'Estaing, Mitterrand, Chirac, Sarkozy, Hollande, Macron). Après avoir présenté le corpus de leurs discours -- soit 9202 textes et plus de 20 millions de mots étiquetés -- le style de chacun des présidents sera caractérisé à l'aide de leurs vocabulaire (vocables et catégories grammaticales). Une analyse plus approfondie révèle les séquences typiques de chaque locataire de l'Élysée. Basée sur les distances entre l'ensemble des allocutions, une figure illustre les similitudes et différences entre les différents présidents. Over the past sixty-six years, eight presidents successively headed the Fifth French Republic (de Gaulle, Pompidou, Giscard d'Estaing, Mitterrand, Chirac, Sarkozy, Holland, Macron). After presenting the corpus of their speeches -- 9,202 texts and more than 20 million labelled words -- the style of each of them will be characterized by their vocabulary (lemmas and part-of-speech). A deeper analysis reveals the typical sequences of each tenant of the Elysée. Based on an intertextual distance between all presidential speeches, a synthesis can be drawn reflecting the similarities and differences between presidents.

Comments:
in French language

Subjects:

Computation and Language (cs.CL)

Cite as:
arXiv:2411.18468 [cs.CL]

(or
arXiv:2411.18468v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2411.18468

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
11. 【2411.18462】Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding

链接https://arxiv.org/abs/2411.18462

作者:Ziyin Zhang,Jiahao Xu,Tian Liang,Xingyu Chen,Zhiwei He,Rui Wang,Zhaopeng Tu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:large language models, Speculative Decoding, language models, speculative decoding systems, important technique

备注: Code at [this https URL](https://github.com/Geralt-Targaryen/SVIP)

点击查看摘要

Abstract:Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe CaPE and EAGLE-2.

12. 【2411.18444】Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator

链接https://arxiv.org/abs/2411.18444

作者:Frederic Kirstein,Terry Ruas,Bela Gipp

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:meeting summaries generated, natural language generation, measure automatically, meeting summaries, summaries generated

备注

点击查看摘要

Abstract:The quality of meeting summaries generated by natural language generation (NLG) systems is hard to measure automatically. Established metrics such as ROUGE and BERTScore have a relatively low correlation with human judgments and fail to capture nuanced errors. Recent studies suggest using large language models (LLMs), which have the benefit of better context understanding and adaption of error definitions without training on a large number of human preference judgments. However, current LLM-based evaluators risk masking errors and can only serve as a weak proxy, leaving human evaluation the gold standard despite being costly and hard to compare across studies. In this work, we present MESA, an LLM-based framework employing a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. We show that MESA's components enable thorough error detection, consistent rating, and adaptability to custom error guidelines. Using GPT-4o as its backbone, MESA achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality, on average 0.25 higher than previous methods. The framework's flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.

13. 【2411.18403】Politicians vs ChatGPT. A study of presuppositions in French and Italian political communication

链接https://arxiv.org/abs/2411.18403

作者:Davide Garassino,Vivana Masia,Nicola Brocca,Alice Delorme Benites

类目:Computation and Language (cs.CL); Computers and Society (cs.CY)

关键词:European Union, chatbot counterparts created, French and Italian, produced by French, Italian politicians

备注: Published: 2024-07-04

点击查看摘要

Abstract:This paper aims to provide a comparison between texts produced by French and Italian politicians on polarizing issues, such as immigration and the European Union, and their chatbot counterparts created with ChatGPT 3.5. In this study, we focus on implicit communication, in particular on presuppositions and their functions in discourse, which have been considered in the literature as a potential linguistic feature of manipulation. This study also aims to contribute to the emerging literature on the pragmatic competences of Large Language Models.

14. 【2411.18383】opic Modeling and Sentiment Analysis on Japanese Online Media's Coverage of Nuclear Energy

链接https://arxiv.org/abs/2411.18383

作者:Yifan Sun,Hirofumi Tsuruta,Masaya Kumagai,Ken Kurosaki

类目:Computation and Language (cs.CL); Social and Information Networks (cs.SI)

关键词:Fukushima Daiichi nuclear, Fukushima Daiichi, power plant accident, Daiichi nuclear power, plants remain shut

备注: 15 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Thirteen years after the Fukushima Daiichi nuclear power plant accident, Japan's nuclear energy accounts for only approximately 6% of electricity production, as most nuclear plants remain shut down. To revitalize the nuclear industry and achieve sustainable development goals, effective communication with Japanese citizens, grounded in an accurate understanding of public sentiment, is of paramount importance. While nationwide surveys have traditionally been used to gauge public views, the rise of social media in recent years has provided a promising new avenue for understanding public sentiment. To explore domestic sentiment on nuclear energy-related issues expressed online, we analyzed the content and comments of over 3,000 YouTube videos covering topics related to nuclear energy. Topic modeling was used to extract the main topics from the videos, and sentiment analysis with large language models classified user sentiments towards each topic. Additionally, word co-occurrence network analysis was performed to examine the shift in online discussions during August and September 2023 regarding the release of treated water. Overall, our results provide valuable insights into the online discourse on nuclear energy and contribute to a more comprehensive understanding of public sentiment in Japan.

15. 【2411.18382】ChatGPT as speechwriter for the French presidents

链接https://arxiv.org/abs/2411.18382

作者:Dominique Labbé,Cyril Labbé,Jacques Savoy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:large language models, Generative AI proposes, language models, users' requests, proposes several large

备注

点击查看摘要

Abstract:Generative AI proposes several large language models (LLMs) to automatically generate a message in response to users' requests. Such scientific breakthroughs promote new writing assistants but with some fears. The main focus of this study is to analyze the written style of one LLM called ChatGPT by comparing its generated messages with those of the recent French presidents. To achieve this, we compare end-of-the-year addresses written by Chirac, Sarkozy, Hollande, and Macron with those automatically produced by ChatGPT. We found that ChatGPT tends to overuse nouns, possessive determiners, and numbers. On the other hand, the generated speeches employ less verbs, pronouns, and adverbs and include, in mean, too standardized sentences. Considering some words, one can observe that ChatGPT tends to overuse "to must" (devoir), "to continue" or the lemma "we" (nous). Moreover, GPT underuses the auxiliary verb "to be" (^etre), or the modal verbs "to will" (vouloir) or "to have to" (falloir). In addition, when a short text is provided as example to ChatGPT, the machine can generate a short message with a style closed to the original wording. Finally, we reveal that ChatGPT style exposes distinct features compared to real presidential speeches.

16. 【2411.18368】AMPS: ASR with Multimodal Paraphrase Supervision

链接https://arxiv.org/abs/2411.18368

作者:Amruta Parulekar,Abhishek Gupta,Sameep Chattopadhyay,Preethi Jyothi

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)

关键词:automatic speech recognition, conversational multilingual speech, automatic speech, speech recognition, multilingual speech presents

备注

点击查看摘要

Abstract:Spontaneous or conversational multilingual speech presents many challenges for state-of-the-art automatic speech recognition (ASR) systems. In this work, we present a new technique AMPS that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. We use paraphrases of the reference transcriptions as additional supervision while training the multimodal ASR model and selectively invoke this paraphrase objective for utterances with poor ASR performance. Using AMPS with a state-of-the-art multimodal model SeamlessM4T, we obtain significant relative reductions in word error rates (WERs) of up to 5%. We present detailed analyses of our system using both objective and human evaluation metrics.

17. 【2411.18365】GPT as ghostwriter at the White House

链接https://arxiv.org/abs/2411.18365

作者:Jacques Savoy

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

关键词:large language models, Recently several large, language models, user request, large language

备注

点击查看摘要

Abstract:Recently several large language models (LLMs) have demonstrated their capability to generate a message in response to a user request. Such scientific breakthroughs promote new perspectives but also some fears. The main focus of this study is to analyze the written style of one LLM called ChatGPT 3.5 by comparing its generated messages with those of the recent US presidents. To achieve this objective, we compare the State of the Union addresses written by Reagan to Obama with those automatically produced by ChatGPT. We found that ChatGPT tends to overuse the lemma "we" as well as nouns and commas. On the other hand, the generated speeches employ less verbs and include, in mean, longer sentences. Even when imposing a given style to ChatGPT, the resulting speech remains distinct from messages written by the target author. Moreover, ChatGPT opts for a neutral tone with mainly positive emotional expressions and symbolic terms (e.g., freedom, nation). Finally, we show that the GPT's style exposes distinct features compared to real presidential addresses.

18. 【2411.18337】Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

链接https://arxiv.org/abs/2411.18337

作者:T.G.D.K. Sumanathilaka,Nicholas Micallef,Julian Hough

类目:Computation and Language (cs.CL)

关键词:found in modern, Large Language Models, Word Sense Disambiguation, modern digital communications, Ambiguous words

备注: 12 pages,6 tables, 1 figure, Proceedings of the 1st International Conference on NLP AI for Cyber Security

点击查看摘要

Abstract:Ambiguous words are often found in modern digital communications. Lexical ambiguity challenges traditional Word Sense Disambiguation (WSD) methods, due to limited data. Consequently, the efficiency of translation, information retrieval, and question-answering systems is hindered by these limitations. This study investigates the use of Large Language Models (LLMs) to improve WSD using a novel approach combining a systematic prompt augmentation mechanism with a knowledge base (KB) consisting of different sense interpretations. The proposed method incorporates a human-in-loop approach for prompt augmentation where prompt is supported by Part-of-Speech (POS) tagging, synonyms of ambiguous words, aspect-based sense filtering and few-shot prompting to guide the LLM. By utilizing a few-shot Chain of Thought (COT) prompting-based approach, this work demonstrates a substantial improvement in performance. The evaluation was conducted using FEWS test data and sense tags. This research advances accurate word interpretation in social media and digital communication.

19. 【2411.18320】Continual Learning in Machine Speech Chain Using Gradient Episodic Memory

链接https://arxiv.org/abs/2411.18320

作者:Geoffrey Tyndall,Kurniawati Azizah,Dipta Tanaya,Ayu Purwarianti,Dessi Puji Lestari,Sakriani Sakti

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)

关键词:avoid catastrophic forgetting, machine speech chain, previously learned tasks, automatic speech recognition, Continual learning

备注: Published as a conference paper at O-COCOSDA 2024. 6 pages; 2 figures

点击查看摘要

Abstract:Continual learning for automatic speech recognition (ASR) systems poses a challenge, especially with the need to avoid catastrophic forgetting while maintaining performance on previously learned tasks. This paper introduces a novel approach leveraging the machine speech chain framework to enable continual learning in ASR using gradient episodic memory (GEM). By incorporating a text-to-speech (TTS) component within the machine speech chain, we support the replay mechanism essential for GEM, allowing the ASR model to learn new tasks sequentially without significant performance degradation on earlier tasks. Our experiments, conducted on the LJ Speech dataset, demonstrate that our method outperforms traditional fine-tuning and multitask learning approaches, achieving a substantial error rate reduction while maintaining high performance across varying noise conditions. We showed the potential of our semi-supervised machine speech chain approach for effective and efficient continual learning in speech recognition.

20. 【2411.18294】Aligning Pre-trained Models for Spoken Language Translation

链接https://arxiv.org/abs/2411.18294

作者:Šimon Sedláček,Santosh Kesiraju,Alexander Polok,Jan Černocký

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:aligning frozen pre-trained, frozen pre-trained automatic, automatic speech recognition, pre-trained automatic speech, transforming ASR encoder

备注

点击查看摘要

Abstract:This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). This connector bridges the gap between the speech and text modalities, transforming ASR encoder embeddings into the latent representation space of the MT encoder while being the only part of the system optimized during training. Experiments are conducted on the How2 English-Portuguese dataset as we investigate the alignment approach in a small-scale scenario focusing on ST. While keeping the size of the connector module constant and small in comparison ( 5% of the size of the larger aligned models), increasing the size and capability of the foundation ASR and MT models universally improves translation results. We also find that the connectors can serve as domain adapters for the foundation MT models, significantly improving translation performance in the aligned ST setting. We conclude that this approach represents a viable and scalable approach to training end-to-end ST systems.

21. 【2411.18280】Neutralizing Backdoors through Information Conflicts for Large Language Models

链接https://arxiv.org/abs/2411.18280

作者:Chen Chen,Yuchen Sun,Xueluan Gong,Jiaxin Gao,Kwok-Yan Lam

类目:Computation and Language (cs.CL)

关键词:Natural Language Processing, Large language models, Large language, Language Processing, Natural Language

备注

点击查看摘要

Abstract:Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the backdoored model to neutralize malicious behaviors by embedding contradictory information within the model's parametric memory. Externally, we incorporate convincing contradictory evidence into the prompt to challenge the model's internal backdoor knowledge. Experimental results on classification and conversational tasks across 4 widely used LLMs demonstrate that our method outperforms 8 state-of-the-art backdoor defense baselines. We can reduce the attack success rate of advanced backdoor attacks by up to 98% while maintaining over 90% clean data accuracy. Furthermore, our method has proven to be robust against adaptive backdoor attacks. The code will be open-sourced upon publication.

22. 【2411.18279】Large Language Model-Brained GUI Agents: A Survey

链接https://arxiv.org/abs/2411.18279

作者:Chaoyun Zhang,Shilin He,Jiaxu Qian,Bowen Li,Liqun Li,Si Qin,Yu Kang,Minghua Ma,Qingwei Lin,Saravan Rajmohan,Dongmei Zhang,Qi Zhang

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

关键词:LLM-brained GUI agents, GUI agents, GUI, LLM-brained GUI, providing an intuitive

备注

点击查看摘要

Abstract:GUIs have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. The advent of LLMs, particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, and visual processing. This has paved the way for a new generation of LLM-brained GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry. To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents.

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

Cite as:
arXiv:2411.18279 [cs.AI]

(or
arXiv:2411.18279v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2411.18279

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
23. 【2411.18269】Hidden Data Privacy Breaches in Federated Learning

链接https://arxiv.org/abs/2411.18269

作者:Xueluan Gong,Yuji Wang,Shuaike Li,Mengyuan Sun,Songze Li,Qian Wang,Kwok-Yan Lam,Chen Chen

类目:Computation and Language (cs.CL); Cryptography and Security (cs.CR)

关键词:conducting machine learning, Federated Learning, promising enhanced privacy, machine learning, promising enhanced

备注

点击查看摘要

Abstract:Federated Learning (FL) emerged as a paradigm for conducting machine learning across broad and decentralized datasets, promising enhanced privacy by obviating the need for direct data sharing. However, recent studies show that attackers can steal private data through model manipulation or gradient analysis. Existing attacks are constrained by low theft quantity or low-resolution data, and they are often detected through anomaly monitoring in gradients or weights. In this paper, we propose a novel data-reconstruction attack leveraging malicious code injection, supported by two key techniques, i.e., distinctive and sparse encoding design and block partitioning. Unlike conventional methods that require detectable changes to the model, our method stealthily embeds a hidden model using parameter sharing to systematically extract sensitive data. The Fibonacci-based index design ensures efficient, structured retrieval of memorized data, while the block partitioning method enhances our method's capability to handle high-resolution images by dividing them into smaller, manageable units. Extensive experiments on 4 datasets confirmed that our method is superior to the five state-of-the-art data-reconstruction attacks under the five respective detection methods. Our method can handle large-scale and high-resolution data without being detected or mitigated by state-of-the-art data reconstruction defense methods. In contrast to baselines, our method can be directly applied to both FedAVG and FedSGD scenarios, underscoring the need for developers to devise new defenses against such vulnerabilities. We will open-source our code upon acceptance.

24. 【2411.18260】MetaphorShare: A Dynamic Collaborative Repository of Open Metaphor Datasets

链接https://arxiv.org/abs/2411.18260

作者:Joanne Boisson,Arif Mehmood,Jose Camacho-Collados

类目:Computation and Language (cs.CL)

关键词:developed numerous valuable, numerous valuable labelled, valuable labelled corpora, developed numerous, numerous valuable

备注

点击查看摘要

Abstract:The metaphor studies community has developed numerous valuable labelled corpora in various languages over the years. Many of these resources are not only unknown to the NLP community, but are also often not easily shared among the researchers. Both in human sciences and in NLP, researchers could benefit from a centralised database of labelled resources, easily accessible and unified under an identical format. To facilitate this, we present MetaphorShare, a website to integrate metaphor datasets making them open and accessible. With this effort, our aim is to encourage researchers to share and upload more datasets in any language in order to facilitate metaphor studies and the development of future metaphor processing NLP systems. The website is accessible at this http URL.

25. 【2411.18247】A gentle push funziona benissimo: making instructed models in Italian via contrastive activation steering

链接https://arxiv.org/abs/2411.18247

作者:Daniel Scalena,Elisabetta Fersini,Malvina Nissim

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:pre-training data requires, data requires fine-tuning, computational resources, Adapting models, partially present

备注

点击查看摘要

Abstract:Adapting models to a language that was only partially present in the pre-training data requires fine-tuning, which is expensive in terms of both data and computational resources. As an alternative to fine-tuning, we explore the potential of activation steering-based techniques to enhance model performance on Italian tasks. Through our experiments we show that Italian steering (i) can be successfully applied to different models, (ii) achieves performances comparable to, or even better than, fine-tuned models for Italian, and (iii) yields higher quality and consistency in Italian generations. We also discuss the utility of steering and fine-tuning in the contemporary LLM landscape where models are anyway getting high Italian performances even if not explicitly trained in this language.

26. 【2411.18242】hai Financial Domain Adaptation of THaLLE -- Technical Report

链接https://arxiv.org/abs/2411.18242

作者:KBTG Labs,Atthakorn Petchsod,Pornchanan Balee,Danupat Khamnuansin,Anuruth Lertpiya,Chanatip Saetia,Tawunrat Chalothorn,Thadpong Pongthawornkamol,Monchai Lertsutthiwong

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:Large Language Models, Large Language, Thai Financial LLM, Thai financial, excel in general

备注

点击查看摘要

Abstract:Large Language Models (LLMs) excel in general tasks but struggle with domain-specific challenges, such as specialized terminology and localized regulations. Existing financial LLMs, like FinGPT and BloombergGPT, lack support for the Thai financial domain. We developed a Thai Financial LLM using the Investment Consultant (IC) exam dataset from the Stock Exchange of Thailand. To address dataset limitations, we applied data augmentation, ReLoRA for efficient training, Continued Pretraining (CPT) for domain knowledge, and Rank-Stabilized LoRA (rsLoRA) for fine-tuning. Supervised Fine-Tuning (SFT) simulated exam scenarios, while Direct Preference Optimization (DPO) refined the model using feedback. The model achieved scores of 72%, 72%, and 84% on IC exam levels P1, P2, and P3, respectively, demonstrating its effectiveness in Thai financial advisory tasks and its potential for specialized applications.

27. 【2411.18217】How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario

链接https://arxiv.org/abs/2411.18217

作者:Shih-Heng Wang,Zih-Ching Chen,Jiatong Shi,Ming-To Chuang,Guan-Ting Lin,Kuan-Po Huang,David Harwath,Shang-Wen Li,Hung-yi Lee

类目:ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)

关键词:Automatic Speech Recognition, speech Self-Supervised Learning, Speech Recognition, Automatic Speech, Self-Supervised Learning

备注

点击查看摘要

Abstract:The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. To handle these issues, we extend a conventional efficient fine-tuning scheme based on the adapter. We add an extra intermediate adaptation to warm up the adapter and downstream model initialization. Remarkably, we update only 1-5% of the total model parameters to achieve the adaptation. Experimental results on the ML-SUPERB dataset show that our solution outperforms conventional efficient fine-tuning. It achieves up to a 28% relative improvement in the Character/Phoneme error rate when adapting to unseen languages.

28. 【2411.18203】Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

链接https://arxiv.org/abs/2411.18203

作者:Di Zhang,Jingdi Lei,Junxian Li,Xunzhi Wang,Yujie Liu,Zonglin Yang,Jiatong Li,Weida Wang,Suorong Yang,Jianbo Wu,Peng Ye,Wanli Ouyang,Dongzhan Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:shown remarkable advancements, Vision-language models, reasoning, critic, shown remarkable

备注: 16 pages, 11 figures

点击查看摘要

Abstract:Vision-language models~(VLMs) have shown remarkable advancements in multimodal reasoning tasks. However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. This framework decouples the reasoning process and critic process by integrating two independent components: the Reasoner, which generates reasoning paths based on visual and textual inputs, and the Critic, which provides constructive critique to refine these paths. In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic. This interaction process was theoretically driven by a reinforcement learning framework where the Critic offers natural language critiques instead of scalar rewards, enabling more nuanced feedback to boost the Reasoner's capability on complex reasoning tasks. The Critic model is trained using Direct Preference Optimization (DPO), leveraging a preference dataset of critiques ranked by Rule-based Reward(RBR) to enhance its critic capabilities. Evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner and constructive feedback from the preference-optimized Critic enables a more reliable and context-sensitive multimodal reasoning process. Our approach provides a promising solution to enhance the reliability of VLMs, improving their performance in real-world reasoning-heavy multimodal applications such as autonomous driving and embodied intelligence.

29. 【2411.18162】SentiXRL: An advanced large language Model Framework for Multilingual Fine-Grained Emotion Classification in Complex Text Environment

链接https://arxiv.org/abs/2411.18162

作者:Jie Wang,Yichen Wang,Zhilin Zhang,Jianhao Zeng,Kaidi Wang,Zhiyang Chen

类目:Computation and Language (cs.CL)

关键词:Large Language Models, strong expressive capabilities, generative models effectively, Large Language, models effectively capture

备注

点击查看摘要

Abstract:With strong expressive capabilities in Large Language Models(LLMs), generative models effectively capture sentiment structures and deep semantics, however, challenges remain in fine-grained sentiment classification across multi-lingual and complex contexts. To address this, we propose the Sentiment Cross-Lingual Recognition and Logic Framework (SentiXRL), which incorporates two modules,an emotion retrieval enhancement module to improve sentiment classification accuracy in complex contexts through historical dialogue and logical reasoning,and a self-circulating analysis negotiation mechanism (SANM)to facilitates autonomous decision-making within a single model for classification this http URL have validated SentiXRL's superiority on multiple standard datasets, outperforming existing models on CPED and CH-SIMS,and achieving overall better performance on MELD,Emorynlp and IEMOCAP. Notably, we unified labels across several fine-grained sentiment annotation datasets and conducted category confusion experiments, revealing challenges and impacts of class imbalance in standard datasets.

30. 【2411.18157】A survey on cutting-edge relation extraction techniques based on language models

链接https://arxiv.org/abs/2411.18157

作者:Jose A. Diaz-Garcia,Julio Amador Diaz Lopez

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:comprehensive survey delves, natural language processing, language processing essential, applications across biomedical, legal sectors

备注: 50 pages, under review in Artificial Intelligence Review

点击查看摘要

Abstract:This comprehensive survey delves into the latest advancements in Relation Extraction (RE), a pivotal task in natural language processing essential for applications across biomedical, financial, and legal sectors. This study highlights the evolution and current state of RE techniques by analyzing 137 papers presented at the Association for Computational Linguistics (ACL) conferences over the past four years, focusing on models that leverage language models. Our findings underscore the dominance of BERT-based methods in achieving state-of-the-art results for RE while also noting the promising capabilities of emerging large language models (LLMs) like T5, especially in few-shot relation extraction scenarios where they excel in identifying previously unseen relations.

31. 【2411.18152】MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models

链接https://arxiv.org/abs/2411.18152

作者:Thai-Binh Nguyen,Alexander Waibel

类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)

关键词:Speaker-attributed automatic speech, Speaker-attributed automatic, automatic speech recognition, aims to transcribe, assigning transcripts

备注

点击查看摘要

Abstract:Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. Existing methods often rely on complex modular systems or require extensive fine-tuning of joint modules, limiting their adaptability and general efficiency. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. Our method involves training a speaker module to predict speaker embeddings based on weak labels without requiring additional ASR model modifications. Despite being trained exclusively with non-overlapping monolingual data, our approach effectively extracts speaker attributes across diverse multilingual datasets, including those with overlapping speech. Experimental results demonstrate competitive performance compared to strong baselines, highlighting the model's robustness and potential for practical applications.

32. 【2411.18126】Curriculum Demonstration Selection for In-Context Learning

链接https://arxiv.org/abs/2411.18126

作者:Duc Anh Vu,Nguyen Tran Cong Duy,Xiaobao Wu,Hoang Minh Nhat,Du Mingzhe,Nguyen Thanh Thong,Anh Tuan Luu

类目:Computation and Language (cs.CL)

关键词:Large Language Models, Large Language, Language Models, shown strong in-context, strong in-context learning

备注: Accepted at the 40th ACM/SIGAPP Symposium On Applied Computing (SAC 2025), Main Conference

点击查看摘要

Abstract:Large Language Models (LLMs) have shown strong in-context learning (ICL) abilities with a few demonstrations. However, one critical challenge is how to select demonstrations to elicit the full potential of LLMs. In this paper, we propose Curriculum Demonstration Selection (CDS), a novel demonstration selection method for ICL. Instead of merely using similarity, CDS additionally partitions samples by their complexity measurements. Following curriculum learning, CDS then selects demonstrations from easy to difficult. Thus the selected demonstrations cover a wide range of difficulty levels, enabling LLMs to learn from varied complexities within the training set. Experiments demonstrate that our CDS consistently outperforms baseline methods, achieving notable improvements across nine LLMs on three benchmarks. Moreover, CDS proves especially effective in enhancing LLM performance in solving challenging problems.

33. 【2411.18104】raining and Evaluating Language Models with Template-based Data Generation

链接https://arxiv.org/abs/2411.18104

作者:Yifan Zhang

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:showcasing remarkable capabilities, Llama has significantly, significantly transformed natural, showcasing remarkable, large language models

备注: 8 pages, 2 figures

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, these models often struggle with tasks requiring complex reasoning, particularly in mathematical problem-solving, due in part to the scarcity of large-scale, high-quality, domain-specific datasets necessary for training sophisticated reasoning abilities. To address this limitation, we introduce Template-based Data Generation (TDG), a novel approach that leverages LLMs (GPT-4) to automatically generate parameterized meta-templates, which are then used to synthesize a vast array of high-quality problems and solutions. Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset comprising over 7 million synthetically generated grade school math problems--each accompanied by code-based and natural language solutions--with the potential to generate an effectively unlimited number more. This dataset alleviates the scarcity of large-scale mathematical datasets and serves as a valuable resource for pre-training, fine-tuning, and evaluating LLMs in mathematical reasoning. Our method not only enables the generation of virtually infinite data but also elevates data augmentation to a new level by using GPT-4 for meta-template generation, ensuring diverse and high-quality problem structures. The TemplateMath Part I: TemplateGSM dataset is publicly available at this https URL. The code is available at this https URL.

34. 【2411.18099】Fine-Tuning Small Embeddings for Elevated Performance

链接https://arxiv.org/abs/2411.18099

作者:Biraj Silwal

类目:Computation and Language (cs.CL)

关键词:language processing tasks, Contextual Embeddings, natural language processing, processing tasks, Nepali language

备注

点击查看摘要

Abstract:Contextual Embeddings have yielded state-of-the-art results in various natural language processing tasks. However, these embeddings are constrained by models requiring large amounts of data and huge computing power. This is an issue for low-resource languages like Nepali as the amount of data available over the internet is not always sufficient for the models. This work has taken an incomplete BERT model with six attention heads pretrained on Nepali language and finetuned it on previously unseen data. The obtained results from intrinsic and extrinsic evaluations have been compared to the results drawn from the original model baseline and a complete BERT model pretrained on Nepali language as the oracle. The results demonstrate that even though the oracle is better on average, finetuning the small embeddings drastically improves results compared to the original baseline.

35. 【2411.18077】Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache

链接https://arxiv.org/abs/2411.18077

作者:Akshat Sharma,Hangliang Ding,Jianping Li,Neel Dani,Minjia Zhang

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:exceptionally challenging due, efficiently serve LLMs, computation requirements, long context tasks, efficiently serve

备注

点击查看摘要

Abstract:How to efficiently serve LLMs in practice has become exceptionally challenging due to their prohibitive memory and computation requirements. In this study, we investigate optimizing the KV cache, whose memory footprint poses a critical bottleneck in LLM inference, especially when dealing with long context tasks. To tackle the challenge, we introduce MiniKV, a KV cache optimization method that simultaneously preserves long context task accuracy while significantly reducing KV cache size via a novel 2-bit layer-discriminative KV cache. More importantly, we develop specialized CUDA kernels to make MiniKV compatible with FlashAttention. Experiments on a wide range of long context tasks show that MiniKV effectively achieves 86% KV cache compression ratio while recovering over 98.5% of accuracy, outperforming state-of-the-art methods while achieving excellent measured system performance improvements.

36. 【2411.18021】Can bidirectional encoder become the ultimate winner for downstream applications of foundation models?

链接https://arxiv.org/abs/2411.18021

作者:Lewen Yang,Xuanyu Zhou,Juao Fan,Xinyi Xie,Shengxin Zhu

类目:Computation and Language (cs.CL)

关键词:Artificial Intelligence, machine learning stage, deep learning stage, initial machine learning, learning stage

备注: 9 pages, 4 figures, FLLM2024

点击查看摘要

Abstract:Over the past few decades, Artificial Intelligence(AI) has progressed from the initial machine learning stage to the deep learning stage, and now to the stage of foundational models. Foundational models have the characteristics of pre-training, transfer learning, and self-supervised learning, and pre-trained models can be fine-tuned and applied to various downstream tasks. Under the framework of foundational models, models such as Bidirectional Encoder Representations from Transformers(BERT) and Generative Pre-trained Transformer(GPT) have greatly advanced the development of natural language processing(NLP), especially the emergence of many models based on BERT. BERT broke through the limitation of only using one-way methods for language modeling in pre-training by using a masked language model. It can capture bidirectional context information to predict the masked words in the sequence, this can improve the feature extraction ability of the model. This makes the model very useful for downstream tasks, especially for specialized applications. The model using the bidirectional encoder can better understand the domain knowledge and be better applied to these downstream tasks. So we hope to help understand how this technology has evolved and improved model performance in various natural language processing tasks under the background of foundational models and reveal its importance in capturing context information and improving the model's performance on downstream tasks. This article analyzes one-way and bidirectional models based on GPT and BERT and compares their differences based on the purpose of the model. It also briefly analyzes BERT and the improvements of some models based on BERT. The model's performance on the Stanford Question Answering Dataset(SQuAD) and General Language Understanding Evaluation(GLUE) was compared.

37. 【2411.17993】DRS: Deep Question Reformulation With Structured Output

链接https://arxiv.org/abs/2411.17993

作者:Zhecheng Li,Yiwei Wang,Bryan Hooi,Yujun Cai,Nanyun Peng,Kai-Wei Chang

类目:Computation and Language (cs.CL)

关键词:large language models, large language, language models, fundamental capability, language

备注

点击查看摘要

Abstract:Question answering is a fundamental capability of large language models (LLMs). However, when people encounter completely new knowledge texts, they often ask questions that the text cannot answer due to a lack of understanding of the knowledge. Recent research shows that large language models identify the unanswerability of questions, but they lack the ability to help people reformulate their questions. Even powerful models like GPT-3.5 perform poorly in this regard. To enhance the ability of LLMs to assist humans in reformulating questions to extract relevant knowledge from new documents, we propose a zero-shot method called DRS: Deep Question Reformulation With Structured Output. Our proposed method leverages large language models and the DFS-based algorithm to iteratively search for possible entity combinations and constrain the output with certain entities, effectively improving the capabilities of large language models in this area. Extensive experimental results show that our zero-shot DRS method significantly improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42% and effectively improves the score of open-source large language models, such as Gemma2-9B, from 26.35% to 56.75%.

38. 【2411.17992】New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing

链接https://arxiv.org/abs/2411.17992

作者:Andreas Madsen

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:prevent unintended behavior, critical applications, unintended behavior, machine learning, prevent unintended

备注: Doctoral thesis

点击查看摘要

Abstract:As machine learning becomes more widespread and is used in more critical applications, it's important to provide explanations for these models, to prevent unintended behavior. Unfortunately, many current interpretability methods struggle with faithfulness. Therefore, this Ph.D. thesis investigates the question "How to provide and ensure faithful explanations for complex general-purpose neural NLP models?" The main thesis is that we should develop new paradigms in interpretability. This is achieved by first developing solid faithfulness metrics and then applying the lessons learned from this investigation to develop new paradigms. The two new paradigms explored are faithfulness measurable models (FMMs) and self-explanations. The idea in self-explanations is to have large language models explain themselves, we identify that current models are not capable of doing this consistently. However, we suggest how this could be achieved. The idea of FMMs is to create models that are designed such that measuring faithfulness is cheap and precise. This makes it possible to optimize an explanation towards maximum faithfulness, which makes FMMs designed to be explained. We find that FMMs yield explanations that are near theoretical optimal in terms of faithfulness. Overall, from all investigations of faithfulness, results show that post-hoc and intrinsic explanations are by default model and task-dependent. However, this was not the case when using FMMs, even with the same post-hoc explanation methods. This shows, that even simple modifications to the model, such as randomly masking the training dataset, as was done in FMMs, can drastically change the situation and result in consistently faithful explanations. This answers the question of how to provide and ensure faithful explanations.

39. 【2411.17991】VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

链接https://arxiv.org/abs/2411.17991

作者:Yueqian Wang,Xiaojun Meng,Yuxuan Wang,Jianxin Liang,Jiansheng Wei,Huishuai Zhang,Dongyan Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Recent researches, duet interaction format, interaction format, large language models, video large language

备注: 9 pages

点击查看摘要

Abstract:Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays. Code, data and demo are available at: this https URL.

40. 【2411.17967】QuaLLM-Health: An Adaptation of an LLM-Based Framework for Quantitative Data Extraction from Online Health Discussions

链接https://arxiv.org/abs/2411.17967

作者:Ramez Kouzy,Roxanna Attar-Olyaee,Michael K. Rooney,Comron J. Hassanzadeh,Junyi Jessy Li,Osama Mohamad

类目:Computation and Language (cs.CL)

关键词:Reddit offer valuable, Health-related discussions, text is challenging, Reddit offer, quantitative data

备注

点击查看摘要

Abstract:Health-related discussions on social media like Reddit offer valuable insights, but extracting quantitative data from unstructured text is challenging. In this work, we present an adapted framework from QuaLLM into QuaLLM-Health for extracting clinically relevant quantitative data from Reddit discussions about glucagon-like peptide-1 (GLP-1) receptor agonists using large language models (LLMs). We collected 410k posts and comments from five GLP-1-related communities using the Reddit API in July 2024. After filtering for cancer-related discussions, 2,059 unique entries remained. We developed annotation guidelines to manually extract variables such as cancer survivorship, family cancer history, cancer types mentioned, risk perceptions, and discussions with physicians. Two domain-experts independently annotated a random sample of 100 entries to create a gold-standard dataset. We then employed iterative prompt engineering with OpenAI's "GPT-4o-mini" on the gold-standard dataset to build an optimized pipeline that allowed us to extract variables from the large dataset. The optimized LLM achieved accuracies above 0.85 for all variables, with precision, recall and F1 score macro averaged 0.90, indicating balanced performance. Stability testing showed a 95% match rate across runs, confirming consistency. Applying the framework to the full dataset enabled efficient extraction of variables necessary for downstream analysis, costing under $3 and completing in approximately one hour. QuaLLM-Health demonstrates that LLMs can effectively and efficiently extract clinically relevant quantitative data from unstructured social media content. Incorporating human expertise and iterative prompt refinement ensures accuracy and reliability. This methodology can be adapted for large-scale analysis of patient-generated data across various health domains, facilitating valuable insights for healthcare research.

41. 【2411.17943】Evaluating Generative AI-Enhanced Content: A Conceptual Framework Using Qualitative, Quantitative, and Mixed-Methods Approaches

链接https://arxiv.org/abs/2411.17943

作者:Saman Sarraf

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)

关键词:offering transformative capabilities, improving language coherence, revolutionized content generation, offering transformative, transformative capabilities

备注

点击查看摘要

Abstract:Generative AI (GenAI) has revolutionized content generation, offering transformative capabilities for improving language coherence, readability, and overall quality. This manuscript explores the application of qualitative, quantitative, and mixed-methods research approaches to evaluate the performance of GenAI models in enhancing scientific writing. Using a hypothetical use case involving a collaborative medical imaging manuscript, we demonstrate how each method provides unique insights into the impact of GenAI. Qualitative methods gather in-depth feedback from expert reviewers, analyzing their responses using thematic analysis tools to capture nuanced improvements and identify limitations. Quantitative approaches employ automated metrics such as BLEU, ROUGE, and readability scores, as well as user surveys, to objectively measure improvements in coherence, fluency, and structure. Mixed-methods research integrates these strengths, combining statistical evaluations with detailed qualitative insights to provide a comprehensive assessment. These research methods enable quantifying improvement levels in GenAI-generated content, addressing critical aspects of linguistic quality and technical accuracy. They also offer a robust framework for benchmarking GenAI tools against traditional editing processes, ensuring the reliability and effectiveness of these technologies. By leveraging these methodologies, researchers can evaluate the performance boost driven by GenAI, refine its applications, and guide its responsible adoption in high-stakes domains like healthcare and scientific research. This work underscores the importance of rigorous evaluation frameworks for advancing trust and innovation in GenAI.

42. 【2411.17891】HOPPR Medical-Grade Platform for Medical Imaging AI

链接https://arxiv.org/abs/2411.17891

作者:Kalina P. Slavkova,Melanie Traughber,Oliver Chen,Robert Bakos,Shayna Goldstein,Dan Harms,Bradley J. Erickson,Khan M. Siddiqui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:large vision language, Technological advances, vision language models, artificial intelligence, HOPPR Platform

备注: 6 pages, 3 figures

点击查看摘要

Abstract:Technological advances in artificial intelligence (AI) have enabled the development of large vision language models (LVLMs) that are trained on millions of paired image and text samples. Subsequent research efforts have demonstrated great potential of LVLMs to achieve high performance in medical imaging use cases (e.g., radiology report generation), but there remain barriers that hinder the ability to deploy these solutions broadly. These include the cost of extensive computational requirements for developing large scale models, expertise in the development of sophisticated AI models, and the difficulty in accessing substantially large, high-quality datasets that adequately represent the population in which the LVLM solution is to be deployed. The HOPPR Medical-Grade Platform addresses these barriers by providing powerful computational infrastructure, a suite of foundation models on top of which developers can fine-tune for their specific use cases, and a robust quality management system that sets a standard for evaluating fine-tuned models for deployment in clinical settings. The HOPPR Platform has access to millions of imaging studies and text reports sourced from hundreds of imaging centers from diverse populations to pretrain foundation models and enable use case-specific cohorts for fine-tuning. All data are deidentified and securely stored for HIPAA compliance. Additionally, developers can securely host models on the HOPPR platform and access them via an API to make inferences using these models within established clinical workflows. With the Medical-Grade Platform, HOPPR's mission is to expedite the deployment of LVLM solutions for medical imaging and ultimately optimize radiologist's workflows and meet the growing demands of the field.

43. 【2411.17876】Leveraging Large Language Models and Topic Modeling for Toxicity Classification

链接https://arxiv.org/abs/2411.17876

作者:Haniyeh Ehsani Oskouie,Christina Chance,Claire Huang,Margaret Capetz,Elizabeth Eyeson,Majid Sarrafzadeh

类目:Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:represent critical tasks, classification represent critical, significant social implications, toxicity classification represent, social implications

备注

点击查看摘要

Abstract:Content moderation and toxicity classification represent critical tasks with significant social implications. However, studies have shown that major classification models exhibit tendencies to magnify or reduce biases and potentially overlook or disadvantage certain marginalized groups within their classification processes. Researchers suggest that the positionality of annotators influences the gold standard labels in which the models learned from propagate annotators' bias. To further investigate the impact of annotator positionality, we delve into fine-tuning BERTweet and HateBERT on the dataset while using topic-modeling strategies for content moderation. The results indicate that fine-tuning the models on specific topics results in a notable improvement in the F1 score of the models when compared to the predictions generated by other prominent classification models such as GPT-4, PerspectiveAPI, and RewireAPI. These findings further reveal that the state-of-the-art large language models exhibit significant limitations in accurately detecting and interpreting text toxicity contrasted with earlier methodologies. Code is available at this https URL.

44. 【2411.17863】LongKey: Keyphrase Extraction for Long Documents

链接https://arxiv.org/abs/2411.17863

作者:Jeovane Honorio Alves,Radu State,Cinthia Obladen de Almendra Freitas,Jean Paul Barddal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:information overload, manually annotating, increasingly impractical, era of information, annotating the vast

备注: Accepted for presentation at the 2024 IEEE International Conference on Big Data (IEEE BigData 2024). Code available at [this https URL](https://github.com/jeohalves/longkey)

点击查看摘要

Abstract:In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.

45. 【2411.17835】Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction

链接https://arxiv.org/abs/2411.17835

作者:Mohamed Rashad

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:structured Markdown text, Arabic book pages, converting Arabic book, Meta Nougat architecture, Markdown text

备注: 7 pages, 1 figure

点击查看摘要

Abstract:We present Arabic-Nougat, a suite of OCR models for converting Arabic book pages into structured Markdown text. Based on Meta's Nougat architecture, Arabic-Nougat includes three specialized models: arabic-small-nougat, arabic-base-nougat, and arabic-large-nougat. These models are fine-tuned on a synthetic dataset, arabic-img2md, comprising 13.7k pairs of Arabic book pages and their Markdown representations. Key contributions include the Aranizer-PBE-86k tokenizer, designed for efficient tokenization, and the use of torch.bfloat16 precision with Flash Attention 2 for optimized training and inference. Our models achieve state-of-the-art performance, with arabic-large-nougat delivering the highest Markdown Structure Accuracy and the lowest Character Error Rate. Additionally, we release a large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books using our best-performing model, providing a valuable resource for Arabic OCR research. All models, datasets, and code are open-sourced and available at this https URL.

46. 【2411.17799】Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

链接https://arxiv.org/abs/2411.17799

作者:Ronglai Zuo,Rolandos Alexandros Potamias,Evangelos Ververas,Jiankang Deng,Stefanos Zafeiriou

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:primary communication method, Sign language, Sign, language, features of natural

备注

点击查看摘要

Abstract:Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. While many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), drawing inspiration from its linguistic characteristics, the reverse task of sign language generation (SLG, text-to-sign) remains largely unexplored. Most existing approaches treat SLG as a visual content generation task, employing techniques such as diffusion models to produce sign videos, 2D keypoints, or 3D avatars based on text inputs, overlooking the linguistic properties of sign languages. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets. To facilitate multilingual SLG research, we further curate a large-scale Chinese sign language dataset, CSL-Daily, with high-quality 3D pose annotations. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. The project page is available at this https URL.

47. 【2411.17792】$H^3$Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

链接https://arxiv.org/abs/2411.17792

作者:Selim Furkan Tekin,Fatih Ilhan,Tiansheng Huang,Sihao Hu,Zachary Yahn,Ling Liu

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:reflect human preference, creating fine-tuned models, human preference, alignment fusion, critical for creating

备注

点击查看摘要

Abstract:Alignment of pretrained LLMs using instruction-based datasets is critical for creating fine-tuned models that reflect human preference. A growing number of alignment-based fine-tuning algorithms and benchmarks emerged recently, fueling the efforts on effective alignments of pre-trained LLMs to ensure helpful, harmless, and honest answers from both open-source and closed-source LLMs. This paper tackles this problem by developing an alignment fusion approach, coined as $H^3$Fusion, with three unique characteristics. First, $H^3$Fusion ensembles multiple individually aligned LLMs to create a final fine-tuned alignment model with enhanced capabilities beyond those of individual models, delivering robust alignment through promoting helpful, harmless, honest fusion. Second, $H^3$Fusion leverages the mixture-of-experts (MoE) methodology in two steps. We first freeze the multi-head attention weights of each individual model while tuning the FFN layer during alignment fusion. Then we merge the aligned model weights with an expert router according to the type of input instruction and dynamically select a subset of experts that are best suited for producing the output response. Finally, we boost the performance of the resulting $H^3$3Fusion model by introducing gating loss and regularization terms. The former penalizes the selection errors of the expert-router, and the latter mediates the expert weights drifting during fine-tuning and dynamically adjusts the fusion behavior of the resulting model by canalizing the activations on the experts. Extensive evaluations on three benchmark datasets show that $H^3$3Fusion is more helpful, less harmful, and more honest from two aspects: it outperforms each individually aligned model by $11.37\%$, and it provides stronger robustness compared to the state-of-the-art LLM ensemble approaches by $13.77\%$. Code is available at this http URL.

48. 【2411.17760】Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

链接https://arxiv.org/abs/2411.17760

作者:Shijian Deng,Wentian Zhao,Yu-Jhe Li,Kun Wan,Daniel Miranda,Ajinkya Kale,Yapeng Tian

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:multimodal large language, large language models, reliability and robustness, multimodal large, large language

备注

点击查看摘要

Abstract:Self-improvement in multimodal large language models (MLLMs) is crucial for enhancing their reliability and robustness. However, current methods often rely heavily on MLLMs themselves as judges, leading to high computational costs and potential pitfalls like reward hacking and model collapse. This paper introduces a novel, model-level judge-free self-improvement framework. Our approach employs a controlled feedback mechanism while eliminating the need for MLLMs in the verification loop. We generate preference learning pairs using a controllable hallucination mechanism and optimize data quality by leveraging lightweight, contrastive language-image encoders to evaluate and reverse pairs when necessary. Evaluations across public benchmarks and our newly introduced IC dataset designed to challenge hallucination control demonstrate that our model outperforms conventional techniques. We achieve superior precision and recall with significantly lower computational demands. This method offers an efficient pathway to scalable self-improvement in MLLMs, balancing performance gains with reduced resource requirements.

49. 【2411.17719】SlideSpawn: An Automatic Slides Generation System for Research Publications

链接https://arxiv.org/abs/2411.17719

作者:Keshav Kumar,Ravindranath Chowdary

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Research, Research papers, structured documents, Abstract, PDF

备注: 6 pages, 4 figures, 2 tables, 5 equations, 41 references

点击查看摘要

Abstract:Research papers are well structured documents. They have text, figures, equations, tables etc., to covey their ideas and findings. They are divided into sections like Introduction, Model, Experiments etc., which deal with different aspects of research. Characteristics like these set research papers apart from ordinary documents and allows us to significantly improve their summarization. In this paper, we propose a novel system, SlideSpwan, that takes PDF of a research document as an input and generates a quality presentation providing it's summary in a visual and concise fashion. The system first converts the PDF of the paper to an XML document that has the structural information about various elements. Then a machine learning model, trained on PS5K dataset and Aminer 9.5K Insights dataset (that we introduce), is used to predict salience of each sentence in the paper. Sentences for slides are selected using ILP and clustered based on their similarity with each cluster being given a suitable title. Finally a slide is generated by placing any graphical element referenced in the selected sentences next to them. Experiments on a test set of 650 pairs of papers and slides demonstrate that our system generates presentations with better quality.

50. 【2411.17708】owards Efficient Neurally-Guided Program Induction for ARC-AGI

链接https://arxiv.org/abs/2411.17708

作者:Simon Ouellette

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)

关键词:open-world problem domain, ability to generalize, crucial quality, open-world problem, problem domain

备注

点击查看摘要

Abstract:ARC-AGI is an open-world problem domain in which the ability to generalize out-of-distribution is a crucial quality. Under the program induction paradigm, we present a series of experiments that reveal the efficiency and generalization characteristics of various neurally-guided program induction approaches. The three paradigms we consider are Learning the grid space, Learning the program space, and Learning the transform space. We implement and experiment thoroughly on the first two, and retain the second one for ARC-AGI submission. After identifying the strengths and weaknesses of both of these approaches, we suggest the third as a potential solution, and run preliminary experiments.

51. 【2411.18138】SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

链接https://arxiv.org/abs/2411.18138

作者:Wenyi Yu,Siyin Wang,Xiaoyu Yang,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Guangzhi Sun,Lu Lu,Yuxuan Wang,Chao Zhang

类目:Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)

关键词:seamless human-machine conversations, multimodal large language, addressing diverse speech, large language models, Full-duplex multimodal large

备注: Technical report

点击查看摘要

Abstract:Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.

52. 【2411.18010】JPPO: Joint Power and Prompt Optimization for Accelerated Large Language Model Services

链接https://arxiv.org/abs/2411.18010

作者:Feiran You,Hongyang Du,Kaibin Huang,Abbas Jamalipour

类目:Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)

关键词:Large Language Models, Small Language Model, demonstrated remarkable capabilities, Large Language, Language Models

备注

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, leading to their increasing deployment in wireless networks for a wide variety of user services. However, the growing longer prompt setting highlights the crucial issue of computational resource demands and huge communication load. To address this challenge, we propose Joint Power and Prompt Optimization (JPPO), a framework that combines Small Language Model (SLM)-based prompt compression with wireless power allocation optimization. By deploying SLM at user devices for prompt compression and employing Deep Reinforcement Learning for joint optimization of compression ratio and transmission power, JPPO effectively balances service quality with resource efficiency. Experimental results demonstrate that our framework achieves high service fidelity and low bit error rates while optimizing power usage in wireless LLM services. The system reduces response time by about 17%, with the improvement varying based on the length of the original prompt.

信息检索

1. 【2411.18583】Automated Literature Review Using NLP Techniques and LLM-Based Retrieval-Augmented Generation

链接https://arxiv.org/abs/2411.18583

作者:Nurshat Fateh Ali,Md. Mahdi Mohtasim,Shakil Mosharrof,T. Gopi Krishna

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Large Language Model, Natural Language Processing, compares multiple approaches, Large Language, Language Model

备注: Key Words : T5, SpaCy, Large Language Model, GPT, ROUGE, Literature Review, Natural Language Processing, Retrieval-augmented generation

点击查看摘要

Abstract:This research presents and compares multiple approaches to automate the generation of literature reviews using several Natural Language Processing (NLP) techniques and retrieval-augmented generation (RAG) with a Large Language Model (LLM). The ever-increasing number of research articles provides a huge challenge for manual literature review. It has resulted in an increased demand for automation. Developing a system capable of automatically generating the literature reviews from only the PDF files as input is the primary objective of this research work. The effectiveness of several Natural Language Processing (NLP) strategies, such as the frequency-based method (spaCy), the transformer model (Simple T5), and retrieval-augmented generation (RAG) with Large Language Model (GPT-3.5-turbo), is evaluated to meet the primary objective. The SciTLDR dataset is chosen for this research experiment and three distinct techniques are utilized to implement three different systems for auto-generating the literature reviews. The ROUGE scores are used for the evaluation of all three systems. Based on the evaluation, the Large Language Model GPT-3.5-turbo achieved the highest ROUGE-1 score, 0.364. The transformer model comes in second place and spaCy is at the last position. Finally, a graphical user interface is created for the best system based on the large language model.

2. 【2411.18306】Delineating Feminist Studies through bibliometric analysis

链接https://arxiv.org/abs/2411.18306

作者:Natsumi S. Shokida,Diego Kozlowski,Vincent Larivière

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:Feminist Studies presents, socially anchored nature, Studies presents unique, feminist and LGBTQIA, Gender Studies

备注: 2 tables, 5 figures

点击查看摘要

Abstract:The multidisciplinary and socially anchored nature of Feminist Studies presents unique challenges for bibliometric analysis, as this research area transcends traditional disciplinary boundaries and reflects discussions from feminist and LGBTQIA+ social movements. This paper proposes a novel approach for identifying gender/sex related publications scattered across diverse scientific disciplines. Using the Dimensions database, we employ bibliometric techniques, natural language processing (NLP) and manual curation to compile a dataset of scientific publications that allows for the analysis of Gender Studies and its influence across different disciplines. This is achieved through a methodology that combines a core of specialized journals with a comprehensive keyword search over titles. These keywords are obtained by applying Topic Modeling (BERTopic) to the corpus of titles and abstracts from the core. This methodological strategy, divided into two stages, reflects the dynamic interaction between Gender Studies and its dialogue with different disciplines. This hybrid system surpasses basic keyword search by mitigating potential biases introduced through manual keyword enumeration. The resulting dataset comprises over 1.9 million scientific documents published between 1668 and 2023, spanning four languages. This dataset enables a characterization of Gender Studies in terms of addressed topics, citation and collaboration dynamics, and institutional and regional participation. By addressing the methodological challenges of studying "more-than-disciplinary" research areas, this approach could also be adapted to delineate other conversations where disciplinary boundaries are difficult to disentangle.

Comments:
2 tables, 5 figures

Subjects:

Digital Libraries (cs.DL); Information Retrieval (cs.IR)

Cite as:
arXiv:2411.18306 [cs.DL]

(or
arXiv:2411.18306v1 [cs.DL] for this version)

https://doi.org/10.48550/arXiv.2411.18306

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
3. 【2411.18262】Break the ID-Language Barrier: An Adaption Framework for Sequential Recommendation

链接https://arxiv.org/abs/2411.18262

作者:Xiaohan Yu,Li Zhang,Xin Zhao,Yue Wang

类目:Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:natural language processing, large language models, recent breakthrough, breakthrough of large, processing has sparked

备注

点击查看摘要

Abstract:The recent breakthrough of large language models (LLMs) in natural language processing has sparked exploration in recommendation systems, however, their limited domain-specific knowledge remains a critical bottleneck. Specifically, LLMs lack key pieces of information crucial for sequential recommendations, such as user behavior patterns. To address this critical gap, we propose IDLE-Adapter, a novel framework that integrates pre-trained ID embeddings, rich in domain-specific knowledge, into LLMs to improve recommendation accuracy. IDLE-Adapter acts as a bridge, transforming sparse user-item interaction data into dense, LLM-compatible representations through a Pre-trained ID Sequential Model, Dimensionality Alignment, Layer-wise Embedding Refinement, and Layer-wise Distribution Alignment. Furthermore, IDLE-Adapter demonstrates remarkable flexibility by seamlessly integrating ID embeddings from diverse ID-based sequential models and LLM architectures. Extensive experiments across various datasets demonstrate the superiority of IDLE-Adapter, achieving over 10\% and 20\% improvements in HitRate@5 and NDCG@5 metrics, respectively, compared to state-of-the-art methods.

4. 【2411.18161】he Rn-index: a more accurate variant of the Rk-index

链接https://arxiv.org/abs/2411.18161

作者:Alonso Rodriguez-Navarro

类目:Digital Libraries (cs.DL); Information Retrieval (cs.IR)

关键词:common bibliometric indicators, bibliometric indicators, pushing the boundaries, boundaries of knowledge, critical metric

备注: 6 pages; 2 figures; 4 tables

点击查看摘要

Abstract:The contribution to pushing the boundaries of knowledge is a critical metric for evaluating the research performance of countries and institutions, which in many cases is not revealed by common bibliometric indicators. The Rk-index was specifically designed to assess such contributions, and the Rn-index is a variant that corrects the weakness of the Rk-index, particularly in the evaluation of countries that produce a high proportion of global advancements. This is the case of the USA and China in many technological fields. Additionally, the Rn-index is simple to calculate and understand, as it involves only summing the ratios between the local and global ranks of papers, ordered by their citation count. Moreover, the Rn-index may also be fractionally counted.

5. 【2411.18073】DuMapper: Towards Automatic Verification of Large-Scale POIs with Street Views at Baidu Maps

链接https://arxiv.org/abs/2411.18073

作者:Miao Fan,Jizhou Huang,Haifeng Wang

类目:Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)

关键词:Web mapping services, Web mapping, POI verification, mobile devices, increased popularity

备注

点击查看摘要

Abstract:With the increased popularity of mobile devices, Web mapping services have become an indispensable tool in our daily lives. To provide user-satisfied services, such as location searches, the point of interest (POI) database is the fundamental infrastructure, as it archives multimodal information on billions of geographic locations closely related to people's lives, such as a shop or a bank. Therefore, verifying the correctness of a large-scale POI database is vital. To achieve this goal, many industrial companies adopt volunteered geographic information (VGI) platforms that enable thousands of crowdworkers and expert mappers to verify POIs seamlessly; but to do so, they have to spend millions of dollars every year. To save the tremendous labor costs, we devised DuMapper, an automatic system for large-scale POI verification with the multimodal street-view data at Baidu Maps. DuMapper takes the signboard image and the coordinates of a real-world place as input to generate a low-dimensional vector, which can be leveraged by ANN algorithms to conduct a more accurate search through billions of archived POIs in the database for verification within milliseconds. It can significantly increase the throughput of POI verification by $50$ times. DuMapper has already been deployed in production since \DuMPOnline, which dramatically improves the productivity and efficiency of POI verification at Baidu Maps. As of December 31, 2021, it has enacted over $405$ million iterations of POI verification within a 3.5-year period, representing an approximate workload of $800$ high-performance expert mappers.

6. 【2411.18069】Overview of TREC 2024 Biomedical Generative Retrieval (BioGen) Track

链接https://arxiv.org/abs/2411.18069

作者:Deepak Gupta,Dina Demner-Fushman,William Hersh,Steven Bedrick,Kirk Roberts

类目:Information Retrieval (cs.IR)

关键词:lay language summarization, clinical note summarization, large language models, language summarization, lay language

备注

点击查看摘要

Abstract:With the advancement of large language models (LLMs), the biomedical domain has seen significant progress and improvement in multiple tasks such as biomedical question answering, lay language summarization of the biomedical literature, clinical note summarization, etc. However, hallucinations or confabulations remain one of the key challenges when using LLMs in the biomedical and other domains. Inaccuracies may be particularly harmful in high-risk situations, such as making clinical decisions or appraising biomedical research. Studies on the evaluation of the LLMs' abilities to ground generated statements in verifiable sources have shown that models perform significantly worse on lay-user generated questions, and often fail to reference relevant sources. This can be problematic when those seeking information want evidence from studies to back up the claims from LLMs[3]. Unsupported statements are a major barrier to using LLMs in any applications that may affect health. Methods for grounding generated statements in reliable sources along with practical evaluation approaches are needed to overcome this barrier. Towards this, in our pilot task organized at TREC 2024, we introduced the task of reference attribution as a means to mitigate the generation of false statements by LLMs answering biomedical questions.

7. 【2411.17863】LongKey: Keyphrase Extraction for Long Documents

链接https://arxiv.org/abs/2411.17863

作者:Jeovane Honorio Alves,Radu State,Cinthia Obladen de Almendra Freitas,Jean Paul Barddal

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:information overload, manually annotating, increasingly impractical, era of information, annotating the vast

备注: Accepted for presentation at the 2024 IEEE International Conference on Big Data (IEEE BigData 2024). Code available at [this https URL](https://github.com/jeohalves/longkey)

点击查看摘要

Abstract:In an era of information overload, manually annotating the vast and growing corpus of documents and scholarly papers is increasingly impractical. Automated keyphrase extraction addresses this challenge by identifying representative terms within texts. However, most existing methods focus on short documents (up to 512 tokens), leaving a gap in processing long-context documents. In this paper, we introduce LongKey, a novel framework for extracting keyphrases from lengthy documents, which uses an encoder-based language model to capture extended text intricacies. LongKey uses a max-pooling embedder to enhance keyphrase candidate representation. Validated on the comprehensive LDKP datasets and six diverse, unseen datasets, LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods. Our findings demonstrate LongKey's versatility and superior performance, marking an advancement in keyphrase extraction for varied text lengths and domains.

8. 【2411.17719】SlideSpawn: An Automatic Slides Generation System for Research Publications

链接https://arxiv.org/abs/2411.17719

作者:Keshav Kumar,Ravindranath Chowdary

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)

关键词:Research, Research papers, structured documents, Abstract, PDF

备注: 6 pages, 4 figures, 2 tables, 5 equations, 41 references

点击查看摘要

Abstract:Research papers are well structured documents. They have text, figures, equations, tables etc., to covey their ideas and findings. They are divided into sections like Introduction, Model, Experiments etc., which deal with different aspects of research. Characteristics like these set research papers apart from ordinary documents and allows us to significantly improve their summarization. In this paper, we propose a novel system, SlideSpwan, that takes PDF of a research document as an input and generates a quality presentation providing it's summary in a visual and concise fashion. The system first converts the PDF of the paper to an XML document that has the structural information about various elements. Then a machine learning model, trained on PS5K dataset and Aminer 9.5K Insights dataset (that we introduce), is used to predict salience of each sentence in the paper. Sentences for slides are selected using ILP and clustered based on their similarity with each cluster being given a suitable title. Finally a slide is generated by placing any graphical element referenced in the selected sentences next to them. Experiments on a test set of 650 pairs of papers and slides demonstrate that our system generates presentations with better quality.

9. 【2411.18502】Isometry pursuit

链接https://arxiv.org/abs/2411.18502

作者:Samson Koelle,Marina Meila

类目:Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME)

关键词:identifying orthonormal column-submatrices, Isometry pursuit, wide matrices, convex algorithm, algorithm for identifying

备注

点击查看摘要

Abstract:Isometry pursuit is a convex algorithm for identifying orthonormal column-submatrices of wide matrices. It consists of a novel normalization method followed by multitask basis pursuit. Applied to Jacobians of putative coordinate functions, it helps identity isometric embeddings from within interpretable dictionaries. We provide theoretical and experimental results justifying this method. For problems involving coordinate selection and diversification, it offers a synergistic alternative to greedy and brute force search.

计算机视觉

1. 【2411.18625】xtured Gaussians for Enhanced 3D Scene Appearance Modeling

链接https://arxiv.org/abs/2411.18625

作者:Brian Chao,Hung-Yu Tseng,Lorenzo Porzi,Chen Gao,Tuotuo Li,Qinbo Li,Ayush Saraf,Jia-Bin Huang,Johannes Kopf,Gordon Wetzstein,Changil Kim

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:rendering technique due, Gaussian, Gaussian Splatting, reconstruction and rendering, rendering time

备注: Project website: [this https URL](https://textured-gaussians.github.io/)

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a state-of-the-art 3D reconstruction and rendering technique due to its high-quality results and fast training and rendering time. However, pixels covered by the same Gaussian are always shaded in the same color up to a Gaussian falloff scaling factor. Furthermore, the finest geometric detail any individual Gaussian can represent is a simple ellipsoid. These properties of 3DGS greatly limit the expressivity of individual Gaussian primitives. To address these issues, we draw inspiration from texture and alpha mapping in traditional graphics and integrate it with 3DGS. Specifically, we propose a new generalized Gaussian appearance representation that augments each Gaussian with alpha~(A), RGB, or RGBA texture maps to model spatially varying color and opacity across the extent of each Gaussian. As such, each Gaussian can represent a richer set of texture patterns and geometric structures, instead of just a single color and ellipsoid as in naive Gaussian Splatting. Surprisingly, we found that the expressivity of Gaussians can be greatly improved by using alpha-only texture maps, and further augmenting Gaussians with RGB texture maps achieves the highest expressivity. We validate our method on a wide variety of standard benchmark datasets and our own custom captures at both the object and scene levels. We demonstrate image quality improvements over existing methods while using a similar or lower number of Gaussians.

2. 【2411.18624】GeneMAN: Generalizable Single-Image 3D Human Reconstruction from Multi-Source Human Data

链接https://arxiv.org/abs/2411.18624

作者:Wentao Wang,Hang Ye,Fangzhou Hong,Xue Yang,Jianfu Zhang,Yizhou Wang,Ziwei Liu,Liang Pan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:human, remains a challenging, challenging task, task to reconstruct, high-quality human data

备注: Project page: [this https URL](https://roooooz.github.io/GeneMAN/)

点击查看摘要

Abstract:Given a single in-the-wild human photo, it remains a challenging task to reconstruct a high-fidelity 3D human model. Existing methods face difficulties including a) the varying body proportions captured by in-the-wild human images; b) diverse personal belongings within the shot; and c) ambiguities in human postures and inconsistency in human textures. In addition, the scarcity of high-quality human data intensifies the challenge. To address these problems, we propose a Generalizable image-to-3D huMAN reconstruction framework, dubbed GeneMAN, building upon a comprehensive multi-source collection of high-quality human data, including 3D scans, multi-view videos, single photos, and our generated synthetic human data. GeneMAN encompasses three key modules. 1) Without relying on parametric human models (e.g., SMPL), GeneMAN first trains a human-specific text-to-image diffusion model and a view-conditioned diffusion model, serving as GeneMAN 2D human prior and 3D human prior for reconstruction, respectively. 2) With the help of the pretrained human prior models, the Geometry Initialization--Sculpting pipeline is leveraged to recover high-quality 3D human geometry given a single image. 3) To achieve high-fidelity 3D human textures, GeneMAN employs the Multi-Space Texture Refinement pipeline, consecutively refining textures in the latent and the pixel spaces. Extensive experimental results demonstrate that GeneMAN could generate high-quality 3D human models from a single image input, outperforming prior state-of-the-art methods. Notably, GeneMAN could reveal much better generalizability in dealing with in-the-wild images, often yielding high-quality 3D human models in natural poses with common items, regardless of the body proportions in the input images.

3. 【2411.18623】Lift3D Foundation Policy: Lifting 2D Large-Scale Pretrained Models for Robust 3D Robotic Manipulation

链接https://arxiv.org/abs/2411.18623

作者:Yueru Jia,Jiaming Liu,Sixiang Chen,Chenyang Gu,Zhilue Wang,Longzan Luo,Lily Lee,Pengwei Wang,Zhongyuan Wang,Renrui Zhang,Shanghang Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:intricate spatial configurations, interact with intricate, spatial relationships, spatial configurations, manipulation tasks

备注

点击查看摘要

Abstract:3D geometric information is essential for manipulation tasks, as robots need to perceive the 3D environment, reason about spatial relationships, and interact with intricate spatial configurations. Recent research has increasingly focused on the explicit extraction of 3D features, while still facing challenges such as the lack of large-scale robotic 3D data and the potential loss of spatial geometry. To address these limitations, we propose the Lift3D framework, which progressively enhances 2D foundation models with implicit and explicit 3D robotic representations to construct a robust 3D manipulation policy. Specifically, we first design a task-aware masked autoencoder that masks task-relevant affordance patches and reconstructs depth information, enhancing the 2D foundation model's implicit 3D robotic representation. After self-supervised fine-tuning, we introduce a 2D model-lifting strategy that establishes a positional mapping between the input 3D points and the positional embeddings of the 2D model. Based on the mapping, Lift3D utilizes the 2D foundation model to directly encode point cloud data, leveraging large-scale pretrained knowledge to construct explicit 3D robotic representations while minimizing spatial information loss. In experiments, Lift3D consistently outperforms previous state-of-the-art methods across several simulation benchmarks and real-world scenarios.

4. 【2411.18622】Leveraging Semi-Supervised Learning to Enhance Data Mining for Image Classification under Limited Labeled Data

链接https://arxiv.org/abs/2411.18622

作者:Aoran Shen,Minghao Dai,Jiacheng Hu,Yingbin Liang,Shiru Wang,Junliang Du

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:extracting valuable information, effectively extracting valuable, big data technology, information age, valuable information

备注

点击查看摘要

Abstract:In the 21st-century information age, with the development of big data technology, effectively extracting valuable information from massive data has become a key issue. Traditional data mining methods are inadequate when faced with large-scale, high-dimensional and complex data. Especially when labeled data is scarce, their performance is greatly limited. This study optimizes data mining algorithms by introducing semi-supervised learning methods, aiming to improve the algorithm's ability to utilize unlabeled data, thereby achieving more accurate data analysis and pattern recognition under limited labeled data conditions. Specifically, we adopt a self-training method and combine it with a convolutional neural network (CNN) for image feature extraction and classification, and continuously improve the model prediction performance through an iterative process. The experimental results demonstrate that the proposed method significantly outperforms traditional machine learning techniques such as Support Vector Machine (SVM), XGBoost, and Multi-Layer Perceptron (MLP) on the CIFAR-10 image classification dataset. Notable improvements were observed in key performance metrics, including accuracy, recall, and F1 score. Furthermore, the robustness and noise-resistance capabilities of the semi-supervised CNN model were validated through experiments under varying noise levels, confirming its practical applicability in real-world scenarios.

5. 【2411.18620】Cross-modal Information Flow in Multimodal Large Language Models

链接https://arxiv.org/abs/2411.18620

作者:Zhi Zhang,Srishti Yadav,Fengze Han,Ekaterina Shutova

类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated promising progress, large language models, auto-regressive multimodal large, large language, multimodal large language

备注

点击查看摘要

Abstract:The recent advancements in auto-regressive multimodal large language models (MLLMs) have demonstrated promising progress for vision-language tasks. While there exists a variety of studies investigating the processing of linguistic information within large language models, little is currently known about the inner working mechanism of MLLMs and how linguistic and visual information interact within these models. In this study, we aim to fill this gap by examining the information flow between different modalities -- language and vision -- in MLLMs, focusing on visual question answering. Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Conducting experiments with a series of models from the LLaVA series, we find that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction. Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing.

6. 【2411.18616】Diffusion Self-Distillation for Zero-Shot Customized Image Generation

链接https://arxiv.org/abs/2411.18616

作者:Shengqu Cai,Eric Chan,Yunzhi Zhang,Leonidas Guibas,Jiajun Wu,Gordon Wetzstein

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:desire fine-grained control, produce impressive results, models produce impressive, diffusion models produce, fine-grained control

备注: Project page: [this https URL](https://primecai.github.io/dsd/)

点击查看摘要

Abstract:Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.

7. 【2411.18615】Proactive Gradient Conflict Mitigation in Multi-Task Learning: A Sparse Training Perspective

链接https://arxiv.org/abs/2411.18615

作者:Zhi Zhang,Jiayi Shen,Congfeng Cao,Gaole Dai,Shiji Zhou,Qizhe Zhang,Shanghang Zhang,Ekaterina Shutova

类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:generalist agents necessitates, Advancing towards generalist, multiple downstream tasks, multiple downstream, generalist agents

备注

点击查看摘要

Abstract:Advancing towards generalist agents necessitates the concurrent processing of multiple tasks using a unified model, thereby underscoring the growing significance of simultaneous model training on multiple downstream tasks. A common issue in multi-task learning is the occurrence of gradient conflict, which leads to potential competition among different tasks during joint training. This competition often results in improvements in one task at the expense of deterioration in another. Although several optimization methods have been developed to address this issue by manipulating task gradients for better task balancing, they cannot decrease the incidence of gradient conflict. In this paper, we systematically investigate the occurrence of gradient conflict across different methods and propose a strategy to reduce such conflicts through sparse training (ST), wherein only a portion of the model's parameters are updated during training while keeping the rest unchanged. Our extensive experiments demonstrate that ST effectively mitigates conflicting gradients and leads to superior performance. Furthermore, ST can be easily integrated with gradient manipulation techniques, thus enhancing their effectiveness.

8. 【2411.18613】CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

链接https://arxiv.org/abs/2411.18613

作者:Rundi Wu,Ruiqi Gao,Ben Poole,Alex Trevithick,Changxi Zheng,Jonathan T. Barron,Aleksander Holynski

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:method for creating, monocular video, multi-view video, multi-view video diffusion, view synthesis

备注: Project page: [this https URL](https://cat-4d.github.io/)

点击查看摘要

Abstract:We present CAT4D, a method for creating 4D (dynamic 3D) scenes from monocular video. CAT4D leverages a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. Combined with a novel sampling approach, this model can transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation. We demonstrate competitive performance on novel view synthesis and dynamic scene reconstruction benchmarks, and highlight the creative capabilities for 4D scene generation from real or generated videos. See our project page for results and interactive demos: \url{this http URL}.

9. 【2411.18597】Structured light with a million light planes per second

链接https://arxiv.org/abs/2411.18597

作者:Dhawal Sirikonda,Praneeth Chakravarthula,Ioannis Gkioulekas,Adithya Pediredla

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structured light system, light scanning device, light, system that captures, thousand frames

备注

点击查看摘要

Abstract:We introduce a structured light system that captures full-frame depth at rates of a thousand frames per second, four times faster than the previous state of the art. Our key innovation to this end is the design of an acousto-optic light scanning device that can scan light planes at rates up to two million planes per second. We combine this device with an event camera for structured light, using the sparse events triggered on the camera as we sweep a light plane on the scene for depth triangulation. In contrast to prior work, where light scanning is the bottleneck towards faster structured light operation, our light scanning device is three orders of magnitude faster than the event camera's full-frame bandwidth, thus allowing us to take full advantage of the event camera's fast operation. To surpass this bandwidth, we additionally demonstrate adaptive scanning of only regions of interest, at speeds an order of magnitude faster than the theoretical full-frame limit for event cameras.

10. 【2411.18594】Biomolecular Analysis of Soil Samples and Rock Imagery for Tracing Evidence of Life Using a Mobile Robot

链接https://arxiv.org/abs/2411.18594

作者:Shah Md Ahasan Siddique,Ragib Tahshin Rinath,Shakil Mosharrof,Syed Tanjib Mahmud,Sakib Ahmed

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:advanced robotic technologies, search for evidence, evidence of past, past life, requires the usage

备注: Key Words : Mars, Rover, Phoenix, Biosignatures, Biomolecular Analysis, Microscopy, Spectroscopy, Sampling, Astrobiology

点击查看摘要

Abstract:The search for evidence of past life on Mars presents a tremendous challenge that requires the usage of very advanced robotic technologies to overcome it. Current digital microscopic imagers and spectrometers used for astrobiological examination suffer from limitations such as insufficient resolution, narrow detection range, and lack of portability. To overcome these challenges, this research study presents modifications to the Phoenix rover to expand its capability for detecting biosignatures on Mars. This paper examines the modifications implemented on the Phoenix rover to enhance its capability to detect a broader spectrum of biosignatures. One of the notable improvements comprises the integration of advanced digital microscopic imagers and spectrometers, enabling high-resolution examination of soil samples. Additionally, the mechanical components of the device have been reinforced to enhance maneuverability and optimize subsurface sampling capabilities. Empirical investigations have demonstrated that Phoenix has the capability to navigate diverse geological environments and procure samples for the purpose of biomolecular analysis. The biomolecular instrumentation and hybrid analytical methods showcased in this study demonstrate considerable potential for future astrobiology missions on Mars. The potential for enhancing the system lies in the possibility of broadening the range of detectable biomarkers and biosignatures.

11. 【2411.18588】Hierarchical Information Flow for Generalized Efficient Image Restoration

链接https://arxiv.org/abs/2411.18588

作者:Yawei Li,Bin Ren,Jingyun Liang,Rakesh Ranjan,Mengyuan Liu,Nicu Sebe,Ming-Hsuan Yang,Luca Benini

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:transformers show promise, vision transformers show, numerous image restoration, vision transformers, promise in numerous

备注

点击查看摘要

Abstract:While vision transformers show promise in numerous image restoration (IR) tasks, the challenge remains in efficiently generalizing and scaling up a model for multiple IR tasks. To strike a balance between efficiency and model capacity for a generalized transformer-based IR method, we propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR, which progressively propagates information among pixels in a bottom-up manner. Hi-IR constructs a hierarchical information tree representing the degraded image across three levels. Each level encapsulates different types of information, with higher levels encompassing broader objects and concepts and lower levels focusing on local details. Moreover, the hierarchical tree architecture removes long-range self-attention, improves the computational efficiency and memory utilization, thus preparing it for effective model scaling. Based on that, we explore model scaling to improve our method's capabilities, which is expected to positively impact IR in large-scale training settings. Extensive experimental results show that Hi-IR achieves state-of-the-art performance in seven common image restoration tasks, affirming its effectiveness and generalizability.

12. 【2411.18572】Exploring Depth Information for Detecting Manipulated Face Videos

链接https://arxiv.org/abs/2411.18572

作者:Haoyue Wang,Sheng Li,Ji He,Zhenxing Qian,Xinpeng Zhang,Shaolin Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:face depth map, Face manipulation detection, face depth, depth map, Face

备注: 12 pages, 10 figures. arXiv admin note: substantial text overlap with [arXiv:2212.14230](https://arxiv.org/abs/2212.14230)

点击查看摘要

Abstract:Face manipulation detection has been receiving a lot of attention for the reliability and security of the face images/videos. Recent studies focus on using auxiliary information or prior knowledge to capture robust manipulation traces, which are shown to be promising. As one of the important face features, the face depth map, which has shown to be effective in other areas such as face recognition or face detection, is unfortunately paid little attention to in literature for face manipulation detection. In this paper, we explore the possibility of incorporating the face depth map as auxiliary information for robust face manipulation detection. To this end, we first propose a Face Depth Map Transformer (FDMT) to estimate the face depth map patch by patch from an RGB face image, which is able to capture the local depth anomaly created due to manipulation. The estimated face depth map is then considered as auxiliary information to be integrated with the backbone features using a Multi-head Depth Attention (MDA) mechanism that is newly designed. We also propose an RGB-Depth Inconsistency Attention (RDIA) module to effectively capture the inter-frame inconsistency for multi-frame input. Various experiments demonstrate the advantage of our proposed method for face manipulation detection.

13. 【2411.18562】DexDiffuser: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation

链接https://arxiv.org/abs/2411.18562

作者:Zhixuan Liang,Yao Mu,Yixiao Wang,Fei Ni,Tianxing Chen,Wenqi Shao,Wei Zhan,Masayoshi Tomizuka,Ping Luo,Mingyu Ding

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:advanced robotics, crucial for advanced, Dexterous manipulation, manipulation, adaptive dexterous manipulation

备注: 27 pages. Project page: [this https URL](https://dexdiffuser.github.io/)

点击查看摘要

Abstract:Dexterous manipulation with contact-rich interactions is crucial for advanced robotics. While recent diffusion-based planning approaches show promise for simpler manipulation tasks, they often produce unrealistic ghost states (e.g., the object automatically moves without hand contact) or lack adaptability when handling complex sequential interactions. In this work, we introduce DexDiffuser, an interaction-aware diffusion planning framework for adaptive dexterous manipulation. DexDiffuser models joint state-action dynamics through a dual-phase diffusion process which consists of pre-interaction contact alignment and post-contact goal-directed control, enabling goal-adaptive generalizable dexterous manipulation. Additionally, we incorporate dynamics model-based dual guidance and leverage large language models for automated guidance function generation, enhancing generalizability for physical interactions and facilitating diverse goal adaptation through language cues. Experiments on physical interaction tasks such as door opening, pen and block re-orientation, and hammer striking demonstrate DexDiffuser's effectiveness on goals outside training distributions, achieving over twice the average success rate (59.2% vs. 29.5%) compared to existing methods. Our framework achieves 70.0% success on 30-degree door opening, 40.0% and 36.7% on pen and block half-side re-orientation respectively, and 46.7% on hammer nail half drive, highlighting its robustness and flexibility in contact-rich manipulation.

14. 【2411.18552】FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

链接https://arxiv.org/abs/2411.18552

作者:Haosen Yang,Adrian Bulat,Isma Hadji,Hai X. Pham,Xiatian Zhu,Georgios Tzimiropoulos,Brais Martinez

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:generating high-quality images, high-quality images, proficient at generating, generating high-quality, Diffusion

备注

点击查看摘要

Abstract:Diffusion models are proficient at generating high-quality images. They are however effective only when operating at the resolution used during training. Inference at a scaled resolution leads to repetitive patterns and structural distortions. Retraining at higher resolutions quickly becomes prohibitive. Thus, methods enabling pre-existing diffusion models to operate at flexible test-time resolutions are highly desirable. Previous works suffer from frequent artifacts and often introduce large latency overheads. We propose two simple modules that combine to solve these issues. We introduce a Frequency Modulation (FM) module that leverages the Fourier domain to improve the global structure consistency, and an Attention Modulation (AM) module which improves the consistency of local texture patterns, a problem largely ignored in prior works. Our method, coined Fam diffusion, can seamlessly integrate into any latent diffusion model and requires no additional training. Extensive qualitative results highlight the effectiveness of our method in addressing structural and local artifacts, while quantitative results show state-of-the-art performance. Also, our method avoids redundant inference tricks for improved consistency such as patch-based or progressive generation, leading to negligible latency overheads.

15. 【2411.18548】PhyCAGE: Physically Plausible Compositional 3D Asset Generation from a Single Image

链接https://arxiv.org/abs/2411.18548

作者:Han Yan,Mingrui Zhang,Yang Li,Chao Ma,Pan Ji

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:present PhyCAGE, physically plausible compositional, Score Distillation Sampling, Gaussian Splatting representations, asset generation

备注: Project page: [this https URL](https://wolfball.github.io/phycage/)

点击查看摘要

Abstract:We present PhyCAGE, the first approach for physically plausible compositional 3D asset generation from a single image. Given an input image, we first generate consistent multi-view images for components of the assets. These images are then fitted with 3D Gaussian Splatting representations. To ensure that the Gaussians representing objects are physically compatible with each other, we introduce a Physical Simulation-Enhanced Score Distillation Sampling (PSE-SDS) technique to further optimize the positions of the Gaussians. It is achieved by setting the gradient of the SDS loss as the initial velocity of the physical simulation, allowing the simulator to act as a physics-guided optimizer that progressively corrects the Gaussians' positions to a physically compatible state. Experimental results demonstrate that the proposed method can generate physically plausible compositional 3D assets given a single image.

16. 【2411.18539】AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans

链接https://arxiv.org/abs/2411.18539

作者:Dillon Loh,Tomasz Bednarz,Xinxing Xia,Frank Guan

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Visual Language Navigation, natural language instructions, Adaptive Visual Language, Visual Language, realistic environments based

备注

点击查看摘要

Abstract:Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a "freeze-time" mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.

17. 【2411.18533】Utilizing the Mean Teacher with Supcontrast Loss for Wafer Pattern Recognition

链接https://arxiv.org/abs/2411.18533

作者:Qiyu Wei,Xun Xu,Zeng Zeng,Xulei Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:helping engineers identify, map pattern recognition, wafer map pattern, wafer maps play, play a crucial

备注: 5 pages,1 figures

点击查看摘要

Abstract:The patterns on wafer maps play a crucial role in helping engineers identify the causes of production issues during semiconductor manufacturing. In order to reduce costs and improve accuracy, automation technology is essential, and recent developments in deep learning have led to impressive results in wafer map pattern recognition. In this context, inspired by the effectiveness of semi-supervised learning and contrastive learning methods, we introduce an innovative approach that integrates the Mean Teacher framework with the supervised contrastive learning loss for enhanced wafer map pattern recognition. Our methodology not only addresses the nuances of wafer patterns but also tackles challenges arising from limited labeled data. To further refine the process, we address data imbalance in the wafer dataset by employing SMOTE and under-sampling techniques. We conduct a comprehensive analysis of our proposed method and demonstrate its effectiveness through experiments using real-world dataset WM811K obtained from semiconductor manufacturers. Compared to the baseline method, our method has achieved 5.46%, 6.68%, 5.42%, and 4.53% improvements in Accuracy, Precision, Recall, and F1 score, respectively.

18. 【2411.18513】Enhancing weed detection performance by means of GenAI-based image augmentation

链接https://arxiv.org/abs/2411.18513

作者:Sourav Modak,Anthony Stein

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Precise weed management, sustaining crop productivity, Precise weed, ecological balance, management is essential

备注

点击查看摘要

Abstract:Precise weed management is essential for sustaining crop productivity and ecological balance. Traditional herbicide applications face economic and environmental challenges, emphasizing the need for intelligent weed control systems powered by deep learning. These systems require vast amounts of high-quality training data. The reality of scarcity of well-annotated training data, however, is often addressed through generating more data using data augmentation. Nevertheless, conventional augmentation techniques such as random flipping, color changes, and blurring lack sufficient fidelity and diversity. This paper investigates a generative AI-based augmentation technique that uses the Stable Diffusion model to produce diverse synthetic images that improve the quantity and quality of training datasets for weed detection models. Moreover, this paper explores the impact of these synthetic images on the performance of real-time detection systems, thus focusing on compact CNN-based models such as YOLO nano for edge devices. The experimental results show substantial improvements in mean Average Precision (mAP50 and mAP50-95) scores for YOLO models trained with generative AI-augmented datasets, demonstrating the promising potential of synthetic data to enhance model robustness and accuracy.

19. 【2411.18499】GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

链接https://arxiv.org/abs/2411.18499

作者:Pengfei Zhou,Xiaopeng Peng,Jiajun Song,Chuanhao Li,Zhaopan Xu,Yue Yang,Ziyao Guo,Hao Zhang,Yuqi Lin,Yefei He,Lirui Zhao,Shuo Liu,Tianhua Li,Yuxuan Xie,Xiaojun Chang,Yu Qiao,Wenqi Shao,Kaipeng Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Large Language, made significant strides, Multimodal Large

备注: 53 pages, 19 figures

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding and generation tasks. However, generating interleaved image-text content remains a challenge, which requires integrated multimodal understanding and generation abilities. While the progress in unified models offers new solutions, existing benchmarks are insufficient for evaluating these methods due to data size and diversity limitations. To bridge this gap, we introduce GATE OpenING (OpenING), a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82. 42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation models. The OpenING is open-sourced at this https URL.

20. 【2411.18476】A comparison of extended object tracking with multi-modal sensors in indoor environment

链接https://arxiv.org/abs/2411.18476

作者:Jiangtao Shuai,Martin Baerveldt,Manh Nguyen-Duc,Anh Le-Tuan,Manfred Hauswirth,Danh Le-Phuoc

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:cloud sensory sources, point cloud sensory, significant price differences, object tracking approach, sensory sources

备注

点击查看摘要

Abstract:This paper presents a preliminary study of an efficient object tracking approach, comparing the performance of two different 3D point cloud sensory sources: LiDAR and stereo cameras, which have significant price differences. In this preliminary work, we focus on single object tracking. We first developed a fast heuristic object detector that utilizes prior information about the environment and target. The resulting target points are subsequently fed into an extended object tracking framework, where the target shape is parameterized using a star-convex hypersurface model. Experimental results show that our object tracking method using a stereo camera achieves performance similar to that of a LiDAR sensor, with a cost difference of more than tenfold.

21. 【2411.18475】Weakly Supervised Framework Considering Multi-temporal Information for Large-scale Cropland Mapping with Satellite Imagery

链接https://arxiv.org/abs/2411.18475

作者:Yuze Wang,Aoran Hu,Ji Qi,Yang Liu,Chao Tao

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:agricultural production management, Accurately mapping large-scale, large-scale cropland mapping, Accurately mapping, cropland mapping

备注

点击查看摘要

Abstract:Accurately mapping large-scale cropland is crucial for agricultural production management and planning. Currently, the combination of remote sensing data and deep learning techniques has shown outstanding performance in cropland mapping. However, those approaches require massive precise labels, which are labor-intensive. To reduce the label cost, this study presented a weakly supervised framework considering multi-temporal information for large-scale cropland mapping. Specifically, we extract high-quality labels according to their consistency among global land cover (GLC) products to construct the supervised learning signal. On the one hand, to alleviate the overfitting problem caused by the model's over-trust of remaining errors in high-quality labels, we encode the similarity/aggregation of cropland in the visual/spatial domain to construct the unsupervised learning signal, and take it as the regularization term to constrain the supervised part. On the other hand, to sufficiently leverage the plentiful information in the samples without high-quality labels, we also incorporate the unsupervised learning signal in these samples, enriching the diversity of the feature space. After that, to capture the phenological features of croplands, we introduce dense satellite image time series (SITS) to extend the proposed framework in the temporal dimension. We also visualized the high dimensional phenological features to uncover how multi-temporal information benefits cropland extraction, and assessed the method's robustness under conditions of data scarcity. The proposed framework has been experimentally validated for strong adaptability across three study areas (Hunan Province, Southeast France, and Kansas) in large-scale cropland mapping, and the internal mechanism and temporal generalizability are also investigated.

22. 【2411.18473】HEMGS: A Hybrid Entropy Model for 3D Gaussian Splatting Data Compression

链接https://arxiv.org/abs/2411.18473

作者:Lei Liu,Zhenghao Chen,Dong Xu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, creates big challenges, Fast progress, Gaussians popular, modeling and image

备注

点击查看摘要

Abstract:Fast progress in 3D Gaussian Splatting (3DGS) has made 3D Gaussians popular for 3D modeling and image rendering, but this creates big challenges in data storage and transmission. To obtain a highly compact 3DGS representation, we propose a hybrid entropy model for Gaussian Splatting (HEMGS) data compression, which comprises two primary components, a hyperprior network and an autoregressive network. To effectively reduce structural redundancy across attributes, we apply a progressive coding algorithm to generate hyperprior features, in which we use previously compressed attributes and location as prior information. In particular, to better extract the location features from these compressed attributes, we adopt a domain-aware and instance-aware architecture to respectively capture domain-aware structural relations without additional storage costs and reveal scene-specific features through MLPs. Additionally, to reduce redundancy within each attribute, we leverage relationships between neighboring compressed elements within the attributes through an autoregressive network. Given its unique structure, we propose an adaptive context coding algorithm with flexible receptive fields to effectively capture adjacent compressed elements. Overall, we integrate our HEMGS into an end-to-end optimized 3DGS compression framework and the extensive experimental results on four benchmarks indicate that our method achieves about 40\% average reduction in size while maintaining the rendering quality over our baseline method and achieving state-of-the-art compression results.

23. 【2411.18466】Complexity Experts are Task-Discriminative Learners for Any Image Restoration

链接https://arxiv.org/abs/2411.18466

作者:Eduard Zamfir,Zongwei Wu,Nancy Mehta,Yuedong Tan,Danda Pani Paudel,Yulun Zhang,Radu Timofte

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, image restoration models, image restoration, unified framework, Recent

备注

点击查看摘要

Abstract:Recent advancements in all-in-one image restoration models have revolutionized the ability to address diverse degradations through a unified framework. However, parameters tied to specific tasks often remain inactive for other tasks, making mixture-of-experts (MoE) architectures a natural extension. Despite this, MoEs often show inconsistent behavior, with some experts unexpectedly generalizing across tasks while others struggle within their intended scope. This hinders leveraging MoEs' computational benefits by bypassing irrelevant experts during inference. We attribute this undesired behavior to the uniform and rigid architecture of traditional MoEs. To address this, we introduce ``complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields. A key challenge is assigning tasks to each expert, as degradation complexity is unknown in advance. Thus, we execute tasks with a simple bias toward lower complexity. To our surprise, this preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity. Extensive experiments validate our approach, demonstrating the ability to bypass irrelevant experts during inference while maintaining superior performance. The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability. The source will be publicly made available at \href{this https URL}{\texttt{this http URL}}

24. 【2411.18415】Neural Image Unfolding: Flattening Sparse Anatomical Structures using Neural Fields

链接https://arxiv.org/abs/2411.18415

作者:Leonhard Rist,Pluvio Stephan,Noah Maul,Linda Vorberg,Hendrik Ditt,Michael Sühling,Andreas Maier,Bernhard Egger,Oliver Taubmann

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:imaging reveals internal, Tomographic imaging reveals, reveals internal structures, medical diagnoses, imaging reveals

备注

点击查看摘要

Abstract:Tomographic imaging reveals internal structures of 3D objects and is crucial for medical diagnoses. Visualizing the morphology and appearance of non-planar sparse anatomical structures that extend over multiple 2D slices in tomographic volumes is inherently difficult but valuable for decision-making and reporting. Hence, various organ-specific unfolding techniques exist to map their densely sampled 3D surfaces to a distortion-minimized 2D representation. However, there is no versatile framework to flatten complex sparse structures including vascular, duct or bone systems. We deploy a neural field to fit the transformation of the anatomy of interest to a 2D overview image. We further propose distortion regularization strategies and combine geometric with intensity-based loss formulations to also display non-annotated and auxiliary targets. In addition to improved versatility, our unfolding technique outperforms mesh-based baselines for sparse structures w.r.t. peak distortion and our regularization scheme yields smoother transformations compared to Jacobian formulations from neural field-based image registration.

25. 【2411.18412】Adaptive Blind All-in-One Image Restoration

链接https://arxiv.org/abs/2411.18412

作者:David Serrano-Lozano,Luis Herranz,Shaolin Su,Javier Vazquez-Corral

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:restoration models aim, aim to recover, recover a high-quality, input degraded, degraded with unknown

备注: 17 pages

点击查看摘要

Abstract:Blind all-in-one image restoration models aim to recover a high-quality image from an input degraded with unknown distortions. However, these models require all the possible degradation types to be defined during the training stage while showing limited generalization to unseen degradations, which limits their practical application in complex cases. In this paper, we propose a simple but effective adaptive blind all-in-one restoration (ABAIR) model, which can address multiple degradations, generalizes well to unseen degradations, and efficiently incorporate new degradations by training a small fraction of parameters. First, we train our baseline model on a large dataset of natural images with multiple synthetic degradations, augmented with a segmentation head to estimate per-pixel degradation types, resulting in a powerful backbone able to generalize to a wide range of degradations. Second, we adapt our baseline model to varying image restoration tasks using independent low-rank adapters. Third, we learn to adaptively combine adapters to versatile images via a flexible and lightweight degradation estimator. Our model is both powerful in handling specific distortions and flexible in adapting to complex tasks, it not only outperforms the state-of-the-art by a large margin on five- and three-task IR setups, but also shows improved generalization to unseen degradations and also composite distortions.

26. 【2411.18409】Deep Fourier-embedded Network for Bi-modal Salient Object Detection

链接https://arxiv.org/abs/2411.18409

作者:Pengfei Lyu,Xiaosheng Yu,Chengdong Wu,Jagath C. Rajapakse

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:RGB and thermal, rapid development, significant improvement, Fourier, thermal images

备注: 13 pages, 13 figures. Submitted to TMM on April 29, 2024

点击查看摘要

Abstract:The rapid development of deep learning provides a significant improvement of salient object detection combining both RGB and thermal images. However, existing deep learning-based models suffer from two major shortcomings. First, the computation and memory demands of Transformer-based models with quadratic complexity are unbearable, especially in handling high-resolution bi-modal feature fusion. Second, even if learning converges to an ideal solution, there remains a frequency gap between the prediction and ground truth. Therefore, we propose a purely fast Fourier transform-based model, namely deep Fourier-embedded network (DFENet), for learning bi-modal information of RGB and thermal images. On one hand, fast Fourier transform efficiently fetches global dependencies with low complexity. Inspired by this, we design modal-coordinated perception attention to fuse the frequency gap between RGB and thermal modalities with multi-dimensional representation enhancement. To obtain reliable detailed information during decoding, we design the frequency-decomposed edge-aware module (FEM) to clarify object edges by deeply decomposing low-level features. Moreover, we equip proposed Fourier residual channel attention block in each decoder layer to prioritize high-frequency information while aligning channel global relationships. On the other hand, we propose co-focus frequency loss (CFL) to steer FEM towards minimizing the frequency gap. CFL dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing the bi-modal edge information in the Fourier domain. This frequency-level refinement of edge features further contributes to the quality of the final pixel-level prediction. Extensive experiments on four bi-modal salient object detection benchmark datasets demonstrate our proposed DFENet outperforms twelve existing state-of-the-art models.

27. 【2411.18391】GeneQuery: A General QA-based Framework for Spatial Gene Expression Predictions from Histology Images

链接https://arxiv.org/abs/2411.18391

作者:Ying Xiong,Linjing Liu,Yufei Cui,Shangyu Wu,Xue Liu,Antoni B. Chan,Chun Jason Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gene expression, presents significant challenges, Gene expression profiling, Gene, molecular mechanisms

备注

点击查看摘要

Abstract:Gene expression profiling provides profound insights into molecular mechanisms, but its time-consuming and costly nature often presents significant challenges. In contrast, whole-slide hematoxylin and eosin (HE) stained histological images are readily accessible and allow for detailed examinations of tissue structure and composition at the microscopic level. Recent advancements have utilized these histological images to predict spatially resolved gene expression profiles. However, state-of-the-art works treat gene expression prediction as a multi-output regression problem, where each gene is learned independently with its own weights, failing to capture the shared dependencies and co-expression patterns between genes. Besides, existing works can only predict gene expression values for genes seen during training, limiting their ability to generalize to new, unseen genes. To address the above limitations, this paper presents GeneQuery, which aims to solve this gene expression prediction task in a question-answering (QA) manner for better generality and flexibility. Specifically, GeneQuery takes gene-related texts as queries and whole-slide images as contexts and then predicts the queried gene expression values. With such a transformation, GeneQuery can implicitly estimate the gene distribution by introducing the gene random variable. Besides, the proposed GeneQuery consists of two architecture implementations, i.e., spot-aware GeneQuery for capturing patterns between images and gene-aware GeneQuery for capturing patterns between genes. Comprehensive experiments on spatial transcriptomics datasets show that the proposed GeneQuery outperforms existing state-of-the-art methods on known and unseen genes. More results also demonstrate that GeneQuery can potentially analyze the tissue structure.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2411.18391 [cs.CV]

(or
arXiv:2411.18391v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2411.18391

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
28. 【2411.18388】Convolutional Neural Networks Do Work with Pre-Defined Filters

链接https://arxiv.org/abs/2411.18388

作者:Christoph Linse,Erhardt Barth,Thomas Martinetz

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Convolutional Neural Networks, Filter Convolutional Neural, Neural Networks called, Convolutional Neural, Neural Networks

备注

点击查看摘要

Abstract:We present a novel class of Convolutional Neural Networks called Pre-defined Filter Convolutional Neural Networks (PFCNNs), where all nxn convolution kernels with n1 are pre-defined and constant during training. It involves a special form of depthwise convolution operation called a Pre-defined Filter Module (PFM). In the channel-wise convolution part, the 1xnxn kernels are drawn from a fixed pool of only a few (16) different pre-defined kernels. In the 1x1 convolution part linear combinations of the pre-defined filter outputs are learned. Despite this harsh restriction, complex and discriminative features are learned. These findings provide a novel perspective on the way how information is processed within deep CNNs. We discuss various properties of PFCNNs and prove their effectiveness using the popular datasets Caltech101, CIFAR10, CUB-200-2011, FGVC-Aircraft, Flowers102, and Stanford Cars. Our implementation of PFCNNs is provided on Github this https URL

29. 【2411.18385】Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization

链接https://arxiv.org/abs/2411.18385

作者:Shivam Pal,Aishwarya Gupta,Saqib Sarwar,Piyush Rai

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:Federated Learning, Bayesian, Bayesian approach, decentralized and heterogeneous, Bayesian approach enables

备注

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a promising method to collaboratively learn from decentralized and heterogeneous data available at different clients without the requirement of data ever leaving the clients. Recent works on FL have advocated taking a Bayesian approach to FL as it offers a principled way to account for the model and predictive uncertainty by learning a posterior distribution for the client and/or server models. Moreover, Bayesian FL also naturally enables personalization in FL to handle data heterogeneity across the different clients by having each client learn its own distinct personalized model. In particular, the hierarchical Bayesian approach enables all the clients to learn their personalized models while also taking into account the commonalities via a prior distribution provided by the server. However, despite their promise, Bayesian approaches for FL can be computationally expensive and can have high communication costs as well because of the requirement of computing and sending the posterior distributions. We present a novel Bayesian FL method using an efficient second-order optimization approach, with a computational cost that is similar to first-order optimization methods like Adam, but also provides the various benefits of the Bayesian approach for FL (e.g., uncertainty, personalization), while also being significantly more efficient and accurate than SOTA Bayesian FL methods (both for standard as well as personalized FL settings). Our method achieves improved predictive accuracies as well as better uncertainty estimates as compared to the baselines which include both optimization based as well as Bayesian FL methods.

30. 【2411.18377】XR-MBT: Multi-modal Full Body Tracking for XR through Self-Supervision with Learned Depth Point Cloud Registration

链接https://arxiv.org/abs/2411.18377

作者:Denys Rozumnyi,Nadine Bertsch,Othman Sbai,Filippo Arcadu,Yuhua Chen,Artsiom Sanakoyeu,Manoj Kumar,Catherine Herold,Robin Kips

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:authentic social presence, social presence, fundamental challenge, challenge to bring, bring a sense

备注: Accepted to WACV 2025

点击查看摘要

Abstract:Tracking the full body motions of users in XR (AR/VR) devices is a fundamental challenge to bring a sense of authentic social presence. Due to the absence of dedicated leg sensors, currently available body tracking methods adopt a synthesis approach to generate plausible motions given a 3-point signal from the head and controller tracking. In order to enable mixed reality features, modern XR devices are capable of estimating depth information of the headset surroundings using available sensors combined with dedicated machine learning models. Such egocentric depth sensing cannot drive the body directly, as it is not registered and is incomplete due to limited field-of-view and body self-occlusions. For the first time, we propose to leverage the available depth sensing signal combined with self-supervision to learn a multi-modal pose estimation model capable of tracking full body motions in real time on XR devices. We demonstrate how current 3-point motion synthesis models can be extended to point cloud modalities using a semantic point cloud encoder network combined with a residual network for multi-modal pose estimation. These modules are trained jointly in a self-supervised way, leveraging a combination of real unregistered point clouds and simulated data obtained from motion capture. We compare our approach against several state-of-the-art systems for XR body tracking and show that our method accurately tracks a diverse range of body motions. XR-MBT tracks legs in XR for the first time, whereas traditional synthesis approaches based on partial body tracking are blind.

31. 【2411.18375】Individual Content and Motion Dynamics Preserved Pruning for Video Diffusion Models

链接https://arxiv.org/abs/2411.18375

作者:Yiming Wu,Huan Wang,Zhenghao Chen,Dong Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)

关键词:video diffusion model, Diffusion Model Compression, high computational cost, individual content, diffusion model

备注: 9 figures, 9 tables

点击查看摘要

Abstract:The high computational cost and slow inference time are major obstacles to deploying the video diffusion model (VDM) in practical applications. To overcome this, we introduce a new Video Diffusion Model Compression approach using individual content and motion dynamics preserved pruning and consistency loss. First, we empirically observe that deeper VDM layers are crucial for maintaining the quality of \textbf{motion dynamics} e.g., coherence of the entire video, while shallower layers are more focused on \textbf{individual content} e.g., individual frames. Therefore, we prune redundant blocks from the shallower layers while preserving more of the deeper layers, resulting in a lightweight VDM variant called VDMini. Additionally, we propose an \textbf{Individual Content and Motion Dynamics (ICMD)} Consistency Loss to gain comparable generation performance as larger VDM, i.e., the teacher to VDMini i.e., the student. Particularly, we first use the Individual Content Distillation (ICD) Loss to ensure consistency in the features of each generated frame between the teacher and student models. Next, we introduce a Multi-frame Content Adversarial (MCA) Loss to enhance the motion dynamics across the generated video as a whole. This method significantly accelerates inference time while maintaining high-quality video generation. Extensive experiments demonstrate the effectiveness of our VDMini on two important video generation tasks, Text-to-Video (T2V) and Image-to-Video (I2V), where we respectively achieve an average 2.5 $\times$ and 1.4 $\times$ speed up for the I2V method SF-V and the T2V method T2V-Turbo-v2, while maintaining the quality of the generated videos on two benchmarks, i.e., UCF101 and VBench.

32. 【2411.18369】G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation

链接https://arxiv.org/abs/2411.18369

作者:Tianxing Chen,Yao Mu,Zhixuan Liang,Zanxin Chen,Shijia Peng,Qiangyu Chen,Mingkun Xu,Ruizhen Hu,Hongyuan Zhang,Xuelong Li,Ping Luo

类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)

关键词:Recent advances, shown promising results, advances in imitation, imitation learning, shown promising

备注: Webpage: [this https URL](https://tianxingchen.github.io/G3Flow/)

点击查看摘要

Abstract:Recent advances in imitation learning for 3D robotic manipulation have shown promising results with diffusion-based policies. However, achieving human-level dexterity requires seamless integration of geometric precision and semantic understanding. We present G3Flow, a novel framework that constructs real-time semantic flow, a dynamic, object-centric 3D semantic representation by leveraging foundation models. Our approach uniquely combines 3D generative models for digital twin creation, vision foundation models for semantic feature extraction, and robust pose tracking for continuous semantic flow updates. This integration enables complete semantic understanding even under occlusions while eliminating manual annotation requirements. By incorporating semantic flow into diffusion policies, we demonstrate significant improvements in both terminal-constrained manipulation and cross-object generalization. Extensive experiments across five simulation tasks show that G3Flow consistently outperforms existing approaches, achieving up to 68.3% and 50.1% average success rates on terminal-constrained manipulation and cross-object generalization tasks respectively. Our results demonstrate the effectiveness of G3Flow in enhancing real-time dynamic semantic feature understanding for robotic manipulation policies.

33. 【2411.18363】ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

链接https://arxiv.org/abs/2411.18363

作者:Qing Jiang,Gen luo,Yuqin Yang,Yuda Xiong,Yihao Chen,Zhaoyang Zeng,Tianhe Ren,Lei Zhang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:computer vision, pillars of computer, Perception, understanding, MLLM

备注: 35 pages, 19 figures

点击查看摘要

Abstract:Perception and understanding are two pillars of computer vision. While multimodal large language models (MLLM) have demonstrated remarkable visual understanding capabilities, they arguably lack accurate perception abilities, e.g. the stage-of-the-art model Qwen2-VL only achieves a 43.9 recall rate on the COCO dataset, limiting many tasks requiring the combination of perception and understanding. In this work, we aim to bridge this perception gap from both model designing and data development perspectives. We first introduce ChatRex, an MLLM with a decoupled perception design. Instead of having the LLM directly predict box coordinates, we feed the output boxes from a universal proposal network into the LLM, allowing it to output the corresponding box indices to represent its detection results, turning the regression task into a retrieval-based task that LLM handles more proficiently. From the data perspective, we build a fully automated data engine and construct the Rexverse-2M dataset which possesses multiple granularities to support the joint training of perception and understanding. After standard two-stage training, ChatRex demonstrates strong perception capabilities while preserving multimodal understanding performance. The combination of these two capabilities simultaneously unlocks many attractive applications, demonstrating the complementary roles of both perception and understanding in MLLM. Code is available at \url{this https URL}.

34. 【2411.18350】ryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models

链接https://arxiv.org/abs/2411.18350

作者:Riza Velioglu,Petra Bevandic,Robin Chan,Barbara Hammer

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:introduces Virtual Try-Off, paper introduces Virtual, generating standardized garment, standardized garment images, clothed individuals

备注

点击查看摘要

Abstract:This paper introduces Virtual Try-Off (VTOFF), a novel task focused on generating standardized garment images from single photos of clothed individuals. Unlike traditional Virtual Try-On (VTON), which digitally dresses models, VTOFF aims to extract a canonical garment image, posing unique challenges in capturing garment shape, texture, and intricate patterns. This well-defined target makes VTOFF particularly effective for evaluating reconstruction fidelity in generative models. We present TryOffDiff, a model that adapts Stable Diffusion with SigLIP-based visual conditioning to ensure high fidelity and detail retention. Experiments on a modified VITON-HD dataset show that our approach outperforms baseline methods based on pose transfer and virtual try-on with fewer pre- and post-processing steps. Our analysis reveals that traditional image generation metrics inadequately assess reconstruction quality, prompting us to rely on DISTS for more accurate evaluation. Our results highlight the potential of VTOFF to enhance product imagery in e-commerce applications, advance generative model evaluation, and inspire future work on high-fidelity reconstruction. Demo, code, and models are available at: this https URL

35. 【2411.18335】Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation

链接https://arxiv.org/abs/2411.18335

作者:Mehdi Zayene,Jannik Endres,Albias Havolli,Charles Corbière,Salim Cherkaoui,Alexandre Kontouli,Alexandre Alahi

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)

关键词:imaging remains underexplored, remains underexplored, stereo depth estimation, considerable progress, depth estimation

备注: Project page: [this https URL](https://vita-epfl.github.io/Helvipad)

点击查看摘要

Abstract:Despite considerable progress in stereo depth estimation, omnidirectional imaging remains underexplored, mainly due to the lack of appropriate data. We introduce Helvipad, a real-world dataset for omnidirectional stereo depth estimation, consisting of 40K frames from video sequences across diverse environments, including crowded indoor and outdoor scenes with diverse lighting conditions. Collected using two 360° cameras in a top-bottom setup and a LiDAR sensor, the dataset includes accurate depth and disparity labels by projecting 3D point clouds onto equirectangular images. Additionally, we provide an augmented training set with a significantly increased label density by using depth completion. We benchmark leading stereo depth estimation models for both standard and omnidirectional images. The results show that while recent stereo methods perform decently, a significant challenge persists in accurately estimating depth in omnidirectional imaging. To address this, we introduce necessary adaptations to stereo models, achieving improved performance.

36. 【2411.18328】EventCrab: Harnessing Frame and Point Synergy for Event-based Action Recognition and Beyond

链接https://arxiv.org/abs/2411.18328

作者:Meiqi Cao,Xiangbo Shu,Jiachao Zhang,Rui Yan,Zechao Li,Jinhui Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Event-based Action Recognition, traditional action recognition, Action Recognition, Event-based Action, high-temporal resolution capturing

备注

点击查看摘要

Abstract:Event-based Action Recognition (EAR) possesses the advantages of high-temporal resolution capturing and privacy preservation compared with traditional action recognition. Current leading EAR solutions typically follow two regimes: project unconstructed event streams into dense constructed event frames and adopt powerful frame-specific networks, or employ lightweight point-specific networks to handle sparse unconstructed event points directly. However, such two regimes are blind to a fundamental issue: failing to accommodate the unique dense temporal and sparse spatial properties of asynchronous event data. In this article, we present a synergy-aware framework, i.e., EventCrab, that adeptly integrates the "lighter" frame-specific networks for dense event frames with the "heavier" point-specific networks for sparse event points, balancing accuracy and efficiency. Furthermore, we establish a joint frame-text-point representation space to bridge distinct event frames and points. In specific, to better exploit the unique spatiotemporal relationships inherent in asynchronous event points, we devise two strategies for the "heavier" point-specific embedding: i) a Spiking-like Context Learner (SCL) that extracts contextualized event points from raw event streams. ii) an Event Point Encoder (EPE) that further explores event-point long spatiotemporal features in a Hilbert-scan way. Experiments on four datasets demonstrate the significant performance of our proposed EventCrab, particularly gaining improvements of 5.17% on SeAct and 7.01% on HARDVS.

37. 【2411.18322】Mixture of Experts in Image Classification: What's the Sweet Spot?

链接https://arxiv.org/abs/2411.18322

作者:Mathurin Videau,Alessandro Leite,Marc Schoenauer,Olivier Teytaud

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:shown promising potential, shown promising, promising potential, potential for parameter-efficient, parameter-efficient scaling

备注

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models have shown promising potential for parameter-efficient scaling across various domains. However, the implementation in computer vision remains limited, and often requires large-scale datasets comprising billions of samples. In this study, we investigate the integration of MoE within computer vision models and explore various MoE configurations on open datasets. When introducing MoE layers in image classification, the best results are obtained for models with a moderate number of activated parameters per sample. However, such improvements gradually vanish when the number of parameters per sample increases.

38. 【2411.18314】Real-time Video Target Tracking Algorithm Utilizing Convolutional Neural Networks (CNN)

链接https://arxiv.org/abs/2411.18314

作者:Chaoyi Tan,Xiangtian Li,Xiaobo Wang,Zhen Qi,Ao Xiang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:http URL continuouslyupdatesthetargetmodelthroughanonline, http URL studysuccessfullyappliesCNNtoreal-timevideotarget, http URL isexpectedtoprovidenewsolutionsfortargettrackingtasksin, Thispaperaimstoresearchandimplementa real-timevideotargettrackingalgorithmbasedon ConvolutionalNeuralNetworks, URL isexpectedtoprovidenewsolutionsfortargettrackingtasksin videosurveillanceandintelligenttransportationdomains

备注

点击查看摘要

Abstract:Thispaperaimstoresearchandimplementa real-timevideotargettrackingalgorithmbasedon ConvolutionalNeuralNetworks(CNN),enhancingthe accuracyandrobustnessoftargettrackingincomplex this http URL algorithmsinhandlingissuessuchastargetocclusion,morphologicalchanges,andbackgroundinterference,our this http URL continuouslyupdatesthetargetmodelthroughanonline learningmechanismtoadapttochangesinthetarget's this http URL,when dealingwithsituationsinvolvingrapidmotion,partial occlusion,andcomplexbackgrounds,theproposedalgorithm exhibitshighertrackingsuccessratesandlowerfailurerates this http URL studysuccessfullyappliesCNNtoreal-timevideotarget tracking,improvingtheaccuracyandstabilityofthetracking algorithmwhilemaintaininghighprocessingspeeds,thus this http URL isexpectedtoprovidenewsolutionsfortargettrackingtasksin videosurveillanceandintelligenttransportationdomains.

39. 【2411.18311】Neural Surface Priors for Editable Gaussian Splatting

链接https://arxiv.org/abs/2411.18311

作者:Jakub Szymkowiak,Weronika Jakubowska,Dawid Malarz,Weronika Smolak-Dyżewska,Maciej Zięba,Przemysław Musialski,Wojtek Pałubicki,Przemysław Spurek

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:easily modifiable representations, recover easily modifiable, Signed Distance Field, computer graphics, easily modifiable

备注: 9 pages, 7 figures

点击查看摘要

Abstract:In computer graphics, there is a need to recover easily modifiable representations of 3D geometry and appearance from image data. We introduce a novel method for this task using 3D Gaussian Splatting, which enables intuitive scene editing through mesh adjustments. Starting with input images and camera poses, we reconstruct the underlying geometry using a neural Signed Distance Field and extract a high-quality mesh. Our model then estimates a set of Gaussians, where each component is flat, and the opacity is conditioned on the recovered neural surface. To facilitate editing, we produce a proxy representation that encodes information about the Gaussians' shape and position. Unlike other methods, our pipeline allows modifications applied to the extracted mesh to be propagated to the proxy representation, from which we recover the updated parameters of the Gaussians. This effectively transfers the mesh edits back to the recovered appearance representation. By leveraging mesh-guided transformations, our approach simplifies 3D scene editing and offers improvements over existing methods in terms of usability and visual fidelity of edits. The complete source code for this project can be accessed at \url{this https URL}

40. 【2411.18309】MvKeTR: Chest CT Report Generation with Multi-View Perception and Knowledge Enhancement

链接https://arxiv.org/abs/2411.18309

作者:Xiwei Deng,Xianchun He,Yudan Zhou,Shuhui Cai,Congbo Cai,Zhong Chen

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:relieving clinicians' workload, improving patient care, automatically generate diagnostic, perception Knowledge-enhanced Tansformer, aims to automatically

备注: 10 pages, 10 figures

点击查看摘要

Abstract:CT report generation (CTRG) aims to automatically generate diagnostic reports for 3D volumes, relieving clinicians' workload and improving patient care. Despite clinical value, existing works fail to effectively incorporate diagnostic information from multiple anatomical views and lack related clinical expertise essential for accurate and reliable diagnosis. To resolve these limitations, we propose a novel Multi-view perception Knowledge-enhanced Tansformer (MvKeTR) to mimic the diagnostic workflow of clinicians. Just as radiologists first examine CT scans from multiple planes, a Multi-View Perception Aggregator (MVPA) with view-aware attention effectively synthesizes diagnostic information from multiple anatomical views. Then, inspired by how radiologists further refer to relevant clinical records to guide diagnostic decision-making, a Cross-Modal Knowledge Enhancer (CMKE) retrieves the most similar reports based on the query volume to incorporate domain knowledge into the diagnosis procedure. Furthermore, instead of traditional MLPs, we employ Kolmogorov-Arnold Networks (KANs) with learnable nonlinear activation functions as the fundamental building blocks of both modules to better capture intricate diagnostic patterns in CT interpretation. Extensive experiments on the public CTRG-Chest-548K dataset demonstrate that our method outpaces prior state-of-the-art models across all metrics.

41. 【2411.18303】InfiniDreamer: Arbitrarily Long Human Motion Generation via Segment Score Distillation

链接https://arxiv.org/abs/2411.18303

作者:Wenjie Zhuo,Fan Ma,Hehe Fan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:arbitrarily long human, human motion generation, motion, long human motion, present InfiniDreamer

备注

点击查看摘要

Abstract:We present InfiniDreamer, a novel framework for arbitrarily long human motion generation. InfiniDreamer addresses the limitations of current motion generation methods, which are typically restricted to short sequences due to the lack of long motion training data. To achieve this, we first generate sub-motions corresponding to each textual description and then assemble them into a coarse, extended sequence using randomly initialized transition segments. We then introduce an optimization-based method called Segment Score Distillation (SSD) to refine the entire long motion sequence. SSD is designed to utilize an existing motion prior, which is trained only on short clips, in a training-free manner. Specifically, SSD iteratively refines overlapping short segments sampled from the coarsely extended long motion sequence, progressively aligning them with the pre-trained motion diffusion prior. This process ensures local coherence within each segment, while the refined transitions between segments maintain global consistency across the entire sequence. Extensive qualitative and quantitative experiments validate the superiority of our framework, showcasing its ability to generate coherent, contextually aware motion sequences of arbitrary length.

42. 【2411.18301】Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation

链接https://arxiv.org/abs/2411.18301

作者:Tianyi Wei,Dongdong Chen,Yifan Zhou,Xingang Pan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Diffusion Transformer, latest Multimodal Diffusion, Diffusion Transformer, Multimodal Diffusion, Representing the cutting-edge

备注

点击查看摘要

Abstract:Representing the cutting-edge technique of text-to-image models, the latest Multimodal Diffusion Transformer (MMDiT) largely mitigates many generation issues existing in previous models. However, we discover that it still suffers from subject neglect or mixing when the input text prompt contains multiple subjects of similar semantics or appearance. We identify three possible ambiguities within the MMDiT architecture that cause this problem: Inter-block Ambiguity, Text Encoder Ambiguity, and Semantic Ambiguity. To address these issues, we propose to repair the ambiguous latent on-the-fly by test-time optimization at early denoising steps. In detail, we design three loss functions: Block Alignment Loss, Text Encoder Alignment Loss, and Overlap Loss, each tailored to mitigate these ambiguities. Despite significant improvements, we observe that semantic ambiguity persists when generating multiple similar subjects, as the guidance provided by overlap loss is not explicit enough. Therefore, we further propose Overlap Online Detection and Back-to-Start Sampling Strategy to alleviate the problem. Experimental results on a newly constructed challenging dataset of similar subjects validate the effectiveness of our approach, showing superior generation quality and much higher success rates over existing methods. Our code will be available at this https URL.

43. 【2411.18296】HUPE: Heuristic Underwater Perceptual Enhancement with Semantic Collaborative Learning

链接https://arxiv.org/abs/2411.18296

作者:Zengxi Zhang,Zhiying Jiang,Long Ma,Jinyuan Liu,Xin Fan,Risheng Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:refraction and absorption, reducing visibility, affected by light, light refraction, visibility and interfering

备注: 22 pages, 21 figures

点击查看摘要

Abstract:Underwater images are often affected by light refraction and absorption, reducing visibility and interfering with subsequent applications. Existing underwater image enhancement methods primarily focus on improving visual quality while overlooking practical implications. To strike a balance between visual quality and application, we propose a heuristic invertible network for underwater perception enhancement, dubbed HUPE, which enhances visual quality and demonstrates flexibility in handling other downstream tasks. Specifically, we introduced an information-preserving reversible transformation with embedded Fourier transform to establish a bidirectional mapping between underwater images and their clear images. Additionally, a heuristic prior is incorporated into the enhancement process to better capture scene information. To further bridge the feature gap between vision-based enhancement images and application-oriented images, a semantic collaborative learning module is applied in the joint optimization process of the visual enhancement task and the downstream task, which guides the proposed enhancement model to extract more task-oriented semantic features while obtaining visually pleasing images. Extensive experiments, both quantitative and qualitative, demonstrate the superiority of our HUPE over state-of-the-art methods. The source code is available at this https URL.

44. 【2411.18293】HiFiVFS: High Fidelity Video Face Swapping

链接https://arxiv.org/abs/2411.18293

作者:Xu Chen,Keke He,Junwei Zhu,Yanhao Ge,Wei Li,Chengjie Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Face swapping, Face swapping aims, video face swapping, aims to generate, generate results

备注

点击查看摘要

Abstract:Face swapping aims to generate results that combine the identity from the source with attributes from the target. Existing methods primarily focus on image-based face swapping. When processing videos, each frame is handled independently, making it difficult to ensure temporal stability. From a model perspective, face swapping is gradually shifting from generative adversarial networks (GANs) to diffusion models (DMs), as DMs have been shown to possess stronger generative capabilities. Current diffusion-based approaches often employ inpainting techniques, which struggle to preserve fine-grained attributes like lighting and makeup. To address these challenges, we propose a high fidelity video face swapping (HiFiVFS) framework, which leverages the strong generative capability and temporal prior of Stable Video Diffusion (SVD). We build a fine-grained attribute module to extract identity-disentangled and fine-grained attribute features through identity desensitization and adversarial learning. Additionally, We introduce detailed identity injection to further enhance identity similarity. Extensive experiments demonstrate that our method achieves state-of-the-art (SOTA) in video face swapping, both qualitatively and quantitatively.

45. 【2411.18289】Don't Let Your Robot be Harmful: Responsible Robotic Manipulation

链接https://arxiv.org/abs/2411.18289

作者:Minheng Ni,Lei Zhang,Zihan Chen,Lei Zhang,Wangmeng Zuo

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Unthinking execution, responsible robotic manipulation, robotic manipulation, severe safety risks, execution of human

备注

点击查看摘要

Abstract:Unthinking execution of human instructions in robotic manipulation can lead to severe safety risks, such as poisonings, fires, and even explosions. In this paper, we present responsible robotic manipulation, which requires robots to consider potential hazards in the real-world environment while completing instructions and performing complex operations safely and efficiently. However, such scenarios in real world are variable and risky for training. To address this challenge, we propose Safety-as-policy, which includes (i) a world model to automatically generate scenarios containing safety risks and conduct virtual interactions, and (ii) a mental model to infer consequences with reflections and gradually develop the cognition of safety, allowing robots to accomplish tasks while avoiding dangers. Additionally, we create the SafeBox synthetic dataset, which includes one hundred responsible robotic manipulation tasks with different safety risk scenarios and instructions, effectively reducing the risks associated with real-world experiments. Experiments demonstrate that Safety-as-policy can avoid risks and efficiently complete tasks in both synthetic dataset and real-world experiments, significantly outperforming baseline methods. Our SafeBox dataset shows consistent evaluation results with real-world scenarios, serving as a safe and effective benchmark for future research.

46. 【2411.18288】Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks

链接https://arxiv.org/abs/2411.18288

作者:Chen Zhou,Peng Cheng,Junfeng Fang,Yifan Zhang,Yibo Yan,Xiaojun Jia,Yanyan Xu,Kun Wang,Xiaochun Cao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:RGB and TIR, Multispectral object detection, Multispectral object, thermal infrared, object detection

备注

点击查看摘要

Abstract:Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task. It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies, spatial misalignment, and environmental dependencies between RGB and TIR images. These challenges significantly hinder the generalization of multispectral detection systems across diverse scenarios. Although numerous studies have attempted to overcome these limitations, it remains difficult to clearly distinguish the performance gains of multispectral detection systems from the impact of these "optimization techniques". Worse still, despite the rapid emergence of high-performing single-modality detection models, there is still a lack of specialized training techniques that can effectively adapt these models for multispectral detection tasks. The absence of a standardized benchmark with fair and consistent experimental setups also poses a significant barrier to evaluating the effectiveness of new approaches. To this end, we propose the first fair and reproducible benchmark specifically designed to evaluate the training "techniques", which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations. A comprehensive evaluation is conducted across multiple representative multispectral object detection datasets, utilizing various backbone networks and detection frameworks. Additionally, we introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models into dual-modality models, integrating our advanced training techniques.

47. 【2411.18281】MotionCharacter: Identity-Preserving and Motion Controllable Human Video Generation

链接https://arxiv.org/abs/2411.18281

作者:Haopeng Fang,Di Qiu,Binjie Mao,Pengfei Yan,He Tang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:integrating character-specific identities, Recent advancements, advancements in personalized, importance of integrating, integrating character-specific

备注

点击查看摘要

Abstract:Recent advancements in personalized Text-to-Video (T2V) generation highlight the importance of integrating character-specific identities and actions. However, previous T2V models struggle with identity consistency and controllable motion dynamics, mainly due to limited fine-grained facial and action-based textual prompts, and datasets that overlook key human attributes and actions. To address these challenges, we propose MotionCharacter, an efficient and high-fidelity human video generation framework designed for identity preservation and fine-grained motion control. We introduce an ID-preserving module to maintain identity fidelity while allowing flexible attribute modifications, and further integrate ID-consistency and region-aware loss mechanisms, significantly enhancing identity consistency and detail fidelity. Additionally, our approach incorporates a motion control module that prioritizes action-related text while maintaining subject consistency, along with a dataset, Human-Motion, which utilizes large language models to generate detailed motion descriptions. For simplify user control during inference, we parameterize motion intensity through a single coefficient, allowing for easy adjustments. Extensive experiments highlight the effectiveness of MotionCharacter, demonstrating significant improvements in ID-preserving, high-quality video generation.

48. 【2411.18275】Visual Adversarial Attack on Vision-Language Models for Autonomous Driving

链接https://arxiv.org/abs/2411.18275

作者:Tianyuan Zhang,Lu Wang,Xinwei Zhang,Yitong Zhang,Boyi Jia,Siyuan Liang,Shengshan Hu,Qiang Fu,Aishan Liu,Xianglong Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enhancing reasoning capabilities, significantly advanced autonomous, Vision-language models, advanced autonomous driving, reasoning capabilities

备注

点击查看摘要

Abstract:Vision-language models (VLMs) have significantly advanced autonomous driving (AD) by enhancing reasoning capabilities. However, these models remain highly vulnerable to adversarial attacks. While existing research has primarily focused on general VLM attacks, the development of attacks tailored to the safety-critical AD context has been largely overlooked. In this paper, we take the first step toward designing adversarial attacks specifically targeting VLMs in AD, exposing the substantial risks these attacks pose within this critical domain. We identify two unique challenges for effective adversarial attacks on AD VLMs: the variability of textual instructions and the time-series nature of visual scenarios. To this end, we propose ADvLM, the first visual adversarial attack framework specifically designed for VLMs in AD. Our framework introduces Semantic-Invariant Induction, which uses a large language model to create a diverse prompt library of textual instructions with consistent semantic content, guided by semantic entropy. Building on this, we introduce Scenario-Associated Enhancement, an approach where attention mechanisms select key frames and perspectives within driving scenarios to optimize adversarial perturbations that generalize across the entire scenario. Extensive experiments on several AD VLMs over multiple benchmarks show that ADvLM achieves state-of-the-art attack effectiveness. Moreover, real-world attack studies further validate its applicability and potential in practice.

49. 【2411.18270】Grid-augumented vision: A simple yet effective approach for enhanced spatial understanding in multi-modal agents

链接https://arxiv.org/abs/2411.18270

作者:Joongwon Chae,Zhenyu Wang,Peiwu Qin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:demonstrated impressive capabilities, Recent advances, scene understanding, advances in multimodal, demonstrated impressive

备注: 10 pages, 2 figures

点击查看摘要

Abstract:Recent advances in multimodal models have demonstrated impressive capabilities in object recognition and scene understanding. However, these models often struggle with precise spatial localization - a critical capability for real-world applications. Inspired by how humans use grid-based references like chess boards and maps, we propose introducing explicit visual position encoding through a simple grid overlay approach. By adding a 9x9 black grid pattern onto input images, our method provides visual spatial guidance analogous to how positional encoding works in transformers, but in an explicit, visual form. Experiments on the COCO 2017 dataset demonstrate that our grid-based approach achieves significant improvements in localization accuracy, with a 107.4% increase in IoU (from 0.27 to 0.56) and a 194.4% improvement in GIoU (from 0.18 to 0.53) compared to baseline performance. Through attention visualization analysis, we show how this visual position encoding helps models better ground spatial relationships. Our method's simplicity and effectiveness make it particularly valuable for applications requiring accurate spatial reasoning, such as robotic manipulation, medical imaging, and autonomous navigation.

Comments:
10 pages, 2 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2411.18270 [cs.CV]

(or
arXiv:2411.18270v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2411.18270

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
50. 【2411.18267】Incomplete Multi-view Multi-label Classification via a Dual-level Contrastive Learning Framework

链接https://arxiv.org/abs/2411.18267

作者:Bingyan Nie,Wulin Xie,Jiang Long,Xiaohuan Lu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:comprehensive data analysis, multi-view multi-label classification, multi-label classification, analysis and exploration, significant domains

备注

点击查看摘要

Abstract:Recently, multi-view and multi-label classification have become significant domains for comprehensive data analysis and exploration. However, incompleteness both in views and labels is still a real-world scenario for multi-view multi-label classification. In this paper, we seek to focus on double missing multi-view multi-label classification tasks and propose our dual-level contrastive learning framework to solve this issue. Different from the existing works, which couple consistent information and view-specific information in the same feature space, we decouple the two heterogeneous properties into different spaces and employ contrastive learning theory to fully disentangle the two properties. Specifically, our method first introduces a two-channel decoupling module that contains a shared representation and a view-proprietary representation to effectively extract consistency and complementarity information across all views. Second, to efficiently filter out high-quality consistent information from multi-view representations, two consistency objectives based on contrastive learning are conducted on the high-level features and the semantic labels, respectively. Extensive experiments on several widely used benchmark datasets demonstrate that the proposed method has more stable and superior classification performance.

51. 【2411.18263】SD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution

链接https://arxiv.org/abs/2411.18263

作者:Linwei Dong,Qingnan Fan,Yihong Guo,Zhonghao Wang,Qi Zhang,Jinwei Chen,Yawei Luo,Changqing Zou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:real-world image super-resolution, increasingly applied, diffusion models, image super-resolution, Target Score Distillation

备注

点击查看摘要

Abstract:Pre-trained text-to-image diffusion models are increasingly applied to real-world image super-resolution (Real-ISR) task. Given the iterative refinement nature of diffusion models, most existing approaches are computationally expensive. While methods such as SinSR and OSEDiff have emerged to condense inference steps via distillation, their performance in image restoration or details recovery is not satisfied. To address this, we propose TSD-SR, a novel distillation framework specifically designed for real-world image super-resolution, aiming to construct an efficient and effective one-step model. We first introduce the Target Score Distillation, which leverages the priors of diffusion models and real image references to achieve more realistic image restoration. Secondly, we propose a Distribution-Aware Sampling Module to make detail-oriented gradients more readily accessible, addressing the challenge of recovering fine details. Extensive experiments demonstrate that our TSD-SR has superior restoration results (most of the metrics perform the best) and the fastest inference speed (e.g. 40 times faster than SeeSR) compared to the past Real-ISR approaches based on pre-trained diffusion priors.

52. 【2411.18229】SharpDepth: Sharpening Metric Depth Predictions Using Diffusion Distillation

链接https://arxiv.org/abs/2411.18229

作者:Duc-Hai Pham,Tung Do,Phong Nguyen,Binh-Son Hua,Khoi Nguyen,Rang Nguyen

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:sharpness typically achieved, fine-grained boundary sharpness, boundary sharpness typically, depth estimation methods, monocular metric depth

备注: Uncompressed version can be found in [this https URL](https://drive.google.com/file/d/1MG4-d_xDERVBCRfLDolNLnMLLuqd7qRz)

点击查看摘要

Abstract:We propose SharpDepth, a novel approach to monocular metric depth estimation that combines the metric accuracy of discriminative depth estimation methods (e.g., Metric3D, UniDepth) with the fine-grained boundary sharpness typically achieved by generative methods (e.g., Marigold, Lotus). Traditional discriminative models trained on real-world data with sparse ground-truth depth can accurately predict metric depth but often produce over-smoothed or low-detail depth maps. Generative models, in contrast, are trained on synthetic data with dense ground truth, generating depth maps with sharp boundaries yet only providing relative depth with low accuracy. Our approach bridges these limitations by integrating metric accuracy with detailed boundary preservation, resulting in depth predictions that are both metrically precise and visually sharp. Our extensive zero-shot evaluations on standard depth estimation benchmarks confirm SharpDepth effectiveness, showing its ability to achieve both high depth accuracy and detailed representation, making it well-suited for applications requiring high-quality depth perception across diverse, real-world environments.

53. 【2411.18225】PATHS: A Hierarchical Transformer for Efficient Whole Slide Image Analysis

链接https://arxiv.org/abs/2411.18225

作者:Zak Buzzard,Konstantin Hemker,Nikola Simidjievski,Mateja Jamnik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:significant research progress, recent years, research progress, progress in recent, applications ranging

备注

点击查看摘要

Abstract:Computational analysis of whole slide images (WSIs) has seen significant research progress in recent years, with applications ranging across important diagnostic and prognostic tasks such as survival or cancer subtype prediction. Many state-of-the-art models process the entire slide - which may be as large as $150,000 \times 150,000$ pixels - as a bag of many patches, the size of which necessitates computationally cheap feature aggregation methods. However, a large proportion of these patches are uninformative, such as those containing only healthy or adipose tissue, adding significant noise and size to the bag. We propose Pathology Transformer with Hierarchical Selection (PATHS), a novel top-down method for hierarchical weakly supervised representation learning on slide-level tasks in computational pathology. PATHS is inspired by the cross-magnification manner in which a human pathologist examines a slide, recursively filtering patches at each magnification level to a small subset relevant to the diagnosis. Our method overcomes the complications of processing the entire slide, enabling quadratic self-attention and providing a simple interpretable measure of region importance. We apply PATHS to five datasets of The Cancer Genome Atlas (TCGA), and achieve superior performance on slide-level prediction tasks when compared to previous methods, despite processing only a small proportion of the slide.

54. 【2411.18224】KANs for Computer Vision: An Experimental Study

链接https://arxiv.org/abs/2411.18224

作者:Karthik Mohan,Hanxiao Wang,Xiatian Zhu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Convolutional Neural Networks, Kolmogorov-Arnold Networks, Neural Networks, image classification, paper presents

备注: 11 pages, 4 figures

点击查看摘要

Abstract:This paper presents an experimental study of Kolmogorov-Arnold Networks (KANs) applied to computer vision tasks, particularly image classification. KANs introduce learnable activation functions on edges, offering flexible non-linear transformations compared to traditional pre-fixed activation functions with specific neural work like Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). While KANs have shown promise mostly in simplified or small-scale datasets, their effectiveness for more complex real-world tasks such as computer vision tasks remains less explored. To fill this gap, this experimental study aims to provide extended observations and insights into the strengths and limitations of KANs. We reveal that although KANs can perform well in specific vision tasks, they face significant challenges, including increased hyperparameter sensitivity and higher computational costs. These limitations suggest that KANs require architectural adaptations, such as integration with other architectures, to be practical for large-scale vision problems. This study focuses on empirical findings rather than proposing new methods, aiming to inform future research on optimizing KANs, in particular computer vision applications or alike.

55. 【2411.18211】meMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

链接https://arxiv.org/abs/2411.18211

作者:Shimin Chen,Xiaohan Lan,Yitian Yuan,Zequn Jie,Lin Ma

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:large language models, multimodal large language, advanced multimodal large, large language, significantly advanced multimodal

备注

点击查看摘要

Abstract:Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks. However, existing video-language models often overlook precise temporal localization and struggle with videos of varying lengths. We introduce TimeMarker, a versatile Video-LLM designed for high-quality dialogue based on video content, emphasizing temporal localization. TimeMarker integrates Temporal Separator Tokens to enhance temporal awareness, accurately marking specific moments within videos. It employs the AnyLength mechanism for dynamic frame sampling and adaptive token merging, enabling effective handling of both short and long videos. Additionally, TimeMarker utilizes diverse datasets, including further transformed temporal-related video QA datasets, to bolster its temporal understanding capabilities. Image and interleaved data are also employed to further enhance the model's semantic perception ability. Evaluations demonstrate that TimeMarker achieves state-of-the-art performance across multiple benchmarks, excelling in both short and long video categories. Our project page is at \url{this https URL}.

56. 【2411.18207】From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

链接https://arxiv.org/abs/2411.18207

作者:Zizhao Li,Zhengkang Xiang,Joseph West,Kourosh Khoshelham

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Traditional object detection, Traditional object, closed-set assumption, fixed number, OVD models

备注

点击查看摘要

Abstract:Traditional object detection methods operate under the closed-set assumption, where models can only detect a fixed number of objects predefined in the training set. Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an unbounded vocabulary, which reduces the cost of training models for specific tasks. However, OVD heavily relies on accurate prompts provided by an ''oracle'', which limits their use in critical applications such as driving scene perception. OVD models tend to misclassify near-out-of-distribution (NOOD) objects that have similar semantics to known classes, and ignore far-out-of-distribution (FOOD) objects. To address theses limitations, we propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning novel objects. To detect FOOD objects, we propose Open World Embedding Learning (OWEL) and introduce the concept of Pseudo Unknown Embedding which infers the location of unknown classes in a continuous semantic space based on the information of known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL), which enables the identification of misclassified unknown objects by promoting the intra-class consistency of object embeddings at different scales. The proposed method achieves state-of-the-art performance in common open world object detection and autonomous driving benchmarks.

57. 【2411.18203】Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

链接https://arxiv.org/abs/2411.18203

作者:Di Zhang,Jingdi Lei,Junxian Li,Xunzhi Wang,Yujie Liu,Zonglin Yang,Jiatong Li,Weida Wang,Suorong Yang,Jianbo Wu,Peng Ye,Wanli Ouyang,Dongzhan Zhou

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:shown remarkable advancements, Vision-language models, reasoning, critic, shown remarkable

备注: 16 pages, 11 figures

点击查看摘要

Abstract:Vision-language models~(VLMs) have shown remarkable advancements in multimodal reasoning tasks. However, they still often generate inaccurate or irrelevant responses due to issues like hallucinated image understandings or unrefined reasoning paths. To address these challenges, we introduce Critic-V, a novel framework inspired by the Actor-Critic paradigm to boost the reasoning capability of VLMs. This framework decouples the reasoning process and critic process by integrating two independent components: the Reasoner, which generates reasoning paths based on visual and textual inputs, and the Critic, which provides constructive critique to refine these paths. In this approach, the Reasoner generates reasoning responses according to text prompts, which can evolve iteratively as a policy based on feedback from the Critic. This interaction process was theoretically driven by a reinforcement learning framework where the Critic offers natural language critiques instead of scalar rewards, enabling more nuanced feedback to boost the Reasoner's capability on complex reasoning tasks. The Critic model is trained using Direct Preference Optimization (DPO), leveraging a preference dataset of critiques ranked by Rule-based Reward(RBR) to enhance its critic capabilities. Evaluation results show that the Critic-V framework significantly outperforms existing methods, including GPT-4V, on 5 out of 8 benchmarks, especially regarding reasoning accuracy and efficiency. Combining a dynamic text-based policy for the Reasoner and constructive feedback from the preference-optimized Critic enables a more reliable and context-sensitive multimodal reasoning process. Our approach provides a promising solution to enhance the reliability of VLMs, improving their performance in real-world reasoning-heavy multimodal applications such as autonomous driving and embodied intelligence.

58. 【2411.18197】Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters

链接https://arxiv.org/abs/2411.18197

作者:Zhiyang Guo,Jinxu Xiang,Kai Ma,Wengang Zhou,Houqiang Li,Ran Zhang

类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)

关键词:modern creative industries, creative industries, extensive manual work, demands extensive manual, essential to modern

备注: Project Page: [this https URL](https://jasongzy.github.io/Make-It-Animatable/)

点击查看摘要

Abstract:3D characters are essential to modern creative industries, but making them animatable often demands extensive manual work in tasks like rigging and skinning. Existing automatic rigging tools face several limitations, including the necessity for manual annotations, rigid skeleton topologies, and limited generalization across diverse shapes and poses. An alternative approach is to generate animatable avatars pre-bound to a rigged template mesh. However, this method often lacks flexibility and is typically limited to realistic human shapes. To address these issues, we present Make-It-Animatable, a novel data-driven method to make any 3D humanoid model ready for character animation in less than one second, regardless of its shapes and poses. Our unified framework generates high-quality blend weights, bones, and pose transformations. By incorporating a particle-based shape autoencoder, our approach supports various 3D representations, including meshes and 3D Gaussian splats. Additionally, we employ a coarse-to-fine representation and a structure-aware modeling strategy to ensure both accuracy and robustness, even for characters with non-standard skeleton structures. We conducted extensive experiments to validate our framework's effectiveness. Compared to existing methods, our approach demonstrates significant improvements in both quality and speed.

59. 【2411.18180】DistinctAD: Distinctive Audio Description Generation in Contexts

链接https://arxiv.org/abs/2411.18180

作者:Bo Fang,Wenhao Wu,Qiangqiang Wu,Yuxin Song,Antoni B. Chan

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Audio Descriptions, aim to provide, text form, scene establishment, provide a narration

备注

点击查看摘要

Abstract:Audio Descriptions (ADs) aim to provide a narration of a movie in text form, describing non-dialogue-related narratives, such as characters, actions, or scene establishment. Automatic generation of ADs remains challenging due to: i) the domain gap between movie-AD data and existing data used to train vision-language models, and ii) the issue of contextual redundancy arising from highly similar neighboring visual clips in a long movie. In this work, we propose DistinctAD, a novel two-stage framework for generating ADs that emphasize distinctiveness to produce better narratives. To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora, enabling more effective alignment between movie and AD modalities at both global and fine-grained levels. In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context, ensuring the prediction of unique terms specific to the current AD. Comprehensive evaluations on MAD-Eval, CMD-AD, and TV-AD benchmarks demonstrate the superiority of DistinctAD, with the model consistently outperforming baselines, particularly in Recall@k/N, highlighting its effectiveness in producing high-quality, distinctive ADs.

60. 【2411.18172】Enhancing Computer Vision with Knowledge: a Rummikub Case Study

链接https://arxiv.org/abs/2411.18172

作者:Simon Vandevelde,Laurent Mertens,Sverre Lauwers,Joost Vennekens

类目:Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)

关键词:Artificial Neural Networks, Neural Networks excel, Artificial Neural, identifying individual components, Neural Networks

备注: Submitted to ESANN2025

点击查看摘要

Abstract:Artificial Neural Networks excel at identifying individual components in an image. However, out-of-the-box, they do not manage to correctly integrate and interpret these components as a whole. One way to alleviate this weakness is to expand the network with explicit knowledge and a separate reasoning component. In this paper, we evaluate an approach to this end, applied to the solving of the popular board game Rummikub. We demonstrate that, for this particular example, the added background knowledge is equally valuable as two-thirds of the data set, and allows to bring down the training time to half the original time.

61. 【2411.18169】PDZSeg: Adapting the Foundation Model for Dissection Zone Segmentation with Visual Prompts in Robot-assisted Endoscopic Submucosal Dissection

链接https://arxiv.org/abs/2411.18169

作者:Mengya Xu,Wenjin Mo,Guankun Wang,Huxin Gao,An Wang,Zhen Li,Xiaoxiao Yang,Hongliang Ren

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:dissection zone segmentation, Endoscopic surgical environments, surgical environments present, environments present challenges, dissection zone

备注

点击查看摘要

Abstract:Purpose: Endoscopic surgical environments present challenges for dissection zone segmentation due to unclear boundaries between tissue types, leading to segmentation errors where models misidentify or overlook edges. This study aims to provide precise dissection zone suggestions during endoscopic submucosal dissection (ESD) procedures, enhancing ESD safety. Methods: We propose the Prompted-based Dissection Zone Segmentation (PDZSeg) model, designed to leverage diverse visual prompts such as scribbles and bounding boxes. By overlaying these prompts onto images and fine-tuning a foundational model on a specialized dataset, our approach improves segmentation performance and user experience through flexible input methods. Results: The PDZSeg model was validated using three experimental setups: in-domain evaluation, variability in visual prompt availability, and robustness assessment. Using the ESD-DZSeg dataset, results show that our method outperforms state-of-the-art segmentation approaches. This is the first study to integrate visual prompt design into dissection zone segmentation. Conclusion: The PDZSeg model effectively utilizes visual prompts to enhance segmentation performance and user experience, supported by the novel ESD-DZSeg dataset as a benchmark for dissection zone segmentation in ESD. Our work establishes a foundation for future research.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2411.18169 [cs.CV]

(or
arXiv:2411.18169v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2411.18169

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)

Submission history From: Wenjin Mo [view email] [v1]
Wed, 27 Nov 2024 09:28:50 UTC (827 KB)

62. 【2411.18165】KAN See Your Face

链接https://arxiv.org/abs/2411.18165

作者:Dong Han,Yong Li,Joachim Denzler

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:enhanced facial privacy, privacy-preserving face recognition, facial privacy protection, secure face recognition, face recognition

备注: 16 pages, 8 figures

点击查看摘要

Abstract:With the advancement of face reconstruction (FR) systems, privacy-preserving face recognition (PPFR) has gained popularity for its secure face recognition, enhanced facial privacy protection, and robustness to various attacks. Besides, specific models and algorithms are proposed for face embedding protection by mapping embeddings to a secure space. However, there is a lack of studies on investigating and evaluating the possibility of extracting face images from embeddings of those systems, especially for PPFR. In this work, we introduce the first approach to exploit Kolmogorov-Arnold Network (KAN) for conducting embedding-to-face attacks against state-of-the-art (SOTA) FR and PPFR systems. Face embedding mapping (FEM) models are proposed to learn the distribution mapping relation between the embeddings from the initial domain and target domain. In comparison with Multi-Layer Perceptrons (MLP), we provide two variants, FEM-KAN and FEM-MLP, for efficient non-linear embedding-to-embedding mapping in order to reconstruct realistic face images from the corresponding face embedding. To verify our methods, we conduct extensive experiments with various PPFR and FR models. We also measure reconstructed face images with different metrics to evaluate the image quality. Through comprehensive experiments, we demonstrate the effectiveness of FEMs in accurate embedding mapping and face reconstruction.

63. 【2411.18164】RPEE-HEADS: A Novel Benchmark for Pedestrian Head Detection in Crowd Videos

链接https://arxiv.org/abs/2411.18164

作者:Mohamad Abubaker,Zubayda Alsadder,Hamed Abdelhaq,Maik Boltes,Ahmed Alia

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:platforms and event, railway platforms, management tasks, analysis and management, high-risk settings

备注: 17 pages, 8 figures, 7 tables

点击查看摘要

Abstract:The automatic detection of pedestrian heads in crowded environments is essential for crowd analysis and management tasks, particularly in high-risk settings such as railway platforms and event entrances. These environments, characterized by dense crowds and dynamic movements, are underrepresented in public datasets, posing challenges for existing deep learning models. To address this gap, we introduce the Railway Platforms and Event Entrances-Heads (RPEE-Heads) dataset, a novel, diverse, high-resolution, and accurately annotated resource. It includes 109,913 annotated pedestrian heads across 1,886 images from 66 video recordings, with an average of 56.2 heads per image. Annotations include bounding boxes for visible head regions. In addition to introducing the RPEE-Heads dataset, this paper evaluates eight state-of-the-art object detection algorithms using the RPEE-Heads dataset and analyzes the impact of head size on detection accuracy. The experimental results show that You Only Look Once v9 and Real-Time Detection Transformer outperform the other algorithms, achieving mean average precisions of 90.7% and 90.8%, with inference times of 11 and 14 milliseconds, respectively. Moreover, the findings underscore the need for specialized datasets like RPEE-Heads for training and evaluating accurate models for head detection in railway platforms and event entrances. The dataset and pretrained models are available at this https URL.

64. 【2411.18159】ype-R: Automatically Retouching Typos for Text-to-Image Generation

链接https://arxiv.org/abs/2411.18159

作者:Wataru Shimoda,Naoto Inoue,Daichi Haraguchi,Hayato Mitani,Seichi Uchida,Kota Yamaguchi

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:reflect detailed instructions, face significant challenges, generate photorealistic images, accurately rendering words, detailed instructions

备注

点击查看摘要

Abstract:While recent text-to-image models can generate photorealistic images from text prompts that reflect detailed instructions, they still face significant challenges in accurately rendering words in the image. In this paper, we propose to retouch erroneous text renderings in the post-processing pipeline. Our approach, called Type-R, identifies typographical errors in the generated image, erases the erroneous text, regenerates text boxes for missing words, and finally corrects typos in the rendered words. Through extensive experiments, we show that Type-R, in combination with the latest text-to-image models such as Stable Diffusion or Flux, achieves the highest text rendering accuracy while maintaining image quality and also outperforms text-focused generation baselines in terms of balancing text accuracy and image quality.

65. 【2411.18147】Online Knowledge Integration for 3D Semantic Mapping: A Survey

链接https://arxiv.org/abs/2411.18147

作者:Felix Igelbrink,Marian Renz,Martin Günther,Piper Powell,Lennart Niecksch,Oscar Lima,Martin Atzmueller,Joachim Hertzberg

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Semantic mapping, structured environments, Semantic, key component, component of robots

备注: Submitted to Robotics and Autonomous Systems

点击查看摘要

Abstract:Semantic mapping is a key component of robots operating in and interacting with objects in structured environments. Traditionally, geometric and knowledge representations within a semantic map have only been loosely integrated. However, recent advances in deep learning now allow full integration of prior knowledge, represented as knowledge graphs or language concepts, into sensor data processing and semantic mapping pipelines. Semantic scene graphs and language models enable modern semantic mapping approaches to incorporate graph-based prior knowledge or to leverage the rich information in human language both during and after the mapping process. This has sparked substantial advances in semantic mapping, leading to previously impossible novel applications. This survey reviews these recent developments comprehensively, with a focus on online integration of knowledge into semantic mapping. We specifically focus on methods using semantic scene graphs for integrating symbolic prior knowledge and language models for respective capture of implicit common-sense knowledge and natural language concepts

66. 【2411.18145】COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models

链接https://arxiv.org/abs/2411.18145

作者:Xiao An,Jiaxing Sun,Zihan Gui,Wei He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Vision-Language Models, sensing Earth observation, remote sensing Earth, remote sensing capabilities, remote sensing

备注: 20 pages, 12 figures

点击查看摘要

Abstract:With the rapid development of Large Vision-Language Models (VLMs), both general-domain models and those specifically tailored for remote sensing Earth observation, have demonstrated exceptional perception and reasoning abilities within this specific field. However, the current absence of a comprehensive benchmark for holistically evaluating the remote sensing capabilities of these VLMs represents a significant gap. To bridge this gap, we propose COREval, the first benchmark designed to comprehensively and objectively evaluate the hierarchical remote sensing capabilities of VLMs. Concentrating on 2 primary capability dimensions essential to remote sensing: perception and reasoning, we further categorize 6 secondary dimensions and 22 leaf tasks to ensure a well-rounded assessment coverage for this specific field. COREval guarantees the quality of the total of 6,263 problems through a rigorous process of data collection from 50 globally distributed cities, question construction and quality control, and the format of multiple-choice questions with definitive answers allows for an objective and straightforward evaluation of VLM performance. We conducted a holistic evaluation of 13 prominent open-source VLMs from both the general and remote sensing domains, highlighting current shortcomings in their remote sensing capabilities and providing directions for improvements in their application within this specialized context. We hope that COREval will serve as a valuable resource and offer deeper insights into the challenges and potential of VLMs in the field of remote sensing.

67. 【2411.18142】Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models

链接https://arxiv.org/abs/2411.18142

作者:Jingming Liu,Yumeng Li,Boyuan Xiao,Yichang Jian,Ziang Qin,Tianjia Shao,Yao-Xiang Ding,Kun Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multimodal Large Language, Language Models, Multimodal Large, Large Language

备注

点击查看摘要

Abstract:There have been recent efforts to extend the Chain-of-Thought (CoT) paradigm to Multimodal Large Language Models (MLLMs) by finding visual clues in the input scene, advancing the visual reasoning ability of MLLMs. However, current approaches are specially designed for the tasks where clue finding plays a major role in the whole reasoning process, leading to the difficulty in handling complex visual scenes where clue finding does not actually simplify the whole reasoning task. To deal with this challenge, we propose a new visual reasoning paradigm enabling MLLMs to autonomously modify the input scene to new ones based on its reasoning status, such that CoT is reformulated as conducting simple closed-loop decision-making and reasoning steps under a sequence of imagined visual scenes, leading to natural and general CoT construction. To implement this paradigm, we introduce a novel plug-and-play imagination space, where MLLMs conduct visual modifications through operations like focus, ignore, and transform based on their native reasoning ability without specific training. We validate our approach through a benchmark spanning dense counting, simple jigsaw puzzle solving, and object placement, challenging the reasoning ability beyond clue finding. The results verify that while existing techniques fall short, our approach enables MLLMs to effectively reason step by step through autonomous imagination. Project page: this https URL.

68. 【2411.18135】ModeDreamer: Mode Guiding Score Distillation for Text-to-3D Generation using Reference Image Prompts

链接https://arxiv.org/abs/2411.18135

作者:Uy Dieu Tran,Minh Luu,Phong Ha Nguyen,Khoi Nguyen,Binh-Son Hua

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Existing Score Distillation, Score Distillation Sampling, driven significant progress, Distillation Sampling, Existing Score

备注

点击查看摘要

Abstract:Existing Score Distillation Sampling (SDS)-based methods have driven significant progress in text-to-3D generation. However, 3D models produced by SDS-based methods tend to exhibit over-smoothing and low-quality outputs. These issues arise from the mode-seeking behavior of current methods, where the scores used to update the model oscillate between multiple modes, resulting in unstable optimization and diminished output quality. To address this problem, we introduce a novel image prompt score distillation loss named ISD, which employs a reference image to direct text-to-3D optimization toward a specific mode. Our ISD loss can be implemented by using IP-Adapter, a lightweight adapter for integrating image prompt capability to a text-to-image diffusion model, as a mode-selection module. A variant of this adapter, when not being prompted by a reference image, can serve as an efficient control variate to reduce variance in score estimates, thereby enhancing both output quality and optimization stability. Our experiments demonstrate that the ISD loss consistently achieves visually coherent, high-quality outputs and improves optimization speed compared to prior text-to-3D methods, as demonstrated through both qualitative and quantitative evaluations on the T3Bench benchmark suite.

69. 【2411.18133】owards Cross-device and Training-free Robotic Grasping in 3D Open World

链接https://arxiv.org/abs/2411.18133

作者:Weiguang Zhao,Chenru Jiang,Chengrui Zhang,Jie Sun,Yuyao Yan,Rui Zhang,Kaizhu Huang

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Robotic grasping, automation processes, open world, critical component, component of manufacturing

备注

点击查看摘要

Abstract:Robotic grasping in the open world is a critical component of manufacturing and automation processes. While numerous existing approaches depend on 2D segmentation output to facilitate the grasping procedure, accurately determining depth from 2D imagery remains a challenge, often leading to limited performance in complex stacking scenarios. In contrast, techniques utilizing 3D point cloud data inherently capture depth information, thus enabling adeptly navigating and manipulating a diverse range of complex stacking scenes. However, such efforts are considerably hindered by the variance in data capture devices and the unstructured nature of the data, which limits their generalizability. Consequently, much research is narrowly concentrated on managing designated objects within specific settings, which confines their real-world applicability. This paper presents a novel pipeline capable of executing object grasping tasks in open-world scenarios even on previously unseen objects without the necessity for training. Additionally, our pipeline supports the flexible use of different 3D point cloud segmentation models across a variety of scenes. Leveraging the segmentation results, we propose to engage a training-free binary clustering algorithm that not only improves segmentation precision but also possesses the capability to cluster and localize unseen objects for executing grasping operations. In our experiments, we investigate a range of open-world scenarios, and the outcomes underscore the remarkable robustness and generalizability of our pipeline, consistent across various environments, robots, cameras, and objects. The code will be made available upon acceptance of the paper.

70. 【2411.18115】Spectral-Spatial Transformer with Active Transfer Learning for Hyperspectral Image Classification

链接https://arxiv.org/abs/2411.18115

作者:Muhammad Ahmad,Manuel Mazzara,Salvatore Distefano

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:challenging task due, high spectral dimensionality, hyperspectral images, challenging task, task due

备注

点击查看摘要

Abstract:The classification of hyperspectral images (HSI) is a challenging task due to the high spectral dimensionality and limited labeled data typically available for training. In this study, we propose a novel multi-stage active transfer learning (ATL) framework that integrates a Spatial-Spectral Transformer (SST) with an active learning process for efficient HSI classification. Our approach leverages a pre-trained (initially trained) SST model, fine-tuned iteratively on newly acquired labeled samples using an uncertainty-diversity (Spatial-Spectral Neighborhood Diversity) querying mechanism. This mechanism identifies the most informative and diverse samples, thereby optimizing the transfer learning process to reduce both labeling costs and model uncertainty. We further introduce a dynamic freezing strategy, selectively freezing layers of the SST model to minimize computational overhead while maintaining adaptability to spectral variations in new data. One of the key innovations in our work is the self-calibration of spectral and spatial attention weights, achieved through uncertainty-guided active learning. This not only enhances the model's robustness in handling dynamic and disjoint spectral profiles but also improves generalization across multiple HSI datasets. Additionally, we present a diversity-promoting sampling strategy that ensures the selected samples span distinct spectral regions, preventing overfitting to particular spectral classes. Experiments on benchmark HSI datasets demonstrate that the SST-ATL framework significantly outperforms existing CNN and SST-based methods, offering superior accuracy, efficiency, and computational performance. The source code can be accessed at \url{this https URL}.

71. 【2411.18111】When Large Vision-Language Models Meet Person Re-Identification

链接https://arxiv.org/abs/2411.18111

作者:Qizao Wang,Bin Li,Xiangyang Xue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Large Vision-Language Models, Large Language, Language Models, incorporate visual models

备注

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) that incorporate visual models and Large Language Models (LLMs) have achieved impressive results across various cross-modal understanding and reasoning tasks. In recent years, person re-identification (ReID) has also started to explore cross-modal semantics to improve the accuracy of identity recognition. However, effectively utilizing LVLMs for ReID remains an open challenge. While LVLMs operate under a generative paradigm by predicting the next output word, ReID requires the extraction of discriminative identity features to match pedestrians across cameras. In this paper, we propose LVLM-ReID, a novel framework that harnesses the strengths of LVLMs to promote ReID. Specifically, we employ instructions to guide the LVLM in generating one pedestrian semantic token that encapsulates key appearance semantics from the person image. This token is further refined through our Semantic-Guided Interaction (SGI) module, establishing a reciprocal interaction between the semantic token and visual tokens. Ultimately, the reinforced semantic token serves as the pedestrian identity representation. Our framework integrates the semantic understanding and generation capabilities of LVLMs into end-to-end ReID training, allowing LVLMs to capture rich semantic cues from pedestrian images during both training and inference. Our method achieves competitive results on multiple benchmarks without additional image-text annotations, demonstrating the potential of LVLM-generated semantics to advance person ReID and offering a promising direction for future research.

72. 【2411.18109】raining Data Synthesis with Difficulty Controlled Diffusion Model

链接https://arxiv.org/abs/2411.18109

作者:Zerun Wang,Jiafeng Mao,Xueting Wang,Toshihiko Yamasaki

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Semi-supervised learning, public image sources, SSL, low costs, performance by leveraging

备注

点击查看摘要

Abstract:Semi-supervised learning (SSL) can improve model performance by leveraging unlabeled images, which can be collected from public image sources with low costs. In recent years, synthetic images have become increasingly common in public image sources due to rapid advances in generative models. Therefore, it is becoming inevitable to include existing synthetic images in the unlabeled data for SSL. How this kind of contamination will affect SSL remains unexplored. In this paper, we introduce a new task, Real-Synthetic Hybrid SSL (RS-SSL), to investigate the impact of unlabeled data contaminated by synthetic images for SSL. First, we set up a new RS-SSL benchmark to evaluate current SSL methods and found they struggled to improve by unlabeled synthetic images, sometimes even negatively affected. To this end, we propose RSMatch, a novel SSL method specifically designed to handle the challenges of RS-SSL. RSMatch effectively identifies unlabeled synthetic data and further utilizes them for improvement. Extensive experimental results show that RSMatch can transfer synthetic unlabeled data from `obstacles' to `resources.' The effectiveness is further verified through ablation studies and visualization.

73. 【2411.18101】Aligning Knowledge Concepts to Whole Slide Images for Precise Histopathology Image Analysis

链接https://arxiv.org/abs/2411.18101

作者:Weiqin Zhao,Ziyu Guo,Yinshuang Fan,Yuming Jiang,Maximus Yeung,Lequan Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Multiple Instance Learning, Slide Images, Instance Learning, Multiple Instance, fine-grained annotation

备注

点击查看摘要

Abstract:Due to the large size and lack of fine-grained annotation, Whole Slide Images (WSIs) analysis is commonly approached as a Multiple Instance Learning (MIL) problem. However, previous studies only learn from training data, posing a stark contrast to how human clinicians teach each other and reason about histopathologic entities and factors. Here we present a novel knowledge concept-based MIL framework, named ConcepPath to fill this gap. Specifically, ConcepPath utilizes GPT-4 to induce reliable diseasespecific human expert concepts from medical literature, and incorporate them with a group of purely learnable concepts to extract complementary knowledge from training data. In ConcepPath, WSIs are aligned to these linguistic knowledge concepts by utilizing pathology vision-language model as the basic building component. In the application of lung cancer subtyping, breast cancer HER2 scoring, and gastric cancer immunotherapy-sensitive subtyping task, ConcepPath significantly outperformed previous SOTA methods which lack the guidance of human expert knowledge.

74. 【2411.18092】raining Noise Token Pruning

链接https://arxiv.org/abs/2411.18092

作者:Mingxing Rao,Bohan Jiang,Daniel Moyer

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:present Training Noise, Training Noise Token, vision transformers, present work, present Training

备注: 25 pages, 8 figures

点击查看摘要

Abstract:In the present work we present Training Noise Token (TNT) Pruning for vision transformers. Our method relaxes the discrete token dropping condition to continuous additive noise, providing smooth optimization in training, while retaining discrete dropping computational gains in deployment settings. We provide theoretical connections to Rate-Distortion literature, and empirical evaluations on the ImageNet dataset using ViT and DeiT architectures demonstrating TNT's advantages over previous pruning methods.

75. 【2411.18082】Dual-view X-ray Detection: Can AI Detect Prohibited Items from Dual-view X-ray Images like Humans?

链接https://arxiv.org/abs/2411.18082

作者:Renshuai Tao,Haoyu Wang,Yuzhe Guo,Hairong Chen,Li Zhang,Xianglong Liu,Yunchao Wei,Yao Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:inspectors typically rely, detect prohibited items, human inspectors typically, dual-view X-ray images, vertical and side

备注: 10 pages, 6 figures

点击查看摘要

Abstract:To detect prohibited items in challenging categories, human inspectors typically rely on images from two distinct views (vertical and side). Can AI detect prohibited items from dual-view X-ray images in the same way humans do? Existing X-ray datasets often suffer from limitations, such as single-view imaging or insufficient sample diversity. To address these gaps, we introduce the Large-scale Dual-view X-ray (LDXray), which consists of 353,646 instances across 12 categories, providing a diverse and comprehensive resource for training and evaluating models. To emulate human intelligence in dual-view detection, we propose the Auxiliary-view Enhanced Network (AENet), a novel detection framework that leverages both the main and auxiliary views of the same object. The main-view pipeline focuses on detecting common categories, while the auxiliary-view pipeline handles more challenging categories using ``expert models" learned from the main view. Extensive experiments on the LDXray dataset demonstrate that the dual-view mechanism significantly enhances detection performance, e.g., achieving improvements of up to 24.7% for the challenging category of umbrellas. Furthermore, our results show that AENet exhibits strong generalization across seven different detection models for X-ray Inspection

76. 【2411.18078】Dual-Level Boost Network for Long-Tail Prohibited Items Detection in X-ray Security Inspection

链接https://arxiv.org/abs/2411.18078

作者:Renshuai Tao,Haoyu Wang,Wei Wang,Yunchao Wei,Yao Zhao

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:X-ray security, Dual-level Boost Network, X-ray, prohibited items, X-ray security inspections

备注: 10 pages, 4 figures

点击查看摘要

Abstract:The detection of prohibited items in X-ray security inspections is vital for ensuring public safety. However, the long-tail distribution of item categories, where certain prohibited items are far less common, poses a big challenge for detection models, as rare categories often lack sufficient training data. Existing methods struggle to classify these rare items accurately due to this imbalance. In this paper, we propose a Dual-level Boost Network (DBNet) specifically designed to overcome these challenges in X-ray security screening. Our approach introduces two key innovations: (1) a specific data augmentation strategy employing Poisson blending, inspired by the characteristics of X-ray images, to generate realistic synthetic instances of rare items which can effectively mitigate data imbalance; and (2) a context-aware feature enhancement module that captures the spatial and semantic interactions between objects and their surroundings, enhancing classification accuracy for underrepresented categories. Extensive experimental results demonstrate that DBNet improves detection performance for tail categories, outperforming sota methods in X-ray security inspection scenarios by a large margin 17.2%, thereby ensuring enhanced public safety.

77. 【2411.18072】SmileSplat: Generalizable Gaussian Splats for Unconstrained Sparse Images

链接https://arxiv.org/abs/2411.18072

作者:Yanyan Li,Yixin Fang,Federico Tombari,Gim Hee Lee

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Generalizable Gaussian Splatting, Sparse Multi-view Images, Gaussian Splatting approaches, wider application prospects, Gaussian Splatting

备注

点击查看摘要

Abstract:Sparse Multi-view Images can be Learned to predict explicit radiance fields via Generalizable Gaussian Splatting approaches, which can achieve wider application prospects in real-life when ground-truth camera parameters are not required as inputs. In this paper, a novel generalizable Gaussian Splatting method, SmileSplat, is proposed to reconstruct pixel-aligned Gaussian surfels for diverse scenarios only requiring unconstrained sparse multi-view images. First, Gaussian surfels are predicted based on the multi-head Gaussian regression decoder, which can are represented with less degree-of-freedom but have better multi-view consistency. Furthermore, the normal vectors of Gaussian surfel are enhanced based on high-quality of normal priors. Second, the Gaussians and camera parameters (both extrinsic and intrinsic) are optimized to obtain high-quality Gaussian radiance fields for novel view synthesis tasks based on the proposed Bundle-Adjusting Gaussian Splatting module. Extensive experiments on novel view rendering and depth map prediction tasks are conducted on public datasets, demonstrating that the proposed method achieves state-of-the-art performance in various 3D vision tasks. More information can be found on our project page (this https URL)

78. 【2411.18070】Large Scale Evaluation of Deep Learning-based Explainable Solar Flare Forecasting Models with Attribution-based Proximity Analysis

链接https://arxiv.org/abs/2411.18070

作者:Temitope Adeyeha,Chetraj Pandey,Berkay Aydin

类目:Machine Learning (cs.LG); Solar and Stellar Astrophysics (astro-ph.SR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)

关键词:potentially significant impact, impact on Earth, Earth and space-based, Accurate and reliable, space-based infrastructure

备注: This is a preprint accepted at IEEE International Conference on Big Data 2024( IEEE BigData 2024) Conference

点击查看摘要

Abstract:Accurate and reliable predictions of solar flares are essential due to their potentially significant impact on Earth and space-based infrastructure. Although deep learning models have shown notable predictive capabilities in this domain, current evaluations often focus on accuracy while neglecting interpretability and reliability--factors that are especially critical in operational settings. To address this gap, we propose a novel proximity-based framework for analyzing post hoc explanations to assess the interpretability of deep learning models for solar flare prediction. Our study compares two models trained on full-disk line-of-sight (LoS) magnetogram images to predict $\geq$M-class solar flares within a 24-hour window. We employ the Guided Gradient-weighted Class Activation Mapping (Guided Grad-CAM) method to generate attribution maps from these models, which we then analyze to gain insights into their decision-making processes. To support the evaluation of explanations in operational systems, we introduce a proximity-based metric that quantitatively assesses the accuracy and relevance of local explanations when regions of interest are known. Our findings indicate that the models' predictions align with active region characteristics to varying degrees, offering valuable insights into their behavior. This framework enhances the evaluation of model interpretability in solar flare forecasting and supports the development of more transparent and reliable operational systems.

79. 【2411.18068】PersonaCraft: Personalized Full-Body Image Synthesis for Multiple Identities from Single References Using 3D-Model-Conditioned Diffusion

链接https://arxiv.org/abs/2411.18068

作者:Gwanghyun Kim,Suh Yoon Jeon,Seunggyu Lee,Se Young Chun

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:significantly advanced, enabling the creation, Personalized image generation, creation of highly, Personalized image

备注: Project page: [this https URL](https://gwang-kim.github.io/persona_craft)

点击查看摘要

Abstract:Personalized image generation has been significantly advanced, enabling the creation of highly realistic and customized images. However, existing methods often struggle with generating images of multiple people due to occlusions and fail to accurately personalize full-body shapes. In this paper, we propose PersonaCraft, a novel approach that combines diffusion models with 3D human modeling to address these limitations. Our method effectively manages occlusions by incorporating 3D-aware pose conditioning with SMPLx-ControlNet and accurately personalizes human full-body shapes through SMPLx fitting. Additionally, PersonaCraft enables user-defined body shape adjustments, adding flexibility for individual body customization. Experimental results demonstrate the superior performance of PersonaCraft in generating high-quality, realistic images of multiple individuals while resolving occlusion issues, thus establishing a new standard for multi-person personalized image synthesis. Project page: this https URL

80. 【2411.18066】GLS: Geometry-aware 3D Language Gaussian Splatting

链接https://arxiv.org/abs/2411.18066

作者:Jiaxiong Qiu,Liu Liu,Zhizhong Su,Tianwei Lin

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Gaussian Splatting, achieved significant performance, surface reconstruction, indoor surface reconstruction, open-vocabulary segmentation

备注: Technical Report

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has achieved significant performance on indoor surface reconstruction and open-vocabulary segmentation. This paper presents GLS, a unified framework of surface reconstruction and open-vocabulary segmentation based on 3DGS. GLS extends two fields by exploring the correlation between them. For indoor surface reconstruction, we introduce surface normal prior as a geometric cue to guide the rendered normal, and use the normal error to optimize the rendered depth. For open-vocabulary segmentation, we employ 2D CLIP features to guide instance features and utilize DEVA masks to enhance their view consistency. Extensive experiments demonstrate the effectiveness of jointly optimizing surface reconstruction and open-vocabulary segmentation, where GLS surpasses state-of-the-art approaches of each task on MuSHRoom, ScanNet++, and LERF-OVS datasets. Code will be available at this https URL.

81. 【2411.18064】Lightweight Gaze Estimation Model Via Fusion Global Information

链接https://arxiv.org/abs/2411.18064

作者:Zhang Cheng,Yanxia Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep learning-based appearance, gaining popularity due, Deep learning-based, learning-based appearance gaze, appearance gaze estimation

备注

点击查看摘要

Abstract:Deep learning-based appearance gaze estimation methods are gaining popularity due to their high accuracy and fewer constraints from the environment. However, existing high-precision models often rely on deeper networks, leading to problems such as large parameters, long training time, and slow convergence. In terms of this issue, this paper proposes a novel lightweight gaze estimation model FGI-Net(Fusion Global Information). The model fuses global information into the CNN, effectively compensating for the need of multi-layer convolution and pooling to indirectly capture global information, while reducing the complexity of the model, improving the model accuracy and convergence speed. To validate the performance of the model, a large number of experiments are conducted, comparing accuracy with existing classical models and lightweight models, comparing convergence speed with models of different architectures, and conducting ablation experiments. Experimental results show that compared with GazeCaps, the latest gaze estimation model, FGI-Net achieves a smaller angle error with 87.1% and 79.1% reduction in parameters and FLOPs, respectively (MPIIFaceGaze is 3.74°, EyeDiap is 5.15°, Gaze360 is 10.50° and RT-Gene is 6.02°). Moreover, compared with different architectural models such as CNN and Transformer, FGI-Net is able to quickly converge to a higher accuracy range with fewer iterations of training, when achieving optimal accuracy on the Gaze360 and EyeDiap datasets, the FGI-Net model has 25% and 37.5% fewer iterations of training compared to GazeTR, respectively.

82. 【2411.18061】Multi-task Gaze Estimation Via Unidirectional Convolution

链接https://arxiv.org/abs/2411.18061

作者:Zhang Cheng,Yanxia Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gaze estimation tasks, Global Convolution Module, Multi-task Regression Module, lightweight models, significant performance degradation

备注

点击查看摘要

Abstract:Using lightweight models as backbone networks in gaze estimation tasks often results in significant performance degradation. The main reason is that the number of feature channels in lightweight networks is usually small, which makes the model expression ability limited. In order to improve the performance of lightweight models in gaze estimation tasks, a network model named Multitask-Gaze is proposed. The main components of Multitask-Gaze include Unidirectional Convolution (UC), Spatial and Channel Attention (SCA), Global Convolution Module (GCM), and Multi-task Regression Module(MRM). UC not only significantly reduces the number of parameters and FLOPs, but also extends the receptive field and improves the long-distance modeling capability of the model, thereby improving the model performance. SCA highlights gaze-related features and suppresses gaze-irrelevant features. The GCM replaces the pooling layer and avoids the performance degradation due to information loss. MRM improves the accuracy of individual tasks and strengthens the connections between tasks for overall performance improvement. The experimental results show that compared with the State-of-the-art method SUGE, the performance of Multitask-Gaze on MPIIFaceGaze and Gaze360 datasets is improved by 1.71% and 2.75%, respectively, while the number of parameters and FLOPs are significantly reduced by 75.5% and 86.88%.

83. 【2411.18042】HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

链接https://arxiv.org/abs/2411.18042

作者:Trong-Thuan Nguyen,Pha Nguyen,Jackson Cothren,Alper Yilmaz,Khoa Luu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Scene Graph Generation, Video Scene Graph, Scene Graph, understanding video scenes, Scene Graph Anticipation

备注

点击查看摘要

Abstract:Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.

84. 【2411.18038】VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

链接https://arxiv.org/abs/2411.18038

作者:Donggoo Kang,Dasol Jeong,Hyunmin Lee,Sangwoo Park,Hasil Park,Sunkyu Kwon,Yeongjoon Kim,Joonki Paik

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Large Vision Language, recently addressed remarkable, Vision Language Model, Large Vision, addressed remarkable progress

备注: 18 pages

点击查看摘要

Abstract:The Large Vision Language Model (VLM) has recently addressed remarkable progress in bridging two fundamental modalities. VLM, trained by a sufficiently large dataset, exhibits a comprehensive understanding of both visual and linguistic to perform diverse tasks. To distill this knowledge accurately, in this paper, we introduce a novel approach that explicitly utilizes VLM as an objective function form for the Human-Object Interaction (HOI) detection task (\textbf{VLM-HOI}). Specifically, we propose a method that quantifies the similarity of the predicted HOI triplet using the Image-Text matching technique. We represent HOI triplets linguistically to fully utilize the language comprehension of VLMs, which are more suitable than CLIP models due to their localization and object-centric nature. This matching score is used as an objective for contrastive optimization. To our knowledge, this is the first utilization of VLM language abilities for HOI detection. Experiments demonstrate the effectiveness of our method, achieving state-of-the-art HOI detection accuracy on benchmarks. We believe integrating VLMs into HOI detection represents important progress towards more advanced and interpretable analysis of human-object interactions.

85. 【2411.18025】Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision

链接https://arxiv.org/abs/2411.18025

作者:Jinnyeong Kim,Seung-Hwan Baek

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Integrating RGB, potentially enhancing robotic, NIR stereo imaging, RGB and NIR, NIR stereo

备注

点击查看摘要

Abstract:Integrating RGB and NIR stereo imaging provides complementary spectral information, potentially enhancing robotic 3D vision in challenging lighting conditions. However, existing datasets and imaging systems lack pixel-level alignment between RGB and NIR images, posing challenges for downstream vision tasks. In this paper, we introduce a robotic vision system equipped with pixel-aligned RGB-NIR stereo cameras and a LiDAR sensor mounted on a mobile robot. The system simultaneously captures pixel-aligned pairs of RGB stereo images, NIR stereo images, and temporally synchronized LiDAR points. Utilizing the mobility of the robot, we present a dataset containing continuous video frames under diverse lighting conditions. We then introduce two methods that utilize the pixel-aligned RGB-NIR images: an RGB-NIR image fusion method and a feature fusion method. The first approach enables existing RGB-pretrained vision models to directly utilize RGB-NIR information without fine-tuning. The second approach fine-tunes existing vision models to more effectively utilize RGB-NIR information. Experimental results demonstrate the effectiveness of using pixel-aligned RGB-NIR images across diverse lighting conditions.

86. 【2411.18013】FASIONAD : FAst and Slow FusION Thinking Systems for Human-Like Autonomous Driving with Adaptive Feedback

链接https://arxiv.org/abs/2411.18013

作者:Kangan Qian,Zhikun Ma,Yangfan He,Ziang Luo,Tianyu Shi,Tianze Zhu,Jiayin Li,Jianhui Wang,Ziyu Chen,Xiao He,Yining Shi,Zheng Fu,Xinyu Jiao,Kun Jiang,Diange Yang,Takafumi Matsumaru

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Ensuring safe, critical goal, Fast, Ensuring, fast system

备注

点击查看摘要

Abstract:Ensuring safe, comfortable, and efficient navigation is a critical goal for autonomous driving systems. While end-to-end models trained on large-scale datasets excel in common driving scenarios, they often struggle with rare, long-tail events. Recent progress in large language models (LLMs) has introduced enhanced reasoning capabilities, but their computational demands pose challenges for real-time decision-making and precise planning. This paper presents FASIONAD, a novel dual-system framework inspired by the cognitive model "Thinking, Fast and Slow." The fast system handles routine navigation tasks using rapid, data-driven path planning, while the slow system focuses on complex reasoning and decision-making in challenging or unfamiliar situations. A dynamic switching mechanism based on score distribution and feedback allows seamless transitions between the two systems. Visual prompts generated by the fast system enable human-like reasoning in the slow system, which provides high-quality feedback to enhance the fast system's decision-making. To evaluate FASIONAD, we introduce a new benchmark derived from the nuScenes dataset, specifically designed to differentiate fast and slow scenarios. FASIONAD achieves state-of-the-art performance on this benchmark, establishing a new standard for frameworks integrating fast and slow cognitive processes in autonomous driving. This approach paves the way for more adaptive, human-like autonomous driving systems.

87. 【2411.18011】Manual-PA: Learning 3D Part Assembly from Instruction Diagrams

链接https://arxiv.org/abs/2411.18011

作者:Jiahao Zhang,Anoop Cherian,Cristian Rodriguez,Weijian Deng,Stephen Gould

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Assembling furniture amounts, physically realistic manner, discrete-continuous optimization task, Assembling furniture, realistic manner

备注

点击查看摘要

Abstract:Assembling furniture amounts to solving the discrete-continuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinatorially large yet sparse solution space thus making learning to assemble a challenging task for current machine learning models. In this paper, we attempt to solve this task by leveraging the assembly instructions provided in diagrammatic manuals that typically accompany the furniture parts. Our key insight is to use the cues in these diagrams to split the problem into discrete and continuous phases. Specifically, we present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework that learns to semantically align 3D parts with their illustrations in the manuals using a contrastive learning backbone towards predicting the assembly order and infers the 6D pose of each part via relating it to the final furniture depicted in the manual. To validate the efficacy of our method, we conduct experiments on the benchmark PartNet dataset. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art. Further, Manual-PA demonstrates strong generalization to real-world IKEA furniture assembly on the IKEA-Manual dataset.

88. 【2411.18009】Monocular Obstacle Avoidance Based on Inverse PPO for Fixed-wing UAVs

链接https://arxiv.org/abs/2411.18009

作者:Haochen Chai,Meimei Su,Yang Lyu,Zhunga Liu,Chunhui Zhao,Quan Pan

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:Unmanned Aerial Vehicles, Urban Air Mobility, Fixed-wing Unmanned Aerial, burgeoning Low-altitude Economy, Aerial Vehicles

备注

点击查看摘要

Abstract:Fixed-wing Unmanned Aerial Vehicles (UAVs) are one of the most commonly used platforms for the burgeoning Low-altitude Economy (LAE) and Urban Air Mobility (UAM), due to their long endurance and high-speed capabilities. Classical obstacle avoidance systems, which rely on prior maps or sophisticated sensors, face limitations in unknown low-altitude environments and small UAV platforms. In response, this paper proposes a lightweight deep reinforcement learning (DRL) based UAV collision avoidance system that enables a fixed-wing UAV to avoid unknown obstacles at cruise speed over 30m/s, with only onboard visual sensors. The proposed system employs a single-frame image depth inference module with a streamlined network architecture to ensure real-time obstacle detection, optimized for edge computing devices. After that, a reinforcement learning controller with a novel reward function is designed to balance the target approach and flight trajectory smoothness, satisfying the specific dynamic constraints and stability requirements of a fixed-wing UAV platform. An adaptive entropy adjustment mechanism is introduced to mitigate the exploration-exploitation trade-off inherent in DRL, improving training convergence and obstacle avoidance success rates. Extensive software-in-the-loop and hardware-in-the-loop experiments demonstrate that the proposed framework outperforms other methods in obstacle avoidance efficiency and flight trajectory smoothness and confirm the feasibility of implementing the algorithm on edge devices. The source code is publicly available at \url{this https URL}.

89. 【2411.18007】AI-Driven Smartphone Solution for Digitizing Rapid Diagnostic Test Kits and Enhancing Accessibility for the Visually Impaired

链接https://arxiv.org/abs/2411.18007

作者:R. B. Dastagir,J. T. Jami,S. Chanda,F. Hafiz,M. Rahman,K. Dey,M. M. Rahman,M. Qureshi,M. M. Chowdhury

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:timely disease detection, results remains challenging, test results remains, test result interpretation, diagnostic test result

备注

点击查看摘要

Abstract:Rapid diagnostic tests are crucial for timely disease detection and management, yet accurate interpretation of test results remains challenging. In this study, we propose a novel approach to enhance the accuracy and reliability of rapid diagnostic test result interpretation by integrating artificial intelligence (AI) algorithms, including convolutional neural networks (CNN), within a smartphone-based application. The app enables users to take pictures of their test kits, which YOLOv8 then processes to precisely crop and extract the membrane region, even if the test kit is not centered in the frame or is positioned at the very edge of the image. This capability offers greater accessibility, allowing even visually impaired individuals to capture test images without needing perfect alignment, thus promoting user independence and inclusivity. The extracted image is analyzed by an additional CNN classifier that determines if the results are positive, negative, or invalid, providing users with the results and a confidence level. Through validation experiments with commonly used rapid test kits across various diagnostic applications, our results demonstrate that the synergistic integration of AI significantly improves sensitivity and specificity in test result interpretation. This improvement can be attributed to the extraction of the membrane zones from the test kit images using the state-of-the-art YOLO algorithm. Additionally, we performed SHapley Additive exPlanations (SHAP) analysis to investigate the factors influencing the model's decisions, identifying reasons behind both correct and incorrect classifications. By facilitating the differentiation of genuine test lines from background noise and providing valuable insights into test line intensity and uniformity, our approach offers a robust solution to challenges in rapid test interpretation.

90. 【2411.18002】An End-to-End Two-Stream Network Based on RGB Flow and Representation Flow for Human Action Recognition

链接https://arxiv.org/abs/2411.18002

作者:Song-Jiang Lai,Tsun-Hin Cheung,Ka-Chun Fung,Tian-Shan Liu,Kin-Man Lam

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:video based action, making two-stream neural, based action recognition, two-stream neural networks, computer vision tasks

备注: 6 pages, 3 figures, 9 tables

点击查看摘要

Abstract:With the rapid advancements in deep learning, computer vision tasks have seen significant improvements, making two-stream neural networks a popular focus for video based action recognition. Traditional models using RGB and optical flow streams achieve strong performance but at a high computational cost. To address this, we introduce a representation flow algorithm to replace the optical flow branch in the egocentric action recognition model, enabling end-to-end training while reducing computational cost and prediction time. Our model, designed for egocentric action recognition, uses class activation maps (CAMs) to improve accuracy and ConvLSTM for spatio temporal encoding with spatial attention. When evaluated on the GTEA61, EGTEA GAZE+, and HMDB datasets, our model matches the accuracy of the original model on GTEA61 and exceeds it by 0.65% and 0.84% on EGTEA GAZE+ and HMDB, respectively. Prediction runtimes are significantly reduced to 0.1881s, 0.1503s, and 0.1459s, compared to the original model's 101.6795s, 25.3799s, and 203.9958s. Ablation studies were also conducted to study the impact of different parameters on model performance. Keywords: two-stream, egocentric, action recognition, CAM, representation flow, CAM, ConvLSTM

Comments:
6 pages, 3 figures, 9 tables

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as:
arXiv:2411.18002 [cs.CV]

(or
arXiv:2411.18002v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2411.18002

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
91. 【2411.18000】Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models

链接https://arxiv.org/abs/2411.18000

作者:Shuyang Hao,Bryan Hooi,Jun Liu,Kai-Wei Chang,Zi Huang,Yujun Cai

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:underlying language models, inheriting security measures, Vision-Language Models, language models, models

备注

点击查看摘要

Abstract:Despite inheriting security measures from underlying language models, Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues. Through empirical analysis, we uncover two critical findings: scenario-matched images can significantly amplify harmful outputs, and contrary to common assumptions in gradient-based attacks, minimal loss values do not guarantee optimal attack effectiveness. Building on these insights, we introduce MLAI (Multi-Loss Adversarial Images), a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment, exploits flat minima theory for robust adversarial image selection, and employs multi-image collaborative attacks for enhanced effectiveness. Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2, substantially outperforming existing methods by margins of 34.37% and 12.77% respectively. Furthermore, MLAI shows considerable transferability to commercial black-box VLMs, achieving up to 60.11% success rate. Our work reveals fundamental visual vulnerabilities in current VLMs safety mechanisms and underscores the need for stronger defenses. Warning: This paper contains potentially harmful example text.

92. 【2411.17995】Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion

链接https://arxiv.org/abs/2411.17995

作者:Taeheon Kim,Sangyun Chung,Youngjoon Yu,Yong Man Ro

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:crucial component, Multispectral pedestrian detection, Multispectral pedestrian, critical applications, heavily misaligned

备注

点击查看摘要

Abstract:Multispectral pedestrian detection is a crucial component in various critical applications. However, a significant challenge arises due to the misalignment between these modalities, particularly under real-world conditions where data often appear heavily misaligned. Conventional methods developed on well-aligned or minimally misaligned datasets fail to address these discrepancies adequately. This paper introduces a new framework for multispectral pedestrian detection designed specifically to handle heavily misaligned datasets without the need for costly and complex traditional pre-processing calibration. By leveraging Large-scale Vision-Language Models (LVLM) for cross-modal semantic alignment, our approach seeks to enhance detection accuracy by aligning semantic information across the RGB and thermal domains. This method not only simplifies the operational requirements but also extends the practical usability of multispectral detection technologies in practical applications.

93. 【2411.17994】Differentiable Inverse Rendering with Interpretable Basis BRDFs

链接https://arxiv.org/abs/2411.17994

作者:Hoon-Gyu Chung,Seokjun Choi,Seung-Hwan Baek

类目:Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)

关键词:basis BRDFs, basis, Inverse rendering, BRDFs, Inverse rendering seeks

备注: This paper is submitted to CVPR 2025. This is a different paper from my previous paper "Differentiable Point-based Inverse Rendering". It must not be removed automatically

点击查看摘要

Abstract:Inverse rendering seeks to reconstruct both geometry and spatially varying BRDFs (SVBRDFs) from captured images. To address the inherent ill-posedness of inverse rendering, basis BRDF representations are commonly used, modeling SVBRDFs as spatially varying blends of a set of basis BRDFs. However, existing methods often yield basis BRDFs that lack intuitive separation and have limited scalability to scenes of varying complexity. In this paper, we introduce a differentiable inverse rendering method that produces interpretable basis BRDFs. Our approach models a scene using 2D Gaussians, where the reflectance of each Gaussian is defined by a weighted blend of basis BRDFs. We efficiently render an image from the 2D Gaussians and basis BRDFs using differentiable rasterization and impose a rendering loss with the input images. During this analysis-by-synthesis optimization process of differentiable inverse rendering, we dynamically adjust the number of basis BRDFs to fit the target scene while encouraging sparsity in the basis weights. This ensures that the reflectance of each Gaussian is represented by only a few basis BRDFs. This approach enables the reconstruction of accurate geometry and interpretable basis BRDFs that are spatially separated. Consequently, the resulting scene representation, comprising basis BRDFs and 2D Gaussians, supports physically-based novel-view relighting and intuitive scene editing.

94. 【2411.17991】VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

链接https://arxiv.org/abs/2411.17991

作者:Yueqian Wang,Xiaojun Meng,Yuxuan Wang,Jianxin Liang,Jiansheng Wei,Huishuai Zhang,Dongyan Zhao

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:Recent researches, duet interaction format, interaction format, large language models, video large language

备注: 9 pages

点击查看摘要

Abstract:Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays. Code, data and demo are available at: this https URL.

95. 【2411.17984】RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model

链接https://arxiv.org/abs/2411.17984

作者:Huiyang Hu,Peijin Wang,Hanbo Bi,Boyuan Tong,Zhaozhi Wang,Wenhui Diao,Hao Chang,Yingchao Feng,Ziqi Zhang,Qixiang Ye,Kun Fu,Xian Sun

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:offering greater scalability, Remote sensing foundation, Remote sensing, remote sensing images, models largely break

备注: 18 pages, 9 figures and 9 tables

点击查看摘要

Abstract:Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with high-resolution remote sensing images. To overcome these, we draw inspiration from heat conduction, a physical process modeling local heat diffusion. Building on this idea, we are the first to explore the potential of using the parallel computing model of heat conduction to simulate the local region correlations in high-resolution remote sensing images, and introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat 1) applies the Heat Conduction Operator (HCO) with a complexity of $O(N^{1.5})$ and a global receptive field, reducing computational overhead while capturing remote sensing object structure information to guide heat diffusion; 2) learns the frequency distribution representations of various scenes through a self-supervised strategy based on frequency domain hierarchical masking and multi-domain reconstruction; 3) significantly improves efficiency and performance over state-of-the-art techniques across 4 tasks and 10 datasets. Compared to attention-based remote sensing foundation models, we reduces memory consumption by 84%, decreases FLOPs by 24% and improves throughput by 2.7 times.

96. 【2411.17982】HI-SLAM2: Geometry-Aware Gaussian SLAM for Fast Monocular Scene Reconstruction

链接https://arxiv.org/abs/2411.17982

作者:Wei Zhang,Qing Cheng,David Skuddis,Niclas Zeller,Daniel Cremers,Norbert Haala

类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

关键词:geometry-aware Gaussian SLAM, Gaussian SLAM system, Existing Neural SLAM, RGB input, Neural SLAM

备注: Under review process

点击查看摘要

Abstract:We present HI-SLAM2, a geometry-aware Gaussian SLAM system that achieves fast and accurate monocular scene reconstruction using only RGB input. Existing Neural SLAM or 3DGS-based SLAM methods often trade off between rendering quality and geometry accuracy, our research demonstrates that both can be achieved simultaneously with RGB input alone. The key idea of our approach is to enhance the ability for geometry estimation by combining easy-to-obtain monocular priors with learning-based dense SLAM, and then using 3D Gaussian splatting as our core map representation to efficiently model the scene. Upon loop closure, our method ensures on-the-fly global consistency through efficient pose graph bundle adjustment and instant map updates by explicitly deforming the 3D Gaussian units based on anchored keyframe updates. Furthermore, we introduce a grid-based scale alignment strategy to maintain improved scale consistency in prior depths for finer depth details. Through extensive experiments on Replica, ScanNet, and ScanNet++, we demonstrate significant improvements over existing Neural SLAM methods and even surpass RGB-D-based methods in both reconstruction and rendering quality. The project page and source code will be made available at this https URL.

97. 【2411.17980】Vision Mamba Distillation for Low-resolution Fine-grained Image Classification

链接https://arxiv.org/abs/2411.17980

作者:Yao Chen,Jiabao Wang,Peichao Wang,Rui Zhang,Yang Li

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:made significant progress, recently made significant, Low-resolution fine-grained image, vision Mamba classification, Mamba classification network

备注

点击查看摘要

Abstract:Low-resolution fine-grained image classification has recently made significant progress, largely thanks to the super-resolution techniques and knowledge distillation methods. However, these approaches lead to an exponential increase in the number of parameters and computational complexity of models. In order to solve this problem, in this letter, we propose a Vision Mamba Distillation (ViMD) approach to enhance the effectiveness and efficiency of low-resolution fine-grained image classification. Concretely, a lightweight super-resolution vision Mamba classification network (SRVM-Net) is proposed to improve its capability for extracting visual features by redesigning the classification sub-network with Mamba modeling. Moreover, we design a novel multi-level Mamba knowledge distillation loss boosting the performance, which can transfer prior knowledge obtained from a High-resolution Vision Mamba classification Network (HRVM-Net) as a teacher into the proposed SRVM-Net as a student. Extensive experiments on seven public fine-grained classification datasets related to benchmarks confirm our ViMD achieves a new state-of-the-art performance. While having higher accuracy, ViMD outperforms similar methods with fewer parameters and FLOPs, which is more suitable for embedded device applications. Code is available at this https URL.

98. 【2411.17973】Improved implicit diffusion model with knowledge distillation to estimate the spatial distribution density of carbon stock in remote sensing imagery

链接https://arxiv.org/abs/2411.17973

作者:Zhenyu Yu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:mitigating climate change, significant terrestrial carbon, effectively reducing atmospheric, carbon stock mechanism, concentrations and mitigating

备注: Under review

点击查看摘要

Abstract:The forest serves as the most significant terrestrial carbon stock mechanism, effectively reducing atmospheric CO$_2$ concentrations and mitigating climate change. Remote sensing provides high data accuracy and enables large-scale observations. Optical images facilitate long-term monitoring, which is crucial for future carbon stock estimation studies. This study focuses on Huize County, Qujing City, Yunnan Province, China, utilizing GF-1 WFV satellite imagery. The KD-VGG and KD-UNet modules were introduced for initial feature extraction, and the improved implicit diffusion model (IIDM) was proposed. The results showed: (1) The VGG module improved initial feature extraction, improving accuracy, and reducing inference time with optimized model parameters. (2) The Cross-attention + MLPs module enabled effective feature fusion, establishing critical relationships between global and local features, achieving high-accuracy estimation. (3) The IIDM model, a novel contribution, demonstrated the highest estimation accuracy with an RMSE of 12.17\%, significantly improving by 41.69\% to 42.33\% compared to the regression model. In carbon stock estimation, the generative model excelled in extracting deeper features, significantly outperforming other models, demonstrating the feasibility of AI-generated content in quantitative remote sensing. The 16-meter resolution estimates provide a robust basis for tailoring forest carbon sink regulations, enhancing regional carbon stock management.

99. 【2411.17959】Adversarial Training in Low-Label Regimes with Margin-Based Interpolation

链接https://arxiv.org/abs/2411.17959

作者:Tian Ye,Rajgopal Kannan,Viktor Prasanna

类目:Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:train robust neural, robust neural network, neural network models, train robust, robust neural

备注

点击查看摘要

Abstract:Adversarial training has emerged as an effective approach to train robust neural network models that are resistant to adversarial attacks, even in low-label regimes where labeled data is scarce. In this paper, we introduce a novel semi-supervised adversarial training approach that enhances both robustness and natural accuracy by generating effective adversarial examples. Our method begins by applying linear interpolation between clean and adversarial examples to create interpolated adversarial examples that cross decision boundaries by a controlled margin. This sample-aware strategy tailors adversarial examples to the characteristics of each data point, enabling the model to learn from the most informative perturbations. Additionally, we propose a global epsilon scheduling strategy that progressively adjusts the upper bound of perturbation strengths during training. The combination of these strategies allows the model to develop increasingly complex decision boundaries with better robustness and natural accuracy. Empirical evaluations show that our approach effectively enhances performance against various adversarial attacks, such as PGD and AutoAttack.

100. 【2411.17957】Optimization-Free Image Immunization Against Diffusion-Based Editing

链接https://arxiv.org/abs/2411.17957

作者:Tarik Can Ozden,Ozgur Kara,Oguzhan Akcin,Kerem Zaman,Shashank Srivastava,Sandeep P. Chinchali,James M. Rehg

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:embed imperceptible noise, Current image immunization, immunization defense techniques, Current image, editing embed imperceptible

备注: Project webpage: [this https URL](https://diffvax.github.io/)

点击查看摘要

Abstract:Current image immunization defense techniques against diffusion-based editing embed imperceptible noise in target images to disrupt editing models. However, these methods face scalability challenges, as they require time-consuming re-optimization for each image-taking hours for small batches. To address these challenges, we introduce DiffVax, a scalable, lightweight, and optimization-free framework for image immunization, specifically designed to prevent diffusion-based editing. Our approach enables effective generalization to unseen content, reducing computational costs and cutting immunization time from days to milliseconds-achieving a 250,000x speedup. This is achieved through a loss term that ensures the failure of editing attempts and the imperceptibility of the perturbations. Extensive qualitative and quantitative results demonstrate that our model is scalable, optimization-free, adaptable to various diffusion-based editing tools, robust against counter-attacks, and, for the first time, effectively protects video content from editing. Our code is provided in our project webpage.

101. 【2411.17949】ROICtrl: Boosting Instance Control for Visual Generation

链接https://arxiv.org/abs/2411.17949

作者:Yuchao Gu,Yipin Zhou,Yunfan Ye,Yixin Nie,Licheng Yu,Pingchuan Ma,Kevin Qinghong Lin,Mike Zheng Shou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:accurately associate positional, limits current text-based, simpler compositions featuring, Natural language, current text-based visual

备注: Project page at [this https URL](https://roictrl.github.io/)

点击查看摘要

Abstract:Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.

102. 【2411.17945】MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation

链接https://arxiv.org/abs/2411.17945

作者:Sankalp Sinha,Mohammad Sadil Khan,Muhammad Usama,Shino Sam,Didier Stricker,Sk Aziz Ali,Muhammad Zeshan Afzal

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)

关键词:computer vision due, Generating high-fidelity, text prompts remains, limited size, prompts remains

备注

点击查看摘要

Abstract:Generating high-fidelity 3D content from text prompts remains a significant challenge in computer vision due to the limited size, diversity, and annotation depth of the existing datasets. To address this, we introduce MARVEL-40M+, an extensive dataset with 40 million text annotations for over 8.9 million 3D assets aggregated from seven major 3D datasets. Our contribution is a novel multi-stage annotation pipeline that integrates open-source pretrained multi-view VLMs and LLMs to automatically produce multi-level descriptions, ranging from detailed (150-200 words) to concise semantic tags (10-20 words). This structure supports both fine-grained 3D reconstruction and rapid prototyping. Furthermore, we incorporate human metadata from source datasets into our annotation pipeline to add domain-specific information in our annotation and reduce VLM hallucinations. Additionally, we develop MARVEL-FX3D, a two-stage text-to-3D pipeline. We fine-tune Stable Diffusion with our annotations and use a pretrained image-to-3D network to generate 3D textured meshes within 15s. Extensive evaluations show that MARVEL-40M+ significantly outperforms existing datasets in annotation quality and linguistic diversity, achieving win rates of 72.41% by GPT-4 and 73.40% by human evaluators.

103. 【2411.17936】Stealthy Multi-Task Adversarial Attacks

链接https://arxiv.org/abs/2411.17936

作者:Jiacheng Guo,Tianyun Zhang,Lei Li,Haochen Yang,Hongkai Yu,Minghai Qin

类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)

关键词:Deep Neural Networks, Neural Networks exhibit, Networks exhibit inherent, Deep Neural, Neural Networks

备注

点击查看摘要

Abstract:Deep Neural Networks exhibit inherent vulnerabilities to adversarial attacks, which can significantly compromise their outputs and reliability. While existing research primarily focuses on attacking single-task scenarios or indiscriminately targeting all tasks in multi-task environments, we investigate selectively targeting one task while preserving performance in others within a multi-task framework. This approach is motivated by varying security priorities among tasks in real-world applications, such as autonomous driving, where misinterpreting critical objects (e.g., signs, traffic lights) poses a greater security risk than minor depth miscalculations. Consequently, attackers may hope to target security-sensitive tasks while avoiding non-critical tasks from being compromised, thus evading being detected before compromising crucial functions. In this paper, we propose a method for the stealthy multi-task attack framework that utilizes multiple algorithms to inject imperceptible noise into the input. This novel method demonstrates remarkable efficacy in compromising the target task while simultaneously maintaining or even enhancing performance across non-targeted tasks - a criterion hitherto unexplored in the field. Additionally, we introduce an automated approach for searching the weighting factors in the loss function, further enhancing attack efficiency. Experimental results validate our framework's ability to successfully attack the target task while preserving the performance of non-targeted tasks. The automated loss function weight searching method demonstrates comparable efficacy to manual tuning, establishing a state-of-the-art multi-task attack framework.

104. 【2411.17922】Exploring Superpixel Segmentation Methods in the Context of Citizen Science and Deforestation Detection

链接https://arxiv.org/abs/2411.17922

作者:Hugo Resende,Isabela Borlido,Victor Sundermann,Eduardo B. Neto,Silvio Jamil F. Guimarães,Fabio Faria,Alvaro Luiz Fazenda

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Tropical forests play, Tropical forests, citizen science campaigns, planet ecosystem, making the conservation

备注: Paper was accepted for presentation at SAC 2025

点击查看摘要

Abstract:Tropical forests play an essential role in the planet's ecosystem, making the conservation of these biomes a worldwide priority. However, ongoing deforestation and degradation pose a significant threat to their existence, necessitating effective monitoring and the proposal of actions to mitigate the damage caused by these processes. In this regard, initiatives range from government and private sector monitoring programs to solutions based on citizen science campaigns, for example. Particularly in the context of citizen science campaigns, the segmentation of remote sensing images to identify deforested areas and subsequently submit them to analysis by non-specialized volunteers is necessary. Thus, segmentation using superpixel-based techniques proves to be a viable solution for this important task. Therefore, this paper presents an analysis of 22 superpixel-based segmentation methods applied to remote sensing images, aiming to identify which of them are more suitable for generating segments for citizen science campaigns. The results reveal that seven of the segmentation methods outperformed the baseline method (SLIC) currently employed in the ForestEyes citizen science project, indicating an opportunity for improvement in this important stage of campaign development.

105. 【2411.17917】DECODE: Domain-aware Continual Domain Expansion for Motion Prediction

链接https://arxiv.org/abs/2411.17917

作者:Boqi Li,Haojie Zhu,Henry X. Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:effectively navigate complex, navigate complex environments, Motion prediction, traffic participants, prediction is critical

备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Motion prediction is critical for autonomous vehicles to effectively navigate complex environments and accurately anticipate the behaviors of other traffic participants. As autonomous driving continues to evolve, the need to assimilate new and varied driving scenarios necessitates frequent model updates through retraining. To address these demands, we introduce DECODE, a novel continual learning framework that begins with a pre-trained generalized model and incrementally develops specialized models for distinct domains. Unlike existing continual learning approaches that attempt to develop a unified model capable of generalizing across diverse scenarios, DECODE uniquely balances specialization with generalization, dynamically adjusting to real-time demands. The proposed framework leverages a hypernetwork to generate model parameters, significantly reducing storage requirements, and incorporates a normalizing flow mechanism for real-time model selection based on likelihood estimation. Furthermore, DECODE merges outputs from the most relevant specialized and generalized models using deep Bayesian uncertainty estimation techniques. This integration ensures optimal performance in familiar conditions while maintaining robustness in unfamiliar scenarios. Extensive evaluations confirm the effectiveness of the framework, achieving a notably low forgetting rate of 0.044 and an average minADE of 0.584 m, significantly surpassing traditional learning strategies and demonstrating adaptability across a wide range of driving conditions.

106. 【2411.17911】Passive Deepfake Detection Across Multi-modalities: A Comprehensive Survey

链接https://arxiv.org/abs/2411.17911

作者:Hong-Hanh Nguyen-Le,Van-Tuan Tran,Dinh-Thuc Nguyen,Nhien-An Le-Khac

类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

关键词:artists' style imitation, misinformation spreading, recent years, malicious purposes, individual impersonation

备注: 26 pages

点击查看摘要

Abstract:In recent years, deepfakes (DFs) have been utilized for malicious purposes, such as individual impersonation, misinformation spreading, and artists' style imitation, raising questions about ethical and security concerns. However, existing surveys have focused on accuracy performance of passive DF detection approaches for single modalities, such as image, video or audio. This comprehensive survey explores passive approaches across multiple modalities, including image, video, audio, and multi-modal domains, and extend our discussion beyond detection accuracy, including generalization, robustness, attribution, and interpretability. Additionally, we discuss threat models for passive approaches, including potential adversarial strategies and different levels of adversary knowledge and capabilities. We also highlights current challenges in DF detection, including the lack of generalization across different generative models, the need for comprehensive trustworthiness evaluation, and the limitations of existing multi-modal approaches. Finally, we propose future research directions that address these unexplored and emerging issues in the field of passive DF detection, such as adaptive learning, dynamic benchmark, holistic trustworthiness evaluation, and multi-modal detectors for talking-face video generation.

107. 【2411.17897】Automating grapevine LAI features estimation with UAV imagery and machine learning

链接https://arxiv.org/abs/2411.17897

作者:Muhammad Waseem Akram,Marco Vannucci,Giorgio Buttazzo,Valentina Colla,Stefano Roccella,Andrea Vannini,Giovanni Caruso,Simone Nesi,Alessandra Francini,Luca Sebastiani

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)

关键词:determines crop health, index determines crop, leaf area index, health and growth, area index determines

备注: Accepted in 2024 IEEE INTERNATIONAL WORKSHOP ON Metrology for Agriculture and Forestry

点击查看摘要

Abstract:The leaf area index determines crop health and growth. Traditional methods for calculating it are time-consuming, destructive, costly, and limited to a scale. In this study, we automate the index estimation method using drone image data of grapevine plants and a machine learning model. Traditional feature extraction and deep learning methods are used to obtain helpful information from the data and enhance the performance of the different machine learning models employed for the leaf area index prediction. The results showed that deep learning based feature extraction is more effective than traditional methods. The new approach is a significant improvement over old methods, offering a faster, non-destructive, and cost-effective leaf area index calculation, which enhances precision agriculture practices.

108. 【2411.17891】HOPPR Medical-Grade Platform for Medical Imaging AI

链接https://arxiv.org/abs/2411.17891

作者:Kalina P. Slavkova,Melanie Traughber,Oliver Chen,Robert Bakos,Shayna Goldstein,Dan Harms,Bradley J. Erickson,Khan M. Siddiqui

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:large vision language, Technological advances, vision language models, artificial intelligence, HOPPR Platform

备注: 6 pages, 3 figures

点击查看摘要

Abstract:Technological advances in artificial intelligence (AI) have enabled the development of large vision language models (LVLMs) that are trained on millions of paired image and text samples. Subsequent research efforts have demonstrated great potential of LVLMs to achieve high performance in medical imaging use cases (e.g., radiology report generation), but there remain barriers that hinder the ability to deploy these solutions broadly. These include the cost of extensive computational requirements for developing large scale models, expertise in the development of sophisticated AI models, and the difficulty in accessing substantially large, high-quality datasets that adequately represent the population in which the LVLM solution is to be deployed. The HOPPR Medical-Grade Platform addresses these barriers by providing powerful computational infrastructure, a suite of foundation models on top of which developers can fine-tune for their specific use cases, and a robust quality management system that sets a standard for evaluating fine-tuned models for deployment in clinical settings. The HOPPR Platform has access to millions of imaging studies and text reports sourced from hundreds of imaging centers from diverse populations to pretrain foundation models and enable use case-specific cohorts for fine-tuning. All data are deidentified and securely stored for HIPAA compliance. Additionally, developers can securely host models on the HOPPR platform and access them via an API to make inferences using these models within established clinical workflows. With the Medical-Grade Platform, HOPPR's mission is to expedite the deployment of LVLM solutions for medical imaging and ultimately optimize radiologist's workflows and meet the growing demands of the field.

109. 【2411.17886】Multimodal Crash Likelihood Prediction: A Complexity-Infused Approach Integrating Semantic, Contextual, and Driving Features

链接https://arxiv.org/abs/2411.17886

作者:Meng Wang,Zach Noonan,Pnina Gershon,Shannon C. Roberts

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:improving traffic safety, Predicting crash likelihood, complex driving environments, advancing autonomous driving, Predicting crash

备注

点击查看摘要

Abstract:Predicting crash likelihood in complex driving environments is essential for improving traffic safety and advancing autonomous driving. Previous studies have used statistical models and deep learning to predict crashes based on semantic, contextual, or driving features, but none have examined the combined influence of these factors, termed roadway complexity in this study. This paper introduces a two-stage framework that integrates roadway complexity features for crash prediction. In the first stage, an encoder extracts hidden contextual information from these features, generating complexity-infused features. The second stage uses both original and complexity-infused features to predict crash likelihood, achieving an accuracy of 87.98% with original features alone and 90.15% with the added complexity-infused features. Ablation studies confirm that a combination of semantic, driving, and contextual features yields the best results, which emphasize their role in capturing roadway complexity. Additionally, complexity index annotations generated by Large Language Models outperform those by Amazon Mechanical Turk, highlighting the potential of automated tools for accurate, scalable crash prediction systems.

110. 【2411.17869】ReC-TTT: Contrastive Feature Reconstruction for Test-Time Training

链接https://arxiv.org/abs/2411.17869

作者:Marco Colussi,Sergio Mascetti,Jose Dolz,Christian Desrosiers

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:computer vision tasks, showcases outstanding results, showcases outstanding, remarkable progress, progress in deep

备注

点击查看摘要

Abstract:The remarkable progress in deep learning (DL) showcases outstanding results in various computer vision tasks. However, adaptation to real-time variations in data distributions remains an important challenge. Test-Time Training (TTT) was proposed as an effective solution to this issue, which increases the generalization ability of trained models by adding an auxiliary task at train time and then using its loss at test time to adapt the model. Inspired by the recent achievements of contrastive representation learning in unsupervised tasks, we propose ReC-TTT, a test-time training technique that can adapt a DL model to new unseen domains by generating discriminative views of the input data. ReC-TTT uses cross-reconstruction as an auxiliary task between a frozen encoder and two trainable encoders, taking advantage of a single shared decoder. This enables, at test time, to adapt the encoders to extract features that will be correctly reconstructed by the decoder that, in this phase, is frozen on the source domain. Experimental results show that ReC-TTT achieves better results than other state-of-the-art techniques in most domain shift classification challenges.

111. 【2411.17864】Generative Image Layer Decomposition with Visual Effects

链接https://arxiv.org/abs/2411.17864

作者:Jinrui Yang,Qing Liu,Yijun Li,Soo Ye Kim,Daniil Pakhomov,Mengwei Ren,Jianming Zhang,Zhe Lin,Cihang Xie,Yuyin Zhou

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Recent advancements, advancements in large, significantly enhanced, enhanced the capabilities, visual effects

备注: The project page: [this https URL](https://rayjryang.github.io/LayerDecomp)

点击查看摘要

Abstract:Recent advancements in large generative models, particularly diffusion-based methods, have significantly enhanced the capabilities of image editing. However, achieving precise control over image composition tasks remains a challenge. Layered representations, which allow for independent editing of image components, are essential for user-driven content creation, yet existing approaches often struggle to decompose image into plausible layers with accurately retained transparent visual effects such as shadows and reflections. We propose $\textbf{LayerDecomp}$, a generative framework for image layer decomposition which outputs photorealistic clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects. To enable effective training, we first introduce a dataset preparation pipeline that automatically scales up simulated multi-layer data with synthesized visual effects. To further enhance real-world applicability, we supplement this simulated dataset with camera-captured images containing natural visual effects. Additionally, we propose a consistency loss which enforces the model to learn accurate representations for the transparent foreground layer when ground-truth annotations are not available. Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks across several benchmarks and multiple user studies, unlocking various creative possibilities for layer-wise image editing. The project page is this https URL.

112. 【2411.17837】OracleSage: Towards Unified Visual-Linguistic Understanding of Oracle Bone Scripts through Cross-Modal Knowledge Fusion

链接https://arxiv.org/abs/2411.17837

作者:Hanqi Jiang,Yi Pan,Junhao Chen,Zhengliang Liu,Yifan Zhou,Peng Shu,Yiwei Li,Huaqin Zhao,Stephen Mihm,Lewis C Howe,Tianming Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Oracle bone script, modern Chinese characters, China earliest mature, mature writing system, present significant challenges

备注

点击查看摘要

Abstract:Oracle bone script (OBS), as China's earliest mature writing system, present significant challenges in automatic recognition due to their complex pictographic structures and divergence from modern Chinese characters. We introduce OracleSage, a novel cross-modal framework that integrates hierarchical visual understanding with graph-based semantic reasoning. Specifically, we propose (1) a Hierarchical Visual-Semantic Understanding module that enables multi-granularity feature extraction through progressive fine-tuning of LLaVA's visual backbone, (2) a Graph-based Semantic Reasoning Framework that captures relationships between visual components and semantic concepts through dynamic message passing, and (3) OracleSem, a semantically enriched OBS dataset with comprehensive pictographic and semantic annotations. Experimental results demonstrate that OracleSage significantly outperforms state-of-the-art vision-language models. This research establishes a new paradigm for ancient text interpretation while providing valuable technical support for archaeological studies.

113. 【2411.17835】Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction

链接https://arxiv.org/abs/2411.17835

作者:Mohamed Rashad

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:structured Markdown text, Arabic book pages, converting Arabic book, Meta Nougat architecture, Markdown text

备注: 7 pages, 1 figure

点击查看摘要

Abstract:We present Arabic-Nougat, a suite of OCR models for converting Arabic book pages into structured Markdown text. Based on Meta's Nougat architecture, Arabic-Nougat includes three specialized models: arabic-small-nougat, arabic-base-nougat, and arabic-large-nougat. These models are fine-tuned on a synthetic dataset, arabic-img2md, comprising 13.7k pairs of Arabic book pages and their Markdown representations. Key contributions include the Aranizer-PBE-86k tokenizer, designed for efficient tokenization, and the use of torch.bfloat16 precision with Flash Attention 2 for optimized training and inference. Our models achieve state-of-the-art performance, with arabic-large-nougat delivering the highest Markdown Structure Accuracy and the lowest Character Error Rate. Additionally, we release a large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books using our best-performing model, providing a valuable resource for Arabic OCR research. All models, datasets, and code are open-sourced and available at this https URL.

114. 【2411.17832】SVGDreamer++: Advancing Editability and Diversity in Text-Guided SVG Generation

链接https://arxiv.org/abs/2411.17832

作者:Ximing Xing,Qian Yu,Chuang Wang,Haitao Zhou,Jing Zhang,Dong Xu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:demonstrated significant potential, iconography and sketching, Particle-based Score Distillation, demonstrated significant, significant potential

备注: 17 pages, 17 figures. arXiv admin note: substantial text overlap with [arXiv:2312.16476](https://arxiv.org/abs/2312.16476)

点击查看摘要

Abstract:Recently, text-guided scalable vector graphics (SVG) synthesis has demonstrated significant potential in domains such as iconography and sketching. However, SVGs generated from existing Text-to-SVG methods often lack editability and exhibit deficiencies in visual quality and diversity. In this paper, we propose a novel text-guided vector graphics synthesis method to address these limitations. To improve the diversity of output SVGs, we present a Vectorized Particle-based Score Distillation (VPSD) approach. VPSD addresses over-saturation issues in existing methods and enhances sample diversity. A pre-trained reward model is incorporated to re-weight vector particles, improving aesthetic appeal and enabling faster convergence. Additionally, we design a novel adaptive vector primitives control strategy, which allows for the dynamic adjustment of the number of primitives, thereby enhancing the presentation of graphic details. Extensive experiments validate the effectiveness of the proposed method, demonstrating its superiority over baseline methods in terms of editability, visual quality, and diversity. We also show that our new method supports up to six distinct vector styles, capable of generating high-quality vector assets suitable for stylized vector design and poster design.

115. 【2411.17831】Rapid Distributed Fine-tuning of a Segmentation Model Onboard Satellites

链接https://arxiv.org/abs/2411.17831

作者:Meghan Plumridge,Rasmus Maråk,Chiara Ceccobello,Pablo Gómez,Gabriele Meoni,Filip Svoboda,Nicholas D. Lane

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)

关键词:Earth observation, natural hazard analysis, natural hazard, Earth, data

备注: Accepted at the Sixth IEEE International Conference on Image Processing Applications and Systems (IPAS) 2025

点击查看摘要

Abstract:Segmentation of Earth observation (EO) satellite data is critical for natural hazard analysis and disaster response. However, processing EO data at ground stations introduces delays due to data transmission bottlenecks and communication windows. Using segmentation models capable of near-real-time data analysis onboard satellites can therefore improve response times. This study presents a proof-of-concept using MobileSAM, a lightweight, pre-trained segmentation model, onboard Unibap iX10-100 satellite hardware. We demonstrate the segmentation of water bodies from Sentinel-2 satellite imagery and integrate MobileSAM with PASEOS, an open-source Python module that simulates satellite operations. This integration allows us to evaluate MobileSAM's performance under simulated conditions of a satellite constellation. Our research investigates the potential of fine-tuning MobileSAM in a decentralised way onboard multiple satellites in rapid response to a disaster. Our findings show that MobileSAM can be rapidly fine-tuned and benefits from decentralised learning, considering the constraints imposed by the simulated orbital environment. We observe improvements in segmentation performance with minimal training data and fast fine-tuning when satellites frequently communicate model updates. This study contributes to the field of onboard AI by emphasising the benefits of decentralised learning and fine-tuning pre-trained models for rapid response scenarios. Our work builds on recent related research at a critical time; as extreme weather events increase in frequency and magnitude, rapid response with onboard data analysis is essential.

116. 【2411.17820】CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

链接https://arxiv.org/abs/2411.17820

作者:Xinhao Liu,Jintong Li,Yichen Jiang,Niranjan Sujay,Zhicheng Yang,Juexiao Zhang,John Abanes,Jing Zhang,Chen Feng

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:requiring advanced spatial, environments presents significant, advanced spatial reasoning, Navigating dynamic urban, urban environments presents

备注

点击查看摘要

Abstract:Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings. this https URL

117. 【2411.17814】Low-rank Adaptation-based All-Weather Removal for Autonomous Navigation

链接https://arxiv.org/abs/2411.17814

作者:Sudarshan Rajagopalan,Vishal M. Patel

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:adverse weather conditions, reliable autonomous navigation, crucial for reliable, weather conditions, autonomous navigation

备注: Project page: [this https URL](https://sudraj2002.github.io/loraapage/)

点击查看摘要

Abstract:All-weather image restoration (AWIR) is crucial for reliable autonomous navigation under adverse weather conditions. AWIR models are trained to address a specific set of weather conditions such as fog, rain, and snow. But this causes them to often struggle with out-of-distribution (OoD) samples or unseen degradations which limits their effectiveness for real-world autonomous navigation. To overcome this issue, existing models must either be retrained or fine-tuned, both of which are inefficient and impractical, with retraining needing access to large datasets, and fine-tuning involving many parameters. In this paper, we propose using Low-Rank Adaptation (LoRA) to efficiently adapt a pre-trained all-weather model to novel weather restoration tasks. Furthermore, we observe that LoRA lowers the performance of the adapted model on the pre-trained restoration tasks. To address this issue, we introduce a LoRA-based fine-tuning method called LoRA-Align (LoRA-A) which seeks to align the singular vectors of the fine-tuned and pre-trained weight matrices using Singular Value Decomposition (SVD). This alignment helps preserve the model's knowledge of its original tasks while adapting it to unseen tasks. We show that images restored with LoRA and LoRA-A can be effectively used for computer vision tasks in autonomous navigation, such as semantic segmentation and depth estimation.

118. 【2411.17807】From memorization to generalization: a theoretical framework for diffusion-based generative models

链接https://arxiv.org/abs/2411.17807

作者:Indranil Halder

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)

关键词:Diffusion-based generative models, training set increases, generative models demonstrate, Diffusion-based generative, training dataset

备注: 22 pages

点击查看摘要

Abstract:Diffusion-based generative models demonstrate a transition from memorizing the training dataset to a non-memorization regime as the size of the training set increases. Here, we begin by introducing a mathematically precise definition of this transition in terms of a relative distance: the model is said to be in the non-memorization/`generalization' regime if the generated distribution is almost surely far from the probability distribution associated with a Gaussian kernel approximation to the training dataset, relative to the sampling distribution. Then, we develop an analytically tractable diffusion model and establish a lower bound on Kullback-Leibler divergence between the generated and sampling distribution. The model also features the transition, according to our definition in terms of the relative distance, when the training data is sampled from an isotropic Gaussian distribution. Further, our study reveals that this transition occurs when the individual distance between the generated and underlying sampling distribution begins to decrease with the addition of more training samples. This is to be contrasted with an alternative scenario, where the model's memorization performance degrades, but generalization performance doesn't improve. We also provide empirical evidence indicating that realistic diffusion models exhibit the same alignment of scales.

119. 【2411.17799】Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

链接https://arxiv.org/abs/2411.17799

作者:Ronglai Zuo,Rolandos Alexandros Potamias,Evangelos Ververas,Jiankang Deng,Stefanos Zafeiriou

类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

关键词:primary communication method, Sign language, Sign, language, features of natural

备注

点击查看摘要

Abstract:Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. While many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), drawing inspiration from its linguistic characteristics, the reverse task of sign language generation (SLG, text-to-sign) remains largely unexplored. Most existing approaches treat SLG as a visual content generation task, employing techniques such as diffusion models to produce sign videos, 2D keypoints, or 3D avatars based on text inputs, overlooking the linguistic properties of sign languages. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we develop a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. These sign tokens are integrated into the raw text vocabulary of the LM, allowing for supervised fine-tuning on sign language datasets. To facilitate multilingual SLG research, we further curate a large-scale Chinese sign language dataset, CSL-Daily, with high-quality 3D pose annotations. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. The project page is available at this https URL.

120. 【2411.17794】NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?

链接https://arxiv.org/abs/2411.17794

作者:Jiaxuan Li,Junwen Mo,MinhDuc Vo,Akihiro Sugimoto,Hideki Nakayama

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Multimodal Large Language, Large Language Models, Large Language, made notable advances, specific attributes remain

备注

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have made notable advances in visual understanding, yet their abilities to recognize objects modified by specific attributes remain an open question. To address this, we explore MLLMs' reasoning capabilities in object recognition, ranging from commonsense to beyond-commonsense scenarios. We introduce a novel benchmark, NEMO, which comprises 900 images of origiNal fruits and their corresponding attributE-MOdified ones; along with a set of 2,700 questions including open-, multiple-choice-, unsolvable types. We assess 26 recent open-sourced and commercial models using our benchmark. The findings highlight pronounced performance gaps in recognizing objects in NEMO and reveal distinct answer preferences across different models. Although stronger vision encoders improve performance, MLLMs still lag behind standalone vision encoders. Interestingly, scaling up the model size does not consistently yield better outcomes, as deeper analysis reveals that larger LLMs can weaken vision encoders during fine-tuning. These insights shed light on critical limitations in current MLLMs and suggest potential pathways toward developing more versatile and resilient multimodal models.

121. 【2411.17790】Self-supervised Monocular Depth and Pose Estimation for Endoscopy with Generative Latent Priors

链接https://arxiv.org/abs/2411.17790

作者:Ziang Xu,Bin Li,Yang Hu,Chenyu Zhang,James East,Sharib Ali,Jens Rittscher

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:holistic lesion characterization, Generative Latent Bank, requiring reliable depth, endoscopy enables quantitative, holistic lesion

备注

点击查看摘要

Abstract:Accurate 3D mapping in endoscopy enables quantitative, holistic lesion characterization within the gastrointestinal (GI) tract, requiring reliable depth and pose estimation. However, endoscopy systems are monocular, and existing methods relying on synthetic datasets or complex models often lack generalizability in challenging endoscopic conditions. We propose a robust self-supervised monocular depth and pose estimation framework that incorporates a Generative Latent Bank and a Variational Autoencoder (VAE). The Generative Latent Bank leverages extensive depth scenes from natural images to condition the depth network, enhancing realism and robustness of depth predictions through latent feature priors. For pose estimation, we reformulate it within a VAE framework, treating pose transitions as latent variables to regularize scale, stabilize z-axis prominence, and improve x-y sensitivity. This dual refinement pipeline enables accurate depth and pose predictions, effectively addressing the GI tract's complex textures and lighting. Extensive evaluations on SimCol and EndoSLAM datasets confirm our framework's superior performance over published self-supervised methods in endoscopic depth and pose estimation.

122. 【2411.17788】Geometric Point Attention Transformer for 3D Shape Reassembly

链接https://arxiv.org/abs/2411.17788

作者:Jiahan Li,Chaoran Cheng,Jianzhu Ma,Ge Liu

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:gained significant interest, reassemble separate parts, Geometric Point Attention, Point Attention Transformer, complete object

备注

点击查看摘要

Abstract:Shape assembly, which aims to reassemble separate parts into a complete object, has gained significant interest in recent years. Existing methods primarily rely on networks to predict the poses of individual parts, but often fail to effectively capture the geometric interactions between the parts and their poses. In this paper, we present the Geometric Point Attention Transformer (GPAT), a network specifically designed to address the challenges of reasoning about geometric relationships. In the geometric point attention module, we integrate both global shape information and local pairwise geometric features, along with poses represented as rotation and translation vectors for each part. To enable iterative updates and dynamic reasoning, we introduce a geometric recycling scheme, where each prediction is fed into the next iteration for refinement. We evaluate our model on both the semantic and geometric assembly tasks, showing that it outperforms previous methods in absolute pose estimation, achieving accurate pose predictions and high alignment accuracy.

123. 【2411.17787】Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

链接https://arxiv.org/abs/2411.17787

作者:Zigeng Chen,Xinyin Ma,Gongfan Fang,Xinchao Wang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:next-scale prediction approach, rapidly advancing field, garnered considerable attention, innovative next-scale prediction, Visual Auto-Regressive

备注: Working in progress. Code repository: [this https URL](https://github.com/czg1225/CoDe)

点击查看摘要

Abstract:In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency, scalability, and zero-shot generalization. Yet, the inherently coarse-to-fine nature of VAR introduces a prolonged token sequence, leading to prohibitive memory consumption and computational redundancies. To address these bottlenecks, we propose Collaborative Decoding (CoDe), a novel efficient decoding strategy tailored for the VAR framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. Based on these insights, we partition the multi-scale inference process into a seamless collaboration between a large model and a small model. The large model serves as the 'drafter', specializing in generating low-frequency content at smaller scales, while the smaller model serves as the 'refiner', solely focusing on predicting high-frequency details at larger scales. This collaboration yields remarkable efficiency with minimal impact on quality: CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98. When drafting steps are further decreased, CoDe can achieve an impressive 2.9x acceleration ratio, reaching 41 images/s at 256x256 resolution on a single NVIDIA 4090 GPU, while preserving a commendable FID of 2.27. The code is available at this https URL

124. 【2411.17786】DreamCache: Finetuning-Free Lightweight Personalized Image Generation via Feature Caching

链接https://arxiv.org/abs/2411.17786

作者:Emanuele Aiello,Umberto Michieli,Diego Valsesia,Mete Ozay,Enrico Magli

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

关键词:Personalized image generation, image generation requires, capture the core, Personalized image, generation requires

备注: 16 pages, 8 figures

点击查看摘要

Abstract:Personalized image generation requires text-to-image generative models that capture the core features of a reference subject to allow for controlled generation across different contexts. Existing methods face challenges due to complex training requirements, high inference costs, limited flexibility, or a combination of these issues. In this paper, we introduce DreamCache, a scalable approach for efficient and high-quality personalized image generation. By caching a small number of reference image features from a subset of layers and a single timestep of the pretrained diffusion denoiser, DreamCache enables dynamic modulation of the generated image features through lightweight, trained conditioning adapters. DreamCache achieves state-of-the-art image and text alignment, utilizing an order of magnitude fewer extra parameters, and is both more computationally effective and versatile than existing models.

125. 【2411.17784】Diffusion Autoencoders for Few-shot Image Generation in Hyperbolic Space

链接https://arxiv.org/abs/2411.17784

作者:Lingxiao Li,Kaixuan Fan,Boqing Gong,Xiangyu Yue

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Few-shot image generation, Few-shot image, Hyperbolic Diffusion Autoencoders, image generation aims, Few-shot

备注

点击查看摘要

Abstract:Few-shot image generation aims to generate diverse and high-quality images for an unseen class given only a few examples in that class. However, existing methods often suffer from a trade-off between image quality and diversity while offering limited control over the attributes of newly generated images. In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images and texts from seen categories. By leveraging pre-trained foundation models, HypDAE generates diverse new images for unseen categories with exceptional quality by varying semantic codes or guided by textual instructions. Most importantly, the hyperbolic representation introduces an additional degree of control over semantic diversity through the adjustment of radii within the hyperbolic disk. Extensive experiments and visualizations demonstrate that HypDAE significantly outperforms prior methods by achieving a superior balance between quality and diversity with limited data and offers a highly controllable and interpretable generation process.

126. 【2411.17777】Network Inversion and Its Applications

链接https://arxiv.org/abs/2411.17777

作者:Pirzada Suhail,Hao Tang,Amit Sethi

类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Logic in Computer Science (cs.LO)

关键词:remains opaque, emerged as powerful, powerful tools, process often remains, Network inversion

备注: arXiv admin note: substantial text overlap with [arXiv:2410.16884](https://arxiv.org/abs/2410.16884) , [arXiv:2407.18002](https://arxiv.org/abs/2407.18002)

点击查看摘要

Abstract:Neural networks have emerged as powerful tools across various applications, yet their decision-making process often remains opaque, leading to them being perceived as "black boxes." This opacity raises concerns about their interpretability and reliability, especially in safety-critical scenarios. Network inversion techniques offer a solution by allowing us to peek inside these black boxes, revealing the features and patterns learned by the networks behind their decision-making processes and thereby provide valuable insights into how neural networks arrive at their conclusions, making them more interpretable and trustworthy. This paper presents a simple yet effective approach to network inversion using a meticulously conditioned generator that learns the data distribution in the input space of the trained neural network, enabling the reconstruction of inputs that would most likely lead to the desired outputs. To capture the diversity in the input space for a given output, instead of simply revealing the conditioning labels to the generator, we encode the conditioning label information into vectors and intermediate matrices and further minimize the cosine similarity between features of the generated images. Additionally, we incorporate feature orthogonality as a regularization term to boost image diversity which penalises the deviations of the Gram matrix of the features from the identity matrix, ensuring orthogonality and promoting distinct, non-redundant representations for each label. The paper concludes by exploring immediate applications of the proposed network inversion approach in interpretability, out-of-distribution detection, and training data reconstruction.

127. 【2411.17776】Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search

链接https://arxiv.org/abs/2411.17776

作者:Shuyu Yang,Yaxiong Wang,Li Zhu,Zhedong Zheng

类目:Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

关键词:natural language descriptions, retrieve specific individuals, person search aims, Text-based person search, text-based person anomaly

备注

点击查看摘要

Abstract:Text-based person search aims to retrieve specific individuals across camera networks using natural language descriptions. However, current benchmarks often exhibit biases towards common actions like walking or standing, neglecting the critical need for identifying abnormal behaviors in real-world scenarios. To meet such demands, we propose a new task, text-based person anomaly search, locating pedestrians engaged in both routine or anomalous activities via text. To enable the training and evaluation of this new task, we construct a large-scale image-text Pedestrian Anomaly Behavior (PAB) benchmark, featuring a broad spectrum of actions, e.g., running, performing, playing soccer, and the corresponding anomalies, e.g., lying, being hit, and falling of the same identity. The training set of PAB comprises 1,013,605 synthesized image-text pairs of both normalities and anomalies, while the test set includes 1,978 real-world image-text pairs. To validate the potential of PAB, we introduce a cross-modal pose-aware framework, which integrates human pose patterns with identity-based hard negative pair sampling. Extensive experiments on the proposed benchmark show that synthetic training data facilitates the fine-grained behavior retrieval in the real-world test set, while the proposed pose-aware method further improves the recall@1 by 2.88%. We will release the dataset, code, and checkpoints to facilitate further research and ensure the reproducibility of our results.

128. 【2411.17773】Efficient Multi-modal Large Language Models via Visual Token Grouping

链接https://arxiv.org/abs/2411.17773

作者:Minbin Huang,Runhui Huang,Han Shi,Yimeng Chen,Chuanyang Zheng,Xiangguo Sun,Xin Jiang,Zhenguo Li,Hong Cheng

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Large Language Models, Multi-modal Large Language, enhances Large Language, Language Models, Large Language

备注

点击查看摘要

Abstract:The development of Multi-modal Large Language Models (MLLMs) enhances Large Language Models (LLMs) with the ability to perceive data formats beyond text, significantly advancing a range of downstream applications, such as visual question answering and image captioning. However, the substantial computational costs associated with processing high-resolution images and videos pose a barrier to their broader adoption. To address this challenge, compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. While existing methods conduct token reduction in the feature alignment phase. In this paper, we introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments without the need for segmentation masks. Specifically, we concatenate semantic tokens to represent image semantic segments after the linear projection layer before feeding into the vision encoder. Besides, with the isolated attention we adopt, VisToG can identify and eliminate redundant visual tokens utilizing the prior knowledge in the pre-trained vision encoder, which effectively reduces computational demands. Extensive experiments demonstrate the effectiveness of VisToG, maintaining 98.1% of the original performance while achieving a reduction of over 27\% inference time.

129. 【2411.17772】MVBoost: Boost 3D Reconstruction with Multi-View Refinement

链接https://arxiv.org/abs/2411.17772

作者:Xiangyu Liu,Xiaomei Zhang,Zhiyuan Ma,Xiangyu Zhu,Zhen Lei

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:Recent advancements, models rely heavily, heavily on existing, rely heavily, reconstruction model

备注

点击查看摘要

Abstract:Recent advancements in 3D object reconstruction have been remarkable, yet most current 3D models rely heavily on existing 3D datasets. The scarcity of diverse 3D datasets results in limited generalization capabilities of 3D reconstruction models. In this paper, we propose a novel framework for boosting 3D reconstruction with multi-view refinement (MVBoost) by generating pseudo-GT data. The key of MVBoost is combining the advantages of the high accuracy of the multi-view generation model and the consistency of the 3D reconstruction model to create a reliable data source. Specifically, given a single-view input image, we employ a multi-view diffusion model to generate multiple views, followed by a large 3D reconstruction model to produce consistent 3D data. MVBoost then adaptively refines these multi-view images, rendered from the consistent 3D data, to build a large-scale multi-view dataset for training a feed-forward 3D reconstruction model. Additionally, the input view optimization is designed to optimize the corresponding viewpoints based on the user's input image, ensuring that the most important viewpoint is accurately tailored to the user's needs. Extensive evaluations demonstrate that our method achieves superior reconstruction results and robust generalization compared to prior works.

130. 【2411.17771】DiagramQG: A Dataset for Generating Concept-Focused Questions from Diagrams

链接https://arxiv.org/abs/2411.17771

作者:Xinyu Zhang,Lingling Zhang,Yanrui Wu,Muye Huang,Wenjun Wu,Bo Li,Shaowei Wang,Jun Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:gained significant attention, significant attention due, Visual Question Generation, Question Generation, Diagram Question Generation

备注

点击查看摘要

Abstract:Visual Question Generation (VQG) has gained significant attention due to its potential in educational applications. However, VQG researches mainly focus on natural images, neglecting diagrams in educational materials used to assess students' conceptual understanding. To address this gap, we introduce DiagramQG, a dataset containing 8,372 diagrams and 19,475 questions across various subjects. DiagramQG introduces concept and target text constraints, guiding the model to generate concept-focused questions for educational purposes. Meanwhile, we present the Hierarchical Knowledge Integration framework for Diagram Question Generation (HKI-DQG) as a strong baseline. This framework obtains multi-scale patches of diagrams and acquires knowledge using a visual language model with frozen parameters. It then integrates knowledge, text constraints and patches to generate concept-focused questions. We evaluate the performance of existing VQG models, open-source and closed-source vision-language models, and HKI-DQG on the DiagramQG dataset. Our HKI-DQG outperform existing methods, demonstrating that it serves as a strong baseline. Furthermore, to assess its generalizability, we apply HKI-DQG to two other VQG datasets of natural images, namely VQG-COCO and K-VQG, achieving state-of-the-art this http URL dataset and code are available at this https URL.

131. 【2411.17769】Omegance: A Single Parameter for Various Granularities in Diffusion-Based Synthesis

链接https://arxiv.org/abs/2411.17769

作者:Xinyu Hou,Zongsheng Yue,Xiaoming Li,Chen Change Loy

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:introduce a single, omega, effectively control granularity, single parameter, control

备注: Project page: [this https URL](https://itsmag11.github.io/Omegance/)

点击查看摘要

Abstract:In this work, we introduce a single parameter $\omega$, to effectively control granularity in diffusion-based synthesis. This parameter is incorporated during the denoising steps of the diffusion model's reverse process. Our approach does not require model retraining, architectural modifications, or additional computational overhead during inference, yet enables precise control over the level of details in the generated outputs. Moreover, spatial masks or denoising schedules with varying $\omega$ values can be applied to achieve region-specific or timestep-specific granularity control. Prior knowledge of image composition from control signals or reference images further facilitates the creation of precise $\omega$ masks for granularity control on specific objects. To highlight the parameter's role in controlling subtle detail variations, the technique is named Omegance, combining "omega" and "nuance". Our method demonstrates impressive performance across various image and video synthesis tasks and is adaptable to advanced diffusion models. The code is available at this https URL.

132. 【2411.17767】Exploring Aleatoric Uncertainty in Object Detection via Vision Foundation Models

链接https://arxiv.org/abs/2411.17767

作者:Peng Cui,Guande He,Dan Zhang,Zhijie Deng,Yinpeng Dong,Jun Zhu

类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:open world unavoidably, world unavoidably suffer, randomness or noiseness, open world, world unavoidably

备注

点击查看摘要

Abstract:Datasets collected from the open world unavoidably suffer from various forms of randomness or noiseness, leading to the ubiquity of aleatoric (data) uncertainty. Quantifying such uncertainty is particularly pivotal for object detection, where images contain multi-scale objects with occlusion, obscureness, and even noisy annotations, in contrast to images with centric and similar-scale objects in classification. This paper suggests modeling and exploiting the uncertainty inherent in object detection data with vision foundation models and develops a data-centric reliable training paradigm. Technically, we propose to estimate the data uncertainty of each object instance based on the feature space of vision foundation models, which are trained on ultra-large-scale datasets and able to exhibit universal data representation. In particular, we assume a mixture-of-Gaussian structure of the object features and devise Mahalanobis distance-based measures to quantify the data uncertainty. Furthermore, we suggest two curial and practical usages of the estimated uncertainty: 1) for defining uncertainty-aware sample filter to abandon noisy and redundant instances to avoid over-fitting, and 2) for defining sample adaptive regularizer to balance easy/hard samples for adaptive training. The estimated aleatoric uncertainty serves as an extra level of annotations of the dataset, so it can be utilized in a plug-and-play manner with any model. Extensive empirical studies verify the effectiveness of the proposed aleatoric uncertainty measure on various advanced detection models and challenging benchmarks.

133. 【2411.17765】I2VControl: Disentangled and Unified Video Motion Synthesis Control

链接https://arxiv.org/abs/2411.17765

作者:Wanquan Feng,Tianhao Qi,Jiawei Liu,Mingzhen Sun,Pengqi Tu,Tianxiang Ma,Fei Dai,Songtao Zhao,Siyu Zhou,Qian He

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:undergoing rapid progress, Video synthesis techniques, rapid progress, usability for end-users, techniques are undergoing

备注: Project page: [this https URL](https://wanquanf.github.io/I2VControl)

点击查看摘要

Abstract:Video synthesis techniques are undergoing rapid progress, with controllability being a significant aspect of practical usability for end-users. Although text condition is an effective way to guide video synthesis, capturing the correct joint distribution between text descriptions and video motion remains a substantial challenge. In this paper, we present a disentangled and unified framework, namely I2VControl, that unifies multiple motion control tasks in image-to-video synthesis. Our approach partitions the video into individual motion units and represents each unit with disentangled control signals, which allows for various control types to be flexibly combined within our single system. Furthermore, our methodology seamlessly integrates as a plug-in for pre-trained models and remains agnostic to specific model architectures. We conduct extensive experiments, achieving excellent performance on various control tasks, and our method further facilitates user-driven creative combinations, enhancing innovation and creativity. The project page is: this https URL .

134. 【2411.17763】Symmetry Strikes Back: From Single-Image Symmetry Detection to 3D Generation

链接https://arxiv.org/abs/2411.17763

作者:Xiang Li,Zixuan Huang,Anh Thai,James M. Rehg

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:structure interpretation, ubiquitous and fundamental, fundamental property, critical cue, cue for perception

备注: Project page: [this https URL](https://ryanxli.github.io/reflect3d/)

点击查看摘要

Abstract:Symmetry is a ubiquitous and fundamental property in the visual world, serving as a critical cue for perception and structure interpretation. This paper investigates the detection of 3D reflection symmetry from a single RGB image, and reveals its significant benefit on single-image 3D generation. We introduce Reflect3D, a scalable, zero-shot symmetry detector capable of robust generalization to diverse and real-world scenarios. Inspired by the success of foundation models, our method scales up symmetry detection with a transformer-based architecture. We also leverage generative priors from multi-view diffusion models to address the inherent ambiguity in single-view symmetry detection. Extensive evaluations on various data sources demonstrate that Reflect3D establishes a new state-of-the-art in single-image symmetry detection. Furthermore, we show the practical benefit of incorporating detected symmetry into single-image 3D generation pipelines through a symmetry-aware optimization process. The integration of symmetry significantly enhances the structural accuracy, cohesiveness, and visual fidelity of the reconstructed 3D geometry and textures, advancing the capabilities of 3D content creation.

135. 【2411.17762】MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

链接https://arxiv.org/abs/2411.17762

作者:Rongchang Xie,Chen Du,Ping Song,Chang Liu

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:Semantic discrete Encoding, introduce MUSE-VL, discrete Encoding, Semantic discrete, Semantic

备注

点击查看摘要

Abstract:We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with texture semantic features. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces training difficulty and improves the performance of the unified model. The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.

136. 【2411.17761】OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

链接https://arxiv.org/abs/2411.17761

作者:Zhongyu Xia,Jishuo Li,Zhiwei Lin,Xinhao Wang,Yongtao Wang,Ming-Hsuan Yang

类目:Computer Vision and Pattern Recognition (cs.CV)

关键词:encompasses domain generalization, driving encompasses domain, autonomous driving encompasses, autonomous driving, Open-world autonomous driving

备注

点击查看摘要

Abstract:Open-world autonomous driving encompasses domain generalization and open-vocabulary. Domain generalization refers to the capabilities of autonomous driving systems across different scenarios and sensor parameter configurations. Open vocabulary pertains to the ability to recognize various semantic categories not encountered during training. In this paper, we introduce OpenAD, the first real-world open-world autonomous driving benchmark for 3D object detection. OpenAD is built on a corner case discovery and annotation pipeline integrating with a multimodal large language model (MLLM). The proposed pipeline annotates corner case objects in a unified format for five autonomous driving perception datasets with 2000 scenarios. In addition, we devise evaluation methodologies and evaluate various 2D and 3D open-world and specialized models. Moreover, we propose a vision-centric 3D open-world object detection baseline and further introduce an ensemble method by fusing general and specialized models to address the issue of lower precision in existing open-world methods for the OpenAD benchmark. Annotations, toolkit code, and all evaluation codes will be released.

137. 【2411.17760】Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach

链接https://arxiv.org/abs/2411.17760

作者:Shijian Deng,Wentian Zhao,Yu-Jhe Li,Kun Wan,Daniel Miranda,Ajinkya Kale,Yapeng Tian

类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:multimodal large language, large language models, reliability and robustness, multimodal large, large language

备注

点击查看摘要

Abstract:Self-improvement in multimodal large language models (MLLMs) is crucial for enhancing their reliability and robustness. However, current methods often rely heavily on MLLMs themselves as judges, leading to high computational costs and potential pitfalls like reward hacking and model collapse. This paper introduces a novel, model-level judge-free self-improvement framework. Our approach employs a controlled feedback mechanism while eliminating the need for MLLMs in the verification loop. We generate preference learning pairs using a controllable hallucination mechanism and optimize data quality by leveraging lightweight, contrastive language-image encoders to evaluate and reverse pairs when necessary. Evaluations across public benchmarks and our newly introduced IC dataset designed to challenge hallucination control demonstrate that our model outperforms conventional techniques. We achieve superior precision and recall with significantly lower computational demands. This method offers an efficient pathway to scalable self-improvement in MLLMs, balancing performance gains with reduced resource requirements.

138. 【2411.17746】UVCG: Leveraging Temporal Consistency for Universal Video Protection

链接https://arxiv.org/abs/2411.17746

作者:KaiZhou Li,Jindong Gu,Xinchun Yu,Junjie Cao,Yansong Tang,Xiao-Ping Zhang

类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

关键词:garnered significant attention, AI-driven video editing, Video Consistency Guard, Universal Video Consistency, significant attention

备注

点击查看摘要

Abstract:The security risks of AI-driven video editing have garnered significant attention. Although recent studies indicate that adding perturbations to images can protect them from malicious edits, directly applying image-based methods to perturb each frame in a video becomes ineffective, as video editing techniques leverage the consistency of inter-frame information to restore individually perturbed content. To address this challenge, we leverage the temporal consistency of video content to propose a straightforward and efficient, yet highly effective and broadly applicable approach, Universal Video Consistency Guard (UVCG). UVCG embeds the content of another video(target video) within a protected video by introducing continuous, imperceptible perturbations which has the ability to force the encoder of editing models to map continuous inputs to misaligned continuous outputs, thereby inhibiting the generation of videos consistent with the intended textual prompts. Additionally leveraging similarity in perturbations between adjacent frames, we improve the computational efficiency of perturbation generation by employing a perturbation-reuse strategy. We applied UVCG across various versions of Latent Diffusion Models (LDM) and assessed its effectiveness and generalizability across multiple LDM-based editing pipelines. The results confirm the effectiveness, transferability, and efficiency of our approach in safeguarding video content from unauthorized modifications.

139. 【2411.17735】SnapMem: Snapshot-based 3D Scene Memory for Embodied Exploration and Reasoning

链接https://arxiv.org/abs/2411.17735

作者:Yuncong Yang,Han Yang,Jiachen Zhou,Peihao Chen,Hongxin Zhang,Yilun Du,Chuang Gan

类目:Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)

关键词:Constructing compact, Constructing, memory, scene, exploration

备注

点击查看摘要

Abstract:Constructing compact and informative 3D scene representations is essential for effective embodied exploration and reasoning, especially in complex environments over long periods. Existing scene representations, such as object-centric 3D scene graphs, have significant limitations. They oversimplify spatial relationships by modeling scenes as individual objects, with inter-object relationships described by restrictive texts, making it difficult to answer queries that require nuanced spatial understanding. Furthermore, these representations lack natural mechanisms for active exploration and memory management, which hampers their application to lifelong autonomy. In this work, we propose SnapMem, a novel snapshot-based scene representation serving as 3D scene memory for embodied agents. SnapMem employs informative images, termed Memory Snapshots, to capture rich visual information of explored regions. It also integrates frontier-based exploration by introducing Frontier Snapshots-glimpses of unexplored areas-that enable agents to make informed exploration decisions by considering both known and potential new information. Meanwhile, to support lifelong memory in active exploration settings, we further present an incremental construction pipeline for SnapMem, as well as an effective memory retrieval technique for memory management. Experimental results on three benchmarks demonstrate that SnapMem significantly enhances agents' exploration and reasoning capabilities in 3D environments over extended periods, highlighting its potential for advancing applications in embodied AI.

140. 【2411.18602】Evaluating and Improving the Effectiveness of Synthetic Chest X-Rays for Medical Image Analysis

链接https://arxiv.org/abs/2411.18602

作者:Eva Prakash,Jeya Maria Jose Valanarasu,Zhihong Chen,Eduardo Pontes Reis,Andrew Johnston,Anuj Pareek,Christian Bluethgen,Sergios Gatidis,Cameron Olsen,Akshay Chaudhari,Andrew Ng,Curtis Langlotz

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:synthetic chest X-ray, chest X-ray images, explore best-practice approaches, augmenting medical imaging, chest X-ray

备注

点击查看摘要

Abstract:Purpose: To explore best-practice approaches for generating synthetic chest X-ray images and augmenting medical imaging datasets to optimize the performance of deep learning models in downstream tasks like classification and segmentation. Materials and Methods: We utilized a latent diffusion model to condition the generation of synthetic chest X-rays on text prompts and/or segmentation masks. We explored methods like using a proxy model and using radiologist feedback to improve the quality of synthetic data. These synthetic images were then generated from relevant disease information or geometrically transformed segmentation masks and added to ground truth training set images from the CheXpert, CANDID-PTX, SIIM, and RSNA Pneumonia datasets to measure improvements in classification and segmentation model performance on the test sets. F1 and Dice scores were used to evaluate classification and segmentation respectively. One-tailed t-tests with Bonferroni correction assessed the statistical significance of performance improvements with synthetic data. Results: Across all experiments, the synthetic data we generated resulted in a maximum mean classification F1 score improvement of 0.150453 (CI: 0.099108-0.201798; P=0.0031) compared to using only real data. For segmentation, the maximum Dice score improvement was 0.14575 (CI: 0.108267-0.183233; P=0.0064). Conclusion: Best practices for generating synthetic chest X-ray images for downstream tasks include conditioning on single-disease labels or geometrically transformed segmentation masks, as well as potentially using proxy modeling for fine-tuning such generations.

141. 【2411.18440】Learning the Evolution of Physical Structure of Galaxies via Diffusion Models

链接https://arxiv.org/abs/2411.18440

作者:Andrew Lizarraga,Eric Hanchen Jiang,Jacob Nowack,Yun Qi Li,Ying Nian Wu,Bernie Boscoe,Tuan Do

类目:Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)

关键词:Denoising Diffusion Probabilistic, Diffusion Probabilistic Models, conditioning Denoising Diffusion, primarily through imaging, imaging data

备注

点击查看摘要

Abstract:In astrophysics, understanding the evolution of galaxies in primarily through imaging data is fundamental to comprehending the formation of the Universe. This paper introduces a novel approach to conditioning Denoising Diffusion Probabilistic Models (DDPM) on redshifts for generating galaxy images. We explore whether this advanced generative model can accurately capture the physical characteristics of galaxies based solely on their images and redshift measurements. Our findings demonstrate that this model not only produces visually realistic galaxy images but also encodes the underlying changes in physical properties with redshift that are the result of galaxy evolution. This approach marks a significant advancement in using generative models to enhance our scientific insight into cosmic phenomena.

142. 【2411.18290】Leveraging Semantic Asymmetry for Precise Gross Tumor Volume Segmentation of Nasopharyngeal Carcinoma in Planning CT

链接https://arxiv.org/abs/2411.18290

作者:Zi Li,Ying Chen,Zeli Chen,Yanzhou Su,Tai Ma,Tony C. W. Mok,Yan-Jie Zhou,Yunhai Bai,Zhinlin Zheng,Le Lu,Yirui Wang,Jia Ge,Xianghua Ye,Senxiang Yan,Dakai Jin

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:clinicians typically delineate, radiation dose delivery, ensure accurate radiation, accurate radiation dose, planning computed tomography

备注

点击查看摘要

Abstract:In the radiation therapy of nasopharyngeal carcinoma (NPC), clinicians typically delineate the gross tumor volume (GTV) using non-contrast planning computed tomography to ensure accurate radiation dose delivery. However, the low contrast between tumors and adjacent normal tissues necessitates that radiation oncologists manually delineate the tumors, often relying on diagnostic MRI for guidance. % In this study, we propose a novel approach to directly segment NPC gross tumors on non-contrast planning CT images, circumventing potential registration errors when aligning MRI or MRI-derived tumor masks to planning CT. To address the low contrast issues between tumors and adjacent normal structures in planning CT, we introduce a 3D Semantic Asymmetry Tumor segmentation (SATs) method. Specifically, we posit that a healthy nasopharyngeal region is characteristically bilaterally symmetric, whereas the emergence of nasopharyngeal carcinoma disrupts this symmetry. Then, we propose a Siamese contrastive learning segmentation framework that minimizes the voxel-wise distance between original and flipped areas without tumor and encourages a larger distance between original and flipped areas with tumor. Thus, our approach enhances the sensitivity of features to semantic asymmetries. % Extensive experiments demonstrate that the proposed SATs achieves the leading NPC GTV segmentation performance in both internal and external testing, \emph{e.g.}, with at least 2\% absolute Dice score improvement and 12\% average distance error reduction when compared to other state-of-the-art methods in the external testing.

143. 【2411.18249】Deep End-to-end Adaptive k-Space Sampling, Reconstruction, and Registration for Dynamic MRI

链接https://arxiv.org/abs/2411.18249

作者:George Yiasemis,Jan-Jakob Sonke,Jonas Teuwen

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)

关键词:Dynamic MRI enables, organ motion tracking, Dynamic MRI, MRI enables, range of clinical

备注: 39 pages, 19 figures, 4 tables

点击查看摘要

Abstract:Dynamic MRI enables a range of clinical applications, including cardiac function assessment, organ motion tracking, and radiotherapy guidance. However, fully sampling the dynamic k-space data is often infeasible due to time constraints and physiological motion such as respiratory and cardiac motion. This necessitates undersampling, which degrades the quality of reconstructed images. Poor image quality not only hinders visualization but also impairs the estimation of deformation fields, crucial for registering dynamic (moving) images to a static reference image. This registration enables tasks such as motion correction, treatment planning, and quantitative analysis in applications like cardiac imaging and MR-guided radiotherapy. To overcome the challenges posed by undersampling and motion, we introduce an end-to-end deep learning (DL) framework that integrates adaptive dynamic k-space sampling, reconstruction, and registration. Our approach begins with a DL-based adaptive sampling strategy, optimizing dynamic k-space acquisition to capture the most relevant data for each specific case. This is followed by a DL-based reconstruction module that produces images optimized for accurate deformation field estimation from the undersampled moving data. Finally, a registration module estimates the deformation fields aligning the reconstructed dynamic images with a static reference. The proposed framework is independent of specific reconstruction and registration modules allowing for plug-and-play integration of these components. The entire framework is jointly trained using a combination of supervised and unsupervised loss functions, enabling end-to-end optimization for improved performance across all components. Through controlled experiments and ablation studies, we validate each component, demonstrating that each choice contributes to robust motion estimation from undersampled dynamic data.

144. 【2411.18189】owards Lensless Image Deblurring with Prior-Embedded Implicit Neural Representations in the Low-Data Regime

链接https://arxiv.org/abs/2411.18189

作者:Abeer Banerjee,Sanjay Singh

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:promising paradigm shift, Generative Adversarial Networks, computational imaging problems, inverse computational imaging, leveraging Generative Adversarial

备注

点击查看摘要

Abstract:The field of computational imaging has witnessed a promising paradigm shift with the emergence of untrained neural networks, offering novel solutions to inverse computational imaging problems. While existing techniques have demonstrated impressive results, they often operate either in the high-data regime, leveraging Generative Adversarial Networks (GANs) as image priors, or through untrained iterative reconstruction in a data-agnostic manner. This paper delves into lensless image reconstruction, a subset of computational imaging that replaces traditional lenses with computation, enabling the development of ultra-thin and lightweight imaging systems. To the best of our knowledge, we are the first to leverage implicit neural representations for lensless image deblurring, achieving reconstructions without the requirement of prior training. We perform prior-embedded untrained iterative optimization to enhance reconstruction performance and speed up convergence, effectively bridging the gap between the no-data and high-data regimes. Through a thorough comparative analysis encompassing various untrained and low-shot methods, including under-parameterized non-convolutional methods and domain-restricted low-shot methods, we showcase the superior performance of our approach by a significant margin.

145. 【2411.18063】Mortality Prediction of Pulmonary Embolism Patients with Deep Learning and XGBoost

链接https://arxiv.org/abs/2411.18063

作者:Yalcin Tur,Vedat Cicek,Tufan Cinar,Elif Keles,Bradlay D. Allen,Hatice Savas,Gorkem Durak,Alpay Medetalibeyoglu,Ulas Bagci

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

关键词:Pulmonary Embolism, enhanced diagnostic strategies, Extreme Gradient Boosting, critical illness, cardiovascular condition

备注: Published at IEEE ICECCME 2024, Maldives, 4-6 November 2024

点击查看摘要

Abstract:Pulmonary Embolism (PE) is a serious cardiovascular condition that remains a leading cause of mortality and critical illness, underscoring the need for enhanced diagnostic strategies. Conventional clinical methods have limited success in predicting 30-day in-hospital mortality of PE patients. In this study, we present a new algorithm, called PEP-Net, for 30-day mortality prediction of PE patients based on the initial imaging data (CT) that opportunistically integrates a 3D Residual Network (3DResNet) with Extreme Gradient Boosting (XGBoost) algorithm with patient level binary labels without annotations of the emboli and its extent. Our proposed system offers a comprehensive prediction strategy by handling class imbalance problems, reducing overfitting via regularization, and reducing the prediction variance for more stable predictions. PEP-Net was tested in a cohort of 193 volumetric CT scans diagnosed with Acute PE, and it demonstrated a superior performance by significantly outperforming baseline models (76-78\%) with an accuracy of 94.5\% (+/-0.3) and 94.0\% (+/-0.7) when the input image is either lung region (Lung-ROI) or heart region (Cardiac-ROI). Our results advance PE prognostics by using only initial imaging data, setting a new benchmark in the field. While purely deep learning models have become the go-to for many medical classification (diagnostic) tasks, combined ResNet and XGBoost models herein outperform sole deep learning models due to a potential reason for having lack of enough data.

146. 【2411.18018】Neural Finite-State Machines for Surgical Phase Recognition

链接https://arxiv.org/abs/2411.18018

作者:Hao Ding,Zhongpai Gao,Benjamin Planche,Tianyu Luan,Abhishek Sharma,Meng Zheng,Ange Lou,Terrence Chen,Mathias Unberath,Ziyan Wu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:analyzing procedure-specific surgical, procedure-specific surgical videos, essential for analyzing, analyzing procedure-specific, procedure-specific surgical

备注

点击查看摘要

Abstract:Surgical phase recognition is essential for analyzing procedure-specific surgical videos. While recent transformer-based architectures have advanced sequence processing capabilities, they struggle with maintaining consistency across lengthy surgical procedures. Drawing inspiration from classical hidden Markov models' finite-state interpretations, we introduce the neural finite-state machine (NFSM) module, which bridges procedural understanding with deep learning approaches. NFSM combines procedure-level understanding with neural networks through global state embeddings, attention-based dynamic transition tables, and transition-aware training and inference mechanisms for offline and online applications. When integrated into our future-aware architecture, NFSM improves video-level accuracy, phase-level precision, recall, and Jaccard indices on Cholec80 datasets by 2.3, 3.2, 3.0, and 4.8 percentage points respectively. As an add-on module to existing state-of-the-art models like Surgformer, NFSM further enhances performance, demonstrating its complementary value. Extended experiments on non-surgical datasets validate NFSM's generalizability beyond surgical domains. Comprehensive experiments demonstrate that incorporating NSFM into deep learning frameworks enables more robust and consistent phase recognition across long procedural videos.

147. 【2411.18003】HAAT: Hybrid Attention Aggregation Transformer for Image Super-Resolution

链接https://arxiv.org/abs/2411.18003

作者:Song-Jiang Lai,Tsun-Hin Cheung,Ka-Chun Fung,Kai-wen Xue,Kin-Man Lama

类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

关键词:global spatial modeling, shifting window attention, research area, global spatial, spatial modeling

备注: 6 pages, 2 figures, 1 table

点击查看摘要

Abstract:In the research area of image super-resolution, Swin-transformer-based models are favored for their global spatial modeling and shifting window attention mechanism. However, existing methods often limit self-attention to non overlapping windows to cut costs and ignore the useful information that exists across channels. To address this issue, this paper introduces a novel model, the Hybrid Attention Aggregation Transformer (HAAT), designed to better leverage feature information. HAAT is constructed by integrating Swin-Dense-Residual-Connected Blocks (SDRCB) with Hybrid Grid Attention Blocks (HGAB). SDRCB expands the receptive field while maintaining a streamlined architecture, resulting in enhanced performance. HGAB incorporates channel attention, sparse attention, and window attention to improve nonlocal feature fusion and achieve more visually compelling results. Experimental evaluations demonstrate that HAAT surpasses state-of-the-art methods on benchmark datasets. Keywords: Image super-resolution, Computer vision, Attention mechanism, Transformer

Comments:
6 pages, 2 figures, 1 table

Subjects:

Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)

Cite as:
arXiv:2411.18003 [eess.IV]

(or
arXiv:2411.18003v1 [eess.IV] for this version)

https://doi.org/10.48550/arXiv.2411.18003

Focus to learn more

              arXiv-issued DOI via DataCite (pending registration)</p>
148. 【2411.17870】Breast Tumor Classification Using EfficientNet Deep Learning Model

链接https://arxiv.org/abs/2411.17870

作者:Majid Behzadpour,Bengie L. Ortiz,Ebrahim Azizi,Kai Wu

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Precise breast cancer, Precise breast, breast cancer classification, outcome in oncology, breast cancer

备注: 19 pages, 7 figures

点击查看摘要

Abstract:Precise breast cancer classification on histopathological images has the potential to greatly improve the diagnosis and patient outcome in oncology. The data imbalance problem largely stems from the inherent imbalance within medical image datasets, where certain tumor subtypes may appear much less frequently. This constitutes a considerable limitation in biased model predictions that can overlook critical but rare classes. In this work, we adopted EfficientNet, a state-of-the-art convolutional neural network (CNN) model that balances high accuracy with computational cost efficiency. To address data imbalance, we introduce an intensive data augmentation pipeline and cost-sensitive learning, improving representation and ensuring that the model does not overly favor majority classes. This approach provides the ability to learn effectively from rare tumor types, improving its robustness. Additionally, we fine-tuned the model using transfer learning, where weights in the beginning trained on a binary classification task were adopted to multi-class classification, improving the capability to detect complex patterns within the BreakHis dataset. Our results underscore significant improvements in the binary classification performance, achieving an exceptional recall increase for benign cases from 0.92 to 0.95, alongside an accuracy enhancement from 97.35 % to 98.23%. Our approach improved the performance of multi-class tasks from 91.27% with regular augmentation to 94.54% with intensive augmentation, reaching 95.04% with transfer learning. This framework demonstrated substantial gains in precision in the minority classes, such as Mucinous carcinoma and Papillary carcinoma, while maintaining high recall consistently across these critical subtypes, as further confirmed by confusion matrix analysis.

149. 【2411.17850】Reliability of deep learning models for anatomical landmark detection: The role of inter-rater variability

链接https://arxiv.org/abs/2411.17850

作者:Soorena Salari,Hassan Rivaz,Yiming Xiao

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:Automated detection, anatomical landmarks plays, surgical applications, anatomical landmark detection, diagnostic and surgical

备注: Accepted to SPIE Medical Imaging 2025

点击查看摘要

Abstract:Automated detection of anatomical landmarks plays a crucial role in many diagnostic and surgical applications. Progresses in deep learning (DL) methods have resulted in significant performance enhancement in tasks related to anatomical landmark detection. While current research focuses on accurately localizing these landmarks in medical scans, the importance of inter-rater annotation variability in building DL models is often overlooked. Understanding how inter-rater variability impacts the performance and reliability of the resulting DL algorithms, which are crucial for clinical deployment, can inform the improvement of training data construction and boost DL models' outcomes. In this paper, we conducted a thorough study of different annotation-fusion strategies to preserve inter-rater variability in DL models for anatomical landmark detection, aiming to boost the performance and reliability of the resulting algorithms. Additionally, we explored the characteristics and reliability of four metrics, including a novel Weighted Coordinate Variance metric to quantify landmark detection uncertainty/inter-rater variability. Our research highlights the crucial connection between inter-rater variability, DL-models performances, and uncertainty, revealing how different approaches for multi-rater landmark annotation fusion can influence these factors.

150. 【2411.17845】CAMLD: Contrast-Agnostic Medical Landmark Detection with Consistency-Based Regularization

链接https://arxiv.org/abs/2411.17845

作者:Soorena Salari,Arash Harirpoush,Hassan Rivaz,Yiming Xiao

类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)

关键词:including disease diagnosis, Anatomical landmark detection, research applications, surgical planning, disease diagnosis

备注: 14 pages, 6 figures, 3 tables

点击查看摘要

Abstract:Anatomical landmark detection in medical images is essential for various clinical and research applications, including disease diagnosis and surgical planning. However, manual landmark annotation is time-consuming and requires significant expertise. Existing deep learning (DL) methods often require large amounts of well-annotated data, which are costly to acquire. In this paper, we introduce CAMLD, a novel self-supervised DL framework for anatomical landmark detection in unlabeled scans with varying contrasts by using only a single reference example. To achieve this, we employed an inter-subject landmark consistency loss with an image registration loss while introducing a 3D convolution-based contrast augmentation strategy to promote model generalization to new contrasts. Additionally, we utilize an adaptive mixed loss function to schedule the contributions of different sub-tasks for optimal outcomes. We demonstrate the proposed method with the intricate task of MRI-based 3D brain landmark detection. With comprehensive experiments on four diverse clinical and public datasets, including both T1w and T2w MRI scans at different MRI field strengths, we demonstrate that CAMLD outperforms the state-of-the-art methods in terms of mean radial errors (MREs) and success detection rates (SDRs). Our framework provides a robust and accurate solution for anatomical landmark detection, reducing the need for extensively annotated datasets and generalizing well across different imaging contrasts. Our code will be publicly available at: this https URL.