本篇博文主要展示每日从Arxiv论文网站获取的最新论文列表,以自然语言处理、信息检索、计算机视觉等类目进行划分。
统计
今日共更新526篇论文,其中:
- 自然语言处理111篇
- 信息检索28篇
- 计算机视觉78篇
自然语言处理
1. 【2502.19416】Norm Growth and Stability Challenges in Localized Sequential Knowledge Editing
链接:https://arxiv.org/abs/2502.19416
作者:Akshat Gupta,Christine Fang,Atahan Ozdemir,Maochuan Lu,Ahmed Alaa,Thomas Hartvigsen,Gopala Anumanchipalli
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:modifying specific facts, broader model capabilities, large language models, altering broader model, study investigates
备注: Accepted for Oral Presentation at KnowFM @ AAAI 2025. arXiv admin note: text overlap with [arXiv:2502.01636](https://arxiv.org/abs/2502.01636)
点击查看摘要
Abstract:This study investigates the impact of localized updates to large language models (LLMs), specifically in the context of knowledge editing - a task aimed at incorporating or modifying specific facts without altering broader model capabilities. We first show that across different post-training interventions like continuous pre-training, full fine-tuning and LORA-based fine-tuning, the Frobenius norm of the updated matrices always increases. This increasing norm is especially detrimental for localized knowledge editing, where only a subset of matrices are updated in a model . We reveal a consistent phenomenon across various editing techniques, including fine-tuning, hypernetwork-based approaches, and locate-and-edit methods: the norm of the updated matrix invariably increases with successive updates. Such growth disrupts model balance, particularly when isolated matrices are updated while the rest of the model remains static, leading to potential instability and degradation of downstream performance. Upon deeper investigations of the intermediate activation vectors, we find that the norm of internal activations decreases and is accompanied by shifts in the subspaces occupied by these activations, which shows that these activation vectors now occupy completely different regions in the representation space compared to the unedited model. With our paper, we highlight the technical challenges with continuous and localized sequential knowledge editing and their implications for maintaining model stability and utility.
2. 【2502.19413】Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs
链接:https://arxiv.org/abs/2502.19413
作者:Christoph Schuhmann,Gollam Rabby,Ameya Prabhu,Tawsif Ahmed,Andreas Hochlehnert,Huu Nguyen,Nick Akinci Heidrich,Ludwig Schmidt,Robert Kaczmarczyk,Sören Auer,Jenia Jitsev,Matthias Bethge
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:knowledge, Knowledge Units, scientific knowledge, rules often restrict, restrict the broad
备注: Technical Report
点击查看摘要
Abstract:Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge. We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts. Current methods, like text embeddings, fail to reliably preserve factual content, and simple paraphrasing may not be legally sound. We urge the community to adopt a new idea: convert scholarly documents into Knowledge Units using LLMs. These units use structured data capturing entities, attributes and relationships without stylistic content. We provide evidence that Knowledge Units: (1) form a legally defensible framework for sharing knowledge from copyrighted research texts, based on legal analyses of German copyright law and U.S. Fair Use doctrine, and (2) preserve most (~95%) factual knowledge from original text, measured by MCQ performance on facts from the original copyrighted text across four research domains. Freeing scientific knowledge from copyright promises transformative benefits for scientific research and education by allowing language models to reuse important facts from copyrighted text. To support this, we share open-source tools for converting research documents into Knowledge Units. Overall, our work posits the feasibility of democratizing access to scientific knowledge while respecting copyright.
3. 【2502.19412】he Mighty ToRR: A Benchmark for Table Reasoning and Robustness
链接:https://arxiv.org/abs/2502.19412
作者:Shir Ashury-Tahan,Yifan Mai,Rajmohan C,Ariel Gera,Yotam Perlitz,Asaf Yehudai,Elron Bandel,Leshem Choshen,Eyal Shnarch,Percy Liang,Michal Shmueli-Scheuer
类目:Computation and Language (cs.CL)
关键词:real-world significance, leaving uncertainty, configuration to adopt, tabular data, data remains underexplored
备注:
点击查看摘要
Abstract:Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, that measures model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.
4. 【2502.19411】Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs
链接:https://arxiv.org/abs/2502.19411
作者:Dayu Yang,Tianyang Liu,Daoan Zhang,Antoine Simoulin,Xiaoyi Liu,Yuwei Cao,Zhaopu Teng,Xin Qian,Grey Yang,Jiebo Luo,Julian McAuley
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:translates high-level goals, large language models, reasoning translates high-level, goals into smaller, executable steps
备注: Project Repo: [this https URL](https://github.com/dayuyang1999/Awesome-Code-Reasoning)
点击查看摘要
Abstract:In large language models (LLMs), code and reasoning reinforce each other: code offers an abstract, modular, and logic-driven structure that supports reasoning, while reasoning translates high-level goals into smaller, executable steps that drive more advanced code intelligence. In this study, we examine how code serves as a structured medium for enhancing reasoning: it provides verifiable execution paths, enforces logical decomposition, and enables runtime validation. We also explore how improvements in reasoning have transformed code intelligence from basic completion to advanced capabilities, enabling models to address complex software engineering tasks through planning and debugging. Finally, we identify key challenges and propose future research directions to strengthen this synergy, ultimately improving LLM's performance in both areas.
5. 【2502.19409】ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
链接:https://arxiv.org/abs/2502.19409
作者:Danae Sánchez Villegas,Ingo Ziegler,Desmond Elliott
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, multimodal large language, remains a challenge, large language, language models
备注: Code, dataset, and checkpoints are publicly available at [this https URL](https://github.com/danaesavi/ImageChain)
点击查看摘要
Abstract:Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task -- achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.
6. 【2502.19407】Learning Code-Edit Embedding to Model Student Debugging Behavior
链接:https://arxiv.org/abs/2502.19407
作者:Hasnain Heickal,Andrew Lan
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:Providing effective feedback, computer science education, Providing effective, iteratively submitting code, students solve problems
备注:
点击查看摘要
Abstract:Providing effective feedback for programming assignments in computer science education can be challenging: students solve problems by iteratively submitting code, executing it, and using limited feedback from the compiler or the auto-grader to debug. Analyzing student debugging behavior in this process may reveal important insights into their knowledge and inform better personalized support tools. In this work, we propose an encoder-decoder-based model that learns meaningful code-edit embeddings between consecutive student code submissions, to capture their debugging behavior. Our model leverages information on whether a student code submission passes each test case to fine-tune large language models (LLMs) to learn code editing representations. It enables personalized next-step code suggestions that maintain the student's coding style while improving test case correctness. Our model also enables us to analyze student code-editing patterns to uncover common student errors and debugging behaviors, using clustering techniques. Experimental results on a real-world student code submission dataset demonstrate that our model excels at code reconstruction and personalized code suggestion while revealing interesting patterns in student debugging behavior.
7. 【2502.19400】heoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding
链接:https://arxiv.org/abs/2502.19400
作者:Max Ku,Thomas Chong,Jonathan Leung,Krish Shah,Alvin Yu,Wenhu Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Understanding domain-specific theorems, Understanding domain-specific, effective communication, structured visual explanations, communication through structured
备注:
点击查看摘要
Abstract:Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.
8. 【2502.19387】Residual Speech Embeddings for Tone Classification: Removing Linguistic Content to Enhance Paralinguistic Analysis
链接:https://arxiv.org/abs/2502.19387
作者:Hamdan Al Ahbabi,Gautier Marti,Saeed AlMarri,Ibrahim Elfadel
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:analyze tone independently, Self-supervised learning models, making it challenging, challenging to analyze, independently of spoken
备注:
点击查看摘要
Abstract:Self-supervised learning models for speech processing, such as wav2vec2, HuBERT, WavLM, and Whisper, generate embeddings that capture both linguistic and paralinguistic information, making it challenging to analyze tone independently of spoken content. In this work, we introduce a method for disentangling paralinguistic features from linguistic content by regressing speech embeddings onto their corresponding text embeddings and using the residuals as a representation of vocal tone. We evaluate this approach across multiple self-supervised speech embeddings, demonstrating that residual embeddings significantly improve tone classification performance compared to raw speech embeddings. Our results show that this method enhances linear separability, enabling improved classification even with simple models such as logistic regression. Visualization of the residual embeddings further confirms the successful removal of linguistic information while preserving tone-related features. These findings highlight the potential of residual embeddings for applications in sentiment analysis, speaker characterization, and paralinguistic speech processing.
9. 【2502.19363】DataMan: Data Manager for Pre-training Large Language Models
链接:https://arxiv.org/abs/2502.19363
作者:Ru Peng,Kexin Yang,Yawen Zeng,Junyang Lin,Dayiheng Liu,Junbo Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:scaling laws makes, data increasingly important, data scaling laws, increasingly important, emergence of large
备注: ICLR2025 paper
点击查看摘要
Abstract:The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking'' -- prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan's domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.
10. 【2502.19361】Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
链接:https://arxiv.org/abs/2502.19361
作者:Yancheng He,Shilong Li,Jiaheng Liu,Weixun Wang,Xingyuan Bu,Ge Zhang,Zhongyuan Peng,Zhaoxiang Zhang,Wenbo Su,Bo Zheng
类目:Computation and Language (cs.CL)
关键词:existing Large Language, Large Language Models, Large Language, drawn significant attention, long CoT reasoning
备注: The first three authors contributed equally, 27 pages
点击查看摘要
Abstract:Recently, o1-like models have drawn significant attention, where these models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs). In this paper, to understand the qualities of these long CoTs and measure the critique abilities of existing LLMs on these long CoTs, we introduce the DeltaBench, including the generated long CoTs from different o1-like models (e.g., QwQ, DeepSeek-R1) for different reasoning tasks (e.g., Math, Code, General Reasoning), to measure the ability to detect errors in long CoT reasoning. Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models. Then, we conduct extensive evaluations of existing process reward models (PRMs) and critic models to detect the errors of each annotated process, which aims to investigate the boundaries and limitations of existing PRMs and critic models. Finally, we hope that DeltaBench could guide developers to better understand the long CoT reasoning abilities of their models.
11. 【2502.19347】Controlled Diversity: Length-optimized Natural Language Generation
链接:https://arxiv.org/abs/2502.19347
作者:Diana Marie Schenke,Timo Baumann
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:strict length requirements, length requirements, improve their usefulness, usefulness in applications, applications that require
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models
点击查看摘要
Abstract:LLMs are not generally able to adjust the length of their outputs based on strict length requirements, a capability that would improve their usefulness in applications that require adherence to diverse user and system requirements. We present an approach to train LLMs to acquire this capability by augmenting existing data and applying existing fine-tuning techniques, which we compare based on the trained models' adherence to the length requirement and overall response quality relative to the baseline model. Our results demonstrate that these techniques can be successfully applied to train LLMs to adhere to length requirements, with the trained models generating texts which better align to the length requirements. Our results indicate that our method may change the response quality when using training data that was not generated by the baseline model. This allows simultaneous alignment to another training objective in certain scenarios, but is undesirable otherwise. Training on a dataset containing the model's own responses eliminates this issue.
12. 【2502.19339】Evaluating LLMs and Pre-trained Models for Text Summarization Across Diverse Datasets
链接:https://arxiv.org/abs/2502.19339
作者:Tohida Rehman,Soumabha Ghosh,Kuntal Das,Souvik Bhattacharjee,Debarshi Kumar Sanyal,Samiran Chattopadhyay
类目:Computation and Language (cs.CL)
关键词:Text summarization plays, condensing large volumes, natural language processing, plays a crucial, crucial role
备注: 5 pages, 2 figures, 6 tables
点击查看摘要
Abstract:Text summarization plays a crucial role in natural language processing by condensing large volumes of text into concise and coherent summaries. As digital content continues to grow rapidly and the demand for effective information retrieval increases, text summarization has become a focal point of research in recent years. This study offers a thorough evaluation of four leading pre-trained and open-source large language models: BART, FLAN-T5, LLaMA-3-8B, and Gemma-7B, across five diverse datasets CNN/DM, Gigaword, News Summary, XSum, and BBC News. The evaluation employs widely recognized automatic metrics, including ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, and METEOR, to assess the models' capabilities in generating coherent and informative summaries. The results reveal the comparative strengths and limitations of these models in processing various text types.
13. 【2502.19328】Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
链接:https://arxiv.org/abs/2502.19328
作者:Hao Peng,Yunjia Qi,Xiaozhi Wang,Zijun Yao,Bin Xu,Lei Hou,Juanzi Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:large language models, Reward models, Reward, verifiable correctness signals, inference-time scaling
备注: 16 pages, 5 figures
点击查看摘要
Abstract:Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (this https URL).
14. 【2502.19320】Shh, don't say that! Domain Certification in LLMs
链接:https://arxiv.org/abs/2502.19320
作者:Cornelius Emde,Alasdair Paren,Preetham Arvind,Maxime Kayser,Tom Rainforth,Thomas Lukasiewicz,Bernard Ghanem,Philip H.S. Torr,Adel Bibi
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:perform constrained tasks, Large language models, Large language, constrained tasks, deployed to perform
备注: 10 pages, includes appendix Published in International Conference on Learning Representations (ICLR) 2025
点击查看摘要
Abstract:Large language models (LLMs) are often deployed to perform constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach, which we call VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior.
15. 【2502.19312】FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users
链接:https://arxiv.org/abs/2502.19312
作者:Anikait Singh,Sheryl Hsu,Kyle Hsu,Eric Mitchell,Stefano Ermon,Tatsunori Hashimoto,Archit Sharma,Chelsea Finn
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (stat.ML)
关键词:Effective personalization, content curation, broad range, range of user-interfacing, user-interfacing applications
备注: Website: [this https URL](https://fewshot-preference-optimization.github.io/)
点击查看摘要
Abstract:Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.
16. 【2502.19279】CritiQ: Mining Data Quality Criteria from Human Preferences
链接:https://arxiv.org/abs/2502.19279
作者:Honglin Guo,Kai Lv,Qipeng Guo,Tianyi Liang,Zhiheng Xi,Demin Song,Qiuyinzhe Zhang,Yu Sun,Kai Chen,Xipeng Qiu,Tao Gui
类目:Computation and Language (cs.CL)
关键词:Language model heavily, model heavily depends, Language model, heavily depends, depends on high-quality
备注:
点击查看摘要
Abstract:Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only $\sim$30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.
17. 【2502.19276】Disentangled VAD Representations via a Variational Framework for Political Stance Detection
链接:https://arxiv.org/abs/2502.19276
作者:Beiyu Xu,Zhiwei Liu,Sophia Ananiadou
类目:Computation and Language (cs.CL)
关键词:stance detection, aims to categorise, stance, detection, detection task aims
备注:
点击查看摘要
Abstract:The stance detection task aims to categorise the stance regarding specified targets. Current methods face challenges in effectively integrating sentiment information for stance detection. Moreover, the role of highly granular sentiment labelling in stance detection has been largely overlooked. This study presents a novel stance detection framework utilizing a variational autoencoder (VAE) to disentangle latent emotional features-value, arousal, and dominance (VAD)-from political discourse on social media. This approach addresses limitations in current methods, particularly in in-target and cross-target stance detection scenarios. This research uses an advanced emotional annotation tool to annotate seven-class sentiment labels for P-STANCE. Evaluations on benchmark datasets, including P-STANCE and SemEval-2016, reveal that PoliStance-VAE achieves state-of-the-art performance, surpassing models like BERT, BERTweet, and GPT-4o. PoliStance-VAE offers a robust and interpretable solution for stance detection, demonstrating the effectiveness of integrating nuanced emotional representations. This framework paves the way for advancements in natural language processing tasks, particularly those requiring detailed emotional understanding.
18. 【2502.19261】Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization
链接:https://arxiv.org/abs/2502.19261
作者:Taishi Nakamura,Takuya Akiba,Kazuki Fujii,Yusuke Oda,Rio Yokota,Jun Suzuki
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:inference cost significantly, cost significantly compared, architecture reduces, equivalent capacity, inference cost
备注: To appear at the 13th International Conference on Learning Representations (ICLR 2025)
点击查看摘要
Abstract:The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.
19. 【2502.19249】Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
链接:https://arxiv.org/abs/2502.19249
作者:Michael Y. Hu,Jackson Petty,Chuan Shi,William Merrill,Tal Linzen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:natural language, formal language impart, language, formal language, formal language pretraining
备注:
点击查看摘要
Abstract:Pretraining language models on formal languages can improve their acquisition of natural language, but it is unclear which features of the formal language impart an inductive bias that leads to effective transfer. Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when the formal language both captures dependency structures in natural language and remains within the computational limitations of the model architecture. Focusing on transformers, we find that formal languages with both these properties enable language models to achieve lower loss on natural language and better linguistic generalization compared to other languages. In fact, pre-pretraining, or training on formal-then-natural language, reduces loss more efficiently than the same amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. We also give mechanistic evidence of cross-task transfer from formal to natural language: attention heads acquired during formal language pretraining remain crucial for the model's performance on syntactic evaluations.
20. 【2502.19230】wo Heads Are Better Than One: Dual-Model Verbal Reflection at Inference-Time
链接:https://arxiv.org/abs/2502.19230
作者:Jiazheng Li,Yuxiang Zhou,Junru Lu,Gladys Tyen,Lin Gui,Cesare Aloisi,Yulan He
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, complex reasoning scenarios, Language Models, struggle with complex
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) often struggle with complex reasoning scenarios. While preference optimization methods enhance reasoning performance through training, they often lack transparency in why one reasoning outcome is preferred over another. Verbal reflection techniques improve explainability but are limited in LLMs' critique and refinement capacity. To address these challenges, we introduce a contrastive reflection synthesis pipeline that enhances the accuracy and depth of LLM-generated reflections. We further propose a dual-model reasoning framework within a verbal reinforcement learning paradigm, decoupling inference-time self-reflection into specialized, trained models for reasoning critique and refinement. Extensive experiments show that our framework outperforms traditional preference optimization methods across all evaluation metrics. Our findings also show that "two heads are better than one", demonstrating that a collaborative Reasoner-Critic model achieves superior reasoning performance and transparency, compared to single-model approaches.
21. 【2502.19211】Negation-Induced Forgetting in LLMs
链接:https://arxiv.org/abs/2502.19211
作者:Francesca Capuano,Ellen Boschert,Barbara Kaup
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, negating incorrect attributes, affirming correct attributes, object or event
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models
点击查看摘要
Abstract:The study explores whether Large Language Models (LLMs) exhibit negation-induced forgetting (NIF), a cognitive phenomenon observed in humans where negating incorrect attributes of an object or event leads to diminished recall of this object or event compared to affirming correct attributes (Mayo et al., 2014; Zang et al., 2023). We adapted Zang et al. (2023) experimental framework to test this effect in ChatGPT-3.5, GPT-4o mini and Llama3-70b-instruct. Our results show that ChatGPT-3.5 exhibits NIF, with negated information being less likely to be recalled than affirmed information. GPT-4o-mini showed a marginally significant NIF effect, while LLaMA-3-70B did not exhibit NIF. The findings provide initial evidence of negation-induced forgetting in some LLMs, suggesting that similar cognitive biases may emerge in these models. This work is a preliminary step in understanding how memory-related phenomena manifest in LLMs.
22. 【2502.19209】Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2502.19209
作者:Zhouyu Jiang,Mengshu Sun,Zhiqiang Zhang,Lei Liang
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Retrieval-Augmented Generation, effectively reduces hallucinations, Language Models
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at this https URL.
23. 【2502.19208】MultiConAD: A Unified Multilingual Conversational Dataset for Early Alzheimer's Detection
链接:https://arxiv.org/abs/2502.19208
作者:Arezo Shakeri,Mina Farmanbar,Krisztian Balog
类目:Computation and Language (cs.CL)
关键词:Alzheimer disease, progressive cognitive syndrome, syndrome with Alzheimer, Mild Cognitive Impairment, progressive cognitive
备注: 11 pages, 3 Figures
点击查看摘要
Abstract:Dementia is a progressive cognitive syndrome with Alzheimer's disease (AD) as the leading cause. Conversation-based AD detection offers a cost-effective alternative to clinical methods, as language dysfunction is an early biomarker of AD. However, most prior research has framed AD detection as a binary classification problem, limiting the ability to identify Mild Cognitive Impairment (MCI)-a crucial stage for early intervention. Also, studies primarily rely on single-language datasets, mainly in English, restricting cross-language generalizability. To address this gap, we make three key contributions. First, we introduce a novel, multilingual dataset for AD detection by unifying 16 publicly available dementia-related conversational datasets. This corpus spans English, Spanish, Chinese, and Greek and incorporates both audio and text data derived from a variety of cognitive assessment tasks. Second, we perform finer-grained classification, including MCI, and evaluate various classifiers using sparse and dense text representations. Third, we conduct experiments in monolingual and multilingual settings, finding that some languages benefit from multilingual training while others perform better independently. This study highlights the challenges in multilingual AD detection and enables future research on both language-specific approaches and techniques aimed at improving model generalization and robustness.
24. 【2502.19207】FaithUn: Toward Faithful Forgetting in Language Models by Investigating the Interconnectedness of Knowledge
链接:https://arxiv.org/abs/2502.19207
作者:Nakyeong Yang,Minsung Kim,Seunghyun Yoon,Joongbo Shin,Kyomin Jung
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:unauthorized exposure, unlearning, sensitive or private, language model, model to prevent
备注: 16 pages
点击查看摘要
Abstract:Various studies have attempted to remove sensitive or private knowledge from a language model to prevent its unauthorized exposure. However, prior studies have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUn, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which updates only knowledge-related neurons to achieve faithful unlearning. KLUE identifies knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA unlearning.
25. 【2502.19202】LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts
链接:https://arxiv.org/abs/2502.19202
作者:Thanh-Phong Le,Trung Le Chi Phan,Nghia Hieu Nguyen,Kiet Van Nguyen
类目:Computation and Language (cs.CL)
关键词:Visual Question Answering, Document Visual Question, Question Answering, Visual Question, textbf
备注: Accepted at IJDAR
点击查看摘要
Abstract:\textbf{Purpose:} Document Visual Question Answering (document VQA) challenges multimodal systems to holistically handle textual, layout, and visual modalities to provide appropriate answers. Document VQA has gained popularity in recent years due to the increasing amount of documents and the high demand for digitization. Nonetheless, most of document VQA datasets are developed in high-resource languages such as English. \textbf{Methods:} In this paper, we present ReceiptVQA (\textbf{Receipt} \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a document kind with high commercial potentials. The dataset encompasses \textbf{9,000+} receipt images and \textbf{60,000+} manually annotated question-answer pairs. In addition to our study, we introduce LiGT (\textbf{L}ayout-\textbf{i}nfused \textbf{G}enerative \textbf{T}ransformer), a layout-aware encoder-decoder architecture designed to leverage embedding layers of language models to operate layout embeddings, minimizing the use of additional neural modules. \textbf{Results:} Experiments on ReceiptVQA show that our architecture yielded promising performance, achieving competitive results compared with outstanding baselines. Furthermore, throughout analyzing experimental results, we found evident patterns that employing encoder-only model architectures has considerable disadvantages in comparison to architectures that can generate answers. We also observed that it is necessary to combine multiple modalities to tackle our dataset, despite the critical role of semantic understanding from language models. \textbf{Conclusion:} We hope that our work will encourage and facilitate future development in Vietnamese document VQA, contributing to a diverse multimodal research community in the Vietnamese language.
Comments:
Accepted at IJDAR
Subjects:
Computation and Language (cs.CL)
Cite as:
arXiv:2502.19202 [cs.CL]
(or
arXiv:2502.19202v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2502.19202
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)
Submission history From: Nghia Hieu Nguyen [view email] [v1]
Wed, 26 Feb 2025 15:09:28 UTC (3,107 KB)
26. 【2502.19187】BIG-Bench Extra Hard
链接:https://arxiv.org/abs/2502.19187
作者:Mehran Kazemi,Bahare Fatemi,Hritik Bansal,John Palowitch,Chrysovalantis Anastasiou,Sanket Vaibhav Mehta,Lalit K. Jain,Virginia Aglietti,Disha Jindal,Peter Chen,Nishanth Dikkala,Gladys Tyen,Xin Liu,Uri Shalit,Silvia Chiappa,Kate Olszewska,Yi Tay,Vinh Q. Tran,Quoc V. Le,Orhan Firat
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, everyday applications, general reasoning capabilities, demanding robust general
备注:
点击查看摘要
Abstract:Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: this https URL.
27. 【2502.19175】MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis
链接:https://arxiv.org/abs/2502.19175
作者:Daniel Rose,Chia-Chien Hung,Marco Lepri,Israa Alqassem,Kiril Gashteovski,Carolin Lawrence
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:physicians iteratively refine, Differential Diagnosis, clinical decision-making, based on symptoms, fundamental yet complex
备注:
点击查看摘要
Abstract:Differential Diagnosis (DDx) is a fundamental yet complex aspect of clinical decision-making, in which physicians iteratively refine a ranked list of possible diseases based on symptoms, antecedents, and medical knowledge. While recent advances in large language models have shown promise in supporting DDx, existing approaches face key limitations, including single-dataset evaluations, isolated optimization of components, unrealistic assumptions about complete patient profiles, and single-attempt diagnosis. We introduce a Modular Explainable DDx Agent (MEDDxAgent) framework designed for interactive DDx, where diagnostic reasoning evolves through iterative learning, rather than assuming a complete patient profile is accessible. MEDDxAgent integrates three modular components: (1) an orchestrator (DDxDriver), (2) a history taking simulator, and (3) two specialized agents for knowledge retrieval and diagnosis strategy. To ensure robust evaluation, we introduce a comprehensive DDx benchmark covering respiratory, skin, and rare diseases. We analyze single-turn diagnostic approaches and demonstrate the importance of iterative refinement when patient profiles are not available at the outset. Our broad evaluation demonstrates that MEDDxAgent achieves over 10% accuracy improvements in interactive DDx across both large and small LLMs, while offering critical explainability into its diagnostic reasoning process.
28. 【2502.19163】stNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency
链接:https://arxiv.org/abs/2502.19163
作者:Henry Peng Zou,Zhengyao Gu,Yue Zhou,Yankai Chen,Weizhi Zhang,Liancheng Fang,Yibo Wang,Yangning Li,Kay Liu,Philip S. Yu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:leverage additional computational, additional computational resources, enhancing large language, Test-time computing approaches, large language model
备注:
点击查看摘要
Abstract:Test-time computing approaches, which leverage additional computational resources during inference, have been proven effective in enhancing large language model performance. This work introduces a novel, linearly scaling approach, TestNUC, that improves test-time predictions by leveraging the local consistency of neighboring unlabeled data-it classifies an input instance by considering not only the model's prediction on that instance but also on neighboring unlabeled instances. We evaluate TestNUC across eight diverse datasets, spanning intent classification, topic mining, domain discovery, and emotion detection, demonstrating its consistent superiority over baseline methods such as standard prompting and self-consistency. Furthermore, TestNUC can be seamlessly integrated with existing test-time computing approaches, substantially boosting their performance. Our analysis reveals that TestNUC scales effectively with increasing amounts of unlabeled data and performs robustly across different embedding models, making it practical for real-world applications. Our code is available at this https URL.
29. 【2502.19160】Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models
链接:https://arxiv.org/abs/2502.19160
作者:Rebekka Görge,Michael Mock,Héctor Allende-Cid
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:introduce data bias, Large Language Models, bias into Large, Large Language, linguistic indicators
备注:
点击查看摘要
Abstract:Social categories and stereotypes are embedded in language and can introduce data bias into Large Language Models (LLMs). Despite safeguards, these biases often persist in model behavior, potentially leading to representational harm in outputs. While sociolinguistic research provides valuable insights into the formation of stereotypes, NLP approaches for stereotype detection rarely draw on this foundation and often lack objectivity, precision, and interpretability. To fill this gap, in this work we propose a new approach that detects and quantifies the linguistic indicators of stereotypes in a sentence. We derive linguistic indicators from the Social Category and Stereotype Communication (SCSC) framework which indicate strong social category formulation and stereotyping in language, and use them to build a categorization scheme. To automate this approach, we instruct different LLMs using in-context learning to apply the approach to a sentence, where the LLM examines the linguistic properties and provides a basis for a fine-grained assessment. Based on an empirical evaluation of the importance of different linguistic indicators, we learn a scoring function that measures the linguistic indicators of a stereotype. Our annotations of stereotyped sentences show that these indicators are present in these sentences and explain the strength of a stereotype. In terms of model performance, our results show that the models generally perform well in detecting and classifying linguistic indicators of category labels used to denote a category, but sometimes struggle to correctly evaluate the associated behaviors and characteristics. Using more few-shot examples within the prompts, significantly improves performance. Model performance increases with size, as Llama-3.3-70B-Instruct and GPT-4 achieve comparable results that surpass those of Mixtral-8x7B-Instruct, GPT-4-mini and Llama-3.1-8B-Instruct.
30. 【2502.19158】When Personalization Meets Reality: A Multi-Faceted Analysis of Personalized Preference Learning
链接:https://arxiv.org/abs/2502.19158
作者:Yijiang River Dong,Tiancheng Hu,Yinhong Liu,Ahmet Üstün,Nigel Collier
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, align Large Language, overlooking diverse human, Language Models, Large Language
备注:
点击查看摘要
Abstract:While Reinforcement Learning from Human Feedback (RLHF) is widely used to align Large Language Models (LLMs) with human preferences, it typically assumes homogeneous preferences across users, overlooking diverse human values and minority viewpoints. Although personalized preference learning addresses this by tailoring separate preferences for individual users, the field lacks standardized methods to assess its effectiveness. We present a multi-faceted evaluation framework that measures not only performance but also fairness, unintended effects, and adaptability across varying levels of preference divergence. Through extensive experiments comparing eight personalization methods across three preference datasets, we demonstrate that performance differences between methods could reach 36% when users strongly disagree, and personalization can introduce up to 20% safety misalignment. These findings highlight the critical need for holistic evaluation approaches to advance the development of more effective and inclusive preference learning systems.
31. 【2502.19149】Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval
链接:https://arxiv.org/abs/2502.19149
作者:Jiarong Wu,Songqiang Chen,Jialun Cao,Hau Ching Lo,Shing-Chi Cheung
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:Large Language Models, HumanEval and MBPP, MBPP are designed, Language Models, code generation
备注:
点击查看摘要
Abstract:Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs' end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the generated code in specific programming languages. However, the evaluation scores revealed in this way provide a little hint as to the bottleneck of the code generation -- whether LLMs are struggling with their problem-solving capability or language-coding capability. To answer this question, we construct PseudoEval, a multilingual code generation benchmark that provides a solution written in pseudocode as input. By doing so, the bottleneck of code generation in various programming languages could be isolated and identified. Our study yields several interesting findings. For example, we identify that the bottleneck of LLMs in Python programming is problem-solving, while Rust is struggling relatively more in language-coding. Also, our study indicates that problem-solving capability may transfer across programming languages, while language-coding needs more language-specific effort, especially for undertrained programming languages. Finally, we release the pipeline of constructing PseudoEval to facilitate the extension to existing benchmarks. PseudoEval is available at: this https URL.
32. 【2502.19148】Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs
链接:https://arxiv.org/abs/2502.19148
作者:Zhaowei Zhang,Fengshuo Bai,Qizhi Chen,Chengdong Ma,Mingzhi Wang,Haoran Sun,Zilong Zheng,Yaodong Yang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:align large language, large language models, static general dataset, user preferences, frequently studied
备注: Accepted by ICLR 2025, Project page: [this https URL](https://zowiezhang.github.io/projects/Amulet)
点击查看摘要
Abstract:How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users' personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.
33. 【2502.19127】Self-Memory Alignment: Mitigating Factual Hallucinations with Generalized Improvement
链接:https://arxiv.org/abs/2502.19127
作者:Siyuan Zhang,Yichi Zhang,Yinpeng Dong,Hang Su
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, objective facts, struggle to align
备注: 29 pages, 17 figures
点击查看摘要
Abstract:Large Language Models (LLMs) often struggle to align their responses with objective facts, resulting in the issue of factual hallucinations, which can be difficult to detect and mislead users without relevant knowledge. While post-training techniques have been employed to mitigate the issue, existing methods usually suffer from poor generalization and trade-offs in different capabilities. In this paper, we propose to address it by directly augmenting LLM's fundamental ability to precisely leverage its existing memory--the knowledge acquired from pre-training data. We introduce self-memory alignment (SMA), which fine-tunes the model on self-generated responses to precise and simple factual questions through preference optimization. Furthermore, we construct FactualBench, a comprehensive and precise factual QA dataset containing 181k Chinese data spanning 21 domains, to facilitate both evaluation and training. Extensive experiments show that SMA significantly improves LLMs' overall performance, with consistent enhancement across various benchmarks concerning factuality, as well as helpfulness and comprehensive skills.
34. 【2502.19115】Improving customer service with automatic topic detection in user emails
链接:https://arxiv.org/abs/2502.19115
作者:Bojana Bašaragin,Darija Medvecki,Gorana Gojić,Milena Oparnica,Dragiša Mišković
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Natural Language Processing, leading Serbian telecommunications, Telekom Srbija, Serbian telecommunications company, Language Processing pipeline
备注: Paper submitted to the 15th International Conference on Information Society and Technology (ICIST), Kopaonik, Serbia, 9-12 March 2025
点击查看摘要
Abstract:This study introduces a novel Natural Language Processing pipeline that enhances customer service efficiency at Telekom Srbija, a leading Serbian telecommunications company, through automated email topic detection and labelling. Central to the pipeline is BERTopic, a modular architecture that allows unsupervised topic modelling. After a series of preprocessing and post-processing steps, we assign one of 12 topics and several additional labels to incoming emails, allowing customer service to filter and access them through a custom-made application. The model's performance was evaluated by assessing the speed and correctness of the automatically assigned topics across a test dataset of 100 customer emails. The pipeline shows broad applicability across languages, particularly for those that are low-resourced and morphologically rich. The system now operates in the company's production environment, streamlining customer service operations through automated email classification.
35. 【2502.19110】Conformal Linguistic Calibration: Trading-off between Factuality and Specificity
链接:https://arxiv.org/abs/2502.19110
作者:Zhengping Jiang,Anqi Liu,Benjamin Van Durme
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Language model outputs, Language model, prompts research, linguistic calibration, Language
备注:
点击查看摘要
Abstract:Language model outputs are not always reliable; this prompts research into methods for adapting model responses based on uncertainty. Common approaches include: \emph{abstention}, where models refrain from generating responses when uncertain; and \emph{linguistic calibration}, where models hedge their statements using uncertainty quantifiers. However, abstention can withhold valuable information, while linguistically calibrated responses are often challenging to leverage in downstream tasks. We propose a unifying view of both approaches, Conformal Linguistic Calibration (CLC), reinterpreting linguistic calibration as answer set prediction. We begin by presenting a unified framework that connects abstention and linguistic calibration through the lens of linguistic pragmatics. We then describe an implementation that allows for controlling the level of imprecision in model responses. Experimental results show that our method produces calibrated outputs with conformal guarantees on factual accuracy. Furthermore, our approach enables fine-tuning models to perform uncertainty-aware adaptive claim rewriting, offering a controllable balance between factuality and specificity.
36. 【2502.19104】Evaluating Gender Bias in German Machine Translation
链接:https://arxiv.org/abs/2502.19104
作者:Michelle Kappl
类目:Computation and Language (cs.CL)
关键词:German machine translation, test set designed, assess occupational stereotyping, evaluation test set, machine translation
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models
点击查看摘要
Abstract:We present WinoMTDE, a new gender bias evaluation test set designed to assess occupational stereotyping and underrepresentation in German machine translation (MT) systems. Building on the automatic evaluation method introduced by arXiv:1906.00591v1 [cs.CL], we extend the approach to German, a language with grammatical gender. The WinoMTDE dataset comprises 288 German sentences that are balanced in regard to gender, as well as stereotype, which was annotated using German labor statistics. We conduct a large-scale evaluation of five widely used MT systems and a large language model. Our results reveal persistent bias in most models, with the LLM outperforming traditional systems. The dataset and evaluation code are publicly available under this https URL.
37. 【2502.19103】LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm
链接:https://arxiv.org/abs/2502.19103
作者:Siwei Wu,Yizhi Li,Xingwei Qu,Rishi Ravikumar,Yucheng Li,Tyler Loakman Shanghaoran Quan Xiaoyong Wei,Riza Batista-Navarro,Chenghua Lin
类目:Computation and Language (cs.CL)
关键词:language processing tasks, Large Language Models, natural language processing, Large Language, achieved remarkable success
备注: Under review
点击查看摘要
Abstract:Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, yet their ability to generate long-form content remains poorly understood and evaluated. Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation, with performance deteriorating as text length increases. To quantitively locate such a performance degradation and provide further insights on model development, we present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms, inspired by cognitive and linguistic writing models. The comprehensive experiments in this work reveal interesting findings such as that while model size correlates with generation ability, the small-scale model (e.g., LongWriter), well-trained on long texts, has comparable performance. All code and datasets are released in this https URL.
38. 【2502.19078】Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs
链接:https://arxiv.org/abs/2502.19078
作者:Yiheng Yang,Yujie Wang,Chi Ma,Lei Yu,Emmanuele Chersoni,Chu-Ren Huang
类目:Computation and Language (cs.CL)
关键词:Dense large language, Dense large, large language models, face critical efficiency, textbf
备注:
点击查看摘要
Abstract:Dense large language models(LLMs) face critical efficiency bottlenecks as they rigidly activate all parameters regardless of input complexity. While existing sparsity methods(static pruning or dynamic activation) address this partially, they either lack adaptivity to contextual or model structural demands or incur prohibitive computational overhead. Inspired by human brain's dual-process mechanisms - predictive coding (N400) for backbone sparsity and structural reanalysis (P600) for complex context - we propose CLADA, a \textit{\textbf{C}ognitive-\textbf{L}oad-\textbf{A}ware \textbf{D}ynamic \textbf{A}ctivation} framework that synergizes statistical sparsity with semantic adaptability. Our key insight is that LLM activations exhibit two complementary patterns: 1) \textit{Global statistical sparsity} driven by sequence-level prefix information, and 2) \textit{Local semantic adaptability} modulated by cognitive load metrics(e.g., surprisal and entropy). CLADA employs a hierarchical thresholding strategy: a baseline from offline error-controlled optimization ensures 40\%+ sparsity, dynamically adjusted by real-time cognitive signals. Evaluations across six mainstream LLMs and nine benchmarks demonstrate that CLADA achieves \textbf{~20\% average speedup with 2\% accuracy drop}, outperforming Griffin (5\%+ degradation) and TT (negligible speedup). Crucially, we establish the first formal connection between neurolinguistic event-related potential (ERP) components and LLM efficiency mechanisms through multi-level regression analysis ($R^2=0.17$ for sparsity-adaptation synergy). Requiring no retraining or architectural changes, CLADA offers a deployable solution for resource-aware LLM inference while advancing biologically-inspired AI design. Our code is available at \href{this https URL}{CLADA}.
39. 【2502.19074】Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
链接:https://arxiv.org/abs/2502.19074
作者:Aloka Fernando,Surangika Ranathunga,Nisansa de Silva
类目:Computation and Language (cs.CL)
关键词:Parallel Data Curation, Data Curation, Parallel Data, Multilingual Language Models, Pre-trained Multilingual Language
备注:
点击查看摘要
Abstract:Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En$\rightarrow$Si, En$\rightarrow$Ta and Si$\rightarrow$Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.
40. 【2502.19067】IndicEval-XL: Bridging Linguistic Diversity in Code Generation Across Indic Languages
链接:https://arxiv.org/abs/2502.19067
作者:Ujjwal Singh,Aditi Sharma,Nikhil Gupta,Deepakshi,Vivek Kumar Jha
类目:oftware Engineering (cs.SE); Computation and Language (cs.CL)
关键词:Large Language Models, demonstrated remarkable capabilities, natural language prompts, Large Language, revolutionizing software development
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation from natural language prompts, revolutionizing software development workflows. As we advance towards agent-based development paradigms, these models form the cornerstone of next-generation software development lifecycles. However, current benchmarks for evaluating multilingual code generation capabilities are predominantly English-centric, limiting their applicability across the global developer community. To address this limitation, we present IndicEval-XL, a comprehensive benchmark for code generation that incorporates 6 major Indic languages, collectively spoken by approximately 14\% of the world's population. Our benchmark bridges these languages with 12 programming languages, creating a robust evaluation framework. This work is particularly significant given India's representation of one-eighth of the global population and the crucial role Indic languages play in Indian society. IndicEval-XL represents a significant step toward expanding the linguistic diversity in code generation systems and evaluation frameworks. By developing resources that support multiple languages, we aim to make AI-powered development tools more inclusive and accessible to developers of various linguistic backgrounds. To facilitate further research and development in this direction, we make our dataset and evaluation benchmark publicly available at this https URL
41. 【2502.19064】Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique
链接:https://arxiv.org/abs/2502.19064
作者:Piotr Sawicki,Marek Grześ,Dan Brown,Fabrício Góes
类目:Computation and Language (cs.CL)
关键词:Consensual Assessment Technique, Assessment Technique, holistic expert judgments, Consensual Assessment, Large Language Models
备注:
点击查看摘要
Abstract:The Consensual Assessment Technique (CAT) evaluates creativity through holistic expert judgments. We investigate the use of two advanced Large Language Models (LLMs), Claude-3-Opus and GPT-4o, to evaluate poetry by a methodology inspired by the CAT. Using a dataset of 90 poems, we found that these LLMs can surpass the results achieved by non-expert human judges at matching a ground truth based on publication venue, particularly when assessing smaller subsets of poems. Claude-3-Opus exhibited slightly superior performance than GPT-4o. We show that LLMs are viable tools for accurately assessing poetry, paving the way for their broader application into other creative domains.
42. 【2502.19058】MathClean: A Benchmark for Synthetic Mathematical Data Cleaning
链接:https://arxiv.org/abs/2502.19058
作者:Hao Liang,Meiyi Qiang,Yuying Li,Zefeng He,Yongzhen Guo,Zhengzhou Zhu,Wentao Zhang,Bin Cui
类目:Computation and Language (cs.CL)
关键词:large language models, data, training data, rapid development, development of large
备注:
点击查看摘要
Abstract:With the rapid development of large language models (LLMs), the quality of training data has become crucial. Among the various types of training data, mathematical data plays a key role in enabling LLMs to acquire strong reasoning abilities. While high-quality open-source data is important, it is often insufficient for pre-training, necessitating the addition of synthetic math problems. However, synthetic math questions and answers can introduce inaccuracies, which may degrade both the training data and web data. Therefore, an effective method for cleaning synthetic math data is essential. In this paper, we propose the MathClean benchmark to evaluate the effectiveness of math data cleaning models. The MathClean benchmark consists of 2,000 correct questions and 2,000 erroneous questions with additional 2,000 correct and erroneous answers sourced from augmented data based on GSM8K and MATH. Moreover, we also annotate error types for each question or answer, since it can assess whether models can correctly identify the error categories for future improvements. Finally, we present comprehensive evaluations using state-of-the-art (SOTA) models. Our results demonstrate that even strong models like GPT-o1 and DeepSeek-R1 perform poorly on this benchmark, highlighting the utility of MathClean. Our code and data is available at this https URL.
43. 【2502.19024】Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments
链接:https://arxiv.org/abs/2502.19024
作者:Zerui Li,Gengze Zhou,Haodong Hong,Yanyan Shao,Wenqi Lyu,Yanyuan Qiao,Qi Wu
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:make sequential decisions, Ground-level Viewpoint Navigation, associate time-sequenced visual, Viewpoint Navigation, empowers agents
备注: Accepted by ICRA 2025
点击查看摘要
Abstract:Vision-and-Language Navigation (VLN) empowers agents to associate time-sequenced visual observations with corresponding instructions to make sequential decisions. However, generalization remains a persistent challenge, particularly when dealing with visually diverse scenes or transitioning from simulated environments to real-world deployment. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view, proposing a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue. This work represents the first attempt to highlight the generalization gap in VLN across varying heights of visual observation in realistic robot deployments. Our approach leverages weighted historical observations as enriched spatiotemporal contexts for instruction following, effectively managing feature collisions within cells by assigning appropriate weights to identical features across different viewpoints. This enables low-height robots to overcome challenges such as visual obstructions and perceptual mismatches. Additionally, we transfer the connectivity graph from the HM3D and Gibson datasets as an extra resource to enhance spatial priors and a more comprehensive representation of real-world scenarios, leading to improved performance and generalizability of the waypoint predictor in real-world environments. Extensive experiments demonstrate that our Ground-level Viewpoint Navigation (GVnav) approach significantly improves performance in both simulated environments and real-world deployments with quadruped robots.
44. 【2502.19008】Binary Neural Networks for Large Language Model: A Survey
链接:https://arxiv.org/abs/2502.19008
作者:Liangdong Liu,Zhitong Zheng,Cong Wang,Tianhuang Su,Zhenyu Yang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:natural language processing, field of natural, Large language models, quantization, NLP
备注: 23 pages, 7 figures
点击查看摘要
Abstract:Large language models (LLMs) have wide applications in the field of natural language processing(NLP), such as GPT-4 and Llama. However, with the exponential growth of model parameter sizes, LLMs bring significant resource overheads. Low-bit quantization, as a key technique, reduces memory usage and computational demands by decreasing the bit-width of model parameters, activations, and gradients. Previous quantization methods for LLMs have largely employed Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). PTQ does not require any retraining of the original model, while QAT involves optimizing precision during training to achieve the best quantization parameters. The BitNet team proposed a radically different approach, where quantization is performed from the start of model training, utilizing low-precision binary weights during the training process. This approach has led to the emergence of many binary quantization techniques for large language models. This paper provides a comprehensive review of these binary quantization techniques. Specifically, we will introduce binary quantization techniques in deep neural networks and further explore their application to LLMs, reviewing their various contributions, implementations, and applications.
45. 【2502.18993】MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
链接:https://arxiv.org/abs/2502.18993
作者:Teng Lin
类目:Computation and Language (cs.CL); Databases (cs.DB)
关键词:represents significant challenges, large language models, Multi-entity question answering, represents significant, retrieval-augmented generation
备注:
点击查看摘要
Abstract:Multi-entity question answering (MEQA) represents significant challenges for large language models (LLM) and retrieval-augmented generation (RAG) systems, which frequently struggle to consolidate scattered information across diverse documents. While existing methods excel at single-document comprehension, they often struggle with cross-document aggregation, particularly when resolving entity-dense questions like "What is the distribution of ACM Fellows among various fields of study?", which require integrating entity-centric insights from heterogeneous sources (e.g., Wikipedia pages). To address this gap, we introduce MEBench, a novel multi-document, multi-entity benchmark designed to systematically evaluate LLMs' capacity to retrieve, consolidate, and reason over fragmented information. Our benchmark comprises 4,780 questions which are systematically categorized into three primary categories, further divided into eight distinct types, ensuring broad coverage of real-world multi-entity reasoning scenarios. Our experiments on state-of-the-art LLMs (e.g., GPT-4, Llama-3) and RAG pipelines reveal critical limitations: even advanced models achieve only 59% accuracy on MEBench. Our benchmark emphasizes the importance of completeness and factual precision of information extraction in MEQA tasks, using Entity-Attributed F1 (EA-F1) metric for granular evaluation of entity-level correctness and attribution validity. MEBench not only highlights systemic weaknesses in current LLM frameworks but also provides a foundation for advancing robust, entity-aware QA architectures.
46. 【2502.18990】GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation
链接:https://arxiv.org/abs/2502.18990
作者:Jie He,Jennifer Neville,Mengting Wan,Longqi Yang,Hui Liu,Xiaofeng Xu,Xia Song,Jeff Z. Pan,Pei Zhou
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, integrating external tools, range of information
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during supervised fine-tuning (SFT), questions remain about their ability to develop robust tool-usage skills and can effectively generalize to unseen queries and tools. In this work, we present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization. Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization, enabling the model to address queries initially lacking a suitable tool by adopting and utilizing one when it becomes available, and Weak-to-Strong Generalization, allowing models to leverage enhanced versions of existing tools to solve queries. To achieve this, we develop synthetic training data simulating these two dimensions of tool usage and introduce a two-stage fine-tuning approach: optimizing tool ranking, then refining tool selection. Through extensive experiments across four generalization scenarios, we demonstrate that our method significantly enhances the tool-usage capabilities of LLMs ranging from 1B to 8B parameters, achieving performance that surpasses GPT-4o. Furthermore, our analysis also provides valuable insights into the challenges LLMs encounter in tool generalization.
47. 【2502.18980】PEToolLLM: Towards Personalized Tool Learning in Large Language Models
链接:https://arxiv.org/abs/2502.18980
作者:Qiancheng Xu,Yongqi Li,Heming Xia,Fan Liu,Min Yang,Wenjie Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models', extending Large Language, Language Models', Large Language, extending Large
备注:
点击查看摘要
Abstract:Tool learning has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools. Existing tool learning studies primarily focus on the general-purpose tool-use capability, which addresses explicit user requirements in instructions. However, they overlook the importance of personalized tool-use capability, leading to an inability to handle implicit user preferences. To address the limitation, we first formulate the task of personalized tool learning, which integrates user's interaction history towards personalized tool usage. To fill the gap of missing benchmarks, we construct PEToolBench, featuring diverse user preferences reflected in interaction history under three distinct personalized settings, and encompassing a wide range of tool-use scenarios. Moreover, we propose a framework PEToolLLaMA to adapt LLMs to the personalized tool learning task, which is trained through supervised fine-tuning and direct preference optimization. Extensive experiments on PEToolBench demonstrate the superiority of PEToolLLaMA over existing LLMs.
48. 【2502.18978】Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning
链接:https://arxiv.org/abs/2502.18978
作者:Hongyi Cal,ie Li,Wenzhen Dong
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, fine-tuning for Large, training datasets, fundamentally constrained
备注: 8 pages
点击查看摘要
Abstract:The effectiveness of instruction fine-tuning for Large Language Models is fundamentally constrained by the quality and efficiency of training datasets. This work introduces Low-Confidence Gold (LCG), a novel filtering framework that employs centroid-based clustering and confidence-guided selection for identifying valuable instruction pairs. Through a semi-supervised approach using a lightweight classifier trained on representative samples, LCG curates high-quality subsets while preserving data diversity. Experimental evaluation demonstrates that models fine-tuned on LCG-filtered subsets of 6K samples achieve superior performance compared to existing methods, with substantial improvements on MT-bench and consistent gains across comprehensive evaluation metrics. The framework's efficacy while maintaining model performance establishes a promising direction for efficient instruction tuning.
49. 【2502.18969】(Mis)Fitting: A Survey of Scaling Laws
链接:https://arxiv.org/abs/2502.18969
作者:Margaret Li,Sneha Kudugunta,Luke Zettlemoyer
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Methodology (stat.ME)
关键词:Modern foundation models, foundation models rely, models rely heavily, Modern foundation, crucial training decisions
备注: 41 pages, 3 figure, first two authors contributed equally. ICLR, 2025
点击查看摘要
Abstract:Modern foundation models rely heavily on using scaling laws to guide crucial training decisions. Researchers often extrapolate the optimal architecture and hyper parameters settings from smaller training runs by describing the relationship between, loss, or task performance, and scale. All components of this process vary, from the specific equation being fit, to the training setup, to the optimization method. Each of these factors may affect the fitted law, and therefore, the conclusions of a given study. We discuss discrepancies in the conclusions that several prior works reach, on questions such as the optimal token to parameter ratio. We augment this discussion with our own analysis of the critical impact that changes in specific details may effect in a scaling study, and the resulting altered conclusions. Additionally, we survey over 50 papers that study scaling trends: while 45 of these papers quantify these trends using a power law, most under-report crucial details needed to reproduce their findings. To mitigate this, we we propose a checklist for authors to consider while contributing to scaling law research.
50. 【2502.18968】Know You First and Be You Better: Modeling Human-Like User Simulators via Implicit Profiles
链接:https://arxiv.org/abs/2502.18968
作者:Kuang Wang,Xianfei Li,Shenghao Yang,Li Zhou,Feng Jiang,Haizhou Li
类目:Computation and Language (cs.CL)
关键词:large language models, replicating human interactions, supporting both collaborative, language models, crucial for replicating
备注: 9 pages
点击查看摘要
Abstract:User simulators are crucial for replicating human interactions with dialogue systems, supporting both collaborative training and automatic evaluation, especially for large language models (LLMs). However, existing simulators often rely solely on text utterances, missing implicit user traits such as personality, speaking style, and goals. In contrast, persona-based methods lack generalizability, as they depend on predefined profiles of famous individuals or archetypes. To address these challenges, we propose User Simulator with implicit Profiles (USP), a framework that infers implicit user profiles from human-machine conversations and uses them to generate more personalized and realistic dialogues. We first develop an LLM-driven extractor with a comprehensive profile schema. Then, we refine the simulation through conditional supervised fine-tuning and reinforcement learning with cycle consistency, optimizing it at both the utterance and conversation levels. Finally, we adopt a diverse profile sampler to capture the distribution of real-world user profiles. Experimental results demonstrate that USP outperforms strong baselines in terms of authenticity and diversity while achieving comparable performance in consistency. Furthermore, dynamic multi-turn evaluations based on USP strongly align with mainstream benchmarks, demonstrating its effectiveness in real-world applications.
51. 【2502.18943】owards Label-Only Membership Inference Attack against Pre-trained Large Language Models
链接:https://arxiv.org/abs/2502.18943
作者:Yu He,Boheng Li,Liu Liu,Zhongjie Ba,Wei Dong,Yiming Li,Zhan Qin,Kui Ren,Chun Chen
类目:Cryptography and Security (cs.CR); Computation and Language (cs.CL)
关键词:model training set, Large Language Models, data sample belongs, aim to predict, model training
备注: Accepted by USENIX Security 2025
点击查看摘要
Abstract:Membership Inference Attacks (MIAs) aim to predict whether a data sample belongs to the model's training set or not. Although prior research has extensively explored MIAs in Large Language Models (LLMs), they typically require accessing to complete output logits (\ie, \textit{logits-based attacks}), which are usually not available in practice. In this paper, we study the vulnerability of pre-trained LLMs to MIAs in the \textit{label-only setting}, where the adversary can only access generated tokens (text). We first reveal that existing label-only MIAs have minor effects in attacking pre-trained LLMs, although they are highly effective in inferring fine-tuning datasets used for personalized LLMs. We find that their failure stems from two main reasons, including better generalization and overly coarse perturbation. Specifically, due to the extensive pre-training corpora and exposing each sample only a few times, LLMs exhibit minimal robustness differences between members and non-members. This makes token-level perturbations too coarse to capture such differences. To alleviate these problems, we propose \textbf{PETAL}: a label-only membership inference attack based on \textbf{PE}r-\textbf{T}oken sem\textbf{A}ntic simi\textbf{L}arity. Specifically, PETAL leverages token-level semantic similarity to approximate output probabilities and subsequently calculate the perplexity. It finally exposes membership based on the common assumption that members are `better' memorized and have smaller perplexity. We conduct extensive experiments on the WikiMIA benchmark and the more challenging MIMIR benchmark. Empirically, our PETAL performs better than the extensions of existing label-only attacks against personalized LLMs and even on par with other advanced logit-based attacks across all metrics on five prevalent open-source LLMs.
Comments:
Accepted by USENIX Security 2025
Subjects:
Cryptography and Security (cs.CR); Computation and Language (cs.CL)
Cite as:
arXiv:2502.18943 [cs.CR]
(or
arXiv:2502.18943v1 [cs.CR] for this version)
https://doi.org/10.48550/arXiv.2502.18943
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
52. 【2502.18940】MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors
链接:https://arxiv.org/abs/2502.18940
作者:Jakub Macina,Nico Daheim,Ido Hakimi,Manu Kapur,Iryna Gurevych,Mrinmaya Sachan
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:making guided progress, Evaluating the pedagogical, capabilities of AI-based, critical for making, making guided
备注: [this https URL](https://eth-lre.github.io/mathtutorbench)
点击查看摘要
Abstract:Evaluating the pedagogical capabilities of AI-based tutoring models is critical for making guided progress in the field. Yet, we lack a reliable, easy-to-use, and simple-to-run evaluation that reflects the pedagogical abilities of models. To fill this gap, we present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains a collection of datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. To score the pedagogical quality of open-ended teacher responses, we train a reward model and show it can discriminate expert from novice teacher responses with high accuracy. We evaluate a wide set of closed- and open-weight models on MathTutorBench and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching. Rather, pedagogy and subject expertise appear to form a trade-off that is navigated by the degree of tutoring specialization of the model. Furthermore, tutoring appears to become more challenging in longer dialogs, where simpler questioning strategies begin to fail. We release the benchmark, code, and leaderboard openly to enable rapid benchmarking of future models.
53. 【2502.18935】JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
链接:https://arxiv.org/abs/2502.18935
作者:Shuyi Liu,Simiao Cui,Haoran Bu,Yuming Shang,Xi Zhang
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, demonstrated remarkable capabilities, Large language, comprehensive safety evaluations, enhanced Chinese language
备注: 12 pages, 5 figures, accepted at PAKDD 2025
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance assessing effectiveness and leverages LLMs to automatically scale up the dataset through context-learning. The proposed JailBench is extensively evaluated over 13 mainstream LLMs and achieves the highest attack success rate against ChatGPT compared to existing Chinese benchmarks, underscoring its efficacy in identifying latent vulnerabilities in LLMs, as well as illustrating the substantial room for improvement in the security and trustworthiness of LLMs within the Chinese context. Our benchmark is publicly available at this https URL.
54. 【2502.18934】Kanana: Compute-efficient Bilingual Language Models
链接:https://arxiv.org/abs/2502.18934
作者:Kanana LLM Team:Yunju Bak,Hojin Lee,Minho Ryu,Jiyeon Ham,Seungjae Jung,Daniel Wontae Nam,Taegyeong Eo,Donghun Lee,Doohae Jung,Boseop Kim,Nayeon Kim,Jaesun Park,Hyunho Kim,Hyunwoong Ko,Changmin Lee,Kyoung-Woon On,Seulye Baeg,Junrae Cho,Sunghee Jung,Jieun Kang,EungGyun Kim,Eunhwa Kim,Byeongil Ko,Daniel Lee,Minchul Lee,Miok Lee,Shinbok Lee,Gaeun Seo
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:demonstrate exceeding performance, performance in English, exceeding performance, demonstrate exceeding, introduce Kanana
备注: 40 pages, 15 figures
点击查看摘要
Abstract:We introduce Kanana, a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English. The computational cost of Kanana is significantly lower than that of state-of-the-art models of similar size. The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models, including high quality data filtering, staged pre-training, depth up-scaling, and pruning and distillation. Furthermore, the report outlines the methodologies utilized during the post-training of the Kanana models, encompassing supervised fine-tuning and preference optimization, aimed at enhancing their capability for seamless interaction with users. Lastly, the report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling. The Kanana model series spans from 2.1B to 32.5B parameters with 2.1B models (base, instruct, embedding) publicly released to promote research on Korean language models.
55. 【2502.18915】END: Early Noise Dropping for Efficient and Effective Context Denoising
链接:https://arxiv.org/abs/2502.18915
作者:Hongye Jin,Pei Chen,Jingfeng Yang,Zhengyang Wang,Meng Jiang,Yifan Gao,Binxuan Huang,Xinyang Zhang,Zheng Li,Tianyi Liu,Huasheng Li,Bing Yin
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, language processing tasks, natural language processing, Large Language, Language Models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output quality. This problem affects both long- and short-context scenarios, such as retrieval-augmented generation, table question-answering, and in-context learning. We reveal that LLMs can implicitly identify whether input sequences contain useful information at early layers, prior to token generation. Leveraging this insight, we introduce Early Noise Dropping (\textsc{END}), a novel approach to mitigate this issue without requiring fine-tuning the LLMs. \textsc{END} segments input sequences into chunks and employs a linear prober on the early layers of LLMs to differentiate between informative and noisy chunks. By discarding noisy chunks early in the process, \textsc{END} preserves critical information, reduces distraction, and lowers computational overhead. Extensive experiments demonstrate that \textsc{END} significantly improves both performance and efficiency across different LLMs on multiple evaluation datasets. Furthermore, by investigating LLMs' implicit understanding to the input with the prober, this work also deepens understanding of how LLMs do reasoning with contexts internally.
56. 【2502.18913】CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition
链接:https://arxiv.org/abs/2502.18913
作者:Jiaming Zhou,Yujie Guo,Shiwan Zhao,Haoqin Sun,Hui Wang,Jiabei He,Aobo Kong,Shiyao Wang,Xi Yang,Yequan Wang,Yonghua Lin,Yong Qin
类目:Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
关键词:automatic speech recognition, full-length dialogue recordings, presents significant challenges, ASR, Code-switching
备注:
点击查看摘要
Abstract:Code-switching (CS), the alternation between two or more languages within a single conversation, presents significant challenges for automatic speech recognition (ASR) systems. Existing Mandarin-English code-switching datasets often suffer from limitations in size, spontaneity, and the lack of full-length dialogue recordings with transcriptions, hindering the development of robust ASR models for real-world conversational scenarios. This paper introduces CS-Dialogue, a novel large-scale Mandarin-English code-switching speech dataset comprising 104 hours of spontaneous conversations from 200 speakers. Unlike previous datasets, CS-Dialogue provides full-length dialogue recordings with complete transcriptions, capturing naturalistic code-switching patterns in continuous speech. We describe the data collection and annotation processes, present detailed statistics of the dataset, and establish benchmark ASR performance using state-of-the-art models. Our experiments, using Transformer, Conformer, and Branchformer, demonstrate the challenges of code-switching ASR, and show that existing pre-trained models such as Whisper still have the space to improve. The CS-Dialogue dataset will be made freely available for all academic purposes.
57. 【2502.18890】From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens
链接:https://arxiv.org/abs/2502.18890
作者:Tong Wu,Junzhe Shen,Zixia Jia,Yuxuan Wang,Zilong Zheng
类目:Computation and Language (cs.CL)
关键词:highly time-intensive task, Generating ultra-long sequences, Generating ultra-long, large language models, time-intensive task
备注:
点击查看摘要
Abstract:Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at this https URL.
58. 【2502.18889】Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Huality Text-to-Speech Method based on Contextual Semantic Understanding
链接:https://arxiv.org/abs/2502.18889
作者:Tianyun Liu
类目:ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
关键词:methods primarily focus, traditional TTS systems, primarily focus, focus on establishing, establishing a mapping
备注:
点击查看摘要
Abstract:Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of the global context, thereby ensuring the quality of the synthesized speech. In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds. Experimental results show that on the LJSpeech and Baker datasets, the speech generated by Clip-TTS achieves state-of-the-art MOS scores, and it also performs excellently on multi-emotion this http URL samples are available at: this https URL.
59. 【2502.18886】On Pruning State-Space LLMs
链接:https://arxiv.org/abs/2502.18886
作者:Tamer Ghattas,Michael Hassid,Roy Schwartz
类目:Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Recent work proposed, work proposed state-space, Recent work, proposed state-space models, work proposed
备注:
点击查看摘要
Abstract:Recent work proposed state-space models (SSMs) as an efficient alternative to transformer-based LLMs. Can these models be pruned to further reduce their computation costs? We adapt several pruning methods to the SSM structure, and apply them to four SSM-based LLMs across multiple tasks. We find that such models are quite robust to some pruning methods (e.g. WANDA), while using other methods lead to fast performance degradation.
60. 【2502.18878】Learning to Generate Structured Output with Schema Reinforcement Learning
链接:https://arxiv.org/abs/2502.18878
作者:Yaxi Lu,Haolun Li,Xin Cong,Zhong Zhang,Yesai Wu,Yankai Lin,Zhiyuan Liu,Fangming Liu,Maosong Sun
类目:Computation and Language (cs.CL)
关键词:producing valid JSON, JSON, large language models, valid JSON, structured generation capabilities
备注: 8 pages, 4 figures
点击查看摘要
Abstract:This study investigates the structured generation capabilities of large language models (LLMs), focusing on producing valid JSON outputs against a given schema. Despite the widespread use of JSON in integrating language models with programs, there is a lack of comprehensive analysis and benchmarking of these capabilities. We explore various aspects of JSON generation, such as structure understanding, escaping, and natural language description, to determine how to assess and enable LLMs to generate valid responses. Building upon this, we propose SchemaBench features around 40K different JSON schemas to obtain and assess models' abilities in generating valid JSON. We find that the latest LLMs are still struggling to generate a valid JSON string. Moreover, we demonstrate that incorporating reinforcement learning with a Fine-grained Schema Validator can further enhance models' understanding of JSON schema, leading to improved performance. Our models demonstrate significant improvement in both generating JSON outputs and downstream tasks.
61. 【2502.18874】Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
链接:https://arxiv.org/abs/2502.18874
作者:Kaishuai Xu,Tiezheng Yu,Wenjun Hou,Yi Cheng,Liangyou Li,Xin Jiang,Lifeng Shang,Qun Liu,Wenjie Li
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large Language Models, Large Language, Language Models, extensively for automated, powerful proprietary models
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios. Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models, such as GPT-4. However, these methods are largely limited to text-based analyses under predefined general criteria, resulting in reduced adaptability for unseen instructions and demonstrating instability in evaluating adherence to quantitative and structural constraints. To address these limitations, we propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses to evaluate LLM responses. ARJudge consists of two components: a fine-tuned Analyzer that generates multi-faceted evaluation analyses and a tuning-free Refiner that combines and refines all analyses to make the final judgment. We construct a Composite Analysis Corpus that integrates tasks for evaluation criteria generation alongside text-based and code-driven analysis generation to train the Analyzer. Our results demonstrate that ARJudge outperforms existing fine-tuned evaluators in effectiveness and robustness. Furthermore, it demonstrates the importance of multi-faceted evaluation and code-driven analyses in enhancing evaluation capabilities.
62. 【2502.18873】Multi-LLM Collaborative Search for Complex Problem Solving
链接:https://arxiv.org/abs/2502.18873
作者:Sen Yang,Yafu Li,Wai Lam,Yu Cheng
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large language models, Large language, vast reasoning space, language models, natural language
备注:
点击查看摘要
Abstract:Large language models (LLMs) often struggle with complex reasoning tasks due to their limitations in addressing the vast reasoning space and inherent ambiguities of natural language. We propose the Mixture-of-Search-Agents (MoSA) paradigm, a novel approach leveraging the collective expertise of multiple LLMs to enhance search-based reasoning. MoSA integrates diverse reasoning pathways by combining independent exploration with iterative refinement among LLMs, mitigating the limitations of single-model approaches. Using Monte Carlo Tree Search (MCTS) as a backbone, MoSA enables multiple agents to propose and aggregate reasoning steps, resulting in improved accuracy. Our comprehensive evaluation across four reasoning benchmarks demonstrates MoSA's consistent performance improvements over single-agent and other multi-agent baselines, particularly in complex mathematical and commonsense reasoning tasks.
63. 【2502.18864】owards an AI co-scientist
链接:https://arxiv.org/abs/2502.18864
作者:Juraj Gottweis,Wei-Hung Weng,Alexander Daryin,Tao Tu,Anil Palepu,Petar Sirkovic,Artiom Myaskovsky,Felix Weissenberger,Keran Rong,Ryutaro Tanno,Khaled Saab,Dan Popovici,Jacob Blum,Fan Zhang,Katherine Chou,Avinatan Hassidim,Burak Gokturk,Amin Vahdat,Pushmeet Kohli,Yossi Matias,Andrew Carroll,Kavita Kulkarni,Nenad Tomasev,Yuan Guan,Vikram Dhillon,Eeshit Dhaval Vaishnav,Byron Lee,Tiago R D Costa,José R Penadés,Gary Peltz,Yunhan Xu,Annalisa Pawlosky,Alan Karthikesalingam,Vivek Natarajan
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Physics and Society (physics.soc-ph); Other Quantitative Biology (q-bio.OT)
关键词:undergo rigorous experimental, undergo rigorous, Scientific discovery relies, rigorous experimental validation, Scientific discovery
备注: 81 pages in total (main 38 pages, appendix 43 pages), 13 main figures, 40 appendix figures, 1 main table, 2 appendix tables, 143 main references, 7 appendix references
点击查看摘要
Abstract:Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.
64. 【2502.18860】Exploring Rewriting Approaches for Different Conversational Tasks
链接:https://arxiv.org/abs/2502.18860
作者:Md Mehrab Tanjim,Ryan A. Rossi,Mike Rimer,Xiang Chen,Sungchul Kim,Vaishnavi Muppala,Tong Yu,Zhengmian Hu,Ritwik Sinha,Wei Zhang,Iftikhar Ahamath Burhanuddin,Franck Dernoncourt
类目:Computation and Language (cs.CL)
关键词:question rewriting algorithm, algorithm that leverages, leverages a subset, subset of past, past interactions
备注: Preprint
点击查看摘要
Abstract:Conversational assistants often require a question rewriting algorithm that leverages a subset of past interactions to provide a more meaningful (accurate) answer to the user's question or request. However, the exact rewriting approach may often depend on the use case and application-specific tasks supported by the conversational assistant, among other constraints. In this paper, we systematically investigate two different approaches, denoted as rewriting and fusion, on two fundamentally different generation tasks, including a text-to-text generation task and a multimodal generative task that takes as input text and generates a visualization or data table that answers the user's question. Our results indicate that the specific rewriting or fusion approach highly depends on the underlying use case and generative task. In particular, we find that for a conversational question-answering assistant, the query rewriting approach performs best, whereas for a data analysis assistant that generates visualizations and data tables based on the user's conversation with the assistant, the fusion approach works best. Notably, we explore two datasets for the data analysis assistant use case, for short and long conversations, and we find that query fusion always performs better, whereas for the conversational text-based question-answering, the query rewrite approach performs best.
65. 【2502.18848】A Causal Lens for Evaluating Faithfulness Metrics
链接:https://arxiv.org/abs/2502.18848
作者:Kerem Zaman,Shashank Srivastava
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
关键词:Large Language Models, Large Language, offer natural language, feature attribution methods, alternative to feature
备注: 18 pages, 18 figures, 6 tables
点击查看摘要
Abstract:Large Language Models (LLMs) offer natural language explanations as an alternative to feature attribution methods for model interpretability. However, despite their plausibility, they may not reflect the model's internal reasoning faithfully, which is crucial for understanding the model's true decision-making processes. Although several faithfulness metrics have been proposed, a unified evaluation framework remains absent. To address this gap, we present Causal Diagnosticity, a framework to evaluate faithfulness metrics for natural language explanations. Our framework employs the concept of causal diagnosticity, and uses model-editing methods to generate faithful-unfaithful explanation pairs. Our benchmark includes four tasks: fact-checking, analogy, object counting, and multi-hop reasoning. We evaluate a variety of faithfulness metrics, including post-hoc explanation and chain-of-thought-based methods. We find that all tested faithfulness metrics often fail to surpass a random baseline. Our work underscores the need for improved metrics and more reliable interpretability methods in LLMs.
66. 【2502.18845】Sliding Window Attention Training for Efficient Large Language Models
链接:https://arxiv.org/abs/2502.18845
作者:Zichuan Fu,Wentao Song,Yejing Wang,Xian Wu,Yefeng Zheng,Yingying Zhang,Derong Xu,Xuetao Wei,Tong Xu,Xiangyu Zhao
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:transformer-based Large Language, Large Language Models, Large Language, demonstrated remarkable capabilities, Recent advances
备注: 14 pages, 5 figures
点击查看摘要
Abstract:Recent advances in transformer-based Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their quadratic computational complexity concerning sequence length remains a significant bottleneck for processing long documents. As a result, many efforts like sparse attention and state space models have been proposed to improve the efficiency of LLMs over long sequences. Though effective, these approaches compromise the performance or introduce structural complexity. This calls for a simple yet efficient model that preserves the fundamental Transformer architecture. To this end, we introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Then, we replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks. Code is available at this https URL.
67. 【2502.18841】Sentiment Analysis of Movie Reviews Using BERT
链接:https://arxiv.org/abs/2502.18841
作者:Gibson Nkhata,Usman Anjum,Justin Zhan
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, opinion mining, kind of text, Sentiment Analysis, emotions and opinions
备注: 7 pages, 3 figures, published in the proceedings The Fifteenth International Conference on Information, Process, and Knowledge Management (eKNOW 2023)
点击查看摘要
Abstract:Sentiment Analysis (SA) or opinion mining is analysis of emotions and opinions from any kind of text. SA helps in tracking peoples viewpoints and it is an important factor when it comes to social media monitoring product and brand recognition customer satisfaction customer loyalty advertising and promotions success and product acceptance. That is why SA is one of the active research areas in Natural Language Processing (NLP). SA is applied on data sourced from various media platforms to mine sentiment knowledge from them. Various approaches have been deployed in the literature to solve the problem. Most techniques devise complex and sophisticated frameworks in order to attain optimal accuracy. This work aims to finetune Bidirectional Encoder Representations from Transformers (BERT) with Bidirectional Long Short-Term Memory (BiLSTM) for movie reviews sentiment analysis and still provide better accuracy than the State-of-The-Art (SOTA) methods. The paper also shows how sentiment analysis can be applied if someone wants to recommend a certain movie for example by computing overall polarity of its sentiments predicted by the model. That is our proposed method serves as an upper-bound baseline in prediction of a predominant reaction to a movie. To compute overall polarity a heuristic algorithm is applied to BERTBiLSTM output vector. Our model can be extended to three-class four-class or any fine-grained classification and apply overall polarity computation again. This is intended to be exploited in future work.
68. 【2502.18823】Evidence-Driven Marker Extraction for Social Media Suicide Risk Detection
链接:https://arxiv.org/abs/2502.18823
作者:Carter Adams,Caleb Carter,Jackson Simmons
类目:Computation and Language (cs.CL)
关键词:social media text, Early detection, Large Language Models, timely intervention, social media
备注:
点击查看摘要
Abstract:Early detection of suicide risk from social media text is crucial for timely intervention. While Large Language Models (LLMs) offer promising capabilities in this domain, challenges remain in terms of interpretability and computational efficiency. This paper introduces Evidence-Driven LLM (ED-LLM), a novel approach for clinical marker extraction and suicide risk classification. ED-LLM employs a multi-task learning framework, jointly training a Mistral-7B based model to identify clinical marker spans and classify suicide risk levels. This evidence-driven strategy enhances interpretability by explicitly highlighting textual evidence supporting risk assessments. Evaluated on the CLPsych datasets, ED-LLM demonstrates competitive performance in risk classification and superior capability in clinical marker span identification compared to baselines including fine-tuned LLMs, traditional machine learning, and prompt-based methods. The results highlight the effectiveness of multi-task learning for interpretable and efficient LLM-based suicide risk assessment, paving the way for clinically relevant applications.
69. 【2502.18817】Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
链接:https://arxiv.org/abs/2502.18817
作者:Shuliang Liu,Xinze Li,Zhenghao Liu,Yukun Yan,Cheng Yang,Zheni Zeng,Zhiyuan Liu,Maosong Sun,Ge Yu
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Retrieval-Augmented Generation, RAG models, hallucinations for Large
备注:
点击查看摘要
Abstract:Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at this https URL.
70. 【2502.18810】Holistic Audit Dataset Generation for LLM Unlearning via Knowledge Graph Traversal and Redundancy Removal
链接:https://arxiv.org/abs/2502.18810
作者:Weipeng Jiang,Juan Zhai,Shiqing Ma,Ziyan Lei,Xiaofei Xie,Yige Wang,Chao Shen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, remove sensitive information, faced increasing demands
备注: 11 pages, 4 figures
点击查看摘要
Abstract:In recent years, Large Language Models (LLMs) have faced increasing demands to selectively remove sensitive information, protect privacy, and comply with copyright regulations through unlearning, by Machine Unlearning. While evaluating unlearning effectiveness is crucial, existing benchmarks are limited in scale and comprehensiveness, typically containing only a few hundred test cases. We identify two critical challenges in generating holistic audit datasets: ensuring audit adequacy and handling knowledge redundancy between forget and retain dataset. To address these challenges, we propose HANKER, an automated framework for holistic audit dataset generation leveraging knowledge graphs to achieve fine-grained coverage and eliminate redundant knowledge. Applying HANKER to the popular MUSE benchmark, we successfully generated over 69,000 and 111,000 audit cases for the News and Books datasets respectively, identifying thousands of knowledge memorization instances that the previous benchmark failed to detect. Our empirical analysis uncovers how knowledge redundancy significantly skews unlearning effectiveness metrics, with redundant instances artificially inflating the observed memorization measurements ROUGE from 19.7% to 26.1% and Entailment Scores from 32.4% to 35.2%, highlighting the necessity of systematic deduplication for accurate assessment.
71. 【2502.18802】Language Models Grow Less Humanlike beyond Phase Transition
链接:https://arxiv.org/abs/2502.18802
作者:Tatsuya Aoyama,Ethan Wilcox
类目:Computation and Language (cs.CL)
关键词:psychometric predictive power, human reading behavior, tipping point, reading behavior, psychometric predictive
备注:
点击查看摘要
Abstract:LMs' alignment with human reading behavior (i.e. psychometric predictive power; PPP) is known to improve during pretraining up to a tipping point, beyond which it either plateaus or degrades. Various factors, such as word frequency, recency bias in attention, and context size, have been theorized to affect PPP, yet there is no current account that explains why such a tipping point exists, and how it interacts with LMs' pretraining dynamics more generally. We hypothesize that the underlying factor is a pretraining phase transition, characterized by the rapid emergence of specialized attention heads. We conduct a series of correlational and causal experiments to show that such a phase transition is responsible for the tipping point in PPP. We then show that, rather than producing attention patterns that contribute to the degradation in PPP, phase transitions alter the subsequent learning dynamics of the model, such that further training keeps damaging PPP.
72. 【2502.18798】ANPMI: Assessing the True Comprehension Capabilities of LLMs for Multiple Choice Questions
链接:https://arxiv.org/abs/2502.18798
作者:Gyeongje Cho,Yeonkyoung So,Jaejin Lee
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Multiple-choice benchmarks, model natural language, natural language understanding, language understanding capability, natural language
备注:
点击查看摘要
Abstract:Multiple-choice benchmarks, consisting of various prompts and choices, are among the most widely used methods to assess a language model's natural language understanding capability. Given a specific prompt, we typically compute $P(Choice|Prompt)$ to evaluate how likely a language model is to generate the correct choice compared to incorrect ones. However, we observe that performance measured using this approach reflects not only the model's comprehension of the prompt but also its inherent biases for certain choices regardless of the prompt. This issue makes it challenging to accurately measure a model's natural language understanding, as models may select the answer without fully understanding the prompt. To address this limitation, we propose a novel metric called ANPMI, which normalizes Pointwise Mutual Information (PMI) by $-\log P(Choice)$. ANPMI provides a more accurate assessment of the model's natural language understanding by ensuring that it is challenging to answer a question without properly understanding the prompt.
73. 【2502.18795】Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs
链接:https://arxiv.org/abs/2502.18795
作者:Xiulin Yang,Tatsuya Aoyama,Yuekun Yao,Ethan Wilcox
类目:Computation and Language (cs.CL)
关键词:LLMs offer insights, human language learning, offer insights, LLMs offer, languages
备注:
点击查看摘要
Abstract:Do LLMs offer insights into human language learning? A common argument against this idea is that because their architecture and training paradigm are so vastly different from humans, LLMs can learn arbitrary inputs as easily as natural languages. In this paper, we test this claim by training LMs to model impossible and typologically unattested languages. Unlike previous work, which has focused exclusively on English, we conduct experiments on 12 natural languages from 4 language families. Our results show that while GPT-2 small can primarily distinguish attested languages from their impossible counterparts, it does not achieve perfect separation between all the attested languages and all the impossible ones. We further test whether GPT-2 small distinguishes typologically attested from unattested languages with different NP orders by manipulating word order based on Greenberg's Universal 20. We find that the model's perplexity scores do not distinguish attested vs. unattested word orders, as long as the unattested variants maintain constituency structure. These findings suggest that language models exhibit some human-like inductive biases, though these biases are weaker than those found in human learners.
74. 【2502.18791】Seeing the Forest for the Trees: A Large Scale, Continuously Updating Meta-Analysis of Frontier LLMs
链接:https://arxiv.org/abs/2502.18791
作者:Jungsoo Park,Junmo Kang,Gabriel Stanovsky,Alan Ritter
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:studies makes synthesizing, makes synthesizing, LLM studies makes, data extraction, studies makes
备注: 21 pages, 9 figures
点击查看摘要
Abstract:The surge of LLM studies makes synthesizing their findings challenging. Meta-analysis can uncover important trends across studies, but its use is limited by the time-consuming nature of manual data extraction. Our study presents a semi-automated approach for meta-analysis that accelerates data extraction using LLMs. It automatically identifies relevant arXiv papers, extracts experimental results and related attributes, and organizes them into a structured dataset. We conduct a comprehensive meta-analysis of frontier LLMs using an automatically extracted dataset, reducing the effort of paper surveying and data extraction by more than 93\% compared to manual approaches. We validate our dataset by showing that it reproduces key findings from a recent manual meta-analysis about Chain-of-Thought (CoT), and also uncovers new insights that go beyond it, showing for example that in-context examples benefit multimodal tasks but offer limited gains in mathematical tasks compared to CoT. Our automatically updatable dataset enables continuous tracking of target models by extracting evaluation studies as new data becomes available. Through our scientific artifacts and empirical analysis, we provide novel insights into LLMs while facilitating ongoing meta-analyses of their behavior.
75. 【2502.18782】Active Few-Shot Learning for Text Classification
链接:https://arxiv.org/abs/2502.18782
作者:Saeed Ahmadnia,Arash Yousefi Jordehi,Mahsa Hosseini Khasheh Heyran,Seyed Abolghasem Mirroshandel,Owen Rambow,Cornelia Caragea
类目:Computation and Language (cs.CL)
关键词:Large Language Models, natural language processing, Language Models, Large Language, rise of Large
备注: Accepted to NAACL 2025 Main Conference; 18 pages, 8 figures, 13 tables including Appendix
点击查看摘要
Abstract:The rise of Large Language Models (LLMs) has boosted the use of Few-Shot Learning (FSL) methods in natural language processing, achieving acceptable performance even when working with limited training data. The goal of FSL is to effectively utilize a small number of annotated samples in the learning process. However, the performance of FSL suffers when unsuitable support samples are chosen. This problem arises due to the heavy reliance on a limited number of support samples, which hampers consistent performance improvement even when more support samples are added. To address this challenge, we propose an active learning-based instance selection mechanism that identifies effective support instances from the unlabeled pool and can work with different LLMs. Our experiments on five tasks show that our method frequently improves the performance of FSL. We make our implementation available on GitHub.
76. 【2502.18779】owards Optimal Multi-draft Speculative Decoding
链接:https://arxiv.org/abs/2502.18779
作者:Zhengmian Hu,Tong Zheng,Vignesh Viswanathan,Ziyi Chen,Ryan A. Rossi,Yihan Wu,Dinesh Manocha,Heng Huang
类目:Data Structures and Algorithms (cs.DS); Computation and Language (cs.CL)
关键词:Large Language Models, language processing tasks, natural language processing, Large Language, optimal acceptance rate
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have become an indispensable part of natural language processing tasks. However, autoregressive sampling has become an efficiency bottleneck. Multi-Draft Speculative Decoding (MDSD) is a recent approach where, when generating each token, a small draft model generates multiple drafts, and the target LLM verifies them in parallel, ensuring that the final output conforms to the target model distribution. The two main design choices in MDSD are the draft sampling method and the verification algorithm. For a fixed draft sampling method, the optimal acceptance rate is a solution to an optimal transport problem, but the complexity of this problem makes it difficult to solve for the optimal acceptance rate and measure the gap between existing verification algorithms and the theoretical upper bound. This paper discusses the dual of the optimal transport problem, providing a way to efficiently compute the optimal acceptance rate. For the first time, we measure the theoretical upper bound of MDSD efficiency for vocabulary sizes in the thousands and quantify the gap between existing verification algorithms and this bound. We also compare different draft sampling methods based on their optimal acceptance rates. Our results show that the draft sampling method strongly influences the optimal acceptance rate, with sampling without replacement outperforming sampling with replacement. Additionally, existing verification algorithms do not reach the theoretical upper bound for both without replacement and with replacement sampling. Our findings suggest that carefully designed draft sampling methods can potentially improve the optimal acceptance rate and enable the development of verification algorithms that closely match the theoretical upper bound.
77. 【2502.18778】M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance
链接:https://arxiv.org/abs/2502.18778
作者:Qingpei Guo,Kaiyou Song,Zipeng Feng,Ziping Ma,Qinglong Zhang,Sirui Gao,Xuzheng Yu,Yunxiao Sun,Tai-WeiChang,Jingdong Chen,Ming Yang,Jun Zhou
类目:Machine Learning (cs.LG); Computation and Language (cs.CL)
关键词:Large Language Models, empowers Large Language, Large Language, achieves competitive performance, empowers Large
备注:
点击查看摘要
Abstract:We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni's language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.
78. 【2502.18772】Plutus: Benchmarking Large Language Models in Low-Resource Greek Finance
链接:https://arxiv.org/abs/2502.18772
作者:Xueqing Peng,Triantafillos Papadopoulos,Efstathia Soufleri,Polydoros Giannouris,Ruoyu Xiang,Yan Wang,Lingfei Qian,Jimin Huang,Qianqian Xie,Sophia Ananiadou
类目:Computation and Language (cs.CL)
关键词:Greece pivotal role, Greek financial NLP, Greek financial, Greece pivotal, Greek
备注: 18 pages, 6 figures
点击查看摘要
Abstract:Despite Greece's pivotal role in the global economy, large language models (LLMs) remain underexplored for Greek financial context due to the linguistic complexity of Greek and the scarcity of domain-specific datasets. Previous efforts in multilingual financial natural language processing (NLP) have exposed considerable performance disparities, yet no dedicated Greek financial benchmarks or Greek-specific financial LLMs have been developed until now. To bridge this gap, we introduce Plutus-ben, the first Greek Financial Evaluation Benchmark, and Plutus-8B, the pioneering Greek Financial LLM, fine-tuned with Greek domain-specific data. Plutus-ben addresses five core financial NLP tasks in Greek: numeric and textual named entity recognition, question answering, abstractive summarization, and topic classification, thereby facilitating systematic and reproducible LLM assessments. To underpin these tasks, we present three novel, high-quality Greek financial datasets, thoroughly annotated by expert native Greek speakers, augmented by two existing resources. Our comprehensive evaluation of 22 LLMs on Plutus-ben reveals that Greek financial NLP remains challenging due to linguistic complexity, domain-specific terminology, and financial reasoning gaps. These findings underscore the limitations of cross-lingual transfer, the necessity for financial expertise in Greek-trained models, and the challenges of adapting financial LLMs to Greek text. We release Plutus-ben, Plutus-8B, and all associated datasets publicly to promote reproducible research and advance Greek financial NLP, fostering broader multilingual inclusivity in finance.
79. 【2502.18770】Reward Shaping to Mitigate Reward Hacking in RLHF
链接:https://arxiv.org/abs/2502.18770
作者:Jiayi Fu,Xuandong Zhao,Chengyuan Yao,Heng Wang,Qi Han,Yanghua Xiao
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Human Feedback, aligning large language, reward, large language models, Reinforcement Learning
备注: 19 pages
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to reward hacking, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. While reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests three key design principles: (1) RL reward is ideally bounded, (2) RL benefits from rapid initial growth followed by gradual convergence, and (3) RL reward is best formulated as a function of centered reward. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model itself as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR's superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. Code is available at this https URL.
80. 【2502.18746】Automatic Prompt Optimization via Heuristic Search: A Survey
链接:https://arxiv.org/abs/2502.18746
作者:Wendi Cui,Jiaxin Zhang,Zhuohang Li,Hao Sun,Damien Lopez,Kamalika Das,Bradley A. Malin,Sricharan Kumar
类目:Computation and Language (cs.CL)
关键词:Natural Language Processing, Large Language Models, Language Processing tasks, guiding model outputs, Large Language
备注:
点击查看摘要
Abstract:Recent advances in Large Language Models have led to remarkable achievements across a variety of Natural Language Processing tasks, making prompt engineering increasingly central to guiding model outputs. While manual methods can be effective, they typically rely on intuition and do not automatically refine prompts over time. In contrast, automatic prompt optimization employing heuristic-based search algorithms can systematically explore and improve prompts with minimal human oversight. This survey proposes a comprehensive taxonomy of these methods, categorizing them by where optimization occurs, what is optimized, what criteria drive the optimization, which operators generate new prompts, and which iterative search algorithms are applied. We further highlight specialized datasets and tools that support and accelerate automated prompt refinement. We conclude by discussing key open challenges pointing toward future opportunities for more robust and versatile LLM applications.
81. 【2502.18744】Like Father, Like Son: Kinship-Aware Preference Mapping (KARMA) for Automatic Alignment in Large Language Models
链接:https://arxiv.org/abs/2502.18744
作者:Jeesu Jung,Chanjun Park,Sangkeun Jung
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Model, Large Language, leveraging pretrained models, Recent advancements, advancements in Large
备注: 14 pages,5 figures,3 tables,4 graphs
点击查看摘要
Abstract:Recent advancements in Large Language Model (LLM) alignment have sought to mitigate the cost of human annotations by leveraging pretrained models to generate preference data. However, existing methods often compare responses from models with substantially different capabilities, yielding superficial distinctions that fail to provide meaningful guidance on what constitutes a superior response. To address this limitation, we propose Kinship-Aware pReference MApping (KARMA), a novel framework that systematically pairs responses from models with comparable competencies. By constraining preference comparisons to outputs of similar complexity and quality, KARMA enhances the informativeness of preference data and improves the granularity of alignment signals. Empirical evaluations demonstrate that our kinship-aware approach leads to more consistent and interpretable alignment outcomes, ultimately facilitating a more principled and reliable pathway for aligning LLM behavior with human preferences.
82. 【2502.18734】Beyond RNNs: Benchmarking Attention-Based Image Captioning Models
链接:https://arxiv.org/abs/2502.18734
作者:Hemanth Teja Yanambakkam,Rahul Chinthala
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:generate meaningful textual, meaningful textual descriptions, challenging task, intersection of computer, computer vision
备注: 10 pages, 6 figures. Code and additional results are available on GitHub under the handle HemanthTejaY
点击查看摘要
Abstract:Image captioning is a challenging task at the intersection of computer vision and natural language processing, requiring models to generate meaningful textual descriptions of images. Traditional approaches rely on recurrent neural networks (RNNs), but recent advancements in attention mechanisms have demonstrated significant improvements. This study benchmarks the performance of attention-based image captioning models against RNN-based approaches using the MS-COCO dataset. We evaluate the effectiveness of Bahdanau attention in enhancing the alignment between image features and generated captions. The models are assessed using natural language processing metrics such as BLEU, METEOR, GLEU, and WER. Our results show that attention-based models outperform RNNs in generating more accurate and semantically rich captions, with better alignment to human evaluation. This work provides insights into the impact of attention mechanisms in image captioning and highlights areas for future improvements.
83. 【2502.18729】Random Forest-of-Thoughts: Uncertainty-aware Reasoning for Computational Social Science
链接:https://arxiv.org/abs/2502.18729
作者:Xiaohua Wu,Xiaohui Tao,Wenjie Wu,Yuefeng Li,Lin Li
类目:Computation and Language (cs.CL)
关键词:elaborate domain theories, social survey analysis, computational social science, interviewee deep thoughts, social survey
备注: 11 pages
点击查看摘要
Abstract:Social surveys in computational social science are well-designed by elaborate domain theories that can effectively reflect the interviewee's deep thoughts without concealing their true feelings. The candidate questionnaire options highly depend on the interviewee's previous answer, which results in the complexity of social survey analysis, the time, and the expertise required. The ability of large language models (LLMs) to perform complex reasoning is well-enhanced by prompting learning such as Chain-of-thought (CoT) but still confined to left-to-right decision-making processes or limited paths during inference. This means they can fall short in problems that require exploration and uncertainty searching. In response, a novel large language model prompting method, called Random Forest of Thoughts (RFoT), is proposed for generating uncertainty reasoning to fit the area of computational social science. The RFoT allows LLMs to perform deliberate decision-making by generating diverse thought space and randomly selecting the sub-thoughts to build the forest of thoughts. It can extend the exploration and prediction of overall performance, benefiting from the extensive research space of response. The method is applied to optimize computational social science analysis on two datasets covering a spectrum of social survey analysis problems. Our experiments show that RFoT significantly enhances language models' abilities on two novel social survey analysis problems requiring non-trivial reasoning.
84. 【2502.18725】alking to the brain: Using Large Language Models as Proxies to Model Brain Semantic Representation
链接:https://arxiv.org/abs/2502.18725
作者:Xin Liu,Ziyue Zhang,Jingxin Nie
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
关键词:psychological experiments utilizing, Visual Question Answering, experiments utilizing naturalistic, Traditional psychological experiments, ecological validity
备注: 20 pages, 6 figures
点击查看摘要
Abstract:Traditional psychological experiments utilizing naturalistic stimuli face challenges in manual annotation and ecological validity. To address this, we introduce a novel paradigm leveraging multimodal large language models (LLMs) as proxies to extract rich semantic information from naturalistic images through a Visual Question Answering (VQA) strategy for analyzing human visual semantic representation. LLM-derived representations successfully predict established neural activity patterns measured by fMRI (e.g., faces, buildings), validating its feasibility and revealing hierarchical semantic organization across cortical regions. A brain semantic network constructed from LLM-derived representations identifies meaningful clusters reflecting functional and contextual associations. This innovative methodology offers a powerful solution for investigating brain semantic organization with naturalistic stimuli, overcoming limitations of traditional annotation methods and paving the way for more ecologically valid explorations of human cognition.
85. 【2502.18702】A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition
链接:https://arxiv.org/abs/2502.18702
作者:Zihan Wang,Ziqi Zhao,Yougang Lyu,Zhumin Chen,Maarten de Rijke,Zhaochun Ren
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:unannotated text corpora, develop entity recognition, aims to develop, text corpora, entity recognition
备注: Accepted at WWW 2025
点击查看摘要
Abstract:Zero-shot named entity recognition (NER) aims to develop entity recognition systems from unannotated text corpora. This task presents substantial challenges due to minimal human intervention. Recent work has adapted large language models (LLMs) for zero-shot NER by crafting specialized prompt templates. It advances model self-learning abilities by incorporating self-annotated demonstrations. However, two important challenges persist: (i) Correlations between contexts surrounding entities are overlooked, leading to wrong type predictions or entity omissions. (ii) The indiscriminate use of task demonstrations, retrieved through shallow similarity-based strategies, severely misleads LLMs during inference. In this paper, we introduce the cooperative multi-agent system (CMAS), a novel framework for zero-shot NER that uses the collective intelligence of multiple agents to address the challenges outlined above. CMAS has four main agents: (i) a self-annotator, (ii) a type-related feature (TRF) extractor, (iii) a demonstration discriminator, and (iv) an overall predictor. To explicitly capture correlations between contexts surrounding entities, CMAS reformulates NER into two subtasks: recognizing named entities and identifying entity type-related features within the target sentence. To enable controllable utilization of demonstrations, a demonstration discriminator is established to incorporate the self-reflection mechanism, automatically evaluating helpfulness scores for the target sentence. Experimental results show that CMAS significantly improves zero-shot NER performance across six benchmarks, including both domain-specific and general-domain scenarios. Furthermore, CMAS demonstrates its effectiveness in few-shot settings and with various LLM backbones.
Comments:
Accepted at WWW 2025
Subjects:
Information Retrieval (cs.IR); Computation and Language (cs.CL)
Cite as:
arXiv:2502.18702 [cs.IR]
(or
arXiv:2502.18702v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2502.18702
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
86. 【2502.18699】MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment
链接:https://arxiv.org/abs/2502.18699
作者:Tianze Wang,Dongnan Gui,Yifan Hu,Shuhang Lin,Linjun Zhang
类目:Computation and Language (cs.CL); Machine Learning (cs.LG); Methodology (stat.ME)
关键词:aligning large language, large language models, Reinforcement Learning, shown promise, promise in aligning
备注:
点击查看摘要
Abstract:Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.
87. 【2502.18685】Speaking the Right Language: The Impact of Expertise Alignment in User-AI Interactions
链接:https://arxiv.org/abs/2502.18685
作者:Shramay Palta,Nirupama Chandrasekaran,Rachel Rudinger,Scott Counts
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:Bing Copilot conversations, Bing Copilot, Copilot conversations, agent responds, user experience
备注: arXiv Version
点击查看摘要
Abstract:Using a sample of 25,000 Bing Copilot conversations, we study how the agent responds to users of varying levels of domain expertise and the resulting impact on user experience along multiple dimensions. Our findings show that across a variety of topical domains, the agent largely responds at proficient or expert levels of expertise (77% of conversations) which correlates with positive user experience regardless of the user's level of expertise. Misalignment, such that the agent responds at a level of expertise below that of the user, has a negative impact on overall user experience, with the impact more profound for more complex tasks. We also show that users engage more, as measured by the number of words in the conversation, when the agent responds at a level of expertise commensurate with that of the user. Our findings underscore the importance of alignment between user and AI when designing human-centered AI systems, to ensure satisfactory and productive interactions.
88. 【2502.18679】Discriminative Finetuning of Generative Large Language Models without Reward Models and Preference Data
链接:https://arxiv.org/abs/2502.18679
作者:Siqi Guo,Ilgee Hong,Vicente Balmaseda,Tuo Zhao,Tianbao Yang
类目:Computation and Language (cs.CL)
关键词:improving pretrained large, pretrained large language, large language models, significant performance gains, Supervised fine-tuning
备注: 15 pages, 6 figures
点击查看摘要
Abstract:Supervised fine-tuning (SFT) followed by preference optimization (PO) denoted by SFT$\rightarrow$PO has become the standard for improving pretrained large language models (LLMs), with PO demonstrating significant performance gains. However, PO methods rely on either human-labeled preference data or a strong reward model to generate preference data. Can we fine-tune LLMs without preference data or reward models while achieving competitive performance to SFT$\rightarrow$PO? We address this question by introducing Discriminative Fine-Tuning (DFT), a novel approach that eliminates the need for preference data. Unlike SFT, which employs a generative approach and overlooks negative data, DFT adopts a discriminative paradigm that that increases the probability of positive answers while suppressing potentially negative ones, shifting from token prediction to data prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT's effectiveness, achieving performance better than SFT and comparable to if not better than SFT$\rightarrow$PO. The code can be found at this https URL.
89. 【2502.18673】Scaffolding Empathy: Training Counselors with Simulated Patients and Utterance-level Performance Visualizations
链接:https://arxiv.org/abs/2502.18673
作者:Ian Steenstra,Farnaz Nouraei,Timothy W. Bickmore
类目:Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
关键词:significant role-play experience, intermittent granular feedback, involves significant role-play, manual training methods, current manual training
备注: This is a preprint version of the paper conditionally accepted to CHI'25
点击查看摘要
Abstract:Learning therapeutic counseling involves significant role-play experience with mock patients, with current manual training methods providing only intermittent granular feedback. We seek to accelerate and optimize counselor training by providing frequent, detailed feedback to trainees as they interact with a simulated patient. Our first application domain involves training motivational interviewing skills for counselors. Motivational interviewing is a collaborative counseling style in which patients are guided to talk about changing their behavior, with empathetic counseling an essential ingredient. We developed and evaluated an LLM-powered training system that features a simulated patient and visualizations of turn-by-turn performance feedback tailored to the needs of counselors learning motivational interviewing. We conducted an evaluation study with professional and student counselors, demonstrating high usability and satisfaction with the system. We present design implications for the development of automated systems that train users in counseling skills and their generalizability to other types of social skills training.
90. 【2502.18653】Enhancing Text Classification with a Novel Multi-Agent Collaboration Framework Leveraging BERT
链接:https://arxiv.org/abs/2502.18653
作者:Hediyeh Baban,Sai A Pidapar,Aashutosh Nema,Sichen Lu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:collaboration framework designed, text classification models, multi-agent collaboration framework, designed to enhance, system comprising Lexical
备注:
点击查看摘要
Abstract:We introduce a novel multi-agent collaboration framework designed to enhance the accuracy and robustness of text classification models. Leveraging BERT as the primary classifier, our framework dynamically escalates low-confidence predictions to a specialized multi-agent system comprising Lexical, Contextual, Logic, Consensus, and Explainability agents. This collaborative approach allows for comprehensive analysis and consensus-driven decision-making, significantly improving classification performance across diverse text classification tasks. Empirical evaluations on benchmark datasets demonstrate that our framework achieves a 5.5% increase in accuracy compared to standard BERT-based classifiers, underscoring its effectiveness and academic novelty in advancing multi-agent systems within natural language processing.
91. 【2502.18650】Single- vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources
链接:https://arxiv.org/abs/2502.18650
作者:Joachim De Baer,A. Seza Doğruöz,Thomas Demeester,Chris Develder
类目:Computation and Language (cs.CL)
关键词:Optimizing language models, Optimizing language, requires large quantities, conversational agents requires, large language models
备注: 11 pages
点击查看摘要
Abstract:Optimizing language models for use in conversational agents requires large quantities of example dialogues. Increasingly, these dialogues are synthetically generated by using powerful large language models (LLMs), especially in domains with challenges to obtain authentic human data. One such domain is human resources (HR). In this context, we compare two LLM-based dialogue generation methods for the use case of generating HR job interviews, and assess whether one method generates higher-quality dialogues that are more challenging to distinguish from genuine human discourse. The first method uses a single prompt to generate the complete interview dialog. The second method uses two agents that converse with each other. To evaluate dialogue quality under each method, we ask a judge LLM to determine whether AI was used for interview generation, using pairwise interview comparisons. We demonstrate that despite a sixfold increase in token cost, interviews generated with the dual-prompt method achieve a win rate up to ten times higher than those generated with the single-prompt method. This difference remains consistent regardless of whether GPT-4o or Llama 3.3 70B is used for either interview generation or judging quality.
92. 【2502.18644】Steered Generation via Gradient Descent on Sparse Features
链接:https://arxiv.org/abs/2502.18644
作者:Sumanta Bhattacharyya,Pedram Rooshenas
类目:Computation and Language (cs.CL)
关键词:Large language models, Large language, specific target characteristics, encode a diverse, diverse range
备注:
点击查看摘要
Abstract:Large language models (LLMs) encode a diverse range of linguistic features within their latent representations, which can be harnessed to steer their output toward specific target characteristics. In this paper, we modify the internal structure of LLMs by training sparse autoencoders to learn a sparse representation of the query embedding, allowing precise control over the model's attention distribution. We demonstrate that manipulating this sparse representation effectively transforms the output toward different stylistic and cognitive targets. Specifically, in an educational setting, we show that the cognitive complexity of LLM-generated feedback can be systematically adjusted by modifying the encoded query representation at a specific layer. To achieve this, we guide the learned sparse embedding toward the representation of samples from the desired cognitive complexity level, using gradient-based optimization in the latent space.
93. 【2502.18642】Contextual effects of sentiment deployment in human and machine translation
链接:https://arxiv.org/abs/2502.18642
作者:Lindy Comstock,Priyanshu Sharma,Mikhail Belov
类目:Computation and Language (cs.CL)
关键词:semantic similarity metrics, automated sentiment analyses, utilize machine translation, machine translation, similarity metrics
备注: ISCA/ITG Workshop on Diversity in Large Speech and Language Models
点击查看摘要
Abstract:This paper illustrates how the overall sentiment of a text may be shifted in translation and the implications for automated sentiment analyses, particularly those that utilize machine translation and assess findings via semantic similarity metrics. While human and machine translation will produce more lemmas that fit the expected frequency of sentiment in the target language, only machine translation will also reduce the overall semantic field of the text, particularly in regard to words with epistemic content.
94. 【2502.18635】Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems
链接:https://arxiv.org/abs/2502.18635
作者:Matthew Barker,Andrew Bell,Evan Thomas,James Carr,Thomas Andrews,Umang Bhatt
类目:Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Retrieval Augmented Generation, Augmented Generation, improving Large Language, Retrieval Augmented, Large Language Model
备注:
点击查看摘要
Abstract:While Retrieval Augmented Generation (RAG) has emerged as a popular technique for improving Large Language Model (LLM) systems, it introduces a large number of choices, parameters and hyperparameters that must be made or tuned. This includes the LLM, embedding, and ranker models themselves, as well as hyperparameters governing individual RAG components. Yet, collectively optimizing the entire configuration in a RAG or LLM system remains under-explored - especially in multi-objective settings - due to intractably large solution spaces, noisy objective evaluations, and the high cost of evaluations. In this work, we introduce the first approach for multi-objective parameter optimization of cost, latency, safety and alignment over entire LLM and RAG systems. We find that Bayesian optimization methods significantly outperform baseline approaches, obtaining a superior Pareto front on two new RAG benchmark tasks. We conclude our work with important considerations for practitioners who are designing multi-objective RAG systems, highlighting nuances such as how optimal configurations may not generalize across tasks and objectives.
95. 【2502.18632】Automated Knowledge Component Generation and Knowledge Tracing for Coding Problems
链接:https://arxiv.org/abs/2502.18632
作者:Zhangqi Duan,Nigel Fernandez,Sri Kanakadandi,Bita Akram,Andrew Lan
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Software Engineering (cs.SE)
关键词:online learning platforms, facilitating personalized learning, tracking their mastery, mastery levels, levels on fine-grained
备注:
点击查看摘要
Abstract:Knowledge components (KCs) mapped to problems help model student learning, tracking their mastery levels on fine-grained skills thereby facilitating personalized learning and feedback in online learning platforms. However, crafting and tagging KCs to problems, traditionally performed by human domain experts, is highly labor-intensive. We present a fully automated, LLM-based pipeline for KC generation and tagging for open-ended programming problems. We also develop an LLM-based knowledge tracing (KT) framework to leverage these LLM-generated KCs, which we refer to as KCGen-KT. We conduct extensive quantitative and qualitative evaluations validating the effectiveness of KCGen-KT. On a real-world dataset of student code submissions to open-ended programming problems, KCGen-KT outperforms existing KT methods. We investigate the learning curves of generated KCs and show that LLM-generated KCs have a comparable level-of-fit to human-written KCs under the performance factor analysis (PFA) model. We also conduct a human evaluation to show that the KC tagging accuracy of our pipeline is reasonably accurate when compared to that by human domain experts.
96. 【2502.18600】Chain of Draft: Thinking Faster by Writing Less
链接:https://arxiv.org/abs/2502.18600
作者:Silei Xu,Wenhao Xie,Lingxiao Zhao,Pengcheng He
类目:Computation and Language (cs.CL)
关键词:Large Language Models, Large Language, Language Models, demonstrated remarkable performance, emphasizes verbose
备注:
点击查看摘要
Abstract:Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks.
97. 【2502.18590】Neurobiber: Fast and Interpretable Stylistic Feature Extraction
链接:https://arxiv.org/abs/2502.18590
作者:Kenan Alkiek,Anna Wegmann,Jian Zhu,David Jurgens
类目:Computation and Language (cs.CL)
关键词:fulfill communicative purposes, scale remains challenging, extracting detailed stylistic, Biber Multidimensional Analysis, texts convey meaning
备注:
点击查看摘要
Abstract:Linguistic style is pivotal for understanding how texts convey meaning and fulfill communicative purposes, yet extracting detailed stylistic features at scale remains challenging. We present Neurobiber, a transformer-based system for fast, interpretable style profiling built on Biber's Multidimensional Analysis (MDA). Neurobiber predicts 96 Biber-style features from our open-source BiberPlus library (a Python toolkit that computes stylistic features and provides integrated analytics, e.g., PCA and factor analysis). Despite being up to 56 times faster than existing open source systems, Neurobiber replicates classic MDA insights on the CORE corpus and achieves competitive performance on the PAN 2020 authorship verification task without extensive retraining. Its efficient and interpretable representations readily integrate into downstream NLP pipelines, facilitating large-scale stylometric research, forensic analysis, and real-time text monitoring. All components are made publicly available.
98. 【2502.18583】What are Foundation Models Cooking in the Post-Soviet World?
链接:https://arxiv.org/abs/2502.18583
作者:Anton Lavrouk,Tarek Naous,Alan Ritter,Wei Xu
类目:Computation and Language (cs.CL)
关键词:influence current events, states is complex, current events, turbulent history, history that continues
备注:
点击查看摘要
Abstract:The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multimodal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multimodal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pretraining data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models' abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding. To foster further research, we will make BORSch publicly available at this https URL.
99. 【2502.18581】Scalable Best-of-N Selection for Large Language Models via Self-Certainty
链接:https://arxiv.org/abs/2502.18581
作者:Zhewei Kang,Xuandong Zhao,Dawn Song
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, increased test-time computation, Language Models, test-time computation
备注:
点击查看摘要
Abstract:Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size $N$, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at this https URL
100. 【2502.18573】FactReasoner: A Probabilistic Approach to Long-Form Factuality Assessment for Large Language Models
链接:https://arxiv.org/abs/2502.18573
作者:Radu Marinescu,Debarun Bhattacharjya,Junkyu Lee,Tigran Tchrakian,Javier Carnerero Cano,Yufang Hou,Elizabeth Daly,Alessandra Pascale
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
关键词:Large language models, demonstrated vast capabilities, Large language, recent years, demonstrated vast
备注:
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated vast capabilities on generative tasks in recent years, yet they struggle with guaranteeing the factual correctness of the generated content. This makes these models unreliable in realistic situations where factually accurate responses are expected. In this paper, we propose FactReasoner, a new factuality assessor that relies on probabilistic reasoning to assess the factuality of a long-form generated response. Specifically, FactReasoner decomposes the response into atomic units, retrieves relevant contexts for them from an external knowledge source, and constructs a joint probability distribution over the atoms and contexts using probabilistic encodings of the logical relationships (entailment, contradiction) between the textual utterances corresponding to the atoms and contexts. FactReasoner then computes the posterior probability of whether atomic units in the response are supported by the retrieved contexts. Our experiments on labeled and unlabeled benchmark datasets demonstrate clearly that FactReasoner improves considerably over state-of-the-art prompt-based approaches in terms of both factual precision and recall.
101. 【2502.18545】PII-Bench: Evaluating Query-Aware Privacy Protection Systems
链接:https://arxiv.org/abs/2502.18545
作者:Hao Shen,Zhouhong Gu,Haokai Hong,Weili Han
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, personally identifiable information, adoption of Large, Language Models
备注:
点击查看摘要
Abstract:The widespread adoption of Large Language Models (LLMs) has raised significant privacy concerns regarding the exposure of personally identifiable information (PII) in user prompts. To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems. PII-Bench comprises 2,842 test samples across 55 fine-grained PII categories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions. Each sample is carefully crafted with a user query, context description, and standard answer indicating query-relevant PII. Our empirical evaluation reveals that while current models perform adequately in basic PII detection, they show significant limitations in determining PII query relevance. Even state-of-the-art LLMs struggle with this task, particularly in handling complex multi-subject scenarios, indicating substantial room for improvement in achieving intelligent PII masking.
102. 【2502.18536】FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
链接:https://arxiv.org/abs/2502.18536
作者:S M Sarwar
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Visual Question Answering, Question Answering requires, generate accurate answers, Answering requires models, textual understanding
备注: 12 pages, 6 figures and 2 tables
点击查看摘要
Abstract:Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.
103. 【2502.18531】Enhancing Hepatopathy Clinical Trial Efficiency: A Secure, Large Language Model-Powered Pre-Screening Pipeline
链接:https://arxiv.org/abs/2502.18531
作者:Xiongbin Gui,Hanlin Lv,Xiao Wang,Longting Lv,Yi Xiao,Lei Wang
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:requires interpreting semantically, complex liver diseases, involving complex liver, cohorts involving complex, interpreting semantically complex
备注: 30 pages, 5 figures
点击查看摘要
Abstract:Background: Recruitment for cohorts involving complex liver diseases, such as hepatocellular carcinoma and liver cirrhosis, often requires interpreting semantically complex criteria. Traditional manual screening methods are time-consuming and prone to errors. While AI-powered pre-screening offers potential solutions, challenges remain regarding accuracy, efficiency, and data privacy. Methods: We developed a novel patient pre-screening pipeline that leverages clinical expertise to guide the precise, safe, and efficient application of large language models. The pipeline breaks down complex criteria into a series of composite questions and then employs two strategies to perform semantic question-answering through electronic health records - (1) Pathway A, Anthropomorphized Experts' Chain of Thought strategy, and (2) Pathway B, Preset Stances within an Agent Collaboration strategy, particularly in managing complex clinical reasoning scenarios. The pipeline is evaluated on three key metrics-precision, time consumption, and counterfactual inference - at both the question and criterion levels. Results: Our pipeline achieved high precision (0.921, in criteria level) and efficiency (0.44s per task). Pathway B excelled in complex reasoning, while Pathway A was effective in precise data extraction with faster processing times. Both pathways achieved comparable precision. The pipeline showed promising results in hepatocellular carcinoma (0.878) and cirrhosis trials (0.843). Conclusions: This data-secure and time-efficient pipeline shows high precision in hepatopathy trials, providing promising solutions for streamlining clinical trial workflows. Its efficiency and adaptability make it suitable for improving patient recruitment. And its capability to function in resource-constrained environments further enhances its utility in clinical settings.
104. 【2502.18513】Analyzing User Perceptions of Large Language Models (LLMs) on Reddit: Sentiment and Topic Modeling of ChatGPT and DeepSeek Discussions
链接:https://arxiv.org/abs/2502.18513
作者:Krishnaveni Katta
类目:ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
关键词:large language models, online platforms, increased discourse, discourse on large, perceive these models
备注: 13 pages, 8 figures
点击查看摘要
Abstract:While there is an increased discourse on large language models (LLMs) like ChatGPT and DeepSeek, there is no comprehensive understanding of how users of online platforms, like Reddit, perceive these models. This is an important omission because public opinion can influence AI development, trust, and future policy. This study aims at analyzing Reddit discussions about ChatGPT and DeepSeek using sentiment and topic modeling to advance the understanding of user attitudes. Some of the significant topics such as trust in AI, user expectations, potential uses of the tools, reservations about AI biases, and ethical implications of their use are explored in this study. By examining these concerns, the study provides a sense of how public sentiment might shape the direction of AI development going forward. The report also mentions whether users have faith in the technology and what they see as its future. A word frequency approach is used to identify broad topics and sentiment trends. Also, topic modeling through the Latent Dirichlet Allocation (LDA) method identifies top topics in users' language, for example, potential benefits of LLMs, their technological applications, and their overall social ramifications. The study aims to inform developers and policymakers by making it easier to see how users comprehend and experience these game-changing technologies.
105. 【2502.18505】Comprehensive Analysis of Transparency and Accessibility of ChatGPT, DeepSeek, And other SoTA Large Language Models
链接:https://arxiv.org/abs/2502.18505
作者:Ranjan Sapkota,Shaina Raza,Manoj Karkee
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Large Language Models, open-source Artificial Intelligence, Artificial Intelligence, Large Language, increasing discussions
备注:
点击查看摘要
Abstract:Despite increasing discussions on open-source Artificial Intelligence (AI), existing research lacks a discussion on the transparency and accessibility of state-of-the-art (SoTA) Large Language Models (LLMs). The Open Source Initiative (OSI) has recently released its first formal definition of open-source software. This definition, when combined with standard dictionary definitions and the sparse published literature, provide an initial framework to support broader accessibility to AI models such as LLMs, but more work is essential to capture the unique dynamics of openness in AI. In addition, concerns about open-washing, where models claim openness but lack full transparency, has been raised, which limits the reproducibility, bias mitigation, and domain adaptation of these models. In this context, our study critically analyzes SoTA LLMs from the last five years, including ChatGPT, DeepSeek, LLaMA, and others, to assess their adherence to transparency standards and the implications of partial openness. Specifically, we examine transparency and accessibility from two perspectives: open-source vs. open-weight models. Our findings reveal that while some models are labeled as open-source, this does not necessarily mean they are fully open-sourced. Even in the best cases, open-source models often do not report model training data, and code as well as key metrics, such as weight accessibility, and carbon emissions. To the best of our knowledge, this is the first study that systematically examines the transparency and accessibility of over 100 different SoTA LLMs through the dual lens of open-source and open-weight models. The findings open avenues for further research and call for responsible and sustainable AI practices to ensure greater transparency, accountability, and ethical deployment of these models.(DeepSeek transparency, ChatGPT accessibility, open source, DeepSeek open source)
106. 【2502.18504】urboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
链接:https://arxiv.org/abs/2502.18504
作者:Aman Goel,Xian Carrie Wu,Zhe Wang,Dmitriy Bespalov,Yanjun Qi
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Jailbreaking large-language models, effective jailbreaking templates, involves testing, withstand prompt attacks, testing their robustness
备注: Accepted at NAACL 2025 industry track, 12 pages, 5 figures
点击查看摘要
Abstract:Jailbreaking large-language models (LLMs) involves testing their robustness against adversarial prompts and evaluating their ability to withstand prompt attacks that could elicit unauthorized or malicious responses. In this paper, we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently finding a collection of effective jailbreaking templates that, when combined with harmful questions, can lead a target LLM to produce harmful responses through black-box access via user prompts. We describe the limitations of directly applying existing template-based attacking techniques in practice, and present functional and efficiency-focused upgrades we added to mutation-based fuzzing to generate effective jailbreaking templates automatically. TurboFuzzLLM achieves $\geq$ 95\% attack success rates (ASR) on public datasets for leading LLMs (including GPT-4o \ GPT-4 Turbo), shows impressive generalizability to unseen harmful questions, and helps in improving model defenses to prompt attacks.
107. 【2502.18499】Mechanistic Understanding of Language Models in Syntactic Code Completion
链接:https://arxiv.org/abs/2502.18499
作者:Samuel Miller,Daking Rai,Ziyu Yao
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:shown impressive proficiency, code generation tasks, Code LMs, closing parenthesis task, code-specific datasets
备注: 10 pages, 4 figures, accepted to the AAAI 2025 Workshop on Towards Knowledgeable Foundation Models
点击查看摘要
Abstract:Recently, language models (LMs) have shown impressive proficiency in code generation tasks, especially when fine-tuned on code-specific datasets, commonly known as Code LMs. However, our understanding of the internal decision-making processes of Code LMs, such as how they use their (syntactic or semantic) knowledge, remains limited, which could lead to unintended harm as they are increasingly used in real life. This motivates us to conduct one of the first Mechanistic Interpretability works to understand how Code LMs perform a syntactic completion task, specifically the closing parenthesis task, on the CodeLlama-7b model (Roziere et al. 2023). Our findings reveal that the model requires middle-later layers until it can confidently predict the correct label for the closing parenthesis task. Additionally, we identify that while both multi-head attention (MHA) and feed-forward (FF) sub-layers play essential roles, MHA is particularly crucial. Furthermore, we also discover attention heads that keep track of the number of already closed parentheses precisely but may or may not promote a correct number of closing parentheses that are still missing, leading to a positive or negative impact on the model's performance.
108. 【2502.18487】AuPair: Golden Example Pairs for Code Repair
链接:https://arxiv.org/abs/2502.18487
作者:Aditi Mavalankar,Hassan Mansoor,Zita Marinho,Masha Samsikova,Tom Schaul
类目:oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:Large Language Models, Large Language, Language Models, inference-time compute, valuable strategy
备注:
点击查看摘要
Abstract:Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additional inference-time compute is self-repair; given an initial flawed response, or guess, the LLM corrects its own mistake and produces an improved response, or fix. We leverage the in-context learning ability of LLMs to perform self-repair in the coding domain. The key contribution of our paper is an approach that synthesises and selects an ordered set of golden example pairs, or AuPairs, of these initial guesses and subsequent fixes for the corresponding problems. Each such AuPair is provided as a single in-context example at inference time to generate a repaired solution. For an inference-time compute budget of $N$ LLM calls per problem, $N$ AuPairs are used to generate $N$ repaired solutions, out of which the highest-scoring solution is selected as the final answer. The underlying intuition is that if the LLM is given a different example of fixing an incorrect guess each time, it can subsequently generate a diverse set of repaired solutions. Our algorithm selects these AuPairs in a manner that maximises complementarity and usefulness. We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of-$N$ and self-repair, and also exhibits strong generalisation across datasets and models. Moreover, our approach shows significantly stronger scaling with inference-time compute budget compared to baselines.
109. 【2502.18482】MixLLM: Dynamic Routing in Mixed Large Language Models
链接:https://arxiv.org/abs/2502.18482
作者:Xinyuan Wang,Yanchi Liu,Wei Cheng,Xujiang Zhao,Zhengzhang Chen,Wenchao Yu,Yanjie Fu,Haifeng Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, exhibit potential artificial, generic intelligence recently, potential artificial generic
备注: 11 pages, 7 figures, accepted by NAACL 2025 main conference
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25% of GPT-4's quality at 24.18% of the cost under the time constraint).
110. 【2502.18480】QExplorer: Large Language Model Based Query Extraction for Toxic Content Exploration
链接:https://arxiv.org/abs/2502.18480
作者:Shaola Ren,Li Ke,Longtao Huang,Dehong Gao,Hui Xue
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Automatically extracting effective, Large Language Model, Automatically extracting, extracting effective queries, toxic content exploration
备注:
点击查看摘要
Abstract:Automatically extracting effective queries is challenging in information retrieval, especially in toxic content exploration, as such content is likely to be disguised. With the recent achievements in generative Large Language Model (LLM), we are able to leverage the capabilities of LLMs to extract effective queries for similar content exploration directly. This study proposes QExplorer, an approach of large language model based Query Extraction for toxic content Exploration. The QExplorer approach involves a 2-stage training process: instruction Supervised FineTuning (SFT) and preference alignment using Direct Preference Optimization (DPO), as well as the datasets construction with feedback of search system. To verify the effectiveness of QExplorer, a series of offline and online experiments are conducted on our real-world system. The offline empirical results demonstrate that the performance of our automatic query extraction outperforms that of several LLMs and humans. The online deployment shows a significant increase in the detection of toxic items.
111. 【2502.18471】FinBloom: Knowledge Grounding Large Language Model with Real-time Financial Data
链接:https://arxiv.org/abs/2502.18471
作者:Ankur Sinha,Chaitanya Agarwal,Pekka Malo
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
关键词:Large language models, generating human-like responses, Large language, Financial Context Dataset, Financial
备注: 27 pages, 9 tables
点击查看摘要
Abstract:Large language models (LLMs) excel at generating human-like responses but often struggle with interactive tasks that require access to real-time information. This limitation poses challenges in finance, where models must access up-to-date information, such as recent news or price movements, to support decision-making. To address this, we introduce Financial Agent, a knowledge-grounding approach for LLMs to handle financial queries using real-time text and tabular data. Our contributions are threefold: First, we develop a Financial Context Dataset of over 50,000 financial queries paired with the required context. Second, we train FinBloom 7B, a custom 7 billion parameter LLM, on 14 million financial news articles from Reuters and Deutsche Presse-Agentur, alongside 12 million Securities and Exchange Commission (SEC) filings. Third, we fine-tune FinBloom 7B using the Financial Context Dataset to serve as a Financial Agent. This agent generates relevant financial context, enabling efficient real-time data retrieval to answer user queries. By reducing latency and eliminating the need for users to manually provide accurate data, our approach significantly enhances the capability of LLMs to handle dynamic financial tasks. Our proposed approach makes real-time financial decisions, algorithmic trading and other related tasks streamlined, and is valuable in contexts with high-velocity data flows.
信息检索
1. 【2502.19298】Agent-centric Information Access
链接:https://arxiv.org/abs/2502.19298
作者:Evangelos Kanoulas,Panagiotis Eustratiadis,Yongkang Li,Yougang Lyu,Vaishali Pal,Gabrielle Poerwawinata,Jingfen Qiao,Zihan Wang
类目:Information Retrieval (cs.IR)
关键词:large language models, specific domains, large language, envision a future, trained on proprietary
备注:
点击查看摘要
Abstract:As large language models (LLMs) become more specialized, we envision a future where millions of expert LLMs exist, each trained on proprietary data and excelling in specific domains. In such a system, answering a query requires selecting a small subset of relevant models, querying them efficiently, and synthesizing their responses. This paper introduces a framework for agent-centric information access, where LLMs function as knowledge agents that are dynamically ranked and queried based on their demonstrated expertise. Unlike traditional document retrieval, this approach requires inferring expertise on the fly, rather than relying on static metadata or predefined model descriptions. This shift introduces several challenges, including efficient expert selection, cost-effective querying, response aggregation across multiple models, and robustness against adversarial manipulation. To address these issues, we propose a scalable evaluation framework that leverages retrieval-augmented generation and clustering techniques to construct and assess thousands of specialized models, with the potential to scale toward millions.
2. 【2502.19280】Efficient Federated Search for Retrieval-Augmented Generation
链接:https://arxiv.org/abs/2502.19280
作者:Rachid Guerraoui,Anne-Marie Kermarrec,Diana Petrescu,Rafael Pires,Mathis Randl,Martijn de Vos
类目:Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Retrieval (cs.IR)
关键词:Large language models, demonstrated remarkable capabilities, Large language, limiting their reliability, hallucinations and inconsistencies
备注: To appear in the proceedings of EuroMLSys'25
点击查看摘要
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities across various domains but remain susceptible to hallucinations and inconsistencies, limiting their reliability. Retrieval-augmented generation (RAG) mitigates these issues by grounding model responses in external knowledge sources. Existing RAG workflows often leverage a single vector database, which is impractical in the common setting where information is distributed across multiple repositories. We introduce RAGRoute, a novel mechanism for federated RAG search. RAGRoute dynamically selects relevant data sources at query time using a lightweight neural network classifier. By not querying every data source, this approach significantly reduces query overhead, improves retrieval efficiency, and minimizes the retrieval of irrelevant information. We evaluate RAGRoute using the MIRAGE and MMLU benchmarks and demonstrate its effectiveness in retrieving relevant documents while reducing the number of queries. RAGRoute reduces the total number of queries up to 77.5% and communication volume up to 76.2%.
3. 【2502.19271】Multiview graph dual-attention deep learning and contrastive learning for multi-criteria recommender systems
链接:https://arxiv.org/abs/2502.19271
作者:Saman Forouzandeh,Pavel N. Krivitsky,Rohitash Chandra
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
关键词:Multi-Criteria Recommender Systems, Recommender systems leveraging, Recommender systems, systems leveraging deep, single-criteria recommender systems
备注:
点击查看摘要
Abstract:Recommender systems leveraging deep learning models have been crucial for assisting users in selecting items aligned with their preferences and interests. However, a significant challenge persists in single-criteria recommender systems, which often overlook the diverse attributes of items that have been addressed by Multi-Criteria Recommender Systems (MCRS). Shared embedding vector for multi-criteria item ratings but have struggled to capture the nuanced relationships between users and items based on specific criteria. In this study, we present a novel representation for Multi-Criteria Recommender Systems (MCRS) based on a multi-edge bipartite graph, where each edge represents one criterion rating of items by users, and Multiview Dual Graph Attention Networks (MDGAT). Employing MDGAT is beneficial and important for adequately considering all relations between users and items, given the presence of both local (criterion-based) and global (multi-criteria) relations. Additionally, we define anchor points in each view based on similarity and employ local and global contrastive learning to distinguish between positive and negative samples across each view and the entire graph. We evaluate our method on two real-world datasets and assess its performance based on item rating predictions. The results demonstrate that our method achieves higher accuracy compared to the baseline method for predicting item ratings on the same datasets. MDGAT effectively capture the local and global impact of neighbours and the similarity between nodes.
4. 【2502.19178】UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering
链接:https://arxiv.org/abs/2502.19178
作者:Langming Liu,Shilei Liu,Yujin Yuan,Yizhen Zhang,Bencheng Yan,Zhiyuan Zeng,Zihao Wang,Jiaqi Liu,Di Wang,Wenbo Su,Pengjie Wang,Jian Xu,Bo Zheng
类目:Information Retrieval (cs.IR)
关键词:Large language models, natural language processing, achieve remarkable success, Large language, language models
备注: 10 pages, 3 figures, 7 tables
点击查看摘要
Abstract:Large language models (LLMs) achieve remarkable success in natural language processing (NLP). In practical scenarios like recommendations, as users increasingly seek personalized experiences, it becomes crucial to incorporate user interaction history into the context of LLMs to enhance personalization. However, from a practical utility perspective, user interactions' extensive length and noise present challenges when used directly as text prompts. A promising solution is to compress and distill interactions into compact embeddings, serving as soft prompts to assist LLMs in generating personalized responses. Although this approach brings efficiency, a critical concern emerges: Can user embeddings adequately capture valuable information and prompt LLMs? To address this concern, we propose \name, a benchmark designed to evaluate the effectiveness of user embeddings in prompting LLMs for personalization. We establish a fair and standardized evaluation process, encompassing pre-training, fine-tuning, and evaluation stages. To thoroughly evaluate user embeddings, we design three dimensions of tasks: sequence understanding, action prediction, and interest perception. These evaluation tasks cover the industry's demands in traditional recommendation tasks, such as improving prediction accuracy, and its aspirations for LLM-based methods, such as accurately understanding user interests and enhancing the user experience. We conduct extensive experiments on various state-of-the-art methods for modeling user embeddings. Additionally, we reveal the scaling laws of leveraging user embeddings to prompt LLMs. The benchmark is available online.
5. 【2502.19163】stNUC: Enhancing Test-Time Computing Approaches through Neighboring Unlabeled Data Consistency
链接:https://arxiv.org/abs/2502.19163
作者:Henry Peng Zou,Zhengyao Gu,Yue Zhou,Yankai Chen,Weizhi Zhang,Liancheng Fang,Yibo Wang,Yangning Li,Kay Liu,Philip S. Yu
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:leverage additional computational, additional computational resources, enhancing large language, Test-time computing approaches, large language model
备注:
点击查看摘要
Abstract:Test-time computing approaches, which leverage additional computational resources during inference, have been proven effective in enhancing large language model performance. This work introduces a novel, linearly scaling approach, TestNUC, that improves test-time predictions by leveraging the local consistency of neighboring unlabeled data-it classifies an input instance by considering not only the model's prediction on that instance but also on neighboring unlabeled instances. We evaluate TestNUC across eight diverse datasets, spanning intent classification, topic mining, domain discovery, and emotion detection, demonstrating its consistent superiority over baseline methods such as standard prompting and self-consistency. Furthermore, TestNUC can be seamlessly integrated with existing test-time computing approaches, substantially boosting their performance. Our analysis reveals that TestNUC scales effectively with increasing amounts of unlabeled data and performs robustly across different embedding models, making it practical for real-world applications. Our code is available at this https URL.
6. 【2502.19108】A 106K Multi-Topic Multilingual Conversational User Dataset with Emoticons
链接:https://arxiv.org/abs/2502.19108
作者:Heng Er Metilda Chee,Jiayin Wang,Zhiqiang Guo,Weizhi Ma,Qinglang Guo,Min Zhang
类目:Information Retrieval (cs.IR); Multimedia (cs.MM)
关键词:emoticons enabling users, Instant messaging, form of communication, ideas efficiently, predominant form
备注:
点击查看摘要
Abstract:Instant messaging has become a predominant form of communication, with texts and emoticons enabling users to express emotions and ideas efficiently. Emoticons, in particular, have gained significant traction as a medium for conveying sentiments and information, leading to the growing importance of emoticon retrieval and recommendation systems. However, one of the key challenges in this area has been the absence of datasets that capture both the temporal dynamics and user-specific interactions with emoticons, limiting the progress of personalized user modeling and recommendation approaches. To address this, we introduce the emoticon dataset, a comprehensive resource that includes time-based data along with anonymous user identifiers across different conversations. As the largest publicly accessible emoticon dataset to date, it comprises 22K unique users, 370K emoticons, and 8.3M messages. The data was collected from a widely-used messaging platform across 67 conversations and 720 hours of crawling. Strict privacy and safety checks were applied to ensure the integrity of both text and image data. Spanning across 10 distinct domains, the emoticon dataset provides rich insights into temporal, multilingual, and cross-domain behaviors, which were previously unavailable in other emoticon-based datasets. Our in-depth experiments, both quantitative and qualitative, demonstrate the dataset's potential in modeling user behavior and personalized recommendation systems, opening up new possibilities for research in personalized retrieval and conversational AI. The dataset is freely accessible.
7. 【2502.18992】OntologyRAG: Better and Faster Biomedical Code Mapping with Retrieval-Augmented Generation (RAG) Leveraging Ontology Knowledge Graphs and Large Language Models
链接:https://arxiv.org/abs/2502.18992
作者:Hui Feng,Yuntzu Yin,Emiliano Reynares,Jay Nanavati
类目:Information Retrieval (cs.IR)
关键词:domain-specific information representations, formalizing domain-specific information, comprehensively define concepts, information representations, biomedical entities
备注: This paper has been accepted as a workshop paper for KEIR@ECIR 2025
点击查看摘要
Abstract:Biomedical ontologies, which comprehensively define concepts and relations for biomedical entities, are crucial for structuring and formalizing domain-specific information representations. Biomedical code mapping identifies similarity or equivalence between concepts from different ontologies. Obtaining high-quality mapping usually relies on automatic generation of unrefined mapping with ontology domain fine-tuned language models (LMs), followed by manual selections or corrections by coding experts who have extensive domain expertise and familiarity with ontology schemas. The LMs usually provide unrefined code mapping suggestions as a list of candidates without reasoning or supporting evidence, hence coding experts still need to verify each suggested candidate against ontology sources to pick the best matches. This is also a recurring task as ontology sources are updated regularly to incorporate new research findings. Consequently, the need of regular LM retraining and manual refinement make code mapping time-consuming and labour intensive. In this work, we created OntologyRAG, an ontology-enhanced retrieval-augmented generation (RAG) method that leverages the inductive biases from ontological knowledge graphs for in-context-learning (ICL) in large language models (LLMs). Our solution grounds LLMs to knowledge graphs with unrefined mappings between ontologies and processes questions by generating an interpretable set of results that include prediction rational with mapping proximity assessment. Our solution doesn't require re-training LMs, as all ontology updates could be reflected by updating the knowledge graphs with a standard process. Evaluation results on a self-curated gold dataset show promises of using our method to enable coding experts to achieve better and faster code mapping. The code is available at this https URL.
8. 【2502.18965】OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment
链接:https://arxiv.org/abs/2502.18965
作者:Jiaxin Deng,Shiyao Wang,Kuo Cai,Lejian Ren,Qigen Hu,Weifeng Ding,Qiang Luo,Guorui Zhou
类目:Information Retrieval (cs.IR)
关键词:promising paradigm, generative retrieval-based recommendation, generative model, retrieval-based recommendation systems, generative retrieval-based
备注:
点击查看摘要
Abstract:Recently, generative retrieval-based recommendation systems have emerged as a promising paradigm. However, most modern recommender systems adopt a retrieve-and-rank strategy, where the generative model functions only as a selector during the retrieval stage. In this paper, we propose OneRec, which replaces the cascaded learning framework with a unified generative model. To the best of our knowledge, this is the first end-to-end generative model that significantly surpasses current complex and well-designed recommender systems in real-world scenarios. Specifically, OneRec includes: 1) an encoder-decoder structure, which encodes the user's historical behavior sequences and gradually decodes the videos that the user may be interested in. We adopt sparse Mixture-of-Experts (MoE) to scale model capacity without proportionally increasing computational FLOPs. 2) a session-wise generation approach. In contrast to traditional next-item prediction, we propose a session-wise generation, which is more elegant and contextually coherent than point-by-point generation that relies on hand-crafted rules to properly combine the generated results. 3) an Iterative Preference Alignment module combined with Direct Preference Optimization (DPO) to enhance the quality of the generated results. Unlike DPO in NLP, a recommendation system typically has only one opportunity to display results for each user's browsing request, making it impossible to obtain positive and negative samples simultaneously. To address this limitation, We design a reward model to simulate user generation and customize the sampling strategy. Extensive experiments have demonstrated that a limited number of DPO samples can align user interest preferences and significantly improve the quality of generated results. We deployed OneRec in the main scene of Kuaishou, achieving a 1.6\% increase in watch-time, which is a substantial improvement.
9. 【2502.18927】A Multifacet Hierarchical Sentiment-Topic Model with Application to Multi-Brand Online Review Analysis
链接:https://arxiv.org/abs/2502.18927
作者:Qiao Liang,Xinwei Deng
类目:Information Retrieval (cs.IR); Methodology (stat.ME)
关键词:Multi-brand analysis based, Multi-brand analysis, analysis based, comments and ratings, commonly used strategy
备注: 21 pages, 6 figures, 4 tables
点击查看摘要
Abstract:Multi-brand analysis based on review comments and ratings is a commonly used strategy to compare different brands in marketing. It can help consumers make more informed decisions and help marketers understand their brand's position in the market. In this work, we propose a multifacet hierarchical sentiment-topic model (MH-STM) to detect brand-associated sentiment polarities towards multiple comparative aspects from online customer reviews. The proposed method is built on a unified generative framework that explains review words with a hierarchical brand-associated topic model and the overall polarity score with a regression model on the empirical topic distribution. Moreover, a novel hierarchical Polya urn (HPU) scheme is proposed to enhance the topic-word association among topic hierarchy, such that the general topics shared by all brands are separated effectively from the unique topics specific to individual brands. The performance of the proposed method is evaluated on both synthetic data and two real-world review corpora. Experimental studies demonstrate that the proposed method can be effective in detecting reasonable topic hierarchy and deriving accurate brand-associated rankings on multi-aspects.
10. 【2502.18877】Hierarchical corpus encoder: Fusing generative retrieval and dense indices
链接:https://arxiv.org/abs/2502.18877
作者:Tongfei Chen,Ankita Sharma,Adam Pauls,Benjamin Van Durme
类目:Information Retrieval (cs.IR)
关键词:inter alia, retrieval employs sequence, document IDs based, employs sequence models, DSI
备注:
点击查看摘要
Abstract:Generative retrieval employs sequence models for conditional generation of document IDs based on a query (DSI (Tay et al., 2022); NCI (Wang et al., 2022); inter alia). While this has led to improved performance in zero-shot retrieval, it is a challenge to support documents not seen during training. We identify the performance of generative retrieval lies in contrastive training between sibling nodes in a document hierarchy. This motivates our proposal, the hierarchical corpus encoder (HCE), which can be supported by traditional dense encoders. Our experiments show that HCE achieves superior results than generative retrieval models under both unsupervised zero-shot and supervised settings, while also allowing the easy addition and removal of documents to the index.
11. 【2502.18803】On Aggregation Queries over Predicted Nearest Neighbors
链接:https://arxiv.org/abs/2502.18803
作者:Carrie Wang,Sihem Amer-Yahia,Laks V. S. Lakshmanan,Reynold Cheng
类目:Data Structures and Algorithms (cs.DS); Databases (cs.DB); Information Retrieval (cs.IR)
关键词:introduce Aggregation Queries, Aggregation Queries, designated object, Queries, Queries over Nearest
备注: 14 pages, 11 figures, 9 tables
点击查看摘要
Abstract:We introduce Aggregation Queries over Nearest Neighbors (AQNNs), a novel type of aggregation queries over the predicted neighborhood of a designated object. AQNNs are prevalent in modern applications where, for instance, a medical professional may want to compute "the average systolic blood pressure of patients whose predicted condition is similar to a given insomnia patient". Since prediction typically involves an expensive deep learning model or a human expert, we formulate query processing as the problem of returning an approximate aggregate by combining an expensive oracle and a cheaper model (e.g, a simple ML model) to compute the predictions. We design the Sampler with Precision-Recall in Target (SPRinT) framework for answering AQNNs. SPRinT consists of sampling, nearest neighbor refinement, and aggregation, and is tailored for various aggregation functions. It enjoys provable theoretical guarantees, including bounds on sample size and on error in approximate aggregates. Our extensive experiments on medical, e-commerce, and video datasets demonstrate that SPRinT consistently achieves the lowest aggregation error with minimal computation cost compared to its baselines. Scalability results show that SPRinT's execution time and aggregation error remain stable as the dataset size increases, confirming its suitability for large-scale applications.
12. 【2502.18757】raining Large Recommendation Models via Graph-Language Token Alignment
链接:https://arxiv.org/abs/2502.18757
作者:Mingdai Yang,Zhiwei Liu,Liangwei Yang,Xiaolong Liu,Chen Wang,Hao Peng,Philip S. Yu
类目:Information Retrieval (cs.IR)
关键词:Recommender systems, helping users efficiently, users efficiently navigate, social platforms, essential tools
备注: 5 pages. Accepted by www'25 as short paper
点击查看摘要
Abstract:Recommender systems (RS) have become essential tools for helping users efficiently navigate the overwhelming amount of information on e-commerce and social platforms. However, traditional RS relying on Collaborative Filtering (CF) struggles to integrate the rich semantic information from textual data. Meanwhile, large language models (LLMs) have shown promising results in natural language processing, but directly using LLMs for recommendation introduces challenges, such as ambiguity in generating item predictions and inefficiencies in scalability. In this paper, we propose a novel framework to train Large Recommendation models via Graph-Language Token Alignment. By aligning item and user nodes from the interaction graph with pretrained LLM tokens, GLTA effectively leverages the reasoning abilities of LLMs. Furthermore, we introduce Graph-Language Logits Matching (GLLM) to optimize token alignment for end-to-end item prediction, eliminating ambiguity in the free-form text as recommendation results. Extensive experiments on three benchmark datasets demonstrate the effectiveness of GLTA, with ablation studies validating each component.
13. 【2502.18754】AgentSociety Challenge: Designing LLM Agents for User Modeling and Recommendation on Web Platforms
链接:https://arxiv.org/abs/2502.18754
作者:Yuwei Yan,Yu Shang,Qingbin Zeng,Yu Li,Keyu Zhao,Zhiheng Zheng,Xuefei Ning,Tianji Wu,Shengen Yan,Yu Wang,Fengli Xu,Yong Li
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:Large Language Model, modeling user behavior, Language Model, enhancing recommender systems, Large Language
备注: 8 pages, 10 figures, in Proceedings of the ACM Web Conference 2025 (WWW '25)
点击查看摘要
Abstract:The AgentSociety Challenge is the first competition in the Web Conference that aims to explore the potential of Large Language Model (LLM) agents in modeling user behavior and enhancing recommender systems on web platforms. The Challenge consists of two tracks: the User Modeling Track and the Recommendation Track. Participants are tasked to utilize a combined dataset from Yelp, Amazon, and Goodreads, along with an interactive environment simulator, to develop innovative LLM agents. The Challenge has attracted 295 teams across the globe and received over 1,400 submissions in total over the course of 37 official competition days. The participants have achieved 21.9% and 20.3% performance improvement for Track 1 and Track 2 in the Development Phase, and 9.1% and 15.9% in the Final Phase, representing a significant accomplishment. This paper discusses the detailed designs of the Challenge, analyzes the outcomes, and highlights the most successful LLM agent designs. To support further research and development, we have open-sourced the benchmark environment at this https URL.
14. 【2502.18702】A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition
链接:https://arxiv.org/abs/2502.18702
作者:Zihan Wang,Ziqi Zhao,Yougang Lyu,Zhumin Chen,Maarten de Rijke,Zhaochun Ren
类目:Information Retrieval (cs.IR); Computation and Language (cs.CL)
关键词:unannotated text corpora, develop entity recognition, aims to develop, text corpora, entity recognition
备注: Accepted at WWW 2025
点击查看摘要
Abstract:Zero-shot named entity recognition (NER) aims to develop entity recognition systems from unannotated text corpora. This task presents substantial challenges due to minimal human intervention. Recent work has adapted large language models (LLMs) for zero-shot NER by crafting specialized prompt templates. It advances model self-learning abilities by incorporating self-annotated demonstrations. However, two important challenges persist: (i) Correlations between contexts surrounding entities are overlooked, leading to wrong type predictions or entity omissions. (ii) The indiscriminate use of task demonstrations, retrieved through shallow similarity-based strategies, severely misleads LLMs during inference. In this paper, we introduce the cooperative multi-agent system (CMAS), a novel framework for zero-shot NER that uses the collective intelligence of multiple agents to address the challenges outlined above. CMAS has four main agents: (i) a self-annotator, (ii) a type-related feature (TRF) extractor, (iii) a demonstration discriminator, and (iv) an overall predictor. To explicitly capture correlations between contexts surrounding entities, CMAS reformulates NER into two subtasks: recognizing named entities and identifying entity type-related features within the target sentence. To enable controllable utilization of demonstrations, a demonstration discriminator is established to incorporate the self-reflection mechanism, automatically evaluating helpfulness scores for the target sentence. Experimental results show that CMAS significantly improves zero-shot NER performance across six benchmarks, including both domain-specific and general-domain scenarios. Furthermore, CMAS demonstrates its effectiveness in few-shot settings and with various LLM backbones.
Comments:
Accepted at WWW 2025
Subjects:
Information Retrieval (cs.IR); Computation and Language (cs.CL)
Cite as:
arXiv:2502.18702 [cs.IR]
(or
arXiv:2502.18702v1 [cs.IR] for this version)
https://doi.org/10.48550/arXiv.2502.18702
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
15. 【2502.18545】PII-Bench: Evaluating Query-Aware Privacy Protection Systems
链接:https://arxiv.org/abs/2502.18545
作者:Hao Shen,Zhouhong Gu,Haokai Hong,Weili Han
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, personally identifiable information, adoption of Large, Language Models
备注:
点击查看摘要
Abstract:The widespread adoption of Large Language Models (LLMs) has raised significant privacy concerns regarding the exposure of personally identifiable information (PII) in user prompts. To address this challenge, we propose a query-unrelated PII masking strategy and introduce PII-Bench, the first comprehensive evaluation framework for assessing privacy protection systems. PII-Bench comprises 2,842 test samples across 55 fine-grained PII categories, featuring diverse scenarios from single-subject descriptions to complex multi-party interactions. Each sample is carefully crafted with a user query, context description, and standard answer indicating query-relevant PII. Our empirical evaluation reveals that while current models perform adequately in basic PII detection, they show significant limitations in determining PII query relevance. Even state-of-the-art LLMs struggle with this task, particularly in handling complex multi-subject scenarios, indicating substantial room for improvement in achieving intelligent PII masking.
16. 【2502.18536】FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
链接:https://arxiv.org/abs/2502.18536
作者:S M Sarwar
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Visual Question Answering, Question Answering requires, generate accurate answers, Answering requires models, textual understanding
备注: 12 pages, 6 figures and 2 tables
点击查看摘要
Abstract:Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.
17. 【2502.18495】A Comprehensive Survey on Composed Image Retrieval
链接:https://arxiv.org/abs/2502.18495
作者:Xuemeng Song,Haoqiang Lin,Haokun Wen,Bohan Hou,Mingzhu Xu,Liqiang Nie
类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Composed Image Retrieval, Composed Image, Image Retrieval, reference image, target images
备注:
点击查看摘要
Abstract:Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user's desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration.
18. 【2502.18484】AI Enhanced Ontology Driven NLP for Intelligent Cloud Resource Query Processing Using Knowledge Graphs
链接:https://arxiv.org/abs/2502.18484
作者:Krishna Chaitanya Sunkara(Independent Researcher),Krishnaiah Narukulla(Independent Researcher)
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:demand exact matches, significant user effort, cloud infrastructure relies, searches or GUIDs, infrastructure relies
备注: 8 pages, 5 figures, 4 tables. This paper not published at else where yet. The experimental setup has a potential to be revised using real time resources. Authors: Krishna Chaitanya Sunkara (IEEE Senior Member, Raleigh, NC, USA, Independent Researcher), Krishnaiah Narukulla (IEEE Senior Member, San Jose, CA, USA, Independent Researcher)
点击查看摘要
Abstract:The conventional resource search in cloud infrastructure relies on keyword-based searches or GUIDs, which demand exact matches and significant user effort to locate resources. These conventional search approaches often fail to interpret the intent behind natural language queries, making resource discovery inefficient and inaccessible to users. Though there exists some form of NLP based search engines, they are limited and focused more on analyzing the NLP query itself and extracting identifiers to find the resources. But they fail to search resources based on their behavior or operations or their capabilities or relationships or features or business relevance or the dynamic changing state or the knowledge these resources have. The search criteria has been changing with the inundation of AI based services which involved discovering not just the requested resources and identifiers but seeking insights. The real intent of a search has never been to just to list the resources but with some actual context such as to understand causes of some behavior in the system, compliance checks, capacity estimations, network constraints, or troubleshooting or business insights. This paper proposes an advanced Natural Language Processing (NLP) enhanced by ontology-based semantics to enable intuitive, human-readable queries which allows users to actually discover the intent-of-search itself. By constructing an ontology of cloud resources, their interactions, and behaviors, the proposed framework enables dynamic intent extraction and relevance ranking using Latent Semantic Indexing (LSI) and AI models. It introduces an automated pipeline which integrates ontology extraction by AI powered data crawlers, building a semantic knowledge base for context aware resource discovery.
19. 【2502.18483】Modeling Churn in Recommender Systems with Aggregated Preferences
链接:https://arxiv.org/abs/2502.18483
作者:Gur Keinan,Omer Ben-Porat
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:technological shifts necessitate, shifts necessitate reliance, individual user data, extensive individual user, recommender systems
备注:
点击查看摘要
Abstract:While recommender systems (RSs) traditionally rely on extensive individual user data, regulatory and technological shifts necessitate reliance on aggregated user information. This shift significantly impacts the recommendation process, requiring RSs to engage in intensive exploration to identify user preferences. However, this approach risks user churn due to potentially unsatisfactory recommendations. In this paper, we propose a model that addresses the dual challenges of leveraging aggregated user information and mitigating churn risk. Our model assumes that the RS operates with a probabilistic prior over user types and aggregated satisfaction levels for various content types. We demonstrate that optimal policies naturally transition from exploration to exploitation in finite time, develop a branch-and-bound algorithm for computing these policies, and empirically validate its effectiveness.
20. 【2502.18482】MixLLM: Dynamic Routing in Mixed Large Language Models
链接:https://arxiv.org/abs/2502.18482
作者:Xinyuan Wang,Yanchi Liu,Wei Cheng,Xujiang Zhao,Zhengzhang Chen,Wenchao Yu,Yanjie Fu,Haifeng Chen
类目:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB); Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, exhibit potential artificial, generic intelligence recently, potential artificial generic
备注: 11 pages, 7 figures, accepted by NAACL 2025 main conference
点击查看摘要
Abstract:Large Language Models (LLMs) exhibit potential artificial generic intelligence recently, however, their usage is costly with high response latency. Given mixed LLMs with their own strengths and weaknesses, LLM routing aims to identify the most suitable model for each query in the stream to maximize response quality and minimize cost and latency. However, the challenges involve: (1) dynamic trade-offs among quality, cost, and latency; (2) enabling continual learning in deployed systems; and (3) navigating a varying (e.g., new LLM addition or old LLM removal) set of LLM candidates over time. To bridge these gaps, we develop MixLLM, a dynamic contextual-bandit-based routing system for query-LLM assignment. Specifically, we first leverage query tags to enhance query embeddings for the routing task. Next, we design lightweight prediction models to estimate the response qualities and costs of queries over LLMs. We then devise a meta-decision maker to choose the query-LLM assignments to best tradeoff response quality, cost, and latency. Finally, the system benefits from continual training, allowing it to adapt to evolving queries and user feedback over time. Our extensive experiments show that MixLLM achieves the best trade-offs in response quality, cost, and latency (97.25% of GPT-4's quality at 24.18% of the cost under the time constraint).
21. 【2502.18481】MDE: Modality Discrimination Enhancement for Multi-modal Recommendation
链接:https://arxiv.org/abs/2502.18481
作者:Hang Zhou,Yucheng Wang,Huijing Zhan
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
关键词:user behavior data, recommendation systems aim, item content features, behavior data, systems aim
备注:
点击查看摘要
Abstract:Multi-modal recommendation systems aim to enhance performance by integrating an item's content features across various modalities with user behavior data. Effective utilization of features from different modalities requires addressing two challenges: preserving semantic commonality across modalities (modality-shared) and capturing unique characteristics for each modality (modality-specific). Most existing approaches focus on aligning feature spaces across modalities, which helps represent modality-shared features. However, modality-specific distinctions are often neglected, especially when there are significant semantic variations between modalities. To address this, we propose a Modality Distinctiveness Enhancement (MDE) framework that prioritizes extracting modality-specific information to improve recommendation accuracy while maintaining shared features. MDE enhances differences across modalities through a novel multi-modal fusion module and introduces a node-level trade-off mechanism to balance cross-modal alignment and differentiation. Extensive experiments on three public datasets show that our approach significantly outperforms other state-of-the-art methods, demonstrating the effectiveness of jointly considering modality-shared and modality-specific features.
22. 【2502.18480】QExplorer: Large Language Model Based Query Extraction for Toxic Content Exploration
链接:https://arxiv.org/abs/2502.18480
作者:Shaola Ren,Li Ke,Longtao Huang,Dehong Gao,Hui Xue
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
关键词:Automatically extracting effective, Large Language Model, Automatically extracting, extracting effective queries, toxic content exploration
备注:
点击查看摘要
Abstract:Automatically extracting effective queries is challenging in information retrieval, especially in toxic content exploration, as such content is likely to be disguised. With the recent achievements in generative Large Language Model (LLM), we are able to leverage the capabilities of LLMs to extract effective queries for similar content exploration directly. This study proposes QExplorer, an approach of large language model based Query Extraction for toxic content Exploration. The QExplorer approach involves a 2-stage training process: instruction Supervised FineTuning (SFT) and preference alignment using Direct Preference Optimization (DPO), as well as the datasets construction with feedback of search system. To verify the effectiveness of QExplorer, a series of offline and online experiments are conducted on our real-world system. The offline empirical results demonstrate that the performance of our automatic query extraction outperforms that of several LLMs and humans. The online deployment shows a significant increase in the detection of toxic items.
23. 【2502.18479】Disrupt Your Research Using Generative AI Powered ScienceSage
链接:https://arxiv.org/abs/2502.18479
作者:Yong Zhang,Eric Herrison Gyamfi,Kelly Anderson,Sasha Roberts,Matt Barker
类目:Information Retrieval (cs.IR)
关键词:Large Language Models, Large Language, Language Models, subjects and industries, disrupting science
备注: This paper has been accepted by Workshop of Deployable AI at AAAI 2025
点击查看摘要
Abstract:Large Language Models (LLM) are disrupting science and research in different subjects and industries. Here we report a minimum-viable-product (MVP) web application called $\textbf{ScienceSage}$. It leverages generative artificial intelligence (GenAI) to help researchers disrupt the speed, magnitude and scope of product innovation. $\textbf{ScienceSage}$ enables researchers to build, store, update and query a knowledge base (KB). A KB codifies user's knowledge/information of a given domain in both vector index and knowledge graph (KG) index for efficient information retrieval and query. The knowledge/information can be extracted from user's textual documents, images, videos, audios and/or the research reports generated based on a research question and the latest relevant information on internet. The same set of KBs interconnect three functions on $\textbf{ScienceSage}$: 'Generate Research Report', 'Chat With Your Documents' and 'Chat With Anything'. We share our learning to encourage discussion and improvement of GenAI's role in scientific research.
24. 【2502.18478】Beyond Self-Consistency: Loss-Balanced Perturbation-Based Regularization Improves Industrial-Scale Ads Ranking
链接:https://arxiv.org/abs/2502.18478
作者:Ilqar Ramazanli,Hamid Eghbalzadeh,Xiaoyi Liu,Yang Wang,Jiaxiang Fu,Kaushik Rangadurai,Sem Park,Bo Long,Xue Feng
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Perturbation-based regularization techniques, Small Perturbation Regularization, perturbation-based regularization algorithms, Perturbation-based regularization, sparse labels
备注:
点击查看摘要
Abstract:Perturbation-based regularization techniques address many challenges in industrial-scale large models, particularly with sparse labels, and emphasize consistency and invariance for perturbation in model predictions. One of the popular regularization techniques has been various forms of self-consistency, which involve making small modifications to input data while preserving contextual information and enforcing similar predictions through auxiliary loss functions. In this work, we explore the first successful application of perturbation-based regularization algorithms in large-scale ads ranking models, and further propose a novel regularization algorithm, namely, Loss-Balanced Small Perturbation Regularization (LSPR) that can be used in potentially any deep learning model. We have successfully demonstrate that both Self-Consistency Regularization approaches (SCR) and LSPR are scalable and can improve ads delivery systems. By conducting industrial-scale experiments, and numerical analysis, we additionally show that our proposed LSPR, performs consistently better compared to SCR, across various groups and signal availability setups. Finally, we report a successful application of the proposed LSPR in a billion-scale industrial ranking system, which to the best of our knowledge, is the first of its kind, and it is specially designed to address the various scalability challenges (e.g, various surfaces, geological locations, clients and so on) as we will mention in this paper.
25. 【2502.18477】Recommendations Beyond Catalogs: Diffusion Models for Personalized Generation
链接:https://arxiv.org/abs/2502.18477
作者:Gabriel Patron,Zhiwei Xu,Ishan Kapnadak,Felipe Maia Polo
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Modern recommender systems, Modern recommender, follow the guiding, guiding principle, principle of serving
备注:
点击查看摘要
Abstract:Modern recommender systems follow the guiding principle of serving the right user, the right item at the right time. One of their main limitations is that they are typically limited to items already in the catalog. We propose REcommendations BEyond CAtalogs, REBECA, a new class of probabilistic diffusion-based recommender systems that synthesize new items tailored to individual tastes rather than retrieve items from the catalog. REBECA combines efficient training in embedding space with a novel diffusion prior that only requires users' past ratings of items. We evaluate REBECA on real-world data and propose novel personalization metrics for generative recommender systems. Extensive experiments demonstrate that REBECA produces high-quality, personalized recommendations, generating images that align with users' unique preferences.
26. 【2502.18471】FinBloom: Knowledge Grounding Large Language Model with Real-time Financial Data
链接:https://arxiv.org/abs/2502.18471
作者:Ankur Sinha,Chaitanya Agarwal,Pekka Malo
类目:Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST)
关键词:Large language models, generating human-like responses, Large language, Financial Context Dataset, Financial
备注: 27 pages, 9 tables
点击查看摘要
Abstract:Large language models (LLMs) excel at generating human-like responses but often struggle with interactive tasks that require access to real-time information. This limitation poses challenges in finance, where models must access up-to-date information, such as recent news or price movements, to support decision-making. To address this, we introduce Financial Agent, a knowledge-grounding approach for LLMs to handle financial queries using real-time text and tabular data. Our contributions are threefold: First, we develop a Financial Context Dataset of over 50,000 financial queries paired with the required context. Second, we train FinBloom 7B, a custom 7 billion parameter LLM, on 14 million financial news articles from Reuters and Deutsche Presse-Agentur, alongside 12 million Securities and Exchange Commission (SEC) filings. Third, we fine-tune FinBloom 7B using the Financial Context Dataset to serve as a Financial Agent. This agent generates relevant financial context, enabling efficient real-time data retrieval to answer user queries. By reducing latency and eliminating the need for users to manually provide accurate data, our approach significantly enhances the capability of LLMs to handle dynamic financial tasks. Our proposed approach makes real-time financial decisions, algorithmic trading and other related tasks streamlined, and is valuable in contexts with high-velocity data flows.
27. 【2502.18470】Spatial-RAG: Spatial Retrieval Augmented Generation for Real-World Spatial Reasoning Questions
链接:https://arxiv.org/abs/2502.18470
作者:Dazhou Yu,Riyang Bao,Gengchen Mai,Liang Zhao
类目:Information Retrieval (cs.IR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
关键词:Large Language Models, Language Models, Large Language, Spatial reasoning remains, challenge for Large
备注:
点击查看摘要
Abstract:Spatial reasoning remains a challenge for Large Language Models (LLMs), which struggle with spatial data retrieval and reasoning. We propose Spatial Retrieval-Augmented Generation (Spatial-RAG), a framework that extends RAG to spatial tasks by integrating sparse spatial retrieval (spatial databases) and dense semantic retrieval (LLM-based similarity). A multi-objective ranking strategy balances spatial constraints and semantic relevance, while an LLM-guided generator ensures coherent responses. Experiments on a real-world tourism dataset show that Spatial-RAG significantly improves spatial question answering, bridging the gap between LLMs and spatial intelligence.
28. 【2502.18469】Using LLM-Based Approaches to Enhance and Automate Topic Labeling
链接:https://arxiv.org/abs/2502.18469
作者:Trishia Khandelwal
类目:Information Retrieval (cs.IR)
关键词:analyzing text data, extracting meaningful insights, Large Language Models, text data, analyzing text
备注: 7 pages, 2 tables
点击查看摘要
Abstract:Topic modeling has become a crucial method for analyzing text data, particularly for extracting meaningful insights from large collections of documents. However, the output of these models typically consists of lists of keywords that require manual interpretation for precise labeling. This study explores the use of Large Language Models (LLMs) to automate and enhance topic labeling by generating more meaningful and contextually appropriate labels. After applying BERTopic for topic modeling, we explore different approaches to select keywords and document summaries within each topic, which are then fed into an LLM to generate labels. Each approach prioritizes different aspects, such as dominant themes or diversity, to assess their impact on label quality. Additionally, recognizing the lack of quantitative methods for evaluating topic labels, we propose a novel metric that measures how semantically representative a label is of all documents within a topic.
计算机视觉
1. 【2502.19409】ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
链接:https://arxiv.org/abs/2502.19409
作者:Danae Sánchez Villegas,Ingo Ziegler,Desmond Elliott
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
关键词:large language models, multimodal large language, remains a challenge, large language, language models
备注: Code, dataset, and checkpoints are publicly available at [this https URL](https://github.com/danaesavi/ImageChain)
点击查看摘要
Abstract:Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task -- achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.
2. 【2502.19400】heoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding
链接:https://arxiv.org/abs/2502.19400
作者:Max Ku,Thomas Chong,Jonathan Leung,Krish Shah,Alvin Yu,Wenhu Chen
类目:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
关键词:Understanding domain-specific theorems, Understanding domain-specific, effective communication, structured visual explanations, communication through structured
备注:
点击查看摘要
Abstract:Understanding domain-specific theorems often requires more than just text-based reasoning; effective communication through structured visual explanations is crucial for deeper comprehension. While large language models (LLMs) demonstrate strong performance in text-based theorem reasoning, their ability to generate coherent and pedagogically meaningful visual explanations remains an open challenge. In this work, we introduce TheoremExplainAgent, an agentic approach for generating long-form theorem explanation videos (over 5 minutes) using Manim animations. To systematically evaluate multimodal theorem explanations, we propose TheoremExplainBench, a benchmark covering 240 theorems across multiple STEM disciplines, along with 5 automated evaluation metrics. Our results reveal that agentic planning is essential for generating detailed long-form videos, and the o3-mini agent achieves a success rate of 93.8% and an overall score of 0.77. However, our quantitative and qualitative studies show that most of the videos produced exhibit minor issues with visual element layout. Furthermore, multimodal explanations expose deeper reasoning flaws that text-based explanations fail to reveal, highlighting the importance of multimodal explanations.
3. 【2502.19337】Consistent Amortized Clustering via Generative Flow Networks
链接:https://arxiv.org/abs/2502.19337
作者:Irit Chelly,Roy Uziel,Oren Freifeld,Ari Pakman
类目:Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
关键词:avoiding lengthy Markov, lengthy Markov chain, Markov chain runs, lengthy Markov, Markov chain
备注: Accepted to AISTATS 2025 on January 21, 2025
点击查看摘要
Abstract:Neural models for amortized probabilistic clustering yield samples of cluster labels given a set-structured input, while avoiding lengthy Markov chain runs and the need for explicit data likelihoods. Existing methods which label each data point sequentially, like the Neural Clustering Process, often lead to cluster assignments highly dependent on the data order. Alternatively, methods that sequentially create full clusters, do not provide assignment probabilities. In this paper, we introduce GFNCP, a novel framework for amortized clustering. GFNCP is formulated as a Generative Flow Network with a shared energy-based parametrization of policy and reward. We show that the flow matching conditions are equivalent to consistency of the clustering posterior under marginalization, which in turn implies order invariance. GFNCP also outperforms existing methods in clustering performance on both synthetic and real-world data.
4. 【2502.19318】Does 3D Gaussian Splatting Need Accurate Volumetric Rendering?
链接:https://arxiv.org/abs/2502.19318
作者:Adam Celarek,George Kopanas,George Drettakis,Michael Wimmer,Bernhard Kerbl
类目:Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
关键词:allowing real-time novel-view, fast training times, important reference method, real-time novel-view synthesis, Gaussian Splatting
备注: To be published in Eurogrpahics 2025, code: [this https URL](https://github.com/cg-tuwien/does_3d_gaussian_splatting_need_accurate_volumetric_rendering)
点击查看摘要
Abstract:Since its introduction, 3D Gaussian Splatting (3DGS) has become an important reference method for learning 3D representations of a captured scene, allowing real-time novel-view synthesis with high visual quality and fast training times. Neural Radiance Fields (NeRFs), which preceded 3DGS, are based on a principled ray-marching approach for volumetric rendering. In contrast, while sharing a similar image formation model with NeRF, 3DGS uses a hybrid rendering solution that builds on the strengths of volume rendering and primitive rasterization. A crucial benefit of 3DGS is its performance, achieved through a set of approximations, in many cases with respect to volumetric rendering theory. A naturally arising question is whether replacing these approximations with more principled volumetric rendering solutions can improve the quality of 3DGS. In this paper, we present an in-depth analysis of the various approximations and assumptions used by the original 3DGS solution. We demonstrate that, while more accurate volumetric rendering can help for low numbers of primitives, the power of efficient optimization and the large number of Gaussians allows 3DGS to outperform volumetric rendering despite its approximations.
5. 【2502.19316】Model Adaptation: Unsupervised Domain Adaptation without Source Data
链接:https://arxiv.org/abs/2502.19316
作者:Rui Li,Qianfen Jiao,Wenming Cao,Hau-San Wong,Si Wu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:unsupervised domain adaptation, source data, data, unsupervised model adaptation, challenging unsupervised domain
备注: accepted by CVPR2020
点击查看摘要
Abstract:In this paper, we investigate a challenging unsupervised domain adaptation setting -- unsupervised model adaptation. We aim to explore how to rely only on unlabeled target data to improve performance of an existing source prediction model on the target domain, since labeled source data may not be available in some real-world scenarios due to data privacy issues. For this purpose, we propose a new framework, which is referred to as collaborative class conditional generative adversarial net to bypass the dependence on the source data. Specifically, the prediction model is to be improved through generated target-style data, which provides more accurate guidance for the generator. As a result, the generator and the prediction model can collaborate with each other without source data. Furthermore, due to the lack of supervision from source data, we propose a weight constraint that encourages similarity to the source model. A clustering-based regularization is also introduced to produce more discriminative features in the target domain. Compared to conventional domain adaptation methods, our model achieves superior performance on multiple adaptation tasks with only unlabeled target data, which verifies its effectiveness in this challenging setting.
6. 【2502.19313】CoopDETR: A Unified Cooperative Perception Framework for 3D Detection via Object Query
链接:https://arxiv.org/abs/2502.19313
作者:Zhe Wang,Shaocong Xu,Xucai Zhuang,Tongda Xu,Yan Wang,Jingjing Liu,Yilun Chen,Ya-Qin Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:individual perception capabilities, Cooperative perception enhances, autonomous vehicles, enhances the individual, capabilities of autonomous
备注: 8 pages, 8 figures, ICRA 2025
点击查看摘要
Abstract:Cooperative perception enhances the individual perception capabilities of autonomous vehicles (AVs) by providing a comprehensive view of the environment. However, balancing perception performance and transmission costs remains a significant challenge. Current approaches that transmit region-level features across agents are limited in interpretability and demand substantial bandwidth, making them unsuitable for practical applications. In this work, we propose CoopDETR, a novel cooperative perception framework that introduces object-level feature cooperation via object query. Our framework consists of two key modules: single-agent query generation, which efficiently encodes raw sensor data into object queries, reducing transmission cost while preserving essential information for detection; and cross-agent query fusion, which includes Spatial Query Matching (SQM) and Object Query Aggregation (OQA) to enable effective interaction between queries. Our experiments on the OPV2V and V2XSet datasets demonstrate that CoopDETR achieves state-of-the-art performance and significantly reduces transmission costs to 1/782 of previous methods.
7. 【2502.19293】Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions
链接:https://arxiv.org/abs/2502.19293
作者:Ruben T. Lucassen,Sander P.J. Moonemans,Tijn van de Luijtgaarden,Gerben E. Breimer,Willeke A.M. Blokx,Mitko Veta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:ordinary moles, concern common nevi, melanocytic skin lesions, Millions, Millions of melanocytic
备注: 11 pages, 2 figures
点击查看摘要
Abstract:Millions of melanocytic skin lesions are examined by pathologists each year, the majority of which concern common nevi (i.e., ordinary moles). While most of these lesions can be diagnosed in seconds, writing the corresponding pathology report is much more time-consuming. Automating part of the report writing could, therefore, alleviate the increasing workload of pathologists. In this work, we develop a vision-language model specifically for the pathology domain of cutaneous melanocytic lesions. The model follows the Contrastive Captioner framework and was trained and evaluated using a melanocytic lesion dataset of 42,512 HE-stained whole slide images and 19,645 corresponding pathology reports. Our results show that the quality scores of model-generated reports were on par with pathologist-written reports for common nevi, assessed by an expert pathologist in a reader study. While report generation revealed to be more difficult for rare melanocytic lesion subtypes, the cross-modal retrieval performance for these cases was considerably better.
8. 【2502.19285】On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation
链接:https://arxiv.org/abs/2502.19285
作者:Ruben T. Lucassen,Tijn van de Luijtgaarden,Sander P.J. Moonemans,Gerben E. Breimer,Willeke A.M. Blokx,Mitko Veta
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:enable multimodal case, pathology enable multimodal, reports, generated reports, pathology reports
备注: 11 pages, 1 figure
点击查看摘要
Abstract:Vision-language models in pathology enable multimodal case retrieval and automated report generation. Many of the models developed so far, however, have been trained on pathology reports that include information which cannot be inferred from paired whole slide images (e.g., patient history), potentially leading to hallucinated sentences in generated reports. To this end, we investigate how the selection of information from pathology reports for vision-language modeling affects the quality of the multimodal representations and generated reports. More concretely, we compare a model trained on full reports against a model trained on preprocessed reports that only include sentences describing the cell and tissue appearances based on the HE-stained slides. For the experiments, we built upon the BLIP-2 framework and used a cutaneous melanocytic lesion dataset of 42,433 HE-stained whole slide images and 19,636 corresponding pathology reports. Model performance was assessed using image-to-text and text-to-image retrieval, as well as qualitative evaluation of the generated reports by an expert pathologist. Our results demonstrate that text preprocessing prevents hallucination in report generation. Despite the improvement in the quality of the generated reports, training the vision-language model on full reports showed better cross-modal retrieval performance.
9. 【2502.19269】Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models
链接:https://arxiv.org/abs/2502.19269
作者:Jiawei Kong,Hao Fang,Sihang Guo,Chenxi Qing,Bin Chen,Bin Wang,Shu-Tao Xia
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:CLIP exhibit excellent, exhibit excellent representational, excellent representational capabilities, CLIP exhibit, pre-trained Vision-Language Models
备注:
点击查看摘要
Abstract:While pre-trained Vision-Language Models (VLMs) such as CLIP exhibit excellent representational capabilities for multimodal data, recent studies have shown that they are vulnerable to backdoor attacks. To alleviate the threat, existing defense strategies primarily focus on fine-tuning the entire suspicious model, yet offer only marginal resistance to state-of-the-art attacks and often result in a decrease in clean accuracy, particularly in data-limited scenarios. Their failure may be attributed to the mismatch between insufficient fine-tuning data and massive parameters in VLMs. To address this challenge, we propose Class-wise Backdoor Prompt Tuning (CBPT) defense, an efficient and effective method that operates on the text prompts to indirectly purify the poisoned VLMs. Specifically, we first employ the advanced contrastive learning via our carefully crafted positive and negative samples, to effectively invert the backdoor triggers that are potentially adopted by the attacker. Once the dummy trigger is established, we utilize the efficient prompt tuning technique to optimize these class-wise text prompts for modifying the model's decision boundary to further reclassify the feature regions of backdoor triggers. Extensive experiments demonstrate that CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.86\% and an Attack Success Rate (ASR) of 0.39\% across seven mainstream backdoor attacks. These results underscore the superiority of our prompt purifying design to strengthen model robustness against backdoor attacks.
10. 【2502.19260】EMT: A Visual Multi-Task Benchmark Dataset for Autonomous Driving in the Arab Gulf Region
链接:https://arxiv.org/abs/2502.19260
作者:Nadya Abdel Madjid,Murad Mebrahtu,Abdelmoamen Nasser,Bilal Hassan,Naoufel Werghi,Jorge Dias,Majid Khonji
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Arab Gulf region, Arab Gulf, Emirates Multi-Task, Gulf region, introduces the Emirates
备注: 19 pages, 6 figures
点击查看摘要
Abstract:This paper introduces the Emirates Multi-Task (EMT) dataset - the first publicly available dataset for autonomous driving collected in the Arab Gulf region. The EMT dataset captures the unique road topology, high traffic congestion, and distinctive characteristics of the Gulf region, including variations in pedestrian clothing and weather conditions. It contains over 30,000 frames from a dash-camera perspective, along with 570,000 annotated bounding boxes, covering approximately 150 kilometers of driving routes. The EMT dataset supports three primary tasks: tracking, trajectory forecasting and intention prediction. Each benchmark dataset is complemented with corresponding evaluations: (1) multi-agent tracking experiments, focusing on multi-class scenarios and occlusion handling; (2) trajectory forecasting evaluation using deep sequential and interaction-aware models; and (3) intention benchmark experiments conducted for predicting agents intentions from observed trajectories. The dataset is publicly available at this https URL, and pre-processing scripts along with evaluation models can be accessed at this https URL.
11. 【2502.19250】ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration
链接:https://arxiv.org/abs/2502.19250
作者:Minjie Zhu,Yichen Zhu,Jinming Li,Zhongyi Zhou,Junjie Wen,Xiaoyu Liu,Chaomin Shen,Yaxin Peng,Feifei Feng
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:dexterous manipulation skills, teaching robots dexterous, robots dexterous manipulation, Imitation learning, dexterous manipulation
备注: Project page at [this https URL](https://objectvla.github.io/)
点击查看摘要
Abstract:Imitation learning has proven to be highly effective in teaching robots dexterous manipulation skills. However, it typically relies on large amounts of human demonstration data, which limits its scalability and applicability in dynamic, real-world environments. One key challenge in this context is object generalization, where a robot trained to perform a task with one object, such as "hand over the apple," struggles to transfer its skills to a semantically similar but visually different object, such as "hand over the peach." This gap in generalization to new objects beyond those in the same category has yet to be adequately addressed in previous work on end-to-end visuomotor policy learning. In this paper, we present a simple yet effective approach for achieving object generalization through Vision-Language-Action (VLA) models, referred to as \textbf{ObjectVLA}. Our model enables robots to generalize learned skills to novel objects without requiring explicit human demonstrations for each new target object. By leveraging vision-language pair data, our method provides a lightweight and scalable way to inject knowledge about the target object, establishing an implicit link between the object and the desired action. We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64\% success rate in selecting objects not seen during training. Furthermore, we propose a more accessible method for enhancing object generalization in VLA models, using a smartphone to capture a few images and fine-tune the pre-trained model. These results highlight the effectiveness of our approach in enabling object-level generalization and reducing the need for extensive human demonstrations, paving the way for more flexible and scalable robotic learning systems.
12. 【2502.19247】ProxyTransformation: Preshaping Point Cloud Manifold With Proxy Attention For 3D Visual Grounding
链接:https://arxiv.org/abs/2502.19247
作者:Qihang Peng,Henry Zheng,Gao Huang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Embodied intelligence requires, real time based, Embodied intelligence, intelligence requires agents, point cloud
备注: 11 pages, 3 figures
点击查看摘要
Abstract:Embodied intelligence requires agents to interact with 3D environments in real time based on language instructions. A foundational task in this domain is ego-centric 3D visual grounding. However, the point clouds rendered from RGB-D images retain a large amount of redundant background data and inherent noise, both of which can interfere with the manifold structure of the target regions. Existing point cloud enhancement methods often require a tedious process to improve the manifold, which is not suitable for real-time tasks. We propose Proxy Transformation suitable for multimodal task to efficiently improve the point cloud manifold. Our method first leverages Deformable Point Clustering to identify the point cloud sub-manifolds in target regions. Then, we propose a Proxy Attention module that utilizes multimodal proxies to guide point cloud transformation. Built upon Proxy Attention, we design a submanifold transformation generation module where textual information globally guides translation vectors for different submanifolds, optimizing relative spatial relationships of target regions. Simultaneously, image information guides linear transformations within each submanifold, refining the local point cloud manifold of target regions. Extensive experiments demonstrate that Proxy Transformation significantly outperforms all existing methods, achieving an impressive improvement of 7.49% on easy targets and 4.60% on hard targets, while reducing the computational overhead of attention blocks by 40.6%. These results establish a new SOTA in ego-centric 3D visual grounding, showcasing the effectiveness and robustness of our approach.
13. 【2502.19238】Arbitrary Volumetric Refocusing of Dense and Sparse Light Fields
链接:https://arxiv.org/abs/2502.19238
作者:Tharindu Samarakoon,Kalana Abeywardena,Chamira U. S. Edussooriya
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:four-dimensional light field, textural information, light field, four-dimensional light, two-dimensional image
备注: 9 pages, 7 figures, 3 tables
点击查看摘要
Abstract:A four-dimensional light field (LF) captures both textural and geometrical information of a scene in contrast to a two-dimensional image that captures only the textural information of a scene. Post-capture refocusing is an exciting application of LFs enabled by the geometric information captured. Previously proposed LF refocusing methods are mostly limited to the refocusing of single planar or volumetric region of a scene corresponding to a depth range and cannot simultaneously generate in-focus and out-of-focus regions having the same depth range. In this paper, we propose an end-to-end pipeline to simultaneously refocus multiple arbitrary planar or volumetric regions of a dense or a sparse LF. We employ pixel-dependent shifts with the typical shift-and-sum method to refocus an LF. The pixel-dependent shifts enables to refocus each pixel of an LF independently. For sparse LFs, the shift-and-sum method introduces ghosting artifacts due to the spatial undersampling. We employ a deep learning model based on U-Net architecture to almost completely eliminate the ghosting artifacts. The experimental results obtained with several LF datasets confirm the effectiveness of the proposed method. In particular, sparse LFs refocused with the proposed method archive structural similarity index higher than 0.9 despite having only 20% of data compared to dense LFs.
14. 【2502.19217】A Lightweight and Extensible Cell Segmentation and Classification Model for Whole Slide Images
链接:https://arxiv.org/abs/2502.19217
作者:Nikita Shvetsov,Thomas K. Kilvaer,Masoud Tafavvoghi,Anders Sildnes,Kajsa Møllersen,Lill-Tove Rasmussen Busund,Lars Ailo Bongo
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:cell-level analysis tools, remains challenging due, Developing clinically, pathology remains challenging, high computational demands
备注: 27 pages, 11 figures
点击查看摘要
Abstract:Developing clinically useful cell-level analysis tools in digital pathology remains challenging due to limitations in dataset granularity, inconsistent annotations, high computational demands, and difficulties integrating new technologies into workflows. To address these issues, we propose a solution that enhances data quality, model performance, and usability by creating a lightweight, extensible cell segmentation and classification model. First, we update data labels through cross-relabeling to refine annotations of PanNuke and MoNuSAC, producing a unified dataset with seven distinct cell types. Second, we leverage the H-Optimus foundation model as a fixed encoder to improve feature representation for simultaneous segmentation and classification tasks. Third, to address foundation models' computational demands, we distill knowledge to reduce model size and complexity while maintaining comparable performance. Finally, we integrate the distilled model into QuPath, a widely used open-source digital pathology platform. Results demonstrate improved segmentation and classification performance using the H-Optimus-based model compared to a CNN-based model. Specifically, average $R^2$ improved from 0.575 to 0.871, and average $PQ$ score improved from 0.450 to 0.492, indicating better alignment with actual cell counts and enhanced segmentation quality. The distilled model maintains comparable performance while reducing parameter count by a factor of 48. By reducing computational complexity and integrating into workflows, this approach may significantly impact diagnostics, reduce pathologist workload, and improve outcomes. Although the method shows promise, extensive validation is necessary prior to clinical deployment.
15. 【2502.19204】Distill Any Depth: Distillation Creates a Stronger Monocular Depth Estimator
链接:https://arxiv.org/abs/2502.19204
作者:Xiankang He,Dongyan Guo,Hongji Li,Ruibo Li,Ying Cui,Chi Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:single RGB image, single RGB, RGB image, Monocular depth estimation, Monocular depth
备注: project page: [this https URL](https://distill-any-depth-official.github.io/)
点击查看摘要
Abstract:Monocular depth estimation (MDE) aims to predict scene depth from a single RGB image and plays a crucial role in 3D scene understanding. Recent advances in zero-shot MDE leverage normalized depth representations and distillation-based learning to improve generalization across diverse scenes. However, current depth normalization methods for distillation, relying on global normalization, can amplify noisy pseudo-labels, reducing distillation effectiveness. In this paper, we systematically analyze the impact of different depth normalization strategies on pseudo-label distillation. Based on our findings, we propose Cross-Context Distillation, which integrates global and local depth cues to enhance pseudo-label quality. Additionally, we introduce a multi-teacher distillation framework that leverages complementary strengths of different depth estimation models, leading to more robust and accurate depth predictions. Extensive experiments on benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, both quantitatively and qualitatively.
16. 【2502.19200】HDM: Hybrid Diffusion Model for Unified Image Anomaly Detection
链接:https://arxiv.org/abs/2502.19200
作者:Zekang Weng,Jinjin Shi,Jinwei Wang,Zeming Han
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:improving product quality, industrial quality inspection, quality inspection, product quality, medical imaging
备注:
点击查看摘要
Abstract:Image anomaly detection plays a vital role in applications such as industrial quality inspection and medical imaging, where it directly contributes to improving product quality and system reliability. However, existing methods often struggle with complex and diverse anomaly patterns. In particular, the separation between generation and discrimination tasks limits the effective coordination between anomaly sample generation and anomaly region detection. To address these challenges, we propose a novel hybrid diffusion model (HDM) that integrates generation and discrimination into a unified framework. The model consists of three key modules: the Diffusion Anomaly Generation Module (DAGM), the Diffusion Discriminative Module (DDM), and the Probability Optimization Module (POM). DAGM generates realistic and diverse anomaly samples, improving their representativeness. DDM then applies a reverse diffusion process to capture the differences between generated and normal samples, enabling precise anomaly region detection and localization based on probability distributions. POM refines the probability distributions during both the generation and discrimination phases, ensuring high-quality samples are used for training. Extensive experiments on multiple industrial image datasets demonstrate that our method outperforms state-of-the-art approaches, significantly improving both image-level and pixel-level anomaly detection performance, as measured by AUROC.
17. 【2502.19199】EGR-Net: A Novel Embedding Gramian Representation CNN for Intelligent Fault Diagnosis
链接:https://arxiv.org/abs/2502.19199
作者:Linshan Jia
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:intelligent fault diagnosis, rotating machinery, extraction is crucial, crucial in intelligent, fault diagnosis methods
备注:
点击查看摘要
Abstract:Feature extraction is crucial in intelligent fault diagnosis of rotating machinery. It is easier for convolutional neural networks(CNNs) to visually recognize and learn fault features by converting the complicated one-dimensional (1D) vibrational signals into two-dimensional (2D) images with simple textures. However, the existing representation methods for encoding 1D signals as images have two main problems, including complicated computation and low separability. Meanwhile, the existing 2D-CNN fault diagnosis methods taking 2D images as the only inputs still suffer from the inevitable information loss because of the conversion process. Considering the above issues, this paper proposes a new 1D-to-2D conversion method called Embedding Gramian Representation (EGR), which is easy to calculate and shows good separability. In EGR, 1D signals are projected in the embedding space and the intrinsic periodicity of vibrational signals is captured enabling the faulty characteristics contained in raw signals to be uncovered. Second, aiming at the information loss problem of existing CNN models with the single input of converted images, a double-branch EGR-based CNN, called EGR-Net, is proposed to learn faulty features from both raw signal feature maps and their corresponding EGRs. The bridge connection is designed to improve the feature learning interaction between the two branches. Widely used open domain gearbox dataset and bearing dataset are used to verify the effectiveness and efficiency of the proposed methods. EGR-Net is compared with traditional and state-of-the-art approaches, and the results show that the proposed method can deliver enhanced performance.
18. 【2502.19194】Self-supervised conformal prediction for uncertainty quantification in Poisson imaging problems
链接:https://arxiv.org/abs/2502.19194
作者:Bernardin Tamo Amougou,Marcelo Pereyra,Barbara Pascal
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
关键词:Image restoration, leading to significant, reconstructed images, image restoration methods, Conformal prediction
备注:
点击查看摘要
Abstract:Image restoration problems are often ill-posed, leading to significant uncertainty in reconstructed images. Accurately quantifying this uncertainty is essential for the reliable interpretation of reconstructed images. However, image restoration methods often lack uncertainty quantification capabilities. Conformal prediction offers a rigorous framework to augment image restoration methods with accurate uncertainty quantification estimates, but it typically requires abundant ground truth data for calibration. This paper presents a self-supervised conformal prediction method for Poisson imaging problems which leverages Poisson Unbiased Risk Estimator to eliminate the need for ground truth data. The resulting self-calibrating conformal prediction approach is applicable to any Poisson linear imaging problem that is ill-conditioned, and is particularly effective when combined with modern self-supervised image restoration techniques trained directly on measurement data. The proposed method is demonstrated through numerical experiments on image denoising and deblurring; its performance are comparable to supervised conformal prediction methods relying on ground truth data.
19. 【2502.19177】Knowledge Distillation for Semantic Segmentation: A Label Space Unification Approach
链接:https://arxiv.org/abs/2502.19177
作者:Anton Backhaus,Thorsten Luettel,Mirko Maehlisch
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:datasets sharing similar, sharing similar domains, past few years, increasing number, sharing similar
备注:
点击查看摘要
Abstract:An increasing number of datasets sharing similar domains for semantic segmentation have been published over the past few years. But despite the growing amount of overall data, it is still difficult to train bigger and better models due to inconsistency in taxonomy and/or labeling policies of different datasets. To this end, we propose a knowledge distillation approach that also serves as a label space unification method for semantic segmentation. In short, a teacher model is trained on a source dataset with a given taxonomy, then used to pseudo-label additional data for which ground truth labels of a related label space exist. By mapping the related taxonomies to the source taxonomy, we create constraints within which the model can predict pseudo-labels. Using the improved pseudo-labels we train student models that consistently outperform their teachers in two challenging domains, namely urban and off-road driving. Our ground truth-corrected pseudo-labels span over 12 and 7 public datasets with 388.230 and 18.558 images for the urban and off-road domains, respectively, creating the largest compound datasets for autonomous driving to date.
20. 【2502.19159】A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs
链接:https://arxiv.org/abs/2502.19159
作者:Xuan Ding,Yao Zhu,Yunjian Zhang,Chuanlong Xie
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:significantly accelerate inference, entire Transformer layer, resource-constrained scenarios, significantly accelerate, pruning
备注:
点击查看摘要
Abstract:Compared to width-wise pruning, depth-wise pruning can significantly accelerate inference in resource-constrained scenarios. Howerver, treating the entire Transformer layer as the minimum pruning unit may degrade model performance by indiscriminately discarding the entire information of the layer. This paper reveals the "Patch-like" feature relationship between layers in large language models by analyzing the correlation of the outputs of different layers in the reproducing kernel Hilbert space. Building on this observation, we proposes a sliding layer merging method that dynamically selects and fuses consecutive layers from top to bottom according to a pre-defined similarity threshold, thereby simplifying the model structure while maintaining its performance. Extensive experiments on LLMs with various architectures and different parameter scales show that our method outperforms existing pruning techniques in both zero-shot inference performance and retraining recovery quality after pruning. In particular, in the experiment with 35\% pruning on the Vicuna-7B model, our method achieved a 1.654\% improvement in average performance on zero-shot tasks compared to the existing method. Moreover, we further reveal the potential of combining depth pruning with width pruning to enhance the pruning effect. Our codes are available at this https URL.
21. 【2502.19128】SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation
链接:https://arxiv.org/abs/2502.19128
作者:Junlong Ren,Hao Wu,Hui Xiong,Hao Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:retrieval task aims, achieve mutual matching, task aims, aims to achieve, achieve mutual
备注: ICRA 2025
点击查看摘要
Abstract:The cross-modal 3D retrieval task aims to achieve mutual matching between text descriptions and 3D shapes. This has the potential to enhance the interaction between natural language and the 3D environment, especially within the realms of robotics and embodied artificial intelligence (AI) applications. However, the scarcity and expensiveness of 3D data constrain the performance of existing cross-modal 3D retrieval methods. These methods heavily rely on features derived from the limited number of 3D shapes, resulting in poor generalization ability across diverse scenarios. To address this challenge, we introduce SCA3D, a novel 3D shape and caption online data augmentation method for cross-modal 3D retrieval. Our approach uses the LLaVA model to create a component library, captioning each segmented part of every 3D shape within the dataset. Notably, it facilitates the generation of extensive new 3D-text pairs containing new semantic features. We employ both inter and intra distances to align various components into a new 3D shape, ensuring that the components do not overlap and are closely fitted. Further, text templates are utilized to process the captions of each component and generate new text descriptions. Besides, we use unimodal encoders to extract embeddings for 3D shapes and texts based on the enriched dataset. We then calculate fine-grained cross-modal similarity using Earth Mover's Distance (EMD) and enhance cross-modal matching with contrastive learning, enabling bidirectional retrieval between texts and 3D shapes. Extensive experiments show our SCA3D outperforms previous works on the Text2Shape dataset, raising the Shape-to-Text RR@1 score from 20.03 to 27.22 and the Text-to-Shape RR@1 score from 13.12 to 16.67. Codes can be found in this https URL.
22. 【2502.19125】he NeRF Signature: Codebook-Aided Watermarking for Neural Radiance Fields
链接:https://arxiv.org/abs/2502.19125
作者:Ziyuan Luo,Anderson Rocha,Boxin Shi,Qing Guo,Haoliang Li,Renjie Wan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Neural Radiance Fields, Neural Radiance, Radiance Fields, content representation, gaining attention
备注: 16 pages, accepted by TPAMI
点击查看摘要
Abstract:Neural Radiance Fields (NeRF) have been gaining attention as a significant form of 3D content representation. With the proliferation of NeRF-based creations, the need for copyright protection has emerged as a critical issue. Although some approaches have been proposed to embed digital watermarks into NeRF, they often neglect essential model-level considerations and incur substantial time overheads, resulting in reduced imperceptibility and robustness, along with user inconvenience. In this paper, we extend the previous criteria for image watermarking to the model level and propose NeRF Signature, a novel watermarking method for NeRF. We employ a Codebook-aided Signature Embedding (CSE) that does not alter the model structure, thereby maintaining imperceptibility and enhancing robustness at the model level. Furthermore, after optimization, any desired signatures can be embedded through the CSE, and no fine-tuning is required when NeRF owners want to use new binary signatures. Then, we introduce a joint pose-patch encryption watermarking strategy to hide signatures into patches rendered from a specific viewpoint for higher robustness. In addition, we explore a Complexity-Aware Key Selection (CAKS) scheme to embed signatures in high visual complexity patches to enhance imperceptibility. The experimental results demonstrate that our method outperforms other baseline methods in terms of imperceptibility and robustness. The source code is available at: this https URL.
23. 【2502.19106】A Survey on Foundation-Model-Based Industrial Defect Detection
链接:https://arxiv.org/abs/2502.19106
作者:Tianle Yang,Luyao Chang,Jiadong Yan,Juntao Li,Zhi Wang,Ke Zhang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:defect detection receives, product defect features, three-dimensional visual feature, visual feature modeling, defect detection
备注: 14 pages, 4 figures
点击查看摘要
Abstract:As industrial products become abundant and sophisticated, visual industrial defect detection receives much attention, including two-dimensional and three-dimensional visual feature modeling. Traditional methods use statistical analysis, abnormal data synthesis modeling, and generation-based models to separate product defect features and complete defect detection. Recently, the emergence of foundation models has brought visual and textual semantic prior knowledge. Many methods are based on foundation models (FM) to improve the accuracy of detection, but at the same time, increase model complexity and slow down inference speed. Some FM-based methods have begun to explore lightweight modeling ways, which have gradually attracted attention and deserve to be systematically analyzed. In this paper, we conduct a systematic survey with comparisons and discussions of foundation model methods from different aspects and briefly review non-foundation model (NFM) methods recently published. Furthermore, we discuss the differences between FM and NFM methods from training objectives, model structure and scale, model performance, and potential directions for future exploration. Through comparison, we find FM methods are more suitable for few-shot and zero-shot learning, which are more in line with actual industrial application scenarios and worthy of in-depth research.
24. 【2502.19101】An anatomically-informed correspondence initialisation method to improve learning-based registration for radiotherapy
链接:https://arxiv.org/abs/2502.19101
作者:Edward G. A. Henderson,Marcel van Herk,Andrew F. Green,Eliana M. Vasquez Osorio
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:propose an anatomically-informed, interpatient CT non-rigid, model to estimate, anatomically-informed initialisation method, established NRR methods
备注: Presented at the XXth International Conference on the use of Computers in Radiation therapy. Pages 99-102 in XXth ICCR Proceedings, found here [this https URL](https://udl.hal.science/hal-04720234v1)
点击查看摘要
Abstract:We propose an anatomically-informed initialisation method for interpatient CT non-rigid registration (NRR), using a learning-based model to estimate correspondences between organ structures. A thin plate spline (TPS) deformation, set up using the correspondence predictions, is used to initialise the scans before a second NRR step. We compare two established NRR methods for the second step: a B-spline iterative optimisation-based algorithm and a deep learning-based approach. Registration performance is evaluated with and without the initialisation by assessing the similarity of propagated structures. Our proposed initialisation improved the registration performance of the learning-based method to more closely match the traditional iterative algorithm, with the mean distance-to-agreement reduced by 1.8mm for structures included in the TPS and 0.6mm for structures not included, while maintaining a substantial speed advantage (5 vs. 72 seconds).
25. 【2502.19090】EndoMamba: An Efficient Foundation Model for Endoscopic Videos
链接:https://arxiv.org/abs/2502.19090
作者:Qingyao Tian,Huai Liao,Xinyan Huang,Bingyu Yang,Dongdong Lei,Sebastien Ourselin,Hongbin Liu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:minimally invasive surgeries, providing real-time assistance, play a crucial, visual navigation, crucial role
备注:
点击查看摘要
Abstract:Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, we propose a self-supervised hierarchical pre-training diagram to enhance EndoMamba's representation learning using endoscopic videos and incorporating general video domain knowledge. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks--classification, segmentation, surgical phase recognition, and localization--demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code will be released upon acceptance.
26. 【2502.19068】Dynamic Degradation Decomposition Network for All-in-One Image Restoration
链接:https://arxiv.org/abs/2502.19068
作者:Huiqiang Wang,Mingchen Song,Guoqiang Zhong
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:restoring clean images, Net, degradation, restoring clean, degradation types
备注:
点击查看摘要
Abstract:Currently, restoring clean images from a variety of degradation types using a single model is still a challenging task. Existing all-in-one image restoration approaches struggle with addressing complex and ambiguously defined degradation types. In this paper, we introduce a dynamic degradation decomposition network for all-in-one image restoration, named D$^3$Net. D$^3$Net achieves degradation-adaptive image restoration with guided prompt through cross-domain interaction and dynamic degradation decomposition. Concretely, in D$^3$Net, the proposed Cross-Domain Degradation Analyzer (CDDA) engages in deep interaction between frequency domain degradation characteristics and spatial domain image features to identify and model variations of different degradation types on the image manifold, generating degradation correction prompt and strategy prompt, which guide the following decomposition process. Furthermore, the prompt-based Dynamic Decomposition Mechanism (DDM) for progressive degradation decomposition, that encourages the network to adaptively select restoration strategies utilizing the two-level prompt generated by CDDA. Thanks to the synergistic cooperation between CDDA and DDM, D$^3$Net achieves superior flexibility and scalability in handling unknown degradation, while effectively reducing unnecessary computational overhead. Extensive experiments on multiple image restoration tasks demonstrate that D$^3$Net significantly outperforms the state-of-the-art approaches, especially improving PSNR by 5.47dB and 3.30dB on the SOTS-Outdoor and GoPro datasets, respectively.
27. 【2502.19048】An Improved 3D Skeletons UP-Fall Dataset: Enhancing Data Quality for Efficient Impact Fall Detection
链接:https://arxiv.org/abs/2502.19048
作者:Tresor Y. Koffi,Youssef Mourchid,Mohammed Hindawi,Yohan Dupuis
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:individual makes contact, fall detection systems, fall detection, impact fall detection, Detecting impact
备注: 17th International Conference on Machine Vision (ICMV 2024) will take place in Edinburgh, UK during October 10-13, 2024
点击查看摘要
Abstract:Detecting impact where an individual makes contact with the ground within a fall event is crucial in fall detection systems, particularly for elderly care where prompt intervention can prevent serious injuries. The UP-Fall dataset, a key resource in fall detection research, has proven valuable but suffers from limitations in data accuracy and comprehensiveness. These limitations cause confusion in distinguishing between non-impact events, such as sliding, and real falls with impact, where the person actually hits the ground. This confusion compromises the effectiveness of current fall detection systems. This study presents enhancements to the UP-Fall dataset aiming at improving it for impact fall detection by incorporating 3D skeleton data. Our preprocessing techniques ensure high data accuracy and comprehensiveness, enabling a more reliable impact fall detection. Extensive experiments were conducted using various machine learning and deep learning algorithms to benchmark the improved 3D skeletons dataset. The results demonstrate substantial improvements in the performance of fall detection models trained on the enhanced dataset. This contribution aims to enhance the safety and well-being of the elderly population at risk. To support further research and development of building more reliable impact fall detection systems, we have made the improved 3D skeletons UP-Fall dataset publicly available at this link this https URL.
28. 【2502.19047】A Dual-Purpose Framework for Backdoor Defense and Backdoor Amplification in Diffusion Models
链接:https://arxiv.org/abs/2502.19047
作者:Vu Tuan Truong Long,Bao Le
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:high-quality multi-modal samples, producing high-quality multi-modal, backdoor, excelling in producing, multi-modal samples
备注:
点击查看摘要
Abstract:Diffusion models have emerged as state-of-the-art generative frameworks, excelling in producing high-quality multi-modal samples. However, recent studies have revealed their vulnerability to backdoor attacks, where backdoored models generate specific, undesirable outputs called backdoor target (e.g., harmful images) when a pre-defined trigger is embedded to their inputs. In this paper, we propose PureDiffusion, a dual-purpose framework that simultaneously serves two contrasting roles: backdoor defense and backdoor attack amplification. For defense, we introduce two novel loss functions to invert backdoor triggers embedded in diffusion models. The first leverages trigger-induced distribution shifts across multiple timesteps of the diffusion process, while the second exploits the denoising consistency effect when a backdoor is activated. Once an accurate trigger inversion is achieved, we develop a backdoor detection method that analyzes both the inverted trigger and the generated backdoor targets to identify backdoor attacks. In terms of attack amplification with the role of an attacker, we describe how our trigger inversion algorithm can be used to reinforce the original trigger embedded in the backdoored diffusion model. This significantly boosts attack performance while reducing the required backdoor training time. Experimental results demonstrate that PureDiffusion achieves near-perfect detection accuracy, outperforming existing defenses by a large margin, particularly against complex trigger patterns. Additionally, in an attack scenario, our attack amplification approach elevates the attack success rate (ASR) of existing backdoor attacks to nearly 100\% while reducing training time by up to 20x.
29. 【2502.19038】FungalZSL: Zero-Shot Fungal Classification with Image Captioning Using a Synthetic Data Approach
链接:https://arxiv.org/abs/2502.19038
作者:Anju Rani,Daniel O. Arroyo,Petar Durdevic
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Contrastive Language-Image Pre-training, Language-Image Pre-training, well-aligned text-image datasets, Contrastive Language-Image, large vision-language models
备注: 11 pages, 5 Figures, 1 Table
点击查看摘要
Abstract:The effectiveness of zero-shot classification in large vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), depends on access to extensive, well-aligned text-image datasets. In this work, we introduce two complementary data sources, one generated by large language models (LLMs) to describe the stages of fungal growth and another comprising a diverse set of synthetic fungi images. These datasets are designed to enhance CLIPs zero-shot classification capabilities for fungi-related tasks. To ensure effective alignment between text and image data, we project them into CLIPs shared representation space, focusing on different fungal growth stages. We generate text using LLaMA3.2 to bridge modality gaps and synthetically create fungi images. Furthermore, we investigate knowledge transfer by comparing text outputs from different LLM techniques to refine classification across growth stages.
30. 【2502.19024】Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments
链接:https://arxiv.org/abs/2502.19024
作者:Zerui Li,Gengze Zhou,Haodong Hong,Yanyan Shao,Wenqi Lyu,Yanyuan Qiao,Qi Wu
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
关键词:make sequential decisions, Ground-level Viewpoint Navigation, associate time-sequenced visual, Viewpoint Navigation, empowers agents
备注: Accepted by ICRA 2025
点击查看摘要
Abstract:Vision-and-Language Navigation (VLN) empowers agents to associate time-sequenced visual observations with corresponding instructions to make sequential decisions. However, generalization remains a persistent challenge, particularly when dealing with visually diverse scenes or transitioning from simulated environments to real-world deployment. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view, proposing a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue. This work represents the first attempt to highlight the generalization gap in VLN across varying heights of visual observation in realistic robot deployments. Our approach leverages weighted historical observations as enriched spatiotemporal contexts for instruction following, effectively managing feature collisions within cells by assigning appropriate weights to identical features across different viewpoints. This enables low-height robots to overcome challenges such as visual obstructions and perceptual mismatches. Additionally, we transfer the connectivity graph from the HM3D and Gibson datasets as an extra resource to enhance spatial priors and a more comprehensive representation of real-world scenarios, leading to improved performance and generalizability of the waypoint predictor in real-world environments. Extensive experiments demonstrate that our Ground-level Viewpoint Navigation (GVnav) approach significantly improves performance in both simulated environments and real-world deployments with quadruped robots.
31. 【2502.18982】Enhanced Neuromorphic Semantic Segmentation Latency through Stream Event
链接:https://arxiv.org/abs/2502.18982
作者:D. Hareb,J. Martinet,B. Miramond
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Achieving optimal semantic, Achieving optimal, vision sensors poses, frame-based vision sensors, self-driving cars
备注:
点击查看摘要
Abstract:Achieving optimal semantic segmentation with frame-based vision sensors poses significant challenges for real-time systems like UAVs and self-driving cars, which require rapid and precise processing. Traditional frame-based methods often struggle to balance latency, accuracy, and energy efficiency. To address these challenges, we leverage event streams from event-based cameras-bio-inspired sensors that trigger events in response to changes in the scene. Specifically, we analyze the number of events triggered between successive frames, with a high number indicating significant changes and a low number indicating minimal changes. We exploit this event information to solve the semantic segmentation task by employing a Spiking Neural Network (SNN), a bio-inspired computing paradigm known for its low energy consumption. Our experiments on the DSEC dataset show that our approach significantly reduces latency with only a limited drop in accuracy. Additionally, by using SNNs, we achieve low power consumption, making our method suitable for energy-constrained real-time applications. To the best of our knowledge, our approach is the first to effectively balance reduced latency, minimal accuracy loss, and energy efficiency using events stream to enhance semantic segmentation in dynamic and resource-limited environments.
32. 【2502.18923】Brain-inspired analogical mixture prototypes for few-shot class-incremental learning
链接:https://arxiv.org/abs/2502.18923
作者:Wanyi Li,Wei Wei,Yongkang Luo,Peng Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:Few-shot class-incremental learning, poses significant challenges, previously learned tasks, artificial neural networks, neural networks due
备注: under review
点击查看摘要
Abstract:Few-shot class-incremental learning (FSCIL) poses significant challenges for artificial neural networks due to the need to efficiently learn from limited data while retaining knowledge of previously learned tasks. Inspired by the brain's mechanisms for categorization and analogical learning, we propose a novel approach called Brain-inspired Analogical Mixture Prototypes (BAMP). BAMP has three components: mixed prototypical feature learning, statistical analogy, and soft voting. Starting from a pre-trained Vision Transformer (ViT), mixed prototypical feature learning represents each class using a mixture of prototypes and fine-tunes these representations during the base session. The statistical analogy calibrates the mean and covariance matrix of prototypes for new classes according to similarity to the base classes, and computes classification score with Mahalanobis distance. Soft voting combines both merits of statistical analogy and an off-shelf FSCIL method. Our experiments on benchmark datasets demonstrate that BAMP outperforms state-of-the-art on both traditional big start FSCIL setting and challenging small start FSCIL setting. The study suggests that brain-inspired analogical mixture prototypes can alleviate catastrophic forgetting and over-fitting problems in FSCIL.
33. 【2502.18871】Inscanner: Dual-Phase Detection and Classification of Auxiliary Insulation Using YOLOv8 Models
链接:https://arxiv.org/abs/2502.18871
作者:Youngtae Kim,Soonju Jeong,Sardar Arslan,Dhananjay Agnihotri,Yahya Ahmed,Ali Nawaz,Jinhee Song,Hyewon Kim
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:study proposes, proposes a two-phase, two-phase methodology, methodology for detecting, detecting and classifying
备注:
点击查看摘要
Abstract:This study proposes a two-phase methodology for detecting and classifying auxiliary insulation in structural components. In the detection phase, a YOLOv8x model is trained on a dataset of complete structural blueprints, each annotated with bounding boxes indicating areas that should contain insulation. In the classification phase, these detected insulation patches are cropped and categorized into two classes: present or missing. These are then used to train a YOLOv8x-CLS model that determines the presence or absence of auxiliary insulation. Preprocessing steps for both datasets included annotation, augmentation, and appropriate cropping of the insulation regions. The detection model achieved a mean average precision (mAP) score of 82%, while the classification model attained an accuracy of 98%. These findings demonstrate the effectiveness of the proposed approach in automating insulation detection and classification, providing a foundation for further advancements in this domain.
34. 【2502.18867】Enhanced Transformer-Based Tracking for Skiing Events: Overcoming Multi-Camera Challenges, Scale Variations and Rapid Motion -- SkiTB Visual Tracking Challenge 2025
链接:https://arxiv.org/abs/2502.18867
作者:Akhil Penta,Vaibhav Adwani,Ankush Chopra
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Accurate skier tracking, injury prevention, Accurate skier, performance analysis, alpine sports
备注:
点击查看摘要
Abstract:Accurate skier tracking is essential for performance analysis, injury prevention, and optimizing training strategies in alpine sports. Traditional tracking methods often struggle with occlusions, dynamic movements, and varying environmental conditions, limiting their effectiveness. In this work, we used STARK (Spatio-Temporal Transformer Network for Visual Tracking), a transformer-based model, to track skiers. We adapted STARK to address domain-specific challenges such as camera movements, camera changes, occlusions, etc. by optimizing the model's architecture and hyperparameters to better suit the dataset.
35. 【2502.18863】Sherlock: Towards Multi-scene Video Abnormal Event Extraction and Localization via a Global-local Spatial-sensitive LLM
链接:https://arxiv.org/abs/2502.18863
作者:Junxiao Ma,Jingjing Wang,Jiamin Luo,Peiying Yu,Guodong Zhou
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Video Anomaly Detection, Anomaly Detection, structured video semantic, abnormal event happen, Video Abnormal Event
备注:
点击查看摘要
Abstract:Prior studies on Video Anomaly Detection (VAD) mainly focus on detecting whether each video frame is abnormal or not in the video, which largely ignore the structured video semantic information (i.e., what, when, and where does the abnormal event happen). With this in mind, we propose a new chat-paradigm \textbf{M}ulti-scene Video Abnormal Event Extraction and Localization (M-VAE) task, aiming to extract the abnormal event quadruples (i.e., subject, event type, object, scene) and localize such event. Further, this paper believes that this new task faces two key challenges, i.e., global-local spatial modeling and global-local spatial balancing. To this end, this paper proposes a Global-local Spatial-sensitive Large Language Model (LLM) named Sherlock, i.e., acting like Sherlock Holmes to track down the criminal events, for this M-VAE task. Specifically, this model designs a Global-local Spatial-enhanced MoE (GSM) module and a Spatial Imbalance Regulator (SIR) to address the two challenges respectively. Extensive experiments on our M-VAE instruction dataset show the significant advantages of Sherlock over several advanced Video-LLMs. This justifies the importance of global-local spatial information for the M-VAE task and the effectiveness of Sherlock in capturing such information.
36. 【2502.18844】BarkXAI: A Lightweight Post-Hoc Explainable Method for Tree Species Classification with Quantifiable Concepts
链接:https://arxiv.org/abs/2502.18844
作者:Yunmei Huang,Songlin Hou,Zachary Nelson Horve,Songlin Fei
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:environmental monitoring, global visual features, visual features, global visual, tree species
备注:
点击查看摘要
Abstract:The precise identification of tree species is fundamental to forestry, conservation, and environmental monitoring. Though many studies have demonstrated that high accuracy can be achieved using bark-based species classification, these models often function as "black boxes", limiting interpretability, trust, and adoption in critical forestry applications. Attribution-based Explainable AI (XAI) methods have been used to address this issue in related works. However, XAI applications are often dependent on local features (such as a head shape or paw in animal applications) and cannot describe global visual features (such as ruggedness or smoothness) that are present in texture-dominant images such as tree bark. Concept-based XAI methods, on the other hand, offer explanations based on global visual features with concepts, but they tend to require large overhead in building external concept image datasets and the concepts can be vague and subjective without good means of precise quantification. To address these challenges, we propose a lightweight post-hoc method to interpret visual models for tree species classification using operators and quantifiable concepts. Our approach eliminates computational overhead, enables the quantification of complex concepts, and evaluates both concept importance and the model's reasoning process. To the best of our knowledge, our work is the first study to explain bark vision models in terms of global visual features with concepts. Using a human-annotated dataset as ground truth, our experiments demonstrate that our method significantly outperforms TCAV and Llama3.2 in concept importance ranking based on Kendall's Tau, highlighting its superior alignment with human perceptions.
37. 【2502.18842】Attention-Guided Integration of CLIP and SAM for Precise Object Masking in Robotic Manipulation
链接:https://arxiv.org/abs/2502.18842
作者:Muhammad A. Muttaqien,Tomohiro Motoda,Ryo Hanai,Domae Yukiyasu
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:paper introduces, enhance the precision, specific domain, domain of masking, object masking
备注: 2025 IEEE/SICE International Symposium on System Integration
点击查看摘要
Abstract:This paper introduces a novel pipeline to enhance the precision of object masking for robotic manipulation within the specific domain of masking products in convenience stores. The approach integrates two advanced AI models, CLIP and SAM, focusing on their synergistic combination and the effective use of multimodal data (image and text). Emphasis is placed on utilizing gradient-based attention mechanisms and customized datasets to fine-tune performance. While CLIP, SAM, and Grad- CAM are established components, their integration within this structured pipeline represents a significant contribution to the field. The resulting segmented masks, generated through this combined approach, can be effectively utilized as inputs for robotic systems, enabling more precise and adaptive object manipulation in the context of convenience store products.
38. 【2502.18816】Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
链接:https://arxiv.org/abs/2502.18816
作者:Chenyang Zhao,Kun Wang,Janet H. Hsiao,Antoni B. Chan
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Contrastive Language-Image Pre-training, Significant progress, Language-Image Pre-training, Contrastive Language-Image, CLIP
备注:
点击查看摘要
Abstract:Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on token features. Qualitative and quantitative evaluations verify the effectiveness and superiority of Grad-ECLIP compared with the state-of-the-art methods. Furthermore, a series of analysis are conducted based on our visual and textual explanation results, from which we explore the working mechanism of image-text matching, the strengths and limitations in attribution identification of CLIP, and the relationship between the concreteness/abstractness of a word and its usage in CLIP. Finally, based on the ability of explanation map that indicates text-specific saliency region of input image, we also propose an application with Grad-ECLIP, which is adopted to boost the fine-grained alignment in the CLIP fine-tuning. The code of Grad-ECLIP is available here: this https URL.
39. 【2502.18748】Spectral-Enhanced Transformers: Leveraging Large-Scale Pretrained Models for Hyperspectral Object Tracking
链接:https://arxiv.org/abs/2502.18748
作者:Shaheer Mohamed,Tharindu Fernando,Sridha Sridharan,Peyman Moghadam,Clinton Fookes
类目:Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
关键词:snapshot mosaic cameras, enhanced spectral information, spectral information alongside, information alongside spatial, Hyperspectral object tracking
备注: Accepted to 14th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS)
点击查看摘要
Abstract:Hyperspectral object tracking using snapshot mosaic cameras is emerging as it provides enhanced spectral information alongside spatial data, contributing to a more comprehensive understanding of material properties. Using transformers, which have consistently outperformed convolutional neural networks (CNNs) in learning better feature representations, would be expected to be effective for Hyperspectral object tracking. However, training large transformers necessitates extensive datasets and prolonged training periods. This is particularly critical for complex tasks like object tracking, and the scarcity of large datasets in the hyperspectral domain acts as a bottleneck in achieving the full potential of powerful transformer models. This paper proposes an effective methodology that adapts large pretrained transformer-based foundation models for hyperspectral object tracking. We propose an adaptive, learnable spatial-spectral token fusion module that can be extended to any transformer-based backbone for learning inherent spatial-spectral features in hyperspectral data. Furthermore, our model incorporates a cross-modality training pipeline that facilitates effective learning across hyperspectral datasets collected with different sensor modalities. This enables the extraction of complementary knowledge from additional modalities, whether or not they are present during testing. Our proposed model also achieves superior performance with minimal training iterations.
40. 【2502.18745】MaskPlanner: Learning-Based Object-Centric Motion Generation from 3D Point Clouds
链接:https://arxiv.org/abs/2502.18745
作者:Gabriele Tiboni,Raffaello Camoriano,Tatiana Tommasi
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:Object-Centric Motion Generation, Motion Generation, plan multiple long-horizon, Object-Centric Motion, multiple long-horizon trajectories
备注: Project website at [this https URL](https://gabrieletiboni.github.io/MaskPlanner/)
点击查看摘要
Abstract:Object-Centric Motion Generation (OCMG) plays a key role in a variety of industrial applications$\unicode{x2014}$such as robotic spray painting and welding$\unicode{x2014}$requiring efficient, scalable, and generalizable algorithms to plan multiple long-horizon trajectories over free-form 3D objects. However, existing solutions rely on specialized heuristics, expensive optimization routines, or restrictive geometry assumptions that limit their adaptability to real-world scenarios. In this work, we introduce a novel, fully data-driven framework that tackles OCMG directly from 3D point clouds, learning to generalize expert path patterns across free-form surfaces. We propose MaskPlanner, a deep learning method that predicts local path segments for a given object while simultaneously inferring "path masks" to group these segments into distinct paths. This design induces the network to capture both local geometric patterns and global task requirements in a single forward pass. Extensive experimentation on a realistic robotic spray painting scenario shows that our approach attains near-complete coverage (above 99%) for unseen objects, while it remains task-agnostic and does not explicitly optimize for paint deposition. Moreover, our real-world validation on a 6-DoF specialized painting robot demonstrates that the generated trajectories are directly executable and yield expert-level painting quality. Our findings crucially highlight the potential of the proposed learning method for OCMG to reduce engineering overhead and seamlessly adapt to several industrial use cases.
41. 【2502.18735】QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries
链接:https://arxiv.org/abs/2502.18735
作者:Nicolas Harvey Chapman,Feras Dayoub,Will Browne,Christopher Lehnert
类目:Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
关键词:domain shift exists, raw image streams, image streams collected, domain shift, shift exists
备注:
点击查看摘要
Abstract:A domain shift exists between the large-scale, internet data used to train a Vision-Language Model (VLM) and the raw image streams collected by a robot. Existing adaptation strategies require the definition of a closed-set of classes, which is impractical for a robot that must respond to diverse natural language queries. In response, we present QueryAdapter; a novel framework for rapidly adapting a pre-trained VLM in response to a natural language query. QueryAdapter leverages unlabelled data collected during previous deployments to align VLM features with semantic classes related to the query. By optimising learnable prompt tokens and actively selecting objects for training, an adapted model can be produced in a matter of minutes. We also explore how objects unrelated to the query should be dealt with when using real-world data for adaptation. In turn, we propose the use of object captions as negative class labels, helping to produce better calibrated confidence scores during adaptation. Extensive experiments on ScanNet++ demonstrate that QueryAdapter significantly enhances object retrieval performance compared to state-of-the-art unsupervised VLM adapters and 3D scene graph methods. Furthermore, the approach exhibits robust generalization to abstract affordance queries and other datasets, such as Ego4D.
42. 【2502.18734】Beyond RNNs: Benchmarking Attention-Based Image Captioning Models
链接:https://arxiv.org/abs/2502.18734
作者:Hemanth Teja Yanambakkam,Rahul Chinthala
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
关键词:generate meaningful textual, meaningful textual descriptions, challenging task, intersection of computer, computer vision
备注: 10 pages, 6 figures. Code and additional results are available on GitHub under the handle HemanthTejaY
点击查看摘要
Abstract:Image captioning is a challenging task at the intersection of computer vision and natural language processing, requiring models to generate meaningful textual descriptions of images. Traditional approaches rely on recurrent neural networks (RNNs), but recent advancements in attention mechanisms have demonstrated significant improvements. This study benchmarks the performance of attention-based image captioning models against RNN-based approaches using the MS-COCO dataset. We evaluate the effectiveness of Bahdanau attention in enhancing the alignment between image features and generated captions. The models are assessed using natural language processing metrics such as BLEU, METEOR, GLEU, and WER. Our results show that attention-based models outperform RNNs in generating more accurate and semantically rich captions, with better alignment to human evaluation. This work provides insights into the impact of attention mechanisms in image captioning and highlights areas for future improvements.
43. 【2502.18724】Adversarial Universal Stickers: Universal Perturbation Attacks on Traffic Sign using Stickers
链接:https://arxiv.org/abs/2502.18724
作者:Anthony Etim,Jakub Szefer
类目:Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
关键词:deep learning models, deep learning, street sign, street, universal perturbations
备注:
点击查看摘要
Abstract:Adversarial attacks on deep learning models have proliferated in recent years. In many cases, a different adversarial perturbation is required to be added to each image to cause the deep learning model to misclassify it. This is ineffective as each image has to be modified in a different way. Meanwhile, research on universal perturbations focuses on designing a single perturbation that can be applied to all images in a data set, and cause a deep learning model to misclassify the images. This work advances the field of universal perturbations by exploring universal perturbations in the context of traffic signs and autonomous vehicle systems. This work introduces a novel method for generating universal perturbations that visually look like simple black and white stickers, and using them to cause incorrect street sign predictions. Unlike traditional adversarial perturbations, the adversarial universal stickers are designed to be applicable to any street sign: same sticker, or stickers, can be applied in same location to any street sign and cause it to be misclassified. Further, to enable safe experimentation with adversarial images and street signs, this work presents a virtual setting that leverages Street View images of street signs, rather than the need to physically modify street signs, to test the attacks. The experiments in the virtual setting demonstrate that these stickers can consistently mislead deep learning models used commonly in street sign recognition, and achieve high attack success rates on dataset of US traffic signs. The findings highlight the practical security risks posed by simple stickers applied to traffic signs, and the ease with which adversaries can generate adversarial universal stickers that can be applied to many street signs.
44. 【2502.18691】Enhancing Image Classification with Augmentation: Data Augmentation Techniques for Improved Image Classification
链接:https://arxiv.org/abs/2502.18691
作者:Saorj Kumar,Prince Asiamah,Oluwatoyin Jolaoso,Ugochukwu Esiowu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Convolutional Neural Networks, Convolutional Neural, Neural Networks, finding applications, workhorse of deep
备注:
点击查看摘要
Abstract:Convolutional Neural Networks (CNNs) serve as the workhorse of deep learning, finding applications in various fields that rely on images. Given sufficient data, they exhibit the capacity to learn a wide range of concepts across diverse settings. However, a notable limitation of CNNs is their susceptibility to overfitting when trained on small datasets. The augmentation of such datasets can significantly enhance CNN performance by introducing additional data points for learning. In this study, we explore the effectiveness of 11 different sets of data augmentation techniques, which include three novel sets proposed in this work. The first set of data augmentation employs pairwise channel transfer, transferring Red, Green, Blue, Hue, and Saturation values from randomly selected images in the database to all images in the dataset. The second set introduces a novel occlusion approach, where objects in the images are occluded by randomly selected objects from the dataset. The third set involves a novel masking approach, using vertical, horizontal, circular, and checkered masks to occlude portions of the images. In addition to these novel techniques, we investigate other existing augmentation methods, including rotation, horizontal and vertical flips, resizing, translation, blur, color jitter, and random erasing, and their effects on accuracy and overfitting. We fine-tune a base EfficientNet-B0 model for each augmentation method and conduct a comparative analysis to showcase their efficacy. For the evaluation and comparison of these augmentation techniques, we utilize the Caltech-101 dataset. The ensemble of image augmentation techniques proposed emerges as the most effective on the Caltech-101 dataset. The results demonstrate that diverse data augmentation techniques present a viable means of enhancing datasets for improved image classification.
45. 【2502.18620】Diffusion Models for conditional MRI generation
链接:https://arxiv.org/abs/2502.18620
作者:Miguel Herencia García del Castillo,Ricardo Moya Garcia,Manuel Jesús Cerezo Mazón,Ekaitz Arriola Garcia,Pablo Menéndez Fernández-Miranda
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
关键词:Magnetic Resonance Imaging, brain Magnetic Resonance, Latent Diffusion Model, Resonance Imaging, Latent Diffusion
备注:
点击查看摘要
Abstract:In this article, we present a Latent Diffusion Model (LDM) for the generation of brain Magnetic Resonance Imaging (MRI), conditioning its generation based on pathology (Healthy, Glioblastoma, Sclerosis, Dementia) and acquisition modality (T1w, T1ce, T2w, Flair, PD). To evaluate the quality of the generated images, the Fréchet Inception Distance (FID) and Multi-Scale Structural Similarity Index (MS-SSIM) metrics were employed. The results indicate that the model generates images with a distribution similar to real ones, maintaining a balance between visual fidelity and diversity. Additionally, the model demonstrates extrapolation capability, enabling the generation of configurations that were not present in the training data. The results validate the potential of the model to increase in the number of samples in clinical datasets, balancing underrepresented classes, and evaluating AI models in medicine, contributing to the development of diagnostic tools in radiology without compromising patient privacy.
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:
arXiv:2502.18620 [cs.CV]
(or
arXiv:2502.18620v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2502.18620
Focus to learn more
arXiv-issued DOI via DataCite (pending registration)</p>
46. 【2502.18592】DeBUGCN -- Detecting Backdoors in CNNs Using Graph Convolutional Networks
链接:https://arxiv.org/abs/2502.18592
作者:Akash Vartak,Khondoker Murad Hossain,Tim Oates
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Deep neural networks, Deep neural, critical applications, making their susceptibility, significant problem
备注: 18 pages, 11 tables, 8 figures
点击查看摘要
Abstract:Deep neural networks (DNNs) are becoming commonplace in critical applications, making their susceptibility to backdoor (trojan) attacks a significant problem. In this paper, we introduce a novel backdoor attack detection pipeline, detecting attacked models using graph convolution networks (DeBUGCN). To the best of our knowledge, ours is the first use of GCNs for trojan detection. We use the static weights of a DNN to create a graph structure of its layers. A GCN is then used as a binary classifier on these graphs, yielding a trojan or clean determination for the DNN. To demonstrate the efficacy of our pipeline, we train hundreds of clean and trojaned CNN models on the MNIST handwritten digits and CIFAR-10 image datasets, and show the DNN classification results using DeBUGCN. For a true In-the-Wild use case, our pipeline is evaluated on the TrojAI dataset which consists of various CNN architectures, thus showing the robustness and model-agnostic behaviour of DeBUGCN. Furthermore, on comparing our results on several datasets with state-of-the-art trojan detection algorithms, DeBUGCN is faster and more accurate.
47. 【2502.18586】Autonomous Vision-Guided Resection of Central Airway Obstruction
链接:https://arxiv.org/abs/2502.18586
作者:M. E. Smith,N. Yilmaz,T. Watts,P. M. Scheikl,J. Ge,A. Deguet,A. Kuntz,A. Krieger
类目:Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:robotic advancements offer, Existing tracheal tumor, Existing tracheal, tumor resection methods, autonomous resection
备注: Submitted to World Scientific, Journal of Medical Robotics Research (JMRR) 2025. 10 pages, 11 figures
点击查看摘要
Abstract:Existing tracheal tumor resection methods often lack the precision required for effective airway clearance, and robotic advancements offer new potential for autonomous resection. We present a vision-guided, autonomous approach for palliative resection of tracheal tumors. This system models the tracheal surface with a fifth-degree polynomial to plan tool trajectories, while a custom Faster R-CNN segmentation pipeline identifies the trachea and tumor boundaries. The electrocautery tool angle is optimized using handheld surgical demonstrations, and trajectories are planned to maintain a 1 mm safety clearance from the tracheal surface. We validated the workflow successfully in five consecutive experiments on ex-vivo animal tissue models, successfully clearing the airway obstruction without trachea perforation in all cases (with more than 90% volumetric tumor removal). These results support the feasibility of an autonomous resection platform, paving the way for future developments in minimally-invasive autonomous resection.
48. 【2502.18555】Application of Attention Mechanism with Bidirectional Long Short-Term Memory (BiLSTM) and CNN for Human Conflict Detection using Computer Vision
链接:https://arxiv.org/abs/2502.18555
作者:Erick da Silva Farias,Eduardo Palhares Junior
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:public safety policies, Convolutional Neural Networks, computer vision, safety policies, crucial area
备注:
点击查看摘要
Abstract:The automatic detection of human conflicts through videos is a crucial area in computer vision, with significant applications in monitoring and public safety policies. However, the scarcity of public datasets and the complexity of human interactions make this task challenging. This study investigates the integration of advanced deep learning techniques, including Attention Mechanism, Convolutional Neural Networks (CNNs), and Bidirectional Long ShortTerm Memory (BiLSTM), to improve the detection of violent behaviors in videos. The research explores how the use of the attention mechanism can help focus on the most relevant parts of the video, enhancing the accuracy and robustness of the model. The experiments indicate that the combination of CNNs with BiLSTM and the attention mechanism provides a promising solution for conflict monitoring, offering insights into the effectiveness of different strategies. This work opens new possibilities for the development of automated surveillance systems that can operate more efficiently in real-time detection of violent events.
49. 【2502.18546】Multi-class Seismic Building Damage Assessment from InSAR Imagery using Quadratic Variational Causal Bayesian Inference
链接:https://arxiv.org/abs/2502.18546
作者:Xuechun Li,Susu Xu
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:Interferometric Synthetic Aperture, Synthetic Aperture Radar, Interferometric Synthetic, Synthetic Aperture, Aperture Radar
备注: Submitted to Remote Sensing and Environment
点击查看摘要
Abstract:Interferometric Synthetic Aperture Radar (InSAR) technology uses satellite radar to detect surface deformation patterns and monitor earthquake impacts on buildings. While vital for emergency response planning, extracting multi-class building damage classifications from InSAR data faces challenges: overlapping damage signatures with environmental noise, computational complexity in multi-class scenarios, and the need for rapid regional-scale processing. Our novel multi-class variational causal Bayesian inference framework with quadratic variational bounds provides rigorous approximations while ensuring efficiency. By integrating InSAR observations with USGS ground failure models and building fragility functions, our approach separates building damage signals while maintaining computational efficiency through strategic pruning. Evaluation across five major earthquakes (Haiti 2021, Puerto Rico 2020, Zagreb 2020, Italy 2016, Ridgecrest 2019) shows improved damage classification accuracy (AUC: 0.94-0.96), achieving up to 35.7% improvement over existing methods. Our approach maintains high accuracy (AUC 0.93) across all damage categories while reducing computational overhead by over 40% without requiring extensive ground truth data.
50. 【2502.18536】FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
链接:https://arxiv.org/abs/2502.18536
作者:S M Sarwar
类目:Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
关键词:Visual Question Answering, Question Answering requires, generate accurate answers, Answering requires models, textual understanding
备注: 12 pages, 6 figures and 2 tables
点击查看摘要
Abstract:Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.
51. 【2502.18533】Convolutional neural networks for mineral prospecting through alteration mapping with remote sensing data
链接:https://arxiv.org/abs/2502.18533
作者:Ehsan Farahbakhsh,Dakshi Goel,Dhiraj Pimparkar,R. Dietmar Muller,Rohitash Chandra
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
关键词:rock sample analysis, based on field, field observations, observations and rock, rock sample
备注:
点击查看摘要
Abstract:Traditional geological mapping, based on field observations and rock sample analysis, is inefficient for continuous spatial mapping of features like alteration zones. Deep learning models, such as convolutional neural networks (CNNs), have revolutionised remote sensing data analysis by automatically extracting features for classification and regression tasks. CNNs can detect specific mineralogical changes linked to mineralisation by identifying subtle features in remote sensing data. This study uses CNNs with Landsat 8, Landsat 9, and ASTER data to map alteration zones north of Broken Hill, New South Wales, Australia. The model is trained using ground truth data and an automated approach with selective principal component analysis (PCA). We compare CNNs with traditional machine learning models, including k-nearest neighbours, support vector machines, and multilayer perceptron. Results show that ground truth-based training yields more reliable maps, with CNNs slightly outperforming conventional models in capturing spatial patterns. Landsat 9 outperforms Landsat 8 in mapping iron oxide areas using ground truth-trained CNNs, while ASTER data provides the most accurate argillic and propylitic alteration maps. This highlights CNNs' effectiveness in improving geological mapping precision, especially for identifying subtle mineralisation-related alterations.
52. 【2502.18530】IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Agents
链接:https://arxiv.org/abs/2502.18530
作者:Eric Xue,Zeyi Huang,Yuyang Ji,Haohan Wang
类目:Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:computer vision models, Computer vision, including plant monitoring, Iterative Refinement, high-performance computer vision
备注:
点击查看摘要
Abstract:Computer vision is a critical component in a wide range of real-world applications, including plant monitoring in agriculture and handwriting classification in digital systems. However, developing high-performance computer vision models traditionally demands both machine learning (ML) expertise and domain-specific knowledge, making the process costly, labor-intensive, and inaccessible to many. Large language model (LLM) agents have emerged as a promising solution to automate this workflow, but most existing methods share a common limitation: they attempt to optimize entire pipelines in a single step before evaluation, making it difficult to attribute improvements to specific changes. This lack of granularity leads to unstable optimization and slower convergence, limiting their effectiveness. To address this, we introduce Iterative Refinement, a novel strategy for LLM-driven ML pipeline design inspired by how human ML experts iteratively refine models, focusing on one component at a time rather than making sweeping changes all at once. By systematically updating individual components based on real training feedback, Iterative Refinement improves stability, interpretability, and overall model performance. We implement this strategy in IMPROVE, an end-to-end LLM agent framework for automating and optimizing object classification pipelines. Through extensive evaluations across datasets of varying sizes and domains, including standard benchmarks and Kaggle competition datasets, we demonstrate that Iterative Refinement enables IMPROVE to consistently achieve better performance over existing zero-shot LLM-based approaches. These findings establish Iterative Refinement as an effective new strategy for LLM-driven ML automation and position IMPROVE as an accessible solution for building high-quality computer vision models without requiring ML expertise.
53. 【2502.18521】Optimized Custom CNN for Real-Time Tomato Leaf Disease Detection
链接:https://arxiv.org/abs/2502.18521
作者:Mangsura Kabir Oni,Tabia Tanzin Prama
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:custom CNN model, custom CNN, staple vegetable, culinary applications, Convolutional Neural Networks
备注:
点击查看摘要
Abstract:In Bangladesh, tomatoes are a staple vegetable, prized for their versatility in various culinary applications. However, the cultivation of tomatoes is often hindered by a range of diseases that can significantly reduce crop yields and quality. Early detection of these diseases is crucial for implementing timely interventions and ensuring the sustainability of tomato production. Traditional manual inspection methods, while effective, are labor-intensive and prone to human error. To address these challenges, this research paper sought to develop an automated disease detection system using Convolutional Neural Networks (CNNs). A comprehensive dataset of tomato leaves was collected from the Brahmanbaria district, preprocessed to enhance image quality, and then applied to various deep learning models. Comparative performance analysis was conducted between YOLOv5, MobileNetV2, ResNet18, and our custom CNN model. In our study, the Custom CNN model achieved an impressive accuracy of 95.2%, significantly outperforming the other models, which achieved an accuracy of 77%, 89.38% and 71.88% respectively. While other models showed solid performance, our Custom CNN demonstrated superior results specifically tailored for the task of tomato leaf disease detection. These findings highlight the strong potential of deep learning techniques for improving early disease detection in tomato crops. By leveraging these advanced technologies, farmers can gain valuable insights to detect diseases at an early stage, allowing for more effective management practices. This approach not only promises to boost tomato yields but also contributes to the sustainability and resilience of the agricultural sector, helping to mitigate the impact of plant diseases on crop production.
54. 【2502.18514】CipherFace: A Fully Homomorphic Encryption-Driven Framework for Secure Cloud-Based Facial Recognition
链接:https://arxiv.org/abs/2502.18514
作者:Sefik Serengil,Alper Ozpinar
类目:Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:recognition systems rely, pre-tuned threshold, systems rely, determine identity, identity by verifying
备注:
点击查看摘要
Abstract:Facial recognition systems rely on embeddings to represent facial images and determine identity by verifying if the distance between embeddings is below a pre-tuned threshold. While embeddings are not reversible to original images, they still contain sensitive information, making their security critical. Traditional encryption methods like AES are limited in securely utilizing cloud computational power for distance calculations. Homomorphic Encryption, allowing calculations on encrypted data, offers a robust alternative. This paper introduces CipherFace, a homomorphic encryption-driven framework for secure cloud-based facial recognition, which we have open-sourced at this http URL. By leveraging FHE, CipherFace ensures the privacy of embeddings while utilizing the cloud for efficient distance computation. Furthermore, we propose a novel encrypted distance computation method for both Euclidean and Cosine distances, addressing key challenges in performing secure similarity calculations on encrypted data. We also conducted experiments with different facial recognition models, various embedding sizes, and cryptosystem configurations, demonstrating the scalability and effectiveness of CipherFace in real-world applications.
55. 【2502.18512】FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression
链接:https://arxiv.org/abs/2502.18512
作者:Jianjian Li,Junquan Fan,Feng Tang,Gang Huang,Shitao Zhu,Songlin Liu,Nian Xie,Wulong Liu,Yong Liao
类目:Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
关键词:Vision Large Language, Large Language Models, Vision Large, Large Language, success of Vision
备注: 20 pages, 18 figures, 6 tables
点击查看摘要
Abstract:The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods exhibit serious performance degradation in tasks involving high-resolution, text-oriented image understanding and reasoning. In this paper, we propose an efficient visual token compression framework for text-oriented VLLMs in high-resolution scenarios. In particular, we employ a light-weight self-distillation pre-training stage to compress the visual tokens, requiring a limited numbers of image-text pairs and minimal learnable parameters. Afterwards, to mitigate potential performance degradation of token-compressed models, we construct a high-quality post-train stage. To validate the effectiveness of our method, we apply it to an advanced VLLMs, InternVL2. Experimental results show that our approach significantly reduces computational overhead while outperforming the baselines across a range of text-oriented benchmarks. We will release the models and code soon.
56. 【2502.18510】Multi-Teacher Knowledge Distillation with Reinforcement Learning for Visual Recognition
链接:https://arxiv.org/abs/2502.18510
作者:Chuanguang Yang,Xinqiang Yu,Han Yang,Zhulin An,Chengqing Yu,Libo Huang,Yongjun Xu
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:transfers diverse knowledge, Multi-teacher Knowledge Distillation, Knowledge Distillation, transfers diverse, Multi-teacher Knowledge
备注: AAAI-2025
点击查看摘要
Abstract:Multi-teacher Knowledge Distillation (KD) transfers diverse knowledge from a teacher pool to a student network. The core problem of multi-teacher KD is how to balance distillation strengths among various teachers. Most existing methods often develop weighting strategies from an individual perspective of teacher performance or teacher-student gaps, lacking comprehensive information for guidance. This paper proposes Multi-Teacher Knowledge Distillation with Reinforcement Learning (MTKD-RL) to optimize multi-teacher weights. In this framework, we construct both teacher performance and teacher-student gaps as state information to an agent. The agent outputs the teacher weight and can be updated by the return reward from the student. MTKD-RL reinforces the interaction between the student and teacher using an agent in an RL-based decision mechanism, achieving better matching capability with more meaningful weights. Experimental results on visual recognition tasks, including image classification, object detection, and semantic segmentation tasks, demonstrate that MTKD-RL achieves state-of-the-art performance compared to the existing multi-teacher KD works.
57. 【2502.18508】REFINE: Inversion-Free Backdoor Defense via Model Reprogramming
链接:https://arxiv.org/abs/2502.18508
作者:Yukun Chen,Shuo Shao,Enhao Huang,Yiming Li,Pin-Yu Chen,Zhan Qin,Kui Ren
类目:Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:deep neural networks, significant security threat, implant hidden malicious, hidden malicious behaviors, model training phase
备注: This paper is accept by ICLR 2025. The first two authors contributed equally to this work. Our code is available at BackdoorBox ( [this https URL](https://github.com/THUYimingLi/BackdoorBox) ) and Github repository ( [this https URL](https://github.com/WhitolfChen/REFINE) ). 28 pages
点击查看摘要
Abstract:Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor triggers during the inference process. However, these methods suffer from inherent limitations: transformation-based defenses often fail to balance model utility and defense performance, while BTI-based defenses struggle to accurately reconstruct trigger patterns without prior knowledge. In this paper, we propose REFINE, an inversion-free backdoor defense method based on model reprogramming. REFINE consists of two key components: \textbf{(1)} an input transformation module that disrupts both benign and backdoor patterns, generating new benign features; and \textbf{(2)} an output remapping module that redefines the model's output domain to guide the input transformations effectively. By further integrating supervised contrastive loss, REFINE enhances the defense capabilities while maintaining model utility. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our REFINE and its resistance to potential adaptive attacks.
58. 【2502.18496】Physical Depth-aware Early Accident Anticipation: A Multi-dimensional Visual Feature Fusion Framework
链接:https://arxiv.org/abs/2502.18496
作者:Hongpu Huang,Wei Zhou,Chen Wang
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:Early accident anticipation, proposed framework, advanced accident anticipation, accident anticipation approaches, accident anticipation
备注:
点击查看摘要
Abstract:Early accident anticipation from dashcam videos is a highly desirable yet challenging task for improving the safety of intelligent vehicles. Existing advanced accident anticipation approaches commonly model the interaction among traffic agents (e.g., vehicles, pedestrians, etc.) in the coarse 2D image space, which may not adequately capture their true positions and interactions. To address this limitation, we propose a physical depth-aware learning framework that incorporates the monocular depth features generated by a large model named Depth-Anything to introduce more fine-grained spatial 3D information. Furthermore, the proposed framework also integrates visual interaction features and visual dynamic features from traffic scenes to provide a more comprehensive perception towards the scenes. Based on these multi-dimensional visual features, the framework captures early indicators of accidents through the analysis of interaction relationships between objects in sequential frames. Additionally, the proposed framework introduces a reconstruction adjacency matrix for key traffic participants that are occluded, mitigating the impact of occluded objects on graph learning and maintaining the spatio-temporal continuity. Experimental results on public datasets show that the proposed framework attains state-of-the-art performance, highlighting the effectiveness of incorporating visual depth features and the superiority of the proposed framework.
59. 【2502.18495】A Comprehensive Survey on Composed Image Retrieval
链接:https://arxiv.org/abs/2502.18495
作者:Xuemeng Song,Haoqiang Lin,Haokun Wen,Bohan Hou,Mingzhu Xu,Liqiang Nie
类目:Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
关键词:Composed Image Retrieval, Composed Image, Image Retrieval, reference image, target images
备注:
点击查看摘要
Abstract:Composed Image Retrieval (CIR) is an emerging yet challenging task that allows users to search for target images using a multimodal query, comprising a reference image and a modification text specifying the user's desired changes to the reference image. Given its significant academic and practical value, CIR has become a rapidly growing area of interest in the computer vision and machine learning communities, particularly with the advances in deep learning. To the best of our knowledge, there is currently no comprehensive review of CIR to provide a timely overview of this field. Therefore, we synthesize insights from over 120 publications in top conferences and journals, including ACM TOIS, SIGIR, and CVPR In particular, we systematically categorize existing supervised CIR and zero-shot CIR models using a fine-grained taxonomy. For a comprehensive review, we also briefly discuss approaches for tasks closely related to CIR, such as attribute-based CIR and dialog-based CIR. Additionally, we summarize benchmark datasets for evaluation and analyze existing supervised and zero-shot CIR methods by comparing experimental results across multiple datasets. Furthermore, we present promising future directions in this field, offering practical insights for researchers interested in further exploration.
60. 【2502.18490】Event-based Solutions for Human-centered Applications: A Comprehensive Review
链接:https://arxiv.org/abs/2502.18490
作者:Mira Adra,Simone Melcarne,Nelida Mirabet-Herranz,Jean-Luc Dugelay
类目:Computer Vision and Pattern Recognition (cs.CV)
关键词:groundbreaking sensors capable, light intensity asynchronously, offering exceptional temporal, exceptional temporal resolution, dynamic vision sensors
备注:
点击查看摘要
Abstract:Event cameras, often referred to as dynamic vision sensors, are groundbreaking sensors capable of capturing changes in light intensity asynchronously, offering exceptional temporal resolution and energy efficiency. These attributes make them particularly suited for human-centered applications, as they capture both the most intricate details of facial expressions and the complex motion dynamics of the human body. Despite growing interest, research in human-centered applications of event cameras remains scattered, with no comprehensive overview encompassing both body and face tasks. This survey bridges that gap by being the first to unify these domains, presenting an extensive review of advancements, challenges, and opportunities. We also examine less-explored areas, including event compression techniques and simulation frameworks, which are essential for the broader adoption of event cameras. This survey is designed to serve as a foundational reference that helps both new and experienced researchers understand the current state of the field and identify promising directions for future work in human-centered event camera applications. A summary of this survey can be found at this https URL
61. 【2502.19390】Multi-modal Contrastive Learning for Tumor-specific Missing Modality Synthesis
链接:https://arxiv.org/abs/2502.19390
作者:Minjoo Lim,Bogyeong Kang,Tae-Eui Kam
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:magnetic resonance imaging, Multi-modal magnetic resonance, providing complementary information, resonance imaging, anatomy and pathology
备注:
点击查看摘要
Abstract:Multi-modal magnetic resonance imaging (MRI) is essential for providing complementary information about brain anatomy and pathology, leading to more accurate diagnoses. However, obtaining high-quality multi-modal MRI in a clinical setting is difficult due to factors such as time constraints, high costs, and patient movement artifacts. To overcome this difficulty, there is increasing interest in developing generative models that can synthesize missing target modality images from the available source ones. Therefore, we design a generative model for missing MRI that integrates multi-modal contrastive learning with a focus on critical tumor regions. Specifically, we integrate multi-modal contrastive learning, tailored for multiple source modalities, and enhance its effectiveness by selecting features based on entropy during the contrastive learning process. Additionally, our network not only generates the missing target modality images but also predicts segmentation outputs, simultaneously. This approach improves the generator's capability to precisely generate tumor regions, ultimately improving performance in downstream segmentation tasks. By leveraging a combination of contrastive, segmentation, and additional self-representation losses, our model effectively reflects target-specific information and generate high-quality target images. Consequently, our results in the Brain MR Image Synthesis challenge demonstrate that the proposed model excelled in generating the missing modality.
62. 【2502.19351】Deep Learning-Based Transfer Learning for Classification of Cassava Disease
链接:https://arxiv.org/abs/2502.19351
作者:Ademir G. Costa Junior,Fábio S. da Silva,Ricardo Rios
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:Convolutional Neural Network, Neural Network architectures, Convolutional Neural, Neural Network, classifying cassava disease
备注: 12 pages, in Portuguese language, 3 figures
点击查看摘要
Abstract:This paper presents a performance comparison among four Convolutional Neural Network architectures (EfficientNet-B3, InceptionV3, ResNet50, and VGG16) for classifying cassava disease images. The images were sourced from an imbalanced dataset from a competition. Appropriate metrics were employed to address class imbalance. The results indicate that EfficientNet-B3 achieved on this task accuracy of 87.7%, precision of 87.8%, revocation of 87.8% and F1-Score of 87.7%. These findings suggest that EfficientNet-B3 could be a valuable tool to support Digital Agriculture.
63. 【2502.19258】Deep learning and classical computer vision techniques in medical image analysis: Case studies on brain MRI tissue segmentation, lung CT COPD registration, and skin lesion classification
链接:https://arxiv.org/abs/2502.19258
作者:Anyimadu Daniel Tweneboah,Suleiman Taofik Ahmed,Hossain Mohammad Imran
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:Medical imaging spans, Medical imaging, imaging spans diverse, spans diverse tasks, treatment planning
备注: 27 pages, 18 figures
点击查看摘要
Abstract:Medical imaging spans diverse tasks and modalities which play a pivotal role in disease diagnosis, treatment planning, and monitoring. This study presents a novel exploration, being the first to systematically evaluate segmentation, registration, and classification tasks across multiple imaging modalities. Integrating both classical and deep learning (DL) approaches in addressing brain MRI tissue segmentation, lung CT image registration, and skin lesion classification from dermoscopic images, we demonstrate the complementary strengths of these methodologies in diverse applications. For brain tissue segmentation, 3D DL models outperformed 2D and patch-based models, specifically nnU-Net achieving Dice of 0.9397, with 3D U-Net models on ResNet34 backbone, offering competitive results with Dice 0.8946. Multi-Atlas methods provided robust alternatives for cases where DL methods are not feasible, achieving average Dice of 0.7267. In lung CT registration, classical Elastix-based methods outperformed DL models, achieving a minimum Target Registration Error (TRE) of 6.68 mm, highlighting the effectiveness of parameter tuning. HighResNet performed best among DL models with a TRE of 7.40 mm. For skin lesion classification, ensembles of DL models like InceptionResNetV2 and ResNet50 excelled, achieving up to 90.44%, and 93.62% accuracies for binary and multiclass classification respectively. Also, adopting One-vs-All method, DL attained accuracies of 94.64% (mel vs. others), 95.35% (bcc vs. others), and 96.93% (scc vs. others), while ML models specifically Multi-Layer Perceptron (MLP) on handcrafted features offered interpretable alternatives with 85.04% accuracy using SMOTE for class imbalance correction on the multi-class task and 83.27% on the binary-class task. Links to source code are available on request.
64. 【2502.19181】Multi-level Attention-guided Graph Neural Network for Image Restoration
链接:https://arxiv.org/abs/2502.19181
作者:Jiatao Jiang,Zhen Cui,Chunyan Xu,Jian Yang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:achieved remarkable success, information, global, deep learning, image
备注:
点击查看摘要
Abstract:In recent years, deep learning has achieved remarkable success in the field of image restoration. However, most convolutional neural network-based methods typically focus on a single scale, neglecting the incorporation of multi-scale information. In image restoration tasks, local features of an image are often insufficient, necessitating the integration of global features to complement them. Although recent neural network algorithms have made significant strides in feature extraction, many models do not explicitly model global features or consider the relationship between global and local features. This paper proposes multi-level attention-guided graph neural network. The proposed network explicitly constructs element block graphs and element graphs within feature maps using multi-attention mechanisms to extract both local structural features and global representation information of the image. Since the network struggles to effectively extract global information during image degradation, the structural information of local feature blocks can be used to correct and supplement the global information. Similarly, when element block information in the feature map is missing, it can be refined using global element representation information. The graph within the network learns real-time dynamic connections through the multi-attention mechanism, and information is propagated and aggregated via graph convolution algorithms. By combining local element block information and global element representation information from the feature map, the algorithm can more effectively restore missing information in the image. Experimental results on several classic image restoration tasks demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance.
65. 【2502.19153】RetinaRegen: A Hybrid Model for Readability and Detail Restoration in Fundus Images
链接:https://arxiv.org/abs/2502.19153
作者:Yuhan Tang,Yudian Wang,Weizhen Li,Ye Yue,Chengchang Pan,Honggang Qi
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:increasing diagnostic uncertainty, diagnosing eye diseases, Fundus image quality, eye diseases, increasing diagnostic
备注:
点击查看摘要
Abstract:Fundus image quality is crucial for diagnosing eye diseases, but real-world conditions often result in blurred or unreadable images, increasing diagnostic uncertainty. To address these challenges, this study proposes RetinaRegen, a hybrid model for retinal image restoration that integrates a readability classifi-cation model, a Diffusion Model, and a Variational Autoencoder (VAE). Ex-periments on the SynFundus-1M dataset show that the proposed method achieves a PSNR of 27.4521, an SSIM of 0.9556, and an LPIPS of 0.1911 for the readability labels of the optic disc (RO) region. These results demonstrate superior performance in restoring key regions, offering an effective solution to enhance fundus image quality and support clinical diagnosis.
66. 【2502.19123】From Traditional to Deep Learning Approaches in Whole Slide Image Registration: A Methodological Review
链接:https://arxiv.org/abs/2502.19123
作者:Behnaz Elhaminia,Abdullah Alsalemi,Esha Nasir,Mostafa Jahanifar,Ruqayya Awan,Lawrence S. Young,Nasir M. Rajpoot,Fayyaz Minhas,Shan E Ahmed Raza
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:tumour microenvironment, analysing the tumour, slide image, essential task, WSI registration
备注:
点击查看摘要
Abstract:Whole slide image (WSI) registration is an essential task for analysing the tumour microenvironment (TME) in histopathology. It involves the alignment of spatial information between WSIs of the same section or serial sections of a tissue sample. The tissue sections are usually stained with single or multiple biomarkers before imaging, and the goal is to identify neighbouring nuclei along the Z-axis for creating a 3D image or identifying subclasses of cells in the TME. This task is considerably more challenging compared to radiology image registration, such as magnetic resonance imaging or computed tomography, due to various factors. These include gigapixel size of images, variations in appearance between differently stained tissues, changes in structure and morphology between non-consecutive sections, and the presence of artefacts, tears, and deformations. Currently, there is a noticeable gap in the literature regarding a review of the current approaches and their limitations, as well as the challenges and opportunities they present. We aim to provide a comprehensive understanding of the available approaches and their application for various purposes. Furthermore, we investigate current deep learning methods used for WSI registration, emphasising their diverse methodologies. We examine the available datasets and explore tools and software employed in the field. Finally, we identify open challenges and potential future trends in this area of research.
67. 【2502.19046】Max360IQ: Blind Omnidirectional Image Quality Assessment with Multi-axis Attention
链接:https://arxiv.org/abs/2502.19046
作者:Jiebin Yan,Ziwen Tan,Yuming Fang,Jiale Rao,Yifan Zuo
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:realistic immersive feelings, omnidirectional images, distorted omnidirectional images, omnidirectional image quality, Omnidirectional
备注:
点击查看摘要
Abstract:Omnidirectional image, also called 360-degree image, is able to capture the entire 360-degree scene, thereby providing more realistic immersive feelings for users than general 2D image and stereoscopic image. Meanwhile, this feature brings great challenges to measuring the perceptual quality of omnidirectional images, which is closely related to users' quality of experience, especially when the omnidirectional images suffer from non-uniform distortion. In this paper, we propose a novel and effective blind omnidirectional image quality assessment (BOIQA) model with multi-axis attention (Max360IQ), which can proficiently measure not only the quality of uniformly distorted omnidirectional images but also the quality of non-uniformly distorted omnidirectional images. Specifically, the proposed Max360IQ is mainly composed of a backbone with stacked multi-axis attention modules for capturing both global and local spatial interactions of extracted viewports, a multi-scale feature integration (MSFI) module to fuse multi-scale features and a quality regression module with deep semantic guidance for predicting the quality of omnidirectional images. Experimental results demonstrate that the proposed Max360IQ outperforms the state-of-the-art Assessor360 by 3.6\% in terms of SRCC on the JUFE database with non-uniform distortion, and gains improvement of 0.4\% and 0.8\% in terms of SRCC on the OIQA and CVIQ databases, respectively. The source code is available at this https URL.
68. 【2502.19037】PolypFlow: Reinforcing Polyp Segmentation with Flow-Driven Dynamics
链接:https://arxiv.org/abs/2502.19037
作者:Pu Wang,Huaizhi Ma,Zhihua Zhang,Zhuoran Zheng
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:irregular lesion morphologies, heterogeneous imaging conditions, remains challenging due, Accurate polyp segmentation, segmentation remains challenging
备注:
点击查看摘要
Abstract:Accurate polyp segmentation remains challenging due to irregular lesion morphologies, ambiguous boundaries, and heterogeneous imaging conditions. While U-Net variants excel at local feature fusion, they often lack explicit mechanisms to model the dynamic evolution of segmentation confidence under uncertainty. Inspired by the interpretable nature of flow-based models, we present \textbf{PolypFLow}, a flow-matching enhanced architecture that injects physics-inspired optimization dynamics into segmentation refinement. Unlike conventional cascaded networks, our framework solves an ordinary differential equation (ODE) to progressively align coarse initial predictions with ground truth masks through learned velocity fields. This trajectory-based refinement offers two key advantages: 1) Interpretable Optimization: Intermediate flow steps visualize how the model corrects under-segmented regions and sharpens boundaries at each ODE-solver iteration, demystifying the ``black-box" refinement process; 2) Boundary-Aware Robustness: The flow dynamics explicitly model gradient directions along polyp edges, enhancing resilience to low-contrast regions and motion artifacts. Numerous experimental results show that PolypFLow achieves a state-of-the-art while maintaining consistent performance in different lighting scenarios.
69. 【2502.19026】InternVQA: Advancing Compressed Video QualityAssessment with Distilling Large Foundation Model
链接:https://arxiv.org/abs/2502.19026
作者:Fengbin Guan,Zihao Yu,Yiting Lu,Xin Li,Zhibo Chen
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:tasks rely heavily, Video quality assessment, rich features required, semantic information, temporal motion
备注: Accepted by ISCAS 2025(Lecture)
点击查看摘要
Abstract:Video quality assessment tasks rely heavily on the rich features required for video understanding, such as semantic information, texture, and temporal motion. The existing video foundational model, InternVideo2, has demonstrated strong potential in video understanding tasks due to its large parameter size and large-scale multimodal data pertaining. Building on this, we explored the transferability of InternVideo2 to video quality assessment under compression scenarios. To design a lightweight model suitable for this task, we proposed a distillation method to equip the smaller model with rich compression quality priors. Additionally, we examined the performance of different backbones during the distillation process. The results showed that, compared to other methods, our lightweight model distilled from InternVideo2 achieved excellent performance in compression video quality assessment.
70. 【2502.18775】Subclass Classification of Gliomas Using MRI Fusion Technique
链接:https://arxiv.org/abs/2502.18775
作者:Kiranmayee Janardhan,Christy Bobby Thomas
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
关键词:exhibits diverse aggressiveness, diverse aggressiveness levels, prevalent primary brain, MRI images, glioma subclass classification
备注: 15 pages, 7 figures, 1 algorithm, 4 tables, journal paper
点击查看摘要
Abstract:Glioma, the prevalent primary brain tumor, exhibits diverse aggressiveness levels and prognoses. Precise classification of glioma is paramount for treatment planning and predicting prognosis. This study aims to develop an algorithm to fuse the MRI images from T1, T2, T1ce, and fluid-attenuated inversion recovery (FLAIR) sequences to enhance the efficacy of glioma subclass classification as no tumor, necrotic core, peritumoral edema, and enhancing tumor. The MRI images from BraTS datasets were used in this work. The images were pre-processed using max-min normalization to ensure consistency in pixel intensity values across different images. The segmentation of the necrotic core, peritumoral edema, and enhancing tumor was performed on 2D and 3D images separately using UNET architecture. Further, the segmented regions from multimodal MRI images were fused using the weighted averaging technique. Integrating 2D and 3D segmented outputs enhances classification accuracy by capturing detailed features like tumor shape, boundaries, and intensity distribution in slices, while also providing a comprehensive view of spatial extent, shape, texture, and localization within the brain volume. The fused images were used as input to the pre-trained ResNet50 model for glioma subclass classification. The network is trained on 80% and validated on 20% of the data. The proposed method achieved a classification of accuracy of 99.25%, precision of 99.30%, recall of 99.10, F1 score of 99.19%, Intersection Over Union of 84.49%, and specificity of 99.76, which showed a significantly higher performance than existing techniques. These findings emphasize the significance of glioma segmentation and classification in aiding accurate diagnosis.
71. 【2502.18704】rraTrace: Temporal Signature Land Use Mapping System
链接:https://arxiv.org/abs/2502.18704
作者:Angela Busheska,Vikram Iyer,Bruno Silva,Peder Olsen,Ranveer Chandra,Vaishnavi Ranganathan
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:tracking events related, Difference Vegetation Index, Understanding land, Normalized Difference Vegetation, time is critical
备注:
点击查看摘要
Abstract:Understanding land use over time is critical to tracking events related to climate change, like deforestation. However, satellite-based remote sensing tools which are used for monitoring struggle to differentiate vegetation types in farms and orchards from forests. We observe that metrics such as the Normalized Difference Vegetation Index (NDVI), based on plant photosynthesis, have unique temporal signatures that reflect agricultural practices and seasonal cycles. We analyze yearly NDVI changes on 20 farms for 10 unique crops. Initial results show that NDVI curves are coherent with agricultural practices, are unique to each crop, consistent globally, and can differentiate farms from forests. We develop a novel longitudinal NDVI dataset for the state of California from 2020-2023 with 500~m resolution and over 70 million points. We use this to develop the TerraTrace platform, an end-to-end analytic tool that classifies land use using NDVI signatures and allows users to query the system through an LLM chatbot and graphical interface.
72. 【2502.18550】A Comparative Review of the Histogram-based Image Segmentation Methods
链接:https://arxiv.org/abs/2502.18550
作者:ZhenZhou Wang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:histogram-based image segmentation, image segmentation methods, numerical grayscale distribution, image segmentation, histogram-based image
备注:
点击查看摘要
Abstract:The histogram of an image is the accurate graphical representation of the numerical grayscale distribution and it is also an estimate of the probability distribution of image pixels. Therefore, histogram has been widely adopted to calculate the clustering means and partitioning thresholds for image segmentation. There have been many classical histogram-based image segmentation methods proposed and played important roles in both academics and industry. In this article, the histories and recent advances of the histogram-based image segmentation techniques are first reviewed and then they are divided into four categories: (1), the means-based method; (2), the Gaussian-mixture-model-based method; (3), the entropy-based method and (4) the feature-points-based method. The principles of the classical histogram-based image segmentation methods are described at first and then their performances are compared objectively. In addition, the histogram-based image segmentation methods are compared with the general-purpose deep learning methods in segmenting objects with uniform or simple backgrounds. The histogram-based image segmentation methods are more accurate than the universal deep-learning methods without special training in segmenting many types of images.
73. 【2502.18523】End-to-End Deep Learning for Structural Brain Imaging: A Unified Framework
链接:https://arxiv.org/abs/2502.18523
作者:Yao Su,Keqi Han,Mingjie Zeng,Lichao Sun,Liang Zhan,Carl Yang,Lifang He,Xiangnan Kong
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:providing valuable insights, Brain imaging analysis, Brain imaging, brain structure, fundamental in neuroscience
备注:
点击查看摘要
Abstract:Brain imaging analysis is fundamental in neuroscience, providing valuable insights into brain structure and function. Traditional workflows follow a sequential pipeline-brain extraction, registration, segmentation, parcellation, network generation, and classification-treating each step as an independent task. These methods rely heavily on task-specific training data and expert intervention to correct intermediate errors, making them particularly burdensome for high-dimensional neuroimaging data, where annotations and quality control are costly and time-consuming. We introduce UniBrain, a unified end-to-end framework that integrates all processing steps into a single optimization process, allowing tasks to interact and refine each other. Unlike traditional approaches that require extensive task-specific annotations, UniBrain operates with minimal supervision, leveraging only low-cost labels (i.e., classification and extraction) and a single labeled atlas. By jointly optimizing extraction, registration, segmentation, parcellation, network generation, and classification, UniBrain enhances both accuracy and computational efficiency while significantly reducing annotation effort. Experimental results demonstrate its superiority over existing methods across multiple tasks, offering a more scalable and reliable solution for neuroimaging analysis. Our code and data can be found at this https URL
74. 【2502.18522】Rewards-based image analysis in microscopy
链接:https://arxiv.org/abs/2502.18522
作者:Kamyar Barakati,Yu Liu,Utkarsh Pratiush,Boris N. Slautin,Sergei V. Kalinin
类目:Image and Video Processing (eess.IV); Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
关键词:including biology, Analyzing imaging, scientific fields, crucial across scientific, Analyzing
备注: 38 pages, 11 figures
点击查看摘要
Abstract:Analyzing imaging and hyperspectral data is crucial across scientific fields, including biology, medicine, chemistry, and physics. The primary goal is to transform high-resolution or high-dimensional data into an interpretable format to generate actionable insights, aiding decision-making and advancing knowledge. Currently, this task relies on complex, human-designed workflows comprising iterative steps such as denoising, spatial sampling, keypoint detection, feature generation, clustering, dimensionality reduction, and physics-based deconvolutions. The introduction of machine learning over the past decade has accelerated tasks like image segmentation and object detection via supervised learning, and dimensionality reduction via unsupervised methods. However, both classical and NN-based approaches still require human input, whether for hyperparameter tuning, data labeling, or both. The growing use of automated imaging tools, from atomically resolved imaging to biological applications, demands unsupervised methods that optimize data representation for human decision-making or autonomous experimentation. Here, we discuss advances in reward-based workflows, which adopt expert decision-making principles and demonstrate strong transfer learning across diverse tasks. We represent image analysis as a decision-making process over possible operations and identify desiderata and their mappings to classical decision-making frameworks. Reward-driven workflows enable a shift from supervised, black-box models sensitive to distribution shifts to explainable, unsupervised, and robust optimization in image analysis. They can function as wrappers over classical and DCNN-based methods, making them applicable to both unsupervised and supervised workflows (e.g., classification, regression for structure-property mapping) across imaging and hyperspectral data.
75. 【2502.18519】FreeTumor: Large-Scale Generative Tumor Synthesis in Computed Tomography Images for Improving Tumor Recognition
链接:https://arxiv.org/abs/2502.18519
作者:Linshan Wu,Jiaxin Zhuang,Yanning Zhou,Sunan He,Jiabo Ma,Luyang Luo,Xi Wang,Xuefeng Ni,Xiaoling Zhong,Mingxiang Wu,Yinghua Zhao,Xiaohui Duan,Varut Vardhanabhuti,Pranav Rajpurkar,Hao Chen
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:million deaths attributed, death worldwide, million deaths, deaths attributed, diseases every year
备注:
点击查看摘要
Abstract:Tumor is a leading cause of death worldwide, with an estimated 10 million deaths attributed to tumor-related diseases every year. AI-driven tumor recognition unlocks new possibilities for more precise and intelligent tumor screening and diagnosis. However, the progress is heavily hampered by the scarcity of annotated datasets, which demands extensive annotation efforts by radiologists. To tackle this challenge, we introduce FreeTumor, an innovative Generative AI (GAI) framework to enable large-scale tumor synthesis for mitigating data scarcity. Specifically, FreeTumor effectively leverages a combination of limited labeled data and large-scale unlabeled data for tumor synthesis training. Unleashing the power of large-scale data, FreeTumor is capable of synthesizing a large number of realistic tumors on images for augmenting training datasets. To this end, we create the largest training dataset for tumor synthesis and recognition by curating 161,310 publicly available Computed Tomography (CT) volumes from 33 sources, with only 2.3% containing annotated tumors. To validate the fidelity of synthetic tumors, we engaged 13 board-certified radiologists in a Visual Turing Test to discern between synthetic and real tumors. Rigorous clinician evaluation validates the high quality of our synthetic tumors, as they achieved only 51.1% sensitivity and 60.8% accuracy in distinguishing our synthetic tumors from real ones. Through high-quality tumor synthesis, FreeTumor scales up the recognition training datasets by over 40 times, showcasing a notable superiority over state-of-the-art AI methods including various synthesis methods and foundation models. These findings indicate promising prospects of FreeTumor in clinical applications, potentially advancing tumor treatments and improving the survival rates of patients.
76. 【2502.18516】Gradient entropy (GradEn): The two dimensional version of slope entropy for image analysis
链接:https://arxiv.org/abs/2502.18516
作者:Runze Jiang,Pengjian Shang
类目:Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
关键词:theory and Shannon, Shannon entropy, essential for quantifying, quantifying irregularity, irregularity in complex
备注:
点击查看摘要
Abstract:Information theory and Shannon entropy are essential for quantifying irregularity in complex systems or signals. Recently, two-dimensional entropy methods, such as two-dimensional sample entropy, distribution entropy, and permutation entropy, have been proposed for analyzing 2D texture or image data. This paper introduces Gradient entropy (GradEn), an extension of slope entropy to 2D, which considers both symbolic patterns and amplitude information, enabling better feature extraction from image data. We evaluate GradEn with simulated data, including 2D colored noise, 2D mixed processes, and the logistic map. Results show the ability of GradEn to distinguish images with various characteristics while maintaining low computational cost. Real-world datasets, consist of texture, fault gear, and railway corrugation signals, demonstrate the superior performance of GradEn in classification tasks compared to other 2D entropy methods. In conclusion, GradEn is an effective tool for image characterization, offering a novel approach for image processing and recognition.
77. 【2502.18506】Exploring Patient Data Requirements in Training Effective AI Models for MRI-based Breast Cancer Classification
链接:https://arxiv.org/abs/2502.18506
作者:Solha Kang,Wesley De Neve,Francois Rameau,Utku Ozbulak
类目:Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
关键词:clinical decision support, companies offering AI-based, offering AI-based solutions, medical institutions, witnessed a substantial
备注: Accepted for publication in MICCAI 2024 Deep Breast Workshop on AI and Imaging for Diagnostic and Treatment Challenges in Breast Care
点击查看摘要
Abstract:The past decade has witnessed a substantial increase in the number of startups and companies offering AI-based solutions for clinical decision support in medical institutions. However, the critical nature of medical decision-making raises several concerns about relying on external software. Key issues include potential variations in image modalities and the medical devices used to obtain these images, potential legal issues, and adversarial attacks. Fortunately, the open-source nature of machine learning research has made foundation models publicly available and straightforward to use for medical applications. This accessibility allows medical institutions to train their own AI-based models, thereby mitigating the aforementioned concerns. Given this context, an important question arises: how much data do medical institutions need to train effective AI models? In this study, we explore this question in relation to breast cancer detection, a particularly contested area due to the prevalence of this disease, which affects approximately 1 in every 8 women. Through large-scale experiments on various patient sizes in the training set, we show that medical institutions do not need a decade's worth of MRI images to train an AI model that performs competitively with the state-of-the-art, provided the model leverages foundation models. Furthermore, we observe that for patient counts greater than 50, the number of patients in the training set has a negligible impact on the performance of models and that simple ensembles further improve the results without additional complexity.
78. 【2502.18485】Deciphering Functions of Neurons in Vision-Language Models
链接:https://arxiv.org/abs/2502.18485
作者:Jiaqi Xu,Cuiling Lan,Xuejin Chen,Yan Lu
类目:Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
关键词:open-sourced vision-language models, diverse domains, neurons, burgeoning growth, growth of open-sourced
备注: 22 pages, 23 figures
点击查看摘要
Abstract:The burgeoning growth of open-sourced vision-language models (VLMs) has catalyzed a plethora of applications across diverse domains. Ensuring the transparency and interpretability of these models is critical for fostering trustworthy and responsible AI systems. In this study, our objective is to delve into the internals of VLMs to interpret the functions of individual neurons. We observe the activations of neurons with respects to the input visual tokens and text tokens, and reveal some interesting findings. Particularly, we found that there are neurons responsible for only visual or text information, or both, respectively, which we refer to them as visual neurons, text neurons, and multi-modal neurons, respectively. We build a framework that automates the explanation of neurons with the assistant of GPT-4o. Meanwhile, for visual neurons, we propose an activation simulator to assess the reliability of the explanations for visual neurons. System statistical analyses on top of one representative VLM of LLaVA, uncover the behaviors/characteristics of different categories of neurons.